HMMER and Infernal: Finding distant homologs of sequences and RNA structures

Information

  • Research Project
  • 10296226
  • ApplicationId
    10296226
  • Core Project Number
    R01HG009116
  • Full Project Number
    2R01HG009116-05
  • Serial Number
    009116
  • FOA Number
    PA-20-185
  • Sub Project Id
  • Project Start Date
    9/16/2016 - 7 years ago
  • Project End Date
    6/30/2026 - 2 years from now
  • Program Officer Name
    SEN, SHURJO KUMAR
  • Budget Start Date
    9/10/2021 - 2 years ago
  • Budget End Date
    6/30/2022 - a year ago
  • Fiscal Year
    2021
  • Support Year
    05
  • Suffix
  • Award Notice Date
    9/10/2021 - 2 years ago
Organizations

HMMER and Infernal: Finding distant homologs of sequences and RNA structures

Project Summary/Abstract Genome sequence data is now available for hundreds of thousands of species. Our ability to exploit this vast trove of information about the molecular basis and evolution of life depends on sophisticated computational analysis tools. One important class of tools is pro?le analysis software, for making consensus statistical models of multiple alignments of biological sequence families, and for using those models to sensitively detect homologs and make deep multiple alignments. Pro?le analysis derives its power from the fact that despite the unbounded growth of sequence data, the majority of functional sequences can be condensed into a manageably small number of conserved families. Pro?le software underlies numerous protein, RNA, and DNA sequence family databases. The systematic availability of deep multiple alignments (of many thousands of sequences) is enabling revolutionary advances in predicting molecular function and 3D structure by comparative sequence analysis. The HMMER and Infernal software packages from our laboratory are some of the most widely used tools for pro?le analysis. HMMER implements pro?le hidden Markov models (pro?le HMMs) of primary sequence consensus, typically for protein domains and conserved DNA elements. Infernal implements pro?le stochastic context-free grammars (pro?le SCFGs) of RNA secondary structure and sequence consensus. In the context of the continued development of these packages, this proposal has three speci?c aims for new lines of research that we expect to lead to major improvements in the accuracy, utility, and computational ef?ciency of pro?le anal- ysis. The ?rst aim proposes to develop a discontinuous Markov model of nonhomologous sequences, to improve the ability to distinguish homologs from nonhomologs and reduce the false positive rate of database searches. The second aim proposes to develop sketching methods for ef?ciently representing the voluminous results of a database homology search with a subset of the most phylogenetically informative hits. The third aim proposes to develop adaptive computation methods to ?exibly harness the complex mix of CPU/GPU processors, mem- ory, and storage in modern hardware architectures, enabling ef?cient scalable computation and near-interactive database search times.

IC Name
NATIONAL HUMAN GENOME RESEARCH INSTITUTE
  • Activity
    R01
  • Administering IC
    HG
  • Application Type
    2
  • Direct Cost Amount
    315651
  • Indirect Cost Amount
    217799
  • Total Cost
    533450
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    172
  • Ed Inst. Type
    SCHOOLS OF ARTS AND SCIENCES
  • Funding ICs
    NHGRI:533450\
  • Funding Mechanism
    Non-SBIR/STTR RPGs
  • Study Section
    GCAT
  • Study Section Name
    Genomics, Computational Biology and Technology Study Section
  • Organization Name
    HARVARD UNIVERSITY
  • Organization Department
    MICROBIOLOGY/IMMUN/VIROLOGY
  • Organization DUNS
    082359691
  • Organization City
    CAMBRIDGE
  • Organization State
    MA
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    021385319
  • Organization District
    UNITED STATES