Machine learning approaches for improved accuracy and speed in sequence annotation

Information

  • Research Project
  • 10231149
  • ApplicationId
    10231149
  • Core Project Number
    R01GM132600
  • Full Project Number
    5R01GM132600-03
  • Serial Number
    132600
  • FOA Number
    PA-19-056
  • Sub Project Id
  • Project Start Date
    9/20/2019 - 5 years ago
  • Project End Date
    7/31/2023 - a year ago
  • Program Officer Name
    RAVICHANDRAN, VEERASAMY
  • Budget Start Date
    8/1/2021 - 3 years ago
  • Budget End Date
    7/31/2022 - 2 years ago
  • Fiscal Year
    2021
  • Support Year
    03
  • Suffix
  • Award Notice Date
    8/9/2021 - 3 years ago
Organizations

Machine learning approaches for improved accuracy and speed in sequence annotation

Summary/Abstract Alignment of biological sequences is a key step in understanding their evolution, function, and patterns of activity. Here, we describe Machine Learning approaches to improve both accuracy and speed of highly- sensitive sequence alignment. To improve accuracy, we develop methods to reduce erroneous annotation caused by (1) the existence of low complexity and repetitive sequence and (2) the overextension of alignments of true homologs into unrelated sequence. We describe approaches based on both hidden Markov models and Artificial Neural Networks to dramatically reduce these sorts of sequence annotation error. We also address the issue of annotation speed, with development of a custom Deep Learning architecture designed to very quickly filter away large portions of candidate sequence comparisons prior to the relatively-slow sequence-alignment step. The results of these efforts will be incorporated into forks of the open source sequence alignment tools HMMER, MMSeqs, and (where appropriate) BLAST; we will also work with community developers of annotation pipelines, such as RepeatMasker and IMG/M, to incorporate these approaches. The development and incorporation into these widely used bioinformatics tools will lead to widespread impact on sequence annotation efforts.

IC Name
NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
  • Activity
    R01
  • Administering IC
    GM
  • Application Type
    5
  • Direct Cost Amount
    200000
  • Indirect Cost Amount
    87379
  • Total Cost
    287379
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    859
  • Ed Inst. Type
    SCHOOLS OF ARTS AND SCIENCES
  • Funding ICs
    NIGMS:287379\
  • Funding Mechanism
    Non-SBIR/STTR RPGs
  • Study Section
    GVE
  • Study Section Name
    Genetic Variation and Evolution Study Section
  • Organization Name
    UNIVERSITY OF MONTANA
  • Organization Department
    BIOSTATISTICS & OTHER MATH SCI
  • Organization DUNS
    010379790
  • Organization City
    MISSOULA
  • Organization State
    MT
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    598124104
  • Organization District
    UNITED STATES