Development and Maintenance of RepeatMasker

Information

  • Research Project
  • 9905539
  • ApplicationId
    9905539
  • Core Project Number
    R01HG002939
  • Full Project Number
    5R01HG002939-15
  • Serial Number
    002939
  • FOA Number
    PA-16-160
  • Sub Project Id
  • Project Start Date
    8/15/2003 - 21 years ago
  • Project End Date
    3/31/2021 - 3 years ago
  • Program Officer Name
    WELLINGTON, CHRISTOPHER
  • Budget Start Date
    4/1/2020 - 4 years ago
  • Budget End Date
    3/31/2021 - 3 years ago
  • Fiscal Year
    2020
  • Support Year
    15
  • Suffix
  • Award Notice Date
    3/23/2020 - 4 years ago

Development and Maintenance of RepeatMasker

Mammalian and most other eukaryotic genomes contain a large number of interspersed repeats (IRs), most of which are copies of transposable elements (TEs) at varying levels of decay. Their presence complicates many genome sequence analyses, but their accurate identification in an early analysis stage can reduce these complications. In addition to their pervasiveness, over the last decades the research community has become widely familiar with their enormous impact on genome activity and evolution. Every species has been exposed to a unique, complex set of TEs leaving recognizable copies from as long ago as 300 million years to as recent as present day. These TEs are uncovered and reconstructed by de novo discovery methods, often by our RepeatModeler tool, while their copies are then annotated by our RepeatMasker software. De novo methods can create TE libraries at a reasonable pace, but the product is far from the desired quality that can be reached by hand curation. With the recent explosive growth in sequenced species, these finishing steps, perhaps never fully automatable, now form a severe bottleneck in genome analyses due to a lack of manpower and expertise, while the results, especially when coming from different research groups, lack consistency and suffer from redundancy. Furthermore, the annotation of genomes for which high-quality libraries have been created is not keeping up with library improvements due to the computational burden of re-analysis. In this proposal, we describe a plan to alleviate the problems of finishing new repeat libraries: we aim to exploit the power of multi-species genome alignments, especially in revealing lineage-specific TEs, develop a web-based workbench based on our TE library finishing tools and strategies, and crowdsource the most laborious step through the use of gamification. In addition, we propose a new family-centric search strategy and an incremental annotation approach to provide a tractable solution to the re-analysis problem while also providing opportunities to improve the annotation quality.

IC Name
NATIONAL HUMAN GENOME RESEARCH INSTITUTE
  • Activity
    R01
  • Administering IC
    HG
  • Application Type
    5
  • Direct Cost Amount
    268066
  • Indirect Cost Amount
    222495
  • Total Cost
    490561
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    172
  • Ed Inst. Type
  • Funding ICs
    NHGRI:490561\
  • Funding Mechanism
    Non-SBIR/STTR RPGs
  • Study Section
    GCAT
  • Study Section Name
    Genomics, Computational Biology and Technology Study Section
  • Organization Name
    INSTITUTE FOR SYSTEMS BIOLOGY
  • Organization Department
  • Organization DUNS
    135646524
  • Organization City
    SEATTLE
  • Organization State
    WA
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    981095263
  • Organization District
    UNITED STATES