Scalable post-assembly editing software for finishing and annotating personal genomes

Information

  • Research Project
  • 9620948
  • ApplicationId
    9620948
  • Core Project Number
    R44GM128518
  • Full Project Number
    1R44GM128518-01A1
  • Serial Number
    128518
  • FOA Number
    PA-17-302
  • Sub Project Id
  • Project Start Date
    9/1/2018 - 5 years ago
  • Project End Date
    2/28/2019 - 5 years ago
  • Program Officer Name
    RAVICHANDRAN, VEERASAMY
  • Budget Start Date
    9/1/2018 - 5 years ago
  • Budget End Date
    2/28/2019 - 5 years ago
  • Fiscal Year
    2018
  • Support Year
    01
  • Suffix
    A1
  • Award Notice Date
    8/20/2018 - 5 years ago
Organizations

Scalable post-assembly editing software for finishing and annotating personal genomes

We are entering a new era of personal genomics where an individual's genome sequence will be used to identify disease susceptibility, improve diagnosis and better treat illnesses as well as be combined across cohorts and populations to identify new biomarkers and causal mutations underlying any phenotype. Despite the tremendous success of mapping short read next-generation sequencing (NGS) data onto a reference genome (resequencing) in identifying genetic variation in a new genome, the inherent lack of long range connectivity together with reference-induced biases make obtaining complete haplotype-phased genomes exceedingly difficult. Emerging long read technologies are beginning to address this critical shortcoming by direct de novo assembly of an individual's genome. However, initial de novo assemblies typically consist of many thousands of unordered contigs that require extensive post-assembly processing to produce finished sequences that can be effectively mined for genetic content and variation. Thus, there is an urgent need for integrated, scalable post-assembly software that 1) automatically organizes, joins and phases the initial contigs into complete haplotype sequences, 2) supports optional NGS and/or manual polishing and 3) provides initial automated annotation of those sequences. Currently, such software does not exist and instead users must cobble together a confusing array of difficult-to-use, task-specific pieces of open source programs. DNASTAR's post-assembly editing program, SeqMan Pro (SMP), has a proven history in finishing bacterial sized genomes although it currently lacks the scalability and all the needed functionality to tackle human genome sized problems. The primary goal of this Fast Track proposal is to create a fully scalable version of SMP for the automated finishing and annotation of de novo assembled large eukaryotic genomes while also providing a manual editing platform when needed. During Phase I, we will develop two key prototypes: 1) a new assembly file format, eBAM, which is interconvertible with the BAM format, but also is editable like our SQD files and 2) a rapid reference-assisted contig scaffolding tool adapted from our proprietary Disk Sort Alignment (DSA) algorithm. With that foundation, we will complete the transformation of SMP in Phase II by: 1) refining the eBAM format for optimal editing performance, 2) building a new 64-bit version of the SMP editing engine that incorporates the additional functionality necessary for post-assembly finishing of large eukaryotic genomes including automated DSA-based scaffolding and phase-aware gap filling, contig joining and haplotype refinement, 3) creating a new DSA-based genome aligner for rapidly aligning a finished sequence to an annotated reference genome which together with 4) a new feature transfer and analysis module, will permit initial annotation of the finished genome along with a cataloging of variants and their impact in both native and reference coordinates. Inclusion of the reference coordinates allows variants in the new genome to be easily associated with the wealth of information available through the numerous online knowledgebase resources.

IC Name
NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
  • Activity
    R44
  • Administering IC
    GM
  • Application Type
    1
  • Direct Cost Amount
  • Indirect Cost Amount
  • Total Cost
    149981
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    859
  • Ed Inst. Type
  • Funding ICs
    NIGMS:149981\
  • Funding Mechanism
    SBIR-STTR RPGs
  • Study Section
    ZRG1
  • Study Section Name
    Special Emphasis Panel
  • Organization Name
    DNASTAR, INC.
  • Organization Department
  • Organization DUNS
    130194947
  • Organization City
    MADISON
  • Organization State
    WI
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    537055202
  • Organization District
    UNITED STATES