Complete genome de novo assembly software for the emerging long read sequencing era

Information

  • Research Project
  • 9255092
  • ApplicationId
    9255092
  • Core Project Number
    R44GM122120
  • Full Project Number
    1R44GM122120-01
  • Serial Number
    122120
  • FOA Number
    PAR-14-088
  • Sub Project Id
  • Project Start Date
    3/1/2017 - 7 years ago
  • Project End Date
    2/28/2019 - 5 years ago
  • Program Officer Name
    RAVICHANDRAN, VEERASAMY
  • Budget Start Date
    3/1/2017 - 7 years ago
  • Budget End Date
    2/28/2018 - 6 years ago
  • Fiscal Year
    2017
  • Support Year
    01
  • Suffix
  • Award Notice Date
    2/28/2017 - 7 years ago
Organizations

Complete genome de novo assembly software for the emerging long read sequencing era

Despite the tremendous success of short read next-generation sequencing (NGS) technologies, their inherent inability to establish long range connectivity makes fundamental tasks such as genome closure, haplotype phasing and alternatively spliced transcript characterization all but impossible. Now, two long read sequencing providers, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), are producing data that can overcome these critical shortcomings. PacBio is capable of producing 10-20kb reads and has seen increased adoption for closing microbial genomes in particular, but also for eurkaryotic genomics and transcriptomics. ONT?s MinION device is a portable real-time sequencing platform capable of producing 100kb reads and has already been successfully applied to microbial sequencing and pathogen identification. ONT?s new high-throughput instrument, the PromethION, is being released in 2016 and will have sufficient output for human genome scale experiments. The tremendous potential of both technologies is currently hampered by high error rates (10-20%) which makes assembly and consensus calling extremely computationally challenging. Various command line software programs have been developed to tackle these challenges, but they typically require substantial bioinformatic expertise and computing resources/savvy and do not address the critical hurdles associated with diploid genomes. With long read sequencing poised to become a major resource for genomics, there is clearly an urgent need for integrated easy-to-use assembly and analysis software that can handle and exploit the unique aspects of this data. Toward that end, we have developed a prototype de novo assembler based on our patented Disk Sort Alignment (DSA) algorithm that can assemble an uncorrected bacterial genome data set into a single contig with >99.2% base accuracy on a standard desktop computer in less than 3.5 hours. The assembler uses DSA-determined read overlaps to construct an assembly string graph from which a layout is fed to a novel consensus generator designed to maximize accuracy from this error prone data. The overall goal of this direct to Phase II proposal is to transform the prototype into a fully scalable long read de novo assembler for both haploid and diploid genomes. We will first optimize the performance of the assembler components, building a solid foundation from which to incorporate the essential diploid-aware capabilities of 1) identifying large structural variation between two sister chromosomes, 2) adapting the consensus base caller to handle heterozygous SNVs and small indels and 3) exploiting the long range connectivity of the data to properly phase the variants and produce accurate haplotype sequences. Finally, we will leverage these tools to identify alternatively spliced transcripts and allele- specific expression from long read RNA-Seq data. Consistent with DNASTAR?s 30 year history of delivering easy-to-use expert level software, this assembler will give any user access to these revolutionary long read sequencing technologies and those to come.

IC Name
NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
  • Activity
    R44
  • Administering IC
    GM
  • Application Type
    1
  • Direct Cost Amount
  • Indirect Cost Amount
  • Total Cost
    749795
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    859
  • Ed Inst. Type
  • Funding ICs
    NIGMS:749795\
  • Funding Mechanism
    SBIR-STTR RPGs
  • Study Section
    ZRG1
  • Study Section Name
    Special Emphasis Panel
  • Organization Name
    DNASTAR, INC.
  • Organization Department
  • Organization DUNS
    130194947
  • Organization City
    MADISON
  • Organization State
    WI
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    537055202
  • Organization District
    UNITED STATES