Complete genome de novo assembly software for the emerging long read sequencing era

Information

Research Project
9747613

ApplicationId
9747613
Core Project Number
R44GM122120
Full Project Number
3R44GM122120-02S1
Serial Number
122120
FOA Number
PA-18-591
Sub Project Id

Project Start Date
3/1/2017 - 7 years ago
Project End Date
2/28/2019 - 5 years ago
Program Officer Name
RAVICHANDRAN, VEERASAMY
Budget Start Date
3/1/2018 - 6 years ago
Budget End Date
2/28/2019 - 5 years ago
Fiscal Year
2018
Support Year
02
Suffix
S1
Award Notice Date
9/13/2018 - 6 years ago

Organizations

DNASTAR, INC.

Information

Complete genome de novo assembly software for the emerging long read sequencing era

Despite the tremendous success of short read next-generation sequencing (NGS) technologies, their inherent inability to establish long range connectivity makes fundamental tasks such as genome closure, haplotype phasing and alternatively spliced transcript characterization all but impossible. Now, two long read sequencing providers, Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT), are producing data that can overcome these critical shortcomings. PacBio is capable of producing 10-20kb reads and has seen increased adoption for closing microbial genomes in particular, but also for eurkaryotic genomics and transcriptomics. ONT?s MinION device is a portable real-time sequencing platform capable of producing 100kb reads and has already been successfully applied to microbial sequencing and pathogen identification. ONT?s new high-throughput instrument, the PromethION, is being released in 2016 and will have sufficient output for human genome scale experiments. The tremendous potential of both technologies is currently hampered by high error rates (10-20%) which makes assembly and consensus calling extremely computationally challenging. Various command line software programs have been developed to tackle these challenges, but they typically require substantial bioinformatic expertise and computing resources/savvy and do not address the critical hurdles associated with diploid genomes. With long read sequencing poised to become a major resource for genomics, there is clearly an urgent need for integrated easy-to-use assembly and analysis software that can handle and exploit the unique aspects of this data. Toward that end, we have developed a prototype de novo assembler based on our patented Disk Sort Alignment (DSA) algorithm that can assemble an uncorrected bacterial genome data set into a single contig with >99.2% base accuracy on a standard desktop computer in less than 3.5 hours. The assembler uses DSA-determined read overlaps to construct an assembly string graph from which a layout is fed to a novel consensus generator designed to maximize accuracy from this error prone data. The overall goal of this direct to Phase II proposal is to transform the prototype into a fully scalable long read de novo assembler for both haploid and diploid genomes. We will first optimize the performance of the assembler components, building a solid foundation from which to incorporate the essential diploid-aware capabilities of 1) identifying large structural variation between two sister chromosomes, 2) adapting the consensus base caller to handle heterozygous SNVs and small indels and 3) exploiting the long range connectivity of the data to properly phase the variants and produce accurate haplotype sequences. Finally, we will leverage these tools to identify alternatively spliced transcripts and allele- specific expression from long read RNA-Seq data. Consistent with DNASTAR?s 30 year history of delivering easy-to-use expert level software, this assembler will give any user access to these revolutionary long read sequencing technologies and those to come.

IC Name

NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES

Activity
R44
Administering IC
GM
Application Type
3

Direct Cost Amount
Indirect Cost Amount
Total Cost
66985
Sub Project Total Cost

ARRA Funded
False
CFDA Code
859
Ed Inst. Type
Funding ICs
NIGMS:66985\
Funding Mechanism
SBIR-STTR RPGs
Study Section
Study Section Name

Organization Name
DNASTAR, INC.
Organization Department
Organization DUNS
130194947
Organization City
MADISON
Organization State
WI
Organization Country
UNITED STATES
Organization Zip Code
537055202
Organization District
UNITED STATES

Complete genome de novo assembly software for the emerging long read sequencing era

Information

ApplicationId

Core Project Number

Full Project Number

Serial Number

FOA Number

Sub Project Id

Project Start Date

Project End Date

Program Officer Name

Budget Start Date

Budget End Date

Fiscal Year

Support Year

Suffix

Award Notice Date

Organizations

Complete genome de novo assembly software for the emerging long read sequencing era

IC Name

Activity

Administering IC

Application Type

Direct Cost Amount

Indirect Cost Amount

Total Cost

Sub Project Total Cost

ARRA Funded

CFDA Code

Ed Inst. Type

Funding ICs

Funding Mechanism

Study Section

Study Section Name

Organization Name

Organization Department

Organization DUNS

Organization City

Organization State

Organization Country

Organization Zip Code

Organization District