Integrated Assembly Software for Sanger and Next Generation Sequence Technologies

Information

  • Research Project
  • 8011298
  • ApplicationId
    8011298
  • Core Project Number
    R44GM082117
  • Full Project Number
    5R44GM082117-03
  • Serial Number
    82117
  • FOA Number
    PA-08-050
  • Sub Project Id
  • Project Start Date
    9/1/2007 - 17 years ago
  • Project End Date
    12/31/2011 - 12 years ago
  • Program Officer Name
    LYSTER, PETER
  • Budget Start Date
    1/1/2011 - 13 years ago
  • Budget End Date
    12/31/2011 - 12 years ago
  • Fiscal Year
    2011
  • Support Year
    3
  • Suffix
  • Award Notice Date
    12/27/2010 - 13 years ago
Organizations

Integrated Assembly Software for Sanger and Next Generation Sequence Technologies

DESCRIPTION (provided by applicant): The advent of next-generation (Next-gen) sequencing technologies has begun a surge in whole genome sequencing and resequencing, exemplified spectacularly by four papers describing five complete human genomes in 2008 alone. One company, Knome, now even offers customers their entire genome sequence using Next-gen sequencing technology. These developments, together with targeted resequencing of genome, presage the day of the $1000 human genome. Broad-scale whole human genome resequencing (WHGR) will have enormous impact on the areas of personalized medicine, human evolution and human diversity. To fully realize that potential, however, software capabilities must be dramatically enhanced to meet the significant challenges posed by the sheer volume of data generated in these projects, the diversity of technology-specific data characteristics and simply analyzing the 6 billion base pair diploid human genome. Moreover, we see the day when technology improvements and cost reductions make WHGR as commonplace as bacterial genome sequencing has become today. For that to occur, assembly and analysis software must be accessible to a far broader and less computer savvy range of researchers than the highly specialized bioinformatics teams that decode the information now. Also, computer resources are far more limited even for a well funded research laboratory than available to a large sequencing center. Therefore, the overall goal of this proposal is to develop a Next-gen sequence assembly and analysis pipeline, DESKAPP, that will run on an affordable ($5000) high- end desktop computer and produce a human genome sequence in a reasonable timeframe (days, not weeks). WHGR by DESKAPP will involve a reference-guided main assembly as well as a de novo assembly branch to characterize unique regions of the new genome relative to the reference. Merging of the assemblies produces a complete sequence that can be evaluated for gene content, single nucleotide polymorphisms (SNPs) and structural variation (SV;indels, inversion, translocations) both by web-based searches of external databases to identify known allelic variation and by direct examination of the sequence to identify new polymorphisms. A Disk Sort Alignment algorithm allows the data sets which are far too large for in-memory processing to be evaluated and clustered for assembly by SeqMan N-Gen (SM N-Gen), our desktop assembly engine. Using a prototype DSA-SM N-Gen pipeline, we have processed the entire 7.4x 454 data set from the James Watson genome to a layout file in 31 hours using DSA and have assembled three chromosomes: 8;21;and X;using SM N-Gen. Assembly times varied from 1 hour for Chromosome 21 to 10.6 hours for an average- sized chromosome, such as Chromosome 8. Together, these results demonstrate the feasibility of constructing a DESKAPP pipeline for WHGR. The Phase II Aims are designed to build upon this foundation and produce a seamless pipeline for the desktop assembly and analysis of a human genome in a matter of days. PUBLIC HEALTH RELEVANCE: Next-gen sequencing technologies have started a new revolution throughout biology by providing DNA sequence data in unprecedented quantities at continually decreasing costs. This data will be invaluable in the emerging era of personalized medicine and in exploring the immense diversity of life. The goal of this project is to develop desktop computer software that will enable research laboratories and clinics of any size to realize the promise of these new technologies.

IC Name
NATIONAL INSTITUTE OF GENERAL MEDICAL SCIENCES
  • Activity
    R44
  • Administering IC
    GM
  • Application Type
    5
  • Direct Cost Amount
  • Indirect Cost Amount
  • Total Cost
    722920
  • Sub Project Total Cost
  • ARRA Funded
    False
  • CFDA Code
    859
  • Ed Inst. Type
  • Funding ICs
    NIGMS:722920\
  • Funding Mechanism
    SBIR-STTR
  • Study Section
    ZRG1
  • Study Section Name
    Special Emphasis Panel
  • Organization Name
    DNASTAR, INC.
  • Organization Department
  • Organization DUNS
    130194947
  • Organization City
    MADISON
  • Organization State
    WI
  • Organization Country
    UNITED STATES
  • Organization Zip Code
    537055202
  • Organization District
    UNITED STATES