The Sequence Listing written in file 49927-502001US ST25, created Mar. 20, 2017, 636 bytes, machine format IBM-PC, MS Windows operating system, is hereby incorporated by reference.
The subject matter described herein relates to genomics, and more particularly to dynamic genome reference generation for improved next generation sequencing (NGS) accuracy and reproducibility
As DNA sequencing costs have plummeted over the last few years, raw data generated by sequencing has increased exponentially, measuring petabytes of data, making analyses and transfer of all this data difficult. These large amounts of data produce a critical bottleneck in the DNA sequencing workflow that has previously only been addressable by throwing increasing numbers of ever more powerful CPU cores at the problem. However, since the data being produced by sequencing already far outpaces Moore's Law, this solution has very limited sustainability.
The hugely parallel approach of NGS requires a human reference genome to be used to reconstruct the patient's genome from the raw read data. The human reference genome has become essential for clinical applications, and is used to identify alleles for risk, protection, or treatment-specific response in human disease. Yet, the current reference genome, GRCh38, being based on a limited number of samples, neither adequately represents the full range of human diversity, nor is complete. Further, the existing approach followed by the GRC and the genomics industry to construct a “static” reference genome introduces biases in standard bioinformatic pipelines used to detect the unique complement of variants in an individual's genome. An elegant, cost effective bioinformatics pipeline solution to perform the analysis of the sequenced data rapidly, accurately and in a consistent, reproducible way based on a truly population-wide reference is the final frontier to commoditize sequencing.
In one aspect, a Next Generation Sequencing (NGS) bioinformatics ASIC (Application Specific Integrated Circuit) is disclosed. A “dynamic” reference is introduced that utilizes population level information to improve reference-based alignment to detect novel, deleterious, or functional variants in clinical sequencing applications. To generate the dynamic reference, an automatically-updated database of known genetic variants (SNPs, indels, CNVs, etc., archived at e.g. dbSNP, DGVa, dbVar) is built and provided, and augments a standard reference genome with the variants, to be processed by the NGS ASIC.
Implementations of the current subject matter can include, but are not limited to, systems and methods consistent including one or more features are described as well as articles that comprise a tangibly embodied machine-readable medium operable to cause one or more machines (e.g., computers, etc.) to result in operations described herein. Similarly, computer systems are also described that may include one or more processors and one or more memories coupled to the one or more processors. A memory, which can include a computer-readable storage medium, may include, encode, store, or the like one or more programs that cause one or more processors to perform one or more of the operations described herein. Computer implemented methods consistent with one or more implementations of the current subject matter can be implemented by one or more data processors residing in a single computing system or multiple computing systems. Such multiple computing systems can be connected and can exchange data and/or commands or other instructions or the like via one or more connections, including but not limited to a connection over a network (e.g. the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.
The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims. While certain features of the currently disclosed subject matter are described for illustrative purposes in relation to an enterprise resource software system or other business software solution or architecture, it should be readily understood that such features are not intended to be limiting. The claims that follow this disclosure are intended to define the scope of the protected subject matter.
The accompanying drawings, which are incorporated in and constitute a part of this specification, show certain aspects of the subject matter disclosed herein and, together with the description, help explain some of the principles associated with the disclosed implementations. In the drawings,
When practical, similar reference numbers denote similar structures, features, or elements.
To address these and potentially other issues with currently available solutions, methods, systems, articles of manufacture, and the like consistent with one or more implementations of the current subject matter can, among other possible advantages, provide a Next Generation Sequencing (NGS) bioinformatics ASIC (Application Specific Integrated Circuit). It enables the computational time required for the NGS data analysis pipeline to be radically reduced from many hours down to only a few minutes.
This dramatic speed improvement addresses the “static” reference issue in a way that has not been previously possible. A “dynamic” genome reference is provided that utilizes population level information to improve reference-based alignment to detect novel, deleterious, or functional variants in clinical sequencing applications. To generate the dynamic reference, an automatically updated database of known genetic variants (SNPs, indels (insertions/deletions), CNVs, etc., archived at e.g. dbSNP, DGVa, dbVar) is also provided, and augments a Wavefront Processor to utilize this data to enhance the standard reference genome with these variants.
The Wavefront Processor is shown in
The reference genome has been a guiding principle for the development of a vast array of computational tools and forms the foundation for databases and bioinformatics algorithms that are used to define target regions for re-sequencing, perform genome wide association studies, or measure inter-species conservation. The human reference genome has become essential for clinical applications, and is used to identify alleles for risk, protection, or treatment-specific response in human disease. Yet, the current reference genome, GRCh38, being based on a limited number of samples, neither adequately represents the full range of human diversity, nor is complete. Further, the existing approach followed by the GRC and the genomics industry to construct a “static” reference genome introduces biases in standard bioinformatic pipelines used to detect the unique complement of variants in an individual's genome, as shown in
To address this problem, the “dynamic” reference is introduced (
A Burrows-Wheeler transform (BWT) based aligner has been developed that maps reads to a dynamic reference genome. Importantly, alignment accuracy can be markedly improved in SNP-dense regions and regions with long indels that are often problematic for successful alignment of short reads. For BWT-based aligners (e.g. BWA), as the number of variants included in a dynamic reference increases, memory usage and run times increase. As illustrated in
To support a dynamic reference, reads aligning to alternate sequences that overlap primary sequences (chromosomes) in most cases must be re-aligned to the correct primary sequence with properly adjusted FLAG, RNAME, POS, MAPQ, and CIGAR SAM fields. The CIGAR strings for a read aligning to an indel alternate sequence must be translated for proper alignment with the corresponding primary sequence. If such a read maps to a rare indel sequence, its MAPQ value may be penalized to decrease the chance of a false positive variant call. This penalty can be determined empirically with ground truth variant call data.
Aligning reads longer than ˜1000 bases is impractical from a memory standpoint if indel alternate sequences are padded with sufficient bases to ensure alignment of full reads. For aligning very long reads, a compact representation of indel alternate sequences (without base padding) will be developed. In essence, each indel sequence must be stitched across the primary sequence region that it overlaps so that bases flanking the indel are coded just once in the dynamic reference.
Together with already-proven Mapping/Aligning/Sorting technology, as described in U.S. patent application Ser. No. 14/158,758, filed Jan. 17, 2014, entitled BIOINFORMATICS SYSTEMS, APPARATUSES, AND METHODS EXECUTED ON AN INTEGRATED CIRCUIT PROCESSING PLATFORM, the contents of which are incorporated by reference herein for all purposes, the dynamic reference genome and extensions to the Pipeline Processor that are described herein have a large impact on the quality of analysis results that can be achieved. Not only are rate and quality of variant identification increased from sequence data that is generated by a variety of next generation sequencing technologies, but accuracy of interpretive analysis of variant data is improved to provide novel e-diagnostics for the future, and deeper understanding of disease and its application in a clinical context.
One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable or hardwired system or computing system may interface with client computers and server computers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
These computer programs, which can also be referred to as programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and/or device, such as for example magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs), and gate arrays, used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium. The machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example as would a processor cache or other random access memory associated with one or more physical processor cores.
To provide for interaction with a user, one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT), a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including, but not limited to, acoustic, speech, or tactile input. Other possible input devices include, but are not limited to, touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive trackpads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.
The subject matter described herein can be embodied in systems, apparatus, methods, and/or articles depending on the desired configuration. The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and/or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and subcombinations of the disclosed features and/or combinations and subcombinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and/or described herein do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Other implementations may be within the scope of the following claims.
This application claims priority under 35. U.S.C. §119(e) to U.S. Provisional Application Ser. No. 61/943,870, filed Feb. 24, 2014, entitled “Dynamic Genome Reference Generation for Improved NGS Accuracy and Reproducibility”, referenced in this paragraph and incorporated by reference in its entirety.
Number | Name | Date | Kind |
---|---|---|---|
6112288 | Ullner | Aug 2000 | A |
6681186 | Denisov et al. | Jan 2004 | B1 |
7948015 | Rothberg et al. | May 2011 | B2 |
8209130 | Kennedy et al. | Jun 2012 | B1 |
8217433 | Fife | Jul 2012 | B1 |
8445945 | Rothberg et al. | May 2013 | B2 |
8524487 | Fife | Sep 2013 | B2 |
8558288 | Rothberg et al. | Oct 2013 | B2 |
8594951 | Homer | Nov 2013 | B2 |
8936763 | Rothberg et al. | Jan 2015 | B2 |
9014989 | McMillen et al. | Apr 2015 | B2 |
20030033279 | Gibson et al. | Feb 2003 | A1 |
20050131649 | Larsen et al. | Jun 2005 | A1 |
20070088510 | Li et al. | Apr 2007 | A1 |
20080086274 | Chamberlain et al. | Apr 2008 | A1 |
20080250016 | Farrar | Oct 2008 | A1 |
20090125248 | Shams et al. | May 2009 | A1 |
20110004413 | Carnevali et al. | Jan 2011 | A1 |
20120001615 | Levine | Jan 2012 | A1 |
20120089339 | Ganeshalingam et al. | Apr 2012 | A1 |
20120109849 | Chamberlain et al. | May 2012 | A1 |
20130091121 | Galinsky | Apr 2013 | A1 |
20130245958 | Forster et al. | Sep 2013 | A1 |
20130311106 | White et al. | Nov 2013 | A1 |
20130316331 | Isakov et al. | Nov 2013 | A1 |
20130324417 | Kennedy et al. | Dec 2013 | A1 |
20130332081 | Reese et al. | Dec 2013 | A1 |
20130338934 | Asadi et al. | Dec 2013 | A1 |
20140033125 | Meral | Jan 2014 | A1 |
20140114582 | Mittelman et al. | Apr 2014 | A1 |
20140121116 | Richards et al. | May 2014 | A1 |
20140236490 | McMillen et al. | Aug 2014 | A1 |
20140309944 | McMillen et al. | Oct 2014 | A1 |
20140371109 | McMillen et al. | Dec 2014 | A1 |
20140371110 | Van Rooyen et al. | Dec 2014 | A1 |
Number | Date | Country |
---|---|---|
2013128371 | Sep 2013 | WO |
2014060305 | Apr 2014 | WO |
2014074246 | May 2014 | WO |
2014113736 | Jul 2014 | WO |
2014186604 | Nov 2014 | WO |
2015123600 | Aug 2015 | WO |
Entry |
---|
Al Junid et al. “Development of Novel Data Compression Technique for Accelerate DNA Sequence Alignment Based on Smith-Waterman Algorithm.” Highlighted. University Technology MARA (UiTM). 2009 Third UKSim European Symposium on Computer Modeling and Simulation. pp. 181-186. |
Al Junid et al. “Optimization of DNA Sequences Data for Accelerate DNA Sequences Alignment on FPGA.” University Technology MARA (UiTM). 2010 Fourth Asia International Conference on Mathematical/Analytical Modelling and Computer Simulation. pp. 231-236. |
Edward Fernandez et al. PowerPoint presentation on “Multithreaded FPGA Acceleration of DNA Sequence Mapping.” UC Riverside, Department of Computer Science and Engineering Jacquard Computer. 2012. |
Guo, Xinyu et al., “A Systolic Array-Based FPGA Parallel Architecture for the BLAST Algorithm”, IRSN Bioinformatics, 2012, 12 pg, vol. 2012, Article ID 195658. |
Herbordt, Martin et al., “Single Pass Streaming BLAS on FPGAs”, NIH Public Access Author Manuscript, Nov. 2007, 25 pgs, Parallel Comput. |
Herbordt, Martin, et al., “Single Pass, BLAST-like, Approximate String Matching of FPGAs”, Boston University, 2006, 19 pgs, Boston. |
Jacob, Arpith et al., “FPGA-Accelerated seed generation in Mercury BLASTP”, Washington University in St. Louis, BECS Technology Inc. (2007). 10 pgs. |
Kasap, Server et al, “Design and Implementation of an FPGA-based Core for Gapped BLAST Sequence Alignment with the Two-Hit Method”, Engineering Letters, 16:3 EL—16—3—25, Aug. 20, 2012, 10 pgs, Scotland, UK (2008). |
Lancaster Joseph, “Design and Evaluation of a BLAST Ungapped Extension Accelerator, Master's Thesis”, Washington University, Jan. 1, 2006, 79 pgs, Report No. WUCSE-20016-21, 2006 St. Louis. |
Lancaster Joseph, et al. “Acceleration of Ungapped Extension in Mercury BLAST”, MSP-7th Workshop on Media and Streaming Processors, Nov. 2005, 9 pgs. |
Muriki, Krishna et al., “RC-BLAST: Towards a Portable, Cost-Effective Open Source Hardware Implementation” Supported in part by NSF Grant EIA-9985986, (2005). 8 pgs. |
Sotiriades Euripides, et al. “FPGA based Architecture for DNA Sequence Comparison and Database Search”, University of Crete, 2006, 8 pgs, Crete, Greece. |
Sotiriades Euripides, et al., “Some Initial Results on Hardware BLAST acceleration with a Reconfigurable Architecture”, University of Crete, 2006, 8 pgs, Crete, Greece. |
TimeLogic Division, Active Motif Inc., “Accelerated BLAST Performance with Tera-Blast: a comparison of FPGA versus GPU and CPU Blast implementations”, Technical Note, May 2013, 5 pages, Version 1.0. |
Number | Date | Country | |
---|---|---|---|
20150339437 A1 | Nov 2015 | US |
Number | Date | Country | |
---|---|---|---|
61943870 | Feb 2014 | US |