The present disclosure relates to an improved method for aligning a read with a haplotype during DNA sequencing.
Recent advances in next-generation sequencing have enabled fast DNA sequencing for cancer, genetic disorder and pathogen detection. Short DNA fragments are sequenced in a massively paralleled method, producing billions of DNA reads (strings of ˜100 nucleotides: A, C, G, T) per human genome. Reassembling these DNA fragments to determine differences from a common reference genome (secondary analysis) requires extensive computation. Among several secondary analysis steps, variant calling is the final step which identifies disease related gene mutations and remains extremely time-consuming. In particular, the pair-Hidden-Markov-Model (Pair-HMM) forward algorithm (or PFA) requires ˜250T FLOPs for variant calling to infer mutation probabilities and contributes 70% of variant calling latency. Pair-HMM forward algorithm requires alignment matrix calculation, with a complicated combination of floating point addition and multiplication to infer overall similarity of two strings, making it a difficult hardware optimization problem. Pair-HMM forward algorithm has been mapped to a GPU as well as a systolic array and ring-based array on an FPGA. However, these methods are still constrained by the availability of floating point resources, with the hardware optimization mainly improving processor element (PE) utilization. Furthermore, no dedicated ASIC has been demonstrated to date to accelerate the Pair-HMM forward algorithm for DNA sequencing.
This section provides background information related to the present disclosure which is not necessarily prior art.
This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.
A method is presented for aligning a read with a haplotype. The method includes: constructing an overall matrix for computing alignment probabilities between a given read and a given haplotype, calculating, during a first pass, an alignment probability for each cell in the overall matrix using Pair-HMM method, where the alignment probabilities are calculated using fixed-point arithmetic; pruning cells from the overall matrix to derive a subset of unpruned cells; and calculating, during a second pass, an alignment probability for each cell in the subset of unpruned cells using the Pair-HMM method, where the alignment probabilities are calculated using floating-point arithmetic.
Pruning cells from the overall matrix further comprises identifying a dominant section of cells in the overall matrix and setting the alignment probability of remaining cells to zero, where the dominant section of cells represent an alignment between the read and the haplotype with highest probability. More specifically, pruning cells from the overall matrix may further include: a) identifying a seed cell in the overall matrix, where the seed cell is the cell having largest alignment probability in bottom row on the overall matrix; b) determining whether the alignment probability of the seed cell is dominate over an adjacent cell along a diagonal extending towards upper left of the overall matrix; c) setting the alignment probability of cells in same row as the seed cell to zero and setting the alignment probability of cells in same column as the seed cell to zero in response to a determination that the alignment probability of the seed cell is dominate over the adjacent cell; and d) repeating steps b) and c) for each adjacent cell along the diagonal extending from the seed cell until the alignment probability of a given cell along the diagonal is not dominate over the adjacent cell.
In some embodiments, at least one of the given read or the given haplotype is extracted from a biological sample.
In other embodiments, the method further includes selecting a mutation with highest likelihood based in part of the alignment probability for each cell in the subset of unpruned cells.
Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The drawings described herein are for illustrative purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure.
Corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.
Example embodiments will now be described more fully with reference to the accompanying drawings.
A genome is a long string of DNA base-pairs A, G, C, and T. Sequencers produce chopped and out-of-order reads from biological samples which are then reconstructed to form sequence whole genome. To further identify mutations of a person's genome, aligned reads need to be compared against a reference genome. This process is referred to as variant calling as seen in
This disclosure present a pruning-based Pair-HMM algorithm and its ASIC implementation for high throughput DNA variant calling. The algorithm-architecture co-design identifies high-quality alignment regions in input data, and devotes floating point resources only to these regions, thereby aggressively reducing expensive floating point computation. For non-high-quality regions, floating point multiplication is replaced with low accuracy 20b integer addition by using log domain number representation, while maintaining (provable) correctness in the downstream analysis. Floating point computation is reduced by 43× on real human genome data, and is replaced by low accuracy integer PEs (I-PEs) that are 4.6× smaller in area and 1.9× higher in performance than floating point PEs (FP-PEs). Implemented in 40 nm CMOS, the 5.67 mm2 accelerator demonstrates 17.3G cell updates per second (CUPS) throughput, marking a 6.6 improvement over a baseline ASIC implementation with the conventional floating-point-only algorithm.
Pair-HMM is a statistical model which determines per-read likelihoods of each haplotypes of the individual. It helps determine the real DNA expression of an individual given the possibly incorrect reads. Conventional pair-HMM forward algorithm calculates probabilities of all alignments between a candidate mutation string and a DNA read using an alignment matrix as seen in
Forward algorithm is commonly used in Pair-HMM model, which efficiently calculates the overall probability of all possible alignments between read and haplotype. The forward algorithm is essentially a probability based dynamic programming approach, which uses three matrices fM, fI, fD. fk(i,j) corresponds to the combined probability of all alignments up to position (i,j) of read and haplotype that ends in state k. k can be I (insertion), D (deletion) and M (match). For each position (i,j), fM, fI, fD are calculated as below:
f
M(i,j)=pmmfM(i−1,j−1)+pimfI(i−1,j−1)+pdmfD(i−1,j−1) (1)
f
I(i,j)=amifM(i−1,j)+aiifI(i−1,j) (2)
f
D(i,j)=amdfM(i,j−1)+addfD(i,j−1) (3)
where pmm, pdm, ami, aii, amd, and add are probabilities related to state transition and read quality score. Final output of forward algorithm is sum of insertion and match probabilities in the final row: Σj=1L
During a first pass (or scan phase), an alignment probability for each cell in each of the three matrices is calculated or approximated at 13 using the Pair-HMM method. Of note, the alignment probabilities are calculated using the less computationally intensive fixed-point arithmetic. That is, an upper bound likelihood for each cell in the matrix is computed using integer processor elements operating in logarithmic number representation which replaces multiplication with addition and significantly reduces hardware complexity. Other approximation techniques for computing alignment probabilities are also contemplated by this disclosure.
In order to calculate overall alignment probability of read-haplotype pair, the Pair-HMM method calls for summing fM, fI and fD from adjacent cells at each position (i,j).
Conceptually, pruning cells from the overall matrix is achieved by identifying a dominant section of cells in the overall matrix and setting the alignment probability of remaining cells to zero as shown in
Next, a determination is made as to whether the alignment probability of the seed cell is dominate over an adjacent cell 52, where the adjacent cell is adjacent to the seed cell along a diagonal extending from the seed cell and towards an upper left of the overall matrix. In the case that the alignment probability of the seed cell is dominate over the adjacent cell, the alignment probability of cells in same row as the seed cell are set to zero and the alignment probability of cells in same column as the seed cell are set to zero as shown in
For each adjacent cell in the diagonal extending from the seed cell, these steps of comparing alignment probabilities for diagonal adjacent cells and pruning cells is repeated until the alignment probability of the given cell in the diagonal is not dominate over the adjacent cell. In some instances, the pruning process may result in a single diagonal stretching from the bottom row of the matrix to the top row of the matrix and the cells comprising the diagonal are the subset of unpruned cells. In other instances, a stopping condition is reached before the diagonal reaches the top row as shown in
The pruning method described above starts with final row of the matrix and proceed backwards in the diagonal direction. This requires hardware to store the values of the entire matrix.
Returning to
Variant calling algorithm emits the final called variant by selecting the mutation with the highest likelihood from amongst all candidate mutations. The mutation likelihood is proportional to product of all per-read likelihoods (i.e. output of Pair-HMM). The proposed pruning hardware accelerator guarantees the correctness of final called variants by forcing the lower bound of called variant to have a higher probability of the upper of un-called variants. Upper bound of un-called variant is readily available from the pruning machine output. Lower bound of called variant is the result of floating point machine. Therefore one can achieve significant speedup and still guarantee the correct final result. The mutation with the highest likelihood is correctly determined as the final variant calling result in 98.5% of all cases. The failing cases (˜1.5%) are guaranteed to be identified by recomouting using only floating point processor elements.
To effectively implement the proposed prune based pair-hmm algorithm, an example implementation for a hardware architecture shown in
In an example embodiment, fabricated in 40 nm CMOS with 5.67 mm2 die area, the pruning-based accelerator runs at 120 MHz with 756 mW power consumption.
The techniques described herein may be implemented by one or more computer programs executed by one or more processors. The computer programs include processor-executable instructions that are stored on a non-transitory tangible computer readable medium. The computer programs may also include stored data. Non-limiting examples of the non-transitory tangible computer readable medium are nonvolatile memory, magnetic storage, and optical storage.
Some portions of the above description present the techniques described herein in terms of algorithms and symbolic representations of operations on information. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. These operations, while described functionally or logically, are understood to be implemented by computer programs. Furthermore, it has also proven convenient at times to refer to these arrangements of operations as modules or by functional names, without loss of generality.
Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system memories or registers or other such information storage, transmission or display devices.
Certain aspects of the described techniques include process steps and instructions described herein in the form of an algorithm. It should be noted that the described process steps and instructions could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.
The present disclosure also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a computer selectively activated or reconfigured by a computer program stored on a computer readable medium that can be accessed by the computer. Such a computer program may be stored in a tangible computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, application specific integrated circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus. Furthermore, the computers referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
The algorithms and operations presented herein are not inherently related to any particular computer or other apparatus. Various systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatuses to perform the required method steps. The required structure for a variety of these systems will be apparent to those of skill in the art, along with equivalent variations. In addition, the present disclosure is not described with reference to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present disclosure as described herein.
In sum, the proposed algorithm includes a fast pruning machine implemented using fixed point calculation and an accurate machine which only works on unpruned cells using floating point calculation. In step one, the fast pruning machine calculates entire alignment matrix rapidly and approximately in log domain. The calculation can be done using approximation, including fixed point calculation with fewer bits to optimize speed. Log-sum is substituted with fast table lookup. Based on approximate values, the accelerator prunes squares in the matrix whose values contribute insignificantly to overall probabilities using the method introduced previously. In step two, precise machine using floating point representation calculates alignment probabilities on only the remaining squares, resulting a final probability slightly lower than accurate overall probability. Based on the proposed pruning technique, one can save 95% floating point computation, leading to a potential 20× reduction in execution time.
The foregoing description of the embodiments has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure. Individual elements or features of a particular embodiment are generally not limited to that particular embodiment, but, where applicable, are interchangeable and can be used in a selected embodiment, even if not specifically shown or described. The same may also be varied in many ways. Such variations are not to be regarded as a departure from the disclosure, and all such modifications are intended to be included within the scope of the disclosure.
This application claims the benefit of U.S. Provisional Application No. 62/795,159, filed on Jan. 22, 2019. The entire disclosure of the above application is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62795159 | Jan 2019 | US |