The present invention relates to the comparison of biological sequences and, more specifically, the invention relates to a method, a computer readable device, and an electronic device for fast and accurate alignment of sequences using local hit tables for rapid screening of local sequence similarity in accordance with the claims.
It is frequently desired to compare two sequences for the purpose of determining similar portions of these sequences. Searching databases for sequences similar to a given sequence is probably one of the most fundamental and important tools for predicting structural variations and functional properties in the modern biology.
The rapidly increasing amounts of genetic sequence information available represent a constant challenge to developers of hardware and software database searching and handling. The expansion of an amount of the genetic sequence information happens at a rate that exceeds the growth in computing power available at a constant cost, in spite of the fact that computing resources also have been increasing exponentially for many years. If this trend continues, increasingly longer time or increasingly more expensive computers will be needed to search the entire database.
Searching for similarity includes introducing “gaps” into a query sequence, a reference sequence, or both sequences that optimize the amount of similarity. When looking for sequences in a database similar to a given query sequence, the search programs compare the query sequence (with unknown distribution of gaps) to every subsequence in the database (again with unknown distribution of gaps). The availability of good tools for fast and accurate alignment of query sequences to database is hence extremely important.
The typical current approach for similarity search and alignment is based on calculating alignment scores of the two sequences using a substitution score matrix and a gap penalty function. A dynamic programming algorithm for computing the optimal local alignment was first described by Smith and Waterman (1981). See T. F. Smith and M. S. Waterman, Identification of Common Molecular Subsequences, J. Mol. Bioi. (1981) 147, 195-97. Sequence matching is commonly performed on very long sequences, and the currently emerging single molecule sequencing technologies are constantly pushing read lengths into longer and longer realm, with 100 K or 1 M base pairs queries are not a fantasy. Performing matching of such sequences using the Smith-Waterman algorithm (SW) is very computationally intensive—on the order of M×N operations (denoted as “O(MN)” complexity), where M and N are the lengths of the two sequences being matched. As a result, the use of the Smith-Waterman algorithm is not practical in many instances. A less computationally intensive method for sequence matching is therefore desired.
U.S. Pat. No. 7,917,302 B2 (Rogues) discloses a method for an efficient parallelization of the Smith-Waterman sequence alignment algorithm using parallel processing in the form of SIMD (Single-Instruction, Multiple-Data) technology. The method still has O(MN) complexity both in time and in space, and hence, not practical for high throughput applications.
U.S. Pat. No. 2009/0125514 A1 (Brown) discloses sequence alignment techniques that introduce a sparse data structure for the Smith-Waterman matrix. This method provides logarithmic improvement for the time complexity O((M+N)log(M)), and because typically
NM that method may still be not sufficient for large sequence databases. For example, for human genomic sequence alignment with N being more than 3 billion characters and M=100, the improvement is about 100/log(100)=21.7, but at the expense of losing some of the true alignments.
“Fast and accurate long-read alignment with Burrows-Wheeler transform” (Li, H. and Durbin, R.) discloses a Burrows-Wheeler Aligner's Smith-Waterman Alignment (BWA-SW) method to align long sequences against a large sequence database (e.g. the human genome). The method adds several heuristics accelerations to the Smith-Waterman algorithm and is able to reduce the time complexity of the algorithm to the sub quadratic level O(N0.628M) and again at the expense of losing some of the true alignments.
U.S. Patent Application No. 61/521,454, Unpublished (filing date Aug. 9, 2011) (Vitaly Galinsky, applicant) discloses techniques for rapid similarity screening between sequences that do not rely on the Smith-Waterman dynamic programming techniques. The method constructs a table of local similarity hits between query and reference sequences and allows to deduce number, size, and type of gaps present in query sequence but do not specify their exact locations. Nevertheless, this method provides improvement for the time complexity O(M), where M is the size of the query sequence.
Accordingly, what is desired, and not heretofore been developed, is a writing alignment tool wherein the total time complexity would be reduced and completely independent from the large size of the reference database. Furthermore, what is desired, and not heretofore been developed, is a writing alignment tool that allows not only rapid assessment of a similarity between genomic sequences but accurately finds all differences between query and reference, including positions of all gaps, in a linear time O(M).
This invention is related to bioinformatics and biological data analysis. Specifically, this invention provides methods, computer software products, and systems for fast and accurate alignment of sequences that follows a step of rapid assessment of similarity between genomic (DNA or RNA) sequences. In preferred embodiments, the methods, computer software products, and systems are used for high throughput processing of vast amount of genomic reads (queries) against very large genomic database (reference), which is aligning queries to the reference.
In preferred embodiments, methods are provided for analyzing the results of the sequential processing of the genomic query against the reference database by means of building of local hit table that includes positions of all similarities between each query subsequence (l base pairs prefix and m base pairs suffix) in the reference database and by merging the local hits in the table that correspond to the same alignment and obtaining gaps and structural variations data in the merging process. The time complexity of local hit table building provided by preferred embodiments depends linearly on the size of the query.
In some preferred embodiments, the methods include using gap size and type data from the merging process in local hit table for further fast and accurate optimization of similarity between query and reference to obtain exact gap locations and/or structural variations present in query.
In another aspect of the invention, systems and computer software are provided for performing the methods of the invention. The systems include a processor; and a memory coupled with the processor, the memory storing a plurality of machine instructions that cause the processor to perform logical steps of the methods of the invention. The computer software products of the invention include a computer readable medium having computer-executable instructions for performing the method of the invention.
The preferred embodiment will be described with reference to the drawings. The method uses prior art forward 102 and backward indices (103, 104) for the reference sequence as shown in
The purpose of splitting l+m bp indices into l-bp prefix and m-bp suffix is not just for conserving memory. Both the prefix and the suffix parts are lexicographically sorted, therefore, they provide not only an index (or hash), but also can work as a forward or backward tree, thus allow fast unwinding of low complexity/low error rate regions. An adaptation of k=8 bit mask to store the locations naturally creates compressed index, as many highly repetitive parts of the reference will be collapsed into a single index entry with the same prefix, suffix and masked location.
The entries with highest number of hits 206 will be candidates for the best alignment of the query to the reference. Presence of single well separated maximum in the hit table clearly indicates the unique alignment. For a set of chosen parameters (l=m=7 and k=8) a difference of 4 or 5 between max and second max entries in the hit table is good enough to rule out random hits that may arise due to rather short period of the l-scaled 2k-masked location for k=8.
Construction of full local hit table 201 for the query sequence creates rather convenient and straight forward way of searching for significant alignments with small but arbitrary gaps 207 as well as for detection of chimeric reads. At the final stage 205, these gaps leave distinct signatures in local hit table. For example, for a single 5 base pairs deletion 207, the leading and the trailing portions of the sequence (roughly of 40 base pairs each in illustration of
Presence in the query sequence highly repetitive areas may complicate the search, especially when combined with errors, single-nucleotide polymorphisms (SNPs), or insertions/deletions. But input from repetitive regions tends to spread uniformly across many hit table entries and won't destroy the input from unique regions. Hence, the local hit table provides easy way of finding the total size of gap or gaps between local subalignments as well as types of these gaps (insertion or deletion) by analyzing the maximums that were formed at different 2k -masked locations. In general, n bases deletion produces n-1 empty cells in the right-left and then top-down direction with respect to the original cell. On the contrary, n bases insertion does it in the reverse, it adds n-1 empty/low noise cells in the left-right down-top direction. Thus, the local hit table provides easy way of finding the total size of gap or gaps between local subalignments as well as types of these gaps (insertion or deletion) by analyzing the maximums that were formed at different 2k-masked locations. For chimeric read the hit table will contain several maximums as well but the separation between them both by the 2k-masked location and by the full distance may in general be arbitrary large.
One of the most important differences of presented method from the majority of long read alignment methods lies in the dismissal of seed-and-extend paradigm that is routinely employed by the most of current long read aligners. The existing algorithms (both hash and trie based) first search for the alignment seeds, but limit their number or size and usually keep several non overlapped 11-12 bp fixed size seeds or one 25-35 bp variable size (possibly with gaps) seed as candidates for further processing. Their next step involves extending seed alignments to the rest of the query sequence using the dynamic programming approaches, usually the Smith-Waterman algorithm. Giving the remaining length of the query q and the size of the part of the reference used for alignment r=q+g (allowing for possible gap or gaps of the total length g) the typical time complexity of this dynamic programming step is O(qr)=O(q2+gq).
Instead of extending relatively short seed area to the entire query sequence with this expensive (quadratic in time) dynamic programming step, the presented invention develops different strategy, which scales linearly O(q) with the size of the query q.
When the entire query sequence has been processed and candidate entries from the hit table for the final alignment have been identified the last processing step involves decoding the unmasked location of the hits in the reference (may be resolving collisions due to finite period of 2k masked location if present in the hits) and glueing those hits together.
Because of the presence of single or multiple base errors, SNPs, or insertions/deletions, there will be areas in the query sequence where no local hits are detected, thus the glueing stage should provide a way of filling all these holes. When the hole is surrounded by the local hits with the same difference between the query and the reference coordinates, the filling is of course trivial and involves simple O(n) (n is the length of the hole, nq) transition through all bases in the hole in order to find and record all mismatches. Of course, theoretically it may be possible that adding equal number of equal length insertions and deletions may lower the total score, practically it is highly unlikely, giving the relatively high gap penalty used in most current algorithms and small size of the holes, especially in resequencing as well as in long read de novo assembly projects.
The major advantage of the algorithm emerges when the hole is located in the area between local hits with different values of the query and the reference coordinate differences (as shown in example in
The approach is illustrated in simplistic form in Figure
The gap that is inserted initially at an arbitrary position in the hole region of the query 401 and is assumed to be able to move freely anywhere in the region. Its position is updated under an influence of attractive force 404 acting on the gap from each mismatch site in the region, either clamped 402 or individually separated 403. A new position of the gap 405 is determined using some effective gap mass (Mg) and cooling schedule (αi). On the following iteration different (usually smaller) number of mismatches 406 will give rise to the attractive force 407 ultimately resulting in a final (optimal) position of the gap 408 after several iterations. The final gap position may be not unique, but in general this iterative procedure has O(n) time complexity for finding an optimal gap location.
The performance (i.e. rate of convergence) of simulated annealing type optimization is implementation dependent and can be controlled or fine tuned by changing various parameters, e.g. annealing (cooling) schedule (i.e. the rate and the pattern of increase or decrease in the mobility of the gap with respect to the same applied force), type of gap/mismatch interaction potential (e.g. dependence of the force on the gap/mismatch separation), effective gap mass and so on.
The sequence assessment embodiments described above may be performed on any suitable type of computer system, which includes any type of computing device.
Processor subsystem 703 may include one or more processors or processing units. For example, processor subsystem 703 may include one or more multi-processor cores, each with its own internal communication and buses. In various embodiments of computer system 720, multiple instances of processor subsystem 703 may be coupled to interconnect 713. In various embodiments, processor subsystem 703 (or each processing unit within 703) may contain a cache 714 or other form of on-board memory.
Computer system 703 also contains memory 704 which is usable by processor subsystem 703. Memory 704 may be implemented using different physical memory media, such as hard disk storage, floppy disk storage, removable disk storage, flash memory, random access memory (SRAM, EDO RAM, SDRAM, DDR SDRAM, etc.), ROM (PROM, EEPROM, etc.), and so on.
As in
I/O interfaces 701 may be any of various types of interfaces configured to couple to and communicate with other devices, according to various embodiments. In one embodiment, I/O interface 701 is a bridge chip from a front-side to one or more back-side buses.
I/O interfaces 701 may be coupled to one or more I/O devices 702 via one or more corresponding buses or other interfaces. Examples of I/O devices include storage devices (hard drive, optical drive, removable flash drive, storage array, storage area network, or their associated controller), network interface devices (e.g., to a local or wide-area network), or other devices (e.g., graphics, user interface devices, etc.)
Memory in computer system 720 is not limited to memory 704. Rather, computer system 720 may be said to have a “memory subsystem” that includes various types/locations of memory. For example, the memory subsystem of computer system 720 may, in one embodiment, include memory 704, cache 714 in processor subsystem 703, and storage on I/O Devices 702 (e.g., a hard drive, storage array, etc.), fixed disk 708 or removable disk 709 directly connected to interconnect bus. Thus, the phrase “memory subsystem” is representative of various types of possible memory media within computer system 720. In some embodiments, the memory subsystem includes program instructions executable by processor subsystem 720 to perform embodiments of the sequence similarity assessment algorithms of the present disclosure.
Various embodiments of the sequence similarity assessment algorithm (described above) may include storing instructions and/or data implemented in accordance with the foregoing description in an article of manufacture such as a tangible computer-readable memory medium, including various portions of the memory subsystem of computer system 720. Certain embodiments of these tangible computer-readable memory media may store instructions and/or data that are computer executable to perform actions in accordance with the present disclosure. Generally speaking, such an article of manufacture may include storage media or memory media such as magnetic (e.g., disk) or optical media (e.g., CD, DVD, BD, and related technologies, etc.). The article of manufacture may be either volatile or nonvolatile memory. For example, the article of manufacture may be (without limitation) SDRAM, DDR SDRAM, SRAM, SanDisk, flash memory, and of various types of ROM, etc.
Further embodiments may include signals such as electrical, electromagnetic, or optical signals, conveyed via a communication medium, link, and/or system (e.g., cable, network, etc.), whether wired, wireless or both. Such signals may carry instructions and/or data implemented in accordance with the foregoing description.
Although specific embodiments have been described above, these embodiments are not intended to limit the scope of the present disclosure, even where only a single embodiment is described with respect to a particular feature. Examples of features provided in the disclosure are intended to be illustrative rather than restrictive unless stated otherwise. The above description is intended to cover such alternatives, modifications, and equivalents as would be apparent to a person skilled in the art having the benefit of this disclosure.
The scope of the present disclosure includes any feature or combination of features disclosed herein (either explicitly or implicitly), or any generalization thereof, whether or not it mitigates any or all of the problems addressed by various described embodiments. Accordingly, new claims may be formulated during prosecution of this application (or an application claiming priority thereto) to any such combination of features. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in the specific combinations enumerated in the appended claims.
The current application claims a priority to the U.S. Provisional Patent application Ser. No. 61/522,853 filed on Aug. 12, 2011.
Number | Date | Country | |
---|---|---|---|
61522853 | Aug 2011 | US |