The invention relates to a method and system for the computationally efficient evaluation of the correlation of sequences, particularly, although not exclusively, nucleotide or protein sequences.
The analysis of nucleotides to determine correlation between a sample sequence and a reference sequence may be computationally demanding. Sequences consist of multiple elements where the order of the elements in the sequence is important. Each element consists of a value, and different elements may have the same or different values. For genetic sequences, such as DNA or RNA, each element of the segment may take on one of the following values: A, C, G, T, and U. The length of a segment may vary from relatively small (for example thousands) to large (for example billions).
In general, a first sample sequence (known as a “read”) is analysed with regard to a second reference sequence, typically a genome. Often the reference sequence is a longer sequence than the sample sequence, and it is desired to determine whether the reference contains a segment that is similar or the same as the sample sequence. Reads may be contiguous, as with sequencers produced by Illumina Inc. or be non-continuous or overlapping, as with sequencers produced by Complete Genomics Inc. and Pacific Biosciences Inc. It is desirable for evaluation algorithms to be able to process any type of read.
Algorithms, such as the Smith Waterman algorithm and its derivatives, have been developed to compare different genomic sequences. Where the goal of the algorithm is to position a smaller sequence within a larger sequence, this algorithm is known as a gapped alignment algorithm. In many cases, the larger sequence is much longer than the smaller sequence, and as a result it is possible that there is more than one location in the larger sequence that is similar to the smaller sequence. There are often small differences between the sample sequence and the corresponding segment of the reference sequence. These errors may be random or systematic of the source of the sample sequence. For example, in the case of DNA sequences, the DNA sequencer reads each nucleotide in the read, but may incorrectly call the correct type as another. Another source of error is that the DNA segments may naturally be different to the reference genome. Differences include SNP (single nucleotide differences), MNP (multiple), large movements in a region of DNA, multiple copies of a region of DNA. Errors and differences may be accounted for by using masking techniques as described in other systems, such as in the applicant's international patent application Patent Application No. PCT/NZ2009/000245. Thus it may take a significant amount of computing time to evaluate a sample sequence at each position of a reference sequence for all relevant permutations.
Therefore, the goal of an alignment algorithm is to attempt to position the sample sequence within the reference sequence with the best possible match within as short as possible a processing time. This may involve placing an entire read (e.g. as many of the nucleotides in the read as possible) starting at a specific location. Alternatively we may wish to determine if parts of the read (for example, chimeric reads) are from different locations in the reference.
It is an object of the present invention to provide a method and system for evaluating the correlation of sequences that is more computationally efficient than prior techniques or which at least provides the public with a useful choice.
According to a first aspect there is provided a computer implemented method of evaluating a sequence using a plurality of evaluation algorithms, comprising applying the evaluation algorithms in an order designed to minimise the processing time for carrying out the required evaluation.
According to a further aspect there is provided a computer implemented method of evaluating the correlation between a sample sequence and a reference sequence using a plurality of evaluation algorithms, comprising applying the evaluation algorithms in an order designed to minimise the processing time for carrying out the required evaluation.
There is also disclosed a sequencing system comprising:
The accompanying drawings which are incorporated in and constitute part of the specification, illustrate embodiments of the invention and, together with the general description of the invention given above, and the detailed description of embodiments given below, serve to explain the principles of the invention.
The invention will now be described by way of example only, with reference to examples based on the analysis of nucleotide sequences in the form of genomic sequences of DNA or RNA.
It is usual for different evaluation algorithms to have different properties with regard to speed and the number and frequency of matches between a sample sequence and a reference sequence.
Here, speed refers to how quickly the evaluation algorithm is able to produce results, whereas the quality represents the strength of a match (i.e. an identical match is the most significant and less statistically relevant matches are less significant).
Some alignment algorithms may be fast and produce strong matches, such as a simple “equality sequence aligner algorithm” which simply determines whether there is an exact match.
A fast algorithm may produce many possible “fires” (matches according to specified match criteria) in a short time, whereas a slow algorithm may produce a few possible areas of alignment in a long time. Evaluation algorithms may be ordered based on their number of matches and frequency of matches.
Take for example:
Algorithm 3 “fires” on 10% of the data and runs at 100,000 alignments/sec
Algorithm 3 makes the least alignments but it is so fast that if run first it may reduce the remaining data down to 90% resulting in a massive time savings. The quality of the matches produced by different algorithms may also be taken into account in determining the order of application of algorithms.
Based on knowledge of the characteristics of evaluation algorithms (their speed, number of matches with respect to processing time and statistical quality of matches) their order of application may be prescribed so as to minimise typical processing time.
The present system uses a set of evaluation algorithms one after another to evaluate potential alignment positions with high efficiency. In a multi-processor system a number of evaluation algorithms may be run in parallel and allocated to processors based on their speed and performance characteristics of the processors. For example slower processors may be allocated algorithms with short processing times (such as identity/equality algorithms) so that the results of that algorithm are not unduly delayed.
In the general case, the system uses faster evaluation algorithms first to reduce the number of potential alignment positions before using slower evaluation algorithms that may produce more and/or better quality matches to further reduce the number of potential alignment positions. However, due to different properties of the data and equipment, different orders of evaluation algorithms may be appropriate and the system is designed to also account for these factors.
Referring to
If one or more exact matches are discovered, then the one or more alignment positions are recorded as alignment positions with perfect alignment. For long reads it is highly unlikely that a position within the genome exactly matches with the sample sequence randomly, and so there is a high probability that at most one exact match will be found and that this will be the correct alignment. The probability of correct alignment is higher for longer sample sequences (the present embodiment typically employs sample sequences of about 18 to 22 bases). In this embodiment, if one alignment position is found in this step, the system ceases searching and returns the location of alignment with the reference sequence as the alignment position.
If an exact match is not found, or it is desired to also find similar but not exact alignment positions, then further evaluation algorithms may be applied.
In this embodiment, the sample sequence and reference sequence are then run through a lower bound algorithm 2. The purpose of this algorithm is to perform a first sample on the sequences by performing a coarse search of the reference sequence to ensure that there is a reasonable chance of discovering alignment positions for the sample sequence in the reference sequence. In this search the unmodified sample sequence is compared to the reference sequence and alignments are scored based on the quality of the alignment—i.e. points are added according to the nature of the misalignments to form a cumulative score at each position (as shown in
In step 3 the sample sequence is modified at each potential alignment position with the reference sequence.
In step 4 a seeded aligner is employed in which portions of the sample sequence that match the reference sequence are positioned and detailed evaluation algorithms analyse the gaps between the seeds. If a match with a score below a threshold value is found then this alignment may be recorded and processing may terminate.
If no alignment has a score below the threshold then a final evaluation algorithm 5 may be employed. This may be an algorithm that returns the best alignment. The further evaluation algorithms may be an algorithm based on the Smith Waterman algorithm such as the Gotoh aligner or Edit Distance aligner.
In one embodiment, the series of alignment algorithms may be predetermined before the system is run, which may be set by the user. In another embodiment, the series is at least in part determined by one or more parameters of the job. For example, the length of the sample sequence, information on the source of the sample sequence (i.e. the equipment that the sample sequence is sourced from), the alignment score desired by the user, and the specific knowledge of the reference sequence properties. In one embodiment, the series may be altered between applications of evaluation algorithms due to the results of the evaluation algorithms.
The first evaluation algorithm applied in general is a fast searching algorithm. The purpose of it is to reduce the number of potential alignment positions from being every position in the reference sequence to a smaller set of positions. Then typically a second, high coverage, but slower, evaluation algorithm is used to further reduce the set of potential alignment positions. Further evaluation algorithms may be applied until the set of alignment positions only contains alignments with better scores than the minimum set by the user. In one embodiment, the user selects a maximum operating time and/or number of evaluation algorithms to use, and once either of these conditions is met the system finishes searching for alignment positions. One of the evaluation algorithms may be a weighted probability algorithm that outputs a weighted probability of each position in the read being a variety of states (ATCG, deleted, etc). The weighted probability is a function of all possible “paths” from the start of the read to the end of the read.
In one embodiment, coarser searching algorithms (simple positioning algorithms) are used to obtain a set of possible alignment positions, and the finer searching algorithms (local or global alignment algorithms) are used to reduce this set until a specified level or certainty is reached. However, it is understood that depending on a variety of factors, different orders of algorithms may be used and different types. The ordering may be based upon historical information as to the performance of evaluation algorithms, a characteristic of the sequences concerned, the sequencing equipment used to obtain reads etc. A characteristic of the sequence may be obtained by user input or by preliminary analysis of one or more sequence. The system may also dynamically select the order of evaluation algorithms based on the results of algorithms that have already run or the order may be set at the start of processing or preset for a specific analyser. An evaluation algorithm engine may determine the order of application of algorithms and may be a rule based engine or artificial intelligence engine employing a neural network or genetic algorithm to select algorithm ordering. The evaluation algorithm engine may also include a “Meta-aligner” which alters the relative positioning of sequences as well as selecting the algorithms to apply. Such a Meta-aligner may be applied as a final algorithm to run in loops to attempt to find an alignment above a required threshold.
In one embodiment, a user selects a minimum alignment score. The alignment score is a measure of how well a segment of the reference sequence matches to the sample sequence. Typically, a higher score is given to segments which align well with the sample sequence. In one case, the score is a relative value, for example 90%, and limits possible segments to those that match within 90% of the sample sequence. The threshold may be based on “local alignment” where the score is determined based on alignment of only a portion of the sequences.
Referring to
Referring to
By ordering evaluation algorithms based on their processing time and likelihood of producing a determinative outcome processing time can be dramatically reduce.
While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in detail, it is not the intention of the Applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details, representative apparatus and methods, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of the Applicant's general inventive concept.
Number | Date | Country | Kind |
---|---|---|---|
585505 | May 2010 | NZ | national |
585532 | May 2010 | NZ | national |
585984 | Jun 2010 | NZ | national |
This is a continuation of PCT Application No. PCT/NZ2011/000080, with an international filing date of May 20, 2011, which claims priority to New Zealand Application No. NZ585505, filed May 20, 2010, New Zealand Application No. NZ585532, filed May 21, 2010, and New Zealand Application No. NZ585594, filed Jun. 8, 2010. PCT Application No. PCT/NZ2011/000080, filed May 20, 2011, is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/NZ2011/000080 | May 2011 | US |
Child | 13681046 | US |