Not Applicable.
Advances in biomolecule sequence determination, in particular with respect to nucleic acid and protein samples, has revolutionized the fields of cellular and molecular biology. Facilitated by the development of automated sequencing systems, it is now possible to sequence whole genomes. However, the quality of the sequence information must be carefully monitored, and may be compromised by many factors related to the biomolecule itself or the sequencing system used, including the composition of the biomolecule (e.g., base composition of a nucleic acid molecule), experimental and systematic noise, variations in observed signal strength, and differences in reaction efficiencies. As such, processes must be implemented to analyze and improve the quality of the data from such sequencing technologies.
One strategy for improving sequence data is to use redundant sequence data, e.g., as a means to improve accuracy by allowing random errors to be identified and resolved in the final sequence determination. However, the methods currently used are typically restrictive with regards to the types of polymorphisms that may be analyzed and the types of error models that can be used. As such, there is a need for a more powerful and flexible method for consensus sequence determination from redundant biomolecule sequence data.
Various embodiments and components of the present invention employ signal and data analysis techniques that are familiar in a number of technical fields. For clarity of description, details of known analysis techniques are not provided herein. These techniques are discussed in a number of available reference works, such as: R. B. Ash. Real Analysis and Probability. Academic Press, New York, 1972; D. T. Bertsekas and J. N. Tsitsiklis. Introduction to Probability. 2002; K. L. Chung. Markov Chains with Stationary Transition Probabilities, 1967; W. B. Davenport and W. L Root. An Introduction to the Theory of Random Signals and Noise. McGraw-Hill, New York, 1958; S. M. Kay, Fundamentals of Statistical Processing, Vols. 1-2, (Hardcover—1998); Monsoon H. Hayes, Statistical Digital Signal Processing and Modeling, 1996; Introduction to Statistical Signal Processing by R. M. Gray and L. D. Davisson; Modern Spectral Estimation: Theory and Application/Book and Disk (Prentice-Hall Signal Processing Series) by Steven M. Kay (Hardcover—January 1988); Modern Spectral Estimation: Theory and Application by Steven M. Kay (Paperback—March 1999); Spectral Analysis and Filter Theory in Applied Geophysics by Burkhard Buttkus (Hardcover—May 11, 2000); Spectral Analysis for Physical Applications by Donald B. Percival and Andrew T. Walden (Paperback—Jun. 25, 1993); Astronomical Image and Data Analysis (Astronomy and Astrophysics Library) by J. L. Starck and F. Murtagh (Hardcover—Sep. 25, 2006); Spectral Techniques In Proteomics by Daniel S. Sem (Hardcover—Mar. 30, 2007); Exploration and Analysis of DNA Microarray and Protein Array Data (Wiley Series in Probability and Statistics) by Dhammika Amaratunga and Javier Cabrera (Hardcover—Oct. 21, 2003).
The invention is generally directed to processes for analyzing replicate sequence information, and for ultimately identifying a consensus sequence of a biomolecular target sequence from the replicate sequence information. Consequently, the invention is also directed to systems that carry out these processes.
The invention and various specific aspects and embodiments will be better understood with reference to the following detailed descriptions and figures, in which the invention is described in terms of various specific aspects and embodiments. These are provided for purposes of clarity and should not be taken to limit the invention. The invention and aspects thereof may have applications to a variety of types of methods, devices, and systems not specifically disclosed herein.
Furthermore, the functional aspects of the invention that are implemented on a computer or other logic processing systems or circuits, as will be understood to one of ordinary skill in the art, may be implemented or accomplished using any appropriate implementation environment or programming language, such as C, C++, Cobol, Pascal, Java, Java-script, HTML, XML, dHTML, assembly or machine code programming, RTL, etc.
The present invention is generally directed to powerful and flexible methods and systems for consensus sequence determination from replicate biomolecule sequence data. Technologies and methods for biomolecule sequence determination do not always produce sequence data that is perfect. For example, it is often the case that DNA sequencing data does not unambiguously identify every base with 100% accuracy, and this is particularly true when the sequencing data is generated from a single pass, or “read.” Using replicate sequence information provides a means to improve validation of sequence reads as compared to single sequence fragment analysis alone. As such, it is an object of the present invention to improve the accuracy of consensus biomolecule sequence determination from replicate sequence data. In certain aspects, the current invention provides algorithms for assimilating replicate sequence into a final consensus sequence more accurately than any one-pass sequence analysis system.
In certain aspects, the methods are computer-implemented methods. In certain aspects, the algorithm and/or results (e.g., consensus sequences generated) are stored on computer-readable medium, and/or displayed on a screen or on a paper print-out. In certain aspects, the results are further analyzed, e.g., to identify genetic variants, to identify a source of the replicate sequence information, to identify genomic regions conserved between individuals or species, to determine relatedness between two individuals, to provide an individual with a diagnosis or prognosis, or to provide a health care professional with information useful for determining an appropriate therapeutic strategy for a patient.
Methods for refining multiple sequence alignments are provided. In certain embodiments, such methods comprise providing an initial multiple sequence alignment; perturbing a portion of the initial multiple sequence alignment to generate a candidate multiple sequence alignment having a perturbed portion; evaluating the candidate multiple sequence alignment by scoring the portion of the initial multiple sequence alignment to generate a first score, and scoring the perturbed portion of the candidate multiple sequence alignment to generate a second score; and accepting the candidate multiple sequence alignment as a new multiple sequence alignment if the second score is greater than the first score. The initial multiple sequence alignment typically comprises a biomolecular sequence, e.g., a polynucleotide sequence. In certain embodiments, the method further comprises performing at least one sequencing reaction to provide the biomolecular sequence, e.g., a sequencing-by-incorporation assay. Preferably, the providing comprises aligning a plurality of replicate sequence reads using one or more MSA algorithms. Typically, the method is iteratively applied, wherein the new multiple sequence alignment is subsequently perturbed and evaluated. The perturbing can comprise performing a gap shifting operation on at least one column of the initial multiple sequence alignment. The scoring can comprise computation of a geometric mean of a signal-to-noise ratio within the column and/or application of an objective function, e.g.,
In certain aspects, methods are provided for determining a consensus call. In certain embodiments, such methods comprise generating a set of training sequence reads from a known template sequence using a given sequencing system; using the set of training sequence reads and a given alignment schema to generate a training multiple sequence alignment, where a read is a series of units, and wherein a unit is either a base call or a single base gap; measuring unit association occurrences in the training multiple sequence alignment; determining conditional probabilities for each combination of training sequence units and a known template sequence unit; generating a set of experimental sequence reads from an unknown template sequence using the given sequencing system; using the set of experimental sequence reads and the given alignment schema to generate an experimental multiple sequence alignment; for each column in the experimental multiple sequence alignment having a plurality of units therein, using the conditional probabilities to compute the likelihoods of observing the plurality of units for each of a set of possible template units for the column; and identifying which of the possible template units gives the highest likelihood in step g and identifying it as the consensus call for the column. In some preferred embodiments, the methods further comprise using a χ2 statistic to determine confidence in the consensus call for the column.
In certain aspects, methods for generating a consensus sequence for a region of a multiple sequence alignment are provided. Such methods preferably comprise providing a multiple sequence alignment comprising a set of actual reads across a region of interest; providing a set of randomly generated candidate reads for the region of interest having a length equal to a mean length of the actual reads; measuring an edit distance between each of the randomly generated candidate reads and the set of actual reads to generate a fitness for each of the randomly generated candidate reads; selecting a subset of the randomly generated candidate reads, wherein the selecting preferentially chooses randomly generated candidate reads having high fitness with respect to others in the set of randomly generated candidate reads; pairing members of the subset selected in the selecting step to produce a plurality of pairs of candidate reads; subjecting each of the pairs of candidate reads to a crossover procedure to generate a first set of recombined candidate reads; subjecting the set of recombined candidate reads to the measuring, selecting, pairing, and subjecting steps to generate a further set of recombined reads, and sequentially repeating this step, each time using the set of recombined candidate reads from an immediately prior crossover procedure for a subsequent measure of the edit distance; terminating the sequentially repeating, thereby providing a final set of recombined candidate reads; measuring an edit distance between each member of the final set of recombined candidate reads and the set of actual reads to generate a fitness for each member of the final set of recombined candidate reads; and determining which member of the final set of recombined candidate reads has a best fitness as a fittest candidate read, and identifying the fittest candidate read as the consensus sequence for the region of the multiple sequence alignment. In certain preferred embodiments, fitness is computed using an equation, e.g.,
Preferably, the “selecting a subset of the randomly generated candidate reads” step preferably retains some of the randomly generated candidate reads having low fitness with respect to others in the set of randomly generated candidate reads, and optionally can be performed using a roulette-wheel selection. In some embodiments, multiple fittest candidate reads are identified in the determining step, and the method further comprises constructing an undirected and weighted graph comprising nodes representing a first of the multiple fittest candidate read and portions of the actual reads that overlap the first within the multiple sequence alignment; repeating the constructing step for all of the fittest candidate reads; generating a minimum spanning tree for each of the undirected and weighted graphs constructed, thereby generating a set of minimum spanning trees; determining which of the minimum spanning trees has a highest degree edge, wherein the highest degree edge is an edge that participates in the greatest number of template-to-read paths; and identifying which of the multiple fittest candidate reads is represented within the minimum spanning tree having the highest degree edge, wherein this multiple fittest candidate read is chosen as the consensus sequence for the region of the multiple sequence alignment. Each of the fittest candidate reads can be a root of one of the set of minimum spanning trees. Further, the edit distance can be computed using a combination of base calls from the actual reads and base quality values for each of the base calls.
In certain aspects, systems for generating a consensus sequence are provided. Such systems typically comprise computer memory containing an alignment of a set of replicate sequence reads; computer-readable code for determining an optimal Steiner string for the alignment, wherein the optimal Steiner string is the consensus sequence; and memory for storing the consensus sequence so generated.
Sequencing applications generally fall into two categories, de novo assembly and re-sequencing. Both efforts require highly-automated, accurate assembly of nucleic acid fragments into contigs. They differ in that de novo assembly is performed using overlapping reads, while re-sequencing assumes knowledge of a reference sequence and maps reads to the reference. Although establishing relative read position is significantly easier for re-sequencing, the subsequent task of calling a consensus base for each aligned column in the contig or alignment is still challenging. The standard of sequencing accuracy was set to 99.99% by the National Human Genome Research Institute (NHGRI) in 1998. While a single base call for each position in a template may not achieve such accuracy, but with increases in coverage multiple overlapping sequencing reads for a template sequence having lower raw read accuracy can be used to determine a consensus sequence with acceptably high accuracy. Consensus calling algorithms attempt to distinguish sequencing error from variants (e.g., SNPs) using multiple “queries” for a given position. A variety of such algorithms have been developed to address changes in sequencing coverage, error profiles, and information accompanying base calls as new sequencing systems are developed, e.g, Li, et al. (2008) Genome Res. 18:1851-1858; and Chen, et al. (2007) Genome Res. 17(5):659-666. Other methods and algorithms that may be used with or are otherwise related to the methods provided herein are found in G. A. Churchill, M. S. Waterman (1992) “The Accuracy of DNA Sequences: Estimating Sequence Quality,” Genomics 14: 89-98; M. Stephens, et al. (2006) “Automating sequence-based detection and genotyping of SNPs from diploid samples,” Nat. Genet., 38: 375-381; Li, et al (2008) “Mapping short DNA sequencing reads and calling variants using mapping quality scores,” Genome Research 18(11):1851-8; and Chen, et al. (2007) “PolyScan: An automatic indel and SNP detection approach to the analysis of human resequencing data,” Genome Research 17(5):659-666, the disclosures of which are incorporated herein by reference in their entireties for all purposes. Further methods and algorithms that may be used with or are otherwise related to the methods provided herein are found in U.S. Patent Publication Nos. 20090024331 and 20100169026, and U.S. patent application Ser. Nos. 13/034,233 and 13/034,199, both filed Feb. 24, 2011, all of which are incorporated by reference herein in their entireties for all purposes.
The present invention is generally directed to powerful and flexible methods and systems for consensus sequence determination from replicate biomolecule sequence data. Technologies and methods for biomolecule sequence determination do not always produce sequence data that is perfect. For example, it is often the case that DNA sequencing data does not unambiguously identify every base with 100% accuracy, and this is particularly true when the sequencing data is generated from a single pass, or “read.” Using replicate sequence information provides a means to improve validation of sequence reads as compared to single sequence fragment analysis alone. As such, it is an object of the present invention to improve the accuracy of consensus biomolecule sequence determination from replicate sequence data. In certain aspects, the current invention provides algorithms and processes for assimilating replicate sequence into a final consensus sequence more accurately than any one-pass sequence analysis system. For example, using the methods described herein a consensus sequence having 99.9% accuracy can be generated from twelve sequencing reads generated by sequencing a single template nucleic acid, where each of the sequencing reads has a single-pass accuracy of about 86%. These twelve sequencing reads can be used in the methods herein to generate a highly accurate consensus sequence for the single template nucleic acid. In preferred embodiments, the consensus sequencing accuracy achieved using the methods herein is at least 95%, 96%, 97%, 98%, 99%, 99.5%, 99.9%, or 99.99%.
In certain aspects, the methods provided herein comprise identifying regions of a biomolecule for which sequencing data are ambiguous. In certain embodiments, a plurality of regions in a biomolecule for which sequencing data are ambiguous are identified as Islands of Ambiguity (IAs), or just “islands.” These regions are adjacent to and/or between other regions of the biomolecule identified as Columns of Certainty (CCs), sometimes referred to as “posts.” In certain aspects, a threshold value is used to evaluate the sequence data to determine if a position is within a CC (post) or an IA (island). In further aspects, ambiguous data for positions identified as within an island are resolved to generate a consensus sequence for that region. In preferred embodiments, a probabilistic graph model is used to determine the consensus sequence for at least one or more islands. In particular embodiments, the methods involve determination of a plurality of consensus sequences for a plurality of islands identified by analysis of replicate sequencing reads of a biomolecule, e.g., a DNA molecule. Consensus sequences for each island in a biomolecule may optionally be combined with sequence data for posts in the biomolecule to generate a template consensus sequence extending across a combination of at least one island and at least one post, e.g., across multiple islands and the intervening posts, e.g., across a target sequence in the template.
In certain preferred embodiments, the problem of calling a consensus sequence from a plurality of sequence reads is formulated as a maximization problem using a Bayesian approach and a generalized representation of the information imparted by each individual read. Further, certain preferred sequencing technologies generate sequencing reads that possess information beyond the individual called bases. For example, single-molecule, real-time sequencing-by-synthesis technologies can provide sequence reads that comprise not only base call information, but also various aspects of the kinetic characteristics of the reaction, which can be used to derive various metrics including, e.g., the likelihood of various error modalities and alternate sequence interpretations. This additional information, in combination with the called bases, is used to better refine the consensus sequence determination. Benefits of this approach include a faster climb to consensus accuracy with increasing coverage, and reduced sensitivity to underlying systematic error.
Sequences from various kinds of biomolecules may be analyzed by the methods presented herein, e.g., polynucleotides and polypeptides. The biomolecule may be naturally-occurring or synthetic, and may comprise chemically and/or naturally modified units, e.g., acetylated amino acids, methylated nucleotides, etc. Methods for detecting such modified units are provided, e.g., in U.S. Ser. No. 12/635,618, filed Dec. 10, 2009; and Ser. No. 12/945,767, filed Nov. 12, 2010, which are incorporated herein by reference in their entireties for all purposes. In certain embodiments, the biomolecule is a nucleic acid, such as DNA, RNA, cDNA, or derivatives thereof. In some preferred embodiments, the biomolecule is a genomic DNA molecule. The biomolecule may be derived from any living or once living organism, including but not limited to prokaryote, eukaryote, plant, animal, and virus, as well as synthetic and/or recombinant biomolecules.
For ease of discussion, various aspects of the invention will be described with regards to analysis of polynucleotide sequences, but it is understood that the methods and systems provided herein are not limited to use with polynucleotide sequence data and may be used to determine consensus sequences for other types of sequence data, e.g., from polypeptide sequencing reactions.
The multiple sequence reads may be generated in any number of ways, including, e.g., repeatedly sequencing the same molecule, sequencing a template comprising multiple copies of the target sequence, sequencing multiple individual biomolecules all of which contain the sequence of interest or “target” sequence, or a combination of such approaches. The replicate sequence reads need not begin and end at the same position in the biomolecule sequence, as long as they contain at least a portion of the target sequence, and in some preferred embodiments contain at least a portion of an island.
In certain embodiments, one aspect of the set of replicate sequence reads is that they have been determined to belong together by a previous stage of processing. In other words, there is an assumption that the set of reads contains replicate sequence data that must be validated. In certain embodiments in which the reads are coming from separate single molecules (e.g., as shown in
At step 2 of
Time and memory requirements for MSA determination increase exponentially with the number of sequences to be represented in the MSA. Therefore, avoiding an exhaustive search of possible alignment configurations and applying some sort of heuristic or random search to find the optimal solution is widely practiced. To produce highly accurate MSAs efficiently, after an existing, approximate solution is established, a series of rearrangements is made to optimize the solution according to some objective function to improve the resulting alignments. However, prior methods demonstrating success tended to evaluate the biological relevance of the alignments by referring to structural alignments. For example, the BALiBASE benchmark dataset contains manually constructed MSAs that result from 3D structural superpositions. The methods described herein provide new methods and objective functions for high quality alignments of sequence data.
In certain aspects, the present invention provides an MSA refinement procedure using Simulated Annealing and a different objective function than others known in the art (e.g., Sum of Pairs and COFFEE). MSA refinement is shown as step 3 of
A preferred embodiment of a simulated annealing framework used to search and evaluate the solution space is shown in
“Alignment blocks” refer to nonoverlapping regions comprising multiple adjacent positions or “columns” within an MSA. Although an alignment block may contain islands or posts or portions thereof, they are not defined based on the location of islands and posts within the MSA. This scoring mechanism effectively computes the geometric mean of the signal-to-noise ratio within a column, where a column is a set of “calls” for a given position in the assembled reads as explained above. It is beneficial to use the geometric mean as opposed to some other measure such as the arithmetic mean because the desired measure is the percentage of change across two alignments (e.g., one having and one lacking the gap shift), not the absolute magnitude of change. To illustrate this point: an increase of quantity x in the score of a low quality column does not have the same impact on downstream consensus calling as the same increase in the score of a high quality column. It is higher priority to improve low quality columns since those are likely to be miscalled at the consensus level.
The new candidate alignment is accepted with a probability of 1 if its score is better than the current solution and is accepted with some probability if the score is worse than the current solution. Bad trades are occasionally made in order to prevent the algorithm from sinking into a local optimum. The temperature (an evolving parameter in the simulated annealing algorithm) used at each iteration of the process can be set using an exponential decay function such that it lowers progressively with the number of iterations, and the chance with which you would accept a bad solution decreases as the temperature cools. After making the decision to accept or reject the candidate, the process either stops (if termination criteria are met) or proceeds to the next iteration. In certain preferred embodiments, termination criteria are met when n iterations have passed without improvement or after exceeding a predefined number of iterations. In general, n is determined empirically by the user based on the particular application. In certain preferred embodiments, n is at least about 20. This MSA refinement improves low coverage consensus calling.
The above MSA determination and MSA refinement methods provide alternatives for the input to consensus determination, as further described below. The consensus determination method includes identification of posts and islands within the MSA, as in step 4 of
The Decode method measures the combined effect of systematic sequencing error and alignment schema, and utilizes this information to determine the most likely base (or lack of one) on the template given an aligned column of base calls. In certain preferred embodiments, the method uses a reference-based multiple sequence alignment to examine the causal association of reference and read on a per-column basis, and condenses the information generated to expose post-alignment error tendencies that are non-random. The measured probabilities are applicable to other MSAs provided that the sequencing reads therein are affected by the same systematic sequencing error and the MSA was generated using the same alignment schema. For example, the reads in additional MSAs to be analyzed should preferably originate from the same sequencing system, and the alignment strategy to generate the MSA should use the same parameters, as were those used in the reference-based sequence alignment. Therefore, the learning or “training” aspect of the methodology using the reference sequence as described above is performed sparingly, possibly only once, and need not be repeated for every MSA to be analyzed. As such, the actual consensus sequence determination post-training can be run very quickly at run-time. In addition, the algorithm is column-independent, and therefore supports parallelization work for scalability to higher-order (e.g., larger) genomes.
In a first phase of the method, an alignment schema of interest for sequencing applications is chosen and used to generate a reference-based alignment of training reads. In one such embodiment, the alignment schema is used to denote all relevant parameters necessary to carry out a Smith-Waterman alignment between a pair of query/target sequences, where the target is a reference sequence. Various alignment tools are known and available to the ordinary practitioner, e.g., Smith-Waterman, YASS, MUMMER, LALIGN, etc. The parameters can include, but are not limited to, gap opening penalty, gap extension penalty, and DNA weight matrix. Using this alignment, the unit association occurrences between the reference and observation for each aligned column are measured. The term “unit” refers to each of the four deoxyribonucleotides (e.g., A, G, C, T) as well as gaps (-). For example, the unit association occurrence for the column ACAAT-AA, where the first letter “A” comes from the reference sequence, can be summarized into a table as shown below. The table as shown is incomplete until all columns in an alignment are analyzed and recorded. Once complete, the table will provide the sum of each type of observed unit (base calls and gaps) for each type of reference unit.
A total of 25 conditional probabilities P(unitobserved|unitreference) are computed, where unitobserved={A, C, G, T, -} and unitreference={A, C, G, T, -}. Once the conditional probabilities are derived from the reference-based alignment of the training reads, these frequencies can be used to generate consensus sequence base calls for alignments of “real” sequencing reads, e.g., de novo sequencing reads.
As noted above, in order for the above described conditional probabilities or “trained frequencies” to be used to analyze actual sequencing data, the actual sequencing reads must have sequencing bias characteristics (error modes, systemic error, etc.) consistent with the training data, and the same alignment schema needs to be used for both the training and actual data. For analysis of the actual sequencing reads in an MSA, one column is processed at a time. In a given column, the likelihood of observing the data d1, d2, d3, . . . dn given the template t is computed by:
P(d1, d2, d3, dn|t)=Πk=1nP(unitobserved=dk|unitreference=t)
The likelihood is computed under each possible assumption of the template unit, namely {A, C, G, T, -}, and the assumption that maximizes P(d1, d2, d3, . . . dn|t), denoted as t*, is the consensus call for the column. The confidence with which the consensus sequence is called can be determined by establishing a null hypothesis H0 and then testing H1, where the underlying template unit is unrestricted, against H0 for significance. The null hypothesis is agnostic to the underlying template unit for a given column, and thus attributes equal probability to each possibility. The template unit is thereby “fixed,” in a sense. The maximum likelihood of the data under the null model, L0, is computed as:
The maximum likelihood of the data under the alternate hypothesis, L1, is already computed:
To assess how much better H1 is at explaining the data than H0, compute the likelihood ratio chi square (χ2) statistic:
Once the test statistic χ2 is obtained for a consensus call, the χ2 distribution with degree of freedom=1 can be consulted to infer the probability of observing such a difference or a more extreme one under the null hypothesis model, and the smaller the P-value, the more confident is the consensus call at a given position. A column where observed bases tend to be in agreement is identified as belonging to a post, and the block of contiguous columns (e.g., one or more columns) between two neighboring posts are marked as an island. As noted above, either the Plurality or Decode method can be used to determine a consensus base for a column in a post.
When a significance test is performed, the framework of the null hypothesis can dramatically shape the confidence with which we accept the alternate hypothesis. Hence, in alternative embodiments the null hypothesis described above can be formulated differently while still preserving its purpose of contrasting against the alternative hypothesis. For example, instead of fixing all possibilities {A,C,G,T,-} as equally likely to be the underlying template unit, frequencies can be used that are more consistent given other known factors, e.g., GC % of the genome from which the sequencing template was obtained. Other types of modifications of the methods herein will be understood by the ordinary practitioner based on the teachings herein in light of the current state of the art, and such modifications may be made without departing from the scope and spirit of the invention.
Various MSA approaches that are well known in the art may also be used with the methods herein. In certain embodiments, alignment strategies for use with the methods herein include methods for determining sequence overlap between different sequence reads or between one or more sequence reads and a reference sequence, and preferred embodiments of such alignment strategies are provided, e.g., in U.S. Ser. No. 13/034,233, filed Feb. 24, 2011 and incorporated herein by reference in its entirety for all purposes.
In certain preferred embodiments, once an IA or “island” has been identified as such, e.g., as described above using the Plurality or Decode methods, determination of the best template sequence for the island involves use of the Steiner framework and base quality values (QV's), e.g., at step 6 in
The definition provided in Algorithms on Strings, Trees and Sequences (Gusfield, Dan. Algorithms on Strings, Trees and Sequences. Cambridge University Press, 1997. Cambridge Books Online Cambridge University Press. 9 Jun. 2011; incorporated herein by reference in its entirety for all purposes) for the Steiner string is as follows: “Given a set of strings S, an optimal Steiner string S* for S is a string that minimizes the consensus error E(S*) over all possible strings.” The consensus error E(S*) for S* and another string is measured as the edit distance between the two strings. Therefore, the Steiner string is the sequence that has the smallest edit distance sum when compared to all strings in the set. Computing the Steiner string for a set of reads (or in our case, a set of read fragments (portions of a single, longer sequencing read) determined to come from the same reference region) involves exploring a large set of possibilities and selecting the possibility which yields the minimal edit distance. This can be an expensive process as the length of the read fragments increases. Therefore, use of the Steiner string method on islands within an MSA requires a robust way to quickly find the Steiner sequence without enumerating through all possible sequences up to a designated length. Genetic algorithm, a technique inspired by Darwin's selection process, is a suitable search heuristic for this problem, because the problem meets the two requirements for using genetic algorithms. First, the solution can be expressed in the form of an array in order to apply mutation and/or cross-over operations; and second, the worth of a candidate in the population can be computed with an objective function.
Genetic algorithms use iterative rounds of selection and inheritance to improve the fitness of candidates in the population over time, and traditionally these rounds of selection are exhaustively enumerated, comprising initial populations containing several hundreds or thousands of individuals, each a “possible solution” to the problem. In certain aspects, the present invention provides an algorithm that converges quickly on the correct result without such exhaustive enumeration and large populations of individuals. In this way, the use of genetic algorithms allows much more efficient use of Steiner string to arrive at a consensus sequence for an island. In certain preferred embodiments, the steps for finding optimal solutions to the Steiner string using genetic algorithm are as follows.
Phase I is termed Initialization. A starting population of candidate reads (“individuals”) needs to be generated to serve as seeds. An individual is represented as an array of integers that correspond to bases {A, C, G, T}, or other bases that may occur in the read, e.g., U in an RNA sequence, or methylated or otherwise modified bases that are identifiable by the sequencing system. Essentially, each individual in the population represents a potential solution to the problem being solved, which is to identify a best consensus sequence for a region of the MSA comprising a plurality of reads, e.g., within an island. The population is generated by constructing random N-mers (where N=the mean length of reads in the region of the MSA (e.g., island) being analyzed) until the population size is met. (The sizes of reads in an island may vary where some reads have insertion or deletion errors, for example, which is why a mean is calculated to specify the size of the N-mer candidate reads.) Population size is an important parameter that needs to be selected carefully in order to ensure good initial diversity and adequate representation of the search space. For the problem at hand, the ideal population size will vary depending on the upper limit for read length. It has been shown that a population size of 150 is sufficient for solving read lengths of approximately ten bases, which meets the criteria for islands.
Phase II is termed Selection. During each generation of simulated evolution, individuals in the population are assessed for their “fitness” by measuring the edit distance between each individual and the reads in the set. Based on their fitness measures, a set of individuals is selected to participate in breeding individuals for the next generation via genetic operations. The Steiner sequence (S*) is characterized as having the minimal edit distance sum. Normalization against the number of reads and converting the objective into a maximization problem produces the following equation:
In order to maintain a healthy diversity in the gene pool and prevent premature convergence to non-optimal solutions, a selection method that preferentially takes high-fitness individuals but still retains some number of “bad” individuals is crucial to being able to generate new solutions in each generation. Roulette-wheel selection is preferably used to achieve this purpose by computing a probability of selection for each individual and spinning the wheel as many times as the pool size. To implement the roulette-wheel selection, the probability that an individual will be selected is computed by comparing its fitness score to its competitors:
The sum of pi for all individuals should equal one, which is akin to adding up all the slices on a roulette wheel, with those individuals having a higher fitness represented by a larger slice than those having a lower fitness. In order to assign boundaries to the wheel, the cumulative pi is computed for each individual, where ci, ranges between 0 and 1.
Next, the “wheel is spun” and the “slice” on which the marker lands is observed. This can be computationally performed by generating a random decimal d between 0 and 1, and the individual i that satisfies ci−1≦d<ci is the one selected. After spinning the wheel n times, where n is the size of the population, a parental pool is generated where the frequency of the individuals present in the pool correlates with their fitness. To produce offspring for the next generation, a pair of individuals is randomly selected from the parental pool and breeding continues. Applying crossover procedures to each pair of parents contributes to two offspring, such that selecting n/2 pairs of parents will yield sufficient progeny to fulfill the next generation. (The parents are not returned to the pool for continued breeding.)
Phase III is termed Inheritance. To introduce new individuals to the pool, selected parents contribute partial sequences. Due to the nature of our problem, we want to favor progeny having similar lengths as their parents but with some variability. Therefore, the length of subsequence to extract from each parent is drawn from a Poisson distribution with λ=0.5*parental length. For each set of parents, two progeny individuals will result by considering different orders of the parental contribution. So, each parental sequence is divided into two portions, and then each portion goes to a different progeny. So, parent A has parts A1 and A2 and parent B has B1 and B2, and the two progeny are A1B2 and B1A2.
Phase IV, termed Iteration and Completion, involves sequentially repeating the Selection and Inheritance phases using the new parental pool generated during the prior inheritance phase, which is typically characterized by an increase in best fitness of the population as time progresses. Termination is contingent upon either (a) dropping below a variability threshold within the parental pool or (b) having surpassed a set number of iterations. The second option is the preferred option. For example, for read lengths of approximately 10 bases, at least about 50 iterations can be chosen for converging upon an optimal solution. In certain preferred embodiments, at least about 40-60 iterations are performed. The solution for the Steiner string is represented by the best fitness individual in the final generation, which is taken as the best consensus sequence for the MSA.
In some cases, multiple Steiner strings can be generated for the same initial population of reads. In certain aspects, the present invention provides a method for determining the “best” Steiner string where multiple Steiner strings are generated. For example, in certain preferred embodiments, the most popular transition event for each potential candidate is identified, where a transition event in the context of the present invention is any combination of insertion, deletion, or substitutions. After the most popular transition events are identified, the plausibility of those transition events is assessed using the inherent biases of the system. For example, for each potential template sequence, an “event diagram” is constructed. An event diagram explains template-to-read as well as read-to-read relationships using the simplest set of transitions that is consistent with the observations. The principle of parsimony is applied as a heuristic to eliminate more complex explanations in favor of one that highlights template transitions shared across multiple reads. The more times a transition can apply itself in the change from template to read, the higher the certainty that it actually occurs.
To construct the event diagram, the data can be framed as an undirected and weighted graph. The minimum spanning tree (MST) for the graph is then identified. Reads and template are represented as nodes on the graph, and an edge is drawn for each pair of notes. For example,
When evaluating different explanations for how a template could give rise to observed reads, pick the tree whose highest degree edge is the most prevalent event given what we know about the sequencing system. The highest degree edge is the edge that participates in the greatest number of template-to-read paths. For example,
The Steiner string approaches provided herein can be modified in various ways. For example, the fitness function may be altered to preserve the same idea of framing the edit distance sum into a maximization problem. Transformations other than the exponential transform can be used. The population size and number of iterations can be varied depending on a number of factors, e.g., the read length requirements of the practitioner. In addition, other methods for finding the MST for a graph can be used, e.g., Kruskal's algorithms and others known in the art. Yet further, the ranking criteria used to determine the prevalence of an event can be adjusted, e.g., depending on the types of errors characteristic of sequencing reads from a given sequencing methodology. Other ways in which this approach can be modified without departing from the scope and spirit of the invention will be recognized by the ordinary artisan in light of the teachings herein. In addition, various methods in the art may be used in conjunction with the methods of the present invention, e.g., Keith, et al. (2002) Bioinformatics 18:1494-1499; Gusfield, Dan. Algorithms on Strings, Trees and Sequences. Cambridge University Press, 1997; and Goldberg, David. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Longman Publishing Co., Inc., 1989, all of which are incorporated herein by reference in their entireties for all purposes. In certain related aspects of the present invention, base quality values can be used in determining a local consensus sequence for islands. For example, in the Steiner consensus string approach described above, D(r, t) is defined as the edit distance between read and template. In some embodiments, this measure is used when base-calls are the only data being used to determine the consensus sequence. However, in certain embodiments base-call quality data is also available and used to develop a more accurate measure of D(r, t). As described below, base-call quality data (which can include, e.g., error type, error frequency, pulse metrics, etc.) can be used to estimate the probability of a putative template sequence or “string,” and this probability is converted into a distance metric for a template-read pair. This approach sets up the problem using a framework similar to that of the Steiner consensus string approach described above. For example, given a set of read fragments R={r1, r2, r3, . . . }, and a candidate template t the consensus sequence that minimizes consensus error E(t) across all read fragments can be approximated as Σr∈RD (r, t), where D(r, t) represents the distance between the observed read fragment and the candidate template.
In preferred embodiments, the process begins with an alignment of the observed read fragment with the putative template string in order to establish a coordinate correspondence. For example, global alignment with a simple [+1|0|0 ] scoring for [match|mismatch|gap] can be used. The resulting correspondence suggests whether an observed base-call is a true incorporation event for a particular base, or instead is an insertion event, e.g., due to cognate or noncognate sampling of the polymerase that did not result in an incorporation event. The probability of the consensus sequence for the template given an observed sequence read can be computed as P(t|r)=Πi=1len(alignment)P(ti|ri). Similar to Phred (Ewing, et al. (1988) Genome Research 8:175-185; Ewing, et al. (1988) Genome Research 8:186-194, incorporated herein by reference in their entireties for all purposes), the method described herein produces per-base quality scores as a function of Pe, the probability that the base is incorrectly called. Therefore, the inverse function can be applied to deduce and assign a Pe to each call in the observed read fragment. Using Pe, P(ti|ri) can be calculated for each of the five different possibilities that ti can assume: A, C, G, T, or gap. For example, given Pe and observed base call ri where ri is not a gap:
P(ti|ri)=1−Pe for ti=ri 1)
P(ti|ri)=Pe*10/11 for ti≠ri and ti=gap 2)
P(ti|ri)=Pe*1/11*1/3 for ti≠ri and ti≈gap 3)
In cases 2 and 3, the expected ratio of insertion:substitution 10:1, can be used to distribute the total error across cases where the template does not match the observation. The above formulas will work where the aligned unit in the read fragment is a base-call with an associate quality value to identify matches, mismatches, and insertions in the pair-wise alignment. However, in the case of a deletion in the read with respect to the template sequence, a gap is produced which will have no quality values, being a product of the alignment procedure rather than a base-call. In such cases, where ri is a gap, an expected deletion probability, e.g., based on the known characteristics of a given sequencing platform, can be used to estimate P (ti|ri). Once P(ti|ri) is computed, the distance can be computed as (r, t)=1−P(t|r). As such, the distance D decreases as the likelihood of the template string increases.
At step 7 in
In certain aspects, the methods provided herein are computer-implemented methods, wherein at least one or more steps of the method are carried out by a computer program. In some embodiments, the methods provided herein are implemented in a computer program stored on computer-readable media, such as the hard drive of a standard computer. For example, a computer program for determining at least one consensus sequence from replicate sequence reads can include one or more of the following: code for providing or receiving the replicate sequence reads, code for grouping replicate sequence reads, code for analyzing call quality, code for aligning the replicate sequence reads to generate an MSA, code for refining an MSA, code for identifying islands and posts in the MSA or refined MSA, code for generating consensus sequences for the posts, code for generating consensus sequences for the islands, code for catenating consensus sequences with sequences in posts, and a computer-readable storage medium comprising the codes.
In some embodiments, a system (e.g., a data processing system) that determines at least one consensus sequence from a set of replicate sequences includes a processor, a computer-readable medium operatively coupled to the processor for storing memory, wherein the memory has instructions for execution by the processor, the instructions including one or more of the following: instructions for receiving input of replicate sequence reads, instructions for grouping replicate sequence reads, instructions for analyzing call quality, instructions that align the replicate sequence reads, instructions that refine an alignment, instructions that identify islands and posts, instructions that generate consensus sequence for at least one post, instructions that generate consensus sequence for at least one island, instructions that catenate consensus sequence from an island with sequence from one or two adjacent posts, instructions that compute/store information related to various steps of the method (e.g., quality scores, error rates, etc.), and instructions that record the results of the method.
In certain embodiments, various steps of the method utilize information and/or programs and generate results that are stored on computer-readable media (e.g., hard drive, auxiliary memory, external memory, server, database, portable memory device (CD-R, DVD, ZIP disk, flash memory cards, etc.), and the like. For example, information used for and results generated by the methods that can be stored on computer-readable media include but are not limited to replicate sequence information, MSAs, refined MSAs, post loci and/or sequences, island loci and/or sequences, newly generated consensus sequences, newly generated catenated sequences, quality information, technology information, and homologous or reference sequence information.
In some aspects, the invention includes an article of manufacture for determining at least one consensus sequence from replicate sequence reads that includes a machine-readable medium containing one or more programs which when executed implement the steps of the invention as described herein.
One type of logical apparatus that may embody the invention is a computer system as illustrated in
After input replicate sequences and parameters required for the method of the present invention are specified by the display 1002 (also referred to as a “screen”), the keyboard 1003, and the pointing device 1004, the CPU 1001 executes the program stored in the main memory 1005 and the MSA and consensus sequence determination are performed by the methods of the present invention. The input sequence 1013 is read from the storage device 1012. The output result of MSA 1014 and consensus sequence determination 1015 can be stored into the storage devices 1012. During the progress of MSA and consensus sequence determination by the method of the present invention, the progress of this processing can be displayed on the display 1002. After completing this processing, the result of the processing can be also displayed on the display 1002, saved to an additional storage device (e.g., ZIP disk, CD-R, DVD, floppy disk, flash memory card, etc.), or displayed and/or saved in hard copy (e.g., on paper). The result of the processing may be stored or displayed in whole or in part, as determined by the practitioner. For example, the results for one or more positions, or one or more islands, or one or more combinations of islands and posts may be displayed/saved. These results may further comprise quality information, technology information (e.g., peak characteristics, expected error rates), alternate (e.g., second or third best) consensus sequences, confidence metrics, etc., as described in greater detail above.
It is to be understood that the above description is intended to be illustrative and not restrictive. It readily should be apparent to one skilled in the art that various modifications may be made to the invention disclosed in this application without departing from the scope and spirit of the invention. The scope of the invention should, therefore, be determined not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. Throughout the disclosure various references, patents, patent applications, and publications are cited. Unless otherwise indicated, each is hereby incorporated by reference in its entirety for all purposes. All publications mentioned herein are cited for the purpose of describing and disclosing reagents, methodologies and concepts that may be used in connection with the present invention. Nothing herein is to be construed as an admission that these references are prior art in relation to the inventions described herein.
To assess the results of an MSA refinement procedure, consensus calling accuracy at low coverage (2-6×) was compared. A reference sequence was mutated at every 500th position to contain a different base at each mutated position, and the resulting mutated reference represented the resequencing “reference” sequence. The original (unmutated) reference sequence represented the “sample” sequence. These two sequences were used for consensus sequence determination using the simulated annealing MSA refinement procedure described at length elsewhere herein. The results shown in
To assess the results of the quality value-derived distance metric for the Steiner consensus framework, the methodology was applied on C. elegans fosmid data, subsampled at 12× coverage. The reference sequence used for mapping was mutated such that, on average, 1 in 500 positions would carry a base substitution. The resulting consensus calls were then evaluated against the un-mutated reference for consensus calling accuracy in the low coverage regime and benchmarked against plurality consensus calling.
This application is a continuation-in-part of U.S. patent application Ser. No. 13/034,233, filed Feb. 24, 2011, which claims the benefit of Provisional U.S. Patent Application No. 61/307,732, filed Feb. 24, 2010, both of which are incorporated herein by reference in their entireties for all purposes.
Number | Date | Country | |
---|---|---|---|
61307732 | Feb 2010 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13034233 | Feb 2011 | US |
Child | 13468347 | US |