The inventions described herein relate to methods for simultaneously evaluating genomic sequences, including cancer-related sequences, and systems therefor. The methods and systems additionally may incorporate Mendelian inheritance among related family members. The inventions also relate to probability-based calling methods suitable for use in calling sequences for reads obtained from samples containing both normal and cancerous material. There are also disclosed methods incorporating copy number variation into probability-based calling methods.
There have been great advances in genomic sequencing in recent times. Sequencing machines can generate reads ever more rapidly with increasingly accurate results. However, there remain errors in the reads produced and during the process of read alignment the reads must be assembled as best as possible to generate the most accurate genomic sequence for the sample possible. The process of “calling” a value of the sequence from the reads requires consideration of a range of relevant factors and potential sources of errors.
Additionally, there has been much research to identify predisposing genomic sequence variants and somatic mutations. The basis for this research is the accurate calling of cancerous sequences obtained from tumors and related samples. However, many samples have included a mixture of normal genomic sequences and cancerous genomic sequences and the quality of calling has been reduced for such mixed samples as the reads for the normal samples act as contamination of the cancerous samples.
A wide range of algorithms for calling sequence values have been employed. Some use filtering techniques but this potentially loses information that may assist in making a call or values that upon more thorough investigation may be the best calls. Mendelian inheritance rules have been used to investigate family relationships but have not been incorporated into an integrated model for simultaneously evaluating multiple population members. Prior approaches have looked to other family members as data rather than as part of a larger dynamic model. Such approaches have had limited success in correctly identifying the likelihood of de novo mutations.
Other techniques for calling biological sequences include the applicant's prior U.S. Pat. No. 7,640,256 and U.S. application Ser. Nos. 13/129,329 and 61/695,408, and PCT/NZ2011/000080, PCT/NZ2011/000081 and PCT/NZ2011/000197 which are hereby incorporated by reference.
Prior calling techniques typically assume that the sample is uncontaminated (i.e. either all normal or all cancerous material) and have not been able to make accurate calls for mixed samples of cancerous and normal biological material or where there is copy number variation (which is common with cancer).
It would be desirable to improve the quality of calling by utilizing population information in an integrated model. It would also be desirable to improve the quality of calling for mixed samples or where there is copy number variation.
It is an object of the disclosed inventions to provide improved methods of calling biological sequences that overcome at least some of these problems or to at least provide the public with a useful choice.
In some embodiments, the invention provides a method of calling a genomic sequence for a sample from a biological entity in a collection of related biological entities, performed by one or more processors executing program instructions stored on one or more memories, causing the one or more processors to perform the method comprising:
In some embodiments, the invention provides a system for calling a genomic sequence for a sample from a biological entity in a collection of related biological entities, the system comprising:
one or more processors configured to execute one or more modules; and a memory storing the one or more modules, the modules comprising:
In some embodiments, the invention provides a method of calling a genomic sequence for a sample from a subject potentially containing normal and cancerous material, performed by one or more processors executing program instructions stored on one or more memories, causing the one or more processors to perform the method comprising:
Additional objects and advantages of the invention will be set forth in part in the description which follows.
It is acknowledged that the terms “comprise,” “comprises” and “comprising” may, under varying jurisdictions, be attributed with either an exclusive or an inclusive meaning. For the purpose of this specification, and unless otherwise noted, these terms are intended to have an inclusive meaning—i.e. they will be taken to mean an inclusion of the listed components which the use directly references, and possibly also of other non-specified components or elements.
Reference to any prior art in this specification does not constitute an admission that such prior art forms part of the common general knowledge.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
Reference will now be made to the accompanying drawings showing example embodiments of this disclosure. In the drawings:
When developing a representation of a genomic sequence from a biological sample sequencing machines produce many reads of short portions of the subject genomic sequence (typically DNA, RNA or proteins). These reads (genomic sequence information) must be aligned and then “calls” must be made as to values of the sequence at each location (e.g., individual bases for DNA). There may typically be only a few reads (and sometimes none) at a particular location or very many reads in others.
Errors can arise in process of sequencing genomes. In some cases all reads are consistent or “simple calls” may be made using conventional calling techniques. There are typically “regions of interest” that may span a single or several values where more sophisticated analysis can be required to make a reliable call. A region may be identified as a region of interest, as the confidence in calling the region may be too low using simple calling techniques or there may be characteristics of the region indicating deeper analysis is desirable. These characteristics may be numbers of insertions and/or deletions, the value and proximity of calls (e.g. a number of low confidence calls close to each other) etc.
The problems are compounded when:
(1) The sample includes both genomic information relating to normal and cancerous biological material; and/or
(2) The number of copies of parts of the genomic sequence varies (i.e. in cancerous cells more copies of parts of the DNA may be produced than others—a phenomenon known as copy number variance).
A Bayesian approach may be applied to resolve calls in such regions of interest. This is a principled way of combining multiple factors and allows evolving knowledge to be dynamically integrated.
Such regions of interest can be evaluated without reference to family members or a related population. Such regions of interest can also be evaluated without taking into account contamination (mixed normal and cancerous biological samples) or copy number variation (certain portions of the genomic sequence may have more copies due to a cancer). But the exclusion of family member, related population, and contamination information removes a large volume of information that can assist in making reliable calls in difficult regions. Accordingly, in certain embodiments, the reads for multiple samples may be evaluated simultaneously so that all information is utilized to inform the calling of genomic sequences for each sample and provide more accurate calling. Additionally, in certain embodiments, the model is adjusted to account for contamination and/or copy number variation to improve the accuracy of calling genomic sequences.
In certain embodiments, a Bayesian model can be applied to calling a genomic sequence. For example, the probability of a hypothesis (proposed sequence values for the region of interest) being correct given the data (reads) is the normalized value of the probability of the hypothesis occurring (prior) times the probability of the data occurring given the hypothesis (model) which may be expressed as:
where:
For a population of k members this may be expressed as:
where:
For a population, an expectation maximization (EM) algorithm may be employed to improve calling accuracy. The algorithm may enhance calling by utilizing population prior information to refine calling. This may be performed by:
In step (b) the called sequence information may be combined with the historical probability data based on the probability of a haploid sequence occurring. This may assist in achieving rapid convergence. Alternatively the called sequence information may be combined with the historical probability data based on the probability of a diploid sequence occurring. Steps (b) and (c) may be repeated until there is no change in sequence calling or when some other criteria is met.
Mendelian Inheritance
In certain embodiments, where a family is being evaluated, such as illustrated in
where:
De Novo Mutations
The Mendelian probability of the hypothesis for the child given the hypotheses for the parents M(Hc|Hm, Hf) may be a simple Mendelian probability or may be a modified form that takes into account non-Mendelian mechanisms. In particular the probabilities associated with de novo mutations may be incorporated into the Mendelian probability M(Hc|Hm, Hf).
In certain embodiments, the probability of de novo mutations may be influenced by population factors (such as species information and the age of the parents), and environmental factors (such as radiation exposure, feed sources, climatic conditions, etc).
One way of constructing a modified Mendelian table M′(Hc|Hm, Hf) is to assume that there is some small probability g of a single nucleotide being mutated and that both nucleotides are never mutated at the same time (because g can be very small). Then the various values in M′ can be computed from the original M. For example:
M′(A:C|A:A,A:A)=2μ/3×M(A:A|A:A,A:A)
M′(A:A|A:A,A:A)=(1−2μ)×M(A:A|A:A,A:A)
In this way even though the probability of a de novo mutation may be very low, information across a family may be utilized to reveal the significance of anomalous data in a subject that may reveal a de novo mutation. A de novo mutation may be identified where the probability of an hypothesis for a de novo mutation is greater than for any other hypothesis or according to other prescribed criteria. In some cases a likelihood of a de novo mutation above a certain level may be flagged so that the region of interest may be further analyzed.
Contamination
In certain embodiments, a sample is obtained from a location expected to have predominantly normal genomic material (e.g. a blood sample) and another is obtained from a region where it is suspected that cancerous genomic material is present. The two samples are sequenced by a sequencing machine to produce sets of reads for each sample. It will be appreciated that genomic sequence information (either reads or a sequence listing) for a prior normal sample may advantageously be utilized where available. Alternatively in some cases a reference genome (such as a reference human genome) may be utilized (for example where the region of investigation is relatively uniform in humans).
In certain embodiments that apply a Bayesian model to calling a genomic sequence, the probability of a hypothesis (proposed sequence values for the region of interest) being correct given the data (reads) is the normalized value of the probability of the hypothesis occurring (prior) times the probability of the data occurring given the hypothesis (model). In certain embodiments a Bayesian model is used to compare two genomes, a normal genome (for which the subscript n is used) and a cancer genome (for which the subscript c is used). Hypotheses can be generated for the pair Hn,Hc (i.e. hypotheses as to the sequences values for a region of interest for the normal and cancerous genome) and the evidence will be a pair En, Ec (i.e. the reads for the cancerous and normal sample in the region of interest, or simply the portions of the normal sequence where a sequence listing is available).
The “priors” (i.e. probability of a hypothesis occurring) may be obtained in a variety of ways. As outlined above P(H) may be obtained from, for example, a reference listing of the human genome, from a prior sequencing and/or from contemporaneous sequencing of the normal sample. P(Hc) may be obtained from, for example, reference listings of known cancer sequences. In certain embodiments P(Hc) is not a required term.
The hypotheses may be the reads for each sample.
Assuming no contamination:
P(En,Ec|Hn,Hc)=P(En|Hn)P(Ec|Hc)
That is, certain embodiments can use the posteriors (before applying priors) for the individual genomes from the calculations that are normally done for SNP (single-nucleotide polymorphism) calling. To compute the priors one can use a model where Hc is taken as being a mutation from an original normal hypothesis, and then:
P(Hn,Hc)=P(Hn)Q(Hc|Hn)
where Q(Hc|Hn) is the probability of a transition from Hn to Hc. In certain embodiments this can be computed as a table given μ, the probability of a novel mutation on one of an homologous pair of chromosomes from the normal to cancer genome.
For example in the haploid case:
Q(C|A)=μ/3
Q(A|A)=1−μ
In the diploid case:
Q(XX|UV)=Q(X|U)Q(X|V)
Q(XY|UV)=Q(X|U)Q(Y|V)+Q(Y|U)Q(X|V)where X≠Y
In certain circumstances there is a non-zero probability that there will be an LOH (loss of heterozygosity) event on the cancer side. Sometimes it will be known from other analyses that this has happened and other times it can only be estimated as a general probability. Given LOH the calculation for Q is:
Q(XX|UV)=[Q(X|U)+Q(X|V)]/2
For complex calling, the individual transition Q(X|U) can be estimated using the technique described in U.S. Appl. 61/695,408 (which is hereby incorporated by reference) where the sequence X is matched against the sequence U and the transitions are normalized for a given U. It may be advantageous to include part of the reference on either side of the sequences to allow some correction when there are repeat or homopolymer regions.
Combining these formulae, we have:
To account for contamination of the cancer sample by normal DNA, the following modification can be included:
and then assuming a is an estimate of the fraction of the cancer sample which is in fact normal tissue we have:
P(ec|Hn,Hc)=αP(ec|Hn)+(1−α)P(ec|Hc) (Equation 8)
The contamination value a may be determined by, for example:
(1) Expert determination by a clinician based on clinical factors and experience;
(2) Clinical information—using an appropriate formula, an expert system, neural network, learning system, or the like;
(3) Comparison of “SNP chips”—for example, compare the number of reads for an area of the sequence likely to give a good indication of relative proportions of normal and cancerous material;
(4) An optimization technique whereby a probability, for example the global probability, is maximized as the measure of goodness.
Combining the above this gives:
In certain embodiments, P(Ec|Hn,Hc) is accumulated for all the pairs Hn,Hc, which imposes a significantly greater burden than computing P(En|Hn) and P(Ec|Hc) separately. One strategy that may be employed is to first compute without using contamination and then in cases where it seems that there may be a non-trivial case, to perform the full calculation.
Copy Number
In a tumor (and in other types of biological samples) the number of copies of a region may differ from that in the normal genome. This can be modeled by assuming that the total number of copies in the tumor is n and that the number of copies of one of an homologous pair of chromosomes is a and of the other is b, that is n=a+b. A special case that is of interest are regions of loss of heterozygosity. This occurs, for example, when the normal genome had a copy number of 2 and the tumor has a copy number of 1—that is, n=1 and a=1, b=0 (or vice versa).
When a # b, a diploid hypothesis is no longer agnostic about orientation, that is the hypothesis AC differs from CA. To deal with this the tumor hypothesis ft may be broken down into a pair H′c and H″c for each haploid hypothesis. For example, for simple SNP calls there can be 16 possible hypotheses rather than the normal 10. The set of hypotheses is given by Hc=H′c×H″c.
According to this embodiment, the formula that includes the effect of both contamination and copy number is:
The copy number values a and b may be calculated in a variety of ways including:
(1) Based on the total number of reads associated with the normal biological sample and the number of reads associated with the cancerous biological sample;
(2) Based on the number of reads associated with the normal biological sample and the number of reads associated with the cancerous biological sample at a plurality of selected locations;
(3) Based on the number of reads associated with the normal biological sample and the number of reads associated with the cancerous biological sample at a location known to be particularly distinctive for one of the sequences.
It will be appreciated that the modification to accommodate copy number variation may be used independently of the modification for dealing with contamination and/or de novo mutations, as well as other aspects of the embodiments disclosed herein. The copy number variation techniques may be applied advantageously to better call cancer-related and other biological sequences irrespective of contamination.
Certain embodiments thus provide sequence calling methods using information for both normal and cancerous samples to provide high quality calls to be made with consistent scoring. The models can provide fast resolution of complex calling problems with improved accuracy. There is provided accurate calling of normal and cancerous sequences for mixed samples and methods of handling copy number variation.
Pruning
The probability of an hypothesis occurring (P(Hm), P(Hf) etc) may be based on historical sequence information, e.g., comparing the sequence in the area of interest with published sequence information (such as the 1000 Genomes Project or dbSNP) in the area of interest that is the probability of that sequence occurring, irrespective of the read data.
The possible hypotheses may include, for example:
(1) All possible sequences for the region of interest. This is generally the most processing intensive approach and may be most appropriate where deep investigation of a region is required or the sequence length is short.
(2) All read values occurring in the region of interest. It is unlikely that a sequence value not occurring in any read will be the correct value and so this approach limits computation without significant reduction in calling confidence.
(3) Read values above may be combined with “assemblies of reads”. Such “assemblies of reads” may combine “associated reads”. This association may be, for example, paired end reads or reads that are associated with external reference sequences (i.e. “pseudo reads” from publications or external events; not from “wet” reads from a sequencer). Such assembled reads may be combined across multiple samples.
The above hypotheses may be pruned using techniques including removing a hypothesis where, for example:
(1) the number of reads matching the hypothesis is below a threshold level;
(2) the occurrence of the hypothesis in historic data for the type of genomic sequence is below a threshold level; and/or
(3) the hypothesis breaches Mendelian inheritance rules.
In some situations pruning is not appropriate.
Hypotheses may also be evaluated in a prescribed order. This may be based on a weighting of hypotheses. The weighting of hypotheses may be a graduated scale or on a simple inclusion and exclusion basis. The weighting may be based upon the frequency of occurrence of a hypothesis in the sequence values and the hypotheses may be evaluated from the hypotheses having the highest weighting to those having the lowest weighting. Sex-based inheritance may also be taken into account. Evaluation may be terminated before all hypotheses are evaluated if an acceptance criterion is met. The acceptance criteria may be that a hypothesis is found to have a probability above a threshold value or be based on a trend in probabilities from evaluation (e.g. continually decreasing probabilities of hypotheses).
Model values (such as P(Dm|Hm)) represent the probability of the genomic sequence information (e.g. (Dm) for a mother) occurring given the hypothesis (e.g. (Hm) for the mother). These model values may be calculated on the basis of one or more of:
(1) quality scores for sequencing machines (i.e. the figures as to sequencing accuracy published by sequencing machine manufacturers);
(2) calibrated quality scores (i.e. quality figures determined from preliminary alignment);
(3) mapping scores (such as MAPQ scores); and/or
(4) the chemistry of the sequences (there may be different probabilities of error, insertion, deletion, etc. depending upon the particular sequence values).
Hypotheses may be processed in an order considered most likely to produce a call meeting a required confidence level. Hypotheses may be rated according to factors such as their frequency of occurrence in the reads, a rating score (such as a MAPQ value) etc. Processing may be terminated if a hypothesis probability is above a threshold value or is trending in a desired manner. This is a technique to speed up processing and may not be appropriate where a more detailed evaluation is required.
Expectation maximization techniques may also be employed, as discussed above, to further refine calling. For example, priors may initially be based on sequence information for a known population. Family sequences may be called using the methodology described above. The family sequences may then be added to the priors and the family sequences recalled. This may be repeated until an acceptable convergence is achieved.
H=H
m
×H
f
×ΠH
i
P(H)=P(Hm,Hf,ΠHi)=P(Hm)×P(Hf)×ΠM(Hi|Hm,Ht)
P(D|H)=P(Dm|Hm)×P(Df|Hf)×ΠP(Di|Hi)
The resulting equation is:
where:
It can be seen that for a family with 2 parents and n children that processing will be of the order of 102+n. For very large families this may require substantial processing capacity.
Application of Forward-Backward Algorithms
In certain embodiments, the “B” values are calculated on the basis of the Mendelian inheritance and the priors and models of the descendants below the member. The B values are propagated up to the generation above and affect the model for the parent.
In certain embodiments, the process may operate generally as follows:
While for a single member just a single A value is propagated down, multiple B values may be propagated up and the recalculation will be based on the member's model, its A value, and all B values.
Where there is no genomic information for a population member, values may be inferred using this model. This enables the genomic sequences of population members to be called relatively accurately even where no or little genomic information is available.
Large Pedigrees
In certain embodiments, scores may be computed in a multi-genome variance caller to analyze genomic sequences corresponding to a large pedigree.
Large Pedigree Notation
Forward Backward Algorithm
Methods for approximating a Bayesian analysis for a large pedigree are included in the present disclosure.
In certain embodiments, a forward backward algorithm can be used to approximate the Bayesian analysis:
compute singleton model for all samples (P(Hx|Dx))
initialize Ax to priors and Bx to identities
do
compute priors
recompute Ax forward through pedigree
recompute Bx backward through pedigree
recompute calls for each sample (P(Ex|h)P(h))
until no change in calls
For founding parents, Ax is the prior computed at the start or on each iteration. For individuals with no children, Bx is an identity where Bx,h=1.
Monogamous Family
Certain embodiments involve computing Ax for the children and Bx for the parents in a single family embedded inside a pedigree (see, e.g.,
Exemplary formulae are:
Non-Monogamous Families
In certain embodiments, parents are not necessarily monogamous, that is, a parent can have children with more than one mate. See, e.g.,
Exemplary formulae are:
The order of execution can be straightforward in the forward direction. Execution order may be organized as a directed graph where there are directed arrows from each parent to its children. See, e.g.,
The backward direction requires arrows from children to parents but also between half-siblings. The result is acyclic when the families are monogamous. However, in the presence of non-monogamous families it is possible to end up with cycles in the graph. One can ignore this and just use the most recent values of Bx at each step, unfortunately, the results depend on the order that nodes are visited. The solution above is to use the values of B from the previous generation (B′v,w,k).
This approach can be computationally efficient for large families and provides improved calling for individuals with no or little coverage.
Exemplary hardware components are represented in
Due to the large number of variant calling possibilities at each location in a genome, there may be benefit in using a specific hardware implementation utilizing parallel execution. Such hardware may dramatically increase the speed of the pedigree variant analysis.
In such a specific hardware solution a set of reads may be passed to the hardware device covering a fixed range across the genome. For example, given a window of, say 20, nucleotides across a chromosome, a set of reads that map to that location may be analyzed by the hardware device.
The pedigree information may also be provided with respect to each read. The hardware devices in parallel can update the thousands or hundreds of thousands of possible variants in parallel and a result obtained that maximizes a likelihood function.
The possible variants can be designed as part of a neural network that efficiently updates counts and probabilities as more read-based evidence is supplied. An example representing a hardware device to provide real-time pedigree variant analysis is shown in
As would be well understood by those of skill in the art, the disclosed methods may be performed by one or more processors executing program instructions stored on one or more memories. Certain embodiments comprise systems for calling genomic sequences, in which the system comprises one or more processors configured to execute one or more modules and a memory storing the one or more modules, wherein the modules comprise the exemplary hardware components disclosed above.
There are thus provided methods utilizing population and family information to provide high quality calls to be made with consistent scoring. The models provide a principled way of combining multiple effects with the ability to dynamically update model values as information increases. The models provide fast resolution of complex calling problems with improved accuracy.
While the present invention has been illustrated by the description of the embodiments thereof, and while the embodiments have been described in detail, it is not the intention of the applicant to restrict or in any way limit the scope of the appended claims to such detail. Additional advantages and modifications will readily appear to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details, representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of the applicant's general inventive concept.
The following specific examples are to be construed as merely illustrative, and not limiting of the disclosure.
Table 1 below provides an example illustrating the application of the invention to a haploid genome. Applying a Bayesian model to calling a genomic sequence the probability of a hypothesis (proposed sequence values for the region of interest) being correct given the data (reads) is the normalized value of the probability of the hypothesis occurring (prior) times the probability of the data occurring given the hypothesis (model) which may be expressed as described in Equation 1, repeated here:
where:
Table 2 below provides an example illustrating the application of the invention to a family. Where a family is being evaluated, such as illustrated in
where:
This example is identical to Example 2 except that it includes a probability of 0.01 in the M table for a de novo mutation of C:G to either A:G or C:A and then a selection of the de novo mutation in the child. The result is that a call that had a posterior probability of zero in Example 2 now has a posterior higher than the alternative call.
The following embodiments are to be construed as merely illustrative, and not limiting of the disclosure,
where:
Additional embodiments include:
where:
P(Ec|Hn,Hc)=αP(ec|Hn)+(1−α)(a/(a+b)P(ec|H′c)+b/(a+b)P(ec|N′c))
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. Additional advantages and modifications will readily appear to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details, representative apparatus and method, and illustrative examples shown and described. Accordingly, departures may be made from such details without departure from the spirit or scope of the applicant's general inventive concept. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
This application claims priority to U.S. Provisional Application No. 61/691,271, filed Aug. 21, 2012; U.S. Provisional Application No. 61/729,462, filed Nov. 23, 2012; and U.S. Provisional Application No. 61/803,671, filed Mar. 20, 2013; all of which are incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
61691271 | Aug 2012 | US | |
61729462 | Nov 2012 | US | |
61803671 | Mar 2013 | US |