The present invention relates to a next generation sequencing (NGS) based method for genetic testing and particularly, although not exclusively, to a next generation sequencing (NGS) based method for paternity testing.
Paternity testing has experienced great changes in the last three decades as a result of improvement of DNA sequencing technologies. To date, the most widely adopted methods for paternity testing in forensic laboratories worldwide are polymerase chain reaction (PCR) based sequencing and capillary electrophoresis (CE) based sequencing for detection of fragment length variations in 13 core short tandem repeat (STR) markers in the Combined DNA Index System (CODIS) published by the Federal Bureau of Investigation (FBI).
Although the CODIS system is powerful and widely applied, it suffers a number of problems. As an increasing number of forensic databases based on CODIS core STRs are established worldwide, the sizes of the databases are dramatically enlarged, and so there probability of random hits (“cold hits”) in databases would also dramatically increase. In applications such as individual identification in criminal cases, this may cause an individual in the forensic database to be falsely charged as the criminal when a new crime occurs. On the other hand, in applications such as paternity testing, the result is vulnerable to false exclusions caused by allelic dropout, null alleles, contamination, human errors and mutations in offspring.
In accordance with a first aspect of the present invention, there is provided a next generation sequencing (NGS) based method for genetic testing, comprising: applying, for one or more genetic loci, respective NGS data related to genotype of a first tested subject, genotype of a second tested subject, and genotype of an alleged offspring of the first and second tested subjects to a statistical model for calculating a value representing a likelihood that the alleged offspring is a true offspring of the first and second subjects; and determining, based on the respective values calculated for the one or more genetic loci, a likelihood that the alleged offspring is a true offspring of the first and second tested subjects; wherein the NGS data includes: genotype and sequencing read of the first tested subject; genotype and sequencing read of the second tested subject; and genotype and sequencing read of the alleged offspring; wherein the statistical model utilizes: a probability of the genotype of the first tested subject in a subject population; a probability of the genotype of the second tested subject in a subject population; and a probability of the genotype of the alleged offspring in a subject population.
In one embodiment of the first aspect, the method is applied to a plurality of genetic loci.
In one embodiment of the first aspect, the statistical model utilizes the respective probability of the genotype of the first tested subject, the second tested subject, and the alleged offspring as posterior probability with the sequencing read of the first tested subject, the second tested subject, and the alleged offspring.
In one embodiment of the first aspect, the statistical model applies the following for calculating the value of the respective genetic loci:
where Dm represents the sequencing read for the first tested subject, Dc represents the sequencing read for the alleged offspring, Daf represents the sequencing read for the second tested subject, gm represents the genotype for a corresponding locus of the first tested subject, gc represents the genotype for a corresponding locus of the alleged offspring, gaf represents the genotype for a corresponding locus of the second tested subject, T(gc|gm, gaf) represents a likelihood that both alleles of the alleged offspring are inherited from the first and second tested subjects, and T(gc|gmf) represents a likelihood that the tested second subject is not biologically related to the alleged offspring.
In one embodiment of the first aspect, the first tested subject is a mother of the offspring and the second tested subject is an alleged father of the offspring.
In one embodiment of the first aspect, the method further comprises the step of obtaining raw NGS data from the first tested subject, the second tested subject, and the alleged offspring.
In one embodiment of the first aspect, in the raw NGS data, a sequencing coverage of the first tested subject is above or equal to 0.5×. In other words, the method of the first aspect would operate reliably even if the raw NGS data of the first tested subject is sub-sampled to a certain extent.
In one embodiment of the first aspect, in the raw NGS data a sequencing coverage of the second tested subject is above or equal to 0.5×. In other words, the method of the first aspect would operate reliably even if the raw NGS data of the second tested subject is sub-sampled to a certain extent.
In one embodiment of the first aspect, in the raw NGS data a sequencing coverage of the alleged offspring is above or equal to 0.5×. In other words, the method of the first aspect would operate reliably even if the raw NGS data of the alleged offspring is sub-sampled to a certain extent.
In one embodiment of the first aspect, the method further comprises: prior to the application step, filtering raw NGS data to remove marker with more than two alleles to obtain the respective NGS data for the one or more genetic loci.
In one embodiment of the first aspect, the method further comprises the steps of: dividing respective genomes in the corresponding NGS data of the first tested subject, the second tested subject, and the alleged offspring into a plurality of segments; sorting markers in each of the plurality of segments based on a probability of exclusion; and selecting a plurality of markers based on the sorting result for application to the statistical model. Preferably, the selection step comprises selecting a plurality of markers with the highest probability of exclusion.
In accordance with a second aspect of the present invention, there is provided a next generation sequencing (NGS) based system for genetic testing, comprising: means for applying, for one or more genetic loci, respective NGS data related to genotype of a first tested subject, genotype of a second tested subject, and genotype of an alleged offspring of the first and second tested subjects to a statistical model for calculating a value representing a likelihood that the alleged offspring is a true offspring of the first and second subjects; and means for determining, based on the respective values calculated for the one or more genetic loci, a likelihood that the alleged offspring is a true offspring of the first and second tested subjects; wherein the NGS data includes: genotype and sequencing read of the first tested subject; genotype and sequencing read of the second tested subject; and genotype and sequencing read of the alleged offspring; wherein the statistical model utilizes: a probability of the genotype of the first tested subject in a subject population; a probability of the genotype of the second tested subject in a subject population; and a probability of the genotype of the alleged offspring in a subject population.
In one embodiment of the second aspect, the statistical model utilizes the respective probability of the genotype of the first tested subject, the second tested subject, and the alleged offspring as posterior probability with the sequencing read of the first tested subject, the second tested subject, and the alleged offspring.
In one embodiment of the second aspect, the statistical model applies the following for calculating the value of the respective genetic loci:
where Dm represents the sequencing read for the first tested subject, Dc represents the sequencing read for the alleged offspring, Daf represents the sequencing read for the second tested subject, gm represents the genotype for a corresponding locus of the first tested subject, gc represents the genotype for a corresponding locus of the alleged offspring, gaf represents the genotype for a corresponding locus of the second tested subject, T(gc|gm, gaf) represents a likelihood that both alleles of the alleged offspring are inherited from the first and second tested subjects, and T(gc|gmf) represents a likelihood that the tested second subject is not biologically related to the alleged offspring.
In one embodiment of the second aspect, the first tested subject is a mother of the offspring and the second tested subject is an alleged father of the offspring.
In some embodiments of the second aspect, the system may also include structures suitable for implementing the method in various embodiments of the first aspect.
In accordance with a third aspect of the present invention, there is provided a non-transitory computer readable medium for storing computer instructions that, when executed by one or more processors, causes the one or more processors to perform a next generation sequencing (NGS) based method for genetic testing, comprising: applying, for one or more genetic loci, respective NGS data related to genotype of a first tested subject, genotype of a second tested subject, and genotype of an alleged offspring of the first and second tested subjects to a statistical model for calculating a value representing a likelihood that the alleged offspring is a true offspring of the first and second subjects; and determining, based on the respective values calculated for the one or more genetic loci, a likelihood that the alleged offspring is a true offspring of the first and second tested subjects; wherein the NGS data includes: genotype and sequencing read of the first tested subject; genotype and sequencing read of the second tested subject; and genotype and sequencing read of the alleged offspring; wherein the statistical model utilizes: a probability of the genotype of the first tested subject in a subject population; a probability of the genotype of the second tested subject in a subject population; and a probability of the genotype of the alleged offspring in a subject population.
In some embodiments of the third aspect, the non-transitory computer readable medium may contain computer instructions that, when executed by one or more processors, causes the one or more processors to perform the method in some embodiments of the first aspect.
Embodiments of the present invention will now be described, by way of example, with reference to the accompanying drawings in which:
The inventors of the present invention have devised, through research, experiments, and trials, that next-generation sequencing (NGS), with its high-throughput and relatively low cost compared to other sequencing techniques, may provide enormous potential feasibilities in forensic studies. From the first pyro-sequencing-based high-throughput sequencing system—the 454 Genome Sequencing System, introduced by Roche in 2005, the NGS technique gradually matures through time. The throughput of a single sequencing run nowadays has increased significantly and the cost-per-base has reduced significantly. For paternity testing, whole genome sequencing provides redundant marker information that is capable of handling complex scenarios with high accuracy.
The inventors of the present invention have also devised, through research, experiments, and trials, that in order to acquire a reliable result with low cost, a minimum requirement of sequencing coverage must be set using NGS-based methods and systems. However, when the sequencing coverage is low, genotypes of the tested individuals are associated with statistical uncertainty, for mainly two reasons. First, for haploids, both alleles may not be samples. Second, in most NGS data, the error rate is at least 0.1% even after filtering out base pairs with low quality. This may result in many homozygous loci wrongly inferred as heterozygous.
The most widely applied method for paternity testing nowadays is the likelihood method. Given the genotypes of a tested trio, this method relies on the calculation of the likelihood ratio of two hypotheses called Paternity Index (PI):
(1) X, the likelihood of the tested man is the biological father of the child (True Trio);
(2) Y, the likelihood of a random man is the biological father of the child (False Trio); For each locus, denote gqf, gm and gc, as the genotypes for the alleged father, mother and child respectively, then the PI value can be written as
where T(gc|gm, gaf) is the likelihood of true trio, which means that both alleles of the child are inherited from the mother and the alleged father; T(gc|gmf) is the likelihood of that the tested man is not the biological father of the child.
Ranges from 0 to infinity, the PI value provides DNA evidence of paternity for each locus. Specifically, if PI>1, then it indicates that the genetic evidence of this locus supports that the tested man is the biological father; if PI=1, then it indicates that the genetic evidence of this locus provides no information on paternity; and if PI<1, then it indicates that the genetic evidence of this locus is more consistent with non-paternity than paternity. Low PI values are primarily resulting from inconsistency in genetic markers, which may be caused by non-paternity, mutations in offspring and wrong genotype calls by sequencing errors.
The following embodiment of the present invention provides a method that reduces the errors caused by sequencing errors.
Preferably, in the method 100 of the present embodiments, the statistical model utilizes the respective probability of the genotype of the first tested subject, the second tested subject, and the alleged offspring as posterior probability with the sequencing read of the first tested subject, the second tested subject, and the alleged offspring.
In some examples, the method 100 may further include obtaining raw NGS data from the first tested subject, the second tested subject, and the alleged offspring. This is prior to step 102. Preferably, in the raw NGS data, respective sequencing coverage of the first tested subject, the second tested subject, and the alleged offspring are each as low as 0.5×. In one example, in the raw NGS data, respective sequencing coverage of the first tested subject, the second tested subject, and the alleged offspring are each between 0.5× and 2×. The method may further include, prior to step 102, filtering, either automatically or manually, raw NGS data to remove marker with more than two alleles to obtain the respective NGS data for the genetic loci.
In one embodiment, the method 100 may include dividing respective genomes in the corresponding NGS data of the first tested subject, the second tested subject, and the alleged offspring into a plurality of segments. The markers in each of the plurality of segments are then sorted based on a probability of exclusion, and afterwards, one or more markers may be selected based on the sorting result for application to the statistical model. In one example, the markers with the highest probability of exclusion are selected.
In the method 100 of the present embodiment, to reduce the errors caused by sequencing errors, the probability of the genotypes are modelled as the posterior probability with the observed reads by Bayesian rule. In the present embodiment, the PI value is defined as
where Dc, Dm and Daf represent the observed sequencing reads for, respectively, the tested offspring, mother and alleged father.
According to the Bayesian rule, the conditional probability of the individuals real genotype is gi,j with allele i and j given the observed read on such locus is
where P(gi,j) is the genotype frequency in the subject population. Under the assumption of Hardy-Weinberg equilibrium, it can be calculated that
where f(i) and f(j) are the allele frequencies for allele i and j respectively.
In the method of the present embodiment, P(D|gi,j) is the likelihood of observing the allele type that are supported by reads if the genotype is gi,j. Assuming that the reads are independent of each other in the sequencing process, then
P(D|gi,j)=ΠkP(dk|gi,j) (5
where dk is the k-th read that covers the corresponding locus.
The present embodiment models the sequencing process as a random process following binomial distribution, which means the probabilities of a sequenced read from both alleles are equal. Thus
P(D|gi,j)=CDd
where P(i|gi,j) is the probability of the sequenced read with allele i in one sampling under the condition that the individual genotype is gi,j. In one example, if gi,j is Aa, then p(A|gAa)=p(a|gAa)=0.5.
Considering sequencing errors, and denote the observed reads for allele i and j as di and dj respectively, the real reads (without error) of allele i and j as ri and ri. Then it can be determined that
P(di,dj|gi,j)=Σr
Suppose in one example it is observed that there are 4 reads supporting allele i and 6 reads supporting allele j for an SNP locus, the real situation may be 4 reads for i and 6 reads for j without sequencing error, or 3 reads for i and 7 reads for j with 1 sequencing error. If the real situation is 4 reads for i and 6 reads for j, the number of errors may be 0, 2, 4, . . . In other words, in this example, there must be even opposite errors, i.e., if one read is incorrectly sequenced as i instead of j, there must be another error where allele j is sequenced as allele i in order to get the final observation.
To convert the theoretically sequencing scenario (without sequencing errors) to the observed case (with sequencing errors), the minimum number of sequencing errors on this locus is emin=|di−ri|=|dj−rj|. Under the assumption that each read can only be incorrectly sequenced once, the total error number on this locus e must satisfy
After clarifying the rules for errors, equation (7) may be expanded by listing out all the cases with sequencing errors. Subsequently,
P(D|gi,j)=Σd
where e is subject to inequality set in equations (8).
Referring to
To verify the performance of the method in the above embodiments of the present invention, the following experiments are performed.
One experiment uses genetic data of 320 Chinese individuals in 1000 Genome Project Phase 3. In the experiment, the allele frequencies for both SNP and STR markers in Chinese sub-population were counted. Then, 8 Chinese family trio NGS data with average sequencing coverage of ˜32× were collected. After stringently filtering out the markers with more than two alleles, the statistical model in the above embodiments of the present invention is applied.
A further experiment was performed by randomizing subsample reads to reduce the sequencing coverage of samples. In this experiment, the overall coverage was reduced to ˜2×, ˜1×, ˜0.5× and ˜0.3× respectively. With each sequencing coverage, 800 experiments for both true trio and false trio (each family trio 100 times) were processed. As shown in
Embodiments of the present invention have provided a statistical model based method for genetic testing with NGS data. By considering the probability of sequencing errors and missing alleles, the likelihood of the genotypes for individuals in the tested trio is calculated, and is then combined together to obtain the overall probability that the tested subject is biologically related to the alleged offspring (e.g., the tested man is the true biological father of the alleged offspring). The method in some embodiments of the present invention requires the minimum 0.5×NGS sequencing data of a trio family to perform accurate determination. As a result, reliable result can be obtained with relatively low cost.
It should be noted that the methods of the present invention can be applied not only to paternity testing, but also to genetic analysis for individual identification. Also, the present invention is not limited in its application to human beings, but may also apply to other animal, plants, etc.
Although not required, the embodiments described with reference to the Figures can be implemented as an application programming interface (API) or as a series of libraries for use by a developer or can be included within another software application, such as a terminal or personal computer operating system or a portable computing device operating system. Generally, as program modules include routines, programs, objects, components and data files assisting in the performance of particular functions, the skilled person will understand that the functionality of the software application may be distributed across a number of routines, objects or components to achieve the same functionality desired herein.
It will also be appreciated that where the methods and systems of the present invention are either wholly implemented by computing system or partly implemented by computing systems then any appropriate computing system architecture may be utilized. This will include stand-alone computers, network computers and dedicated hardware devices. Where the terms “computing system” and “computing device” are used, these terms are intended to cover any appropriate arrangement of computer hardware capable of implementing the function described.
It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
Any reference to prior art contained herein is not to be taken as an admission that the information is common general knowledge, unless otherwise indicated.