This application is claims the benefit of Serial No. GB 0722649.1, filed Nov. 19, 2007 in Great Britain, and also of Serial No. GB 0804665.8, filed Mar. 13, 2008 in Great Britain, and also of Serial No. GB 0810624.7, filed Jun. 11, 2008, and also of Serial No. GB 0815300.9, filed Aug. 21, 2008 and which application(s) are incorporated herein by reference. A claim of priority to all, to the extent appropriate is made.
This invention concerns improvements in and relating to the consideration of evidence, particularly, but not exclusively the consideration of DNA evidence.
In many situations, particularly in forensic science, there is a need to consider one piece of evidence against one or more other pieces of evidence.
For instance, it may be desirable to compare a sample collected from a crime scene with a sample collected from a person, with a view to linking the two by comparing the characteristics of their DNA. This is an evidential consideration. The result may be used directly in criminal or civil legal proceedings. Such situations include instances where the sample from the crime scene is contributed to by more than one person.
In other instances, it may be desirable to establish the most likely matches between examples of characteristics of DNA samples stored on a database with a further sample. The most likely matches or links suggested may guide further investigations. This is an intelligence consideration.
In both of these instances, it is desirable to be able to express the strength or likelihood of the comparison made, a so called likelihood ratio.
The present invention has amongst its possible aims to establish likelihood ratios. The present invention has amongst its possible aims to provide a more accurate or robust method for establishing likelihood ratios. The present invention has amongst its possible aims to provide probability distribution functions for use in establishing likelihood ratios, where the probability distribution functions are derived from experimental data.
According to a first aspect of the invention we provide a method of comparing a test sample result set with another sample result set, the method including:
The method of comparing may be used to considered evidence, for instance in civil or criminal legal proceedings. The comparison may be as to the relative likelihoods, for instance a likelihood ratio, of one hypothesis to another hypothesis. The comparison may be as to the relative likelihoods of the evidence relating to one hypothesis to another hypothesis. In particular, this may be a hypothesis advanced by the prosecution in the legal proceedings and another hypothesis advanced by the defence in the legal proceedings. The likelihood ratio may be of the form:
where
The likelihood ratio may be defined as:
where Ui is one of the supposed sources stated in Hd.
The likelihood ratio may be defined as:
where the f terms are factors that are likelihoods.
The factor Pr(Ui|S,Hd) may be computed by using various methods, including those which may include consideration of relatedness between the suspect and the unknown contributor and/or those described in D. J. Balding (2005), “Weight-of-evidence for forensic DNA profiles”, John Wiley & Sons Ltd and/or J. Buckleton, C. M. Triggs, S. J. Walsh (2005), “Forensic DNA evidence interpretation”, CRC Press.
The factors f(C|S,Hp) and/or f(C|Ui,S,Hd) may be taken as statements of the same type of calculation. Those factors may be taken as definitions of the likelihood of observing the set of peak heights in C given a supposed donor.
The comparison may include the consideration of the term:
f(C|G)
where G denotes the supposed donor genotype. The comparison may include the consideration of this term as, or as a component of, the denominator of the likelihood ratio, and/or the consideration of this term as, or as a component of, the nominator of the likelihood ratio.
The term f(C|G) may be an estimation. The term f(C|G) may be derived from experimental data. The term f(C|G) may be an estimation derived from experimentation data. The term f(C|G) may be or include one or more probability distribution functions. The term f(C|G) may be or include one or more probability distribution functions provided according to the seventh aspect of the invention. The term f(C|G) may be or may be as further defined in the second aspect of the invention.
The method of comparing may involve a likelihood ratio which is not conditioned on quantity of DNA in the first and/or second sample.
The method of comparing may involve a likelihood ratio which is conditioned on quantity of DNA in the first and/or second sample.
The method of comparing may include defining the term f(C|G) as being defined by:
where Pr(χj|I(h)) is a probability distribution of DNA quantity given peak height information, I(h), preferably experimentally derived.
The term f(C|G) may be conditioned on DNA quantity in the test sample in this way. The term f(C|G) and/or f(C|G,χi) may be or may be as further defined in the third aspect of the invention.
The term f(C|G,χi) may be an estimation. The term f(C|G,χi) may be derived from experimental data. The term f(C|G,χi) may be an estimation derived from experimentation data. The term f(C|G,χi) may be or include one or more probability distribution functions. The term f(C|G,χi) may be or include one or more probability distribution functions provided according to the seventh aspect of the invention. The term f(C|G,χi) may be or may be as further defined in the third aspect of the invention.
The method of comparing may be used to gather information to assist further investigations or legal proceedings. The method of comparing may provide intelligence on a situation. The method of comparison may be of the likelihood of the information of the first or test sample result given the information of the second or another sample result. The method of comparison may provide a listing of possible another sample results, ideally ranked according to the likelihood. The method of comparison may seek to establish a link between a DNA profile from a crime scene sample and one or more DNA profiles stored in a database.
The method of comparison may consider a likelihood ratio. Preferably the likelihood ratio, more preferably the posterior probability, is defined as:
where:
C is the first or test result set from a test sample, more particularly, the first result set taken from a sample recovered from a person or location linked with a crime, potentially expressed in terms of peak positions and/or heights;
Gi is the second or another result set, more particularly, the various members of an exhaustive list of potential donor genotypes; and
Pr(Gi) is a prior distribution for genotype Gi, preferably computed for a population, particularly the population from which the test sample under consideration comes. It can be computed using the formulae introduced by Balding et al. (1996).
The comparison may include consideration of the term: f(C|Gi). The method may include any of the features, options or possibilities proposed for f(C|G) applying to f(C|Gi).
The method of comparing may involve a likelihood ratio which is not conditioned on quantity of DNA in the first and/or second sample.
The method of comparing may involve a likelihood ratio which is conditioned on quantity of DNA in the first and/or second sample.
The method of comparing may include defining the term Pr(Gi|C) as being defined by:
where Pr(χj|I(h)) is a probability distribution of DNA quantity given information on peak heights.
The term Pr(χj|I(h)) may be or may be as further defined in the fourth aspect of the invention.
The term f(C|G1,G2) may be or may be as further defined in the fifth aspect of the invention.
The term f(C|G1,G2,χi) may be or may be as further defined in the sixth aspect of the invention.
The method of comparing may be used in one or more of the following situations:
According to a second aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:
The factor f(C|G) may be an expression of the likelihood of observing the information of the first result set given that the information of the second result set. The factor f(C|G) may be an expression of the likelihood of observing the information from a sample taken from a crime scene given the information in a given genotype, particularly the genotype of a suspect and/or another person.
The comparison may include the consideration of this term as, or as a component of, the denominator of a likelihood ratio, and/or the consideration of this term as, or as a component of, the nominator of a likelihood ratio. The likelihood ratio may be of the form:
where
The term f(C|G) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.
The term f(C|G) may be used to calculate the term: f(C|S,Hp) and/or the term f(C|Ui,S,Hd).
The likelihood ratio of the form:
may be defined as:
where Ui is one of the supposed sources stated in Hd.
The likelihood ratio of the form:
may be defined as:
where the f terms are factors that are likelihoods.
The factor Pr(Ui|S) may be computed by using a subpopulation structure equation, such as the subpopulation structure equation described in D. J. Balding (2005), “Weight-of-evidence for forensic DNA profiles”, John Wiley & Sons Ltd and/or J. Buckleton, C. M. Triggs, S. J. Walsh (2005), “Forensic DNA evidence interpretation. CRC Press”.
The method of comparing may consider one or more of the following scenarios for the suspect and/or information observed in the analysis:
a) Suspect is heterozygous and crime profile has one peak;
b) Suspect is heterozygous and crime profile has two peaks;
c) Suspect is heterozygous and the crime profile contains no peaks;
d) Suspect is homozygous and crime profile contains one peak;
e) Suspect is homozygous and crime profile contains no peaks.
Where the suspect is heterozygous, stated as S={a,b}, and the test sample or first sample has one peak, stated as C={hb}, the likelihood ratio may be defined as:
Preferably the numerator in this likelihood ratio is defined as:
where ha is a peak height or peak area or peak weight smaller than a detection threshold and so is not observed in the results.
Preferably the denominator in this likelihood ratio is defined as:
f(C={hb}|Hd)=f(C={hb}|U={b,b},Hd)Pr(U={b,b}|S={a,b}Hd)+f(C={hb}|U={b,Q}Hd)Pr(U={b,Q}|S={a,b}Hd).
where Q is any other allele identity than a.
The factors Pr(U={b,b}|S={a,b}Hd) and Pr(U={b,Q}|S={a,b} Hd), may be computed using the population substructure calculations described in D. J. Balding, M. Greenhalgh, R. A. Nichols (1996), “Population genetics of STR loci in Caucasians”, 108:300-305.
The factor may consider:
Pr(C={hb}|U={b,b},Hd)=fhet(0.5hb,0.5hb)
where fhet is a two-dimensional pdf on pairs of heights originating from heterozygous donors.
The factor may consider the position as relating to or equating to the solution described below at the bottom of page 8.
The factor may consider the position as fhom and point to a description of solutions, for instance as:
Pr(C={hb}|U={b,b},Hd)=fhom(hb)
The factor may consider:
where hQ is smaller than the threshold t.
Where the suspect is heterozygous, stated as S={a,b}, and the first or test sample gives two peaks, stated as C={ha,hb}, the LR may be defined as:
The numerator in this LR may be defined by:
f(C={ha,hb}|S={a,b},Hd)=fhet(ha,hb)
The denominator is given by:
f(C={ha,hb}|S={a,b},Hd)=fhet(ha,hb)Pr(U={a,b}|S={a,b},Hd)
The LR may be defined as:
Where the suspect is heterozygous, stated as S={b,b}, and the first or test sample provides a profile containing one allele, stated as C={hb}, the LR may be defined as:
The numerator may be given by:
f(C={hb}|S={b,b},Hp)=fhom(hb) which may be a density function estimated for homozygous contributors.
The denominator may be given by:
The term: f(C={hb}|U={b,b},S={b,b},Hd)=fhom(hb) is the same as the numerator.
The term:
can be obtained by estimating a pdf for the peak heights.
Where the suspect is homozygous, stated as C={hb}, and the test or first sample provides a profile containing one allele, hb, the likelihood for a homozygous donor may be given by:
f(C={hb}|G={b,b})=fhom(hb)
where fhom can be computed by several methods, for example by the method of rotation described elsewhere in this document and/or by the method:
where fhet(αhb,(1−α)hb) is a two-dimensional probability density function computed for a heterozygote and f(α) is a probability density function for α, where α is the proportion of the homozygote given by one of the alleles of the homozygote and 1−α is the proportion accounted by the other allele of the homozygote.
The function may be approximated by:
f
hom(hb)=fhet(αhb,(1−α)hb)
for an α close to 0.5.
The method of comparing may be used to gather information to assist further investigations or legal proceedings. The method of comparing may provide intelligence on a situation. The method of comparison may be of the likelihood of the information of the first or test sample result given the information of the second or another sample result. The method of comparison may provide a listing of possible another sample results, ideally ranked according to the likelihood. The method of comparison may seek to establish a link between a DNA profile from a crime scene sample and one or more DNA profiles stored in a database.
The method may include taking a crime scene DNA profile, C, and determining an ordered list of suspect's profiles, where the first profile in the list is the genotype of the most likely donor. The method may propose a list of genotypes {G1,G2, . . . ,Gm} and then rank those genotypes according the posterior probability of the genotype given the crime scene profile, C.
The method of comparison may consider a likelihood ratio or more preferably posterior probability. Preferably the likelihood ratio, more preferably posterior probability, is defined as:
where:
C is the first or test result set from a test sample, more particularly, the first result set taken from a sample recovered from a person or location linked with a crime, potentially expressed in terms of peak positions and/or heights;
Gi is the second or another result set, more particularly, the various members of an exhaustive list of potential donor genotypes; and
Pr(Gi) is a prior distribution for genotype Gi, preferably computed for a population, particularly the population from which the test sample under consideration comes. It can be set to be a uniform distribution or computed using genotype probabilities formulae described in Blading (2005) and Buckleton et al. (2005), both referenced above.
The comparison may include consideration of the term: f(C|Gi). the method may include any of the features, options or possibilities proposed for f(C|G) applying to f(C|Gi). According to a third aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:
where f means this is an expression of likelihood, C is the information of the first result set and G is the information of the second result set, where Pr(χj|I(h)) is a probability distribution of DNA quantity, χj, given a quantitative measure of the one or more identities considered in the first result set and/ort second result set, I(h).
Preferably Pr(χj|I(h)) is a probability distribution of DNA quantity with peak height and/or peak area/ and/or peak weight.
Preferably the factor f(C|G,χi) is computed by conditioning on DNA quantity.
The factor f(C|G,χi) may be an expression of the likelihood of observing the information of the first result set given that the information of the second result set. The factor f(C|G,χi) may be an expression of the likelihood of observing the information from a sample taken from a crime scene given the information in a given genotype, particularly the genotype of a suspect and/or another person.
The term f(C|G,χi) may be an estimation. The term f(C|G,χi) may be derived from experimental data. The term f(C|G,χi) may be an estimation derived from experimentation data. The term f(C|G,χi) may be or include one or more probability distribution functions. The estimation of the term f(C|G,χi) may be provided by the consideration of dilution data, particularly data collected from heterozygous sources and/or, and preferably separately, from heterozygous sources.
The estimation of f(C|G,χi) may be provided by grouping together control sample analysis data for the same quantity of DNA. A separate estimation of f(C|G,χi) may be provided for each quantity of DNA. The quantities of DNA for which the factor f(C|G,χi) is estimated may be spaced across a range, preferably evenly so.
The estimation of f(C|G,χi) may be provided by grouping control sample analysis data from one or more loci of interest together. A single estimation of f(C|G,χi) covering all the loci for which information is stored in a database and/or which are analysed using an analysis method and/or which are analysed using a multiplex amplification product may be provided. A separate estimation of f(C|G,χi) may be provided for each locus. The estimation of f(C|G,χi) may be generated using two-dimensional density estimations and/or two-dimensional probability distribution functions.
The term f(C|G,χi) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.
The method of comparing may be used to gather information to assist further investigations or legal proceedings. The method of comparing may provide intelligence on a situation. The method of comparison may be of the likelihood of the information of the first or test sample result given the information of the second or another sample result. The method of comparison may provide a listing of possible another sample results, ideally ranked according to the likelihood. The method of comparison may seek to establish a link between a DNA profile from a crime scene sample and one or more DNA profiles stored in a database.
The method may include taking a crime scene DNA profile, C, and determining an ordered list of suspect's profiles, where the first profile in the list is the genotype of the most likely donor. The method may propose a list of genotypes {G1,G2, . . . ,Gm} and then rank those genotypes according the posterior probability of the genotype given the crime scene profile, C.
The method of comparison may consider a likelihood ratio. Preferably the likelihood ratio is defined as:
where Pr(χj|I(h)) is a probability distribution of DNA quantity, χj, given a quantitative measure of the one or more identities considered in the first result set and/ort second result set, I(h), more preferably, where Pr(χj|I(h)) is a probability distribution of DNA quantity given information on peak heights.
According to a fourth aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:
Preferably Pr(χj|I(h)) is a probability distribution of DNA quantity with peak height and/or peak area and/or peak weight.
Preferably the probability distribution for Pr(χj|I(h)) relates to a list of discrete probabilities for χ1to χj, where ΣiPr(χi)=1. Preferably the distribution is obtained by considering the observed distribution obtained from the analysis of the control samples.
The quantitative measure may be the mean value for all observed values across all loci. The quantitative measure may be the mean of all values for a single locus. The quantitative measure may be the mean value for all the observed values for all loci for a given quantity of DNA. The quantitative measure may be the mean value for all the observed values for a single locus for a given quantity of DNA.
The quantitative measure may be peak height and/or peak area and/or peak weight.
A distribution may be estimated. The distribution may be of the form f(
The calculation of Pr(Xχi|I(h)) for a locus may be performed in a sequential fashion based on an ordering L1,L2, . . . ,Ln of the loci. The calculation may begin by setting Pr(X=χi|I(h)) for L1 to a uniform distribution. Another assumed form for the distribution may be used. The calculation for the next locus, L2, of Pr(X=χi|I(h)) may be based on the assumption for I(h) from L1. The subsequent calculations for the subsequent loci may be based upon the assumption for I(h) for all the loci which precede it in the calculation, for instance Pr(X=χi|I(h)) for Ln is computed based on I(h) from L1, L2, . . . ,Ln−1.
The term Pr(X=χi|I(h)) may be an estimation. The term Pr(X=χi|I(h)) may be derived from experimental data. The term Pr(X=χi|I(h)) may be an estimation derived from experimentation data. The term Pr(X=χi|I(h)) may be or include one or more probability distribution functions. The estimation of the term Pr(X=χi|I(h)) may be provided by the consideration of dilution data, particularly data collected from heterozygous sources and/or, and preferably separately, from heterozygous sources.
The estimation of Pr(X=χi|I(h)) may be provided by grouping together control sample analysis data for the same quantity of DNA. A separate estimation of Pr(X=χi|I(h)) may be provided for each quantity of DNA. The quantities of DNA for which the factor Pr(X=χi|I(h)) is estimated may be spaced across a range, preferably evenly so.
The estimation of Pr(X=χi|I(h)) may be provided by grouping control sample analysis data from one or more loci of interest together. A single estimation of Pr(X=χi|I(h)) covering all the loci for which information is stored in a database and/or which are analysed using an analysis method and/or which are analysed using a multiplex amplification product may be provided. A separate estimation of Pr(X=χi|I(h)) may be provided for each locus. The estimation of Pr(X=χi|I(h)) may be generated using two-dimensional density estimations and/or two-dimensional probability distribution functions.
The term Pr(X=χi|I(h)) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.
According to a fifth aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:
The factor f(C|G1,G2) may be used to calculate the factors f(C|S,Ui,Hp) and f(C|Uj,Uk,S,Hd). The factors f(C|S,Ui,Hp) and f(C|Uj,Uk,S,Hd) may be the expression of the likelihood of the crime profile C given two supposed sources or donors; potentially according to a hypothesis of the defence, Hd, and/or prosecution, Hp. G1 and G2 may be the genotypes of the supposed donors or sources.
The method may involve the consideration, particularly in an evidential context, particularly for a mixed source sample arising from two contributors, of the prosecution and defence hypotheses. The hypotheses may be: the prosecution hypothesis, Hp, that the suspect and someone else are the donors to the first result set, potentially a stain at the scene of crime; and/or the defence hypothesis, Hd, that two unknown people are the donors to the first result set, potentially a stain at the scene of crime. The hypotheses may be: the prosecution hypothesis, Hp, that the suspect and the victim are the donors to the first result set, and/or the defence hypothesis, Hd, that the victim and unknown person are the donors to the first result set. The hypotheses may be: the prosecution hypothesis, Hp, that the two suspects are the donors to the first result set, and/or the defence hypothesis, Hd, that the two unknown persons are the donors to the first result set. One or both hypotheses may involve statements of relatedness amongst the possible contributors.
The LR may be given by the formula:
where
The LR formula can be written as:
where f(C|S,Ui,Hp) is a density function of C given S and Ui and f(C|Uj,Uk,S,Hd) is a density function of C given Uj, Uk and S.
The factors Pr(Ui|S,Hp) and Pr(Uj,Uk|S,Hd) may be computed using the formulae described in Balding (2005) and Buckleton (2005), referenced above, which may include considerations of relatedness between the unknown contributors and the suspect.
The factor f(C|G1,G2) may be an expression of the likelihood of observing the information of the first result set given that the information of the second result set. The factor f(C|G1,G2) may be an expression of the likelihood of observing the information from a sample taken from a crime scene given the information in a given genotype, particularly the genotype of a suspect and/or another person.
The term f(C|G1,G2) may be an estimation. The term f(C|G1,G2) may be derived from experimental data. The term f(C|G1,G2) may be an estimation derived from experimentation data. The term f(C|G1,G2) may be or include one or more probability distribution functions. The estimation of the term f(C|G1,G2) may be provided by the consideration of dilution data, particularly data collected from heterozygous sources and/or, and preferably separately, from heterozygous sources.
The estimation of f(C|G1,G2) may be provided by grouping together control sample analysis data for the same quantity of DNA. A separate estimation of f(C|G1,G2) may be provided for each quantity of DNA. The quantities of DNA for which the factor f(C|G1,G2) is estimated may be spaced across a range, preferably evenly so.
The estimation of f(C|G1,G2) may be provided by grouping control sample analysis data from one or more loci of interest together. A single estimation of f(C|G1,G2) covering all the loci for which information is stored in a database and/or which are analysed using an analysis method and/or which are analysed using a multiplex amplification product may be provided. A separate estimation of f(C|G1,G2) may be provided for each locus. The estimation of f(C|G1,G2) may be generated using two-dimensional density estimations and/or two-dimensional probability distribution functions.
The term f(C|G1,G2) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.
The method may involve a consideration of one or more of the following scenarios, particularly in an evidential context:
In both scenarios, where we are considering likelihood ratios for a locus, in respect of a sample from two sources, there are two basic hypotheses which may be considered:
Where there is a heterozygous suspect and victim with no overlapping alleles and three peaks in the crime profile, then Hp may be V+S, and Hd may be V+U, C={ha,hb,hc}, V={a,b} and S={c,d}. The LR may be given by:
The numerator in this function may be given by:
f(C={ha,hb,hc}|V={a,b},S={c,d},Hp)=fhet(ha,hb)×fhet(hc,hd)
The unknown contributors for the denominators can be {a,c}, {b,c}, {c,Q}. The denominator may be given by:
The factor for U={a,c} may be computed with the formula:
f(C={ha,hb,hc}|V={a,b},U={a,c},Hd)=fhet(mxha,hb)×fhet((1−mx)ha,hc).
The factor for U={b,c} may be computed with the formula:
f(C={ha,hb,hc}|V={a,b},U={b,c},Hd)=fhet(ha,mxhb)×fhet(ha,(1−mx)hc).
The factor for U={c,Q} may be computed with the formula:
f(C={ha,hb,hc}|V={a,b},U={c,Q},Hd)=fhet(ha,hb)×fhet(hc,hQ).
Where the heterozygous suspect and victim are with one overlapping allele and three peaks in the crime profile, then Hp may be V+S, and Hd may be V+U, C={ha,hb,hc}, V={a,b} and S={b,c}. The likelihood ratio may be given by:
The numerator may be given by:
Pr(C={ha,hb,hc}|V={a,b},S={b,c},Hp)=fhet(ha,mxhb)fhet((1−mx)hb,hc).
The denominator may consider the following potential unknown contributors:
U ε {{a,c},{b,c},{c,c},{c,Q}}.
The function may be computed, where U={a,c}, using:
Pr(C={ha,hb,hc}|V={a,b},S={b,c},U={a,c},Hd)=fhet(mxha,hb)fhet((1−mx)ha,hc)π( )
The function may be computed, where U={b,c} using:
f(C={ha,hb,hc}|V={a,b},U={b,c},Hd)=fhet(ha,mxhb)fhet((1−mx)hb,hc).
The function may be computed, where U={c,c} using:
f(C={ha,hb,hc}|V={a,b},U={c,c}Hd)=fhet(ha,hb)fhet(0.5hc,0.5hc)
The function may be computed, where U={c,Q}, using:
f(C={ha,hb,hc}|V={a,b},U={c,Q}Hd)=fhet(ha,hb)fhet(hc,hQ).
The method of comparing may be used to gather information to assist further investigations or legal proceedings. The method of comparing may provide intelligence on a situation. The method of comparison may be of the likelihood of the information of the first or test sample result given the information of the second or another sample result. The method of comparison may provide a listing of possible another sample results, ideally ranked according to the likelihood. The method of comparison may seek to establish a link between a DNA profile from a crime scene sample and one or more DNA profiles stored in a database.
The method may include taking a crime scene DNA profile, C, and determining an ordered list of suspect's profiles, where the first profile in the list is the genotype of the most likely donor. The method may propose a list of genotypes {G1,G2, . . . ,Gm} and then rank those genotypes according the posterior probability of the genotype given the crime scene profile, C.
The method of comparison may consider a posterior probability. Preferably posterior probability is defined as:
The method may propose an ordered list of pairs of genotypes G1 and G2 per locus, preferably so that the first pair in the list in the most likely donor of the crime stain.
The method may start with a first set of results, for instance, crime stain profile C. The method may then provide a list, potentially an exhaustive list {G1,i,G2,i}, of pairs of potential donors is generated. Preferably for each of theses pairs, a probability distribution for the genotypes is calculated using the formula:
where Pr(G1,i,G2,i) is a prior distribution for the pair of genotypes inside the brackets that can be set to be a uniform distribution or computed using the probabilities of genotypes described in Balding (2005) and Buckleton et al. (2005), referenced above.
According to a sixth aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:
The factor f(C|G1,G2,χi) may be used to calculate the factors f(C|S,Ui,Hp) and f(C|Uj,Uk,S,Hd). The factors f(C|S,Ui,Hp) and f(C|Uj,Uk,S,Hd) may be the expression of the likelihood of the crime profile C given two supposed sources or donors; potentially according to a hypothesis of the defence, Hd, and/or prosecution, Hp. G1 and G2 may be the genotypes of the supposed donors or sources.
The method may involve the consideration, particularly in an evidential context, particularly for a mixed source sample arising from two contributors, of the prosecution and defence hypotheses. The hypotheses may be: the prosecution hypothesis, Hp, that the suspect and someone else are the donors to the first result set, potentially a stain at the scene of crime; and/or the defence hypothesis, Hd, that two unknown people are the donors to the first result set, potentially a stain at the scene of crime.
The LR may be given by the formula:
where
The LR formula can be written as:
where f(C|S,Ui,Hp) is a density function of C given S and Ui and f(C|Uj,Uk,S,Hd) is a density function of C given Uj, Uk and S.
The factors Pr(Ui|S,Hp) and Pr(Uj,Uk|S,Hd) may be computed by the methods described in Balding (2005) and Buckleton (2005), referenced above, which may include considerations of relatedness.
The factor f(C|G1,G2,χi) may be an expression of the likelihood of observing the information of the first result set given that the information of the second result set. The factor f(C|G1,G2,χi) may be an expression of the likelihood of observing the information from a sample taken from a crime scene given the information in a given genotype, particularly the genotype of a suspect and/or another person.
The term f(C|G1,G2,χi) may be an estimation. The term f(C|G1,G2,χi) may be derived from experimental data. The term f(C|G1,G2,χi) may be an estimation derived from experimentation data. The term f(C|G1,G2,χi) may be or include one or more probability distribution functions. The estimation of the term f(C|G1,G2,χi) may be provided by the consideration of dilution data, particularly data collected from heterozygous sources and/or, and preferably separately, from heterozygous sources.
The estimation of f(C|G1,G2,χi) may be provided by grouping together control sample analysis data for the same quantity of DNA. A separate estimation of f(C|G1,G2,χi) may be provided for each quantity of DNA. The quantities of DNA for which the factor f(C|G1,G2,χi) is estimated may be spaced across a range, preferably evenly so.
The estimation of f(C|G1,G2,χi) may be provided by grouping control sample analysis data from one or more loci of interest together. A single estimation of f(C|G1,G2,χi) covering all the loci for which information is stored in a database and/or which are analysed using an analysis method and/or which are analysed using a multiplex amplification product may be provided. A separate estimation of f(C|G1,G2,χi) may be provided for each locus. The estimation of f(C|G1,G2,χi) may be generated using two-dimensional density estimations and/or two-dimensional probability distribution functions.
The term f(C|G1,G2,χi) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.
According to a seventh aspect of the invention we provide a method for generating one or more probability distribution functions relating to the detected level for a variable characteristic of DNA, the method including:
a) providing a control sample of DNA;
b) analysing the control sample to establish the detected level for the at least one variable characteristic of DNA;
c) repeating steps a) and b) for a plurality of control samples to form a data set of detected levels;
d) defining a probability distribution function for at least a part of the data set of detected levels.
The method may be used to provide a probability distribution function for the detected level for control samples from heterozygous persons.
The method may be used to provide a probability distribution function for the detected level for control samples from homozygous persons.
Preferably the method is used to provide a probability distribution function for the detected level for control samples from heterozygous persons and to provide a probability distribution function for the detected level for control samples from homozygous persons. Preferably the performance of the method to provide the probability distribution function for the heterozygous persons is a separate performance to that used to provide the probability distribution function for the homozygous persons.
The detected level for the variable characteristic may be a peak height. The detected level may be a peak area. The detected level for the variable characteristic may be a peak weight. A peak weight may be defined as the molecular weight of the allele multiplied by the peak height or peak area for that allele.
The characteristic may be the allele identity or identities at a locus known to have variable short tandem repeat alleles.
The probability distribution function may be a 2-dimensional probability distributional function. The probability distribution function may be formed as a 2-dimensional probability distribution function or may be converted thereto, for instance from a one-dimensional probability distribution function or from a three-dimensional probability distribution function.
Steps a) and b) may be performed on control samples from one person, preferably a plurality of different persons and ideally at least four different persons.
Steps a) and b) may be performed on at least 20 control samples, preferably at least 100 control samples, more preferably at least 200 control samples and ideally on at least 500 control samples. These numbers of control samples may be the total number or the number for each different person from whom control samples are considered.
Steps a) and b) may be performed on one example of each control sample. Steps a) and b) may be performed on a plurality of examples of each control sample, for instance at least 20 examples of each control sample. These numbers of control samples may be the total number or the number for each different person from whom control samples are considered.
The person's who are the source of the control samples may be selected to be heterozygous with respect to the variable characteristic, particularly to generate a probability distribution function for heterozygotes.
The person's who are the source of the control samples may be selected to be homozygous with respect to the variable characteristic, particularly to generate a probability distribution function for homozygotes.
The method may be applied to one or more control samples which include different quantities of DNA. The quantities of DNA may be provided across a range of quantities, for instance at regular intervals across the range. The range may have a lower limit of 10 pg and more preferably 50 pg. The range may have an upper limit of 1000 pg and more preferably of 500 pg. The interval may be every 10 to 50 pg or potentially every 25 pg.
The analysis of the detected level for the variable characteristic is preferably provided in respect of more than one variable characteristic. The variable characteristic(s) are preferably the allele identity or allele identities present at a locus. Preferably the control samples are considered in respect of one or more loci, preferably at least 8 loci.
The probability distribution function, pdf, may be formed of one or more probability distribution functions. A probability distribution function may be provided for a group of control samples, with the control samples being divided into more than one group of control samples. The probability distribution function may be formed of four probability distribution functions, particularly in the context of the consideration of test samples from two different people.
The detected levels for the characteristic may be divided into one or more groups. One group may be those control samples whose detected level is at and/or above a threshold value. One group may be those control samples whose detected level is at and/or below a threshold value.
Where the control samples are from homozygous persons, the detected levels are preferably divided into two groups, most preferably relative to a threshold. The division of the detected levels into two groups may provide a 1-dimensional probability distribution function. It is preferred that the 1 dimensional pdf be converted to a 2-dimensional pdf.
The method used to convert a 1-dimensional pdf to a 2-dimensional pdf may include treating the pdf for a homozygous sample with height h1 is being defined by the pdf for a heterozygous sample with heights 0.5h1 and 0.5h2.
Alternatively, the 1-dimensional pdf may be converted into a 2-dimensional pdf by mathematically rotating the distribution through an angle, particularly 90°. The method may include the use of a 1-dimensional pdf estimating by two components. One of the components may be a uniform distribution for the height interval. One of the components may be a probability distribution that takes positive values within that height interval range, for example an exponential distribution. The one-dimensional pdf may be defined by the formula,
where
p0 is the proportion of heights in the range;
p1 is the proportion of heights above the range;
t is the detection threshold, for instance 30 rfu's.
The surface of the distribution obtained by rotation may be normalised so that the volume under the surface is 1.
The calculation of a pdf value for homozygote peak height h for a likelihood ratio calculation may be given by:
(1/V)×fH. (4)
where V is the volume under the surface obtained by rotation of the one-dimensional pdf.
One group may those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic. One group may be those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic. One group may those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic. One group may be those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic.
Where the control samples are from heterozygous persons, the detected levels may be divided into four groups, particularly where the test sample to be considered is from two sources.
The threshold may be a detection level at which detection of the characteristic relative to the detection of noise signals is not possible or is impaired. The threshold may be between 10 random fluorescence units and 70 rfu's, more preferably less than 60 rfu's, still more preferably 50 rfu's or less or even less than 40 rfu's.
The threshold may be different for different variable characteristics at a locus, but is preferably the same. The threshold may be different or the same for different loci.
Where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for each variable may be the values contributing to the probability distribution function defined.
Where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for the first characteristic and a value of zero for the second characteristic may be the values contributing to the probability distribution function.
Where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for the second characteristic and a value of zero for the first characteristic may be the values contributing to the probability distribution function.
Where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, a value of zero for the first characteristic and a value of zero for the second characteristic may be the values contributing to the probability distribution function.
The probability distribution function, particularly for a heterozygous source, may be provided by four probability functions.
The probability distribution function may be, or more preferably include a probability distribution function, that function relating to, or being:
where p1,1 is the proportion of data where h1<t and h2<t where t is the detection threshold, particularly where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for each variable may be the values contributing to the probability distribution function defined.
The probability distribution function may be, or more preferably include a probability distribution function, that function relating to, or being:
p1,2×g1,2(h1,h2)
where p1,2 is the proportion of points where h1≧t and h2<t, g1,2 is a 2-dimensional probability density function for the region where h1≧t and h2<t, h1 is the height of the low-molecular allele, h2 is the height of the high-molecular allele, particularly where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for the first characteristic and a value of zero for the second characteristic may be the values contributing to the probability distribution function. The probability distribution may be, or more preferably include a probability distribution function, that function relating to, or being:
g
1,2(h1,h2)=g1,2(h1|h2)g1,2(h2)
where g1,2(h1|h2) is estimated from experimental data where h1≧t and h2<t, and:
The probability distribution function may be, or include, particularly for this group, an exponential distribution.
The probability distribution function may be, or more preferably include a probability distribution function, that function relating to, or being:
P2,1·g2,1(h1,h2).
where p2,1 is the proportion of pairs of heights in the region where h1<t and h2≧t, g2,1 is a 2-dimensional probability density function for the region where h1<t and h2≧t, h1 is the height of the low-molecular allele, h2 is the height of the high-molecular allele particularly where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for the second characteristic and a value of zero for the first characteristic may be the values contributing to the probability distribution function. The probability distribution may be, or more preferably include a probability distribution function, that function relating to, or being:
g
2,1(h1,h2)=g2,1(h2|h1)g2,1(h1) (A.5)
where g2,1(h2|h1) is estimated from experimental data where h1<t and h2>t and:
The probability distribution function may be, or include, particularly for this group, an exponential distribution.
The probability distribution function may be, or more preferably include a probability distribution function, that function relating to, or being:
p2,2×g2,2(h1,h2)
where p2,2 is the proportion of height pairs (h1,h2) that fall in the region where h1≧t and h2≧t, g2,2 is a 2-dimensional probability density function for the region where h1≧t and h2≧t, h1 is the height of the low-molecular-weight allele, h2 is the height of the high-molecular-weight allele, particularly where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, a value of zero for the first characteristic and a value of zero for the second characteristic may be the values contributing to the probability distribution function.
One or more of the probability distribution functions may be a uniform distribution. One or more of the distribution functions may be a triangular distribution. One or more of the distribution functions may be an exponential distribution.
The probability distribution function, particularly for a heterozygous source, may be defined as
where, k1,2 is an exponential probability density function, k2,1 is an exponential probability density function, l1,2 and l2,1 are parameters of the exponential distribution, and preferably where:
is the pdf of an exponential distribution with parameter l1,2, where x is a variable representing h1−t. and preferably where:
and k2,2 is preferably given by a mixture of two-dimensional normal distributions:
where pi is the mixing proportion and n is a 2-dimensional probability density function of a normally distributed random variably, preferably where:
and preferably:
where correlation coefficient ρ=0.
Where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the probability distribution function may be fitted to the detected levels observed for the control samples.
Where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the probability distribution function may be provided by fitting a 2 dimensional planar distribution to the group and/or a distribution function whose shape represents a transition from values at the junction with the adjoining groups and/or whose shape provides a probability value of zero at zero value for detected level for both the first and second characteristic.
Where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the probability distribution function may be provided by fitting a 2 dimensional planar distribution to the group and/or a distribution function whose shape represents a transition from values at the junction with the adjoining groups.
Where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the probability distribution function may be provided by fitting a 2 dimensional planar distribution to the group and/or a distribution function whose shape represents a transition from values at the junction with the adjoining groups.
The probability distribution function, particularly for a heterozygous source, may be provided by estimating a two dimensional pdf for a pair of heights h1 and h2, for instance in the space defined by mean heights, m, and heterozygote imbalance, r. The method may include each pair of heights being transformed by
Given a pdf fM,R, a pdf in the space of pairs of heights may be obtained with the formula:
where the first factor is the Jacobian of the transformation, G. Casella & R. L. Berger (1990). Statistical Inference. Wadsworth & Brooks/Cole Advanced Books Software, Pacific Grove, Calif., USA, page 148 and the second factor can be estimated from experimental data. The estimation may be provided by obtaining the conditional distribution of heterozygous imbalance given mean height:
The method may use this approach, and particularly this function, to provide one pdf fR|M for all four regions, for example as shown in
The method may provided that the conditional pdf fR|M can be obtained using data from region (1,2) and extrapolate to a pdf fR|M for all regions. The method may provide that a simulation is provided for the material which is missing in each of these three zones. The method may include a joint pdf fln(M),ln(R) for variables ln(M) and ln(R) is obtained first. The method may include a conditional pdf fR|M being calculated from fln(M),ln(R). The method may include correcting the biasing of pdf fR|M by the biased introduced by the threshold, for instance 30 rfu. The method may provide that the bias is removed by replacing each fR|M with a Log normal pdf and correcting in the space of parameters mean, μ, and variance, σ, of the Log normal pdf's.
The method may include estimating the two-dimensional pdf fln(M),ln(R) for variables ln(M) and ln(R) using the EM-algorithm, where each Gaussian has zero correlation; A. Dempster, N. Laird, and D. Rubin (1977). “Maximum likelihood from incomplete data via the EM algorithm”. Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977; G. Mclachlan, D. Peel (2000). Finite mixture models. John Wiley & Sons, Inc.
The method may include a conditional distribution of R given M being calculated from the estimated fln(M),ln(R) using the formula:
The method may include the pdf fln(M) being a mixture of one-dimensional Gaussian distribution with the same mixing proportions as fln(M),ln(R) and mean and variances given by the corresponding mean and variances. The method may include several conditional pdf's being computed. The method may include providing Lognormal distributions to simulate the conditional pdf for R given M. The method may include the pdf of the lognormal distribution being given by:
The method may include correcting the bias in the space of parameters (μ,σ) of the Log normal pfd's. The method may include a number of iterations of the values for μ and/or σ. In particular, the method may included two or more iterations of the value for μ and the estimation procedure for σ.
The pdf for a value of M may be obtained by extrapolate the trend defined by two or more values of M to lower values of M. One or more, preferably several, polynomials may be fitted to values of σ in the regions of the higher M values, for instance from 80 to 200 rfu. The trend may be a line, spline or straight line.
The method may include obtaining a family of Lognormal distributions of defined by:
The method may include defining another component in the proposed two-dimensional distribution for peak height as pdf's for mean in each of the four regions.
For region (0,0) a uniform distribution may be defined. The pdf for mean height may be:
where p(0,0) is the proportion of pair of heights that fall in region (0,0) and t is the threshold and can be 30 rfu's or can be replaced by another rfu value as desired.
For region (1,0), the method may include obtaining a distribution for peak height (h1,h2) in this region. The method may then calculate a distribution for mean height. The method may include extracting a pdf for mean height and used the extrapolated distribution and transformed back to the space of (h1,h2). The method may include defining a joint distribution in this region by assuming that the variables h1 and h2 are independent. A uniform distribution may be assumed for (h1,h2). A pdf for region (1,0) may be given by:
where p(0,0) is the proportion of pair of heights that fall in region (0,0) and 30 can be replaced by another rfu value as desired.
The method may include obtaining a pdf for mean height M by first using the transformation:
and then use the Jacobian of the transformation to obtain a two-dimensional pdf for (M,N). The method may include obtaining the marginal pdf for m through integration. The two-dimensional pdf for (M,N) may be given by:
f
M,N(m,n)=2×fH
The two-dimensional pdf for (M,N) may be given by:
The method may provide that the pdf for M is given by:
where 15 is the lower quantity proportion and can be replaced by another value and 30 is the higher peal quantity proportion and can be replaced by another value. The method may include that the pdf for M is given by:
where 15 is the lower quantity proportion and can be replaced by another value and 30 is the higher peal quantity proportion and can be replaced by another value.
The method may use the same methodology or a methodology as described above for region (1,0) to obtain a pdf for mean height in region (0,2). The method may provide that the pdf for (H1,H2) is given by:
where p(0,2) is the proportion of height pairs that fall in region (0,2) and 30 can be replaced by another rfu value as desired and constant λ(0,2) is the parameter of an exponential pdf estimated from values of h2.
The method for calculating a marginal for M in region (0,2) may be the same as, or provided according to the possibilities for, region (1,0). The pdf may be given by:
if m ε [15,30), and
if m≧30, where 15 is the lower quantity proportion and can be replaced by another value and 30 is the higher peal quantity proportion and can be replaced by another value.
The method may provide that the pdf fM for region (1,2) can be obtained from the two-dimensional pdf estimated for (ln(M),ln(R)). The result may be a mixture of one-dimensional Gaussians. The method may provide that the pdf for M in region (1,2) is given by:
where p(1,2) is the proportion of height pairs that fall in region (1,2) and pi is the mixing proportion of the Gaussian components and f(m|μi,σi) are the Gaussian pdf's.
The method may provide for the generation of a probability distribution function for a given locus. The method may provide for the generation of a probability distribution function for each of a plurality of loci. The method may provide for the generation of a probability distribution function for a combination of loci. The combination of loci may correspond to those analysed using a PCR amplification multiplex, such as SGMplus. A plurality of different probability distribution functions may be provided for different combinations of loci. Probability distribution functions may be generated for a number of combinations corresponding to the combinations of loci analysed in different PCR amplification multiplexes.
The method may provide for the generation of a probability distribution function for a given DNA quantity. The method may provide for the generation of a probability distribution function for each of a number of different DNA quantities. Probability distribution functions may be provided for different values of DNA quantity provided across a range of DNA quantities. The different values may be evenly spaced throughout the range.
The method may provide for the generation of a probability distribution function for a given DNA quantity in respect of one locus and/or for a combination of loci. The method may provide for the generation of a probability distribution function for each of a number of different DNA quantities for one locus and/or for a combination of loci.
The one or more generated probability distribution functions may be stored for later use, for instance in a computer implemented database. The one or more generated probability distribution functions may be stored in an electronic or solid state memory. The method may include providing the one or more probability distribution functions in a computer implemented database and/or electronic memory device and/or solid state memory device. The one or more probability distribution functions may be stored at a location remote from the location of use and/or location of the device using the one or more probability distribution functions, for instance by storing on a remote access unit, for instance accessible via the Internet.
The later use of one or more probability distribution functions may in a method according to the first aspect of the invention or its features or possibilities. The memory may be provided as a part of, in permanent connection with or in temporary connection with a device according to the second aspect of the invention or any of the possibilities provided therefore.
The one or more stored probability distribution functions may be used one or more times in subsequent method steps, such as those of the first aspect of the invention.
Any of the proceeding aspects of the invention may include the following features, options or possibilities or those set out elsewhere in this document.
The method may be a computer implemented method.
The method may involve the display of information to a user, for instance in electronic form or hardcopy form.
The test sample, may be a sample from an unknown source. The test sample may be a sample from a known source, particularly a known person. The test sample may be analysed to establish the identities present in respect of one or more variable parts of the DNA of the test sample. The one or more variable parts may be the allele or alleles present at a locus. The analysis may establish the one or more variable parts present at one or more loci.
The test sample may be contributed to by a single source. The test sample may be contributed to by an unknown number of sources. The test sample may be contributed to by two or more sources. One or more of the two or more sources may be known, for instance the victim of the crime.
The test sample may be considered as evidence, for instance in civil or criminal legal proceedings. The evidence may be as to the relative likelihoods, a likelihood ratio, of one hypothesis to another hypothesis. In particular, this may be a hypothesis advanced by the prosecution in the legal proceedings and another hypothesis advanced by the defence in the legal proceedings.
The test sample may be considered in an intelligence gathering method, for instance to provide information to further investigative processes, such as evidence gathering. The test sample may be compared with one or more previous samples or the stored analysis results therefore. The test sample may be compared to establish a list of stored analysis results which are the most likely matches therewith.
The test sample and/or control samples may be analysed to determine the peak height or heights present for one or more peaks indicative of one or more identities. The test sample and/or control samples may be analysed to determine the peak area or areas present for one or more peaks indicative of one or more identities. The test sample and/or control samples may be analysed to determine the peak weight or weights present for one or more peaks indicative of one or more identities. The test sample and/or control samples may be analysed to determine a level indicator for one or more identities.
Various embodiments of the invention will now be described, by way of example only, and with reference to the accompanying drawings in which:
a is an illustration of a frequency distribution for allele peak height, including a detection threshold;
b is an illustration of the fitting of one distribution type to the data below the detection threshold of
c is an illustration of the fitting of an alternative distribution type to the data below the detection threshold of
a shows estimated and Log normal pdf's for mean heights with rfu threshold valued at 50 rfu;
b shows estimated and Log normal pdf's for mean heights with rfu threshold valued at 100 rfu;
c shows estimated and Log normal pdf's for mean heights with rfu threshold valued at 500 rfu;
d shows estimated and Log normal pdf's for mean heights with rfu threshold valued at 1,000 rfu;
a and 10b shows Log normal parameters computed from estimated pdf for heterozygote balance given mean height;
a and 11b shows Log normal parameters computed from estimated pdf for heterozygote balance given mean height whilst setting μ=0.1069;
The present invention is concerned with improving the interpretation of DNA analysis. Basically, such analysis involves taking a sample of DNA and analysing the variations present at a number of loci. The identities of the variations give rise to a profile which is then interpreted. The extent of interpretation required can be extensive and/or introduce uncertainties. This is particularly so where the DNA sample contains DNA from more than one person, a mixture.
There is often a need to consider various hypotheses for the identities of the persons responsible for the DNA and evaluate the likelihood of those hypotheses, evidential uses.
There is often a need to consider the analysis genotype against a database of genotypes, so as to establish a list of stored genotypes that are likely matches with the analysis genotype, intelligence uses.
The present invention provides a mathematical specification of a model for computing likelihood ratios (LRs) that uses peak heights taken from such DNA analysis.
The approach of the present invention draws on an estimation of a two-dimensional, 2D, probability density function, pdf, which is estimated from the heights, or areas, of peaks observed after the analysis of control samples. Such pdf's may be generated from heterozygous donors and separately from homozygous donors.
The invention goes on to use the approach to calculate the probability of dropout and achieve other benefits.
As a first part of the explanation, exemplary methods for generating the pdf's are discussed.
A key part of the present invention is the generation and then use of a 2D probability distribution function, pdf's, relating to peak heights. This is obtained through the analysis of a large number of control samples.
The analyses of the control samples used to support the present approach needs to consider the presence of a variety of levels of DNA within the control samples because those various levels, and others, are encountered in unknown source samples of interest. When an unknown source sample is collected for DNA analysis, there is an aim to collect a given amount of DNA. However, the amount actually present in the unknown source sample varies around that amount because of a variety of factors.
To account for this, control pre-samples from a number of heterozygous donors were obtained and then a variety of control samples with different DNA amounts in them were generated from these pre-control samples. The control samples were then analysed to establish the alleles present and the peak heights therefore. This control sample investigation process was repeated for each locus of interest in the DNA analysis or comparison process to be improved upon by the invention. Thus different loci will need to be considered for different analysis approaches and/or for databases which store information on different loci.
Separate pdf's are beneficially generated for heterozygous and homozygous situations.
As an example of the control sample investigation process, four heterozygous donors may be used. Control samples having different amounts of DNA present are then generated over the range 50 pg to 500 pg in 25 pg steps. 200 repeats of the pre-control samples were provided from each donor and used to form the control sample sets.
The results of the analysis of these control samples can be considered by plotting:
A set of results using this approach is shown schematically in
When considering the peak heights, the detection of a height relative to the noise signal in the analysis becomes difficult below a threshold level. As a result, it is normal to discount peak heights below a predetermined threshold, t, as being unreliable to consider further. As we known the control samples are from people who are heterozygous for the locus in question, we know that these non-observed peaks correspond to dropout of the allele from the analysis results. The allele is present in the DNA, but not present in the analysis results for the DNA.
The use of these thresholds, t, is provided for in the
The distribution of
In further considerations, both t1 and t2 have the same value and are represented as t, but different values could be used for the lower and higher weight alleles and/or between loci. The threshold could be 50 random fluorescence units, rfu's, or as low as 20 rfu's.
The 2D pdf's for peak heights, fhet(h1,h2), where that function is a probability density function of heterozygous alleles with height values h1 and h2, can be defined piece-wise according to the four different regions identified above; p1,1 being for region (1,1); p1,2 for region (1,2); p2,1 for region (2,1); and p2,2 for region (2,2).
This approach gives the following definitions for the four regions of
The form of these definitions are now considered further:
Definition of g1,2(h1,h2)
This is the joint distribution of h1 and h2 in region (1,2) of
g
1,2(h1,h2)=g1,2(h1|h2)g1,2(h2) (A.2)
where g1,2(h1|h2) is estimated from experimental data where h1>t and h2<t, and:
The reasoning behind this definition of g1,2 follows from the calculation:
The interpretation is that we only know that the peak height h2 is below the threshold, t, and so they have equal probability in the interval (0,t).
Definition of g2,1(h1,h2)
This is the joint probability of h1 and h2 in region (2,1) and follows a matching form to the previous definition. Hence, it is provided through factorising the distribution in a similar manner and as follows:
g
2,1(h1,h2)=g2,1(h2|h1)g2,1(h1) (A.5)
where g2,1(h2|h1) is estimated from experimental data where h1<t and h2>t and:
Definition of g2,2(h1,h2)
This is the joint probability of h1 and h2 in region (2,2), this is provided by an estimate of a 2D pdf from experimental data where h1>t and h2>t.
For the three regions and their distributions, the following distributions are presently estimated as applying:
g1,2(h1,h2)—as a preliminary estimation we use an exponential distribution.
g2,1(h1,h2)—as a preliminary estimation we use an exponential distribution.
g2,2(h1,h2)—use a transformation of the data and then a 2D estimation.
The overall result of these definitions is that the 2D pdf for peak heights, fhet(h1,h2), is given by:
where, k1,2; k2,1; l12; l21; are described above, and where:
is the pdf of an exponential distribution with parameter l1,2, where x is the mixing proportion and:
and k2,2 is described above and the function k2,2 is given by a mixture of two-dimensional normal distributions:
where y and n are described above, where:
and:
where ρ=0.
Once determined, the 2D pdf for peak heights, fhet(h1,h2), can be stored and made use of in a large number of subsequent considerations of likelihood ratios. As previously mentioned, such pdf's may be generated for each locus under consideration or potential consideration. In some cases, it may be necessary to determine a particular 2D pdf for peak heights, fhet(h1,h2), for a particular situation, as part of the analysis process
In the previous section, an illustrative method for estimate a 2D pdf for the case where the source is heterozygote was provided. In this section we deal with the situation where the source is homozygote.
In this case, the probability distribution function, pdf, starts out as a 1D plot. There is only a value for h1 as only one peak is observed. Thus above the threshold, t, this has a value and below it is represented at the zero value.
To enable compatibility between situations where the likelihood is being considered in 2D (based on the pdf's developed in the section above and the likelihood evaluation discussed in sections which follow below) the pdf for the homozygous situation also has to be defined in 2D. Otherwise, the likelihoods in the 2D heterozygous approach will be inherently less likely than in a 1D homozygous approach.
A number of methods for converting 1D pdf's to 2D pdf's exist and can be used.
The basis for this method is the observation that a homozygous sample with height h1 is similar to observing a heterozygous sample with approximate heights 0.5h1 and 0.5h1.
It is also possible to convert the 1D pdf into a 2D pdf by mathematically rotating the distribution through say 90°. The result is a pdf in 2D which can be used in the manner described below for homozygous donors and as an alternative to method 1 for homozygous donors.
The estimation of a one-dimensional pdf has two components. A uniform distribution for the interval, for instance height values in the range 0 to 30 and a probability distribution that takes positive values within that range, for example the exponential distribution of
The one-dimensional pdf is given by the formula,
where
p0 is the proportion of heights in (0, 30];
p1 is the proportion of heights in (30, ∞);
For example if fH|H>30 is a exponential pdf, then
where λ is the estimated parameter for the exponential distribution using known methods of estimation, e.g. maximum likelihood estimation methods ready available in statistical packages. The overall pdf of an example for case when fH|H>30 is an exponential distribution is plotted in
The idea is to rotate the one-dimensional pdf above to take non-zero values in R+×R+. The surface obtained in this manner needs to be normalised so that the volume under the surface is one.
The calculation of a pdf value for homozygote peak height h for a likelihood ratio calculation is given by
(1/V)×fH. (24)
where V is the volume under the surface obtained by rotation of the one-dimensional pdf.
That part of the distribution which relates to data left-censored by the rfu threshold, t, is discussed in approach a, with approach b offering an alternative.
In this approach, a family of distributions is chosen, f(h|θ), where θ represents the parameters determining the distributions and the parameters encoded in θ can be either one-dimensional or multi-dimensional. The type of distribution chosen can be one of numerous possibilities, such as Exponential, Weibull, Gamma or other distributions with positive support H. For example an Exponential distribution might be chosen where, thus providing:
The Likelihood of the censored data can then be established by the function:
L(θ|{hi: i=1,2, . . . ,n})=n1×F(t|θ)×Πmi=1f(hi|θ), where
there are n1 heights recorded as zero due to left-censoring, and F is the cumulative probability distribution and f is the probability density function of the Exponential distribution and t is the detection threshold.
The family member for θ which gives the maximum value of the likelihood is chosen, that is, the maximum likelihood estimate (MLE) for θ.
An alternative method for establishing a 2D pdf in the context of heterozygous donors is now described, starting with an explanation in using a 1D analogy, followed by an expansion of the approach to detail its working in 2D.
In the consideration of heterozygous analysis results for the control samples, the two alleles may result in peaks of different heights. As previously described, there is a chance that one or both peaks may be below a threshold level, t, at which the peak can safely be identified or even identified at all.
The result is that a distribution of data above the threshold, t, is established, but that the form of the distribution below the threshold, t, is not know. The position of the frequency distribution of
In the case of method 1 above, a uniform frequency for the distribution is applied to the unknown section below the threshold, see
In this alternative approach, a non-uniform distribution is applied to the region below the threshold, t. In the illustrated example of
This approach can now be extending to the four regions describe in the context of
The region (2,2) has a distribution which is fully known from the analysis results for the control samples. For the region (1,1) it is possible to use a 2D plane or other distribution shape which represents a transition from the region (2,2) distribution at h1=t, h2=t down to zero at the junction of the axes and which also provides a transition from the values of region (1,2) along junction h1=t, h2=0 to t and from the values of region (2,1) along the junction h2=t, h1=0 to t. The regions (1,2) and (2,1) themselves can be further distributions representing a combination of the distribution applied in region (2,2) and that applied in region (1,1).
As described above, a profile from a heterozygous donor in a locus typically consists of two peaks. However if a profile is obtained from small amounts of DNA, it may contain either one peak or no peaks in the observed result. This is because a peak cannot be detected if is low and close to the baseline. In this example we use 30 rfu as the minimum detection limit, but other detection limits can be applied to the observed results.
The support of two-dimensional pdf is R+υ{0}×R+υ{0}. The pair of heights (h1,h2) are in effect censored by the threshold of t, for instance 30 rfu, that divides the support in four regions, see
The estimation of a two dimensional pdf is done in the space defined by mean heights, m, and heterozygote imbalance, r. Each pair of heights are transformed by
Given a pdf fM,R, a pdf in the space of pairs of heights can be obtained with the formula:
where the first factor is the Jacobian of the transformation, G. Casella & R. L. Berger (1990). Statistical Inference. Wadsworth & Brooks/Cole Advanced Books Software, Pacific Grove, Calif., USA, page 148 and the second factor can be estimated from experimental data. The estimation is done through by first obtaining the conditional distribution of heterozygous imbalance given mean height:
This will provide one pdf fR|M for all regions, but there will be one marginal for each of the regions. In the rest of the section the estimation of the conditional pdf fR|M and then the estimation of the marginal pdf's fM for each of the regions is illustrated.
A conditional pdf fR|M can be obtained using data from region (1,2) and extrapolate to a pdf fR|M for all regions. In effect, a simulation is provided for the material which is missing in each of these three zones. A joint pdf fln(M),ln(R) for variables ln(M) and ln(R) is obtained first. A conditional pdf fR|M is calculated from fln(M),ln(R). The pdf fR|M is affected by the biased introduced by the threshold of 30 rfu. This bias is removed by replacing each fR|M with a Log normal pdf and correcting in the space of parameters mean, μ, and variance, σ, of the Log normal pdf's.
A two-dimensional pdf fln(M),ln(R) for variables ln(M) and ln(R) is estimated using the EM-algorithm, where each Gaussian pdf has zero correlation; A. Dempster, N. Laird, and D. Rubin (1977). “Maximum likelihood from incomplete data via the EM algorithm”. Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977; G. Mclachlan, D. Peel (2000). Finite mixture models. John Wiley & Sons, Inc. A scatter plot of mean height and heterozygous balance in natural logarithm scales is shown in
The goodness of fit is assessed by ascertaining that the data is a likely sample of the estimated pdf. By computing one thousand samples of the same size as the data, computing the negative log-likelihood for each sample and also computing the negative log-likelihood of the data in the estimated pdf verification can be provided. The negative log-likelihood of the data falls within the range of values of the negative log-likelihoods of samples of the same size as the data. It indicates that the mixture fits the data well.
A conditional distribution of R given M can be calculated from the estimated fln(M),In(R) using the formula:
The pdf fln(M) is in fact a mixture of one-dimensional Gaussian distribution with the same mixing proportions as fln(M),ln(R) and mean and variances given by the corresponding mean and variances. These are reported in Table 1 above.
Several conditional pdf's are computed using the formula above. Estimated Log normal distributions are shown in
In
Notice that the Log normal distributions are a good fit. These can therefore correct the bias in the space of parameters (μ,σ) of the Log normal pfd's.
In a second iteration we set μ=0.1069 corresponding to heterozygote imbalance R=1.11. In fact, this number is expected as height of the low-molecular allele tends to be larger than the height of the high-molecular weight allele. We then run the estimation procedure for σ again.
To extrapolate the increasing trend to lower values of M, several polynomials were fitted to values of σ in region from 80 to 200 rfu. A line was the best fit and it was chosen for extrapolating σ in region (0,80).
In these examples, Log normal distributions are used, but other extrapolatable distributions, such as gamma distributions can be used.
From this a family of Log normal distributions have been obtained:
where μ=0.1069 and σ(m) is given by the corrected σ value in
Other components in the proposed two-dimensional distribution for peak height are pdf's for mean in each of the four regions. For region (0,0) both peak heights are not known and thus follow a uniform distribution in the interval [0,30]. A pdf for mean height is therefore also a uniform distribution in this interval, i.e.
where p(0,0) the proportion of pair of heights that fall in region (0,0) and takes the value 0.1012. Of course, other values than 30 rfu's for the threshold t can be employed.
Obtaining a pdf for region (1,0) requires more effort. First a distribution for peak height (h1,h2) in this region is obtained and them used to calculate a distribution for mean height. Although the goal is to obtain a pdf for (h1,h2), this distribution does not have the extrapolated pdf for heterozygote imbalance given mean height. A pdf for mean height is therefore extracted and used the extrapolated distribution and transformed back to the space of (h1,h2).
A joint distribution can be defined in this region by assuming that the variables h1 and h2 are independent. Given that we do not know the value of h2, we can assume that they follow a uniform distribution in the interval (0,30). The values that variable h2 follow is an exponential distribution, see
A pdf for region (1,0) is given by
where p(1,0) is the proportion of height pairs that fall in region (1,0) and take the value of 0.0448.
A pdf for mean height M can be obtained by first using the transformation
and then use the Jacobian of the transformation to obtain a two-dimensional pdf for (M,N). The marginal pdf for m can be obtained through integration. The two-dimensional pdf for (M,N) is given by:
f
M,N(m,n)=2×fH
It can be re-written as
Before we integrate N to obtain a pdf or M, we need to describe the sample space for (M,N). It is the area in the rectangle.
The resulting pdf for M is given by:
if m ε [15,30), and
if m≧30.
The methodology for obtaining a pdf for mean height in region (0,2) is the same as for region (1,0). A pdf for (H1,H2) is given by:
where p(0,2) is the proportion of height pairs that fall in region (0,2) and is equal to 0.0325. Constant λ(0,2) is the parameter of an exponential pdf estimated from values of h2 translated by 30, see
The method for calculating a marginal for M in region (0,2) is the same as for region (1,0). The resulting pdf is given by:
if m ε [15,30), and
if m≧30.
Finally, a pdf fM for region (1,2) can be obtained from the two-dimensional pdf estimated for (ln(M),ln(R)). The resulting is a mixture of one-dimensional Gaussians where the mixture proportion is the same as those reported in Table 1. The means are given in the column with heading “Mean 1” and the variances are given in the column with heading “Variance 1”1. A pdf for M in region (1,2) is given by: 1Matlab: fitMeanHeightPDF.m
where p(1,2) is the proportion of height pairs that fall in region (1,2) and pi is the mixing proportion of the Gaussian components and f(m|μi,σi) are the Gaussian pdf's.
We have defined all the components of a two-dimensional pdf for pair of heights for all regions as defined in Equation (27). The conditional pdf fR|M is given by the family of Lognormal pdf's with a fixed value of μ and values of σ(m) given in
In approach b above, an approach for dealing with the censoring of the peaks by the rfu limit imposed was discussed. This approach provides a further variant to that manner of dealing with the left censored peaks.
In approach b, a pdf fM,R, a pdf in the space of pairs of heights, was described as being obtained with the formula:
where the first factor is the Jacobian of the transformation and the second factor was estimated from experimental data. The estimation was done through by first obtaining the conditional distribution of heterozygous imbalance given mean height:
This provided one pdf fR|M for all regions, but with one marginal for each of the regions. Approach b then described the estimation of the conditional pdf fR|M and then the estimation of the marginal pdf's fM for each of the regions illustrated.
In approach c, the same approach is taken to that of approach b in the use of the formula:
and in the estimation of the conditional pdf fR|M. The variation is in the manner by which the estimation of the marginal pdf's fM is provided. In this further approach, consideration of the possible range for the values of the peak-heights, even those recorded as zero, are taken into account.
The first step is the selection of a family of probability distributions f(m|θ) where c is the set of parameters specifying the family of distributions. Variable m is the mean height (h1+h2)/2. The values of h1 and h2 are affected by the threshold t and thus h1 can be either greater than or equal to t or recorded as zero if it is smaller than t. The values of m are affected by left-censoring of h1 and h2 and the definition of the likelihood of θ given the observation on M are affected accordingly.
If h1<t and h2<t, the m<t. In this case m is left-censored and the likelihood function for these m's as F(t|θ) where F is the cumulative probability distribution (CDF) of the previously chosen family of distributions. If h1<t and h2≧t, then m is interval-censored, that is it falls in the interval from 0.5t to 0.5(t+h2). The likelihood for the m's is F(0.5(t+h2)|θ)−F(0.5t|θ). Similarly for the contrary case where h2<t and h1≧t the likelihood of m is F(0.5(t+h1)|θ)−F(0.5t|θ). If h1≧t and h2≧t, then there is no censoring and the likelihood is given by f(m|θ). The overall likelihood is given by multiplying the likelihood for each pair of heights h1 and h2, including those that fall below the threshold t and are recorded as zero:
The parameters encoded in θ are choose to maximise the likelihood given above.
Having established the underlying 2D pdf for peak heights in the heterozygous case, fhet (h1,h2), and 2D pdf for peak height in the homozygous case, fhom(h1), it is possible to move on to establish the likelihood ratios in various situations of interest.
These situations considered in turn and in detail are a variety of situations that are encountered, particularly in the context of forensic science.
The various situations are:
In an evidential context, the relative likelihood's of two hypotheses are normally being considered. Generally, these will be a prosecution hypothesis and a defence hypothesis. The consideration, the likelihood ratio, can be generally expressed as:
where
The LR can be expanded as:
where Ui is one of the supposed donor stated in Hd. Given that the quantities in C are continuous, the LR is written as:
where the f means that these factors are likelihoods and not probabilities.
There are a number of ways in which the second factor Pr(Ui|S) can be computed using the method given by Balding et al. (2005) and Buckleton et al.(2005), referenced above, which may include considerations of relatedness between the unknown contributor and the suspect.
The factors f(C|S,Hp) and f(C|Ui,S,Hd) state the same type of calculation: the likelihood of observing the set of peak heights in C given a supposed donor. Therefore the discussion for their calculation is simplified as:
f(C|G) (45)
where G denotes the supposed donor genotype.
This general form of the likelihood is key to the operation of the invention and its benefits.
The estimation of f(C|G) from experimental data, as provided in the pdf forming methods discussed above, has not previously been envisaged or provided.
The estimation can come from the dilution data approach of heterozygous sources, method 1, in particular. The homozygous source methods could be used for homozygous situations.
As far as the estimation of f(C|G) is concerned, this can be done by grouping data from all loci to general a single estimation of f(C|G) covering all the loci considered. It is equally possible to provide a separate estimation of f(C|G) for each locus. A variety of ways for generating the estimation of f(C|G) using two-dimensional density estimations apply.
In the following sections a more detailed application of the approach to evidential contexts is provided for the four scenarios:
a) Suspect is heterozygous and crime profile has one peak;
b) Suspect is heterozygous and crime profile has two peaks;
c) Suspect is heterozygous and crime profile contains no peaks;
d) Suspect is homozygous and crime profile contains one peak;
e) Suspect is homozygous and crime profile contains one peak.
In this situation, C={hb} and S={a,b}. The likelihood ratio is given, consistent with equation 1 above, by,
The numerator then becomes,
where ha is a peak height smaller than the threshold and so is not observed in the results. This is the type of situation illustrated in
The term in the right-hand-side of Equation (1.2) is a pdf in two-dimensions of the type discussed above in the pdf's for heterozygote sections.
The denominator in equation (1.1) is given by,
f(hb|Hd)=f(hb|U={b,b},Hd)Pr(U={b,b}|S={a,b}),Hd+f(ha<t,hb|U={b,Q},Hd)Pr(U={b,Q}|S={a,b})Hd. (1.3)
where Q is any other allele identity than a.
The second factors in the right-hand side of equation (1.3), Pr(U={b,b}|S={a,b})Hd and Pr(U={b,Q}|S={a,b})Hd can be computed using methods described in Balding (2005) and Buckleton et al. (2005) which includes cases where the suspect and the unknown contributors are from the same and different ethnic groups, and cases when the suspect and unknown contributor are related.
The first factors of the right-hand side of equation 1.3 will now be described in more detail.
The second factor is computed as follows:
f(hb|U={b,b},Hd)=fhom(hb). (1.4)
where fhom is the one-dimensional pdf as discussed above. The third factor is computed with the formula:
where hQ is smaller than the threshold.
This provides a determinable value for the likelihood ratio for the situation of, suspect is heterozygous and crime profile has one peak.
In this situation, C={ha,hb} and S={a,b}. The LR is given by:
The numerator is given by:
f(C={ha,hb}|S={a,b},Hd)=fhet(ha,hb). (2.2)
The denominator is given by:
f(C={ha,hb}|S={a,b},Hd)=fhet(ha,hb)Pr(U={a,b}|S={a,b},Hd) (2.3)
The second term in the right hand side of Equation 2.3 can be computed using methods described in Balding (2005) and Buckleton et al. (2005) which includes cases where the suspect and the unknown contributors are from the same and different ethnic groups, and cases when the suspect and unknown contributor are related. The first term is calculated from a two-dimensional pdf for peak heights.
The LR is then given by:
This provides a determinable value for the likelihood ratio for the situation of suspect is heterozygous and crime profile has two peaks. This assumes that different contributors having the same genotype will have the same peak height distributions.
In this situation, C={hb} and S={b,b}. The LR is given by:
The numerator is given by:
f(hb|S={b,b},Hp)=fhom(hb) (3.2)
a density function estimated for homozygous contributors.
The denominator is given by:
The term:
f(hb|U={b,b},S={b,b},Hd)=fhom(hb). (3.4)
is the same as the numerator. The term:
can be obtained by estimating a pdf for the peak heights.
This provides a determinable value for the likelihood ratio for the situation of suspect is heterozygous and the crime profile has one allele.
As previously mentioned, the homozygous situation starts out relating to a 1D pdf and has to be converted to a 2D pdf for compatibility in approaches. This 2D approach extends to the likelihood calculation too. The approach stems from the observation that a homozygous sample with height h1 is similar to observing a heterozygous sample with approximate heights 0.5h1 and 0.5h1.
The likelihood for a homozygous donor is then given by
where f(C={αh1,(1−α)h1}|G={1,1}) is a likelihood function computed in a two-dimensional likelihood for a heterozygote and f(α) is a probability density function for α, where α is the proportion of the sample. Alternatively we can approximate this quantity by:
for an α close to 0.5.
In this section, the approach has been provided without consideration as to the DNA quantity involved; the next section conditions on DNA quantity. For ease of presentation, however, the above mentioned approach extended to when we condition to DNA quantity, χ, is mentioned here. The formulae remain valid, and they become:
where f(C={αh1,(1−α)h1}|G={1,1},χi) is a likelihood function for a heterozygote contributor estimated from data obtained with targeted DNA quantity χi.
Alternatively:
for an α close to 0.5.
In situation 1, above, we discussed how the calculation is simplified as:
f(C|G) (50)
where G denotes the supposed donor and that this general form of the likelihood is key to the operation of the invention and its benefits.
In the extension of the approach that now follows, the likelihood can also be computed by conditioning on DNA quantity, in the form:
where Pr(χj|I(h)) is a probability distribution of DNA quantity given peak height information, I(h).
The calculation of f(C|G) can be done for each different DNA quantity considered through the control samples. In effect, for each quantity of DNA a dataset is generated using the general approaches previously described. This data is stored for future use. The relevant distribution to the DNA quantity in the analysis situation is then selected from these and used.
The probability distribution for Pr(χj|I(h)) relates to a list of discrete probabilities for χ1 to χj, where
The distribution can be obtained by considering an observed distribution from the control samples. There are various ways of combining the control sample results to achieve this.
It is possible to considering all the heights, h, observed for each locus in a control sample. This gives a mean h for a control sample across all its loci. The mean h's for repeats at the same quantity χ can then be grouped together across multiple control samples. Hence, {
Other approaches could consider the position locus by locus, with the process of obtain a mean height at each χ and providing the distribution being performed separately for the different loci.
Both the sets of information for equation 5 (which equation) can be obtained for one or more loci, preferably, but not necessarily including the locus on which the approach is being used.
It is possible for the calculation of Pr(X=χi|I(h)) for a locus to be performed in a sequential fashion based on an ordering L1,L2, . . . ,Ln of the loci. In that instance, the process starts by setting Pr(X=χi|I(h)) for L1 to a uniform distribution. This is in part because there is no information available as to I(h). It is then possible to compute Pr(X=χi|I(h)) for L2 based on the assumption for I(h) from L1. The prior knowledge of I(h) for L1 is of benefit. The process then computes Pr(X=χi|I(h)) for L3 based on I(h) from L1 and L2, with the process continuing in this manner until Pr(X=χi|I(h)) for Ln is computed based on I(h) from L1,L2, . . . ,Ln−1. This provides the best use of the prior knowledge to compute the various Pr's, but without generating a circularity issue in the processing.
To compute Pr(X=χi|I(h)) that function can be equated to:
where:
f(
is a 1D distribution calculated for data for DNA only, χ=χi.
The likelihoods f(C|Gi) and its version conditional on DNA quantity f(C|Gi,χj) are key to the invention. The estimation of f(C|G,χi) from experimentally derives data has not previously been envisaged or provided. Once again:
this can be provided from dilution data;
the estimation of f(C|G,χi) can be done by grouping data from all loci, potentially with the estimation rendering one f(C|G,χi) for all loci, but with a separate f(C|G,χi) for each DNA quantity χi;
the estimation of f(C|G,χi) can be done once per locus, potentially with the estimation rendering one f(C|G,χi) for each locus and for each DNA quantity χi;
the estimation of f(C|G,χi) can be done using any method for estimating a two-dimensional density estimation;
the estimation of f(C|G,χi) can be done using the method described in the heterozygous donors, method one and/or homozygous donors method one approaches.
Equally, the estimation of Pr(X=χi|I(h)) from experimentally derived data has not previously been provided or envisaged. Again this could involve:
the estimation of Pr(X=χi|I(h)) from dilution data;
the peak height information I(h) being any function of peak height, for example the mean of two peak heights or the sum of the heights;
the estimation of Pr(X=χi|I(h)) for a locus being based on I(h) from the same locus or from one or more other loci.
In an intelligence context, a different issue is under consideration to that approached in an evidential context. The intelligence context seeks to find links between a DNA profile from a crime scene sample and profiles stored in a database, such as The National DNA Databases®, which is used in the UK. The process is interested in the genotype given the collected profile.
Thus in this context, the process starts with a crime profile C and is interested in proposing an ordered list of suspect's profiles, where the first profile in the list is the genotype of the most likely donor. This task is usually done by proposing a list of genotypes {G1,G2, . . . ,Gm} which is then rank according the posterior probability of the genotype given the crime stain.
The list of genotypes is generated from the crime scene C. For example if C={h1,h2}, where both h1 and h2 are greater than the dropout threshold, t, then the only potential donor genotype is G={1,2}. If C={h1}, then the potential donors are G1={1,1} and G2={1,Q}, where Q represent any other allele different than allele 1, present in the DNA, but not observed in the results for the crime profile C.
The posterior probability of a supposed genotype given the crime profile is given by the formula:
where the Gi are assumed to be the various members of an exhaustive list of potential donor genotypes, and π(Gi) is a prior distribution for genotype Gi computed from the population in question.
In a similar manner to the evidential context and equation 4 described above, the term:
f(C|Gi)
is key to the process, and this term can be estimated successfully using the various approaches for quantifying the pdf discussed above.
In general, the estimation is the same as with evidential assessment discussed elsewhere.
It is possible to extend the approach in situation 3 to consider the quantity of DNA involved, in a similar manner to the extension of situation 1 to situation 2.
The posterior probability in this situation is computed using the formula:
where Pr(χj|I(h)) is a probability distribution of DNA quantity given information on peak heights. The possibilities for establishing this probability distribution in respect of the same loci and/or across one or more other loci, and possibly including the locus where the formula will applied in practice, exist.
In a variety of situations, the DNA sample encountered is not from a single source, but rather from a mixed source. A mixed source arises from two, three or more contributors. The approach is detailed in respect of one such mixed source situation, but the approach is valid for other types of mixed source sample too.
In the evidential context for a mixed source sample arising from two contributors, the prosecution and defence will both have hypotheses. The hypotheses may be, for the example:
The LR is given by the formula:
The crime profile can contain from zero to four heights depending upon the alleles contributed by the two sources and the extent to which they are reflected in the analysis results. The LR formula can be written as:
where f(C|S,Ui,Hp) is a density function of C given S and Ui and f(C|Uj,Uk,S,Hd) is a density function of C given Uj, Uk and S.
Equation 10 is a ratio of likelihoods and, therefore, the Pr's have changed to f's.
There are a number of ways in which the factors Pr(Ui|S) and Pr(Uj,Uk|S) can be computed using the formula introduced by Balding et al. (1996) also described in Balding (2005) and Buckleton (2005) which may include considerations of relatedness.
The factors f(C|S,Ui,Hp) and f(C|Uj,Uk,S,Hd) state the same type of calculation: the likelihood of the crime profile C given two supposed donors. We denote this calculation by the term:
f(C|G1,G2) (56)
where G1 and G2 are the genotypes of the supposed donors. Hence, the situation is reduced to a factor of an equivalent general nature to that in the previous situations. This can be evaluated using the control sample approach previously described.
In the following sections a more detailed application of the approach to evidential contexts is provided for the two scenarios:
a) Suspect and victim are heterozygous, with no overlapping alleles between them and only three peaks in the crime profile; and
b) Suspect and victim are heterozygous, with one overlapping allele between them and only three peaks in the crime profile.
In both scenarios, where we are considering LR's for a locus, in respect of a sample from two sources, there are two basic hypotheses to consider:
In this situation, Hp: V+S, Hd: V+U, C={ha,hb,hc}, V={a,b} and S={c,d}. The LR is given by:
The numerator is given by:
f(C={ha,hb,hc}|V={a,b},S={c,d},Hp)=fhet(ha,hb)×fhet(hc,hd) (4.2)
The independence assumption in equation 2.2 is also made for a simulation method.
The unknown contributors for the denominators can be {a,c}, {b,c}, {c,Q}. The denominator is given by:
The factor for U={a,c} is computed with the formula:
f(C={ha,hb,hc}|V={a,b},U={a,c},Hd)=fhet(mxha,hb)×fhet((1−mx)ha,hc). (4.4)
The factor for U={b,c} is computed with the formula:
f(C={ha,hb,hc}|V={a,b},U={b,c},Hd)=fhet(ha,mxhb)×fhet(ha,(1−mx)hc). (4.5)
The factor for U={c,Q} is computed with the formula:
f(C={ha,hb,hc}|V={a,b},U={c,Q},Hd)=fhet(ha,hb)×fhet(hc,hQ). (4.6)
In this situation, Hp: V+S, Hd: V+U, C={ha,hb,hc}, V={a,b} and S={b,c}. The likelihood ratio is given by:
For the numerator we have:
f(C={ha,hb,hc}|V={a,b},S={b,c},Hp)=fhet(ha,mxhb)fhet((1−mx)hb,hc). (5.2)
For the denominator we consider the following potential unknown contributors:
U ε {{a,c},{b,c},{c,c},{c,Q}}.
We look at the formulae per unknown contributor. We start with U={a,c}:
f(C={ha,hb,hc}|V={a,b},S={b,c},U={a,c},Hd)=fhet(mxha,hb)fhet((1−mx)ha,hc) (5.3)
For U={b,c}:
f(C={ha,hb,hc}|V={a,b},U={b,c},Hd)=fhet(ha,mxhb)fhet((1−mx)hb,hc). (5.4)
For U={c,c}:
f(C={ha,hb,hc}|V={a,b},U={c,c},Hd)=fhet(ha,hb)fhet(0.5hc,0.5hc) (5.5)
For U={c,Q}:
f(C={ha,hb,hc}|V={a,b},U={c,Q},Hd)=fhet(ha,hb)fhet(hc,hQ). (5.6)
At the core for evidential situations, and in the intelligence situations discussed below, is the calculation of the likelihood f(C|G1,G2).
This situation is dealt with in an equivalent manner to that described above for situation 5. The formulae can be modified to use the conditioning of DNA quantity.
In the intelligence context, the task is to propose an ordered list of pairs of genotypes G1 and G2 per locus so that the first pair in the list in the most likely donor of the crime stain.
The starting point is the crime stain profile C. For example C={h1,h2,h3}. From this, an exhaustive list {G1,i,G2,i} of pairs of potential donors is generated. For each of theses pairs, a probability distribution for the genotypes is calculated using the formula:
where Pr(G1,i,G2,i) is a prior distribution for the pair of genotypes inside the brackets that can be set to a uniform distribution or computed using the formulae introduced by Balding et al. (1996).
As with the situations 5 and 6 in the previous sections, at the core for evidential evaluation and intelligence is the calculation of the likelihood f(C|G1,G2).
Situation 8 can be handled using the approach of situation 7, but with the term fhet is conditioned on DNA quantity.
As previously mentioned, at the core of the mixed source approach for evidential intelligence contexts is the calculation of the likelihood f(C|G1,G2). The description which follows provides for the calculation of f(C|G1,G2) via a factorisation that reduces it to a calculation of likelihood for single profiles. That approach is also extended to f(C|G1,G2,χi) to allow conditioning on DNA quantity. Furthermore, the estimation of Pr(χi|I(h)) from experimental data and its use in conjunction with these formula is detailed.
The way in which these terms are computed depends on the alleles that the genotypes share. They are described case-by-case, starting with a method where the factorisation of likelihoods is based on conditioning on a mixing proportion. A method in which conditioning on DNA quantity, χ, as well as conditioning on mixing proportion mx is then provided.
Calculation of Likelihoods Via Conditioning on mx
In this case we do not need the assistance of a mixing proportion mx to factorise the likelihood of a two-person mixture to two likelihoods of single profiles. For example if the two donors are heterozygous then:
f(C={h1,h2,h3,h4}|G1={1,2},G2={3,4})=f(C={h1,h2}|G1={1,2})f(C={h3,h4}|G2={3,4}) (58)
If the one donor is homozygous:
f(C={h1,h2,h3}|G1={1,2},G2={3,3})=f(C={h1,h2}|G1={1,2})f(C={h3}|G2={3,3}). (59)
If both donors are homozygotes:
f(C={h1,h2}|G1={1,1},G2={2,2})=f(C={h1}|G1={1,1})f(C={h2}|G2={2,2}). (60)
If the donors share one-allele, then the peak height in common, the largest peak, is split according to a mixing proportion mx. If both donors are heterozygous:
where Pr(mx) is a discrete probability distribution for the mixing proportion.
If one of the donors is homozygous:
As with scenario 2, the mixing proportion is used for factorising a two-person-mixture likelihood into two single-profile likelihoods. Both peaks are split. More specifically:
If the two donors are heterozygotes then:
where mxχi is the proportion of the DNA quantity assign to donor 1 and (1−mx)χi is the DNA quantity assign to donor 2. Pr(χi|I(h)) is a probability distribution on DNA quantity based on peak height information.
If one donor is homozygous:
If both donors are homozygotes:
If the donors share one-allele, then the peak height in common is split according to a mixing proportion mx. If both donors are heterozygous:
where Pr(mx) is a discrete probability distribution for the mixing proportion and Pr(χi|I(h)) is a probability distribution of DNA quantity given peak information from the one more loci, possibly including the locus for which the formula will be used.
If one of the donors is homozygous:
As with previous cases, the mixing proportion is used for factorising a two-person-mixture likelihood into two single-profile likelihoods. More specifically:
By making use of 2D pdf's, the present invention provides a number of advantages and allows a variety of situations and hypotheses to be considered.
For instance, the situation of
In the context of the mixtures consideration, the principle of comparing pdf's having the same number of dimensions is also used. In this case, there are a number of 3D pdf's (such as those in equations 4.4, 4.5, 5.3, 5.4) to compare with 2D pdf's (such as those in equations 4.6, 5.5, 5.6). The 3D pdf's are moved to be 2D pdf's by deconvoluting for each allele, taking into account the mixing proportions.
Modelling according to the present invention offers a number of advantages:
D. J. Balding, M. Greenhalgh, R. A. Nichols (1996). Population genetics of STR loci in Caucasians. 108:300-305.
D. J. Balding (2005). Weight-of-evidence for forensic DNA profiles. John Wiley & Sons Ltd.
J. Buckleton, C. M. Triggs, S. J. Walsh (2005). Forensic DNA evidence interpretation. CRC press.
A. Dempster, N. Laird, and D. Rubin (1977). “Maximum likelihood from incomplete data via the EM algorithm”. Journal of the Royal Statistical Society, Series B, 39(1):1-38.
Number | Date | Country | Kind |
---|---|---|---|
GB 0722649.1 | Nov 2007 | GB | national |
GB 0804665.8 | Mar 2008 | GB | national |
GB 0810624.7 | Jun 2008 | GB | national |
GB 0815300.9 | Aug 2008 | GB | national |