COMPUTING LIKELIHOOD RATIOS USING PEAK HEIGHTS

This application is claims the benefit of Serial No. GB 0722649.1, filed Nov. 19, 2007 in Great Britain, and also of Serial No. GB 0804665.8, filed Mar. 13, 2008 in Great Britain, and also of Serial No. GB 0810624.7, filed Jun. 11, 2008, and also of Serial No. GB 0815300.9, filed Aug. 21, 2008 and which application(s) are incorporated herein by reference. A claim of priority to all, to the extent appropriate is made.

BACKGROUND OF THE INVENTION

This invention concerns improvements in and relating to the consideration of evidence, particularly, but not exclusively the consideration of DNA evidence.

In many situations, particularly in forensic science, there is a need to consider one piece of evidence against one or more other pieces of evidence.

For instance, it may be desirable to compare a sample collected from a crime scene with a sample collected from a person, with a view to linking the two by comparing the characteristics of their DNA. This is an evidential consideration. The result may be used directly in criminal or civil legal proceedings. Such situations include instances where the sample from the crime scene is contributed to by more than one person.

In other instances, it may be desirable to establish the most likely matches between examples of characteristics of DNA samples stored on a database with a further sample. The most likely matches or links suggested may guide further investigations. This is an intelligence consideration.

In both of these instances, it is desirable to be able to express the strength or likelihood of the comparison made, a so called likelihood ratio.

SUMMARY OF THE INVENTION

The present invention has amongst its possible aims to establish likelihood ratios. The present invention has amongst its possible aims to provide a more accurate or robust method for establishing likelihood ratios. The present invention has amongst its possible aims to provide probability distribution functions for use in establishing likelihood ratios, where the probability distribution functions are derived from experimental data.

According to a first aspect of the invention we provide a method of comparing a test sample result set with another sample result set, the method including:

- providing information for the first result set on the one or more identities detected for a variable characteristic of DNA;
- providing information for the second result set on the one or more identities detected for a variable characteristic of DNA.

The method of comparing may be used to considered evidence, for instance in civil or criminal legal proceedings. The comparison may be as to the relative likelihoods, for instance a likelihood ratio, of one hypothesis to another hypothesis. The comparison may be as to the relative likelihoods of the evidence relating to one hypothesis to another hypothesis. In particular, this may be a hypothesis advanced by the prosecution in the legal proceedings and another hypothesis advanced by the defence in the legal proceedings. The likelihood ratio may be of the form:

$LR = \frac{\Pr (C | S, H_{p})}{\Pr (C | S, H_{d})}$

where

- C is the first or test result set from a test sample, more particularly, the first result set taken from a sample recovered from a person or location linked with a crime, potentially expressed in terms of peak positions and/or heights;
- S is the second or another result set, more particularly, the second result set taken from a sample collected from a person, particularly expressed as a suspect's genotype;
- H_pis one hypothesis, more particularly the prosecution hypothesis in legal proceedings stating “The suspect left the sample at the scene of crime”;
- H_dis an alternative hypothesis, more particularly the defence hypothesis in legal proceedings stating “Someone else left the sample at the crime scene”.

The likelihood ratio may be defined as:

$LR = \frac{\Pr (C | S, H_{p})}{\sum_{i} \Pr (C | U_{i}, S, H_{d}) \Pr (U_{i} | S, H_{d})}$

where U_iis one of the supposed sources stated in H_d.

The likelihood ratio may be defined as:

$LR = \frac{f (C | S, H_{p})}{\sum_{i} f (C | U_{i}, S, H_{d}) \Pr (U_{i} | S, H_{d})}$

where the f terms are factors that are likelihoods.

The factor Pr(U_i|S,H_d) may be computed by using various methods, including those which may include consideration of relatedness between the suspect and the unknown contributor and/or those described in D. J. Balding (2005), “Weight-of-evidence for forensic DNA profiles”, John Wiley & Sons Ltd and/or J. Buckleton, C. M. Triggs, S. J. Walsh (2005), “Forensic DNA evidence interpretation”, CRC Press.

The factors f(C|S,H_p) and/or f(C|U_i,S,H_d) may be taken as statements of the same type of calculation. Those factors may be taken as definitions of the likelihood of observing the set of peak heights in C given a supposed donor.

The comparison may include the consideration of the term:

f(C|G)

where G denotes the supposed donor genotype. The comparison may include the consideration of this term as, or as a component of, the denominator of the likelihood ratio, and/or the consideration of this term as, or as a component of, the nominator of the likelihood ratio.

The term f(C|G) may be an estimation. The term f(C|G) may be derived from experimental data. The term f(C|G) may be an estimation derived from experimentation data. The term f(C|G) may be or include one or more probability distribution functions. The term f(C|G) may be or include one or more probability distribution functions provided according to the seventh aspect of the invention. The term f(C|G) may be or may be as further defined in the second aspect of the invention.

The method of comparing may involve a likelihood ratio which is not conditioned on quantity of DNA in the first and/or second sample.

The method of comparing may involve a likelihood ratio which is conditioned on quantity of DNA in the first and/or second sample.

The method of comparing may include defining the term f(C|G) as being defined by:

$f (C | G) = \sum_{j} f (C | G, χ_{j}) \Pr (χ_{j} | I (h))$

where Pr(χ_j|I(h)) is a probability distribution of DNA quantity given peak height information, I(h), preferably experimentally derived.

The term f(C|G) may be conditioned on DNA quantity in the test sample in this way. The term f(C|G) and/or f(C|G,χ_i) may be or may be as further defined in the third aspect of the invention.

The term f(C|G,χ_i) may be an estimation. The term f(C|G,χ_i) may be derived from experimental data. The term f(C|G,χ_i) may be an estimation derived from experimentation data. The term f(C|G,χ_i) may be or include one or more probability distribution functions. The term f(C|G,χ_i) may be or include one or more probability distribution functions provided according to the seventh aspect of the invention. The term f(C|G,χ_i) may be or may be as further defined in the third aspect of the invention.

The method of comparing may be used to gather information to assist further investigations or legal proceedings. The method of comparing may provide intelligence on a situation. The method of comparison may be of the likelihood of the information of the first or test sample result given the information of the second or another sample result. The method of comparison may provide a listing of possible another sample results, ideally ranked according to the likelihood. The method of comparison may seek to establish a link between a DNA profile from a crime scene sample and one or more DNA profiles stored in a database.

The method of comparison may consider a likelihood ratio. Preferably the likelihood ratio, more preferably the posterior probability, is defined as:

$\Pr (G_{i}  C) = \frac{f (C  G_{i}) \Pr (G_{i})}{\sum_{i}^{} f (C  G_{i}) \Pr (G_{i})}$

where:

C is the first or test result set from a test sample, more particularly, the first result set taken from a sample recovered from a person or location linked with a crime, potentially expressed in terms of peak positions and/or heights;

G_iis the second or another result set, more particularly, the various members of an exhaustive list of potential donor genotypes; and

Pr(G_i) is a prior distribution for genotype G_i, preferably computed for a population, particularly the population from which the test sample under consideration comes. It can be computed using the formulae introduced by Balding et al. (1996).

The comparison may include consideration of the term: f(C|G_i). The method may include any of the features, options or possibilities proposed for f(C|G) applying to f(C|G_i).

The method of comparing may involve a likelihood ratio which is not conditioned on quantity of DNA in the first and/or second sample.

The method of comparing may involve a likelihood ratio which is conditioned on quantity of DNA in the first and/or second sample.

The method of comparing may include defining the term Pr(G_i|C) as being defined by:

$\Pr (G_{i}  C) = \frac{[\sum_{j}^{} f (C  G_{i}, χ_{j}) \Pr (χ_{j}  I (h))] \times \Pr (G_{i})}{\sum_{i}^{} [\sum_{j}^{} f (C  G_{i}, χ_{j}) \Pr (χ_{j}  I (h))] \times \Pr (G_{i})}$

where Pr(χ_j|I(h)) is a probability distribution of DNA quantity given information on peak heights.

The term Pr(χ_j|I(h)) may be or may be as further defined in the fourth aspect of the invention.

The term f(C|G₁,G₂) may be or may be as further defined in the fifth aspect of the invention.

The term f(C|G₁,G₂,χ_i) may be or may be as further defined in the sixth aspect of the invention.

The method of comparing may be used in one or more of the following situations:

- 1) in an evidential context, for a single source sample, without conditioning on DNA quantity;
- 2) in an evidential context, for a single source sample, with conditioning on DNA quantity;
- 3) in an intelligence context, for a single source sample, without conditioning on DNA quantity;
- 4) in an intelligence context, for a single source sample, with conditioning on DNA quantity;
- 5) in an evidential context, for a mixed source sample, without conditioning on DNA quantity;
- 6) in an evidential context, for a mixed source sample, with conditioning on DNA quantity;
- 7) in an intelligence context, for a mixed source sample, without conditioning on DNA quantity; and
- 8) in an intelligence context, for a mixed source sample, with conditioning on DNA quantity.

According to a second aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:

- providing information for the first result set on the one or more identities detected for a variable characteristic of DNA;
- providing information for the second result set on the one or more identities detected for a variable characteristic of DNA; and
- wherein the method uses the factor f(C|G) or a factor incorporating that factor, where means f this is an expression of likelihood, C is the information of the first result set and G is the information of the second result set.

The factor f(C|G) may be an expression of the likelihood of observing the information of the first result set given that the information of the second result set. The factor f(C|G) may be an expression of the likelihood of observing the information from a sample taken from a crime scene given the information in a given genotype, particularly the genotype of a suspect and/or another person.

The comparison may include the consideration of this term as, or as a component of, the denominator of a likelihood ratio, and/or the consideration of this term as, or as a component of, the nominator of a likelihood ratio. The likelihood ratio may be of the form:

$LR = \frac{\Pr (C  S, H_{p})}{\Pr (C  S, H_{d})}$

where

- C is the first result set from a test sample, more particularly, the first result set taken from a sample recovered from a person or location linked with a crime, potentially expressed in terms of peak positions and/or heights;
- S is the second result set, more particularly, the second result set taken from a sample collected from a person, particularly expressed as a suspect's genotype;
- H_pis one hypothesis, more particularly the prosecution hypothesis in legal proceedings stating “The suspect left the sample at the scene of crime”;
- H_dis an alternative hypothesis, more particularly the defence hypothesis in legal proceedings stating “Someone else left the sample at the crime scene”.
  
  The term f(C|G) may be an estimation. The term f(C|G) may be derived from experimental data. The term f(C|G) may be an estimation derived from experimentation data. The term f(C|G) may be or include one or more probability distribution functions. The estimation of the term f(C|G) may be provided by the consideration of dilution data, particularly data collected from heterozygous sources and/or, and preferably separately, from heterozygous sources. The estimation of f(C|G) may be provided by grouping control sample analysis data from one or more loci of interest together. A single estimation of f(C|G) covering all the loci for which information is stored in a database and/or which are analysed using an analysis method and/or which are analysed using a multiplex amplification product may be provided. A separate estimation of f(C|G) may be provided for each locus. The estimation of f(C|G) may be generated using two-dimensional density estimations and/or two-dimensional probability distribution functions.

The term f(C|G) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.

The term f(C|G) may be used to calculate the term: f(C|S,H_p) and/or the term f(C|U_i,S,H_d).

The likelihood ratio of the form:

$LR = \frac{\Pr (C  S, H_{p})}{\Pr (C  S, H_{d})}$

may be defined as:

$LR = \frac{\Pr (C  S, H_{p})}{\sum_{i}^{} \Pr (C  U_{i}, S, H_{d}) \Pr (U_{i}  S)}$

where U_iis one of the supposed sources stated in H_d.

The likelihood ratio of the form:

$LR = \frac{\Pr (C  S, H_{p})}{\Pr (C  S, H_{d})}$

may be defined as:

$LR = \frac{f (C  S, H_{p})}{\sum_{i}^{} f (C  U_{i}, S, H_{d}) \Pr (U_{i}  S)}$

where the f terms are factors that are likelihoods.

The factor Pr(U_i|S) may be computed by using a subpopulation structure equation, such as the subpopulation structure equation described in D. J. Balding (2005), “Weight-of-evidence for forensic DNA profiles”, John Wiley & Sons Ltd and/or J. Buckleton, C. M. Triggs, S. J. Walsh (2005), “Forensic DNA evidence interpretation. CRC Press”.

The method of comparing may consider one or more of the following scenarios for the suspect and/or information observed in the analysis:

a) Suspect is heterozygous and crime profile has one peak;

b) Suspect is heterozygous and crime profile has two peaks;

c) Suspect is heterozygous and the crime profile contains no peaks;

d) Suspect is homozygous and crime profile contains one peak;

e) Suspect is homozygous and crime profile contains no peaks.

Where the suspect is heterozygous, stated as S={a,b}, and the test sample or first sample has one peak, stated as C={h_b}, the likelihood ratio may be defined as:

$LR = \frac{f (C = {h_{b}}  S = {a, b}, H_{p})}{f (C = {h_{b}}  S = {a, b}, H_{d})}$

Preferably the numerator in this likelihood ratio is defined as:

$\begin{matrix} f (C = {h_{a} < t, h_{b}}  S = {a, b}, H_{p}) = f_{het} (h_{a} < t, h_{b}) \\ = \int_{0}^{t} f_{het} (h_{a}, h_{b}) \partial h_{a} \end{matrix}$

where h_ais a peak height or peak area or peak weight smaller than a detection threshold and so is not observed in the results.

Preferably the denominator in this likelihood ratio is defined as:

f(C={h_b}|H_d)=f(C={h_b}|U={b,b},H_d)Pr(U={b,b}|S={a,b}H_d)+f(C={h_b}|U={b,Q}H_d)Pr(U={b,Q}|S={a,b}H_d).

where Q is any other allele identity than a.

The factors Pr(U={b,b}|S={a,b}H_d) and Pr(U={b,Q}|S={a,b} H_d), may be computed using the population substructure calculations described in D. J. Balding, M. Greenhalgh, R. A. Nichols (1996), “Population genetics of STR loci in Caucasians”, 108:300-305.

The factor may consider:

Pr(C={h_b}|U={b,b},H_d)=f_het(0.5h_b,0.5h_b)

where f_hetis a two-dimensional pdf on pairs of heights originating from heterozygous donors.

The factor may consider the position as relating to or equating to the solution described below at the bottom of page 8.

The factor may consider the position as f_homand point to a description of solutions, for instance as:

Pr(C={h_b}|U={b,b},H_d)=f_hom(h_b)

The factor may consider:

$\begin{matrix} f (C = {h_{b}}  U = {b, Q}, H_{d}) = f_{het} (h_{b}, h_{Q} < t) \\ = \int_{0}^{Td} f_{het} (h_{b}, h_{Q}) \partial h_{a} \end{matrix}$

where h_Qis smaller than the threshold t.

Where the suspect is heterozygous, stated as S={a,b}, and the first or test sample gives two peaks, stated as C={h_a,h_b}, the LR may be defined as:

$LR = \frac{f (C = {h_{a}, h_{b}}  S = {a, b}, H_{p})}{f (C = {h_{a}, h_{b}}  S = {a, b}, H_{d})}$

The numerator in this LR may be defined by:

f(C={h_a,h_b}|S={a,b},H_d)=f_het(h_a,h_b)

The denominator is given by:

f(C={h_a,h_b}|S={a,b},H_d)=f_het(h_a,h_b)Pr(U={a,b}|S={a,b},H_d)

The LR may be defined as:

$LR = \frac{1}{\Pr (U = {a, b}  S = {a, b} H_{d})}$

Where the suspect is heterozygous, stated as S={b,b}, and the first or test sample provides a profile containing one allele, stated as C={h_b}, the LR may be defined as:

$LR = \frac{f (C = {h_{b}}  S = {b, b}, H_{p})}{f (C = {h_{b}}  S = {b, b}, H_{d})}$

The numerator may be given by:

f(C={h_b}|S={b,b},H_p)=f_hom(h_b) which may be a density function estimated for homozygous contributors.

The denominator may be given by:

$\begin{matrix} f (C = {h_{b}}  S = {b, b}, H_{d}) = f (C = {h_{b}}  U = {b, b}, H_{d}) \\ \Pr (U = {b, b}  S = {b, b}) + \\ f (C = {h_{b}, h_{Q} < Td}  U = {b, Q}, H_{d}) \\ \Pr (U = {b, Q}  S = {b, b}) . \end{matrix}$

The term: f(C={h_b}|U={b,b},S={b,b},H_d)=f_hom(h_b) is the same as the numerator.

The term:

$f (C = {h_{b}, h_{Q} < Td}  U = {b, Q}, H_{d}) = \int_{0}^{Td} f_{het} (h_{b}, h_{Q}) \partial_{Q}$

can be obtained by estimating a pdf for the peak heights.

Where the suspect is homozygous, stated as C={h_b}, and the test or first sample provides a profile containing one allele, h_b, the likelihood for a homozygous donor may be given by:

f(C={h_b}|G={b,b})=f_hom(h_b)

where f_homcan be computed by several methods, for example by the method of rotation described elsewhere in this document and/or by the method:

$f_{hom} (h_{b}) = \int_{0}^{1} f_{het} (α h_{b}, (1 - α) h_{b}) f (α) \partial α$

where f_het(αh_b,(1−α)h_b) is a two-dimensional probability density function computed for a heterozygote and f(α) is a probability density function for α, where α is the proportion of the homozygote given by one of the alleles of the homozygote and 1−α is the proportion accounted by the other allele of the homozygote.

The function may be approximated by:

f
_hom(h_b)=f_het(αh_b,(1−α)h_b)

for an α close to 0.5.

The method may include taking a crime scene DNA profile, C, and determining an ordered list of suspect's profiles, where the first profile in the list is the genotype of the most likely donor. The method may propose a list of genotypes {G₁,G₂, . . . ,G_m} and then rank those genotypes according the posterior probability of the genotype given the crime scene profile, C.

The method of comparison may consider a likelihood ratio or more preferably posterior probability. Preferably the likelihood ratio, more preferably posterior probability, is defined as:

$\Pr (G_{i}  C) = \frac{f (C  G_{i}) \Pr (G_{i})}{\sum_{i}^{} f (C  G_{i}) \Pr (G_{i})}$

where:

G_iis the second or another result set, more particularly, the various members of an exhaustive list of potential donor genotypes; and

Pr(G_i) is a prior distribution for genotype G_i, preferably computed for a population, particularly the population from which the test sample under consideration comes. It can be set to be a uniform distribution or computed using genotype probabilities formulae described in Blading (2005) and Buckleton et al. (2005), both referenced above.

The comparison may include consideration of the term: f(C|G_i). the method may include any of the features, options or possibilities proposed for f(C|G) applying to f(C|G_i). According to a third aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:

- providing information for the first result set on the one or more identities detected for a variable characteristic of DNA;
- providing information for the second result set on the one or more identities detected for a variable characteristic of DNA; and
- wherein the method uses the factor f(C|G) or a factor incorporating that factor, where

$f (C  G) = \sum_{j}^{} f (C  G, χ_{j}) \Pr (χ_{j}  I (h))$

where f means this is an expression of likelihood, C is the information of the first result set and G is the information of the second result set, where Pr(χ_j|I(h)) is a probability distribution of DNA quantity, χ_j, given a quantitative measure of the one or more identities considered in the first result set and/ort second result set, I(h).

Preferably Pr(χ_j|I(h)) is a probability distribution of DNA quantity with peak height and/or peak area/ and/or peak weight.

Preferably the factor f(C|G,χ_i) is computed by conditioning on DNA quantity.

The factor f(C|G,χ_i) may be an expression of the likelihood of observing the information of the first result set given that the information of the second result set. The factor f(C|G,χ_i) may be an expression of the likelihood of observing the information from a sample taken from a crime scene given the information in a given genotype, particularly the genotype of a suspect and/or another person.

The term f(C|G,χ_i) may be an estimation. The term f(C|G,χ_i) may be derived from experimental data. The term f(C|G,χ_i) may be an estimation derived from experimentation data. The term f(C|G,χ_i) may be or include one or more probability distribution functions. The estimation of the term f(C|G,χ_i) may be provided by the consideration of dilution data, particularly data collected from heterozygous sources and/or, and preferably separately, from heterozygous sources.

The estimation of f(C|G,χ_i) may be provided by grouping together control sample analysis data for the same quantity of DNA. A separate estimation of f(C|G,χ_i) may be provided for each quantity of DNA. The quantities of DNA for which the factor f(C|G,χ_i) is estimated may be spaced across a range, preferably evenly so.

The estimation of f(C|G,χ_i) may be provided by grouping control sample analysis data from one or more loci of interest together. A single estimation of f(C|G,χ_i) covering all the loci for which information is stored in a database and/or which are analysed using an analysis method and/or which are analysed using a multiplex amplification product may be provided. A separate estimation of f(C|G,χ_i) may be provided for each locus. The estimation of f(C|G,χ_i) may be generated using two-dimensional density estimations and/or two-dimensional probability distribution functions.

The term f(C|G,χ_i) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.

The method of comparison may consider a likelihood ratio. Preferably the likelihood ratio is defined as:

where Pr(χ_j|I(h)) is a probability distribution of DNA quantity, χ_j, given a quantitative measure of the one or more identities considered in the first result set and/ort second result set, I(h), more preferably, where Pr(χ_j|I(h)) is a probability distribution of DNA quantity given information on peak heights.

According to a fourth aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:

- providing information for the first result set on the one or more identities detected for a variable characteristic of DNA;
- providing information for the second result set on the one or more identities detected for a variable characteristic of DNA; and
  
  wherein the method uses the factor Pr(χ_j|I(h)) or a factor incorporating that factor, where Pr(χ_j|I(h)) is a probability distribution of DNA quantity, χ_j, given a quantitative measure of the one or more identities considered in the first result set and/ort second result set, I(h).

Preferably Pr(χ_j|I(h)) is a probability distribution of DNA quantity with peak height and/or peak area and/or peak weight.

Preferably the probability distribution for Pr(χ_j|I(h)) relates to a list of discrete probabilities for χ₁to χ_j, where Σ_iPr(χ_i)=1. Preferably the distribution is obtained by considering the observed distribution obtained from the analysis of the control samples.

The quantitative measure may be the mean value for all observed values across all loci. The quantitative measure may be the mean of all values for a single locus. The quantitative measure may be the mean value for all the observed values for all loci for a given quantity of DNA. The quantitative measure may be the mean value for all the observed values for a single locus for a given quantity of DNA.

The quantitative measure may be peak height and/or peak area and/or peak weight.

A distribution may be estimated. The distribution may be of the form f( h|χ_i) preferably where h is the mean conditional on a DNA quantity χ_i.

The calculation of Pr(Xχ_i|I(h)) for a locus may be performed in a sequential fashion based on an ordering L₁,L₂, . . . ,L_nof the loci. The calculation may begin by setting Pr(X=χ_i|I(h)) for L₁to a uniform distribution. Another assumed form for the distribution may be used. The calculation for the next locus, L₂, of Pr(X=χ_i|I(h)) may be based on the assumption for I(h) from L₁. The subsequent calculations for the subsequent loci may be based upon the assumption for I(h) for all the loci which precede it in the calculation, for instance Pr(X=χ_i|I(h)) for L_nis computed based on I(h) from L₁, L₂, . . . ,L_n−1.

The term Pr(X=χ_i|I(h)) may be an estimation. The term Pr(X=χ_i|I(h)) may be derived from experimental data. The term Pr(X=χ_i|I(h)) may be an estimation derived from experimentation data. The term Pr(X=χ_i|I(h)) may be or include one or more probability distribution functions. The estimation of the term Pr(X=χ_i|I(h)) may be provided by the consideration of dilution data, particularly data collected from heterozygous sources and/or, and preferably separately, from heterozygous sources.

The estimation of Pr(X=χ_i|I(h)) may be provided by grouping together control sample analysis data for the same quantity of DNA. A separate estimation of Pr(X=χ_i|I(h)) may be provided for each quantity of DNA. The quantities of DNA for which the factor Pr(X=χ_i|I(h)) is estimated may be spaced across a range, preferably evenly so.

The estimation of Pr(X=χ_i|I(h)) may be provided by grouping control sample analysis data from one or more loci of interest together. A single estimation of Pr(X=χ_i|I(h)) covering all the loci for which information is stored in a database and/or which are analysed using an analysis method and/or which are analysed using a multiplex amplification product may be provided. A separate estimation of Pr(X=χ_i|I(h)) may be provided for each locus. The estimation of Pr(X=χ_i|I(h)) may be generated using two-dimensional density estimations and/or two-dimensional probability distribution functions.

The term Pr(X=χ_i|I(h)) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.

According to a fifth aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:

- providing information for the first result set on the one or more identities detected for a variable characteristic of DNA;
- providing information for the second result set on the one or more identities detected for a variable characteristic of DNA; and
- wherein the method uses the factor f(C|G₁,G₂) or a factor incorporating that factor, where preferably f means this is an expression of likelihood, C is the information of the first result set and where G₁and G₂is the information of the second result set.

The factor f(C|G₁,G₂) may be used to calculate the factors f(C|S,U_i,H_p) and f(C|U_j,U_k,S,H_d). The factors f(C|S,U_i,H_p) and f(C|U_j,U_k,S,H_d) may be the expression of the likelihood of the crime profile C given two supposed sources or donors; potentially according to a hypothesis of the defence, H_d, and/or prosecution, H_p. G₁and G₂may be the genotypes of the supposed donors or sources.

The method may involve the consideration, particularly in an evidential context, particularly for a mixed source sample arising from two contributors, of the prosecution and defence hypotheses. The hypotheses may be: the prosecution hypothesis, H_p, that the suspect and someone else are the donors to the first result set, potentially a stain at the scene of crime; and/or the defence hypothesis, H_d, that two unknown people are the donors to the first result set, potentially a stain at the scene of crime. The hypotheses may be: the prosecution hypothesis, H_p, that the suspect and the victim are the donors to the first result set, and/or the defence hypothesis, H_d, that the victim and unknown person are the donors to the first result set. The hypotheses may be: the prosecution hypothesis, H_p, that the two suspects are the donors to the first result set, and/or the defence hypothesis, H_d, that the two unknown persons are the donors to the first result set. One or both hypotheses may involve statements of relatedness amongst the possible contributors.

The LR may be given by the formula:

$LR = \frac{\Pr (C  S, H_{p})}{\Pr (C  S, H_{d})}$

where

- C is the first result set from a test sample, more particularly, the first result set taken from a sample recovered from a person or location linked with a crime, potentially expressed in terms of peak positions and/or heights;
- S is the second result set, more particularly, the second result set taken from a sample collected from a person, particularly expressed as a suspect's genotype;
- H_pis one hypothesis, more particularly the prosecution hypothesis in legal proceedings stating “The suspect left the sample at the scene of crime”;
- H_dis an alternative hypothesis, more particularly the defence hypothesis in legal proceedings stating “Someone else left the sample at the crime scene”.

The LR formula can be written as:

$LR = \frac{\sum_{i}^{} f (C  S, U_{i}, H_{p}) \Pr (U_{i}  S, H_{p})}{\sum_{j}^{} \sum_{k}^{} f (C  U_{j}, U_{k}, S, H_{d}) \Pr (U_{j}, U_{k}  S, H_{d})}$

where f(C|S,U_i,H_p) is a density function of C given S and U_iand f(C|U_j,U_k,S,H_d) is a density function of C given U_j, U_kand S.

The factors Pr(U_i|S,H_p) and Pr(U_j,U_k|S,H_d) may be computed using the formulae described in Balding (2005) and Buckleton (2005), referenced above, which may include considerations of relatedness between the unknown contributors and the suspect.

The factor f(C|G₁,G₂) may be an expression of the likelihood of observing the information of the first result set given that the information of the second result set. The factor f(C|G₁,G₂) may be an expression of the likelihood of observing the information from a sample taken from a crime scene given the information in a given genotype, particularly the genotype of a suspect and/or another person.

The term f(C|G₁,G₂) may be an estimation. The term f(C|G₁,G₂) may be derived from experimental data. The term f(C|G₁,G₂) may be an estimation derived from experimentation data. The term f(C|G₁,G₂) may be or include one or more probability distribution functions. The estimation of the term f(C|G₁,G₂) may be provided by the consideration of dilution data, particularly data collected from heterozygous sources and/or, and preferably separately, from heterozygous sources.

The estimation of f(C|G₁,G₂) may be provided by grouping together control sample analysis data for the same quantity of DNA. A separate estimation of f(C|G₁,G₂) may be provided for each quantity of DNA. The quantities of DNA for which the factor f(C|G₁,G₂) is estimated may be spaced across a range, preferably evenly so.

The estimation of f(C|G₁,G₂) may be provided by grouping control sample analysis data from one or more loci of interest together. A single estimation of f(C|G₁,G₂) covering all the loci for which information is stored in a database and/or which are analysed using an analysis method and/or which are analysed using a multiplex amplification product may be provided. A separate estimation of f(C|G₁,G₂) may be provided for each locus. The estimation of f(C|G₁,G₂) may be generated using two-dimensional density estimations and/or two-dimensional probability distribution functions.

The term f(C|G₁,G₂) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.

The method may involve a consideration of one or more of the following scenarios, particularly in an evidential context:

- a) Suspect and victim are heterozygous, with no overlapping alleles between them and only three peaks in the crime profile; and
- b) Suspect and victim are heterozygous, with one overlapping allele between them and only three peaks in the crime profile.

In both scenarios, where we are considering likelihood ratios for a locus, in respect of a sample from two sources, there are two basic hypotheses which may be considered:

- H_p: The suspect (S) and the victim (V) are the originators of the crime profile, the hypothesis of the prosecution; and
- H_d: The victim (V) and an unknown (U) are the originators of the crime profile, the hypothesis of the defence.

Where there is a heterozygous suspect and victim with no overlapping alleles and three peaks in the crime profile, then H_pmay be V+S, and H_dmay be V+U, C={h_a,h_b,h_c}, V={a,b} and S={c,d}. The LR may be given by:

$LR = \frac{f (C = {h_{a}, h_{b}, h_{c}}  V = {a, b}, S = {c, d}, H_{p})}{f (C = {h_{a}, h_{b}, h_{c}}  V = {a, b}, S = {c, d}, H_{d})}$

The numerator in this function may be given by:

f(C={h_a,h_b,h_c}|V={a,b},S={c,d},H_p)=f_het(h_a,h_b)×f_het(h_c,h_d)

The unknown contributors for the denominators can be {a,c}, {b,c}, {c,Q}. The denominator may be given by:

$\begin{matrix} f (\begin{matrix} C = {h_{a}, h_{b}, h_{c}}  V = {a, b}, \\ S = {c, d}, H_{d} \end{matrix}) = f (\begin{matrix} C = {h_{a}, h_{b}, h_{c}}  V = {a, b}, \\ S = {c, d}, U = {a, c}, H_{d} \end{matrix}) \\ \Pr (U = {a, c}  S = {a, b}) + \\ f (\begin{matrix} C = {h_{a}, h_{b}, h_{c}}  V = {a, b}, \\ S = {c, d}, U = {b, c}, H_{d} \end{matrix}) \\ \Pr (U = {b, c}  S = {a, b}) + \\ f (\begin{matrix} C = {h_{a}, h_{b}, h_{c}}  V = {a, b}, \\ S = {c, d}, U = {c, Q}, H_{d} \end{matrix}) \\ \Pr (U = {a, c}  S = {a, b}) . \end{matrix}$

The factor for U={a,c} may be computed with the formula:

f(C={h_a,h_b,h_c}|V={a,b},U={a,c},H_d)=f_het(m_xh_a,h_b)×f_het((1−m_x)h_a,h_c).

The factor for U={b,c} may be computed with the formula:

f(C={h_a,h_b,h_c}|V={a,b},U={b,c},H_d)=f_het(h_a,m_xh_b)×f_het(h_a,(1−m_x)h_c).

The factor for U={c,Q} may be computed with the formula:

f(C={h_a,h_b,h_c}|V={a,b},U={c,Q},H_d)=f_het(h_a,h_b)×f_het(h_c,h_Q).

Where the heterozygous suspect and victim are with one overlapping allele and three peaks in the crime profile, then H_pmay be V+S, and H_dmay be V+U, C={h_a,h_b,h_c}, V={a,b} and S={b,c}. The likelihood ratio may be given by:

$LR = \frac{f (C = {h_{a}, h_{b}, h_{c}}  V = {a, b}, S = {b, c}, H_{p})}{f (C = {h_{a}, h_{b}, h_{c}}  V = {a, b}, S = {b, c}, H_{d})}$

The numerator may be given by:

Pr(C={h_a,h_b,h_c}|V={a,b},S={b,c},H_p)=f_het(h_a,m_xh_b)f_het((1−m_x)h_b,h_c).

The denominator may consider the following potential unknown contributors:

U ε {{a,c},{b,c},{c,c},{c,Q}}.

The function may be computed, where U={a,c}, using:

Pr(C={h_a,h_b,h_c}|V={a,b},S={b,c},U={a,c},H_d)=f_het(m_xh_a,h_b)f_het((1−m_x)h_a,h_c)π( )

The function may be computed, where U={b,c} using:

f(C={h_a,h_b,h_c}|V={a,b},U={b,c},H_d)=f_het(h_a,m_xh_b)f_het((1−m_x)h_b,h_c).

The function may be computed, where U={c,c} using:

f(C={h_a,h_b,h_c}|V={a,b},U={c,c}H_d)=f_het(h_a,h_b)f_het(0.5h_c,0.5h_c)

The function may be computed, where U={c,Q}, using:

f(C={h_a,h_b,h_c}|V={a,b},U={c,Q}H_d)=f_het(h_a,h_b)f_het(h_c,h_Q).

The method of comparison may consider a posterior probability. Preferably posterior probability is defined as:

$\Pr (G_{1, i}, G_{2, i}  C) = \frac{f (C  G_{1, i}, G_{2, i}) \Pr (G_{1, i}, G_{2, i})}{\sum_{i}^{} f (C  G_{1, i}, G_{2, i}) \Pr (G_{1, i}, G_{2, i})}$

The method may propose an ordered list of pairs of genotypes G₁and G₂per locus, preferably so that the first pair in the list in the most likely donor of the crime stain.

The method may start with a first set of results, for instance, crime stain profile C. The method may then provide a list, potentially an exhaustive list {G_1,i,G_2,i}, of pairs of potential donors is generated. Preferably for each of theses pairs, a probability distribution for the genotypes is calculated using the formula:

$\Pr (G_{1, i}, G_{2, i}  C) = \frac{f (C  G_{1, i}, G_{2, i}) \Pr (G_{1, i}, G_{2, i})}{\sum_{i}^{} f (C  G_{1, i}, G_{2, i}) \Pr (G_{1, i}, G_{2, i})}$

where Pr(G_1,i,G_2,i) is a prior distribution for the pair of genotypes inside the brackets that can be set to be a uniform distribution or computed using the probabilities of genotypes described in Balding (2005) and Buckleton et al. (2005), referenced above.

According to a sixth aspect of the invention we provide a method of comparing a first, potentially test, sample result set with a second, potentially another, sample result set, the method including:

- providing information for the first result set on the one or more identities detected for a variable characteristic of DNA;
- providing information for the second result set on the one or more identities detected for a variable characteristic of DNA; and
- wherein the method uses the factor f(C|G₁,G₂,χ_i) or a factor incorporating that factor, where preferably f means this is an expression of likelihood, C is the information of the first result set and where G₁and G₂is the information of the second result set, χ_iis a quantitative measure of the one or more identities considered in the first result set and/or second result set.

The factor f(C|G₁,G₂,χ_i) may be used to calculate the factors f(C|S,U_i,H_p) and f(C|U_j,U_k,S,H_d). The factors f(C|S,U_i,H_p) and f(C|U_j,U_k,S,H_d) may be the expression of the likelihood of the crime profile C given two supposed sources or donors; potentially according to a hypothesis of the defence, H_d, and/or prosecution, H_p. G₁and G₂may be the genotypes of the supposed donors or sources.

The LR may be given by the formula:

$LR = \frac{\Pr (C  S, H_{p})}{\Pr (C  S, H_{d})}$

where

- C is the first result set from a test sample, more particularly, the first result set taken from a sample recovered from a person or location linked with a crime, potentially expressed in terms of peak positions and/or heights;
- S is the second result set, more particularly, the second result set taken from a sample collected from a person, particularly expressed as a suspect's genotype; and/or
- H_pis one hypothesis, more particularly the prosecution hypothesis in legal proceedings stating “The suspect is one of the contributors of the crime stain obtained”;
- H_dis an alternative hypothesis, more particularly the defence hypothesis in legal proceedings stating “Two unknown persons are the donors of the crime stain”.

The LR formula can be written as:

$LR = \frac{\sum_{i}^{} f (C  S, U_{i}, H_{p}) \Pr (U_{i}  S)}{\sum_{j}^{} \sum_{k}^{} f (C  U_{j}, U_{k}, S, H_{d}) \Pr (U_{j}, U_{k}  S)}$

where f(C|S,U_i,H_p) is a density function of C given S and U_iand f(C|U_j,U_k,S,H_d) is a density function of C given U_j, U_kand S.

The factors Pr(U_i|S,H_p) and Pr(U_j,U_k|S,H_d) may be computed by the methods described in Balding (2005) and Buckleton (2005), referenced above, which may include considerations of relatedness.

The factor f(C|G₁,G₂,χ_i) may be an expression of the likelihood of observing the information of the first result set given that the information of the second result set. The factor f(C|G₁,G₂,χ_i) may be an expression of the likelihood of observing the information from a sample taken from a crime scene given the information in a given genotype, particularly the genotype of a suspect and/or another person.

The term f(C|G₁,G₂,χ_i) may be an estimation. The term f(C|G₁,G₂,χ_i) may be derived from experimental data. The term f(C|G₁,G₂,χ_i) may be an estimation derived from experimentation data. The term f(C|G₁,G₂,χ_i) may be or include one or more probability distribution functions. The estimation of the term f(C|G₁,G₂,χ_i) may be provided by the consideration of dilution data, particularly data collected from heterozygous sources and/or, and preferably separately, from heterozygous sources.

The estimation of f(C|G₁,G₂,χ_i) may be provided by grouping together control sample analysis data for the same quantity of DNA. A separate estimation of f(C|G₁,G₂,χ_i) may be provided for each quantity of DNA. The quantities of DNA for which the factor f(C|G₁,G₂,χ_i) is estimated may be spaced across a range, preferably evenly so.

The estimation of f(C|G₁,G₂,χ_i) may be provided by grouping control sample analysis data from one or more loci of interest together. A single estimation of f(C|G₁,G₂,χ_i) covering all the loci for which information is stored in a database and/or which are analysed using an analysis method and/or which are analysed using a multiplex amplification product may be provided. A separate estimation of f(C|G₁,G₂,χ_i) may be provided for each locus. The estimation of f(C|G₁,G₂,χ_i) may be generated using two-dimensional density estimations and/or two-dimensional probability distribution functions.

The term f(C|G₁,G₂,χ_i) may be or include, and/or be computed using, one or more probability distribution functions provided according to the seventh aspect of the invention.

According to a seventh aspect of the invention we provide a method for generating one or more probability distribution functions relating to the detected level for a variable characteristic of DNA, the method including:

a) providing a control sample of DNA;

b) analysing the control sample to establish the detected level for the at least one variable characteristic of DNA;

c) repeating steps a) and b) for a plurality of control samples to form a data set of detected levels;

d) defining a probability distribution function for at least a part of the data set of detected levels.

The method may be used to provide a probability distribution function for the detected level for control samples from heterozygous persons.

The method may be used to provide a probability distribution function for the detected level for control samples from homozygous persons.

Preferably the method is used to provide a probability distribution function for the detected level for control samples from heterozygous persons and to provide a probability distribution function for the detected level for control samples from homozygous persons. Preferably the performance of the method to provide the probability distribution function for the heterozygous persons is a separate performance to that used to provide the probability distribution function for the homozygous persons.

The detected level for the variable characteristic may be a peak height. The detected level may be a peak area. The detected level for the variable characteristic may be a peak weight. A peak weight may be defined as the molecular weight of the allele multiplied by the peak height or peak area for that allele.

The characteristic may be the allele identity or identities at a locus known to have variable short tandem repeat alleles.

The probability distribution function may be a 2-dimensional probability distributional function. The probability distribution function may be formed as a 2-dimensional probability distribution function or may be converted thereto, for instance from a one-dimensional probability distribution function or from a three-dimensional probability distribution function.

Steps a) and b) may be performed on control samples from one person, preferably a plurality of different persons and ideally at least four different persons.

Steps a) and b) may be performed on at least 20 control samples, preferably at least 100 control samples, more preferably at least 200 control samples and ideally on at least 500 control samples. These numbers of control samples may be the total number or the number for each different person from whom control samples are considered.

Steps a) and b) may be performed on one example of each control sample. Steps a) and b) may be performed on a plurality of examples of each control sample, for instance at least 20 examples of each control sample. These numbers of control samples may be the total number or the number for each different person from whom control samples are considered.

The person's who are the source of the control samples may be selected to be heterozygous with respect to the variable characteristic, particularly to generate a probability distribution function for heterozygotes.

The person's who are the source of the control samples may be selected to be homozygous with respect to the variable characteristic, particularly to generate a probability distribution function for homozygotes.

The method may be applied to one or more control samples which include different quantities of DNA. The quantities of DNA may be provided across a range of quantities, for instance at regular intervals across the range. The range may have a lower limit of 10 pg and more preferably 50 pg. The range may have an upper limit of 1000 pg and more preferably of 500 pg. The interval may be every 10 to 50 pg or potentially every 25 pg.

The analysis of the detected level for the variable characteristic is preferably provided in respect of more than one variable characteristic. The variable characteristic(s) are preferably the allele identity or allele identities present at a locus. Preferably the control samples are considered in respect of one or more loci, preferably at least 8 loci.

The probability distribution function, pdf, may be formed of one or more probability distribution functions. A probability distribution function may be provided for a group of control samples, with the control samples being divided into more than one group of control samples. The probability distribution function may be formed of four probability distribution functions, particularly in the context of the consideration of test samples from two different people.

The detected levels for the characteristic may be divided into one or more groups. One group may be those control samples whose detected level is at and/or above a threshold value. One group may be those control samples whose detected level is at and/or below a threshold value.

Where the control samples are from homozygous persons, the detected levels are preferably divided into two groups, most preferably relative to a threshold. The division of the detected levels into two groups may provide a 1-dimensional probability distribution function. It is preferred that the 1 dimensional pdf be converted to a 2-dimensional pdf.

The method used to convert a 1-dimensional pdf to a 2-dimensional pdf may include treating the pdf for a homozygous sample with height h₁is being defined by the pdf for a heterozygous sample with heights 0.5h₁and 0.5h₂.

Alternatively, the 1-dimensional pdf may be converted into a 2-dimensional pdf by mathematically rotating the distribution through an angle, particularly 90°. The method may include the use of a 1-dimensional pdf estimating by two components. One of the components may be a uniform distribution for the height interval. One of the components may be a probability distribution that takes positive values within that height interval range, for example an exponential distribution. The one-dimensional pdf may be defined by the formula,

$\begin{matrix} f_{H} (h) = {\begin{matrix} p_{0} \times \frac{1}{t} & if h \in (0, t] \\ p_{1} \times f_{H  H > t} & if h \in (t, \infty) \end{matrix} & (3) \end{matrix}$

where

p₀is the proportion of heights in the range;

p₁is the proportion of heights above the range;

t is the detection threshold, for instance 30 rfu's.

The surface of the distribution obtained by rotation may be normalised so that the volume under the surface is 1.

The calculation of a pdf value for homozygote peak height h for a likelihood ratio calculation may be given by:

(1/V)×f_H. (4)

where V is the volume under the surface obtained by rotation of the one-dimensional pdf.

One group may those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic. One group may be those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic. One group may those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic. One group may be those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic.

Where the control samples are from heterozygous persons, the detected levels may be divided into four groups, particularly where the test sample to be considered is from two sources.

The threshold may be a detection level at which detection of the characteristic relative to the detection of noise signals is not possible or is impaired. The threshold may be between 10 random fluorescence units and 70 rfu's, more preferably less than 60 rfu's, still more preferably 50 rfu's or less or even less than 40 rfu's.

The threshold may be different for different variable characteristics at a locus, but is preferably the same. The threshold may be different or the same for different loci.

Where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for each variable may be the values contributing to the probability distribution function defined.

Where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for the first characteristic and a value of zero for the second characteristic may be the values contributing to the probability distribution function.

Where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for the second characteristic and a value of zero for the first characteristic may be the values contributing to the probability distribution function.

Where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, a value of zero for the first characteristic and a value of zero for the second characteristic may be the values contributing to the probability distribution function.

The probability distribution function, particularly for a heterozygous source, may be provided by four probability functions.

The probability distribution function may be, or more preferably include a probability distribution function, that function relating to, or being:

$p_{1, 1} \times \frac{1}{t^{2}}$

where p_1,1is the proportion of data where h₁<t and h₂<t where t is the detection threshold, particularly where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for each variable may be the values contributing to the probability distribution function defined.

The probability distribution function may be, or more preferably include a probability distribution function, that function relating to, or being:

p_1,2×g_1,2(h₁,h₂)

where p_1,2is the proportion of points where h₁≧t and h₂<t, g_1,2is a 2-dimensional probability density function for the region where h₁≧t and h₂<t, h₁is the height of the low-molecular allele, h₂is the height of the high-molecular allele, particularly where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for the first characteristic and a value of zero for the second characteristic may be the values contributing to the probability distribution function. The probability distribution may be, or more preferably include a probability distribution function, that function relating to, or being:

g
_1,2(h₁,h₂)=g_1,2(h₁|h₂)g_1,2(h₂)

where g_1,2(h₁|h₂) is estimated from experimental data where h₁≧t and h₂<t, and:

$g_{1, 2} (h_{2}) = \frac{1}{t} .$

The probability distribution function may be, or include, particularly for this group, an exponential distribution.

The probability distribution function may be, or more preferably include a probability distribution function, that function relating to, or being:

P_2,1·g_2,1(h₁,h₂).

where p_2,1is the proportion of pairs of heights in the region where h₁<t and h₂≧t, g_2,1is a 2-dimensional probability density function for the region where h₁<t and h₂≧t, h₁is the height of the low-molecular allele, h₂is the height of the high-molecular allele particularly where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the detected level for the second characteristic and a value of zero for the first characteristic may be the values contributing to the probability distribution function. The probability distribution may be, or more preferably include a probability distribution function, that function relating to, or being:

g
_2,1(h₁,h₂)=g_2,1(h₂|h₁)g_2,1(h₁) (A.5)

where g_2,1(h₂|h₁) is estimated from experimental data where h₁<t and h₂>t and:

$\begin{matrix} g_{2, 1} (h_{1}) = \frac{1}{t} . & (A .6) \end{matrix}$

The probability distribution function may be, or include, particularly for this group, an exponential distribution.

The probability distribution function may be, or more preferably include a probability distribution function, that function relating to, or being:

p_2,2×g_2,2(h₁,h₂)

where p_2,2is the proportion of height pairs (h₁,h₂) that fall in the region where h₁≧t and h₂≧t, g_2,2is a 2-dimensional probability density function for the region where h₁≧t and h₂≧t, h₁is the height of the low-molecular-weight allele, h₂is the height of the high-molecular-weight allele, particularly where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, a value of zero for the first characteristic and a value of zero for the second characteristic may be the values contributing to the probability distribution function.

One or more of the probability distribution functions may be a uniform distribution. One or more of the distribution functions may be a triangular distribution. One or more of the distribution functions may be an exponential distribution.

The probability distribution function, particularly for a heterozygous source, may be defined as

$\begin{matrix} f (h_{1}, h_{2}) = {\begin{matrix} p_{1, 1} \times \frac{1}{t^{2}} & if h_{1} < t & h_{2} < t \\ p_{1, 2} \times k_{1, 2} (h 1 - t | l_{12}) \times \frac{1}{t} & if h_{1} \geq t & h_{2} < t \\ p_{2, 1} \times k_{2, 1} (h 2 - t | l_{21}) \times \frac{1}{t} & if h_{1} < t & h_{2} \geq t \\ \begin{matrix} p_{2, 2} \times k_{2, 2} (\frac{h_{1} - t + h_{2} - t}{2}, \log (\frac{h_{1} - t}{h_{2} - t})) \times \\ \frac{1}{2} (\frac{1}{h_{1} - t} + \frac{1}{h_{2} - t}) \end{matrix} & if h_{1} \geq t & h_{2} \geq t \end{matrix} & (A .7) \end{matrix}$

where, k_1,2is an exponential probability density function, k_2,1is an exponential probability density function, l_1,2and l_2,1are parameters of the exponential distribution, and preferably where:

$\begin{matrix} k_{1, 2} (x | l_{12}) = \frac{1}{l_{12}} \times \exp {- \frac{x}{l_{12}}}, & (A .8) \end{matrix}$

is the pdf of an exponential distribution with parameter l_1,2, where x is a variable representing h_1−t. and preferably where:

$\begin{matrix} k_{2, 1} (x | l_{2, 1}) = \frac{1}{l_{2, 1}} \times \exp {- \frac{x}{l_{2, 1}}} . & (A .9) \end{matrix}$

and k_2,2is preferably given by a mixture of two-dimensional normal distributions:

$\begin{matrix} k_{2, 2} (x, y) = \sum_{i} p_{i} \times n ([x, y] | [μ_{x, i}, μ_{y, i}], \sum_{i}) & (A .10) \end{matrix}$

where p_iis the mixing proportion and n is a 2-dimensional probability density function of a normally distributed random variably, preferably where:

$\begin{matrix} \sum_{i} = [\begin{matrix} σ_{x, i}^{2} & {ρσ}_{x, i} σ_{y, i} \\ {ρσ}_{x, i} σ_{y, i} & σ_{y, i}^{2} \end{matrix}] & (A .11) \end{matrix}$

and preferably:

$\begin{matrix} n ([x, y] | [μ_{x, i}, μ_{y, i}], \sum_{i}) = \frac{1}{2 {πσ}_{x, i} σ_{y, i}} \times \exp {- \frac{1}{2} (\frac{{(x - μ_{x, i})}^{2}}{σ_{x, i}^{2}} + \frac{{(y - μ_{y, i})}^{2}}{σ_{y, i}^{2}})} . & (A .12) \end{matrix}$

where correlation coefficient ρ=0.

Where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the probability distribution function may be fitted to the detected levels observed for the control samples.

Where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the probability distribution function may be provided by fitting a 2 dimensional planar distribution to the group and/or a distribution function whose shape represents a transition from values at the junction with the adjoining groups and/or whose shape provides a probability value of zero at zero value for detected level for both the first and second characteristic.

Where the one group is those control samples whose detected level is at and/or above a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or below a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the probability distribution function may be provided by fitting a 2 dimensional planar distribution to the group and/or a distribution function whose shape represents a transition from values at the junction with the adjoining groups.

Where the one group is those control samples whose detected level is at and/or below a threshold value in respect of a first variable characteristic, such as a higher weight characteristic, and is at and/or above a threshold value in respect of a second variable characteristic, such as a lower weight characteristic, the probability distribution function may be provided by fitting a 2 dimensional planar distribution to the group and/or a distribution function whose shape represents a transition from values at the junction with the adjoining groups.

The probability distribution function, particularly for a heterozygous source, may be provided by estimating a two dimensional pdf for a pair of heights h₁and h₂, for instance in the space defined by mean heights, m, and heterozygote imbalance, r. The method may include each pair of heights being transformed by

$\begin{matrix} (h_{1}, h_{2}) \mapsto (m = \frac{h_{1} + h_{2}}{2}, r = \frac{h_{1}}{h_{2}}) . & (5) \end{matrix}$

Given a pdf f_M,R, a pdf in the space of pairs of heights may be obtained with the formula:

$\begin{matrix} f_{H_{1}, H_{2}} (h_{1}, h_{2}) = \frac{1}{h_{2}^{2}} (\frac{h_{1} + h_{2}}{2}) \times f_{M, R} (m, r) & (6) \end{matrix}$

where the first factor is the Jacobian of the transformation, G. Casella & R. L. Berger (1990). Statistical Inference. Wadsworth & Brooks/Cole Advanced Books Software, Pacific Grove, Calif., USA, page 148 and the second factor can be estimated from experimental data. The estimation may be provided by obtaining the conditional distribution of heterozygous imbalance given mean height:

$\begin{matrix} f_{H_{1}, H_{2}} (h_{1}, h_{2}) = \frac{1}{h_{2}^{2}} (\frac{h_{1} + h_{2}}{2}) \times f_{R | M} (r | m) f_{M} (m) & (7) \end{matrix}$

The method may use this approach, and particularly this function, to provide one pdf f_R|Mfor all four regions, for example as shown in FIG. 2. The method may provide that there will be one marginal for each of the four regions, for instance marginal pdf's f_Mfor each of the regions.

The method may provided that the conditional pdf f_R|Mcan be obtained using data from region (1,2) and extrapolate to a pdf f_R|Mfor all regions. The method may provide that a simulation is provided for the material which is missing in each of these three zones. The method may include a joint pdf f_ln(M),ln(R)for variables ln(M) and ln(R) is obtained first. The method may include a conditional pdf f_R|Mbeing calculated from f_ln(M),ln(R). The method may include correcting the biasing of pdf f_R|Mby the biased introduced by the threshold, for instance 30 rfu. The method may provide that the bias is removed by replacing each f_R|Mwith a Log normal pdf and correcting in the space of parameters mean, μ, and variance, σ, of the Log normal pdf's.

The method may include estimating the two-dimensional pdf f_ln(M),ln(R)for variables ln(M) and ln(R) using the EM-algorithm, where each Gaussian has zero correlation; A. Dempster, N. Laird, and D. Rubin (1977). “Maximum likelihood from incomplete data via the EM algorithm”. Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977; G. Mclachlan, D. Peel (2000). Finite mixture models. John Wiley & Sons, Inc.

The method may include a conditional distribution of R given M being calculated from the estimated f_ln(M),ln(R)using the formula:

$\begin{matrix} \begin{matrix} f_{R | M} (r | m) = \frac{f_{M, R} (m, r)}{f_{M} (m)} \\ = \frac{\frac{1}{m r} f_{\ln (M), \ln (R)} (m, r)}{\frac{1}{m} f_{\ln (M)} (m)} \\ = \frac{f_{\ln (M), \ln (R)} (m, r)}{r \times f_{\ln (M)} (m)} . \end{matrix} & (8) \end{matrix}$

The method may include the pdf f_ln(M)being a mixture of one-dimensional Gaussian distribution with the same mixing proportions as f_ln(M),ln(R)and mean and variances given by the corresponding mean and variances. The method may include several conditional pdf's being computed. The method may include providing Lognormal distributions to simulate the conditional pdf for R given M. The method may include the pdf of the lognormal distribution being given by:

$\begin{matrix} f_{R} (r) = \frac{1}{r \times σ \sqrt{2 π}} \exp^{\frac{- {(\ln (r) - μ)}^{2}}{2 σ^{2}}} . & (9) \end{matrix}$

The method may include correcting the bias in the space of parameters (μ,σ) of the Log normal pfd's. The method may include a number of iterations of the values for μ and/or σ. In particular, the method may included two or more iterations of the value for μ and the estimation procedure for σ.

The pdf for a value of M may be obtained by extrapolate the trend defined by two or more values of M to lower values of M. One or more, preferably several, polynomials may be fitted to values of σ in the regions of the higher M values, for instance from 80 to 200 rfu. The trend may be a line, spline or straight line.

The method may include obtaining a family of Lognormal distributions of defined by:

$\begin{matrix} f_{R | M} (r | m) = \frac{1}{r \times σ (m) \sqrt{2 π}} \exp^{\frac{- {(\ln (r) - μ)}^{2}}{{2 [σ (m)]}^{2}}} & (10) \end{matrix}$

The method may include defining another component in the proposed two-dimensional distribution for peak height as pdf's for mean in each of the four regions.

For region (0,0) a uniform distribution may be defined. The pdf for mean height may be:

$\begin{matrix} f_{M} (m) = p_{(0, 0)} \times \frac{1}{t} & (11) \end{matrix}$

where p_(0,0)is the proportion of pair of heights that fall in region (0,0) and t is the threshold and can be 30 rfu's or can be replaced by another rfu value as desired.

For region (1,0), the method may include obtaining a distribution for peak height (h₁,h₂) in this region. The method may then calculate a distribution for mean height. The method may include extracting a pdf for mean height and used the extrapolated distribution and transformed back to the space of (h₁,h₂). The method may include defining a joint distribution in this region by assuming that the variables h₁and h₂are independent. A uniform distribution may be assumed for (h₁,h₂). A pdf for region (1,0) may be given by:

$\begin{matrix} f_{H_{1}, H_{2}} (h_{1}, h_{2}) = p_{(1, 0)} \times \frac{1}{30} \times \frac{1}{λ_{(1, 0)}} \exp {- \frac{h_{1} - 30}{λ_{(1, 0)}}} & (12) \end{matrix}$

where p_(0,0)is the proportion of pair of heights that fall in region (0,0) and 30 can be replaced by another rfu value as desired.

The method may include obtaining a pdf for mean height M by first using the transformation:

$\begin{matrix} (h_{1}, h_{2}) \mapsto (m = \frac{h_{1} + h_{2}}{2}, n = \frac{h_{1} - h_{2}}{2}) . & (13) \end{matrix}$

and then use the Jacobian of the transformation to obtain a two-dimensional pdf for (M,N). The method may include obtaining the marginal pdf for m through integration. The two-dimensional pdf for (M,N) may be given by:

f
_M,N(m,n)=2×f_H₁_,H₂(h₁,h₂). (14)

The two-dimensional pdf for (M,N) may be given by:

$\begin{matrix} f_{M, N} (m, n) = p_{(1, 0)} \times \frac{1}{15} \times \frac{1}{λ_{(1, 0)}} \exp {- \frac{m + n - 30}{λ_{(1, 0)}}} . & (15) \end{matrix}$

The method may provide that the pdf for M is given by:

$\begin{matrix} f_{M} (m) = p_{(1, 0)} \times \frac{1}{15} \times [1 - \exp {- \frac{2 m - 30}{λ_{(1, 0)}}}] & (16) \end{matrix}$

where 15 is the lower quantity proportion and can be replaced by another value and 30 is the higher peal quantity proportion and can be replaced by another value. The method may include that the pdf for M is given by:

$\begin{matrix} f_{M} (m) = p_{(1, 0)} \times \frac{1}{15} \times [\exp {- \frac{2 (m - 30)}{λ_{(1, 0)}}} - \exp {- \frac{2 m - 30}{λ_{(1, 0)}}}] & (17) \end{matrix}$

where 15 is the lower quantity proportion and can be replaced by another value and 30 is the higher peal quantity proportion and can be replaced by another value.

The method may use the same methodology or a methodology as described above for region (1,0) to obtain a pdf for mean height in region (0,2). The method may provide that the pdf for (H₁,H₂) is given by:

$\begin{matrix} f_{H_{1}, H_{2}} (h_{1}, h_{2}) = p_{(0, 2)} \times \frac{1}{30} \times \frac{1}{λ_{(0, 2)}} \exp {- \frac{h_{2} - 30}{λ_{(0, 2)}}} . & (18) \end{matrix}$

where p_(0,2)is the proportion of height pairs that fall in region (0,2) and 30 can be replaced by another rfu value as desired and constant λ_(0,2)is the parameter of an exponential pdf estimated from values of h₂.

The method for calculating a marginal for M in region (0,2) may be the same as, or provided according to the possibilities for, region (1,0). The pdf may be given by:

$\begin{matrix} f_{M} (m) = p_{(0, 2)} \times \frac{1}{15} \times [1 - \exp {- \frac{2 m - 30}{λ_{(0, 2)}}}] & (19) \end{matrix}$

if m ε [15,30), and

$\begin{matrix} f_{M} (m) = p_{(0, 2)} \times \frac{1}{15} \times [\exp {- \frac{2 (m - 30)}{λ_{(0, 2)}}} - \exp {- \frac{2 m - 30}{λ_{(0, 2)}}}] & (20) \end{matrix}$

if m≧30, where 15 is the lower quantity proportion and can be replaced by another value and 30 is the higher peal quantity proportion and can be replaced by another value.

The method may provide that the pdf f_Mfor region (1,2) can be obtained from the two-dimensional pdf estimated for (ln(M),ln(R)). The result may be a mixture of one-dimensional Gaussians. The method may provide that the pdf for M in region (1,2) is given by:

$\begin{matrix} f_{M} (m) = p_{(1, 2)} \times \sum_{i = 1}^{4} p_{i} \times f (m | μ_{i}, σ_{i}) & (21) \end{matrix}$

where p_(1,2)is the proportion of height pairs that fall in region (1,2) and p_iis the mixing proportion of the Gaussian components and f(m|μ_i,σ_i) are the Gaussian pdf's.

The method may provide for the generation of a probability distribution function for a given locus. The method may provide for the generation of a probability distribution function for each of a plurality of loci. The method may provide for the generation of a probability distribution function for a combination of loci. The combination of loci may correspond to those analysed using a PCR amplification multiplex, such as SGMplus. A plurality of different probability distribution functions may be provided for different combinations of loci. Probability distribution functions may be generated for a number of combinations corresponding to the combinations of loci analysed in different PCR amplification multiplexes.

The method may provide for the generation of a probability distribution function for a given DNA quantity. The method may provide for the generation of a probability distribution function for each of a number of different DNA quantities. Probability distribution functions may be provided for different values of DNA quantity provided across a range of DNA quantities. The different values may be evenly spaced throughout the range.

The method may provide for the generation of a probability distribution function for a given DNA quantity in respect of one locus and/or for a combination of loci. The method may provide for the generation of a probability distribution function for each of a number of different DNA quantities for one locus and/or for a combination of loci.

The one or more generated probability distribution functions may be stored for later use, for instance in a computer implemented database. The one or more generated probability distribution functions may be stored in an electronic or solid state memory. The method may include providing the one or more probability distribution functions in a computer implemented database and/or electronic memory device and/or solid state memory device. The one or more probability distribution functions may be stored at a location remote from the location of use and/or location of the device using the one or more probability distribution functions, for instance by storing on a remote access unit, for instance accessible via the Internet.

The later use of one or more probability distribution functions may in a method according to the first aspect of the invention or its features or possibilities. The memory may be provided as a part of, in permanent connection with or in temporary connection with a device according to the second aspect of the invention or any of the possibilities provided therefore.

The one or more stored probability distribution functions may be used one or more times in subsequent method steps, such as those of the first aspect of the invention.

Any of the proceeding aspects of the invention may include the following features, options or possibilities or those set out elsewhere in this document.

The method may be a computer implemented method.

The method may involve the display of information to a user, for instance in electronic form or hardcopy form.

The test sample, may be a sample from an unknown source. The test sample may be a sample from a known source, particularly a known person. The test sample may be analysed to establish the identities present in respect of one or more variable parts of the DNA of the test sample. The one or more variable parts may be the allele or alleles present at a locus. The analysis may establish the one or more variable parts present at one or more loci.

The test sample may be contributed to by a single source. The test sample may be contributed to by an unknown number of sources. The test sample may be contributed to by two or more sources. One or more of the two or more sources may be known, for instance the victim of the crime.

The test sample may be considered as evidence, for instance in civil or criminal legal proceedings. The evidence may be as to the relative likelihoods, a likelihood ratio, of one hypothesis to another hypothesis. In particular, this may be a hypothesis advanced by the prosecution in the legal proceedings and another hypothesis advanced by the defence in the legal proceedings.

The test sample may be considered in an intelligence gathering method, for instance to provide information to further investigative processes, such as evidence gathering. The test sample may be compared with one or more previous samples or the stored analysis results therefore. The test sample may be compared to establish a list of stored analysis results which are the most likely matches therewith.

The test sample and/or control samples may be analysed to determine the peak height or heights present for one or more peaks indicative of one or more identities. The test sample and/or control samples may be analysed to determine the peak area or areas present for one or more peaks indicative of one or more identities. The test sample and/or control samples may be analysed to determine the peak weight or weights present for one or more peaks indicative of one or more identities. The test sample and/or control samples may be analysed to determine a level indicator for one or more identities.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments of the invention will now be described, by way of example only, and with reference to the accompanying drawings in which:

FIG. 1 is a plot of the peak height for the low molecular weight allele against peak height for the high molecular weight allele for a series of control samples;

FIG. 2 is a graphical representation of the sample space of peak heights based upon the data of FIG. 1;

FIG. 3 is a statistical histogram of peak heights according to an exponential distribution;

FIG. 4 is a probability distribution function for an example;

FIG. 5 is a two-dimensional pdf obtained from the example of FIG. 4;

FIG. 6 is an illustration of the peak heights in a crime profile relative to a threshold;

FIG. 7
a is an illustration of a frequency distribution for allele peak height, including a detection threshold;

FIG. 7
b is an illustration of the fitting of one distribution type to the data below the detection threshold of FIG. 7a;

FIG. 7
c is an illustration of the fitting of an alternative distribution type to the data below the detection threshold of FIG. 7a;

FIG. 8 is a scatter plot of mean height and heterozygote balance in a natural logarithm scale;

FIG. 9
a shows estimated and Log normal pdf's for mean heights with rfu threshold valued at 50 rfu;

FIG. 9
b shows estimated and Log normal pdf's for mean heights with rfu threshold valued at 100 rfu;

FIG. 9
c shows estimated and Log normal pdf's for mean heights with rfu threshold valued at 500 rfu;

FIG. 9
d shows estimated and Log normal pdf's for mean heights with rfu threshold valued at 1,000 rfu;

FIGS. 10
a and 10b shows Log normal parameters computed from estimated pdf for heterozygote balance given mean height;

FIGS. 11
a and 11b shows Log normal parameters computed from estimated pdf for heterozygote balance given mean height whilst setting μ=0.1069;

FIG. 12 shows corrected σ through a estimated linear trend;

FIG. 13 is a probability distribution for H₁in region (1,0). The estimated parameter λ_(1,0)=26.24; the estimation was done by subtraction 29 to the height values;

FIG. 14 shows the state space of (M,N) is the area within the rectangle;

FIG. 15 is a probability distribution for H₁in region (1,0); the estimated parameter λ_(0,2)=25.30; the estimation was done by subtraction 30 to the height values;

FIG. 16 is an illustration of the probability of dropout as a function of peak height of the remaining allele; and

FIG. 17 shows the variation in LR, for locus D2 where the suspect is heterozygous and there is only one peak in the crime stain, with peak height.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT(S)

The present invention is concerned with improving the interpretation of DNA analysis. Basically, such analysis involves taking a sample of DNA and analysing the variations present at a number of loci. The identities of the variations give rise to a profile which is then interpreted. The extent of interpretation required can be extensive and/or introduce uncertainties. This is particularly so where the DNA sample contains DNA from more than one person, a mixture.

There is often a need to consider various hypotheses for the identities of the persons responsible for the DNA and evaluate the likelihood of those hypotheses, evidential uses.

There is often a need to consider the analysis genotype against a database of genotypes, so as to establish a list of stored genotypes that are likely matches with the analysis genotype, intelligence uses.

The present invention provides a mathematical specification of a model for computing likelihood ratios (LRs) that uses peak heights taken from such DNA analysis.

The approach of the present invention draws on an estimation of a two-dimensional, 2D, probability density function, pdf, which is estimated from the heights, or areas, of peaks observed after the analysis of control samples. Such pdf's may be generated from heterozygous donors and separately from homozygous donors.

The invention goes on to use the approach to calculate the probability of dropout and achieve other benefits.

As a first part of the explanation, exemplary methods for generating the pdf's are discussed.

Generating a 2D Probability Distribution Function for Heterozygous Donors—Method One

A key part of the present invention is the generation and then use of a 2D probability distribution function, pdf's, relating to peak heights. This is obtained through the analysis of a large number of control samples.

The analyses of the control samples used to support the present approach needs to consider the presence of a variety of levels of DNA within the control samples because those various levels, and others, are encountered in unknown source samples of interest. When an unknown source sample is collected for DNA analysis, there is an aim to collect a given amount of DNA. However, the amount actually present in the unknown source sample varies around that amount because of a variety of factors.

To account for this, control pre-samples from a number of heterozygous donors were obtained and then a variety of control samples with different DNA amounts in them were generated from these pre-control samples. The control samples were then analysed to establish the alleles present and the peak heights therefore. This control sample investigation process was repeated for each locus of interest in the DNA analysis or comparison process to be improved upon by the invention. Thus different loci will need to be considered for different analysis approaches and/or for databases which store information on different loci.

Separate pdf's are beneficially generated for heterozygous and homozygous situations.

As an example of the control sample investigation process, four heterozygous donors may be used. Control samples having different amounts of DNA present are then generated over the range 50 pg to 500 pg in 25 pg steps. 200 repeats of the pre-control samples were provided from each donor and used to form the control sample sets.

The results of the analysis of these control samples can be considered by plotting:

- on one axis the amount of DNA detected in the analysis, expressed as the peak height for the lower molecular weight allele at the locus which is observed, h₁; and
- on the other axis the amount of DNA detected in the analysis, expressed as the peak height for the higher molecular weight allele at the locus which is observed, h₂.

A set of results using this approach is shown schematically in FIG. 1. As a general observation, the h₁and h₂levels are distributed around an equivalent level, (the 45° dot-dash-dot line) with the peak height increasing as the amount of DNA in the sample increases.

When considering the peak heights, the detection of a height relative to the noise signal in the analysis becomes difficult below a threshold level. As a result, it is normal to discount peak heights below a predetermined threshold, t, as being unreliable to consider further. As we known the control samples are from people who are heterozygous for the locus in question, we know that these non-observed peaks correspond to dropout of the allele from the analysis results. The allele is present in the DNA, but not present in the analysis results for the DNA.

The use of these thresholds, t, is provided for in the FIG. 1 plot. When the value for one of the h₁or h₂values is below the threshold, then the point is plotted on the relevant axis as a zero value. When both are below the threshold, the point is plotted at the junction of the axes; a zero value for both.

The distribution of FIG. 1 can be considered further in the schematic form illustrated in FIG. 2. This divides the results of the analysis of the control samples into four blocks which are defined by the axes and the threshold values, t₁and t₂. The region (1,1) defines that block in which neither h₁or h₂are above the threshold t. The region (1,2) defines that block in which the higher molecular weight allele, h₂, is observed below the threshold t, but the lower molecular weight allele, h₁, is above the threshold, t. The region (2,1) defines that block in which the lower molecular weight allele, h₁, is observed below the threshold t, but the higher molecular weight allele, h₂, is above the threshold, t. The region (2,2) defines that block in which both the alleles, h₁and h₂, are both above their respective thresholds t₁and t₂.

In further considerations, both t₁and t₂have the same value and are represented as t, but different values could be used for the lower and higher weight alleles and/or between loci. The threshold could be 50 random fluorescence units, rfu's, or as low as 20 rfu's.

The 2D pdf's for peak heights, f_het(h₁,h₂), where that function is a probability density function of heterozygous alleles with height values h₁and h₂, can be defined piece-wise according to the four different regions identified above; p_1,1being for region (1,1); p_1,2for region (1,2); p_2,1for region (2,1); and p_2,2for region (2,2).

This approach gives the following definitions for the four regions of FIG. 2.

$\begin{matrix} f_{het} (h_{1}, h_{2}) = {\begin{matrix} p_{1, 1} \times \frac{1}{t^{2}} & if h_{1} < t & h_{2} < t \\ p_{1, 2} \times g_{1, 2} (h_{1}, h_{2}) & if h_{1} > t & h_{2} < t \\ p_{2, 1} \times g_{2, 1} (h_{1}, h_{2}) & if h_{1} < t & h_{2} > t \\ p_{2, 2} \times g_{2, 2} (h_{1}, h_{2}) & if h_{1} > t & h_{2} > t \end{matrix} & (A .1) \end{matrix}$

The form of these definitions are now considered further:

Definition of g_1,2(h₁,h₂)

This is the joint distribution of h₁and h₂in region (1,2) of FIG. 2. The definition of g_1,2(h₁,h₂) is provided through factorising the distribution as follows:

g
_1,2(h₁,h₂)=g_1,2(h₁|h₂)g_1,2(h₂) (A.2)

where g_1,2(h₁|h₂) is estimated from experimental data where h₁>t and h₂<t, and:

$\begin{matrix} g_{1, 2} (h_{2}) = \frac{1}{t} . & (A .3) \end{matrix}$

The reasoning behind this definition of g_1,2follows from the calculation:

$\begin{matrix} \Pr (h_{2} < t) = \int_{0}^{t} p_{1, 2} g_{1, 2} (h_{2}) \partial h_{2} = p_{1, 2} \int_{0}^{t} \frac{1}{t} \partial h_{1} = p_{1, 2} \frac{t}{t} = p_{1, 2} . & (A .4) \end{matrix}$

The interpretation is that we only know that the peak height h₂is below the threshold, t, and so they have equal probability in the interval (0,t).

Definition of g_2,1(h₁,h₂)

This is the joint probability of h₁and h₂in region (2,1) and follows a matching form to the previous definition. Hence, it is provided through factorising the distribution in a similar manner and as follows:

g
_2,1(h₁,h₂)=g_2,1(h₂|h₁)g_2,1(h₁) (A.5)

where g_2,1(h₂|h₁) is estimated from experimental data where h₁<t and h₂>t and:

$\begin{matrix} g_{2, 1} (h_{1}) = \frac{1}{t} . & (A .6) \end{matrix}$

Definition of g_2,2(h₁,h₂)

This is the joint probability of h₁and h₂in region (2,2), this is provided by an estimate of a 2D pdf from experimental data where h₁>t and h₂>t.

For the three regions and their distributions, the following distributions are presently estimated as applying:

g_1,2(h₁,h₂)—as a preliminary estimation we use an exponential distribution.

g_2,1(h₁,h₂)—as a preliminary estimation we use an exponential distribution.

g_2,2(h₁,h₂)—use a transformation of the data and then a 2D estimation.

The overall result of these definitions is that the 2D pdf for peak heights, f_het(h₁,h₂), is given by:

$\begin{matrix} f_{het} (h_{1}, h_{2}) = {\begin{matrix} p_{1, 1} \times \frac{1}{t^{2}} & if h_{1} < t & h_{2} < t \\ p_{1, 2} \times k_{1, 2} (h 1 - t | l_{12}) \times \frac{1}{t} & if h_{1} \geq t & h_{2} < t \\ p_{2, 1} \times k_{2, 1} (h 2 - t | l_{21}) \times \frac{1}{t} & if h_{1} < t & h_{2} \geq t \\ p_{2, 2} \times k_{2, 2} (\frac{h_{1} - t + h_{2} - t}{2}, \log (\frac{h_{1} - t}{h_{2} - t})) \times \frac{1}{2} (\frac{1}{h_{1} - t} + \frac{1}{h_{2} - t}) & if h_{1} \geq t & h_{2} \geq t \end{matrix} & (A .7) \end{matrix}$

where, k_1,2; k_2,1; l₁₂; l₂₁; are described above, and where:

$\begin{matrix} k_{1, 2} (x | l_{12}) = \frac{1}{l_{12}} \times \exp {- \frac{x}{l_{12}}}, & (A .8) \end{matrix}$

is the pdf of an exponential distribution with parameter l_1,2, where x is the mixing proportion and:

$\begin{matrix} k_{2, 1} (x | l_{2, 1}) = \frac{1}{l_{2, 1}} \times \exp {- \frac{x}{l_{2, 1}}} . & (A .9) \end{matrix}$

and k_2,2is described above and the function k_2,2is given by a mixture of two-dimensional normal distributions:

$\begin{matrix} k_{2, 2} (x, y) = \sum_{i} p_{i} \times n ([x, y] | [μ_{x, i}, μ_{y, i}], \sum_{i}) & (A .10) \end{matrix}$

where y and n are described above, where:

$\begin{matrix} \sum_{i} = [\begin{matrix} σ_{x, i}^{2} & {ρσ}_{x, i} σ_{y, i} \\ {ρσ}_{x, i} σ_{y, i} & σ_{y, i}^{2} \end{matrix}] & (A .11) \end{matrix}$

and:

where ρ=0.

Once determined, the 2D pdf for peak heights, f_het(h₁,h₂), can be stored and made use of in a large number of subsequent considerations of likelihood ratios. As previously mentioned, such pdf's may be generated for each locus under consideration or potential consideration. In some cases, it may be necessary to determine a particular 2D pdf for peak heights, f_het(h₁,h₂), for a particular situation, as part of the analysis process

Generating a 2D Probability Distribution Function for Homozygous Donors—Method One

In the previous section, an illustrative method for estimate a 2D pdf for the case where the source is heterozygote was provided. In this section we deal with the situation where the source is homozygote.

In this case, the probability distribution function, pdf, starts out as a 1D plot. There is only a value for h₁as only one peak is observed. Thus above the threshold, t, this has a value and below it is represented at the zero value.

To enable compatibility between situations where the likelihood is being considered in 2D (based on the pdf's developed in the section above and the likelihood evaluation discussed in sections which follow below) the pdf for the homozygous situation also has to be defined in 2D. Otherwise, the likelihoods in the 2D heterozygous approach will be inherently less likely than in a 1D homozygous approach.

A number of methods for converting 1D pdf's to 2D pdf's exist and can be used.

The basis for this method is the observation that a homozygous sample with height h₁is similar to observing a heterozygous sample with approximate heights 0.5h₁and 0.5h₁.

Generating a 2D Probability Distribution Function for Homozygous Donors—Method Two—Approach a

It is also possible to convert the 1D pdf into a 2D pdf by mathematically rotating the distribution through say 90°. The result is a pdf in 2D which can be used in the manner described below for homozygous donors and as an alternative to method 1 for homozygous donors.

The estimation of a one-dimensional pdf has two components. A uniform distribution for the interval, for instance height values in the range 0 to 30 and a probability distribution that takes positive values within that range, for example the exponential distribution of FIG. 3. FIG. 3 shows a statistical histogram of the peak heights. The plot line represents an exponential pdf estimated from the data as an example.

The one-dimensional pdf is given by the formula,

$\begin{matrix} f_{H} (h) = {\begin{matrix} p_{0} \times \frac{1}{30} & if h \in (0, 30] \\ p_{1} \times f_{H | H > 30} & if h \in (30, \infty) \end{matrix} & (22) \end{matrix}$

where

p₀is the proportion of heights in (0, 30];

p₁is the proportion of heights in (30, ∞);

For example if f_H|H>30is a exponential pdf, then

$\begin{matrix} f_{H} (h) = {\begin{matrix} p_{0} \times \frac{1}{30} & if h \in (0, 30] \\ p_{1} \times \frac{1}{λ} \exp {- \frac{h - 30}{λ}} & if h \in (30, \infty) \end{matrix} & (23) \end{matrix}$

where λ is the estimated parameter for the exponential distribution using known methods of estimation, e.g. maximum likelihood estimation methods ready available in statistical packages. The overall pdf of an example for case when f_H|H>30is an exponential distribution is plotted in FIG. 4.

The idea is to rotate the one-dimensional pdf above to take non-zero values in R⁺×R⁺. The surface obtained in this manner needs to be normalised so that the volume under the surface is one. FIG. 5 shows a two-dimensional pdf obtained from the example using an exponential distribution.

The calculation of a pdf value for homozygote peak height h for a likelihood ratio calculation is given by

(1/V)×f_H. (24)

where V is the volume under the surface obtained by rotation of the one-dimensional pdf.

Generating a 2D Probability Distribution Function for Homozygous Donors—Method Two—Approach b

That part of the distribution which relates to data left-censored by the rfu threshold, t, is discussed in approach a, with approach b offering an alternative.

In this approach, a family of distributions is chosen, f(h|θ), where θ represents the parameters determining the distributions and the parameters encoded in θ can be either one-dimensional or multi-dimensional. The type of distribution chosen can be one of numerous possibilities, such as Exponential, Weibull, Gamma or other distributions with positive support H. For example an Exponential distribution might be chosen where, thus providing:

- f(h|θ)=θe^−θhwhere the θ can have different values, for instance 0.1, 0.11 etc to provide the family.

The Likelihood of the censored data can then be established by the function:

L(θ|{h_i: i=1,2, . . . ,n})=n₁×F(t|θ)×Π^m_i=1f(h_i|θ), where

there are n₁heights recorded as zero due to left-censoring, and F is the cumulative probability distribution and f is the probability density function of the Exponential distribution and t is the detection threshold.

The family member for θ which gives the maximum value of the likelihood is chosen, that is, the maximum likelihood estimate (MLE) for θ.

Generating a 2D Probability Distribution Function for Heterozygous Donors—Method Two—Approach a

An alternative method for establishing a 2D pdf in the context of heterozygous donors is now described, starting with an explanation in using a 1D analogy, followed by an expansion of the approach to detail its working in 2D.

In the consideration of heterozygous analysis results for the control samples, the two alleles may result in peaks of different heights. As previously described, there is a chance that one or both peaks may be below a threshold level, t, at which the peak can safely be identified or even identified at all. FIG. 6 illustrates a full analysis signal, A, and includes a representation of the two peaks, PK1 and PK2, which are actually present. In the case of PK2, sufficient signal is detected to discern and call that peak as “observed”. However, in the case of peak PK1, the signal is so close to and mixed in with the noise signal that no peak can be discerned. This is the natural cut off at 20 to 30 rfu's mentioned above. A 50 rfu cut off is often applied to exclude peaks which may be discernable, but which cannot be absolutely distinguished from noise.

The result is that a distribution of data above the threshold, t, is established, but that the form of the distribution below the threshold, t, is not know. The position of the frequency distribution of FIG. 7a applies.

In the case of method 1 above, a uniform frequency for the distribution is applied to the unknown section below the threshold, see FIG. 7b.

In this alternative approach, a non-uniform distribution is applied to the region below the threshold, t. In the illustrated example of FIG. 4c, a triangular distribution is employed. Other distribution shapes can be applied to this region.

This approach can now be extending to the four regions describe in the context of FIG. 2 and hence into 2D.

The region (2,2) has a distribution which is fully known from the analysis results for the control samples. For the region (1,1) it is possible to use a 2D plane or other distribution shape which represents a transition from the region (2,2) distribution at h₁=t, h₂=t down to zero at the junction of the axes and which also provides a transition from the values of region (1,2) along junction h₁=t, h₂=0 to t and from the values of region (2,1) along the junction h₂=t, h₁=0 to t. The regions (1,2) and (2,1) themselves can be further distributions representing a combination of the distribution applied in region (2,2) and that applied in region (1,1).

Generating a 2D Probability Distribution Function for Heterozygous Donors—Method Two—Approach b

As described above, a profile from a heterozygous donor in a locus typically consists of two peaks. However if a profile is obtained from small amounts of DNA, it may contain either one peak or no peaks in the observed result. This is because a peak cannot be detected if is low and close to the baseline. In this example we use 30 rfu as the minimum detection limit, but other detection limits can be applied to the observed results.

The support of two-dimensional pdf is R⁺_υ{0}×R⁺_υ{0}. The pair of heights (h₁,h₂) are in effect censored by the threshold of t, for instance 30 rfu, that divides the support in four regions, see FIG. 2. The peak height h₁of the low molecular weight allele is plotted in the x-axis and the peak height h₂of the high molecular weight allele is plotted in the y-axis. In region (0,0) both peaks are below t, for instance 30 rfu; in region (1,0) h₁is greater than or equal to t, for instance 30 rfu, whilst h₂is less than t, for instance 30 rfu; in region (0,2) is the opposite case; and in region (1,2) both peak are greater than or equal to t, for instance 30 rfu.

The estimation of a two dimensional pdf is done in the space defined by mean heights, m, and heterozygote imbalance, r. Each pair of heights are transformed by

$\begin{matrix} (h_{1}, h_{2}) \mapsto (m = \frac{h_{1} + h_{2}}{2}, r = \frac{h_{1}}{h_{2}}) . & (25) \end{matrix}$

Given a pdf f_M,R, a pdf in the space of pairs of heights can be obtained with the formula:

$\begin{matrix} f_{H_{1}, H_{2}} (h_{1}, h_{2}) = \frac{1}{h_{2}^{2}} (\frac{h_{1} + h_{2}}{2}) \times f_{M, R} (m, r) & (26) \end{matrix}$

where the first factor is the Jacobian of the transformation, G. Casella & R. L. Berger (1990). Statistical Inference. Wadsworth & Brooks/Cole Advanced Books Software, Pacific Grove, Calif., USA, page 148 and the second factor can be estimated from experimental data. The estimation is done through by first obtaining the conditional distribution of heterozygous imbalance given mean height:

$\begin{matrix} f_{H_{1}, H_{2}} (h_{1}, h_{2}) = \frac{1}{h_{2}^{2}} (\frac{h_{1} + h_{2}}{2}) \times f_{R | M} (r | m) f_{M} (m) & (27) \end{matrix}$

This will provide one pdf f_R|Mfor all regions, but there will be one marginal for each of the regions. In the rest of the section the estimation of the conditional pdf f_R|Mand then the estimation of the marginal pdf's f_Mfor each of the regions is illustrated.

A conditional pdf f_R|Mcan be obtained using data from region (1,2) and extrapolate to a pdf f_R|Mfor all regions. In effect, a simulation is provided for the material which is missing in each of these three zones. A joint pdf f_ln(M),ln(R)for variables ln(M) and ln(R) is obtained first. A conditional pdf f_R|Mis calculated from f_ln(M),ln(R). The pdf f_R|Mis affected by the biased introduced by the threshold of 30 rfu. This bias is removed by replacing each f_R|Mwith a Log normal pdf and correcting in the space of parameters mean, μ, and variance, σ, of the Log normal pdf's.

A two-dimensional pdf f_ln(M),ln(R)for variables ln(M) and ln(R) is estimated using the EM-algorithm, where each Gaussian pdf has zero correlation; A. Dempster, N. Laird, and D. Rubin (1977). “Maximum likelihood from incomplete data via the EM algorithm”. Journal of the Royal Statistical Society, Series B, 39(1):1-38, 1977; G. Mclachlan, D. Peel (2000). Finite mixture models. John Wiley & Sons, Inc. A scatter plot of mean height and heterozygous balance in natural logarithm scales is shown in FIG. 8. The components of the estimated mixture of Gaussians are given in Table 1.

TABLE 1

Parameters of mixture of Gaussians for an estimated

pdf ln(M), ln(R)

Component
Proportion
Mean 1
Mean 2
Variance 1
Variance 2

1
0.2505
4.5791
−0.0963
0.2581
0.2147

2
0.2528
4.4672
0.1929
0.2158
0.2203

3
0.2573
5.4292
0.1041
0.1550
0.1407

4
0.2393
5.7499
0.0585
0.3454
0.0546

The goodness of fit is assessed by ascertaining that the data is a likely sample of the estimated pdf. By computing one thousand samples of the same size as the data, computing the negative log-likelihood for each sample and also computing the negative log-likelihood of the data in the estimated pdf verification can be provided. The negative log-likelihood of the data falls within the range of values of the negative log-likelihoods of samples of the same size as the data. It indicates that the mixture fits the data well.

A conditional distribution of R given M can be calculated from the estimated f_ln(M),In(R)using the formula:

$\begin{matrix} f_{R | M} (r | m) = \frac{f_{M, R} (m, r)}{f_{M} (m)} = \frac{\frac{1}{mr} f_{\ln (M), \ln (R)} (m, r)}{\frac{1}{m} f_{\ln (M)} (m)} = \frac{f_{\ln (M), \ln (R)} (m, r)}{r \times f_{l n (M)} (m)} . & (28) \end{matrix}$

The pdf f_ln(M)is in fact a mixture of one-dimensional Gaussian distribution with the same mixing proportions as f_ln(M),ln(R)and mean and variances given by the corresponding mean and variances. These are reported in Table 1 above.

Several conditional pdf's are computed using the formula above. Estimated Log normal distributions are shown in FIGS. 9a, 9b, 9c and 9d. For the time being, the focus is on the estimated pdf's. Notice that the heterozygote imbalance pdf given a mean height of 50 rfu, FIG. 9a, appears to have less variability than the pdf conditional on 100 rfu, FIG. 9b. This is an effect of the bias introduced by the dropout region. Notice as well that the pdf for a mean height of 500, FIG. 9c, is very similar to the pdf for 1,000 rfu, FIG. 9d, suggesting that the variability does not significantly decrease after 500 rfu in mean height.

In FIG. 9a-d Log normal distributions that closely resemble the conditional pdf for R given M have been plotted. The pdf of the lognormal distribution is given by:

$\begin{matrix} f_{R} (r) = \frac{1}{r \times σ \sqrt{2 π}} \exp^{\frac{- {(\ln (r) - μ)}^{2}}{2 σ^{2}}} . & (29) \end{matrix}$

Notice that the Log normal distributions are a good fit. These can therefore correct the bias in the space of parameters (μ,σ) of the Log normal pfd's. FIGS. 10a and 10b show a plot of estimated parameters for the Log normal distribution that closely fit f_R|Mfor each value of M=m. Notice that in the region M≧100 where the effect of the threshold is not present, the value of μ is constant while σ decreases as m increases.

In a second iteration we set μ=0.1069 corresponding to heterozygote imbalance R=1.11. In fact, this number is expected as height of the low-molecular allele tends to be larger than the height of the high-molecular weight allele. We then run the estimation procedure for σ again. FIG. 11 shows the estimated values of σ for each value of m.

To extrapolate the increasing trend to lower values of M, several polynomials were fitted to values of σ in region from 80 to 200 rfu. A line was the best fit and it was chosen for extrapolating σ in region (0,80). FIG. 12 displays the estimated line and the extrapolated values.

In these examples, Log normal distributions are used, but other extrapolatable distributions, such as gamma distributions can be used.

From this a family of Log normal distributions have been obtained:

$\begin{matrix} f_{R | M} (r | m) = \frac{1}{r \times σ (m) \sqrt{2 π}} \exp^{\frac{- {(\ln (r) - μ)}^{2}}{{2 [σ (m)]}^{2}}} & (30) \end{matrix}$

where μ=0.1069 and σ(m) is given by the corrected σ value in FIG. 12.

Other components in the proposed two-dimensional distribution for peak height are pdf's for mean in each of the four regions. For region (0,0) both peak heights are not known and thus follow a uniform distribution in the interval [0,30]. A pdf for mean height is therefore also a uniform distribution in this interval, i.e.

$\begin{matrix} f_{M} (m) = p_{(0, 0)} \times \frac{1}{30} & (31) \end{matrix}$

where p_(0,0)the proportion of pair of heights that fall in region (0,0) and takes the value 0.1012. Of course, other values than 30 rfu's for the threshold t can be employed.

Obtaining a pdf for region (1,0) requires more effort. First a distribution for peak height (h₁,h₂) in this region is obtained and them used to calculate a distribution for mean height. Although the goal is to obtain a pdf for (h₁,h₂), this distribution does not have the extrapolated pdf for heterozygote imbalance given mean height. A pdf for mean height is therefore extracted and used the extrapolated distribution and transformed back to the space of (h₁,h₂).

A joint distribution can be defined in this region by assuming that the variables h₁and h₂are independent. Given that we do not know the value of h₂, we can assume that they follow a uniform distribution in the interval (0,30). The values that variable h₂follow is an exponential distribution, see FIG. 13. The goodness of fit was further corroborated with a Kolmogorov-Smirnoff test.

A pdf for region (1,0) is given by

$\begin{matrix} f_{H_{1}, H_{2}} (h_{1}, h_{2}) = p_{(1, 0)} \times \frac{1}{30} \times \frac{1}{λ_{(1, 0)}} \exp {- \frac{h_{1} - 30}{λ_{(1, 0)}}} & (32) \end{matrix}$

where p_(1,0)is the proportion of height pairs that fall in region (1,0) and take the value of 0.0448.

A pdf for mean height M can be obtained by first using the transformation

$\begin{matrix} (h_{1}, h_{2}) \mapsto (m = \frac{h_{1} + h_{2}}{2}, n = \frac{h_{1} - h_{2}}{2}) . & (33) \end{matrix}$

and then use the Jacobian of the transformation to obtain a two-dimensional pdf for (M,N). The marginal pdf for m can be obtained through integration. The two-dimensional pdf for (M,N) is given by:

f
_M,N(m,n)=2×f_H₁_,H₂(h₁,h₂. (34)

It can be re-written as

$\begin{matrix} f_{M, N} (m, n) = p_{(1, 0)} \times \frac{1}{15} \times \frac{1}{λ_{(1, 0)}} \exp {- \frac{m + n - 30}{λ_{(1, 0)}}} . & (35) \end{matrix}$

Before we integrate N to obtain a pdf or M, we need to describe the sample space for (M,N). It is the area in the rectangle.

The resulting pdf for M is given by:

$\begin{matrix} f_{M} (m) = p_{(1, 0)} \times \frac{1}{15} \times [1 - \exp {- \frac{2 m - 30}{λ_{(1, 0)}}}] & (36) \end{matrix}$

if m ε [15,30), and

$\begin{matrix} f_{M} (m) = p_{(1, 0)} \times \frac{1}{15} \times [\exp {- \frac{2 (m - 30)}{λ_{(1, 0)}}} - \exp {- \frac{2 m - 30}{λ_{(1, 0)}}}] & (37) \end{matrix}$

if m≧30.

The methodology for obtaining a pdf for mean height in region (0,2) is the same as for region (1,0). A pdf for (H₁,H₂) is given by:

$\begin{matrix} f_{H_{1}, H_{2}} (h_{1}, h_{2}) = p_{(0, 2)} \times \frac{1}{30} \times \frac{1}{λ_{(0, 2)}} \exp {- \frac{h_{2} - 30}{λ_{(0, 2)}}} . & (38) \end{matrix}$

where p_(0,2)is the proportion of height pairs that fall in region (0,2) and is equal to 0.0325. Constant λ_(0,2)is the parameter of an exponential pdf estimated from values of h₂translated by 30, see FIG. 14. The goodness of fit was further corroborated with a Kolmogorov-Smirnoff test.

The method for calculating a marginal for M in region (0,2) is the same as for region (1,0). The resulting pdf is given by:

$\begin{matrix} f_{M} (m) = p_{(0, 2)} \times \frac{1}{15} \times [1 - \exp {- \frac{2 m - 30}{λ_{(0, 2)}}}] & (39) \end{matrix}$

if m ε [15,30), and

$\begin{matrix} f_{M} (m) = p_{(0, 2)} \times \frac{1}{15} \times [\exp {- \frac{2 (m - 30)}{λ_{(0, 2)}}} - \exp {- \frac{2 m - 30}{λ_{(0, 2)}}}] & (40) \end{matrix}$

if m≧30.

Finally, a pdf f_Mfor region (1,2) can be obtained from the two-dimensional pdf estimated for (ln(M),ln(R)). The resulting is a mixture of one-dimensional Gaussians where the mixture proportion is the same as those reported in Table 1. The means are given in the column with heading “Mean 1” and the variances are given in the column with heading “Variance 1”¹. A pdf for M in region (1,2) is given by: ¹Matlab: fitMeanHeightPDF.m

$\begin{matrix} f_{M} (m) = p_{(1, 2)} \times \sum_{i = 1}^{4} p_{i} \times f (m | μ_{i}, σ_{i}) & (41) \end{matrix}$

where p_(1,2)is the proportion of height pairs that fall in region (1,2) and p_iis the mixing proportion of the Gaussian components and f(m|μ_i,σ_i) are the Gaussian pdf's.

We have defined all the components of a two-dimensional pdf for pair of heights for all regions as defined in Equation (27). The conditional pdf f_R|Mis given by the family of Lognormal pdf's with a fixed value of μ and values of σ(m) given in FIG. 12. The marginal distributions f_Mhave been given for all regions above. This pdf form is one of the building blocks in the calculation of LRs in subsequent sections.

Generating a 2D Probability Distribution Function for Heterozygous Donors—Method Two—Approach c

In approach b above, an approach for dealing with the censoring of the peaks by the rfu limit imposed was discussed. This approach provides a further variant to that manner of dealing with the left censored peaks.

In approach b, a pdf f_M,R, a pdf in the space of pairs of heights, was described as being obtained with the formula:

$f_{H_{1}, H_{2}} (h_{1}, h_{2}) = \frac{1}{h_{2}^{2}} (\frac{h_{1} + h_{2}}{2}) \times f_{M, R} (m, r)$

where the first factor is the Jacobian of the transformation and the second factor was estimated from experimental data. The estimation was done through by first obtaining the conditional distribution of heterozygous imbalance given mean height:

$f_{H_{1}, H_{2}} (h_{1}, h_{2}) = \frac{1}{h_{2}^{2}} (\frac{h_{1} + h_{2}}{2}) \times f_{R | M} (r | m) f_{M} (m)$

This provided one pdf f_R|Mfor all regions, but with one marginal for each of the regions. Approach b then described the estimation of the conditional pdf f_R|Mand then the estimation of the marginal pdf's f_Mfor each of the regions illustrated.

In approach c, the same approach is taken to that of approach b in the use of the formula:

$f_{H_{1}, H_{2}} (h_{1}, h_{2}) = \frac{1}{h_{2}^{2}} (\frac{h_{1} + h_{2}}{2}) \times f_{R | M} (r | m) f_{M} (m)$

and in the estimation of the conditional pdf f_R|M. The variation is in the manner by which the estimation of the marginal pdf's f_Mis provided. In this further approach, consideration of the possible range for the values of the peak-heights, even those recorded as zero, are taken into account.

The first step is the selection of a family of probability distributions f(m|θ) where c is the set of parameters specifying the family of distributions. Variable m is the mean height (h₁+h₂)/2. The values of h₁and h₂are affected by the threshold t and thus h₁can be either greater than or equal to t or recorded as zero if it is smaller than t. The values of m are affected by left-censoring of h₁and h₂and the definition of the likelihood of θ given the observation on M are affected accordingly.

If h₁<t and h₂<t, the m<t. In this case m is left-censored and the likelihood function for these m's as F(t|θ) where F is the cumulative probability distribution (CDF) of the previously chosen family of distributions. If h₁<t and h₂≧t, then m is interval-censored, that is it falls in the interval from 0.5t to 0.5(t+h₂). The likelihood for the m's is F(0.5(t+h₂)|θ)−F(0.5t|θ). Similarly for the contrary case where h₂<t and h₁≧t the likelihood of m is F(0.5(t+h₁)|θ)−F(0.5t|θ). If h₁≧t and h₂≧t, then there is no censoring and the likelihood is given by f(m|θ). The overall likelihood is given by multiplying the likelihood for each pair of heights h₁and h₂, including those that fall below the threshold t and are recorded as zero:

$\begin{matrix} L (θ | m) = Π {F (t | θ) : h_{1} < t, h_{2} < t} \times \\ Π {F (0.5 (t + h_{2}) | θ) - F (0.5 t | θ) : h_{1} < t, h_{2} \geq t} \times \\ Π {F (0.5 (t + h_{1}) | θ) - F (0.5 t | θ) : h_{1} \geq t, h_{2} < t} \times \\ Π {F (0.5 (h_{1} + h_{2}) | θ) : h_{1} \geq t, h_{2} \geq t} \end{matrix}$

The parameters encoded in θ are choose to maximise the likelihood given above.

Establishing Likelihood Ratios

Having established the underlying 2D pdf for peak heights in the heterozygous case, f_het(h₁,h₂), and 2D pdf for peak height in the homozygous case, f_hom(h₁), it is possible to move on to establish the likelihood ratios in various situations of interest.

These situations considered in turn and in detail are a variety of situations that are encountered, particularly in the context of forensic science.

The various situations are:

- 1) in an evidential context, a single source sample, without conditioning on DNA quantity;
- 2) in an evidential context, a single source sample, with conditioning on DNA quantity;
- 3) in an intelligence context, a single source sample, without conditioning on DNA quantity;
- 4) in an intelligence context, a single source sample, with conditioning on DNA quantity;
- 5) in an evidential context, a mixed source sample, without conditioning on DNA quantity;
- 6) in an evidential context, a mixed source sample, with conditioning on DNA quantity;
- 7) in an intelligence context, a mixed source sample, without conditioning on DNA quantity; and
- 8) in an intelligence context, a mixed source sample, with conditioning on DNA quantity.

Situation 1—An Evidential Context—A Single Source Sample—Without Conditioning on DNA Quantity

In an evidential context, the relative likelihood's of two hypotheses are normally being considered. Generally, these will be a prosecution hypothesis and a defence hypothesis. The consideration, the likelihood ratio, can be generally expressed as:

$\begin{matrix} LR = \frac{\Pr (C | S, H_{p})}{\Pr (C | S, H_{d})} & (42) \end{matrix}$

where

- C is the crime profile consisting of peak heights, for example C={h₁,h₂} or C={h₁}, h={h₁,h₂<t}, h={h₁<t}
- S is the suspect's genotype, for example S={1,2}, or S={1,1}.
- H_pis the prosecution hypothesis stating “The suspect left the stain at the scene of crime”;
- H_dis the defence hypothesis stating “Someone else left the stain in the crime scene. This includes a defence hypothesis of a putative donor; that is either related or unrelated to the suspect, and from the same ethnicity or different ethnicity.

The LR can be expanded as:

$\begin{matrix} LR = \frac{\Pr (C | S, H_{p})}{\sum_{i} \Pr (C | U_{i}, S, H_{d}) \Pr (U_{i}, S, H_{d})} & (43) \end{matrix}$

where U_iis one of the supposed donor stated in H_d. Given that the quantities in C are continuous, the LR is written as:

$\begin{matrix} LR = \frac{f (C | S, H_{p})}{\sum_{i} f (C | U_{i}, S, H_{d}) \Pr (U_{i}, S, H_{d})} & (44) \end{matrix}$

where the f means that these factors are likelihoods and not probabilities.

There are a number of ways in which the second factor Pr(U_i|S) can be computed using the method given by Balding et al. (2005) and Buckleton et al.(2005), referenced above, which may include considerations of relatedness between the unknown contributor and the suspect.

The factors f(C|S,H_p) and f(C|U_i,S,H_d) state the same type of calculation: the likelihood of observing the set of peak heights in C given a supposed donor. Therefore the discussion for their calculation is simplified as:

f(C|G) (45)

where G denotes the supposed donor genotype.

This general form of the likelihood is key to the operation of the invention and its benefits.

The estimation of f(C|G) from experimental data, as provided in the pdf forming methods discussed above, has not previously been envisaged or provided.

The estimation can come from the dilution data approach of heterozygous sources, method 1, in particular. The homozygous source methods could be used for homozygous situations.

As far as the estimation of f(C|G) is concerned, this can be done by grouping data from all loci to general a single estimation of f(C|G) covering all the loci considered. It is equally possible to provide a separate estimation of f(C|G) for each locus. A variety of ways for generating the estimation of f(C|G) using two-dimensional density estimations apply.

In the following sections a more detailed application of the approach to evidential contexts is provided for the four scenarios:

a) Suspect is heterozygous and crime profile has one peak;

b) Suspect is heterozygous and crime profile has two peaks;

c) Suspect is heterozygous and crime profile contains no peaks;

d) Suspect is homozygous and crime profile contains one peak;

e) Suspect is homozygous and crime profile contains one peak.

Suspect is Heterozygous and Crime Profile has One Peak

In this situation, C={h_b} and S={a,b}. The likelihood ratio is given, consistent with equation 1 above, by,

$\begin{matrix} LR = \frac{f (h_{b} | S = {a, b}, H_{p})}{f (h_{b} | S = {a, b}, H_{d})} . & (1.1) \end{matrix}$

The numerator then becomes,

$\begin{matrix} f (h_{a} < t, h_{b} | S = {a, b}, H_{p}) = f_{het} (h_{a} < t, h_{b}) = \int_{0}^{t} fhet (h_{a}, h_{b}) \partial h_{a} & (1.2) \end{matrix}$

where h_ais a peak height smaller than the threshold and so is not observed in the results. This is the type of situation illustrated in FIG. 17, with h_acorresponding to the situation for PK1.

The term in the right-hand-side of Equation (1.2) is a pdf in two-dimensions of the type discussed above in the pdf's for heterozygote sections.

The denominator in equation (1.1) is given by,

f(h_b|H_d)=f(h_b|U={b,b},H_d)Pr(U={b,b}|S={a,b}),H_d+f(h_a<t,h_b|U={b,Q},H_d)Pr(U={b,Q}|S={a,b})H_d. (1.3)

where Q is any other allele identity than a.

The second factors in the right-hand side of equation (1.3), Pr(U={b,b}|S={a,b})H_dand Pr(U={b,Q}|S={a,b})H_dcan be computed using methods described in Balding (2005) and Buckleton et al. (2005) which includes cases where the suspect and the unknown contributors are from the same and different ethnic groups, and cases when the suspect and unknown contributor are related.

The first factors of the right-hand side of equation 1.3 will now be described in more detail.

The second factor is computed as follows:

f(h_b|U={b,b},H_d)=f_hom(h_b). (1.4)

where f_homis the one-dimensional pdf as discussed above. The third factor is computed with the formula:

$\begin{matrix} f (h_{b}, h_{a} < t | U = {b, Q}, H_{d}) = f_{het} (h_{b}, h_{Q} < t) = \int_{0}^{t} f_{het} (h_{b}, h_{a}) \partial h_{Q} & (1.6) \end{matrix}$

where h_Qis smaller than the threshold.

This provides a determinable value for the likelihood ratio for the situation of, suspect is heterozygous and crime profile has one peak.

Suspect is Heterozygous and Crime Profile has Two Peaks

In this situation, C={h_a,h_b} and S={a,b}. The LR is given by:

$\begin{matrix} LR = \frac{f (C = {h_{a}, h_{b}} | S = {a, b}, H_{p})}{f (C = {h_{a}, h_{b}} | S = {a, b}, H_{d})} . & (2.1) \end{matrix}$

The numerator is given by:

f(C={h_a,h_b}|S={a,b},H_d)=f_het(h_a,h_b). (2.2)

The denominator is given by:

f(C={h_a,h_b}|S={a,b},H_d)=f_het(h_a,h_b)Pr(U={a,b}|S={a,b},H_d) (2.3)

The second term in the right hand side of Equation 2.3 can be computed using methods described in Balding (2005) and Buckleton et al. (2005) which includes cases where the suspect and the unknown contributors are from the same and different ethnic groups, and cases when the suspect and unknown contributor are related. The first term is calculated from a two-dimensional pdf for peak heights.

The LR is then given by:

$\begin{matrix} LR = \frac{1}{\Pr (U = {a, b} | S = {a, b})} . & (2.4) \end{matrix}$

This provides a determinable value for the likelihood ratio for the situation of suspect is heterozygous and crime profile has two peaks. This assumes that different contributors having the same genotype will have the same peak height distributions.

Suspect is Homozygous and the Crime Profile Contains One Allele

In this situation, C={h_b} and S={b,b}. The LR is given by:

$\begin{matrix} LR = \frac{f (h_{b} | S = {b, b}, H_{p})}{f (h_{b} | S = {b, b}, H_{d})} . & (3.1) \end{matrix}$

The numerator is given by:

f(h_b|S={b,b},H_p)=f_hom(h_b) (3.2)

a density function estimated for homozygous contributors.

The denominator is given by:

$\begin{matrix} f (h_{b} | S = {b, b}, H_{d}) = f (h_{b} | U = {b, b}, H_{d}) \Pr (U = {b, b} | S = {b, b}, H_{d}) + f (h_{b}, h_{a} < t | U = {b, Q}, H_{d}) \Pr (U = {b, Q} | S = {b, b} H_{d}) . & (3.3) \end{matrix}$

The term:

f(h_b|U={b,b},S={b,b},H_d)=f_hom(h_b). (3.4)

is the same as the numerator. The term:

$\begin{matrix} f (h_{b}, h_{Q} < t | U = {b, Q}, H_{d}) = f_{het} (h_{b}, h_{Q} < t) = \int_{0}^{t} f_{het} (h_{b}, h_{a}) \partial h_{Q} . & (3.5) \end{matrix}$

can be obtained by estimating a pdf for the peak heights.

This provides a determinable value for the likelihood ratio for the situation of suspect is heterozygous and the crime profile has one allele.

As previously mentioned, the homozygous situation starts out relating to a 1D pdf and has to be converted to a 2D pdf for compatibility in approaches. This 2D approach extends to the likelihood calculation too. The approach stems from the observation that a homozygous sample with height h₁is similar to observing a heterozygous sample with approximate heights 0.5h₁and 0.5h₁.

The likelihood for a homozygous donor is then given by

$\begin{matrix} f (C = {h_{1}} | G = {1, 1}) = \int_{0}^{1} f (C = {α h_{1}, (1 - α) h_{1}} | G = {1, 1}) f (α) \partial α & (46) \end{matrix}$

where f(C={αh₁,(1−α)h₁}|G={1,1}) is a likelihood function computed in a two-dimensional likelihood for a heterozygote and f(α) is a probability density function for α, where α is the proportion of the sample. Alternatively we can approximate this quantity by:

$\begin{matrix} f (C = {h_{1}} | G = {1, 1}) = f (C = {α h_{1}, (1 - α) h_{1}} | G = {1, 1}) & (47) \end{matrix}$

for an α close to 0.5.

In this section, the approach has been provided without consideration as to the DNA quantity involved; the next section conditions on DNA quantity. For ease of presentation, however, the above mentioned approach extended to when we condition to DNA quantity, χ, is mentioned here. The formulae remain valid, and they become:

$\begin{matrix} f (C = {h_{1}} | G = {1, 1}, χ_{i}) = \int_{0}^{1} f (C = {α h_{1}, (1 - α) h_{1}} | G = {1, 1}, χ_{i}) f (α) \partial α & (48) \end{matrix}$

where f(C={αh₁,(1−α)h₁}|G={1,1},χ_i) is a likelihood function for a heterozygote contributor estimated from data obtained with targeted DNA quantity χ_i.

Alternatively:

$\begin{matrix} f (C = {h_{1}} | G = {1, 1}, χ_{i}) = f (C = {α h_{1}, (1 - α) h_{1}} | G = {1, 1}, χ_{i}) & (49) \end{matrix}$

for an α close to 0.5.

Situation 2—An Evidential Context—A Single Source Sample—With Conditioning on DNA Quantity

In situation 1, above, we discussed how the calculation is simplified as:

f(C|G) (50)

where G denotes the supposed donor and that this general form of the likelihood is key to the operation of the invention and its benefits.

In the extension of the approach that now follows, the likelihood can also be computed by conditioning on DNA quantity, in the form:

$\begin{matrix} f (C | G) = \sum_{j} f (C | G, χ_{j}) \Pr (χ_{j} | I (h)) & (51) \end{matrix}$

where Pr(χ_j|I(h)) is a probability distribution of DNA quantity given peak height information, I(h).

The calculation of f(C|G) can be done for each different DNA quantity considered through the control samples. In effect, for each quantity of DNA a dataset is generated using the general approaches previously described. This data is stored for future use. The relevant distribution to the DNA quantity in the analysis situation is then selected from these and used.

The probability distribution for Pr(χ_j|I(h)) relates to a list of discrete probabilities for χ₁to χ_j, where

$\sum_{j} \Pr (χ_{j} | I (h)) = 1.$

The distribution can be obtained by considering an observed distribution from the control samples. There are various ways of combining the control sample results to achieve this.

It is possible to considering all the heights, h, observed for each locus in a control sample. This gives a mean h for a control sample across all its loci. The mean h's for repeats at the same quantity χ can then be grouped together across multiple control samples. Hence, { h_j}, j=1,2, . . . ,m. This allows the estimation of a distribution f( h|χ_i) and the fitting of a distribution to that. The estimation can be provided by various approached for estimating probability distributions in 1D. For example, this distribution can be estimated using mixtures of 1D normal distributions using EM-algorithm, see for instance, A. Dempster, N. Laird, and D. Rubin (1977). “Maximum likelihood from incomplete data via the EM algorithm”. Journal of the Royal Statistical Society, Series B, 39(1):1-38.

Other approaches could consider the position locus by locus, with the process of obtain a mean height at each χ and providing the distribution being performed separately for the different loci.

Both the sets of information for equation 5 (which equation) can be obtained for one or more loci, preferably, but not necessarily including the locus on which the approach is being used.

It is possible for the calculation of Pr(X=χ_i|I(h)) for a locus to be performed in a sequential fashion based on an ordering L₁,L₂, . . . ,L_nof the loci. In that instance, the process starts by setting Pr(X=χ_i|I(h)) for L₁to a uniform distribution. This is in part because there is no information available as to I(h). It is then possible to compute Pr(X=χ_i|I(h)) for L₂based on the assumption for I(h) from L₁. The prior knowledge of I(h) for L₁is of benefit. The process then computes Pr(X=χ_i|I(h)) for L₃based on I(h) from L₁and L₂, with the process continuing in this manner until Pr(X=χ_i|I(h)) for L_nis computed based on I(h) from L₁,L₂, . . . ,L_n−1. This provides the best use of the prior knowledge to compute the various Pr's, but without generating a circularity issue in the processing.

To compute Pr(X=χ_i|I(h)) that function can be equated to:

$\Pr (X = χ_{i} | \bar{h}) = \frac{f (\bar{h} | X = χ_{i}) \Pr (X = χ_{i})}{\sum_{j} f (\bar{h} | X = χ_{j}) \Pr (X = χ_{j})}$

where:

f(h|X=χ_i)

is a 1D distribution calculated for data for DNA only, χ=χ_i.

The likelihoods f(C|G_i) and its version conditional on DNA quantity f(C|G_i,χ_j) are key to the invention. The estimation of f(C|G,χ_i) from experimentally derives data has not previously been envisaged or provided. Once again:

this can be provided from dilution data;

the estimation of f(C|G,χ_i) can be done by grouping data from all loci, potentially with the estimation rendering one f(C|G,χ_i) for all loci, but with a separate f(C|G,χ_i) for each DNA quantity χ_i;

the estimation of f(C|G,χ_i) can be done once per locus, potentially with the estimation rendering one f(C|G,χ_i) for each locus and for each DNA quantity χ_i;

the estimation of f(C|G,χ_i) can be done using any method for estimating a two-dimensional density estimation;

the estimation of f(C|G,χ_i) can be done using the method described in the heterozygous donors, method one and/or homozygous donors method one approaches.

Equally, the estimation of Pr(X=χ_i|I(h)) from experimentally derived data has not previously been provided or envisaged. Again this could involve:

the estimation of Pr(X=χ_i|I(h)) from dilution data;

the peak height information I(h) being any function of peak height, for example the mean of two peak heights or the sum of the heights;

the estimation of Pr(X=χ_i|I(h)) for a locus being based on I(h) from the same locus or from one or more other loci.

Situation 3—An Intelligence Context—A Single Source Sample—Without Conditioning on DNA Quantity

In an intelligence context, a different issue is under consideration to that approached in an evidential context. The intelligence context seeks to find links between a DNA profile from a crime scene sample and profiles stored in a database, such as The National DNA Databases®, which is used in the UK. The process is interested in the genotype given the collected profile.

Thus in this context, the process starts with a crime profile C and is interested in proposing an ordered list of suspect's profiles, where the first profile in the list is the genotype of the most likely donor. This task is usually done by proposing a list of genotypes {G₁,G₂, . . . ,G_m} which is then rank according the posterior probability of the genotype given the crime stain.

The list of genotypes is generated from the crime scene C. For example if C={h₁,h₂}, where both h₁and h₂are greater than the dropout threshold, t, then the only potential donor genotype is G={1,2}. If C={h₁}, then the potential donors are G₁={1,1} and G₂={1,Q}, where Q represent any other allele different than allele 1, present in the DNA, but not observed in the results for the crime profile C.

The posterior probability of a supposed genotype given the crime profile is given by the formula:

$\begin{matrix} \Pr (G_{i} | C) = \frac{f (C | G_{i}) π (G_{i})}{\sum_{i} f (C | G_{i}) π (G_{i})} & (52) \end{matrix}$

where the G_iare assumed to be the various members of an exhaustive list of potential donor genotypes, and π(G_i) is a prior distribution for genotype G_icomputed from the population in question.

In a similar manner to the evidential context and equation 4 described above, the term:

f(C|G_i)

is key to the process, and this term can be estimated successfully using the various approaches for quantifying the pdf discussed above.

In general, the estimation is the same as with evidential assessment discussed elsewhere.

Situation 4—An Intelligence Context—A Single Source Sample—With Conditioning on DNA Quantity

It is possible to extend the approach in situation 3 to consider the quantity of DNA involved, in a similar manner to the extension of situation 1 to situation 2.

The posterior probability in this situation is computed using the formula:

$\begin{matrix} \Pr (G_{i} | C) = \frac{[\sum_{j} f (C | G_{i}, χ_{j}) \Pr (χ_{j} | I (h))] \times \Pr (G_{i})}{\sum_{i} [\sum_{j} f (C | G_{i}, χ_{j}) \Pr (χ_{j} | I (h))] \times \Pr (G_{i})} & (53) \end{matrix}$

where Pr(χ_j|I(h)) is a probability distribution of DNA quantity given information on peak heights. The possibilities for establishing this probability distribution in respect of the same loci and/or across one or more other loci, and possibly including the locus where the formula will applied in practice, exist.

Situation 5)—An Evidential Context—A Mixed Source Sample—Without Conditioning on DNA Quantity

In a variety of situations, the DNA sample encountered is not from a single source, but rather from a mixed source. A mixed source arises from two, three or more contributors. The approach is detailed in respect of one such mixed source situation, but the approach is valid for other types of mixed source sample too.

In the evidential context for a mixed source sample arising from two contributors, the prosecution and defence will both have hypotheses. The hypotheses may be, for the example:

- i) the prosecution hypothesis, H_p: The suspect and someone else are the donors to the stain at the scene of crime;
- ii) the defence hypothesis, H_d: Two unknown people are the donors to the stain at the scene of crime.

The LR is given by the formula:

$\begin{matrix} LR = \frac{\Pr (C | S, H_{p})}{\Pr (C | S, H_{d})} . & (54) \end{matrix}$

The crime profile can contain from zero to four heights depending upon the alleles contributed by the two sources and the extent to which they are reflected in the analysis results. The LR formula can be written as:

$\begin{matrix} LR = \frac{\sum_{i} f (C | S, U_{i}, H_{p}) \Pr (U_{i} | S)}{\sum_{j} \sum_{k} f (C | U_{j}, U_{k}, S, H_{d}) \Pr (U_{j}, U_{k} | S)} . & (55) \end{matrix}$

where f(C|S,U_i,H_p) is a density function of C given S and U_iand f(C|U_j,U_k,S,H_d) is a density function of C given U_j, U_kand S.

Equation 10 is a ratio of likelihoods and, therefore, the Pr's have changed to f's.

There are a number of ways in which the factors Pr(U_i|S) and Pr(U_j,U_k|S) can be computed using the formula introduced by Balding et al. (1996) also described in Balding (2005) and Buckleton (2005) which may include considerations of relatedness.

The factors f(C|S,U_i,H_p) and f(C|U_j,U_k,S,H_d) state the same type of calculation: the likelihood of the crime profile C given two supposed donors. We denote this calculation by the term:

f(C|G₁,G₂) (56)

where G₁and G₂are the genotypes of the supposed donors. Hence, the situation is reduced to a factor of an equivalent general nature to that in the previous situations. This can be evaluated using the control sample approach previously described.

In the following sections a more detailed application of the approach to evidential contexts is provided for the two scenarios:

a) Suspect and victim are heterozygous, with no overlapping alleles between them and only three peaks in the crime profile; and

b) Suspect and victim are heterozygous, with one overlapping allele between them and only three peaks in the crime profile.

In both scenarios, where we are considering LR's for a locus, in respect of a sample from two sources, there are two basic hypotheses to consider:

- H_p: The suspect (S) and the victim (V) are the originators of the crime profile, the hypothesis of the prosecution; and
- H_d: The victim (V) and an unknown (U) are the originators of the crime profile, the hypothesis of the defence.

Heterozygous Suspect and Victim With no Overlapping Alleles and Three Peaks in the Crime Profile

In this situation, H_p: V+S, H_d: V+U, C={h_a,h_b,h_c}, V={a,b} and S={c,d}. The LR is given by:

$\begin{matrix} LR = \frac{f (C = {h_{a}, h_{b}, h_{c}} | V = {a, b}, S = {c, d}, H_{p})}{f (C = {h_{a}, h_{b}, h_{c}} | V = {a, b}, S = {c, d}, H_{d})} . & (4.1) \end{matrix}$

The numerator is given by:

f(C={h_a,h_b,h_c}|V={a,b},S={c,d},H_p)=f_het(h_a,h_b)×f_het(h_c,h_d) (4.2)

The independence assumption in equation 2.2 is also made for a simulation method.

The unknown contributors for the denominators can be {a,c}, {b,c}, {c,Q}. The denominator is given by:

$\begin{matrix} f (C = {h_{a}, h_{b}, h_{c}} | V = {a, b}, S = {c, d}, H_{d}) = f (C = {h_{a}, h_{b}, h_{c}} | V = {a, b}, S = {c, d}, U = {a, c}, H_{d}) \Pr (U = {a, c} | S = {a, b}) + f (C = {h_{a}, h_{b}, h_{c}} | V = {a, b}, S = {c, d}, U = {b, c}, H_{d}) \Pr (U = {b, c} | S = {a, b}) + f (C = {h_{a}, h_{b}, h_{c}} | V = {a, b}, S = {c, d}, U = {c, Q}, H_{d}) \Pr (U = {a, c} | S = {a, b}) . & (4.3) \end{matrix}$

The factor for U={a,c} is computed with the formula:

f(C={h_a,h_b,h_c}|V={a,b},U={a,c},H_d)=f_het(m_xh_a,h_b)×f_het((1−m_x)h_a,h_c). (4.4)

The factor for U={b,c} is computed with the formula:

f(C={h_a,h_b,h_c}|V={a,b},U={b,c},H_d)=f_het(h_a,m_xh_b)×f_het(h_a,(1−m_x)h_c). (4.5)

The factor for U={c,Q} is computed with the formula:

f(C={h_a,h_b,h_c}|V={a,b},U={c,Q},H_d)=f_het(h_a,h_b)×f_het(h_c,h_Q). (4.6)

Heterozygous Suspect and Victim With One Overlapping Allele and Three Peaks in the Crime Profile

In this situation, H_p: V+S, H_d: V+U, C={h_a,h_b,h_c}, V={a,b} and S={b,c}. The likelihood ratio is given by:

$\begin{matrix} LR = \frac{f (C = {h_{a}, h_{b}, h_{c}} | V = {a, b}, S = {b, c}, H_{p})}{f (C = {h_{a}, h_{b}, h_{c}} | V = {a, b}, S = {b, c}, H_{d})} . & (5.1) \end{matrix}$

For the numerator we have:

f(C={h_a,h_b,h_c}|V={a,b},S={b,c},H_p)=f_het(h_a,m_xh_b)f_het((1−m_x)h_b,h_c). (5.2)

For the denominator we consider the following potential unknown contributors:

U ε {{a,c},{b,c},{c,c},{c,Q}}.

We look at the formulae per unknown contributor. We start with U={a,c}:

f(C={h_a,h_b,h_c}|V={a,b},S={b,c},U={a,c},H_d)=f_het(m_xh_a,h_b)f_het((1−m_x)h_a,h_c) (5.3)

For U={b,c}:

f(C={h_a,h_b,h_c}|V={a,b},U={b,c},H_d)=f_het(h_a,m_xh_b)f_het((1−m_x)h_b,h_c). (5.4)

For U={c,c}:

f(C={h_a,h_b,h_c}|V={a,b},U={c,c},H_d)=f_het(h_a,h_b)f_het(0.5h_c,0.5h_c) (5.5)

For U={c,Q}:

f(C={h_a,h_b,h_c}|V={a,b},U={c,Q},H_d)=f_het(h_a,h_b)f_het(h_c,h_Q). (5.6)

At the core for evidential situations, and in the intelligence situations discussed below, is the calculation of the likelihood f(C|G₁,G₂).

Situation 6—An Evidential Context—A Mixed Source Sample—With Conditioning on DNA Quantity

This situation is dealt with in an equivalent manner to that described above for situation 5. The formulae can be modified to use the conditioning of DNA quantity.

Situation 7—An Intelligence Context—A Mixed Source Sample—Without Conditioning on DNA Quantity

In the intelligence context, the task is to propose an ordered list of pairs of genotypes G₁and G₂per locus so that the first pair in the list in the most likely donor of the crime stain.

The starting point is the crime stain profile C. For example C={h₁,h₂,h₃}. From this, an exhaustive list {G_1,i,G_2,i} of pairs of potential donors is generated. For each of theses pairs, a probability distribution for the genotypes is calculated using the formula:

$\begin{matrix} \Pr (G_{1, i}, G_{2, i} | C) = \frac{f (C | G_{1, i}, G_{2, i}) \Pr (G_{1, i}, G_{2, i})}{\sum_{i} f (C | G_{1, i}, G_{2, i}) \Pr (G_{1, i}, G_{2, i})} & (57) \end{matrix}$

where Pr(G_1,i,G_2,i) is a prior distribution for the pair of genotypes inside the brackets that can be set to a uniform distribution or computed using the formulae introduced by Balding et al. (1996).

As with the situations 5 and 6 in the previous sections, at the core for evidential evaluation and intelligence is the calculation of the likelihood f(C|G₁,G₂).

Situation 8—An Intelligence Context—A Mixed Source Sample—With Conditioning on DNA Quantity

Situation 8 can be handled using the approach of situation 7, but with the term f_hetis conditioned on DNA quantity.

Additional Information for Mixed Source Situations

As previously mentioned, at the core of the mixed source approach for evidential intelligence contexts is the calculation of the likelihood f(C|G₁,G₂). The description which follows provides for the calculation of f(C|G₁,G₂) via a factorisation that reduces it to a calculation of likelihood for single profiles. That approach is also extended to f(C|G₁,G₂,χ_i) to allow conditioning on DNA quantity. Furthermore, the estimation of Pr(χ_i|I(h)) from experimental data and its use in conjunction with these formula is detailed.

The way in which these terms are computed depends on the alleles that the genotypes share. They are described case-by-case, starting with a method where the factorisation of likelihoods is based on conditioning on a mixing proportion. A method in which conditioning on DNA quantity, χ, as well as conditioning on mixing proportion m_xis then provided.

Calculation of Likelihoods Via Conditioning on m_x

Scenario 1—Where the Donors do not Share Any Alleles

In this case we do not need the assistance of a mixing proportion m_xto factorise the likelihood of a two-person mixture to two likelihoods of single profiles. For example if the two donors are heterozygous then:

f(C={h₁,h₂,h₃,h₄}|G₁={1,2},G₂={3,4})=f(C={h₁,h₂}|G₁={1,2})f(C={h₃,h₄}|G₂={3,4}) (58)

If the one donor is homozygous:

f(C={h₁,h₂,h₃}|G₁={1,2},G₂={3,3})=f(C={h₁,h₂}|G₁={1,2})f(C={h₃}|G₂={3,3}). (59)

If both donors are homozygotes:

f(C={h₁,h₂}|G₁={1,1},G₂={2,2})=f(C={h₁}|G₁={1,1})f(C={h₂}|G₂={2,2}). (60)

Scenario 2—Where the Donors Share One Allele

If the donors share one-allele, then the peak height in common, the largest peak, is split according to a mixing proportion m_x. If both donors are heterozygous:

$\begin{matrix} f (C = {h_{1}, h_{2}, h_{3}} | G_{1} = {1, 2}, G_{2} = {2, 3}) = \sum_{m_{x}} f (C = {h_{1}, h_{2}, h_{3}} | G_{1} = {1, 2}, G_{2} = {2, 3}, m_{x}) \Pr (m_{x}) = \sum_{m_{x}} f (C = {h_{1}, m_{x} h_{2}} | G_{1} = {1, 2}, m_{x}) f (C = {(1 - m_{x}) h_{2}, h_{3}} | G_{2} = {2, 3}, m_{x}) \Pr (m_{x}) & (61) \end{matrix}$

where Pr(m_x) is a discrete probability distribution for the mixing proportion.

If one of the donors is homozygous:

$\begin{matrix} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, G_{2} = {2, 2}) = \sum_{m_{x}} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, G_{2} = {2, 3}, m_{x}) \Pr (m_{x}) = \sum_{m_{x}} f (C = {h_{1}, m_{x} h_{2}} | G_{1} = {1, 2}, m_{x}) f (C = {(1 - m_{x}) h_{2}} | G_{2} = {2, 2}, m_{x}) \Pr (m_{x}) & (62) \end{matrix}$

Scenario 3—Where the Donors Share Two-Alleles

As with scenario 2, the mixing proportion is used for factorising a two-person-mixture likelihood into two single-profile likelihoods. Both peaks are split. More specifically:

$\begin{matrix} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, G_{2} = {1, 2}) = \sum_{m_{x}} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, G_{2} = {1, 2}, m_{x}) \Pr (m_{x}) = \sum_{m_{x}} [\begin{matrix} f (C = {m_{x} h_{1}, m_{x} h_{2}} | G_{1} = {1, 2}, m_{x}) \times \\ f (C = {(1 - m_{x}) h_{1}, (1 - m_{x}) h_{2}} | G_{2} = {1, 2}, m_{x}) \Pr (m_{x}) \end{matrix}] . & (63) \end{matrix}$

Calculation of Likelihoods Via Conditioning on DNA Quantity χ and mx
Scenario 1—Where the Donors do not Share Any Alleles

If the two donors are heterozygotes then:

$\begin{matrix} f (C = {h_{1}, h_{2}, h_{3}, h_{4}} | G_{1} = {1, 2}, G_{2} = {3, 4}) = \sum_{m_{x}} \sum_{χ_{i}} [\begin{matrix} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, m_{x} χ_{i}) \times \\ f (C = {h_{3}, h_{4}} | G_{2} = {3, 4}, (1 - m_{x}) χ_{i}) \times \\ \Pr (m_{x}) \Pr (χ_{i} | I (h)) \end{matrix}] & (64) \end{matrix}$

where m_xχ_iis the proportion of the DNA quantity assign to donor 1 and (1−m_x)χ_iis the DNA quantity assign to donor 2. Pr(χ_i|I(h)) is a probability distribution on DNA quantity based on peak height information.

If one donor is homozygous:

$\begin{matrix} f (C = {h_{1}, h_{2}, h_{3}} | G_{1} = {1, 2}, G_{2} = {3, 3}) = \sum_{m_{x}} \sum_{χ_{i}} [\begin{matrix} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, m_{x} χ_{i}) \times \\ f (C = {h_{3}} | G_{2} = {3, 3}, (1 - m_{x}) χ_{i}) \times \\ \Pr (m_{x}) \Pr (χ_{i} | I (h)) \end{matrix}] . & (65) \end{matrix}$

If both donors are homozygotes:

$\begin{matrix} f (C = {h_{1}, h_{2}} | G_{1} = {1, 1}, G_{2} = {2, 2}) = \sum_{m_{x}} \sum_{χ_{i}} [\begin{matrix} f (C = {h_{1}} | G_{1} = {1, 1}, m_{x} χ_{i}) \times \\ f (C = {h_{2}} | G_{2} = {2, 2}, (1 - m_{x}) χ_{i}) \times \\ \Pr (m_{x}) \Pr (χ_{i} | I (h)) \end{matrix}] & (66) \end{matrix}$

Scenario 2—Where the Donors Share One Allele

If the donors share one-allele, then the peak height in common is split according to a mixing proportion m_x. If both donors are heterozygous:

$\begin{matrix} f (C = {h_{1}, h_{2}, h_{3}} | G_{1} = {1, 2}, G_{2} = {2, 3}) = \sum_{m_{x}} \sum_{χ_{i}} f (C = {h_{1}, h_{2}, h_{3}} | G_{1} = {1, 2}, G_{2} = {2, 3}, χ_{i}, m_{x}) \Pr (m_{x}) \Pr (χ_{i} | I (h)) = \sum_{m_{x}} \sum_{χ_{i}} [\begin{matrix} f (C = {h_{1}, m_{x} h_{2}} | G_{1} = {1, 2}, m_{x} χ_{i}) \times \\ f (C = {(1 - m_{x}) h_{2}, h_{3}} | G_{2} = {2, 3}, (1 - m_{x}) χ_{i}) \times \\ \Pr (m_{x}) \Pr (χ_{i} | I (h)) \end{matrix}] & (67) \end{matrix}$

where Pr(m_x) is a discrete probability distribution for the mixing proportion and Pr(χ_i|I(h)) is a probability distribution of DNA quantity given peak information from the one more loci, possibly including the locus for which the formula will be used.

If one of the donors is homozygous:

$\begin{matrix} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, G_{2} = {2, 2}) = \sum_{m_{x}} \sum_{χ_{i}} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, G_{2} = {2, 2}, m_{x}) \Pr (m_{x}) \Pr (χ_{i}) = \sum_{m_{x}} \sum_{χ_{i}} [\begin{matrix} f (C = {h_{1}, m_{x} h_{2}} | G_{1} = {1, 2}, m_{x} χ_{i}) \times \\ f (C = {(1 - m_{x}) h_{2}} | G_{2} = {2, 2}, (1 - m_{x}) χ_{i}) \times \\ \Pr (m_{x}) \Pr (χ_{i}) \end{matrix}] & (68) \end{matrix}$

Scenario 3—Where the Donors Share Two-Alleles

As with previous cases, the mixing proportion is used for factorising a two-person-mixture likelihood into two single-profile likelihoods. More specifically:

$\begin{matrix} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, G_{2} = {1, 2}) = \sum_{m_{x}} \sum_{χ_{i}} f (C = {h_{1}, h_{2}} | G_{1} = {1, 2}, G_{2} = {1, 2}, m_{x}) \Pr (m_{x}) \Pr (χ_{i} | I (h)) = \sum_{m_{x}} \sum_{χ_{i}} [\begin{matrix} f (C = {m_{x} h_{1}, m_{x} h_{2}} | G_{1} = {1, 2}, m_{x} χ_{i}) \times \\ f (C = {(1 - m_{x}) h_{1}, (1 - m_{x}) h_{2}} | G_{2} = \\ {1, 2}, (1 - m_{x}) χ_{i}) \times \\ \Pr (m_{x}) \Pr (χ_{i} | I (h)) \end{matrix}] . & (69) \end{matrix}$

General Observations

By making use of 2D pdf's, the present invention provides a number of advantages and allows a variety of situations and hypotheses to be considered.

For instance, the situation of FIG. 16 could arise due to the sample source being homozygous or due to the sample source being heterozygous and allele dropout occurring. Both of these possibilities are fully considered in the present invention.

In the context of the mixtures consideration, the principle of comparing pdf's having the same number of dimensions is also used. In this case, there are a number of 3D pdf's (such as those in equations 4.4, 4.5, 5.3, 5.4) to compare with 2D pdf's (such as those in equations 4.6, 5.5, 5.6). The 3D pdf's are moved to be 2D pdf's by deconvoluting for each allele, taking into account the mixing proportions.

Modelling according to the present invention offers a number of advantages:

- 1) the model can deal with allelic dropout as dropout events are considered in the estimation of the 2D pdf.
- 2) the model can deal with preferential amplification automatically and as a function of mean peak height. At present, reporting officers involved in the analysis process need to select which combinations of donors to choose according to preferential amplification thresholds.
- 3) the model can deal with degradation. The preferential amplification distribution implicit in the 2D pdf changes according to mean peak height.
- 4) the model will render a system that will give larger likelihood ratios for a larger number of cases and can help ranking profiles obtained from a database search.
- 5) the above mentioned approach could be extended to provide for scoring the results of the mixture analysis.
- 6) the approach allows for searches against a database to be prioritised and so give a reduced number of searches which are needed. A reduced number of searches gives an increase in search speed.
  
  The following documents, and their contents, are incorporated herein by reference, with particular emphasis on their teachings as reference at specific locations within this documents:

D. J. Balding, M. Greenhalgh, R. A. Nichols (1996). Population genetics of STR loci in Caucasians. 108:300-305.

D. J. Balding (2005). Weight-of-evidence for forensic DNA profiles. John Wiley & Sons Ltd.

J. Buckleton, C. M. Triggs, S. J. Walsh (2005). Forensic DNA evidence interpretation. CRC press.

A. Dempster, N. Laird, and D. Rubin (1977). “Maximum likelihood from incomplete data via the EM algorithm”. Journal of the Royal Statistical Society, Series B, 39(1):1-38.

Number	Date	Country	Kind
GB 0722649.1	Nov 2007	GB	national
GB 0804665.8	Mar 2008	GB	national
GB 0810624.7	Jun 2008	GB	national
GB 0815300.9	Aug 2008	GB	national

COMPUTING LIKELIHOOD RATIOS USING PEAK HEIGHTS

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

US Classifications

International Classifications

Abstract

Description

Claims

Priority Claims (4)