This patent application relates generally to mixture deconvolution systems and methods for identifying DNA profiles.
Investigative genetic genealogy (IGG) has emerged as a new, rapidly growing field of forensic science since its use in identifying the Golden State Killer in 2018. Recent IGG techniques have had a significant impact on the resolution of current and, especially, cold criminal cases. As a result, IGG is in high demand across the international forensic community. Currently, IGG searches are conducted only with a single-source DNA profile, requiring the deconvolution of any DNA mixtures prior to its use for long-range familial searching. However, estimates indicate that ˜50% of forensic casework samples are low level, partially degraded and/or mixtures, leaving samples from unidentified human remains, violent crime and matters of national security unresolved. For example, forensic casework samples may include DNA mixtures from more than one person. Mixtures of the DNA of people who did not match reference database profiles (a significant fraction of DNA evidence) cannot be used for emerging/advanced methods like IGG by existing systems.
It is appreciated that there is a need for a system and method to isolate distinct DNA profiles from a DNA mixture to enable searching in existing genealogy databases. Various embodiments described herein concerns the deconvolution of unknown DNA profiles in a two-person DNA mixture into two DNA profiles. Deconvolution methods isolate distinct DNA profiles from a DNA mixture without the need to match against DNA reference profiles. As provided herein, a system and method is provided for a mixture deconvolution pipeline that involves a series of mathematical steps and machine learning algorithms to achieve the desired performance and decision-support outputs. Various embodiments enable distant familial matching to existing investigative genetic genealogy (IGG; also known as forensic genetic genealogy (FGG)) databases. This capability enables the generation of investigative leads from unresolved casework samples (i.e., DNA mixtures) by identifying possible genealogical relationships to one or more person(s) of interest. Such aspects may be performed in association with one or more systems used for genetic identification.
According to some embodiments, aspects relate to addressing a large unmet need in the forensic genomics market: the ability to deconvolve DNA profiles of unknown persons that are mixed with DNA from one or more other person(s) to enable searching in existing genealogy databases. Adding this capability will improve the generation of investigative leads in challenging defense, intelligence, and prosecutorial cases which often rely on incomplete DNA profile reference databases that hamper case resolution as well as offer an additional revenue stream for commercial laboratories involved in the forensic industry.
In some embodiments described herein, a two-person mixture may be processed in such a manner that does not require reference DNA from a subject. Rather, processing of the mixture as well as one or more existing genealogical databases are used to identify an individual. This process is beneficial, as reference DNA is not required for identification. Rather, long-range familial searching may be used for determining investigative leads. Further, in some embodiments, machine learning methods may be applied to more accurately predict the sex of particular contributors. Such elements may be used in an overall identification strategy and identification pipeline.
Some embodiments include a series of mathematical steps and mechanized processes to ingest, process, and produce results (e.g., in the current standard Verogen's ForenSeq Kintelligence sequencing format) of a two-person mixture. During processing, multiple algorithms are applied to select two-person mixtures for evaluation; to identify contributor's sex and concentration; and finally, to deconvolve each Single Nucleotide Polymorphisms (SNP) profile. Such algorithms (and companion software implementation) may be, according to various embodiments, specifically designed to yield the predicted number of contributors (NOC) in the mixture as well as the estimated percent contribution, predicted sex, and predicted DNA profile of each contributor. This information may be used to compare the individual DNA profile of each contributor to a wide variety of genealogical databases.
According to one aspect, a system is provided. The system comprises a component configured to analyze an input DNA mixture comprising at least two DNA contributors, a component configured to identify the number of contributors in the DNA mixture, a component configured to identify the sex of the two DNA contributors, a component to estimate the concentration of the two DNA contributors, and a component adapted to determine an individual DNA profile for the two DNA contributors.
According to one embodiment, one or more forensic genealogy databases comprise DNA markers enabling long-range familial searching of at least three degrees. According to one embodiment, the system further comprises a supervised learning model, the model being trained on a plurality of classification features relating to the input DNA mixture. According to one embodiment, the plurality of classification features comprises at least one of a group comprising a plurality of autosomal loci of an existing panel, estimated concentrations for minor and major contributors, minor allele counts ratio for each autosomal loci within the input DNA mixture, number of loci with a minor allele within the input DNA mixture, and global allele frequencies for each of the plurality of autosomal loci of an existing panel. For example, a commercially available panel (e.g., commercially available from Verogen, Inc. or other sources) may be used that provides autosomal loci information.
According to one embodiment, the system further comprises applying a threshold responsive to a predicted DNA marker at each genetic location and the estimated concentrations. According to one embodiment, the supervised learning model includes a random forest model. According to one embodiment, the random forest model is operated to deconvolve two-person mixtures. According to one embodiment, the processing component is used within an identification pipeline. According to one embodiment, the processing component is used to identify and select two-person mixtures for processing through the identification pipeline. According to one embodiment, the supervised learning model includes at least one output from a group comprising a probability for each possible genotype combination contained in the mixture, a predicted genotype with the highest probability score, and predicted DNA profiles and corresponding prediction probabilities for each of the two DNA contributors. According to one embodiment, the processing component is configured to deconvolve input DNA mixture comprising two DNA contributors into two distinct DNA profiles. According to one embodiment, the processing component is configured to determine the two distinct DNA profiles without performing a comparison with one or more DNA reference profiles. According to one embodiment, the component configured to identify the sex of the two DNA contributors further comprises a learning model, the model being trained on a plurality of classification features relating to the input DNA mixture. According to one embodiment, the plurality of classification features comprises a total number of counts of non-autosomal loci of the input DNA mixture at each sex genetic location.
Still other aspects, examples, and advantages of these exemplary aspects and examples, are discussed in detail below. Moreover, it is to be understood that both the foregoing information and the following detailed description are merely illustrative examples of various aspects and examples and are intended to provide an overview or framework for understanding the nature and character of the claimed aspects and examples. Any example disclosed herein may be combined with any other example in any manner consistent with at least one of the objects, aims, and needs disclosed herein, and references to “an example,” “some examples,” “an alternate example,” “various examples,” “one example,” “at least one example,” “this and other examples” or the like are not necessarily mutually exclusive and are intended to indicate that a particular feature, structure, or characteristic described in connection with the example may be included in at least one example. The appearances of such terms herein are not necessarily all referring to the same example.
Various aspects of at least one example are discussed below with reference to the accompanying figures, which are not intended to be drawn to scale. The figures are included to provide an illustration and a further understanding of the various aspects and examples are incorporated in and constitute a part of this specification but are not intended as a definition of the limits of a particular example. The drawings, together with the remainder of the specification, serve to explain principles and operations of the described and claimed aspects and examples. In the figures, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For purposes of clarity, not every component may be labeled in every figure. In the figures:
At block 103, the system predicts a number of contributors within and selects a two-person mixture for processing. Further, the system may perform a number of processes by one or more components that process the mixture to determine predictions about the mixture. For example, the system may include a component (e.g., component 105) that is configured to predict a sex of one or more of the contributors. Further, the system may include a component (e.g., component 104) that is configured to estimate a percent contribution of the contributors to the mixture. Also, the system may include a component (e.g., component 106) that is configured to predict a DNA profile of the contributors. This deconvolved mixture information 110 may be then provided as outputs. The output information may be provided, for example, to a system that allows for identification of individuals identified from information determined from the deconvolved mixtures.
For example, as an optional set of steps, the information determined from deconvolving the mixture (e.g., deconvolved mixture information 110) may be used by an identification system to determine one or more output matches. For example, at block 107, the system compares a DNA profile of each contributor to one or more genealogical databases. At block 108, the system outputs any matches, and at block 109, process 100 ends.
As discussed above and in further detail below, the system may be capable of processing an input DNA mixture and deconvolving information relating the mixture using input DNA features. In particular,
Further, as discussed above, system 200 may implement machine learning models (e.g., learning model 300) that provides information relating to individuals having DNA present in the input mixture.
Some embodiments include a series of mathematical steps and mechanized processes to ingest, process, and produce results (e.g., in the current standard Verogen ForenSeq Kintelligence sequencing format) of a two-person mixture. During processing, multiple algorithms are applied to select two-person mixtures for evaluation; to identify contributor's sex and concentration; and finally, to deconvolve each Single Nucleotide Polymorphisms (SNP) profile. Such algorithms (and companion software implementation) may be, according to various embodiments, specifically designed to yield the predicted number of contributors (NOC) in the mixture as well as the estimated percent contribution, predicted sex, and predicted DNA profile of each contributor (e.g., as shown in
Various algorithms discover and refine unknown profiles from forensic DNA mixtures such as the apex unknown method, the unknown coalesce method, the SCOPE method, and direct deconvolution using a random forest classifier. Additional features of some embodiments of the present invention beyond these algorithms extract unknown DNA profiles from a two-person mixture include:
In addition, a small portion of the SCOPE method such as the exemplary Equation described below may be leveraged to determine the number of contributors in the DNA mixture and the Unknown Concentration Estimation (UCE) method may be leveraged to determine the contributor concentrations of each individual in the mixture.
Exemplary Equation:
Number of loci with a minor allele=L−Σ(p2)N
However, adaptations may be required for its compatibility with the Forenseq Kintelligence sequencing panel that had ˜10,000 DNA markers. In-silico mixtures may be modelled to calculate the expected mean number of minor alleles for a two-person mixture and the minor contributor's average mAR plateau to compare against the unknown mixtures to estimate the number of contributors and contributor concentrations, respectively. In some embodiments, the adapted number of contributor's algorithm and contributor concentration's algorithm may be sequentially processed and may be crucial first steps in one embodiment that 1) identifies and selects two-person mixtures for continuation through the deconvolution pipeline and 2) estimates the contributors' concentrations that is utilized as an input feature in the mixture deconvolution algorithm.
For mixture deconvolution, a random forest model may be used to deconvolve two-person mixtures using actual or estimated contributions (provided by the contributor concentrations algorithm mentioned above), minor allele ratio (mAR) at each autosomal genetic location, rank order of genetic locations as determine by the mAR, total number of minor allele calls in the mixture, and global allele frequencies for each autosomal genetic location from the Genome Aggregation Database (gnomAD). This algorithm may be specific to two-person mixtures using the Verogen ForenSeq Kintelligence genetic panel. In some embodiments, the model may provide the predicted DNA markers for each genetic location with their corresponding probability score. Custom probability thresholds based on DNA markers and contributor concentrations may be used in some embodiments to remove predicted DNA markers below the threshold to increase performance (specificity and sensitivity) relative to benchmark standards. For sex identification, a second random forest model is used to predict the sex of each contributor in an unknown two-person mixture. In some embodiments, the key classification feature employed by the model may be the total sequencing read count at each sex genetic location. This feature may be conveniently provided in the raw sequencing results from the instrument which is recorded in the standard Verogen ForenSeq Kintelligence sequencing format. This algorithm may be specific to two-person mixtures using the ForensSeq Kintelligence sex SNPs.
Recently, Investigative Genetic Genealogy (IGG) has been a rapidly growing forensic industry assisting in over 200 cold cases in the United States. IGG is currently conducted using single-source DNA profiles. Various embodiments of the present invention may have a high national security impact by providing the opportunity to utilize mixtures in addition to single-source profiles, thereby increasing the generation of investigative leads in challenging defense, intelligence, and prosecutorial cases. Beyond the national security impact, various embodiments of the present invention will fill a large gap in the forensic genomics market: the deconvolution of DNA profiles from a DNA mixture to enable searching of existing genealogy databases. In some embodiments, various aspects described herein may be incorporated within one or more computer systems for identifying individuals from one or more databases. In some embodiments, some aspects may be configured to operate within various software systems used to search various databases (e.g., Verogen's ForenSeq Kintelligence SNPs and GEDMatch database) implementing one or more workflows (e.g., Verogen's IGG workflow).
Various embodiments described herein have been demonstrated in a laboratory environment beyond proof-of-concept capability for two-person mixtures. Over 500 in silico and 30 real experimental mixtures (consisting of unblinded and blinded datasets) demonstrated feasibility and high performance, as shown in
Recent IGG techniques have had a significant impact on the resolution of current and, especially, cold criminal cases. As a result, IGG is in high demand across the international forensic community. Mixtures of the DNA of people who did not match reference database profiles (a significant fraction of DNA evidence) cannot be used for emerging/advanced methods like IGG by existing systems. Advantages of using various methods as described herein include the ability to identify the sex and recover DNA profiles for each unknown contributor of two-person mixtures to enable long-range familial searching (e.g., 3-4th degree) in genetic genealogy databases. In addition, some embodiments described herein provide a probability value associated with the predicted DNA profiles that yield confidence scores for the deconvolved profiles. These capabilities provide a significant impact on the large fraction of cases where DNA mixtures currently prevent the use of IGG searches.
As illustrated by
Summary statistics for each NOC based on simulation are listed in the table 1 below and the distribution can be visualized in
In some embodiments, the NOC may be predicted for an unknown mixture by:
Table 2 below illustrates the computed z-scores for all possible NOCs from an unknown mixture having 8797 loci with a minor allele. The NOC of two resulted in the lowest absolute z-score predicting two contributors in the mixture.
In some embodiments, a Random Forest model may be generated using a predetermined number (e.g., 500) of insilico mixtures and the total number of counts per non-autosomal locus (normalized to counts per million) to predict the sex of each contributor in an unknown two-person mixture. An exemplary process using exemplary model inputs is illustrated by block 404 of
In other embodiments, similar approaches described below for the deconvolution method may be implemented to determine the sex of the contributors (e.g., deterministic approaches using counts-based features, other probabilistic supervised machine learning methods). In some embodiments, a tier approach that first utilizes the y-sex markers to determine the presence/absence of male in the mixture and then determines the ratio of male to female presence utilizing the signal ratio of the y to x-sex markers.
In some embodiments, the model input may be the total number of counts per non-autosomal (sex) loci normalized to counts per million. Table 3 illustrates an exemplary model input illustrating normalized total counts for 3 non-autosomal loci including 233 non-autosomal loci in total.
In some embodiments, the model output may be a single character vector representing the sex of the major/minor contributor (e.g., “F/M”). For example, “F/M” represents a mixture with a female major contributor and male minor contributor.
In some embodiments, sex markers (X and Y-SNPs) may be an effective method for determining the sex of an individual and estimating the ratio of male and female in a mixture. More specifically, the presence or absence of the Y chromosome is critical as only males will inherit a Y chromosome and will only have a single copy of the X chromosome.
In some embodiments, relative probability thresholds may be selected for each genotype and contributor concentrations using 500 insilico mixtures representing various ethnicities and mixture contribution ratios. Optimized thresholds were determined by algorithmically decreasing the number of false positives genotype calls below a target (10% per possible genotype combination). This target was chosen to provide a sufficiently high number of true positive genotype calls (i.e., >3000 loci for 3rd degree relationship and >6000 for 4th degree relationship) for searching in IGG databases.
In some embodiments, the model output may provide the predicted genotypes for each loci with their corresponding probability score. Table 4 shows an example of threshold implementation in which the second row corresponds to genotype calls below the threshold that are assigned “./._./.” and the first row corresponds to assigned genotype calls.
In some embodiments genotype calls below the threshold (optimized threshold for given genotype combination and contributor concentration) are assigned “./.” rather than the predicted genotype to reduce the number of false positive rates. Genotype calls above the threshold may also be assigned. Table 4 provides an example of two genotype calls demonstrating both scenarios in which the predicted genotype of the first row is assigned based on probability score being above the threshold and in the second row, a predicted genotype is not called.
Block 405 of
In other embodiments, other probabilistic classification methods could be utilized as well as a deconvolution method to extract unknown DNA profiles from a two-person mixture.
In some embodiments, The model input may include:
Table 5 shows an example of the 5 features used for model inputs, as explained above.
In some embodiments, the model output may include (i) a probability for each possible genotype combination in the mixture; (ii) a predicted genotype (genotype with the highest probability score). Table 6 shows an exemplary model output illustrating the probability score for all possible genotype combination for each loci and the predicted genotype.
The utility of these features has been previously demonstrated to be valuable for deconvolving an unknown mixture.
In some embodiments, the software may include code (e.g., written in R) which ingests and deconvolves the standard Verogen ForenSeq Kintelligence sequencing results text file. In some embodiments, files may be output for each of the two contributors containing the predicted number of contributors in the mixture as well as the estimated percent contribution, predicted sex, and predicted DNA profile of each contributor. In some embodiments, the files may then be used individually to compare the DNA profile of each contributor to genealogical databases. In some embodiments, the algorithm code may be packaged into a Docker container that can be easily transitioned to be utilized by any individual and machine.
In some embodiments, the object code may include sysdata as an RDA file, which may include one or more input files:
In some embodiments, the source code may include:
In some embodiments, one or more README files may include the steps needed to run the deconvolution pipeline (descriptions above) as well as how to build the Docker container. Table 7 shows an example of txt file output information that is generated for each contributor.
Table 8 shows an example of output information.
Some embodiments include one or more of the following 3rd-party dependencies:
In some embodiments, one or more of the third party dependencies are unmodified.
The above-described embodiments can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. It should be understood that any component or collection of components that perform the functions described above can be generically considered as one or more controllers that control the above-discussed functions. The one or more controllers can be implemented in numerous ways, such as with dedicated hardware or with one or more processors programmed using microcode or software to perform the functions recited above.
In this respect, it should be understood that one implementation of the embodiments of the present invention comprises at least one non-transitory computer-readable storage medium (e.g., a computer memory, a portable memory, a compact disk, etc.) encoded with a computer program (i.e., a plurality of instructions), which, when executed on a processor, performs the above-discussed functions of the embodiments of the present invention. The computer-readable storage medium can be transportable such that the program stored thereon can be loaded onto any computer resource to implement the aspects of the present invention discussed herein. In addition, it should be understood that the reference to a computer program which, when executed, performs the above-discussed functions, is not limited to an application program running on a host computer. Rather, the term computer program is used herein in a generic sense to reference any type of computer code (e.g., software or microcode) that can be employed to program a processor to implement the above-discussed aspects of the present invention.
Various aspects of the present invention may be used alone, in combination, or in a variety of arrangements not specifically discussed in the embodiments described in the foregoing and are therefore not limited in their application to the details and arrangement of components set forth in the foregoing description or illustrated in the drawings. For example, aspects described in one embodiment may be combined in any manner with aspects described in other embodiments.
Also, embodiments of the invention may be implemented as one or more methods, of which an example has been provided. The acts performed as part of the method(s) may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments.
Use of ordinal terms such as “first,” “second,” “third,” etc., in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed. Such terms are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term).
The phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of “including,” “comprising,” “having,” “containing”, “involving”, and variations thereof, is meant to encompass the items listed thereafter and additional items.
This application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Application Ser. No. 63/389,748, filed Jul. 15, 2022, and entitled “MIXTURE DECONVOLUTION METHOD FOR IDENTIFYING DNA PROFILES,” which is incorporated herein by reference in its entirety for all purposes.
This invention was made with government support under FA8702-15-D-0001 awarded by the U.S. Air Force. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63389748 | Jul 2022 | US |