DNA-based identification in forensics is typically accomplished via genotyping allele length at a defined set of short tandem repeat (STR) loci via PCR [1]. These PCR assays are robust, reliable, and inexpensive [2]. Given the multiallelic nature of each of these loci, a small panel of STR markers can provide suitable discriminatory power for personal identification [3, 4].
Massively parallel sequencing (MPS) technologies and genotype array technologies invite new approaches for DNA-based identification. Application of these technologies has provided catalogs of global human genetic variation at single-nucleotide polymorphic (SNP) sites and short insertion-deletion (INDEL) sites. For example, from the 1000 Genomes Project [5], there is now a catalog of nearly all human SNP and INDEL variation down to 1% worldwide frequency.
Genotype files, generated via MPS or genotype array, can be compared between individuals to find regions that are co-inherited or identical-by-descent (IBD) [6-8]. These comparisons are the basis of the relative finder functions in many direct-to-consumer genetic testing products [9, 10]. A special case of relative-finding is self-identification. This is a trivial comparison of genotype files as self comparisons will be identical across all sites, minus the error rate of the assay.
For many forensic samples, however, the available DNA may not be suitable for PCR-based STR amplification [11], genotype array analysis [12], or MPS to the depth required for comprehensive, accurate genotype calling [13]. In the case of PCR, one of the most common failure modes occurs when DNA is too fragmented for amplification. For these samples, it may be possible to directly observe the degree of DNA fragmentation from the decreased amplification efficiency of larger STR amplicons from a multiplex STR amplification [14]. In the case of severely fragmented samples, where all DNA fragments are shorter than the shortest STR amplicon length, PCR simply fails with no product.
Provided are computer-implemented methods for comparing genotype data from a first sample to a limited amount of DNA sequence data from a second sample. In certain embodiments, the first sample is from a known individual and the second sample is an unknown sample. The methods find use in a variety of contexts, including for genetic identity detection, e.g., for forensic and other applications. Also provided are computer-implemented methods for assessing the degree of relatedness between genotype data from a first sample and a limited amount of DNA sequence data from a second sample. Computer-readable media and systems that find use in practicing the methods of the present disclosure are also provided
Before the methods, computer-readable media and systems of the present disclosure are described in greater detail, it is to be understood that the methods, computer-readable media and systems are not limited to particular embodiments described, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only, and is not intended to be limiting, since the scope of the methods will be limited only by the appended claims.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limit of that range and any other stated or intervening value in that stated range, is encompassed within the methods, computer-readable media and systems. The upper and lower limits of these smaller ranges may independently be included in the smaller ranges and are also encompassed within the methods, computer-readable media and systems, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the methods, computer-readable media and systems.
Certain ranges are presented herein with numerical values being preceded by the term “about.” The term “about” is used herein to provide literal support for the exact number that it precedes, as well as a number that is near to or approximately the number that the term precedes. In determining whether a number is near to or approximately a specifically recited number, the near or approximating unrecited number may be a number which, in the context in which it is presented, provides the substantial equivalent of the specifically recited number.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the methods, computer-readable media and systems belong. Although any methods, computer-readable media and systems similar or equivalent to those described herein can also be used in the practice or testing of the methods, computer-readable media and systems, representative illustrative methods, computer-readable media and systems are now described.
All publications and patents cited in this specification are herein incorporated by reference as if each individual publication or patent were specifically and individually indicated to be incorporated by reference and are incorporated herein by reference to disclose and describe the materials and/or methods in connection with which the publications are cited. The citation of any publication is for its disclosure prior to the filing date and should not be construed as an admission that the present methods, computer-readable media and systems are not entitled to antedate such publication, as the date of publication provided may be different from the actual publication date which may need to be independently confirmed. It is noted that, as used herein and in the appended claims, the singular forms “a”, “an”, and “the” include plural referents unless the context clearly dictates otherwise. It is further noted that the claims may be drafted to exclude any optional element. As such, this statement is intended to serve as antecedent basis for use of such exclusive terminology as “solely,” “only” and the like in connection with the recitation of claim elements, or use of a “negative” limitation.
It is appreciated that certain features of the methods, computer-readable media and systems, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the methods, computer-readable media and systems, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub-combination. All combinations of the embodiments are specifically embraced by the present disclosure and are disclosed herein just as if each and every combination was individually and explicitly disclosed, to the extent that such combinations embrace operable processes and/or compositions. In addition, all sub-combinations listed in the embodiments describing such variables are also specifically embraced by the present methods, computer-readable media and systems and are disclosed herein just as if each and every such sub-combination was individually and explicitly disclosed herein.
As will be apparent to those of skill in the art upon reading this disclosure, each of the individual embodiments described and illustrated herein has discrete components and features which may be readily separated from or combined with the features of any of the other several embodiments without departing from the scope or spirit of the present methods. Any recited method can be carried out in the order of events recited or in any other order that is logically possible.
Aspects of the present disclosure include computer-implemented methods for comparing genotype data from a first sample to a limited amount of DNA sequence data from a second sample. Such methods are implemented using one or more processors and one or more non-transitory computer-readable media comprising instructions stored thereon, which when executed by the one or more processors, cause the one or more processors to perform operations. The operations comprise receiving genotype data from a first sample and a limited amount of DNA sequence data from a second sample, and comparing the genotype data and limited amount of DNA sequence data at a plurality of variable sites across one or more genomic regions. The operations further comprise, at each of the plurality of variable sites, calculating the likelihood of the limited amount of DNA sequence data and the genotype data being related under at least two models of relatedness, wherein the at least two models of relatedness comprise: (i) the first sample and the second sample share two chromosomes identical-by-descent (IBD2); and (ii) the first sample and the second sample share no chromosomes identical-by-descent (IBD0). As used herein, “IBD2”, “IBD1”, and “IBD0” mean that the samples being compared share 2, 1, and 0 chromosomes, respectively, of a chromosome pair as identical-by-descent. The operations further comprise, at each of the plurality of variable sites, comparing the likelihood of model (i) to the likelihood of model (ii). According to some embodiments, comparing the likelihood of model (i) to the likelihood of model (ii) comprises generating a log-likelihood ratio (LLR) of model (i) and model (ii), thereby generating a plurality of log-likelihood ratios comprising a log-likelihood ratio for each of the plurality of variable sites. In certain embodiments, such methods further comprise aggregating the plurality of log-likelihood ratios. Optionally, the log-likelihood ratios are aggregated across each arm of each autosome.
According to some embodiments, the first sample is from a known individual and the second sample is an unknown sample. By “unknown sample” is meant it is unknown whether the second sample is from the same individual as the first sample. Such methods may further comprise determining whether the unknown sample is from the known individual, e.g., based on aggregated log-likelihood ratios.
The methods of the present disclosure find use, and are advantageous in a variety of contexts. For example, several methods currently exist for detecting genetic relatedness by comparing DNA information. These methods generally require genotype calls, either SNPs or STRs, at the sites used for comparison. For some DNA samples, like those obtained from bone fragments or single rootless hairs, there is often not enough DNA present to generate genotype calls that are accurate and complete enough for these comparisons. The methods of the present disclosure, embodiments of which are sometimes referred to herein as “IBDGem”, constitute a fast and robust computational procedure for detecting genomic regions of identity-by-descent using low-coverage sequence data (e.g., low-coverage shotgun sequence data) and genotype calls from a known query individual. At less than one-fold genome coverage, IBDGem reliably detects segments of relatedness and can make high-confidence identity detections with as little as 1% genome coverage.
In certain embodiments, IBDGem does not attempt to call genotypes from the sequence data. Rather, IBDGem may be used to evaluate the likelihood that a test individual, whose genotype is known, could have generated the sequence data from the unknown (or “questioned”) sample. As demonstrated herein, this approach can reliably identify samples with as little as 0.01-fold genome coverage from the questioned sample. Consequently, among other useful applications, IBDGem enables forensic use of sample types such as bone and single rootless hairs that typically yield sub-nanogram quantities of fragmented DNA.
According to some embodiments, the methods are implemented in C and compare genotype data (e.g., generated via genotype array or DNA sequence data) from a known individual to aligned sequence data in BAM format from an unknown individual. In certain embodiments, for each variable site, it calculates the likelihood of the observed sequence data under 3 models of relatedness: (1) the compared samples share two chromosomes identical-by-descent, IBD2, (2) the compared samples share one chromosome identical-by-descent, IBD1, or (3) the compared samples share no chromosomes identical-by-descent, IBD0. These three relationships are the only possible ways that two samples can be related to one another at a particular region in the genome.
The likelihoods of the data under these three models can then be compared to find the most likely model or to generate a log-likelihood ratio. If the distances between variable sites are sufficiently large, i.e., longer than the length of a sequence read, then the observed alleles at each site can be treated as independent observations of the likelihood of each state (IBD2, IBD1, or IBD0) across a genomic region. Thus, these likelihoods can be aggregated across multiple sites to increase the discriminatory power across a genomic region.
In the case of determining whether the sequence data derives from the same individual as the genotype data versus the model of it coming from an unrelated individual (analogous to the match probability for STR identification), log-odds ratios between the IBD2 and IBD0 models may be generated (
Algorithmic Details for IBDGem The purpose of this program is to compare the data in an input BAM file (from shotgun sequencing of a hair, for example) against the variants in an input VCF file (from shotgun sequencing of a DNA sample or genotype array, for example). At each variant site in the input VCF file, likelihoods of the BAM data are generated under three models:
IBDGem requires population allele frequency estimates for each variant site. This information can come from an AF tag in an input VCF file. This information is used in the likelihood calculation for non-IBD chromosomes.
To formalize the procedure, let us define data for a given variable site in the genome, Si, that is present in the VCF file, and for which there is observed data in an input BAM file.
Let Gi represent the known genotype of Si in the VCF file with 0 representing the reference allele and 1 representing the alternative allele. The alleles are not in any order, i.e., are unphased. Only bi-allelic sites are considered.
G
i=(0|1,0|1)
Note that Gi must take one of three possible values: (0,0), (0,1), or (1,1).
Let Di represent the observed data from a BAM file, e.g., of observed reference and alternative allele counts at site Si
D
i=(Di0,Di1)
The probability of the observed data, Di, under each of the three possible genotypes of the data in the BAM data sample can be defined:
Where ε is the probability of wrongly observed an allele, i.e., sequencing error. Note that these probabilities for homozygous genotypes are simply the probability of correctly observing the only allele present in the data the number of times it was observed given the sequencing error rate. For the heterozygous case, the probability of the data is the binomial probability of the data given the number of each of the two alleles observed. Further, it is assumed that errors are symmetric, such that observing an alternative allele in a read when the reference allele was present occurs as often as observing a reference allele in a read when the alternative allele was present.
With these probabilities of data, given an underlying genotype, the probability of observed data can be calculated given an allele's frequency:
P(Di|f)=(1−f)2P(Di|Gi=(0,0))+2f(1−f)P(Di|Gi=(0,1))+f2P(Di|Gi=(1,1)) Equation 2
This equation assumes Hardy-Weinberg equilibrium. It represents the probability of BAM data given no information from the VCF file, i.e., only the frequency of the allele in the population. Thus, this represents a model of the probability of data under the assumption that the comparison VCF file shares zero IBD chromosomes with the BAM file individual.
The probability of the BAM data, Di, at a particular variable site of known alternative allele frequency, f, can be calculated thusly:
P(Di|Gi,f,IBD0)=P(Di|f)
That is, the probability of the BAM data, given that it comes from a sample that is IBD0 with the VCF data is simply determined by the frequency, f, of the alternative allele as described in Equation 2.
P(Di|Gi,f,IBD2)=P(Di|Gi)
That is, the probability of the BAM data, given that it comes from a sample that is IBD2 with the VCF data is simply determined by the genotype of the VCF file, as described in Equation 1.
The IBD1 calculation (unlike the IBD0 and IBD2 calculations), is calculated differently depending on the genotype of the VCF file, GL. Under this model, the BAM data individual shares one allele with the VCF individual. The other is not shared. What can be assumed about the shared allele depends on whether the VCF individual is homozygous or heterozygous. IBD1 probabilities are calculated thusly:
The rationale for these equations is that the probability of the underlying genotypes of the BAM data sample can be calculated knowing the genotype of the VCF data and the frequency of the derived allele at that site. Then, the probability of the BAM data, given a particular comparison genotype can be calculated.
The log-odds ratio (LOR) of any two of these models (IBD0, IBD1, and IBD2) can then be calculated for the data at a given site. Because data are independent across sites, one can aggregate the LOR within bins to increase power to discriminate between models. The case of distinguishing between whether the two samples are from the same individual and whether they are from different, unrelated individuals is a comparison of the IBD2 versus IBD0 model across all arms of all chromosomes.
To analyze signals that are regional over the genome, the genome may be divided into non-overlapping bins of sufficient length such that each bin will contain tens or hundreds of sites. Then, the aggregate LOR in each bin over the genome is calculated.
According to some embodiments, the methods of the present disclosure are computer-implemented. As used herein, “computer-implemented” means at least one step of the method is implemented using one or more processors and one or more non-transitory computer-readable media. The computer-implemented methods of the present disclosure may further comprise one or more steps that are not computer-implemented, e.g., obtaining one or more samples (e.g., a forensic sample), preparing the one or more samples for genotyping and/or nucleic acid sequencing, and/or the like.
In certain embodiments, receiving the genotype data comprises receiving a VCF file comprising the genotype data. According to some embodiments, the genotype data was generated by massively parallel sequencing (MPS). Sequencing may be performed using any of a variety of available MPS sequencing machines and systems. Illustrative sequencing systems include the Illumina iSeq 100, Miniseq, MiSeq series, NextSeq series (e.g., NextSeq 500 series, NextSeq 1000, NextSeq 2000), and NovaSeq sequencing systems (Illumina, Inc., San Diego, Calif.), the Pacific Biosciences Sequel (e.g., Sequel II) sequencing system (Pacific Biosciences, Menlo Park, Calif.), the Oxford Nanopore Technologies MinION™, GridIONx5™, PromethION™, or SmidgION™ nanopore-based sequencing systems (Oxford Nanopore Technologies, Oxford, UK), and other systems having similar capabilities.
According to some embodiments, the genotype data was generated by genotype array. Suitable genotype array technologies are known and include, but are not limited to, the Illumina Mutli-Ethnic Global Screening Array (MEGA) and the Illumina Global Screening Array (GSA).
In certain embodiments, receiving the sequence data comprises receiving a BAM file comprising the sequence data.
According to some embodiments, the variable sites comprise single-nucleotide polymorphisms (SNPs), insertion-deletions (INDELs), or a combination thereof.
In certain embodiments, the unknown sample is a hair sample, a bone sample, a blood sample, a semen sample, or any combination thereof. When the unknown sample comprises a hair sample (e.g., a head hair sample, a pubic hair sample, or the like), in some instances, the hair sample is a rootless hair sample. For example, according to some embodiments, the unknown sample comprises a single rootless hair sample.
As will be appreciated with the benefit of the present disclosure, the methods of the present disclosure find use in performing a forensic analysis. In certain embodiments, the known individual is a person of interest (P01) in a criminal investigation, e.g., a murder investigation, a rape investigation, and/or the like. Accordingly, in some instances, the unknown sample was collected from a crime scene or a victim of a crime, e.g., a murder victim or a rape victim.
In certain embodiments, the limited amount of DNA sequence data comprises less than 1-fold genome coverage, less than 0.5-fold genome coverage, less than 0.1-fold genome coverage, or less than 0.05-fold genome coverage. According to some embodiments, the limited amount of DNA sequence data was obtained from a sample comprising less than 1 nanogram of genomic DNA.
Aspects of the present disclosure also include computer-implemented methods for assessing the degree of relatedness between genotype data from a first sample and a limited amount of DNA sequence data from a second sample. Such methods are implemented using one or more processors and one or more non-transitory computer-readable media comprising instructions stored thereon, which when executed by the one or more processors, cause the one or more processors to perform operations. The operations comprise receiving genotype data from a first sample from a first individual and a limited amount of DNA sequence data from a second sample from a second individual, and comparing the genotype data and limited amount of DNA sequence data at a plurality of variable sites across one or more genomic regions. In some embodiments, the second sample from the second individual is a hair sample (e.g., a root-containing hair, a rootless hair, or the like) from the second individual. The operations further comprise, at each of the plurality of variable sites, calculating the likelihood of the limited amount of DNA sequence data and the genotype data being related under the following models of relatedness: (i) the first sample and the second sample share two chromosomes identical-by-descent (IBD2); (ii) the first sample and the second sample share one chromosome identical-by-descent (IBD1); and (iii) the first sample and the second sample share no chromosomes identical-by-descent (IBD0). The operations further comprise determining the most likely path of the three IBD models through the one or more genomic regions using regional likelihood values of each model. In some instances, the genotype data from the first sample is available from an online database.
According to some embodiments, determining the most likely path of the three IBD states is performed using a dynamic programming algorithm. Non-limiting examples of suitable dynamic programming algorithms which may be employed include a forward algorithm, a forward-backward algorithm, or a Viterbi algorithm.
In certain embodiments, the operations further comprise partitioning the most likely path into IBD2, IBD1 and IBD0. In some instances, the operations further comprise identifying regions of co-inheritance for IBD2, IBD1 and/or IBD0. According to some embodiments, two individuals (e.g., cousins) are related to the second individual (e.g., a grandmother), and wherein the operations further comprise identifying IBD1 segments that are shared between the two individuals but not shared between the two individuals and the second individual, thereby identifying IBD1 segments from a common ancestor (e.g., a grandfather) of the two individuals other than the second individual.
According to some embodiments, the methods for assessing the degree of relatedness of the present disclosure further comprising, based on the determining step, identifying the first and second individuals as parent-child relatives. In certain embodiments, the methods for assessing the degree of relatedness of the present disclosure further comprising, based on the determining step, identifying the first and second individuals as full siblings.
Algorithmic details for an embodiment (sometimes referred to herein as “HiddenGem”) of the methods for assessing the degree of relatedness of the present disclosure will now be described.
Algorithmic Details for HiddenGem
The HiddenGem module employs a maximum-likelihood approach to find the most likely path of IBD states across a genomic region using likelihood values calculated by the main program for each IBD model.
Following comparison of two samples by IBDGem, likelihood values are aggregated into non-overlapping bins containing a fixed number of sites, for a total of N bins. Then, at the b-th bin along the genomic region, the likelihoods for models IBD0, IBD1, and IBD2 are first normalized to probabilities as follows:
where Me∈(IBD0,IBD1,IBD2).
Moving from the first to last bin, for each of the 3 IBD states, a 3× N score matrix is populated by multiplying the cumulative probability of state Mb-1 in the previous bin with the probability of state Mb in the current bin and a penalty for switching states if Mb-1≠Mb, keeping the largest product as the current score at Mb. The score calculation for IBD0, IBD1, and IBD2 at bin b can thus be formalized as follows:
where wIBD0-IBD1 is the switch penalty between IBD0 and IBD1, wIBD0-IBD2 is the switch penalty between IBD0 and IBD2, and wIBD1-IBD2 is the switch penalty between IBD1 and IBD2.
The score matrix also keeps track of which state in the previous bin b−1 yields the highest score in the current bin b, so that when scores for the last bin have been calculated, backtracking along the matrix is performed to find the path of IBD states that results in the final maximum cumulative probability.
The path of states provided by HiddenGem can then be used to estimate the proportion of each IBD state across the genome, for example by counting the number of bins at a specific state, and infer the degree of relatedness between the compared samples.
Shown in
Aspects of the present disclosure further include systems and non-transitory computer-readable media. In certain aspects, provided are one or more computer-readable media comprising instructions stored thereon, which when executed by one or more processors, cause the one or more processors to perform the operations of any of the computer-implemented methods of the present disclosure.
In other aspects, provided are systems comprising one or more processors and one or more computer-readable media comprising instructions stored thereon, which when executed by the one or more processors, cause the one or more processors to perform the operations of any of the computer-implemented methods of the present disclosure.
A variety of processor-based systems may be employed to implement the embodiments of the present disclosure. Such systems may include system architecture wherein the components of the system are in electrical communication with each other using a bus. System architecture can include a processing unit (CPU or processor), as well as a cache, that are variously coupled to the system bus. The bus couples various system components including system memory, (e.g., read only memory (ROM) and random access memory (RAM), to the processor.
System architecture can include a cache of high-speed memory connected directly with, in close proximity to, or integrated as part of the processor. System architecture can copy data from the memory and/or the storage device to the cache for quick access by the processor. In this way, the cache can provide a performance boost that avoids processor delays while waiting for data. These and other modules can control or be configured to control the processor to perform various actions. Other system memory may be available for use as well. Memory can include multiple different types of memory with different performance characteristics. Processor can include any general purpose processor and a hardware module or software module, such as first, second and third modules stored in the storage device, configured to control the processor as well as a special-purpose processor where software instructions are incorporated into the actual processor design. The processor may essentially be a completely self-contained computing system, containing multiple cores or processors, a bus, memory controller, cache, etc. A multi-core processor may be symmetric or asymmetric.
To enable user interaction with the computing system architecture, an input device can represent any number of input mechanisms, such as a microphone for speech, a touch-sensitive screen for gesture or graphical input, keyboard, mouse, motion input, speech and so forth. An output device can also be one or more of a number of output mechanisms. In some instances, multimodal systems can enable a user to provide multiple types of input to communicate with the computing system architecture. A communications interface can generally govern and manage the user input and system output. There is no restriction on operating on any particular hardware arrangement and therefore the basic features here may easily be substituted for improved hardware or firmware arrangements as they are developed.
The storage device is typically a non-volatile memory and can be a hard disk or other types of computer-readable media which can store data that are accessible by a computer, such as magnetic cassettes, flash memory cards, solid state memory devices, digital versatile disks, cartridges, random access memories (RAMs), read only memory (ROM), and hybrids thereof.
The storage device can include software modules for controlling the processor. Other hardware or software modules are contemplated. The storage device can be connected to the system bus. In one aspect, a hardware module that performs a particular function can include the software component stored in a computer-readable medium in connection with the necessary hardware components, such as the processor, bus, output device, and so forth, to carry out various functions of the disclosed technology.
Embodiments within the scope of the present disclosure may also include tangible and/or non-transitory computer-readable storage media or devices for carrying or having computer-executable instructions or data structures stored thereon. Such tangible computer-readable storage devices can be any available device that can be accessed by a general purpose or special purpose computer, including the functional design of any special purpose processor as described above. By way of example, and not limitation, such tangible computer-readable devices can include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other device which can be used to carry or store desired program code in the form of computer-executable instructions, data structures, or processor chip design. When information or instructions are provided via a network or another communications connection (either hardwired, wireless, or combination thereof) to a computer, the computer properly views the connection as a computer-readable medium. Thus, any such connection is properly termed a computer-readable medium. Combinations of the above should also be included within the scope of the computer-readable storage devices.
Computer-executable instructions include, for example, instructions and data which cause a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Computer-executable instructions also include program modules that are executed by computers in stand-alone or network environments. Generally, program modules include routines, programs, components, data structures, objects, and the functions inherent in the design of special-purpose processors, etc. that perform tasks or implement abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of the program code means for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.
Other embodiments of the disclosure may be practiced in network computing environments with many types of computer system configurations, including personal computers, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, and the like. Embodiments may also be practiced in distributed computing environments where tasks are performed by local and remote processing devices that are linked (either by hardwired links, wireless links, or by a combination thereof) through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
Notwithstanding the appended claims, the present disclosure is also defined by the following embodiments.
1. A computer-implemented method for comparing genotype data from a first sample to a limited amount of DNA sequence data from a second sample, the method being implemented using one or more processors and one or more non-transitory computer-readable media comprising instructions stored thereon, which when executed by the one or more processors, cause the one or more processors to perform operations comprising:
The following examples are offered by way of illustration and not by way of limitation.
IBDGem analyzes regions of the genome for which there is genotype data from a known sample and some amount of sequence data from an unknown sample. It calculates the likelihood of the two samples being related by 0, 1, or 2 shared chromosomes (identical-by-descent) regionally across the genome. Because humans are diploids, humans carry two copies of each genomic locus. Thus, these three models (IBD0, IBD1, and IBD2) are the only ways that two samples can be related at any particular genomic region. More closely related individuals have more IBD1 regions (genome segments inherited from common ancestors) than less closely related individuals.
Comparisons between unrelated individuals will be IBD0, i.e., not share either chromosomal region from a recent common ancestor, across all or nearly all regions of the genome. Conversely, comparisons between the same person will necessarily be IBD2 across every region of the genome. For closely related individuals, some regions will be IBD1, where a segment of a chromosome is co-inherited from a recent common ancestor. Parent-offspring relatives are IBD1 across all chromosomes. Full siblings are roughly 25% IBD0, 50% IBD1, and 25% IBD2.
To test the ability of IBDGem to reliably compare samples, data from the high-coverage 1000 Genomes panel [15] was first used. This panel provides both genotype calls and aligned DNA sequence data for each individual. In this analysis, self versus self comparisons represent positive controls wherein all segments of all chromosomes should be identifiable as IBD2. Further, self versus non-self comparisons represent negative controls wherein all segments of all chromosomes should be identifiable as IBD0, except in cases of cryptic relatedness, which is known to be present in this panel.
The genotype and sequence data in the GBR (British) and LWK (Luhya) panels from the 1000 Genomes was first analyzed. All available genotypes at bi-allelic SNP sites was used for these comparisons. In this way, this experiment approximates the situation of having high-coverage DNA from one comparison individual with which to generate full genotype information. After excluding known relatives, one self and one non-self comparison was performed for every sample within each panel. Specifically, the genotype of each individual was compared against either their own aligned sequence data (self comparison) or the sequence data of a different, random individual within the same panel (non-self comparison) that had been down-sampled to 2-fold, 1-fold, 0.5-fold, 0.1-fold, and 0.01-fold genomic coverages. For the background model (IBD0), allele-frequencies from all unrelated individuals in the 1000 Genomes panel were used.
Variance in mean aggregated LLR across GBR and LWK individuals for self and non-self comparisons, at different down-sampling coverages.
Variance in mean aggregated LLR across GBR and LWK individuals for self and non-self comparisons, using different populations as background model.
For all of these comparisons, it was found that self comparisons were strongly identifiable from non-self comparisons, even at the ultra-low coverage of 0.01-fold (
Because the probability models for IBD0 and IBD1 take as input the non-reference allele frequency at each variable site, the sensitivity of IBDGem to these input values was tested. Human populations, in general, have low cross-population Fst values [17]. Thus, a priori IBDGem is not expected to be sensitive to the small differences in allele frequencies from various human populations.
To test this expectation, the previous comparisons of 1-fold genome coverage data from the GBR and LWK panels were re-ran. For this experiment, the most general model of allele frequencies, i.e., derived from all unrelated individuals from the 1000 Genomes panel, were not used. Instead, allele frequencies from specific continental or sub-continental subsets were used (
Additional sample comparison data is shown in
The 1000 Genomes project pipeline calls variants from shotgun sequencing data across the genome. Each individual has genotype calls at nearly all sites that are found to be variable in the panel. Therefore, for each IBDGem analysis, the number of sites available for comparison is limited chiefly by the data available from the questioned sample. High-coverage genomic data can be used to generate nearly complete call sets at all of the sites known to be variable within humans in, for example, the 1000 Genomes panel. Thus, the genotype call set will include tens of millions of sites, although any specific individual will be homozygous for the reference allele at most of these sites.
In contrast, commercially available genotype arrays provide highly accurate genotype calls at about one million sites of known variation—those on the array—but no information at other sites. Genotype arrays are an accurate and less expensive approach for generating genotype data. To test the sensitivity of IBDGem when limited to data only at genotype array sites for the subject individual, the program was specified to perform comparisons on only bi-allelic sites found on the Illumina Global Screening Array (GSA). In both the GBR and LWK panels, it was found that for all self comparisons, the IBD2/IBD0 log-likelihood ratios remain higher than 100 and for all non-self comparisons, these ratios are less than −100 (
The sequencing libraries that generated the 1000 Genomes data were predominantly made from cell line derived, high-molecular weight DNA. In this way, the data quality is superior to what is possible from many forensic samples. To test the power of IBDGem using data derived from a more realistic forensic DNA source, DNA from rootless hairs from a panel of eight individuals was extracted and sequenced. Separately, DNA from the saliva of these same eight individuals was collected for genotype analysis using the Illumina Multi-Ethnic Global array.
Multiple head and pubic hairs from each individual were collected. DNA from each hair was extracted in 50 μL elution volume, and the two hair extracts with the highest DNA concentrations were chosen for sequencing. For many of the hair extracts, the DNA concentration was below the level of detection with qubit fluorimetry. Using 20 μL of each extract (40% of total volume), sequencing libraries were generated using a single-stranded library approach [16]. These libraries were pooled and roughly 60 million read pairs per library were generated.
After mapping these sequence data to the reference human genome, it was found that the amount of usable human DNA for each hair sample was variable (
Next, IBDGem was run, comparing each hair DNA dataset to each genotype dataset. For this comparison, the whole-panel 1000 Genomes allele frequencies were used as the background model since nothing about the donors was known and, as shown above, the method is largely insensitive to the use of a specific population background panel. All eight self comparisons and all 56 non-self comparisons were correctly identified (
Determining self versus non-self using this framework is straightforward as self comparisons are IBD2 across every region of the genome and non-self comparisons are IBD0 across nearly every region. Closely related individuals, however, will share genomic regions where one chromosome is identical by descent (IBD1). Full siblings will also share some regions of IBD2.
To assess the power of IBDGem to detect regions of IBD1 and, more generally, to assess the degree of relatedness between compared samples, a module (sometimes referred to herein as “HiddenGem”) was implemented that finds the most likely path of the three IBD states through the genome using regional likelihood values of each state (
IBDGem was run followed by the maximum-likelihood IBD-state caller HiddenGem, comparing genotypes at only GSA sites for the known relatives of two individuals, NA19662 and NA19686, from the MXL (Mexican-American) population. In this experiment, the sequence data from each relative was down-sampled to one-fold average genome coverage. For each known relative, there is general concordance between the observed proportion of each IBD state and the expected values given the degree of relatedness (
Data presented here are from: (1) The 1000 Genomes Project Phase 3 deep sequencing [15] and (2) a panel of eight human volunteers from whom DNA was derived from a saliva sample and cut hairs (hair panel).
For each anonymous study participant, saliva DNA, head hair, and pubic hair was collected. Saliva was collected using the OGR-500 collection device. 1 μg of extracted saliva DNA was submitted to AKESOgen for genotype array processing using the Illumina Multi-Ethnic Global Screening Array (MEGA). For each participant, DNA was extracted from 5 head and 3 pubic hairs, followed by preparation of single-stranded DNA Illumina sequencing libraries [16] from the two highest concentration head and pubic hair extractions. Libraries prepared from the hair extractions were sequenced on an Illumina NovaSeq 6000 at UCSF. Further details of the wet lab methods will now be provided.
Participants anonymously picked up and dropped off a collection kit containing an OGR-500 (DNA Genotek) saliva collection device, a plastic bag for head hair, a plastic bag for pubic hair, and a set of instructions. Each participant was requested to donate at least 5 head hairs, 3 pubic hairs, and saliva following the OGR-500 instructions.
For each saliva sample, DNA was extracted and submitted for array genotyping. First, DNA was extracted and isolated from 500 μL of each saliva sample using the prepIT-L2P (DNA Genotek) reagent, following the manufactures instructions. DNA was quantified using a Qubit dsDNA BR Assay Kit (Invitrogen) and a Qubit 4 fluorometer (Invitrogen). Next, 1 μg of DNA was submitted to Akesogen for genotyping on an Illumina MEGA array.
For each participant, 5 head hairs and 3 pubic hairs were trimmed and washed, followed by DNA extraction from the hairs. First, identifiable roots were removed and the head and pubic hairs were trimmed to a maximum of 5 cm and 3 cm, respectively. For external decontamination, each hair was submerged in 0.5% sodium hypochlorite for 10 seconds and then in three water baths for 10 seconds each. DNA was extracted and isolated from each hair using the Qiagen DNeasy Blood and Tissue Kit (Qiagen) following a user-developed protocol for hair [Purification of total DNA from nails, hair, or feathers using the DNeasy® Blood & Tissue Kit—(EN)]. DNA was eluted in 40 μL buffer EBT (10 mM Tris-HCl, 0.05% Tween). The DNA was quantified using a Qubit 1× dsDNA HS Assay Kit (Invitrogen) and a Qubit 4 fluorometer.
For each participant, Illumina sequencing libraries were prepared from the highest and lowest concentration head and pubic hair DNA extractions. First, 20 μL of each extract was concentrated to 11 μL using a SPRI bead mixture, which was prepared and performed as described in Rohland and Reich [Rohland, N. and D. Reich (2012) Genome Res. 22(5): p. 939-46] using 72 μL SPRI solution and a 108 μL isopropanol addition as described in Fishman et al [Fishman et al. (2018) Genome Biol. 19(1):113]. Next, Illumina libraries were prepared as described in Kapp et al. [J Hered. 2021. 112(3):241-249] with the following modifications: 10 μL of concentrated DNA extract, 1 μL 76 ng/μL ET SSB (NEB), 1 μL 2 μM P5 splinted adapter, 1 μL 0.4 μM P7 splinted adapter, and 12 μL reaction mix (38.33% PEG 8000, 104.16 mM Tris-HCl, 20.83 mM MgCl2, 20.83 mM DTT, 2.08 mM ATP, 41.66 U/μL T4 DNA Ligase (NEB), and 0.46 U/μL T4 PNK (NEB)). The libraries were purified using SPRI as described in Rohland and Reich [supra], beginning with the addition of 60 μL SPRI bead solution and 35 μL buffer EBT. The cleaned libraries were eluted in 20 μL buffer EBT.
Each library was amplified and double-indexed using the primers described in Kircher et al. [Nucleic Acids Res, 2012. 40(1):e3]. For each hair library, 50 μL reactions containing 20 μL library, 25 μL Amplitaq Gold 360 Master Mix (Applied Biosystems), 2.5 μL unique 20 μM i5 indexing primer, and 2.5 μL unique 20 μM i7 indexing primer were prepared. Each hair library was amplified with the following cycling conditions: 95° C. for 10 min, followed by 10 or 13 cycles of 95° C. for 30 s, 60° C. for 30 s, and 72° C. for 60 s, and a final extension of 72° C. for 7 min.
The post-amplified hair libraries were purified using SPRI ratios 1.2×. Next, each library was quantified using a Qubit 1×dsDNA HS Assay Kit and a Qubit 4 fluorometer, followed by visualization of each library using a D1000 ScreenTape (Agilent) and Tapestation 2200 (Agilent). Finally, all libraries were sequenced on one lane of NovaSeq using the S4 2×150 kit. The table below summarizes the data generated for each of the 4 libraries for each individual, the amount of human DNA that aligned to the reference genome (hg19) and the fold-coverage genome for each individual used in the IBDGem comparisons:
DNA sequence data from the hair panel were processed using SeqPrep [St. John, J. Seq Prep. Available from: github.com/jstjohn/SeqPrep]. Only read-pairs that were merged were used for downstream analysis. Merged reads were aligned to hs37d5.fa (hg19 human reference genome) using bwa aln with the following command set:
bwa aln -t 48 -l 26 hs37d5.fa LIB.fq>LIB.sai
bwa samse hs37d5.fa LIB.sai LIB.fa|
samtools view -Sb -o - |
samtools sort -o LIB.sorted.bam -
BAM files for each library were further processed with Picardtools CleanSam and MarkDuplicates. Then, BAM files from each individual were merged into a single bam file for downstream processing.
The merged BAM files were then filtered to remove any reads that are longer than 80 bp as DNA fragments from hair are rarely this long and the few DNA fragments longer than this cutoff show abnormally low concordance with genotype array data.
Since IBDGem takes in sequence data of single chromosomes in the pileup format, samtools mpileup was used to generate pileup files of all 22 autosomes from each individual's BAM file:
samtools mpileup -r CHROM -A -a -q 30 -Q 30 -s INDV.all-hair.180.bam -o INDV.all-hair.180.CHROM.pileup
IBDGem also takes in genotype data in the IMPUTE format. Therefore, a joint VCF from the genotype array data of all Hair1.0 individuals for each separate autosome was first created. Then, these files were further converted to the IMPUTE format using vcftools with the following options:
vcftools --vcf HAIR1-joint.CHROM.vcf \
This command generated a .legend, .indv, and .hap file for each chromosome, which were used as inputs to IBDGem.
Finally, the IBDGem comparison was performed between any two individuals on a specific chromosome with the following options:
ibdgem --hap HAIR1-joint.CHROM.hap \
For the GBR and LWK panels from 1000 Genomes, pileup files were generated for each unrelated individual using samtools mpileup with the same options as for the Hair1.0 panel. The joint VCF file for all 3,202 individuals was also converted to the IMPUTE format, removing known relatives with the following options:
vcftools --gzvcf 1KG-joint.CHROM.vcf.gz \
After relative filtering, a total of 2,594 individuals remained in the joint IMPUTE files.
IBDGem comparisons were then performed between any two individuals with the following options:
ibdgem --hap 1KG.unrelated.CHROM.hap \
Genome Res, 2009. 19(2): p. 318-26.
Accordingly, the preceding merely illustrates the principles of the present disclosure. It will be appreciated that those skilled in the art will be able to devise various arrangements which, although not explicitly described or shown herein, embody the principles of the invention and are included within its spirit and scope. Furthermore, all examples and conditional language recited herein are principally intended to aid the reader in understanding the principles of the invention and the concepts contributed by the inventors to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents and equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure. The scope of the present invention, therefore, is not intended to be limited to the exemplary embodiments shown and described herein.
This application claims the benefit of U.S. Provisional Patent Application No. 63/244,118, filed Sep. 14, 2021, which application is incorporated herein by reference in its entirety.
This invention was made with Government support under contract A21-0538-001 awarded by National Institute of Justice. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
63244118 | Sep 2021 | US |