Knowledge about recent shared ancestry between individuals is fundamental to a wide variety of genetic studies. Detecting cryptic relatedness is a valuable technique for mapping disease-susceptibility loci and for identifying other at-risk individuals (Neklason et al. 2008; Thomas et al. 2008). For case-control association studies and population-based genetic analyses, related individuals should be identified and removed from samples that are intended to be random representatives of their populations (Pemberton et al. 2010; Simonson et al. 2010; Voight and Pritchard 2005; Xing et al. 2010). Using genetic data to correct pedigree errors increases the power of disease mapping in families (Cherny et al. 2001). Genetic identification of relatives has proven invaluable in forensic identification of missing persons, victims of mass disasters, and suspects in criminal investigations (Bieber et al. 2006; Biesecker et al. 2005; Zupanic Pajnic et al. 2010). Studies of conservation biology, quantitative genetics, and evolutionary biology are greatly illuminated when the recent shared ancestry between individuals being observed or sampled can be reconstructed, especially in agricultural and wild populations (DeWoody 2005; Slate et al. 2010).
Some embodiments of the disclosure include methods of estimating genetic relatedness between members of a first pair of conspecific organisms comprising receiving, by a processor, a value indicating a number of nonoverlapping polynucleotide segments longer than a threshold length (t) that are identical, by at least about 90 percent sequence identity, between members of the first pair; receiving, by a processor, values indicating lengths of the identical segments; comparing the number of the first pair's identical segments to a number of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of a second pair of organisms, the members of the second pair having an established degree of genetic relatedness to each other; comparing the lengths of the first pair's identical segments to lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of a third pair of organisms, the members of the third pair having an established degree of genetic relatedness to each other; based on the number comparison and the length comparison, estimating, by a processor, a degree of genetic relatedness between the members of the first pair. In some embodiments, the members of the first pair are human. In certain embodiments, first pair's polynucleotide segments comprise DNA, mitochondrial DNA, sex-linked nucleotide segments, and/or RNA. In certain embodiments, t is equal to or greater than about 2.5 cM.
Some embodiments of the disclosure include methods of estimating genetic relatedness between members of a first pair of conspecific organisms further comprising comparing the lengths of the first pair's identical segments to a background distribution of lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a background group, the members of most pairs in the background group being more distantly related than fourth cousins; and wherein the estimating is further based on the comparison of the lengths of the first pair's identical segments to the lengths in the background distribution. In some embodiments the identical segments of the background group are no longer than about 10 cM. In certain embodiments members of the background group are selected randomly from a larger population.
In certain embodiments, the methods further comprise comparing the number of the first pair's identical segments to a first distribution, of numbers of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a first group, the members of each pair in the first group having an established degree of genetic relatedness to each other; and wherein the estimating is further based on the comparison of the number of the first pair's identical segments to the numbers in the first distribution.
In some embodiments, the methods further comprise comparing the lengths of the first pair's identical segments to a second distribution, of lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a second group, the members of each pair in the second group having an established degree of genetic relatedness to each other; wherein the estimating is further based on the comparison of the lengths of the first pair's identical segments to the lengths in the second distribution.
In certain embodiments, the methods further comprise comparing the lengths of the first pair's identical segments to a second distribution, of lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a second group, the members of each pair in the second group having an established degree of genetic relatedness to each other; wherein the estimating is further based on the comparison of the lengths of the first pair's identical segments to the lengths in the second distribution. In some embodiments, the methods further comprise comparing the lengths of the first pair's identical segments to a background distribution, of lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a background group, the members of most pairs in the background group being more distantly related than fourth cousins; and wherein the estimating is further based on the comparison of the lengths of the first pair's identical segments to the lengths in the background distribution.
Some embodiments of the disclosure include methods of estimating genetic relatedness between members of a first pair of conspecific organisms wherein the estimating further comprises estimating a likelihood LP that the first pair are no more related than two individuals selected randomly from a population, wherein: LP(n,s|t)=NP(n|t)·SP(s|t); wherein
wherein NP(n|t) comprises the likelihood of sharing n segments, SP(s|t) comprises the likelihood of the set of segments s, and FP(i|t) comprises the likelihood of a segment of size i. In some embodiments, FP(i|t) is approximated as
wherein θ is equal to a mean shared segment size in the population for all segments of size greater than t and less than a maximum length. In some embodiments the maximum length is about 10 cM.
In some aspects, the estimating further comprises estimating a likelihood LR that the first pair share one or two ancestors, wherein: LR=LA(nA,sA|d,a,t)LP(sP|t); wherein nP+nA=n, where nA is equal to the number of shared segments inherited from ancestors, nP is the number of segments shared by the population; wherein sP and sA are two mutually exclusive subsets of s, where sA is the subset of segments inherited from ancestor(s) with nA elements, and sP is the subset of segments shared by the population with nP elements; wherein a represents the number of ancestors shared, and d represents the combined number of generations separating the individuals from their ancestor(s).
In some embodiments, the estimating further comprises estimating a likelihood LA that the first pair share n segments from ancestor(s) specified by d and a, with the segment sizes specified by sA, wherein: LA(nA,sA|d,a,t)=NA(n|d,a,t)·SA(sA|d,t); wherein
wherein NA(n|d,a,t) is the likelihood of sharing n segments, SA(sA|d,t) is the likelihood of the set of segments sA, and FA(i|t) is the likelihood of a segment of size i; wherein sP and sA are two mutually exclusive subsets of s, where sA is the subset of segments inherited from ancestor(s) with nA elements, and sP is the subset of segments shared by the population with nP elements; wherein nP+nA=n, where nA is equal to the number of shared segments inherited from ancestors, nP is the number of segments shared by the population; wherein a represents the number of ancestors shared, and d represents the combined number of generations separating the individuals from their ancestor(s). In certain aspects, the estimating further comprises estimating a likelihood LA that the first pair share n segments from ancestor(s) specified by d and a, with the segment sizes specified by sA, wherein: LA(nA,sA|d,a,t)=NA(n|d,a,t)·SA(sA|d,t); wherein
wherein NA(n|d,a,t) is the likelihood of sharing n segments, SA(sA|d,t) is the likelihood of the set of segments sA, and FA(i|t) is the likelihood of a segment of size i.
In some embodiments,
wherein p(t) is the probability that a shared segment is longer than t, c comprises an average number of chromosomes in the organisms, and r comprises an average number of recombination events per haploid genome in the organisms. In certain embodiments, p(t) is assumed to be equal to or about e−dt/100. In certain embodiments,
In certain aspects, the estimating further comprises estimating a maximum likelihood of LR(MLR), wherein: MLR(nP,nA,s|d,a,t)=NP(nP|t)NA(nA|d,a,t)·SP({s1:n . . . sn
In some embodiments, the methods of the invention further comprise receiving, by a processor, values indicating locations of the identical segments; comparing the locations of the first pair's identical segments to locations of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of a fourth pair of organisms, the members of the fourth pair having an established degree of genetic relatedness to each other; and wherein the estimating is further based on the location comparison. In some embodiments, the methods further comprise comparing the locations of the first pair's identical segments to a background distribution of locations of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a background group, the members of most pairs in the background group being more distantly related than fourth cousins; and wherein the estimating is further based on the comparison of the locations of the first pair's identical segments to the locations in the background distribution.
In some embodiments, the methods further comprise comparing the locations of the first pair's identical segments to a first distribution, of locations of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a first group, the members of each pair in the first group having an established degree of genetic relatedness to each other; and wherein the estimating is further based on the comparison of the locations of the first pair's identical segments to the locations in the first distribution.
Some embodiments of the disclosure include a computer-readable medium encoded with a computer program comprising instructions executable by a processor for estimating genetic relatedness between members of a first pair of conspecific organisms, the instructions including instruction code for: receiving, by a processor, a value indicating a number of nonoverlapping polynucleotide segments longer than a threshold length (t) that are identical, by at least about 90 percent sequence identity, between members of the first pair; receiving, by a processor, values indicating lengths of the identical segments; comparing the number of the first pair's identical segments to a number of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of a second pair of organisms, the members of the second pair having an established degree of genetic relatedness to each other; comparing the lengths of the first pair's identical segments to lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of a third pair of organisms, the members of the third pair having an established degree of genetic relatedness to each other; based on the number comparison and the length comparison, estimating, by a processor, a degree of genetic relatedness between the members of the first pair. In some embodiments, the members of the first pair are human. In certain embodiments, first pair's polynucleotide segments comprise DNA, mitochondrial DNA, sex-linked nucleotide segments, and/or RNA. In certain embodiments, t is equal to or greater than about 2.5 cM.
In certain aspects, the computer-readable medium further comprises comparing the lengths of the first pair's identical segments to a background distribution of lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a background group, the members of most pairs in the background group being more distantly related than fourth cousins; and wherein the estimating is further based on the comparison of the lengths of the first pair's identical segments to the lengths in the background distribution. In some embodiments, the identical segments of the background group are no longer than about 10 cM. In certain embodiments the members of the background group are selected randomly from a larger population.
In some aspects, the medium further comprises comparing the number of the first pair's identical segments to a first distribution, of numbers of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a first group, the members of each pair in the first group having an established degree of genetic relatedness to each other; and wherein the estimating is further based on the comparison of the number of the first pair's identical segments to the numbers in the first distribution.
In some embodiments, the medium further comprises comparing the lengths of the first pair's identical segments to a second distribution, of lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a second group, the members of each pair in the second group having an established degree of genetic relatedness to each other; wherein the estimating is further based on the comparison of the lengths of the first pair's identical segments to the lengths in the second distribution.
In certain aspects, the medium further comprises comparing the lengths of the first pair's identical segments to a second distribution, of lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a second group, the members of each pair in the second group having an established degree of genetic relatedness to each other; wherein the estimating is further based on the comparison of the lengths of the first pair's identical segments to the lengths in the second distribution.
In certain embodiments, the medium further comprises comparing the lengths of the first pair's identical segments to a background distribution, of lengths of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a background group, the members of most pairs in the background group being more distantly related than fourth cousins; and wherein the estimating is further based on the comparison of the lengths of the first pair's identical segments to the lengths in the background distribution. In certain aspects, the estimating further comprises estimating a likelihood LP that the first pair are no more related than two individuals selected randomly from a population, wherein: LP(n,s|t)=NP(n|t)·SP(s|t); wherein
wherein NP(n|t) comprises the likelihood of sharing n segments, SP(s|t) comprises the likelihood of the set of segments s, and FP(i|t) comprises the likelihood of a segment of size i. In certain aspects, FP(i|t) is approximated as:
wherein θ is equal to a mean shared segment size in the population for all segments of size greater than t and less than a maximum length. In some embodiments, the maximum length is about 10 cM.
In some aspects of the computer-readable medium the estimating further comprises estimating a likelihood LR that the first pair share one or two ancestors, wherein: LR=LA(nA,sA|d,a,t)LP(sP|t); wherein nP+nA=n, where nA is equal to the number of shared segments inherited from ancestors, nP is the number of segments shared by the population; wherein sP and sA are two mutually exclusive subsets of s, where sA is the subset of segments inherited from ancestor(s) with nA elements, and sP is the subset of segments shared by the population with nP elements; wherein a represents the number of ancestors shared, and d represents the combined number of generations separating the individuals from their ancestor(s). In some embodiments the estimating further comprises estimating a likelihood LA that the first pair share n segments from ancestor(s) specified by d and a, with the segment sizes specified by sA, wherein: LA(nA,sA|d,a,t)=NA(n|d,a,t)·SA(sA|d,t); wherein
wherein NA(n|d,a,t) is the likelihood of sharing n segments, SA(sA|d,t) is the likelihood of the set of segments sA, and FA(i|t) is the likelihood of a segment of size i; wherein sP and sA are two mutually exclusive subsets of s, where sA is the subset of segments inherited from ancestor(s) with nA elements, and sP is the subset of segments shared by the population with nP elements; wherein nP+nA=n, where nA is equal to the number of shared segments inherited from ancestors, nP is the number of segments shared by the population; wherein a represents the number of ancestors shared, and d represents the combined number of generations separating the individuals from their ancestor(s).
In some embodiments of the computer-readable medium, the estimating further comprises estimating a likelihood LA that the first pair share n segments from ancestor(s) specified by d and a, with the segment sizes specified by sA, wherein: LA(nA,sA|d,a,t)=NA(n|d,a,t)·SA(sA|d,t); wherein
wherein NA(n|d,a,t) is the likelihood of sharing n segments, SA(sA|d,t) is the likelihood of the set of segments sA, and FA(i|t) is the likelihood of a segment of size i.
In certain aspects,
wherein p(t) is the probability that a shared segment is longer than t, c comprises an average number of chromosomes in the organisms, and r comprises an average number of recombination events per haploid genome in the organisms. In certain embodiments, p(t) is assumed to be equal to or about e−dt/100. In certain embodiments, of the computer-readable medium
In certain aspects, the estimating further comprises estimating a maximum likelihood of LR(MLR), wherein: MLR(nP,nA,s|d,a,t)=NP(nP|t)NA(nA|d,a,t)·SP({s1:n . . . sn
In some embodiments, the computer-readable medium further comprises receiving, by a processor, values indicating locations of the identical segments; comparing the locations of the first pair's identical segments to locations of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of a fourth pair of organisms, the members of the fourth pair having an established degree of genetic relatedness to each other; and wherein the estimating is further based on the location comparison.
In some aspects, the computer-readable medium further comprises comparing the locations of the first pair's identical segments to a background distribution of locations of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a background group, the members of most pairs in the background group being more distantly related than fourth cousins; and wherein the estimating is further based on the comparison of the locations of the first pair's identical segments to the locations in the background distribution. In certain aspects, the computer-readable medium further comprises comparing the locations of the first pair's identical segments to a first distribution, of locations of nonoverlapping polynucleotide segments longer than t that are identical, by at least about 90 percent sequence identity, between members of pairs of organisms in a first group, the members of each pair in the first group having an established degree of genetic relatedness to each other; and wherein the estimating is further based on the comparison of the locations of the first pair's identical segments to the locations in the first distribution.
Additional features and advantages of the subject technology will be set forth in the description below, and in part will be apparent from the description, or may be learned by practice of the subject technology. The advantages of the subject technology will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the subject technology as claimed.
All publications, patents, and GenBank sequences cited in this disclosure are incorporated by reference in their entirety.
The accompanying drawings, which are included to provide further understanding of the subject technology and are incorporated in and constitute a part of this specification, illustrate aspects of the subject technology and together with the description serve to explain the principles of the subject technology.
In the following detailed description, numerous specific details are set forth to provide a full understanding of the subject technology. It will be apparent, however, to one ordinarily skilled in the art that the subject technology may be practiced without some of these specific details. In other instances, well-known structures and techniques have not been shown in detail so as not to obscure the subject technology.
A phrase such as “an aspect” does not imply that such aspect is essential to the subject technology or that such aspect applies to all configurations of the subject technology. A disclosure relating to an aspect may apply to all configurations, or one or more configurations. An aspect may provide one or more examples of the disclosure. A phrase such as “an aspect” may refer to one or more aspects and vice versa. A phrase such as “an embodiment” does not imply that such embodiment is essential to the subject technology or that such embodiment applies to all configurations of the subject technology. A disclosure relating to an embodiment may apply to all embodiments, or one or more embodiments. An embodiment may provide one or more examples of the disclosure. A phrase such “an embodiment” may refer to one or more embodiments and vice versa. A phrase such as “a configuration” does not imply that such configuration is essential to the subject technology or that such configuration applies to all configurations of the subject technology. A disclosure relating to a configuration may apply to all configurations, or one or more configurations. A configuration may provide one or more examples of the disclosure. A phrase such as “a configuration” may refer to one or more configurations and vice versa.
Most established methods for detecting and estimating genetic relationships are based on genome-wide averages of the estimated number of alleles shared that are identical by descent (IBD) between two individuals (Weir et al. 2006). These methods are accurate and efficient for relationships as distant as third-degree relatives (e.g., first cousins) but cannot identify more distant relationships. In contrast, aspects of the instant disclosure provide novel methods and apparatus of estimation of recent shared ancestry (ERSA) that accurately estimate the degree of relationship for up to eighth-degree relatives (e.g., third cousins once removed), and detect relationships as distant as twelfth-degree relatives (e.g., fifth cousins once removed).
Some methods of detecting relatedness (for example, the method implemented in PLINK; Purcell et al. 2007) rely on genome-wide averages of genetic identity coefficient estimates. These statistics incompletely summarize the information contained in the IBD segment data: genetic identity coefficients can be calculated from IBD segment data, but the reverse is not true. To illustrate the importance of this difference, the typical amount of genetic sharing between a pair of fourth cousins is considered. The probability that fourth cousins share at least one IBD segment is 77%, and the expected length of this segment is 10 centiMorgans (cM) (Donnelly 1983). Because a 10 cM segment represents less than 0.3% of the genome, this excess of IBD has very little effect on estimates of relatedness averaged over the genome. However, because unrelated individuals are unlikely to share a 10 cM segment in most populations, the novel ERSA methods and apparatus disclosed herein are capable of detecting many fourth-cousin relationships.
Another family of methods for detecting relationships models the IBD states between haplotypes as a Markov process along a chromosome, with different transition probability matrices corresponding to different hypothesized relationships. The likelihoods of various relationship models are then estimated from the data. Examples of these methods include RELPAIR (Boehnke and Cox 1997; Epstein et al. 2000), PREST (extending the methods in Boehnke and Cox, 1997; McPeek and Sun 2000; Sun et al. 2002), and GBIRP (extending PREST to the problem of general relationship estimation; Stankovich et al. 2005). These tools were initially designed for use with hundreds of microsatellite loci spaced at intervals of several cM, but they have also been applied to high-density single-nucleotide polymorphism (SNP) data (e.g., Berkovic et al. 2008; Pemberton et al. 2010). However, they do not model the patterns of linkage disequilibrium (LD) that exist between very closely spaced SNP markers and instead assume that markers are not in strong LD. High-density SNP data sets must be thinned to approximately 10,000 markers before they can be used (see, e.g., Berkovic et al. 2008; Pemberton et al. 2010). The key information used by such Markov-process methods is the match between the hypothesized transition probability matrix and the pattern of IBD state transitions induced by the genotype data.
In contrast, some embodiments of the instant ERSA methods and apparatus use explicit IBD segment information to estimate the relationships between pairs of individuals in a maximum-likelihood framework. This makes better use of the information present in high-density SNP genotyping data. The power of ERSA disclosed herein to detect relationships between second cousins or closer relatives is essentially perfect and exceeds 85% for third cousins even at the α=0.001 level. ERSA is also more accurate than RELPAIR or GBIRP.
The number, lengths, and locations of chromosomal segments that are shared IBD by a pair of individuals essentially constitute the genetic information that bears on their recent shared genetic ancestry.
Algorithms can be used to detect the number, lengths, and locations of chromosomal segments IBD between two individuals. (Browning and Browning 2010; Gusev et al. 2009; Thomas et al. 2008) In some embodiments, ERSA uses a likelihood ratio test to compare the null hypothesis that the two individuals are unrelated with the alternative hypothesis that the individuals share recent ancestry. Because of the qualitative difference between genome-wide averages of relatedness and the information contained in IBD segments, aspects of the present disclosure greatly expand the range of relationships that can be detected from genetic data.
ERSA is immediately applicable to a number of problems. It can be used to identify cryptic relatedness between individuals with the same rare genetic disorder. In analyzing large pedigrees, ERSA can verify distant relationships without genotyping intervening family members. This can sharply reduce sample collection and genotyping requirements.
In the forensic field, a common DNA-based method for identifying the remains of missing persons is based on comparisons of kinship statistics computed from a modest number (13-17) of STR loci, with useful comparisons generally limited to second-degree relationships (Alonso et al. 2005; e.g., MDKAP, Leclair et al. 2007; M-FISys, Budimlija et al. 2003; Cash et al. 2003). The International Commission on Missing Persons (ICMP) has generated matches for more than 18,000 persons missing from armed conflicts or mass disasters at a significance level exceeding 99.95% (personal communication from T J Parsons, ICMP). However, this level of certainty requires typing multiple first- or second-degree relatives. Such close relatives are often unavailable, due either to disasters and conflicts that disperse entire families or to the passage of time (Brenner 2006; Leclair 2004). For example, DNA profiles exist for over 2,000 individuals killed in the armed conflict in Bosnia for which identifications cannot be made due to insufficient family reference samples (T J Parsons, ICMP). ERSA allows the use of a much larger pool of distant relatives (Bieber et al. 2006) and enables definitive conclusions to be drawn based on single closer relatives. For the first time, with ERSA, even a single individual searching for a family member would be able to provide a definitive reference.
The methods described here are computationally efficient, make near-optimal use of the genetic signal of relatedness between individuals, achieve a statistical power very close to the theoretical maximum and have multiple applications. These methods can be implemented by machine-readable code, e.g., in software or hardware, and over computer networks such as the Internet.
As used herein, “IBD-segments” are nonoverlapping polynucleotide segments longer than a threshold length (t) that are identical, in certain embodiments, by at least about 90% identical; in certain embodiments about 95% identical; in certain embodiments about 98% identical; in certain embodiments about 99% identical; and in certain embodiments about 100% identical.
Any IBD segment number and length data can be used in aspects of the present disclosure. Likewise, any IBD segment detection method can be used. Examples of software programs for IBD segment detection are GERMLINE (Gusev et al. 2009); fastIBD in Beagle 3.3 (Browning and Browning 2010), MERLIN (via—extended, Abecasis et al.) and Thompson (tech report, U Wash). IBD segments are determined using, for example, SNP data, whole-genome sequencing data, and/or higher-density microarray data.
As used herein, “polynucleotides” are in certain embodiments deoxyribonucleic acids (DNA), in certain embodiments ribonucleic acids (RNA), in certain embodiments mitochondrial DNA (mtDNA), in certain embodiments sex-linked nucleotide segments, such as those found on the Y or X chromosomes.
In certain embodiments, autosomal segments is a source of the polynucleotides used in estimating recent shared ancestry. In certain embodiments, RNA is a source of the polynucleotides used in estimating recent shared ancestry.
In certain embodiments, mtDNA or the Y chromosome(s) is a source of the polynucleotides used in estimating recent shared ancestry. For a hypothesized alternative relationship with a ancestors on a path d meioses long, the likelihood of the observed mtDNA or Y chromosome data is computed by integrating over all possible pedigrees with a ancestors and d meioses, specifying the sex of each individual in the inheritance path so that the probabilities can be calculated. The likelihood of the null hypothesis (no relationship) is calculated based on the frequencies of the observed mtDNA or Y chromosome haplotypes in the background population. In both calculations, an allowance is made for an appropriate genotyping or sequencing error rate. The log-likelihoods based on the mtDNA and Y chromosome data are then added to the log-likelihoods computed from the autosomal data (for the corresponding null and alternative hypotheses), and the relationship is estimated using standard likelihood theory as before.
In certain embodiments, the X chromosome(s) is a source of the polynucleotides used in estimating recent shared ancestry. IBD segment data from the X chromosome is used in a similar way as Y chromosome and mtDNA data. To calculate the likelihood of the null hypothesis given observed X chromosome SNP genotype or sequence data, the observed IBD segments are compared to distributions estimated from unrelated individuals in the source population. For each alternative hypothesis, likelihoods are calculated by integrating over all possible sex-specified pedigrees in the class of relationships with a ancestors on a path d meioses long. This allows the method to account for the number of meioses in the path in which recombination occurred (only in females), which determines the IBD segments length distribution, and for the probability that the ancestral X chromosome is lost altogether (due to two consecutive male parents in the inheritance path.) The log-likelihoods for null and alternative hypotheses based on X chromosome data are added to the log-likelihoods for the autosomal data, and the final likelihood ratio test is carried out as before.
As used herein, the term “ancestor” is a parent or, recursively, the parent of an ancestor, e.g., a grandparent, great-grandparent, or great-great-grandparent.
As used herein, the term “random selection” is a broad term that includes, without limitation, selections that are any combination of (a) truly random, such as a random number generated by a random physical process, e.g., radioactive decay; (b) pseudo-random, such as a computer-generated random selection; (c) semi-random, including constraints in a selection process such as database size, and (d) quasi-random, such as a selection of n items that fills n-space more uniformly than uncorrelated random items, sometimes also called a low-discrepancy sequence. (The outputs of quasi-random sequences are generally constrained by a low-discrepancy requirement that has a net effect of points being generated in a highly correlated manner, i.e., the next point “knows” where the previous points are).
As used herein, the word “module” refers to logic embodied in hardware or firmware, or to a collection of software instructions, possibly having entry and exit points, written in a programming language, such as, for example C++. A software module may be compiled and linked into an executable program, installed in a dynamic link library, or may be written in an interpretive language such as BASIC. It will be appreciated that software modules may be callable from other modules or from themselves, and/or may be invoked in response to detected events or interrupts. Software instructions may be embedded in firmware, such as an EPROM or EEPROM. It will be further appreciated that hardware modules may be comprised of connected logic units, such as gates and flip-flops, and/or may be comprised of programmable units, such as programmable gate arrays or processors. The modules described herein are preferably implemented as software modules, but may be represented in hardware or firmware. It is contemplated that the modules may be integrated into a fewer number of modules. One module may also be separated into multiple modules. The described modules may be implemented as hardware, software, firmware or any combination thereof. Additionally, the described modules may reside at different locations connected through a wired or wireless network, or the Internet.
In general, it will be appreciated that the processors can include, by way of example, computers, program logic, or other substrate configurations representing data and instructions, which operate as described herein. In other embodiments, the processors can include controller circuitry, processor circuitry, processors, general purpose single-chip or multi-chip microprocessors, digital signal processors, embedded microprocessors, microcontrollers and the like.
Furthermore, it will be appreciated that in one embodiment, the program logic may advantageously be implemented as one or more components. The components may advantageously be configured to execute on one or more processors. The components include, but are not limited to, software or hardware components, modules such as software modules, object-oriented software components, class components and task components, processes methods, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.
The foregoing description is provided to enable a person skilled in the art to practice the various configurations described herein. While the subject technology has been particularly described with reference to the various figures and configurations, it should be understood that these are for illustration purposes only and should not be taken as limiting the scope of the subject technology.
There may be many other ways to implement the subject technology. Various functions and elements described herein may be partitioned differently from those shown without departing from the scope of the subject technology. Various modifications to these configurations will be readily apparent to those skilled in the art, and generic principles defined herein may be applied to other configurations. Thus, many changes and modifications may be made to the subject technology, by one having ordinary skill in the art, without departing from the scope of the subject technology.
It is understood that the specific order or hierarchy of steps in the processes disclosed is an illustration of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged. Some of the steps may be performed simultaneously. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.
As used herein, the singular forms “a,” “an” and “the” include plural references unless the content clearly dictates otherwise.
The term “about,” as used herein, can refer to +/−10% of a value.
Furthermore, to the extent that the term “include,” “have,” or the like is used in the description or the claims, such term is intended to be inclusive in a manner similar to the term “comprise” as “comprise” is interpreted when employed as a transitional word in a claim.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.
A reference to an element in the singular is not intended to mean “one and only one” unless specifically stated, but rather “one or more.” Pronouns in the masculine (e.g., his) include the feminine and neuter gender (e.g., her and its) and vice versa. The term “some” refers to one or more. Underlined and/or italicized headings and subheadings are used for convenience only, do not limit the subject technology, and are not referred to in connection with the interpretation of the description of the subject technology. All structural and functional equivalents to the elements of the various configurations described throughout this disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and intended to be encompassed by the subject technology. Moreover, nothing disclosed herein is intended to be dedicated to the public regardless of whether such disclosure is explicitly recited in the above description.
Aspects of the invention now being generally described, it will be more readily understood by reference to the following examples, which are included merely for purposes of illustration of certain aspects and embodiments of the present disclosure, and are not intended to limit the invention.
Some aspects of the present disclosure employ a likelihood ratio test for which the data are the number and lengths of autosomal genomic segments shared between two individuals, with segment length measured in centiMorgans (cM). The null hypothesis is that the individuals are no more related than two persons picked at random from the population; the alternative hypothesis is that the two individuals share recent ancestry. When the alternative model is not significantly more likely than the null model, it is concluded that there is no evidence for recent shared ancestry. Otherwise, the maximum-likelihood estimate for the degree of relationship between two individuals by maximizing the likelihood over all possible relationships is obtained in the alternative model. Significance levels and confidence intervals are determined from standard chi-square approximations for the likelihood ratio test.
An embodiment of ERSA according to the present disclosure was applied to three well-defined pedigrees with predominantly Northern European ancestry (Table 1). Informed consent was obtained from all study subjects, and all procedures were approved by the Western Institutional Review Board. DNA samples were collected and purified from blood as described in Xing et al. (2010). Affymetrix 6.0 SNP arrays were used to genotype 169 individuals selected from these pedigrees (Table 1), per the manufacturer's instructions (see Xing et al. 2010). Beagle 3.2 (Browning and Browning 2010) was used to phase and impute missing genotypes, using the Affymetrix 6.0 SNP genotypes of the 30 HapMap CEU trios as a reference (CEL files provided by Affymetrix). Of 868,155 autosomal SNP loci with unique positions on the array (not including controls, whose probe set IDs begin with ‘AFFX-SNP’), 18,610 were excluded from the final data set because they exhibited more than three Mendelian inheritance errors in the CEU trios or more than 10% missing data in either the CEU or pedigree individuals. On the basis of the pedigree genotypes, GERMLINE 1.4.1 (Gusev et al. 2009; software available on Columbia University's Computer Science webpage (Gusev; GERMLINE)) inferred the locations and extents of IBD segments for all pairs of individuals (parameters err_het=2, err_hom=1, and min_m=1cM, with marker positions given on the HapMap r22 genetic map). GERMLINE identifies short regions of exact matches between haplotypes using a library of short seeds, then extends and merges those regions using an efficient hashing and matching algorithm. ERSA was applied to the output of GERMLINE. The program fastIBD in Beagle vers. 3.3 (Browning, University of Washington website) was also used to generate IBD segments for analysis by ERSA (default options). Although principal component analysis (
The likelihood of the null hypothesis is estimated from the empirical distribution of autosomal shared segments in the population. Only shared segments longer than a given threshold, t, are considered because shorter segments are more difficult to detect and provide little information about recent ancestry. Let s equal the set of segments shared between two individuals and n equal the number of elements in s. For this calculation, it is assumed that the number of segments shared and the length of each segment are independent, which is approximately true for the HapMap CEU population (see
L
P(n,s|t)=NP(n|t)·SP(s|t), 1.
where
NP(n|t) is the likelihood of sharing n segments, SP(s|t) is the likelihood of the set of segments s, and FP(i|t) is the likelihood of a segment of length i. NP(n|t) is approximated from a Poisson distribution with mean equal to the sample mean of the number of segments shared in the population (
The variable t is set to the smallest value that can achieve a false-negative rate of 1% or lower. This setting maximizes the use of available data while ensuring that the exponential approximation to the distribution of segment lengths in the population holds. Here, the choice of t=2.5 cM was based on GERMLINE's previously reported false-negative rate of 1% for segments 2.5 cM and longer (Gusev et al. 2009). In the HapMap CEU population, the distribution of segments detected by GERMLINE that are longer than 2.5 cM is approximately exponential, with the exception of a few significant outliers (
where θ is equal to the mean shared segment length in the population for all segments of size greater than t and less than h. For HapMap CEU with t=2.5 cM and h=10 cM, the estimate of θ is 3.12 cM.
The alternative hypothesis is that the pair of individuals share either one or two recent ancestors. Let a represent the number of ancestors shared, and let d equal the combined number of generations separating the individuals from their ancestors(s), e.g., d=6 and a=1 for half-second cousins. Under the alternative hypothesis, segments shared by two individuals come from two sources: recent ancestry and the population background (denoted by subscripts A and P, respectively). Let nP+nA=n, where nA is equal to the number of shared segments inherited from recent ancestors, and nP is the number of segments shared due to the population background. sP and sA are two mutually exclusive subsets of s, with sA equal to the subset of segments inherited from recent ancestor(s) with nA elements and sP equal to the subset of segments shared due to the background with nP elements. The likelihood of the alternative hypothesis of recent ancestry, LR, is then:
L
R
=L
A(nA,sA|d,a,t)LP(nP,sP|t). 4.
Because sP is distributed according to the population distribution, LP follows the description in Eq. 1. LA is the likelihood that two individuals share n autosomal segments from recent ancestor(s) specified by d and a, with the segment lengths specified by sA. LA can be expressed as the product of likelihoods of the number of shared segments and the length of each segment, which parallels Eqs. 1 and 2:
Eq. 6 assumes that, for a given value of d, the lengths of segments are independent. This assumption is not strictly true. One might imagine that the presence of a particularly long segment would reduce the genomic space available for additional segments. However because the length of any one segment is small relative to the length of the genome, and because the genome is physically divided into chromosomes, the segment lengths are approximately independent (Thomas et al. 1994).
For two individuals who are related by an inheritance path that is d meioses long, the probability that they will inherit any particular autosomal segment from a common ancestor on that path is equal to ½d−1. The expected number of shared autosomal segments that could potentially be inherited from a common ancestor is equal to rd+c, where c is the number of autosomes and r is the expected number of recombination events per haploid genome per generation. Therefore, the expected number of shared segments is equal to a(rd+c)/2d−1 (Thomas et al. 1994). In humans, c is equal to 22 and r is approximately 35.3 (McVean et al. 2004). Given d, the expected value of i is 100/d. Without conditioning on t, the distribution of segment length is exponential with mean 100/d. Conditioning on t,
The probability that a shared segment is longer than t, p(t), is equal to e−dt/100 (Thomas et al. 1994). Because the distribution of the number of shared segments is approximately Poisson (Thomas et al. 1994),
Given nA and nP, the maximum value of the likelihood function (Eq. 4) is equal to:
where sx:n is equal to the xth smallest value in s. Eq. 9 asserts that the likelihood is maximized when the set of segments resulting from recent ancestry is equal to the longest nA segments in s, with the remaining nP segments being due to the population background.
The alternative model contains three additional parameters relative to the null model, d, a, and nA (nP=n−nA). However, when the behavior of d and a was evaluated empirically, it was found that they effectively act as a single parameter (
ML
R(n,s|d,a,t)=Max{MLR(nP,n−nP,s):nP ∈ {0,1 . . . n}}. 10.
a. Individuals Ascertained Based on a Shared Genetic Variant
If the two individuals have been ascertained because they both share the same genetic variant, as in the case of a shared disease-causing variant, then the likelihood calculation must be conditioned on this ascertainment. In the case of such ascertainment, the shared segment that contains the variant is equivalent to two shared segments, with the segment boundaries defined by the original boundaries and the location of the ascertained variant. (Thomas et al. 2008; Thomas et al. 1994) Thomas et al. have shown that the lengths of these segments, g1 and g2, are exponentially distributed, with the mean equal to the unconditional length of a segment. Excluding the ascertained segment from n and s, the maximum value of the likelihood function is equal to:
AML
R(n,s,g1,g2|d,a,t)=MLR(n,s|d,a,t)·Max{SP({g1,g2}|t)SA({g1,g2}|d,a,t)} 11.
Equation 9 holds as long as θ<a(rd+c), which is true whenever a and d specify shared ancestry that is recent relative to pairs of individuals selected at random from the population. Given a set of shared segment lengths between two individuals, s, the objective is to identify the subset of these segments, m, containing the nA elements that are most likely to have been inherited from recent ancestor(s). Eq. 9 assumes that m is equal to the largest nA elements in s. Here, it is shown why this assumption holds: Let θ1=100/d, which is the expected length of a shared segment inherited from a recent ancestor. Let θ2=θ, which is the expected length of a shared segment if it is not inherited from a recent ancestor. If the average time to the most recent common ancestor between individuals in the population is greater than d/2, then θ1>θ2. If θ1<θ2, then individuals selected at random from the population are more closely related than the relationship being analyzed, and therefore there is no power to detect a relationship.
To demonstrate that m is equal to the set containing the largest nA elements of s, consider two mutually exclusive subsets of s, zP and zA, with zA containing nA elements. Let x1 equal the largest element in zP and x2 equal the smallest element in zA. Let yP and yA respectively equal the sets zP and zA, with the exception that x1 and x2 are swapped. As long as x1>x2, the likelihood of zP and zA is less than the likelihood of yP and yA:
L
R(np,na,ya,yp|d,a,t)<LR(nP,na,zA,zP|d,a,t).
The components of LR are NA, NP, SA, and SP. Because NA and NP depend only on nP and nA, the above condition simplifies to:
S
P(yP|t)SA(yA|d,a,t)<SP(zP|t)SA(zA|d,a,t).
The elements in both zP and zA, and yP and yA are equal, with the exception of x1 and x2. Therefore, by Eq. 6, the inequality becomes
F
P(x2|t)FA(x1|d,a,t)<FP(x1|t)FA(x2|d,a,t),
which (by Eqs. 3 and 7) is equal to
This simplifies to
Q.E.D.
Although d and a are specified as two separate parameters in the likelihood ratio test, analyses indicated that allowing a to vary has almost no effect on the distribution of likelihood scores under the null hypothesis. To demonstrate this behavior, the likelihood scores for pairs of individuals from two closely-related populations, the CHB (45 Han Chinese in Beijing) and JPT (45 Japanese in Tokyo) samples, were evaluated using the HapMap phase 2 SNP genotype data (HapMap Consortium 2005). For each pair of individuals, the maximum likelihood for two alternative models (L1 and L2) was calculated. In model 1, a is allowed to vary, and in model 2, a is fixed equal to 2 (d is estimated in both). To evaluate the effect of allowing a to vary, a likelihood ratio test (LRT) statistic for the two models (−2 ln [L1/L2] was calculated;
The performance of ERSA was assessed by analyzing high-density SNP microarray data on three deep, well-defined pedigrees composed of 24, 30, and 115 individuals (Table 1). The output from this analysis was a maximum-likelihood estimate and confidence interval (C.I.) for the degree of relationship of each pair of individuals in the sample. The computation time taken by ERSA to analyze all 14,196 pairs of individuals in this sample was approximately 9 minutes running on one core of a 2.3 GHz AMD Opteron processor. In
For pairs of individuals as distantly related as eighth-degree relatives, ERSA's estimates are generally accurate to within one degree of the known relationship. ERSA predicted the exact degree of relationship for 66% of the 549 pairs of first- through fifth-degree relative and was accurate to within one degree of relationship for 97% of those pairs (
ERSA has nearly 100% power to detect first- through fifth-degree relatives and substantial power to detect ancestry as distant as eleventh-degree relatives. A significant relationship was detected among all 549 pairs of first- through fifth-degree relatives in the sample α=0.001, where the null hypothesis is no relationship (
For comparison, the same relationships were analyzed by applying RELPAIR (Epstein et al. 2000) and GBIRP (Stankovich et al. 2005) to a subset of the SNP loci (see FIGS. 4 and 6A-6F). Both methods had high power to detect third- and fourth-degree relatives (dotted and solid blue lines in
As shown in Table 2, ERSA's probability of detecting a significant relationship between unrelated individuals (the empirical false positive rate) is approximately equal to the nominal significance level (α). To estimate the empirical false positive rate, high-density SNP data on a set of individuals with no recent shared ancestry was needed. Given the sensitivity of ERSA to distant relationships, acquiring an appropriate dataset from pedigree data would require complete ancestry information for each individual in the sample extending back at least seven generations. Because such pedigrees are extremely rare, the false positive rate from two closely related populations, the CHB (45 Han Chinese in Beijing) and JPT (45 Japanese in Tokyo) samples, using the HapMap phase 2 SNP genotype data was estimated (HapMap Consortium 2005). Because these populations can be distinguished genetically (HapMap Consortium 2005), estimating the false positive rate from the CHB-JPT comparison is not ideal. However, the allele frequency and haplotype distributions of these populations are very similar (HapMap Consortium 2005), and pairs of CHB and JPT individuals are unlikely to have shared an ancestor in the past 200 years. Therefore, false-positive rates from the proportions of CHB-JPT pairs in which significant recent ancestry was detected was estimated. The estimated false positive rates closely matched the nominal rates (Table 2). For the significance level of α=0.001 used in
ERSA can also accurately detect relationships between individuals who share a disease-causing mutation transmitted from a common founder. The process of ascertaining individuals based on a shared mutation introduces biases in the estimation of recent ancestry, but this bias can be taken into account (see Methods). The test case was composed of seven previously described individuals who are affected with attenuated familial adenomatous polyposis (AFAP) due to a single disease-causing mutation (c.426—427delAT in the APC gene; Neklason et al. 2008). The available pedigree information identified four pairs of these individuals as sixth-degree relatives and one pair as eighth-degree relatives. The point estimates from ERSA were accurate to within one degree of relationship for all five of these pairs.
ERSA uses explicit IBD segment information to estimate the relationships between pairs of individuals in a maximum-likelihood framework. This makes better use of the information present in high-density SNP genotyping data, as shown by the power curves in
Because denser and more accurate genetic data will improve the ability to detect and delineate IBD segments, it is expected that the accuracy of IBD segment inference will improve as whole-genome sequencing becomes more affordable and as higher-density microarrays become available. In addition, while the IBD segment detection methods used here (GERMLINE; Gusev et al. 2009; fastIBD in Beagle 3.3) perform well, further improvements are expected as phasing and imputation methods advance (e.g., Genovese et al. 2010).
ERSA detects recent shared ancestry by identifying an excess of IBD segment-sharing relative to the population background. Therefore, the power to detect shared ancestry between individuals depends on the demographic history of the population to which those individuals belong. If the population size is small, or if the population has experienced a founder effect or recent bottleneck, then the level of IBD segment-sharing among unrelated individuals will increase. In such populations, ERSA's power to detect distant relationships will be diminished.
The pedigree samples analyzed in Example 1 are from a homogeneous population. As shown here, it is predicted that ERSA will retain its high detection power in admixed populations.
Analysis of the European samples of Example 1 demonstrates that ERSA performs well in a homogeneous population with no history of recent admixture from a more distantly related population. Because pedigree data for an admixed population was not available, ERSA's performance in the presence of admixture could not be directly analyzed. Impacts of admixture on ERSA's performance would most likely be mediated through effects on the expected distributions of the number and lengths of IBD segments shared between unrelated individuals. Admixture should reduce the number and lengths of such segments. The reasoning for this expected reduction is as follows. The detection of IBD segments is based largely on long runs of consecutive loci at which the genotypes are consistent with identity-by-state (IBS). Admixture will introduce alleles that are frequently IBS among pairs of individuals in the population due to shared ancestry. However, in the absence of founder effect, given that two admixed individuals are of identical ancestry at a particular genomic segment, they are no more likely to share long runs of IBS than individuals chosen at random from the appropriate reference population. When individuals are not required to share ancestry at any particular genomic segment (as would be the case for ascertainment for a shared genetic disease), it results in an expectation of fewer and smaller shared segments among unrelated individuals relative to at least one of the reference populations.
This prediction was tested by comparing individuals from a sample of 25 Bolivian individuals genotyped on Affymetrix SNP 6.0 arrays (Xing et al. 2010). Substantial European admixture (19-41%; data not shown) in 9 Bolivians was identified using the Admixture software (Alexander et al. 2009). The Bolivian population was divided into groups with and without admixture. All non-admixed Bolivians were estimated to have <0.1% admixture. The same process was then applied to identify shared segments in the European sample, i.e., using Beagle (Browning and Browning 2010) to phase and impute the data and GERMLINE (Gusev et al. 2009) to identify all shared segments longer than 2.5 cM. Consistent with predictions, on average, the admixed Bolivians shared 43 segments (95% C.I. 41-45 segments) with an average size of 3.5 cM (95% C.I. 3.4-3.7 cM), compared to 88 segments (95% C.I. 86-92 segments) with an average size of 4.2 cM (95% C.I. 4.1-4.3 cM) in non-admixed Bolivians.
In comparisons of distantly-related admixed individuals, the smaller expected number and size of background segments could slightly improve ERSA's detection power: short but meaningful shared IBD segments could become statistically significant when compared to a shorter background size distribution. In comparisons of distantly-related individuals with ancestries mostly confined to one of the reference populations, however, the admixed population background distributions would be incorrect. Using them might cause ERSA to suffer a slightly increased false positive rate or a bias towards overestimating the degree of relationship due to the misattribution of some short background segments to a distant relationship.
Many existing methods for detecting IBD segments do not distinguish segments that overlap on homologous chromosomes, and rather than consider them to be separate, merge them into one (see
where {circumflex over (k)} is the maximum likelihood estimate for the number of merged segments. Because Eq. S2 introduces additional estimated parameters into the full-sibling model, ERSA only reports the full-sibling model as the maximum likelihood estimate if it is significantly more likely than all other models at the 0.05 level.
ERSA is designed to detect ancestry at a single node in a pedigree; incorporating information about human biodiversity (HBD) would result in a near-perfect detection of full sibling relationships, but would have little to no effect on estimates of other relationships. HBD information will be incorporated into future evaluation of full-sibling models as the tools for IBD and HBD segment detection improve.
Many existing IBD methods are also unable to detect the recombination breakpoints between parent-offspring pairs and usually report the length of each entire chromosome as a shared segment (Gusev et al. 2009; Thomas et al. 2008). With this detection scheme, a probabilistic description of the number and size of shared segments is no longer appropriate. Therefore, to identify parent-offspring relationships, a different statistic, the total proportion of the genome shared between the two individuals, was considered. A sibling relationship is rejected in favor of a parent-offspring relationship when the proportion of the genome shared exceeds a specified significance level for siblings (default is 0.01). ERSA includes options to bypass Eqs. S1, S2, and/or the parent-offspring option for situations where the overlapping segments can be accurately identified.
ALEXANDER, D. H., NOVEMBRE, J., AND LANGE, K. 2009. FAST MODEL-BASED ESTIMATION OF ANCESTRY IN UNRELATED INDIVIDUALS. GENOME RES 19: 1655-1664.
ALONSO, A., MARTIN, P., ALBARRAN, C., GARCIA, P., FERNANDEZ DE SIMON, L., JESUS ITURRALDE, M., FERNANDEZ-RODRIGUEZ, A., ATIENZA, I., CAPILLA, J., GARCIA-HIRSCHFELD, J. ET AL. 2005. CHALLENGES OF DNA PROFILING IN MASS DISASTER INVESTIGATIONS. CROAT MED J 46: 540-548.
BERKOVIC, S. F., DIBBENS, L. M., OSHLACK, A., SILVER, J. D., KATERELOS, M., YEARS, D. F., LULLMANN-RAUCH, R., BLANZ, J., ZHANG, K. W., STANKOVICH, J. ET AL. 2008. ARRAY-BASED GENE DISCOVERY WITH THREE UNRELATED SUBJECTS SHOWS SCARB2/LIMP-2 DEFICIENCY CAUSES MYOCLONUS EPILEPSY AND GLOMERULOSCLEROSIS. AM J HUM GENET 82: 673-684.
BIEBER, F. R., BRENNER, C. H., AND LAZER, D. 2006. FINDING CRIMINALS THROUGH DNA OF THEIR RELATIVES. SCIENCE 312: 1315-1316.
BIESECKER, L. G., BAILEY-WILSON, J. E., BALLANTYNE, J., BAUM, H., BIEBER, F. R., BRENNER, C., BUDOWLE, B., BUTLER, J. M., CARMODY, G., CONNEALLY, P. M. ET AL. 2005. EPIDEMIOLOGY. DNA IDENTIFICATIONS AFTER THE 9/11 WORLD TRADE CENTER ATTACK. SCIENCE 310: 1122-1123.
BOEHNKE, M. AND COX, N. J. 1997. ACCURATE INFERENCE OF RELATIONSHIPS IN SIB-PAIR LINKAGE STUDIES. THE AMERICAN JOURNAL OF HUMAN GENETICS 61: 423-429.
BRENNER, C. H. 2006. SOME MATHEMATICAL PROBLEMS IN THE DNA IDENTIFICATION OF VICTIMS IN THE 2004 TSUNAMI AND SIMILAR MASS FATALITIES. FORENSIC SCI INT 157: 172-180.
BROWNING, S. R. AND BROWNING, B. L. 2010. HIGH-RESOLUTION DETECTION OF IDENTITY BY DESCENT IN UNRELATED INDIVIDUALS. THE AMERICAN JOURNAL OF HUMAN GENETICS 86: 526-539.
BUDIMLIJA, Z. M., PRINZ, M. K., ZELSON-MUNDORFF, A., WIERSEMA, J., BARTELINK, E., MACKINNON, G., NAZZARUOLO, B. L., ESTACIO, S. M., HENNESSEY, M. J., AND SHALER, R. C. 2003. WORLD TRADE CENTER HUMAN IDENTIFICATION PROJECT: EXPERIENCES WITH INDIVIDUAL BODY IDENTIFICATION CASES. CROAT MED J 44: 259-263.
CASH, H. D., HOYLE, J. W., AND SUTTON, A. J. 2003. DEVELOPMENT UNDER EXTREME CONDITIONS: FORENSIC BIOINFORMATICS IN THE WAKE OF THE WORLD TRADE CENTER DISASTER. PAC SYMP BIOCOMPUT: 638-653.
CHERNY, S. S., ABECASIS, G. R., COOKSON, W. O., SHAM, P. C., AND CARDON, L. R. 2001. THE EFFECT OF GENOTYPE AND PEDIGREE ERROR ON LINKAGE ANALYSIS: ANALYSIS OF THREE ASTHMA GENOME SCANS. GENET EPIDEMIOL 21 SUPPL 1: S117-122.
INTERNATIONAL HAPMAP CONSORTIUM 2005. A HAPLOTYPE MAP OF THE HUMAN GENOME. NATURE 437: 1299-1320.
DEWOODY, J. A. 2005. MOLECULAR APPROACHES TO THE STUDY OF PARENTAGE, RELATEDNESS, AND FITNESS: PRACTICAL APPLICATIONS FOR WILD ANIMALS. THE JOURNAL OF WILDLIFE MANAGEMENT 69: 1400-1418.
DONNELLY, K. P. 1983. THE PROBABILITY THAT RELATED INDIVIDUALS SHARE SOME SECTION OF GENOME IDENTICAL BY DESCENT. THEOR POPUL BIOL 23: 34-63.
EPSTEIN, M. P., DUREN, W. L., AND BOEHNKE, M. 2000. IMPROVED INFERENCE OF RELATIONSHIP FOR PAIRS OF INDIVIDUALS. THE AMERICAN JOURNAL OF HUMAN GENETICS 67: 1219-1231.
GENOVESE, G., LEIBON, G., POLLAK, M., AND ROCKMORE, D. 2010. IMPROVED IBD DETECTION USING INCOMPLETE HAPLOTYPE INFORMATION. BMC GENETICS 11: 58.
GUSEV, A., LOWE, J. K., STOFFEL, M., DALY, M. J., ALTSHULER, D., BRESLOW, J. L., FRIEDMAN, J. M., AND PE'ER, I. 2009. WHOLE POPULATION, GENOME-WIDE MAPPING OF HIDDEN RELATEDNESS. GENOME RESEARCH 19: 318-326.
LECLAIR, B. 2004. LARGE-SCALE COMPARATIVE GENOTYPING AND KINSHIP ANALYSIS: EVOLUTION IN ITS USE FOR HUMAN IDENTIFICATION IN MASS FATALITY INCIDENTS AND MISSING PERSONS DATABASING. PROGRESS IN FORENSIC GENETICS 10: 42-44.
LECLAIR, B., SHALER, R., CARMODY, G. R., ELIASON, K., HENDRICKSON, B. C., JUDKINS, T., NORTON, M. J., SEARS, C., AND SCHOLL, T. 2007. BIOINFORMATICS AND HUMAN IDENTIFICATION IN MASS FATALITY INCIDENTS: THE WORLD TRADE CENTER DISASTER. J FORENSIC SCI 52: 806-819.
MCPEEK, M. S. AND SUN, L. 2000. STATISTICAL TESTS FOR DETECTION OF MISSPECIFIED RELATIONSHIPS BY USE OF GENOME-SCREEN DATA. THE AMERICAN JOURNAL OF HUMAN GENETICS 66: 1076-1094.
MCVEAN, G. A. T., MYERS, S. R., HUNT, S., DELOUKAS, P., BENTLEY, D. R., AND DONNELLY, P. 2004. THE FINE-SCALE STRUCTURE OF RECOMBINATION RATE VARIATION IN THE HUMAN GENOME. SCIENCE 304: 581-584.
NEKLASON, D. W., STEVENS, J., BOUCHER, K. M., KERBER, R. A., MATSUNAMI, N., BARLOW, J., MINEAU, G., LEPPERT, M. F., AND BURT, R. W. 2008. AMERICAN FOUNDER MUTATION FOR ATTENUATED FAMILIAL ADENOMATOUS POLYPOSIS. CLIN GASTROENTEROL HEPATOL 6: 46-52.
PEMBERTON, T. J., WANG, C., LI, J. Z., AND ROSENBERG, N. A. 2010. INFERENCE OF UNEXPECTED GENETIC RELATEDNESS AMONG INDIVIDUALS IN HAPMAP PHASE III. AM J HUM GENET 87: 457-464.
PURCELL, S., NEALE, B., TODD-BROWN, K., THOMAS, L., FERREIRA, M. A. R., BENDER, D., MALLER, J., SKLAR, P., DE BAKKER, P. I. W., DALY, M. J. ET AL. 2007. PLINK: A TOOL SET FOR WHOLE-GENOME ASSOCIATION AND POPULATION-BASED LINKAGE ANALYSES. THE AMERICAN JOURNAL OF HUMAN GENETICS 81: 559-575.
SIMONSON, T. S., YANG, Y., HUFF, C. D., YUN, H., QIN, G., WITHERSPOON, D. J., BAI, Z., LORENZO, F. R., XING, J., JORDE, L. B. ET AL. 2010. GENETIC EVIDENCE FOR HIGH-ALTITUDE ADAPTATION IN TIBET. SCIENCE 329: 72-75.
SLATE, J., SANTURE, A. W., FEULNER, P. G. D., BROWN, E. A., BALL, A. D., JOHNSTON, S. E., AND GRATTEN, J. 2010. GENOME MAPPING IN INTENSIVELY STUDIED WILD VERTEBRATE POPULATIONS. TRENDS IN GENETICS 26: 275-284. XXVI. STANKOVICH, J., BAHLO, M., RUBIO, J. P., WILKINSON, C. R., THOMSON, R., BANKS, A., RING, M., FOOTE, S. J., AND SPEED, T. P. 2005. IDENTIFYING NINETEENTH CENTURY GENEALOGICAL LINKS FROM GENOTYPES. HUM GENET 117: 188-199.
SUN, L., WILDER, K., AND MCPEEK, M. S. 2002. ENHANCED PEDIGREE ERROR DETECTION. HUM HERED 54: 99-110.
THOMAS, A., CAMP, N. J., FARNHAM, J. M., ALLEN-BRADY, K., AND CANNON-ALBRIGHT, L. A. 2008. SHARED GENOMIC SEGMENT ANALYSIS. MAPPING DISEASE PREDISPOSITION GENES IN EXTENDED PEDIGREES USING SNP GENOTYPE ASSAYS. ANNALS OF HUMAN GENETICS 72: 279-287.
THOMAS, A., SKOLNICK, M. H., AND LEWIS, C. M. 1994. GENOMIC MISMATCH SCANNING IN PEDIGREES. MATHEMATICAL MEDICINE AND BIOLOGY 11: 1-16.
VOIGHT, B. F. AND PRITCHARD, J. K. 2005. CONFOUNDING FROM CRYPTIC RELATEDNESS IN CASE-CONTROL ASSOCIATION STUDIES. PLOS GENET 1: E32.
WEIR, B. S., ANDERSON, A. D., AND HEPLER, A. B. 2006. GENETIC RELATEDNESS ANALYSIS: MODERN DATA AND NEW CHALLENGES. NAT REV GENET 7: 771-780.
D. J. WITHERSPOON, C. D. HUFF, Y. ZHANG, W. S. WATKINS, T. S. SIMONSON, T. M. TUOHY, D. W. NEKLASON, R. W. BURT, S. L. GUTHERY, S. R. WOODWARD, AND L. B. JORDE. NOV. 5, 2010 MAXIMUM LIKELIHOOD ESTIMATION OF RECENT ANCESTRY (ERA) BETWEEN PAIRS OF INDIVIDUALS USING HIGH-DENSITY SNP-GENOTYPING MICROARRAY DATA. AMERICAN SOCIETY OF HUMAN GENETICS 2010 ANNUAL MEETING.
XING, J., WATKINS, W. S., SHLIEN, A., WALKER, E., HUFF, C. D., WITHERSPOON, D. J., ZHANG, Y., SIMONSON, T. S., WEISS, R. B., SCHIFFMAN, J. D. ET AL. 2010. TOWARD A MORE UNIFORM SAMPLING OF HUMAN GENETIC DIVERSITY: A SURVEY OF WORLDWIDE POPULATIONS BY HIGH-DENSITY GENOTYPING. GENOMICS 96: 199-210.
ZUPANIC PAJNIC, I., GORNJAK POGORELC, B., AND BALAZIC, J. 2010. MOLECULAR GENETIC IDENTIFICATION OF SKELETAL REMAINS FROM THE SECOND WORLD WAR KONFIN I MASS GRAVE IN SLOVENIA. INT J LEGAL MED 124: 307-317.
This application is a continuation of International Patent Application No. PCT/US2012/021573, filed on Jan. 17, 2012, entitled ESTIMATION OF RECENT SHARED ANCESTRY, which claims the benefit of and priority to U.S. Provisional Application No. 61/433,921, filed on Jan. 18, 2011, the entire content of each of which is incorporated by reference herein.
This invention was made with government support under K99 HG005846, R01 CA040641, N01 PC035141, P01CA073992, GM059290 and DK069513 awarded by National Institutes of Health. The government has certain rights in this invention.
Number | Date | Country | |
---|---|---|---|
61433921 | Jan 2011 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2012/021573 | Jan 2012 | US |
Child | 13943739 | US |