The present invention relates to a method for gene mapping from genotype and phenotype data, which method utilizes linkage disequilibrium between genetic markers mi, which are polymorphic nucleic acid or protein sequences or strings of single-nucleotide polymorphisms deriving from a chromosomal region.
The use of linkage disequilibrium (LD) in detecting disease genes has recently drawn much attention in genetic epidemiology. LD is evaluated with association analysis, which, when applied to disease-gene mapping, requires the comparison of allele or haplotype frequencies between the affected and the control individuals, under the assumption that a reasonable proportion of disease-associated chromosomes has been derived from a common ancestor. Traditional association analysis methods have long been used to test the involvement of candidate genes in diseases and, in special circumstances, to fine-map disease loci found by linkage methods. The testing has mostly been done using simple two-point measures.
Improved statistical methods to detect LD have been presented lately (Terwilliger 1995; Devlin et al. 1996; Lazzeroni 1998; McPeek and Strahs 1999; Service et al. 1999). The newer methods are based on statistical models of LD around a disease susceptibility (DS) gene. Genomic regions—rather than alleles—that are shared among affected individuals, are searched for. The recombination history from the common ancestor to the present day is taken into account with more or less simplified statistical models. The power of these methods, as well as their ability to localize the correct position of the DS gene, has been shown to be better than that of traditional methods. Some of the models are robust against high levels of etiologic heterogeneity (McPeek and Strahs 1999; Service et al. 1999). However, the methods contain assumptions about the inheritance model of the disease and the structure of the survey population, and the effects of violations of these assumptions in the real data are not known. In addition, they can only consider association of one region at a time. Thus, they are currently best suited for fine mapping rather than complex disease mapping or genome screening. The methods also tend to be computationally heavy.
The present inventors have recently introduced a so-called haplotype pattern mining (HPM) method (Toivonen et al. 2000a and 2000b). In the HPM method, haplotype patterns are ordered by their strength of association with the phenotype, and all haplotype patterns exceeding a given threshold level are used for prediction of disease susceptibility gene location. The advantage of the BPM method is that it is model-free as it does not require any assumptions about the inheritance model of the disease. The haplotype patterns are allowed to contain gaps and therefore the HPM method is quite robust against mutations and to missing and erroneous data. How-ever, the basis of the HPM method is that haplotypes, i.e. separate vectors of alleles of markers, are available. As will be explained below, this requirement causes various problems in gene mapping methods, and thus also in the HPM method.
Zhang et al. (2002) have extended the HPM method to allow simultaneous use of haplotype data of related individuals with quantitative trait from an extended pedigree. This is done by employing the Quantitative Pedigree Disequilibrium Test (QPDT) statistic to measure the strength of association between haplotype and a quantitative trait.
The standard procedure in association-based gene mapping is to 1) ascertain individuals carrying the trait of interest and their family members (at least parents), 2) genotype the individuals, 3) derive the haplotypes computationally using genotypes within families, and finally to 4) find associations in the haplotypes (gene mapping).
Even though the actual association analysis is done on sole case and control haplo-types, obtaining these haplotypes requires the parents of the affected individuals to be genotyped as well: vast majority of haplotyping programs available expect the parental genotypes to exist. This means that the parents first have to be recruited, which is not always straightforward, as they might no longer be alive, or cannot be reached, or refuse from giving blood samples. Genotyping more individuals is laborious and elevates the study expenses: per every case or control, 3 individuals will be genotyped instead of just one, so genotyping is done on 3 times as many persons as there are cases and controls. In case the non-transmitted parental chromosomes could be used as controls, a case and his/her parents contributes one case-control pair, in which case the genotyping effort is 1.5 times higher than the number of cases and controls needed.
As an alternative to these haplotyping approaches, some methods for direct haplo-typing from population-based data have indeed been presented, but the problem with these is that they still produce a lot of mistakes, which is a very bad starting point for any haplotype based association program.
There is no straightforward way to use genotypes as input for a method that is designed for haplotypes. Given a genotype, it is in principle possible to consider all haplotype configurations available from the genotype, and to run a haplotype gene mapping method on the different configurations of chromosomes. In practice, however, this is not feasible for marker maps of reasonable size due to a combinatorial explosion: given a genotype with N heterozygous markers, the number of different possible haplotype configurations is 2N-1 (or 1, if N=0). For instance, for N=100 the number of possible haplotype configurations is about 6*1029.
Zhang and Zhao (2002) have studied linkage disequilibrium mapping directly with genotype data. Their approach is model-based and the method is based on the decay of haplotype sharing (DHS) method for haplotype data developed by McPeek and Strahs (1999). The approach of Zhang and Zhao is based on explicitly considering all possible haplotype configurations. Since this is not feasible for marker maps of interesting sizes—as was described above—they apply complex and error-prone techniques to prune the number of haplotype configurations they need to consider. Further, in this method, data consisting of multiallele (microsatellite) loci only—thus, no SNPs (single nucleotide polymorphisms) or any other type of markers—is considered. In short, the method works as follows: it is supposed that there are two alleles in the disease locus: the disease causing allele D and the normal allele d. The basic idea is to treat the chromosomes in affected individuals as if they were a random sample from chromosome population consisting of both chromosomes with the D allele and d allele. Chromosomes in normal individuals are assumed to be a random sample of chromosome population only consisting of d chromosomes. Next, a likelihood for individual haplotypes is formulated in the same way as presented by McPeek and Strahs (1999) where the probability of an observed haplotype is dependent of number of generations from original disease mutation, recombination rates between markers, and mutation rate in marker loci. The gap between using haplotype data as the starting point (as in McPeek and Strahs) and genotype data (as Zhang and Zhao present) as the starting point is bridged with following inference: for each genotype gi, there are several haplotype pairs that are compatible with it (2N-1, where N is the number of heterozygote sites in the genotype). The likelihood for an observed genotype is the sum of probabilities of each possible haplotype pair, where the probabilities of individual haplotypes are formulated as above. The genetic parameters of interest (such as location of the disease locus, mutation rate and disease allele age) are then estimated by using EM algorithm. The large number of possible ancestral haplotypes prerequisites pruning out any too rare haplotypes; the haplotype frequencies are estimated with Markov model, and any which are below some prespecified level are left unconsidered.
The approach of Zhang and Zhao has the following serious drawbacks. First, the principle of Zhang and Zhao is to explicitly consider all possible haplotype configurations. This is feasible only with very small marker maps. Second, to avoid the first problem and to extend the applicability of the approach to larger maps, Zhang and Zhao apply additional pruning techniques to reduce the number of haplotype configurations they need to consider. However, those techniques are complex and error-prone. Third, their approach is based on summing probabilities of different haplo-type configurations. Such an approach is not directly applicable to pattern-based mapping methods such as HPM.
Curtis et al. (2001) studied the use of an artificial neural network to detect association between disease and multiple marker genotypes. The pattern-recognition properties of the network were used in the hope that marker haplotypes implicit in the genotypes differed between cases and controls in a way which led to the network being able to classify the subjects correctly, according to their marker genotype.
The object of the present invention is to provide a model-free and computationally effective method allowing direct association analysis on genotype rather than haplo-type data, which overcomes the above-mentioned drawbacks. The invention offers remarkable advantages by avoiding the technically difficult, costly and sometimes impossible steps of recruiting and genotyping family members, as well as by avoiding some of the error sources present in population-based haplotyping methods.
The above-mentioned object is achieved in accordance with the invention by the method for gene mapping from genotype and phenotype data, which utilizes linkage disequilibrium between genetic markers mi, which are polymorphic nucleic acid or protein sequences or strings of single-nucleotide polymorphisms deriving from a chromosomal region. The method according to the invention is characterized by the following steps:
A computer system according to the invention is programmed to perform the method of any embodiments of the invention.
As used herein the term ‘haplotype’ defines a vector of alleles in a single chromosome. Also, as used herein the term ‘genotype’ defines a vector of (unphased) allele pairs in a chromosome pair.
The term ‘microsatellite’ used defines a small run (usually less than 0.1 kb) of tandem repeats of a very simple DNA sequence, usually 1-4 bp, for example (CA)n. It has been used as the primary tool for genetic mapping during the 1990s. ‘Multiallelic genetic locus’ is a gene with high level of variation; there are several types of variants in the gene locus, each with reasonably high frequency. ‘SNP’, single nuleotide polymorphism, defines any polymorphic variation at a single nucleotide. Although less informative than microsatellites, SNPs are more amenable to large-scale automated scoring.
The object of the present invention is to provide a method for gene mapping from genotype and phenotype data, which utilizes linkage disequilibrium between genetic markers mi, which are polymorphic nucleic acid or protein sequences or strings of single-nucleotide polymorphisms deriving from a chromosomal region. The chromosome data may consist of genotypes or haplotypes. The phenotype being studied may also be a combination of several phenotypes.
The method according to the invention, also named as haplotype pattern mining in genotype data (HPM-G), uses data mining methods in LD-based gene mapping. The method uses both genotypes and haplotypes as input. In diseases with reasonable genetic contribution, affected individuals are likely to have higher frequencies of associated marker alleles near the DS gene than control individuals. Combinations of marker alleles which are more frequent in genotypes of affected individuals than in genotypes of unaffected individuals, are searched for in the data, without assumptions about the mode of inheritance of the disease. These combinations, marker patterns or haplotype patterns, are sorted by the strength of their association with the disease, and the resulting list of marker or haplotype patterns is used in localizing the DS gene. Terms marker pattern and haplotype pattern denote the same concept, and are used interchangeably in this text.
The method according to the present invention is an algorithm-based extension of traditional association analysis. It works with a non-parametric statistical model and without any genetic models. The localization power of the method of the invention is high, even in cases, where multiple independent founder mutations are allowed, and the frequency of the most common mutation in the affected chromosomes varies between 5-15% at realistic sample sizes (100 affected individuals and a similar number of population controls). In addition, the experiments suggest that the method is highly robust against missing data. Since HPM-G can handle high degrees of etiologic heterogeneity, it can be successful in complex disease mapping.
LD, the non-random association of marker alleles and haplotypes to the disease, is likely to be strongest around the DS gene; consequently the locus is likely to be where most of the strongest associations are. In the HPM-G method according to the invention, we search for shared, flexible haplotypes that may contain gaps, and find out, which ones are strongly associated with the disease status. We then use a non-parametric model for predicting the DS locus, on the basis of the locations of the haplotypes. Permutation tests can be used to contrast the results against the null hypothesis that there is no gene effect.
Marker or Haplotype Patterns and Disease Association
We examine linkage disequilibrium by looking for marker or haplotype patterns that consist of a set of nearby markers, not necessarily consecutive ones. Given a marker map M with k markers m1, . . . , mk, a “marker pattern” or “haplotype pattern” P on M is defined as a vector (p1, . . . , pk), where each pi is either an allele of marker mi or the “don't care” symbol (*). The haplotype pattern P occurs in a given haplotype vector (chromosome) H=(h1, . . . , hk) if pi=hi or pi=* for all i,1<=i<=k. Pattern P occurs in a given genotype G=({g11, g12}, . . . , {gk1,gk2}) if pi=gi1 or pi=gi2 or pi=* for all i,1<=i<=k.
For example, consider a marker map of 10 markers. The vector P1=(*, 2, 5, *, 3, *, *, *, *, *), where 1, 2, 3, . . . are marker alleles, is an example of a haplotype pattern. This pattern occurs, for instance, in a chromosome with haplotype (4, 2, 5, 1, 3, 2, 6, 4, 5, 3). The pattern also occurs in the genotype ({2,5}, {2,3}, {1,5}, {4,6}, {3,6}, {2,4}, {1,2}, {1,4}, {3,5}, {1,6}). (For instance, {2,5} is the genotype of marker 1; the alleles are 2 and 5, but their phases are not known.)
Our goal is to search for haplotype patterns that roughly correspond to haplotypes identical by descent in the disease-associated. In doing this, there are two major issues with respect to the shapes of haplotype patterns: the genetic length of the significant part of the patterns, and gaps. We define the “(genetic) length” of a haplo-type pattern P=(p1, . . . , pk) as the maximum distance, in Morgans, between any two markers mi,mj with pi≈*≈pj. Searching for haplotype patterns of arbitrary length hardly makes sense; it is unlikely that genetically extremely long patterns will be discovered, at least not in significant numbers. Consequently, when haplotype patterns are searched for, the maximum length of patterns to be considered can be constrained with an optional pattern-search parameter to the HPM-G method.
We allow for gaps in the marker patterns, since mutations, gene conversions, errors, missing data, and recombinations can corrupt continuous haplotypes. Marker mutations and errors typically cause very short gaps only. Missing information can span several consecutive markers, depending on the data collection scheme. Longer gaps can be introduced by double recombinations which, however, are rare on genetically short distances. In the HPM-G method, the maximum number and maximum length of gaps can be controlled with pattern search parameters.
Mining Disease-Associated Haplotype Patterns
All marker patterns P that satisfy a pattern evaluation function e(P) are searched from the data in the step (i) of the method of the invention. The pattern evaluation function e(P) involves some statistical measure of the association between the marker pattern P and the phenotype being studied. In step (ii), each marker mi of the data is scored by a marker score s(mi), which is a function of the set Si defined as the set of marker patterns overlapping the marker mi and satisfying the pattern evaluation function e as defined in step (i).
In step (i), let U be the universe of marker patterns considered in the study. The output S of this phase is the set of those marker patterns that satisfy e, i.e., S=(P ε U|e(P) is true}.
In step (ii), for each marker mi in the data, let Si={P ε U|P makes a reference to mi, or to a marker to the left of and to a marker to the right of mi} be the set of patterns in S that overlap with the marker mi. In this phase each marker mi is scored as a function of Si, and the result is s(mi).
In step (iii), the location of the gene is predicted as a function of the scores s(mi) of all the markers mi in the data. This function returns an area where the gene is likely to be. The area can be contiguous or fragmented, and it can be a point in a special case.
The marker or haplotype patterns P can be searched for by an algorithm developed by the inventors for this purpose or by the levelwise search method described in Mannila and Toivonen (1997). Preferred algorithms are given in the following.
Version 1 of the Algorithm for Marker Pattern Searching
The following algorithm is a simple, generic, and efficient way to implement step (i) of the method according to the invention. It is based on depth-first search in the space of patterns, a standard procedure in computer science. A pre-requisite is that 10 there is a suitable generalization relation for the patterns, such that if a pattern satisfies the evaluation function, then all more general patterns also satisfy it.
Input
Output
Method
Version 2 of the Algorithm for Marker Pattern Searching
The following algorithm is a simple, generic, and efficient way to implement step (i) of the method according to the invention. It is based on depth-first search in the space of patterns, a standard procedure in computer science. It approximates the exact answer by ignoring infrequent and therefore statistically less important patterns.
Define an auxiliary evaluation function ae(P) which is true if and only if the frequency of pattern P exceeds a given threshold x, (how to specify the threshold is discussed elsewhere) and replace the original evaluation function e(P) by function e′(P) defined as e′(P)=true if and only if e(P) is true and ae(P) is true. This refinement prunes patterns that satisfy e but are no more frequent than x. Such infrequent patterns are statistically not relevant, and therefore little information is lost when they are ignored. Now a suitable generalization relation is obtained from logical implication based on the pattern syntax: P<P′ if and only if P′−>P.
The algorithm uses the generalization relation based on logical implication to structure the search space, and the auxiliary function ae to prune the search space. All patterns satisfying ae are searched for, but only those also satisfying e are output.
Input
Output
Method
Version 3 of the Algorithm for Marker Pattern Searching
When phenotype being studied is qualitative and the pattern evaluation function e(P) has the form e(P)=true if and only if e′(P)>x, where e′(P) is the (signed) association measure χ2 and x is a user-specified minimum value, which is chosen so that the sizes of Si are large enough, such as 7, to give statistically sufficiently reliable estimates for the gene locus, the following algorithm is a simple, generic, and efficient way to implement step (i) of the method according to the invention. It is based on depth-first search in the syntactic space of patterns. It derives a lower bound lb for pattern frequency from the given lower bound x for chi-squared test, and uses lb to prune the search.
Input
Output
Method
Version 4 of the Algorithm for Marker Pattern Searching
The following algorithm is a simple, generic, and efficient way to implement step (i) of the method according to the invention. It is based on the levelwise search method described in Mannila and Toivonen (1997).
Input
Output
Definitions
Method
Version 5 of the Algorithm for Marker Pattern Searching
This is the levelwise search version of the algorithm 2.
Input
Output
Definitions
Method
The phenotype being studied may be qualitative, for example disease is present or disease is absent. In that case, the pattern evaluation function e(P) may have the form e(P)=true if and only if e′(P)>x, where e′(P) is the (signed) association measure χ2 and x is a user-specified minimum value, which is chosen so that the sizes of Si are large enough, such as 7, to give statistically sufficiently reliable estimates for the gene locus and the score s(mi) of marker mi is the size of Si, also called marker-wise pattern frequency of mi and denoted by f(mi).
As mentioned above, the (signed) χ2 is a measure of marker-disease association. A signed version of the measure is used in order to discriminate disease association from control association. The signed χ2 measure ±χ2(P) of a haplotype pattern P is positive if P is more frequent in cases than in controls, and negative otherwise. Given a “(positive) association threshold” x, we say that P is “strongly associated” with the disease if ±χ2(P)≧x.
The first part of the HPM-G method can be described as follows. Given the data—markers M, genotypes H, and phenotypes Y—the task is to output all haplotype patterns P that are strongly associated with the disease status for a given value of the association threshold x. We denote the collection of all such haplotype patterns by S—that is, S={P is a haplotype pattern on M|±χ2(P)≧x}. If pattern parameters are specified—a maximum genetic length, a maximum number of gaps, or a maximum length for gaps—the task is refined by requiring that these additional restrictions are also fulfilled.
The signed χ2 value is calculated from a 2×2 contingency table, where the rows correspond to the trait-association statuses of the persons, and the columns correspond to the presence and absence of the haplotype pattern. A pattern P=(p1, . . . , pk) is present in a given genotype G=({g11, g12}, . . . , {gk1,gk2)) if pi=gi1 or pi=gi2 or pi=* for all i, 1<=i<=k. If, instead of a genotype, two haplotype vectors H1=(h1l, . . . , h1k) and H2=(h2l, . . . , h2k) are given for a person, pattern P is considered to be present in the person if it is present in either of the haplotypes, i.e., if either pi=h1i or pi=* for all i,1<=i<=k, or pi=h2i or p1=* for all i,1<=i<=k.
The value of χ2 statistic is computed normally, and a negative sign is attached, if the relative frequency of the haplotype pattern among the control persons is higher than among the trait-associated persons.
The first observation in solving the pattern-mining task is that given an association threshold x, a lower bound can be derived for the frequency of strongly associated haplotype patterns as follows:
Given a 2×2 contingency table of the numbers of disease-associated (A) and control (C) persons either matching a pattern (P) or not (N), the χ2 test statistic for the disease association of the pattern is defined by
where πij is the number of persons with properties i and j, πi the number of persons with property i, and π the total number of persons. Given the number of affected persons (πA), the number of control persons (πC), and a lower bound x for the test statistic, we can derive a lower bound for the pattern frequency among the affected persons (πAP) as follows. Assuming the pattern is disease-associated, we have πAP·πCN>πAN·πCP. The test statistic is maximized when πCP=0, implying πAP=πP and πCN=πC. Then
The situation is symmetric for protective haplotypes, and the lower bound for πCP is obtained by simply swapping πA and πC in the above result. If disease-associated and protective haplotypes are searched for at the same time, the smaller of πAP and πCP can be used as a lower bound for πP, making the implementation slightly simpler.
On another hand, given such a frequency threshold, all patterns exceeding the threshold can be enumerated efficiently with data-mining algorithms or a standard depth-first search method. An algorithm that first finds all haplotype patterns whose frequency exceeds the computed lower bound and then evaluates the association measure on them, is guaranteed to find the exact set of strongly disease-associated patterns.
The approach is suitable for finding protective haplotype patterns by considering patterns P with ±χ2(P)≦−x. The derivation of the lower bound for the frequency among controls is identical to the case above. Obviously, both disease-associated and protective haplotypes can be found when |±χ2(P)|≧x.
In another embodiment according to the invention, the phenotype being studied may be, in addition to qualitative, also quantitative, for example a measured blood concentration of a substance has a certain value. In that case, the pattern evaluation function e(P) may have the form e(P)=true if and only if e′(P)>x, where e′(P) is the absolute frequency of pattern P in the data and x is a user-specified value, which is chosen so that the sizes of Si are large enough, such as 20, to give statistically sufficently reliable estimates for the gene locus. In this embodiment, the statistical strength of the method may be still increased.
A linear model is of form Y=β1X1+ . . . +βkXk+αZ+β0, where the dependent variable Y is a quantitative phenotype, X1 through Xk are covariates, such as enviromental factors, and Z is a dummy variable for the occurrence of the haplotype pattern. Firstly, the coefficients α and β* are adjusted for best fit. Secondly, the significance of Z as a covariate is assessed by using a t test. If the phenotype is dichotomous, then the logit transformation can be applied.
Marker Scoring in the Case of Qualitative Phenotype Being Studied
Haplotype patterns close to the DS locus are likely to have stronger association than haplotypes further away; consequently the locus is likely to be where most of the strongest associations are. In one embodiment according to the invention, the marker score s(mi) is defined as the marker frequency f(mi) of marker mi (with respect to M,H,Y,x) as the number of patterns that contain marker mi, possibly mi in a gap: f(mi)=|{P=(p1, . . . , pk)εS | there exist t≦i and u≧i such that pt≈*≈pu}|. The idea is that each haplotype pattern roughly corresponds to a continuous chromosomal region, potentially identical by descent, where gaps allow for corruption of marker data. While markers within gaps are not used in measuring the disease association of the pattern, the whole chromosomal region of the pattern is thought to be relevant.
Marker Scoring in the Case of Qualitative or Quantitative Phenotype Being Studied
In order to derive the score s(mi), the p value (statistical significance) of each marker pattern P in determining the phenotype being studied is evaluated, and the score s(mi) is the distance between the observed p value distribution of patterns in Si and the uniform distribution, defined as average of (pi−qi) log (pi/qi) over all i=1 . . . n, where n is the number of haplotype patterns in Si, pi is the ith smallest p value in Si, and qi is the expectation of the ith smallest p value, if the p values were randomly drawn from the uniform distribution.
Gene Localization
The location of the gene, predicted as a function of the scores s(mi) and based on maximizing or minimizing the score, is predicted to
The location of the gene may also be determined by expert investigation of the marker scores or their visualization e.g. as a curve.
Permutation Tests
More information about the significance of the observed scores may be obtained by permutation tests. The results obtained by considering the marker frequencies or the linear model, as explained earlier, can be contrasted against the null hypothesis that all the persons are drawn from the same distribution; that is, there is no gene effect in the disease status. We propose to permute randomly the status fields of the persons, keeping the proportions of affected and control persons constant, in a fashion similar to the method of Churchill and Doerge (1994). We approximate marker-wise p values using permutations and then predict the DS gene to be in the vicinity of the marker with the smallest empirical p value. Consecutive markers are dependent, and thus a large number of mutually dependent p values are produced. This is not a problem, since we do not use the p values for hypothesis testing, but only for ranking markers.
Marker-wise p values are used to re-score markers by their statistical unexpectedness. The test is carried out as follows: The phenotypes of the persons are randomly shuffled a number (thousands) of times. The scores are re-calculated for each per-mutation mutation in turn. Marker-wise p value p(mi) is the proportion of such permutation scores for marker mi that are larger than or equal to the non-permuted score.
Each score s(mi) is then refined by replacing it by the marker-wise p value p(mi) of the score s(mi).
Searching Several Genes
Several genes may be searched for simultaneously by using marker patterns that refer to several potential gene loci at the same time.
Certain embodiments and results of the present invention are described in the following non-limiting examples.
We evaluated the performance of the proposed HPM-G method with simulated data sets that correspond to a recently founded, relatively isolated founder subpopulation. Simulation of a population isolate was chosen, since it is recommended as the study population for LD studies. However, the method can be applied to any population that is suitable for LD analysis, since no assumptions are made about the population structure.
An isolated founder population, which grows from the initial size of 200 to 100,000 individuals in 20 generations, was simulated.
The population pedigree was first generated assuming distinct generations and ex-ponential growth of the population size. In each generation, the parents of the newborn individuals were randomly selected from members of the previous generation, with the exception that whenever a parent with at least one child was chosen, his/her spouse was always forced to become the other parent of the child. This procedure generates family structure into each generation.
In the simulation of inheritance, each member of the first generation was assigned to have one pair of homologous chromosomes. The genetic length of the chromosomes was 100 cM for both males and females. Meiosis was repeatedly simulated, and in each meiosis the number of crossover points was taken from a Poisson distribution with parameter value 1, which corresponds to the total genetic length of the chromosome. No chiasm interference was modeled. To accommodate the fact that ever-increasing informativeness of marker maps may soon facilitate whole-genome LD mapping, we used relatively dense and informative marker maps with inter-marker intervals of exactly one cM. Each marker contained 4 alleles, whose frequencies in the founder population were 0.4 for one allele and 0.2 for the remaining three alleles. The polymorphism information content (PIC) of each marker was thus fixed at 0.678.
To produce each of the 100 data sets for HPM-G and HPM, the processes of disease locus selection, diagnosing and sampling were done independently. Next, these processes are described.
For each data set, a random locus was selected as a disease locus, and 8 random chromosomes present in the original population were labeled as disease-carrying chromosomes. In the final population, all chromosomes that had inherited the disease locus identical-by-descent from one of the eight founders were considered to carry a disease-causing mutation.
In the diagnosing phase, we used a liability-based model, where an individual's probability of becoming affected depends on two factors: the presence of a disease mutation and a normally distributed random component. The random component is thought to contain factors such as environmental effects and effects of other, un-known genes. The liability of an individual is defined as L=5x1+x2+C, where indicator variable x1 indicates the presence of any of the disease-causing mutations, and variable x2 is randomly sampled from standard normal distribution. Based on the generated segment data, the value of constant C is set in such a way that the desired population prevalence of 5 per cent is reached. The liability value L defines the probability of an individual to be affected, denoted by p, via formula
Having discovered the affection status of each individual, the desired number of individuals labeled as affected was randomly selected to constitute the affected sample.
The control samples were created using two different methods: for HPM-G that utilizes genotype data, the control individuals were simply taken randomly from the entire population. To do this, the sampling process described above was repeated, but this time the liability of each individual was purely random, including no genetic component.
For the original HPM that requires haplotype data, we used a more laborious sampling method: the genotypes of the parents of the affected individuals were collected to create family-based pseudocontrol chromosomes. This was done in practice by taking the alleles in the non-transmitted chromosomal segments of the parents of each affected individual and labeling them as control chromosomes. In reality, this is a common practice. In the simulations we treated the haplotypes obtained from the simulator as given, which corresponds to error-free haplotyping, and is expected to slightly favor HPM in the comparisons.
The simulation of missing data was based on the notion that in real genotyping laboratories there seems to be two types of clustering of missing data. First, missing genotypes tend to cluster to certain individuals, which can be a consequence of low quality samples. Second, certain markers may function poorly, likely producing missing genotypes. To mimic such clustering of missing data, we defined two parameters: parameter α corresponds to the amount of missing data that clusters to individuals and parameter β to the amount that clusters to markers. The missing genotypes were selected using the following procedure:
For each individual i, a personal missing genotype probability xi1 was computed as the x value of the first random point in (x, y) plane (x, yε[0,1]) that satisfies the inequality y<1/eαx. Having computed the value of variable xi1 for the individual, each of his/her genotypes was then labeled as missing with probability xi1. In the second phase, the procedure was repeated for each marker. For each marker j, a marker failure probability xjM was computed in an analogous fashion as the x value of the first random point in (x, y) plane (x, yε[0,1]) that satisfies the inequality y<1/eβx and each genotype corresponding to that marker was labeled as missing independently for each individual with probability xjM.
Values of variables α and β were empirically adjusted to produce the desired over-all levels of missing data. These values were: 25 and 80 for 5%, and 13 and 40 for 10% of missing data.
The localization accuracy was explored by plotting curves similar to power graphs:
the height of the curve shows the fraction of data sets for which the localization was successful, as a function of the length of the predicted region. The sample consisted of 150 affected and 150 control genotypes. The maximum length of a pattern was 7, and one gap of one marker was allowed. The association threshold was set to 10.
These numbers were based on experimentation. For comparison, we also show the corresponding curve for HPM with ⅓ smaller sample size, and thus equal genotyping cost (
The results show that HPM-G has a high accuracy, and that it is extremely competitive even in comparison to state-of-the-art methods that use explicitly haplotyped data.
The effect of sample size was examined by experimenting with sample sizes of 100+100, 150+150, 200+200 and 300+300 people (
HPM-G performs well even with only 100+100 genotypes. On the other hand, if the amount of data is increased, the accuracy is improved.
The influence of missing data was explored by randomly removing 5% or 10% of marker genotypes (
These results show that HPM-G is very robust against missing data.
Permutation tests were used to obtain more information about the significance of observed marker frequencies. Marker-wise P values were used to sort markers by their statistical unexpectedness, not to test the statistical significance of the findings. We performed the following experiment in order to see if the prediction accuracy can be improved by permutation tests. We predicted the location of the DS gene to be at the marker with the smallest P value instead of the most frequent marker. The localization accuracy with 100 permutations compared to that without permutations is shown in
The situation could be different with real marker data, where permutation tests are likely to bring a greater benefit.
Number | Date | Country | Kind |
---|---|---|---|
20020651 | Apr 2002 | FI | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/FI03/00248 | 4/1/2003 | WO | 5/10/2005 |