1. Field of the Invention
The present invention relates to a method of estimating a penetrance by using genotype data and phenotype data observed in a predetermined population. The present invention also relates to a method of estimating the probability of a new particular individual developing a phenotype by using genotype data on the individual and by using a penetrance (estimated value) obtained by using the penetrance estimation method.
2. Background Art
Analysis of polymorphism based on linkage disequilibrium (LD) and haplotype structure is becoming increasingly important. A combination of a plurality of linked alleles existing in one gamete is defined here as a haplotype. Also, non-independence of different alleles is defined as linkage disequilibrium. An allele represents a polymorphism in a one of a pair of chromosomes in terms of molecular biology.
Recent studies analyzing data on multiple linked loci (alleles) in many individuals have clarified the structure of the human genome from the viewpoint of the population genetics. That is, it has been made clear that the human genome has a haplotype block (or LD block) structure. Within a block, LD is strong and the number of major haplotypes is limited. One can extract tag SNPs (single nucleotide polymorphism, single base substituted) from the SNPs within the block and use them for association studies. That is, the haplotype structure is very useful for fine mapping of traits.
Haplotypes are also useful for association of genetic information about humans with various phenotypes. A phenotype is often associated not only with each SNP but also with haplotypes. We can interpret haplotypes as complete data and genotypes (e.g., SNP genotypes) as incomplete data. It is because we can extract all genotype data from haplotype data, while the reverse is not the case.
In fact, we can redefine an allele at a locus as a set of multiple haplotypes that have the allele at the locus. Therefore, it is more general to consider the association between a polymorphism and a phenotype based on a haplotype or a diplotype configuration (a combination of haplotypes) rather than based on allele or genotypes. Many studies recently made have suggested that phenotypes such as diabetes and reaction to drugs are associated with haplotypes or diplotype configurations rather than genotypes such as SNP.
However, a haplotype or a diplotype configuration of a subject cannot easily be observed although instances of experimental direct determination of haplotypes have been reported. Haplotypes are inferred by various algorithms using multilocus genotype data instead of being directly observed. That is, Clark's algorithm (see non-patent document 1), the expectation-maximization (EM) algorithm (see non-patent documents 2 to 6), the PHASE algorithm, the PL-EM algorithm, and so on have been proposed.
In genome-based association analysis based on haplotypes, haplotype frequencies are estimated by one of the above-mentioned algorithms. The haplotype frequencies are then compared based on the estimated values among case and control groups. Zaykin et al. have published an algorithm for analysis of the association between haplotype frequencies and traits based on an regression-based approach even in the presence of diplotype configurations not concentrated with respect to discrete or continuous traits (see non-patent document 7). Fallin et al. have also proposed a method of testing the associations between estimated haplotype frequencies and traits on the basis of case/control sampling design (see non-patent document 8).
However, when phenotypes at the level of individuals are brought into focus, they ought be based on diplotype configurations rather than the haplotypes. This relationship is equivalent to that between genotypes and alleles at a locus. In examining the association between a phenotype and genetic information in some cases, therefore, comparing the proportions of affected subjects among individuals with different diplotype configurations is more effective than comparing the haplotype frequencies among different phenotypes.
To compare the proportions of affected subjects among individuals with different diplotype configurations, the diplotype configuration of each individual is inferred by using one of the above-mentioned algorithms and all the individuals are classified according to the presence or absence of a haplotype or a diplotype configuration. The affected and non-affected subjects in populations classified in this manner are further classified and a test as to the independence is thereafter carried out.
However, the problem is that, at least in some individuals, diplotype configurations are not unequivocally determined (only ambiguous diplotype configurations are obtained). The degree of type I error due to classification of individuals according to inferred haplotype information used instead of real haplotype information cannot be distinctly ascertained.
[Non-Patent Document 1]
In view of the above-described problem, an object of the present invention is to provide an algorithm for estimating a diplotype-based penetrance as well as haplotype frequencies and diplotype configurations on the basis of observed genotype and phenotype data. Another object of the present invention is to provide an algorithm for estimating the probability of development of a predetermined phenotype in a tested individual on the basis of genotype data on the individual.
An algorithm in accordance with the present invention enables maximum likelihood estimation of haplotype frequencies in a population, diplotype distributions of individuals (posterior probability distributions of diplotype configurations) and penetrances when genotype data and phenotype data are given, while eliminating the need for definitely determining diplotype configurations of the individuals. If the algorithm in accordance with the present invention is used, for example, the association between the existence of a haplotype and one phenotype can be tested by using genotype data and phenotype data obtained as a result of a cohort study or a clinical trial. The inventors of the present invention studied the effectiveness of the algorithm by using both simulation and real data, found that the algorithm is advantageously effective in analysis of genotype data and phenotype data in cohort studies and clinical trials, and completed the present invention. Further, the inventors of the present invention found that the algorithm can also be applied to analysis of genotype data and phenotype data obtained as a result of a case-control study.
That is, the present invention comprises methods and so on described below.
(1) A penetrance estimation method comprising:
(2) The penetrance estimation method described in (1), wherein genotype data and phenotype data obtained as a result of a cohort study or a clinical trial and observed in a predetermined population are used, wherein the step a, the maximum likelihood (Lmax) is obtained by maximizing the following expression (I) over Θ, q+ and q−:
and the maximum likelihood (L0max) is obtained by maximizing the following expression (II) over Θ and q0:
(in expressions (I) and (II), Θ denotes a vector of the haplotype frequencies; q+ denotes the probability that a phenotype ψ+ results under
(3) The penetrance estimation method described in (1), wherein genotype data and phenotype data obtained as a result of a case-control study are used wherein the step a, the maximum likelihood (Lmax) is obtained by maximizing the following expression (I) over Θ, r+ and r−:
the maximum likelihood (L0max) is obtained by maximizing the following expression (II) over Θ and r0:
(in expressions (I) and (II), Θ denotes a vector of the haplotype frequencies; r+ denotes an estimate under
(4) In the penetrance estimation method described in (1), wherein a set of haplotypes to be tested is defined in such a limiting manner that loci for giving information for discrimination between individual haplotypes and loci having redundant information due to combinations of the loci for giving the discrimination information are determined on the basis of the haplotype frequencies used as a parameter, and the determined loci are masked.
(5) A method of testing a relationship between a diplotype configuration and a phenotype, comprisng:
(6) The method of testing a relationship between a diplotype configuration and a phenotype described in (5), wherein genotype data and phenotype data obtained as a result of a cohort study or a clinical trial and observed in a predetermined population are used wherein the step a, the maximum likelihood (Lmax) is obtained by maximizing the following expression (I) over Θ, q+ and q−:
and the maximum likelihood (L0max) is obtained by maximizing the following expression (II) over Θ and q0:
(in expressions (I) and (II), Θ denotes a vector of the haplotype frequencies; q+ denotes the probability that a phenotype ψ+ results under
(7) The method of estimating the probability of development of a phenotype, described in (9), wherein, in said step b, −2log(Lmax/L0max) is obtained as a statistic (where log denotes natural logarithm), and wherein because the static asymptotically follows the λ2 distribution with the degree of freedom of 1 in a case where the diplotype configuration and the phenotype are independent of each other, it is determined that it cannot be said that there is an association between the predetermined diplotype configurations and the predetermined phenotype, when the statistic does not exceed a limit value λ2< (which is a value at which a cumulative distribution function is 1-α in λ2 distribution with the degree of freedom of 1, where α denotes the risk rate of the test), and it is determined that there is an association between the predetermined diplotype configurations and the predetermined phenotype, when the statistic exceeds the limit value λ2<.
(8) The method of testing a relationship between a diplotype configuration and a phenotype, described in (5), wherein genotype data and phenotype data obtained as a result of a case-control study are used. In the above-described step a, the maximum likelihood (Lmax) is obtained by maximizing the following expression (I) over Θ, r+ and r−:
and the maximum likelihood (L0max) is obtained by maximizing the following expression (II) over Θ and r0:
(in expressions (I) and (II), Θ denotes a vector of the haplotype frequencies; r+ denotes an estimate under
(9) The method of testing a relationship between a diplotype configuration and a phenotype, described in (5), wherein a set of haplotypes to be tested is defined in such a limiting manner that loci for giving information for discrimination between individual haplotypes and loci having redundant information due to combinations of the loci having the discrimination information are determined on the basis of the haplotype frequencies used as a parameter, and the determined loci are masked.
(10) A method of estimating the probability of development of a phenotype, comprising:
(11) The method of estimating the probability of development of a phenotype, described in (10), wherein genotype data and phenotype data obtained as a result of a cohort study or a clinical trial and observed in a predetermined population are used. In the above-described step a, the maximum likelihood (Lmax) is obtained by maximizing the following expression (I) over Θ, q+ and q−:
and the maximum likelihood (L0max) is obtained by maximizing the following expression (II) over Θ and q0:
(in expressions (I) and (II), Θ denotes a vector of the haplotype frequencies; q+ denotes the probability that a phenotype ψ+ results under
(12) The method of estimating the probability of development of a phenotype, described in (10), wherein genotype data and phenotype data obtained as a result of a case-control study are used, wherein the step a, the maximum likelihood (Lmax) is obtained by maximizing the following expression (I) over Θ, r+ and r−:
and the maximum likelihood (L0max) is obtained by maximizing the following expression (II) over Θ and r0:
(in expressions (I) and (II), Θ denotes a vector of the haplotype frequencies; r+ denotes an estimate under
(13) The method of estimating the probability of development of a phenotype, described in (10), wherein a set of haplotypes to be tested is defined in such a limiting manner that loci for giving information for discrimination between individual haplotypes and loci having redundant information due to combinations of the loci having the discrimination information are determined on the basis of the haplotype frequencies used as a parameter, and the determined loci are masked.
(14) The method of estimating the probability of development of a phenotype, described in (10), wherein, in said step b, the probability is obtained by the following equation (III):
(In equation (III), the tested individual is shown as the N+1th individual, gN+1 denotes a genotype of the observed individual, and
The present invention also comprises a program for enabling a computer to carry out any of the above-described methods (1) to (14). That is, each of the above-described methods (1) to (14) can be implemented as a program for enabling a computer having pieces of hardware such as a controller, a transmitter/receiver and a storage device to execute each step.
The present invention will be described below in detail.
The penetrance estimation method and the phenotype development probability estimation method in accordance with the present invention are implemented by means of an algorithm described below (hereinafter referred to as “the present algorithm”). In use of this algorithm, genotype data and phenotype data observed in a predetermined population obtained as a result of a cohort study or a clinical trial or genotype data and phenotype data observed in a predetermined population obtained as a result of a case control study can be used.
1. Application to Cohort Study or Clinical Trial
Description will be first made of a method in which the association between the existence of a phenotype and the existence of a haplotype is tested by using genotype data and phenotype data on individuals obtained by a cohort study or a clinical trial to estimated a penetrance on the basis of the haplotype. This algorithm is formed on the basis of an EM algorithm.
By this algorithm, the frequency of a haplotype in a population and penetrances differing between individuals having the haplotype and other individuals not having the haplotype can be estimated. Therefore, this algorithm also enables maximum-likelihood estimation of a relative risk. More specifically, the maximum likelihood (L0max) under a hypothesis that there is no association between a phenotype and a haplotype (the null hypothesis; there is one penetrance) and the maximum likelihood (Lmax) under a hypothesis that there is an association (the alternative hypothesis; there are two penetrances) are first calculated. Next, in this algorithm, a statistic, e.g., −2log(L0max/Lmax) (hereinafter referred to simply as “statistic”) is computed. The association between the phenotype and the haplotype is tested on the basis of this statistic.
This algorithm can be applied to a method of estimating the probability of development of a phenotype from genotype information on a specimen. That is, this algorithm enables estimation of the probability of development of a predetermined phenotype in a specimen on the basis of genotype information on the specimen. Therefore, this algorithm useful for analysis of the association between a genetic factor and reaction of an individual to a drug. This algorithm can be executed on a computer by being implemented in the computer program PENHAPLO. The computer has a central processing unit (CPU) (control means) for overall control on the operation, a keyboard and a mouse capable of inputting an instruction for executing a program and so on, an input means for obtaining various sorts of data stored on a storage medium (database) or the like, a display means such as a display device, a memory in which temporary information, a program and so on are recorded, and a storage means such as a hard disk on which various sorts of data, a program and so on are stored. The computer may be connected to an external database, another computer and the like via a communication network such as the Internet.
For execution of this algorithm on the computer, the computer program PENHAPLO is installed in the computer. The computer can execute this algorithm in accordance with the computer program PENHAPLO under the control of the CPU. Genotype data and phenotype data may be input, for example, via a communication network such as the Internet, or from a storage medium on which genotype data and/or phenotype data is stored.
That is, the present algorithm enables the computer to function as an input means for obtaining genotype data and phenotype data and as a control means (computation means) for obtaining the maximum-likelihood estimate and the maximum likelihood (Lmax) of haplotype frequencies and penetrances by using the genotype and phenotype data obtained by the input means. The present algorithm also enables the computer to function so as to obtain the penetrance from the maximum-likelihood estimate by means of the control means. Further, the present algorithm enables the computer to function so as to obtain, by means of the control means, the likelihood ratio from the maximum likelihood (L0max) and the maximum likelihood (Lmax) and to test, with reference to the λ2 distribution, a hypothesis that a predetermined diplotype configurations and a predetermined phenotype are associated with each other. Further, the present algorithm enables the computer to function so as to obtain the probability that any of individuals to be tested develops the predetermined phenotype by means of the control means using the maximum likelihood estimate and the genotype data on the individuals.
The genotype data comprises information denoting the position of a polymorphism obtained as a result of execution of so-called SNP typing or the like on a certain individual and information denoting the kind of the polymorphism. The genotype data may be made anonymous by removing information identifying the individual.
The phenotype data comprises binary data denoting whether or not a certain individual has a predetermined phenotype. The particular phenotype may be, for example, the existence/nonexistence of a drug action or a side effect, or an affected/non-affected state checked by a clinical test and diagnosis or the like.
Algorithm
The present algorithm will be described in detail.
<Sample Space>
Let us assume that there are 1 linked SNP loci. The number of all the possible haplotypes is L=21. We set up a collection of infinite number of haplotype copies. The haplotype frequencies in the collection are Θ=(θ1, . . . ,θj, . . . , θL), where θj is the frequency of jth haplotype, and
The subset of Hall is not limited to H+. H1 described below may be defined. H1 is defined as a subset of Hall in such a manner that loci for giving information for discrimination between individual haplotypes and loci having redundant information due to combinations of the loci having the discrimination information are determined on the basis of a haplotype distribution inferred by the EM algorithm, and the determined loci are masked.
This masking means concealing the information at particular one or more of the plurality of loci constituting the haplotypes by assuming that any polymorphism can apply to the particular locus or loci. Thus, the concept of an incomplete haplotype having its portion concealed by masking is expressed as a set H1 including a plurality of haplotypes as constituents. From the definition of H1,
H1⊂Hall
is apparent. Further, if an incomplete haplotype is constructed by using information on only one locus and by masking the other loci, only SNP information on the used locus is used. This incomplete haplotype is therefore equivalent to SNP. If haplotypes formed in this manner are defined as a set HSNP, HSNP is a particular case of H1, and
H1⊂HSNP⊂Hall
The rationality of using incomplete haplotypes H1 in place of H+ is as described below.
1) In a case where a gene polymorphism is expressed by haplotypes, the cause of the polymorphism comprises base substitution and recombination. In a case where a haplotype in a certain region is associated with a cause locus associated with a certain phenotype, a plurality of haplotypes may be associated with the cause locus. This association is due to a mutation or recombination caused after the occurrence of the cause locus. If incomplete haplotypes are constructed, a mutation can be expressed as masking at a particular locus and recombination can be expressed as masking at consecutive loci.
2) When incomplete haplotypes are constructed by using SNP information on L loci, the loci to be masked are changed from 0 to L−1 to enable haplotypes from those using all the information items to SNP to be included in objects to be tested by the present algorithm.
If all incomplete haplotypes H1 are constructed by using SNP information on the L loci, three information items consisting of two alleles and masking operation is used with respect to each locus and, to put it simply, there is a need to consider 3L−1 combinations. The number of combinations is excessively large even in comparison with the number of combinations in haplotype estimation is 2L. However, the number of haplotypes is ordinarily smaller than 10 in a region where linkage disequilibrium is strong, and it is not necessary to simply construct all the possible combinations. Therefore, incomplete haplotypes H1 may be constructed in accordance with an algorithm having the following steps 1 to 3.
1. Haplotype estimation based on the EM algorithm is performed.
2. From a haplotype distribution thereby estimated, loci for giving information for discrimination between individual haplotypes are extracted. From the loci thereby extracted, those having redundant information due to combinations are removed to determine haplotype tags SNP(hereinafter referred to as htSNP).
3. To the htSNP loci, three information items consisting of two alleles and a masking operation are successively applied to construct incomplete haplotypes H1. There is a possibility of identical haplotypes H1 being constructed even by different masking methods. A redundant haplotype constructed in such a case is also removed.
Description will be made with respect to a case where H+is provided as an object to be tested. However, the same objects to be tested as the above-described incomplete haplotypes H1 may alternatively be provided.
Let D+ denote a set of diplotype configurations that contain a member or members of H+. Let q+ denote the probability that the ith individual develops ψ+ when
That is, if ψi denotes the phenotype of ith subject,
P(ψi=ψ+|diεD+)=q+
P(ψi=ψ+|di∉D+)=q−
Note that Θ and q+, q− are independent.
The present algorithm is different from the conventional EM algorithm in that the process of developing a phenotype is included in the present algorithm. In the present algorithm, the parameters such as q+ and q− are also included in the probability apace in addition to Θ. Note that ψi is independent of Θ conditionally on di, particularly in this algorithm.
<Likelihood Function>
The observed data used in the present algorithm are genotype data and phenotype data on individuals. Let Gobs=(g1, g2, . . . , gN) and ψobs=(w1, w2, . . . , wN) denote the vectors of the observed genotypes and phenotypes, respectively, where gi and wi denote the observed genotype and phenotype of the ith individual. Then the likelihood function is
where Ai denotes the set of ak for the ith individual that is consistent with gi.
Since di is independent of q+, q− and ψi is independent of Θ conditionally on di,
where Ai denotes the set of ak for the ith individual that is consistent with gi.
For any i and k,
Under the null hypothesis that the phenotype is independent of the diplotype configuration concerning the loci examined, the likelihood function is
where q0 denotes the penetrance for all the diplotype configurations, and Ai denotes the set of ak for the ith individual that is consistent with gi.
For any i and k,
<EM Algorithm>
Expression (1) shown above is maximized over Θ, q+ and q−, and the maximum likelihood thereby obtained is calculated as Lmax. In the present algorithm, expression (2) is also maximized over Θ and q0, and the maximum likelihood thereby obtained is calculated as L0max. The likelihood ratio L0max/Lmax is used to test the association between the existence of the haplotype and the phenotype.
In the maximization for Lmax, the parameters to be estimated are Θ=(θ1, θ2, . . . ,θL), q+ and q−, while in the maximization for L0max, the parameters to be estimated are Θ=(θ1, θ2, . . . , θL), q0. The space spanned in the maximization for L0max is a subspace of that spanned in the maximization for Lmax. Under the null hypothesis, −2log(L0max/Lmax) follows the λ2 distribution with the degree of freedom of 1.
If the complete data on d1, d2, . . . , dN and ψ1, ψ2, . . . , ψN can be obtained, the maximum likelihood estimates of θ1, θ2, . . . , θL and q+, q− can easily be obtained as
{circumflex over (θ)}j=nj/(2N)
{circumflex over (q)}+=N+ψ+/N+
{circumflex over (q)}−=N−ψ+/N−
(where j=1, 2, . . . , L), where nj is the number of copies of the jth haplotype in the N individuals. Also,
N+=#{i;diεD+},
N−=#{i;di∉D+},
N+ψ+=#{i;diεD+,ψi=ψ+}
N−ψ+=#{i;di∉D+,ψi=ψ+}
where #{i; , ,} denotes the number of individuals that fulfill the condition after “;”.
However, the complete data on the diplotype configurations of the individuals cannot be obtained, while we observe only the genotype data and the phenotype data on the individuals. Therefore, we prepares the following algorithm in which the expected values of nj/(2N), N+ψ+/N+ and N−ψ−/N are substituted for the real values. That is, the present algorithm includes the following steps (i) to (viii).
From the calculation of step (v), the penetrances are updated for the next step as follows.
q+(n+1)=E[N+ψ+/N+|Ψobs, Gobs, θ(n),q+(n),q−(n)]
q−(n+1)=E[N−ψ+/N−|Ψobs, Gobs, θ(n),q+(n),q−(n)]
The values when converged are considered as the maximum likelihood estimates:
{circumflex over (Θ)}=({circumflex over (θ)}1,{circumflex over (θ)}2, . . . {circumflex over (θ)}L)
{circumflex over (q)}+
{circumflex over (q)}−
Here,
If the condition q0=q+=q− is given and if the steps (iii) through (vii) are repeated, then the maximum likelihood L0max for the alternative hypothesis is obtained.
Under the null hypothesis, the statistic −2log(L0max/Lmax) is expected to asymptotically follow the λ2 distribution with the degree of freedom of 1.
<Phenotype Development Probability Estimation Algorithm>
By the above-described EM algorithm (the present algorithm), the probability that a specimen develops a predetermined phenotype can be estimated on the basis of genotype data on the specimen. Let the specimen to be examined be N+1, and let gN+1 denote the observed genotype of this specimen. The probability that the specimen develops the phenotype ψ+:
P(ψN+1=ψ+|gN+1, {circumflex over (θ)})
The above-described EM algorithm and phenotype development probability estimation algorithm (the present algorithm) can be implemented, for example, in a computer program. If the present algorithm is implemented in a computer program, all calculations can be performed in a computer.
The computer in which the software having the present algorithm implemented therein is installed may be connected to an external network via a communication network, for example, to enable genotype data and phenotype data obtained from a cohort study or a clinical trial to be obtained via the communication network as well as to enable penetrances and probabilities obtained by the present algorithm to be output via the communication network.
2. Application to Case-Control Study
Description will next be made of a method in which genotype data and phenotype data on individuals obtained as a result of a case-control study are used in the above-described algorithm to test the association between the existence of a phenotype and a haplotype and to estimate the penetrance based on the haplotype.
In a case-control study, cases and controls are sampled by fixing the number of cases (affected) and the number of controls (non-affected). Therefore, if the ratio of cases and controls is “case:control=ω: 1−ω” and if the proportion of affected specimens in the entire population (incidence rate) is λ, cases corresponding to ω in the affected specimens (λ) in the sample space and controls corresponding to (1−ω) in the non-affected specimens (1−λ) in the sample space are extracted.
Accordingly, if the condition of an individual as to whether or not the individual has an element in D+ of diplotype configurations is associated with the existence/nonexistence of a particular phenotype (e.g., a particular disease) in a case-control study, parameters estimated in the case-control study are given as shown in Table 1 below.
In the above Table 1, ω is a constant and
f′D
denotes the frequencies of diplotype configurations in the population including cases and controls. Unlike penetrances q+ and q− estimated in the above-described case of application to a cohort study or a clinical trial, r+ and r− do not denote penetrances. The relationship between these estimates can be shown by the following equation:
It can be understood from the above equation that r+ and r− are not in a mutually dependent relationship, and that the number of estimated parameters in the case of application of the present algorithm to a case-control study is smaller by one than that in the above-described case of application to a cohort study or a clinical trial. The relationship r+=r− is established when r+=r=w. Therefore, the null hypothesis and the alternative hypothesis in the case of application of the present algorithm to a case-control study are
H0:r+=ω
H1:r+≠ω
Also, the penetrance q+ is expressed by using the estimate r+ as
In a case-control study, therefore, an estimate of the penetrance can be computed by using a result of the case-control study. Thus, the estimate of the penetrance is calculated by substituting the estimate r+ estimated by applying the present algorithm to the case-control study.
The incidence rate λ cannot be estimated from any result of a case-control study, and therefore needs to be separately given. For example, the incidence rate λ may be obtained as a particular value from a statistical investigation of a target disease or a follow-up survey in a particular population.
Even in the case of use of results of a case-control study, the probability of development of a predetermined phenotype in specimens can be estimated on the basis of genotype data on the specimens, as is that in the above-described case of use of results of a cohort study or a clinical trial.
<Simulation>
In the following description,
We first examined the empirical distribution of the statistic −2log(L0max/Lmax) under the null hypothesis by a simulation. In this examination, we obtained the haplotype frequency Θ not by a simulation but from real data. That is, we used Θ obtained from a study made in the past on SAA (serum amyloid A) genes [Moriguchi et al. (2001) A novel single-nucleotide polymorphism at the 5′-flanking region of SAA1 associated with risk of type AA amyloidosis secondary to rheumatoid arthritis. Arthritis Rheum 44: 1266-1272]. For q0, we tested various values between 0 and 1. Note that under the null hypothesis the penetrance is the same for all the diplotype configurations.
We began the simulation by giving ordered two haplotype copies to each of N individuals by drawing haplotypes using Θ, and determined the phenotype of each individual according to q0. Thus, q0 was used as the probability at which an arbitrary one of the individuals develops the phenotype ψ+. After removing the phase information, the algorithm as defined above was applied to the simulated data for determination of the statistic −2log(L0max/Lmax). The simulation was repeatedly performed and the distribution of the statistic was examined.
Haplotype data on SAA gene includes six SNP data items. Table 2 shows haplotypes and frequencies relating to SAA gene.
In the above Table 2, the haplotype “ACCGTC” is a haplotype assigned as “phenotype-associated haplotype” in a simulation according to the alternative hypothesis.
On the basis of Table 2, haplotype copies were randomly drawn to give ordered two haplotype copies to each individual. Since the probability of development of the phenotype ψ is the same with respect to all the individuals, the phenotype ψ was given on the basis of the probability q0 fixed for each individual. Various values from 0 to 1 were given as q0.
Table 3 shows, with respect to various values given as q0 and N, the probability of the type I error (α=0.05) calculated by assuming the distribution of the statistic follows the λ2 distribution with the degree of freedom of 1, and the values of “q+ hat” and “q− hat” estimated under the alternative hypothesis.
In the above Table 3, the values of “q+ hat” and “q− hat” are shown as mean±standard deviation. As the type I error rate, the rate at which the value of the statistic exceeded 3.841 (at which the cumulative density function of the λ2 distribution with the degree of freedom of 1 is 0.95) is shown.
It is apparent from the results shown in
Simulation Under the Alternative Hypothesis:
Next, the simulation under the alternative hypothesis was performed. One of the haplotypes was assumed to be associated with the phenotype ψ+, and this phenotype was denoted as “phenotype-associated haplotype”. For this simulation, D+ was defined as the set of diplotype configurations that contained at least one “phenotype-associated haplotype”. Various from 0 to 1 were given as each of the two penetrances q+ and q−.
The simulation was began by giving each individual ordered two haplotype copies by drawing them using Θ. Then q+ or q− was given as the probability of development of the phenotype ψ+ in each individual. The probability of development of phenotype ψ+ when the individual had the “phenotype-associated haplotype” was q+, while the probability of development of phenotype ψ+ when the individual did not have the “phenotype-associated haplotype” was q−.
After removing the phase information, the genotype and the phenotype data were subjected to the above defined algorithm. The simulation was repeated many times and the results obtained were analyzed. The power was estimated at various values of q+ and q− by assuming that under the null hypothesis the statistic −2log(L0max/Lmax) follows the λ2 distribution with the degree of freedom of 1.
In the graphs a to f in
The same simulation was repeated 10,000 times by setting various values of q+, q− and N and the proportion of trials resulting in a state where the value of the statistic exceeded 3.841 (at which the cumulative density function of the λ2 distribution with the degree of freedom of 1 is 0.95) was obtained as an empirical power.
Distribution of Estimated Penetrances “q+ hat” and “q−0 hat”:
The distributions of “q+ hat” and “q−0 hat” were examined under the above-described alternative hypothesis. More specifically, the distributions were examined under the condition of changing the penetrance q− to the given penetrance q+=0.2. The relative risk (q+/q−) was changed from 1.0 to 2.0. The sample size N was fixed at 1,000. By using the present algorithm, “q+ hat” and “q− hat” were estimated under the alternative hypothesis and the statistic −2log(L0max/Lmax) was calculated. This simulation was repeated 10,000 times with respect to each parameter. Table 4 shows the results of this simulation.
In Table 4, “q+ hat” and “q− hat” are shown as mean±standard deviation, and the empirical detection rate is shown as the proportion of trials resulting in a state where the value of the statistic exceeded 3.841 (at which the cumulative density function of the λ2 distribution with the degree of freedom of 1 is 0.95). As shown in Table 4, it has been made clear that q+ and q− can be estimated with substantially high accuracy by the present algorithm and the variation is comparatively small. From this result, it is thought that the accuracy of each of the estimate of q+/q−, “q+ hat” and “q31 hat” is comparatively high.
We also examined whether “Θ hat” and the posterior probability distribution of the diplotype configuration (diplotype distribution) when they were estimated by factoring in the phenotype data were the same as those estimated without factoring in the phenotype data. The estimation without the phenotype data was performed by the program, so-called LDSUPPORT. By program LDSUPPORT, the diplotype distribution is estimated from the genotype data only. However, the penetrances are set as q0=q+=q−, i.e., under the null hypothesis, the estimated “Θ hat” and the diplotype distribution of each individual were the same as those estimated by LDSUPPORT. “q+ hat” and “q− hat” were changed when the phenotypes of the individuals were changed, although no data on this change is shown here.
Table 5 shows the result of the simulation performed under the alternative hypothesis by using Θ obtained from SAA gene shown in Table 2. The parameters in the simulation are q+=0.2, q−=0.1 and N=1,000. Table 5 shows data on four typical subjects.
In the column “Phenotype” of Table 5, the affected states of the subjects are shown. N denotes non-affected subjects, and A denotes affected subjects. In the columns “Diplotype distribution” of Table 5, the posterior distribution of the diplotype configuration of each individual is shown. In particular, in the column “Only genotype”, the results of analysis performed by using LDSUPPORT without phenotype information are shown. In the column “+ Phenotype”, the results of analysis performed by the present algorithm using the genotype data and the phenotype data are shown. In the column “Phenotype is changed”, the results of analysis performed by the present algorithm using data obtained by inverting only the phenotype data on the individuals are shown.
As shown in Table 5, the diplotype configuration of each individual estimated as “Θ hat” was actually changed by factoring in the phenotype data or changing the phenotype of the individual. In a case where the estimated diplotype distributions of the individuals were concentrated on one event, “Θ hat” and the diplotype distributions of the individuals were not substantially changed, although data on such a phenomenon is not shown in Table 5. In contrast, in a case where the diplotype distributions of the individuals were not concentrated on one event as shown in Table 4, “Θ hat” and the diplotype distributions of the individuals were changed by factoring in the phenotype data or changing the phenotypes of the individuals.
Next, a test was performed as to which of the present algorithm and the algorithm not factoring in phenotypes was higher in accuracy when both or one of “Θ hat” and the diplotype configuration estimated by factoring in phenotypes (i.e., by the present algorithm) was different from that estimated without factoring in the phenotypes (i.e., by the program LDSUPPORT).
More specifically, when a simulation was performed under the alternative hypothesis, Θ was obtained from SAA gene and a gene artificially designed. For the gene artificially designed, data on six SNP loci between which weak linkage disequilibrium exists was prepared and a collection of haplotype copies was formed.
The simulation of SAA gene using q+=0.5, q31 =0.125 and N=100 were set as parameters was performed 10,000 times. In the simulation of the artificial gene, the parameters were q+=0.5 and N=1,000 and the value of q− was changed. This simulation was performed 1,000 times.
After each simulation, the phase information was removed and the data was subjected to the present algorithm. With respect to each individual, the true diplotype configuration and the estimated diplotype distribution were compared and a case where the true diplotype configurations coincides with the estimated diplotype distribution with the highest probability was recognized as accurate estimation. The proportion of the individuals for which diplotype configuration estimation was performed with accuracy as recognized by this comparison was recorded. The results of this test are shown in Table 6.
As shown in Table 6, the proportion of individuals for which the diplotype configuration was correctly estimated was increased in the case where the phenotype data was added (in the case where the present algorithm was used). It is thereby made clear that the accuracy of inference of the diplotype configuration can be improved by using the present algorithm.
Case-Control Study, Examination about Type I Error in Null Hypothesis
Type I error in the null hypothesis in the case of a case-control study was evaluated by a simulation using the present algorithm on the basis of haplotype frequencies of SAA gene (in Table 2 shown above) and haplotype frequencies of a gene artificially made and having weak linkage disequilibrium (hereinafter referred to briefly as ART). The haplotype frequencies used in this simulation are shown in Table 7.
From
In a case where genotype data and phenotype data on individuals obtained as a result of a case-control study are used in the present algorithm, advantages described below can be obtained over methods of performing diplotype configuration estimation on individuals and thereafter performing a test by using a contingency table. Comparison with the following four methods of determining the diplotype configurations of individuals will be made.
1. Diplotype configuration estimation is separately performed on each of a case population and a control population. If a plurality of diplotype configurations can exist for each individual, one of them having the maximum probability is adopted as the diplotype configuration for the individual. (This method will be referred to as separate0.)
2. Diplotype configuration estimation is separately performed on each of a case population and a control population. If a plurality of diplotype configurations can exist for the individuals, the individuals are divided according to the probabilities of the diplotype configurations. (This method will be referred to as separate1.)
3. Diplotype configuration estimation is performed on the entire population comprising a case population and a control population, and the diplotype configuration having the maximum probability is adopted as the diplotype configuration of the corresponding individual. (This method will be referred to as separate2.)
4. Diplotype configuration estimation is performed on the entire population comprising a case population and a control population, and the individuals are divided according to the probabilities of diplotype configurations. (This method will be referred to as separate3.)
The Peason test statistic was calculated from contingency tables formed on the basis of the above-described separate0 to separate3 methods. The Peason test statistic obtained from a contingency table formed on the basis of the complete data on diplotype configurations (denoted as diplotype) and the likelihood ratio test statistic obtained by the present algorithm (denoted as Penhaplo) were also computed and correlation coefficients for these statistics were examined. Table 8 shows the results of simulations based on the haplotype frequencies of ART shown in Table 7.
As can be understood from Table 8, the likelihood ratio test statistic (Penhaplo) obtained by the present algorithm shows the strongest correlation with the test statistic (diplotype) based on the complete data, and the statistic obtained by the separate3 division method shows the second strongest correlation.
As shown in
<Analysis of Real Data 1>
We applied the present algorithm to real data actually collected. As the real data, data on MTHFR gene [Urano et al. (2002) Polymorphisms in the methylenetetrahydrofolate reductase gene were associated with both the efficacy and the toxicity of methotrexate used for the treatment of rheumatoid arthritis, as evidenced by single locus and haplotype analyses. Pharmacogenetics. 12:183-190] and the NAT2 gene [Tanaka et al. (2002) Adverse effects of sulfasalazine in patients with rheumatoid arthritis are associated with diplotype configuration at the N-acetyltransferase 2 gene. J Rheumatol 29:2492-2499]. Both the data on MTHFR gene and the data on NAT2 gene are derived from cohort studies.
The data set for MTHFR is derived from the cohort study on rheumatoid arthritis patients. An examination was made as to the occurrence of side effects and the state of two SNPs of MTFHR gene with respect to 104 patients who received methotrexate. One haplotype was assumed to be “phenotype-associated haplotype”.
The statistic −2log(L0max/Lmax) calculated from the data was 6.8074, which is significant (P<0.01). The maximum likelihood estimates of “q+ hat” and “q− hat” in this case were 0.2571 and 0.0588, respectively. That is, the maximum likelihood estimate of the relative risk was 4.37. The diplotype distributions of the individuals were concentrated on a single event. The diplotype configurations estimated under the alternative hypothesis with respect to all the individuals were approximately the same as those estimated under the null hypothesis or those estimated by LDSUPPORT (while the data is not shown). That is, in this example, the estimated diplotype configurations were approximately the same regardless of which of the present algorithm using phenotype data or LDSUPPORT not using phenotype data was used. “Θ hat” was also the same regardless of which of the present algorithm or LDSUPPORT was used.
The data set for NAT2 gene is also derived from the cohort study on rheumatoid arthritis patients. This data set was obtained by making a search with respect to the occurrence of side effects and seven SNPs in NET2 gene in 144 patients who received sulfasalazine (see the above-mentioned document by Tanaka et al.) One haplotype recognized in advance as a wild-type haplotype was assumed to be “phenotype-associated haplotype”. The statistic −2log(L0max/Lmax) calculated from the data was 13.4629, which is significant (P<0.001) showing that the presence of the haplotype is associated with the side effects. The maximum likelihood estimates of “q+ hat” and “q− hat” were 0.0809 and 0.6248, respectively. That is, the maximum likelihood estimate of the relative risk was 0.129. That is, the presence of the “phenotype-associated haplotype” was associated with a reduction in the side effects.
The diplotype distributions of the individuals were concentrated on a single event except for that for one individual. The diplotype configurations estimated under the alternative hypothesis with respect to all the individuals were the same as those estimated under the null hypothesis or those estimated by LDSUPPORT.
According to the results of analysis using real data, the penetrance can be calculated by the present algorithm using genotype data and phenotype data. Then, the association between a phenotype and a haplotype at the level of individuals can be tested by the present algorithm using the genotype data and phenotype data. Also, maximum likelihood estimation of the penetrance can be performed by the present algorithm on the assumption that different penetrances are obtained according to the existence/nonexistence of a particular haplotype. Further, a maximum likelihood estimate of the relative risk can, of course, be obtained from the different penetrances.
We also analyzed the genotype data comprising the seven SNPs for NAT2 gene and data on the side effects, which data were obtained from a cohort study on the 144 rheumatoid arthritis patients who received sulfasalazine, with the above-described incomplete haplotypes H1 provided as an object to be tested. From this data, it has been shown that the risk of generation of the side effects in the case where the wild-type haplotype is possessed is descreased or 0.129/1 of the risk in the case where the wild-type haplotype is not possessed, as described above.
More specifically, the wild-type haplotype is expressed by an SNP sequence “GCTCGAG” and, when incomplete haplotypes were constructed by the above-described method, an analysis result showing that “GC****G” was most significant was obtained without designating the wild-type haplotype. The symbol “*” indicates the masked state. That is, in this example, the third to sixth loci can be masked and incomplete haplotypes H1 constructed on SNPs at the first, second and seventh loci can be provided as an object to be tested. As a result of application of the present algorithm using the constructed incomplete haplotypes H1 as an object to be tested, it was found that “GC****G” was most significant. It is meant that the masked loci as shown in “GC****G” are expressed by information on the other loci, and that the fact that “C” is at the second locus and “G” is at the seventh locus is associated with the phenotype. The wild-type haplotype “GCTCGAG” is a haplotype correctly conforming to “GC****G”. Thus, the “phenotype-associated haplotype” can be found by using incomplete haplotypes H1.
Further, another example of application using incomplete haplotypes H1 is analysis performed on genotype data comprising three SNPs for ABCB8 gene and data on execution/non-execution of dosing with folic acid to be performed in the event of occurrence of a side effect, which data were obtained from a cohort study on 175 rheumatoid arthritis patients who received methotrexate, with incomplete haplotypes used as an object to be tested. Table 9 shows the results of this analysis.
In the analysis results shown in Table 9, the haplotype “CG*” has the lowest P value indicating that the haplotype “CG*” is more significant than thee haplotype “CGA” using information on all the loci. This means that the incomplete haplotype “CG” on the first and second loci is strongly associated with the cause of the phenotype. That is, it can be understood that the cause locus is associated with both the haplotypes “CGA” and “CGG”, and that the association with the phenotype appears strongly when the third locus is masked. These results show that even a phenotype associated with a plurality of haplotypes can be detected when incomplete haplotypes H1 are used.
According to the present invention, it is possible to perform maximum likelihood estimation of haplotype frequencies in a population, diplotype distributions of individuals (posterior probability distributions of diplotype configurations) and penetrances by using genotype data and phenotype data, with no need to definitely determine diplotype configurations of the individuals. If the algorithm in accordance with the present invention is used, the association between the existence of a haplotype and one phenotype can be tested by using genotype data and phenotype data obtained as a result of, for example, a cohort study, a clinical trial or a case-control study.
Number | Date | Country | Kind |
---|---|---|---|
130638/2003 | May 2003 | JP | national |
130349/2004 | Apr 2004 | JP | national |