The present application claims priority from Japanese application JP 2004-091104 filed on Mar. 26, 2004, the content of which is hereby incorporated by reference into this application.
The present invention relates to a diagnostic decision support system and a method of diagnostic decision support which can analyze association of clinical information with genetic information and sample and show clinically useful information.
The human genome project has almost completed sequence decision to move into the age of post-sequencing. From now on, the effective utilization of an enormous amount of stacked genetic information in medical science is expected. The advancement of clarification of association of genes with disease makes it possible to predict disease-appearing risk on the basis of the genotype of an individual, which enables prevention, early discovery and treatment of the disease according to the genetic predisposition of the individual. To realize these, it is necessary to analyze association of clinical information with genetic information.
As one of strong methods of analyzing association of clinical information with genetic information, there is a method of statistical genetics. The method of statistical genetics is a method of using genetic information and the presence or absence of disease of an individual as data to search for disease-associated genes employing statistics. It may also find disease-associated genes whose mechanism is unknown, which is increasingly important. The method of statistical genetics is a technique for searching for a genetic region associated with a specific trait using a linkage between a plurality of loci (positions of genes on a chromosome). The trait refers to various formative characteristics observed at individual level and is the presence or absence of affected disease, height and the color of eyes or hair. The linkage is an exception to the Mendel's law of independence: “Two different traits are isolated and independent to be inherited.
When loci defining two traits exist on a chromosome to be close to each other, the genes are not isolated and independent and are inherited from parent to child in a linked state. This state refers to a linkage between two loci. In meiosis, partial exchange may occur between a pair of chromosomes passed from parents and a combination of genes passed to their child may be different from that derived from the parents. This phenomenon is called recombination.
The probability that recombination occurs between two loci in one meiosis is called a recombination fraction. As the two loci are closer to each other, the recombination fraction is small. That is, the possibility of their linkage is high. The method of statistical genetics examines, on the basis of recombination information, the presence or absence of a linkage between polymorphism (such as single nucleotide polymorphism and microsatellite) and disease-associated genes over a chromosome to close in on disease-associated loci.
Some methods of statistical genetics have been reported. As for genetic disease, a number of causal genes have been identified by parametric linkage analysis using data of a large pedigree. In the future study of searching for disease causal genes, searching for causal genes of complex disease appearing by a plurality of genetic effects and environmental effects is considered to be the mainstream. It is initially considered that the causal genes of complex disease can be identified by nonparametric linkage analysis (affected sib-pair analysis) using data of a number of small pedigrees. In general, it is often difficult to directly identify the causal genes of complex disease having low penetrance (disease-appearing probability). In recent years, due to its high power and analyzing facilitation, attention has been given to association analysis comparing allele frequencies of polymorphism noted in a case group and a control group.
In the prior art association analysis, the possibility that a gene truly associated with a trait may be missed or a gene not associated with a target trait may be selected by mistake is relatively high. In general, the former is handled as a false negative problem and the latter is handled as a false positive problem. The reasons why false negative and false positive analyzed results are given are as follows: only a haplotype of single polymorphism or polymorphism in a narrow range is used to analyze association of a gene with a trait; no haplotype blocks are considered when performing analysis using haplotype; and no diversity existing in a target group (hereinafter, called a genetic structure) is considered.
The haplotype refers to a combination of alleles derived from the same parent in a plurality of linked loci. Alleles in a plurality of loci existing on a chromosome to be close to each other are transferred to the next generation in a linked state without being influenced by recombination in heterogenesis. After heterogenesis many times, there is found association of a plurality of loci existing to be close to each other. This state is called linkage disequilibrium. In recent years, for instance, Non-patent Document 1 (Gabriel SB et al.: The Structure of Haplotype Blocks in the Human Genome, Science, Vol. 296, pp. 2225-2229, 2002) has reported that there alternately exist on a genome a part called haplotype block in which linkage disequilibrium is maintained in a relatively strong state and a part called hotspot weakening linkage disequilibrium between loci since recombination occurs at high frequency.
This fact means that if the position of a haplotype block can be correctly inferred, an exact haplotype pattern can be decided only by measuring the genotype of a few loci in the haplotype block. At the same time, this fact means that when performing analysis using a plurality of loci across a hotspot, many false positive results which are not important in genetics are given.
When generally performing association analysis, a target population is often divided into groups according to a noted trait. Most famous is case-control study which samples a number of cases and controls from a certain population, compares frequencies of noted alleles of a case group and a control group, and detects loci of polymorphism having significant difference in allele frequency. The case-control study assumes that the case group is perfectly matched with the control group other than a noted trait.
The assumption is not always established, and is a problem when a genetic structure exists in a target population. When sampling a case group and a control group from genetically different populations, a genetic structure significantly affects the analyzed result. The influence of the genetic structure of a population will be described using a simple example. For instance, when collecting a case group and a control group having drepanocyte in the U.S., the case group is supposed to include many people derived from Africa and the control group is supposed to include many people derived from Europe. When comparing the two populations without considering the influence of a genetic structure, a number of loci inherently different in allele frequency between African and European people are detected as causal loci of drepanocyte. A genetic structure of a population gives many false positive analyzed results. The genetic structure of the population may also give false negative analyzed results as well as false positive analyzed results.
[Non-patent Document 1] Gabriel S B et al.: The Structure of Haplotype Blocks in the Human Genome, Science, Vol. 296, pp. 2225-2229, 2002
As described above, when performing association analysis without considering the influence of a haplotype block and a genetic structure existing in a target population, many false negative and false positive analyzed results are given, significantly affecting the analyzed results. Accordingly, an object of the present invention is to provide a system performing high-accuracy diagnostic decision support in consideration of the influence of a haplotype block and a genetic structure.
In a diagnostic decision support system and a method of diagnostic decision support according to the present invention, haplotype block inference means, on the basis of polymorphism information, infers the position of recombination to infer the positions of haplotype blocks, and analyzes each of the haplotype blocks to infer a haplotype pattern of individuals with high accuracy. The inferred haplotype frequency information and haplotype pattern information of the individuals are stored in a haplotype information database. Genetic structure inference means performs clustering the individuals on the basis of the haplotype pattern to divide a population into some subpopulations, and removes the influence of a genetic structure existing in the population to analyze association of clinical information with genetic information with high accuracy. The result obtained by the genetic structure inference means is stored in a genetic structure information database to analyze the association of clinical information with genetic information using the genetic structure information database and a clinical information database for providing high-accuracy diagnostic decision support knowledge. The diagnostic decision support knowledge obtained by analyzing the association of clinical information with genetic information is stored in a decision support knowledge database. Risk calculation means calculates, on the basis of information of the decision support knowledge database, a risk that a predetermined individual is affected by disease.
In a diagnostic decision support system and a method of diagnostic decision support according to the present invention, a haplotype block inference algorism can infer the position of recombination to infer the positions of haplotype blocks, and analyze each of the haplotype blocks to infer a haplotype pattern of individuals with high accuracy. A genetic structure inference algorism can perform clustering individuals on the basis of the haplotype pattern to divide a population into some subpopulations, and remove the influence of a genetic structure existing in the population to analyze association of clinical information with genetic information with high accuracy.
Data of a population is handled for the databases. Information of the decision support knowledge database 18 is effective to the population. The contents of the databases are further fulfilled by stacking data of persons who have received diagnostic decision.
In the diagnostic decision support system of the present invention, the haplotype block inference program 13, on the basis of polymorphism information, infers the position of recombination to infer the positions of haplotype blocks, and analyzes each of the haplotype blocks to infer a haplotype pattern of individuals with high accuracy. The inferred haplotype frequency information and haplotype pattern information of the individuals are stored in the haplotype information database 14. The genetic structure inference means 15 can perform clustering the individuals on the basis of the haplotype pattern to divide a population into some subpopulations, and removes the influence of a genetic structure existing in the population to analyze association of clinical information with genetic information with high accuracy. The result obtained by the genetic structure inference program 15 is stored in the genetic structure information database 16 to analyze the association of clinical information with genetic information using the genetic structure information database 16 and the clinical information database 11 for providing high-accuracy diagnostic decision support knowledge. The diagnostic decision support knowledge obtained by analyzing the association of clinical information with genetic information is stored in the decision support knowledge database 18. The risk calculation program 19 calculates, on the basis of information of the decision support knowledge database 18, a risk that a predetermined individual is affected by disease.
The clinical information database 11 stores basic data of the name, address, birthday and family structure of an individual, clinical data such as information on the case history, family history, major complaint, findings, examined result, lifestyle, condition process, treatment process and medicine prescription of the individual, and data on an informed consent. The genetic polymorphism information database 12 stores basic information on polymorphism (position, measurement method, polymorphism type (such as SNP or STRP), and allele frequency), the polymorphism measured result of the individual (such as base sequence pattern, homozygote, or heterozygote), identification information of a specimen used in an examination, and specimen management data of a stored state.
The haplotype block inference program 13 will be described. As described previously, linkage disequilibrium is maintained in a relatively strong state in a haplotype block. For instance, as shown in the previously described Non-patent Document 1, the diversity of a haplotype is known to be relatively small in a haplotype block. To infer the position of the haplotype block, it is necessary to define the strength of linkage disequilibrium in a certain region on a genome.
In general, the strength of linkage disequilibrium is often expressed using coefficient of linage disequilibrium D′ between two loci. The present invention, when coefficient of linkage disequilibrium D′ of a plurality of loci in a certain region satisfies the condition of the following equation, defines the region as a haplotype block.
min(|D′|)>0.8
Haplotype frequency of a population and a haplotype pattern of individuals in each inferred haplotype block are inferred. A combination of two haplotypes owned by the individual is called diplotype configuration. Some methods of inferring a diplotype of an individual from genotype data have been proposed. As representative methods, there are a method of using EM algorism as shown in Document: Excoffier L & Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population, Mol Biol Evol, Vol. 12, pp. 921-927, 1995 and a PHASE method as shown in Document: Stephens M et al.: A new statistical method for haplotype reconstruction from population data, Am J Hum Genet, Vol. 68, pp. 978-989, 2001.
A method of inferring haplotype frequency of a population and diplotypes of individuals using the EM algorism will be described below. A sample having n individuals will be considered now. In the population, a haplotype in a plurality of linked marker loci is considered so that frequency of the population is F=(F1, F2, . . . , FM). M is the total number of potential haplotypes. When the marker loci are all SNP loci, the number of loci is L so that M=2L. Genotype observed data in the plurality of linked marker loci of each individual is G=(G1, G2, . . . , Gn). In many cases, Gi is incomplete data. The number of diplotypes corresponding to Gi is not decided to be one in many cases. In such case, a probability distribution (called a diplotype distribution) on the potential diplotype is defined. For individual i (i=1, 2, . . . , n), the diplotype corresponding to Gi is Dij (j=1, 2, . . . , mi). Here, mi is the number of potential diplotypes to Gi and the maximum value of mi is M.
Step 21: Give an initial value F(0) of haplotype frequency to M potential haplotypes (H1, H2, . . . , HM) The total of the haplotype frequency is 1.
For t=0, 1, 2, . . . , calculation for F(t) to F(t+1) is performed by the following steps 22 to 25.
Step 22: Each diplotype Dij has two haplotypes Hl, Hm where 1≦l≦M and 1≦m≦M. When the haplotype frequency F(t) of a population is given, the probability that Dij is obtained is as shown in Equation (1):
Posterior probability Pr(Dij|Gi) that under genotype observed data Gi, the diplotype of individual i is Dij is expressed by Equation (2) by the Bayes' theorem:
When this is calculated for all j (j=1, 2, . . . , mi), the diplotype distribution of the individual i is decided. This is applied to all individuals in the sample.
Step 23: When the diplotype distribution of the individual is decided, an expectation of haplotype frequency of the population can be calculated from the diplotype distribution of all individuals in the sample. The expectation of the haplotype frequency of the population is expressed by Equation (3):
Step 24: The entire likelihood can be expressed by Equation (4) by coupling the likelihood of all diplotypes in each of the individuals and coupling the likelihood of all individuals:
Step 25: F is updated as F(t+1)=E[F(t)]. Whether the value of L(F) is converged or not is determined. When satisfying L(F(t+1))−L(F(t))<β, it is converged to advance to step 26. When not satisfying it, the routine is returned to step 22 to repeat until step 25. Here, β is a threshold.
Step 26: E[F]=F(EM) at convergence is maximum likelihood estimation of the haplotype frequency of the population, and Pr(D|G) is the diplotype distribution of the individual under the maximum likelihood estimation of the haplotype frequency of the population.
As described above, the haplotype information database 14 stores haplotype frequency information of a population and a haplotype pattern of individuals for each of haplotype blocks obtained by inferring the positions of the haplotype blocks on the basis of information of the genetic polymorphism information database 12 and inferring the haplotype frequency of the population and the haplotype pattern of the individuals for each of the haplotype blocks, basic information necessary for setting the haplotype blocks, and haplotype pattern and haplotype frequency information in each of the haplotype blocks.
The genetic structure inference program 15 will be described. In the present invention, to infer a genetic structure of a population, clustering individuals on the basis of a haplotype pattern is performed to divide the population into some subpopulations. In the present invention, new distance decided by the likelihood of mutation and recombination between haplotypes is defined to use the distance for performing clustering individuals. A clustering method of the present invention will be described below.
In the present invention, an evolutionary tree is created so that the edge of the evolutionary tree shows evolution by one mutation or one recombination. As in the evolution of haplotypes 1 to 5 of
For each edge of the created evolutionary tree, whether the evolution is by recombination or mutation is decided. In
The likelihood when a certain haplotype HS is evolved to another haplotype HT is expressed by Equation (5):
As in the evolution of haplotypes 1 to 4 in
Here, two haplotypes being IBD (identical by descent) indicates that they have allele derived from the same ancestor. Since two haplotypes are IBS in appearance and may be actually IBD, this is expressed as IBS*.
When applying the Bayes' theorem, Equation (10) is given:
Here, Equation (11) can be supposed:
Since equation (12) expresses the frequency of HT{1:k}, the value of Equation (10) can be easily calculated:
Pr(HT1:k|HT1:kIBS* to HS1:k) (12)
In the present invention, the likelihood expressed by Equation (5) is newly defined as distance between haplotypes to perform clustering individuals using the distance. Distance dk between an individual having haplotypes of Hkak, Hkbk and an individual having haplotypes of Hkck, Hkdk for the kth haplotype block is defined as in Equation (13):
When the number of haplotype blocks is m, distance d between two individuals is expressed as Equation (14) by coupling distances between all haplotype blocks:
A method of inferring a membership proportion of an individual, that is, the genetic structure inference program 15 will be described. In the present invention, information on to which subpopulation generated by the above-described clustering method each individual belongs is defined as a membership proportion of the individual.
Step 71: Distance between haplotypes in each haplotype block is decided by the method explained with reference to
Step 72: Clustering on the basis of the distance between haplotypes is performed.
Step 73: From the result of step 72, a population having n individuals is divided into N subpopulations. When a certain individual i is classified into a certain subpopulation j, the membership proportion of the individual i to the subpopulation j is 100% and the membership proportion of the individual i to a subpopulation other than the subpopulation j is 0%. When the number of haplotype blocks is m, the entire likelihood can be expressed as Equation (15):
Step 74: Whether the value of L(N) is converged or not is determined. When satisfying L(Nk-1)−L(Nk)<β, it is converged to advance to step 75. When not satisfying it, the routine is advanced to step 71 to repeat until step 74. P is a threshold.
Equation (17) is the membership proportion of the individual i to the subpopulation j:
Qj(i) (17)
Step 75: N when the likelihood expressed by Equation (15) is maximum, is maximum likelihood estimation of the number of subpopulations. The maximum likelihood estimation is adopted as a parameter.
Step 76: The membership proportion of the individual to the subpopulation is calculated on the basis of the likelihood expressed by Equation (15). For instance, there are N_{k} subpopulations, and subpopulation N—{1} is coupled to subpopulation N_{l+1} in the next link step to form N_{k−1} subpopulations. When the likelihood is not changed in this step and the likelihood is maximum, the membership proportions of all individuals classified into subpopulations N—{1} and N_{l+1} to subpopulations N—{1} and N_{l+1} are 50%, respectively.
As described above, the genetic structure information database 16 stores haplotype pattern and haplotype frequency information in each subpopulation and membership proportion of each individual to each subpopulation.
As understood with reference to
There will be described a procedure for analyzing association of the haplotype pattern of an individual with a trait for each haplotype block of each subpopulation on the basis of information of the clinical information database 11 and the genetic structure information database 16 by the association analysis program 17. The association analysis program 17 compares traits of a group of individuals owning a specified haplotype and a group of individuals not owning it (for instance, compares the presence or absence of disease appearing) to calculate an odds ratio of both groups, and compares the group of individuals owning a specified haplotype with the group of individuals not owning it for inferring to what degree the risk of affected disease is increased.
In the present invention, the odds ratio of disease appearing of the group of individuals owning a specified haplotype to the group of individuals not owning it is defined as a haplotype relative risk. In many cases, a 2×2 contingency table is created by the presence or absence of owning a specified haplotype and the presence or absence of disease appearing (which may be the presence or absence of a clinical event or the presence or absence of a side effect of medicine) to calculate the influence of the presence or absence of owning a specified haplotype on the presence or absence of disease appearing by a test of independence (chi-squared test or Fisher's exact test) of the 2×2 contingency table. When the traits cannot be divided into some categories, the t test or Wilcoxon test may be conducted to compare the difference in trait between the group of individuals owning a specified haplotype and the group of individuals not owning it.
Knowledge obtained by the association analysis program 17 is stored in the decision support knowledge database 18.
The risk calculation program 19 calculates, with reference to the genetic structure information database 16 and the decision support knowledge database 18, a risk that a predetermined individual is affected by disease. Risk Ri that an individual i is affected by certain disease can be expressed by Equation (18) when the number of haplotype blocks is m, the number of subpopulations existing in a population is N, and the haplotype relative risk of individual i in haplotype block k of subpopulation j is rijk:
Number | Date | Country | Kind |
---|---|---|---|
2004-091104 | Mar 2004 | JP | national |