Embodiments described herein relate to a genotyping device and method.
An organism holds genetic information as a nucleotide sequence (or Deoxyribonucleic Acid (DNA)) and, in the same species, most part of the nucleotide sequence is in agreement with each other. However, a part of the nucleotide sequence differs among individuals and, in particular, a locus where a nucleotide differs at a frequency of 1% or more in a population of the same species is referred to as a single nucleotide polymorphism (SNP). In organisms having two chromosomes (diploid organisms) like humans, three types of combination patterns are formed due to the difference in the nucleotides at an SNP. Such a combination pattern is called genotype.
Since individual differences such as constitution occur among even in the same species depending upon genotypes of SNPs, the genotypes have relevance to genetic diseases and effects of medicines and their side effects. Accordingly, investigation of the genotype of a specific SNP of a certain individual enables prediction of effectiveness of medicines and/or side effects prior to actual medication.
In the case of humans, it is necessary to determine genotypes of hundreds of thousands to several millions of SNPs at once in order to discover a genotype or genotypes associated with genetic diseases and effectiveness of medicines and their side effects. As a genotyping method that realizes this, a method using a DNA microarray may be mentioned.
According to this method, first, a known nucleotide sequence of an SNP on the array side and an unknown nucleotide sequence of a certain organism (specimen) whose genotype should be determined are hybridized by the DNA microarray, and a signal intensity is measured. Next, the signal intensities of a plurality of specimens measured for the same SNP are projected on a plane and classified into clusters of the same genotype for each SNP. The genotypes are then assigned (labeled) to the respective clusters using biological findings. As a result, it is made possible to determine the genotypes of the same SNP at once for a plurality of specimens.
Meanwhile, according to the above-described traditional method, fluctuations in the signal intensities caused by experimentation environments such as temperature and humidity are not taken into consideration, so that it may happen that erroneous genotypes are assigned to the clusters. As a result, a drawback of the traditional method that the SNP whose genotype has been erroneously determined increases, causing degradation in the accuracy of the genotyping occurs.
According to one embodiment, a genotyping device includes: a representative value calculator, a first labeler, a model creator and a second labeler.
The representative value calculator is configured to calculate a representative value for each of one or more clusters each including a plurality of specimens with respect to each of a plurality of SNPs, the specimens being classified based on signal intensities of the specimens into the clusters with respect to each of the SNPs, and the representative value being calculated based on the signal intensities of the specimens included in each of the clusters.
The first labeler is configured to assign genotypes to clusters of an SNP pertaining to three clusters among the SNPs on the basis of the representative values of the clusters of the SNP pertaining to three clusters.
The model creator is configured to create a model indicative of a relationship between the genotypes of the clusters of the SNP pertaining to the three clusters among the SNPs and the representative values of the clusters of the SNP pertaining to three clusters.
The second labeler is configured to assign genotypes to clusters of an SNP pertaining to one or two clusters among the SNPs on the basis of the representative values of the clusters of the SNP pertaining to one or two clusters and the model.
Embodiments of the present invention are described with reference to the drawings.
First, an outline of a genotyping technique using a DNA microarray will be described with reference to
Each SNP section includes two types of probes “A” and “B,” each having a known nucleotide sequence. A probe is a mechanism for grasping two different nucleotides in each SNP, and the probes have different nucleotides of an SNP corresponding to the SNP section of this SNP. In the example of
When the DNAs of the specimens are hybridized to the respective probes, a signal intensity such as fluorescence intensity and electric current intensity changes. The DNA microarray measures this signal intensity for each type of the probes. In the following, one probe is referred to as probe “A,” and the other probe is referred to as probe “B.” Also, a signal whose intensity changes according to the hybridization of the probe “A” is referred to as signal “A” and the intensity of the signal “A” is referred to as signal intensity “A.” Also, a signal whose intensity changes according to the hybridization of the probe “B” is referred to as signal “B,” and the intensity of the signal “B” is referred to as signal intensity “B.”
Here, it is assumed that the probe in which the nucleotide of SNPi is “A” is defined as probe “A” and a probe in which the nucleotide is “C” is defined as probe “B.” As illustrated in
In addition, if a genotype of an SNPi of “Specimen 2” is “TG,” similar numbers of specimens are hybridized to the probes “A” and “B,” respectively, at the SNP section corresponding to the SNPI, and the signal intensities “A” and “B” will be about the same. In this way, a genotype causing the signal intensities “A” and “B” to be about the same is hereinafter referred to as “genotype “AB,” The “genotype “AB” is a heterozygous genotype.
Further, if a genotype of an SNPi of “Specimen 3” is “GG,” then many specimens are hybridized to the probe “B” at the SNP section corresponding to the SNPi, and the signal intensity “B” increases. A genotype that increases the signal intensity “B” in this manner is hereinafter referred to as genotype “BB.” The genotype “BB” is a homozygous genotype.
The DNA microarray simultaneously measures the signal intensities “A” and “B” for a plurality of specimens in a plurality of SNPs. Subsequently, clustering of the specimens on a per-SNP basis is carried out on the basis of the signal intensities “A” and “B” measured by the DNA microarray.
In addition, after the clustering, genotypes are assigned to the generated clusters. As described above, since the specimens of the genotype “AB” have the same or similar degree of the signal intensities “A” and “B,” the cluster of the genotype “AB” is considered to be distributed on or along a 45-degree straight line in the signal intensity plane. In addition, since the cluster of a genotype “AA” exhibits a large signal intensity “A” and a small signal intensity “B,” it is considered that the cluster of the genotype “AA” is distributed closer to the signal intensity “A” axis with reference to the 45-degree straight line. Since the cluster of a genotype “BB” exhibits a large signal intensity “B” and a small signal intensity “A,” it is considered that the cluster of the genotype “BB” is distributed closer to the signal intensity “B” axis with reference to the 45-degree line.
According to traditional genotyping techniques, assignment of genotypes to the clusters is performed using the magnitude relationship of the signal intensities of the individual genotypes.
The traditional genotyping technique can simultaneously determine the genotypes at a plurality of SNPs of a plurality of specimens by carrying out the above processing on the individual SNPs. For example, in the example of
According to the genotype assignment method using the magnitude relationship of the signal intensities, the genotypes can be assigned with high accuracy when the signal intensities “A” and “B” are accurately measured. However, in actuality, a measurement error may occur in the signal intensities “A” and “B” due to the influence of an experimentation environment (such as a reagent of the DNA microarray) in measuring the signal intensities “A” and “B” by the DNA microarray, and the distribution of the specimens may exhibit fluctuation.
For example, as illustrated in
As described above, if fluctuation occurs in the distribution of the specimens, it may happen that clusters other than that of the genotype “AB” may be located on the 45-degree straight line as illustrated in
This is because it is unknown how fluctuation occurs in the distribution of the specimens when only one cluster or only two clusters are created as illustrated in
A first embodiment will be described with reference to
First, the outline of the genotyping method by the genotyping device according to the first embodiment will be described. FIGS, 7 and 8 are diagrams for explanation of the outline of the determination method by the genotyping device according to this embodiment.
In the example of
As described above, the genotyping device assigns genotypes not on a per-specimen basis but on a per-cluster basis. For this purpose, the genotyping device first calculates representative values of the clusters from the signal intensities of the specimens included in the respective clusters. The representative value is calculated for each SNP.
Next, the genotyping device assigns genotypes to the clusters of SNPs classified as pertaining to the three clusters by using the magnitude relationship of the representative values. In the example of
As a result, representative values of the respective genotypes of 500,000 SNPs are obtained as illustrated in
The genotyping device creates a probability distribution model using the genotypes and the representative values of 500,000 SNPs thus obtained. For example, the probability distribution model of the genotype “AA” is expressed as a probability density function of 500,000 representative values of the genotype “AA.”
Subsequently, the genotyping device assigns the genotypes to the respective clusters of SNPs classified as pertaining to the one or two clusters using the probability distribution model. Specifically, the genotyping device applies the representative values of the respective clusters to the above probability distribution model, and assigns the genotypes having the maximum probability density to the clusters.
In the example of
Next, the functional configuration of the genotyping device (hereinafter referred to as “determination device”) according to this embodiment will be described with reference to
As illustrated in
The signal intensity DB 1 is configured to store the signal intensities “A” and “B” (signal intensity data) measured by the DNA microarray. As described above, the signal intensities “A” and “B” may be a fluorescence intensity or an electric current intensity. In the following description, it is assumed that the signal intensities of SNPs 1 to “n” of the specimens 1 to “M” are respectively stored in the signal intensity DB 1. At this point, “M”דn” signal intensities “A” and “B” are stored in the signal intensity DB 1.
The clustering unit 2 is configured to create a cluster or clusters for each SNP based on the signal intensities “A” and “8” stored in the signal intensity DB 1. A cluster is a set of specimens. The specimens are each classified as pertaining to one of the clusters generated by the clustering unit 2. When the specimen is a human, there are only three genotypes “AAt” “AB” and “BB,” so that three or fewer clusters are generated for each SNP. The clustering unit 2 may perform clustering of specimens using a well-known clustering method such as a k-means method.
The cluster DB 3 is configured to store the result of clustering (cluster data) carried out by the clustering unit 2. Specifically, the cluster DB 3 stores cluster Information of the respective specimens with the respective SNPs.
It should be noted that the determination device may acquire the clustering result as illustrated in
In addition, the clustering unit 2 may calculate converted signal intensities “x” and “y” from the signal intensities “A” and “B” and carry out the clustering based on the converted signal intensities “x” and “y.” The converted signal intensities “x” and “y” are calculated, for example, by the following expressions.
[Expression 1]
x=log(B/A) . . . (1)
y=1/2 log(A*B) . . . (2)
When the clustering is carried out using the converted signal intensities “x” and “y” calculated by the expressions (1) and (2), the specimens are plotted on a plane of the converted signal intensity defined by an axis representing the converted signal intensity “x” and another axis representing the converted signal intensity “y,” as illustrated in
The converted signal intensities “x” and “y” calculated by the clustering unit 2 may be stored in the signal intensity DB 1.
The representative value calculator 4 is configured to calculate representative values of the clusters generated by the clustering unit 2. The representative value is a value unique to each cluster of each SNP. in this embodiment, the representative values are calculated based on the signal intensities A, B and the converted, signal intensities “x” and “y” of the specimen included in each cluster of each SNP, in the following, It is assumed that the representative values are calculated based on the signal intensities “A” and “B.”
The representative value is, for example, a regression coefficient of a regression line of each cluster, an arc tangent of a regression coefficient, or an inclination of an approximate straight line passing through the origin, but it is not limited thereto. The representative value may be a correlation coefficient of each cluster, a cluster center value, a cluster median value, a cluster variance, an average value of ratios, or an average value of differences.
The representative value DB 5 stores the representative values (representative value data) of the respective clusters of the respective SNPs calculated by the representative value calculator 4.
The first labeler 6 is configured to refer to the representative value DB 5 and extracts SNPs for which three clusters have been generated. The SNP for which three clusters are generated corresponds to an SNP for which representative values are stored for three clusters. For example, in the example of
Next, the first labeler 6 assigns a genotype to each of the clusters of each of the extracted SNP or SNPs. Genotype assignment is carried out using the magnitude relationship of the representative values, More specifically, when a value that increases as the signal intensity “A” of the specimen included in the cluster increases is calculated as the representative value, then the first labeler 6 sequentially assigns genotypes “AA,” “AB,” and “BB.” Likewise, when a value that increases as the signal intensity “B” of the specimen included in the cluster increases is calculated as the representative value, then the first labeler 6 assigns the genotypes “BB” “AB,” and “AA” in a descending order of the representative value. This also applies to a case where the representative values are calculated based on the converted signal intensities “x” and “y.”
For example, when the representative value is a regression coefficient of each cluster on the signal intensity plane in
The first labeler 6 applies the result of assignment to the cluster data stored in the cluster DB 3 and thereby generates the result of determination of the genotype of the SNP classified as pertaining to three clusters. The result of determination is stored in the determination result DB 10.
The model creator 7 creates a probability distribution model indicative of the relationship between the genotype and the representative value on the basis of the genotype of each cluster assigned by the first labeler 6 and the representative value of each cluster to which the genotype is assigned. The probability distribution model is constituted by probability density functions of the representative values for the respective genotypes. The probability variable of each probability density function is a representative value.
As the probability distribution model, a probability density function according to an appropriate probability distribution such as Gaussian distribution (normal distribution), mixed Gaussian distribution, F distribution, and beta distribution can be used. Also, each probability density function may follow different types of distribution for each genotype. For example, it may be considered that the probability density functions of the genotypes “AA” and “BB” follow a mixed Gaussian distribution, and the probability density function of the genotype “AB” follows a normal distribution.
When the signal intensities “A” and “B” are accurately measured, the probability distributions of the genotypes “AA” and “BB” become symmetric with respect to the probability distribution of the genotype “AB.” Also, the probability distribution of the genotype “AB” has an average value of about 45°. In contrast, in the probability distribution model of
In this manner, by using the genotypes and the representative values assigned by the first labeler 6, the model creator 7 can create a probability distribution model reflecting the fluctuations of the distributions due to the influence of the experimentation environment.
The model DB 8 is configured to store the probability distribution model created by the model creator 7. Specifically, parameters (average, variance, etc,) of the probability density function for each genotype are stored therein.
The second labeler 9 refers to the representative value DB 5 and extracts SNPs for which one or two clusters are generated. The SNPs for which one or two clusters are generated respectively correspond to the SNPs for which representative values are stored for one or two clusters. For example, in the example of
Next, the second labeler 9 assigns genotypes to the clusters of the respective SNPs that have been extracted. The assignment of the genotypes is carried out using the probability distribution model stored in the model DB 8, More specifically, the second labeler 9 assigns the representative values of the respective clusters to the probability density functions of the respective genotypes, and assigns the genotype having the maximum probability density to each cluster.
For example, as illustrated in
The result of determination of the genotype of the SNP classified as pertaining to one or two clusters is generated by the second labeler 9 which applies the result of assignment to the cluster data stored in the cluster DB 3. The result of determination is stored in the determination result DB 10.
The determination result DB 10 stores therein the result of determination of the genotype of each SNP of each specimen. The result of determination is generated by applying the genotypes assigned by the first labeler 6 and the second labeler 9 to the respective clusters stored in the cluster DB 3.
The display 11 is configured to convert the various kinds of information generated by the determination device into image data and video data, and display the image data and video data on the display device 103 (which will be described later). in the example of
Next, a hardware configuration of the determination device according to this embodiment will be described with reference to
The CPU 101 is a control device and a computing device of the computer 100. The CPU 101 performs arithmetic processing based on data and programs input from the individual devices (e.g., the Input device 102, the communication device 104, and the storage device 105) connected via the bus 106, and outputs results of calculation and control signals to the devices (e.g., the display device 103, the communication device 104, and the storage device 105) connected via the bus 106.
Specifically, the CPU 101 runs an operating system (OS) of the computer 100, a determination program, and the like, and controls the devices constituting the computer 100. The determination program is a program that causes the computer 100 to implement the above-described functions of the determination device. When the CPU 101 runs the determination program, the computer 100 functions as the determination device.
The input device 102 is a device for inputting information to the computer 100. Examples of the input device 102 may include, but is not limited to, a keyboard, a mouse, and a touch panel. By using the input device 102, a user (operator) of the determination device can cause the determination device to start the determination processing or to input the parameters of the probability distribution model.
The display device 103 is a device for displaying images and videos. Examples of the display device 103 may include, but is not limited to, an LCD (liquid crystal display), a CRT (cathode ray tube), and a PDP (plasma display). Image data generated by the display 11 is displayed on the display device 103.
The communication device 104 is a device for allowing the computer 100 to make wired or wireless communications with an external device. Examples of the communication device 104 may include, but is not limited to, a modem, a hub, and a router. Information such as the signal intensity measured by the DNA microarray and the clustering results of the specimens can be input from the external device via the communication device 104.
The storage device 105 is a storage medium that stores therein the OS of the computer 100, the determination program, data necessary for running the determination program, data generated by execution of the determination program, and the like. The storage device 105 includes a main storage device and an external storage device. Examples of the main storage device may include, but is not limited to, RAM, DRAM, and SRAM. Also, examples of the external storage device may include, but is not limited to, a hard disk, an optical disk, a flash memory, and a magnetic tape. The signal intensity DB 1, the cluster DB 3, the representative value DB 5, the model DB 8, and the determination result DB 10 can be configured using the storage device 105.
It should be noted that the computer 100 may include one or more of the CPU 101, the Input device 102, the display device 103, the communication device 104, and the storage device 105, and peripheral devices such as a printer and a scanner may be connected thereto.
Also, the determination device may be constituted by a single computer 100, or may be configured as a system including a plurality of Interconnected computers 100.
Further, the determination program may be stored in advance in the storage device 105 of the computer 100, recorded in a computer-readable recording medium such as a CD-ROM, or uploaded on the Internet. In any case, the determination device can be configured by installing the determination program onto the computer 100 and executing it.
Next, the determination processing executed by the determination device according to this embodiment will be described with reference to
First, the outline of the determination processing will be described.
Through the above processing, genotypes are assigned to each cluster of SNPs 1 to “n” of Specimens 1 to “M,” and the determination processing is completed. The result of determination is stored in the determination result DB 10.
Here, details of each process of the above-described steps S1 to S4 will be specifically described.
(Step S1)
First, the representative value calculation process in step S1 will be describe.
First, in step S10, the representative value calculator 4 acquires the signal intensity data stored in the signal intensity DB 1 and the cluster data stored in the cluster DB 3.
Next, in step S11, the representative value calculator 4 extracts the signal intensities “A” and “B” of “Cluster j” of SNPi, where “i” is an integer from 1 to “n” and “j” is an integer from 1 to 3. For example, when extracting the signal intensity of “Cluster 1” of SNPi. the representative value calculator 4 first refers to the cluster data of SNPi and extracts the specimens of “Cluster 1” as illustrated in
Next, the representative value calculator 4 refers to the signal intensity data and extracts the signal intensities “A” and “B” of the specimens of “Cluster 1,” As a result, as illustrated in
Subsequently, in step S12, the representative value calculator 4 calculates a representative value “CLU(l,j)” of “Cluster j” of SNPi, The representative value “CLU(l,j)” is the slope (angle) of the approximate straight line of “Cluster j.”
CLU(l,j)=tan−1(average B(l,j))/(average A(l,j)) . . . (1)
In the expression (1), B(i,j) is the signal intensity “B” of “Cluster j” of SNPi, and A(i,j) is the signal intensity “A” of “Cluster j” of SNPi. The coordinates of the cluster center of “Cluster j” of SNPi are (average A(i,j),average B(i,j)). The representative value calculator 4 calculates the representative value “CLU(i,j)” by assigning the signal intensities “A” and “B” of “Cluster j” of SNPi extracted in step S11.
Further, in step S13, the representative value calculator 4 stores the calculated representative value “CLU(i,j)” in the representative value DB 5.
As illustrated in FIGS, 25 to 27, the representative value DB 5 may have different tables for the respective numbers of clusters of SNPs. Further, as illustrated in
(Step S2)
Next, the genotype assignment processing for three-cluster SNPs (SNPs classified as pertaining to the three clusters) in step S2 will be described.
First, in step S20, the first labeler 6 acquires representative value data of three-cluster SNPI from the representative value DB 5, As a result, a table as illustrated in
Next, in step S21, the first labeler 6 refers to the cluster data and assigns genotypes to “Clusters 1” to “3” of each SNPi.
As illustrated in
Subsequently, in step S22, the first labeler 6 applies the result of assignment of the genotypes for SNPI to the cluster data. Specifically, the first labeler 6 replaces the cluster of each specimen of SNPI stored in the cluster DB 3 with the genotype assigned to each cluster of SNPi.
When the first labeler 6 applies the result of assignment, the result of determination of the genotypes of the three-cluster SNP as illustrated in
In addition, in step S23, the generated result of determination is stored in the determination result DB 10.
Also, in step S24, the first labeler 6 applies the result of assignment of the genotype for SNPI to the representative value data. Specifically, the first labeler 6 replaces the “Cluster j” of each representative value “CLU(i,j)” stored in the representative value DB 5 with the genotype assigned to each “Cluster j” of SNP1, and sorts them by the genotypes.
In addition, the first labeler 6 sorts the representative values “CLU(i,j)” by genotypes. As a result, the representative value DB 5 is updated.
(Step S3)
Next, the process of creating the probability distribution model in step S3 will be described.
First, in step S30, the model creator 7 acquires representative value data of SNPs of the three clusters stored in the representative value DB 5. As a result, the updated representative value data as illustrated in
Next, in step S31, the model creator 7 extracts a representative value for each genotype. As illustrated in
Subsequently, in step S32, the model creator 7 calculates an average “μ” and a variance “δ” of each genotype. Specifically, the model creator 7 calculates the average and variance “σAA” of the set “CLUAA,” the average “μAB” and variance “σAB” of the set “CLUAB,” and the average “μBB” and variance “σBB” of the set “CLUBB.”
In addition, in step S33, the model creator 7 applies the averages V and variances V of the respective genotype to the normal distribution, and generates the probability density function f(x) for each genotype. The probability density function is expressed by the following the expression.
In the above expressions (3) to (5), “x” is a representative value “CLU,” “fAA(x)” is the probability density function of the genotype “AA,” “fAB(x)” is the probability density function of the genotype “AB,” and “fBB(x)” is the probability density function of the genotype “BB.” The set of the above three probability density functions constitutes the probability distribution model.
After creating the probability distribution model, the model creator 7 stores the probability distribution model in the model DB 8 in step S34, In the model DB 8, the averages “μ” and the variances V for the respective genotypes are stored.
(Step S4)
Next, the genotype assignment processing for one- or two-cluster SNPs (SNP classified as pertaining to the one cluster or SNP classified as pertaining to the two clusters) in step S4 will be described.
First, in step S40, the second labeler 9 acquires the representative value data of the one-cluster SNP or the two-cluster SNP stored in the representative value DB 5. As a result, the representative value data as illustrated in FIG., 26 and 27 is acquired.
Also, in step S41, the second labeler 9 acquires the probability distribution model stored in the model DB 8. As a result, the probability distribution model illustrated in
Next, in step S42, the second labeler 9 applies the representative value “CLU(i,j)” to the probability distribution model. Specifically, as illustrated in
Subsequently, in step S43, the second labeler 9 assigns a genotype having the maximum probability density “f(CLU(i,j))” to “Cluster j” of SNPi. For example, in the example of
In addition, in step S44, the second labeler 9 applies the result of assignment of the genotypes for SNPi to the cluster data. Specifically, the second labeler 9 replaces the cluster of each specimen of SNPi stored in the cluster DB 3 with the genotype assigned to each cluster of SNPi. The method of applying the result of assignment is the same as in step S22.
When the second labeler 9 applies the result of assignment, the determination result of genotype of one-cluster SNP or two-cluster SNP as illustrated in
In addition, in step S45, the generated result of determination is stored in the determination result DB 10. As a result, the determination of the genotypes of the SNPs 1 to “n” of the specimens 1 to “M” is completed.
As described above, according to this embodiment, the genotype is determined by using the probability distribution model reflecting the fluctuation of distribution due to the influence of the experimentation environment. Accordingly, errors in genotype assignment due to the influence of the experimentation environment can be suppressed, and the accuracy of genotyping can be improved.
(Second Embodiment)
A second embodiment will be described below with reference to
The third labeler 12 is configured to acquire the result of the genotype assignment by the second labeler 9 and determine whether or not the reliability of the result of assignment is high.
If it is determined that the reliability of the result of assignment is low, the third labeler 12 outputs the result of assignment of the second labeler 9 on an as-is basis. On the other hand, if it is determined that the reliability of the result of assignment is low, the third labeler 12 reassigns the genotypes. In addition, the third labeler 12 outputs the result of assignment of the reassigned genotypes.
According to this embodiment, the results of determination of the genotypes of one-cluster and two-cluster SNPs are generated by applying the result of assignment that has been output by the third labeler 12 to the cluster data stored in the cluster DB 3.
First, in step S50, the third labeler 12 acquires the result of the genotype assignment for SNPI from the second labeler 9. The SNPi acquired here is a one-cluster or two-cluster SNP.
Next, in step S51, the third labeler 12 determines whether or not the acquired SNPi is of one-cluster or two-cluster. When the SNPi is of two-cluster (Yes), the process proceeds to step S52.
In step S52, the third labeler 12 determines whether or not the two genotypes assigned to the SNPI of two-cluster are different genotypes. If they are different genotypes (Yes), the process proceeds to step S53.
In step S53, the third labeler 12 determines whether or not the genotype “AB” is included in the two genotypes assigned to the two-cluster SNPi. When the genotype “AB” is included (Yes), the third labeler 12 outputs the result of assignment acquired from the second labeler 9 on an as-is basis, and the reassignment processing is completed.
On the other hand, in step S53, If the genotype “AB” is not included in the two genotypes (No), the process proceeds to step S54.
In step S54, the third labeler 12 reassigns the genotype to the two clusters, i.e., the “Clusters 1 and 2” of SNPi using an assignment method A. The assignment method A will be described later. Thereafter, the third labeler 12 outputs the result of assignment of the reassigned genotype, and the reassignment process is completed.
Also, if the two genotypes assigned to the two-cluster SNPi are the same in step S52 (Yes), the process proceeds to step S55.
In step S55, the third labeler 12 determines whether or not the genotypes assigned to SNPi is “AB.” If the genotype “AB” is assigned to SNPi (YES), the process proceeds to step S56.
In step S56, the third labeler 12 reassigns the genotype to the two clusters, i.e., the “Clusters 1 and 2” of SNPi using an assignment method B. The assignment method B will be described later. Thereafter, the third labeler 12 outputs the result of assignment of the reassigned genotype, and the reassignment process is completed.
On the other hand, if the genotype “AB” has not been assigned to SNPi in step S55 (No), the process proceeds to step S57.
In step S57, the third labeler 12 reassigns the genotypes to the two clusters, i.e., the “Clusters 1 and 2” of SNPi using an assignment method C. The assignment method C will be described later. Thereafter, the third labeler 12 outputs the result of assignment of the reassigned genotype, and the reassignment process is completed.
Further, in step S51, if SNPi is of one cluster (No), the process proceeds to step S58.
In step S58, the third labeler 12 determines whether or not the genotype assigned to SNPi is “AB.” When the genotype “AB” is assigned to SNPi (Yes), the process proceeds to step S59.
In step S59, the third labeler 12 reassigns the genotype to one cluster, i.e., “Cluster 1” of the SNPi using an assignment method D. The assignment method D will be described later. Thereafter, the third labeler 12 outputs the result of assignment of the reassigned genotype, and the reassignment process is completed.
On the other hand, If the genotype “AB” is not assigned to SNPi (No) in step S58, the third labeler 12 outputs the result of assignment acquired from the second labeler 9 on an as-is basis, and the reassignment process is completed.
Next, the assignment methods A to D will be described.
(Assignment Method A)
The assignment method A will be described first. Reassignment by the assignment method A is carried out when the genotypes “AA” and “BB” are assigned to the two clusters of “Clusters 1 and 2” of SNPi.
The possibility that genotype of a certain ethnic group of humans results exclusively in the genotype “AA” or the genotype “BB” is considered to be biologically extremely low. This is because a child between a mother (father) of the genotype “AA” and a father (mother) of the genotype “BB” will have the genotype “AB” with a probability of 50%. Accordingly, from a biological point of view, the reliability of this result of assignment is determined to be low.
In such a case, the third labeler 12 first acquires a probability distribution model and a representative value data of SNPi. As a result, the probability density functions “fAA(x),” “fAB(x),” and “fBB(x),” the representative value “CLU(i,1)” of “Cluster 1” and the representative value “CLU(i,2)” of the “Cluster 2” are acquired.
Next, the third labeler 12 substitutes the representative values to the probability density function “fAB(x)” to calculate a probability density “fAB(CLU(i,1))” and a probability density “fAB(CLU(i,2)).” In addition, the third labeler 12 reassigns the genotype “AB” to a cluster having a high probability density “fAB(x).” The genotype of the cluster with a small probability density “fAB(x)” remains unchanged.
(Assignment Method B)
Next, the assignment method B will be described. Reassignment by the assignment method B is carried out when the genotype “AB” is assigned to the two clusters of “Clusters 1 and 2” of SNPi. Since the same genotype is assigned to the two clusters, the reliability of this assignment result is determined to be low.
In such a case, the third labeler 12 first acquires the probability distribution model and the representative value data of SNPi. As a result, the probability density functions “fAA(x),” “fAB(x),” and “faa(x).” The representative value “CLU(i,1)” of “Cluster 1” and the representative value “CLU(i,2)” of the “Cluster 2” are acquired.
Next, the third labeler 12 substitutes the representative values to the probability density function “fAB(x)” to calculate the probability density “fAB(CLU(i,1))” and the probability density “fAB(CLU(i,2).” In addition, the third labeler 12 reassigns the genotype “AA” or “BB” to a cluster having a small probability density “fAB(x).” The genotype of the cluster with a high probability density “fAB(x)” remains to be “AB.”
The third labeler 12 calculates the probability densities “fAA(x)” and “fBB(x)” of clusters having a small probability density “fAA(x).” In the case of “fAA(x)”>“fBB(x),” the third labeler 12 reassigns the genotype “AA” to a cluster having a small probability density “fAB(x).” On the other hand, in the case of “fAA(x)”<“fBB(x),” the third labeler 12 reassigns the
genotype “BB” to the cluster having the small probability density “fAB(x).”
With regard to the assignment method B, the reason why the genotype of one of the clusters is left as “AB” is that the possibility that the genotype results exclusively in “AA” or “BB” is considered to be biologically extremely low as mentioned above.
(Assignment Method C)
Next, the assignment method C will be described. Reassignment by the assignment method C is carried out when the genotype “AA” or genotype “BB” is assigned to either one of the two clusters of “Clusters 1 and 2” of SNPi. Since the same genotype is assigned to the two clusters, the reliability of this assignment result is determined to be low.
In such a case, the third labeler 12 first acquires the probability distribution model and the representative value data of SNPi. As a result, the probability density functions “fAA(x),” “fAB(x),” and “fBB(x),” the representative value “CLU(i,1)” of “Cluster 1” and the representative value “CLU(i,2)” of “Cluster 2” are acquired.
When the genotype “AA” is assigned to “Clusters 1 and 2,” the third labeler 12 substitutes each representative value to the probability density function “fAA(x)” to calculate the probability density “fAA(CLU(i,1))” and the probability density “fAA(CLU(i,1)).” In addition, the third labeler 12 reassigns the genotype “AB” to a cluster having a small probability density “fAA(x).” The genotype of the cluster with a high probability density “fAA(x)” remains to be “AA.”
On the other hand, when the genotype “BB” is assigned to “Clusters 1 and 2,” the third labeler 12 substitutes each representative value to the probability density function “fBB(x)” to calculate the probability density “fBB(CLU(i,1))” and the probability density “fBB(CLU(i,2)).” In addition, the third labeler 12 reassigns the genotype “AB” to a cluster having a small probability density “fBB(x).” The genotype of the cluster with a large probability density “fBB(x)” remains to be “BB.”
In the assignment method C, the reason why the genotype of one cluster is reassigned to AB is that the possibility that the genotype is divided only to AA or BB is considered to be biologically extremely low as mentioned above.
(Assignment Method D)
Next, the assignment method D will be described. Reassignment by the assignment method D is carried out when the genotype “AB” is assigned to one-cluster SNPi.
The possibility that the genotype of a certain ethnic group of humans results exclusively in the genotype “AB” for all the members is considered biologically extremely low. This is because if both of the parents have the genotype “AB,” such a homozygous child that has the genotype “AA” or “BB” appears with a probability of about 50%. In addition, if the genotype of all members of a large population is “AB,” then only the combination of a mother (father) of the genotype “AA” and a father (mother) of the genotype BB can be considered as the parents of the individuals. Accordingly, from a biological point of view, the reliability of this result of assignment is determined to be low.
In such a case, the third labeler 12 first acquires the probability distribution model and the representative value data of SNPi. As a result, the probability density functions “fAA(x),” “fAB(x),” and “fBB(x)” and the representative value “CLU(i,1)” of “Cluster 1” are acquired.
Next, the third labeler 12 substitutes the representative value “CLU(i,1)” to the probability density functions “fAA(x)” and “fBB(x)” to calculate the probability densities “fAA(CLU(i, 1))” and “fBBCLU(i,1)).” In addition, in the case of “fAA(CLU(i,1)”>“fBB(CLU(i,1)),” the third labeler 12 reassigns the genotype “AA” to “Cluster 1” and in the case of “fAA(CLU(i,1))”<“fBB(CLU(i,l),” the genotype “BB” is reassigned to “Cluster 1.”
As described above, according to this embodiment, it is possible to reassign a genotype to a cluster to which a genotype with low reliability is assigned by using biological knowledge. Accordingly, the reliability of genotype assignment is improved, and as a result, the accuracy of genotyping can be improved.
(Third Embodiment)
A third embodiment will be described below with reference to
The second representative value may be calculated based on the signal intensities “A” and “B.” Such a representative value may include, for example, a regression coefficient of a regression line of each cluster, an arc tangent of a regression coefficient, a gradient of an approximate straight line passing through the origin, a correlation coefficient of each cluster, a cluster center value, a cluster median value, a cluster variance, an average value of ratios, and an average value of differences.
Also, the second representative value may not be calculated based on the signal intensities “A” and “B.” As such a representative value, for example, the number of specimens can be mentioned. The number of specimens is the number of specimens included in each cluster.
According to this embodiment, the method of determining the reliability of genotypes by the third labeler 12 is the same as that of the second embodiment (see the flowchart of
(Assignment Method A)
First, the assignment method A will be described. Reassignment by the assignment method A is carried out when the genotypes “AA” and “BB” are assigned to the two clusters of “Clusters 1 and 2” of SNPi.
According to this embodiment, the third labeler 12 reassigns the genotype “AB” to a cluster having a small number of specimens. This is because clusters with a small number of specimens are considered to have low reliability in their genotype assignment. The genotype of the cluster with many specimens is left unchanged.
(Assignment Method B)
Next, the assignment method B will be described. Reassignment by the assignment method B is carried out when the genotype “AB” is assigned to the two clusters of “Clusters 1 and 2” of SNPi.
According to this embodiment, the third labeler 12 reassigns the genotype “AA” or “BB” to a cluster having a small number of specimens. This is because clusters with a small number of specimens are considered to have low reliability in their genotype assignment. The genotype of the cluster with many specimens remains to be “AB.”
The third labeler 12 should reassign a genotype to a cluster having a small number of specimens in the same manner as in the second embodiment. Specifically, the third labeler 12 calculates the probability densities “fAA(x)” and “fBB(x),” reassigns the genotype “AA” in the case of “fAA(x)”>“fBB(x),” and reassigns the genotype “BB” in the case of “fAA(x)”<“fBB(x).”
(Assignment Method C)
Next, the assignment method C will be described. Reassignment by the assignment method C is carried out when the genotype “AA” or the genotype “BB” is assigned to both of the two clusters of “Clusters 1 and 2” of SNPi.
According to this embodiment, the third labeler 12 reassigns the genotype “AB” to a cluster having a small number of specimens. This is because clusters with a small number of specimens are considered to have low reliability in terms of the genotype assignment. The genotypes of the clusters with many specimens are left unchanged.
As explained above, according to this embodiment, genotypes are reassigned using the second representative value. If the reliability of the genotype assignment is low due to the low reliability of the first representative value, the reliability of the assignment of the genotypes can be improved through the reassignment using the second representative value, which leads to improvement of the accuracy of the genotyping.
It should be noted that with regard to the assignment methods A to C, it is also possible to use the method of this embodiment and the method of the second embodiment in combination. For example, it can be considered that, if the threshold value “α” of the number of specimens is set and at least one of the numbers of specimens in the “Clusters 1 and 2” is equal to or less than the threshold value “α” then the genotype is reassigned by the method of this embodiment and, if the number of specimens is greater than the threshold value “α” then the genotype is reassigned by the method of the second embodiment.
In addition, the model creator 7 may create a second probability distribution model on the basis of the second representative value, the model DB 8 may store the second probability distribution model, and the third labeler 12 may carry out the reassignment of the genotypes on the basis of the second representative value and the second probability distribution model.
Further, the representative value calculator 4 may calculate three or more representative values for each cluster, and the third labeler 12 may carry out the reassignment of the genotypes using two or more types of representative values other than the first representative value.
(Fourth Embodiment)
A fourth embodiment will be described below with reference to
In the screen of
In the screen of
Since the display 11 displays such a screen, the user of the determination device can readily grasp the clusters and the representative values. It should be noted that when a plurality of types of representative values are calculated as in the third embodiment, the representative value table in
In the screen of
In the screen of
Since the display 11 displays such a screen, the user of the determination device can readily grasp the results of determination (assignment result) of the clusters and the genotypes.
In the screen of
In the screen of
Also, on the graph of
Since the display 11 displays such a screen, the user of the determination device can readily grasp the created probability distribution model and the basis (probability density) of the genotype assignment.
It should be noted that, when the genotype is reassigned by the third labeler 12, the probability density used in the reassignment may be plotted on the probability density function as illustrated in
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
The present application is a Continuation of International Application No, PCT/JP2015/060368, filed on Apr. 01, 2015, the entire contents of which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2015/060368 | Apr 2015 | US |
Child | 15693268 | US |