The present invention relates to a method and system for statistical analysis of cDNA microarray data using two different fluorescence dyes, and a recording medium for the same. In particular, the present invention relates to a system, method, and program for estimating the probability of gene expression in each channel, and a recording medium for the same.
Currently, the study of genomics is expanding from structural analysis on individual genes to systematic functional analysis of genes. Experiments using cDNA (complementary DNA) microarrays capable of simultaneously quantifying the expression levels of a large number of genes are expected to be extremely effective in functional analysis of functionally unknown genes or whole genes.
The objective of experiments using cDNA microarrays with two different fluorescence dyes is to detect the difference in gene expression level between two kinds of cells. The following gives a summary of a cDNA microarray configuration with two different fluorescence dyes. First, cDNAs of a large number of sets of genes are densely fixed on glass slides in arrays (microarrays) as reference probes.
Next, mRNAs extracted from two kinds of different conditional samples, cell 1 and cell 2 (e.g., normal cell and cancer cell), are labeled respectively with fluorescence dyes different in wavelength from each other to synthesize target cDNAs. Then, these cDANs are mixed in equal proportions, and hybridized with the microarrayed cDNAs or reference probes. After this competitive hybridization, the glass slides are imaged using a scanner and fluorescence intensities are measured separately for each dye. The fluorescence dye with which the cell 1 is labeled and the fluorescence dye with which the cell 2 is labeled are read from channel 1 and channel 2, respectively, to obtain gene expression level data (microarray data).
Thus, since the process of obtaining microarray data is so complicated as to require advanced experimental techniques, it is conceivable that several experimental errors could occur at each stage of the experiment. Therefore, in order to retrieve data to be truly biologically significant from the microarray data, analyzing expression level distributions and experimental errors presents a significant challenge to be solved.
In regard to the expression level distributions, prior art document 1 (Newton et. al., 2001, Journal of Computational Biology, Vol. 8, pp. 37-52) can be referred to, for example, in which Newton et. al. forms a hypothesis that proposes the use of the gamma distribution function to help analyze expression levels so as to consider statistical characteristics about the ratio of expression levels (the ratio of expression levels in channel 1 and channel 2).
f(x)=pφ(x−μ1|σ12)+(1−p)φ(x−μ2|σ22) (1)
In regard to the observed expression level data, prior art document 2 (Lee et. al., 2000, Proceeding of the National Academy of Sciences, Vol. 97, No. 18, pp. 9834-9839) can be referred to, for example. Assuming the ability to separate true expression levels into two levels and the existence of accidental errors, Lee et. al. adopts a mixed normal distribution as shown in the following equation (1) to consider statistical characteristics about the expression level data:
Here, x denotes (the logarithmic value of) a gene expression level such as fluorescence intensity obtained with a scanner or the like. Further, the first term of the right hand side, φ(x−μ1|σ12), represents a normal distribution with average μ1 and variance σ12 when a gene is being expressed, the second term φ(x−μ2|σ22) represents the density function of a normal distribution with average μ2 and variance σ22 when no gene is being expressed, and p is a parameter representing the mixing ratio.
In regard to the analysis of the experimental errors, there have been proposed several methods of removing systematic errors, so-called normalization methods. For example, when referring to prior art document 3 (Chen et. al., 1997, Journal of Biomedical Optics, Vol. 2, pp. 364-374), Chen et. al. assumes that the median values of gene expression levels of two cells are equal to correct for the measured values obtained from channel 1 and channel 2, respectively. Further, when referring to prior art document 4 (Dudoit et. al., 2000, “Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments,” Technical Report #5782), prior art document 5 (Schuchhardt et. al., 2000, Nucleic Acids Research, Vol.28, No. 10), and prior art document 6 (Yang et. al., 2002, Nucleic Acids Research, Vol.30, No.4), Dudoit, Schuchhardt, and Yang consider that systematic errors are caused by different locations of spots on glass slides or different sensitivities of the two kinds of fluorescence dyes, and propose methods of removing the errors.
The above-mentioned prior art problems are derived from the fact that the analytical results of microarray data lack reproducibility because of low precision and efficiency. It is considered that the cause is insufficient separation of microarray data into true signals, and systematic and measurement errors in the conventional analytical methods. Therefore, removal of systematic errors and evaluation of measurement errors are important issues.
In regard to the removal of systematic errors, a copending patent application entitled “Method and System for Correction of cDNA Microarray Data, and Recoding Medium Therefor” has been filed separately. Therefore, the present invention assumes that systematic errors are already removed from the microarray data.
The conventional analytical methods using microarray data deal with only the ratio (the difference of logarithmic values) of gene expression levels of two channels, that is, they do not deal with the gene expression level of each channel, for the reason that quantitative uniformity of cDNA in each spot is not ensured. Therefore, the conventional analytical methods results in insufficient separation between true signals related to gene expression state and measurement errors.
It is an object of the present invention to provide a method and system for separating true signals related to gene expression from measurement errors to increase the precision and efficiency of analysis using microarray data, and further estimating the probability of gene expression in each channel.
A gene expression state estimating system of the present invention includes an input device for inputting microarray data, a program-controlled data analyzer, and an output device. The data analyzer has parameter estimating means for estimating distributed parameters for each component of a mixed normal distribution and a mixing ratio parameter using gene expression level data given from the input device, and posterior probability calculating means for calculating the posterior probabilities of gene expression in each channel using each of the estimated parameters. The calculated posterior probabilities are outputted to the output device.
The adoption of such a configuration to estimate the state of gene expression can attain the object of the present invention.
A mathematical model of gene expression level data obtained from a microarray according to the present invention will first be described. If X denotes the gene expression level of cell 1 obtained with channel 1 and Y denotes the gene expression level of cell 2 obtained with channel 2, then respective gene expression level data are shown in the following equation (2)
X=τ1α+β+ε1
Y=τ2α+β+ε2 (2)
where X and Y denote amounts subjected to adequate transformations of observed values including logarithmic transformation or power transformation and linear transformation.
Here, τ1 and τ2 take either 1 or 0, which represents the presence or absence (ON/OFF) of true gene expression in each cell. Further, α denotes the amount of mRNA produced when the gene is ON-state and a random variable of gene expression defined by the state of the spot, β denotes a common measurement error between the channel 1 and the channel 2, and ε denotes a measurement error independent between the channels. Note that each distribution of random variables follows the following equation (3)
Here, N (μ, σ2) demotes a one-dimensional normal distribution with average μ and variance σ2. Further, α, β, and ε are all independent. In this mathematical model, when a gene is being expressed (ON-state), the true expression level is a random variable that takes on nonnegative values, while when it is not being expressed (OFF-state), only simple measurement errors are considered to be observed. Further, referring to the prior art document 6, an S-D transformation is performed as a modification from the M-A transformation adopted by Yang Y. H. et. al. as shown in the following equation (4):
U=X+Y,
V=X−Y (4)
In other words, the transformation is made assuming that U and V are the sum and difference of gene expression levels of two channels, respectively. A schematic graph of this S-D transformation model is shown in
g00(u,ν|θ)=φ(u|4σβ2+2σε2)φ(ν|2σε2) (5)
Here, φ(u|σ2) is the density function of a one-dimensional normal distribution with average 0 and variance σ2. The density function of the distribution g10 is shown in the following equation (6):
Here, φ2(u, v|Σ) is the density function of a two-dimensional normal distribution with average vector 0 and variance-covariance matrix Σ, and Σ10 is a 2×2 variance-covariance matrix, which is shown in the following equation (7)
The density function of the distribution g01 is shown in the following equation (8):
Here, Σ01 is a 2×2 variance-covariance matrix, which is shown in the following equation (9):
The density function of the distribution g11 is shown in the following equation (10)
Based on the above-mentioned distributions, posterior probabilities of gene expression in cell 1 and cell 2 are shown in the following equations (11) and (12)
where f(u, v|p, θ) is given by the following equation (13)
Note that p=(p00, p10, p01, p11) is a parameter representing the mixing ratio for each distribution.
An embodiment of the present invention will next be described in detail with reference to the accompanying drawings. Referring to
The data analyzer 2 is provided with distributed parameter estimating means 21, mixing ratio parameter estimating means 22, and posterior probability calculating means 23. The distributed parameter estimating means 21 estimates distributed parameters for each component in a mixed normal distribution using gene expression level data from the input device 1. The estimated distributed parameters are sent to the mixing ratio parameter estimating means 22 and the posterior probability calculating means 23. The mixing ratio parameter estimating means 22 estimates a mixing ratio parameter for the mixed normal distribution by a conditional maximum likelihood method using the gene expression level data from the input device 1 and the distributed parameters for each component given from the distributed parameter estimating means 21. The estimated mixing ratio parameter is sent to the posterior probability calculating means 23. The posterior probability calculating means 23 calculates the posterior probability of a gene expression state in each channel using the gene expression level data from the input device 1, the distributed parameters for each component given from the distributed parameter estimating means 21, and the mixing ratio parameter from the mixing ratio parameter estimating means 22. The calculated posterior probability is sent to the output device 3.
Referring next to
(1−ξ)φ(u−μ0|σ02)+ξφ(u−μ1|σ12) (14)
where φ(*|σ2) is the density function of a one-dimensional normal distribution with average 0 and variance σ2, (μ0, σ02) and (μ1, σ12) are average and variance parameters for first and second components, respectively, and ξ is the mixing ratio, with the assumption that μ0<μ1, σ02>0, σ12>0, 0<ξ<1 is satisfied.
Next, the distributed parameter estimating means 21 uses the estimated {circumflex over (ξ)}, {circumflex over (μ)}0, {circumflex over (σ)}02, {circumflex over (μ)}1, {circumflex over (σ)}12 to estimate μ, σε2, σβ2, λ according to the following equations (15), (16), (17), and (18) (step A2)
where N0 denotes an index set of data values that satisfies i ε {i|ui<{circumflex over (μ)}0} and ∥N0∥ denotes the number of elements.
Next, the mixing ratio parameter estimating means 22 estimates a mixing ratio parameter p=(p00, p10, p01, p11) by a conditional maximum likelihood method using an estimate {circumflex over (θ)}=({circumflex over (μ)}, {circumflex over (λ)}, {circumflex over (σ)}ε2, {circumflex over (σ)}β2) of each parameter given from the distributed parameter estimating means 21 by applying a two-variable mixed normal distribution of four components shown in the following equation (19) to the gene expression level data {(ui, vi)|i=1, . . . ,n} given from the input device 1 (step A3).
p00g00(u,ν|{circumflex over (θ)})+p10g10(u,ν|{circumflex over (θ)})+p01g01(u,ν|{circumflex over (θ)})+p11g11(u,ν|{circumflex over (θ)})=p00φ(u|4{circumflex over (σ)}β2+2{circumflex over (σ)}ε2)φ(ν|2{circumflex over (σ)}ε2)+p10φ2(u−{circumflex over (μ)},ν−{circumflex over (μ)}|Σ10)+p01φ2(u−{circumflex over (μ)},ν+{circumflex over (μ)}|Σ01)+p11φ(u−2{circumflex over (μ)}|4{circumflex over (μ)}2(eλ
Here, it is assumed that the above equation satisfies the relationships shown in the following equation (20) (where {circumflex over (Σ)}10 is a 2×2 variance-covariance matrix derived from the equation (7)) and the following equation (21) (where {circumflex over (Σ)}01 is a 2×2 variance-covariance matrix derived from the equation (9)).
The process of estimating posterior probabilities of gene expression states in cell 1 and cell 2 using the calculated estimates of parameters will next be described.
The posterior probability calculating means 23 can describe the posterior probabilities of gene expression state in each cell for each pair (u, v) of the gene expression level data given from the input device 1 using the estimates {circumflex over (θ)}=({circumflex over (μ)}, {circumflex over (λ)}, {circumflex over (σ)}ε2, {circumflex over (σ)}β2) and {circumflex over (p)}=({circumflex over (p)}00, {circumflex over (p)}10, {circumflex over (p)}11) of each parameter given from the distributed parameter estimating means 21 and the mixing ratio parameter estimating means 22.
In other words, the posterior probabilities indicating that any gene expression is ON-state in cell 1 and cell 2 can be calculated from the following equations (22) and (23) (step A4).
It is then judged whether calculations of posterior probabilities indicating that any gene expression is ON-state have been made for all the pairs (u, v) of the gene expression level data (step A5). When all the calculations have been completed, the process is ended, while when all the calculations have not been completed yet, the posterior probability related to the next gene is calculated.
The calculated posterior probabilities of gene expression in each channel are sent to the output device 3. The output device 3 displays or prints out the posterior probabilities of gene expression in each channel in the form of a graph.
The following describes the effects of the embodiment. In the embodiment, a mathematical model in which the concept of gene expression/nonexpression is introduced is constructed to separate true signals from experimental errors. Further, the use of data on the sum and difference of gene expression levels in two channels makes it easy to obtain information on the sensitivity of microarray data to fluorescence intensities in each channel, allowing more accurate extraction of the magnitude of experimental errors. Furthermore, a two-dimensional simultaneous distribution is described for these sum and difference data. It allows high-precision estimation of posterior probabilities of gene expression in each channel.
In addition, the posterior probability indicating an event of differential expression between cell 1 and cell 2 (mismatched ON-OFF state) (step 4 in
Thus, the embodiment has the advantage of detecting candidate genes that are likely to reveal differential expression in cell 1 and cell 2.
A second embodiment of the present invention will next be described in detail with reference to
The data analyzing program is read from the recording medium 4 into a data analyzer 5 to control the operation of the data analyzer 5 to execute processing on data files inputted from the input device 1 in the same manner as the data analyzer 2 does in the first embodiment.
The following specifically describes the embodiments of the present invention. Data used as an example is obtained from an experiment for comparing the states of gene expression of two different types of cancer cells (cell 1 and cell 2).
The test is conducted on expression patterns of 48 grids on one chip, 441 (21×21) spots pergrid, a total of 21168 genes.
The long and short dashed line in
In
The following table (2) shows estimation results of distributed parameters by the conditional maximum likelihood method (results of steps A2 and A3 in
The first effect of the present invention is to achieve a separation between true signals related to gene expression and experimental errors by introducing the concept of gene expression/nonexpression into the gene expression level data obtained from microarrays to construct a mathematical model.
The second effect of the present invention is to make it easy to obtain sensitivity information on the sensitivity of microarray data to fluorescence intensities of two channels by transforming the gene expression level data obtained from microarrays into data on the sum and difference of the gene expression levels between two channels. It then makes it possible to visualize the magnitude of experimental errors.
The third effect of the present invention is to enable estimation of posterior probability related to expression/nonexpression of each gene in each of the two channels by transforming the gene expression level data obtained from microarrays into data on the sum and difference of the gene expression levels between two channels to describe a two-dimensional simultaneous distribution of the sum and difference data. It then makes it possible to detect genes related to differences between cell 1 and cell 2 with high precision.
Number | Date | Country | Kind |
---|---|---|---|
275983/2003 | Jul 2003 | JP | national |