This application claims the priority benefit of Taiwan application serial no. 101149024, filed Dec. 21, 2012, the full disclosure of which is incorporated herein by reference.
1. Technical Field
The present invention relates to a computer-implemented method for identifying differentially expressed genes (DEGs) and a computer-readable medium encoded with a computer program to execute the method.
2. Description of Related Art
Since DNA microarray was introduction in 1995, high-throughput gene expression profiling has emerged as one of the most important and powerful approaches in biomedical research. Its use to discover differentially expressed genes between replicated sample groups has found many applications. Although many studies reported success of application, often with high rates of validation using alternate technologies such as qRT-PCR or northern blot analysis, researchers were unsettled by the observed disparities between results obtained by different groups analyzing similar samples and called into question the validity of microarray assays. In a later study, by contrasting two commercially produced RNAs in technical quintuplicates, the MicroArray Quality Control (MAQC) Consortium showed distinct platforms and test sites performed comparably, generating similar lists of genes whose activity differed by at least a factor of two between the two RNA samples and owed the improved reproducibility over previous studies to its data analysis approach: while most researchers employed a statistical criterion foremost by applying a cutoff on the p-value from a t-test, the MAQC Consortium advised to loosen the p-value cutoff and add a fold-change cutoff because between platforms and test sites genes selected based on fold-change were found much more reproducible than those based on the t-test. Although the study has been criticized for implying that prioritizing genes by fold-change is more productive than by the level of statistical significance and employment of a fold-change cutoff leads to loss of statistical control, the approach, henceforth the MAQC method (MAQCm), has been widely practiced.
The t-test's apparent lack of statistical power results from its naive approach of variance estimation. For elucidation, we categorize data as either type I or type II and divide variance into two components, noise and non-noise. Type I data are made from samples of same DNA, such as biological replicates of a cell line; type II data are made from samples of different DNAs, such as clinically collected specimens; noise includes random noise and biological noise, is independent of differential expression and typically follows a normal distribution; non-noise, as explained below, exists only in type II data, arises from differential expression and hence shouldn't be included in the statistical testing. For the gene under test, if the variance of noise for each measurement were known so that the means and the fold-change could each be predicted using a Gaussian distribution function (Gaussian) as the probability density function, the z-test could in principle lead to the most reproducible gene-ranking among all possibilities. t-test ranking is same as z-test ranking with variance of noise taken as homogeneous among replicates and approximated as sample variance. Accuracy of the approximation is limited by sample size. For type II data, the approximation is rendered unjustifiable by molecular heterogeneity which manifests itself as expansion of sample variance with absolute fold-change. Although the expansion apparently exacts the to statistical testing be based on individually estimated variances, it arises from differential expression and, regardless of sample size, invalidates any method that mistakes the affected variances for noise and understates the genes' priority. Fold-change ranking, on the other hand, is same as z-test ranking with variance of noise taken as homogeneous among replicates and among genes. Its global superiority to t-test ranking implies either variance of noise is homogeneous among, genes or the differences between genes are trivial compared to effects of sample size limitations and molecular heterogeneity. In summary, the key to better statistical power for both types of data lies in an approach that excludes non-noise from the statistical testing, takes variance of noise as homogeneous among genes and estimates the common variance at full through-put capacity of the platform.
In light of the above insight, we have developed a method named Weighting Arrays By Error (WABE). WABE's design takes variance of noise as homogeneous among genes and, to handle samples of uneven quality, heterogeneous among replicates. By further assuming most genes are not differentially expressed and hence not affected by non-noise, WABE estimates the sample-wise variances of noise based on data of all genes. We schematically illustrate WABE in
As an embodiment of this invention, a computer-implemented method for identifying DEGs is provided. The method, named WABE, comprises the following steps:
(a) Obtain gene expression data from several test samples and several control samples.
(b) Estimate variance of noise in each sample.)
(c) For each expression measurement, use a Gaussian distribution function (Gaussian), which takes the measured value as mean and the sample's variance of noise as variance, as probability density function (PDF) for predicting its true value.
(d) Normalize the Gaussians.
(e) For each gene, based on the normalized Gaussians for the test samples, derive the test-group Gaussian for predicting mean expression level of the test group; based on the normalized Gaussians for the control samples, derive the control-group Gaussian for predicting mean expression level of the control group.
(f) Based on the test-group Gaussian and the control-group Gaussian, derive the final Gaussian for predicting fold-change of the gene.
(g) Based on the final Gaussian, conduct a z-test to determine whether the gene is differentially expressed.
As another embodiment of this invention, a computer-readable storage medium storing a computer program for executing the steps of the aforementioned method is provided. Steps of the method are as disclosed above.
These and other features, aspects, and advantages of the present invention will become better understood with reference to the following description and appended claims. It is to be understood that both the foregoing general description and the following detailed description are by examples, and are intended to provide further explanation of the invention as claimed.
The invention can be more fully understood by reading the following detailed description of the embodiments, with reference made to the accompanying drawings as follows:
Reference will now be made in detail to the present embodiments of the invention, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
As an embodiment of the present invention, a z-test based method for identifying DEGs is provided. The method, named WABE, differs from t-test based methods in that the non-noise component of variance is excluded from the statistical testing and that variance of noise is taken as homogeneous among genes but heterogeneous among replicates. Accordingly, the statistical testing is based on sample-wise variances of noise derived from data of all genes rather than based on gene-wise variances derived from data of the gene under test. The method may take the form of a computer program product stored on a non-transitory computer-readable storage medium having computer-readable instructions embodied in the medium. Any suitable non-transitory storage medium may be used including non-volatile memory such as read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), and electrically erasable programmable read only memory (EEPROM) devices; volatile memory such as static random access memory (SRAM), dynamic random access memory (DRAM), and double data rate random access memory (DDR-RAM); optical storage devices such as compact disc read only memories (CD-ROMs) and digital versatile disc read only memories (DVD-ROMs); and magnetic storage devices such as hard disk drives (HDD) and floppy disk drives.
At step 110, gene expression data for several test samples and several control samples are obtained.
At step 120, variances of noise in the samples are calculated.
At step 130, a PDF for predicting the true value of each measurement is derived.
At step 140, the PDFs are normalized.
At step 150, for the gene under test. the test-group PDF for predicting mean expression level of the test samples is derived based on the normalized PDFs for predicting expression levels in the individual test samples, and the control-group PDF for predicting mean expression level of the control samples is derived from the normalized PDFs for predicting expression levels in the individual control samples. The flow from
At step 160, a final PDF for predicting fold-change of the gene under test is derived based on the test-group PDF and the control-group PDF. The flow from
At step 170, a statistical test is performed based on the final PDF for predicting fold-change of the gene under test to determine whether the gene under test is differentially expressed.
Although the present invention has been described in considerable detail with reference to certain embodiments thereof, other embodiments are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the embodiments contained herein. It will be apparent to those killed in the art that various modifications and variations can be made to to the structure of the present invention without departing from the scope or spirit of the invention. In view of the foregoing, it is intended that the present invention cover modifications and variations of this invention provided they fall within the scope of the following claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 101149024 | Dec 2012 | TW | national |