1. Field of the Invention
The present invention relates in general to statistical analysis of microarray data generated from nucleotide arrays. Specifically, the present invention relates to identification of differentially expressed genes by multivariate microarray data analysis. More specifically, the present invention provides an improved multivariate random search method for identifying large sets of genes that are differentially expressed under a given biological state or at a given biological locale of interest. The method of the invention implements multiple starts and early stop in the random search of sets of differentially expressed genes.
2. Description of the Related Art
Gene expression analyses based on microarray data promises to open new avenues for researchers to unravel the functions and interactions of genes in various biological pathways and, ultimately, to uncover the mechanisms of life in diversified species. A significant objective in such expression analyses is to identify genes that are differentially expressed in different cells, tissues, organs of interest or at different biological states. So identified, a set of differentially expressed genes associated with a certain biological state, e.g., tumor or certain pathology, may point to the cause of such tumor or pathology, and thereby shed light on the search of potential cures.
In practice, however, gene expression studies are hampered by many difficulties. For example, poor reproducibility in microarray readings can obscure actual differences between normal and pathological cells or create false positives and false negatives. The tension between the extremely large number of genes present (hence high dimensionality of the feature space) and the relatively small number of measurements also poses serious challenges to researchers in making accurate diagnostic inferences.
Existing methods for selecting differentially expressed genes are typically univariate, not taking into account the information on interactions among genes. As appreciated by an ordinary skilled molecular biologist, genes do not operate in isolation—activation of one gene may trigger changes in the expression levels of other genes. That is, genes may be involved in one or more pathways. Therefore, determination of differentially expressed genes calls for consideration of covariance structure of the microarray data, in addition to, for example, mean expression levels. In this regard, however, application of well-established statistical techniques for multidimensional variable selection encounters much difficulty. This is so because, in one aspect, the small number of independent samples and the presence of outliers make the estimates on selected variables unstable for large dimensions. In other words, only small sets of genes can be meaningfully considered while a relatively large number of genes are potentially differentially expressed. It is generally impossible to compare all gene subsets and find the optimal one because the number of possible gene combinations is prohibitively large. On the other hand, if a global optimum could be found, it might be overly specific to a training sample due to overfitting. Thus, it remains a significant challenge to scale methods for identifying differentially expressed genes to deal with microarray data of high dimensional space.
Therefore, there is a need to address the difficulties in applying multivariate analysis to microarray data—a need to establish rigorous methods for identification of differentially expressed genes from high dimensional gene expression data.
It is therefore an object of this invention to provide multivariate methods for analyzing microarray gene expression data of high dimensional space and thereby identifying differentially expressed genes. Particularly, it is an object of this invention to provide methods for identifying larger sets of differentially expressed genes starting from feature spaces of smaller dimensionality where accurate estimates on covariance matrix can be made. More particularly, the present invention provides a random search method with multiple starts and early stop.
In accordance with the present invention, there is provided methods for identifying a set of genes from a multiplicity of genes whose expression levels at a first and a second state, in a first and a second tissue, or in a first and a second types of cells are measured in replicates using one or more nucleotide arrays, thereby generating a first plurality of independent measurements of the expression levels for the first state, tissue, or type of cells and a second plurality of independent measurements of the expression levels for the second state, tissue, or type of cells. The method comprises: (a) identifying a quality function capable of evaluating the distinctiveness between the first plurality and the second plurality; (b) selecting a subset of genes, whose expression levels in the first and second states, tissues, or types of cells are represented in the first plurality and the second plurality, respectively; (c) calculating the values of the quality function for the subset of genes in the first state and said second state based on the first and second plurality, thereby determining the distinctiveness of the first and the second plurality; (d) substituting a gene in the subset with one outside of the subset, thereby generating a new subset, and repeating step (c), keeping the new subset if the distinctiveness increases and the original subset if otherwise; (e) repeating steps (c) and (d) for a first predetermined number of times, thereby identifying a locally optimal subset of genes; (f) repeating steps (b) to (e) for a second predetermined number of times, thereby identifying the second predetermined number of the locally optimal subsets; and (g) integrating the second predetermined number of the locally optimal subsets into the set of genes, wherein the set is larger than the locally optimal subsets in size.
According to the present invention, in certain embodiments, the states may be biological states, physiological states, pathological states, and prognostic states. In other embodiments, the tissues may be normal lung tissues, cancer lung tissues, normal heart tissues, pathological heart tissues, normal and abnormal colon tissues, normal and abnormal renal tissues, normal and abnormal prostate tissues, and normal and abnormal breast tissues. In yet other embodiments, the types of cells may be normal lung cells, cancer lung cells, normal heart cells, pathological heart cells, normal and abnormal colon cells, normal and abnormal renal cells, normal and abnormal prostate cells, and normal and abnormal breast cells. In still other embodiments, the types of cells may be cultured cells and cells isolated from an organism.
According to another embodiment of this invention, the integrating is performed by selecting the genes whose frequency of occurrences in the second predetermined number of the locally optimal subsets exceeds a third predetermined number. In certain embodiments, the third predetermined number is 1% or 5%. According to yet another embodiment, the first predetermined number is sufficiently small such that the global maximum is not reached. According to still another embodiment, the quality function is a parametric function or a non-parametric function. In a further embodiment, the parametric function is selected from the group consisting of the Mahalanobis distance and the Bhattacharya distance.
In various embodiments of the invention, the nucleotide arrays may be arrays having spotted thereon cDNA sequences and/or arrays having synthesized thereon oligonucleotides.
Definition
As used herein, the term “microarray” refers to nucleotide arrays; “array,” “slide,” and “chip” are used interchangeably in this disclosure. Various kinds of nucleotide arrays are made in research and manufacturing facilities worldwide, some of which are available commercially. There are, for example, two kinds of arrays depending on the ways in which the nucleic acid materials are spotted onto the array substrate: oligonucleotide arrays and cDNA arrays. One of the most widely used oligonucleotide arrays is GeneChip™ made by Affymetrix, Inc. The oligonucleotide probes that are 20- or 25-base long are synthesized in silico on the array substrate. These arrays tend to achieve high densities (e.g., more than 40,000 genes per cm2). The cDNA arrays, on the other hand, tend to have lower densities, but the cDNA probes are typically much longer than 20- or 25-mers. A representative of cDNA arrays is LifeArray made by Incyte Genomics. Pre-synthesized and amplified cDNA sequences are attached to the substrate of these kinds of arrays.
Microarray data, as used herein, encompasses any data generated using various nucleotide arrays, including but not limited to those described above. Typically, microarray data includes collections of gene expression levels measured using nucleotide arrays on biological samples of different biological states and origins. The methods of the present invention may be employed to analyze any microarray data; irrespective of the particular microarray platform from which the data are generated.
Gene expression, as used herein, refers to the transcription of DNA sequences, which encode certain proteins or regulatory functions, into RNA molecules. The expression level of a given gene refers to the amount of RNA transcribed therefrom measured on a relevant or absolute quantitative scale. The measurement can be, for example, an optic density value of a fluorescent or radioactive signal, on a blot or a microarray image. Differential expression, as used herein, means that the expression levels of certain genes are different in different states, tissues, or type of cells, according to a predetermined standard. Such standard maybe determined based on the context of the expression experiments, the biological properties of the genes under study, and/or certain statistical significance criteria.
The terms “vector,” “probability distance,” “distance,” “the Mahalanobis distance,” “the Euclidean distance,” “feature,” “feature space,” “dimension,” “space,” “type I error,” “type II error,” and “ROC curve” are to be understood consistently with their typical meanings established in the relevant art, i.e. the art of mathematics, statistics, and any area related thereto. For example, a set of microarray data on p distinct genes represents a random vector X=X1, . . . , Xp with mutually dependent components.
Improved Random Search Procedure with Multiple Starts and Early Stop
Random search algorithms have been used for finding optima in complex combinatorial spaces. See, e.g., Zhigljavsky A A., Vol. 65, Mathematics and its Applications, Kluwer Academic Publishers Group, Dordrecht, 1991. The improved random search procedure according to one embodiment of this invention applies a local search procedure multiple times and then integrates the selected sets of genes to build a global optimal set of differential expressed genes. To prevent overfitting, short local searches may be performed. Local maximum regions are carefully examined and convergence to a unique global maximum is avoided. The method can be applied in conjunction with a variety of parametric and non-parametric quality functions, which are discussed in more detail in the next section. In certain embodiments, the improved random search procedure with multiple starts and early stop includes the following steps:
1. Randomly select Nsubset genes from Nall, wherein Nsubset is the number of genes in a subset, Nall is the total number of the genes, and Nsubset is smaller then Nall.
2. Evaluate the quality function for the Nsubset genes.
3. Generate a new evaluation point (i.e., starting point) by swapping one or more randomly selected genes between the currently selected set and the rest of the genes, thereby identifying g a new Nsubset.
4. Evaluate the quality function for the new Nsubset genes; if its value has decreased, then return to the previous Nsubset, otherwise maintain the new Nsubset.
5. Repeat steps 3 and 4 until the number of iterations reaches a predetermined number—let it be Niter—then save the resultant Nsubset at that point.
6. Repeat steps 1-5 Ncycle times.
7. Evaluate the resultant Ncycle groups of Nsubset genes to identify an integrated larger set of genes.
In step 7, a post-processing step, the local optima are combined to provide a final, global solution, i.e., an integrated larger set of differentially expressed genes. Heuristically, strongly differentially expressed genes should appear in many of the local maxima. Therefore, each gene to be included in the final set of differentially may be identified based on the frequency of its occurrence in the sub-optimal (i.e., locally optimal) sets derived from each of the Ncycle cycles, as performed in steps 1-6 above. A conservative estimate of the p-value corresponding to the observed frequency can be calculated. For example, if a gene is not differentially expressed, the probability that it will be in the selected subset by chance is expected to be equal to Nsubset/Nall, and most likely smaller. As the number of repetitions Ncycle is large, the final selection frequency of this gene may be approximated by a Poisson distribution with a mean Ncycle∘Nsubset/Nall. Based on this null-distribution the corresponding p-values for each gene may be calculated.
Generally, Nsubset is limited by the number of available training samples (e.g., the number of microarray slides in a typical experiment) and hence, Nsubset may be significantly smaller than Nall. Depending upon the particular quality function of choice, the nature and the extent of this limitation may vary; but, generally, both parametric and non-parametric criteria are sensitive to the scarceness of training samples in a high-dimensional feature space. In this connection, one significant advantage of the improved random search method disclosed herein is that, the detectable number of the differentially expressed genes is not limited by Nsubset, even though the depth of the estimated interaction structure (e.g., the covariance matrix) may be affected. In other words, a relatively large set of differentially expressed genes may be identified by integrating the subsets of genes selected from multiple local searches. In some embodiments of this invention, the final set of differentially expressed genes is significantly larger in size than the subset identified in the local search, i.e., the locally optimal subset: Nsubset.
The determination of Niter is crucial for preventing overfitting. It cannot be too small because a small value may not permit finding truly differentially expressed genes. On the other hand, too large a number will not be efficient. When the value is too big, the same maximum may be attained in many iterations of search because of overfitting.
With regard to Ncycle, this is a number that substantiates the variability of this random search procedure. It may be as large as possible, only limited by the applicable CPU power (e.g., Ncycle=1,000,000 may be used).
Quality Functions Used in Conjunction with the Random Search Procedure
A variety of quality functions may be used in conjunction with the improved random search procedure in various embodiments of this invention. A quality function measures the “distinctiveness” of the two tissues or two biological states under comparison based on a set of genes, taking into account the correlation structure. Generally, properly specified parametric methods are more powerful than non-parametric methods due to the utilization of additional information accounted in the model, although such parametric quality functions may be sensitive to any departure from the model. With microarray data, since small sample sizes are a prevalent problem, choosing an appropriate parametric quality function may be advantageous in its power, whereas a non-parametric random search method may be more robust. One parametric measure of the differences between two multidimensional samples is the Mahalanobis distance, which is used in one embodiment of this invention. See, Mahalanobis PC., Proceedings of the National Institute of India (1936) 2 Vol. 49.
where v and u are the sample means and Σu, Σv are the two sample variance-covariance matrices. It is a natural extension of the t-statistic to a multidimensional setting. Because of the matrix inverse involved, the calculation of the Mahalanobis distance at every step of the search—for Ncycle·Nitertimes—may appear to be prohibitive. However, with the improved random search procedure of this invention, changes in the vectors are only in one dimension on every step (see supra, steps 1-5); therefore, a fast update formula may be permitted. See, e.g., McLachlan G J., Discriminant Analysis and Statistical Pattern Recognition, (1992) Wiley, N.Y.
In another embodiment of this invention, the Bhattacharya distance may be used, especially where differences in both the mean and the covariance structure are of interest.
Similarly, other parametric or non-parametric dissimilarity measures may be used in various alternative embodiments in conjunction with the improved random search procedure disclosed herein. Such different choices of quality functions each may be designed to deal with microarray data with different characteristics.
Further, when using various quality functions, various background reduction, normalization, and other adjustment procedures may be applied to the microarray data. For example, rank-based adjustment and the typical mean-log adjustment (dividing by mean and take logarithm) may be used. In one embodiment, the following adjustment is implemented: the data points on each slide or array were replaced by their normal scores using the formula
x=Φ−1(rankjxij/Nall).
where −1 (α) is the 100αth percentile of the standard normal distribution and rankjxij is the rank of xij among all of the observations on the jth slide. See, Tsodikov A. et al., (2002) Bioinformatics 18: 251-260.
Computer Simulation of the Multivariate Search Method
A simulation study was performed to evaluate the improved random search method. Totally 1000 genes were divided into subsets of equal size 20. One of the subsets was selected to be deemed as differentially expressed with the gene-specific ratio d randomly generated for each of the genes from a lognormal distribution with mean 1 and variance 0.5. The correlation structure was kept the same in the two hypothetic tissues. In the selected subset some of the genes exhibited large over- or under-expression, while others with d≈1 changed their expression level only slightly. The simulation was performed on 20 slides or arrays with one of the tissues on the green channel and the other on the red channel. The relevant parameters for the random search were set: Ncycle=10,000, Nsubset=5; and, the Mahalanobis distance was used as the quality function.
Referring to
The middle graphs illustrate the same phenomenon in another way. In the case of early stopping, i.e., when Niter=1000, the distribution of the Mahalanobis distances corresponding to the Ncycle sub-optimal sets is unimodal with high variability. Thus, the procedure has explored many different local maxima with a variety of corresponding values of the quality function. On the other hand, when the number of iterations increase, e.g., when Niter=100,000, the distribution of the Mahalanobis distances achieved in the sub optimal sets became very discrete. In about half of the cases the search reached the global maximum on a unique combination of genes. Therefore, in this situation, although the global maximum was found, many local maxima and the corresponding differentially expressed genes from the various subsets were missed. When early stop is carried out at the 1000-th iteration, none of 10,000 cycles found the global maximum, but a variety of genes were selected.
At the bottom panel of
Referring to
The invention is further described by the following examples, which are illustrative of the invention but do not limit the invention in any manner.
Referring to
1. Repeat the following Niter times. Niter is not too large; early stop—stop before convergence—is implemented.
2. Repeat step 1 Ncycle times, obtain Ncycle sets of genes of size k.
3. For each gene, calculate the frequency of its occurrence as a member of a sub-optimal set.
4. The final set of genes is defined as the genes that have a frequency of occurrence exceeding a preset limit.
HT29 cells represent advanced, highly aggressive colon tumors. They contain mutations in both the APC gene and p53 gene, two tumor suppressor genes that frequently mutate during colon tumorigenesis. HCT116 cells manifest less aggressive colon tumors and harbor functional p53 and APC. They are defective in DNA repair. The experiment was performed with three RNA samples (1 μg RNA each). Cy-3-dCTP (green) was used to label HCT116 cells while Cy-5-dCTP (red) was used for HT29 cells. Each comparison set was hybridized against two microarray slides (facing each other) containing 4608 minimally redundant cDNAs spotted in duplicate. As control, six Drosophila genes were added to the Cy-5 samples. Thus, in a red vs. green comparison they are differentially expressed by design. This experiment resulted in a total of twelve measurements on each channel for each gene on the microarrays. Although a nested dependence structure existed in the samples, the analysis assumes them as independent replicates. Additionally, ten HCT116 samples hybridized with Cy-5 (red) from a separate experiment were included in the analysis.
Two comparisons were performed: (i) HCT116 vs. HT29 and (ii) HCT116 (green) vs. HCT116 (red); the first is inter cell lines whereas the second is intra cell lines. The relevant parameters for the random search were set: Ncycle=10,000, Nsubset=5; and, the Mahalanobis distance was used as the is quality function.
Referring to
Referring to
A frequency level of 1% was selected as the cutoff for identifying differentially expressed genes. Total 59 genes were selected and thus 59 cDNA spots were identified on the slides. A comparison was carried out between the 59 cDNA spots and the top 59 genes selected by t-statistic. Almost half of those genes (25 to be exact) were identified by both methods. However, a characteristic advantage of the multivariate random search procedure was its ability to identify correlated genes. Some of the genes had several corresponding spots on the slides, and therefore their expression levels at various spots were known to be correlated. Among the 59 genes identified by the multivariate random search method, 13 had two, and two had three spots inter-related to each other. By comparison, among the genes identified by the marginal t-statistic, 17 genes had two or more replicates on the slides, and only one of them had all of its replicates selected in the resulting list of genes. Therefore, the improved random search procedure of this invention is powerful in identifying less pronounced differentially expressed genes when they are correlated with more strongly differentially expressed genes.
It is to be understood that the description, specific examples and data, while indicating exemplary embodiments, are given by way of illustration and are not intended to limit the present invention. Various changes and modifications within the present invention will become apparent to the skilled artisan from the discussion, disclosure and data contained herein, and thus are considered part of the invention.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US03/05730 | 2/28/2003 | WO | 2/15/2006 |
Number | Date | Country | |
---|---|---|---|
60361068 | Mar 2002 | US |