The pattern of expressed genes in DNA microarray data demonstrates a typical profile, such as in relation to a cancer type or disease severity. These unique sets of genes defining specific pathology are regarded as molecular “signatures” or “fingerprints” and have a potential to be as indispensable tools for diagnosis, prognosis and treatment of various types of cancers and diseases. Gene expression profiling may aid physicians to better understand cellular morphology, resistance to chemotherapy, and the clinical outcome of disease. This type of individualized treatment may significantly increase survival due to the optimization of treatment procedure in accordance with the clinical pathogenesis.
As far as the reliability and robustness of microarray techniques are concerned, microarray gene expressions have been found to be highly reproducible within and across high volume labs. Emergence of new gene signatures from wet lab microarray experiments have resulted in an exponential surge in microarray data. Although gene clustering is an important tool for the identification of like-groups in a microarray experiment, this methodology is not valid for two-group comparisons. Several statistical methods such as analysis of variance, Mann Whitney's U test, Pearson's correlation test, t-test, and Wilcoxon signed-rank test have been used for comparison of microarray data. However, these conventional statistical methods often result in spurious outputs when comparing microarray gene expression data.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The described systems and methods relate to indexing gene expression data for comparing gene signatures. Such systems and methods may assign one of a plurality of fold change-based grading scores to each of a number of genes in a probe gene signature. The fold change-based grading scores reflect relative expression of one of the number of genes in the probe gene signature. Each of the number of genes in the probe gene signature assigned a particular grading score is weighted by the assigned grading score. A ratio is determined of each weighted number of genes in the probe gene signature assigned a particular grading score to a total number of genes in the probe gene signature. Then, the ratios of each weighted number of genes in the probe gene signature assigned each particular grading score to the total number of genes in the probe gene signature are summed to arrive at an index of gene expression.
The detailed description is set forth with reference to the accompanying figures, in which the left-most digit of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items or features.
The systems and methods described herein relate to indexing of gene expression data and its application for comparing gene signatures. The present systems and methods provide robust indexing for comparison of microarray expression data to provide clinical application of gene signatures. The index of gene expression provided by the present systems and methods may be referred to as the Haseeb Index of Gene Expression (HIGE) score, but may generally be referred to herein as an Index of Gene Expression (IGE) score.
Despite the influx of new gene signatures from wet lab microarray experiments, limited attempts have been made to establish a unified strategy for useful application of this exponentially surging microarray data. The present systems and methods employ an algorithm for robust indexing of gene expression data to compare gene signatures. The fold-change strategy used in the present systems and methods for indexing gene expression scores is robust, accurate and reproducible. Although fold-change has been used in microarray experiments, it has not been applied for collective interpretation of gene signatures. Conventionally, in a microarray experiment, the ratio of the color intensity of each spot location with a specific probe describes a relative expression of the corresponding gene under two different conditions. A gene is considered to be differentially expressed if the ratio of the expression levels between two groups exceeds predefined threshold values. The conventionally accepted expression ratios for up-regulated and down-regulated genes have been suggested to be greater than 1.5 and less than 0.5, respectively. The present systems and methods employ similar cut-off margins, but provide a more refined protocol using additional sub-grading of expression ratios.
Particular examples discussed herein are described with respect to cancer or other disease-related genes. However, the present invention can be utilized for indexing gene expression data for comparison of gene signatures for any type of genes. Also, particular examples discussed herein are described with respect to an algorithm employed in a general purpose processor-based computing device. However, the present invention can utilize any number of types of computing devices, by way of example a further enhanced DNA microarray, an Application Specific Integrated Circuit (ASIC), and/or the like.
Data communication network 108 represents any type of network, such as a local area network (LAN), wide area network (WAN), or the Internet. In particular embodiments, data communication network 108 is a combination of multiple networks communicating data using various protocols across any communication medium.
Although one computing system (102) is shown in
At 204, each of the number of genes in the probe gene signature assigned a particular grading score are weighted by the assigned particular grading score. Such weighting might entail, by way of example, finding the product of a number of genes in the gene signature Nx with each of a plurality of grading scores and the respective grading score Gy.
A ratio of each weighted number of genes in the probe gene signature assigned a particular grading score to a total number of genes in the gene signature is determined at 206. For example, the quotient of the product of the number of genes in the gene signature with each of the plurality of grading scores and the respective grading score (NxGy) with respect to the total number of genes in the gene signature (Nt) may be found at 206.
The ratios of each weighted number of genes in the probe gene signature assigned each particular grading score to a total number of genes in the gene signature are summed at 208 to arrive at an index of gene expression. This index of gene expression may be expressed as a percent, such as may be achieved by multiplying the sum of ratios of each weighted number of genes in the gene signature assigned a particular grading score to a total number of genes in the gene signature by one-hundred.
Thus, in accordance with various implementations of the present systems and methods, a formula for arriving at the IGE score may be expressed as:
IGE=[ΣNxGy/Nt]100
where Nx is the number of genes with grading score Gy. As noted above, the subscript ‘x’ can vary between 0 and total number of genes in a signature (Nt) and ‘y’ can vary between 0 and 1. (See Table 1, above.)
Thus, applying this formula to process 200, the gene expression ratios of DNA microarray data may be categorized according to a logically defined scale, such as shown in Table 1 above, to arrive at the respective Nx and Gy values at 202. The percent contributions of each set of genes, that is the genes with the same expression score, are computed at 204 and 206 and their summation, found at 208, is regarded as the IGE score.
Computing device 300 includes one or more processor(s) 302, one or more memory device(s) 304, one or more interface(s) 306, one or more mass storage device(s) 308, one or more Input/Output (I/O) device(s) 310, and a display device 312 all of which are coupled to a bus 314. Processor(s) 302 include one or more processors or controllers that execute instructions stored in memory device(s) 304 and/or mass storage device(s) 308, such as one or more programs (316) implementing process 200 of
Memory device(s) 304 include various computer-readable media, such as volatile memory (e.g., random access memory (RAM)) 318 and/or nonvolatile memory (e.g., read-only memory (ROM) 320). Memory device(s) 304 may also include rewritable ROM, such as Flash memory.
Mass storage device(s) 308 include various computer readable media, such as magnetic tapes, magnetic disks, optical disks, solid-state memory (e.g., Flash memory), and so forth. Program 316 implementing process 200 may be stored in such mass storage. Data, such as one or more databases 322 containing, by way of example, standardized or known gene expression data, and/or the like, may also be stored on mass storage device(s) 308. As shown in
I/O device(s) 310 include various devices that allow data and/or other information to be input to or retrieved from computing device 300. Example I/O device(s) 310 might include the afore mentioned microarray 104, cursor control devices, keyboards, keypads, microphones, monitors or other display devices, speakers, printers, network interface cards, modems, lenses, CCDs or other image capture devices, and the like.
Display device 312 is optionally directly coupled to the computing device 300. If display device 312 is not coupled to device 300, such a device is operatively coupled to another device that is operatively coupled to device 300 and accessible by a user of the results of method 200. Display device 312 includes any type of device capable of displaying information to one or more users of computing device 300, such as the IGE results of process 200. Examples of display device 312 include a monitor, display terminal, video projection device, and the like.
Interface(s) 306 include various interfaces that allow computing device 300 to interact with other systems, devices, or computing environments. Example interface(s) 306 include any number of different network interfaces 328, such as interfaces to local area networks (LANs), wide area networks (WANs), wireless networks, and the Internet. As alluded to above, a microarray, such as microarray 104 of
Bus 314 allows processor(s) 302, memory device(s) 304, interface(s) 306, mass storage device(s) 308, and I/O device(s) 310 to communicate with one another, as well as other devices or components coupled to bus 314. Bus 314 represents one or more of several types of bus structures, such as a system bus, PCI bus, IEEE 1394 bus, USB bus, and so forth.
For purposes of illustration, programs and other executable program components, such as program 316, are shown herein as discrete blocks, although it is understood that such programs and components may reside at various times in different storage components of computing device 300, and are executed by processor(s) 302.
Alternatively, the systems and procedures described herein can be implemented in hardware, or a combination of hardware, software, and/or firmware. For example, one or more application specific integrated circuits (ASICs) can be programmed to carry out one or more of the systems and procedures described herein.
The present systems and methods have been validated using simulated gene signatures with known differences. The resultant IGE scores have been compared with the outputs of seven nonparametric tests. This case study cross-checks the validity of various statistical methods for two group comparison of gene signatures using carefully designed sets of simulated data. Due to the format of expression data, the conventional statistical methods largely failed to perform accurately and consistently for comparison of gene signatures. However, the present IGE offered a robust and authenticated indexing system for comparing microarray gene signatures.
To evaluate the validity of conventional nonparametric statistical tests for comparison of gene signatures, six pairs of expression data (e.g., Pair-1 through Pair-6) were designed to represent various degrees of similarities or differences. The two groups in Pair-4 and Pair-6 represented the minimum and maximum differences, respectively. All six pairs were subjected to statistical comparisons using the conventional Friedman test, the conventional Kendall W test, the conventional Kolmogorov-Smirnov test, the conventional Kruskal-Wallis test, the conventional Mann-Whitney U test, the conventional Wilcoxon signed rank test, and the conventional Sign test, using SPSS statistical analysis software package. The IGE scores obtained in accordance with the present systems and methods were compared in parallel to the outputs of these tests, as detailed in Table 2, below.
The results of this validation using simulated signatures clearly show paradoxical outcomes while comparing six gene signatures using the seven conventional nonparametric tests. This indicates the incompatibility of conventional statistics for comparing gene expression data. (See Table 2, above.) Five of the tests, including the Friedman test, the Kendall W test, the Kruskal-Wallis test, the Mann-Whitney U test and the Sign test provided the same results, but logically unrealistic P values for all six of the signature pairs. These tests show a P value of one for a gene signature with maximum difference (Pair 6) and P=0.001 for a signature with a slight difference (Pair 5); the corresponding IGE scores obtained in accordance with the present systems and methods for these signatures were 100 and 4, respectively (See Table 2, above). The remaining two conventional tests, the Kolmogorov-Smirnov test and the Wilcoxon signed-rank test, also failed to efficiently handle these statistical comparisons. On the other hand, the IGE scores obtained in accordance with the present systems and methods effectively quantitated the differences or similarities between the groups of each pair.
The results of this validation clearly demonstrate the failure of conventional statistical methods to handle the microarray expression data, particularly for two-group comparison of gene signatures. The present IGE systems and methods provide a more accurate and unified system that enables routine and uniform clinical application of gene signature. The present systems and methods are a convenient and robust means for comparison of gene signatures. IGE scores obtained in accordance with the present systems and methods are intuitive to interpret and comparison of the collective expression of molecular signatures straightforward.
The applicability of IGE scores has also been validated using actual signatures data of two different cancer types including ulcerative colitis (“Signature 1” from Dooly et al., Inflamm. Bowel. Dis., 2004, 10, 1-14) and ovarian cancer (“Signature 2” from Wang et al., Gene, 1999, 229, 101-108). The characteristics of these signatures are summarized in Table 3 and the results obtained using various statistical methods are shown in Table 4, below.
The results of this validation also clearly demonstrate the failure of conventional statistical methods to consistently handle the microarray expression data, as there was a huge disparity in the P values obtained from different statistical tests. However, the IGE scores provide robust and straightforward comparisons that are comparable to the known expression data for these signatures.
Although the systems and methodologies for indexing gene expression data for comparison of gene signatures have been described in language specific to structural features and/or methodological operations or actions, it is understood that the implementations defined in the appended claims are not necessarily limited to the specific features or actions described. For example, although the described systems and methods may refer to the use of microarray data, gene expression data from any source may be used in accordance with embodiments of the present systems and methods. Accordingly, the specific features and operations of the described systems and methods of indexing gene expression data for comparison of gene signatures are disclosed as exemplary forms of implementing the claimed subject matter.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US11/51147 | Sep 2011 | US |
Child | 14202487 | US |