The invention relates to the field of markers for various types of brain cancer. More particularly, it relies on a sequential system for sorting individual cancer types.
Identification markers for various types of disease conditions have been developed based on gene expression data. Assessment of the transcriptome has been able to identify various markers for diagnosis, prognosis prediction and optimal therapy of various cancers (Friedman, D. R., et al., Clin. Cancer Res. (2009) 15:6947-6955; Khan, J., et al., Nature Med. (2001) 7:673-679; Yeoh, E. J., et al., Cancer Cell (2002) 1:133-143).
These studies, while useful, exhibit a wide variation among various datasets obtained for particular types of cancer. These disparate results may be accounted for by differing methodologies, different demographics among the subjects, individual variation in cancer heterogeneity, and, perhaps, different measurement techniques. Meta-analyses that compile a multiplicity of studies as a basis for judgment have, to some extent, alleviated the problems caused by this variability (Miller, J. A., et al., PNAS (2010) 107:12698-12703; Dudley, J. T., et al., Molecular Systems Biol. (2009) 5:307). However, such meta-analysis has not been provided with respect to determination of markers for various brain cancers.
In addition, others have experimented with data-driven hierarchical approaches to multi-category classification in the context of machine learning (Blanchard, G., et al., Am. Stat. (2005) 33:1155-11202; Amit, Y., et al., IEEE Transactions on Pattern Analysis and Machine Intelligence (2004) 26:1606-1621).
The present inventors have marshaled these techniques specifically with respect to determination and verification of successful gene expression markers for various types of brain tumors.
The invention provides a panel that successfully can distinguish cancerous brain tissue from normal brain tissue, and further can distinguish among six different types of brain cancer with high levels of sensitivity and specificity in correlation with phenotypic assessments. The panel can be employed in a hierarchical discrimination sequence to parse tissues into these six cancerous types. It employs a framework for brain cancer diagnosis that is a tree-structured hierarchy of these brain cancer phenotypes.
Thus, in one aspect, the invention is directed to a panel for distinguishing among normal brain tissue, samples that harbor meningioma (MNG), samples that harbor ependymoma (EPN), samples that harbor medulloblastoma (MDL), samples that harbor glioblastoma (GBM), samples that harbor oligodendroglioma (OLG), and samples that harbor pilocytic astrocytoma (PA) wherein said panel comprises detection reagents for the transcripts of the following genes: PRPF40A and PURA; NRCAM and ISLR; IDH2 and GMDS; SALL1 and PAFAH1B3; SRI and NBEA; DDR1 and TIA1 or MAB21L1; ITPKB and PDS5B; NUP62CL and ZNF280A; GALNS and WAS; CELSR1 and OR10H3; TLE4 and OLIG2; DDX27 and KCNMA1; COX7A2 and GNPTAB; GNPTAB and NDUFS2; APOD and PPIA; CD59 and SNRPB2; SEMA3E and ADAMTS3; HINT1 and CD59; BAMBI and CIAPIN1; FLNA and TNKS2; ITGB3BP and RB1CC1; DDX27 and TRIM8; and LARP5 and ANXA1.
In another aspect, the invention is directed to a method to distinguish among normal brain tissue, samples that harbor MNG, samples that harbor EPN, samples that harbor MDL, samples that harbor GBM, samples that harbor OLG, and samples that harbor PA which method comprises initially distinguishing normal brain tissue from tissue with all of the above-mentioned MNG, EPN, MDL, GBM, OLG and PA, followed by distinguishing samples that harbor MNG from samples that harbor EPN, MDL, GBM, OLG or PA, followed by distinguishing samples that harbor MDL from samples that harbor EPN, GBM, OLG or PA, followed by distinguishing samples that harbor EPN from samples that harbor GBM, OLG or PA, followed by distinguishing samples that harbor PA from samples that harbor GBM or OLG, followed by distinguishing between samples that harbor GBM and samples that harbor OLG.
The invention is thus directed to methods to distinguish individual types of cancers in the context of this method and to kits for performing various portions of the method.
In still another aspect, the invention is directed to a method to identify brain cancer or other disease markers by meta-analysis of multiple datasets designed to identify such markers.
The invention takes advantage of the results from multiple datasets and applies a specific algorithm to order the markers derived from these datasets into a hierarchical system for discriminating between normal tissue and among six different types of brain cancers.
Data-driven, hierarchical approaches to multi-category classification have been investigated extensively in machine learning. A classification framework in the form of a tree-structured hierarchy of sets of different categories, is first designed followed by identifying binary classifiers for all decision points (i.e., nodes and/or edges) of the tree. The sets of binary classifiers are aggregated into a classifier marker-panel, which directs diagnosis of a sample from a subject down the hierarchical structure towards a particular phenotype. The cumulative expression patterns constitute “hierarchically-structured” diagnostic signatures.
A computational approach called Identification of Structured Signatures And Classifiers (ISSAC) based on this idea was developed to identify diagnostic signatures that simultaneously distinguishes major cancers of the human brain. From an integrated dataset of publicly available gene expression data, ISSAC provided a global diagnostic hierarchy and corresponding brain cancer signatures composed of sets of gene-pair classifiers. Integration of datasets from multiple studies enhances the disease signal sufficiently to mitigate batch effects and improve independent validation results.
ISSAC constructs the framework for brain cancer diagnosis as shown in FIG. 1A—a tree-structured hierarchy of all brain cancer phenotypes built using an agglomerative hierarchical clustering algorithm on gene expression training data. Briefly, the construction of the hierarchy relies on the fact that there exist natural groupings among phenotypes based on shared features in their gene expression. As the set of different phenotypes is partitioned into smaller and more homogeneous subsets, the multi-class diagnosis problem is thereby decomposed into more tractable sub-problems.
As shown in
ISSAC identifies a binary classifier corresponding to each node and to each edge of the diagnostic hierarchy. Briefly, each classifier attempts to distinguish between two sets of phenotypes. These classifiers are based on comparing the relative expression values (i.e., ranks) between two genes, or for one or several pairs of genes within a gene expression profile at each stage. The chosen pairs are the ones that best differentiate between the phenotype sets, and are based entirely on the reversal of relative expression, as previously reported (Geman, D., Stat. Apps. in Gen. & Mol. Biol. (2004) 3:Article 19. Briefly, the decision rule by Geman, et al. consists of two genes (gene i and gene j), distinguishing two phenotypes (class A and class B): If the expression of gene i is greater than that of gene j for a given profile, then the phenotype is classified as class A; otherwise, class B. Recently, it has been shown that using such simple decision rules with only a small number of gene-pairs can lead to highly accurate supervised classification of human cancers (Tan, A. C., et al., Bioinformatics (2005) 21:3896-3904).
The objective of a node classifier is to distinguish the set of phenotypes associated with the node from all other phenotypes. Overall, the node classifiers represent a series of coarse-grained to fine-grained explanations of the hierarchical groupings, and are used in diagnosis to screen for phenotype-specific expression patterns. Thus, the hierarchy of binary predictors guides classification of an expression profile in a dynamic “coarse-to-fine” fashion: a classifier is executed if and only if all of its ancestor classifiers have been executed and returned a positive response, i.e., predicted the phenotypes in each node. The cumulative outcome of the node classifiers for a given expression profile is the set of its candidate phenotypes, corresponding to all the leaves of the hierarchy that were reached successfully.
For tie-breaking purposes, ISSAC also identifies classifiers at the edges of the diagnostic hierarchy. The objective of these classifiers is analogous to that of decision rules of an ordinary decision-tree: to distinguish the two sets of phenotypes associated with the two child nodes. The cumulative outcome of the decision-tree classifiers is a unique diagnosis.
Step-By-Step Description of How ISSAC Works
Construction of the Disease Diagnostic Hierarchy
Let £=(d1, . . . , d7) be the collection of class labels, where di denotes brain phenotype i. Using expression profiles of the phenotype classes, we first calculate the Top Scoring Pair (TSP) score (Δ) of all gene-pair combinations between all pair-wise class comparisons. As previously described (17), the TSP score between two classes dm and dn, of two genes, gene i and gene j, is defined as:
Δi,j(dm,dn)=|Pi>j(dm)−Pi>j(dn)|,
where Pi>j(dm) and Pi>j(dn) denotes the percentage of samples in dm and dn, respectively, whose expression of gene i is higher than that of gene j. Δmax(dm, dn) denotes the maximum Δi,j between dm and dn over all gene pairs i and j.
Let C designate an evolving set of groups of labels that starts off as the set of individual class (d1, . . . , d7). The brain disease diagnostic hierarchy was constructed by progressively evolving C towards the set of all groupings in the hierarchy using the following steps:
1. For all pair-wise comparisons of distinct elements in C, we calculate all Δmax. The leaves of the class-pair dm and dn with the smallest value of Δmax are merged into the first node of the tree, denoted as nd
2. Δmax of all pair-wise comparisons of the elements in the updated C are calculated, and the pair with the smallest value of Δmax is grouped into the next node of the tree. Since at this point C contains one non-singleton node and a host of other leaves, the next merging can be either between two leaves du and dv, denoted as nd
3. This process of finding the minimum Δmax for all pair-wise elements in C, and adding the new node in C, is iterated until all nodes and leaves are connected to form a tree structure. All classes combine to form the top node nd
The Markers Used in the Invention Method:
The classifier transcriptome gene expression markers are shown in Table 1.
Thus, the marker panels consist of 39 total gene pairs and 44 unique genes. The 44 genes are available as a subset of Affymetrix® microarrays.
In this table, aNode # corresponds to numerical labels in the diagnostic hierarchy shown in
The notations in Table 2 are as follows:
Node #: Corresponds to numerical labels shown in the brain phenotype diagnostic hierarchy (
To distinguish normal brain tissue from the six cancer types, only a single gene pair need be analyzed—a higher expression of PRPF40A than PURA classifies the tissue as cancerous. In the next step, only a single pair is required to distinguish MNG from the remaining cancer types; a higher expression of ISLR compared to NRCAM classifies the tissue as MNG. On the other hand, to distinguish MDL from the four cancer types EPM, GBM, OLG or PA, it has been found that two pairs need to be compared.
ISSAC uses the gene-pair classifiers for class prediction as described above and shown in
The following examples are intended to illustrate but not to limit the invention.
All transcriptomic data used in our analysis are publicly available at the NCBI Gene Expression Omnibus (GEO). We integrated 921 microarray samples of six brain cancers which are ependymoma (EPN), glioblastoma multiforme (GBM), medulloblastoma (MDL), meningioma (MNG), oligodendroglioma (OLG), pilocytic astrocytoma (PA) and normal brain across 16 independent studies into a transcriptome meta-dataset. Importantly, we obtained the raw data (.CEL files) from each of these studies and preprocessed them simultaneously using identical techniques to reduce extraneous sources of technical artifacts (discussed below). All data manipulation and numerical calculations were performed using MATLAB (MathWorks).
We used the following strict criteria and reasoning to select brain phenotypes, to ensure data quality, and to help control for systemic bias:
1. To facilitate data integration, expression profiles must have been conducted on either the Affymetrix® Human Genome U133A or U133 Plus 2.0 microarray platform. This allowed maximum microarray sample collection without considerable reduction in number of overlapping classifier features (i.e., microarray probe-sets).
2. Transcriptomic datasets (i.e., GSE xxx) for each phenotype must have been collected from at least two independent sources to help mitigate batch effects.
3. All datasets must have consisted of no fewer than 5 microarray samples.
4. All datasets must have originated from primary brain tumor or tissue biopsies. Expression profiles from cell-lines or laser micro-dissections were not used in our study to better ensure sample consistency.
5. Raw microarray intensity data (.CEL files) must have been available on GEO for consensus preprocessing.
6. Sample preparation protocol must have been fully disclosed on GEO.
7. All microarray samples in a dataset of a given phenotype were used in order to take into consideration all sources of heterogeneity.
After an exhaustive search on GEO, we were able to find 921 microarray samples from 16 studies that met the above criteria. Information on all datasets (e.g., publication sources, Affymetrix® platforms, GEO dataset IDs, and microarray sample IDs) used in Table 3 and Table 4.
in human GBM
-catenin pathway, and complete loss of chromosome 6
indicates data missing or illegible when filed
Raw microarray intensity data (.CEL files) were obtained online from GEO and preprocessed simultaneously using identical techniques to reduce extraneous sources of technical artifacts. More specifically, common probe-sets were found across all transcriptome samples, and consensus preprocessing was performed on all the raw microarray image data to build a consensus dataset. This step removes one major non-biological source of variance between different studies. These preprocessed samples were used to build a multi-study, meta-dataset of human brain cancer and normal brain transcriptomes. Finally, stringent probe-set filtering was used to remove spurious classifier features.
The resulting hierarchical markers are shown above in Table 1. The discrimination at each node is shown in
A further summary is found in Table 5.
aAffymetrix ® microarray platform probe IDs of the classifier genes are shown in Table 3 and Table 4.
bFor each gene-pair comparison (i.e., Is Gene i > Gene j ?), 1 and 0 delineates ‘true’ and ‘false’, respectively, and ‘—’ denotes that the outcome is not used for classification.
The classification performance of our brain cancer diagnostic marker-panel was first evaluated by ten-fold cross-validation. Our marker-panel achieved a 90.4% average of phenotype-specific classification accuracies (Table 6), showing strong promise for accurate diagnostics against a multi-category, multi-dataset background at the gene expression level.
92.2
84.8
91.1
97.5
74.6
94.4
98.5
aAccuracies reflect average performance in ten-fold cross-validation conducted ten times. The main diagonal gives the average classification accuracy of each class (bold), and the off-diagonal elements show the erroneous predictions.
bUC (Unclassified samples). When using the node classifiers, expression profiles that did not exert a signature of any phenotype (i.e., did not percolate down to at least one positive terminal node) were rejected from classification. In this case, the Unclassified sample is treated as a misclassification.
In addition, we observed higher classification accuracy (93.2%) among the expression profiles for which a unique diagnosis was obtained without subsequent disambiguation from the decision-tree.
Four brain cancers (ependymoma, medulloblastoma, meningioma, and pilocytic astrocytoma) have accuracies of at least 91.1%, suggesting clear differences between them and the other phenotypes at the transcriptomics level. These cancers arise from unique cell types and regions in the brain, which affects the accuracy of the signatures. Ependymoma is composed of ependymal cells, which are the epithelial layer of the ventricular system of the brain and the spinal cord. Meningioma arises from the arachnoidal cells in the meninges, the system of membranes that covers and protects the central nervous system. Medulloblastoma is a neuroectodermal tumor derived from neural stem cell precursors originating in the cerebellum or posterior fossa. And finally, pilocytic astrocytoma is generally considered a low-grade, benign tumor of astrocytes, usually arising in the cerebellum or hypothalamus. Accordingly, the anatomical region specificity of these four cancers is likely to contribute toward their accurate separation—as there are regional areas of unique gene expression patterns, as discussed below.
The cross-validation accuracies for glioblastoma and oligodendroglioma, two well-progressed gliomas, were 84.8% and 74.6%, respectively. Their lower performance was mainly a consequence of the limited ability of the marker-panel to correctly differentiate these two cancers from each other. Indeed, the distinction of these two phenotypes seems to be rather difficult; although oligodendroglioma is generally characterized by its own unique histological features, it is also known to present morphological traits similar to those of glioblastoma. This suggests that the two phenotypes are not as clearly distinct as presently clinically defined. Interestingly, however, these two accuracies are comparable to those reported previously. Furthermore, our signatures did show an excellent degree of sensitivity (96.4%) and specificity (97.4%) for distinguishing these two well-progressed gliomas as a set from all other brain phenotypes. There exist genetic tests and methods that differentiate glioblastoma and oligodendroglioma well, such as the combined loss of chromosome arms 1p and 19q, and over-expression of transcription factors Olig1 and Olig2.
We trained ISSAC on each of the five transcriptomic datasets (i.e., GSE ####) of glioblastoma individually, coupled in each case to data from all other brain phenotypes. The results from various data handling methods are shown in
In general, GBM signatures from larger datasets (GSE4271, GSE4290) had better average performance than those from smaller datasets (GSE8692, GSE9171), but variation across different validation sets limited overall performance (
These favorable outcomes are likely due to the molecular heterogeneity within and across transcriptomes in this particular dataset adequately encompassing broad, population-level characteristics. This suggests that GSE4271 may serve as an ideal dataset in future studies for learning representative, molecular features of glioblastoma. Indeed, we found that training on GSE4271 was a notable exception; when GSE4290 (77 samples) was used as the training set, there was over a 30% decrease in average glioblastoma classification accuracy (55.5%), despite the nearly identical sample sizes of the two datasets. This shows that any individual dataset, even those of a sufficient sample size, do not consistently yield robust diagnostic signatures.
Signatures from GSE8692 (6 samples) and GSE9171 (13 samples), led to average accuracies of 22.3% and 0.0%, respectively; these significantly low performance results are not surprising given the very small sample numbers. However, that glioblastoma signatures from GSE9171 could not classify even a single sample correctly is an intriguing observation. After searching through sample preparation and handling protocols provided in the publications of all five glioblastoma studies, we were not able to identify any steps unique to the GSE9171 study that could have obviously led to such severe over-fitting. We suspect that, rather than from a single aspect, erroneous signals were obtained from a myriad of different factors, from the lack of variance in the biology of the patient samples studied, to batch effects that compromised transcriptomic measurements, and to possibly unreported variations in standard protocol. Finally, training on GSE4412 (59 samples) gave an average accuracy of 23.1%. Interestingly, the average accuracies from training sets GSE4412 and GSE8692 (23.1% and 22.3%, respectively) were very similar despite almost ten-fold difference in sample sizes (59 and 6 samples, respectively). This implies that, in general, sample size is really not a sole determining factor of signature performance. The overall hold-one-lab-in validation performance, or the average of all classification accuracies in
We found considerable discrepancy between the minimum and maximum validation set accuracies for training sets GSE4412 (0.0% and 83.33%) and GSE4290 (16.7% and 92.31%) (Table 8). This shows that batch effects, as well as potential biological discrepancies between populations studied at different sites, can lead to remarkable variation among transcriptomic datasets of the supposedly same phenotype. This “dataset variation” is widespread in large-scale expression studies, causing inconsistencies in diagnostic signature identification and performance reproducibility. Large variation within and across transcriptomic datasets of glioblastoma is not surprising, given that glioblastoma is known to have various molecular subtypes. Therefore, as mentioned above, diagnostic signatures from any single dataset need to be approached with caution.
We next analyzed how the multi-study integration approach affects performance robustness. One of each of the five datasets of glioblastoma was sequentially withheld as the validation set, while all remaining gene expression data (including those from other phenotypes) were used for training. The glioblastoma signature was then evaluated on the held-out validation set. We term this strategy as “leave-one-lab-out validation.”
Classification accuracies ranged from 63.2% (GBM training set: 155 samples across four datasets; validation set: GSE4271, 76 samples) to 100% (GBM training set: 225 samples across four datasets; validation set: GSE8692, 6 samples) (
To evaluate how multi-study dataset integration alone affects performance robustness, we performed hold-one-lab-in and leave-one-lab-out validations for GSE4412, GSE4271, and GSE4290 (59, 76, and 77 samples, respectively) while training on the same number of samples for glioblastoma. More specifically, the same steps in the analyses of
The results we observed from these analyses were consistent with our two aforementioned conclusions, as shown in Table 9.
First, when a diagnostic signature is learned from an individual dataset, its ability to accurately and precisely represent phenotype features across a broad population highly varies depending on the particular dataset used for training (
Second, combining datasets considerably increased average accuracy (
Thus, dataset integration across multiple studies, even without change in sample size, can lead to significant improvements in diagnostic performance.
We used the results in
When we performed the stringent test of obtaining a diagnostic signature from a single dataset of glioblastoma, we found the variation between individual studies often have a larger effect on the transcriptome than did phenotype differences, resulting in dramatically decreased average accuracy. However, we found that learning signatures across multiple datasets significantly improved average accuracy with concomitant reduction in performance variance, even when keeping the size of the training set the same. This was most likely due to the meta-signature encompassing more of the heterogeneity across different sources and conditions, while not losing focus on the important, global characteristics of the phenotype.
This invention was supported in part by a National Institutes of Health/National Center for Research Resources Grant UL1 RR 025005 (DG), and the Grand Duchy of Luxembourg-Institute for Systems Biology Program (LH, NDP). The U.S. government has certain rights in this invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2013/067890 | 10/31/2013 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
61720947 | Oct 2012 | US |