The present invention is of a method and system for non-linear quantification of pathway deregulation for analysis of malignancies, and in particular of such a method and system for characterizing malignancies and tumor behavior, including with regard to the extent of pathway deregulation for a particular tumor.
The operation of many important pathways is altered during cancer initiation and progression. Identifying the involved pathways and quantifying their deregulation may improve understanding of the malignant process1-5. Moreover, since advanced therapies target specific pathways, pathway-level understanding is a key step for developing personalized cancer treatments. Indeed, many methods (e.g.5-14) were developed for pathway analysis of high throughput (mainly expression) data. Nearly all methods analyze pathway activity in a global manner, on the basis of an entire sample set, and do not attempt to characterize individual tumors. One prominent exception is PARADIGM15, a powerful tool which deduces an integrated pathway activity (IPA) from mRNA expression and DNA copy number data, for each sample, on the basis of known pathway structure. However, PARDAGIM cannot analyze well complex pathways when either detailed knowledge of the mechanism of pathway activity or essential relevant data (such as protein abundance and phosphorylation) is unavailable, which is the case for a vast majority of existing cancer datasets.
The present invention, in at least some embodiments, is of a method and system for non-linear quantification of pathway deregulation for analysis of malignancies. According to at least some embodiments, the method and system are suitable for characterizing malignancies and tumor behavior, optionally and preferably including with regard to the extent of pathway deregulation for a particular tumor.
According to at least some embodiments of the present invention, no detailed knowledge of the network or mechanism of the pathway activity is needed. Preferably, deregulation levels of many complex pathways are estimated, generating a pathway-level representation of every sample. Such a representation is demonstrated below to be clinically relevant for some exemplary, illustrative non-limiting examples of cancers, and to generate novel stratifications and outcome predictors for these cancers, for which data is shown for glioblastoma and colon cancer, specifically on three colorectal cancer datasets and two glioblastoma multiforme datasets.
According to at least some embodiments, the method comprises inferring pathway deregulation scores for each tumor sample on the basis of expression data. This score is determined, in a context-specific manner, for every particular data set and type of cancer that is being investigated. The algorithm transforms gene level information into pathway level information, generating a compact and biologically relevant representation of each sample.
According to at least some embodiments, there are provided specific pathways that are demonstrated below to be significantly associated with survival of glioblastoma patients, and two, CXCR3-mediated signaling and oxidative phosphorylation, whose scores are predictive of survival in colorectal cancer.
According to at least some embodiments, there are provided specific cancer sub-classes, including but not limited to a sub-class of each of Proneural and Neural glioblastoma with significantly better survival, and a new class of EGFR-deregulated colon cancers.
According to at least some embodiments of the present invention, there is provided a method for non-linear quantification of pathway deregulation for analysis of a malignancy, comprising selecting a plurality of biological pathways P for a malignancy; for each pathway P, identifying dP genes that belong to it; determining expression values for each of said dP genes, wherein each sample i of a plurality of samples is represented by a point Xi of the expression values of said genes, said plurality of samples comprising malignant and normal tissue samples; calculating a Principal Curve of the cloud of points formed by the full sample set of the expression values of said genes, projecting every point onto said Principal Curve, denoting by Y, the projection of Xi onto said curve. Optionally, the method further comprises determining, for each pathway P, the center of mass of the points that represent all the available normal tissue samples; and determining quantification of pathway deregulation for each individual malignant sample by comparing said center of mass of the normal tissue to said point representing every one of the malignant samples.
Optionally, said comparing said center of mass of the normal tissue to said malignant sample comprises calculating said center of mass of the normal tissue, its projection, N, onto said Principal Curve of the cloud of points formed by the full sample set of the expression values of said genes, projecting every point Xi onto said Principal Curve, for each of the normal and malignant samples; determining the distance of the projection, Yi from N, measured along the principal curve, wherein said distance is Di (P), the Deregulation Score of pathway P in sample i; and determining, for every sample from said plurality of samples, a plurality of deregulation scores for said plurality of pathways P.
Optionally, the method further comprises clustering, on the basis of said plurality of deregulation scores, the said plurality of samples and said plurality of pathways P.
Optionally the method further comprises determining a class or subclass for said malignancy according to said clustering of the samples.
FIGS. 1A and 1B—The principal curve learned for the apoptosis KEGG pathway on the Sheffer et al. colorectal dataset. The data points (representing samples of different tissue types) and the principal curve are projected onto the three leading principal components. A. The principal curve (in blue) going through the cloud of samples. The arrow indicates the direction of the curve. B. The samples projected onto the curve.
FIGS. 5A and 5B—A. PARADIGM's IPAs for the TCGA GBM dataset. Each row corresponds to a PARADIGM “entity” and each column to a sample. Entities (pathways, interactions, complexes etc.) and samples are clustered according to the IPAs. Blue color represents low activity, and red high activity. The bottom bar displays the subtype. B. PARADIGM's IPAs correlated with mutations. The bottom bars display the mutation status for the corresponding gene.
FIG. 6—Expression of genes belonging to the immune pathway cluster ShP2. The upper panel shows the PDS of cluster ShP2. As the PDS of the normal samples is around zero (green-blue), highly positive PDS (dark red) and highly negative PDS (dark blue) correspond to pathway deregulation, but in different directions. The lower panel shows the expression of genes that participate in at least a quarter of the pathways in ShP2: each row represent a gene, blue corresponds to low expressions and red to high expressions. The color bars at the bottom correspond to the tissue type of the tumor (normal, polyp, tumor and metastasis) and the CIN index (equally distributed into 10 bins). These genes are related to lymphocytes, and may represent TILs (tumor infiltrating lymphocytes). Note that positive PDS co-occur with low expression (mostly in metastatic samples and in some of the polyps), while negative PDS correlate with high expression (mostly found among the tumors with low CIN).
FIG. 8—Clustering analysis of all the PDS in the Kogo dataset. Each row corresponds to a pathway and each column—to a sample. Pathways and samples are clustered according to PDS. For most pathways the PDS of the normal samples is minimal (dark blue), and hence the higher the PDS is the more deregulated the pathway is. For a few pathways (mostly in KoP1) the PDS of normal samples is around zero (green-blue), and hence highly positive PDS (dark red) and highly negative PDS (dark blue) both correspond to pathway deregulation, but in different directions. The color bars at the bottom correspond to the tissue type of the tumor (normal, tumors) and the CIN index (equally distributed into 10 bins).
FIG. 9—Summary of recurrent clusters in the colorectal datasets (A. Sheffer, B. Sveen, C. Kogo). Each row corresponds to a pathway cluster and each column to a sample cluster, displaying the median value of deregulation for each pair of clusters. These four pathway clusters were reproduced in all three datasets. In the first row are pathway clusters for which the PDS of the normal samples is minimal around zero (cyan), and hence highly positive PDS (dark red) and highly negative PDS (dark blue) both correspond to pathway deregulation, but in different directions. For all other clusters the PDS of the normal samples is minimal (dark blue), and hence the higher the PDS is the more deregulated the pathway is. Arrows connect between pathway clusters that match (that is, the pathways in the clusters significant overlap).
FIG. 10—Hastie and Stuetzle's principal curve for simulated noisy trajectory. After 20 iterations the algorithm converges to the (wrong) curve shown in brown. The initial guess (first principal component) is shown in red. t=0 points are in green.
FIG. 11—Hastie and Stuetzle's principal curve for simulated noisy trajectory, with an educated initial guess. The initial starting curve (in red) reflects external knowledge and goes through the centers of the points labeled as t=0 (in green) and t=1. After 15 iterations the algorithm converges to the (wrong) purple curve.
FIG. 12—Modified semi-supervised principal curve for simulated noisy trajectory. After 10 iterations the algorithm converges to the curve shown in dark blue, capturing the real trajectory. The initial guess (first principal component) is shown in red. t=0 points are shown in green.
FIG. 13—Modified semi-supervised principal curve for simulated noisy trajectory. After 20 iterations the algorithm converges to the curve shown in brown, capturing the real trajectory. The initial guess (line that goes through the center of t=0 and t=1 points) is shown in red. t=0 points are shown in green.
FIG. 17—Pathways predicting survival in both glioblastoma datasets. Significant agreement was found between the pathways that predict survival. Pathways were identified by logrank p-value, with FDR (false discovery rate)<10% for each dataset.
According to at least some embodiments, there is provided a method and system for non-linear quantification of pathway deregulation for analysis of malignancies. According to at least some embodiments, the method and system are suitable for characterizing malignancies and tumor behavior, optionally and preferably including with regard to the extent of pathway deregulation for a particular tumor.
According to at least some embodiments of the present invention, there is provided a method for non-linear quantification of pathway deregulation for analysis of a malignancy, comprising selecting a plurality of biological pathways P for a malignancy; for each pathway P, identifying dP genes that belong to it; determining expression values for each of said dP genes, wherein each sample i of a plurality of samples is represented by a point Xi of the expression values of said genes, said plurality of samples comprising malignant and normal tissue samples; calculating a Principal Curve of the cloud of points formed by the full sample set of the expression values of said genes, projecting every point onto said Principal Curve, denoting by Y, the projection of Xi onto said curve.
Optionally, the method further comprises determining, for each pathway P, the center of mass of the points that represent all the available normal tissue samples; and determining quantification of pathway deregulation for each individual malignant sample by comparing said center of mass of the normal tissue to said point representing every one of the malignant samples.
Optionally, said comparing said center of mass of the normal tissue to said malignant sample comprises calculating said center of mass of the normal tissue, its projection, N, onto said Principal Curve of the cloud of points formed by the full sample set of the expression values of said genes, projecting every point Xi onto said Principal Curve, for each of the normal and malignant samples; determining the distance of the projection, Yi from N, measured along the principal curve, wherein said distance is Di (P), the Deregulation Score of pathway P in sample i; and determining, for every sample from said plurality of samples, a plurality of deregulation scores for said plurality of pathways P.
Optionally, the method further comprises clustering, on the basis of said plurality of deregulation scores, the said plurality of samples and the said plurality of pathways P.
Optionally the method further comprises determining a class or subclass for said malignancy according to said clustering of the samples.
Scoring a Gene Set
According to at least some embodiments, the above described method may optionally be performed as follows. Denote by SP the dP-dimensional space, where each coordinate is the expression level of a gene that belongs to a given pathway P, and represent each sample by a point in this space. Determine a one-dimensional curve in SP (or in a subspace of SP) that best describes the variability (e.g., due to disease progression) of the samples across SP. That is, a curve that passes through the “middle of the cloud” of samples is found, and it is assumed that any two points (samples) that have proximal projections onto the curve, also share similar pathway functionality.
Variance Stabilization:
Since for some genes one can observe large variation of expression, while for others a similar effect on a pathway's functionality may be induced by a smaller variation, the absolute expression values are not used. Rather, a gene's expression values are divided by the standard deviation of its expression in some normalization set of samples (such as normal samples, or the entire set of samples). To avoid genes whose entire variation is due to noise, optionally and preferably genes with little variation within all samples (normal and malignant) are omitted.
Correlations:
Many of the genes in the gene set of a pathway might be highly correlated, conveying the same information, while some other important information might reside in a single gene in the set. To counter this effect, and to improve the running time, a curve is preferably not searched in SP, but in a space SP′ of smaller dimensionality k, identified as follows. First optionally and preferably PCA is performed, followed by selecting only the principal components with high variance (for this Example the threshold was 1.1, i.e. keeping PC along which the variance exceeds by more than 10% that of the normalization set). The number of such PCs is k and they define the space SP′, in which the entire ensuing computation is done.
Principal Curve:
Hastie and Stuetzle's algorithm20 was used to find a principal curve in SP′ (see
In some cases the normal samples fall roughly in the middle of the curve. When this happens, the curve captures two different kinds of deregulation, with samples moving away from the normal samples along two distinct paths. In principle one can use other (than normal) samples as reference, though doing this makes sense only in cases when the inner variability of the new reference set is considerably smaller than the overall variability.
Finding a Stable Gene Set
Often some of the genes in the gene set are noisy (in the sense that their variation does not reflect information relevant to the biology to be analyzed, see Discussion), and they are preferably omitted. Since the analysis is performed in SP′ and not in SP, metagenes (linear combinations of genes) are optionally and preferably omitted, but similar considerations imply that some of the metagenes might be noisy and should be omitted. This is partly taken care of by omitting genes and metagenes that don't vary much, but some of the noise might be due to highly varying metagenes, where most of the variation is unrelated to the biological information captured by the gene set.
To find out which metagenes should be omitted, one at a time, those metagenes were selected along which the samples were farthest from the curve, as expected for noisy metagenes, and found after each omission the new corresponding principal curve. To assess which curve is the best, the sensitivity of the gene set's scores was compared to sampling noise (the variance over 100 repeats of 80% of the samples selected randomly each time). If there is a significant improvement in the stability, the metagene was omitted whose omission yielded the most stable curve, and continue in a greedy fashion. If the improvement is not significant (or stability actually becomes worse), the process stops.
Principal Curves
The concept of principal curve was first proposed by Hastie and Stuetzle20 as a non-parametric nonlinear extension of the linear Principal Component Analysis. By f(λ) a curve in p-dimensional space is denoted, where λ is a single parameter whose variation traces all the points along the curve. A curve f is defined to be a principal curve associated with a distribution P(x) defined over some space, if and only if it is a smooth, one dimensional non intersecting curve that is self-consistent, i.e. each point y on the curve is the expected value of all the points x whose projection onto the curve is y. Let the projection index λf (x) be the value of λ for which the projection of x on the curve is f(λ), i.e.:
The condition for self-consistency is simply
f(λ)=E(X|λf(X)=λ)
Since in practice one is given a finite data set X, of n points in dP dimensional space XεMn×d
1. Conditional-Expectation step: Fix λ={λi}i=1n and minimize E∥X−f(λ)∥ by setting f(λi) to be the local average (of the points projected onto a neighborhood of f(λi), e.g. the points xj for which λj≃λi).
2. Projection step: Given f={f(λi)}i=1n find for each xi the corresponding value of λi=λf(xi) assuming f is piecewise linear.
The line along the first linear principal component is used as a starting curve, and the algorithm is iterated until convergence.
Since Hastie and Stuetzle's seminal publication, many alternatives and generalizations were offered56-61, but they all focused on unsupervised learning, not allowing the incorporation of additional information. In particular, another non-linear trajectory, defined in the full expression space, was used to capture tumor progression62. This Example comprises a modification to Hastie and Stuetzle's algorithm that allows introduction of prior belief on the order of the curve parameter {λi}i=1n. This prior affects the estimation of the distribution from which X is sampled, and therefore generates a modified principal curve.
Detailed Exemplary Description for Determining the Analysis:
The concept of principal curve was first proposed by Hastie and Stuetzle (1) as a non-parametric nonlinear extension of the linear Principal Component Analysis. In this Example, one denotes by f(λ) a curve in p-dimensional space, where λ is a single parameter whose variation traces all the points along the curve. A curve f is defined to be a principal curve associated with a distribution P(x) defined over some space, if and only if it is a smooth, one dimensional non intersecting curve that is self-consistent, i.e. each point y on the curve is the expected value of all the points x whose projection onto the curve is y. Let the projection index λf (x) be the value λ of for which the projection of x on the curve is f(λ), i.e.:
The condition for self-consistency is simply
f(λ)=E(X|λf(X)=λ)
Since in practice one is given a finite data set X, of n points in dP dimensional space XεMn×d
The line along the first linear principal component is used as a starting curve, and the algorithm is iterated until convergence.
Since Hastie and Stuetzle's seminal publication, many alternatives and generalizations were offered (2-7), but they all focused on unsupervised learning, not allowing the incorporation of additional information. In particular, another non-linear trajectory, defined in the full expression space, was used to capture tumor progression(8).
Integration of External Information: Semi Supervised Principal Curve
Here a modification to Hastie and Stuetzle's algorithm is described that allows introduction of prior belief on the order of the curve parameter {λi}i=1n. This prior affects the estimation of the distribution from which X is sampled, and therefore generates a modified principal curve. Finding a principal curve is closely related to finding the “correct” order to go through the data points. A curve explicitly defines that order (the ranking of {λi}i=1n), and given an order, a curve could be deduced, e.g. by Hastie and Stuetzle's(1) Conditional-Expectation step (or any other data fitting approach). Therefore, if one has detailed information on the order, finding the curve is very simple. However, if one has only vague or imprecise information, one is facing the full problem of finding a principal curve, but the information still might help us to improve identification of the desired curve. This knowledge is captured by assuming that the Spearman correlation of λ={λi}i=1n and a vector of some (approximately known) ranking is high. Spearman correlation was selected as a non-limiting exemplary correlation since (a) it requires only the knowledge of ranks (allowing ties to represent uncertainties), and not the explicit values of some variable, and (b) it does not depend on the specific details of the curve parameterization.
Notice that in Hastie and Stuetzle's algorithm, the conditional-expectation step is not affected by ranking prior, as the {λi}i=1n are given and fixed. Therefore only the projection step is adapted, in a way that incorporates the above described prior. The projection step is equivalent to minimizing ∥xi−f(λi)∥ for every 1≦i≦n. It is desirable to simultaneously maximize the Spearman correlation
where di (λi) is the difference between the rank of λi and the rank of i in the given guiding ranks vector. Therefore, the minimized function can be written as
where α reflects confidence in the ordering.
Thanks to the iterative framework, it is not necessary to find the global minimum in each step. Therefore, the method used a greedy approach to save computation. First one computes for every xi the distance to each segment of the curve, and then one selects the νi=λi that minimize the distance. Given these {νi}i=1n, the following is selected for every
that is, the λi that minimizes the weighted sum of the distance and the difference in contribution to the correlation term, given that all the other λ's remain the same. Notice that the first term was divided by dP and the second multiplied by n, so that both terms will be independent of dP and n, but this does not harm the optimization as it can be introduced back via α.
As noted in (2), Hastie and Stuetzle's algorithm can be analyzed from a probabilistic point of view. A principal curve with cubic spline smoothing is the solution for maximizing the likelihood that the curve generated the given data, assuming Gaussian noise and a prior on the smoothness of the curve. This modification, therefore, can be viewed as adding another prior, according to which the probability of a curve with {λi}i=1n is
In addition, since some ranking is known, the initial guess is set to be piecewise linear, generated by first calculating the average of all the points with the same rank, and connecting the averages corresponding to consecutive ranks. If the rank of most points is unknown, this might be a poor guess and one therefore starts with the first principal component, making sure that its direction does not impose a reverse order to that of the ranking.
Setting the Weight α
In order to automatically learn a suitable value of a, the following heuristic may optionally be employed. Find the best curve for a set of values of αε[0,1) (say, in steps of 0.05) and record for each one the sum of squared distances of the data points to the curve, D(α), as well as the Spearman correlation ρ(a) between {λi}i=1n and the labels. Ideally, one would like to minimize the distance and maximize the correlation. However, as that is unlikely to occur simultaneously, one needs to assign these two demands some weights.
One can therefore weigh the minimization goal g(α,R)=D(α)−Rρ(α), so that the right choice of R will allow us to select the best a. One possibility is to set
or any other similar function of a, but this might lead to over or under-estimation of α—for example, if the maximal correlation is low, one would choose R=0, so that the correlation will not affect the target function, but this will force us to select α=0, even though a higher value may be more appropriate, as it will improve the correlation with almost no effect on the distance. Therefore, one needs to weigh the two terms in a way that is independent of a, yet flexible enough to handle those cases when both terms can be minimized considerably, as well as cases when one cannot find a low enough minimum for one of the terms. To achieve this, one can minimize the difference between the standardized terms, that is:
The mean and the variance are taken over the different values achieved for the different α's.
In many cases one may use the ranking information only to guide the curve learning process, and not to set the actual final projection. To achieve that one performs an additional iteration with no ranking information (α=0). One reason one would want to do this is that often the ranking information cannot be trusted, or one is not confident of its relevance to the gene set in question. Another related reason is that since one already knows the guiding ranking, new information that can be obtained from the gene set is more interesting, even at the cost of losing consistency with the guiding ranking. Furthermore, the curve can be used to score new samples, for which no ranking information is available (for example using patient survival as the guiding ranking, and then predicting survival for a new patient on the basis of its projection onto the curve). In this the case the curve's performance should be assessed in the same framework in which it is meant to be applied; that is, with no available ranking information.
A Simulated Example for Semi-Supervised Principal Curves
Suppose one measures the location of a particle over time. The location measurements are a little noisy, and the time measurements are very noisy, to the extent that it is only possible to determine whether the measurement was done at time t=0 or t=1 (e.g. the particle moves very fast). One would like to deduce the trajectory from the data. Approximately linear trajectories are easy to deduce, but complex ones require using principal curves or similar methods. In this simple example the trajectory (r sin θ, r cos θ) was selected with Gaussian noise with standard deviation 0.3, where rε[1, 3] and θε[0, 2π].
As expected, an unsupervised principal curve fails to capture the correct trajectory (see
Data and Preprocessing
GBM mRNA expression data was downloaded from TCGA data portal on April 2011 (9). To reduce batch effects of arrays and protocols, only data from Agilent G4502A arrays measured at the UNC medical school were used, yielding 455 samples, 10 of which were from normal tissue. Additionally, 228 glioblastoma samples and 28 normal brain samples were obtained from REMBRANDT (10). Subtypes classification was taken from Veerhak et al. (11) Classification of the REMBRANDT data and additional TCGA samples was done using the same genes. Colorectal mRNA data was taken from Sheffer et al. (12) which contains 313 samples including normal tissue (52 samples), polyps (49), primary tumors (182) and metastases from the liver (21) and lung (9); Sveen et al. (13) containing 13 normals and 76 primary tumors of stages 2,3 (one tumor sample was removed); and Kogo et al. (14) containing 9 normal and 132 primary tumors of all stages.
For TCGA data level 3 already processed data was used and for Kogo data the downloaded processed data was used. For all the rest, data was summarized with PLIER and normalized with cyclic LOWESS (15). To eliminate noisy genes only the 5000 most varying genes for each cancer type (sum of variation on all 2 or 3 datasets) were selected for further analysis.
Assembly of Pathway Associated Gene Sets
Gene sets were imported from three pathway databases, KEGG (16, 17), BioCarta (18) (both downloaded from MSigDB (19)) and the NCI-Nature curated Pathway Interaction Database (20). Identity of genes in gene sets was decided according to their official gene symbols. Gene sets with less than 3 genes varying in the data were omitted, leaving 173 KEGG pathways, 188 BioCarta pathways and 197 NCI-Nature PID pathways.
Chromosomal Instability Index
Chromosomal instability index (CIN) was deduced from gene expression, following reference (12). For a given tumor sample, each chromosomal arm was scored as follows. For every gene i calculate fci the fold change versus the median expression of the gene in the normal samples. The median fold change of the chromosomal arm a is defined as
The total chromosomal instability index is the sum (over all arms) of the squared median fold changes
Spearman correlation between the CIN index and the deregulation score of every pathway was calculated across each colorectal dataset. The list of pathways passing 5% FDR for every dataset was recorded. The Fisher exact test was applied to evaluate the significance of the similarity between every pair of pathway lists.
Pathway Deregulation Scores that are Correlated with Necrosis Levels of Glioblastoma
The PDS of 242 pathways significantly correlate with the necrosis levels of the samples, as quantified by TCGA (FDR<1%), see
Pathway Clusters in the TCGA GBM Dataset
Pathway cluster TgP1 consists of cell cycle arrest and cell death pathways; TgP2 contains cell cycle pathways and many of KEGG's “cancer” pathways (including glioma) which capture cancer progression and signaling; TgP3 contains mainly cell death and DNA repair pathways and is deregulated mostly on the Neural and Proneural samples; The pathways of cluster TgP4 correspond to the EGF activated pathways mentioned above; Cluster TgP5 contains pathways that are deregulated mostly on the Classical samples. Some of them are indeed suspected to be specific to this subtype, such as hedgehog-GLI signaling(11) and GPCR/CXCR4 signaling(21) while the deregulation of some others in this subtype is a new prediction: such as PAR1(F2R)-mediated thrombin signaling, axon guidance, etc.; Half of the TgP6 pathways involve alpha synuclein amyloids; All TgP7 pathways involve phospholipase C; the pathways that comprise clusters TgP8-TgP10, TgP12-TgP15 belong to the 242 pathways that were correlated with necrosis that were mentioned above, and are also highly expressed on many Mesenchymal samples. As mentioned, many of these pathways (and specifically the pathways of TgP8-TgP10, TgP13 and TgP15) are related to hypoxia and angiogenesis, and it was found, in agreement with previous knowledge, that they score higher in Mesenchymal glioblastoma(22). Clusters TgP8 and TgP12 contain several Epithelial-Mesenchymal Transition (EMT) related pathways (such as N-cadherin signaling, epithelial tight junctions, Rho/Rac/CDC42 signaling, regulation of actin cytoskeleton, ECM-receptor, Focal adhesion) obviously related to Mesenchymal tumors; 7 of the 8 pathways of TgP11 are key signaling pathways involving caveolin; TgP12 contains many of the pathways correlated with NF1 mutation (P3 in
Matching Between Rembrandt and TCGA GBM Pathway Clusters
Cluster ReP1 matches TgP1, ReP2 matches TgP2, ReP3 matches TgP15, ReP4 matches TgP16, ReP7 matches TgP3, ReP8 matches TgP14, ReP9 includes parts of TgP10-TgP13 (most strongly related to TgP12), and ReP10 includes parts of TgP4-TgP9 (most strongly related to TgP5 and TgP9). Under this mapping, similar characteristics of the deregulation profiles of sample types emerge (
Deregulation Scores of Many Pathways are Correlated with Survival of GBM Patients
35 pathways were found to be related to survival in both GBM datasets (see Table 1), many of them make biological sense: Agrin deregulation may temper the blood-brain barrier in glioblastoma(23); Growth hormone (GH) plays a crucial role in stimulating and controlling the growth, metabolism and differentiation of many mammalian cells, and hence clearly relevant for cancer aggressiveness(24); The hematopoiesis pathway contains cytokines and it is suspected to be related to cancer progression and drug resistance by interactions with the immune system(25-27); Linolenic acids and their products were suggested to prolong cancer patient survival(28); FcεRI may protect against cancer by IgE antitumor immunity(29); Cell-matrix adhesions are clearly related to invasion and metastasis(30); GnRH is a neurohormone that may drive proliferation in glioblastoma and other cancers, and therefore is also a suggested drug target(31-36); Phosphatidyl-inositol 3- and 4-kinases, key ingredients of the inositol phosphate pathway, are known to have important roles in glioblastoma and cancer in general, and hence are possible drug targets(37-39); Surprisingly, cholera toxin was also found related to glioblastoma; WNT signaling has a key role in brain and other cancers, and is related to cancer stem-like cells and poor prognosis(40-42); Alterations in E-cadherin mediated cell-cell adhesion are associated with an increase in carcinoma cell motility, invasiveness and metastasis(43); Glypican-1 is crucial for efficient growth, metastasis, and angiogenesis of cancer, and lack of it slows down pancreatic tumor progression(44); Fas and TNF-alpha are key players in apoptosis whose deregulation is a clear hallmark of cancer(45, 46); α4β1 integrin is related to angiogenesis(47), and is involved the survival and chemoresistance of several types of cancer(48); P2 integrins are known to predict poor survival in blood cancers(49, 50), and may cause fibrinolysis via uPA/uPAR; α6 integrin regulated stemness and invasiveness in glioblastoma(51, 52), and has a prognostic value in breast cancer(53); P53 is a key player in glioblastoma and cancer in general; PTPN1 (aka PTP1B) may promote apoptosis in cancer and is a possible drug target for gliomas(54); Reelin is a secreted glycoprotein guiding migratory neurons, it is downregulated in neuroblastoma, which may contribute to metastasis(55, 56); Syndecans induce proliferation and invasion(57-59), and may serve as a prognostic predictor(60, 61).
Deregulation Scores of Many Pathways are Correlated with CIN in Colorectal Tumors
84 pathways were found to have significant positive correlation with the estimated CIN level in the tumors. To check that this correlation does not reflect only the differences between the MSI-high (most with low CIN) and MSS (most with high CIN) subtypes, the correlations for the subset of MSS and MSI-low tumors of the Sheffer and Sveen datasets were recalculated. 71 pathways were significant (p<0.046, Fisher exact test), with an overlap of 55 with the previously found 84 pathways, suggesting that the correlations between CIN and deregulation of tumorigenic pathways is not only due to the differences between MSS and MSI-high tumors.
Deregulation Scores in MSS Colorectal Tumors Compared with MSI-High
In the Sheffer dataset, 325 pathways are differentially deregulated between MSS and MSI-high tumors (Mann-Whitney, 10% FDR,
Pathway Clusters in the Sheffer Colorectal Dataset
ShP1 includes B-cell receptor and T-cell receptor pathways, and ShP2 includes antigen processing and presentation, T-cell differentiation, T helper cell surface molecules, T cytotoxic cell surface molecules, Graft-versus-host disease, Autoimmune thyroid disease etc. Clusters ShP4 and ShP5 include ECM pathways, focal adhesion, syndecan pathways, integrin pathways such as α9/β1 that induce adhesion and migration of endothelial and cancer cells(64); interleukines that mediate inflammation and angiogenesis(65); toll like receptors, that are involved in innate immune response and HIF2, a transcription factor that induces the hypoxia response. Other pathways are related to metabolism, such as glycolysis and drug metabolism. Cluster ShP7 includes regulation of actin cytoskeleton, cell adhesion, JAK/STAT signaling, MAPK signaling and complement and coagulation cascades pathway. Interestingly, this cluster is also deregulated in the polyps (cluster ShS2), although polyps show low level of CIN. Cluster ShP8 includes various cAMP-dependent signaling pathways, triggered by receptor binding to GPCRs involving the G-Protein such as insulin and BAD phosphrylation. Other pathways include metabolism of different amino acids and TGF-beta signaling. Cluster ShP9 includes a number of death associated pathways such as apoptosis, T-cell apoptosis, p53 downstream signaling, lysosome and FAS signaling. In addition, it includes also fatty acid metabolism, VEGF, mTOR, and notch signaling. Cluster ShP10 includes mitochondrial metabolic pathways such as pyruvate metabolism, TCA cycle, metabolism of sugars and oxidative phosphorylation. Cluster ShP11 includes cell cycle related pathways such as G1/S check point, aurora pathway, p53 upstream regulation, E2F, RB1, MCM, DNA replication, mismatch repair, purine metabolism, and more.
72 pathways out of the 106 that exhibited increase of PDS with progression of the disease (see above) belong to clusters ShP8, ShP9 and ShP10 (p<10−3 for all three clusters, Fisher exact test). This group of pathways consist of cell death, cAMP-dependent signaling and mitochondrial metabolism pathways, along with p53 pathways.
Data and Preprocessing
GBM mRNA expression data was downloaded from TCGA data portal on April 201121. To reduce batch effects of arrays and protocols, only data from Agilent G4502A arrays measured at the UNC medical school were used, yielding 455 samples, 10 of which were from normal tissue. Additionally, 228 glioblastoma samples and 28 normal brain samples were obtained from REMBRANDT23. Subtypes classification was taken from Veerhak et al.25 Classification of the REMBRANDT data and additional TCGA samples was done using the same genes. Colorectal mRNA data was taken from Sheffer et al.41 which contains 313 samples including normal tissue (52 samples), polyps (49), primary tumors (182) and metastases from the liver (21) and lung (9); Sveen et al.44 containing 13 normals and 76 primary tumors of stages 2,3 (one tumor sample was removed); and Kogo et al.45 containing 9 normal and 132 primary tumors of all stages.
For TCGA data level 3 already processed data was used and for Kogo data the downloaded processed data was used. For all the rest, data was summarized with PLIER and normalized with cyclic LOWESS63. To eliminate noisy genes only the 5000 most varying genes for each cancer type (sum of variation on all 2 or 3 datasets) were selected for further analysis.
Assembly of Pathway Associated Gene Sets
Gene sets were imported from three pathway databases, KEGG16,17, BioCarta18 (both downloaded from MSigDB64) and the NCI-Nature curated Pathway Interaction Database19. Identity of genes in gene sets was decided according to their official gene symbols. Gene sets with less than 3 genes varying in the data were omitted, leaving 173 KEGG pathways, 188 BioCarta pathways and 197 NCI-Nature PID pathways.
Chromosomal Instability Index
Chromosomal instability index (CIN) was deduced from gene expression, following reference41. For a given tumor sample, each chromosomal arm was scored as follows. For every gene i calculate fci the fold change versus the median expression of the gene in the normal samples. The median fold change of the chromosomal arm a is defined as
The total chromosomal instability index is the sum (over all arms) of the squared median fold changes
Spearman correlation between the CIN index and the deregulation score of every pathway was calculated across each colorectal dataset. The list of pathways passing 5% FDR was recorded for every dataset. The Fisher exact test was applied to evaluate the significance of the similarity between every pair of pathway lists.
Implementation and Availability
The code is implemented in R and Fortran based on “princurve 1.1-10” by Andreas Weingessel (http://cran.r-project.org/web/packages/princurve/index.html), which is in turn based on the original S/Fortran code princury by Hastie.
Results
Brief Outline of the Above Described Method
The above described method analyzes NP pathways, one at a time, and assigns to each sample i and pathway P a score DP(i), which estimates the extent to which the behavior of pathway P deviates, in sample i, from normal. To determine this pathway deregulation score (PDS), the expression levels of those dP genes that belong to P (e.g. using databases such as16-19) were used. Each sample i is a point in this dP dimensional space; the entire set of samples forms a cloud of points, and the “principal curve”20, that captures the variation of this cloud was calculated. Next each sample was projected onto this curve; the PDS is defined as the distance DP(i), measured along the curve, of the projection of sample i, from the projection of the normal samples (see
The PDS capture biologically relevant information in glioblastoma
The above described method was applied to expression data from 445 glioblastoma multiforme (GBM) and 10 normal brain samples from The Cancer Genome Atlas (TCGA)21. The PDS for 548 pathways from KEGG16,17, BioCarta18 and the NCI-Nature Pathway Interaction Database19 are presented in a 548 by 455 table (
Pathways of cluster P1 are deregulated mostly in samples of cluster S2, which comprises tumors with IDH1 mutation. All 32 pathways in P2 are activated by EGF signaling. Indeed, they are highly deregulated on sample cluster S5, which includes almost all patients with EGFR mutations. The fact that the scores capture the deregulation of EGF signaling pathways, expected in samples with oncogenic EGF mutations22, is reassuring and indicates that the method indeed captures relevant biological information. Another example of concordance of PDS with mutations is of cluster P3, which contains many pathways with high PDS in tumors with NF1 mutations (mostly in sample cluster S4), and low PDS in tumors with IDH1 mutation (mostly in S2). Another indication of the biological relevance of the PDS is the observed correlation of the scores of many cell death-related pathways with the necrosis levels of GBM samples (see
PDS-Based Stratification of Glioblastoma
Hierarchical average-linkage clustering according to PDS of the TCGA data (
A concise representation of the characteristic deregulation profiles of the sample types over the pathway clusters, given in the left part of
To validate these results data was used from REMBRANDT (REpository for Molecular BRAin Neoplasia DaTa)23. The clustering results are shown in
The Pathway Based Sub-Stratification of GBM has Important Clinical Implications
Neural and Proneural samples are thought to have better prognosis24,25; the pathway based sub-stratification reveals, however, that this notion is due to a subset of better survivors (logrank p-value26<0.05). In the TCGA data, patients of clusters TgS15 and TgS13, which have relatively few deregulated pathways, survive significantly longer than other Neural and Proneural samples (p=0.009 for TgS15 and p=0.015 for TgS13), while patients from TgS12 have worse prognosis (p=0.003) as can be seen in
These results are reproduced on the REMBRANDT data as well—patients with Neural or Proneural tumors in cluster ReS2, the one for which only few pathways are deregulated, have better prognosis than other Neurals and Proneurals (p=0.066,
Pathways Associated with Survival in Glioblastoma
77 pathways are significantly related to survival on the TCGA data, and 187 on the REMBRANDT data (FDR<10%, from Kaplan-Meier analysis, comparing the top ⅓ deregulated samples to the bottom ⅓, logrank p-value). 37 of these pathways overlap, constituting a significant intersection (p=0.005). Higher PDS were associated with bad prognosis on both datasets for all but two pathways (which are probably false positives, as higher deregulation implies worse prognosis on REMBRANDT but better on TCGA). Many of the other 35 pathways, listed in Table 1, make biological sense: Some are related to angiogenesis, critical to glioblastoma progression (such as VEGF signaling, Fibrinolysis, PDGFRβ signaling, α4β1 integrin signaling, HIF2α pathway); Many are known key players in glioblastoma and cancer in general, are known to have a prognostic value, and are promising drug targets (such as MAP kinase27-31, Insulin signaling and its components32-34, RET tyrosine kinase35, EGFR/ERBB signaling36, PDGF signaling37 and integrins38). For the full list of other survival-related pathways and their roles in glioblastoma see the section entitled “Detailed exemplary description for determining the analysis”.
Comparison of the Inventive Method, in at Least Some Embodiments, with PARADIGM
The analyses were repeated using the scores derived using PARADIGM. EGFR gene and EGFR complexes were indeed found to be correlated to EGFR mutations as expected (FDR<1%). However, no pathways were found to be correlated to the EGFR mutations. Not much more could be inferred by the analysis based on PARADIGM scores analysis (
Pathway Deregulation in Colorectal Cancer is Associated with Chromosomal and Micros Atellite Instability
Two kinds of genetic instabilities were identified in colon cancer: chromosomal instability (CIN) and microsatellite instability (MSI-high). CIN tumors exhibit abnormal numbers of chromosomes, deletions and amplifications of smaller genomic regions and translocations. CIN tumors constitute 85% of colon cancers, tend to have p53 mutations39-41 and tend to originate in the distal (left) colon42. MSI-high tumors have highly varying lengths of short sequences of nucleotides, caused by dysfunctional mismatch-repair genes. These tumors account for 15% of the colon cancer cases43, usually display no large-scale deletions and amplifications and originate in the proximal (right) colon42. High CIN tumors are usually microsattelite stable (MSS).
The above described method was applied to expression data of Sheffer et al.41 (313 samples of normal colon, polyps, primary carcinoma and metastases), and validated the results on datasets of Sveen et al44 and Kogo et al45 (89 and 141 samples, respectively; see Methods).
Notably, for many relevant pathways the PDS reflected disease progression (see
Many pathways are differentially deregulated between MSS and MSI-high tumors (325 pathways in the Sheffer dataset, with significant overlap to those in the Sveen data, p=1.92×10−5 for MSS deregulated pathways, p=0.022 for MSI-high, Fisher exact test; see the section entitled “Detailed exemplary description for determining the analysis” and
Pathway Deregulation Based Stratification of Colorectal Cancer
Clustering analysis of the deregulation scores of the Sheffer data identified 11 pathway clusters (denoted by ShP1-ShP11) and 12 sample clusters (ShS1-ShS12) (see
Sub-stratifications of the samples emerge from the clustering, as can be noticed from
A new class of colon cancer was discovered, characterized by high deregulation of cluster ShP3, which contains EGF signaling pathways, including EGF, EGFR and ERBB signaling. This cluster is markedly deregulated in ShS8, is comprised of 9 tumors and 2 polyps. The main cause of the deregulation is over-expression of EGF; no prior identification of such a subgroup has been made, even though there are some reports of EGFR mutations in colon cancer47, Apparently about 5% of the tumors belong to this class; hence relatively large cohort is needed to observe them. Identification of mutations or amplifications associated with these tumors may provide a new therapeutic strategy that targets the EGF pathway.
Clusters ShP4 and ShP5 (containing pathways known to play a role in cell migration and invasion48,49) show marked deregulation in most samples of ShS10. Deregulation of pathways from cluster ShP5 defines a new subgroup of low-CIN tumors (ShS10), that contains both MSS and MSI-high samples, and hence is probably independent of the MSI status; the survival rates of this cluster are not different than those of the high CIN groups. Clusters ShP6, ShP7, ShP9 and ShP10 show high deregulation in the high CIN clusters ShS3-4; see
The results of clustering the Sveen and Kogo datasets are shown in
Survival Analysis of Pathways in Colorectal Cancer
13 pathways are significantly related to survival in the Sheffer data (FDR<10%, comparing primary tumors with the top ⅓ deregulation scores to the bottom ⅓, logrank p-value). Three of these pathways are significantly associated with disease-free survival in the Sveen dataset (p<0.01, Kogo et al didn't provide survival information).
The first is oxidative phosphorylation (logrank p-value<1.98×10−3 for Sheffer, p<0.027 for Sveen,
A new, significant finding is the prognostic value of the CXCR3 pathway—this is a chemokine receptor expressed by activated T-cells and NK cells51 (p<3.28×10−5 for Sheffer, p<1.13×10−3 for Sveen,
The third is the IL22BP pathway, which may play a role in anti-inflammation (p<1.41×10−3 for Sheffer, p<0.014 for Sveen). Notably, both oxidative phosphorylation and CXCR3 remained significant in both datasets, even when only MSS and MSI-low tumors were considered (CXCR3, p<2.5×10−4 Sheffer, p<0.043 Sveen; oxid. Phos. p<7.34×10−3 Sheffer, p<0.035 Sveen), suggesting that these pathways are related to survival in colorectal cancer independently of microsatellite stability.
The above described method performs pathway-level analysis of an expression dataset of tumors, and determines for each sample a set of pathway deregulation scores. These PDS are calculated separately for each pathway using genes which are known to take part in its functioning. The approach can be extended to any other kind of data with known pathway assignments, and allows incorporation of an additional parameter that is likely to be correlated with pathway deregulation (chromosomal instability, mutations, etc.). This framework may serve as basis for future integration of different measurements (such as DNA copy number, protein abundance, localization, etc.). The approach is data-based: for each pathway a principal curve was constructed, which captures the variation of the data. All samples are projected onto this curve, and for each sample the distance between this projection and that of the normal samples is measured along the curve. This distance represents the level of deregulation of the pathway. The method copes successfully with the biggest challenges of expression-based pathway analysis: (a) knowledge of biological pathways is partial (b) pathway deregulation is context specific, and (c) available data (e.g. expression) contain only part of the relevant information. By including genes that were labeled by different studies as belonging to a pathway, and using data from the very problem to be studied, it was possible to define a context-specific PDS. This is accomplished without relying on (incompletely known) underlying network connectivity and function, by deducing pathway deregulation from the data itself. The absence of relevant information (e.g. post-translational modifications, protein localization) was handled by projecting the very complex (and unavailable) parameterization of the “biological state” onto expression space, where deregulation is defined by the deviation from the signature of normal samples, measured along a trajectory that reflects context specific deregulation.
The main conceptual aim of defining these scores is to incorporate a large body of prior biological knowledge (in the form of assignment of genes to pathways) to allow further analysis on a “higher” (pathway) level, instead of analyzing directly the expression levels of tens of thousands of genes, in a brute force, “ignorance based” manner.
The method was applied to glioblastoma and colorectal cancer data, and showed that the PDS successfully reflect deregulation of pathways, and constitute a compact and biologically relevant representation of the samples. The resulting representation of the tumors retains most of the essential information present in the original data. Tumors were stratified into subtypes which are easy to interpret in terms of the biologically meaningful and relevant pathways. The resulting tumor groups were consistent with previously identified clinical classes of glioblastoma and colon cancer, and new sub-classes with important clinical consequences were also identified.
For glioblastoma a clinically relevant new sub-stratification of Neural and Proneural samples was determined, separating them into poor and good survivors, as well as a robust new substratification of the Mesenchymal subtype. The method was shown to be robust and the results were validated on additional datasets.
It was shown that important recurrent mutations in glioblastoma have a clear impact on the deregulation scores of the relevant pathways. Some samples without the mutation exhibit similar deregulation profiles to the mutated ones, suggesting alternative equivalent deregulation mechanisms. 35 pathways whose deregulation score is significantly correlated with survival were found, where higher levels of deregulation match poor survival. Some of these pathways (such as MAP kinase) were previously known to be associated with survival in glioblastoma patients, while several others constitute new findings (such as PDGFRβ and WNT signaling), that may serve as new hypotheses for glioblastoma research.
For colorectal cancer it was shown that CXCR3-mediated signaling and oxidative phosphorylation pathways are significantly predictive of survival in two different datasets. Furthermore, a new classification of tumors may be suggested based on their CIN status; high and low, which is broader than the known partition into MSS and MSI-high. Many of the pathways show differential deregulation between these two CIN-based classes of tumors, which cannot be explained solely by their MSI status, emphasizing the important effect of CIN on tumor development. Within the class of low CIN tumors a new subgroup was found, comprised of both MSI-high and MSS, that show high deregulation of pathways related to migration, inflammation and angiogenesis, and indeed these tumors have similar survival rates as the group of high CIN tumors. In addition a novel subclass of tumors was discovered related to aberrant EGF signaling, that comprise about 5% of the patients.
The previously described Glioblastoma (GBM) data and results were investigated further. Additional data was also created by another group (Cell 155, 462-477, Oct. 10, 2013, hereby incorporated by reference as if fully set forth herein) which extended the above data in several ways: it added more samples and more molecular information about all the samples. This new data was analyzed as described above.
The new data was used to further analyze one of the above described findings, which was a sub-division of the proneural group of tumors into several clusters with very different survival. One group (TgS15) contained 9 patients with particularly good outcome, and another, TgS12, had 34 samples with very bad outcome. This partition was derived in a completely unsupervised way on the basis of PDS profiling. The newly published data has revealed that 8 out of 9 samples of TgS15 had mutations of the IDH1 gene and a high level of methylation (referred to as G-CIMP in the above paper). No samples of the bad outcome group had either mutated IDH1 or methylation (see Tables 2 and 3 below).
These results demonstrate the power of the previously described method as straightforward analysis of expression data in the above described paper was unable to identify these clinically meaningful and relevant subgroups.
Robustness of the previously described method was also tested by repeating the analysis over the full dataset of the above paper. In particular the PDS for all patients was derived and then compared with the original classification. Furthermore, data was analyzed that was obtained from two different expression platforms as described in the paper. Even though the number of probes, and genes and the (oligos representing them) of these two microarrays are very different, the resulting PDS profiles of the samples and their stratification were strikingly similar.
Robustness of the method was tested by repeating the analysis over the glioblastoma samples from the above Cell paper as described in the paper Pathway-based personalized analysis of cancer, Drier et al, PNAS 2013, vol. 110 no. 16, pp 6388-6393. In particular the PDS for all patients was derived using expression as measured by two different platforms; Agilent and Affymetrix. The original set of 5000 genes identified as the most variable genes in the Agilent platform was used as a basis for the comparison. For the Affymetrix platform 3328 genes were used from the 5000 genes list (all those genes that were found in Affymetrix probe list). The above described method was applied to both data sets and generated PDS matrices.
In order to compare the results between the two platforms, the Pearson correlation between Agilent and Affymetrix PDS values over all pathways was calculated for every sample. The histogram of the correlation values per sample is shown in
Similarly, the Pearson correlation between Agilent and Affymetrix PDS values over all samples was also calculated for every pathway. The histogram of the correlation values per pathway is shown in
The results clearly demonstrate that the inventive method is operative, efficient and accurate, regardless of the expression platform used or other experimental conditions which are present.
It is to be understood that while the invention has been described in conjunction with the detailed description thereof, the foregoing description is intended to illustrate and not limit the scope of the invention, which is defined by the scope of the appended claims. Optionally any one or more embodiments, sub-embodiments and/or components of any embodiment may be combined. Also optionally any combination or subcombination of elements or embodiments may optionally be combined. Other aspects, advantages, and modifications are within the scope of the following claims.
Number | Date | Country | |
---|---|---|---|
61764604 | Feb 2013 | US |