The present invention relates to identification of pairs of genes for which the respective gene expression values in a subject are statistically significant in relation to a medical condition, for example cancer, or more particularly breast cancer. The gene expression values may for example be indicative of the susceptibility of the subject to the medical condition, or the prognosis of a subject who exhibits the medical condition. The invention further relates to methods employing the identified gene pairs.
Global gene expression profiles of subjects are often used to obtain information about those subjects, such as their susceptibility to a certain medical condition, or, in the case of subjects exhibiting medical conditions, their prognosis. For example, having determined that a particular gene is important, the level in which that gene is expressed in a subject can be used to classify the individual into one of a plurality of classes, each class being associated with a different susceptibility or prognosis. The class comparison analysis leads to a better understanding of the disease process by identifying gene expression in primary tumours associated with subject survival outcomes (Kuznetsov et al., 2006).
Currently available clinical prediction and prognostic scores, such as the follicular lymphoma international prognostic index (LeBrun et al, 2008), or breast cancer aggressiveness scores (the St. Gallen consensus criterion or the Nottingham grading score (Ivshina et al, 2006)), are not able to optimally predict cancer aggressive dynamics or poor prognosis. On the other hand, genome-wide sequencing and expression technologies provide an enormous amount of information about genome and gene expression patterns associated with ethnology and pathogenesis of diseases, including cancers and other dangerous pathological process. However, these datasets provided by high-throughput technologies are still very noisy and biased so that selection of biologically meaningful and clinically essential genes and their products is imperative in further progress of diagnosis, prediction, prognosis and assignment of individual optimal therapy.
Furthermore, a given disease may exhibit many known and unknown genotype variations, which must be reliably defined on structural and functional levels. To this extent the task is twofold: first, the identification of reproducible, clinically-essential genetic (and phenotypic) sub-types of pathological cell types, and second the simultaneous optimal selection of the biologically meaningful genes. It is desirable to select genes (and or other markers) that correlate with survival time of the patients and provide in their combinations the best partition of the patients of relatively homogeneous distinct sub-groups of biological importance.
A gene and its products (mRNA transcripts, proteins) could be co-regulated and involved in many regulatory circuits, pathways and cellular programs which are very complex and poorly understood. However, several studies involving group analysis of expression profiles suggest that expression ratios of some gene pairs might be relatively reproducible indicators of the pathological process and, at the optimal thresholds for patient partition on to sub-groups, could be reliable prognostic indicators (Cui and Churchill, 2003; Motakis et al, 2008).
In some works, the combinations of individual gene pairs have been successfully used for diagnostics and prognosis (van't Veer, Sorlle et al, 2006; Ivshina et al, 2006). For example, In the work (LeBrun et al, 2008) the authors carried out so-called predictive interaction analysis (PIA) to examine whether any of the gene pairs generated from the very limited number of pres-selected genes (300 single genes) of follicular lymphoma were able to discriminate the 5-year outcomes more reliably than either single gene of the pair. Only gene pairs with both stringent and principal p-value gains of 10 times that of their respective single genes models were considered for further analysis. Of the 303 gene pairs that passed both criteria, the authors observed only 15 repeated gene pairs due to redundant features or genes represented by multiple probes on the array. The survival model in this and many other studies has been used only for validation of selected gene pairs. It was found that a high HOXB13/IL-17BR expression ratio is associated with increased relapse and death in patients with resected node-negative, ER-positive breast cancer treated with tamoxifen (Goetz et al, 2006). This case study showed that appropriate gene pairs may identify patients in whom alternative therapies should be studied.
However, several key questions remained outstanding:
We now present the background theory of survival analysis in more mathematical terms:
1. Analytical Method
First we will describe briefly the background theory of survival analysis. We denote by T the patient's survival time. T is a continuous non-negative random variable which can take values t, t∈[0, ∞). T has density function f(t) and cumulative distribution function F(t)=P(T≦t).
We are primarily interested in estimating two quantities:
The survival function: S(t)=P(T>t)=1−F(t)
The hazard function:
The survival function expresses the probability of a patient to be alive at time t. It is often presented in the form S(t)=exp(−H(t)), where
denotes the cumulative hazard. The hazard function assesses the instantaneous risk of death at time t, conditional on survival up to that time.
Notice that the hazard function is expressed in terms of the survival function. To this extent, survival distributions and hazard functions can be generated for any distribution defined for t∈[0, ∞). By considering a random variable W, distributed in (∞, −∞), we can generate a family of survival distributions by introducing location (α) and scale (σ) changes of the form log T=α+σW.
Alternatively, we can express the relationship of the survival distribution to covariates by means of a parametric model. The parametric model employs a “regressor” variable x. Take for example a model based on the exponential distribution and write: log(h(t))=α+βx, or equivalently, h(t)=exp(α+βx).
This is a linear model for the log-hazard, or, equivalently, a multiplicative model for the hazard. The constant α represents the log-baseline hazard (the hazard when the regressor x=0) and the slope parameter β gives the change in hazard rate as x varies. This is an easy example of how survival models can be obtained from simple distributional assumptions. In the next paragraphs we will see more specific examples.
2. The Cox Proportional Hazards Model
One of the most popular survival models is the Cox proportional hazards model (Cox, 1972):
log h(t)=α(t)+βx
where, as before, t is the survival time, h(t) represents the hazard function, α(t) is the baseline hazard, β is the slope parameter of the model and x is the regressor. The popularity of this model lies in the fact that it leaves the baseline hazard function α(t) (which we may alternatively designate as log h0(t)) unspecified (no distribution assumed). It can be estimated iteratively by the method of partial likelihood of Cox (1972). The Cox proportional hazards model is semi-parametric because while the baseline hazard can take any form, the covariates enter the model linearly.
Cox (1972) showed that the β coefficient can be estimated efficiently by the Cox partial likelihood function. Suppose that for each of a plurality of K subjects (labelled by k=1, . . . , K), we observe at corresponding time tk a certain nominal (i.e. yes/no) clinical event has occurred (e.g. whether there has been metastasis). This knowledge is denoted ek. For example ek may be 0 if the event has not occurred by time tk (e.g. no tumour metastasis at time tk) and 1 if the event has occurred (e.g. tumour metastasis at time tk). Cox (1972) showed that the β coefficient can be estimated efficiently by the Cox partial likelihood function, estimated as:
where R(tk)={j: tj=tk} is the risk set at time tk.
3. Application of Cox Proportional Hazards Model to Expression Level Data
Suppose that, for a microarray experiment with i=1, 2, . . . , N genes, the intensities are measured for k=1, 2, . . . , K subjects. The measure of the intensity of gene i (i.e. the expression value of gene i) in subject k may be denoted as yi,k. This may be a direct expression value measurement, or some transformed value derived from it such as a logarithm of an expression value measurement. Denote the chance of survival of a given individual as hk(t), then the Cox proportional-hazards model is that:
where α(t) is unspecified and βi, i=1, . . . N, is a set of N parameters. Each βi is a measure of the significance of the corresponding gene i. The set of N values βi can be considered as a 1×N vector of regression parameters denoted by β.
It is known to divide the group of patients into “high-risk” and “low risk” groups using a set of parameters {ci} according to whether:
The values {ci} may be referred to as “cut-off expression values”.
In this case, the Cox proportional hazard regression model becomes (Cox, 1984):
log hik(tk|xki, βi)=αi(tk)+βi·xki,
where hik is the hazard function and αi(tk) represents the unspecified log-baseline hazard function; β is the 1×N regression parameters vector with components βi; and tk is the subjects' survival time.
A Wald statistic, corresponding to a partial likelihood for each of the βi, may estimated for each gene i as:
where R(tk)={j: tj=tk} is the risk set at time tk.
4. Mean-Based Grouping (MBg)
One specific example of this procedure is Mean-based grouping (MBg). The following procedure is performed successively for i=1, 2, . . . , N.
log hik(tk|xki, βi)=αi(tk)+βi·xki
5. Alternatives to a Cox Proportional Hazards Model
Table 1 lists a number of known parametric survival models which are alternatives to the Cox proportional Hazards model. The survival function S(t)=P(T>t) defines the probability of an individual to survive after time t.
Each of the survival models employ one or more parameters λ, γ, μ, σ which are fixed, in the sense that these values are selected so as to fit the survival model to a given data-set, such that the survival function then varies only with the regressor.
The first four survival models in Table 1 make a proportional hazards assumption. The last two do not. Note that the Hazard functions for the six survival models are what we will call here “structurally different”. In other words, they differ not only in the values which certain fixed parameters take, but in the form of the mathematical function. In other words, the survival models cannot be transformed from one into another by a revaluing of any of the parameters.
Φ is the cumulative distribution function of the standard normal distribution and φ is the probability density function of the standard normal distribution.
Yet another known model is a “stratified Cox model” as proposed by Kalbfleisch and Prentice, 1980. In this model,
log hk,gi(tk,xki)=αi,g(tk)+βi,gxki, g=1, 2, . . . , G
where g labels G “strata”.
This modification of the Cox regression model includes the stratification of a covariate that does not satisfy the proportional hazards assumption. It assumes that there is one variable that does not satisfy the proportional hazards. The modification creates strata for this one variable, and within these strata estimates the effect of grouping on survival times. Covariates that are assumed to satisfy the proportional hazards assumption are included in the model, whereas the predictor being stratified is not included.
The present invention aims to provide new and useful methods for identifying genes which are statistically significant in relation to a medical condition. It further aims to provide new and useful methods for identifying statistical models employing data about the expression levels of the identified genes. It further suggests gene pairs which are statistically significant for certain medical conditions.
A first aspect of the invention proposes in general terms that, for a given set of patient data, there is a process of selecting from a plurality of survival models, which survival model best models a set of clinical data. In each of the multiple survival models, the hazard function is a structurally-different mathematical function of the time variable or other exogenous variables, i.e. the functions differ not only in that they include respective parameter(s) which take different respective values, but in that the hazard functions of the functions have a different functional dependency on the regressor variables (e.g. time). For example, one of the models is typically a Cox proportional hazards model, while another may be one of the other models specified in Table 1. For example, one or more of the survival models may be proportional survival models, and one or more of the survival models may be non-proportional survival models.
In one form, there may first be a step of determining whether the data set fits a single survival model (e.g. such that a numerical measure of the quality of the fit is above a predetermined level), and if it is found not to fit, one or more other survival models are tried out. In one example, the single survival model may be one involving proportional hazards, and the step of determining whether the data set fits the survival model may include determining whether the proportional hazards assumption holds.
The criterion for selecting which of the survival models best fits the data set may employ the Bayesian Information Criterion (BIC).
A first possible expression of the first aspect of the invention is a computerized method for identifying one or more genes, or pairs of genes, selected from a set of N genes, which are statistically associated with prognosis of a potentially fatal medical condition, the method employing test data which, for each subject k of a set of K subjects suffering from the medical condition, indicates (i) a survival time of subject k, and (ii) for each gene i, a corresponding gene expression value yi,k of subject k;
the method comprising:
Another possible expression of the first aspect of the invention is a computerized method for identifying one or more genes, or pairs of genes, selected from a set of N genes, which are statistically associated with prognosis of a potentially fatal medical condition, the method employing test data which, for each subject k of a set of K subjects suffering from the medical condition, indicates (i) a survival time of subject k, and (ii) for each gene i, a corresponding gene expression value yi,k of subject k;
the method comprising:
A second aspect of the invention relates, in general terms, to a survival model which, for each of one or more pairs of genes, is a function of a measure of the ratio of expression levels of the pair of genes. For each pair of genes, there is a corresponding a cut-off value, such that patients are classified according to whether the corresponding measure is above or below the cut-off value. The second aspect of the invention proposes that the cut-off value should be selected so as to maximize the separation of the respective survival curves of the two groups of patients. From another point of view, this means that the cut-off value is selected such that the partition of subjects which it implies is of maximal statistical significance. One embodiment of the first aspect of the invention is referred to below as “DDgR”.
One possible expression of this aspect of the invention is a computerized method for identifying one or more pairs of genes, selected from a set of N genes, which are statistically associated with prognosis of a potentially fatal medical condition, the method employing test data which, for each subject k of a set of K subjects suffering from the medical condition, indicates (i) a survival time of subject k, and (ii) for each gene i, a corresponding gene expression value yi,k of subject k;
the method comprising:
A certain embodiment of the invention relates to the usage of gene pairs identified by a method according to the first and/or second aspects of the invention. It further relates to programmed computers capable of performing the methods (any general purpose computer may be used for this purpose), and to data-carriers (such as tangible media, such as CDs) carrying software instructions for performing the method.
A third aspect of the invention proposes obtaining information about a first subject in relation to breast cancer, by measuring in the first subject the expression levels of a pair of genes in one of the rows of Table 2 below, or one of the pairs of genes in Table 10 later in this document, and obtaining the information using a statistical model which employs the expression levels of those genes.
This third aspect of the invention further proposes a microarray (or other kit) for obtaining data for performing prognosis of breast cancer. Since the microarray is dedicated to this purpose it does not require sensing of large numbers of genes which are not of significance. Instead, its cost can be reduced by designing it only to sense the presence of a limited number of genes (e.g. at most 100 genes, at most 50 genes, or at most 30 genes, or at most 20 genes) of which some or all have been determined to be a specific relevance to breast cancer.
One specific expression of this method is a microarray being for measuring the expression value of a set of no more than 20 genes (or not more than 100, or no more than 50, or no more than 30), including at least one of the pairs of genes which are the rows of Table 2.
Another specific expression of this method is a microarray (or other kit) for measuring the expression value of a set of no more than 20 genes (or not more than 100, no more than 50, or no more than 30) including at least one of the 16 pairs of genes which are respectively given in the upper 17 rows of Table 10, and preferably more than one of these pairs.
Embodiments of the invention will now be explained, with reference to the following figures in which:
“Array” or “microarray,” as used herein, comprises a surface with an array, preferably an ordered array, of putative binding (e.g., by hybridization) sites for a sample which often has undetermined characteristics. An array can provide a medium for matching known and unknown nucleic acid molecules based on base-pairing rules and automating the process of identifying the unknowns. An array experiment can make use of common assay systems such as microplates or standard blotting membranes, and can be worked manually, or make use of robotics to deposit the sample. The array can be a macro-array (containing nucleic acid spots of about 250 microns or larger) or a micro-array (typically containing nucleic acid spots of less than about 250 microns).
The term “cut-off expression value” represented by ci refers to a value of the expression level of a particular nucleic acid molecule (or gene or probe) i in a subject. The cut-off expression value is used to partition the subjects into classes, according to whether the expression level of the corresponding gene is below or above the cut-off expression value. Note that it makes no difference whether all subjects for which the expression value is actually equal to the cut-off value are classified as being in one class or alternatively in the other. The term “cut-off value” is used for variables such as ci,j where the cut-off is used to classify based on a value other than the expression value of one nucleic acid molecule (or gene or probe), such as the ratio of the expression values two nucleic acid molecules i, j.
The term “gene” refers to a nucleic acid molecule that encodes, and/or expresses in a detectable manner, a discrete product, whether RNA or protein. It is appreciated that more than one nucleic acid molecule may be capable of encoding a discrete product. The term includes alleles and polymorphisms of a gene that encodes the same product or an analog thereof.
There are various methods for detecting gene expression levels. Examples include Reverse-Transcription PCR, RNAse protection, Northern hybridisation, Western hybridisation, Real-Time PCR and microarray analysis. The gene expression level may be determined at the transcript level, or at the protein level, or both. The nucleic acid molecule or probe may be immobilized on a support, for example, on an array or a microarray. The detection may be manual or automated. Standard molecular biology techniques known in the art and not specifically described may be employed as described in Sambrook and Russel, Molecular Cloning: A Laboratory Manual, Cold Springs Harbor Laboratory, New York (2001).
Gene expression may be determined from sample(s) isolated from the subject(s).
Algorithms referred to as R-algorithms, or routines “in the R library”, refer to algorithms in the package described at http://cran.r-project.org/, or written in the language of the environment described there.
Embodiments of the invention will now be explained. Before doing so, we discuss a technique (“DDg”) which was not in the public domain at the priority date of the present application, and which is not independently claimed in the present application (it is described in detail in one of the related patent applications referenced above), but which is employed in certain embodiments of the present invention.
DDg is an alternative to the mean-based grouping (MBg) explained in the background section of the present application. The explanation of it given below uses terms with the same definition as in the explanation of MBg above.
For each gene i, compute the 10th quantile (q10i) and the 90th quantile (q90i) of the distribution of the K signal intensity values (i.e. the expression levels of the K patients). Within (q10i, q90i), we search for the value that most successfully discriminates the two unknown genetic classes. This can be done by the following steps:
We now turn to an embodiment of the invention, illustrated by the flow diagram of
Data-driven grouping for ratios is an extension of our original Data-driven method that takes into account ratios of intensities. Instead of considering each gene independently and attempting to identify the respective patients' grouping, this new approach takes into account ratios of raw gene intensities (or equivalently the difference of gene log-transformed intensities), which are then processed in the same way as in the DDg method. To this extent, the ratio-based approach utilizes pairs of genes (instead of single genes) and aims to identify significantly different groups of cancer patients. The steps of the algorithm are as follows:
log hkij(tk|xkij, βijz)=αij(tk)÷βijzxkij and
Statistically significant gene pairs are identified as those having low βi,jz p-values.
Before discussing further embodiments of the invention we now discuss the concept of 2-dimensional patient grouping. Again, this idea is not independently claimed in the present application (it is described in detail in one of the related patent applications referenced above), but it is employed in certain embodiments of the present invention.
The idea of 2-D grouping, in general terms is that that pairs of genes are selected. For each gene pair, we use a respective cut-off value for each gene to generate a plurality of models, which represent ways of partitioning a set of subjects based on the expression values of the pair of genes in relation to the respective cut-off values. We then identify those gene pairs for which one of the models has a high prognostic significance. Specifically, the models are defined using the scheme illustrated in
Specifically, for a given gene pair i, j, i≠j, and respective individual cut-off expression values ci and cj, we define seven “models”, each of which is a possible way in which the expression levels of the two genes might be significant. Then we test the data to see if any of the seven models are in fact statistically significant. The models are defined using
Each of the seven models is then defined as a respective selection from among the four regions:
Model 1 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A or D, rather than B or C.
Model 2 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A, B or C, rather than D.
Model 3 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A, C or D, rather than B.
Model 4 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions B, C or D, rather than A.
Model 5 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A, B or D, rather than C.
Model 6 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A or C, rather than B or D.
Model 7 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A or B, rather than C or D.
Note that model 6 is equivalent to asking only whether the subject's expression level of gene 1 is below of above c1 (i.e. it assumes that the expression value of gene 2 is not important). Model 7 is equivalent to asking only whether the subject's expression for gene 2 is above or below c2 (it assumes that the expression value of gene 1 is not important). Thus, models 1-5 are referred to as “synergetic” (1-5), and the models 6 and 7 as “independent”.
The 2-D patient grouping evaluates the significance of all possible gene pairs i, j as follows:
log hi,jk(tk|xi,j,km,βi,jm)=αi,j(tk)+βi,jm·xi,j,km,
“2-D patient grouping” is grouping of K patients based on a pair of genes exhibiting potential interaction in context of survival time. “1-D grouping” is grouping of K patients based on individual genes. In view of the term “2-D patient grouping”, DDg can be considered as more general and biologically meaningful approach than “1-D grouping”, in which each gene is considered individually. Strictly speaking, MBg is a kind of 1-D grouping but it does not estimate the optimal cut-off. The optimal cut-off is estimated only by the DDg method.
To validate the significance of our findings (in terms of the estimated Wald p-values) we bootstrap our samples (from the two cohorts) and estimate 99% confidence intervals for the βi coefficients of the Cox proportional hazards model. Note that the following discussion considers only the case of 1-D grouping. In the case of the 2-D grouping, we would substitute βi with βi,jm in the following discussion. The bootstrap method though is the same in both cases (1-D and 2-D). Extending this technique for validating the ratio method DDgR is also straightforward, so will not be explained here.
We use the non-parametric residuals bootstrap for the proportional hazards model of Loughin (1995) using the boot package in R.
Specifically, the algorithm works as follows sequentially for each gene i:
log hik(tk|xki,βi)=αi(tk)+βi·xki
u
k=[1−F0(t)]exp(β
t
k
(b)=1−[uk(b)]1/exp(β
Based on these coefficients, we estimate and report the Bias-Corrected accelerated (BCa) bootstrap confidence intervals for each βi coefficient that correct the simple quantile intervals of βi for bias and skewness in their distribution (Efron and Tibshirani, 1994). A detailed discussion (theory and applications) on BCa intervals is given in Efron and Tibshirani (1994). Here the 99% BCa were estimated by the boot package in the R library. Bootstrap test p-values based on quantile method gave similar results using the Wald test.
The most crucial assumption of the Cox proportional hazards survival model is the one concerning the effect of the parameters to the hazard ratio. Specifically, consider two patients k1 and k2 modelled by the Cox proportional hazards model
log hik(tk|xki,βi)=αi(tk)+βi·xki
The hazard log-difference of these two groups must be:
log hik
which is independent of time t. We call this assumption the proportional hazards assumption of the Cox model. Under these circumstances it is desirable to determine whether the Cox regression model adequately fits the data or, in other words, whether the proportional hazards assumption holds. Tests and graphical diagnostics for proportional hazards may be based on the estimation of the scaled Schoenfeld residuals (rk) (Schoenfeld, 1982), which for non-censored patients are estimated as:
Grambsch and Therneau (1994) developed the respective test statistic that has the form:
where ri are the scaled Schoenfeld residuals for probe i, gi is the time scale,
In the survival package in the R library, the proportional hazards test can be done more conveniently with the cox.zph function, which checks the proportional-hazards assumption by correlating the corresponding set of scaled Schoenfeld residuals with a suitable transformation of time for each covariate xi (for more information refer to the survival package in the R library).
The embodiments described below incorporate a test for the proportional hazards assumption applied to the results of the 1 D-, 2D- and ratio-patient grouping methods. When the assumption holds, we can rely on the Cox model and make inferences based on the results of our 1 D/2D/ratio methods. However, when the assumption is not satisfied, we need to think of a proper approach to analyze the data. Note however, that preliminary analysis of our data showed that non-proportionality holds for approximately 1-1.5% of the probes/probe pairs.
Survival Model Misspecification
At this paragraph we wish to discuss a possible reason of the proportionality failure before we describe possible remedies of the problem. Schemper (1992) and Therneau and Grambsch (2000) suggest that certain model specification errors can lead to false positive results of the test. Such specification errors can be simply the omission of an important covariate of the model (either a single covariate or an interaction) or an incorrect functional form of the model. The latter implies that if a parametric model is more appropriate than Cox proportional model for the data, using Cox can lead to a misleading significant test result. In addition, we should take into account the case where proportional hazards assumption does not hold due to the data properties which requires the use of a model that does not make this assumption.
To solve the problem with respect to model misspecification, a second embodiment of the invention uses the 4 different parametric survival models in the top lines of Table 1, along with any one of the 1 D-, 2D- and ratio-methods. Note that these 4 models assume proportional hazards. Parametric survival models are constructed by choosing a specific probability distribution for the survival function. The choice of survival distribution expresses some particular information about the relation of time and any exogenous variables to survival. It is natural to choose a statistical distribution which has non-negative support since survival times are non-negative.
Here we describe the second embodiment in detail with reference to
t
k=αi(tk)+βixki
For the estimation of β one needs to maximize the appropriate likelihood ƒ(t|θ) of each model, where θ denotes a vector of parameters. Notice that this step is similar to what we described in step 1, but requires the use of a parametric model instead of Cox's.
BIC=−2*log L+p×log(K)
The Non-Proportional Hazards Case
We will now discuss how the second embodiment deals with the case where proportionality fails due to omission of model parameters or simply because of the data properties (the hazards are not proportional).
Two possible solutions exist: first, these two cases may be handled simultaneously by considering the stratified Cox proportional hazards model of Kalbfleisch and Prentice (1980); second, the embodiment may use a parametric model that does not assume proportional hazards (e.g. the lower two models in Table 1) and fit it as described above. The second embodiment first uses the stratified Cox proportional hazards model, checks whether proportionality is obeyed, and then only if not, moves on to considering the parametric methods which do not assume proportional hazards.
As discussed above, the stratified Cox regression model is a modification of the Cox regression model by the stratification of a covariate that does not satisfy the proportional hazards assumption. Covariates that are assumed to satisfy the proportional hazards assumption are included in the model, whereas the predictor being stratified is not included. As discussed above, the embodiment initially assumes that the patients' grouping (as estimated by the 1 D-, 2D- or ratio methods) satisfies the proportional hazards assumption. However, in the cases that this hypothesis is rejected, we assume that there is at least one additional variable that may cause the problem. Our aim is to create strata based on this additional variable and within this strata estimate the effect of our grouping on survival times. As an additional we consider a series of clinical variables such as the histological grade (obtained from SWS analysis of Kuznetsov et al., 2006), the patients' age, the patients' intrinsic molecular signature, the metastasis status and the endocrine status. For a detailed description of these variables see Kuznetsov et al. (2006) and Kuznetsov et al. (2008).
More mathematically, our procedure to solve non-proportionality problem by the stratified no-interaction Cox proportional hazards model (Kalbfleisch and Prentice, 1980; Ata and Sozer, 2007) or/and with a non-proportional hazards parametric approach can be described as follows:
log hk,gi(tk,xki)=αi,g(tk)+βi,gxki, g=1, 2, . . . , G
t
k=αi(tk)+βixki
We now list three detailed examples of algorithms which implement the second embodiment of the invention. All three of these algorithms have the general structure illustrated in
7.1 Using 1-D Data-Driven Patient Grouping
t
k=αi(tk÷βixki
BIC=−2*log L+p×log(K)
7.2 Using 2-D Data-Driven Grouping
r
ij
=x
k
ij−Σk=1j,∈R(t
t
k=αij(tk)÷βijxkij
BIC=−2*log L+p×log(K)
7.3 Using Ratio Data Driven Grouping (the Ratio-Method)
t
k=αij(tk)÷βijxkij
BIC=−2*log L+p×log(K)
Breast cancer is most common malignancy among women (van't Veer et al., 2002; Sotiriou et al., 2001). We compare our data-driven grouping for individual genes (DDg) and the data-driven grouping for ratios (DDgR) methods by considering the expression patterns of approximately 30,000 gene transcripts (representing by 44,928 probesets (p.s.) on Affymetrix U133A and U133B arrays) in 315 primary breast tumors (NCBI Gene Expression Omnibus (GEO) data sets GSE4922 and GSE1456). The tumor samples were derived from two independent population-based cohorts: Uppsala (249 samples) and Stockholm (159 samples). For details on patients, clinical information, tumor samples and microarrays see (Ivshina et al., 2006).
Information on patients' disease free survival (DFS) times/events and the expression patterns of gene transcripts on Affymetrix U133A and U133b arrays in all primary breast tumors were obtained from NCBI Gene Expression Omnibus (GEO) (Stockholm data set label is GSE4922; Uppsala data set label is GSE1456). The microarray intensities were normalized (by RMA) and the probe set signal intensities were log-transformed and scaled by adjusting the mean signal to a target value of log500 (Liu et al., 2006).
1D Analysis
The first step of our analysis involves estimating the βi of the Cox Proportional hazards survival model:
log hik(tk|xki,βi)=αi(tk)+βi·xki
for each cohort, where i=1, . . . , 44928 denotes the number of probesets. To do this, we apply our data-driven grouping for individual genes (DDg) using the survival package in the R library. As described above we are particularly interested in storing the p-value of each βi coefficient, which we consider as a measure of group discrimination. At significance level α=1%, we identified 9930 prognostic significant probesets (6559 prognostic gene signatures) in Stockholm and 9737 in Uppsala (6353 prognostic gene signatures).
To further validate the significance of our findings (in terms of the estimated p-values) we bootstrap the Stockholm and Uppsala samples and estimate 99% Bias Correction acceleration (BCa) confidence intervals for the βi coefficients of the Cox proportional hazards model. We use the non-parametric residuals bootstrap approach described above. By applying this second, independent test we intend to produce a more refined and reliable list of prognostic significant genes. Indeed, the residuals bootstrap method resulted in 1844 common Stockholm-Uppsala probesets, verified by two independent statistical procedures and two independent cohorts. Tables 3-4 present the 10 top-list probesets identified by our 1D algorithm under Cox and the best (in terms of BIC) parametric models in two cohorts.
Notice that in Tables 3-4 none of the probesets rejects the assumptions of the proportionality of the hazards. Particularly, this is true for approximately 99% of the individual cases, where the stratified Cox proportional hazards model and the parametric non-proportional hazards alternatives needed to be evaluated.
2D Analysis
We proceed further by using this reliable list of 1844 probes. By sorting the survival p-values in increasing order we identified the top-level 400 probes (corresponding to 310 unique genes) with which we form all possible 79800 gene pairs. Using these pairs we apply our 2D grouping method in Stockholm and Uppsala cohorts. The 2D grouping has been developed by Motakis et al (2008). The 2D grouping algorithm seems to improve significantly the patients' grouping obtained by individual genes.
At 1% significance level, our analysis revealed 50021 significant probesets in Stockholm and 49986 in Uppsala. To validate mathematically our results and be able to select more reliable gene pairs, we refine our lists, based on three criteria;
In several cases criterion 3 fails to be satisfied resulting in exclusion of several highly significant and biologically meaningful pairs. To solve this problem we propose here an extension of our 2D algorithm:
This procedure results in 20665 highly significant, reproducible synergetic genes by the extended, design-corrected 2D method (i.e. the method of section 7.2 above) at significance level α=1%.
In the Stockholm cohort we identified 6562 and in Uppsala 8303 significant gene pairs (by applying both BCa and Bonferroni to the original 20665 reproducible pairs). The distribution of the p-values is plotted at the bottom panel of
Notice that in Tables 5-6 none of the probesets rejects the assumptions of the proportionality of the hazards. Particularly, this is true for approximately 98.5% of the paired cases, where the stratified Cox model and the loglogistic/lognormal models needed to be evaluated.
Gene Ratio Analysis
At the final step of our analysis we attempt to compare the results obtained so far by 1D and 2D grouping approaches against our new ratio data-driven grouping method (DDgR). At this point we consider all 20665 significant probe pairs of the 2D DDg step. Notice that each individual gene, included in this list, is highly significant by our 1D data-driven method.
Each pair is transformed to a log ratio (or log-difference) of intensities; i.e. if we denote by I1 as the vector of raw intensities of Affy1 and I2 as the vector of raw intensities of Affy2 for two probesets Affy1 and Affy2 that form a pair, then the log-ratio (or log-difference):
forms the ratio of interest. In this way we form a large sample of 20665 ratios of intensities composed of well-characterized, survival significant genes and gene pairs. By using the DDgR approach, we estimate the survival p-values of βi coefficients, which we compare to our previous results. Tables 7-8 give the 10 top-level probes (and gene symbols) identified by the DDgR and their respective survival p-values in Stockholm and Uppsala cohorts.
Notice that in Tables 7-8 none of the probesets rejects the assumptions of the proportionality of the hazards. Particularly, this is true for approximately 99% of the paired cases, where the stratified Cox model and the loglogistic/lognormal models needed to be evaluated.
Our new ratio approach improved the classification of 603 out of 20665 (approximately 3%) probe pairs. Each of these survival p-values was lower than both the 2D data-driven and the two 1D data-driven p-values of the respective pair. Note also that the ratio approach improved the patient grouping for 383 out of 400 individual probes (approximately 96%) that formed our gene pairs and consequently our gene ratios in our analysis.
Biological Results and Conclusion
The above preliminary results show that DDgR method and its combination with DDg can be considered as a novel fully automatic and robust methodology for a dual biomedical and patient grouping searching: (1) selection of survival/clinical significant genes (or other available tumor data) providing synergistic effect on patient survival and (2) optimal grouping of the patients onto poorly and good disease outcome groups. Such tasks are appeared in several biomedical studies.
The embodiments can provide practical interest for identification key genetic genes and their interactive gene pairs related directly to patho-biological mechanisms, prediction of disease outcome and screening novel therapeutic targets.
The embodiments were able to find many hundreds robust and statistically strong confidence genes and gene pairs. Several top-level robust gene markers present on the
Among identified survival significant genes, some of the genes have been considered as clinically significant genes For example, ARF1 (Aurka, STK6), and MELK are survival significant genes which have been included in the genomic grading classification of breast cancers (Ivshina et al, 2006, Kuznetsov et al, 2008). Some of the survival significant genes of Tables 3-8 are in clinical trials (e.g. Survivin, TYMS, MELK). Importantly, combination of mRNA markers in blood including Survivin and TK1 (Thymidine kinase 1) has been used in the diagnosis of patients with breast cancer (Chen et al, 2006). It was also shown that patients' clinic-pathological characteristics tumor size (p=0.006), histologic grade (p=0.012), lymph node metastasis (p=0.001) and TNM stage (p=0.006) significantly correlated with the positive detection rate of these cancer markers. We suggest that not only Survivin and TK1 but other cancer-prognostic multi-markers of Tables 3-8 and could be also used in their combination to detect circulating tumor cells (CTCs) in the circulation of breast cancer patients with considerably high sensitivity and specificity.
Thus, our method can provide practical interest for identification key genetic genes and their interactive gene pairs and composed survival significant genes (SSGs) and their products (mRNAs, proteins, peptides) related directly to pathobiological mechanisms, prediction of disease outcome and screening novel therapeutic targets.
In particular, spartin (SPG20) gene (which is involved in the intracellular trafficking of EGFR and that impaired endocytosis may underlie the pathogenesis of Troyer syndrome) and gene PGAM1 (Phosphoglycerate mutase 1, involved in glycolise pathway) are strongly predicted by the DDgR as the new breast cancer survival synergetic genes pairs. In the literature, the both genes have not been associated with breast cancer and cancer survival. However, our additional analysis of expression PGAM1 based on longSAGE tag (cgap.nci.nih.gov/SAGE) DB demonstrates that this gene is strongly over-expressed in embryonic stem cells, and several cancers, including breast cancer cell line.
DDg and DDgR the both predict that synergic TUBB3-KIAA0999 gene pair. TUBB3 (Tubulin 3) is the major constituent of microtubules. It binds two moles of GTP, one at an exchangeable site on the beta chain and one at a non-exchangeable site on the alpha-chain. This gene is predicted to be survival significant pairing with unknown gene KIAA0999. We found that ChIP-PET experiments show over-expression TUBB3 in ER-positive breast cancer cell line MCF7 cells and in the human ESC hES3 cells. Recently, KIAA099 was included in the list of “candidate of breast cancer genes”, which have been subjected to mutational selection during tumorgenesis (Sjoblom et al, 2006). However, survival significant of this hypothetical gene was not discussed.
We conclude that the embodiments provide a novel approach to identify small gene signatures providing statistically reliable, biological important and clinical significant molecular markers.
http://cgap.nci.nih.gov/SAGE/FreqsOfTag?ORG=Hs&METHOD=SS10,LS10&FORMAT=html&TAG=GGTCCAGTGT
The DDg method (but not DDgR) predicts strong clinical significance of genes H2AFZ and RBM35B and its pair H2AFZ-RBM35B. Expression H2AFZ in breast cancer cells was increased and the prognostic power of these biomarkers demonstrated in clinical data (Hua at al., 2008). Gene RBM35B is involved in RNA binding; however, biological function and clinical significance of this gene has been not unknown.
RBM35B forms also the survival significant gene pair with gene MEIS3P1. MEIS3P1 is the homo sapiens Meis homeobox 3 pseudogene 1 which also might be classified as non-coding RNA gene. This gene appears to be a retrotransposed pseudogene based on its lack of exons compared to other family members, one of which is found on chromosome 19 (MEIS3). The MESI3P1 pseudogene lacks three exons compared to MESI3. However, both genes encode proteins of similar size and sequence and contain the same homeodomain, which is a one-of-a-kind DNA binding domain involved in the transcriptional regulation of key eukaryotic developmental processes. MESI3P1 could activate the cAMP-response element (CRE) pathway (Linjie Tian et al, 2007). MEI3P1 has DDgR partner CIAPIN1. CIAPIN1 is a cytokine-induced inhibitor of apoptosis with no relation to apoptosis regulatory molecules of the BCL2 (MIM 151430) or CASP (see MIM 147678) families. Expression of CIAPIN1 is dependent on growth factor stimulation (Shibayama et al., 2004). CIAPIN1, a newly identified apoptosis-related molecule, was found to be down-regulated in several human cancers as compared to the matched adjacent non-neoplastic tissues. It was found that CIAPIN1 is a suppressor of gastric cancer cell proliferation. Due to strong survival significance, we suggest that CIAPIN1 might play an important role in the development of human breast cancer.
Thus, the list of selected genes demonstrates that our survival significance analysis via microarrays can be a powerful technique in elucidating the functions of many novel genes in normal and cancerous tissues. However, many of survival significant and synergetic genes pairs are poorly studied in context of prediction of biological functions and clinical prognostic significance. Our GO network analysis shows that more than 60% gene of top SSSG (Tables 3-8) are not included in the known gene interaction networks (data not presented). These findings strongly suggest that our methodology able to discovery many new clinically important markers and promising molecular targets for drug development and validation.
Among identified survival significant genes, some of the genes have been considered as a clinically significant genes, some of these genes are in clinical trials (e.g. survivin, TYMS, MELK), but the most of identified genes and synergetic genes pairs are novel in context of prediction of biological essentiality and clinical prognostic significance for specific (cancer) diseases.
Our final step involves determining the patients' composite group based on the final list of significant genes and gene pairs identified by our 1D, 2D and DDgR grouping algorithms. In other words, we combine the information from all significant synergetic genes and gene pairs of Tables 3-8 into a single final grouping scheme and then we test whether our algorithm was able to identify two biologically distinct patients' groups.
Our final list of significant findings includes multiple entries of the same genes (e.g. AP2S1 appears at least once in all grouping methods). We assumed that redundancy implies highly correlated groups, which we would like to avoid (e.g. all groupings based on AP2S1 tend to be correlated). To deal with this problem, for each probe set i we sort in ascending order (from the lowest to the highest) the 2D and DDgR Wald P values pij it generates with other probe sets j, j=1, . . . , J and keep the i, j that produces the lowest pij. Subsequently, we do not reconsider i and j either as members of a different pair or as individual entities. Next, we enrich our list with all 1D significant probe sets that do not form any significant pair. We resulted in 17 top-level, representative significant pairs and 4 significant probe sets. These are shown in Table 10 below.
Table 11 indicates the SEQ ID NO. and the corresponding gene and GenBank reference in the sequence listing.
Where more than one transcript variant exists, the invention may be practiced with any one of the transcript variants, any of several of the transcript variants or all of the transcript variants.
Ata, N. and Sozer, M. T. (2007). Cox regression models with non-proportional hazards applied to lung cancer survival data, Hacettepe Journal of Mathematics and Statistics, 36, 157-167.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing, Journal of the Royal Statistical Society, Series B (Methodological) 57 (1): 289-300.
Benjamini, Y. and Yekutieli, D. (2001). The control of the false discovery rate in multiple testing under dependency, Annals of Statistics 29 (4): 1165-1188
Breslow, N. E., Covariance analysis of censored survival data, Biometrics, vol. 30, pp 89-99, 1974.
Broet P., Kuznetsov V. A., J. Bergh J., Liu E. T-B., Miller L. D. (2006) Identifying gene expression changes in breast cancer that distinguish early and late relapse among uncured patients. Bioinformatics, 22, 1477-1485.
Chen C C, Chang T W, Chen F M, Hou M F, Hung S Y, Chong I W, Lee S C, Zhou T H, Lin S R. Combination of multiple mRNA markers (PTTG1, Survivin, UbcH10 and TK1) in the diagnosis of Taiwanese patients with breast cancer by membrane array. Oncology. 2006; 70 (6):438-46. Epub 2007 Jan. 12.
Cox, D. R. (1972). Regression Models and Life Tables (with Discussion). Journal of the Royal Statistical Society, Series B 34:187-220.
Cox R. D. and Oakes, D. (1984). Analysis of Survival Data. London: Chapman and Hall.
Cox, D. R. and Snell, E. J. (1968). A general definition of residuals (with discussion)”, Journal of the Royal Statistical Society, Series B, vol. 30, pp 248-265.
Cui, X. And Churchill, G. A. (2003). Statistical tests for differential expression in cDNA microarray experiments, Genome Biology, 4 (4):210
Efron, B. and Tibshirani, R. J. (1994). An Introduction to the Bootstrap. New York: Chapman and Hall.
Goetz M P, Suman V J, Ingle J N, Nibbe A M, Visscher D W, Reynolds C A, Lingle W L, Erlander M, Ma X J, Sgroi D C, Perez E A, Couch F J. (2006) A two-gene expression ratio of homeobox 13 and interleukin-17B receptor for prediction of recurrence and survival in women receiving adjuvant tamoxifen. Clin Cancer Res.; 12 (7 Pt 1):2080-7.
Grambsch, M. P. and Therneau, M. T. (1994). Proportional Hazards Tests and Diagnostics Based on Weighted Residuals, Biometrika, 81, 515-526.
Hua S, Kallen C B, Dhar R, Baquero M T, Mason C E, Russell B A, Shah P K, Liu J, Khramtsov A, Tretiakova M S, Krausz T N, Olopade O I, Rimm D L, White K P (2008). Genomic analysis of estrogen cascade reveals histone variant H2A.Z associated with breast cancer progression. Mol Syst Biol., 4:188. Epub 2008 Apr. 15.
Ivshina A V, George J, Senko O V., Mow B, Putti T C, Smeds J, Nordgren H, Bergh J, Liu E T-B, Kuznetsov V A, Miller L D (2006). Genetic reclassification of histologic grade delineates new clinical subtypes of breast cancer. Cancer Res. 66, 10292-10301.
Kalbfleisch, J. D. and Prentice, R. L. (1980). The Statistical Analysis of Failure Time Data, New York: John Wiley & Sons, Inc.
Kuznetsov V. A. , Senko O. V., Miller L. D. and Ivshina Anna V. (2006). Statistically Weighted Voting Analysis of Microarrays for Molecular Pattern Selection and Discovery Cancer Genotypes. Intern J of Computer Sciences and Network Security, 6, 73-83.
Kuznetsov, V. A., Motakis E. and Ivshina, A. V. (2008). Low- and High-Agressive Genetic Breast Cancer Subtypes and Significant Survival Gene Signatures, IEEE Congress on Computational Intelligence, Hong-Kong, June 1-6.
LeBrun D, Baetz T, Foster C, Farmer P, Sidhu R, Guo H, Harrison K, Somogyi R, Greller L D, Feilotter H. (2008) Predicting outcome in follicular lymphoma by using interactive gene pairs. Clin Cancer Res. 14 (2):478-87.
Linjie Tian, Pingzhang Wang, 1, Jinh Guo, Xinyu Wang, Weiwei Deng, Chenying Zhang, Dongxu Fu, Xi Gao, Taiping Shi, Dalong Ma. Screening for novel human genes associated with CRE pathway activation with cell microarray’ Genomics, 90, Issue 1, July 2007, 28-34.
Loughin, T. M. (1995). A residual bootstrap for regression parameters in proportional hazards model”, J. of Statistical and Computational Simulations, vol. 52, pp 367-384.
Motakis, E., Ivshina, A. V. and Kuznetsov, V. A. (2008). A data-driven method of identification of essential genes and gene pairs associated with survival time of cancer patients, IEEE Engineering in Medicine and Biology (to appear)
Ruth M. R., Adrian L. Harris, Lionel Tarassenko (2004). Non-linear survival analysis using neural networks, Statistics in Medicine, 23, 825-842
Schemper, M. (1992). Cox analysis of survival data with non-proportional hazard functions, Statistician, 41, 455-465.
Schoenfeld D. (1982). Residuals for the proportional hazards regression model. Biometrika, 69 (1):239-241.
Shibayama H, Takai E, Matsumura I, Kouno M, Morii E, Kitamura Y, Takeda J, Kanakura Y. Identification of a cytokine-induced antiapoptotic molecule anamorsin essential for definitive hematopoiesis. J Exp Med. 2004 Feb. 16; 199 (4):581-92.
Sjoblom T, Jones S, Wood L D, Parsons D W, Lin J, Barber T D, Mandelker D, Leary R J, Ptak J, Silliman N, Szabo S, Buckhaults P, Farrell C, Meeh P, Markowitz S D, Willis J, Dawson D, Willson J K, Gazdar A F, Hartigan J, Wu L, Liu C, Parmigiani G, Park B H, Bachman K E, Papadopoulos N, Vogelstein B, Kinzler K W, Velculescu V E. Science. 2006 Oct. 13; 314 (5797):268-74. Epub 2006 Sep. 7
Sotiriou, T., Perou, C. M., Tibshirani, R. et al. (2001). Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc Natl Acad Sci USA 98:10869-10874.
Therneau, M. T. and Grambsch, M. P. (2000). Modeling Survival Data: Extending The Cox Model. New York: Springer-Verlag.
Thompson, A. L., Chhikara, S. R. and Conkin, J. (2003). Cox proportional hazards models for modeling the time to onset of decompression sickness in hypobaric environments, NASA/TP-2003-210791, Manuscript.
Tibshirani, R., Hastie, T., Narasimhan, B. and Chu, G. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS, 99, 6567-6572. van't Veer, L. J., Dai, H., van de Vijver et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature 415:530-536.
Number | Date | Country | Kind |
---|---|---|---|
200901682-5 | Mar 2009 | SG | national |
The present application is related to U.S. application 61/158,948 which was filed on 10 Mar. 2009, and to Singapore patent application 200901682-5 from which it claims priority.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SG10/00078 | 3/10/2010 | WO | 00 | 9/9/2011 |