This application incorporates by reference the material (Sequence Listing) in the ASCII text file Sequence_Listing.txt, created Sep. 9, 2011, having a file size of 99,000 bytes.
The present invention relates to identification of clinically distinct sub-groups of patients and corresponding genes and/or pairs of genes for which the respective gene expression values in a subject and clinical status are statistically significant in relation to a medical condition, for example cancer progression, or more particularly breast cancer patient's survival. The gene expression values may for example be indicative of the susceptibility of the individual subject to the medical condition (in context of time survival or/and disease progression), or the prognosis of a subject who exhibits the medical condition. The invention further relates to methods employing the identified patient survival significant genes and gene pairs.
Global gene expression profiles of subjects are often used to obtain information about those subjects, such as their susceptibility to certain medical condition, or, in the case of subjects exhibiting medical conditions, their prognosis. For example, having determined that a particular gene is important, the level in which that gene is expressed in a subject can be used to classify the individual into one of a plurality of classes, each class being associated with a different susceptibility or prognosis. The class comparison analysis leads to a better understanding of the disease process by identifying gene expression in primary tumours associated with subject survival outcomes (Kuznetsov et al., 2006).
First we will describe briefly the background theory of survival analysis. We denote by T the patient's survival time. T is a continuous non-negative random variable which can take values t, tε[0, ∞). T has density function f(t) and cumulative distribution function
We are primarily interested in estimating two quantities:
The survival function: S(t)=P(T>t)=1−F(t)
The hazard function:
The survival function expresses the probability of a patient to be alive at time t. It is often presented in the form S(t)=exp(−H(t)), where
denotes the cumulative hazard. The hazard function assesses the instantaneous risk of death at time t, conditional on survival up to that time.
Notice that the hazard function is expressed in terms of the survival function. To this extent, survival distributions and hazard functions can be generated for any distribution defined for tε[0,∞). By considering a random variable W, distributed in (∞,−∞), we can generate a family of survival distributions by introducing location (α) and scale (σ) changes of the form log T=α+σW.
Alternatively, we can express the relationship of the survival distribution to covariates by means of a parametric model. The parametric model employs a “regressor” variable x. Take for example a model based on the exponential distribution and write: log(h(t))=α+βx, or equivalently, h(t)=exp(α+βx).
This is a linear model for the log-hazard, or, equivalently, a multiplicative model for the hazard. The constant α represents the log-baseline hazard (the hazard when the regressor x=0) and the slope parameter β gives the change in hazard rate as x varies. This is an easy example of how survival models can be obtained from simple distributional assumptions. In the next paragraphs we will see more specific examples.
2. Cox Proportional Hazards. Model
One of the most popular survival models is the Cox proportional hazards model (Cox, 1972):
log h(t)=α(t)+βx (1)
where, as before, t is the survival time, h(t) represents the hazard function, α(t) is the baseline hazard, β is the slope parameter of the model and x is the regressor. The popularity of this model lies in the fact that it leaves the baseline hazard function α(t) (which we may alternatively designate as log h0(t)) unspecified (no distribution assumed). It can be estimated iteratively by the method of partial likelihood of Cox (1972). The Cox proportional hazards model is semi-parametric because while the baseline hazard can take any form, the covariates enter the model linearly.
Cox (1972) showed that the p coefficient can be estimated efficiently by the Cox partial likelihood function. Suppose that for each of a plurality of K subjects (labelled by k=1, . . . , K), we observe at corresponding time tk a certain nominal (i.e. yes/no) clinical event has occurred (e.g. whether there has been metastasis). This knowledge is denoted ek. For example ek may be 0 if the event has not occurred by time tk (e.g. no tumour metastasis at time tk) and 1 if the event has occurred (e.g. tumour metastasis at time tk). Cox (1972) showed that the p coefficient can be estimated efficiently by the Cox partial likelihood function, estimated as:
where R(tk)={j: tj≧tk} is the risk set at time tk. Typically, e is a binary variable taking value 0=non-occurrence of the event until time t or 1=occurrence of the event at time t. Later we will discuss a particular case of clinical event we consider in the work, without limiting though our model to this specific case.
The likelihood (2) is minimized by Newton-Raphson optimization method for finding successively better approximations to the zeroes (or roots) of a real-valued function (Press et al., 1992), with a very simple elimination algorithm to invert and solve the simultaneous equations.
Assume a microarray experiment with i=1, 2, . . . , N genes, whose intensities are measured for k=1, 2, . . . , K breast cancer patients. The log-transformed intensities of gene i and patient k are denoted as yi,k. Log-transformation serves for data “Gaussianization” and variance stabilization purposes, although other approaches, such as the log-linear hybrid transformation of Holder et al. (2001), the generalized logarithm transform of Durbin et al. (2002) and the data-driven Haar-Fisz transform of Motakis et al. (2006), have also been considered in the literature.
Associated with each patient k are a disease free survival time tk (in this work DFS time), a nominal clinical event ek taking values 0 in the absence of an event until DFS time tk or 1 in the presence of the event at DFS time tk (DFS event) and a discrete gradual characteristic (histologic grade). Note that in this particular work the events correspond to the presence or absence of tumor metastasis for each of the k patients. Other types of events and/or survival times are possible to be analyzed by the model we will discuss below.
Additional information, which is not utilized in this work, includes patients' age (continuous variable ranging from 28 to 93 years old), tumor size (in millimeters), breast cancer subtype (Basal, ERBB2, Luminal A, Luminal B, No subtype, normal-like), patients' ER status (ER+ and ER−) and distant metastasis (a binary variable indicating the presence or absence of distant metastasis).
Assuming, without loss of generality, that the K clinical outcomes are negatively correlated with the vector of expression signal intensity yi of gene i, patient k can be assigned to the high-risk or the low-risk group according to:
where cj denotes the predefined cut-off of the ith gene's intensity level. In the case of positive correlation between the K clinical outcomes and y1, patient k is simply assigned to one of the two groups according to:
After specifying xki, the DFS times and events are subsequently fitted to the patients' groups by the Cox proportional hazard regression model (Cox, 1972):
log hkt(tK|xki,βi)=αl(tk)+βixkl (4)
where, as before, hik is the hazard function and αi(tk)=log hi0(tk) represents the unspecified log-baseline hazard function for gene i; PA is the ith element of the vector
where R(tk)={j:tj>tk} is the risk set at time tk and ek is the clinical event at time tk. The actual fitting of model (4) is conducted by the survival package in R (cran.r-project.org/web/packages/survival/index.html). The genes with the largest βi Wald statistics (Wi's) or the lowest βi Wald P values are assumed to have better group discrimination ability and thus called highly survival significant genes. These genes are selected for further confirmatory analysis or for inclusion in a prospective gene signature set. Note that given βi, one derives the Wald statistic, W, as:
where
and I denotes the Fisher information matrix of the βi parameter. Estimating the Wald P value, simply requires evaluation of the probability:
where χvz denotes the chi-square distribution with v degrees of freedom. Typically, v is the number of parameters of the Cox proportional hazards model and in our case v=1. Expression (5) can be derived from the proper statistical tables of the chi-square distribution.
From Eqn. (3) notice that the selection of prognostic significant genes relies on the predefined cut-off value ci that separates the low-risk from the high-risk patients. The simplest cut-off basis is the mean of the individual gene expression values within samples (Kuznetsov, 2006), although other choices (e.g. median, trimmed mean, etc) could be also applied. Two problems, associated with such cut-offs, are: 1) they are suboptimal cut-off values that often provide low classification accuracy or even miss existing groups; 2) the search for prognostic significance is carried out for each gene independently, thus ignoring the significance and the impact of genes' co-expression on the patient' survival.
In a first aspect, the present invention proposes in general terms that a cut-off expression value should be selected so as to maximise the separation of the respective survival curves of the two groups of patients. From another point of view, this means that the cut-off expression value is selected such that the partition of subjects which it implies is of maximal statistical significance. This overcomes a possible problem in the known method described above: that if the cut-off expression values are not well-chosen they may provide low classification accuracy even for genes which are very statistically significant for certain ranges of expression value.
A specific expression of the first aspect of the invention is a computerized method for optimising, for each gene i of a set of N genes, a corresponding cut-off expression value ci.
for partitioning subjects according to the expression level of the corresponding gene,
the method employing medical data which, for each subject k of a set of K* subjects suffering from the medical condition, indicates (i) the survival time of subject k, and (ii) for each gene i, a corresponding gene expression value yi,k of subject k;
the method comprising, for each gene i,
The cut-off expression value yi,k here may be a logarithm of the measured expression intensity of gene i in patient k (as in the background section above, and in following description of specific embodiments). However, the expression value yi,k may alternatively be any other transformation of the measured expression insensity, or indeed the raw intensity itself.
In a second aspect, the invention proposes in general terms that pairs of genes are selected. For each gene pair, we generate a plurality of models, each of which represents a way of partitioning a set of subjects based on the expression values of the pair of genes. We then identify those gene pairs for which one of the models has a high prognostic significance. This overcomes a problem of the known method described above that the search for prognostic significance is carried out for each gene independently, thus neglecting any significance of genes' co-expression on patient survival.
A specific expression of the second aspect of the invention is a computerized method for identifying one or more pairs of genes, selected from a set of N genes, which are statistically associated with prognosis of a potentially-fatal medical condition,
the method employing medical data which, for each subject k of a set of K* subjects suffering from the medical condition, indicates (i) the survival time of subject k, and (ii) for each gene i, a corresponding gene expression value yi,k of subject k; the method comprising:
The first and second aspects of the invention may be used in combination. That is, cut-off expression values for individual genes derived according to the first aspect of the invention may be employed in a method according to the second aspect of the invention.
Note that the “expression values” referred to in the specific expressions of the invention may be the direct outputs of expression value measurements, but more preferably are the logarithms (e.g. natural logarithms) of such measurements, and optionally may have been subject to a normalisation operation.
In either the first or second aspect of the invention, the medical data preferably includes nominal data which indicates for each patient, whether one or more clinical events have occurred. For example, the nominal data may indicate whether tumour metastasis had occurred by the survival time. The significance value may be calculated using a formula which incorporates the nominal data. Alternatively or additionally, it can be used to select a subset of the patients, such that the clinical data for that subset of patients is used in the method.
In either the first or second aspect of the invention, the survival time may be an actual survival time (i.e. a time taken to die) or a time spent in a certain state associated with the medical condition, e.g. a time until metastasis of a cancer occurs.
Furthermore, in either the first or second aspect of the invention the K* subjects may be a subset of a larger dataset of K (K>K*) subjects. For example, the data for K* subjects can be used as training data, and the rest used for validation.
Alternatively, a plurality of subsets of the K subjects can be defined, and the method defined above is carried out independently for each of the subsets. Each of these subsets of the K subjects is a respective “cohort” of the subjects; if the cohorts do not overlap, they are independent training datasets. Note that each time the method is performed for a certain cohort, K* denotes the number of subjects in that cohort, which may be different from the number of subjects in other of the cohorts. After this, there is a step of discovering which pairs of genes were found to be significant for all the cohorts.
Once one or more genes, or pairs of significant genes, are identified by a method according to either the first or second aspect of the invention, they can be used to obtain useful information in relation to a certain subject (typically not one of the cohort(s) of subjects) using a statistical model which takes as an input the ratio(s) of the expression values of the corresponding identified pair(s) of genes. The information may for example be susceptibility to the medical condition, the or prognosis (e.g. relating to recurrence or death) of a subject suffering from the condition.
The invention may be expressed, as above, in terms of a method implemented using a computer, or alternatively as a computer system programmed to implement the method, or alternatively as a computer program product (e.g. embodied in a tangible storage medium) including program instructions which are operable by the computer to perform the method. The computer system may in principle be any general computer, such as a personal computer, although in practice it is more likely typically to be a workstation or a mainframe supercomputer.
In a third aspect, the present invention proposes a kit, such as a microarray, for detecting the expression level of a set of genes, the set having no more than 1000 member, or no more than 100 members, or no more than 20 members, and including
“Array” or “microarray,” as used herein, comprises a surface with an array, preferably an ordered array, of putative binding (e.g., by hybridization) sites for a sample which often has undetermined characteristics. An array can provide a medium for matching known and unknown nucleic acid molecules based on base-pairing rules and automating the process of identifying the unknowns. An array experiment can make use of common assay systems such as microplates or standard blotting membranes, and can be worked manually, or make use of robotics to deposit the sample. The array can be a macro-array (containing nucleic acid spots of about 250 microns or larger) or a micro-array (typically containing nucleic acid spots of less than about 250 microns).
The term “cut-off expression value” represented by c (followed with the proper superscript; e.g. ci) refers to a value of the expression level of a particular nucleic acid molecule (or gene) in a subject. The cut-off expression value is used to partition the subjects into classes, according to whether the expression level of the corresponding gene is below or above the cut-off expression value. Note that it makes no difference whether all subjects for which the expression value is actually equal to the cut-off value are classified as being in one class or alternatively in the other.
The term “gene” refers to a nucleic acid molecule that encodes, and/or expresses in a detectable manner, a discrete product, whether RNA or protein. It is appreciated that more than one nucleic acid molecule may be capable of encoding a discrete product.
The term includes alleles and polymorphisms of a gene that encodes the same product or an analog thereof.
There are various methods for detecting gene expression levels. Examples include Reverse-Transcription PCR, RNAse protection, Northern hybridisation, Western hybridisation, Real-Time PCR and microarray analysis. The gene expression level may be determined at the transcript level, or at the protein level, or both. The nucleic acid molecule or probe may be immobilized on a support, for example, on an array or a microarray. The detection may be manual or automated. Standard molecular biology techniques known in the art and not specifically described may be employed as described in Sambrook and Russel, Molecular Cloning: A Laboratory Manual, Cold Springs Harbor Laboratory, New York (2001).
Gene expression may be determined from sample(s) isolated from the subject(s).
Embodiments of the methods will now be explained with reference to the figures, and experimental results are presented which were generated using two cohorts of subjects: the Uppsala and Stockholm cohorts.
The clinical characteristics of the subjects and the tumour samples of Uppsala and Stockholm cohorts have been summarised in Ivshina et al., 2006. The Stockholm cohort comprised Ks=159 subjects with breast cancer, operated on in Karolinska Hospital from Jan. 1, 1994, through Dec. 31, 1996, and identified in the Stockholm-Gotland breast Cancer registry. The Uppsala cohort involved Ku=251 subjects representing approximately 60% of all breast cancers resections in Uppsala County, Sweden, from Jan. 1, 1987, to Dec. 31, 1989. Information on the subjects' disease free survival (DFS) times/events and the expression patterns of approximately 30000 gene transcripts (representing N=44928 probe sets on Affymetrix U133A and U133b arrays) in 315 primary breast tumours were obtained from NCBI Gene Expression Omnibus (GEO) (Stockholm data set label is GSE4922; Uppsala data set label is GSE1456). The microarray intensities were calibrated (RMA) and the probe set signal intensities were log-transformed and scaled by adjusting the mean signal to a target value of log 500 (Ivshina et al., 2006). In this study, Affymetrix U133A and 133b probesets (232 gene signatures) was used to provide classification of the low- and high-aggressive cancer sub-types described in Ivshina et al., 2006. For each of these patients, there was date specifying the disease free survival time (DFS) of an event (tumor metastasis). and the actual occurrence of this event (a binary variable taking the values 1=occurrence of the event, 0=no occurrence of the event).
As may be seen in
The present embodiments used only the “typical” data as training data. In other words, there was a pre-processing of the data to identify only typical patients: data from subjects who satisfy the above Event and tk relationship. Then we apply our methods on the typical (or training) data only (data from responders that satisfy the above Event and tk relationship), estimate a cutoff value for each gene by (1) and use these estimates to the whole set of patients to infer about prognostic significance by (2).
Based on visual inspection of
The terminology explained in the background section of this document is followed in the following explanation of the embodiments and some comparative examples. In these embodiments and comparative examples, the number of patients in the training set is denoted K* which is less than K, the total number of patients about whom data existed. The subjects of the training set are labelled k=1, . . . , K*. The DFS survival time for the k-th subject is referred to as tk, and whether or not tumour metastatis has occurred is referred to as ek. Each of the embodiments, and the comparative example, involve fitting a set of survival times to an equation such as Eqn. (4). The fitting to Eqn. (4) was conducted by the survival package which can be found at the following website: cran.r-project.org/web/packages/survival/index.html. See also the references Cox, D. R and Snell, E. J (1968) and Cox, D. R. and Oakes, D (1984). This made it possible to find a Wald statistic for each gene/gene-pair in the matter explained in the background section of this document.
We now discuss estimation of a cut-off expression value for each gene. This is done first using the prior art method discussed above (Mean based grouping—“MBg”) and using a embodiment of the invention illustrated in
For each gene i of the training set, the respective mean μi of the values yi,k of the subjects in the training set was found using the K* training set patients. Note that K*<K and that in Stockholm Ks=159 and Ks*=148 while in Uppsala Ku=251 and Ku*=212 (for simplicity we drop the s and u subscripts in the following paragraphs). The subjects of the training set were then grouped according to whether their values of yi,k were above or below a cut-off ci=μi. Equation (3) was used to generate a set of values {xik}, and from this a set of corresponding values βi was generated by fitting Equation (4). The prognostic significance of gene i was evaluated by reporting the p-value of the estimated βi. The genes with significantly small p-values (p<0.05 or p<0.01) were selected.
Schematically, the steps we follow are:
log hki(tk|xkt,βi)=αi(tk)+βixkt
The first embodiment is explained with reference to
1. Form the candidate cut-offs vector of dimension 1×Q,
to separate the K* subjects with ci.
For z=1 (the first element in
3. Iterate for z=2, . . . , Q to estimate the Q p-values (corresponding to Q distinct cut-off expression value levels) for each i. The “optimal” cut-off expression value ci for each i is the taken as the one with the minimum βi2 p-value, provided that the sample size of each group is sufficiently large (formally above 30) and Cox proportional hazards model is plausible.
In step 3, the “optimal” cut-off expression value for each i is the taken as the one with the minimum βiz p-value, provided that the sample size of each group is sufficiently large (formally above 30) and model defined by Eqn. (4) is plausible. Note that here βi is substituted by βiz indicating that the search was for the βi that leads to the best cut-off expression value.
In order to validate the significance of the findings (in terms of the estimated p-values), the Stockholm and Uppsala samples were bootstrapped and the 99% confidence intervals for the βi coefficients of Eqn. (4) were estimated. We use the non-parametric residuals bootstrap for the proportional hazards model of Loughin (1995) using the boot package in R. Specifically, the algorithm works as follows sequentially for each gene i:
log hkt(tk|xkt,βt)=α1(tk)+(β1xkt
u
k=[1−F0(t)]exp(β
Based on these coefficients, we estimate and report the Bias-Corrected accelerated (BCa) bootstrap confidence intervals for each βi coefficient that correct the simple quantile intervals of βi for bias and skewness in their distribution (Efron and Tibshirani, 1994). A detailed discussion (theory and applications) on BCa intervals is given in Efron and Tibshirani (1994). Here the 99% BCa were estimated by the boot package in R. Bootstrap test p-values based on quantile method gave similar results using the Wald test.
The gene expression profiles were correlated with clinical outcome (disease free survival time; DFS) in the two cohorts with the intention of identifying specific genes that predict survival. It was found that a large fraction of the genes had some correlation with survival, and could be used to predict survival.
This cut-off expression value is marked by a cross on the dashed line. The dotted line in each of the graphs indicates the log (p-value) corresponding to a MBg cut-off found by method of section 4.1, and this is cut-off is marked by a cross on the dotted line.
The mean-based grouping and the data-driven grouping with the estimated cut-off expression value were repeated for the full set of 159 subjects of the Stockholm Cohort to estimate the Wald p-values for each of the 264 probe sets (representing 232-gene genetic grade signature (Ivshina et al., 2006). At 1% significance level, MBg identified 151 probesets (148 prognostic gene signatures), while DDg identified 195 probesets (192 prognostic gene signatures). 82 of the 100 top-level survival related probesets of the two approaches are common. For the genes with p-values lower than 0.1%, the methods are highly reproducible. In this case, ˜99.0% of the MBg probesets was also present in the DDg list.
Bias-Corrected accelerated confidence intervals (BCa Cl's) were used to confirm the Wald statistic estimates. By estimating the 99% BCa Cl's, 118 and 145 probesets were predicted by MBg and DDg methods, respectively, as survival significant probesets. As a comparison between Wald statistic and BCa, 52 probesets (Wald-only positives) with significant Wald p-values (at alpha=1%) were not significant by BCa bootstrapping; while 2 genes (Wald-only negatives) for which the p-values were not significant were significant with BCa bootstrapping for the DDg selected group set. For the MBg selected probesets, the 99% BCa Cl's found 118 significant probesets with 38 Wald-only positives and 2 Wald-only negatives.
A second embodiment of the invention is explained with reference to
A: yi,k<ci and yj,k<cj
B: yi,k≧ci and yj,k<cj
C: yi,k<ci and yj,k≧cj
D: yi,k≧ci and yj,k≧cj
Each of the seven models is then defined as a respective selection from among the four regions:
Model 1 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A or D, rather than B or C.
Model 2 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A, B or C, rather than D.
Model 3 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A, C or D, rather than B.
Model 4 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions B, C or D, rather than A.
Model 5 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A, B or D, rather than C.
Model 6 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A or C, rather than B or D.
Model 7 is that a subject's prognosis is correlated with whether the subject's patient's expression levels are within regions A or B, rather than C or D.
Note that model 6 is equivalent to asking only whether the subject's expression level of gene 1 is below of above c1 (i.e. it assumes that the expression value of gene 2 is not important). Model 7 is equivalent to asking only whether the subject's expression for gene 2 is above or below c2 (it assumes that the expression value of gene 1 is not important). Thus, models 1-5 are referred to as “synergetic” (1-5), and the models 6 and 7 as “independent”.
The algorithm of the second embodiment evaluates the significance of all possible gene pairs i, j as follows (see
log hi,jk(tk|xi,j,km,βi,jm)=αi,j(tk)+βi,jm·xi,j,km, (7)
The algorithm above was then applied to the data set in order to examine whether gene pairing could improve the prognostic outcome for certain survival significant genes. All the possible 34716 probeset pairs (0.5×264×263) of the study were considered.
As shown in the table of
For the LRP2-ITGA7 case, the subjects at A region (identified as “low-risk” subjects) were separated from the B, C and D subjects (“high-risk” subjects) while for the CCNA2-PTPRT case the A, C and D (“high-risk”) subjects were separated from the B subjects (“low-risk”).
For the gene pair CCNA2-PTPRT, the corresponding results for performing the method of
For both pairs of genes, the DDg method separates the “low-risk” from the “high-risk” subjects more accurately as shown by the respective Kaplan-Meier curves.
As mentioned above, in the case of the Uppsala cohort, the training set of “typical” patients consisted of Ku*=212 subjects. The the “mean-based” and “data-driven” cut-off expression values for each of the N=264 probesets were derived according to the comparative example and the first embodiment. The the Wald p-values were derived by Eqns. (4) and (5). At 1% significance, DDg found 195 probesets (191 prognostic gene signatures) and MBg found 131 probesets (127 prognostic gene signatures). There was almost perfect reproducibility for the top level MBg probesets (probesets with p-values lower than 0.1%) in the DDg list, while 82% of the top 100 MBg-DDG probesets were common. The 99% BCa Cl's showed results similar to the Stockholm results. For DDg, 168 significant genes (157 by MBg) were found with 28 Wald-only positives (26 by MBg) and 2 Wald-only negatives (1 by MBg). Across the two cohorts, the Wald statistic discovered 165 common probe sets (˜85% of the DDg significant set) while the bootstrap method found 123 common probe sets (˜73% of the bootstrap DDg significant set).
Recently, we discovered a 5-gene signature (6 probesets) which re-classifies tumours with histologic grade 2 into two sub-types, 1-like grade and 3-like grade tumours (Ivshina et al., 2006) that show similar genetic features with grade 1 and grade 3 breast cancer tumours, respectively. Here, we find that all 6 probesets of this 5-gene signature are survival-significant genes identified by the DDg method (BRRN1 (212949_at, p=1.4E-03); FLJ11029 (228273_at, p=1.7E-04); C6orf173 (226936_at, p=6.2E-04); STK6 (208079_s_at, p=3.4E-04; 204092_s_at, p=6.4E-04), MELK (204825_at, p=1.2E-05)), while combinations of these genes with other genes produce synergetic survival effects, suggesting that these genes are representative and robust members of quite a diverse gene regulatory network.
The search for common genes across two independent studies may lead to the identification of the most reliable genes for further analysis. Accordingly, the functional significance and functional reproducibility of the results in the two cohorts were investigated further.
In order to compare the specificity of DDg and MBg in terms of the gene functions they identify, GO (gene ontology) analyses of the top 100 genes of each method were conducted in Panther (Protein Analysis through Evolutionary Relationships) software from the website (pantherdb.org). This grouped the genes into pathways and/or biological processes. Significant enrichment in p53 (DDg p-value=2.4E-03; MBg p-value=2.4E-03) and ubiquitin proteasome (DDg p-value=5.1E-02; MBg p-value=5.2E-02) pathways were identified. Further, significant biological processes include: cell cycle (DDg p-value=1.2E-23; MBg p-value=3.4E-23) and mitosis (DDg p-value=2.4E-11; MBg p-value=6.0E-11). Significant molecular functions include: Microtubule family Cytoskeletal protein (DDg p-value=9.9E-10; MBg p-value=3.0E-10) and protein kinase (DDg p-value−1.5E-04; MBg p-value=6.6E-05).
Similar GO analysis results were observed for the Uppsala data and the findings were well supported by the results of previous studies (Ivshina et al., 2006 and Pawitan et al., 2005). Importantly, the GO analysis produced results very similar to those of the top 1000 survival highly significant genes identified by DDg method from the entire list of 44928 probesets for both the Stockholm and Uppsala cohorts. Table 1 shows the GO analysis of the best gene synergy results according to Criteria 1 and 2. 100 highly significant synergetic pairs represent 44 unique DDg genes and 49 unique MBg genes were identified
In order to compare specificity and sensitivity of the methods in the 2-D case (i.e. using pairs of genes), 34716 pairs of genetic grade signature pairs were considered and the numbers of pairs which provide a significant p-value by the Wald statistic of survival curves <0.01 were counted. The DDg method was more specific and more sensitive than the MBg method. For example using the Stockholm cohort, MBg identified 11778 significant probeset pairs. In comparison, the DDg method identified 16489 significant pairs (˜1.4 times the MBg method), resulting in 4711 DDg pairs unique to DDg. The large difference in the number of significant genes identified by the two methods shows that the DDg method can find interesting genes not located by the MBg method (or any other grouping method based on a single point estimate of yi expression levels). This feature indicates that the DDg method may be particularly appealing for prognostic gene identification.
Using the more stringent p-value of 0.005, it was found that ˜39% of the gene pairs common to both DDg and MBg were false positives. In addition, at a significance level alpha=1% (equivalent to p-value of 0.01), 40% of the unique DDg gene pairs were false positives. In order to reduce Type I errors due to multiple testing (false positives), Bonferroni correction was applied on the Wald test p-values to base the inference on a more stringent significance level. This method identified 1180 significant DDg gene pairs in the Stockholm cohort and 1465 significant DDg gene pairs in the Uppsala cohort, with 53 common pairs in the two cohorts. The respective values derived using MBg approach were 85 significant gene pairs in the Stockholm cohort and 75 significant gene pairs in the Upsala cohort, with no common gene pairs between the two cohorts. For the individual gene analysis, DDg with Bonferroni correction found 97 and 88 significant genes in the Stockholm and Uppsala cohorts, respectively, with 58 common genes between the two cohorts, while the corresponding numbers for MBg were 35 and 36 significant genes (10 common).
The grouping scheme analysis was repeated for the full data set of 44928 probesets. In the Stockholm cohort, the DDg method identified 7473 prognostic significant genes by the Wald test. With Bonferroni correction, the number was 90 prognostic significant genes. In the Uppasala cohort, the respective numbers were 5545 by the Wald test and 55 after Bonferroni correction. Between the two cohorts, 3152 common prognostic genes were identified by the Wald-based DDg test while the MBg method identified 559 (˜18% of DDg). This further supports that the DDg method is able to identify many statistically significant and biologically meaningful prognostic and predictive genes.
Table 2 presents the top 7 gene pairs in terms of the Criteria 1 and 2. These pairs exhibit high synergetic effect in both cohorts and their synergy produces significantly stronger effect than individual grouping (as indicated by the respective p-values in Table 2).
1.9e−03
1.1e−04
3.9e−03
1.1e−04
2.1e−03
Survival significant gene are summarized in the table 3.
In the embodiments, the semi-parametric Cox proportional hazard regression model was used to estimate predictive significance of genes for disease outcome as indicated by patient survival times. For a given gene, the optimal partition (cut-off expression value) of its expression domain was estimated by maximising the separation of survival curves related to the high- and low-risk of disease behaviour, as indicated by the Wald statistic derived from the corresponding univariate Cox partial likelihood function. The top-level genes having the largest Wald statistic were selected for further confirmatory GO-analysis and inclusion into gene signatures.
A similar selection procedure was also developed in order to construct two-genes signatures exhibiting synergetic influence on patient survival. This approach was applied to analyse Affymetrix U133 data sets of two large breast cancer cohorts to identify genes and genes pairs related to genetic breast cancer grade signature (Ivshina et al., 2006). The genes that were most significantly correlated with the disease free survival time of breast cancer patients were selected. These genes could be subsequently used as an input in reconstruction analysis of biological programs/pathways associated with aggressiveness of breast cancer (Ivshina et al., 2006).
All genes of 5-gene genetic grading signature are survival significant. Additionally, they are co-regulated in primary breast cancer samples (data not present) and could functionally related to each other. BRRN1 encodes a member of the barr gene family and a regulatory subunit of the condensin complex. This complex is required for the conversion of interphase chromatin into condensed chromosomes. The protein encoded by this gene is associated with mitotic chromosomes, except during the early phase of chromosome condensation. During interphase, the protein has a distinct punctate nucleolar localization. There is a SSB-specific response of condensin I through PARP-1 and a role for condensin in SSB. repair. The protein encoded by STK6 is a cell cycle-regulated kinase that appears to be involved in microtubule formation and/or stabilization at the spindle pole during chromosome segregation. The encoded protein is found at the centrosome in interphase cells and at the spindle poles in mitosis. This gene may play a role in tumor development and progression. A processed pseudogene of this gene has been found on chromosome 1, and an unprocessed pseudogene has been found on chromosome 10. Multiple transcript variants encoding the same protein have been found for this gene. MELK is known to have a critical role in the proliferation of brain tumors, including their stem cells, and suggest that MELK may be a compelling molecular target for treatment of high-grade brain tumors. 2. Maternal embryonic leucine zipper kinase transcript abundance correlates with malignancy grade in astrocytomas 3. the kinase activity of MELK is likely to affect mammary carcinogenesis through inhibition of the pro-apoptotic function of Bcl-GL 4. analysis of MELK substrate specificity and activity regulation 5. pEg3 is a potential regulator of the G2/M progression and may act antagonistically to the CDC25B phosphatase, pEg3 kinase is able to specifically phosphorylate CDC25B in vitro. One phosphorylation site was identified and corresponded to serine 323.
For our work that available information is important in context of validation of our method of identification of patient survival significant genes. In particular, these 3 genes are included in the genetic grade signature which classifies breast cancer patients according to aggressiveness of the cancer disease (Ivshina et al, 2006) and have also used as important clinical markers of breast cancer and other human cancers. MELK is used now as a target for adjuvant therapy in clinical trials. All these genes are survival significant by our DDg estimates.
The biological importance of the DDg survival genes was also demonstrated by the proliferative pattern of the 5-gene grade signature (
Thus, these genes itself or in pairs or in larger number groups with other genes represent biologically and clinically important survival significant gene signature. We could claim, that such genes and their combinations could be used as “a positive control” for reliable selection of poorly defined/unknown genes which could be promising as novel and important components of critical biological pathways in cancer cells, and novel markers for prognosis and prediction of cancer patients.
In particular, C6orf173 (226936_at, 1D DDg: p=6.2E-04) which is a member of our 5-gene genetic grading signature (Ivshina et al, 2006) and this gene is also a survival significant by our analysis (Kuznetsov et al, 2006; Motakis et al, 2009). It was reported a a cancer up-regulated gene (CUG2) (Lee et al, 2007; Kim et al, 2009). CUG2 was recently renamed CENPW based on the new findings that it is a component of the centromeric complex playing a crucial role in the assembly of functional kinetochore complex during mitosis (Hod et al, 2008).
FLJ11029 (228273_at, p=1.7E-04) is an unknown gene. We found that with very high expression in many cancer tissues and cell lines, however the functions of this gene are unknown. Due to moderate level of evolution conservation and lost of ORF structures (data not shown), FLJ11029 (228273_at) gene could be considered as a novel long non-protein coding gene. In Uppsala and Stockholm cohorts, CUG2(CENPW) and FLJ11029 are strongly positively correlated to each other (Kendal correlation; p<0.0001) and simultaneously with the expression levels of BRRN1, STK6, and MELK (Kendal correlation p<0.0001). Our results of survival analysis suggest that all 5 genes are survival significant and could form essential functional module associated with mitosis phase of breast cancer cells.
Base on these findings we suggest that CUG2, BRRN1, STK6, MELK and can be involved in same regulatory pathway (or sub-network) which is associated with mitotic chromosomes, chromosome condensation and their segregation.
Thus, we suggest a function of genetic grade 5-gene signature genes, related to their biological attributes to coordination and co-regulation dynamics of mitotic chromosomes; these genes provide a synergetic effect on survival of the breast cancer patients and could be used to identify distinct subtypes of breast cancers.
A large number of synergetic gene pairs that were significantly associated with survival of breast cancer patients had been identified and biologically meaningful information about these survival-significant genes may also be postulated. This DDg approach was able to identify interesting targets that were not picked up by other methods, e.g. MBg approach. In order to further test the effectiveness of the DDg approach on the whole gene population of the study, the two methods were applied to the 44928 probesets and the DDg prognostic genes list was found to be twice as large and all genes identified by MBg were also present in the DDg list.
Most of the top-level survival significant genes were related to the cell cycle, and more specifically to the shortest phase of the cell cycle, mitosis (Table 1, Table 2).
Another survival significant gene SPAG5 encodes a protein associated with the mitotic spindle apparatus (www.genecards.org/cgi-bin/carddisp.pl?gene=SPAG5&search=SPAG5). By the literature, SPAG5 encoded protein may be involved in the functional and dynamic regulation of mitotic spindles. FLJ20105 ((ERCC6L); www.genecards.org/cgi-bin/carddisp.pl?gene=ERCC6L&search=ERCC6L) is a partner of synergistic survival significant pair of SPAG5. ERCC6L is DNA helicase that acts as an essential component of the spindle assembly checkpoint. Contributes to the mitotic checkpoint by recruiting MAD2 to kinetochores and monitoring tension on centromeric chromatin. Acts as a tension sensor that associates with catenated DNA which is stretched under tension until it is resolved during anaphase. Thus, the both genes of that pair are co-localized (centromeric region) and involved in the same molecular machinery (mitotic spindle assembly).
Thus, based on our analysis, we claim that gene pairs SPAG5-ERCC6L, CENPE-CCNE2, CDCA8-CLDN5 consist of a novel breast cancer associated synergistic survival significant gene pairs playing a important role in dynamics of chromosomes during cell mitosis. Together with the 5-gene genetic grade signature genes they form an integrative functional module essential for breast cancer prognosis and prediction.
Compared to single survival significant genes, gene pairs showed highly significant p-values of Wald statistic (˜10−8 vs ˜10˜5). This result implied that the pairing procedure itself should be considered as a unique statistical tool for identification of patients with very poor/good prognosis. Interestingly, some of the gene pairs are not directly related to the cell cycle.
Some of the pairs, for example, megalin (LRP2) and itnergrin alpha 7 (ITGA7) have not been discussed for breast cancer classification or patients' survival in the literature. The megalin gene contributes to the endocytic uptake of 25(OH)D3-DBP and activation of the vitamin D receptor (VDR) pathway. The VDR pathway is normally expressed in mammary gland, where it functions to oppose estrogen-driven proliferation and maintain differentiation. LRP2 can be highly expressed in some breast cancer cells. The above suggest that LRP2 participates in negative growth regulation of mammary epithelial cells. Associations of expression of integrin alpha 7 with breast cancer and breast cancer patient survival have not been reported. Megalin and integrin alpha 7 have also not been reported as molecular partners in anti-cancer regulation.
In the present study, several other gene pairs can be related to strong survival significance and produce an interaction effect influencing the likelihood of survival of breast cancer patients (Results, Table 2). These gene pairs, discovered by the present approach, might be grouped into survival significant interactome sub-networks and be an important source for the discovery of new anti-cancer drugs.
HN1-CACNA1D pair. Hematological and neurological expressed 1 protein is a protein that in humans is encoded by the HN1 gene. CACNA1D is a calcium channel, voltage-dependent, L type, alpha 1D subunit. The roles of these genes in breast cancer have been not studied yet.
Mis-incorporation of oxidized nucleoside triphosphates into DNA/RNA during replication and transcription can cause mutations that may result in carcinogenesis or neurodegeneration. The protein encoded by NUDT1 (synonym MTH1) is an enzyme that hydrolyzes oxidized purine nucleoside triphosphates, such as 8-oxo-dGTP, 8-oxo-dATP, 2-hydroxy-dATP, and 2-hydroxy rATP, to monophosphates, thereby preventing mis-incorporation. The encoded protein is localized mainly in the cytoplasm, with some in the mitochondria, suggesting that it is involved in the sanitization of nucleotide pools both for nuclear and mitochondrial genomes. Several alternatively spliced transcript variants, some of which encode distinct isoforms, have been identified. Additional variants have been observed, but their full-length natures have not been determined. A single-nucleotide polymorphism that results in the production of an additional, longer isoform (p26) has been described. NUDT1 plays an important role in protecting cells against H(2)O(2)-induced apoptosis via a Noxa- and caspase-3/7-mediated signaling pathway. Elevated levels of NUDT1 protein is associated non-small cell lung carcinomas. This gene provides synergistic survival effect together with gene neuromedin U (NMU). NMU may be involved in the HGF-c-Met paracrine loop regulating cell migration, invasiveness and dissemination of pancreatic ductal adenocarcinoma. NMU and its cancer-specific receptors, as well as its target genes, are frequently overexpressed in clinical samples of lung cancer and in cell lines, and that those gene products play indispensable roles in the growth and progression of lung cancer cells. NmU expression is related to Myb and that the NmU/NMU1R axis constitutes a previously unknown growth-promoting autocrine loop in myeloid leukemia cells. NMU plays a role in feeding behavior and catabolic functions via corticotropin-releasing hormone. Amino acid variants in NMU associate with overweight and obesity, suggesting that NMU is involved in energy regulation in humans. Overexpression of neuromedin U is associated with bladder tumor formation, lung metastasis and cancer cachexia. Our results suggest important role of NUDT1 and NMU genes in breast cancer progression and the breast patient's survival. Thus, this pair could be considered as novel prognostic and predictive biomarkers for breast cancer.
Thus, our data-driven approach allows the discovery of novel mechanistically important individual genes and small gene signatures that predict cancer patient survival. In doing so, the method can lead to the identification of new potential targets for anti-cancer drugs and facilitate the development of alternative approaches for cancer treatment.
The methods can be extended up to any number of genes. We can check only the “all possible pairs” of the statistically significant individual genes or generally all possible pairs of genes in our study. The number of genes to be used is purely a matter of the researcher's choice.
Table 4 indicates the SEQ ID NO. and the corresponding gene and GenBank reference in the sequence listing.
Where more than one transcript variant exists, the invention may be practiced with any one of the transcript variants, any of several of the transcript variants or all of the transcript variants.
This patent application is a divisional application of U.S. patent application Ser. No. 13/255,898, filed Sep. 9, 2011, entitled IDENTIFICATION OF BIOLOGICALLY AND CLINICALLY ESSENTIAL GENES AND GENE PAIRS, AND METHODS EMPLOYING THE IDENTIFIED GENES AND GENE PAIRS, which is a U.S. National Phase application under 35 U.S.C. §371 of International Application No. PCT/SG2010/000080, filed on Mar. 10, 2010, entitled IDENTIFICATION OF BIOLOGICALLY AND CLINICALLY ESSENTIAL GENES AND GENE PAIRS, AND METHODS EMPLOYING THE IDENTIFIED GENES AND GENE PAIRS, which claims priority to U.S. provisional patent application No. 61/158,948, filed Mar. 10, 2009, and also is related to Singapore patent application number 200901682-5, which has the same filing date as U.S. 61/158,948.
Number | Date | Country | |
---|---|---|---|
61158948 | Mar 2009 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 13255898 | Sep 2011 | US |
Child | 14301296 | US |