The application relates generally to methods for biomarker identification and to biomarkers for non-small cell lung cancer.
Non-small cell lung cancer (NSCLC) is the predominant histological type of lung cancer, accounting for up to 85% of cases (1). Tumor stage is the best established and validated predictor of patient survival (2). When identified at an early stage, NSCLC is primarily treated by surgical resection, which is potentially curative. However 30-60% of patients with stage IB to IIIA NSCLC die within five years after surgery, primarily from tumor recurrence (3). These relapses have been postulated to arise from a reservoir of cells beyond the resection site, such as microscopic residual tumors at the resection margin, occult systemic metastases, or circulating tumor cells. Such a reservoir could potentially be eliminated with an adjuvant systemic therapy, such as systemic chemotherapy. Indeed, this type of adjuvant therapy is routinely applied in the treatment of other solid tumors, including breast (4) and colorectal cancer (5, 6).
Randomized clinical trials have confirmed the benefit of adjuvant chemotherapy in stage II to IIIA NSCLC patients, but the benefit in stage I remains controversial (7-10). However, even in stage I the overall survival is only 70%, which suggests that there is a sub-population of stage I patients who have more aggressive tumors. In theory these patients might benefit from post-operative adjuvant chemotherapy. In contrast, there may be sub-populations of stage II or IIIA patients who have such good prognosis that they may neither need nor derive benefit from adjuvant therapy.
Several groups have attempted to identify these sub-populations by studying the mRNA expression profiles of surgically excised tumor samples using high-density microarray platforms (11-17). Several groups, including our own, have reported smaller prognostic signatures assayed by quantitative reverse-transcriptase PCR (RT-PCR) (18). However the specific signatures identified by these groups show minimal overlap (19) and it is unclear why this is so. Ein-Dor and coworkers demonstrated that biological heterogeneity leads to thousands of samples being required to identify robust and reproducible subsets for most tumour types (20). These conclusions are supported by the finding that thousands of genes display intra-tumor heterogeneity, likely caused by the diversity of tumour microenvironments and cell populations (21, 22). We hypothesized that different statistical methods handle the disease heterogeneity in different ways, and thus play a major role in the lack of overlap amongst reported NSCLC prognostic signatures.
In accordance with one aspect, there is provided a method for identifying a biomarker associated with a biological parameter comprising:
In accordance with a further aspect, there is provided a computer readable memory having recorded thereon statements and instructions for execution by a computer to carry out the method described herein.
In accordance with a further aspect, there is provided a computer program product, comprising a memory having a computer readable code embodied therein, for execution by a CPU, said code comprising code means for each of the steps of the method described herein.
In accordance with a further aspect, there is provided a method for identifying a gene signature associated with a biological parameter comprising:
In accordance with a further aspect, there is provided a method of prognosing or classifying a subject with non-small cell lung cancer NSCLC comprising:
In accordance with a further aspect, there is provided a method of predicting prognosis in a subject with non-small cell lung cancer (NSCLC) comprising the steps:
In accordance with a further aspect, there is provided a method of selecting a therapy for a subject with NSCLC, comprising the steps:
In accordance with a further aspect, there is provided a method of selecting a therapy for a subject with NSCLC, comprising the steps:
In accordance with a further aspect, there is provided a composition comprising a plurality of isolated nucleic acid sequences, wherein each isolated nucleic acid sequence hybridizes to:
In accordance with a further aspect, there is provided an array comprising, for each of at least three of sixteen genes: CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1, one or more polynucleotide probes complementary and hybridizable to an expression product of the gene.
In accordance with a further aspect, there is provided a computer program product for use in conjunction with a computer having a processor and a memory connected to the processor, the computer program product comprising a computer readable storage medium having a computer mechanism encoded thereon, wherein the computer program mechanism may be loaded into the memory of the computer and cause the computer to carry out the method described herein.
In accordance with a further aspect, there is provided a computer implemented product for predicting a prognosis or classifying a subject with NSCLC comprising:
In accordance with a further aspect, there is provided a computer implemented product for determining therapy for a subject with NSCLC comprising:
In accordance with a further aspect, there is provided a computer readable medium having stored thereon a data structure for storing the computer implemented product described herein.
In accordance with a further aspect, there is provided a computer system comprising
In accordance with a further aspect, there is provided a kit to prognose or classify a subject with early stage NSCLC, comprising detection agents that can detect the expression products of at least three biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1, and instructions for use.
In accordance with a further aspect, there is provided a kit to select a therapy for a subject with NSCLC, comprising detection agents that can detect the expression products of at least three biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1, and instructions for use.
These and other features of the preferred embodiments of the invention will become more apparent in the following detailed description in which reference is made to the appended drawings wherein:
Table 1 shows univariate properties of the six-gene signature. Stable (Entrez Gene ID) identifiers and the independent univariate prognostic ability (based on the log-rank test and Cox proportional hazards modeling) are given for each component of the six-gene mSD signature.
Table 2 shows a summary of all patient data. The survival, follow-up status, clinical stage, and normalized expression levels for the six-gene signature of all patients considered in any analysis in this study. Patients are identified by the study of origin: UHN, Lau et al.; MI02, Beer et al.; MIT, Bhattacharjee et al.; Duke, Potti et al.; MI06, Raponi et al.; AD1, Larsen et al.; SQ2, Larsen et al.; LuMayo and LuWashU, Lu et al. mSD prediction status is also given for the training (UHN) dataset.
Table 3 shows a summary of mSD validation. For each validation dataset considered in this experiment, the number of patients, hazard ratio and 95% confidence interval, and p-value are given. The hazard ratio and p-value are derived from stage-adjusted Cox proportional hazard models, with p-values determined using the Wald test.
Table 4 shows a summary of permutation analyses for the training (UHN) and four validation (Duke, MI02, MI06, MIT) datasets. This table gives the total number of permutations considered, the number of missing values, the number and percentage of permutations statistically significant at p<0.05 (corresponding to chi-squared>3.84), the chi-squared value obtained from the mSD signature, and the number and percentage of permutations showing superior performance to the mSD signature. Missing values occur when clustering or classifying results in groups with such unequal sizes that log-rank analysis could not be performed. This occurred in approximately 0.01% of cases, and as such makes a negligible contribution to the overall classifier evaluation. Datasets are identified by the first author of the publication first reporting them.
Table 5 shows enrichment scores. Specifically, for each of the 113 genes in the permutation dataset the total number of signatures was counted containing that gene and the fraction of those signatures that are statistically significant at p<0.05 (chi-squared>3.84). Genes were then ranked by this enrichment score. The Gene ID gives the integer used to identify this gene in the raw permutation data. The official gene symbol uniquely identifies each gene in the dataset. The p-value for each gene is in the right-most column.
The application generally relates to identifying gene signatures and provides methods and computer implemented products therefore.
The application also relates to 16 biomarkers that form a 16-gene signature, and provides methods, compositions, computer implemented products, detection agents and kits for prognosing or classifying a subject with non-small cell lung cancer (NSCLC) and for determining the benefit of adjuvant chemotherapy.
It must be noted that as used herein and in the appended claims, the singular forms “a”, “an” and “the” include the plural referents unless the context clearly dictates otherwise.
As used herein, “biological parameter” may refer to any measurable or quantifiable characteristic in a biological system and includes, without limitation, physical characteristics and attributes, genotype, phenotype, biomarkers, gene expression, splice-variants of an mRNA, polymorphisms of DNA or protein, levels of protein, cells, nucleic acids, amino acids or other biological matter.
The term “biomarker” as used herein refers to a gene that is differentially expressed in individuals. For example, specifically with respect to non-small cell lung cancer (NSCLC), the biomarkers may be differentially expressed in individuals according to prognosis and thus may be predictive of different survival outcomes and of the benefit of adjuvant chemotherapy. In one embodiment, the 16 biomarkers that form the NSCLC gene signature of the present application are listed as the first 16 genes in Table 5.
The term “level of expression” or “expression level” as used herein refers to a measurable level of expression of the products of biomarkers, such as, without limitation, the level of messenger RNA transcript expressed or of a specific exon or other portion of a transcript, the level of proteins or portions thereof expressed of the biomarkers, the number or presence of DNA polymorphisms of the biomarkers, the enzymatic or other activities of the biomarkers, and the level of specific metabolites.
The term “dataset” as used herein refers to the measurement or detection of one or more biological parameters for a series of subjects or individuals. Typically, a dataset will be generated at a single location or will involve measurements of biological parameters performed in a consistent manner. For example the set of expression levels of different mRNAs and survival times for one or more individuals with non-small cell-lung cancer would comprise a “dataset”.
The term “partitioning method” as used herein refers to a method that divides a dataset into two or more groups along any dimension of the dataset using either features inherent to the dataset or external meta-information. The number of groups can be as large as the dimension of the dataset or can be a continuous variable. For example k-means clustering, median-dichotomization, novelty-detection, and hierarchical clustering are all partitioning methods and others would be known to a person skilled in the art.
The term “strength” as used herein refers to the predictive power that a biomarker has for a specific biological parameter. Predictive power can be assessed by methods known to a person skilled in the art and include, without limitation, using measures of magnitude, such as differences in survival rates or hazard ratios, or using prediction accuracies or measures of statistical significance such as p-values. Methods also exist to consider both magnitude and statistical significance, such as the F-statistic.
The term “set threshold” as used herein refers to a threshold value of the strength of the relationship between a biomarker and a biological parameter that is used to identify biomarkers that have a meaningful association with a biological parameter. The specific value of the set threshold is dependent on the parameter used to measure the strength of the association. For example if hazard-ratios are used to measure the magnitude of a predictive threshold than a set threshold might be a hazard ratio greater than two. For example if p-values are used to measure the reproducibility of a biomarker then a set threshold might be a p-value less than 0.05. For example if prediction accuracies are used to measure the reproducibility of an association then a set threshold might be a prediction accuracy greater than 70%.
The term “enrichment threshold” as used herein refers to a threshold value of the number of sets in which a gene is found where that set has a strong association with a biological parameter as determined by the set threshold. For example, an enrichment threshold might be a fraction of sets containing a specific such as 20%. Thus in this example if at least 20% of sets containing a specific gene have a strong association with the biological parameter then this gene will be above the enrichment threshold. An enrichment threshold might also be a p-value derived from a chi-squared test, a hypergeometric distribution, a proportion-test, and a permutation-based estimate of the null distribution, amongst others.
The term “molecular characteristics” as used herein refers to measurements of properties of the molecular composition of a biological specimen including, but not limited to, measurements of the levels or structural variations of specific mRNA transcripts or portions thereof, measurements of the levels of specific non-coding RNA species or portions thereof, measurements of the levels or structural variations of specific proteins including post-translational modifications thereof, measurements of the activity of specific proteins or complexes containing proteins, measurements of the number or type of genetic or epigenetic polymorphisms, and measurements of the levels of specific organic or inorganic metabolites within a cell.
According to an aspect, there is provided method for identifying a biomarker associated with a biological parameter comprising:
Preferably, there is at least two validation datasets and between steps (h) and (i), further comprising the step of pooling the ranks determined in step (h) for each validation dataset.
In one embodiment, the ranks are expressed as percentiles and the pooling comprises the product the percentiles.
Pooling may also be performed using other methods known by a person skilled in the art. For example, without limitation, pooling may be performed using a standard dataset and machine-learning methods such as support vector machines or random forests, or pooling may be performed by taking the product of the p-values of a statistical test of the strength of the association of a biomarker with a biological parameter, or pooling may be performed by taking the sum or product (weighted or unweighted) of the magnitudes of the strength of the association of a biomarker with a biological parameter. For example, the sum of hazard ratios or of coefficients from a Cox proportional hazard model across multiple validation datasets could be used to pool validation datasets.
In some embodiments, there is at least two validation datasets and after step (i), further comprising the step of determining those genes identified in (j) that were enriched above the predetermined enrichment threshold in a plurality of validation datasets.
In some embodiments, the partitioning method comprises k-means clustering. However, other partitioning methods would be known to a person skilled in the art, for example, without limitation, agglomerative hierarchical clustering, divisive hierarchical clustering, novelty-detection, median dichotomization, asymmetric thresholding and self-organizing maps. Preferably, this embodiment additionally comprises performing a log-rank analysis to estimate the separation between the at least two populations. However, a person skilled in the art would understand that other methods could be used, for example, without limitation, Cox proportional hazards modeling with or without adjustment for clinical parameters, Wilcoxon Rank-Sum analysis, t-test analysis, general linear modeling, and non-linear mixed modeling.
In some embodiments, the classifying in step (f) comprises calculation of Euclidian distance to determine the distance to the expression levels of s from the subjects in the training dataset. It is readily apparent to one skilled in the art that many alternative methods exist to determine the distance to the expression levels of s from the subjects in the training set, including but not limited to Pearson's correlation, k-nearest neighbours, classification in a hyperspace such as by support-vector machines, Manhattan distance, and mutual information.
In some embodiments, the relationship between the biological parameter and each (P) is determined using log-rank analysis. It is readily apparent to one skilled in the art that many alternative methods exist to determine this relationship, including but not limited to Cox proportional hazards modeling with or without adjustment for other clinical covariates, Wilcoxon rank-sum analysis, general linear modeling, and linear or non-linear mixed modeling.
In some embodiments, the set size n is between 2 and 20, preferably between 4 and 18, 4 and 14, 4 and 10, and 6 and 8 in increasing preferablity.
In some embodiments, the number of genes (m) is between 3 and 10,000, preferably between 20 and 200.
In some embodiments, the plurality (S) of sets of genes is the smaller of 1,000,000 and 0.1% of all possible sets of m genes having n set size.
In some embodiments, the validation dataset at least partially overlaps with the training dataset.
In accordance with a further aspect, there is provided a computer readable memory having recorded thereon statements and instructions for execution by a computer to carry out the method described herein.
In accordance with a further aspect, there is provided a computer program product, comprising a memory having a computer readable code embodied therein, for execution by a CPU, said code comprising code means for each of the steps of the method described herein.
In accordance with a further aspect, there is provided a method for identifying a gene signature associated with a biological parameter comprising:
(j) identify genes in the high strength sets that are enriched above a predetermined enrichment threshold.
In accordance with a further aspect, there is provided a method of prognosing or classifying a subject with non-small cell lung cancer NSCLC comprising:
In accordance with a further aspect, there is provided a method of predicting prognosis in a subject with non-small cell lung cancer (NSCLC) comprising the steps:
Preferably, the biomarker reference expression profile comprises a poor survival group or a good survival group.
The term “reference expression profile” as used herein refers to the expression level of at least 3 of the 16 biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1 associated with a clinical outcome in a NSCLC patient. The reference expression profile comprises 16 values, each value representing the level of a biomarker, wherein each biomarker corresponds to one gene selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1. The reference expression profile is identified using one or more samples comprising tumor or adjacent or otherwise tumour-related stromal/blood based tissue or cells, wherein the expression is similar between related samples defining an outcome class or group such as poor survival or good survival and is different to unrelated samples defining a different outcome class such that the reference expression profile is associated with a particular clinical outcome. The reference expression profile is accordingly a reference profile or reference signature of the expression of at least 3 of the 16 biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1, to which the subject expression levels of the corresponding genes in a patient sample are compared in methods for determining or predicting clinical outcome.
As used herein, the term “control” refers to a specific value or dataset that can be used to prognose or classify the value e.g expression level or reference expression profile obtained from the test sample associated with an outcome class. In one embodiment, a dataset may be obtained from samples from a group of subjects known to have NSCLC and good survival outcome or known to have NSCLC and have poor survival outcome or known to have NSCLC and have benefited from adjuvant chemotherapy or known to have NSCLC and not have benefited from adjuvant chemotherapy. The expression data of the biomarkers in the dataset can be used to create a control value that is used in testing samples from new patients. In such an embodiment, the “control” is a predetermined value for the set of at least 3 of the 16 biomarkers obtained from NSCLC patients whose biomarker expression values and survival times are known. Alternatively, the “control” is a predetermined reference profile for the set of at least three of the sixteen biomarkers described herein obtained from patients whose survival times are known.
Accordingly, in one embodiment, the control is a sample from a subject known to have NSCLC and good survival outcome. In another embodiment, the control is a sample from a subject known to have NSCLC and poor survival outcome.
A person skilled in the art will appreciate that the comparison between the expression of the biomarkers in the test sample and the expression of the biomarkers in the control will depend on the control used. For example, if the control is from a subject known to have NSCLC and poor survival, and there is a difference in expression of the biomarkers between the control and test sample, then the subject can be prognosed or classified in a good survival group. If the control is from a subject known to have NSCLC and good survival, and there is a difference in expression of the biomarkers between the control and test sample, then the subject can be prognosed or classified in a poor survival group. For example, if the control is from a subject known to have NSCLC and good survival, and there is a similarity in expression of the biomarkers between the control and test sample, then the subject can be prognosed or classified in a good survival group. For example, if the control is from a subject known to have NSCLC and poor survival, and there is a similarity in expression of the biomarkers between the control and test sample, then the subject can be prognosed or classified in a poor survival group.
A person skilled in the art will appreciate that the comparison between the expression of the biomarkers in the test sample and the expression of the biomarkers in the control can be made in different ways. For example, without limitation, Euclidean distances, Pearson's correlation, and k-nearest neighbours can be used to determine the similarity of the expression of the biomarkers in the test sample to the expression of the biomarkers in the control sample.
The term “differentially expressed” or “differential expression” as used herein refers to a difference in the level of expression of the biomarkers that can be assayed by measuring the level of expression of the products of the biomarkers, such as the difference in level of messenger RNA transcript or a portion thereof expressed or of proteins expressed of the biomarkers. In a preferred embodiment, the difference is statistically significant. The term “difference in the level of expression” refers to an increase or decrease in the measurable expression level of a given biomarker, for example as measured by the amount of messenger RNA transcript and/or the amount of protein in a sample as compared with the measurable expression level of a given biomarker in a control. In one embodiment, the differential expression can be compared using the ratio of the level of expression of a given biomarker or biomarkers as compared with the expression level of the given biomarker or biomarkers of a control, wherein the ratio is not equal to 1.0. For example, an RNA or protein is differentially expressed if the ratio of the level of expression in a first sample as compared with a second sample is greater than or less than 1.0. For example, a ratio of greater than 1, 1.2, 1.5, 1.7, 2, 3, 3, 5, 10, 15, 20 or more, or a ratio less than 1, 0.8, 0.6, 0.4, 0.2, 0.1, 0.05, 0.001 or less. In another embodiment the differential expression is measured using p-value. For instance, when using p-value, a biomarker is identified as being differentially expressed as between a first sample and a second sample when the p-value is less than 0.1, preferably less than 0.05, more preferably less than 0.01, even more preferably less than 0.005, the most preferably less than 0.001.
The term “similarity in expression” as used herein means that there is no or little difference in the level of expression of the biomarkers between the test sample and the control or reference profile. For example, similarity can refer to a fold difference compared to a control. In a preferred embodiment, there is no statistically significant difference in the level of expression of the biomarkers.
The term “most similar” in the context of a reference profile refers to a reference profile that is associated with a clinical outcome that shows the greatest number of identities and/or degree of changes with the subject profile.
The term “prognosis” as used herein refers to a clinical outcome group such as a poor survival group or a good survival group associated with a disease subtype which is reflected by a reference profile such as a biomarker reference expression profile or reflected by an expression level of the fifteen biomarkers disclosed herein. The prognosis provides an indication of disease progression and includes an indication of likelihood of death due to lung cancer. In one embodiment the clinical outcome class includes a good survival group and a poor survival group.
The term “prognosing or classifying” as used herein means predicting or identifying the clinical outcome group that a subject belongs to according to the subject's similarity to a reference profile or biomarker expression level associated with the prognosis. For example, prognosing or classifying comprises a method or process of determining whether an individual with NSCLC has a good or poor survival outcome, or grouping an individual with NSCLC into a good survival group or a poor survival group, or predicting whether or not an individual with NSCLC will respond to therapy.
The term “good survival” as used herein refers to an increased chance of survival as compared to patients in the “poor survival” group. For example, the biomarkers of the application can prognose or classify patients into a “good survival group”. These patients are at a lower risk of death after surgery.
The term “poor survival” as used herein refers to an increased risk of death as compared to patients in the “good survival” group. For example, biomarkers or genes of the application can prognose or classify patients into a “poor survival group”. These patients are at greater risk of death or adverse reaction from disease or surgery, treatment for the disease or other causes.
Accordingly, in one embodiment, the biomarker reference expression profile comprises a poor survival group. In another embodiment, the biomarker reference expression profile comprises a good survival group.
The term “subject” as used herein refers to any member of the animal kingdom, preferably a human being and most preferably a human being that has NSCLC or that is suspected of having NSCLC.
In various embodiments, the at least three biomarkers is four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen and sixteen biomarkers respectively.
In some embodiments the NSCLC is stage I or stage II.
NSCLC patients are classified into stages, which are used to determine therapy. Staging classification testing may include any or all of history, physical examination, routine laboratory evaluations, chest x-rays, and chest computed tomography scans or positron emission tomography scans with infusion of contrast materials. For example, stage I includes cancer in the lung, but has not spread to adjacent lymph nodes or outside the chest. Stage I is divided into two categories based primarily on the size of the tumor (IA and IB). Stage II includes cancer located in the lung and proximal lymph nodes. Stage II is divided into 2 categories based on the size of tumor and nodal status (IIA and IIB). Stage III includes cancer located in the lung and the lymph nodes. Stage III is divided into 2 categories based on the size of tumor and nodal status (IIIA and IIIB). Stage 1V includes cancer that has metastasized to distant locations. The term “early stage NSCLC” includes patients with Stage I to IIIA NSCLC. These patients are treated primarily by complete surgical resection.
The term “test sample” as used herein refers to any fluid, cell or tissue sample from a subject which can be assayed for biomarker expression products and/or a reference expression profile, e.g. genes differentially expressed in subjects with NSCLC according to survival outcome.
The phrase “determining the expression of biomarkers” as used herein refers to determining or quantifying RNA or proteins or protein activities or protein-related metabolites expressed by the biomarkers. The term “RNA” includes mRNA transcripts, and/or specific spliced or other alternative variants of mRNA, including anti-sense products. The term “RNA product of the biomarker” as used herein refers to RNA transcripts transcribed from the biomarkers and/or specific spliced or alternative variants. In the case of “protein”, it refers to proteins translated from the RNA transcripts transcribed from the biomarkers. The term “protein product of the biomarker” refers to proteins translated from RNA products of the biomarkers.
A person skilled in the art will appreciate that a number of methods can be used to detect or quantify the level of RNA products of the biomarkers within a sample, including arrays, such as microarrays, RT-PCR (including quantitative RT-PCR), nuclease protection assays and Northern blot analyses.
Accordingly, in one embodiment, the biomarker expression levels are determined using arrays, optionally microarrays, RT-PCR, optionally quantitative RT-PCR, nuclease protection assays or Northern blot analyses.
In another embodiment, the biomarker expression levels are determined by using an array. In one embodiment, the array is a HG-U133A chip from Affymetrix. In another embodiment, a plurality of nucleic acid probes that are complementary or hybridizable to an expression product of at least 3 of the 16 biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1 are used on the array.
The term “nucleic acid” includes DNA and RNA and can be either double stranded or single stranded.
The term “hybridize” or “hybridizable” refers to the sequence specific non-covalent binding interaction with a complementary nucleic acid. In a preferred embodiment, the hybridization is under high stringency conditions. Appropriate stringency conditions which promote hybridization are known to those skilled in the art, or can be found in Current Protocols in Molecular Biology, John Wiley & Sons, N.Y. (1989), 6.3.1 6.3.6. For example, 6.0× sodium chloride/sodium citrate (SSC) at about 45° C., followed by a wash of 2.0×SSC at 50° C. may be employed.
The term “probe” as used herein refers to a nucleic acid sequence that will hybridize to a nucleic acid target sequence. In one example, the probe hybridizes to an RNA product of the biomarker or a nucleic acid sequence complementary thereof. The length of probe depends on the hybridization conditions and the sequences of the probe and nucleic acid target sequence. In one embodiment, the probe is at least 8, 10, 15, 20, 25, 50, 75, 100, 150, 200, 250, 400, 500 or more nucleotides in length.
In another embodiment, the biomarker expression levels are determined by using quantitative RT-PCR. In another embodiment, the primers used for quantitative RT-PCR comprise a forward and reverse primer for each of CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1.
The term “primer” as used herein refers to a nucleic acid sequence, whether occurring naturally as in a purified restriction digest or produced synthetically, which is capable of acting as a point of synthesis when placed under conditions in which synthesis of a primer extension product, which is complementary to a nucleic acid strand is induced (e.g. in the presence of nucleotides and an inducing agent such as DNA polymerase and at a suitable temperature and pH). The primer must be sufficiently long to prime the synthesis of the desired extension product in the presence of the inducing agent. The exact length of the primer will depend upon factors, including temperature, sequences of the primer and the methods used. A primer typically contains 15-25 or more nucleotides, although it can contain less or more. The factors involved in determining the appropriate length of primer are readily known to one of ordinary skill in the art.
In addition, a person skilled in the art will appreciate that a number of methods can be used to determine the amount of a protein product of the biomarker of the invention, including immunoassays such as Western blots, ELISA, and immunoprecipitation followed by SDS-PAGE and immunocytochemistry.
Accordingly, in another embodiment, an antibody is used to detect the polypeptide products of at least 3 of the 16 biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1. In another embodiment, the sample comprises a tissue sample. In a further embodiment, the tissue sample is suitable for immunohistochemistry.
The term “antibody” as used herein is intended to include monoclonal antibodies, polyclonal antibodies, and chimeric antibodies. The antibody may be from recombinant sources and/or produced in transgenic animals. The term “antibody fragment” as used herein is intended to include Fab, Fab′, F(ab′)2, scFv, dsFv, ds-scFv, dimers, minibodies, diabodies, and multimers thereof and bispecific antibody fragments. Antibodies can be fragmented using conventional techniques. For example, F(ab′)2 fragments can be generated by treating the antibody with pepsin. The resulting F(ab′)2 fragment can be treated to reduce disulfide bridges to produce Fab′ fragments. Papain digestion can lead to the formation of Fab fragments. Fab, Fab′ and F(ab′)2, scFv, dsFv, ds-scFv, dimers, minibodies, diabodies, bispecific antibody fragments and other fragments can also be synthesized by recombinant techniques.
Conventional techniques of molecular biology, microbiology and recombinant DNA techniques are within the skill of the art. Such techniques are explained fully in the literature. See, e.g., Sambrook, Fritsch & Maniatis, 1989, Molecular Cloning: A Laboratory Manual, Second Edition; Oligonucleotide Synthesis (M. J. Gait, ed., 1984); Nucleic Acid Hybridization (B. D. Harms & S. J. Higgins, eds., 1984); A Practical Guide to Molecular Cloning (B. Perbal, 1984); and a series, Methods in Enzymology (Academic Press, Inc.); Short Protocols In Molecular Biology, (Ausubel et al., ed., 1995).
For example, antibodies having specificity for a specific protein, such as the protein product of a biomarker, may be prepared by conventional methods. A mammal, (e.g. a mouse, hamster, or rabbit) can be immunized with an immunogenic form of the peptide which elicits an antibody response in the mammal. Techniques for conferring immunogenicity on a peptide include conjugation to carriers or other techniques well known in the art. For example, the peptide can be administered in the presence of adjuvant. The progress of immunization can be monitored by detection of antibody titers in plasma or serum. Standard ELISA or other immunoassay procedures can be used with the immunogen as antigen to assess the levels of antibodies. Following immunization, antisera can be obtained and, if desired, polyclonal antibodies isolated from the sera.
To produce monoclonal antibodies, antibody producing cells (lymphocytes) can be harvested from an immunized animal and fused with myeloma cells by standard somatic cell fusion procedures thus immortalizing these cells and yielding hybridoma cells. Such techniques are well known in the art, (e.g. the hybridoma technique originally developed by Kohler and Milstein (Nature 256:495-497 (1975)) as well as other techniques such as the human B-cell hybridoma technique (Kozbor et al., Immunol. Today 4:72 (1983)), the EBV-hybridoma technique to produce human monoclonal antibodies (Cole et al., Methods Enzymol, 121:140-67 (1986)), and screening of combinatorial antibody libraries (Huse et al., Science 246:1275 (1989)). Hybridoma cells can be screened immunochemically for production of antibodies specifically reactive with the peptide and the monoclonal antibodies can be isolated.
The gene signature described herein can be used to select treatment for NCSLC patients. As explained herein, the biomarkers can classify patients with NSCLC into a poor survival group or a good survival group and into groups that might benefit from adjuvant chemotherapy or not.
Accordingly, in one embodiment, the application provides a method of selecting a therapy for a subject with NSCLC, comprising the steps:
In another embodiment, the application provides a method of selecting a therapy for a subject with NSCLC, comprising the steps:
The term “adjuvant chemotherapy” as used herein means treatment of cancer with chemotherapeutic agents after surgery where all detectable disease has been removed, but where there still remains a risk of small amounts of remaining cancer. Typical chemotherapeutic agents include cisplatin, carboplatin, vinorelbine, gemcitabine, doccetaxel, paclitaxel and navelbine.
In another aspect, the application provides compositions useful in detecting changes in the expression levels of at least 3 of the 16 biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1. Accordingly in one embodiment, the application provides a composition comprising a plurality of isolated nucleic acid sequences wherein each isolated nucleic acid sequence hybridizes to:
In a further aspect, the application also provides an array that is useful in detecting the expression levels of at least 3 of the 16 biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1. Accordingly, in one embodiment, the application provides an array comprising for each of the above biomarkers one or more nucleic acid probes complementary and hybridizable to an expression product of the gene.
In yet another aspect, the application also provides for kits used to prognose or classify a subject with NSCLC into a good survival group or a poor survival group or to select a therapy for a subject with NSCLC that includes detection agents that can detect the expression products of the biomarkers. Accordingly, in one embodiment, the application provides a kit to prognose or classify a subject with early stage NSCLC comprising detection agents that can detect the expression products of at least 3 of the 16 biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1. In another embodiment, the application provides a kit to select a therapy for a subject with NSCLC, comprising detection agents that can detect the expression products of at least 4 of the 16 biomarkers selected from CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1.
A person skilled in the art will appreciate that a number of detection agents can be used to determine the expression of the biomarkers. For example, to detect RNA products of the biomarkers, probes, primers, complementary nucleotide sequences or nucleotide sequences that hybridize to the RNA products can be used. To detect protein products of the biomarkers, ligands or antibodies that specifically bind to the protein products can be used.
Accordingly, in one embodiment, the detection agents are probes that hybridize to the at least 4 of the sixteen biomarkers. A person skilled in the art will appreciate that the detection agents can be labeled.
The label is preferably capable of producing, either directly or indirectly, a detectable signal. For example, the label may be radio-opaque or a radioisotope, such as 3H, 14C, 32P, 35S, 123I, 125I, 131I; a fluorescent (fluorophore) or chemiluminescent (chromophore) compound, such as fluorescein isothiocyanate, rhodamine or luciferin; an enzyme, such as alkaline phosphatase, beta-galactosidase or horseradish peroxidase; an imaging agent; or a metal ion.
The kit can also include a control or reference standard and/or instructions for use thereof. In addition, the kit can include ancillary agents such as vessels for storing or transporting the detection agents and/or buffers or stabilizers.
In a further aspect, the application provides computer programs and computer implemented products for carrying out the methods described herein. Accordingly, in one embodiment, the application provides a computer program product for use in conjunction with a computer having a processor and a memory connected to the processor, the computer program product comprising a computer readable storage medium having a computer mechanism encoded thereon, wherein the computer program mechanism may be loaded into the memory of the computer and cause the computer to carry out the methods described herein.
In another embodiment, the application provides a computer implemented product for predicting a prognosis or classifying a subject with NSCLC comprising:
In yet another embodiment, the application provides a computer implemented product for determining therapy for a subject with NSCLC comprising:
Another aspect relates to computer readable mediums such as CD-ROMs. In one embodiment, the application provides computer readable medium having stored thereon a data structure for storing a computer implemented product described herein.
In one embodiment, the data structure is capable of configuring a computer to respond to queries based on records belonging to the data structure, each of the records comprising:
In another aspect, the application provides a computer system comprising
The advantages of the present invention are further illustrated by the following example. The example and its particular details set forth herein are presented for illustration only and should not be construed as a limitation on the claims of the present invention.
To identify a subset of genes whose mRNA expression profile is predictive of patient prognosis we combined feature selection by greedy forward-selection with unsupervised pattern-recognition. We call this algorithm modified Steepest Descent, or “mSD”, this iterative algorithm adds genes to an existing classifier based on their ability to maximize the significance of a log-rank test on patient groups identified by k-medians clustering and will be described in further detail below.
To identify a signature comprising genes that are not ranked by some univariate criterion, we developed a discrete, greedy gradient-descent algorithm (i.e the mSD). mSD begins by considering all possible classifiers (signatures) of one dimension (gene), and selecting the best gene. Once this optimal single-gene classifier is identified, the algorithm proceeds to add additional dimensions (genes) sequentially, testing all possible subsets of two genes that contain the optimal single-gene classifier. This corresponds to testing all supersets of the single-gene classifier and taking the largest discrete step to improve classifier performance. This procedure iterates through higher dimensions, evaluating successive supersets of the best n-gene classifier identified thus far. The algorithm terminates when an n gene classifier is discovered whose performance is not exceeded by any n+1 gene superset of itself. At each stage of the feature selection, classifier performance is evaluated by using k-medians clustering with k=2 to separate patients into two groups. Note that clustering is used here as an exploratory technique, not as a significance-testing method (30,31). Next, survival differences between these two groups are assessed using the log-rank test. Gene selection was made on the basis of the chi-squared statistic from the log-rank test, and thus the termination criterion corresponds to finding an n gene classifier whose chi-squared score cannot be exceeded by adding any single additional gene. The final output of the algorithm is a subset of prognostic genes, along with a separation of patients into a group with good survival (the “good prognosis group”) and a group with poor survival (the “poor prognosis group”). A Cox proportional hazards model including stage was then fit to these group assignments. Hazard ratios for the classification were extracted, along with p-values based on the Wald test. Feature selection was implemented in Perl (v5.8.7) and was run on AIX (v5.2.0.0) on an IBM p690. Clustering employed the Algorithm::Cluster (v1.31) C library (32) via its Perl bindings. Survival analysis used the survival package (v2.20) in R (v2.0.1).
A previously published RT-PCR dataset of 158 genes assessed in 147 NSCLC patients (19) was used for training. Data were normalized as described previously (28). Training used the original clinical annotation; subsequent survival analyses were performed using updated annotations, which increased patient follow-up by an average of 5.2 months (Table 2).
Two genes (STX1A and HIF1A) from this signature overlap with our previously reported linear risk-score analysis (33). Because we employed the same training dataset for both algorithms we are able to investigate the effect this overlap has on patient classifications. We compared the patient-by-patient predictions of our earlier risk-score-derived three-gene signature and our current six-gene signature (Table 2). The three-gene signature did not classify 10 patients from the initial cohort of 147, leaving 137 patients classified by both methods. Of these, 108 (79%) were classified identically by both methods. Most of the 29 mismatches (24/29=83%) were classified as poor prognosis by the three-gene signature and good prognosis by the six-gene signature. Similar proportions of adenocarcinomas and squamous cell carcinomas were divergently classified (22.6% vs. 20.2%, p=0.904). The two classifiers showed somewhat greater divergence for stage I than stage II or III patients, although this was not statistically significant (25.6% vs. 13.7%, p=0.154). The few divergences observed reflect the use of median dichotomization in the risk-score analysis. Median dichotomization is a common statistical procedure used when the training groups cannot be defined a priori, and forces the good and poor prognosis groups to be equally sized in the training dataset. By contrast the semi-supervised approach used by the mSD algorithm finds groups that reflect the strongest trend within the training dataset, regardless of group sizes. This is done by using unsupervised pattern-recognition (clustering). As a result mSD identifies groups of unequal size (92 good and 55 poor prognosis patients) while the risk-score analysis identified groups of equal size (68 good and 69 poor prognosis patients). Despite this underlying algorithmic difference, these data show that the two classifiers concur on the classifications for the majority of patients and that the few divergent classifications are not strongly biased according to any clinical covariates.
To estimate the generalization error of the mSD method we performed leave-one-out cross validation (29). Each of the 147 patients was classified using clusters defined with the remaining 146 patients. Euclidean distances were used to classify patients and significance was assessed with a stage-adjusted Cox proportional hazards model.
Specifically, using the normalized dataset, each of the 147 patients was sequentially removed from the sample. The mSD algorithm was then trained on the remaining 146 patient samples to select a prognostic subset of genes, as outlined above. The Euclidean distance between the expression profile of the omitted patient and the median expression profiles of the good and poor prognosis groups of patients were then calculated. The patient was classified into the nearer of these two groups, and the entire procedure was repeated 147 times so that each patient was omitted once. A survival curve of the resulting classifications was then plotted, and a stage-adjusted Cox proportional hazards model fitted as above. Cross validation was performed in R (v2.4.1) using the survival package (v2.31).
Four independent public datasets were used for validation (13, 14, 16, 25). The normalized data were downloaded and a unique probe for each of the six genes in the six gene signature (see above regarding Training Dataset and Table 2) was identified in each dataset. Median scaling and house-keeping gene normalization (to the geometric mean of ACTB, BAT1, B2M, and TBP levels) were performed (28). Euclidean distances to the training clusters were used to classify each patient. Survival differences were assessed using stage-adjusted Cox proportional hazards models.
Specifically, the four independent, publicly available datasets were used to validate the six-gene classifier identified by modified steepest-descent (34-37). These datasets were not used to select the 158 genes in our study and thus each constitutes an independent validation dataset. Two validation datasets were generated using Affymetrix microarrays (36, 37) and two using custom cDNA arrays (34, 35). Two are comprised primarily of adenocarcinomas (34, 36) and two exclusively of squamous cell carcinomas (35, 37). In each case, the normalized data were downloaded from the GEO repository. ProbeSets or spots representing the genes involved in the signature were identified using NetAffx annotation for Affymetrix arrays (36, 37) and BLAST analysis against UniGene build Hs.199 (34, 35) for cDNA arrays. When multiple ProbeSets for a single gene were present, the Pearson's correlation between their vectors was calculated. If they were strongly correlated (R>0.75) they were collapsed by averaging; otherwise bl2seq analysis against the RefSeq mRNA for the gene in question was used to identify the best match. Median scaling was performed as described previously (38). House-keeping gene normalization was used for the two Affymetrix array platforms, as described above for the PCR analysis. Because only one of the four house-keeping genes used was available on the custom cDNA platforms so this normalization step was omitted.
For each validation dataset, the distance between the expression profile for each patient and the cluster centers (medians) identified from the training dataset were calculated. A patient was classified into the nearer cluster if the ratio of the distances between the profile and the two clusters was at least 0.9. This quality criterion was not used for the two studies with small sample sizes where one signature gene was not present on the array platform (34, 35). The resulting classifications were then tested to determine if our prognostic signature resulted in significant survival differences using Cox proportional hazards model with adjustment for stage in R (v2.4.1) using the survival library (v2.33) as previously described.
We combined patients from the four validation datasets described above with four older or smaller NSCLC datasets (11, 12, 23). These 589 patients were classified as described above, with Cox modeling to identify survival differences.
Several smaller expression studies of non-small cell lung cancer were also available but, because of their limited number of patients, were not useful as validation datasets. To leverage these resources, we combined all patients from the four studies described above, along with datasets from the Mayo Clinic and Washington University (39), and two additional studies of mRNA expression in NSCLC (40, 41). In each of these cases, the raw data (CEL files) was downloaded and pre-processed using the RMA algorithm (42) as implemented in the affy package (43) (v1.6.7) for R (v2.1.1). One dataset (40) included highly-correlated technical replicates for some samples, which were collapsed through ProbeSet-wise averaging. The resulting dataset of 589 patients was then subject to the same nearest-centre classification described above. Survival between the two groups was tested using Cox proportional hazards model with adjustment for stage. The normalized data and clinical annotations for all patients used in this paper are presented in
To determine the number of 6-gene classifiers (signatures) that could be generated from our 158-gene training dataset we performed a permutation analysis. We tested the prognostic capability of all combinations of ten million combinations of six genes. For each combination we divided the patients into two groups using k-means clustering and calculated significance using log-rank analysis.
Study of all combinations is not possible for larger subset sizes because of the combinatorial explosion. This analysis was performed in the R statistical environment (v2.6.1) using the survival package (v2.34).
To test each signature we used the clusters defined in our training cohort to classify patients from four additional datasets (36, 37, 40, 41), again using Euclidean distances and log-rank analysis. The normalized data for each of these datasets was extracted for the genes in each signature. Euclidean distances were calculated between each patient and the centre of the two training clusters, and the patient was classified into the nearest cluster. Survival differences between good and poor prognosis clusters were then assessed using log-rank analysis.
Finally, to consider the generalizability of each prognostic signature across all four testing datasets we employed percentile analysis. The distribution of subsets with prognostic significance (χ2>3.84 or p<0.05) in the training dataset was visualized using Gaussian density plots. First, for visualization purposes we calculated and plotted the Gaussian kernel density of prognostic signatures in each validation dataset. Next, we calculated the percentile rank of each signature in each of the four validation datasets. The product of these ranks provides an estimate of the overall validation of a classifier across all four datasets, and we plotted the Gaussian kernel densities of these ranks. The performance of the six-gene mSD-signature was then treated in the same manner and its location marked on plots with an arrow to indicate its performance relative to the distribution of all potential prognostic markers.
Specifically, we focused on those six-gene signatures having a p-value below 0.05 (a strength greater than pre-defined parameter). Enrichment of each gene was studied in the high-strength (p<0.05) subsets using two enrichment statistics. First, the fraction of subsets containing that gene that were statistically significant at p<0.05 by a log-rank test was calculated. Second, this fraction was compared to the fraction that would be expected by chance alone using a bootstrap analysis. A bootstrap analysis involves repeated random-samplings from the original dataset, in this case 1000 random samplings were used to estimate each p-value. Bootstrap analysis is preferred when the distribution of the underlying data is unknown or highly complex.
Genes were ranked by the p-value-based enrichment statistics. To identify genes that have an enrichment above a pre-defined threshold we set our threshold as p<0.01.
To determine the impact of alternative statistical methods on prognostic marker identification we considered our previously published 147-patient, 158-gene RT-PCR NSCLC dataset. This dataset had been analyzed with a risk-score methodology, which identified a three-gene classifier capable of separating patients into groups with significantly different prognoses (19). The majority of signatures developed for NSCLC employed linear or risk-score methods to classify patients (11, 13, 14, 16, 23), which are unable to capture non-linear interactions amongst genes. For example, regulatory networks make substantial use of “or” logic: a cell may respond to hypoxic conditions by up-regulating HIF1A or down-regulating VHL. Such relationships cannot generally be captured by linear methods. We thus developed a novel non-linear semi-supervised method by coupling unsupervised pattern-recognition to gradient descent optimization (i.e. mSD). Referring to
Applying mSD to a training dataset of 147 NSCLC patients initially generated a prognostic signature comprising six genes: syntaxin 1A (STX1A), hypoxia inducible factor 1A (HIF1A), chaperonin containing TCP1 subunit 3 (CCT3), MHC Class II DPbeta 1 (HLA-DPB1), v-maf musculoaponeurotic fibrosarcoma oncogene homolog K (MAFK), and ring finger protein 5 (RNF5) (as described in U.S. patent application Ser. No. 11/940,707). Table 1 gives additional information on these genes. Specifically, stable (Entrez Gene ID) identifiers and the independent univariate prognostic ability (based on the log-rank test and Cox proportional hazards modeling) are given for each component of the six-gene mSD signature.
Referring to
The signature separated the 147 training patients into groups with significantly different survivals (p=2.14×10−8; log-rank test;
To validate our initial six-gene signature we tested its ability to stratify patients into groups with different prognosis using four independent publicly available datasets from Duke University (25), the University of Michigan (16), and the Prince Charles Hospital (13, 14). These datasets represent two versions of Affymetrix arrays (U133Plus2.0, Duke; U133A, Michigan) and a custom cDNA array (Prince Charles). Two of these studies comprise exclusively squamous cell carcinomas (13, 16), one exclusively adenocarcinomas (14), and one both (25). Each dataset was analyzed separately, as outlined in the supplementary methods. The molecular stratifications are plotted in
In addition to the four datasets analyzed in
We identified a six-gene classifier that shows partial overlap with the three-gene classifier identified previously from the same training dataset using risk-score methods. We questioned whether other small prognostic signatures could be identified from this 158-gene dataset. To test this question comprehensively we mapped our 158 genes into four test datasets (11, 12, 16, 25). In total 113 genes were common to these four datasets, and adding additional datasets greatly reduced this number. We restricted subsequent analyses to the 113 genes profiled in all four datasets. We then generated ten million permutations of six genes and tested their prognostic capability in these four datasets. For each subset we calculated its statistical significance using the log-rank test, as before.
A large number of these permutations showed statistical significance. In total 16.4% of all six-gene signatures were significant at p<0.05. This is 3.28-fold greater than the 5% expected by chance alone, and reflects a statistically significant enrichment (p<2.2×10−16; proportion test).
The distribution of all 10,000,000 six-gene signatures is shown in
We next compared the validation of the aforementioned 6-gene mSD signature to that of ten million random 6-gene signatures. For each test dataset (11, 12, 16, 25) the distribution of validation rates was again plotted as kernel density estimates. For each kernel density estimate in the training dataset we marked the performance of the six-gene mSD signature in that dataset with an arrow (
These data demonstrate the efficacy of the aforementioned initial six-gene signature in four distinct testing datasets. While said 6-gene signature performed amongst the top 10% of all signatures in each test dataset, it was not the single best signature in any single dataset. Rather, its strength is its validation in four independent datasets. To compare the validation of this 6-gene signature across all four test datasets we calculated its percentile ranking in each dataset and took the product of these rankings. The resulting validation score provides a measure of the inter-dataset reproducibility of a signature. Only 1,789 of the 10,000,000 signatures tested perform better than the mSD signature across all four validation datasets. Thus the mSD signature was superior to 99.98% of signatures tested (
Having used our large permutation dataset to rank the aforementioned initial six-gene prognostic signature, we next tested if specific genes were enriched in prognostic signatures. For each gene, we calculated the percentage of signatures containing it that were statistically significant (p<0.05, log-rank test). At this threshold we expect 5% of signatures to be significant by chance alone. When we plotted the percentages for the 113 gene set (
Table 5 provides the enrichment values for all 113 genes. At an enrichment above a threshold set at p<0.01, 16-genes remain in our final signature. This choice of threshold is further supported by the clear inflection point that is evident both in the enrichment plot (
The observed lack of overlap in typically reported prognostic signatures for NSCLC likely results from the use of different statistical techniques. To address this, we trained two distinctive algorithms on a single dataset to determine if identical signatures would be found. For training, we selected a real-time PCR dataset of 158 genes assessed in 147 patients, which we had used previously to identify a three-gene signature using linear risk-score methods (19). To provide a counterpoint to this linear analysis we then developed a semi-supervised algorithm by coupling unsupervised pattern-recognition and gradient descent algorithms (i.e. mSD).
The application of mSD to the same 147-patient training dataset identified a six-gene signature. This signature stratified NSCLC patients into two groups with different outcomes in four independent public datasets (
Clinical implementation of signatures should be straight-forward. For each patient, RT-PCR analysis would be performed for the identified prognostic genes in conjunction with a number of (i.e. 4) house-keeping genes for normalization purposes. Following normalization, Euclidean distances will determine if a patient's profile most resembles good or poor prognosis tumors—a similar approach to that adopted in two major breast-cancer studies (26, 27). Such signature(s) can be used even if some of the PCR reactions fail or data is otherwise unavailable, as shown by successful validation of the aforementioned 6-gene signature in two cDNA microarray datasets where one signature and one normalization gene were not present on the array platform (13, 14).
We have validated the aforementioned six-gene signature in eight of the eleven most recent NSCLC microarray studies (
Two genes (STX1A and HIF1A) are common to both the previously described three-(19) and aforementioned six-gene signatures. This partial overlap led us to hypothesize that additional small prognostic signatures could be identified from our training dataset. To test this, we trained ten million sets of six genes in our PCR dataset and tested each in four independent validation datasets. In both the training and testing datasets the aforementioned six-gene signature is superior to 99.98% of prognostic signatures (
These results demonstrate that very large numbers of potential prognostic signatures exist. Our permutation study focused on 113 genes that were profiled in five separate studies. This small dataset can generate approximately 2.5-billion unique six-gene signatures. If, as our results suggest, 0.02% of these can be verified in multiple independent validation cohorts, then a minimum of 500,000 verifiable six-gene prognostic signatures exist. This large number may explain the poor gene-wise overlap observed in prognostic signatures from different groups (19). It will be critical to determine if this conclusion can be generalized to other datasets and sizes of prognostic signature.
A detailed comparison of verifiable prognostic signatures might reveal common features. Our initial univariate shows that some specific genes were highly enriched in statistically significant prognostic signatures (
Our approach may provide a template for future studies to develop reproducible, mRNA-based signatures for cancer and other complex diseases. We started by using a high-quality training dataset enriched for prognostic markers. By keeping this dataset small we minimize the problems of over-fitting that arise from using thousands of genes. Next, we used a non-linear algorithm that dynamically learned patient groupings (i.e. a semi-supervised algorithm). Finally, we extensively validated our results, using cross-validation, multiple external datasets, and permutation-type analyses. Application of this protocol to the development of other signatures should be fruitful.
In summary, the present application encompasses a novel, semi-supervised algorithm (utilized in combination with a novel permutation analysis) which was used to demonstrate that a single training dataset can yield multiple prognostic signatures. By way of example, an initial (and previously described; i.e. U.S. patent application Ser. No. 11/940,707)) was validated in multiple testing datasets. Additionally, the application further teaches an approach for the identification and verification of a multiplicity of diverse and distinct NSCLC prognostic gene signatures, as exemplified by those signatures comprising at least three of CALCA, CCR7, STX1A, CCT3, SPRR1B, SELP, PAFAH1B3, CPE, XRCC6, HIF1A, MARCH6, PLOD2, NAP1L1, SFTPC, KRT5 and STC1.
Although preferred embodiments of the invention have been described herein, it will be understood by those skilled in the art that variations may be made thereto without departing from the spirit of the invention or the scope of the appended claims.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CA2009/001775 | 12/2/2009 | WO | 00 | 9/26/2011 |
Number | Date | Country | |
---|---|---|---|
61119936 | Dec 2008 | US | |
61149847 | Feb 2009 | US |