The present invention is in the field of biomarkers and therapeutic targets.
Lung cancer is the leading cause of cancer-related death worldwide [1], where non-small cell lung cancer (NSCLC) is the most common type of cancer affecting the lungs with adenocarcinoma being the most common subtype. Microarray and next generation sequencing technologies have become invaluable tools to deconvolute the genetic heterogeneity and complexity of NSCLC, providing tremendous information to define new biomarkers for diagnosis, prognosis and prediction of therapeutic response, and to identify new potential therapeutic targets. Despite the advances in our knowledge of the genetic factors underlying this disease, the five-year survival rate for NSCLC patients is approximately 21% [2]. Lung cancer treatment is therefore moving rapidly towards an era of personalized medicine, where the molecular characteristics of an individual patient's tumor will dictate the optimal treatment modalities. For example, NSCLC patients with EGFR mutations show significantly improved responses to treatment with tyrosine kinase inhibitors, e.g., gefitinib or erlotinib, which target this protein [3].
Patient stratification based on histopathological markers, immunohistochemistry and other molecular factors has been evaluated to improve treatment decisions in LuADC patients [4-6]. The availability of large cancer genomic data sets allows for unbiased approaches to identify multi-gene signatures important in tumor progression. Gene transcript based signatures that predict prognosis have successfully been developed for many different tumor types [7-10]. A number of gene signatures using microarray analysis show promise for prognosis or prediction of response to therapy in NSCLC [11-14]. However, these signatures were either based on incomplete genome annotation or were based solely on existing knowledge. Therefore, a new comprehensive and unbiased genome-wide screening for genes associated with lung cancer prognosis is warranted.
The present invention provides for a library or an array of nucleic acids or nucleotides encoding portions of two or more genes selected from a set of 27 genes as indicated in Table 1 which are useful for predicting lung cancer survival.
In some embodiments, the library or the array of nucleic acids or nucleotides encoding portions of 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, or more, or all, genes selected from the set of 27 genes as indicated in Table 1. In some embodiments, the lung cancer is lung adenocarcinoma (LuADC).
The present invention also provides for a method for predicting a subject's overall survival (OS) from a lung cancer, comprising: (a) obtaining a lung gene transcript sample from a subject, (b) determining the transcript level of two or more genes selected from a set of 27 genes as indicated in Table 1, (c) correlating the pattern of the transcript to a predicted OS based on the analysis described herein, and (d) optionally treating the subject with a treatment regime appropriate to the predicted OS of the subject obtained from the correlating step.
In some embodiments, the determining step comprises determining the transcript level of 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, or more, or all, genes selected from the set of 27 genes as indicated in Table 1. In some embodiments, the subject is a person suspected of having a lung cancer, a person with a high probability of having a lung cancer, or a person diagnosed with a lung cancer. In some embodiments, the lung cancer is lung adenocarcinoma (LuADC). In some embodiments, the treatment comprises one or more of surgery, radiation therapy, and/or chemotherapy.
In some embodiments, the treatment regime is surgery radiation therapy, chemotherapy, targeted therapy, administering angiogenesis inhibitor, or immunotherapy. In some embodiments, the surgery is a lobectomy, wedge resection, segmentectomy, pneumoectomy, or sleeve resection. In some embodiments, the chemotherapy comprises administering to the subject of a therapeutic amount of (i) cisplatin or carboplatin, and (ii) pemetrexed or doctataxel. In some embodiments, the targeted therapy comprises administering to the subject of a therapeutic amount of (i) Crizotinib (Xalkori®), (ii) Ceritinib (Zykadia®), (iii) Alectinib (Alecensa®), and/or (iv) Brigatinib (Alunbrig™). Brigatinib (Alunbrig™) is administered to subjects whose cancer has grown while they were on Crizotinib or are intolerant to Crizotinib. In some embodiments, the targeted therapy comprises administering to the subject of a therapeutic amount of (i) Afatinib (Gilotrif®), (ii) Dacomitinib (Visimpro®), (iii) Erlotinib (Tarceva®), (iv) Gefitinib (Iressa®), and (v) Osimertinib (Tagrisso®). Osimertinib (Tagrisso®) is administered to subjects whose tumors are (EGFR) T790M-positive and whose disease has progressed on or after EGFR TKI therapy. In some embodiments, the administering angiogenesis inhibitor comprises administering to the subject of a therapeutic amount of Bevacizumab (Avastin®) and/or Ramucirumab (Cyramza®). In some embodiments, the immunotherapy comprises administering to the subject of a therapeutic amount of Nivolumab (Opdivo®), Pembrolizumab (Keytruda®), and/or Atezolizumab (Tecentriq®).
The identification of reliable predictive biomarkers and new therapeutic targets is a critical step for leading to real improvement in patient outcomes. To reach this purpose, we developed a multi-step bioinformatics analytic strategy to mine large omics data together with clinical data. A meta-analysis of transcriptome data identified 1327 genes significantly and robustly deregulated in lung adenocarcinomas (LuADCs) compared to normal lung tissue. 600 of these genes are significantly associated with overall survival (OS) of LuADC patients. The structure of a gene co-expression network revealed the biological functions of 600 genes in normal lung and LuADCs, which were enriched for cell cycle-related processes, blood vessel development, cell adhesion and metabolic processes. We established a 600 gene expression-based molecular classification of LuADCs into 4 possible subtypes, which is weakly, but significantly associated with OS in TCGA data. Finally, we implemented a multiple resampling method combined with a Cox regression analysis to identify a 27-gene signature associated with OS in the TCGA dataset, and then created a prognostic scoring system based on Cox regression function. This scoring system robustly predicts OS of LuADC patients in 100 sampling test sets and is further validated in four LuADC datasets. Our multi-omics and clinical data integration study identified a 27-gene prognostic signature that could guide adjuvant therapy for LuADC patients and includes novel potential molecular targets for therapy.
This invention is based on the discovery that: (1) a genome-wide screen identified 1327 genes significantly and robustly deregulated across four independent lung adenocarcinoma datasets compared to normal lung tissues; (2) the gene expression of 600 genes is significantly associated with overall survival (OS) in lung adenocarcinoma patients; (3) 4 molecular subtypes are identified based on the 600 genes associated with OS of patients with lung adenocarcinomas; (4) a forward-conditional Cox regression analysis identified a 27-gene signature associated with overall survival (OS) of lung adenocarcinomas; and, (5) a prognostic scoring system was created based on the 27-gene signature. This scoring system robustly predicted lung adenocarcinoma patient OS in 100 sampling test sets and was further validated in 4 independent lung tumor data sets. The 27-gene prognostic signature of the present invention is useful for guiding adjuvant therapy for lung cancer patients, including but not limited novel potential molecular targets for therapy.
The present invention is useful identifying genes important in lung cancer survival, so that novel targeted therapies can be developed, and for predicting lung cancer survival.
The foregoing aspects and others will be readily appreciated by the skilled artisan from the following description of illustrative embodiments when read in conjunction with the accompanying drawings.
Before the invention is described in detail, it is to be understood that, unless otherwise indicated, this invention is not limited to particular sequences, expression vectors, enzymes, host microorganisms, or processes, as such may vary. It is also to be understood that the terminology used herein is for purposes of describing particular embodiments only, and is not intended to be limiting.
In this specification and in the claims that follow, reference will be made to a number of terms that shall be defined to have the following meanings:
The terms “optional” or “optionally” as used herein mean that the subsequently described feature or structure may or may not be present, or that the subsequently described event or circumstance may or may not occur, and that the description includes instances where a particular feature or structure is present and instances where the feature or structure is absent, or instances where the event or circumstance occurs and instances where it does not.
As used in the specification and the appended claims, the singular forms “a,” “an,” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to an “expression vector” includes a single expression vector as well as a plurality of expression vectors, either the same (e.g., the same operon) or different; reference to “cell” includes a single cell as well as a plurality of cells; and the like.
Where a range of values is provided, it is understood that each intervening value, to the tenth of the unit of the lower limit unless the context clearly dictates otherwise, between the upper and lower limits of that range is also specifically disclosed. Each smaller range between any stated value or intervening value in a stated range and any other stated or intervening value in that stated range is encompassed within the invention. The upper and lower limits of these smaller ranges may independently be included or excluded in the range, and each range where either, neither or both limits are included in the smaller ranges is also encompassed within the invention, subject to any specifically excluded limit in the stated range. Where the stated range includes one or both of the limits, ranges excluding either or both of those included limits are also included in the invention.
The term “about” refers to a value including 10% more than the stated value and 10% less than the stated value.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although any methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are now described. All publications mentioned herein are incorporated herein by reference to disclose and describe the methods and/or materials in connection with which the publications are cited.
It is to be understood that, while the invention has been described in conjunction with the preferred specific embodiments thereof, the foregoing description is intended to illustrate and not limit the scope of the invention. Other aspects, advantages, and modifications within the scope of the invention will be apparent to those skilled in the art to which the invention pertains.
All patents, patent applications, and publications mentioned herein are hereby incorporated by reference in their entireties.
The invention having been described, the following examples are offered to illustrate the subject invention by way of illustration, not by way of limitation.
Identification of reliable predictive biomarkers and new therapeutic targets is a critical step for significant improvement in patient outcomes. Here, we developed a multi-step bioinformatics analytic strategy to mine large omics and clinical data to build a prognostic scoring system for predicting the overall survival (OS) of lung adenocarcinoma (LuADC) patients. In latter we first identified 1327 significantly and robustly deregulated genes, 600 of which were significantly associated with the OS of LuADC patients. Gene co-expression network analysis revealed the biological functions of these 600 genes in normal lung and LuADCs, which were found to be enriched for cell cycle-related processes, blood vessel development, cell-matrix adhesion and metabolic processes. Finally, we implemented a multiple resampling method combined with Cox regression analysis to identify a 27-gene signature associated with OS, and then created a prognostic scoring system based on this signature. This scoring system robustly predicted OS of LuADC patients in 100 sampling test sets and was further validated in four independent LuADC cohorts. In addition, in comparison to other existing prognostic gene signatures published in the literature, our signature was significantly superior in predicting OS of LuADC patients. In summary, our multi-omics and clinical data integration study created a 27-gene prognostic risk score that can predict OS of LuADC patients independent of age, gender and clinical stage. This score could guide therapeutic selection and allow stratification in clinical trials.
Herein is described a multi-step bioinformatics analytic strategy to mine large omics data together with clinical information to develop a gene expression-based prognostic risk score for lung adenocarcinomas (LuADCs). A resampling method is employed by splitting the LuADCs TCGA dataset into training and testing sets and then used repeated cross-validation to identify critical genes for prognostic classification. Based on these analyses, a 27-gene expression prognostic scoring system is created and successfully applied it to predict overall survival (OS) in multiple validation datasets. This study raises the prospect that the practicality of LuADC patient prognosis may be assessed by this prognostic scoring system.
Results
Identification of Consistently Deregulated Genes in Human LuADCs
A meta-analysis of three publically available LuADC transcriptome datasets (GSE31210, GSE19188 and GSE19804) was conducted to identify genes that are consistently deregulated in human LuADCs compared to normal lung tissues (
Impact of the Deregulated Genes on Overall Survival in Human LuADCs
To assess the importance of the 1327 deregulated genes in LuADC development, we evaluated their prognostic value for LuADC patients in a large public database combining tumor gene expression and patient survival17 (
To reveal the molecular mechanism underlying LuADC development, we determined which Gene Ontology (GO) categories are statistically overrepresented in the 600 gene set, and observed significant enrichment for cell cycle, adhesion, cell death, angiogenesis, metabolism and kinase activity (
Expression Architecture of Prognostic Genes in Normal Lung and LuADCs
Co-expression network analysis has been used to identify clusters of genes with common biological functionality important in normal or tumor tissues. We used data obtained from the GTEx database of 320 normal human lung tissues and the TCGA database of 517 LuADC samples to reveal the expression architecture of 600 OS-associated genes in normal lung and LuADC tissues. We first calculated correlation coefficients among 600 genes in both normal and LuADC tissue samples, and then constructed a gene co-expression network where nodes represent individual genes and edges connecting genes represent a significant correlation in expression (R≥|0.7|; adjusted p-value<0.001;
Development of a Gene Expression Signature-Based Prognostic Risk Score in LuADC
We designed a strategy to develop a prognostic scoring system (
27-Gene Expression Signature-Based Prognostic Risk Score Independently Predicts Overall Survival in LuADC Patients
We then tested our 27-gene prognostic signature in four independent datasets of LuADC patients. Prognostic scores for all patients were calculated and patients were ranked based on their score and divided into three equal sized cohorts. Kaplan-Meier analysis revealed a significant difference among three patient cohorts. Patients with a high prognostic score had a significantly shorter OS compared to patients with a low prognostic score (p<0.001) in all datasets (
Comparison of 27-Gene Expression Signature with Existing Prognostic Signatures
There are a number of prognostic signatures for NSCLC prognosis in the literature. We compared the performance of three published signatures [12-14] with our 27-gene signature. For each of the published signatures, we performed a multivariate Cox regression analysis on the same 100 training sets, averaged the Cox regression co-efficient and calculated prognostic scores for all patients. For each signature, the patients were then divided into tertiles based on their prognostic scores and the prognostic scores at the cut-points were recorded. Finally, the HR was calculated for each testing set for the “intermediate” and “poor” groups in comparison to the “good” group (
Discussion
Lung cancer is the most common cancer and the leading cause of cancer death among both men and women worldwide [1,20]. NSCLC, like many other cancers, exhibits considerable complexity and heterogeneity in biology, drug response and survival [21], which represents a major obstacle to effective personalized treatment. This work aimed to identify reliable predictive biomarkers and build a prognostic scoring system for predicting OS of LuADC patients.
There are several prognostic signatures for NSCLC prognosis in the literature.12-14 While these signatures have been shown to predict lung cancer survival, they were developed based on a subset of all genes in the genome or were assembled based on existing knowledge on the role of genes in cancer. With the availability of lung cancer transcriptome data sets covering many additional genes it seemed plausible that that novel gene signatures better able to predict LuADC patient survival could exist. To this end, we embarked on a comprehensive and unbiased genome-wide screen for genes associated with lung cancer prognosis. We show that our 27-gene scoring system has robust discriminative ability to distinguish patients with good versus bad prognosis in multiple datasets independent of clinical characteristics including age, gender and pathological stage. A direct performance comparison of our signature with the three published signatures mentioned above in terms of predicting patient survival showed that, while all signatures were able to predict survival, our 27-gene signature was much more robust. To translate such findings into clinical practice, a multigene assay should be developed for further validation of this gene signature in assessment of LuADC survival. Such information will assist treatment decision-making in a way similar to that used for the Oncotype DX breast cancer assay developed by Genomic Health [9] and Mammaprint 70-gene breast cancer recurrence assay by Agendia [7]. Randomized prospective clinical trials to further validate the accuracy and clinical value of this novel prognostic test for LuADC patients will need to be conducted.
In conclusion, lung cancer remains the leading cause of cancer-related disease burden. We developed a multi-step unbiased bioinformatics analytic approach to identify reliable predictive biomarkers and new therapeutic targets for LuADCs. We discovered that the expression of 600 genes are consistently altered in LUADCs and are significantly associated with OS of LuADC patients. Our study created a robust 27-gene prognostic signature that could predict patient overall survival independent of age, gender and clinical stage. This signature could guide adjuvant therapy for LuADC patients and include novel potential molecular targets for therapy.
Materials and Methods
Data Sets Used in this Study
Gene transcript data of normal and LuADC tissues was obtained from NCBI Gene Expression Omnibus (GEO) accession numbers: GSE31210, GSE19188 and GSE19804. Normal lung gene transcript data used for generating gene expression correlation networks were obtained from GTEx (website for: gtexportal.org/home/datasets) using the RPKM normalized gene transcript counts table [15,16].
Statistical Analysis
GEO2R was used to calculate the differential expression of tumor versus normal using a fold-change cut-off of 5 and adjusted p-value<0.0001. Association of differentially expressed genes and OS in LuADC patients was assessed using Kaplan-Meier plotter (website for: kmplot.com) including KM survival analysis, hazard ratio (HR) with 95% confidence intervals and logrank p-value for each gene [17]. The cytoscape plugin ClueGO was used to assess overrepresentation of Gene Ontology categories in biological networks (adjusted p<0.001 was used as a threshold for significance) [18].
Gene Co-Expression Network Construction
Gene expression Spearman correlation coefficients were calculated in “R” for 600 genes that were differentially expressed between LuADC and normal tissues samples and significantly associated with OS of LuADC patients. A gene network was generated where nodes represent individual genes and edges connecting nodes were drawn when the correlation coefficient exceeded R≥|0.7| (adjusted p-value≤0.001). Gene co-expression networks were generated for normal lung gene expression data (GTEx) and lung adenocarcinoma (TCGA) and visualized using Cytoscape 3.4.0. (website for: cytoscape.org). Dynet was used to highlight differences between two networks based on node and edge presence, ClueGO was used to identify significantly enriched biological pathways [18,19].
Gene Expression Signature-Based Prognostic Risk Score
100 random selections of 350 patients with LuADC were extracted from TCGA dataset and used as a training set to isolate a biomarker panel associated with OS. The remaining 167 patients for each selection were used as a test set to validate the prognostic significance of the biomarker panel. A forward-conditional Cox regression using all 600 genes as covariates was performed using SPSS on each of the training sets in order to isolate the biomarker panel. The results of each test were recorded and the genes that appeared in more than half of the training sets were included in our biomarker panel.
Cox regression was repeated on all 100 training sets using our 27-gene signature as covariates using the forced-entry (enter) method to obtain the co-efficient values for each biomarker. The resulting 100 co-efficient values of each biomarker were averaged to estimate the true co-efficient value of each gene. A prognostic scoring system was created based on this formula:
The patients were ranked by their prognostic scores and divided into three equal sized cohorts. Kaplan-Meier plots were constructed and a long-rank test was used to determine differences in OS of LuADC patients.
Prognostic scores for each of the test set samples were then calculated using the same set of mean co-efficient values developed in the training set. Patients were ranked based on their prognostic scores and divided into three cohorts based on the average prognostic score at cut-point in the training sets. Kaplan-Meier plots were constructed and a long-rank test was used to determine differences among OS in all testing sets.
To further validate our biomarker panel, mRNA expression levels for the 27-gene signature were obtained from four additional datasets (GSE42127, GSE31210, GSE37745 and GSE30219). New coefficients for 27 genes were obtained from Cox regression. Prognostic scores for all patients were calculated and patients were ranked based on their scores and divided into three equal sized cohorts. Kaplan-Meier analysis and a long-rank test were used to determine differences in survival.
While the present invention has been described with reference to the specific embodiments thereof, it should be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the true spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation, material, composition of matter, process, process step or steps, to the objective, spirit and scope of the present invention. All such modifications are intended to be within the scope of the claims appended hereto.
The application claims priority to U.S. Provisional Patent Application Ser. No. 62/573,057, filed Oct. 16, 2017, which is herein incorporated by reference in its entirety.
The invention was made with government support under Contract Nos. DE-AC02-05CH11231 awarded by the U.S. Department of Energy and Grant No. R01CA116481 awarded by the NIH. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62573057 | Oct 2017 | US |