This invention is generally related to cancer diagnostic methods and uses thereof.
Worldwide, cervical cancer is the most common and deadliest gynecologic malignancy, accounting for an estimated 570,000 new cases and 311,000 deaths each year (Bray, et al., CA Cancer J Clin., 68:394 (2018)). Despite efforts in screening and human papillomavirus (HPV) vaccine adoption, cervical cancer remains a persistent health challenge for women in the United States, with 13,170 new cases and 4,250 deaths estimated for 2019 (Siegel, et al., CA Cancer J Clin., 69:7 (2019)). Survival for women with cervical cancer has not significantly improved since the mid-1970s, in contrast to the majority of other common cancers in the United States (Jemal, et al., J Natl Cancer Inst., 109 (2017)). While early-stage cervical cancer can be successfully treated, with 5-year overall survival (OS) rates as high as 97%, metastatic cervical cancer is virtually incurable, with 5-year OS rates below 10% (Quinn, et al., Int J Gynae-col Obstet., 95 Suppl 1:S43-103 (2006)). For patients with recurrent cervical cancer, their prognosis remains poor. The mortality risk for metastatic or recurrent cervical cancer is high, with median OS remaining limited to less than 1.5 years, even with the 3.5 month gain in median OS shown in GOG 240 by adding bevacizumab to first-line systemic platinum-based combination chemotherapy (Tewari, et al., N Engl J Med., 370:734 (2014); Tewari, et al., Lancet 390: 1654 (2017)). Therefore, new approaches are needed to better identify and treat patients with cervical cancer at high risk of recurrence and death.
A major focus in improving systemic treatment of cervical cancer involves developing a better understanding of the genomic, transcriptomic, and proteomic underpinnings and heterogeneity of the disease. The central tenet in the pathogenesis of cervical cancer is the involvement of HPV, which can be found in up to 99.7% of cervical cancers (Walboomers, et al., J Pathol., 189:12 (1999)). Despite the near-universal contribution of HPV to cervical carcinogenesis, there is wide variance in the risk of cancer associated with the different types of carcinogenic HPV, as well as the association of types of carcinogenic HPV with the different histologic subtypes (squamous cell carcinoma and adenocarcinoma) of cervical cancer (Li, et al., Int J Cancer, 128:927 (2011)).
To further advance the molecular understanding of cervical cancer, The Cancer Genome Atlas (TCGA) project recently published their analysis of 228 primary cervical cancers (Cancer Genome Atlas Research Network, Nature, 543:378 (2017)). While the results from that project noted a number of novel molecular features, the integrated clustering, which identified 3 main subgroups (keratin-low squamous, keratin-high squamous, adenocarcinoma), was not based on patient outcomes such as survival. A proteomic grouping was associated with differences in survival, but that grouping was (a) not primarily based on patient outcomes and (b) used as a small component of the integrative clustering that resulted in the featured novel subgroups (of note, the prognostic value of the proteomic grouping was recently validated by a separate group and dataset (Rader, et al., Gynecol Oncol., 155:324 (2019)). Further, no data was reported by TCGA to show that differences in the main novel cervical cancer subgroups were associated with differences in clinically relevant outcomes. Several other studies have investigated the genomic contributions to differences in clinical outcomes in cervical cancer, but outcomes were typically not a starting point in those studies, and their sample sizes were much smaller than TCGA (Barron, et al., PLoS One, 10:e0137397 (2015); Espinosa, et al., PLoS One, 8:e55975 (2013); Medina-Martinez, et al., PLoS One, 9:e97842 (2014); Wright, et al., Cancer, 119:3776 (2013)). Other groups have evaluated the potential of micro-RNA signatures for use as prognostic biomarkers, but results have been mixed and the most promising of those signatures did not validate (How, et al., PLoS One, 10:e0123946 (2015); Liu et al., Oncotarget, 7:56690 (2016); Zeng et al., J Cell Biochem, 119:1558-1566 (2018)). Further, it is unclear whether the findings in above studies were confounded by fundamental differences between the 2 major histologic subtypes of cervical cancer (squamous cell vs. adenocarcinoma), which arise from separate sites of the cervix and have different molecular profiles (Wright, et al., Cancer, 119:3776 (2013)).
Locally advanced cervical cancer can be treated with surgery, radiation, chemoradiation (CRT), or a combination of these modalities (Rose et al. 1999, Cohen et al. 2019). These options work well in patients who have localized disease as the 5-year survival is 85 to 90% (Cohen et al. 2019, Kim, Choi, and Byun 2000). One challenge for clinicians is determining which patients will need adjuvant treatment following surgery, as the use of dual modality therapy has been associated with considerable morbidity (Peters III et al. 2000, Sedlis et al. 1999). Therefore, the decision to recommend surgery or CRT is multifactorial, taking into account comorbidities, available pathologic and imaging data, and the side effect profiles of the different treatment modalities (Landoni et al. 2017, Vistad, Fossa, and Dahl 2006). A prognostic score capable of predicting survival and treatment response would greatly help in counseling patients on potential options at the time of diagnosis or after surgery.
To date, there have been few genetic scores developed for cervical cancer (Wong et al. 2003, Wang et al. 2019, Huang et al. 2012). These prior studies have been limited by sample size, lack of validation, or no association with treatment response.
Therefore, there is a need to identify sets of genes that can identify subgroups with large and clinically meaningful survival differences and to develop a genetic risk score capable of predicting prognosis and stratify patients into those who will or will not respond to primary therapy.
It is an object of the invention to provide methods and reagents for diagnosis or assisting in the prognosis of cancer.
Survival for patients with newly diagnosed cervical cancer has not significantly improved over the past several decades. Disclosed herein is a clinically relevant set of prognostic genes for squamous cell carcinoma of the cervix (SCCC), the most common cervical cancer subtype. Using RNA-sequencing data and survival data from 203 patients in The Cancer Genome Atlas (TCGA), a series of analyses using different decile and quartile cutoffs for gene expression was performed to identify genes that could indicate large and consistent survival differences across different cutoffs of gene expression. Those analyses identified 40 prognostic genes that have the greatest utility to stage cervical cancer and include the following: EGLN1, CD46, PLOD1, QSOX1, TM2D1, PEAR1, FKBP9, NRP1, GALNT2, TMED4, KIRREL, LAMC1, SDF4, COPA, FNDC3A, GALNT3, PLK1S1, ANGPTL4, APCDD1L, ZNF281, MMS19, GPR27, MTDH, LIF, BRSK1, GLG1, KBTBD2, PFKP, CD59, PLAGL1, PRR12, KBTBD6, GRB10, ZC3H12C, FSD1L, AIMP2, ZNF701, RPS6KA2, TMEM167A, RNF145, and combinations thereof. In one embodiment, a patient's survivability is estimated by using gene expression levels of each of the individual 40 genes and more importantly by using a machine learning (ML) algorithm such as Ridge regression to calculate a Ridge regression score, wherein a smaller Ridge regression score indicates lower survivability than a larger score. Other machine learning methods can also be used to calculate a transcriptomic risk score similar to Ridge Regression Score. In some embodiments, a Ridge regression score is calculated by using any combination of two to thirty-nine genes or all forty genes disclosed above. In some embodiments the RNA gene expression of 2, 5, 10, 15, 20, 25, 30, 35 or all 40 genes is used to calculate the subject's Ridge regression score. In one embodiment, large numbers of Ridge regression models can be created by randomly sampling gene expression of the 40 genes of subsets of patients from all patients in the total data. In still another embodiment, the final staging of a patient is determined by the consensus of two or more Ridge regression models created using the expression levels of any combination of the forty genes disclosed above. This transcriptomic biomarker can better predict survival than clinical prognostic factors, including the stage of the cancer in the subject.
One embodiment provides method of assessing a patient's survivability by determining RNA levels of one or more of the genes selected from the group consisting of EGLN1, CD46, PLOD1, Q SOX1, TM2D1, PEAR1, FKBP9, NRP1, GALNT2, TMED4, KIRREL, LAMC1, SDF4, COPA, FNDC3A, GALNT3, PLK1S1, ANGPTL4, APCDD1L, ZNF281, MMS19, GPR27, MTDH, LIF, BRSK1, GLG1, KBTBD2, PFKP, CD59, PLAGL1, PRR12, KBTBD6, GRB10, ZC3H12C, FSD1L, AIMP2, ZNF701, RPS6KA2, TMEM167A, RNF145, and subcombinations thereof from a sample from the patient and comparing the patient's RNA levels to RNA levels of reference samples with a known survivability assignment. In some embodiments the RNA gene expression of 1, 2, 5, 10, 15, 20, 25, 30, 35 or all 40 genes is used to generate a survivability assignment or transcriptomic risk scores (TRS). Gene expression levels can be determined using RT-PCR, microarrays, RNAseq, or other standard molecular biology technique. The method also includes generating multigenic models using modeling techniques including, but not limited to, machine learning such as Ridge regression and deep learning to compute transcriptomic risk scores for patients. The transcriptomic risk scores are then used to stratify patients into low, intermediate, and high TRS groups using a using a plurality voting of the models. In some embodiments the models are predictive models for predicting the survivability of the patient. The disclosed methods can be used to estimate survival time of the patient, estimate treatment outcome, inform decisions on therapeutic options, and assist in the selection of new therapies versus traditional therapies. For a patient with a high risk score, more aggressive treatment would be selected for the patient.
One embodiment provides a method for staging a patient's cervical cancer by using gene expression levels in one or more of the following genes or any combination thereof: EGLN1, CD46, PLOD1, QSOX1, TM2D1, PEAR1, FKBP9, NRP1, GALNT2, TMED4, KIRREL, LAMC1, SDF4, COPA, FNDC3A, GALNT3, PLK1S1, ANGPTL4, APCDD1L, ZNF281, MMS19, GPR27, MTDH, LIF, BRSK1, GLG1, KBTBD2, PFKP, CD59, PLAGL1, PRR12, KBTBD6, GRB10, ZC3H12C, FSD1L, AIMP2, ZNF701, RPS6KA2, TMEM167A, RNF145, and using Ridge regression, a machine learning (ML) algorithm, to calculate a Ridge regression score for the patient, wherein the Ridge regression score is compared to a Ridge regression score of patients with a known stage of cervical cancer to stage the patient's cervical cancer. In some embodiments the RNA gene expression of 2, 5, 10, 15, 20, 25, or all 40 genes is used to calculate the subject's Ridge regression score. In one embodiment, thousands or more Ridge regression models can are created by randomly sampling subsets of patients from all patients in the total data. In still another embodiment, the final staging of a patient is determined by the consensus of 2 or any number of models greater than 2 Ridge regression models created using subsets of patients and subsets of the 40 genes disclosed above.
Another embodiment provides a method of prognosing and treating cervical carcinoma in a subject in need thereof by quantifying RNA gene expression in a sample, wherein the genes include one or more of the 40 genes or any combination thereof selected from the group consisting of EGLN1, CD46, PLOD1, QSOX1, TM2D1, PEAR1, FKBP9, NRP1, GALNT2, TMED4, KIRREL, LAMC1, SDF4, COPA, FNDC3A, GALNT3, PLK1S1, ANGPTL4, APCDD1L, ZNF281, MMS19, GPR27, MTDH, LIF, BRSK1, GLG1, KBTBD2, PFKP, CD59, PLAGL1, PRR12, KBTBD6, GRB10, ZC3H12C, FSD1L, AIMP2, ZNF701, RPS6KA2, TMEM167A, RNF145; calculating the subject's Ridge regression score (RRS) by examining the expression levels of the genes and the inter-dependence between the two or more of the 40 genes, wherein a higher TRS indicates that the patient may not respond to therapy. The method further includes the step of administering radiation and/or chemotherapy to the patient diagnosed with cervical carcinoma. In some embodiments the RNA gene expression of 1, 2, 5, 10, 15, 20, 25, or all 40 genes is used to calculate the subject's TRS. The groups of genes used can be in any combination of the 40 genes disclosed above in groups of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 23, 25, 30, 35, and 40.
Still another embodiment provides a method of prognosing and treating cervical carcinoma in a subject in need thereof by generating different machine learning models (ML models) using gene expression of two or more genes selected from the group consisting of EGLN1, CD46, PLOD1, Q SOX1, TM2D1, PEAR1, FKBP9, NRP1, GALNT2, TMED4, KIRREL, LAMC1, SDF4, COPA, FNDC3A, GALNT3, PLK1S1, ANGPTL4, APCDD1L, ZNF281, MMS19, GPR27, MTDH, LIF, BRSK1, GLG1, KBTBD2, PFKP, CD59, PLAGL1, PRR12, KBTBD6, GRB10, ZC3H12C, FSD1L, AIMP2, ZNF701, RPS6KA2, TMEM167A, RNF145, and combinations thereof from subsets of patients in a dataset and the plurality voting of the top models to calculate an TRS, wherein a higher TRS indicates that the patient has worse survivability than a lower TRS. The method further includes the step of administering radiation and/or chemotherapy to the patient diagnosed with cervical carcinoma.
One embodiment provides a method of diagnosing and treating cervical carcinoma in a subject in need thereof by generating different machine learning models (ML models) using gene expression of one or more genes selected from the group consisting of EGLN1, CD46, PLOD1, QSOX1, TM2D1, PEAR1, FKBP9, NRP1, GALNT2, TMED4, KIRREL, LAMC1, SDF4, COPA, FNDC3A, GALNT3, PLK1S1, ANGPTL4, APCDD1L, ZNF281, MMS19, GPR27, MTDH, LIF, BRSK1, GLG1, KBTBD2, PFKP, CD59, PLAGL1, PRR12, KBTBD6, GRB10, ZC3H12C, FSD1L, AIMP2, ZNF701, RPS6KA2, TMEM167A, RNF145, and combinations thereof from subsets of patients in a dataset and the plurality voting of the top models to calculate an TRS for each model, wherein an TRS is associated with survivability of cervical carcinoma. The method further includes the step of administering radiation and/or chemotherapy to the patient having diagnosed with cervical carcinoma. In some embodiments the RNA gene expression of 2, 5, 10, 15, 20, 25, 30, or all 40 genes is used to calculate the subject's TRS. The groups of genes used can be in any combination of the 40 genes disclosed above for example any combination of genes in groups of 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 23, 25, 30, 35 and 40.
One embodiment provides a method of diagnosing and treating cervical carcinoma in a subject in need thereof by developing a transcriptomic risk score (TRS) from expression data of one or more of the genes selected from the group consisting of EGLN1, CD46, PLOD1, QSOX1, TM2D1, PEAR1, FKBP9, NRP1, GALNT2, TMED4, KIRREL, LAMC1, SDF4, COPA, FNDC3A, GALNT3, PLK1S1, ANGPTL4, APCDD1L, ZNF281, MMS19, GPR27, MTDH, LIF, BRSK1, GLG1, KBTBD2, PFKP, CD59, PLAGL1, PRR12, KBTBD6, GRB10, ZC3H12C, FSD1L, AIMP2, ZNF701, RPS6KA2, TMEM167A, RNF145, and combinations thereof from which predicts prognosis and stratifies patients into those who will respond well or poorly to primary therapy by calculating an TRS. In some embodiments, the TRS identifies patients who may not need aggressive therapies such as chemotherapy or radiation therapy; and the TRS also identifies patients who do not respond well to primary therapy and need therapies with better efficacy. In some embodiments the RNA gene expression of 2, 5, 10, 15, 20, 25, 30, 35 or all 40 genes is used to calculate the subject's TRS. The groups of genes used can be in any combination of the forty genes disclosed above.
Another embodiment provides a method for identifying biological pathways that can be targeted to improve the poor prognosis of those patients with disease predicted to be unresponsive to chemotherapy and radiation therapy. For example, one or more of the 40 genes recited above can be targeted to modulate their expression to improve a poor prognosis of a patient.
It should be appreciated that this disclosure is not limited to the compositions and methods described herein as well as the experimental conditions described, as such may vary. It is also to be understood that the terminology used herein is for the purpose of describing certain embodiments only, and is not intended to be limiting, since the scope of the present disclosure will be limited only by the appended claims.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs. Although any compositions, methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention. All publications mentioned are incorporated herein by reference in their entirety.
The use of the terms “a,” “an,” “the,” and similar referents in the context of describing the presently claimed invention (especially in the context of the claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context.
Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein.
Use of the term “about” is intended to describe values either above or below the stated value in a range of approx. +/−10%; in other embodiments the values may range in value either above or below the stated value in a range of approx. +/−5%; in other embodiments the values may range in value either above or below the stated value in a range of approx. +/−2%; in other embodiments the values may range in value either above or below the stated value in a range of approx. +/−1%. The preceding ranges are intended to be made clear by context, and no further limitation is implied. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
Disclosed herein is a clinically relevant set of prognostic genes for squamous cell carcinoma of the cervix (SCCC), the most common cervical cancer subtype. Forty (40) genes were identified that individually predict SCCC patient survival (
In current clinical practice, there is no prognostic biomarker for cervical cancer. Factors that inform adjuvant treatment for early cervical cancer include: a risk stratification based on stromal invasion, lymphovascular space invasion, and tumor diameter (Sedlis, et al., Gynecol Oncol., 73:177 (1999)) (intermediate-risk disease: give pelvic radiotherapy); criteria for high-risk of recurrence and death (positive margins, positive lymph nodes, parametrial involvement) that merit chemoradiation (Peters, et al., J Clin Oncol., 18:1606-1613 (2000)). For locally advanced cervical cancer, chemoradiation is standard of care; the benefit of additional chemotherapy given after chemoradiation is currently under investigation (ClinicalTrials.gov Identifier: NCT01414608). Stage and lymph node status can influence treatment planning for cervical cancer, but those factors may miss some patients at high risk for mortality.
Data from presented herein raises the concern that early stage may underestimate mortality in some patients, as approximately 40.9% of early stage (stage I and II) patients in the studied TCGA SCCC population were high TRS and poor survivors. Given the finding that TRS appears to outperform stage and lymph node status as a prognostic variable, it warrants further investigation as a biomarker for SCCC. Such would be especially important to a poor-prognosis subgroup of earlier-stage SCCC patients with high TRS, who might be under-treated relative to their prognosis based on clinical factors alone.
Another important observation was that 47.8% of late stage (stage III and IV) SCCC had low TRS associated with good survival, which would suggest that a subset of late stage SCCC patients may have an overestimation of mortality risk with clinical factors alone. Further investigation in more patients would be needed to confirm the presence and degree of prognosis-modifying impact of low TRS in patients with stage III SCCC. However, the finding of 2 within-stage TRS subgroups prognostically different than expected based on stage alone strongly suggests that TRS is not completely confounded by stage.
This disclosure also provides a new perspective on gene expression in SCCC with respect to survival. The approach taken is quite different from prior studies in several respects: clinical outcomes were not a starting point in those studies, sample sizes were much smaller than TCGA, analysis was limited to specific gene types (e.g., micro-RNAs), and/or the inclusion of both major histologic subtypes may have confounded the genomic analyses (Cancer Genome Atlas Research Network, Nature, 543:378 (2017); Barron, et al., PLoS One, 10:e0137397 (2015); Espinosa, et al., PLoS One, 8:e55975 (2013); Medina-Martinez, et al., PLoS One, 9:e97842 (2014); Wright, et al., Cancer, 119:3776 (2013); How, et al., PLoS One, 10:e0123946 (2015); Liu et al., Oncotarget, 7:56690 (2016); Zeng et al., J Cell Biochem, 119:1558-1566 (2018)). In contrast, this work leveraged the relatively high number of SCCC patients with both gene expression and survival data and avoided the pitfalls of grouping multiple histologic subtypes into a single-omic analysis. Further, an analysis was conducted through the lens of clinical relevance (i.e., who survived and who died?). While the finding of a transcriptomic risk gene signature for SCCC has not yet been validated with a separate data set, a strength of this disclosure is its focus on genes showing large and consistent survival differences at multiple cutoffs. Such genes are more likely to be validated in other datasets and be clinically relevant.
One innovation of this work is the discovery and selection of a large number of prognostic models based on machine learning. This is achieved by sampling thousands of subsets of patients for training the models and using the remaining samples for testing the models. Only those models that provide excellent prognostic potential in both training and testing are kept and used for prognostic prediction. Furthermore, selected models are validated by a bootstrapping procedure.
Another innovation is to use plurality voting or consensus of many excellent machine learning models to determine the final assignment of each patient to different survival groups. This innovative approach results in a biomarker that is much more likely to applicable to future samples because the classification is not based on any individual genes or any individual models.
40 genes were found that are highly associated with survival in SCCC (Table 5). Among TCGA SCCC patients analyzed, survival prognosis worsened with (a) increasing expression level for each individual high-risk gene or (b) a greater number of those genes with high expression level in a patient's tumor. These findings suggest the importance of the transcriptomic risk load on survival. The plurality voting of ML models appear to have better prognostic ability than any reported prognostic marker for SCCC, including stage and lymph node status. Although the clinical application of these discoveries will require validation in other datasets, this disclosure provides a roadmap towards a clinically meaningful prognostic biomarker for SCCC.
A. 25 and 20 Gene Signatures
To generate models for survival prediction, the entire dataset was randomly divided into training and testing subset, each containing 50% of the patients. This process was repeated 3000 times to generate 3000 pairs of training and testing subsets of patients. For each subset of patients, Ridge regression analysis was carried out to calculate a ridge regression score (RRS) using all 40 genes and the relative contribution of each gene to RRS in the first round of analyses. Proportional hazard analysis for survival was carried out to calculate hazard ratio (HR) and p value using RRS. All 3000 models were evaluated using their HRs in the training and testing pair. The HRs were greater than 5 in both the training and testing subsets for 86 models, which were then used to rank the 40 genes based on the mean contribution of the gene to the RRS of all 86 top models. The top 25 genes were selected and shown in Table 1.
1. Ridge Models with the Top 25 and Top 20 Genes
Three thousand (3000) random training/testing pairs were generated and Ridge regression analyses conducted with the 25 top genes. A total of 190 models yielded HRs greater than 5 in both the training and testing subsets, more than doubling the number of models with comparable performance from the 40 gene models, suggesting that the top 25 genes perform much better than the top 40 genes. Data for 40 selected models are shown in Table 2. Furthermore, 3000 Ridge regression models were developed and assessed with the top 20 genes and 245 models had HRs >5 in both training and testing, tripling the number of excellent models from the 40-gene analysis but only modestly higher than the 25-gene models. As quality control for the analytical pipeline, 3000 models were also developed and assessed using the bottom 20 genes that were not used in the top 20 gene analyses. Interestingly but not surprisingly, none of the 3000 bottom 20-gene models yielded HRs >5 in both training and testing; and only 28 models had HR >3 in both training and testing. These results suggest that the gene selection pipeline presented herein performs very well.
2. Consensus Model
Whether and how consistently different Ridge regression models classify each patient to the RRS groups was examined next. For this purpose, the top 40 models from the 25-gene analysis and top 40 models from the 20-gene analysis were selected based on the ranking of mean HRs from the bootstrapping analyses (Table 2) to examine how each of the 80 models vote on the classification of each patient. As shown in
One embodiment provides a method of diagnosing and treating cervical carcinoma in a subject in need thereof by quantifying RNA gene expression in a sample, wherein the genes include the 40 genes listed in Table 1; calculating the subject's survival risk score by determining the expression level of genes and their relationships;
One embodiment provides a method of diagnosing and treating cervical carcinoma in a subject in need thereof by quantifying RNA gene expression in a sample, wherein the genes include one or more of the 40 genes listed in Table 1; calculating the subject's Ridge regression score (RRS) by examining the expression levels of genes and the relationships between two to forty different genes;
One embodiment provides a method of diagnosing and treating cervical carcinoma in a subject in need thereof by generating different ML models using subsets of patients in a dataset and the plurality voting of the top models;
One embodiment provides a method of diagnosing and treating cervical carcinoma in a subject in need thereof by developing a transcriptomic risk score (TRS) or Ridge regression score (RRS) capable of predicting prognosis and stratify patients into those who will respond well or poorly to primary therapy. Furthermore, the TRS/RRS would also help identify patients who may not need aggressive therapies such as chemotherapy or radiation therapy; and TRS/RRS would help identify patients who do not respond well to primary therapy and need therapies with better efficacy.
Another embodiment provides insight towards potential pathways which could be targeted to improve the poor prognosis of those patients with disease predicted to be unresponsive to chemotherapy and radiation therapy.
3. Overall Survival and Response to Primary Therapy in SCCC
Cervical cancer remains a major contributor to female mortality worldwide (Cohen et al. 2019). For women diagnosed with locally advanced disease the cornerstone of treatment remains a combination radiation, chemotherapy, and surgery (Landoni et al. 1997) (Rose et al. 1999). However, therapeutic decisions remain multifactorial, based on stage, pathologic variables, and treatment side effects. Unfortunately, none of the prior are predictive of treatment response or molecular alterations which could be targeted to extend survival and treatment recommendations (Sedlis et al. 1999) (Peters III et al. 2000).
For these reasons the score presented herein has the potential to greatly impact care of cervical cancer patients. First, the genetic score was able to identify patients who have an excellent prognosis whether they received radiation or not. Even early-stage patients in the low-risk group with poor pathologic findings such as LVSI or advanced stage had 5-year overall survival exceeding 80%. This may allow for patients with low-risk scores to be triaged to undergoing the treatment that will result in the greatest quality of life for them in the long term. Furthermore, with additional studies this work may serve to replace Sedlis or Peter's criteria (Sedlis et al. 1999, Peters III et al. 2000). Second, those in the high-risk group demonstrated a phenotype that was remarkably resistant to treatment, as a combination of radiation and chemotherapy did not improve survival among this subgroup of patients. This represents a group of patients who would benefit from novel therapies, early initiation of palliative care, or both.
When examining functions of the 20 genes making up the TGS/RRS score, the three main pathways observed were related to hypoxia survival, which would contribute to radiation resistance, DNA repair activation which would confer resistance to both chemotherapy and radiation, and immune evasion. PFKP, PLOD1, QSOX1, LIF, ANGPTL4, and GRB10 were all associated with surviving hypoxic conditions or reducing reactive oxygen species, which would theoretically contribute to radiation resistance (Peng et al. 2019, Qi and Xu 2018, Coppock and Thorpe 2006, Liu et al. 2013, Metcalfe 2011, Kim et al. 2011, Holt and Siddle 2005). While MMS19, BRSK1, GALNT3, and LIF have roles in either nucleotide excision pathway, homologous recombination pathway, or responding to DNA damage (Kou et al. 2008, Chen and Vogel 2009, Sheta et al. 2019, Liu et al. 2013, Metcalfe 2011). Last, LIF, NRP1, and CD46 all have essential roles in anti-tumoral immunity including promoting Tregs, suppressing CD8+ T cells, and decreasing TH1 responses (Liu et al. 2013, Metcalfe 2011, Acharya and Anderson 2020, Cardone, Le Friec, and Kemper 2011).
Prior genetic risk scores in cervical cancer have been limited by the low number of patients or lack of association with treatment response (Wong et al. 2003) (Huang et al. 2012) (Lee et al. 2013). In another paper, which used the TCGA, Wang et al. were able to find a 9 gene combination which was predictive of survival, but did not expand on the proportion of early stage patients in each group, proportion with LVSI, or how patient score related to known pathologic risk factors such as LVSI (Wang et al. 2019). There was only 1 common gene between the Wang et al. score and the score presented herein, PEAR1 (platelet endothelial aggregation receptor 1). Comparing to the prior mentioned publications, the disclosure presented herein consisted of a large sample size, had improved survival prediction, and was associated with treatment response (Wong et al. 2003, Huang et al. 2012) (Lee et al. 2013, Wang et al. 2019).
The data represents an exciting advancement in cervical cancer. The score presented herein demonstrated excellent survival prediction along with biologically targetable mechanisms, which could be used to extend patient survival. However, future studies are needed to validate this risk score.
Materials and Methods:
Study Design and Patients:
Squamous cell cervical cancer patients from the TCGA patient cohort (n=203) were obtained through the UCSC Xena platform. All patients had level 3, log 2 transformed RNAseq data. Patients were divided into overarching stage groups (I, II, III, IV). FIGO stage breakdown for all patients can be found in Table 3. Overall survival was the primary endpoint of this disclosure.
Statistical Analyses:
Statistical analyses were performed using the R language and environment for statistical computing (Team 2013). Categorical variables were compared using Chi-squared test. Continuous variables were compared using student's t-test. P-values were considered significant if the value was less than 0.05. In single gene testing utilizing quartiles, the first quartile was used as the reference.
Results:
Of the 203 squamous cell cervical cancer patients identified the median age was 47 years old and 47% (n=96) were moderately differentiated tumors. Stage I disease made up 50% of the cohort. A total of 115 (57%) of patients had known lymph nodes assessment at initial diagnosis and 118 (58%) underwent adjuvant therapy. Demographic information is further summarized in Table 4. On univariate analysis stage IV disease, lymphovascular invasion, presence of positive lymph nodes, partial response to primary therapy, and no response to primary therapy were all associated with worse overall survival Table 4 and
Materials and Methods:
TCGA Cervical Squamous Cell Carcinoma (SCC) Dataset:
The RNAseq data (IlluminaHiSeq: log 2-normalized count+1) for SCCC from TCGA was downloaded from UCSC Xena (Goldman, et al., bioRxiv, 326470 (2019)). The details regarding the clinical characteristics of this dataset are available in a recent publication from TCGA (Cancer Genome Atlas Research Network, Nature, 543:378 (2017)). The TCGA dataset was used for this disclosure because it has the largest number of patients and the highest quality gene expression data of any publicly available dataset of patients with cervical cancer. Given the inherent molecular differences between the 2 histologic subtypes of cervical cancer, the analysis described herein was focused on SCC. The rationale was that SCC is the most common cervical cancer subtype and there were far more patient-derived samples for SCC than for adenocarcinoma in TCGA cervical cancer dataset. RNA-seq data for a total of 20,530 genes was available for each patient sample analyzed in this disclosure. Samples were included in this disclosure if they were SCCC and had both RNAseq and OS data available. Accordingly, samples were excluded from the disclosure if they (a) did not contain SCC, (b) contained SCC but were mixed with another histologic subtype (e.g., a mixed SCC and adenocarcinoma tumor), (c) did not contain RNA-seq data, or (d) did not contain OS data.
A total of 203 patients with SCCC met inclusion criteria for this analysis. Median age of the sampled population was 47 years. Median follow-up was 27.3 months. Stage distribution was as follows: I (102; 50.2%), II (50; 24.6%), III (32; 15.8%), IV (14; 6.9%), unknown (5; 2.5%). As of last follow-up, 143 (70.4%) of patients were alive, and 60 (29.6%) had died.
Statistical Analyses:
All statistical analyses were performed using the R language and environment for statistical computing (R version 3.2.2; R Foundation for Statistical Computing; www.r-project.org). The Cox proportional hazards models were used to evaluate the impact of gene expression levels on overall survival. Overall survival data (diagnosis to date of death) were downloaded from TCGA patient phenotype files. Patients who were alive were censored at the date of last follow-up visit. Kaplan-Meier survival analysis and log-rank test were used to compare differences in overall survival between groups classified using different cut-offs of expression level.
Identification of Survival-Associated Genes:
Survival differences associated with each gene were initially examined using 10 different cut-offs corresponding to each decile. For example, for the 90% cutoff, the top 10% of patients with the highest expression levels for a given gene were assigned to a “high expression” group and the bottom 90% of patients are assigned to a “low expression” group and the two groups of patients were analyzed using a univariate Cox regression analysis. Similarly, the top 80% of patients with the highest expression could also be compared to the remaining 20% of patients. For individual genes, the difference in survival for above and below the cut-off was assessed using hazard ratio (HR) and log-rank test, with a significance level of P<0.01. This process was repeated for each gene and at each cutoff.
Survival differences associated with each gene were also examined after dividing the patients into four quartiles based on gene expression levels for each gene. Survival for patients in the second, third and fourth quartile were compared to patients in the first quartile. Genes were ranked based on hazard ratio (HR) and log-rank test. These procedures allowed identification of genes that had large survival differences and could consistently predict survival at different cutoffs.
To accomplish these goals, survival analysis was systematically conducted for every gene and at every decile cutoff. Examination of the results suggested that larger survival differences were usually observed at the fourth quartile, although survival differences were also seen at the third and sometimes the second quartile.
Results:
Using selection criteria, 40 genes had good survival prediction potential as shown by the HR and p values (Table 5) and the Kaplan-Meier survival curves for representative genes (
Materials and Methods:
Building the SCCC Gene Signature and TRS Stratifier:
The individual genes with high survival differences were used to construct a survival prediction model using a machine learning method. The least absolute shrinkage selection operator (LASSO) algorithm was used to select and fit the regression coefficients for each gene in a penalized Cox proportional hazard model (Simon, et al., J Stat Softw. 39:1(2011); Friedman, et al., J Stat Softw. 33:1 (2010)). This process allowed us to select a subset of the genes, with weighted expression values, to use in calculating a survival risk score for each patient. The risk scores were then used to stratify all patients into 3 transcriptomic risk score (TRS) or Ridge regression score (RRS) groups. The stratification was optimized using the log-rank test. For the univariate analysis, major clinical characteristics with prognostic relevance were fitted to a Cox regression model after removal of patients with unknown clinical information. All clinical variables that were significant on univariate analysis (stage and lymph node status) were combined with TRS for the multivariate Cox model. Although LASSO is capable of selecting genes, it is not possible to apply LASSO to the entire genomic dataset with over 20,000 genes and come up with the best model. Therefore, the approach presented herein of pre-selecting genes using unigene survival analyses and then fitting a LASSO model represents a practical and efficient way of developing multivariate models.
Validation:
Because there was no hold out validation set, bootstrapping was performed to validate models. The mean HR was computed for 1,000 bootstraps utilizing 70% of the data set in each bootstrap. A model was considered valid if 95% or more of the bootstrapped models had a p-value 0.05 or less.
Results:
The 40 genes identified using the described selection criteria were used to identify the gene signatures that can predict survival. Ridge regression was then performed utilizing the “glmnet” package in R in order to make a prognostic score combining all genes (Friedman, Hastie, and Tibshirani 2009). With the 40 genes, Ridge regression analysis was conducted to find the optimal regression coefficients and decipher which composition of genes was most predictive of survival. This was performed in 3,000 training and testing pairs. Of the 3,000 models, 86 had an HR greater than 5 in both the training and test set. Based on individual gene RRS, it was determined that 25 of the 40 genes contributed the majority of the points to each individual risk score. Therefore, the top 25 genes were selected and shown in Table 1 and Table 5.
Using the 25 genes with the highest RRS scores, the same process was repeated which resulted in 190 models with a hazard ratio of greater than 5 in both the training and test sets. A final iteration was performed using the top 20/25 genes. This resulted in 245 models with an HR of greater than 5 in both training and test sets. The best 40 models for both the 25 and 20 gene signatures are shown in Table 6. These models were further evaluated by bootstrapping for 1,000 bootstraps over the entire dataset which showed that each model retained its prognostic capabilities as shown in Table 6. Almost all models had close to a 40% difference in percent 5-year overall survival, representative models shown in
Methods:
Consensus Voting:
Once the final model had been chosen and bootstrapping was completed validating the models, individual model prediction was combined utilizing consensus voting.
Results:
Because it is not expected that all models would rank patients into the same high and low risk groupings, the disclosure examined how often the 80 models with the highest mean HR (after bootstrapping) in both the 25 and 20 gene combinations agreed on patient risk grouping. If models concurred on patient risk grouping at least 75% of the time, then the patient was classified as that risk grouping. When models concurred less than 75% of the time, patients were classified as ambiguous or moderate risk. The 80 models agreed on placing 86/203 (42%) patients into the low risk group, and 83/203 (41%) patients into the high-risk group. For the remaining 34/203 (17%) patients, there was agreement less than 75% of the time, and therefore these patients were defined as a middle or ambiguous risk group (
Results:
Surprisingly, 22% (18/86) of patients in the low risk group had stage III disease or greater, 31% (27/86) had positive LVSI, and 23% (20/86) had positive lymph node metastases (Table 7). Because of the large number of patients in the low-risk group with poor prognostic pathologic characteristics, a subgroup analysis was performed to see if TGS/RRS outweighed the potential negative effects of their pathologic characteristics. Patients in the low-risk group with unknown or absent LVSI had a percent 5-year overall survival of 87% compared to 34% among high-risk patients with positive LVSI (HR 10.3, 95% CI 4.01-26.7, p<0.001). Stage 3 and 4 patients in the low-risk group had a percent 5-year survival of 83%. Based on this, it is evident that the TGS/RRS outweighs the potential negative implications of poor pathologic findings. Importantly, the only clinical, pathologic, or treatment characteristics that had more frequent occurrence in the high-risk group were advanced stage (p=0.04) and worse response to primary therapy (p<0.001). There was no discrepancy between the receipt of radiation (p=0.89), chemotherapy (p=0.33), or surgery (p=0.33) between and high, moderate, or low risk group patients (Table 7).
Interestingly, 70% (n=57/83) of high-risk group patients had either stage I or II disease. When comparing only early stage low and high-risk patients, there was a persistent survival difference (5-year overall survival is 91% for low risk and 39% for high risk, HR=11.3, 95% CI=4.30-29.6, p=8.49E-07). Among stage I and II high-risk patients, there was no difference in survival regardless of whether patients received no treatment, radiation alone (HR=0.95, p=0.94) or if they received CRT (HR=1.19, p=0.67). This indicates high risk patients represent an extremely treatment-resistant population. Supporting this conclusion, there was a difference in response to primary therapy when comparing between low and high-risk patients (p<0.001). The high-risk group contained 94% (17/18) of patients who did not respond to primary therapy and 100% (6/6) of partial responders to primary therapy (Table 7).
Results
Across the TRS groups, median age, stage, lymph node status, and grades were similarly represented (Table 7). Median overall survival was 1.56 years for the high-risk TRS group, 8.48 years for the intermediate-risk TRS group, and not yet reached for the low-risk TRS group.
Stage-by-stage distribution of the TRS groups can also be found in Table 7. Two observations from this part of the analysis are worth high-lighting. First, 40.9% of early stage patients (stage I and stage II) were in the TRS poor-survival subgroup. Second, 47.8% of late stage patients (stage 3 and 4) had low TRS, belonging to the good-survival TRS group, suggesting that TGS completely overwrites the contribution of stage.
Univariate analysis of major clinical variables for SCCC found that stage 4 and lymph node status were each significantly associated with survival, but grade was not (
Given that stage was the most significant clinical factor associated with survival, survival analysis was further carried out on stage I-III patients stratified by TRS (
All references cited herein are incorporated by reference in their entirety. The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof and, accordingly, reference should be made to the appended claims, rather than to the foregoing specification, as indicating the scope of the invention.
While in the foregoing specification this invention has been described in relation to certain embodiments thereof, and many details have been put forth for the purpose of illustration, it will be apparent to those skilled in the art that the invention is susceptible to additional embodiments and that certain of the details described herein can be varied considerably without departing from the basic principles of the invention.
This application claims benefit of and priority to U.S. Provisional Patent Application No. 63/015,045 filed on Apr. 24, 2020, and is incorporate by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63015045 | Apr 2020 | US |