The application relates generally to methods for identifying biomarkers and biomarkers for squamous cell carcinoma of the lung.
Identifying gene expression signatures that capture altered key pathways/regulators in carcinogenesis may discover molecular subclasses and predict patient outcomes (1). Several prognostic gene expression signatures have been published for non-small cell lung cancer (NSCLC) (2-8) and its adenocarcinoma (ADC) subtype (9-12). Few studies have been performed to identify prognostic signatures specific for lung squamous cell carcinoma (SQCC) (13, 14), but their validation in independent cohorts or datasets has been limited.
Factors such as patient/sample heterogeneity, small sample size, variation in microarray platforms, RNA preparation and hybridization protocols could all contribute to difficulties in validation of gene expression signatures. In addition, the loss of information through arbitrary exclusion of patients or genes prior to analysis may play an important role. Supervised data mining methodology assigns cases into good and poor prognosis subgroups at specified time points (13, 15). This arbitrary assignment of a cutoff to split good/poor prognosis cases could be problematic due to the non-linear relationships between gene expression and patient survival. Other investigators have compared two extremes in outcome (very early death versus long survival) (3, 12); however, this approach may result in significant information loss, for almost half of the cases with intermediate survival are excluded from analysis, thereby leading to high finite sample variation (16), and making the cohort under study less representative. Therefore, it is anticipated that the validation of the identified signature could be very challenging.
It is estimated that most tissues express only 30-40% of genes (17) or 10,000 to 15,000 genes (18). Furthermore, among the expressed genes from similar tissue types, only a small fraction is differentially expressed. Only these differentially expressed genes distinguish one phenotype from another. In an attempt to compensate for this in genome-wide microarray studies, some investigators have excluded genes with low expression or low variation prior to signature selection (3, 8-10). This approach may result in the exclusion of potentially important low expression but key regulatory genes, leading to another potential source of information loss. In addition, signatures are generated using a forced forward inclusion procedure pre-determined by the rank of significance of the gene (8, 9) or the bootstrap score (13), regardless of whether the included gene contributes to the classification ability of the signature. The lack of heuristic measures in these methods potentially reduces the robustness of these signatures.
According to a further aspect, there is provided a method of predicting prognosis in a subject with lung squamous cell carcinoma (SQCC) comprising the steps:
According to a further aspect, there is provided a method of selecting a therapy for a subject with SQCC, comprising the steps:
According to a further aspect, there is provided a method of selecting a therapy for a subject with SQCC, comprising the steps:
According to a further aspect, there is provided a composition comprising a plurality of isolated nucleic acid sequences, wherein each isolated nucleic acid sequence hybridizes to:
According to a further aspect, there is provided an array comprising, for each of at least one of twelve genes: RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A, one or more polynucleotide probes complementary and hybridizable to an expression product of the gene.
According to a further aspect, there is provided a computer program product for use in conjunction with a computer having a processor and a memory connected to the processor, the computer program product comprising a computer readable storage medium having a computer mechanism encoded thereon, wherein the computer program mechanism may be loaded into the memory of the computer and cause the computer to carry out a method described herein.
According to a further aspect, there is provided a computer implemented product for predicting a prognosis or classifying a subject with SQCC comprising:
According to a further aspect, there is provided a computer implemented product for determining therapy for a subject with SQCC comprising:
According to a further aspect, there is provided a computer implemented product described herein for use with a method described herein.
According to a further aspect, there is provided a computer readable medium having stored thereon a data structure for storing a computer implemented product described herein.
According to a further aspect, there is provided a computer system comprising
According to a further aspect, there is provided a kit to prognose or classify a subject with early stage SQCC, comprising detection agents that can detect the expression products of at least one biomarker selected from RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A, and instructions for use.
According to a further aspect, there is provided a kit to select a therapy for a subject with SQCC, comprising detection agents that can detect the expression products of at least one biomarker selected from RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A, and instructions for use.
These and other features of the preferred embodiments of the invention will become more apparent in the following detailed description in which reference is made to the appended drawings wherein:
The application generally relates to identifying gene signatures and provides methods and computer implemented products therefore. The application also relates to 12 biomarkers that form 1-gene to 12-gene signatures, and provides methods, compositions, computer implemented products, detection agents and kits for prognosing or classifying a subject with SQCC and for determining the benefit of adjuvant chemotherapy.
Global gene expression profiling has been implemented successfully for tumor characterization, classification and prediction of disease outcome. However, few studies have explored prognostic signatures for squamous cell carcinoma of the lung (SQCC).
A published microarray dataset from 129 SQCC patients was used as a training set to identify the minimal gene set prognostic signature. This was selected using the MAximizing R Square Algorithm (MARSA), a novel heuristic signature optimization procedure based on goodness-of-fit (R square). The signature was tested internally by leave-one-out-cross-validation (LOOCV), and then externally in 3 independent public lung cancer microarray datasets: 2 datasets of NSCLC and one of adenocarcinoma (ADC) only. Quantitative-PCR (QPCR) was used to validate the signature in a fourth independent SQCC cohort.
A 12-gene signature that passed the internal LOOCV validation was identified. The signature was independently prognostic for SQCC in two NSCLC datasets (total n=223) but not in ADC. The lack of prognostic significance in ADC was confirmed in the largest available ADC dataset (n=442). The prognostic significance of the signature was validated further by qPCR in another independent cohort containing 62 SQCC samples (HR=3.76, 95% CI 1.10-12.87, p=0.035).
We have identified a novel 12-gene prognostic signature specific for SQCC and demonstrated the effectiveness of MARSA to identify prognostic gene expression signatures.
It must be noted that as used herein and in the appended claims, the singular forms “a”, “an” and “the” include the plural referents unless the context clearly dictates otherwise.
As used herein, “biological parameter” may refer to any measurable or quantifiable characteristic in a biological system and includes, without limitation, physical characteristics and attributes, genotype, phenotype, biomarkers, gene expression, splice-variants of an mRNA, polymorphisms of DNA or protein, levels of protein, cells, nucleic acids, amino acids or other biological matter.
The term “biomarker” as used herein refers to a gene that is differentially expressed in individuals. For example, specifically with respect to lung squamous cell carcinoma (SQCC), the biomarkers may be differentially expressed in individuals according to prognosis and thus may be predictive of different survival outcomes and of the benefit of adjuvant chemotherapy. In one embodiment, the 12 biomarkers that form the SQCC gene signature of the present application are RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A.
The term “level of expression” or “expression level” as used herein refers to a measurable level of expression of the products of biomarkers, such as, without limitation, the level of messenger RNA transcript expressed or of a specific exon or other portion of a transcript, the level of proteins or portions thereof expressed of the biomarkers, the number or presence of DNA polymorphisms of the biomarkers, the enzymatic or other activities of the biomarkers, and the level of specific metabolites.
The term “reference expression profile” as used herein refers to the expression level of at least one of the 12 biomarkers selected from RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIMS, RNFT2, ARHGEF12 and PTPN20A associated with a clinical outcome in a SQCC patient. The reference expression profile comprises up to 12 values, each value representing the level of a biomarker, wherein each biomarker corresponds to one gene selected from RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A. The reference expression profile is typically identified using one or more samples comprising tumor or adjacent or other-wise tumour-related stromal/blood based tissue or cells, wherein the expression is similar between related samples defining an outcome class or group such as poor survival or good survival and is different to unrelated samples defining a different outcome class such that the reference expression profile is associated with a particular clinical outcome. The reference expression profile is accordingly a reference profile or reference signature of the expression of at least 1 of the 12 biomarkers selected from RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A, to which the subject expression levels of the corresponding genes in a patient sample are compared in methods for determining or predicting clinical outcome.
As used herein, the term “control” refers to a specific value or dataset that can be used to prognose or classify the value e.g expression level or reference expression profile obtained from the test sample associated with an outcome class. In one embodiment, a dataset may be obtained from samples from a group of subjects known to have SQCC and good survival outcome or known to have SQCC and have poor survival outcome or known to have SQCC and have benefited from adjuvant chemotherapy or known to have SQCC and not have benefited from adjuvant chemotherapy. The expression data of the biomarkers in the dataset can be used to create a control value that is used in testing samples from new patients. In such an embodiment, the “control” is a predetermined value for the set of at least 1 of the 12 biomarkers obtained from SQCC patients whose biomarker expression values and survival times are known. Alternatively, the “control” is a predetermined reference profile for the set of at least three of the sixteen biomarkers described herein obtained from patients whose survival times are known.
A person skilled in the art will appreciate that the comparison between the expression of the biomarkers in the test sample and the expression of the biomarkers in the control will depend on the control used. For example, if the control is from a subject known to have SQCC and poor survival, and there is a difference in expression of the biomarkers between the control and test sample, then the subject can be prognosed or classified in a good survival group. If the control is from a subject known to have SQCC and good survival, and there is a difference in expression of the biomarkers between the control and test sample, then the subject can be prognosed or classified in a poor survival group. For example, if the control is from a subject known to have SQCC and good survival, and there is a similarity in expression of the biomarkers between the control and test sample, then the subject can be prognosed or classified in a good survival group. For example, if the control is from a subject known to have SQCC and poor survival, and there is a similarity in expression of the biomarkers between the control and test sample, then the subject can be prognosed or classified in a poor survival group.
The term “differentially expressed” or “differential expression” as used herein refers to a difference in the level of expression of the biomarkers that can be assayed by measuring the level of expression of the products of the biomarkers, such as the difference in level of messenger RNA transcript or a portion thereof expressed or of proteins expressed of the biomarkers. In a preferred embodiment, the difference is statistically significant. The term “difference in the level of expression” refers to an increase or decrease in the measurable expression level of a given biomarker, for example as measured by the amount of messenger RNA transcript and/or the amount of protein in a sample as compared with the measurable expression level of a given biomarker in a control. In one embodiment, the differential expression can be compared using the ratio of the level of expression of a given biomarker or biomarkers as compared with the expression level of the given biomarker or biomarkers of a control, wherein the ratio is not equal to 1.0. For example, an RNA or protein is differentially expressed if the ratio of the level of expression in a first sample as compared with a second sample is greater than or less than 1.0. For example, a ratio of greater than 1, 1.2, 1.5, 1.7, 2, 3, 3, 5, 10, 15, 20 or more, or a ratio less than 1, 0.8, 0.6, 0.4, 0.2, 0.1, 0.05, 0.001 or less. In another embodiment the differential expression is measured using p-value. For instance, when using p-value, a biomarker is identified as being differentially expressed as between a first sample and a second sample when the p-value is less than 0.1, preferably less than 0.05, more preferably less than 0.01, even more preferably less than 0.005, the most preferably less than 0.001.
The term “similarity in expression” as used herein means that there is no or little difference in the level of expression of the biomarkers between the test sample and the control or reference profile. For example, similarity can refer to a fold difference compared to a control. In a preferred embodiment, there is no statistically significant difference in the level of expression of the biomarkers.
The term “most similar” in the context of a reference profile refers to a reference profile that is associated with a clinical outcome that shows the greatest number of identities and/or degree of changes with the subject profile.
The term “prognosis” as used herein refers to a clinical outcome group such as a poor survival group or a good survival group associated with a disease subtype which is reflected by a reference profile such as a biomarker reference expression profile or reflected by an expression level of the biomarkers disclosed herein. The prognosis provides an indication of disease progression and includes an indication of likelihood of death due to lung cancer. In one embodiment the clinical outcome class includes a good survival group and a poor survival group.
The term “prognosing or classifying” as used herein means predicting or identifying the clinical outcome group that a subject belongs to according to the subject's similarity to a reference profile or biomarker expression level associated with the prognosis. For example, prognosing or classifying comprises a method or process of determining whether an individual with SQCC has a good or poor survival outcome, or grouping an individual with SQCC into a good survival group or a poor survival group, or predicting whether or not an individual with SQCC will respond to therapy.
The term “good survival” as used herein refers to an increased chance of survival as compared to patients in the “poor survival” group. For example, the biomarkers of the application can prognose or classify patients into a “good survival group”. These patients are at a lower risk of death after surgery.
The term “poor survival” as used herein refers to an increased risk of death as compared to patients in the “good survival” group. For example, biomarkers or genes of the application can prognose or classify patients into a “poor survival group”. These patients are at greater risk of death or adverse reaction from disease or surgery, treatment for the disease or other causes.
The term “subject” as used herein refers to any member of the animal kingdom, preferably a human being and most preferably a human being that has SQCC or that is suspected of having SQCC.
The term “test sample” as used herein refers to any fluid, cell or tissue sample from a subject which can be assayed for biomarker expression products and/or a reference expression profile, e.g. genes differentially expressed in subjects with SQCC according to survival outcome.
The phrase “determining the expression of biomarkers” as used herein refers to determining or quantifying RNA or proteins or protein activities or protein-related metabolites expressed by the biomarkers. The term “RNA” includes mRNA transcripts, and/or specific spliced or other alternative variants of mRNA, including anti-sense products. The term “RNA product of the biomarker” as used herein refers to RNA transcripts transcribed from the biomarkers and/or specific spliced or alternative variants. In the case of “protein”, it refers to proteins translated from the RNA transcripts transcribed from the biomarkers. The term “protein product of the biomarker” refers to proteins translated from RNA products of the biomarkers.
A person skilled in the art will appreciate that a number of methods can be used to detect or quantify the level of RNA products of the biomarkers within a sample, including arrays, such as microarrays, RT-PCR (including quantitative RT-PCR), nuclease protection assays and Northern blot analyses.
Accordingly, in one embodiment, the biomarker expression levels are determined using arrays, optionally microarrays, RT-PCR, optionally quantitative RT-PCR, nuclease protection assays or Northern blot analyses.
In another embodiment, the biomarker expression levels are determined by using an array.
In one embodiment, the array is a HG-U133A chip from Affymetrix. In another embodiment, a plurality of nucleic acid probes that are complementary or hybridizable to an expression product of at least one of the 12 biomarkers selected from RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A are used on the array.
The term “nucleic acid” includes DNA and RNA and can be either double stranded or single stranded.
The term “hybridize” or “hybridizable” refers to the sequence specific non-covalent binding interaction with a complementary nucleic acid. In a preferred embodiment, the hybridization is under high stringency conditions. Appropriate stringency conditions which promote hybridization are known to those skilled in the art, or can be found in Current Protocols in Molecular Biology, John Wiley & Sons, N.Y. (1989), 6.3.1 6.3.6. For example, 6.0× sodium chloride/sodium citrate (SSC) at about 45° C., followed by a wash of 2.0×SSC at 50° C. may be employed.
The term “probe” as used herein refers to a nucleic acid sequence that will hybridize to a nucleic acid target sequence. In one example, the probe hybridizes to an RNA product of the biomarker or a nucleic acid sequence complementary thereof. The length of probe depends on the hybridization conditions and the sequences of the probe and nucleic acid target sequence. In one embodiment, the probe is at least 8, 10, 15, 20, 25, 50, 75, 100, 150, 200, 250, 400, 500 or more nucleotides in length.
In another embodiment, the biomarker expression levels are determined by using quantitative RT-PCR. In another embodiment, the primers used for quantitative RT-PCR comprise a forward and reverse primer for each of RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A.
The term “primer” as used herein refers to a nucleic acid sequence, whether occurring naturally as in a purified restriction digest or produced synthetically, which is capable of acting as a point of synthesis when placed under conditions in which synthesis of a primer extension product, which is complementary to a nucleic acid strand is induced (e.g. in the presence of nucleotides and an inducing agent such as DNA polymerase and at a suitable temperature and pH). The primer must be sufficiently long to prime the synthesis of the desired extension product in the presence of the inducing agent. The exact length of the primer will depend upon factors, including temperature, sequences of the primer and the methods used. A primer typically contains 15-25 or more nucleotides, although it can contain less or more. The factors involved in determining the appropriate length of primer are readily known to one of ordinary skill in the art.
In addition, a person skilled in the art will appreciate that a number of methods can be used to determine the amount of a protein product of the biomarker of the invention, including immunoassays such as Western blots, ELISA, and immunoprecipitation followed by SDS-PAGE and immunocytochemistry.
Accordingly, in another embodiment, an antibody is used to detect the polypeptide products of at least 1 of the 12 biomarkers selected from RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A. In another embodiment, the sample comprises a tissue sample. In a further embodiment, the tissue sample is suitable for immunohistochemistry.
The term “antibody” as used herein is intended to include monoclonal antibodies, polyclonal antibodies, and chimeric antibodies. The antibody may be from recombinant sources and/or produced in transgenic animals. The term “antibody fragment” as used herein is intended to include Fab, Fab′, F(ab′)2, scFv, dsFv, ds-scFv, dimers, minibodies, diabodies, and multimers thereof and bispecific antibody fragments. Antibodies can be fragmented using conventional techniques. For example, F(ab′)2 fragments can be generated by treating the antibody with pepsin. The resulting F(ab′)2 fragment can be treated to reduce disulfide bridges to produce Fab′ fragments. Papain digestion can lead to the formation of Fab fragments. Fab, Fab′ and F(ab′)2, scFv, dsFv, ds-scFv, dimers, minibodies, diabodies, bispecific antibody fragments and other fragments can also be synthesized by recombinant techniques.
Conventional techniques of molecular biology, microbiology and recombinant DNA techniques are within the skill of the art. Such techniques are explained fully in the literature. See, e.g., Sambrook, Fritsch & Maniatis, 1989, Molecular Cloning: A Laboratory Manual, Second Edition; Oligonucleotide Synthesis (M. J. Gait, ed., 1984); Nucleic Acid Hybridization (B. D. Harnes & S. J. Higgins, eds., 1984); A Practical Guide to Molecular Cloning (B. Perbal, 1984); and a series, Methods in Enzymology (Academic Press, Inc.); Short Protocols In Molecular Biology, (Ausubel et al., ed., 1995).
For example, antibodies having specificity for a specific protein, such as the protein product of a biomarker, may be prepared by conventional methods. A mammal, (e.g. a mouse, hamster, or rabbit) can be immunized with an immunogenic form of the peptide which elicits an antibody response in the mammal. Techniques for conferring immunogenicity on a peptide include conjugation to carriers or other techniques well known in the art. For example, the peptide can be administered in the presence of adjuvant. The progress of immunization can be monitored by detection of antibody titers in plasma or serum. Standard ELISA or other immunoassay procedures can be used with the immunogen as antigen to assess the levels of antibodies. Following immunization, antisera can be obtained and, if desired, polyclonal antibodies isolated from the sera.
To produce monoclonal antibodies, antibody producing cells (lymphocytes) can be harvested from an immunized animal and fused with myeloma cells by standard somatic cell fusion procedures thus immortalizing these cells and yielding hybridoma cells. Such techniques are well known in the art, (e.g. the hybridoma technique originally developed by Kohler and Milstein (Nature 256:495-497 (1975)) as well as other techniques such as the human B-cell hybridoma technique (Kozbor et al., Immunol. Today 4:72 (1983)), the EBV-hybridoma technique to produce human monoclonal antibodies (Cole et al., Methods Enzymol, 121:140-67 (1986)), and screening of combinatorial antibody libraries (Huse et al., Science 246:1275 (1989)). Hybridoma cells can be screened immunochemically for production of antibodies specifically reactive with the peptide and the monoclonal antibodies can be isolated.
The gene signature described herein can be used to select treatment for SQCC patients. As explained herein, the biomarkers can classify patients with SQCC into a poor survival group or a good survival group and into groups that might benefit from adjuvant chemotherapy or not.
The term “adjuvant chemotherapy” as used herein means treatment of cancer with chemotherapeutic agents after surgery where all detectable disease has been removed, but where there still remains a risk of small amounts of remaining cancer. Typical chemotherapeutic agents include cisplatin, carboplatin, vinorelbine, gemcitabine, doccetaxel, paclitaxel and navelbine.
According to one aspect, there is provided a method of prognosing or classifying a subject with lung squamous cell carcinoma SQCC comprising:
According to a further aspect, there is provided a method of predicting prognosis in a subject with lung squamous cell carcinoma (SQCC) comprising the steps:
In some embodiments, the biomarker reference expression profile comprises a poor survival group or a good survival group.
In different embodiments, the at least one biomarker is any of two biomarkers, three biomarkers, four biomarkers, five biomarkers, six biomarkers, seven biomarkers, eight biomarkers, nine biomarkers, ten biomarkers, eleven biomarkers and twelve biomarkers.
In some embodiments, determining the biomarker expression level comprises use of quantitative PCR or an array, preferably a U133A chip.
In some embodiments, determining the biomarker expression profile comprises use of an antibody to detect polypeptide products of the biomarker.
In some embodiments, the sample comprises a tissue sample, preferably a sample suitable for immunohistochemistry.
According to a further aspect, there is provided a method of selecting a therapy for a subject with SQCC, comprising the steps:
According to a further aspect, there is provided a method of selecting a therapy for a subject with SQCC, comprising the steps:
According to a further aspect, there is provided a composition comprising a plurality of isolated nucleic acid sequences, wherein each isolated nucleic acid sequence hybridizes to:
According to a further aspect, there is provided an array comprising, for each of at least one of twelve genes: RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A, one or more polynucleotide probes complementary and hybridizable to an expression product of the gene.
According to a further aspect, there is provided a computer program product for use in conjunction with a computer having a processor and a memory connected to the processor, the computer program product comprising a computer readable storage medium having a computer mechanism encoded thereon, wherein the computer program mechanism may be loaded into the memory of the computer and cause the computer to carry out a method described herein.
According to a further aspect, there is provided a computer implemented product for predicting a prognosis or classifying a subject with SQCC comprising:
Preferably, a computer implemented product described herein is for use with a method described herein.
According to a further aspect, there is provided a computer implemented product for determining therapy for a subject with SQCC comprising:
According to a further aspect, there is provided a computer readable medium having stored thereon a data structure for storing a computer implemented product described herein.
Preferably, the data structure is capable of configuring a computer to respond to queries based on records belonging to the data structure, each of the records comprising:
According to a further aspect, there is provided a computer system comprising
According to a further aspect, there is provided a kit to prognose or classify a subject with early stage SQCC, comprising detection agents that can detect the expression products of at least one biomarker selected from RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A, and instructions for use.
According to a further aspect, there is provided a kit to select a therapy for a subject with SQCC, comprising detection agents that can detect the expression products of at least one biomarker selected from RPL22, VEGFA, G0S2, NES, TNFRSF25, DKFZP586P0123, COL8A2, ZNF3, RIPK5, RNFT2, ARHGEF12 and PTPN20A, and instructions for use.
A person skilled in the art will appreciate that a number of detection agents can be used to determine the expression of the biomarkers. For example, to detect RNA products of the biomarkers, probes, primers, complementary nucleotide sequences or nucleotide sequences that hybridize to the RNA products can be used. To detect protein products of the biomarkers, ligands or antibodies that specifically bind to the protein products can be used.
Accordingly, in one embodiment, the detection agents are probes that hybridize to the at least 1 of the 12 biomarkers. A person skilled in the art will appreciate that the detection agents can be labeled.
The label is preferably capable of producing, either directly or indirectly, a detectable signal. For example, the label may be radio-opaque or a radioisotope, such as 3H, 14C, 32P, 35S; 123I; 125I; 131I; a fluorescent (fluorophore) or chemiluminescent (chromophore) compound, such as fluorescein isothiocyanate, rhodamine or luciferin; an enzyme, such as alkaline phosphatase, beta-galactosidase or horseradish peroxidase; an imaging agent; or a metal ion.
The kit can also include a control or reference standard and/or instructions for use thereof. In addition, the kit can include ancillary agents such as vessels for storing or transporting the detection agents and/or buffers or stabilizers.
In a further aspect, the application provides computer programs and computer implemented products for carrying out the methods described herein. Accordingly, in one embodiment, the application provides a computer program product for use in conjunction with a computer having a processor and a memory connected to the processor, the computer program product comprising a computer readable storage medium having a computer mechanism encoded thereon, wherein the computer program mechanism may be loaded into the memory of the computer and cause the computer to carry out the methods described herein.
The advantages of the present invention are further illustrated by the following examples. The example and its particular details set forth herein are presented for illustration only and should not be construed as a limitation on the claims of the present invention.
Datasets: Four large, NSCLC, publicly available microarray datasets were used: 129 SQCC samples from Molecular Diagnostics, Veridex LLC (UM) (13), 85 NSCLC samples (44 SQCC and 41 ADC) samples from Duke University (Duke) (3), 138 NSCLC samples (76 SQCC and 62 ADC) from Sungkyunkwan University (SKKU) (7), and 327 ADC samples from the NCI Director's Challenge Consortium for the Molecular Classification of ADC (DCC) (11). UM was used as the training set, while the remaining three datasets served as independent test sets. In addition, qPCR validation of the signature was carried out in 62 SQCC samples from the University Health Network (UHN). Patient demographics of the five independent datasets are shown in Table 1. The primary survival endpoint was 5-year survival (in UM, Duke, DCC, and UHN where overall survival was used) or disease-free survival (SKKU).
Data pre-processing: The raw data of the Veridex dataset were made available by Dr. Mitch Raponi and the Veridex. Duke and DCC datasets were downloaded from http::Rdata.cgt.duke.edu/oncogene.php and https::Rcaarraydb.nci.nih.gov/caarray/publicExperimentDetailAction.do?expId=1015945236141280, respectively. Raw .cel files were pre-processed by the Robust Multichip Average (RMA) algorithm using RMAexpress v0.5 (55), and then log 2 transformed. Probe sets were annotated using NetAffx v4.2 annotation tool (56). Affymetrix assigns five grades (A, B, C, E, and R) to classify the quality of their probe sets used in the GeneChip (56). Matching probe or Grade A annotations represents the best quality transcript assignments with at least 9 of the 11 probes in a probe set match a transcript mRNA or gene model sequence. Therefore only probe sets with ‘grade A’ annotation were used for signature optimization. The GCRMA normalized data and the limited clinical information from SKKU were downloaded directly from the NCBI GEO database (http::Rwww.ncbi.nlm.nih.gov/geo/) with the accession number GSE8894. The normalized data was standardized by Z-score transformation, which centered the expression level to mean zero and standard deviation of one (57). It is noteworthy that two methods were used for the calculation of the risk score. The first method was used in the signature optimization where the risk score was the product of Z-score weighted by the coefficient from the univariate survival analysis (58,59). The second method was used when PCA analysis was applied to the 12-gene signature, where the Z-score was first weighted by coefficient of each gene in each of the 4 selected principal components and the risk score was the sum of the scores of the 4 principal components weighted by their coefficients in the multivariate model (Table 4).
Univariate analysis: Overall survival (date of surgery to date of last follow-up or death) was used as the outcome endpoint. Follow-up was truncated at 5 years. The association of the expression of individual probe sets with 5-year overall survival was evaluated by Cox proportional hazards regression. An inclusion criterion of p<0.005 was set for pre-selecting the candidate probe sets chosen for signature optimization (22).
Signature selection: Signature optimization was conducted by an exclusion followed by an inclusion selection procedure (
Principal Component Analysis (PCA): To further reduce the data dimensionality and get rid of possible co-linearity expression of genes, PCA and multivariate Cox proportional hazards model with stepwise selection were used. PCA analysis identified 12 principal components (PC) and these PCs were introduced to a multivariate Cox proportional hazard model with stepwise selection using an inclusion criteria of 0.5 (sle=0.5). PCs who were significantly associated with survival (sls=0.05) retained. Four PCs were identified and their coefficients were listed in Table 4. The weight of each member of the 12-gene signature in each of the 4 PCs was listed in Table 4. Risk score was dichotomized at the optimal cutoff in the training set determined by the macro http::Rndc.mayo.edu/mayo/research/biostat/sasmacros.cfm (60). It gave a value of −0.056 as risk score cutoff (Table 4).
Leave-one-out-cross-validation (LOOCV): LOOCV was used as an internal validation of how accurate of the signature in assigning cases into low and high risk group. Cases were classified as low- or high-risk by the 12-gene signature based on the optimal cutoff in the entire cohort (n=129). Each case was then excluded once at a time and the class of low or high risk of the excluded case was predicted by the remaining cases (n=128). If the case was classified as high/low risk in the entire cohort but was assigned as low/high risk in the LOOCV, then it was an error. The acceptable predicting error rate was <5%.
In silico validation of expression signature: in silico validation of the prognostic signature was carried out separately on the 3 validation datasets form Duke (52), SKKU (53), and DCC (54). Expression level was Z-score transformed and the risk score was generated using the parameters listed in Table 5. Multivariate analysis was performed by Cox proportional hazards regression with the adjustment for stage, age and sex. Statistical analyses were performed using SAS v9.1 (SAS Institute, CA).
Quantitative-RT-PCR (qPCR) validation of the signature: qPCR validation was carried out in 62 SQCC samples from the University Heath Network. The patients did not receive any chemo- or radiotherapy before the samples were surgically resected. PrimerExpress v3.0 (AppliedBiosystems, Foster city, CA) was used to design primers. Primers were primarily designed within the target sequence of the probe sets, but once no primer could be found in this area, primers were designed in the CDS of the target gene. Primers used for quantification of the target genes were listed in Table 5. Five ng of cDNA was used for each reaction in the HT-7900 fast real-time PCR system (AppliedBiosystems, Foster city, CA). PCR reaction optimization was described previously (57). Four house-keeping genes (ACTB, TBP, BAT1, and B2M) were used initially (57); however, NormFinder (63) found that the combination of 3 genes (ACTB, TBP, and BAT1) was most stable (smallest variation, Table 6). Therefore, the mean of the Cts of the 3 house-keeping genes was used to normalize qPCR data. Expression was quantitated using 2−ΔΔCt method and then Z-score transformed. Risk score was then calculated using the parameters listed in Table 4.
Protein-protein interaction (PPI) network construction and analysis: To determine the relationships among the proteins corresponding to the 12-gene SQCC prognostic signature and two published SQCC prognostic signatures [50-gene of Sun et al. (64) and 50-gene of Raponi et al. (51)], gene identifiers (EntrezGene IDs) and protein identifiers (SwissProt IDs) corresponding to the probe-sets of each of the prognostic signatures were obtained from NetAffx (NA24) annotation tables. The 12-gene signature mapped to 12 genes (Table 6), Sun's 50-gene signature mapped to 42 genes, while Raponi's 50-gene signature mapped to 48 genes, respectively. Protein-protein interaction (PPI) data were obtained by querying the Interologous Interaction Database (I2D v1.71; http::Rophid.utoronto.ca/i2d (65)). Interactions were obtained for 8/12 genes, 31/42, and 35/48 for signatures of our 12-gene, Sun's 50-gene and Raponi's 50-gene, respectively, including 8/9 genes overlapping between the latter two 50-gene signatures. The interacting proteins were then used to query the same database to determine whether any interactions are present among them. The resulting PPI network based on these three SQCC prognostic signatures comprised 1,075 nodes/proteins and 14,651 edges/interactions. The PPI network was visualized and annotated using NAViGaTOR v2.08 (http::Rophid.utoronto.ca/navigator/) (66).
Gene Ontology (GO) term and KEGG pathways enrichment analysis: GoStat (67) was used to evaluate GO term representation enrichment in the 12-gene signature. Significance was tested using Fisher's exact test and corrected by Benjamini and Hochberg method. For KEGG pathways (68) (http::Rwww.genome.jp/kegg/) representation enrichment analysis, Fisher's exact test was employed and the significance was corrected by the Bonferroni method. KEGG pathways representation enrichment in the protein-protein interaction (PPI) network of the three signature probe sets was also tested. PPI data was determined by testing KEGG pathway genes proportions (of 45 KEGG pathways for which at least 25% of the pathway genes were mapped in the experimentally determined PPI network) against expected proportions estimated from 1,000 randomly-generated PPI networks obtained by querying I2D using the same number of proteins in the interaction network of these 3 signatures (66 genes/proteins). Student's t-test was then used to compare the proportion in the experimentally determined PPI network against the distributions in random networks (69). The p-values were corrected by the Bonferroni method.
The steps leading to signature identification and subsequent validation are represented schematically in
This is followed by the inclusion procedure using 211514_at as its starting probe-set. The procedure included one probe-set at a time until all 96 ps were included. The exclusion procedure identified the largest R2 of 0.77 with a combination of 12 ps (12-gene) (
When the risk score was dichotomized at the optimal cutoff (−0.056, Table 4), the 12-gene signature classified 63 and 66 SQCC patients into low- and high-risk groups, respectively with a significant difference in overall survival (HR=11.47, 95% CI 4.78-27.49, p<0.0001,
We first tested the 12-gene signature in the Duke 89 NSCLC dataset (46 SQCC and 43 ADC). Four patients with stage III-IV (2 ADC and 1 SQCC in stage III and 1 SQCC in stage IV) were excluded from further analysis (Table 1). When the risk score was dichotomized at −0.056, the signature classified 25 and 19 of 44 SQCC and 13 and 28 of 41 ADC into low- and high-risk groups, respectively. High-risk SQCC had significantly poorer survival than the low-risk group (HR=2.91, 95% CI 1.17-7.24, p=0.022,
The SKKU dataset (7) included 138 stage I-III NSCLC (76 SQCC and 62 ADC) patients profiled using U133 plus 2 chip. This is the only NSCLC microarray dataset from Asia. Validation of our signature used recurrence-free survival as this is the only endpoint reported for this study. Because the GEO database has no raw data, we downloaded the expression data which was already GCRMA-preprocessed and log 2-transformed. Gene expression level was Z-score transformed and risk score was derived using the formula listed in Table 4. The 12-gene signature classified 41 and 35 of 76 SQCC and 27 and 35 of 62 ADC into low- and high-risk groups, respectively. Significantly shortened recurrence-free survival was observed in the high-risk group in the SQCC (HR=2.46, 95% CI 1.26-4.79, p=0.008,
To determine further whether the signature was prognostic in ADC, the 12-gene signature was tested in the largest available ADC microarray dataset from the NIH Director's Challenge Consortium study (11), which included 442 samples. Among them, 327 patients did not receive any adjuvant chemotherapy or radiotherapy and had follow-up longer than 1 month. The 12-gene signature was not prognostic (HR=1.26, 95% CI 0.87-1.81, p=0.221,
qPCR Validation in UHN SQCC Cohort
qPCR validation of the 12-gene signature was performed in an independent set of 62 snap-frozen SQCC samples from UHN. Fold change was calculated using 2−ΔΔCt method and then Z-score transformed. Risk score was generated using parameters listed in Table 4. When risk score was dichotomized at −0.056, the 12-gene signature was able to separate 41 and 21 SQCC into low and high risk group with significant difference in 5-year overall survival (HR=4.00, 95% CI 1.20-13.31, p=0.024,
Table 3 shows the members of 12-gene signature and their ranks of expression level, variance, and significance in the Veridex dataset (in decreasing order of importance). Notably, the expression level of individual genes varies greatly, from very high levels as for RPL22 (rank in the top 0.6%) to extremely low levels for PTPN20A/B (ranked at 99.7%). The standard deviation value also varies greatly, from very large as for G0S2 (rank at 1.9% of the total) to very small for RIPK5 (rank at 97.5% of the total). These data showed that the low-expression and low-variabity genes were as important as those with higher expression and higher variability.
Gene ontology (GO) (29) and KEGG pathways (26, 30) annotations revealed the involvement of several of the prognostic genes in signal transduction (e.g., VEGFA, TNFRSF25), cell cycle (e.g., VEGFA, G0S2), apoptosis (e.g., TNFRSF25), adhesion (e.g., COL8A2), transcription and translation (ZNF3 and RPL22, respectively) (Table 9)
To assess the potential SQCC-specific biological relevance of the 12-gene signature genes further, we evaluated the functional relationship between our 12-gene signature and the reported Raponi (13) and Sun (8) 50-gene signatures (mapped to 12, 48 and 42 genes, respectively) through their corresponding protein-protein interaction (PPI) networks. We mapped 8/12 genes of the 12-gene signature, 35/48 and 31/42 for the Raponi and Sun signatures, respectively, to PPIs in the Interologous Interaction Databasever 1.7 (I2D; (23)). While the Raponi and Sun signatures have 10 overlapping probe sets (9 genes), the 12-gene signature has no probe sets/genes overlapping with either of the 50-gene signatures. However, direct interactions between the signature genes/proteins or via shared interacting proteins were seen among these signatures, implying a rich shared functional milieu (
We describe here the MAximizing R Square Algorithm (MARSA), a heuristic signature selection method that includes only genes contributing to the separation ability of the signature. By applying the algorithm to the UM dataset, we identified a 12-gene prognostic signature. The prognostic value of the 12-gene signature was validated in silico in 2 independent SQCC microarray datasets (Duke: HR=3.05, 95% CI 1.14-8.21, p=0.027; SKKU: HR=2.73, 95% CI 1.32-5.64, p=0.007, Table 2) but not in the corresponding ADC datasets (Table 2). Further, we confirmed the absence of the prognostic value of the 12-gene signature in the largest available ADC dataset from DCC containing 442 ADC samples (Table 2). Importantly, qPCR validation in another independent cohort confirmed that the signature was an independent prognostic factor in SQCC (Table 2). Combined, our data strongly suggested that the 12-gene signature is a valuable prognostic factor for SQCC.
The cellular origin and pathogenesis of SQCC and ADC remain controversial. In contrast to ADC, SQCC tends to arise in the epithelium of large airways and its etiology is clearly linked to smoking, suggesting different pathogenetic differences between the two lung cancer types (31). This is supported by differences in the occurrence of key genetic alterations in the two types of cancer (32). While frequently mutated in ADC, KRAS (33, 34) and EGFR (35) mutations occur very infrequently in SQCC. In contrast, P53 mutation (34), TIMP3 (36) and HIF-1α (37) overexpressions occur more frequently in SQCC than ADC of the lung. Moreover, gene expression profiling has demonstrated distinctive patterns among the subtypes of NSCLC (38). Additionally, target therapy indicates that significantly more ADC benefit from gifitinib and erlotinib treatments (39), Both treatments target EGFR, whereas SQCC benefit more from vandetanib (40), which targets both EGFR and VEGFR. Therefore, it may not be surprising that there could be gene signatures that are prognostic in SQCC but not in ADC patients.
Cancer phenotype is characterized by underlying gene expression. Thus gene expression signatures may predict clinical outcome. The fact that our signature had been validated consistently in multiple independent SQCC cohorts supports a notion that it might have captured a key gene expression program in squamous cancer biology. Indeed, many members of the 12-gene signature have been reported to be involved in processes underlying tumorigenesis, including: tumor necrosis factor receptor superfamily, member 25 (TNFRSF25), triggering apoptosis and activating the transcription factor NF-kappa-B in HEK293 or HeLa cells (41), RIPK5, a cell death inducer (42). Vascular endothelial growth factor (VEGF or VEGFA) has been extensively studied (43) and is a major regulator of tumor angiogenesis (44). ARHGEF4 (Rho guanine nucleotide exchange factor 4) is involved in G-protein mediated signaling, which has been implicated in regulating cell morphology and invasion (45). It has also been shown to interact directly with insulin-like growth factor receptor 1 (IGF1r), providing a link between G protein-coupled and IGF1r signaling pathways (46) (
Previous approaches to the identification of prognostic signatures filtered out low-expression or low-variance genes prior to signature selection. However, this might lead to the exclusion of low expression but important genes in the signatures. In fact, one third of the genes (ARHGEF12, RIPK5, PTPN20A, and ZNF3) in the 12-gene signature had expression levels in the lowest 20% (from 79.9-99.7%), while their variation (SD) was in the lowest 10% (from 91.5-97.5%, Table 3) of all probe-sets. The consistent performance of the 12-gene signature in the training and test cohorts implied that these low-expressed and low-variable genes might have played important roles in tumor progression, and thus these genes must be included in signature selection.
In summary, MARSA is an effective approach to identify prognostic gene expression signatures and this novel 12-gene prognostic signature appears specific for SQCC.
Although preferred embodiments of the invention have been described herein, it will be understood by those skilled in the art that variations may be made thereto without departing from the spirit of the invention or the scope of the appended claims. All documents mentioned herein, including but not limited to the following reference list, are hereby incorporated by reference.
Q9NZN5*
Q93038*
Q6XUX3
P15692
P48681
P17036
P35268*
P25067
P62491*
P05388*
P35998*
P31483
P31483
Q16539*
P08397
Q99952
P21802*
Q92633
P55957*
Q16534*
O14936*
P21802*
P05388*
O00214
O00214
Q9NZ52*
Q9H3H5*
Q9UBP0
P51671
P20340*
Q6UB98
Q68CR1
Q8ND30
O75161
Q9UDR5*
P10145
P11836
Q9Y570*
P52594
Q86Y56
Q9HAT8
Q96I59
Q5VST7
Q9P0L2*
Q9H5J8
Q03112
P27816
Q9Y570*
P30279
P06737
P81877
Q92633
Q13761
Q13761
P35790
Q16534*
Q16534*
Q16633
P30825
O60494
Q16820*
O14936*
O00214
O00214
Q9UBP0
Q14005*
P41180*
Q14004
Q7Z340
Q9UPR0
O60941*
P10145
Q9Y5Z0
Q9Y2V2
O95336
Q5SXN3
Q9HAT8
Q9NPA5
Q9P0L2*
Q9BYV9*
Q8N9I9
O60936*
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/CA2010/000596 | 4/20/2010 | WO | 00 | 12/23/2011 |
Number | Date | Country | |
---|---|---|---|
61170743 | Apr 2009 | US |