Reference to a “Sequence Listing,” a table, or a computer program listing appendix submitted on a compact disc and an incorporation by reference of the material on the compact disc including duplicates and the files on each compact disc shall be specified.
This application claims the benefit of U.S. Patent Application No. 60/632,053, filed Nov. 30, 2005 which is incorporated herein by reference.
This invention relates to prognostics for lung cancer based on the gene expression profiles of biological samples.
Lung cancer is the leading cause of cancer deaths in developed countries killing about 1 million people worldwide each year. An estimated 171,900 new cases are expected in 2003 in the US, accounting for about 13% of all cancer diagnoses. Non-small cell lung cancer (NSCLC) represents the majority (˜75%) of bronchogenic carcinomas while the remainder is small cell lung carcinomas (SCLC). NSCLC is comprised of three main subtypes: 40% adenocarcinoma, 40% squamous, and 20% large cell cancer. Adenocarcinoma has replaced squamous cell carcinoma as the most frequent histological subtype over the last 25 years, peaking the early 1990's. This may be associated with the use of “low tar” cigarettes resulting in deeper inhalation of cigarette smoke. Wingo et al. (1999). The overall 10-year survival rate of patients with NSCLC is a dismal 8-10%.
Approximately 25-30% of patients with NSCLC have stage I disease and of these 35-50% will relapse within 5 years after surgical treatment. Depending upon stage, adenocarcinoma has a higher relapse rate than squamous cell carcinoma with approximately 65% and 55% of SCC and adenocarcinoma patients surviving at 5 years, respectively. Mountain et al. (1987). Currently, it is not possible to identify those patients with a high risk of relapse. The ability to identify high-risk patients among the stage I disease group will allow for the consideration of additional therapeutic intervention leading to the potential for improved survival. Indeed, recent clinical trials have shown that adjuvant therapy following resection of lung tumors can lead to improved survival. Kato et al. (2004). Specifically, Kato et al. demonstrated that adjuvant chemotherapy with uracil-tegafur improves survival among patients with completely resected pathological stage I adenocarcinoma, particularly T2 disease.
Microarray gene expression profiling has recently been utilized to define prognostic signatures in patients with lung adenocarcinomas, (Beer et al. (2002)) however, no large studies have investigated gene expression profiles of prognosis in the squamous cell carcinoma population. Here, we have profiled 134 SCC samples and 10 normal matched lung samples on the Affymetrix U133A chip. Hierarchical clustering and Cox modeling has identified genes that correlate with patient prognosis. These signatures can be used to identify patients who may benefit from adjuvant therapy following initial surgery.
The present invention provides a method of assessing lung cancer status by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7 where the expression levels of the Marker genes above or below pre-determined cut-off levels are indicative of lung cancer status.
The present invention provides a method of staging lung cancer patients by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7 where the expression levels of the Marker genes above or below pre-determined cut-off levels are indicative of the lung cancer stage.
The present invention provides a method of determining lung cancer patient treatment protocol by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7 where the expression levels of the Marker genes above or below predetermined cut-off levels are sufficiently indicative of risk of recurrence to enable a physician to determine the degree and type of therapy recommended to prevent recurrence.
The present invention provides a method of treating a lung cancer patient by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7 where the expression levels of the Marker genes above or below pre-determined cut-off levels are indicate a high risk of recurrence and; treating the patient with adjuvant therapy if they are a high risk patient.
The present invention provides a method of determining whether a lung cancer patient is high or low risk of mortality by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 4 where the expression levels of the Marker genes above or below pre-determined cut-off levels are sufficiently indicative of risk of mortality to enable a physician to determine the degree and type of therapy recommended.
The present invention provides a method of generating a lung cancer prognostic patient report by determining the results of any one of the methods described herein and preparing a report displaying the results and patient reports generated thereby.
The present invention provides a composition comprising at least one probe set selected from the group consisting of: Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7.
The present invention provides a kit for conducting an assay to determine lung cancer prognosis in a biological sample comprising: materials for detecting isolated nucleic acid sequences, their complements, or portions thereof of a combination of genes selected from the group consisting of Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7.
The present invention provides articles for assessing lung cancer status comprising: materials for detecting isolated nucleic acid sequences, their complements, or portions thereof of a combination of genes selected from the group consisting of Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7.
The present invention provides a microarray or gene chip for performing the method described herein.
The present invention provides a diagnostic/prognostic portfolio comprising isolated nucleic acid sequences, their complements, or portions thereof of a combination of genes selected from the group consisting of Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7.
Non-small cell lung cancer (NSCLC) represents the majority (˜75%) of lung carcinomas and is comprised of three main subtypes: 40% squamous, 40% adenocarcinoma, and 20% large cell cancer. Approximately 25-30% of patients with NSCLC have stage I disease and of these 35-50% will relapse within 5 years after surgical treatment. Current histopathology and genetic biomarkers are insufficient for identifying patients who are at a high risk of relapse. As described in the present invention, 129 primary squamous cell lung carcinomas and 10 matched normal lung tissues were profiled using the Affymetrix U133A gene chip. Unsupervised hierarchical clustering identified two clusters of patients with lung carcinoma that had no correlation with stage of disease but had significantly different median overall survival (p=0.036). Cox proportional hazard models were then utilized to identify an optimal set of 50 genes (Table 1) in a 65 patient training set that significantly predicted survival in a 64 patient test set. This signature achieved 52% specificity and 82% sensitivity and provided an overall predictive value of 71%. Kaplan-Meier analysis showed clear significant stratification of high and low risk patients (p=0.0075). The identification of prognostic signatures allows identification of patients with high-risk squamous cell lung carcinoma who could benefit from adjuvant therapy following initial surgery.
A Biomarker is any indicia of the level of expression of an indicated Marker gene. The indicia can be direct or indirect and measure over- or under-expression of the gene given the physiologic parameters and in comparison to an internal control, normal tissue or another carcinoma. Biomarkers include, without limitation, nucleic acids (both over and under-expression and direct and indirect). Using nucleic acids as Biomarkers can include any method known in the art including, without limitation, measuring DNA amplification, RNA, micro RNA, loss of heterozygosity (LOH), single nucleotide polymorphisms (SNPs, Brookes (1999)), microsatellite DNA, DNA hypo- or hyper-methylation. Using proteins as Biomarkers can include any method known in the art including, without limitation, measuring amount, activity, modifications such as glycosylation, phosphorylation, ADP-ribosylation, ubiquitination, etc., imunohistochemistry (IHC). Other Biomarkers include imaging, cell count and apoptosis markers.
The indicated genes provided herein are those associated with a particular tumor or tissue type. Marker gene may be associated with numerous cancer types but provided that the expression of the gene is sufficiently associated with one tumor or tissue type to be identified using the algorithm described herein to be specific for a lung cancer cell, the gene can be using in the claimed invention to determine cancer status and prognosis. Numerous genes associated with one or more cancers are known in the art. The present invention provides preferred Marker genes and even more preferred Marker gene combinations. These are described herein in detail.
A Marker gene corresponds to the sequence designated by a SEQ ID NO when it contains that sequence. A gene segment or fragment corresponds to the sequence of such gene when it contains a portion of the referenced sequence or its complement sufficient to distinguish it as being the sequence of the gene. A gene expression product corresponds to such sequence when its RNA, mRNA, or cDNA hybridizes to the composition having such sequence (e.g. a probe) or, in the case of a peptide or protein, it is encoded by such mRNA. A segment or fragment of a gene expression product corresponds to the sequence of such gene or gene expression product when it contains a portion of the referenced gene expression product or its complement sufficient to distinguish it as being the sequence of the gene or gene expression product.
The inventive methods, compositions, articles, and kits of described and claimed in this specification include one or more Marker genes. “Marker” or “Marker gene” is used throughout this specification to refer to genes and gene expression products that correspond with any gene the over- or under-expression of which is associated with a tumor or tissue type. The preferred Marker genes are described in more detail in Table 8.
The present invention provides a method of assessing lung cancer status by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7 where the expression levels of the Marker genes above or below pre-determined cut-off levels are indicative of lung cancer status.
The present invention provides a method of staging lung cancer patients by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7 where the expression levels of the Marker genes above or below pre-determined cut-off levels are indicative of the lung cancer stage. The stage can correspond to any classification system, including, but not limited to the TNM system or to patients with similar gene expression profiles.
The present invention provides a method of determining lung cancer patient treatment protocol by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7 where the expression levels of the Marker genes above or below pre-determined cut-off levels are sufficiently indicative of risk of recurrence to enable a physician to determine the degree and type of therapy recommended to prevent recurrence.
The present invention provides a method of treating a lung cancer patient by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7 where the expression levels of the Marker genes above or below pre-determined cut-off levels are indicate a high risk of recurrence and; treating the patient with adjuvant therapy if they are a high risk patient.
The present invention provides a method of determining whether a lung cancer patient is high or low risk of mortality by obtaining a biological sample from a lung cancer patient; and measuring Biomarkers associated with Marker genes corresponding to those selected from Table 4 where the expression levels of the Marker genes above or below pre-determined cut-off levels are sufficiently indicative of risk of mortality to enable a physician to determine the degree and type of therapy recommended.
In the above methods, the sample can be prepared by any method known in the art including, but not limited to, bulk tissue preparation and laser capture microdissection. The bulk tissue preparation can be obtained for instance from a biopsy or a surgical specimen.
In the above methods, the gene expression measuring can also include measuring the expression level of at least one gene constitutively expressed in the sample.
In the above methods, the specificity is preferably at least about 40% and the sensitivity at least at least about 80%.
In the above methods, the pre-determined cut-off levels are at least about 1.5-fold over- or under-expression in the sample relative to benign cells or normal tissue.
In the above methods, the pre-determined cut-off levels have at least a statistically significant p-value over-expression in the sample having metastatic cells relative to benign cells or normal tissue, preferably the p-value is less than 0.05.
In the above methods, gene expression can be measured by any method known in the art, including, without limitation on a microarray or gene chip, nucleic acid amplification conducted by polymerase chain reaction (PCR) such as reverse transcription polymerase chain reaction (RT-PCR), measuring or detecting a protein encoded by the gene such as by an antibody specific to the protein or by measuring a characteristic of the gene such as DNA amplification, methylation, mutation and allelic variation. The microarray can be for instance, a cDNA array or an oligonucleotide array. All these methods and can further contain one or more internal control reagents.
The present invention provides a method of generating a lung cancer prognostic patient report by determining the results of any one of the methods described herein and preparing a report displaying the results and patient reports generated thereby. The report can further contain an assessment of patient outcome and/or probability of risk relative to the patient population.
The present invention provides a composition comprising at least one probe set selected from the group consisting of: Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7.
The present invention provides a kit for conducting an assay to determine lung cancer prognosis in a biological sample comprising: materials for detecting isolated nucleic acid sequences, their complements, or portions thereof of a combination of genes selected from the group consisting of Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7. The kit can further comprise reagents for conducting a microarray analysis, and/or a medium through which said nucleic acid sequences, their complements, or portions thereof are assayed.
The present invention provides articles for assessing lung cancer status comprising: materials for detecting isolated nucleic acid sequences, their complements, or portions thereof of a combination of genes selected from the group consisting of Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7. The articles can further contain reagents for conducting a microarray analysis and/or a medium through which said nucleic acid sequences, their complements, or portions thereof are assayed.
The present invention provides a microarray or gene chip for performing the method of claim 1, 2, 5, 6 or 7. The microarray can contain isolated nucleic acid sequences, their complements, or portions thereof of a combination of genes selected from the group consisting of Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7. Preferably, the microarray is capable of measurement or characterization of at least 1.5-fold over- or under-expression. Preferably, the microarray provides a statistically significant p-value over- or under-expression. Preferably, the p-value is less than 0.05. The microarray can contain a cDNA array or an oligonucleotide array and/or one or more internal control reagents.
The present invention provides a diagnostic/prognostic portfolio comprising isolated nucleic acid sequences, their complements, or portions thereof of a combination of genes selected from the group consisting of Marker genes corresponding to those selected from Table 1, Table 4, Table 5 or Table 7. Preferably, the portfolio is capable of measurement or characterization of at least 1.5-fold over- or under-expression. Preferably, the portfolio provides a statistically significant p-value over- or under-expression. Preferably, the p-value is less than 0.05.
The mere presence or absence of particular nucleic acid sequences in a tissue sample has only rarely been found to have diagnostic or prognostic value. Information about the expression of various proteins, peptides or mRNA, on the other hand, is increasingly viewed as important. The mere presence of nucleic acid sequences having the potential to express proteins, peptides, or mRNA (such sequences referred to as “genes”) within the genome by itself is not determinative of whether a protein, peptide, or mRNA is expressed in a given cell. Whether or not a given gene capable of expressing proteins, peptides, or mRNA does so and to what extent such expression occurs, if at all, is determined by a variety of complex factors. Irrespective of difficulties in understanding and assessing these factors, assaying gene expression can provide useful information about the occurrence of important events such as tumorogenesis, metastasis, apoptosis, and other clinically relevant phenomena. Relative indications of the degree to which genes are active or inactive can be found in gene expression profiles. The gene expression profiles of this invention are used to provide diagnosis, status, prognosis and treatment protocol for lung cancer patients.
Sample preparation requires the collection of patient samples. Patient samples used in the inventive method are those that are suspected of containing diseased cells such as cells taken from a nodule in a fine needle aspirate (FNA) of tissue. Bulk tissue preparation obtained from a biopsy or a surgical specimen and Laser Capture Microdissection (LCM) are also suitable for use. LCM technology is one way to select the cells to be studied, minimizing variability caused by cell type heterogeneity. Consequently, moderate or small changes in Marker gene expression between normal or benign and cancerous cells can be readily detected. Samples can also comprise circulating epithelial cells extracted from peripheral blood. These can be obtained according to a number of methods but the most preferred method is the magnetic separation technique described in U.S. Pat. No. 6,136,182. Once the sample containing the cells of interest has been obtained, a gene expression profile is obtained using a Biomarker, for genes in the appropriate portfolios.
Preferred methods for establishing gene expression profiles include determining the amount of RNA that is produced by a gene that can code for a protein or peptide. This is accomplished by reverse transcriptase PCR (RT-PCR), competitive RT-PCR, real time RT-PCR, differential display RT-PCR, Northern Blot analysis and other related tests. While it is possible to conduct these techniques using individual PCR reactions, it is best to amplify complementary DNA (cDNA) or complementary RNA (cRNA) produced from mRNA and analyze it via microarray. A number of different array configurations and methods for their production are known to those of skill in the art and are described in U.S. Patents such as: U.S. Pat. Nos. 5,445,934; 5,532,128; 5,556,752; 5,242,974; 5,384,261; 5,405,783; 5,412,087; 5,424,186; 5,429,807; 5,436,327; 5,472,672; 5,527,681; 5,529,756; 5,545,531; 5,554,501; 5,561,071; 5,571,639; 5,593,839; 5,599,695; 5,624,711; 5,658,734; and 5,700,637.
Microarray technology allows for the measurement of the steady-state mRNA level of thousands of genes simultaneously thereby presenting a powerful tool for identifying effects such as the onset, arrest, or modulation of uncontrolled cell proliferation. Two microarray technologies are currently in wide use. The first are cDNA arrays and the second are oligonucleotide arrays. Although differences exist in the construction of these chips, essentially all downstream data analysis and output are the same. The product of these analyses are typically measurements of the intensity of the signal received from a labeled probe used to detect a cDNA sequence from the sample that hybridizes to a nucleic acid sequence at a known location on the microarray. Typically, the intensity of the signal is proportional to the quantity of cDNA, and thus mRNA, expressed in the sample cells. A large number of such techniques are available and useful. Preferred methods for determining gene expression can be found in U.S. Pat. Nos. 6,271,002; 6,218,122; 6,218,114; and 6,004,755.
Analysis of the expression levels is conducted by comparing such signal intensities. This is best done by generating a ratio matrix of the expression intensities of genes in a test sample versus those in a control sample. For instance, the gene expression intensities from a diseased tissue can be compared with the expression intensities generated from benign or normal tissue of the same type. A ratio of these expression intensities indicates the fold-change in gene expression between the test and control samples.
Gene expression profiles can also be displayed in a number of ways. The most common method is to arrange raw fluorescence intensities or ratio matrix into a graphical dendogram where columns indicate test samples and rows indicate genes. The data are arranged so genes that have similar expression profiles are proximal to each other. The expression ratio for each gene is visualized as a color. For example, a ratio less than one (indicating down-regulation) may appear in the blue portion of the spectrum while a ratio greater than one (indicating up-regulation) may appear as a color in the red portion of the spectrum. Commercially available computer software programs are available to display such data including “GENESPRING” from Silicon Genetics, Inc. and “DISCOVERY” and “INFER” software from Partek, Inc.
In the case of measuring protein levels to determine gene expression, any method known in the art is suitable provided it results in adequate specificity and sensitivity. For example, protein levels can be measured by binding to an antibody or antibody fragment specific for the protein and measuring the amount of antibody-bound protein. Antibodies can be labeled by radioactive, fluorescent or other detectable reagents to facilitate detection. Methods of detection include, without limitation, enzyme-linked immunosorbent assay (ELISA) and immunoblot techniques.
Modulated Markers used in the methods of the invention are described in the Examples. The genes that are differentially expressed are either up regulated or down regulated in patients with various lung cancer prognostics. Up regulation and down regulation are relative terms meaning that a detectable difference (beyond the contribution of noise in the system used to measure it) is found in the amount of expression of the genes relative to some baseline. In this case, the baseline is determined based on the algorithm. The genes of interest in the diseased cells are then either up- or down-regulated relative to the baseline level using the same measurement method.
Diseased, in this context, refers to an alteration of the state of a body that interrupts or disturbs, or has the potential to disturb, proper performance of bodily functions as occurs with the uncontrolled proliferation of cells. Someone is diagnosed with a disease when some aspect of that person's genotype or phenotype is consistent with the presence of the disease. However, the act of conducting a diagnosis or prognosis may include the determination of disease/status issues such as determining the likelihood of relapse, type of therapy and therapy monitoring. In therapy monitoring, clinical judgments are made regarding the effect of a given course of therapy by comparing the expression of genes over time to determine whether the gene expression profiles have changed or are changing to patterns more consistent with normal tissue.
Genes can be grouped so that information obtained about the set of genes in the group provides a sound basis for making a clinically relevant judgment such as a diagnosis, prognosis, or treatment choice. These sets of genes make up the portfolios of the invention. As with most diagnostic markers, it is often desirable to use the fewest number of markers sufficient to make a correct medical judgment. This prevents a delay in treatment pending further analysis as well unproductive use of time and resources.
One method of establishing gene expression portfolios is through the use of optimization algorithms such as the mean variance algorithm widely used in establishing stock portfolios. This method is described in detail in US patent publication number 20030194734. Essentially, the method calls for the establishment of a set of inputs (stocks in financial applications, expression as measured by intensity here) that will optimize the return (e.g., signal that is generated) one receives for using it while minimizing the variability of the return. Many commercial software programs are available to conduct such operations. “Wagner Associates Mean-Variance Optimization Application,” referred to as “Wagner Software” throughout this specification, is preferred. This software uses functions from the “Wagner Associates Mean-Variance Optimization Library” to determine an efficient frontier and optimal portfolios in the Markowitz sense is one option. Use of this type of software requires that microarray data be transformed so that it can be treated as an input in the way stock return and risk measurements are used when the software is used for its intended financial analysis purposes.
The process of selecting a portfolio can also include the application of heuristic rules. Preferably, such rules are formulated based on biology and an understanding of the technology used to produce clinical results. More preferably, they are applied to output from the optimization method. For example, the mean variance method of portfolio selection can be applied to microarray data for a number of genes differentially expressed in subjects with cancer. Output from the method would be an optimized set of genes that could include some genes that are expressed in peripheral blood as well as in diseased tissue. If samples used in the testing method are obtained from peripheral blood and certain genes differentially expressed in instances of cancer could also be differentially expressed in peripheral blood, then a heuristic rule can be applied in which a portfolio is selected from the efficient frontier excluding those that are differentially expressed in peripheral blood. Of course, the rule can be applied prior to the formation of the efficient frontier by, for example, applying the rule during data pre-selection.
Other heuristic rules can be applied that are not necessarily related to the biology in question. For example, one can apply a rule that only a prescribed percentage of the portfolio can be represented by a particular gene or group of genes. Commercially available software such as the Wagner Software readily accommodates these types of heuristics. This can be useful, for example, when factors other than accuracy and precision (e.g., anticipated licensing fees) have an impact on the desirability of including one or more genes.
The gene expression profiles of this invention can also be used in conjunction with other non-genetic diagnostic methods useful in cancer diagnosis, prognosis, or treatment monitoring. For example, in some circumstances it is beneficial to combine the diagnostic power of the gene expression based methods described above with data from conventional markers such as serum protein markers (e.g., Cancer Antigen 27.29 (“CA 27.29”)). A range of such markers exists including such analytes as CA 27.29. In one such method, blood is periodically taken from a treated patient and then subjected to an enzyme immunoassay for one of the serum markers described above. When the concentration of the marker suggests the return of tumors or failure of therapy, a sample source amenable to gene expression analysis is taken. Where a suspicious mass exists, a fine needle aspirate (FNA) is taken and gene expression profiles of cells taken from the mass are then analyzed as described above. Alternatively, tissue samples may be taken from areas adjacent to the tissue from which a tumor was previously removed. This approach can be particularly useful when other testing produces ambiguous results.
Kits made according to the invention include formatted assays for determining the gene expression profiles. These can include all or some of the materials needed to conduct the assays such as reagents and instructions and a medium through which Biomarkers are assayed.
Articles of this invention include representations of the gene expression profiles useful for treating, diagnosing, prognosticating, and otherwise assessing diseases. These profile representations are reduced to a medium that can be automatically read by a machine such as computer readable media (magnetic, optical, and the like). The articles can also include instructions for assessing the gene expression profiles in such media. For example, the articles may comprise a CD ROM having computer instructions for comparing gene expression profiles of the portfolios of genes described above. The articles may also have gene expression profiles digitally recorded therein so that they may be compared with gene expression data from patient samples. Alternatively, the profiles can be recorded in different representational format. A graphical recordation is one such format. Clustering algorithms such as those incorporated in “DISCOVERY” and “INFER” software from Partek, Inc. mentioned above can best assist in the visualization of such data.
Different types of articles of manufacture according to the invention are media or formatted assays used to reveal gene expression profiles. These can comprise, for example, microarrays in which sequence complements or probes are affixed to a matrix to which the sequences indicative of the genes of interest combine creating a readable determinant of their presence. Alternatively, articles according to the invention can be fashioned into reagent kits for conducting hybridization, amplification, and signal generation indicative of the level of expression of the genes of interest for detecting cancer.
The invention is further illustrated by the following non-limiting examples. All references cited herein are hereby incorporated herein.
Genes analyzed according to this invention are typically related to full-length nucleic acid sequences that code for the production of a protein or peptide. One skilled in the art will recognize that identification of full-length sequences is not necessary from an analytical point of view. That is, portions of the sequences or ESTs can be selected according to well-known principles for which probes can be designed to assess gene expression for the corresponding gene.
Methods
Patient Population
134 fresh frozen, surgically resected lung SCC and 10 matched normal lung samples from 133 individual patients (LS-71 and LS-136 were duplicate samples from different areas of the same tumor) from all stages of squamous cell lung carcinoma were evaluated in this study. These samples were collected from patients from the University of Michigan Hospital between October 1991 and July 2002 with patient consent and Institutional Review Board (IRB) approval. Portions of the resected lung carcinomas were sectioned and evaluated by the study pathologist by routine hematoxylin and eosin (H&E) staining. Samples chosen for analysis contained greater than 70% tumor cells. Approximately one third of patients (with equal proportions for each stage) received radiotherapy or chemotherapy following surgery. Seventy-seven patients were lymph node negative. Follow-up data were available for all patients. The mean patient age was 68±10 (range 42-91) with approximately 45% of patients 70 years or older. One patient (LS-3) likely died of surgery-related causes and was therefore not utilized in identifying prognostic signatures. Also, three specimens had mixed histology and were also not included in prognostic profiling (LS-76, LS-84, LS-112).
Microarray Analysis
For isolation of RNA, 20 to 40 cryostat sections of 30 μm were cut from each sample, in total corresponding to approximately 100 mg of tissue. Before, in between, and after cutting the sections for RNA isolation, 5 μm sections were cut for hematoxylin and eosin staining to confirm the presence of tumor cells. Total RNA was isolated with RNAzol B (Campro Scientific, Veenendaal, Netherlands), and dissolved in DEPC (0.1%)-treated H2O. About 2 ng of total RNA was resuspended in 10 μl of water and 2 rounds of the T7 RNA polymerase based amplification were performed to yield about 50 μg of amplified RNA. Quality of RNA was checked using the Agilent Bioanalyzer. The mean ribosomal ratio (28s/18s) for all samples was 1.5 (range: 1.0-2.1). Four micrograms of total RNA was amplified, labeled and aRNA was fragmented and hybridized to the Affymetrix U133A chip according to the manufacturer's instructions. Microarray data were extracted using the Affymetrix MAS 5 software. Global gene expression was scaled to an average intensity of 600 units. The data were then normalized using a spline quantile normalization method.
Statistical Analysis
Three complimentary statistical methods were performed to identify the optimal prognostic gene signature: Cox proportional-hazard regression modeling, bootstrapping, and a leave 20 percent out cross validation (L20OCV).
Univariate Cox proportional-hazard regression modeling was performed to identify genes that were significantly associated with overall survival. The Cox score was defined as the sum of the selected gene's log2-based chip signals multiplied by their z scores from the Cox regression. Similarly, Cox scores were calculated for patients in the testing set with the same selected genes from the training set. A series of cutoffs (percentile of risk index for the patients in the training set) was applied to predict the clinical outcome of patients in the testing set by comparing the patients° Cox score in the testing set with a cutoff for the risk index. If a patient's Cox score was higher than the cutoff, the patient was classified as “high risk”, otherwise, it is put in the “low risk” group. Kaplan-Meier analysis was performed to explore the survival characteristics of high-risk and low-risk patients. A cutoff of 3-year survival was employed since the majority of patients who will relapse in this population will have this occur within 3 years. Kiernan et al. (1993). Also many of these patients die due to non-cancer related illnesses after 3 years. Kiernan et al. (1993). This rationale was also employed when performing Cox modeling.
The bootstrap method was also employed to provide a more stringent means of defining prognostic genes. Using the same training and testing sets created above, 65 samples were selected, with replacement from the training set, and then Cox regression was performed on these samples. Each gene's P value and z score were recorded. This step was repeated 400 times thus giving 400 P values and z scores for each gene. For each gene, the top and bottom 5% of P values were removed and then the mean P value and the rank of each gene (based on the mean P value) were defined. Similarly, the top and bottom 5% z scores for each gene in the training set were removed and the sum of the remaining ones was calculated. Various numbers of top genes based on the mean P value were defined, their log2-based chip signal were multiplied with the sum of their z scores. This equated their Cox scores, namely, the risk index. The patients' Cox scores in the testing set was also calculated in this manner. Receiver operator characteristic (ROC) curves were drawn for patients in the training and testing sets and the area under the curve (AUC) values for each gene classifier was recorded. The AUC values were then plotted versus various numbers of gene classifiers to determine the optimal gene number that provides steady AUC values in the training set.
A L20OCV was also performed to confirm the optimal gene number of the classifier. First samples were partitioned into 5 groups with the same or very close numbers of samples. Five pairs of training and testing sets was generated with the training set consisting of 80% of samples and the testing set consisting of the remaining 20%. Therefore each sample was chosen exactly once in a testing set. Cox regression modeling was performed to select the top prognostic genes (from 2 to 200) in the training set and the selected genes were tested in the corresponding testing set. ROC was performed to calculate the AUC. The mean AUC of the 5 testing sets for gene number from 2 to 200 was calculated. This was repeated 100 times and the mean of 100 AUC's for gene numbers from 2 to 200 was then calculated. The mean AUC versus gene number (2 to 200) was plotted and the optimal number of genes in the signature was selected.
Hierarchical clustering was performed with GeneSpring7.0 (Silicon Genetics) to identify major clusters of patients and investigate their association with patient co-variates. Prior to clustering genes that had a coefficient of variation (CV) smaller than 0.3 (arbitrarily chosen) were removed so as to reduce the impact of genes that displayed minimal change in expression across the dataset. Thus a dataset with 11,101 genes was created for clustering analysis. The signal intensity of each gene was divided by the median expression level of that gene from all patients. Samples were clustered using Pearson correlation as measurement of similarity. Genes were clustered in the same way.
Results
Microarray Profiling
141 of the 144 microarrays gave excellent data (% present>40, scaling factor<10) while the remaining 3 samples (LS76, LS78, LS82) gave acceptable results (% present>30, scaling factor<15). Table 2 shows the clinical-pathological staging of the 134 SCC samples analyzed by microarray. All samples were included in initial clustering analysis. Genes were filtered from the dataset if they were not called present in at least 10% of all samples (including normal). This left 14,597 genes for analysis.
Note.
One duplicate stage IIb, 77 lymph node negative samples
Unsupervised Hierarchical Clustering
For unsupervised clustering the dataset was further filtered by removing genes (CV<30%) that had low variation of expression across the entire dataset. The 134 SCC and 10 normal lung samples were initially clustered based on unsupervised k-means clustering of the remaining 11,101 genes. The normal lung samples had a distinct profile from the carcinomas and clustered together. The 2 duplicate SCC samples (LS-71 and LS-136) clustered together demonstrating the reproducibility of the microarray analysis. Of the 133 unique patient carcinomas four were removed from further analysis since the patient either died due to surgery (LS3) or the sample had mixed histology (LS-76, LS-84, LS-112). When the 129 samples were clustered using the 11,101 genes two major clusters were formed, one with 55 patients and the other with 74 patients (
Identification of Prognostic Gene Signatures
To identify genes that could further stratify early stage patients into good and poor prognostic groups several complimentary statistical analyses were performed. This included: 1) Cox modeling on a training set and validating prognostic signatures on a test set of samples; 2) bootstrapping; and 3) L20OCV.
First, the 129 SCC samples were split into training and test sets with equal number of stages represented in both groups. Both groups showed similar overall median survival times. The 65-patient training set was analyzed using a bootstrapping method (see Methods section) to determine the optimal number of genes to be used in the prognostic signature. When increasing numbers of genes was plotted versus the AUC from a receiver operator characteristic analysis it could be seen that the signature performance began to plateau at around 50 genes (
A LOOCV procedure was then used in the 65-patient training set to determine the optimal cutoff of the risk index. The error rates were calculated with various cutoffs. This indicated that cutoff at 58%ile gave the lowest error rate (
Identification of a Robust Prognostic Signature
Although we used a bootstrap method to avoid random sampling issues in the training-testing method, a more robust prognostic signature might be identified if we use all 129 samples in the training set. Therefore, a gene signature was also selected by bootstrapping the entire 129-patient dataset. Genes were ranked based on their mean P value and the top 100 genes were identified (Table 4). Twenty-three of these genes were in common with the top 50 genes identified from the training-test method.
We had data on time to relapse (TTR) for 16 patients. The mean TTR was 21.7 months with 88% of patients relapsing within 3 years. Since the majority of patients who die after 3 years die from non-cancer related causes we chose a cutoff of 36 months for classifying patients who will have a lung cancer-related death. Our defined classifiers were tested with or without a 36-month cutoff. The signatures had a better performance in the testing set when a 3-year cutoff was employed. Therefore, a gene signature selected with the time limit is better than without the time limit.
Identification of a High-Risk Sub-Group of SCC Patients
The unsupervised hierarchical clustering described above identified two main groups of patients that differed significantly in their overall survival. A bootstrap analysis performed on the two patient groups found 121 genes (non-unique) whose expression levels were significantly different between the high- and low-risk groups (p <0.001, mean difference>3-fold; Table 5). Interestingly, the majority of these genes (118) were down-regulated in the high risk group (
Gene Expression Signatures for Prognosis of Lung Cancer.
Methods
Real-Time Quantitative RT-PCR
Total RNA samples were normalized by OD260. Quality testing included analysis by capillary electrophoresis using a Bioanalyzer (Agilent). For aRNA, the Ribobeast™ 1-Round Aminoallyl-aRNA amplification kit (Epicentre) was used. All first-strand cDNA synthesis, second-strand cDNA synthesis, in vitro transcription of aRNA, DNase treatment, purification and other steps were performed according to the manufacturer's protocol. For each sample aRNA was reverse transcribed into first-stand cDNA and used for real-time quantitative RT-PCR. The first-strand cDNA synthesis reaction contained, 100 ng of aRNA, 1 μl of 50 ng/μl T7-Oligo(dT) primer, 0.25 μl of 10 mM dNTPs, 1 μl of 5× Superscript™ III Reverse Transcriptase Buffer, 0.25 μl of 200 U/μl Superscript™ III Reverse Transcriptase (Invitrogen Corp), 0.25 μl of 100 mM DTT and 0.25 μl of 0.3 U/μl RNase Inhibitor (Epicentre) in a total reaction volume of 5 μl.
Teal-time quantitative RT-PCR analyses were performed on the ABI Prism 7900HT sequence detection system (Applied Biosystems). Each reaction contained 10 μl of 2× TaqMan® Universal PCR Master Mix (Applied Biosystems), 5 μl of cDNA template, and 1 μl of 20× Assays-on-Demand Gene Expression Assay Mix (Applied Biosystems) in a total reaction volume of 20 μl. The PCR consisted of an UNG activation step at 50° C. for 2 min and initial enzyme activation step at 95° C. for 10 min, followed by 40 cycles of 95° C. for 15 sec, 60° C. for 1 min.
Immunohistochemistry
Immunohistochemistry (IHC) was performed on tissue microarrays containing 60 lung squamous cell carcinomas. Areas of the tumor that best represented the overall morphology were selected for generating a tissue microarray (TMA) block as previously described by Kononen et al. (1998). All controls stained negative for background.
Pathway Analysis
Pathway analysis was performed by first mapping the genes on the Affy U133A chip to the Biological Process categories of Gene Ontology (GO). The categories that had at least 10 genes on the U133A chip were used for subsequent pathway analyses. Genes that were selected from data analysis were mapped to the GO Biological Process categories. Then the hypergeometric distribution probability of the genes was calculated for each category. A category that had a p-value less than 0.05 and had at least two genes was considered over-represented in the selected gene list.
Identification of Core Set of Prognostic Genes
Briefly, 400 random training sets of 65 patients were selected from the 129 lung SCC patients. For each training set, Cox regression was performed to identify significant genes at the 5% significance level (i.e. P<0.05). 331 genes that are significant in more than 40% of the training sets are used as the core gene sets. These 331 genes are shown in Table 7.
Microarray Results Verification
To confirm the microarray results we initially performed TaqMan® quantitative RT-PCR on4 genes (FGFR2, KRT13, NTRK2, and VEGF). The correlation between the platforms ranged from 0.71 to 0.96 indicating the expression data were reproducible.
Immunohistochemistry was then performed on tissue microarrays to confirm expression of several of these proteins within the tumor cells. Various levels of expression of several keratins in addition to the tyrosine kinase proteins FGFR2 and NTKR2 in SCC cells was demonstrated.
Identification of a Core Set of Prognostic Genes
In the previous analysis a set of 50 genes was identified from a single training set of 65 patients. One problem with this approach is that the genes identified as predictors of prognosis can be unstable since the molecular signature strongly depends on the selection of patients in the training sets. The use of validation by repeated random sampling can avoid this instability. We therefore generated 400 random training sets of 65 patients from the 129 lung SCC patients and performed Cox regression to identify significant genes at the 5% significance level (i.e. P<0.05). 331 genes that were significant in more than 40% of the training sets were identified as a core set of prognostic genes in squamous cell lung cancer. These genes are SEQ ID NOs: in Table 7.
Although the foregoing invention has been described in some detail by way of illustration and example for purposes of clarity of understanding, the descriptions and examples should not be construed as limiting the scope of the invention.
No government funds were used to make this invention.
Number | Date | Country | |
---|---|---|---|
60632053 | Nov 2004 | US | |
60655573 | Feb 2005 | US |