The present invention pertains to a method for the detection of non-small cell lung cancer (NSCLC) based on the abundance of particular RNAs from blood samples, as well as diagnostic tools such as kits and arrays suitable for such method.
Lung cancer is still the leading cause of cancer related death worldwide. Prognosis has remained poor with a disastrous two-year survival rate of only about 15% due to diagnosis of the disease in late, i.e. incurable stages in the majority of patients (Jemal, A. et al., CA Cancer J. Clin., 58: 71-96 (2008)) and still disappointing therapeutic regimens in advanced disease (Sandler, A. et al., N. Engl. J. Med., 355: 2542-50 (2006)). Thus, there is an urgent need to establish reliable tools for the identification of non-small cell lung cancer (NSCLC) patients at early stages of the disease e.g. prior to the development of clinical symptoms. Today, the only way to detect non-small cell lung cancer is by means of imaging technologies detecting morphological changes in the lung in combination with biopsy specimens taken for histological examination. However, these screening approaches are not easily applied to secondary prevention of non-small cell lung cancer in an asymptomatic population (Henschke, C. I. et al., N. Engl. J. Med., 355: 1763-71 (2006)).
The use of surrogate tissue-based, e.g. blood-based, biomarkers for non-small cell lung cancer might therefore circumvent the known pitfalls of imaging technologies and invasive diagnostics ((Henschke, C. I. et al., N. Engl. J. Med., 355: 1763-71 (2006); Bach, P. B. et al., Chest, 132: 69S-77S (2007)). Such biomarkers might be utilized to direct imaging based and invasive screening approaches to only those individuals identified as potential non-small cell lung cancer patients by biomarker screening.
Array-based assessment of disease-specific gene expression patterns in peripheral blood mononuclear cells (PBMC) have been reported for non-malignant (Staratschek-Jox, A. et al., Expert Review of Molecular Diagnostics, 9: 271-80 (2009)) and malignant diseases including renal cell carcinoma, melanoma, bladder, breast and lung cancer (Burczynski, M. E. et al., Clin. Cancer Res., 11: 1181-9 (2005); Twine, N. C. et al., Cancer Res., 63: 6069-75 (2003); Sharma, P. et al., Breast Cancer Res., 7: R634-44 (2005); Osman, I. et al., Clin. Cancer Res., 12:3374-80 (2006); Critchley-Thorne, R. J. et al., PLoS Med., 4: e176 (2007); Showe, M. K. et al., Cancer Res., 69: 9202-10 (2009)). In some cases, gene expression profiles derived from PBMC were even suggested as promising tools for early detection (Sharma, P. et al., Breast Cancer Res., 7: R634-44 (2005); Showe, M. K. et al., Cancer Res., 69: 9202-10 (2009)) or prediction of prognosis (Burczynski, M. E. et al., Clin. Cancer Res., 11: 1181-9 (2005)), albeit these findings have not yet been validated in independent studies. Furthermore, circumventing known pitfalls of analyzing PBMC in a clinical setting (Debey, S. et al., Pharmacogenomics J., 4: 193-207 (2004); Debey, S. et al., Genomics, 87: 653-64 (2006)) by using stabilized RNA derived from whole blood would further strengthen the validity of blood-based surrogate biomarkers for early diagnosis of lung cancer and other malignant diseases.
The inventors have shown for the first time that measuring the abundance of a set of at least 5 different RNAs, possibly stemming from the transcription of genes, from a sample of whole blood allows for a highly accurate prediction of whether a human individual suffers from non-small cell lung cancer or not. Thus, a means for detecting or diagnosing non-small cell lung cancer is provided that is based on a minimally invasive method such as drawing a blood sample from a patient.
In this context, the invention teaches a test system that is trained in detection of a non-small cell lung cancer, comprising at least 5 RNAs, which can be quantitatively measured on an adequate set of training samples, with adequate clinical information on carcinoma status, applying adequate quality control measures, and on an adequate set of test samples, for which the detection yet has to be made. Given the quantitative values for the biomarkers and the clinical data for the training, a classifier can be trained and applied to the test samples to calculate the probability of the presence of the non-small cell lung cancer. The 484 RNAs of the invention are defined in particular in table 2 and are characterized through their nucleotide sequence or further through synonymous probe IDs provided in table 2. It is noted that the abundance of a particular RNA may be determined using different probe IDs, as known in the art.
Therefore, the inventors provide a means for diagnosing or detecting NSCLC in a human individual with a sensitivity and specificity (as shown e.g. by the area under the curve (AUC) values in lists 1 to 51), as it has thus far not yet been described for a blood-based method.
In a first aspect, the invention provides a method for the detection of non-small cell lung cancer (NSCLC) of any clinical stage in a human individual based on RNA obtained from a blood sample obtained from the individual. Such a method comprises at least the following two steps: Firstly, the abundance of at least 5 RNAs that are chosen from the RNAs listed in Table 2 is determined in the sample. Secondly, based on the measured abundance, it is concluded whether the patient has NSCLC or not.
In preferred embodiments of the invention, the abundance of at least 6 RNAs, of at least 7 RNAs, of at least 11 RNAs, of at least 16 RNAs, of at least 21 RNAs, of at least 25 RNAs, or of at least 34 RNAs that are chosen from the RNAs listed in Table 2 is determined, respectively. Reference is made in this context to lists 1 to 51, which show exemplary embodiments of sets of RNAs whose abundance can be measured in the method of the invention. For each set of RNAs, the AUC is provided, which is a quantitative parameter for the clinical utility (specificity and sensitivity) of the invention.
In a preferred embodiment of this method, it comprises determining the abundance of the RNAs specified in
In further preferred embodiments of the invention, the abundance of at least 5 RNAs from the RNAs listed in Table 2 is measured, wherein at least one RNA that is chosen from the group consisting of SEQ ID NOs:72 (PLSCR1), 85 (VPREB3), 99 (NT5C2), 146 (BTN3A2), 183 (ANKMY1) and 283 (BLR1) is excluded from the measurement. In preferred embodiments, at least two RNAs that are chosen from the group consisting of SEQ ID NOs:72, 85, 99, 146, 183 and 283, or all six RNAs from the group consisting of SEQ ID NOs:72, 85, 99, 146, 183 and 283 are not measured. In the latter case the abundance of at least 5 RNAs is measured that are listed in Table 2 and that are chosen from the group consisting of SEQ ID NOs: 1-71, 73-84, 86-98, 100-145, 147-182, 184-282 and 284-484. It is understood that the embodiments of the method of the invention that avoid the determination of the abundance one or more of the RNAs of SEQ ID NOs:72, 85, 99, 146, 183 and 283 are at least equally sensitive as those including the determination of said RNAs. An RNA obtained from an individual's blood sample, i.e. an RNA biomarker, is an RNA molecule with a particular base sequence whose presence within a blood sample from a human individual can be quantitatively measured. The RNA can e.g. be mRNA, cDNA, unspliced RNA, or fragments of any of the before mentioned molecules.
The term “abundance” refers to the amount of RNA in a sample of a given size. In a preferred embodiment, the term “abundance” is equivalent to the term “expression level”. The term “whole blood” refers to a sample of blood taken from a human individual for which no separation of particular fractions of the blood is performed. In particular, no separation of a certain type of blood cell or of blood cells in general needs to be performed, since the whole blood sample is used in the present invention. This allows for easier handling and shipping of the blood samples compared to methods in which the blood sample is separated into different fractions.
Lung Cancer is subdivided into two major histological and clinical groups: Small Cell Lung Cancer and Non Small Cell Lung Cancer (WHO Lung Cancer classification, Travis et al., IARC Press 2004). The UICC based staging system has recently been revised. All data obtained for this study were based on the UICC based staging system version 6 (Travis et al., JTO 2008).
The conclusion whether the patient has NSCLC or not may comprise, in a preferred embodiment of the method, classifying the sample as being from a healthy individual or from an individual having NSCLC based on the specific difference of the abundance of the at least 5 RNAs in healthy individuals versus the abundance of the at least 5 RNAs in individuals with NSCLC in a reference set. In the present method, a sample can be classified as being from a patient with NSCLC or from a healthy individual without the necessity to run a reference sample of known origin (i.e. from an NSCLC patient or a healthy individual) at the same time.
In a preferred embodiment the method of the invention is a method for the detection of NSCLC in a human individual based on RNA obtained from a blood sample obtained from the individual, comprising:
determining the abundance of at least 5 RNAs in the sample that are chosen from the RNAs listed in Table 2, and
classifying the sample as being from a healthy individual or from an individual having NSCLC based on the specific difference of the abundance of the at least 5 RNAs in healthy individuals versus the abundance of the at least 5 RNAs in individuals with NSCLC. Particularly preferred for this method is that the abundance of at least 7 RNAs, of at least 11 RNAs, of at least 16 RNAs, of at least 21 RNAs, of at least 25 RNAs, or of at least 34 RNAs chosen from the RNAs listed in Table 2 is determined.
The conclusion or test result whether the individual has NSCLC or not is preferably reached on the basis of a classification algorithm, such as a support vector machine, a random forest method, or a K-nearest neighbor method, as known in the art. The conclusion or test result whether the individual has NSCLC or not is preferably reached on the basis of a classification algorithm, such as a support vector machine, as known in the art.
For the development of a model that allows for the classification for a given set of biomarkers, such as RNAs, only those methods are needed that are generally known to a person of skill in the art.
The major steps of such a model are:
1) condensation of the raw measurement data (for example combining probes of a microarray to probeset data, and/or normalizing measurement data against common controls);
2) training and applying a classifier (i.e. a mathematical model that generalizes properties of the different classes (NSCLC vs. healthy individual) from the training data and applies them to the test data resulting in a classification for each test sample.
The key component of these classifier training and classification techniques is the choice of RNA biomarkers that are used as input to the classification algorithm. The classification is in one embodiment achieved by applying a Prediction Method for Support Vector Machines (hereinafter SVM; David Meyer based on C++-code by Chih-Chung Chang and Chih-Jen Lin), which predicts values based upon a model trained by svm.
Usage:
## S3 method for class ‘svm’:
predict(object, newdata, decision.values=FALSE,
probability=FALSE, . . . , na.action=na.omit)
Arguments:
A vector of predicted values (for classification: a vector of labels, for density estimation: a logical vector). If decision.value is TRUE, the vector gets a “decision.values” attribute containing a n×c matrix (n number of predicted values, c number of classifiers) of all c binary classifiers' decision values. There are k*(k−1)/2 classifiers (k number of classes). The colnames of the matrix indicate the labels of the two classes. If probability is TRUE, the vector gets a “probabilities” attribute containing a n×k matrix (n number of predicted values, k number of classes) of the class probabilities.
If the training set was scaled by svm (done by default), the new data is scaled accordingly using scale and center of the training data.
The specific SVM algorithm suitable for the method of the invention is shown in Table 3.
The determination of the expression profiles of the RNAs described herein is performed in a blood-based fashion. In particular, blood samples are used in the method of the invention. Methods for determining the expression profiles therefore comprise e.g. in situ hybridization, PCR-based methods, sequencing, or preferably microarray-based methods.
The conclusion of a sample as stemming from a healthy individual or from an individual with NSCLC is based on increases or decreases in the abundance of the RNAs in the sample compared to reference values. Specifically, an increase of the abundance preferably provides for a change of >1.1, >1.2, or >1.3. A decrease of the expression preferably provides for a change <0.9, <0.8, or <0.7 relative to the respective expression in healthy individuals.
The abundance can e.g. be determined with an RNA hybridization assay, preferably with a solid phase hybridization array (microarray), or with a real-time polymerase chain reaction, or through sequencing. Preferably, the abundance of the at least 5 RNAs of Table 2 is determined through a hybridization with probes. Such a probe may comprise 12 to 150, preferably 25 to 70 consecutive nucleotides with a sequence that is reverse complementary to at least part of the RNA whose abundance is to be determined, such that a specific hybridization between the probe and the RNA whose abundance is to be determined can occur.
In another aspect, the invention refers to a solid phase hybridization array (microarray) for the detection of NSCLC based on a blood sample from a human individual. Such an array comprising probes for detecting at least 5 of the RNAs that are chosen from the RNAs listed in Table 2. Preferably, the array comprises probes for detecting at least 6 RNAs, of at least 7 RNAs, of at least 11 RNAs, of at least 16 RNAs, of at least 21 RNAs, of at least 25 RNAs, or of at least 34 RNAs that are chosen from the RNAs listed in Table 2, respectively. It is further preferred that said probes for detecting said at least 5 RNAs from the RNAs listed in Table 2 exclude probes for detecting one, or two, or all six of the RNAs that is chosen from the group consisting of SEQ ID NOs:72, 85, 99, 146, 183 and 283. In particularly preferred embodiments, the probes for detecting the at least 5 RNAs as listed in Table 2 are chosen from the group consisting of SEQ ID NOs: 1-71, 73-84, 86-98, 100-145, 147-182, 184-282 and 284-484.
A microarray includes a specific set of probes, such as oligonucleotides and/or cDNA's (e.g. ESTs) corresponding in whole or in part, and/or continuously or discontinuously, to regions of RNAs that can be extracted from a blood sample of a human individual; wherein the probes are localized onto a support. The probes can correspond in sequence to the RNAs of the invention such that hybridization between the RNA from the individual and the probe occurs, yielding a detectable signal. This signal can be detected and together with its location on the support can be used to determine which probe hybridized with RNA from the individual's blood sample.
In another aspect, the invention refers to the use of a hybridization array as described above and herein for the detection of NSCLC in a human individual based on RNA from a blood sample obtained from the individual, comprising determining the abundance of at least 5 RNAs, preferably at least 7 RNAs, stemming from the 484 genes listed in Table 2 in the sample.
In another aspect, the invention refers to a kit for the detection of NSCLC, comprising means for determining the abundance of at least 5, preferably at least 7 out of 484 RNAs in the sample that are chosen from the RNAs listed in table 2. Preferably said kit comprises means for determining the abundance of at least 5 RNAs chosen from the RNAs listed in Table 2, wherein the means for determining one, or two, or all six of the RNAs that is chosen from the group consisting of SEQ ID NOs:72, 85, 99, 146, 183 and 283 are excluded. In particularly preferred embodiments, the kit comprises means for determining the abundance of at least 5 RNAs as listed in Table 2 that are chosen from the group consisting of SEQ ID NOs: 1-71, 73-84, 86-98, 100-145, 147-182, 184-282 and 284-484.
Such a kit may comprise probes and/or a microarray as described above and herein. Specifically, the kit preferably comprises probes, which in turn comprise 15 to 150, preferably 30 to 70 consecutive nucleotides with a reverse complementary sequence to the at least 5 RNAs (or more, as disclosed above and herein) whose abundance is to be determined. Alternatively, the kit preferably comprises a microarray comprising probes with a reverse complementary sequence to the at least 5 RNAs whose abundance is to be determined. Optionally, the kit may further comprise a mixture of at least 5 of the RNAs of table 2 in a given amount or concentration for use as a standard and/or other components, such as solvents, buffers, labels, primers and/or reagent.
The expression profile of the herein disclosed at least 5 RNAs (or more, as disclosed above and herein) is determined, preferably through the measurement of the quantity of the mRNA of the marker gene. This quantity of the mRNA of the marker gene can be determined for example through chip technology (microarray), (RT-) PCR (for example also on fixated material), Northern hybridization, dot-blotting, sequencing, or in situ hybridization.
The microarray technology, which is most preferred, allows for the simultaneous measurement of RNA abundance of up to many thousand gene products and is therefore an important tool for determining differential expression in this context, in particular between two biological samples or groups of biological samples. As will be understood by a person of skill in the art, the analysis can also be performed through single reverse transcriptase-PCR, competitive PCR, real time PCR, differential display RT-PCR, Northern blot analysis, sequencing, and other related methods. In general, the larger the number of markers is that are to be measured, the more preferred is the use of the microarray technology.
Measurements can be performed using the complementary DNA (cDNA) or complementary RNA (cRNA), which is produced on the basis of the RNA to be analyzed, e.g. using microarrays. A great number of different arrays as well as their manufacture are known to a person of skill in the art and are described for example in the U.S. Pat. Nos. 5,445,934; 5,532,128; 5,556,752; 5,242,974; 5,384,261; 5,405,783; 5,412,087; 5,424,186; 5,429,807; 5,436,327; 5,472,672; 5,527,681; 5,529,756; 5,545,331; 5,554,501; 5,561,071; 5,571,639; 5,593,839; 5,599,695; 5,624,711; 5,658,734; and 5,700,637.
Using three independent data sets of patients and controls, the validity of whole blood-based gene expression profiling for the detection of NSCLC patients among smokers was investigated. It is shown that RNA-stabilized blood samples allow for the identification of NSCLC patients among hospital based controls as well as healthy individuals.
Using RNA-stabilized whole blood from smokers in three independent sets of NSCLC patients and controls, the inventors present a gene expression based classifier that can be used as a biomarker to discriminate between NSCLC cases and controls. The optimal parameters of this classifier were first determined by applying a classical 10-fold cross-validation approach to a training set consisting of NSCLC patients (stage I-IV) and hospital based controls (TS). Subsequently this optimized classifier was successfully applied to two independent validation sets, namely VS1 comprising NSCLC patients of stage I-IV and hospital based controls (VS1) and VS2 containing patients with stage I NSCLC and healthy blood donors. This successful application of the classifier in both validation sets underlines the validity and robustness of the classifier. Extensive permutation analysis using random feature lists and the possibility of building specific classifiers independently of the composition of the initial training set further support the specificity of the classifier. The inventors found no associations between stage of disease and the probability score assigned to each sample. In addition, the inventors observed no association between other cancers and the probability score of the controls (data not shown). But controls without documented morbidity (controls in VS2) tend to have lower probability scores to be a case as compared to controls with documented morbidity, although this was not statistically significant.
The gene set used to build the classifier was enriched in genes related to immune functions. The inventors therefore postulate, without wanting to be bound by theory, that the classifier is based on the transcriptome of blood-based immune effector cells rather than influenced by the occurrence of rare tumor cells occasionally detected in blood of cancer patients although this possibility cannot be ruled out (Nagrath, S. et al., Nature, 450: 1235-9 (2007)). Moreover, the lack of NSCLC tumor cell specific transcripts e.g. TTF1, cytokeratins or hTERT in the inventors' classifier points into the same direction. Gene set enrichment analysis (Osman, I. et al., Clin. Cancer Res., 12:3374-80 (2006)) of the gene set used in the inventors' diagnostic approach in comparison with published expression datasets from a variety of cancer entities (Segal, E. et al., Nat. Genet., 36: 1090-8 (2004)) revealed an significant overlap with 26% of the lung cancer tissue specific gene expression profiles. Since NSCLC tissue consists of tumor cells, immune and stromal cells (Critchley-Thorne, R. J. et al., PLoS Med., 4: e176 (2007)) the inventors presume that the similarities of both gene sets is due to a similar regulation of genes present in immune cells in NSCLC tissue and peripheral blood of NSCLC patients. These findings are in line with data demonstrating tumor-induced alteration of the immune system in mice (Burczynski, M. E. et al., Clin. Cancer Res., 11: 1181-9 (2005); Twine, N. C. et al., Cancer Res., 63: 6069-75 (2003)) and in men (Showe, M. K. et al., Cancer Res., 69: 9202-10 (2009); Keller, A. et al., BMC Cancer, 9: 353 (2009)).
Recently, Showe, M. K. et al., Cancer Res., 69: 9202-10 (2009) reported a NSCLC associated gene expression signature derived from PBMC of predominantly early stage NSCLC patients. Also in this study an enrichment of immune-associated pathways in the signature was observed further indicating that the alteration of the immune system might be a common feature already during the initial phase of NSCLC development. Since the inventors used RNA-stabilized whole blood and not PBMC for analysis, the signature identified by Showe et al. could not be used in the inventors' dataset to distinguish between cases and controls. The same holds true when applying the inventors' classifier to the published data set. Findings derived from several of the inventors' own studies further underline that signatures derived from PBMC and RNA-stabilized whole blood samples cannot be directly compared (Debey, S. et al., Pharmacogenomics J., 4: 193-207 (2004); Debey, S. et al., Genomics, 87: 653-64 (2006)). However, for clinical applicability and robustness, RNA-stabilized approaches reveal more reliable results in a multi-center setting (Debey, S. et al., Genomics, 87: 653-64 (2006)).
Lists 1 to 51 show exemplary sets of RNAs whose abundance in a blood sample from a human individual can be determined according to the invention to detect NSCLC in the individual. Each list shows a set of RNAs (defined by accession number and probe ID) with an area under the curve (AUC) of at least 0.8. The AUC is a quantitative parameter for the clinical utility (specificity and sensitivity) of the detection method described herein. An AUC of 1.0 refers to a sensitivity and specificity of 100%.
The invention is further described in the following non-limiting examples.
Material and Methods
Cases and controls: NSCLC cases and hospital based controls were recruited at the University Hospital Cologne and the Lung Clinic Merheim, Cologne, Germany. Healthy blood donors were recruited at the Institute for Transfusion Medicine, University of Cologne. From all individuals, PAXgene stabilized blood samples were taken for blood-based gene expression profiling. For all NSCLC cases, blood was taken prior to chemotherapy. To establish and validate a NSCLC specific classifier, 3 independent sets of cases and controls were assembled. The training set (TS) comprised 77 individuals. 35 of those represent NSCLC cases of stage I-IV admitted to the hospital with symptoms of non-small cell lung cancer (coughing, dyspnea, weight loss or reduction in general health state) and 42 were hospital based controls with a comparable comorbidity but no prior history of lung cancer. The validation set 1 (VS1, n=54) likewise contained 28 NSCLC cases of stage I-IV and 26 hospital based controls. Overall, the hospital based controls in TS and VS1 enclosed individuals suffering from advanced chronic obstructive pulmonary disease (COPD) as typically seen in a population of heavily smoking adults (TS: n=7 VS1: n=5). Other diseases such as hypertension (TS: n=17, VS: n=11) or other malignancies (TS: n=10; VS1: n=6) were also observed in the group of hospital based controls. The validation set 2 (VS2, n=102) contained 32 NSCLC cases that had documented stage I NSCLC and were diagnosed mostly during routine chest-rays or due to clinical workup of unspecific symptoms such as reduced general health status. All individuals had an ECOG performance status of 0. In addition VS2 contains 70 healthy blood donors without prior history of lung cancer. Detailed information on cases and controls are summarized in Table 1. The analyses were approved by the local ethics committee and all probands gave informed consent.
Blood collection, cRNA synthesis and array hybridization: Blood (2.5 ml) was drawn into PAXgene vials. After RNA isolation biotin labeled cRNA preparation was performed using the Ambion® Illumina RNA amplification kit (Ambion, UK) or Epicentre TargetAmp™ Kit (Epicentre Biotechnologies, USA) and Biotin-16-UTP (10 mM; Roche Molecular Biochemicals) or Illumina® TotalPrep RNA Amplification Kit (Ambion, UK). Biotin labeled cRNA (1.5 μg) was hybridized to Sentrix® whole genome bead chips WG6 version 2, (Illumina, USA) and scanned on the Illumina® BeadStation 500×. For data collection, the inventors used Illumina® BeadStudio 3.1.1.0 software.
Quality control: For RNA quality control the ratio of the OD at wavelengths of 260 nm and 280 nm was calculated for all samples and was between 1.85 and 2.1. To determine the quality of cRNA, a semi-quantitative RT-PCR amplifying a 5′ prime and a 3′ prime product of the β-actin gene was used as previously described (Zander T. et al., J. Med. Biol. Res., 39: 589-93 (2006)) and demonstrated no sign of degradation with the 5′ prime and a 3′ prime product being present. All expression data presented in this manuscript were of high quality. Quality of RNA expression data was controlled by different separate tools. First, the inventors performed quality control by visual inspection of the distribution of raw expression values. Therefore, the inventors constructed pairwise scatterplots of expression values from all arrays (R-project Vs 2.8.0) (Team RDC. R: A language and environment for statistical computing. R Foundation for Statistical Computing (2006)). For data derived from an array of good quality a high correlation of expression values is expected to lead to a cloud of dots along the diagonal. In all comparisons the r2 was above 0.95. Second the present call rate was high in all samples. Finally, the inventors performed quantitative quality control. Here, the absolute deviation of the mean expression values of each array from the overall mean was determined (R-project Vs 2.8.0) (Team RDC. R: A language and environment for statistical computing. R Foundation for Statistical Computing (2006)). In short, the mean expression value for each array was calculated. Next the mean of these mean expression values (overall mean) was taken and the deviation of each array mean from the overall mean was determined (analogous to probe outlier detection used by Affymetrix before expression value calculation) (Affymetrix. Statistical algorithms description document; http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf (2002)). The deviation was below 28 for all samples.
Classification algorithm: An overview of the experimental design is depicted in
Datamining: To investigate gene ontology of transcripts used for the classifier GeneTrail analysis for over- and under-expressed genes (Backes, C. et al., Nucleic Acids Res., 35: W186-92 (2007)) was performed. To this end, the inventors analyzed the enrichment in genes in the classifier compared to all genes present on the whole array. The inventors analyzed under- respectively over-expressed genes using the hypergeometric test with a minimum of 2 genes per category. In addition, the inventors performed datamining by Gene Set Enrichment Analysis (GSEA) (Subramanian, A. et al., Proc. Natl. Acad. Sci. USA, 102: 15545-50 (2005)). As indicated, the inventors compared the respective list of genes obtained in the inventors' expression profiling experiment with datasets deposited in the Molecular Signatures Database (MSigDB). The power of the gene set analysis is derived from its focus on groups of genes that share common biological functions. In GSEA an overlap between predefined lists of genes and the newly identified genes can be identified using a running sum statistics that leads to attribution of a score. The significance of this score is tested using a permutation design which is adapted for multiple testing (Subramanian, A. et al., Proc. Natl. Acad. Sci. USA, 102: 15545-50 (2005)). Groups of genes, called gene sets were deposited in the MSigDB database and ordered in different biological dimensions such as cancer modules, canonical pathways, miRNA targets, GO-terms etc. (http://www.broadinstitute.org/gsea/-msigdb/index.jsp). In the analysis, the inventors focused on cancer modules. The cancer modules integrated into the MSigDB are derived from a compendium of 1975 different published microarrays spanning several different tumor entities (Segal, E. et al., Nat. Genet., 36: 1090-8 (2004)).
Experimental design: Expression profiles were generated from PAXgene stabilized blood samples from 3 independent groups consisting of NSCLC cases and controls (n=77, 54, 102) using the Illumina WG6-VS2 system.
Results: Several genes are consistently differentially expressed in whole blood of NSCLC patients and controls. These expression profiles were used to build a diagnostic classifier for NSCLC which was validated in an independent validation set of NSCLC patients (stage I-IV) and hospital based controls. The area under the receiver operator curve (AUC) was calculated to be 0.824 (p<0.001). In a further independent dataset of stage I NSCLC patients and healthy controls the AUC was 0.977 (p<0.001). Specificity of the classifier was validated by permutation analysis in both validation cohorts. Genes within the classifier are enriched in immune associated genes and demonstrate specificity for non-small cell lung cancer.
Conclusions: The inventors' results show that gene expression profiles of whole blood allow for detection of manifest NSCLC. These results prompted further development of gene expression based biomarker tests in peripheral blood for the diagnosis and early detection i.e. stage I lung cancer of NSCLC.
Establishment of a gene expression profiling-based classifier for blood-based diagnosis of NSCLC: The classifier was build based on an initial training set (TS) containing 35 NSCLC cases of different stages (stage I: n=5, stage II: n=5, stage III: n=17, stage IV: n=8) and 42 hospital based controls suffering in part from severe comorbidities such as COPD, hypertension, cardiac diseases as well as malignancies other than lung cancer. The inventors first evaluated three different approaches, namely support vector machine (SVM), linear discrimination analysis (LDA) and prediction analysis of microarrays (PAM) to identify the best algorithm to build a classifier for the diagnosis of NSCLC in a 10 fold cross-validation design. To this end, the inventors used 36 different feature lists extracted from the list of differentially expressed genes according to 36 different cut-off p-values of the T-statistics. In this setting the SVM algorithm performed best by reaching the highest AUC (mean AUC=0.754) at a cut-off p-value of the T-statistics of 0.003 (
Use of the diagnostic NSCLC classifier to detect NSCLC cases in an independent validation set of NSCLC cases and hospital based controls: First it was validated whether the classifier can be used to discriminate NSCLC cases of early and advanced stages among hospital based controls. Therefore, in the first independent validation set cases and controls were chosen in a similar setting as in the training set, i.e. patients with NSCLC stage I-IV and clinical symptoms associated with lung cancer and hospital based controls with relevant comorbidities (n=26). The AUC for the diagnostic test of NSCLC in this first validation set was calculated to be 0.824 (p<0.001) (
The diagnostic NSCLC classifier identifies stage 1 NSCLC patients in an independent second validation set comprising stage I NSCLC cases and healthy blood donors: After demonstrating that the classifier can be used to detect NSCLC cases among individuals with comorbidities it was also investigated whether this test can be used to distinguish NSCLC cases presenting at stage I with no or only minor symptoms from healthy individuals. Therefore, the inventors recruited a second independent validation set consisting of 32 NSCLC cases at stage I and 70 healthy blood donors (VS2). By applying the identical classifier to VS2 the AUC was determined to be 0.977 (p<0.001) (
Permutation test to analyse the specificity of the classifier: To further underline the specificity of this classifier, the inventors used 1000 random feature lists each comprising 484 features to likewise build a SVM-based classifier in the training set (TS) which then were applied to validation set 1 (VS1) and validation set 2 (VS2), respectively. For VS1 the mean AUC obtained by using these random feature lists was 0.49 (range 0.1346 - 0.8633) with only 2 AUCs being 0.824, the AUC obtained using the NSCLC specific classifier (Figure. 3B). This corresponds to a p-value of less than 0.002 for the permutation test further confirming the specificity of the NSCLC classifier. Similarly, by applying the permuted classifiers to VS2, only 1.8% of random feature lists lead to an AUC of ≧0.977, the AUC obtained using the NSCLC-specific classifier (
Mining of expression profiles: To analyze the biological significance of the extracted 484 features derived as classifier by the SVM approach different strategies were used. First, the inventors used GeneTrail (Backes, C. et al., Nucleic Acids Res., 35: W186-92 (2007)) to analyze an enrichment in GO-terms of the genes associated with NSCLC in the inventors' study. The inventors observed 112 GO categories demonstrating a significant (p-value FDR corrected <0.05) enrichment of genes in the inventors' extracted gene list of which 25 were associated with the immune system (Table 3). These data indicate an impact of immune cells to the genes involved in the classifier.
Next, the inventors performed a gene set enrichment analysis ((Subramanian, A. et al., Proc. Natl. Acad. Sci. USA, 102: 15545-50 (2005); Segal, E. et al., Nat. Genet., 36: 1090-8 (2004)), thereby focusing on cancer modules which comprise groups of genes participating in biological processes related to cancer. Initially, the power of such modules has been demonstrated exemplarily for single genes such as cyclin D1 or PGC-lalpha (Lamb, J. et al., Cell, 114: 323-34 (2003); Mootha, V. K. et al., Nat. Genet., 34: 267-73 (2003)) and a more comprehensive view on such modules has been introduced recently (Segal, E. et al., Nat. Genet., 36: 1090-8 (2004)). This comprehensive collection of modules allows the identification of similarities across different tumor entities, such as the common ability of a tumor to metastasize to the bone e.g. in subsets of breast, lung and prostate cancer (Segal, E. et al., Nat. Genet., 36: 1090-8 (2004)). Overall, 456 such modules are described in the database spanning several biological processes such as metabolism, transcription, cell cycle and others.
When analysing the identified 484 NSCLC specific features, 199 cancer modules including 26% of all NSCLC associated modules where identified to show a significant enrichment. This indicates that genes used to build a classifier for NSCLC cases in the inventors' study represent in part a subset of biologically cooperating genes that are also differentially expressed in primary lung cancer.
To further investigate the specificity of the extracted list of 484 features obtained from the analysis for the classification of NSCLC, the inventors also calculated the overlap between this extracted gene set and a set of genes differentially expressed in the blood of patients with renal cell cancer (Twine, N. C. et al., Cancer Res., 63: 6069-75 (2003); Sharma, P. et al., Breast Cancer Res., 7: R634-44 (2005)). No significant overlap was observed for both gene sets. Similarly, no overlap was observed between the inventors' NSCLC specific gene set and gene sets obtained from blood-based expression profiles specific for melanoma (Critchley-Thorne, R. J. et al., PLoS Med., 4: e176 (2007)), breast (Sharma, P. et al., Breast Cancer Res., 7: R634-44 (2005)) and bladder (Osman, I. et al., Clin. Cancer Res., 12:3374-80 (2006)), respectively. In summary, these data point to a NSCLC specific gene set present in the inventors' classifier.
Number | Date | Country | Kind |
---|---|---|---|
11164480.3 | May 2011 | EP | regional |
Number | Date | Country | |
---|---|---|---|
Parent | 14339896 | Jul 2014 | US |
Child | 15723122 | US | |
Parent | 14115565 | US | |
Child | 14339896 | US |