The invention comprises a method for predicting the progression of a colon cancer (colorectal carcinoma) in patients within three years of diagnosing them with colon cancer at UICC stage I and II according to the state of the art and whose primary tumor was completely removed according to surgical and pathological criteria (R0). The method according to the invention comprises the determination and analysis of the expression profiles of 30 or less marker genes in a tissue sample from the primary tumor that was removed during the surgery of the patient. Using the method, it is predicted whether a progression of the cancer is likely to occur within three years after surgery or not. The progression of the disease refers to professional medical diagnosis of a recurrence of the disease in the same organ, of a metastasis in other organs or the occurrence of other cancer types. In other words, the method allows for the prediction of the three year progression-free survival of patients with colon cancer through determining a gene expression profile of 30 marker genes or a selection thereof as well the subsequent bioinformatical analysis. The 30 genes are defined by their sequence as depicted in SEQ ID NOs: 1 to 30. One aspect of the invention concerns a specific gene expression profile of a subgroup of 9 genes form the 30 marker genes. Another aspect of the invention concerns a gene expression profile of 5 genes from the 30 marker genes. The accuracy of prediction of a progression is 89% when the expression profile consists of 8 genes. Also disclosed are kits for performing the method according to the invention and diagnostic kits. Other embodiments of the invention concern the use of the marker genes disclosed herein and/or of the combinations of marker genes disclosed herein.
Colon cancer, also referred to as colorectal carcinoma, is the third most common tumor entity in western countries. In Germany, each year about 66.000 patients are diagnosed with colon cancer. The colorectal carcinoma is a heterogeneous disease with complex etiology. Colon cancer patients are classified into four clinical stages, UICC I-IV, according to histopathological criteria defined by the Union International Contre le Cancer (UICC). The TNM-classification scheme of the UICC is used all over the world.
Patients with colon cancer in UICC stage I have a TNM-status of T1/2N0M0. In these patients, no regional lymph nodes show metastases (N=0) and no metastases have been found and histologically confirmed (M=0).
Patients with colon cancer in stage II have a TNM-Status of T3,4N0M0. Although the primary tumor is significantly lager than in stage I and has already penetrated the wall of the colon, no metastases in the regional lymph nodes and no metastases have been found in these patients.
About half of all newly diagnosed patients, in Germany ca. 33.000 patients per year, have colon cancer in UICC stages I and II. The total surgical removal of tumors in clinical stages I and II is very effective and leads to progression-free survival rates of 76% after 5 years in UICC stage I and to 67% in UICC stage II. However, within 5 years after the total surgical removal of the primary tumor, in about 24% of the colon cancer patients in UICC stage I and in 33% of the colon cancer patient in UICC stage II, progression of the cancer occurs. The diagnosis of metastases of the primary tumor in liver and/or lung constitutes the majority of the observed progressions.
Patients in UICC stage III have a TNM-status of T1-4N1-2M0. For patients in this stage, it is typical that regional lymph nodes are afflicted with metastases, whereas no metastases in other organs can be found. The presence of afflicted lymph nodes in UICC stage III increases the probability for the progression of the disease significantly. About 60% of the patients in stage III are likely to suffer from a progression of the disease within 5 years after the surgical removal of the primary tumor. Due to this high progression rate, patients in UICC stage III receive adjuvant chemotherapy according to the guidelines of the German Cancer Society. The adjuvant chemotherapy decreases the incidence of progressions by about 10-20%, so that generally only about 40-50% of stage III patients show a progression of the disease after surgery and adjuvant chemotherapy within the first 5 years.
Colon cancer patients in which metastases have been found and histologically confirmed when they were first diagnosed are allotted to UICC stage IV. They have only a relatively small 5 year probability for survival. In Germany, this is true for about 20.000 patients. In these patients, lung or liver metastases occur synchronously or metachronously. In about 4.000 of the patients in UICC stage IV, a removal of the primary tumor and a complete removal of metastases (RO) are technically feasible, which is accompanied by a 5 year survival rate of about 30%. In the other 16.000 patient in UICC stage IV, a resection is not feasible for various reasons (multinodular, unfavorable localization of metastases adjacent to blood vessels and bile duct, extraheptical). In these cases, a palliative therapy option is recommended. The aim of the palliative chemotherapeutical treatment is the prolongation of survival and the maintenance of a good quality of life.
A series of problems arises when classifying and allotting colon cancer patients to disease stages. The allotment of patients into stages I and II is not exact. About 10% of patients of stage I and about 25% of patients of stage II suffer from a progression within 5 years, of which the majority shows progression already within two years after surgical removal of the primary tumor. In Germany alone, this affects 6.000-8.000 patients per year. There is no possibility to identify the patients with a high probability of progression from this seemingly homogenous group. For quite some time, experts have discussed whether patients in UICC stage II should generally receive adjuvant chemotherapy. Due to the relatively small probability of progression of 33% within 5 years for stage II patients, the benefit of such a therapy is difficult to predict and is therefore still being controversially discussed. About 67% of all patients in stage II would not benefit from adjuvant chemotherapy. The costs would be enormously high.
An individual therapy could be decided upon based on predictive markers. In this context, many attempts have been made to find new markers that can identify patients with an increased risk of progression. Hawkins et al. (2002) Gastroenterology 122:1376-1387, analyzed the instability of microsatellites and promoter methylation. Noura et al. (2002) J Clin Oncol 20:4232, used a RT-PCR based detection of lymph node metastases. Zhou et al. (2002) Lancet 359:219-225, analyzed allele imbalances to predict recurrence in colorectal carcinoma. Eschrich et al. (2005) J Clin Oncol. 2005 May 20;23(15):3526-35, used cDNA microarrays to predict the probability of survival of patients with colorectal cancer.
Common to all markers examined in the literature is that they have so far not been used as the basis for prognostic assays in a clinical environment, since they have not been independently validated. A possible explanation for this could be that the progression of the colorectal carcinoma is a consequence of very different genetic events that occur within the malignant epithelium or that are induced through modifying events in the surrounding stromal tissue. In order to understand the potential complexity of the progression of the disease, a comprehensive analysis of the underlying molecular events is required.
The technical problem underlying the invention consists in the provision of a reliable diagnostic means that can lead to an improved individual therapy.
The technical problem is solved through the provision of the herein disclosed embodiments and in particular through the claims characterizing the invention. The invention therefore comprises a method for predicting the probability of a progression (local recurrence, metastases, secondary malignoma) within the first three years after surgical removal of the primary tumor of colon cancer patients in UICC stage I and in UICC stage II.
The invention relates to the determination of expression profiles of particular genes that are of importance in carcinoma, in particular in gastro-intestinal carcinomas and preferably in colorectal carcinoma. In this context, the invention teaches a test system for (in vitro) detection of the probability of progression of a carcinoma referred to above, comprising a method for quantitatively measuring the expression profiles of particular marker genes in particular tumor tissue samples as well as bioinformatical analysis methods for calculating therefrom the probability of the occurrence of a progression (local recurrence, metastases, secondary malignoma) for a patient for whom a colorectal carcinoma in UICC stage I or UICC stage II was diagnosed and is being treated. The 30 marker genes of the invention are defined in particular in table 1 and are characterized through their corresponding sequence or further through synonymous identifiers in the table. These are:
mitochondrial malic enzyme 2 (NAD(+)-dependent) [Affymetrix Nummer 210154_at] SEQ_ID—1, Fas (TNF receptor superfamily, member 6) [Affymetrix Nummer 215719_x_at] SEQ_ID—2, solute carrier family 25 (mitochondrial carrier; oxoglutarate carrier), member 11 [Affymetrix Nummer 207088_s_at] SEQ_ID—3, signal transducer and activator of transcription 1, 91 kDa [Affymetrix Nummer AFFX-HUMISGF3A/M97935_MB_at] SEQ_ID—4, CDC42 binding protein kinase alpha (DMPK-like) [Affymetrix Nummer 214464_at] SEQ_ID—5, glia maturation factor beta [Affymetrix Nummer 202543_s_at] SEQ_ID—6, chemokine (C-X-C motif) ligand 10 [Affymetrix Nummer 204533_at] SEQ_ID—7, mitochondrial malic enzyme 2 (NAD(+)-dependent) [Affymetrix Nummer 209397_at] SEQ_ID—8, signal transducer and activator of transcription 1, 91 kDa [Affymetrix Nummer AFFX-HUMISGF3A/M97935_MA_at] SEQ_ID—9, nucleoporin 210 kDa [Affymetrix Nummer 212316_at] SEQ_ID—10, dystonin [Affymetrix Nummer 212254_s_at] SEQ_ID—11, tryptophanyl-tRNA synthetase [Affymetrix Nummer 200628_s_at] SEQ_ID—12, nucleoside phosphorylase [Affymetrix Nummer 201695_s_at] SEQ_ID—13, phosphoserine aminotransferase 1 [Affymetrix Nummer 220892_s_at] SEQ_ID—14, heterogeneous nuclear ribonucleoprotein D (AU-rich element RNA binding protein 1, 37kDa) [Affymetrix Nummer 221481_x_at] SEQ_ID—15, solute carrier family 25 (mitochondrial carrier; oxoglutarate carrier), member 11 [Affymetrix Nummer 209003_at] SEQ_ID—16, methylenetetrahydrofolate dehydrogenase (NADP+ dependent) 2, methenyltetrahydrofolate cyclohydrolase [Affymetrix Nummer 201761_at] SEQ_ID—17, NADH dehydrogenase (ubiquinone) 1 alpha subcomplex, 9, 39 kDa [Affymetrix Nummer 208969_at] SEQ_ID—18, transferrin receptor (p90, CD71) [Affymetrix Nummer 207332_s_at] SEQ_ID_19, 1-acylglycerol-3-phosphate O-acyltransferase 5 (lysophosphatidic acid acyltransferase, epsilon) [Affymetrix Nummer 218096_at] SEQ_ID—20, chromatin licensing and DNA replication factor 1 [Affymetrix Nummer 209832_s_at] SEQ_ID—21, transferrin receptor (p90, CD71) [Affymetrix Nummer 208691_at] SEQ_ID—22, eukaryotic translation initiation factor 4E [Affymetrix Nummer 201435_s_at] SEQ_ID—23, peptidylglycine alpha-amidating monooxygenase [Affymetrix Nummer 202336_s_at] SEQ_ID—24, KIT ligand [Affymetrix Nummer 207029_at] SEQ_ID—25, splicing factor, arginine/serine-rich 2 [Affymetrix Nummer 200754_x_at] SEQ_ID—26, fucosyltransferase 4 (alpha (1,3) fucosyltransferase, myeloid-specific) [Affymetrix Nummer 209892_at] SEQ_ID—27, thymidylate synthetase [Affymetrix Nummer 202589_at] SEQ_ID—28, translocated promoter region (to activated MET oncogene) [Affymetrix Nummer 201730_s_at] SEQ_ID—29, peroxiredoxin 3 [Affymetrix Nummer 201619_at] SEQ_ID—30
The prediction of the progression of a primary colorectal carcinoma is of particular relevance for a clinician, since it determines the further treatment of the patient. When no tumors, neither in regional lymph node nor metastases are found, the patient is allotted to UICC stages I or II. These tumors, when there are colorectal carcinomas, are exclusively treated through surgery. An adjuvant chemotherapy, save in clinical studies, is not designated. In contrast, when tumor cells are found in regional lymph nodes (UICC stage III), a postoperative adjuvant chemotherapy is recommended according to the guide lines of the German Cancer Society and other international societies. This adjuvant chemotherapy yields a progression-free 3 year survival of patients in UICC stage III of about 69%; without subsequent chemotherapy, the 3 year progression-free survival is only about 49%. The total survival is also significantly influenced by the adjuvant chemotherapy. In the case of rectum carcinoma, it is also of particular relevance whether tumor cells are already present in regional lymph nodes. In these cases, preoperative radiochemotherapy is recommended, because it significantly reduces the occurrence of local recurrence in the rectum. In addition, a preoperative radiochemotherapy allows for significantly more patients to have surgery and retain their continence which contributes to a significant improvement of the postoperative quality of life for these patients.
Concerning the present invention, the term “colorectal carcinoma” refers in particular to polypoid, plateau shaped, ulcerous and szirrhous forms, which according to the WHO-classification can be histologically typified into solid, mucinous or adenous adenocarcinoma, Signet-ring cell carcinoma, squamous, adenosquamous, cribiform, squamous-like or undifferentiated carcinoma (Becker, Hohenberger, Junginger, Schlag. Chirurgische Onkologie. Thieme, Stuttgart 2002).
In relation to the invention, the term “gene expression profile” comprises the determination of “expression profiles” as well as of particular “expression levels” of the respective genes. The term “expression level” and the term “expression profile” comprise, according to the invention, both the quantity of a gene product as well as its qualitative modifications, like for example methylation, glycosylation, phosphorylation, and so on. Therefore, when determining the “expression profiles” in relation to the invention, mainly the quantity of the respective gene products (RNA/protein) is determined. The expression level is, if applicable, compared with that of other individuals. Corresponding embodiments are shown in the experimental part and are also depicted in the tables.
The determination of the expression profiles of the genes (gene sections) described herein is performed in particular in tissues and/or single cells of the tissues. Methods for determining the expression profiles therefore comprise (in the sense of the invention) e. g. in situ hybridisation, PCR-based methods (e.g. Taqman), or microarray-based methods (see the experimental part of the invention).
In a particular embodiment, the invention comprises the above mentioned method, wherein the expression profile of at least one or of any combination of the 30 marker genes that are unequivocally defined through SEQ ID NO 1 to SEQ ID NO 30, is determined.
In a further preferred embodiment, the invention comprises the above mentioned method, wherein the expression profile of any combination from the subset of nine marker genes, depicted in SEQ ID NO 1 to SEQ ID NO 9, is determined.
In a further preferred embodiment, the invention comprises the above mentioned method, wherein the expression profile of exactly nine marker genes, as depicted in SEQ ID NO 1 to SEQ ID NO 9, is determined.
In a further particularly preferred embodiment, the invention comprises the above mentioned method, wherein the expression profile of any combination from the subset of the five marker genes, as depicted in SEQ ID NO 1 to SEQ ID NO 5, is determined.
In a further particularly preferred embodiment, the invention comprises the above mentioned method, wherein the expression profile of exactly five marker genes, depicted in SEQ ID NO 1 to SEQ ID NO 5, is determined.
As will be defined, the term marker gene in the sense of this invention comprises not only the specific gene sequences (or the respective gene products) as depicted in the specific nucleotide sequences, but also gene sequences which have a high homology to these sequences. Further, the reverse complementary sequences of the defined marker genes are encompassed. Sequences of high homology comprise sequences which have at least 80%, preferably at least 90%, most preferably at least 95% homology to the sequences depicted in the SEQ ID NOs: 1 to 30.
In the context of this invention, these highly homologous sequences also comprise sequences that encode for gene products (e.g. RNA or proteins) which are at least 80% identical to the defined gene products of SEQ ID NOs: 1 to 30. The term marker gene with reference to this invention comprises according to the invention a gene or a gene portion that is at least 90% homologous, more preferably at least 95% homologous, more preferably at least 98%, most preferably at least 100% homologous to the depicted sequences in SEQ ID NO 1 to SEQ ID NO 30 in the form of desoxyribonucleotides or equivalent ribonucleotids or the proteins derived therefrom.
A protein derived from one of the 30 marker genes (defined in SEQ ID NOs 1 to 30, in table 1) is meant to refer to, according to the invention, a protein, a protein fragment or a polypeptide that was translated in its native reading frame (in frame).
The sequence identity can be determined conventionally through the use of computer programs like e.g. the FASTA program (W. R. Pearson (1990) Rapid and Sensitive Sequence Comparison with FASTP and FASTA Methods in Enzymology 183:63-98.), which can be downloaded for example as a service of the EBI in Hinxton. When using FASTA or another sequence alignment program to determine whether a particular sequence is for example 25% identical to a reference sequence of the present invention, the parameters are chosen such that the percentage of identity of the entire length of the reference sequence is calculated and that homology gaps (also referred to as gaps) of up to 5% of the total number of nucleotides in the reference sequence are allowed. Important program parameters like for example GAP PENALTIES and KTUP are left at their default values.
In a particular embodiment, the relevant marker genes cannot only be determined in tumor samples, but also in other biological samples, like e.g. in blood, blood serum, blood plasma, feces or other body fluids (ascites of the abdominal cavity, lymph). Accordingly, the present invention is not limited to the analysis of frozen or fresh tumor tissue. The results according to the invention can also obtained through analysis of fixed tumor tissue, for example paraffin material. In fixed material, also other detection methods for the detecting of genes and gene expression products can preferably be used, e.g. RNA specific primers in a real time PCR.
As shown in the embodiments of the invention, the expression profile of the herein disclosed 30 marker genes (or a selection thereof) is determined, preferably through the measurement of the quantity of the mRNA of the marker gene. This quantity of the mRNA of the marker gene can be determined for example through gene chip technology, (RT-) PCR (for example also on fixed material), Northern hybridization, dot-blotting, or in situ hybridization. Further, the method according to the invention can also be performed by measuring the gene products on a protein or peptide level. Therefore, the invention also comprises the methods described herein, in which the gene expression products are determined in form of their synthesized proteins (or peptides). In this case, the quantity as well as the quality (e.g. modifications like phosphorylations or glycosylisation) can be determined. Preferably, the expression profile of the marker gene is determined through measuring the polypeptide quantity of the marker gene and, if desired, is compared to a reference value of the particular comparison specimen. The quantity of the polypeptide of the marker gene can be determined through ELISA, RIA, (Immuno-) Blotting, FACS or immunohistochemical methods.
The microarray technology which is used in the present invention most preferably allows for the simultaneous measurement of the mRNA expression level of many thousand genes and is therefore an important tool for determining differential expression between two biological samples or groups of biological samples. As known to a person of skill and the art, the analysis can also be performed through single reverse transcriptase-PCR, competitive PCR, real time PCR, differential display RT-PCR, Northern blot analysis, and other related methods.
It is best to analyze the complementary DNA (cDNA) or complementary RNA (cRNA) which is produced on the basis of the RNA to be analyzed using microarrays. A great number of different arrays as well as their manufacture are known to a person of skill in the art and are described for example in the U.S. Pat. Nos. 5,445,934; 5,532,128; 5,556,752; 5,242,974; 5,384,261; 5,405,783; 5,412,087; 5,424,186; 5,429,807; 5,436,327; 5,472,672; 5,527,681; 5,529,756; 5,545,331; 5,554,501; 5,561,071; 5,571,639; 5,593,839; 5,599,695; 5,624,711; 5,658,734; and 5,700,637.
In a further embodiment, the invention comprises a well-defined sequence of analysis steps which in the end lead to the determination of marker signatures with which the sample group can be distinguished from the control group. This method, which was not previously described in this manner, comprises the following, as described in the examples in detail matter and as depicted in
The raw data from the biochips are first condensed with FARMS as shown by Hochreiter (2006), Bioinformatics 22(8):943-9, and are subsequently partitioned in a double nested bootstrap approach [Efron (1979) Bootstrap Methods—Another Look at the Jackknifing, Ann. Statist. 7, 1-6] in the outer loop into a test data set and training data set. In the inner bootstrap loop, the feature relevance is extracted from the training data set through a decision-tree-analysis. For this purpose, a particular number of samples to be classified is chosen at random in several bootstrap iterations and the influence of a feature is determined from its contribution to the classification error: In case the error of a feature increases due to the permutation of the values of a feature while the values of all other features remain unchanged, this feature is weighted more strongly. Using a frequency table, the features that were chosen the most number of times are determined and used in the outer bootstrap loop for the classification of the test data set through a support-vector-machine or other classification algorithms known to a person of skill in the art, like for example classification and regression trees, penalized logistic regression, sparse linear discriminant analysis, Fisher linear discriminant analysis, K-nearest neighbors, shrunken centroids, and artificial neural networks.
In this context, a feature is a particular measurement point for a gene to be analyzed which is located on the surface of the biochip and hybridizes with the labeled probe that is to be analyzed and thereby generates an intensity single.
The present invention also relates to a kit for performing the method described herein, wherein the kit comprises specific DNA or RNA probes, primers (also pairs of primers), antibodies, aptameres for determining at least one of the 30 marker genes that are depicted in SEQ ID NO: 1 to 30 or for determining at least one gene product of the 30 marker genes that are encoded in the sequences of SEQ ID NO: 1 to 30. The kit is preferably a diagnostic kit. A kit in the sense of the invention is also any microarray or specifically an “Affimetrix-Genechip”. The kit may contain all or some of the material necessary for performing the assay as well as the instructions therefor.
Subjects of the invention are also depictions of maker gene signatures that are advantageous for the treatment, diagnosis and the prognosis of the diseases mentioned above. These depictions of the gene profiles are reduced to media which are machine readable like e.g. computer readable media (magnetical media, optical media, and so on). The subject of the invention can also be CD-ROMs containing computer programs for the comparison with the stored 30 gene expression profile, which was described above. The subjects of the invention can contain digitally stored expression profiles such that they can be compared to expression data from patients. Alternatively, such profiles can be stored in a different physical format. A graphic depiction is for example such a format.
In the following, the invention is further described on the basis of sequences, tables and examples, without being limited thereto.
The tables show:
Table 1a contains the 30 marker genes that are differentially expressed in the present invention between patients with and without a progression of the primary colorectal carcinoma when in the validation bootstrap one data set was used as a test set in each iteration.
Table 1b contains the 30 marker genes that are differentially expressed in the present invention between patients with and without a progression of the primary colorectal carcinoma when in the validation bootstrap two data sets were used as a test set in each iteration.
Table 1c contains the 30 marker genes that are differentially expressed in the present invention between patients with and without a progression of the primary colorectal carcinoma when in the validation bootstrap three data sets were used as a test set in each iteration.
Table 2a shows the index of the classification of the five year progression-free survival for the chosen population of patients (a total of 55, of those 26 with progression) with respect to the number of marker genes used when in the validation bootstrap in each iteration one data set was used as test set.
Table 2b shows the index of the classification of the five year progression three survival for the chosen collective of patients (a total of 55, of those 26 with progression) with respect to the number of marker genes used when in the validation bootstrap in each iteration two data sets were used as test sets.
Table 2c shows the index of the classification of the five year progression three survival for the chosen collective of patients (a total of 55, of those 26 with progression) with respect to the number of marker genes used when in the validation bootstrap in each iteration three data sets were used as test sets.
a shows the index of the classification of the occurrence of a progression within five years after the primary diagnosis of a colorectal carcinoma with respect to the number of marker genes used when in the validation bootstrap in each iteration one data set were used as a test set.
b shows the index of the classification of the occurrence of a progression within five years after the primary diagnosis of a colorectal carcinoma with respect to the number of marker genes used when in the validation bootstrap in each iteration two data sets were used a test set.
c shows the index of the classification of the occurrence of a progression within five years after the primary diagnosis of a colorectal carcinoma with respect to the number of marker genes used when in the validation bootstrap in each iteration three data sets were used a test set.
The population of patients for the determination of the signature consisted of 55 patients, 34 men and 21 women, in whom a colorectal carcinoma had been diagnosed. These patients had surgery between August 1988 and June 1998 for total removal of the colorectal carcinoma. The age of the patients at the time of surgery was from 33 years to 87 years; the mean age was 63.4 years.
Among the 55 carcinomas that were removed, 11 were classified to be in UICC stage I (TNM-Classification: pT1 or pT2 and pN0 and pM0) and 44 were classified as tumors in UICC stage II (TNM-Classification: pT3 or pT4 and pN0 and pM0).
The total observation time of the patients, that is the time from the first surgery performed to the last observation of the patient was on average 11.25 years; the minimum was 6.36 years, the maximum was 16.53 years.
After surgery, in 26 patients a progression of the disease was diagnosed, 29 patients remained progression-free after surgery.
The tumors were homogenized and the RNA was isolated using the RNeasy Mini Kit (Qiagen, Hilden, Germany) and resuspended in 55 μl of water. The cRNA preparation was performed as described (Birkenkamp-Demtroder K, Christensen L L, Olesen S H, et al. Gene expression in colorectal cancer. Cancer Res 2002; 62:4352-63). Double-stranded cDNA was synthesized using an oligo-dT-T7 primer (Eurogenetic, Koeln, Germany) and was subsequently transcribed using the Promega RiboMax T7-kit (Promega, Madison, Wis.) and Biotin-NTP marker mix (Loxo, Dossenheim, Germany).
15 μg cRNA were subsequently fragmented at 95° C. for 35 minutes.
To the cRNA, B2-control oligonucleotide (Affymetrix, Santa Clara, Calif.), eukaryotic hybridization controls (Affymetrix, Santa Clara, Calif.), herring's sperm (Promega, Madison, Wis.), hybridization buffer and BSA were added to a final volume of 300 μl. The cRNA was hybridized on a Microarraychip U1233A (Affymetrix, Santa Clara, Calif.) for 16 hours at 45° C. The wash- and incubation steps with streptavidin (Roche, Mannheim), biotinylated goat-anti-streptavidin antibody (Serva, Heidelberg), goat-IgG (Sigma, Taufkirchen) and streptavidin-phycoerythrin conjugate (Molecular Probes, Leiden, The Netherlands) were performed on an Affymetrix Fluidics Station according to the manufacturer's protocol.
Subsequently, the arrays were scanned with a confocal microscope based on a HP-Argon-Ion laser and the digitalized picture data was processed using the Affymetrix® Microarry Suite 5.0 Software. The gene chips underwent a quality control to remove scans with abnormal characteristics. The criteria were: a too high or too low dynamic range, high saturation of the “perfect matches”, high pixel background, grid misalignment problems and a low mean signal to noise ratio.
The statistical data analysis was performed with the Open-Source Software R, Version 2.3 and the Bioconductor Packages, Version 1.8. Based on the 55 CEL-Files, which are created by the above-referenced Affymetrix Software, the gene expression values were determined through FARMS condensation [Hochreiter et al. (2006), Bioinformatics 22(8):943-9].
Based on the clinical data of 55 patients, the classification problem “classification of 55 expression data sets after progression-free survival of the respective patients” was formulated and analyzed. The expression data set stemmed from the above-described patients, of which in 26 a progression occurred, while for 29 of the patients progression-free survival was documented. The marker genes according to the invention were, as shown in
Based on the training data set, the feature relevance from the data was extracted in the inner bootstrap loop through a Random-Forest-Analysis. For this purpose, in 50 inner loop iterations, 10 data sets each were randomly chosen as an inner training set. Those were classified through a SVM, that was trained on the 44, 43, or 42 remaining data sets, and the influence of a feature was determined from its contribution to the classification error: when the error increase through permutation of the values of a feature in the 10 test data sets while the values of all other features remained unchanged, then these features were weighted more strongly. Using a frequency table, the 30 features that were chosen most in the inner loop iteration were determined and used for the prognosis of the two test data sets of the outer loop: a support-vector-machine with a linear kernel (cost parameter=10) was trained on the 54, 53, or 52 data sets of the outer training set and then applied to the one, two or three test data sets. After 500 iterations, the average prospective classification rate (with sensitivity and specificity) and the frequency of the identified features were determined. The gene signatures contain only features that were relevant in all drawings with high frequency and were sorted according to their relative frequency. In the retrospective Leave-One-Out-Cross Validation (LOOCV) of the signatures, 80% of the data sets were classified correctly for seven features used (see also tables 2a, 2b, and 2c).
In case b), in which two test samples were drawn, the resulting gene signature contains 11 features that were relevant in more than 50% of all drawings. They were sorted according to their relative frequency. Using the retrospective cross validation (500 Leave-10-Out-CV) on the 11 -feature signature, 86% of the data sets were classified correctly. The average prospective classification rate for this case was determined to be 76%.
Number | Date | Country | Kind |
---|---|---|---|
10 2006 035 388.9 | Nov 2006 | DE | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/DE07/50005 | 11/1/2007 | WO | 00 | 6/15/2009 |