No government funds were used to make this invention.
Reference to a “Sequence Listing”, a table, or a computer program listing appendix submitted on a compact disc and an incorporation by reference of the material on the compact disc including duplicates and the files on each compact disc shall be specified.
Microarray technology has become a popular tool to classify breast cancer patients into subtypes, relapse and non-relapse, type of relapse, responder and non-responder3-11. A concern for application of gene expression profiling is stability of the gene list as a signature12. Considering that many genes have correlated expression on a chip, especially for genes involved in the same biological process, it is quite possible that different genes may be present in different signatures when different training sets of patients are used. Gene signatures to date for separating patients into different risk groups were derived based on the performance of individual genes, regardless of its biological processes or functions. It has been suggested that it might be more appropriate to interrogate the gene list for biological themes, rather than for individual genes1,2,8,13-19.
The present invention provides a method for predicting distant metastasis of lymph node negative primary breast cancer by obtaining breast cancer cells; isolating nucleic acid and/or protein from the cells; and analyzing the nucleic acid and/or protein to determine the presence, expression level or status of a Biomarker selected from the pathways in Table 2.
The present invention provides a method for predicting distant metastasis of lymph node negative primary breast cancer by obtaining breast cancer cells; isolating nucleic acid and/or protein from the cells; and analyzing the nucleic acid and/or protein to determine the presence, expression level or status of a Biomarker selected from the pathways in Table 2.
A Biomarker is any indicia of an indicated Marker nucleic acid/protein. Nucleic acids can be any known in the art including, without limitation, nuclear, mitochondrial (homeoplasmy, heteroplasmy), viral, bacterial, fungal, mycoplasmal, etc. The indicia can be direct or indirect and measure over- or under-expression of the gene given the physiologic parameters and in comparison to an internal control, placebo, normal tissue or another carcinoma. Biomarkers include, without limitation, nucleic acids and proteins (both over and under-expression and direct and indirect). Using nucleic acids as Biomarkers can include any method known in the art including, without limitation, measuring DNA amplification, deletion, insertion, duplication, RNA, micro RNA (miRNA), loss of heterozygosity (LOH), single nucleotide polymorphisms (SNPs, Brookes (1999)), copy number polymorphisms (CNPs) either directly or upon genome amplification, microsatellite DNA, epigenetic changes such as DNA hypo- or hyper-methylation and FISH. Using proteins as Biomarkers includes any method known in the art including, without limitation, measuring amount, activity, modifications such as glycosylation, phosphorylation, ADP-ribosylation, ubiquitination, etc., or immunohistochemistry (IHC) and turnover. Other Biomarkers include imaging, molecular profiling, cell count and apoptosis Markers.
“Origin” as referred to in ‘tissue of origin’ means either the tissue type (lung, colon, etc.) or the histological type (adenocarcinoma, squamous cell carcinoma, etc.) depending on the particular medical circumstances and will be understood by anyone of skill in the art.
A Marker gene corresponds to the sequence designated by a SEQ ID NO when it contains that sequence. A gene segment or fragment corresponds to the sequence of such gene when it contains a portion of the referenced sequence or its complement sufficient to distinguish it as being the sequence of the gene. A gene expression product corresponds to such sequence when its RNA, mRNA, or cDNA hybridizes to the composition having such sequence (e.g. a probe) or, in the case of a peptide or protein, it is encoded by such mRNA. A segment or fragment of a gene expression product corresponds to the sequence of such gene or gene expression product when it contains a portion of the referenced gene expression product or its complement sufficient to distinguish it as being the sequence of the gene or gene expression product.
The inventive methods, compositions, articles, and kits of described and claimed in this specification include one or more Marker genes. “Marker” or “Marker gene” is used throughout this specification to refer to genes and gene expression products that correspond with any gene the over- or under-expression of which is associated with an indication or tissue type.
Preferred methods for establishing gene expression profiles include determining the amount of RNA that is produced by a gene that can code for a protein or peptide. This is accomplished by reverse transcriptase PCR (RT-PCR), competitive RT-PCR, real time RT-PCR, differential display RT-PCR, Northern Blot analysis and other related tests. While it is possible to conduct these techniques using individual PCR reactions, it is best to amplify complementary DNA (cDNA) or complementary RNA (cRNA) produced from mRNA and analyze it via microarray. A number of different array configurations and methods for their production are known to those of skill in the art and are described in for instance, U.S. Pat. Nos. 5,445,934; 5,532,128; 5,556,752; 5,242,974; 5,384,261; 5,405,783; 5,412,087; 5,424,186; 5,429,807; 5,436,327; 5,472,672; 5,527,681; 5,529,756; 5,545,531; 5,554,501; 5,561,071; 5,571,639; 5,593,839; 5,599,695; 5,624,711; 5,658,734; and 5,700,637.
Microarray technology allows for the measurement of the steady-state mRNA level of thousands of genes simultaneously thereby presenting a powerful tool for identifying effects such as the onset, arrest, or modulation of uncontrolled cell proliferation. Two microarray technologies are currently in wide use. The first are cDNA arrays and the second are oligonucleotide arrays. Although differences exist in the construction of these chips, essentially all downstream data analysis and output are the same. The product of these analyses are typically measurements of the intensity of the signal received from a labeled probe used to detect a cDNA sequence from the sample that hybridizes to a nucleic acid sequence at a known location on the microarray. Typically, the intensity of the signal is proportional to the quantity of cDNA, and thus mRNA, expressed in the sample cells. A large number of such techniques are available and useful. Preferred methods for determining gene expression can be found in U.S. Pat. Nos. 6,271,002; 6,218,122; 6,218,114; and 6,004,755.
Analysis of the expression levels is conducted by comparing such signal intensities. This is best done by generating a ratio matrix of the expression intensities of genes in a test sample versus those in a control sample. For instance, the gene expression intensities from a diseased tissue can be compared with the expression intensities generated from benign or normal tissue of the same type. A ratio of these expression intensities indicates the fold-change in gene expression between the test and control samples.
The selection can be based on statistical tests that produce ranked lists related to the evidence of significance for each gene's differential expression between factors related to the tumor's original site of origin. Examples of such tests include ANOVA and Kruskal-Wallis. The rankings can be used as weightings in a model designed to interpret the summation of such weights, up to a cutoff, as the preponderance of evidence in favor of one class over another. Previous evidence as described in the literature may also be used to adjust the weightings.
A preferred embodiment is to normalize each measurement by identifying a stable control set and scaling this set to zero variance across all samples. This control set is defined as any single endogenous transcript or set of endogenous transcripts affected by systematic error in the assay, and not known to change independently of this error. All Markers are adjusted by the sample specific factor that generates zero variance for any descriptive statistic of the control set, such as mean or median, or for a direct measurement. Alternatively, if the premise of variation of controls related only to systematic error is not true, yet the resulting classification error is less when normalization is performed, the control set will still be used as stated. Non-endogenous spike controls could also be helpful, but are not preferred.
Gene expression profiles can be displayed in a number of ways. The most common is to arrange raw fluorescence intensities or ratio matrix into a graphical dendogram where columns indicate test samples and rows indicate genes. The data are arranged so genes that have similar expression profiles are proximal to each other. The expression ratio for each gene is visualized as a color. For example, a ratio less than one (down-regulation) appears in the blue portion of the spectrum while a ratio greater than one (up-regulation) appears in the red portion of the spectrum. Commercially available computer software programs are available to display such data including “Genespring” (Silicon Genetics, Inc.) and “Discovery” and “Infer” (Partek, Inc.)
In the case of measuring protein levels to determine gene expression, any method known in the art is suitable provided it results in adequate specificity and sensitivity. For example, protein levels can be measured by binding to an antibody or antibody fragment specific for the protein and measuring the amount of antibody-bound protein. Antibodies can be labeled by radioactive, fluorescent or other detectable reagents to facilitate detection. Methods of detection include, without limitation, enzyme-linked immunosorbent assay (ELISA) and immunoblot techniques.
Modulated genes used in the methods of the invention are described in the Examples. The genes that are differentially expressed are either up regulated or down regulated in patients with carcinoma of a particular origin relative to those with carcinomas from different origins. Up regulation and down regulation are relative terms meaning that a detectable difference (beyond the contribution of noise in the system used to measure it) is found in the amount of expression of the genes relative to some baseline. In this case, the baseline is determined based on the algorithm. The genes of interest in the diseased cells are then either up regulated or down regulated relative to the baseline level using the same measurement method. Diseased, in this context, refers to an alteration of the state of a body that interrupts or disturbs, or has the potential to disturb, proper performance of bodily functions as occurs with the uncontrolled proliferation of cells. Someone is diagnosed with a disease when some aspect of that person's genotype or phenotype is consistent with the presence of the disease. However, the act of conducting a diagnosis or prognosis may include the determination of disease/status issues such as determining the likelihood of relapse, type of therapy and therapy monitoring. In therapy monitoring, clinical judgments are made regarding the effect of a given course of therapy by comparing the expression of genes over time to determine whether the gene expression profiles have changed or are changing to patterns more consistent with normal tissue.
Genes can be grouped so that information obtained about the set of genes in the group provides a sound basis for making a clinically relevant judgment such as a diagnosis, prognosis, or treatment choice. These sets of genes make up the portfolios of the invention. As with most diagnostic Markers, it is often desirable to use the fewest number of Markers sufficient to make a correct medical judgment. This prevents a delay in treatment pending further analysis as well unproductive use of time and resources.
One method of establishing gene expression portfolios is through the use of optimization algorithms such as the mean variance algorithm widely used in establishing stock portfolios. This method is described in detail in 20030194734. Essentially, the method calls for the establishment of a set of inputs (stocks in financial applications, expression as measured by intensity here) that will optimize the return (e.g., signal that is generated) one receives for using it while minimizing the variability of the return. Many commercial software programs are available to conduct such operations. “Wagner Associates Mean-Variance Optimization Application,” referred to as “Wagner Software” throughout this specification, is preferred. This software uses functions from the “Wagner Associates Mean-Variance Optimization Library” to determine an efficient frontier and optimal portfolios in the Markowitz sense is preferred. Markowitz (1952). Use of this type of software requires that microarray data be transformed so that it can be treated as an input in the way stock return and risk measurements are used when the software is used for its intended financial analysis purposes.
The process of selecting a portfolio can also include the application of heuristic rules. Preferably, such rules are formulated based on biology and an understanding of the technology used to produce clinical results. More preferably, they are applied to output from the optimization method. For example, the mean variance method of portfolio selection can be applied to microarray data for a number of genes differentially expressed in subjects with cancer. Output from the method would be an optimized set of genes that could include some genes that are expressed in peripheral blood as well as in diseased tissue. If samples used in the testing method are obtained from peripheral blood and certain genes differentially expressed in instances of cancer could also be differentially expressed in peripheral blood, then a heuristic rule can be applied in which a portfolio is selected from the efficient frontier excluding those that are differentially expressed in peripheral blood. Of course, the rule can be applied prior to the formation of the efficient frontier by, for example, applying the rule during data pre-selection.
Other heuristic rules can be applied that are not necessarily related to the biology in question. For example, one can apply a rule that only a prescribed percentage of the portfolio can be represented by a particular gene or group of genes. Commercially available software such as the Wagner Software readily accommodates these types of heuristics. This can be useful, for example, when factors other than accuracy and precision (e.g., anticipated licensing fees) have an impact on the desirability of including one or more genes.
The gene expression profiles of this invention can also be used in conjunction with other non-genetic diagnostic methods useful in cancer diagnosis, prognosis, or treatment monitoring. For example, in some circumstances it is beneficial to combine the diagnostic power of the gene expression based methods described above with data from conventional Markers such as serum protein Markers (e.g., Cancer Antigen 27.29 (“CA 27.29”)). A range of such Markers exists including such analytes as CA 27.29. In one such method, blood is periodically taken from a treated patient and then subjected to an enzyme immunoassay for one of the serum Markers described above. When the concentration of the Marker suggests the return of tumors or failure of therapy, a sample source amenable to gene expression analysis is taken. Where a suspicious mass exists, a fine needle aspirate (FNA) is taken and gene expression profiles of cells taken from the mass are then analyzed as described above. Alternatively, tissue samples may be taken from areas adjacent to the tissue from which a tumor was previously removed. This approach can be particularly useful when other testing produces ambiguous results.
The present invention provides a method for analyzing a biological specimen for the presence of cells specific for an indication by: a) enriching cells from the specimen; b) isolating nucleic acid and/or protein from the cells; and c) analyzing the nucleic acid and/or protein to determine the presence, expression level or status of a Biomarker specific for the indication.
The biological specimen can be any known in the art including, without limitation, urine, blood, serum, plasma, lymph, sputum, semen, saliva, tears, pleural fluid, pulmonary fluid, bronchial lavage, synovial fluid, peritoneal fluid, ascites, amniotic fluid, bone marrow, bone marrow aspirate, cerebrospinal fluid, tissue lysate or homogenate or a cell pellet. See, e.g. 20030219842.
The indication can include any known in the art including, without limitation, cancer, risk assessment of inherited genetic pre-disposition, identification of tissue of origin of a cancer cell such as a CTC 60/887,625, identifying mutations in hereditary diseases, disease status (staging), prognosis, diagnosis, monitoring, response to treatment, choice of treatment (pharmacologic), infection (viral, bacterial, mycoplasmal, fungal), chemosensitivity U.S. Pat. No. 7,112,415, drug sensitivity, metastatic potential or identifying mutations in hereditary diseases.
Cells enrichment can be by any method known in the art including, without limitation, by antibody/magnetic separation, (Immunicon, Miltenyi, Dynal) U.S. Pat. No. 6,602,422, 5,200,048, fluorescence activated cell sorting, (FACs) U.S. Pat. No. 7,018,804, filtration or manually. The manual enrichment can be for instance by prostate massage. Goessl et al. (2001) Urol 58:335-338.
The nucleic acid can be any known in the art including, without limitation, is nuclear, mitochondrial (homeoplasmy, heteroplasmy), viral, bacterial, fungal or mycoplasmal.
Methods of isolating nucleic acid and protein are well known in the art. See e.g. U.S. Pat. No. 6,992,182, RNA www.aibion.com/techlib/basics/rnaisol/index.htlm, and 20070054287.
DNA analysis can be any known in the art including, without limitation, methylation, de-methylation, karyotyping, ploidy (aneuploidy, polyploidy), DNA integrity (assessed through gels or spectrophotometry), translocations, mutations, gene fusions, activation—de-activation, single nucleotide polymorphisms (SNPs), copy number or whole genome amplification to detect genetic makeup. RNA analysis includes any known in the art including, without limitation, q-RT-PCR, miRNA or post-transcription modifications. Protein analysis includes any known in the art including, without limitation, antibody detection, post-translation modifications or turnover. The proteins can be cell surface markers, preferably epithelial, endothelial, viral or cell type. The Biomarker can be related to viral/bacterial infection, insult or antigen expression.
The claimed invention can be used for instance to determine metastatic potential of a cell from a biological specimen by isolating nucleic acid and/or protein from the cells; and analyzing the nucleic acid and/or protein to determine the presence, expression level or status of a Biomarker specific for metastatic potential.
The cells of the claimed invention can be used for instance to identify mutations in hereditary diseases cell from a biological specimen by isolating nucleic acid and/or protein from the cells; and analyzing the nucleic acid and/or protein to determine the presence, expression level or status of a Biomarker specific for specific for a hereditary disease.
The cells of the claimed invention can be used for instance to obtain and preserve cellular material and constituent parts thereof such as nucleic acid and/or protein. The constituent parts can be used for instance to make tumor cell vaccines or in immune cell therapy. 20060093612, 20050249711.
Kits made according to the invention include formatted assays for determining the gene expression profiles. These can include all or some of the materials needed to conduct the assays such as reagents and instructions and a medium through which Biomarkers are assayed.
Articles of this invention include representations of the gene expression profiles useful for treating, diagnosing, prognosticating, and otherwise assessing diseases. These profile representations are reduced to a medium that can be automatically read by a machine such as computer readable media (magnetic, optical, and the like). The articles can also include instructions for assessing the gene expression profiles in such media. For example, the articles may comprise a CD ROM having computer instructions for comparing gene expression profiles of the portfolios of genes described above. The articles may also have gene expression profiles digitally recorded therein so that they may be compared with gene expression data from patient samples. Alternatively, the profiles can be recorded in different representational format. A graphical recordation is one such format. Clustering algorithms such as those incorporated in “DISCOVERY” and “INFER” software from Partek, Inc. mentioned above can best assist in the visualization of such data.
Different types of articles of manufacture according to the invention are media or formatted assays used to reveal gene expression profiles. These can comprise, for example, microarrays in which sequence complements or probes are affixed to a matrix to which the sequences indicative of the genes of interest combine creating a readable determinant of their presence. Alternatively, articles according to the invention can be fashioned into reagent kits for conducting hybridization, amplification, and signal generation indicative of the level of expression of the genes of interest for detecting cancer.
The present invention defines specific marker portfolios that have been characterized to detect a single circulating breast tumor cell in a background of peripheral blood. The molecular characterization multiplex assay portfolio has been optimized for use as a QRT-PCR multiplex assay where the molecular characterization multiplex contains 2 tissue of origin markers, 1 epithelial marker and a housekeeping marker. QRT-PCR will be carried out on the Smartcycler II for the molecular characterization multiplex assay. The molecular characterization singlex assay portfolio has been optimized for use as a QRT-PCR assay where each marker is run in a single reaction that utilizes 3 cancer status markers, 1 epithelial marker and a housekeeping marker. Unlike the RPA multiplex assay the molecular characterization singlex assay will be run on the Applied Biosystems (ABI) 7900HT and will use a 384 well plate as it platform. The molecular characterization multiplex assay and singlex assay portfolios accurately detect a single circulating epithelial cell enabling the clinician to predict recurrence. The molecular characterization multiplex assay utilizes Thermus thermophilus (TTH) DNA polymerase due to its ability to carry out both reverse transcriptase and polymerase chain reaction in a single reaction. In contrast, the molecular characterization singlex assay utilizes the Applied Biosystems One-Step Master Mix which is a two enzyme reaction incorporating MMLV for reverse transcription and Taq polymerase for PCR. Assay designs are specific to RNA by the incorporation of an exon-intron junction so that genomic DNA is not efficiently amplified and detected.
Knowledge of biological processes may be more relevant for understanding of the disease than information on differentially expressed genes. We have investigated distinct biological pathways associated with the metastatic capability of lymph-node negative primary breast tumors. A re-sampling method was used to create 500 different training sets, and to derive the corresponding gene signatures for estrogen receptor (ER)-positive and -negative tumors. The constructed gene signatures were mapped to Gene Ontology Biological Process (GOBP) to identify over-represented pathways related to patient outcomes. Global Test program1,2 was used to confirm that these biological pathways were associated with the development of metastases. Furthermore, by mapping 4 published prognostic gene signatures with more than 60 genes to the top 20 pathways, each of them can be mapped to 19 of the top distinct pathways despite a minimum overlap of identical genes. Our study provides a new way to understand the mechanisms of breast cancer progression and to derive a pathway-based signatures for prognosis.
We investigated the various prognostic gene signatures derived from different patient groups with an aim towards understanding the underlying biological pathways. Since gene expression patterns of ER-subgroups of breast tumors are quite different3-6,8,20, data analysis to derive gene signatures and subsequent pathway analysis was conducted separately8. For either ER-positive or ER-negative patients, 80 samples were randomly selected as a training set and the top 100 genes were used as a signature to predict tumor recurrence for the remaining ER-positive or ER-negative patients (
In Table 1, the top 20 genes are ranked by their frequency in the 500 signatures of 100 genes for ER-positive and ER-negative tumors (for details see
The biological pathways are distinct for ER-positive and -negative tumors. For ER-positive tumors, many pathways that are related with cell division are present in the top 20 over-represented pathways, in addition to a couple of immune-related pathways (Table 4).
All of the 20 pathways had a significant association with distant metastasis-free survival (DMFS) by Global Testing program. The top 2 most significant being Apoptosis, and Regulation of cell cycle (Table 2). For ER-negative tumors, many of the top 20 pathways are related with RNA processing, transportation and signal transduction (Table 4). Eighteen of the top 20 pathways demonstrated significant association with DMFS, the 2 most significant being Regulation of cell growth, and Regulation of G-protein coupled receptor signaling (Table 2).
In Table 2, each of the top 20 over-represented pathways that have the highest frequencies in the 500 signatures of ER-positive and ER-negative tumors (see Table 5) were subjected to Global Test program1,2. The Global Test examines the association of a group of genes as a whole to a specific clinical parameter, in this case DMFS, and generates an asymptotic theory P value for the pathway1,2. The pathways are ranked by their P value in the respective ER-subgroup of tumors.
The contribution of individual genes in the top over-represented pathways to the association with DMFS, and their significance, were determined for ER-positive (
C. elegans)
C. elegans)
Drosophila)
Drosophila)
In ER-negative tumors, examples of pathways with genes that had both positive or negative correlation to DMFS include Regulation of cell growth (
In an attempt to use gene expression profiles in the most significant biological processes to predict distant metastases we used the genes of the top 2 significant pathways in both ER-positive and -negative tumors (Table 7) to construct a gene signature for prediction of distant recurrence. A 50-gene signature was constructed by combining the 38 genes from the top 2 ER-positive pathways and 12 genes for the top 2 ER-negative pathways. The Affymetrix U133A data on a recently published set of breast tumors with follow-up information21 was used as an independent test set to validate the signature. The 152-patient validation set consisted of 125 ER-positive tumors and 27 ER-negative tumors. When the 38-gene signature was applied to ER-positive tumors, an ROC analysis gave an AUC of 0.782 (
(HR=3.36) (
To compare genes from various prognostic signatures for breast cancer, five published gene signatures were selected6,8,21-23. We first compared the gene sequence identity between each pair of the gene signatures and found very few overlapping genes as expected (Table 8). The gene expression grade index comprising 97 genes, of which most are associated with cell cycle regulation and proliferation21, showed the highest number of overlapping genes between the various signatures ranging from 5 with the 16 genes of Genomic Health22 to 10 with Yu's 62 genes23. The other 4 gene signatures showed only 1 gene overlap in pair-wise comparison, and there was no common gene for all signatures. In spite of the low number of overlapping genes across signatures, which are due to different platforms and bioinformatical analyses used and different groups of patients analyzed, we found that the representation of common pathways in the various signatures may underlie their individual prognostic value8. Therefore, we examined the representation of the top 20 core pathways (Table 2) in the 5 signatures, the genes in the signatures were mapped to GOBP. Except the Genomic Health 16-gene signature mapped to 10 distinct core pathways, each of the other 4 signatures with 62 genes or more mapped to 19 distinct core prognostic pathways (Table 3). Of these 19 pathways, 8 were identical for all 4 signatures, i.e., Mitosis, Apoptosis, Regulation of cell cycle, DNA repair, Cell cycle, Protein amino acid phosphorylation, Intracellular signaling cascade, and Cell adhesion. The other 11 pathways were either present in 1, 2, or 3, of the signatures, but not in all (Table 3). In a recent study, comparing the prognostic performance of different gene signatures, agreement in outcome predictions were found as well24. However, in contrast to our present approach, the underlying pathways were not investigated, and merely the performance of various gene signatures on a single patient cohort, heterogeneous with respect to nodal status and adjuvant systemic therapy25, was compared24. It is important to note, however, that although similar pathways are represented in various signatures, it does not necessarily mean the individual genes in a pathway contribute equally and into the same direction. Genes in a specific pathway may be positively or negatively associated with tumor aggressiveness, and have very different contributions and significance levels (
aPublished gene signatures that were studied include the 76-gene signature by Wang et al8, the 70-gene signature by van 't Veer et al6, the 16-gene signature by Paik et al22, the 62-gene signature by Yu et al23, and the 97-gene signature by Sotiriou et al21. Individual genes in each signature were mapped to the top 20 core pathways for ER-positive and ER-negative tumors.
In conclusion, we have shown that gene signatures can be derived by combining statistical methods and biological knowledge. Our study for the first time applied a method that systematically evaluated the biological pathways related to patient outcomes of breast cancer and have provided biological evidence that various published prognostic gene signatures providing similar outcome predictions are based on the representation of common biological processes. Identification of the key biological processes, rather than the assessment of signatures based on individual genes, provides targets for future drug development.
The following examples are provided to illustrate but not limit the claimed invention. All references cited herein are hereby incorporated herein by reference.
Patient population. The study was approved by the Medical Ethics Committee of the Erasmus MC Rotterdam, The Netherlands (MEC 02.953), and was performed in accordance to the Code of Conduct of the Federation of Medical Scientific Societies in the Netherlands (www.fmwv.nl). A cohort of 344 breast tumor samples from a tumor bank at the Erasmus Medical Center (Rotterdam, Netherlands) were used in this study. All these samples were from patients with lymph node-negative breast cancer who had not received any adjuvant systemic therapy, and had more than 70% tumor content. Among them, 286 samples had been used to derive a 76-gene signature to predict distant metastasis8. An additional 58 ER-negative cases were included to increase the numbers in this subgroup in the analyses performed. In this study, ER status for a patient was determined based on the expression level of the ER gene on the chip. A patient is considered ER-positive if its ER expression level is higher than 1000 after scaling the average of intensity on a chip to 600. Otherwise, the patient is ER-negative26. As a result, there were 221 ER-positive and 123 ER-negative patients in the 344-patient population. The mean age of the patients was 53 years (median 52, range 26-83 years), 175 (51%) were premenopausal and 169 (49%) postmenopausal. T1 tumors (≦2 cm) were present 168 patients (49%), T2 tumors (>2-5 cm) in 163 patients (47%), T3/4 tumors (>5 cm) in 12 patients (3%), and 1 patient with unknown tumor stage. Pathological examination was carried out by regional pathologists as described previously27 and the histological grade was coded as poor in 184 patients (54%), moderate in 45 patients (13%, good in 7 patients (2%), and unknown for 108 patients (31%). During follow-up 103 patients showed a relapse within 5 years and were counted as failures in the analysis for DMFS. Eighty two patients died after a previous relapse. The median follow-up time of patients still alive was 101 months (range 61-171 months).
RNA isolation and hybridization. Total RNA was extracted from 20-40 cryostat sections of 30 um thickness with RNAzol B (Campro Scientific, Veenendaal, Netherlands). After being biotinylated, targets were hybridized to Affymetrix HG-U133A chips as described8. Gene expression signals were calculated using Affymetrix GeneChip analysis software MAS 5.0. Chips with an average intensity less than 40 or a background higher than 100 were removed. Global scaling was performed to bring the average signal intensity of a chip to a target of 600 before data analysis.
For the validation dataset21, quantile normalization was performed and ANOVA was used to eliminate batch effects from different sample preparation methods, RNA extraction methods, different hybridization protocols and scanners.
Multiple gene signatures. Since gene expression patterns of ER-positive breast tumors are quite different from that of ER-negative breast tumors8, data analysis to derive gene signatures and subsequent pathway analysis were conducted separately. For either ER-positive or ER-negative patients, 80 samples were randomly selected as a training set. For the training set, univariant Cox proportional-hazards regression was performed to identify genes whose expression patterns were most correlated to patients' distant metastasis-free survival (DMFS) time. Our previous analysis suggested that 80 patients represent a minimum size of the training set for producing a prognostic gene signature of stable performance8. The top 100 genes were used as a signature to predict tumor recurrence for the remaining independent patients as a test set. A receiver operating characteristic (ROC) analysis with distant metastasis within 5 years as a defining point was conducted. The area under curve (AUC) was used as a measurement of the performance of a signature in the test set. The above procedure was repeated 500 times (
As a control, the patient clinical information for the ER-positive patients or ER-negative patients was permutated randomly and reassigned to the chip data. As described above, 80 chips were then randomly selected as a training set and the top 100 genes were selected using the Cox modeling based on the permutated clinical information. The top 100 genes were then used as a signature to predict relapse in the remaining patients. The clinical information was permutated 10 times. For each permutation of the clinical information, 50 various training sets of 80 patients were created. For each training set, the top 100 genes were obtained as a control gene list based on the Cox modeling. Thus, a total of 500 control signatures were obtained. The predictive performance of the 100 genes was examined in the remaining patients. An ROC analysis was conducted and AUC was calculated in the test set.
Mapping to GOBP. To identify over-representation of biological pathways in the signatures, genes on Affymetrix HG-U133A chip were mapped to the categories of GOBP based on the annotation table downloaded from www.affymetrix.com. Categories that contain at least 10 probe sets from HG-U133A chip were retained for subsequent pathway analysis. The 100 genes of each signature were mapped to GOBP. Hypergeometric distribution probabilities for GOBP categories were calculated for each signature. A pathway that has a hypergeometric distribution probability <0.05 and was hit by two or more genes from the 100 genes was considered as an over-represented pathway in a signature. The total number of a pathway appeared in the 500 signatures was considered as the frequency of over-representation.
Global Test program. To evaluate the relationship between a pathway and the clinical outcome, each of the top 20 over-represented pathways that have the highest frequencies in the 500 signatures were subjected to Global Test program1,2. The Global Test examines the association of a group of genes as a whole to a specific clinical parameter such as DMFS. The contribution of individual genes in the top over-represented pathways to the association was also evaluated and significant contributors were selected for subsequent analyses.
To explore the possibility of using the genes in a specific pathway as a signature to predict distant metastasis, the top two pathways for ER-positive or ER-negative tumors that were in the top 20 list based on frequency of over-representation and had the smallest P values from Global Test program were chosen to build a gene signature. First, genes in the pathway were selected if their z-score was greater than 1.95 from the Global Test program. A z-score greater than 1.95 indicates that the association of the gene expression with DMFS time is significant (P<0.05)1,2. The relapse score was the difference of weighted expression signals for negatively correlated genes and ones for positively correlated genes. To determine the optimal number of genes in a signature, ROC analysis was performed using signatures of various numbers of genes in the training set. The performance of the selected gene signature was evaluated by Kaplan-Meier survival analysis in an independent patient group21.
Comparing multiple gene signatures. To compare the genes from various prognostic signatures for breast cancer, five gene signatures were selected6,8,22-23. Identity of the genes between the signatures was determined by BLAST program. To examine the representation of the top 20 pathways in the signatures, genes in each of the signatures were mapped to GOBP.
Data Availability. The microarray data analyzed in this paper have been submitted to the NCBI/Genbank GEO database. The microarray and clinical data used for the independent validation testing set analysis were obtained from the Gene Expression Omnibus database (http://www.ncbi.nlm.hih.gov.geo) with accession code GSE2990.
Statistical Methods. Statistical analyses were conducted using the R system, version 2.2.1 (http://www.r-project.org). Cox proportional-hazard regression modeling analysis was performed to identify genes with a high correlation to DMFS in each training set. The survival package included in the R system was used for survival analysis. The hazard ratio (HR) and 95% confidence intervals (CI) were estimated using the stratified Cox regression analysis. Hypergeometric distribution probability analysis was performed to identify over-represented pathways in each of the 500 signatures. Global Test, version 3.1.1, was used to evaluate the top over-represented pathways related to DMFS and provided a way to visualize contributions of individual genes in a pathway.
Although the foregoing invention has been described in some detail by way of illustration and example for purposes of clarity of understanding, the descriptions and examples should not be construed as limiting the scope of the invention.
Drosophila)
Drosophila)
Number | Date | Country | Kind |
---|---|---|---|
PCT/US07/77593 | Sep 2007 | US | national |
Number | Date | Country | |
---|---|---|---|
60842212 | Sep 2006 | US |