A major challenge for microarray studies, especially those with clinical implications, is validation (Ioannidis 2005; Jenssen and Hovig 2005; Michiels et al. 2005). Due to the practical considerations of cost and accessing large numbers of fresh samples with associated clinical information, very few microarray studies have analyzed enough samples to allow the findings to be extended to the general population. Furthermore, it has been difficult to combine and/or validate results from independent laboratories due to differences in sample preparation, patient demographics and the microarray platforms used. An accepted method for validation is to derive a prognostic gene set from a “training set” and then apply it to a “test set” that was not used in any way, to derive the prognostic gene set (Simon et al. 2003); the “purest” test sets have also been suggested to be comprised of samples not contained in the training set and not generated by the primary investigators (loannidis 2005). What is needed in the art is a new breast tumor intrinsic gene list that identifies new and important biological features of breast tumors and validates this predictor using a true test set.
Described herein is a method of diagnosing cancer, the method comprising comparing expression levels of a combination of genes from Table 21 to test nucleic acids wherein specific expression patterns of the test nucleic acids indicates a cancerous state.
Also disclosed is a method of quantitating level of expression of a test nucleic acid comprising: a) comparing gene expression levels of a combination of genes from Table 21 to test nucleic acids corresponding to the same combination of genes; and b) quantitating level of expression of the test nucleic acid.
Also disclosed is a method for determining prognosis based on the expression patterns in a subject diagnosed with cancer comprising: a) comparing expression levels of a combination of genes from Table 21 to test nucleic acids corresponding to the same combination of genes, b) identifying a subtype of cancer of the subject, and c) prognosis (ie, outcome) and treatment decisions based on the subtype of cancer in the subject.
Disclosed is a method of classifying cancer in a subject, comprising: a) identifying intrinsic genes of the subject to be used to classify the cancer; b) obtaining a sample from the subject; c) amplifying and detecting levels of intrinsic genes in the subject; and d) classifying cancer or subject based upon results of step c.
Also disclosed is a method of diagnosing cancer in a subject the method comprising: a) amplifying and detecting intrinsic genes; and b) diagnosing cancer based on expression levels of the gene within the subject.
Disclosed herein is a method of deriving a minimal intrinsic gene set for making biological classifications of cancer comprising: a) collecting data from multiple samples from the same individual to identify potential intrinsic classifier genes; b) weighting intrinsic classifier genes of multiple individuals identified using the method of step a relative to each other and forming classification clusters; c) estimating the number of clusters formed in step b) and assigning individual samples to classification clusters; d) identifying genes that optimally distinguish the samples in the assigned groups of step c); e) performing iterative cross-validation with a nearest centroid classifier and overlapping gene sets of various sizes using the genes identified in step d); and f) choosing a gene set which provides the highest class prediction accuracy when compared to the classifications made in step b).
Also disclosed is a method of assigning a sample to an intrinsic subtype, comprising a) creating an intrinsic subtype average profile (centroid) for each subtype; b) individually comparing a new sample to each centroid; and c) assigning the new sample to the centroid that is most similar to the expression profile of new sample.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate several embodiments and together with the description illustrate the disclosed compositions and methods.
Before the present compounds, compositions, articles, devices, and/or methods are disclosed and described, it is to be understood that they are not limited to specific synthetic methods or specific recombinant biotechnology methods unless otherwise specified, or to particular reagents unless otherwise specified, as such may, of course, vary. It is also to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.
As used in the specification and the appended claims, the singular forms “a,” “an” and “the” include plural referents unless the context clearly dictates otherwise. Thus, for example, reference to “a pharmaceutical carrier” includes mixtures of two or more such carriers, and the like.
Ranges can be expressed herein as from “about” one particular value, and/or to “about” another particular value. When such a range is expressed, another embodiment includes from the one particular value and/or to the other particular value. Similarly, when values are expressed as approximations, by use of the antecedent “about,” it will be understood that the particular value forms another embodiment. It will be further understood that the endpoints of each of the ranges are significant both in relation to the other endpoint, and independently of the other endpoint. It is also understood that there are a number of values disclosed herein, and that each value is also herein disclosed as “about” that particular value in addition to the value itself. For example, if the value “10” is disclosed, then “about 10” is also disclosed. It is also understood that when a value is disclosed that “less than or equal to” the value, “greater than or equal to the value” and possible ranges between values are also disclosed, as appropriately understood by the skilled artisan. For example, if the value “10” is disclosed the “less than or equal to 10” as well as “greater than or equal to 10” is also disclosed. It is also understood that the throughout the application, data is provided in a number of different formats, and that this data, represents endpoints and starting points, and ranges for any combination of the data points. For example, if a particular data point “10” and a particular data point 15 are disclosed, it is understood that greater than, greater than or equal to, less than, less than or equal to, and equal to 10 and 15 are considered disclosed as well as between 10 and 15. It is also understood that each unit between two particular units are also disclosed. For example, if 10 and 15 are disclosed, then 11, 12, 13, and 14 are also disclosed.
As used throughout, by a “subject” is meant an individual. Thus, the “subject” can include, for example, domesticated animals, such as cats, dogs, etc., livestock (e.g., cattle, horses, pigs, sheep, goats, etc.), laboratory animals (e.g., mouse, rabbit, rat, guinea pig, etc.) mammals, non-human mammals, primates, non-human primates, rodents, birds, reptiles, amphibians, fish, and any other animal. The subject can be a mammal such as a primate or a human.
“Treating” or “treatment” does not mean a complete cure. It means that the symptoms of the underlying disease are reduced, and/or that one or more of the underlying cellular, physiological, or biochemical causes or mechanisms causing the symptoms are reduced. It is understood that reduced, as used in this context, means relative to the state of the disease, including the molecular state of the disease, not just the physiological state of the disease.
By “reduce” or other forms of reduce means lowering of an event or characteristic. It is understood that this is typically in relation to some standard or expected value, in other words it is relative, but that it is not always necessary for the standard or relative value to be referred to. For example, “reduces phosphorylation” means lowering the amount of phosphorylation that takes place relative to a standard or a control.
By “inhibit” or other forms of inhibit means to hinder or restrain a particular characteristic. It is understood that this is typically in relation to some standard or expected value, in other words it is relative, but that it is not always necessary for the standard or relative value to be referred to. For example, “inhibits phosphorylation” means hindering or restraining the amount of phosphorylation that takes place relative to a standard or a control.
By “prevent” or other forms of prevent means to stop a particular characteristic or condition. Prevent does not require comparison to a control as it is typically more absolute than, for example, reduce or inhibit. As used herein, something could be reduced but not inhibited or prevented, but something that is reduced could also be inhibited or prevented. It is understood that where reduce, inhibit or prevent are used, unless specifically indicated otherwise, the use of the other two words is also expressly disclosed. Thus, if inhibits phosphorylation is disclosed, then reduces and prevents phosphorylation are also disclosed.
By “specific expression pattern” is meant an elevation or reduction of expression of given genes when compared with a control or a standard. One of ordinary skill in the art is capable of identifying and measuring the expression of gene patterns of genes related to the methods disclosed herein.
The term “therapeutically effective” means that the amount of the composition used is of sufficient quantity to ameliorate one or more causes or symptoms of a disease or disorder. Such amelioration only requires a reduction or alteration, not necessarily elimination. The term “carrier” means a compound, composition, substance, or structure that, when in combination with a compound or composition, aids or facilitates preparation, storage, administration, delivery, effectiveness, selectivity, or any other feature of the compound or composition for its intended use or purpose. For example, a carrier can be selected to minimize any degradation of the active ingredient and to minimize any adverse side effects in the subject.
Throughout the description and claims of this specification, the word “comprise” and variations of the word, such as “comprising” and “comprises,” means “including but not limited to,” and is not intended to exclude, for example, other additives, components, integers or steps.
The term “cell” as used herein also refers to individual cells, cell lines, or cultures derived from such cells. A “culture” refers to a composition comprising isolated cells of the same or a different type.
References in the specification and concluding claims to parts by weight, of a particular element or component in a composition or article, denotes the weight relationship between the element or component and any other elements or components in the composition or article for which a part by weight is expressed. Thus, in a compound containing 2 parts by weight of component X and 5 parts by weight component Y, X and Y are present at a weight ratio of 2:5, and are present in such ratio regardless of whether additional components are contained in the compound.
A weight percent of a component, unless specifically stated to the contrary, is based on the total weight of the formulation or composition in which the component is included.
In this specification and in the claims which follow, reference will be made to a number of terms which shall be defined to have the following meanings:
“Optional” or “optionally” means that the subsequently described event or circumstance may or may not occur, and that the description includes instances where said event or circumstance occurs and instances where it does not.
“Primers” are a subset of probes which are capable of supporting some type of enzymatic manipulation and which can hybridize with a target nucleic acid such that the enzymatic manipulation can occur. A primer can be made from any combination of nucleotides or nucleotide derivatives or analogs available in the art which do not interfere with the enzymatic manipulation.
“Probes” are molecules capable of interacting with a target nucleic acid, typically in a sequence specific manner, for example through hybridization. The hybridization of nucleic acids is well understood in the art and discussed herein. Typically a probe can be made from any combination of nucleotides or nucleotide derivatives or analogs available in the art.
Disclosed herein are methods and compositions for deriving a minimal intrinsic gene set for making biological classifications of cancer. Also disclosed are methods of using intrinsic genes in a real-time qRT-PCR assay for cancer classification, prognosis and/or treatment. Described herein are several algorithms for use in combination in order to generate a statistically validated minimal gene set that makes biological classifications of cancers. While the methods disclosed herein are generally useful with any type of cancer, breast cancer is specifically used as an example herein. Below follows a list of specific cancers that are useful with the methods disclosed herein, and the example of breast cancer is not intended to be limiting, but rather exemplary. The samples disclosed herein can be obtained from a variety of sources, including fresh tissue, fresh-frozen samples, or formalin-fixed paraffin-embedded samples.
The methodology described herein can be used to make a classification that distinguishes 2 or more intrinsic subtypes of breast cancer. The intrinsic subtypes can be designated as Luminal (and classes therein), HER2/ER− (and classes therein), Basal (and classes therein), Normal-like (and classes therein). The steps for finding the minimal intrinsic gene set for making subtype (and class) distinctions are as follows.
The first step is to use microarray data from biological replicates from the same patient to find intrinsic classifier genes. For example, a data set of tumors and normal breast samples can be used. In one embodiment, these data sets can comprise paired biological replicates to identify the intrinsic gene set. This is described, for example, in Perou et al. (2000), which is herein incorporated by reference in its entirety for its teaching regarding finding intrinsic classifier genes. In Perou et al., the molecular portraits revealed in the patterns of gene expression not only uncovered similarities and differences among the tumors, but also point to a biological interpretation. Variation in growth rate, in the activity of specific signalling pathways, and in the cellular composition of the tumors were all reflected in the corresponding variation in the expression of specific subsets of genes.
In the second step of the method disclosed herein, hierarchical cluster microarray data was obtained using an intrinsic gene set. Here, data can be combined from different microarray platforms for clustering using methods described in Example 2. Specifically, the “intrinsic gene set” from the first step (above) is tested on new tumors and normal breast samples after combining different datasets (such as cross platform analyses) and common genes/elements are hierarchically clustered. For example, a two-way average linkage hierarchical cluster analysis can be performed using a centered Pearson correlation metric and the program “Cluster” (Eisen et al. 1998), with the data being displayed relative to the median expression for each gene (i.e. median centering of the rows/genes).
In the third step, the number of clusters formed in the microarray dataset is estimated, and samples/tumors are assigned to clusters based on the sample-associated dendrogram groupings. In other words, the “test set” is used as a training set to create subtype centroids based upon the expression of the common intrinsic genes. New samples are assigned to the subtype corresponding to the nearest centroid when using Spearman correlation values.
In the fourth step, genes are found that optimally distinguish the samples in the assigned groups using the ratio of between-group to within-group sums of squares (the entire microarray dataset is used in this analysis). An example of this can be found in Chung et al, Cancer Cell 2004, herein incorporated by reference in its entirety for its teaching concerning identification of genes that optimally distinguish samples.
In the fifth step, iterative cycles of 10-fold cross-validation are performed with a nearest centroid classifier and overlapping gene sets of varying sizes. In other words, each gene and gene set are ranked based upon the metric from step four above, and various overlapping and every increasing sized genes lists are used in a 10-fold cross validation.
In the sixth, and final step, the smallest gene set which provides the highest class prediction accuracy when compared to the classifications made by the complete microarray-based intrinsic gene set is chosen. Subtypes are assigned for each gene set and the minimal gene set with the highest agreement in sample assignment to the full intrinsic gene set is chosen. In one example, using a 1410 intrinsic gene set as disclosed in Example 2, 100 genes were identified (see Table 12 (7p 100), after the “Examples” section) that are important for identifying 7 different biological classes of breast cancer. Specific steps and sample sets used to develop the 7-class predictor as shown in
The minimal intrinsic gene set (identified using the methods described above, and found in Tables 12 and 13) has prognostic and predictive significance in breast cancer. The complete assay for making these biological “intrinsic” classifications includes 3 “housekeeper” genes (MRPL19, PUM1, and PSMC4) for normalizing the quantitative data. In addition, it has been shown that proliferation genes can also be used in combination with the housekeeper genes for providing a quantitative measurement of grade and for assessing prognosis in breast cancer.
Also disclosed herein is the Single Sample Predictor (SSP). The Single Sample Predictor/SSP is based upon the Nearest Centroid method presented in (Hastie et al. 2001). The subtype centroids (either all intrinsic genes or the minimal gene lists) can be used to make subtype predictions on additional test sets (e.g., homogenously treated subjects from clinical trial groups). The resulting classifications are then analyzed using Kaplan-Meier survival plots to determine prognostic and therapeutic significance. An example of SSP can be found in Example 2.
1′. Intrinsic Genes and Cancer
An intrinsic gene is a gene that shows little variance within repeated samplings of the same tumor, but which shows high variance across tumors. Disclosed herein are genes that can be used as intrinsic genes with the methods disclosed herein. The intrinsic genes disclosed herein can be genes that have less than or equal to 0.00001, 0.0001, 0.001, 0.01, 0.1, 0.2. 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 1,000, 10,000, or 100,000% variation between two samples from the same tissue. It is also understood that these levels of variation can also be applied across 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, or 50 or more tissues, and the level of variation compared. It is also understood that variation can be determined as discussed in the examples using the algorithms as disclosed herein.
“Intrinsic gene set” is defined herein as comprising one or more intrinsic genes. “Minimal intrinsic gene set” is defined herein as being derived from an intrinsic gene set, and is considered the fewest number of intrinsic genes that can be used to classify a sample.
Disclosed herein is a set of 212 minimal intrinsic genes, as found in Table 21. These genes can be used alone, or in combination, as intrinsic genes for the purposes of classification, prognosis, and diagnosis of cancer, for example. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199 of the genes can be used with the methods disclosed herein for analyzing samples.
Described herein is a method of diagnosing cancer, the method comprising comparing expression levels of a combination of genes from Table 21 to test nucleic acids corresponding tu the same combination of genes, wherein specific expression patterns of the test nucleic acids indicates a cancerous state.
Also disclosed is a method of quantitating level of expression of a test nucleic acid comprising: a) comparing gene expression levels of a combination of genes from Table 21 to test nucleic acids corresponding to the same combination of genes; and b) quantitating level of expression of the test nucleic acid.
Also disclosed is a method of prognosing outcome in a subject diagnosed with cancer comprising: a) comparing expression levels of a combination of genes from Table 21 to test nucleic acids corresponding to the same combination of genes, b) identifying a subtype of cancer of the subject, and c) prognosing the outcome based on the subtype of cancer of the subject.
The intrinsic genes disclosed herein can be normalized to control housekeeper genes and used in a qRT-PCR diagnostic assay that uses relative copy number to assess risk or therapeutic response in cancer. For example, MRPL19 (SEQ ID NO:1), PSMC4 (SEQ ID NO:2), SF3A1 (SEQ ID NO:3), PUM1 (SEQ ID NO:4), ACTB (SEQ ID NO:5) and GAPD (SEQ ID NO:6). Other genes include GUSB, RPLP0, and TFRC, whose sequences can be found in Genbank. These are part of the 212 gene list. Other genes as disclosed herein can also be considered intrinsic genes.
The intrinsic genes can be used in any combination or singularly in any method described herein. It is also understood that any nucleic acid-related to the expression control genes, such as the RNA, mRNA, exons, introns, or 5′ or 3′ upstream or downstream sequence, or DNA or gene can be used or identified in any of the methods or with any of the compositions disclosed herein.
2. Molecules for Detecting Genes, Gene Expression Products, Proteins Encoded by Genes
The disclosed methods involve using specific intrinsic genes or gene sets or expression control genes or gene sets such that they are detected in some way or their expression product is detected in some way. Typically the expression of a gene or its expression product will be detected by a primer or probe as disclosed herein. However, it is understood that they can also be detected by any means, such as in a microarray analysis or a specific monoclonal antibody or other visualization technique. Often, the expression of the genes of interest (control “housekeeper” genes or intrinsic classifier genes) can be detected after or during an amplification process, such as RT-PCR, including quantitative PCR.
3. Method of Diagnosing or Prognosing Cancer
Microarrays have shown that gene expression patterns can be used to molecularly classify various types of cancers into distinct and clinically significant groups. In order to translate these profiles into routine diagnostics, a microarray breast cancer classification system has been recapitulated using real-time quantitative (q)RT-PCR (Example 2). Statistical analyses were performed on multiple independent microarray datasets to select an “intrinsic” gene set that can classify breast tumors into four different subtypes designated as Luminal, Normal-like, HER2+/ER−, and Basal-like. Intrinsic genes, as described in Perou et al. (Nature (2000) 406:747-752), are statistically selected to have low variation in expression between biological sample replicates from the same individual and high variation in expression across samples from different individuals. Thus, intrinsic genes are the classifier genes for breast cancer classification and each classifier gene can be normalized to the housekeeper (or control) genes in order to make the classification. A minimal gene set from the microarray “intrinsic” list, and additional genes important for outcome (e.g., proliferation genes), were used to develop a real-time qRT-PCR assay comprised of 53 classifiers and 3 housekeepers. The expression data and classifications from microarray and real-time qRT-PCR were respectively compared using 123 unique breast samples (117 invasive carcinomas, 1 fibroadenoma and 5 normal tissues) and 3 cells lines. The overall correlation for the 50 genes in common between microarray and qRT-PCR was 0.76. There was 91% (114/126) concordance in the hierarchical clustering classification of the real-time qRT-PCR minimal “intrinsic” gene set (37 genes) and the larger (550 genes) microarray intrinsic gene set from which the PCR list was derived. As expected, the Luminal tumors (ER+) had a significantly better outcome than the HER2+/ER− (p=0.043) and Basal-like tumors (p=0.001). High expression of the proliferation genes GTBP4 (p=0.011), HSPA14 (p=0.023), and STK6 (p=0.027) were significant predictors of relapse free survival (RFS) independent of grade and stage. It has been shown that genomic microarray data can be translated into a qRT-PCR diagnostic assay that improves the standard of care in breast cancer.
The overlap in the minimized gene set discussed above and in Example 2 versus those in Example 3 is 14 out of 40. There are 108 genes in common between the larger intrinsic gene sets, which included 427 in Perreard et al versus 1300 used in Example 3. Example 2 illustrates how intrinsic gene sets can be minimized from microarray data and used on fresh tissue in a qRT-PCR assay to recapitulate the microarray classifications. It also shows the importance of the ‘proliferation’ genes in risk stratifying Luminal (ER+) breast tumors. Example 3 discusses a version of the intrinsic gene set from Hu et al and shows again how it can be minimized to provide intrinsic classifications on both fresh and FFPE tissue and using microarray or qRT-PCR data. Validated primer sequences from FFPE tissues for 212 genes important for breast cancer diagnostics are presented in Table 21.
A major challenge in the clinical care of cancer has been providing an accurate diagnosis for appropriate management of breast cancer. For over 50 years, medicine has relied on morphological features (histopathology) and anatomic staging (Tumor size/Node involvement/Metastasis) for classification of tumors (Greenough, R. B. J Cancer Res 9:452-463; Bloom et al. (1957) British Journal of Cancer 9:359-377). The TNM staging system provides information about the extent of disease and has been the “gold standard” for prognosis (Henson, et al. (1991) Cancer 68:2142-2149; Fitzgibbons, et al (2000) Arch Pathol Lab Med 124:966-978).
In addition to TNM, the grade of the tumor is also prognostic for relapse free survival (RFS) and overall survival (OS) (Elston et al. (1991) Histopathology 19:403-410). Tumor grade is determined from histological assessment of tubule formation, nuclear pleomorphism, and mitotic count. Due to the subjective nature of grading and difficulties standardizing methods, there has been less than optimal agreement between pathologists (Dalton et al. (1994) Cancer 73:2765-2770). Applying the Nottingham combined histological grade has made scoring more quantitative and improved agreement between observers (Frierson (1995) Am J Clin Pathol 103:195-198), however, more objective methods are still needed before grade is integrated into the TNM classification (Singletary (2003) Surg Clin North Am 83:803-819). For instance, most studies show significance in outcome between Grade 1 (low/least aggressive) and Grade 3 (high/most aggressive), but Grade 2 (intermediate) tumors show variability in outcome and are commonly not classified the same across institutions (Kollias et al. (1999) Eur J Cancer 35:908-912; Robbins et al. (1995) Hum Pathol 26:873-879; Genestie et al. (1998) Anticancer Res 18:571-576.). Alternatively, proliferation assays, such as S-phase fraction and mitotic index, have shown to be independent prognostic indicators and could be used in conjunction with, or instead of grade (Michels et al. (2004) Cancer 100:455-464; Caly et al. (2004) Anticancer Res 24:3283-3288). It has been shown that proliferation genes can be used in a qRT-PCR assay and the genes can be averaged to produce a proliferation meta-gene that correlates with grade but is more prognostic (
Women with the same stage of breast cancer can have widely different clinical outcomes due to differences in tumor biology (van't Veer et al. (2002) Nature 415:530-536; van de Vijver et al. (2002) N Engl J Med 347:1999-2009). The use of gene expression markers in breast pathology can provide addition clinical information that complements the TNM system for prognosis and is important for making therapeutic decisions (van't Veer et al. (2002) Nature 415:530-536; van de Vijver et al. (2002) N Engl J Med 347:1999-2009; Paik et al. (2004) N Engl J Med 351:2817-2826; Sorlie et al. (2001) Proc Natl Acad Sci USA 98:10869-10874; Sorlie et al. (2003) Proc Natl Acad Sci USA 100:8418-8423). Undoubtedly, one of the greatest advancements in breast cancer medicine has been the identification and routine testing for the expression of the hormone receptors, namely the Estrogen Receptor (ER) and the Progesterone Receptor (PgR), which allows the clinician to offer endocrine blockade therapy that can significantly prolong survival in women with tumors expressing these proteins (Buzdar et al. (2003) J Clin Oncol 21:1007-1014; Fisher et al (1989) N Engl J Med 320:479-484).
Although ER expression is a predictive marker, it also serves as a surrogate marker for describing a tumor biology that is characteristically less aggressive (e.g. lower grade) than ER-negative tumors (Fisher et al. (1981) Breast Cancer Res Treat 1:37-41). Microarrays have elucidated the richness and diversity in the biology of breast cancer and have identified many genes that associate with ER-positive and ER-negative tumors (Plerou et al. (2000) Nature 406:747-752; West et al. (2001) Proc Natl Acad Sci USA 98:11462-11467; Gruvberger et al. (2001) Cancer Res 61:5979-5984). When microarray data from invasive breast carcinomas are analyzed by hierarchical clustering, samples are separated primarily based on ER status (Sotiriou et al. (2003) Proc Natl Acad Sci USA 100:10393-10398).
Breast tumors of the “Luminal” subtype are ER positive and have a similar keratin expression profile as the epithelial cells lining the lumen of the breast ducts (Taylor-Papadimitriou et al. (1989) J Cell Sci 94:403-413; Perou et al. (2000) New Technologies for life sciences: A Trends Guide:67-76). Conversely, ER-negative tumors can be broken into two main subtypes, namely those that overexpress (and are DNA amplified for) HER2 and GRB7 (HER2+/ER−), and “Basal-like” tumors that have an expression profile similar to basal epithelium and express Keratin 5, 6B and 17. Both these tumor subtypes are aggressive and typically more deadly than Luminal tumors; however, there are subtypes of Luminal tumors that lead to poor outcome despite being ER-positive. For instance, Sorlie et al, identified a Luminal B subtype with similar outcomes to the HER2+/ER− and Basal-like subtypes, and Sotiriou et al. showed that there are 3 different types of Luminal tumors with different outcomes. The Luminal tumors with poor outcomes consistently share the histopathological feature of being higher grade and the molecular feature of highly expressing proliferation genes.
The so called “proliferation genes” show periodicity in expression through the cell cycle and have a variety of functions necessary for cell growth, DNA replication, and mitosis (Whitfield et al. (2002) Mol Biol Cell 13:1977-2000; Ishida et al. Mol Cell Biol 21:4684-4699). Despite their diverse functions, proliferation genes have similar gene expression profiles when analyzed by hierarchical clustering. As might be expected, proliferation genes correlate with grade, the mitotic index (Perou et al. (1999) Proc Natl Acad Sci USA 96:9212-9217), and outcome (Sørlie et al. (2001) Proc Natl Acad Sci USA 98:10869-10874). Proliferation genes are often selected when supervised analysis is used to find genes that correlate with patient outcome. For example, the SAM264 “survival” list presented in Sorlie et al., the 231 “prognosis classifier” list in van't Veer et al., and the “485 prognostic gene” list in Sotiriou et al., identified common proliferation genes (PCNA, TOP2A, CENPF). This suggests that all these studies are likely tracking a similar phenotype.
Gene expression profiling using DNA microarrays is a powerful tool to discover genes for molecular classifications of cancer but the platforms are labor intensive, expensive and currently not amenable to routine clinical diagnostics. Real-time qRT-PCR is well-suited for solid tumor diagnostics since it is rapid, homogenous (amplification and quantification in a single vessel), and can be performed from archived (FFPE tissue) samples. Example 3 shows that FFPE samples can perform as well as fresh samples. It has been shown that “intrinsic” breast cancer classifications from microarray can be recapitulated by qRT-PCR using a minimal “intrinsic” gene set. In addition, by supplementing the “intrinsic” gene set with proliferation genes, a more objective measurement of grade has been developed. The assay disclosed herein adds prognostic information to the standard of care for breast cancer.
Microarray used in conjunction with RT-PCR provides a powerful system for discovering and translating genomic markers into the clinical laboratory for molecular diagnostics. Although these platforms are fundamentally very different, the quantitative data across the methods have a high correlation. In fact, the data across the methods is no more disparate then across different microarray platforms. By hierarchical clustering, it has been shown that a biological classification of breast cancer derived from microarray data can be recapitulated using real-time qRT-PCR. Biological classification by real-time qRT-PCR makes the important clinical distinction between ER positive and ER negative tumors and identifies additional subtypes that have prognostic (ie, correlate to outcome) and predictive value (ie, correlate to treatment response).
The benefit of using real-time qRT-PCR for cancer diagnostics is that new informative markers can be readily validated and implemented, making tests expandable and/or tailored to the individual. For instance, it has been shown that including proliferation genes serves a similar purpose to grade but is more prognostic. Since grade has been shown to be universal as a prognostic factor in cancer, it is likely that the same markers correlate to grade and are important for survival in other tumor types. Real-time qRT-PCR is attractive for clinical use because it is fast, reproducible, tissue sparing, and able to be automated. Although genomic profiling should currently be used for ancillary testing, the fact that normal tissues can be distinguished from tumor tissue shows that these molecular assays may eventually be used for cancer diagnostics without histological corroboration.
Disclosed is a method of classifying cancer in a subject, comprising: a) identifying intrinsic genes of the subject to be used to classify the cancer; b) obtaining a sample from the subject; c) amplifying and detecting levels of intrinsic genes in the subject; and d) classifying cancer based upon results of step c. The sample can be fresh, or can be an FFPE sample.
Also disclosed is a method of diagnosing cancer in a subject the method comprising: a) amplifying and detecting intrinsic genes; and b) diagnosing cancer based on expression levels of the gene within the subject. The methods disclosed herein can be used with any of the types of cancer listed herein. The cancer can be breast cancer, for example. The breast cancer can be classified into one of four or more groups: luminal, normal-like, HER2+/ER− and basal-like, for example. Again, the sample can be fresh, or can be an FFPE sample.
Disclosed are methods of analyzing nucleic acid expression levels in a sample, the methods comprising comparing expression levels of an intrinsic gene set to a test nucleic acid, wherein specific expression patterns of the test gene relative to the intrinsic gene set indicates a diagnoses, poor prognosis, likelihood of obtaining, predisposition to obtaining, or presence of a cancer. Also disclosed are methods wherein the step of comparing comprises identifying the expression levels of an intrinsic gene set and a test nucleic acid by interaction with a primer or probe.
Disclosed are methods where a specific expression pattern of a test nucleic acid relative to an intrinsic gene set indicates the presence of a cancer, a poor (or good) prognosis for a patient having a cancer, a predisposition of getting a cancer, or a diagnoses of cancer or a cancerous state.
It is understood that any method of assaying any gene discussed herein can be performed. For example methods of assaying gene copy number or mRNA expression copy number can be performed. For example, RT-PCR, PCR, quantitative PCR, and any other forms of nucleic acid amplification can be performed. Furthermore, methods of hybridization, such as blotting, such as Northern or Southern techniques, such as chip and microarray techniques and any other techniques involving hybridizing of nucleic acids.
4. A Non-Limiting List of Cancers which can be Assayed with Disclosed Compositions and Methods
The disclosed compositions can be used to diagnose or prognose any disease where uncontrolled cellular proliferation occurs such as cancers. A non-limiting list of different types of cancers is as follows: lymphomas (Hodgkins and non-Hodgkins), leukemias, carcinomas, carcinomas of solid tissues, squamous cell carcinomas, adenocarcinomas, sarcomas, gliomas, high grade gliomas, blastomas, neuroblastomas, plasmacytomas, histiocytomas, melanomas, adenomas, hypoxic tumours, myelomas, AIDS-related lymphomas or sarcomas, metastatic cancers, or cancers in general.
A representative but non-limiting list of cancers that the disclosed compositions can be used to diagnose or prognose is the following: lymphoma, B cell lymphoma, T cell lymphoma, mycosis fimgoides, Hodgkin's Disease, myeloid leukemia, bladder cancer, brain cancer, nervous system cancer, head and neck cancer, squamous cell carcinoma of head and neck, kidney cancer, lung cancers such as small cell lung cancer and non-small cell lung cancer, neuroblastoma/glioblastoma, ovarian cancer, pancreatic cancer, prostate cancer, skin cancer, liver cancer, melanoma, squamous cell carcinomas of the mouth, throat, larynx, and lung, colon cancer, cervical cancer, cervical carcinoma, breast cancer, and epithelial cancer, renal cancer, genitourinary cancer, puhnonary cancer, esophageal carcinoma, head and neck carcinoma, large bowel cancer, hematopoietic cancers; testicular cancer; colon and rectal cancers, prostatic cancer, or pancreatic cancer.
Compounds disclosed herein may also be used for the diagnosis or prognosis of precancer conditions such as cervical and anal dysplasias, other dysplasias, severe dysplasias, hyperplasias, atypical hyperplasias, and neoplasias.
5. Methods of Identifying a Minimal Intrinsic Gene Set
Disclosed are methods of identifying minimal intrinsic genes. These methods are described in detail above, and generally comprise the following: deriving a minimal intrinsic gene set for making biological classifications of cancer comprising: a) collecting data from multiple samples from the same or different individuals to identify potential intrinsic classifier genes (microarray data can be used in this step, for example); b) weighting intrinsic classifier genes of multiple individuals identified using the method of step a relative to each other and forming classification clusters (weighting can be done, for example, by forming hierarchical clusters); c) estimating the number of clusters formed in step b) and assigning individual samples to clusters; d) identifying genes that optimally distinguish the samples in the assigned groups of step c); e) performing iterative cross-validation with a nearest centroid classifier and overlapping gene sets of various sizes using the genes identified in step d); and f) choosing a gene set which provides the highest class prediction accuracy when compared to the classifications made in step b).
Also disclosed is a method of assigning a sample to an intrinsic subtype, comprising a) creating an intrinsic subtype average profile (centroid) for each subtype; b) individually comparing a new sample to each centroid; and c) assigning the new sample to the centroid that is most similar to the new sample. This is known as the Single Sample Predictor (SSP) method, and is described in further detail in Example 2.
Also disclosed are computerized implementing systems, as well as storage and retrieval systems, of biological information, comprising: a data entry means; a display means; a programmable central processing unit; and a data storage means having expression data for a gene electronically stored; wherein the stored sequences are used as input data for determining which sequence is the best intrinsic gene set for a specific tissue type.
Disclosed are the components to be used to prepare the disclosed compositions as well as the compositions themselves to be used within the methods disclosed herein. These and other materials are disclosed herein, and it is understood that when combinations, subsets, interactions, groups, etc. of these materials are disclosed that while specific reference of each various individual and collective combinations and permutation of these compounds may not be explicitly disclosed, each is specifically contemplated and described herein. For example, if a particular expression control gene is disclosed and discussed and a number of modifications that can be made to a number of molecules including the expression control gene are discussed, specifically contemplated is each and every combination and permutation of expression control gene and the modifications that are possible unless specifically indicated to the contrary. Thus, if a class of molecules A, B, and C are disclosed as well as a class of molecules D, E, and F and an example of a combination molecule, A-D is disclosed, then even if each is not individually recited each is individually and collectively contemplated meaning combinations, A-E, A-F, B-D, B-E, B-F, C-D, C-E, and C-F are considered disclosed. Likewise, any subset or combination of these is also disclosed. Thus, for example, the sub-group of A-E, B-F, and C-E would be considered disclosed. This concept applies to all aspects of this application including, but not limited to, steps in methods of making and using the disclosed compositions. Thus, if there are a variety of additional steps that can be performed it is understood that each of these additional steps can be performed with any specific embodiment or combination of embodiments of the disclosed methods.
1. Sequence Similarities
It is understood that as discussed herein the use of the terms homology and identity mean the same thing as similarity. Thus, for example, if the use of the word homology is used between two non-natural sequences it is understood that this is not necessarily indicating an evolutionary relationship between these two sequences, but rather is looking at the similarity or relatedness between their nucleic acid sequences. Many of the methods for determining homology between two evolutionarily related molecules are routinely applied to any two or more nucleic acids or proteins for the purpose of measuring sequence similarity regardless of whether they are evolutionarily related or not.
In general, it is understood that one way to define any known variants and derivatives or those that might arise, of the disclosed genes and proteins herein, is through defining the variants and derivatives in terms of homology to specific known sequences. This identity of particular sequences disclosed herein is also discussed elsewhere herein. In general, variants of genes and proteins herein disclosed typically have at least, about 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, or 99 percent homology to the stated sequence or the native sequence. Those of skill in the art readily understand how to determine the homology of two proteins or nucleic acids, such as genes. For example, the homology can be calculated after aligning the two sequences so that the homology is at its highest level.
Another way of calculating homology can be performed by published algorithms. Optimal alignment of sequences for comparison may be conducted by the local homology algorithm of Smith and Waterman Adv. Appl. Math 2: 482 (1981), by the homology alignment algorithm of Needleman and Wunsch, J. MoL Biol. 48: 443 (1970), by the search for similarity method of Pearson and Lipman, Proc. Natl. Acad. Sci. U.S.A. 85: 2444 (1988), by computerized implementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Genetics Computer Group, 575 Science Dr., Madison, Wis.), or by inspection.
The same types of homology can be obtained for nucleic acids by for example the algorithms disclosed in Zuker, M. Science 244:48-52, 1989, Jaeger et al. Proc. Natl. Acad. Sci. USA 86:7706-7710, 1989, Jaeger et al. Methods Enzymol. 183:281-306, 1989 which are herein incorporated by reference for at least material related to nucleic acid alignment. It is understood that any of the methods typically can be used and that in certain instances the results of these various methods may differ, but the skilled artisan understands if identity is found with at least one of these methods, the sequences would be said to have the stated identity, and be disclosed herein.
For example, as used herein, a sequence recited as having a particular percent homology to another sequence refers to sequences that have the recited homology as calculated by any one or more of the calculation methods described above. For example, a first sequence has 80 percent homology, as defined herein, to a second sequence if the first sequence is calculated to have 80 percent homology to the second sequence using the Zuker calculation method even if the first sequence does not have 80 percent homology to the second sequence as calculated by any of the other calculation methods. As another example, a first sequence has 80 percent homology, as defined herein, to a second sequence if the first sequence is calculated to have 80 percent homology to the second sequence using both the Zuker calculation method and the Pearson and Lipman calculation method even if the first sequence does not have 80 percent homology to the second sequence as calculated by the Smith and Waterman calculation method, the Needleman and Wunsch calculation method, the Jaeger calculation methods, or any of the other calculation methods. As yet another example, a first sequence has 80 percent homology, as defined herein, to a second sequence if the first sequence is calculated to have 80 percent homology to the second sequence using each of calculation methods (although, in practice, the different calculation methods will often result in different calculated homology percentages).
2. Hybridization/Selective Hybridization
The term hybridization typically means a sequence driven interaction between at least two nucleic acid molecules, such as a primer or a probe and a gene. Sequence driven interaction means an interaction that occurs between two nucleotides or nucleotide analogs or nucleotide derivatives in a nucleotide specific manner. For example, G interacting with C or A interacting with T are sequence driven interactions. Typically sequence driven interactions occur on the Watson-Crick face or Hoogsteen face of the nucleotide. The hybridization of two nucleic acids is affected by a number of conditions and parameters known to those of skill in the art. For example, the salt concentrations, pH, and temperature of the reaction all affect whether two nucleic acid molecules will hybridize.
Parameters for selective hybridization between two nucleic acid molecules are well known to those of skill in the art. For example, in some embodiments selective hybridization conditions can be defined as stringent hybridization conditions. For example, stringency of hybridization is controlled by both temperature and salt concentration of either or both of the hybridization and washing steps. For example, the conditions of hybridization to achieve selective hybridization may involve hybridization in high ionic strength solution (6×SSC or 6×SSPE) at a temperature that is about 12-25° C. below the Tm (the melting temperature at which half of the molecules dissociate from their hybridization partners) followed by washing at a combination of temperature and salt concentration chosen so that the washing temperature is about 5° C. to 20° C. below the Tm. The temperature and salt conditions are readily determined empirically in preliminary experiments in which samples of reference DNA immobilized on filters are hybridized to a labeled nucleic acid of interest and then washed under conditions of different stringencies. Hybridization temperatures are typically higher for DNA-RNA and RNA-RNA hybridizations. The conditions can be used as described above to achieve stringency, or as is known in the art. (Sambrook et al., Molecular Cloning: A Laboratory Manual, 2nd Ed., Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y., 1989; Kunkel et al. Methods Enzymol. 1987:154:367, 1987 which is herein incorporated by reference for material at least related to hybridization of nucleic acids). A preferable stringent hybridization condition for a DNA:DNA hybridization can be at about 68° C. (in aqueous solution) in 6×SSC or 6×SSPE followed by washing at 68° C. Stringency of hybridization and washing, if desired, can be reduced accordingly as the degree of complementarity desired is decreased, and further, depending upon the G-C or A-T richness of any area wherein variability is searched for. Likewise, stringency of hybridization and washing, if desired, can be increased accordingly as homology desired is increased, and further, depending upon the G-C or A-T richness of any area wherein high homology is desired, all as known in the art.
Another way to define selective hybridization is by looking at the amount (percentage) of one of the nucleic acids bound to the other nucleic acid. For example, in some embodiments selective hybridization conditions would be when at least about, 60, 65, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 percent of the limiting nucleic acid is bound to the non-limiting nucleic acid. Typically, the non-limiting primer is in for example, 10 or 100 or 1000 fold excess. This type of assay can be performed at under conditions where both the limiting and non-limiting primer are for example, 10 fold or 100 fold or 1000 fold below their kd, or where only one of the nucleic acid molecules is 10 fold or 100 fold or 1000 fold or where one or both nucleic acid molecules are above their kd.
Another way to define selective hybridization is by looking at the percentage of primer that gets enzymatically manipulated under conditions where hybridization is required to promote the desired enzymatic manipulation. For example, in some embodiments selective hybridization conditions would be when at least about, 60, 65, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 percent of the primer is enzymatically manipulated under conditions which promote the enzymatic manipulation, for example if the enzymatic manipulation is DNA extension, then selective hybridization conditions would be when at least about 60, 65, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100 percent of the primer molecules are extended. Preferred conditions also include those suggested by the manufacturer or indicated in the art as being appropriate for the enzyme performing the manipulation.
Just as with homology, it is understood that there are a variety of methods herein disclosed for determining the level of hybridization between two nucleic acid molecules. It is understood that these methods and conditions may provide different percentages of hybridization between two nucleic acid molecules, but unless otherwise indicated meeting the parameters of any of the methods would be sufficient. For example if 80% hybridization was required and as long as hybridization occurs within the required parameters in any one of these methods it is considered disclosed herein.
It is understood that those of skill in the art understand that if a composition or method meets any one of these criteria for determining hybridization either collectively or singly it is a composition or method that is disclosed herein.
3. Nucleic Acids
There are a variety of molecules disclosed herein that are nucleic acid based, including for example the nucleic acids that encode, for example, the intrinsic genes disclosed herein (Table 12), as well as various functional nucleic acids. The disclosed nucleic acids are made up of for example, nucleotides, nucleotide analogs, or nucleotide substitutes. Non-limiting examples of these and other molecules are discussed herein. It is understood that for example, when a vector is expressed in a cell, that the expressed mRNA will typically be made up of A, C, G, and U. Likewise, it is understood that if, for example, an antisense molecule is introduced into a cell or cell environment through for example exogenous delivery, it is advantageous that the antisense molecule be made up of nucleotide analogs that reduce the degradation of the antisense molecule in the cellular environment.
a) Nucleotides and Related Molecules
A nucleotide is a molecule that contains a base moiety, a sugar moiety and a phosphate moiety. Nucleotides can be linked together through their phosphate moieties and sugar moieties creating an internucleoside linkage. The base moiety of a nucleotide can be adenin-9-yl (A), cytosin-1-yl (C), guanin-9-yl (G), uracil-1-yl (U), and thymin-1-yl (T). The sugar moiety of a nucleotide is a ribose or a deoxyribose. The phosphate moiety of a nucleotide is pentavalent phosphate. An non-limiting example of a nucleotide would be 3′-AMP (3′-adenosine monophosphate) or 5′-GMP (5′-guanosine monophosphate).
b) Primers and Probes
It is understood that primers and probes can be produced for the actual gene (DNA) or expression product (mRNA) or intermediate expression products which are not fully processed into mRNA. Discussion of a particular gene is also a disclosure of the DNA, mRNA, and intermediate RNA products associated with that particular gene.
Disclosed are compositions including primers and probes, which are capable of interacting with the intrinsic genes disclosed herein, as well as the any other genes or nucleic acids discussed herein. In certain embodiments the primers are used to support DNA amplification reactions. Typically the primers will be capable of being extended in a sequence specific manner. Extension of a primer in a sequence specific manner includes any methods wherein the sequence and/or composition of the nucleic acid molecule to which the primer is hybridized or otherwise associated directs or influences the composition or sequence of the product produced by the extension of the primer. Extension of the primer in a sequence specific manner therefore includes, but is not limited to, PCR, DNA sequencing, DNA extension, DNA polymerization, RNA transcription, or reverse transcription. Techniques and conditions that amplify the primer in a sequence specific manner are preferred. In certain embodiments the primers are used for the DNA amplification reactions, such as PCR or direct sequencing. It is understood that in certain embodiments the primers can also be extended using non-enzymatic techniques, where for example, the nucleotides or oligonucleotides used to extend the primer are modified such that they will chemically react to extend the primer in a sequence specific manner. Typically the disclosed primers hybridize with the disclosed genes or regions of the disclosed genes or they hybridize with the complement of the disclosed genes or complement of a region of the disclosed genes.
The size of the primers or probes for interaction with the disclosed genes in certain embodiments can be any size that supports the desired enzymatic manipulation of the primer, such as DNA amplification or the simple hybridization of the probe or primer. A typical disclosed primer or probe would be at least 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1250, 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3500, or 4000 nucleotides long.
In other embodiments the disclosed primers or probes can be less than or equal to 6, 7, 8, 9, 10, 11, 12 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1250, 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3500, or 4000 nucleotides long.
The primers for the disclosed genes in certain embodiments can be used to produce an amplified DNA product that contains the desired region of the disclosed genes. In general, typically the size of the product will be such that the size can be accurately determined to within 10, 5, 4, 3, or 2 or 1 nucleotides.
In certain embodiments this product is at least 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1250, 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3500, or 4000 nucleotides long.
In other embodiments the product is less than or equal to 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 125, 150, 175, 200, 225, 250, 275, 300, 325, 350, 375, 400, 425, 450, 475, 500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000, 1250, 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3500, or 4000 nucleotides long.
In certain embodiments the primers and probes are designed such that they are targeting as specific region in one of the genes disclosed herein. It is understood that primers and probes having an interaction with any region of any gene disclosed herein are contemplated. In other words, primers and probes of any size disclosed herein can be used to target any region specifically defined by the genes disclosed herein. Thus, primers and probes of any size can begin hybridizing with nucleotide 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or any specific nucleotide of the genes or gene expression products disclosed herein. Furthermore, it is understood that the primers and probes can be of a contiguous nature meaning that they have continuous base pairing with the target nucleic acid for which they are complementary. However, also disclosed are primers and probes which are not contiguous with their target complementary sequence. Disclosed are primers and probes which have at least 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, 150, 200, 500, or more bases which are not contiguous across the length of the primer or probe. Also disclosed are primers and probes which have less than or equal to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, 150, 200, 500, or more bases which are not contiguous across the length of the primer or probe.
In certain embodiments the primers or probes are designed such that they are able to hybridize specifically with a target nucleic acid. Specific hybridization refers to the ability to bind a particular nucleic acid or set of nucleic acids preferentially over other nucleic acids. The level of specific hybridization of a particular probe or primer with a target nucleic acid can be affected by salt conditions, buffer conditions, temperature, length of time of hybridization, wash conditions, and visualization conditions. By increasing the specificity of hybridization means decreasing the number of nucleic acids that a given primer or probe hybridizes to typically under a given set of conditions. For example, at 20 degrees Celsius under a given set of conditions a given probe may hybridize with 10 nucleic acids in a sample. However, at 40 degrees Celsius with all other conditions being equal, the same probe may only hybridize with 2 nucleic acids in the same sample. This would be considered an increase in specificity of hybridization. A decrease in specificity of hybridization means an increase in the number of nucleic acids that a given primer or probe hybridizes to typically under a given set of conditions. For example, at 700 mM NaCl under a given set of conditions a particular probe or primer may hybridize with 2 nucleic acids in a sample, however when the salt concentration is increased to 1 Molar NaCl the primer or probe may hybridize with 6 nucleic acids in the same sample.
The salt can be any salt such as those made from the alkali metals: Lithium, Sodium, Potassium, Rubidium, Cesium, or Francium or the alkaline earth metals: Beryllium, Magnesium, Calcium, Strontium, Barium, or Radiumsodium, or the transition metals: Scandium, Titanium, Vanadium, Chromium, Manganese, Iron, Cobalt, Nickel, Copper, Zinc, Yttrium, Zirconium, Niobium, Molybdenum, Technetium, Ruthenium, Rhodium, Palladium, Silver, Cadmium, Hafnium, Tantalum, Tungsten, Rhenium, Osmium, Iridium, Platinum, Gold, Mercury, Rutherfordium, Dubnium, Seaborgium, Bohrium, Hassium, Meitnerium, Ununnilium, Unununium or Ununbium at any molar strength to promoter the desired condition, such as 1, 0.7, 0.5, 0.3, 0.2, 0.1, 0.05, or 0.02 molar salt. In general increasing salt concentration decreases the specificity of a given probe or primer for a given target nucleic acid and decreasing the salt concentration increases the specificity of a given probe or primer for a given target nucleic acid.
The buffer conditions can be any buffer such as TRIS at any pH, such as 5.0, 5.5, 6.0, 6.1, 6.2, 6.3, 6.4, 6.5, 6.6, 6.7, 6.8, 6.9, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8, 7.9, 8.0, 8.5, or 9.0. In general pHs above or below 7.0 increase the specificity of hybridization.
The temperature of hybridization can be any temperature. For example, the temperature of hybridization can occur at 20°, 21°, 22°, 23°, 24°, 25°, 26°, 27°, 28°, 29°, 31°, 32°, 33°, 34°, 35°, 36°, 37°, 38°, 39°, 40°, 41°, 42°, 43°, 44°, 45°, 46°, 47°, 48°, 49°, 50°, 51°, 52°53°54°55°56°, 57°, 58° 59°, 60°, 61°, 62°, 63°, 64°, 65°, 66°, 67°, 68°, 69°, 70, 81°, 82°, 83°, 84°, 85°, 86°, 87°, 88°, 89°, 90°, 91°, 92°, 93°, 94°, 95°, 96°, 97°, 98°, or 99° Celsius.
The length of time of hybridization can be for any time. For example, the length of time can be for 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 120, 150, 180, 210, 240, 270, 300, 360, minutes or 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 30, 36, 48 or more hours.
It is understood that any wash conditions can be used including no wash step. Generally the wash conditions occur by a change in one or more of the other conditions designed to require more specific binding, by for example increasing temperature or decreasing the salt or changing the length of time of hybridization.
It is understood that there are a variety of visualization conditions which have different levels of detection capabilities. In general any type of visualization or detection system can be used. For example, radiolabeling or fluorescence labeling can be used and in general fluorescence labeling would be more sensitive, meaning a fewer number of absolute molecules would have to be present to be detected.
c) Sequences
There are a variety of sequences related to the intrinsic genes as well as the others disclosed herein and others are herein incorporated by reference in their entireties as well as for individual subsequences contained therein. A specific intrinsic gene set can be found in Table 12.
4. Kits
Disclosed are kits comprising nucleic acids which can be used in the methods disclosed herein and, for example, buffers, salts, and other components to be used in the methods disclosed herein. Disclosed are kits for identifying minimal intrinsic gene sets comprising nucleic acids, such as in a microarray. Also disclosed are specific minimal intrinsic genes used for classifying cancer, such as those found in Table 21. As described above, these intrinsic genes can be used in any combination or permutation, and any combination of permutation of these genes can be used in a kit. Also disclosed are kits comprising instructions.
5. Chips and Micro Arrays
Disclosed are chips where at least one address is the sequences or part of the sequences set forth in any of the nucleic acid sequences disclosed herein.
Also disclosed are chips where at least one address is a variant of the sequences or part of the sequences set forth in any of the nucleic acid sequences disclosed herein.
6. Computer Readable Mediums
Those of skill in the art understand how to display and express any nucleic acid or protein sequence in any of the variety of ways that exist, each of which is considered herein disclosed. Specifically contemplated herein is the display of these sequences on computer readable mediums, such as, commercially available floppy disks, tapes, chips, hard drives, compact disks, and video disks, or other computer readable mediums. Also disclosed are the binary code representations of the disclosed sequences. Those of skill in the art understand what computer readable mediums. Thus, computer readable mediums on which the nucleic acids or protein sequences are recorded, stored, or saved.
Disclosed are computer readable mediums comprising the sequences and information regarding the sequences set forth herein.
The compositions disclosed herein and the compositions necessary to perform the disclosed methods can be made using any method known to those of skill in the art for that particular reagent or compound unless otherwise specifically noted.
1. Nucleic acid synthesis
For example, the nucleic acids, such as, the oligonucleotides to be used as primers can be made using standard chemical synthesis methods or can be produced using enzymatic methods or any other known method. Such methods can range from standard enzymatic digestion followed by nucleotide fragment isolation (see for example, Sambrook et al., Molecular Cloning: A Laboratory Manual, 2nd Edition (Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y., 1989) Chapters 5, 6) to purely synthetic methods, for example, by the cyanoethyl phosphoramidite method using a Milligen or Beckman System lPlus DNA synthesizer (for example, Model 8700 automated synthesizer of Milligen-Biosearch, Burlington, Mass. or ABI Model 380B). Synthetic methods useful for making oligonucleotides are also described by Ikuta et al., Ann. Rev. Biochem. 53:323-356 (1984), (phosphotriester and phosphite-triester methods), and Narang et al., Methods Enzymol., 65:610-620 (1980), phosphotriester method). Protein nucleic acid molecules can be made using known methods such as those described by Nielsen et al., Bioconjug. Chem. 5:3-7 (1994).
The disclosed compositions can be used in a variety of ways as research tools. The compositions can be used for example as targets in combinatorial chemistry protocols or other screening protocols to isolate molecules that possess desired functional properties related to the disclosed genes.
The disclosed compositions can also be used diagnostic tools related to diseases, such as cancers, such as those listed herein.
The disclosed compositions can be used as discussed herein as either reagents in micro arrays or as reagents to probe or analyze existing microarrays. The disclosed compositions can be used in any known method for isolating or identifying single nucleotide polymorphisms. The compositions can also be used in any method for determining allelic analysis of for example, the genes disclosed herein. The compositions can also be used in any known method of screening assays, related to chip/micro arrays. The compositions can also be used in any known way of using the computer readable embodiments of the disclosed compositions, for example, to study relatedness or to perform molecular modeling analysis related to the disclosed compositions.
Throughout this application, various publications are referenced. The disclosures of these publications in their entireties are hereby incorporated by reference into this application in order to more fully describe the state of the art to which this pertains. The references disclosed are also individually and specifically incorporated by reference herein for the material contained in them that is discussed in the sentence in which the reference is relied upon.
The following examples are put forth so as to provide those of ordinary skill in the art with a complete disclosure and description of how the compounds, compositions, articles, devices and/or methods claimed herein are made and evaluated, and are intended to be purely exemplary and are not intended to limit the disclosure. Efforts have been made to ensure accuracy with respect to numbers (e.g., amounts, temperature, etc.), but some errors and deviations should be accounted for. Unless indicated otherwise, parts are parts by weight, temperature is in ° C. or is at ambient temperature, and pressure is at or near atmospheric.
a) Methods
Patient selection. An ethnically diverse cohort of patients were studied using samples collected from various locations throughout the United States. Tissues analyzed included 117 invasive breast cancers, 1 fibroadenoma, 5 “normal” samples (from reduction mammoplasty), and 3 cells lines. Patients were heterogeneously treated in accordance with the standard of care dictated by their disease stage, ER and HER2 status. Patients were censored for recurrence and/or death for up to 118 months (median 21.5 months). Clinical data presented in supplementary Table 7.
Sample preparation and first strand synthesis for qRT-PCR. Nucleic acids were extracted from fresh frozen tissue using RNeasy Midi Kit (Qiagen Inc., Valencia, Calif.). The quality of RNA was assessed using the Agilent 2100 Bioanalyzer with the RNA 6000 Nano LabChip Kit (Agilent Technologies, Palo Alto, Calif.). All samples used had discernable 18S and 28S ribosomal peaks. First strand cDNA was synthesized from approximately 1.5 mg total RNA using 500 ng Oligo(dT)12-18 and Superscript III reverse transcriptase (1st Strand Kit, Invitrogen, Carlsbad, Calif.). The reaction was held at 42° C. for 50 min followed by a 15-min step at 70° C. The cDNA was washed on a QIAquick PCR purification column and stored at −80° C. in TE′ (25 mM Tris, 1 mM EDTA) at a concentration of 5 ng/ul (concentration estimated from the starting RNA concentration used in the reverse transcription).
Primer design. Genbank sequences were downloaded from Evidence viewer (NCBI website) into the Lightcycler Probe Design Software (Roche Applied Science, Indianapolis, Ind.). All primer sets were designed to have a Tm>>60° C., GC content >>50% and to generate a PCR amplicon<200 bps. Finally, BLAT and BLAST searches were performed on primer pair sequences using the UCSC Genome Bioinformatics (http://genome.ucsc.edu/) and NCBI (http://www.ncbi.nlm.nih.gov/BLAST/) to check for uniqueness. Primer sets and identifiers are provided in supplementary Table 8.
Real-time PCR. For PCR, each 20 L reaction included 1×PCR buffer with 3 mM MgCl2 (Idaho Technology Inc., Salt Lake City, Utah), 0.2 mM each of dATP, dCTP, and dGTP, 0.1 mM dTTP, 0.3 mM dUTP (Roche, Indianapolis, Ind.), 10 ng cDNA and IU Platinum Taq (Invitrogen, Carlsbad, Calif.). The dsDNA dye SYBR Green I (Molecular Probes, Eugene, Oreg.) was used for all quantification ( 1/50000 final). PCR amplifications were performed on the Lightcycler (Roche, Indianapolis, Ind.) using an initial denaturation step (94° C., 90 sec) followed by 50 cycles denaturation (94° C., 3 sec), annealing (58° C., 5 sec with 20° C./s transition), and extension (72° C., 6 sec with 2° C./sec transition). Fluorescence (530 nm) from the dsDNA dye SYBR Green I was acquired each cycle after the extension step. Specificity of PCR was determined by post-amplification melting curve analysis. Reactions were automatically cooled to 60° C. at a rate of 3° C./s and slowly heated at 0.1° C./s to 95° C. while continuously monitoring fluorescence.
Relative quantification by RT-PCR. Quantification was performed using the LightCycler 4.0 software. The crossing threshold (Ct) for each reaction was determined using the 2nd derivative maximum method (Wittwer et al. (2004) Washington, D.C.: ASM Press; Rasmussen (2001) Heidelberg: Springer Verlag. 21-34). Relative copy number was calculated using an external calibration curve to correct for PCR efficiency and a within run calibrator to correct for the variability between run. The calibrator is made from 4 equal parts of RNA from 3 cell lines (MCF7, SKBR3, ME16C) and Universal Human Reference RNA (Stratagene, La Jolla, Calif., Cat #740000). Differences in cDNA input were corrected by dividing target copy number by the arithmetic mean of the copy number for 3 housekeeper genes (MRPL19, PSMC4, and PUM1) (Szabo et al. (2004) Genome Biol 5:R59). The normalized relative gene copy number was log 2 transformed and analyzed by hierarchical clustering using Cluster (Eisen et al. (1998) Proc Natl Acad Sci USA 95:14863-14868). The clustering was visualized using Treeview software (Eisen Lab, http:/rana.lbl.gov/EisenSoftware.htm).
Microarray experiments. The same 126 samples used for qRT-PCR were analyzed by microarray (Agilent Human oligonucleotide). Total RNA was prepared and quality checked as described above. Labeling and hybridization of RNA for microarray was done using the Agilent low RNA input linear amplification kit (http://www.chem.agilent.com/Scripts/PDS.asp?lPage=10003), but with one-half the recommended reagent volumes and using a Qiagen PCR purification kit to clean up the cRNA. Each sample was assayed versus a common reference sample that was a mixture of Stratagene's Human Universal Reference total RNA (100 ug) enriched with equal amounts of RNA (0.3 μg each) from MCF7 and ME16C cell lines. Microarray hybridizations were carried out on Agilent Human oligonucleotide microarrays (1A-v1, 1A-v2 and custom designed 1A-v1 based microarrays) using 2 μg each of Cy3-labeled “reference” and Cy5-labeled “experimental” sample. Hybridizations were done using the Agilent hybridization kit and a Robbins Scientific “22k chamber” hybridization oven. The arrays were incubated overnight and then washed once in 2×SSC and 0.0005% triton X-102 (10 min), twice in 0.1×SSC (5 min) and then immersed into Agilent Stabilization and Drying solution for 20 seconds. All microarrays were scanned using an Axon Scanner 4000A. The image files were analyzed with GenePix Pro 4.1 and loaded into the UNC Microarray Database at the University of North Carolina at Chapel Hill (https://genome.unc.edu/) where a lowess normalization procedure was performed to adjust the Cy3 and Cy5 channels (Yang et al. (2002) Nucleic Acids Res 30:e15). All primary microarray data associated with this study are available at the UNC Microarray Database and have been deposited into the GEO (http://www.ncbi.nlm.nih.gov/geo/) under the accession number of GSE1992, series GSM34424-GSM34568.
Selecting genes for real-time qRT-PCR. A new “intrinsic” gene set for classifying breast tumors was derived using 45 before and after therapy samples from the combined data sets presented in Sorlie et al. (see Table 9 for the list of 45 pairs). The two-color DNA microarray data sets were downloaded from the internet and the R/G ratio (experimental/reference) for each spot was normalized and log 2 transformed. Missing values were imputed using the k-NN imputation algorithm described by Troyanskaya et al. (Troyanskaya et al. (2001) Bioinformatics 17:520-525). The “intrinsic” analysis identified 550 gene elements.
Next, a completely independent data set was utilized (van't Veer et al. 2002) to derive an optimized version of the 550 intrinsic gene list. To allow across data set analyses, gene annotation from each dataset was translated to UniGene Cluster IDs (UCID) using the SOURCE database (Diehn et al. (2003) Nucleic Acids Res 31:219-223). Following the alogorithm outlined by Tibshirani and colleagues (Bair et al. (2004) PLoS Biol 2:E108; Bullinger et al. (2004) N Engl J Med 350:1605-1616), the 97 samples from the van't Veer et al. 2002 study were hierarchical clustered using a common set of 350 genes, and assigned an “intrinsic subtype of either Luminal, HER2+/ER−, Basal-like, or Normal-like to each sample. A feature/gene selection was then performed to identify genes that optimally distinguished these 4 classes using a version of the gene selection method first described by Dudoit et al. (Genome Biol 3:RESEARCH0036), where the best class distinguishers are identified according to the ratio of between-group to within-group sums of squares (a type of ANOVA). In addition to statistically selecting “intrinsic” classifiers proliferation genes (e.g., TOP2A, KI-67, PCNA) were also chosen, and other important prognostic markers (e.g., PgR) that have potential for diagnostics. In total, 53 differentially expressed biomarkers were used in the real-time qRT-PCR assay (Table 8).
Combining microarray and qRT-PCR datasets. Distance Weighted Discrimination (DWD) was used to identify and correct systematic biases across the microarray and qRT-PCR datasets (Benito et al. (2004) Bioinformatics 20:105-114). Prior to DWD, each dataset was normalized by setting the mean to zero and the variance to one. Normalization was done within each microarray experiment and for genes profiled across many experimental runs for real-time qRT-PCR. After DWD, genes in common between the datasets were clustered using Spearman correlation and average linkage association.
Receiver operator curves. In order to determine agreement between protein expression (immunohistochemistry) and gene expression (qRT-PCR), a cut-off for relative gene copy number was selected by minimizing the sum of the observed false positive and false negative errors. That is, minimizing the estimated overall error rate under equal priors for the presence/absence of the protein. The sensitivity and specificity of the resulting classification rule were estimated via bootstrap adjustment for optimism (Efron et al. (1998) CRC Press LLC. p 247 pp).
Survival analyses. Survival curves were estimated by the Kaplan-Meier method and compared via a log-rank or stratified log-rank test as appropriate. Standard clinical pathological parameters of age (in years), node status (positive vs. negative), tumor size (cm, as a continuous variable), grade (1-3, as a continuous covariate), and ER status (positive vs. negative) were tested for differences in RFS and OS using Cox proportional hazards regression model. Pairwise log-rank tests were used to test for equality of the hazard functions among the intrinsic classes. Only the classes Luminal, HER2+/ER−, and Basal-like classes were included in the analyses because it was believed the Normal Breast-like subtype is not a pure tumor class and may result from normal breast contamination. Cox regression was used to determine predictors of survival from continuous expression data. All statistical analyses were performed using the R statistical software package (R Foundation for Statistical Computing).
b) Results
Recapitulating microarray breast cancer classifications by qRT-PCR. 126 different breast tissue samples (117 invasive, 5 normal, 1 fibroadenoma, and 3 cell lines) were expression profiled using a real-time qRT-PCR assay comprised of 53 biological classifiers and 3 control/housekeepers genes. Genes were statistically selected to optimally identify the 4 main breast tumor intrinsic subtypes, and to create an objective gene expression predictor for cell proliferation and outcome (Ross et al. (2000) Nat Genet 24:227-235).
There were 402 genes in common between this microarray dataset and the 550 “intrinsic” genes selected from the Sorlie et al. 2003 study. Two-way hierarchical clustering of the 402 genes in the microarray gave the same tumor subtypes as the minimal 37 “intrinsic” genes assayed by qRT-PCR (
Proliferation and grade. Expression of the 14 “proliferation” genes (
Agreement between imnunohistochemistry, qRT-PCR “intrinsic” classifications, and gene expression. Fifty out of fifty-five (91%) Luminal tumors with IHC data were scored positive for ER. Conversely, 50 out of 56 (89%) tumors classified as HER2+/ER− or Basal-like were negative for ER by IHC. Cluster analysis showed that the Luminal tumors co-express ER and estrogen responsive genes such as LIV1/SLC39A6, X-box binding protein 1 (XBP1), and hepatocyte nuclear factor 3a (HNF3A/FOXA1). The gene with the highest correlation in expression to ESR1 was GATA3 (0.79, 95% CI: 0.71-0.85). It was found that the gene expression of ESR1 alone had 88% sensitivity and 85% specificity for calling ER status by IHC, and GATA3 alone showed 79% sensitivity and 88% specificity (
Reproducibility of qRT-PCR. The run-to-run variation in Cp (cycle number determined from fluorescence crossing point) for all 56 genes (53 classifiers and 3 housekeepers) was determined from 8 runs. The median CV (standard deviation/mean) for all the genes was 1.15% (0.28%-6.55%) and 51/56 genes (91%) had a CV≦2%. The reproducibility of the classification method is illustrated from the observation that replicates of the same sample (UB57A&B and UB60A&B), cluster directly adjacent to one another. Notably, the replicates were from separate RNA/cDNA preparations done on different pieces of the same tumor.
Survival Predictors. The clinical significance of individual markers and “intrinsic” subtypes were analyzed using qRT-PCR data. Patients with Luminal tumors showed significantly better outcomes for relapse-free survival (RFS) and overall survival (OS) compared to HER2+/ER− (RFS: p=0.023; OS: p=0.003) and Basal-like (RFS: p=0.065; OS: p=0.002) tumors (
Using a Cox proportional hazards model to find biomarkers from the qRT-PCR data that predict survival, it was found that high expression of the proliferation genes GTBP4 (p=0.011), HSPA14 (p=0.023), and STK6 (p=0.027) were significant predictors of RFS independent of grade and stage (
Co-clustering qRT-PCR and Microarray Data. In order to determine if qRT-PCR and microarray data could be analyzed together in a single dataset, DWD was used to combine data for 50 genes and 126 samples profiled on both platforms (252 samples total). Hierarchical clustering of these data show that 98% ( 124/126) of the paired samples classified in the same group and 83/126 (66%) clustered directly adjacent to their corresponding partner (
c) Discussion
Gene expression analyses can identify differences in breast cancer biology that are important for prognosis. However, a major challenge in using genomics for diagnostics is finding biomarkers that can be reproducibly measured across different platforms and that provide clinically significant classifications on different patient populations. Using microarray data, 402 “intrinsic” genes were identified that classify breast cancers based on vastly different expression patterns. This “intrinsic” gene set was shown to provide the same classifications when applied to a completely new and ethnically diverse population. Furthermore, the microarray dataset can be minimized to 37 “intrinsic” genes, translated into a real-time qRT-PCR assay, and provide the same classifications as the larger gene set. Molecular classifications using the “intrinsic” qRT-PCR assay agree with standard pathology and are clinically significant for prognosis. Thus, biological classifications based on “intrinsic” genes are robust, reproducible across different platforms, and can be used for breast cancer diagnostics.
The greatest contribution genomic assays have made towards clinical diagnostics in breast cancer has been in identifying risk of recurrence in women with early stage disease. For instance, MammaPrint™ is a microarray assay based on the 70 gene prognosis signature originally identified by van't Veer et al. On the test set validation, the 70 gene assay found that individuals with a poor prognostic signature had approximately a 50% chance of remaining free of distant metastasis at 10 years while those with a good-prognostic signature had a 85% chance of remaining free of disease. Another assay with similar utility is Oncotype Dx (Genomic Health Inc)—a real-time qRT-PCR assay that uses 16 classifiers to assess if patients with ER positive tumors are at low, intermediate, or high risk for relapse. While recurrence can be predicted with high and low risk tumors, patients in the intermediate risk group still have variable outcomes and need to be diagnosed more accurately.
In general tumors that have a low risk of early recurrence are low grade and have low expression of proliferation genes. Due to the correlation of proliferation genes with grade and their significance in predicting outcome, a group of 14 proliferation genes were assayed. While the classic proliferation markers TOP2A and MK167 significantly correlated with grade in the cohort, they were not near the top of the list. Furthermore, PCNA did not significantly correlate with grade (p=0.11) in the cohort. This could result from PCR primer design or differences between RNA and protein stability. Nevertheless, the proliferation gene that was found had the highest correlation to grade was CENPF (mitosin); another commonly used mitotic marker that has been shown to correlate with grade and outcome in breast cancer (Clark et al. (1997) Cancer Res 57:5505-5508). Since tumor grade and the mitotic index have been shown to be important in predicting risk of relapse (Chia et al. (2004) J Clin Oncol 22:1630-1637; Manders et al. (2003) Breast Cancer Res Treat 77:77-84), it is not surprising that 4 (GTBP4, HSPA14, STK6/15, BUB1) out the top 5 predictors for RFS (independent of stage) were proliferation genes. The proliferation gene that was the best predictor of RFS was GTBP4, a GTP-binding protein implicated in chronic renal disease and shown to be upregulated after serum administration (i.e., serum response gene) (Laping et al. (2001) J Am Soc Nephrol 12:883-890). Overall, the best predictor for both RFS (p=0.004) and OS (p=0.004) independent of grade and stage was SMA3. The role of SMA3 in the pathogenesis of breast cancer is still unclear, although it has also been associated with the BCL2 anti-apoptotic pathway (Iwahashi et al. (1997) Nature 390:413-417).
A training set of 105 tumors were used to derive a new breast tumor “intrinsic” gene list and validated it using a combined test set of 315 tumors compiled from three independent microarray studies. An unchanging Single Sample Predictor was also used, and applied to three additional test sets. The Intrinsic/UNC gene set identified a number of findings not seen in previous analyses including 1) significance in multivariate testing, 2) that the proliferation signature is an intrinsic property of tumors, 3) the high expression of many Kallikrein genes in Basal-like tumors, and 4) the expression of the Androgen Receptor within the BER2+/ER− and Luminal tumor subtypes. The Single Sample Predictor that was based upon subtype average profiles, was able to identity groups of patients within a test set of local therapy only patients, and two independent tamoxifen-treated patient sets, which showed significant differences in outcomes. The analyses demonstrates that the “intrinsic” subtypes add value to the existing repertoire of clinical markers used for breast cancer patients. The computation approach also provides a means for quickly validating gene expression profiles using publicly available data.
Breast cancers represent a spectrum of diseases comprised of different tumor subtypes, each with a distinct biology and clinical behavior. Despite this heterogeneity, global analyses of primary breast tumors using microarrays have identified gene expression signatures that characterize many of the essential qualities important for biological and clinical classification. Using cDNA microarrays, five distinct subtypes of breast tumors arising from at least two distinct cell types (basal-like and luminal epithelial cells) were previously identified (Perou et al. 2000; Sorlie et al. 2001; Sorlie et al. 2003). This molecular taxonomy was based upon an “intrinsic” gene set, which was identified using a supervised analysis to select genes that showed little variance within repeated samplings of the same tumor, but which showed high variance across tumors (Perou et al. 2000). An intrinsic gene set reflects the stable biological properties of tumors and typically identifies distinct tumor subtypes that have prognostic significance, even though no knowledge of outcome was used to derive this gene set.
315 breast tumor samples compiled from publicly available microarray data were generated on different microarray platforms. These analyses show for the first time, that the breast tumor intrinsic subtypes are significant predictors of outcome when correcting for standard clinical parameters, and that common patterns of expression and outcome predictions can be identified when comparing data sets generated by independent labs.
a) Methods
Tissue samples, RNA preparations and microarray protocols. 105 fresh frozen breast tumor samples and 9 normal breast tissue samples were used as the training set and were obtained from 4 different sources using IRB approved protocols from each participating institution: the University of North Carolina at Chapel Hill, The University of Utah, Thomas Jefferson University and the University of Chicago. Thus, this sample set represents an ethnically diverse cohort from different geographic regions in the US with the clinical and microarray data for samples provided in Table 11. Patients were heterogeneously treated in accordance with the standard of care dictated by their disease stage, ER and HER2 status. The 105 patient training data set had a median follow up of 19.5 months, while the 315 sample combined test set had a median follow up of 74.5 months. Finally, another 16 tamoxifen-treated patient tumor samples were included that were used for the Single Sample Predictor additional test set analysis (tamoxifen-treated set #2).
Total RNA was purified from each sample using the Qiagen RNeasy Kit according to the manufacturer's protocol (Qiagen, Valencia Calif.) and using 10-50 milligram of tissue per sample. The integrity of the RNA was determined using the RNA 6000 Nano LabChip Kit and an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, Calif.). The total RNA labeling and hybridization protocol used is described in the Agilent low RNA input linear amplification kit (http://www.chem.agilent.com/Scripts/PDS.asp?lPage=110003) with the following modifications: 1) a Qiagen PCR purification kit was used to clean up the cRNA and 2) all reagent volumes were cut in half. Each sample was assayed versus a common reference sample that was a mixture of Stratagene's Human Universal Reference total RNA (Novoradovskaya et al. 2004) (100 ug) enriched with equal amounts of RNA (0.3 μg each) from MCF7 and ME16C cell lines. Microarray hybridizations were carried out on Agilent Human oligonucleotide microarrays (1A-v1, 1A-v2 and custom designed 1A-v1 based microarrays) using 2 μg of Cy3-labeled Reference and 2 μg of Cy5-labeled experimental sample. Hybridizations were done using the Agilent hybridization kit and a Robbins Scientific “22 k chamber” hybridization oven. The arrays were incubated overnight and then washed once in 2×SSC and 0.0005% triton X-105 (10 min), twice in 0.1×SSC (5 min), and then immersed into Agilent Stabilization and Drying solution for 20 seconds. All microarrays were scanned using an Axon Scanner GenePix 4000B. The image files were analyzed with GenePix Pro 4.1 and loaded into the UNC Microarray Database at the University of North Carolina at Chapel Hill (https://genome.unc.edu/) where a Lowess normalization procedure was performed to adjust the Cy3 and Cy5 channels (Yang et al. 2002). All primary microarray data associated with this study are available at https://genome.unc.edulpubsup/breastTumor/ and have been deposited into the GEO (http://www.ncbi.nlm.nih.gov/geo/) under the accession number of GSE1992, series GSM34424-GSM34568.
Intrinsic gene set analysis. A new breast tumor intrinsic gene set was derived, called the “Intrinsic/UNC” list using 105 patients (146 total arrays) and 15 repeated tumor samples that were different physical pieces (and RNA preparations) of the same tumor, 9 tumor-metastasis pairs and 2 normal sample pairs (26 paired samples in total, Table 11). This sample size was chosen based upon Basal-like, Luminal A, Luminal B, HER2+/ER−, and Normal-like samples, which occur at a frequency of 15%, 40%, 15%, 20%, and 10%, respectively; and it was estimated that most clinically relevant classes would constitute at least 10% of the affected population, and it was hoped to acquire at least 10 samples from each class in the new data set. Therefore, a sample size of 100 tumors was deemed adequate to identify most classes that might be present in breast cancer patients.
The background subtracted, Lowess normalized log 2 ratio of Cy5 over Cy3 intensity values were first filtered to select genes that had a signal intensity of at least 30 units above background in both the Cy5 and Cy3 channels. Only genes that met these criteria in at least 70% of the 146 microarrays were included for subsequent analysis. Next, an “intrinsic” analysis was performed as described in Sorlie et al. 2003 (Sorlie et al. 2003) using the 26 paired samples and 86 additional microarrays. An intrinsic analysis identifies genes that have low variability in expression within paired samples and high variability in expression across different tumors; for an intrinsic analysis, each gene receives a score that is the average “within-pair variance” (the average square before/after difference), as well as the “between-subject variance” (the variance of the pair averages across subjects). The ratio D=(within-pair variance)/(between-subject variance) was then computed, and those genes with a small value of D (i.e. cut-off) declared to be “intrinsic”. The choice of a value of D was set at one standard deviation below the mean intrinsic score of all genes. This analysis resulted in the selection of 1410 microarray elements representing 1300 genes. In order to obtain an estimate of the number of false-positive intrinsic genes, the sample labels were permuted to generate 26 random pairs and 86 non-paired samples. This permutation was performed 100 times and the intrinsic scores were calculated for each. These permuted scores were used to determine a threshold on the intrinsic score corresponding to a false discovery rate less than 1%. The selected threshold resulted in 1410 microarray features being called significant with a FDR=0.3% and the 90th percentile FDR=0.5%. (See Tusher et al. for a complete description of this calculation (Tusher et al. 2001)).
These 1410 microarray elements were then used to perform a two-way average linkage hierarchical cluster analysis using a centered Pearson correlation metric and the program “Cluster” (Eisen et al. 1998), with the data being displayed relative to the median expression for each gene (i.e. median centering of the rows/genes). The cluster results were then visualized using “Treeview”.
Combined test set analysis. The two-color DNA microarray data sets of Sorlie et al. 2001 and 2003 van't Veer et al. and Sotiriou et al. (Sotiriou et al. 2003) were each downloaded from the internet and pre-processed similarly. Briefly, pre-processing included log 2 transformation of the R/G ratio and then Lowess normalization of the data set (Yang et al. 200). Next, missing values were imputed using the k-NN imputation algorithm described by Troyanskaya et al. (Troyanskaya et al. 2001). Gene annotation from each dataset was translated to UniGene Cluster IDs (UCID) using the SOURCE database (Diehn et al. 2003), which gave a common gene set of approximately 2800 genes that were present across all four data sets. UniGene was chosen because a majority of the identifiers from each dataset could be easily mapped to a UniGene identifier (Build 161). Multiple occurrences of a UCID were collapsed by taking the median value for that ID within each experiment and platform. Next, Distance Weighted Discrimination was performed in a pair-wise fashion by first combining the Sorlie et al. data set with the Sotiriou et al. data set, and then combining this with the van't Veer et al. data to make a single data set. In the final step of pre-processing, each individual experiment (microarray) was normalized by setting the mean to zero and its variance to one. The data for 306 of the 1300 Intrinsic/UNC genes was present in the combined test set and was used in a two-way average linkage hierarchical cluster analysis across the set of 315 microarrays as described above.
Single Sample Predictor. The Single Sample Predictor/SSP is based upon the Nearest Centroid method presented in (Hastie et al. 2001). More specifically, the combined test set was utilized, and 306 Intrinsic/UNC gene set hierarchical cluster presented in
Three additional test data sets were then analyzed: First the 60 sample data set of Ma et al. (Ma et al. 2004) was taken, which is an already pre-processed data set of log 2 transformed ratios (GEO GSE1379), and performed a DWD correction using the 278 genes that were in common between the Ma et al. data set and the set of 306 Intrinsic/UNC genes used in the SSP. The SSP was applied to the 60 Ma et al. samples and, using Spearman correlation, each of the 60 samples were assigned to an intrinsic subtype based upon the highest correlation value to a centroid. Next, 220 samples from Chang et al. (Chang et al. 2005) were analyzed and 16 additional additional samples from UNC that were not used in the training set. The 220 samples represent an extension of the sample set presented in van't Veer et al. (van't Veer et al. 2002), and the combination of these two are the data used in van de Vijver et al. (van de Vijver et al. 2002). Each sample was column-standardized and then performed DWD to combine the 249 SSP samples (306 intrinsic genes) with the 220 samples from Chang et al. and the 16 UNC additional test set samples. Next, each sample's correlation to each centroid was calculated using a Spearman correlation and a sample was assigned to the centroid it was closest to, and the test set was then split into a local only therapy test set, and a tamoxifen-treated test set. Finally, the SSP was applied to the 105 sample original training set after DWD normalization.
Survival analyses. Univariate Kaplan-Meier analysis using a log-rank test was performed using WinSTAT for excel (R. Fitch Software). Standard clinical pathological parameters of age (in decades), node status (positive vs. negative), tumor size (categorical variable of T1-T4), grade (I vs. II and I vs. III), and ER status (positive vs. negative) were tested for differences in RFS, OS and DSS using a proportional hazards regression model. The likelihood ratio test was used to test for equality of the hazard functions among the intrinsic classes after adjusting for the covariates listed above. For the intrinsic subtype analyses, the coding was such that LumA was the reference group to which the other classes were compared. SAS (SAS Institute Inc., SAS/STAT User's Guide, Version 8, 1999, Cary, N.C.) was used for proportional hazards modeling.
Immunohistochemistry. Five micron sections from formalin-fixed, paraffin-embedded tumors were cut and mounted onto Probe On Plus slides (Fisher Scientific). Following deparaffinization in xylene, slides were rehydrated through a graded series of alcohol and placed in running water. Endogenous peroxidase activity was blocked with 3% hydrogen peroxidase and methanol. Samples were steamed for antigen retrieval with 10 mM citrate buffer (pH 6.0) for 30 min. Following protein block, slides were incubated with biotinylated antibody for the Androgen Receptor (Zymed, 08-1292) and incubated with streptavidin conjugated HRP using Vectastain ABC kit protocol (Vector Laboratories). 3,3′-diaminobenzidine tetrahydrochloride (DAB) chromogen (the substrate) was used for the visualization of the antibody/enzyme complex. Slides were counterstained with hematoxylin (Biomeda-M10) and examined by light microscopy.
b) Results
Overview. The goals were to create a new breast tumor intrinsic list and validate this list using multiple test data sets so that new biology could be identified, and the clinical significance of “intrinsic” classifications shown. A new intrinsic list was created using paired samples that were similarly treated (note that these were different “intrinsic” pairs than previously used since they were not before and after therapy pairs). In deriving the “new” list microarrays containing many more thousands of genes than was used before were used. A diagram representing the flow of data sets used here, and the different analysis methods, is presented in
Identification of the Intrinsic/UNC gene set. A new breast tumor intrinsic gene set was created, called the “Intrinsic/UNC” list, using 26 paired samples comprised of 15 paired primary tumors that were different physical pieces (and RNA preparations) of the same tumor, 9 primary tumor-metastasis pairs, and 2 normal breast sample pairs. In total, 105 biologically diverse breast tumor specimens and 9 normal breast samples (146 microarrays, see Table 11) were assayed on Agilent oligo DNA microarrays representing 17,000 genes (GEO accession number GSE1992). This intrinsic analysis identified 1410 microarray elements that represented 1300 genes. When this new gene list was used in a two-way hierarchical clustering analysis on the training set (
The biology of the intrinsic subtypes is rich and extensive, and the current analysis identified new biologically important features. A HER2+ expression cluster was observed that contained genes from the 17q11 amplicon including HER2/ERBB2 and GRB7 (
A Basal-like expression cluster was also present and contained genes characteristic of basal epithelial cells such as SOX9, CK17, c-KIT, FOXC1 and P-Cadherin (
The subtype defining genes from this analysis showed similarity to the previous breast tumor intrinsic lists (i.e. Intrinsic/Stanford) described in (Perou et al. 2000; Sorlie et al. 2003), except there was a significant increase in gene numbers likely due to the increased number of genes present on the current microarrays, and another significant difference was that the new Intrinsic/UNC list contained a large proliferation signature (
Combined test set analysis. Another difference between the intrinsic subtypes found in the 105 sample training data set versus those presented in Sorlie et al. 2001 and 2003 (Sorlie et al. 2001; Sorlie et al. 2003), was that the training set did not have a clear Luminal B (LumB) group as determined by hierarchical clustering analysis. The lack of a LumB group in the training set cluster analysis could be due to few LumB tumors being present in this data set, an artifact of the clustering analysis, or the lack of LumB defining genes in the Intrinsic/UNC gene list. To address this question, a “combined test set” of 315 breast samples was made (311 tumors and 4 normal breast samples) that was a single data set created by combining together the data from Sorlie et al. 2001 and 2003 (cDNA microarrays), van't Veer et al. 2002 (custom Agilent oligo microarrays) and Sotiriou et al. 2003 (cDNA microarrays).
A single data table of these three sets was created by first identifying the common genes present across all four microarray data sets (2800 genes). Next, Distance Weighted Discrimination (DWD) was used to combine these three data sets together (Benito et al. 2004); DWD is a multivariate analysis tool that is able to identify systematic biases present in separate data sets and then make a global adjustment to compensate for these biases. Finally, it was determined that 306 of the 1300 unique Intrinsic/UNC genes were present in the combined test set.
Even though there was limited overlap between the new Intrinsic/UNC list and the Intrinsic/Stanford list of Sorlie et al. 2003 (108 genes in common), there was high agreement in sample classification. For example, it was found 85% concordance in subtype assignments for the 416 tumor data set (combined samples from training and combined test set) that were analyzed independently using the Intrinsic/Stanford and Intrinsic/UNC lists, and both lists showed significance in univariate survival analyses (data not shown). This analysis suggests that, even though the exact constituent genes may vary, the different lists are tracking the same phenotypes and the same “portraits” are seen. However, since the Intrinsic/UNC list contained many more genes and a biologically relevant pattern of expression not seen in the Intrinsic/Stanford lists (i.e. proliferation signature), therefore, it can be more biologically representative of breast tumors. The Intrinsic/UNC list can also be more valuable because it provides a larger number of genes for performing across data set analyses and thus, classifications made across different platforms are less susceptible to artifactual groupings as a result of gene attrition.
Multivariate analyses. In the training set and combined test set, the standard clinical parameters of ER status, node status, grade, and tumor size were all significant predictors of Relapse-Free Survival (RFS, where an event is either a recurrence or death) using univariate Kaplan-Meier analysis (
When the five standard clinical parameters were tested on the 315 sample combined test set using a proportional hazards regression model and RFS, OS or Disease-Specific Survival (DSS) as endpoints, tumor size, grade and ER status were the significant predictors with node status being close to significant (p=0.06-0.07); however, node status was still prognostic in a univariate analysis (
Single Sample Predictions using three additional test sets. A major limitation of using hierarchical clustering as a classifications tool, is its' dependence upon the sample/gene set used for the analysis (Simon et al. 2003). That is, new samples cannot be analyzed prospectively by simply adding them to an existing dataset because it may alter the initial classification of a few previous samples. If an assay is going to be used in the clinical setting, it must be robust and unchanging. To address this concern, a Single Sample Predictor (SSP) was developed using the “combined test set” and its 306 Intrinsic/UNC genes (See
Using the combined test set, five centroids representing the LumA, LumB, Basal-like, HER2+/ER− and Normal Breast-like groups were created). The SSP was tested on three “additional test sets”, the first of which was the Ma et al. data set of ER+ patients that were homogenously treated with tamoxifen (Ma et al. 2004). Using the 60 whole tissue samples of Ma et al., the SSP called 2 Basal-like, 2 HER2+/ER−, 12 Normal Breast-like, 34 LumA, and 9 LumB. Since this patient set had RFS data, the SSP classifications were tested in terms of outcomes (the 2 Basal-like and 2 HER2+/ER− samples by SSP analysis were excluded). The SSP assignments were a significant predictor for this group of adjuvant tamoxifen treated patients (p=0.04,
Next, the SSP was applied onto a 96 sample test set of local only (surgery) treated patients from Chang et al. (Chang et al. 2005), which showed highly significant results (
c) Discussion
This study identified a number of new biologically relevant “intrinsic” features of breast tumors and methods that are important for the microarray community. These new biological features include the 1) demonstration that proliferation is a stable and intrinsic feature of breast tumors, 2) identification of the high expression of many Kallikrein genes in Basal-like tumors, and 3) demonstration that there are multiple types of “HER2-positive” tumors; the HER2-positive tumors falling into the “HER2+/ER−” intrinsic subtype were also shown to associate with the expression of the Androgen Receptor, while those not falling into this group were present in the LumB or LumI subtypes and usually showed better outcomes relative to the HER2+/ER− tumors. Recent studies in prostate cancer have shown that HER2 signaling enhances AR signaling under low androgen levels (Mellinghoff et al. 2004). When this finding is coupled to the observation that some HER2+/ER− tumors showed nuclear AR expression (
Microarray studies are often criticized for a lack of reproducibility and limited validation due to small sample sizes (Simon et al. 2003; Ioannidis 2005). By using DWD, multiple microarray data sets have been combined together to create a single and large combined test set, and it has been shown that the same “intrinsic” patterns can be identified in different data sets in a coordinated analysis, even though entirely different patient populations were investigated on different microarray platforms. The analysis of the 315 sample combined test set showed that the “intrinsic” subtypes based upon the Intrinsic/UNC list, were independent prognostic variables, and thus, were providing new clinical information.
To be of routine clinical use, a gene expression-based test must be based upon an unchanging assay that is capable of making a prediction on a single sample. Therefore, a Single Sample Predictor/SSP was created that was able to classify samples from three additional test sets of similarly treated patients. In particular, the new Intrinsic/UNC list and the SSP, recapitulated the finding that the intrinsic subtypes are truly prognostic on a test set of local only treated patients (
This data shows that the breast tumor intrinsic subtypes identified using the Intrinsic/UNC gene list can be generalized to many different patient sets, both treated and untreated. The intrinsic portraits of breast tumors are recognizable patterns of expression that are of biological and clinical value, and the SSP-based classification tool represents an unchanging predictor to be used for individualized medicine.
Microarray analyses of breast cancers have identified different biological groups that are important for prognosis and treatment. In order to transition these classifications into the clinical laboratory, a real-time quantitative (q)RT-PCR assay has been developed for profiling breast tumors from formalin-fixed paraffin-embedded (FFPE) tissues and evaluate its performance relative to fresh-frozen (FF) RNA samples.
Microarray data from 124 breast samples were used as a training set for classifying tumors into four different previously defined molecular subtypes of Luminal, HER2+/ER−, Basal-like, and Normal-like. Sample class predictors were developed from hierarchical clustering of microarray data using two different centroid-based algorithms: Prediction Analysis of Microarray and a Single Sample Predictor. The training set data was applied to predicting sample class on an independent test set of 35 breast tumors procured as both fresh-frozen and formalin-fixed, paraffin embedded tissues (70 samples). Classification of the test set samples was determined from microarray data using a large 1300 gene set, and using a minimized version of this gene list (40 genes). The minimized gene set was also used in a real-time qRT-PCR assay to predict sample subtype from the fresh-frozen and formalin-fixed, paraffin embedded tissues. Agreement between primer set performance on fresh-frozen and formalin-fixed, paraffin embedded tissues was evaluated using diagonal bias, diagonal correlation, diagonal standard deviation, concordance correlation coefficient, and subtype assignment.
The centroid-based algorithms (Prediction Analysis of Microarray and Single Sample Predictor) had complete agreement in classification from formalin-fixed, paraffin-embedded tissues using qRT-PCR and the minimized ‘intrinsic’ gene set (40 classifiers). There was 94% (33/35) concordance between the diagnostic algorithms when comparing subtype classification from fresh-frozen tissue using microarray (large and minimized gene set) and qRT-PCR data. By qRT-PCR alone, there was 97% (34/35) concordance between fresh-frozen and formalin-fixed, paraffin embedded tissues using Prediction Analysis of Microarray and 91% (32/35) concordance using Single Sample Predictor. Finally, we used several analytical techniques to assess primer set performance between fresh-frozen and formalin-fixed, paraffin-embedded tissues and found that the ratio of the diagonal standard deviation to the dynamic range was the best method for assessing agreement on a gene-by-gene basis.
Determining agreement in classification between platforms and procurement methods requires a variety of methods. It has been shown that centroid-based algorithms are robust classifiers for breast cancer subtype assignment across platforms (microarray and qRT-PCR data) and procurement conditions (fresh-frozen and formalin-fixed, paraffin-embedded tissues). In addition, the standard deviation, dynamic range, and concordance correlation coefficient are important parameters to assess individual primer set performance across procurement methods. The strategy for primer set validation and classification have applications in routine clinical practice for stratifying breast cancers and other tumor types.
Expression-based classifications are important for determining risk of relapse and making treatment decisions in breast cancer (Fan et al. N Engl J Med 2006, 355:560-569; Paik et al. N Engl J Med 2004, 351:2817-2826; Perou et al. Nature 2000, 406:747-752; van't Veer et al. Nature 2002, 415:530-536). Classifications are often developed using microarray data and then further validated on the same or different platforms using minimized gene sets. For instance, van't Veer and van de Vijer used microarray data in training and test sets to validate a 70-gene signature that predicts relapse in early stage ER-positive and ER-negative tumors (van't Veer et al. Nature 2002, 415:530-536; van de Vijver et al. N Engl J Med 2002, 347:1999-2009). In addition, Paik et al developed a 16-gene classifier that predicts relapse in ER-positive tumors using qRT-PCR on formalin-fixed, paraffin embedded (FFPE) tissues. Furthermore, Perou and Sorlie showed that hierarchical clustering of microarray data separates breast tumors into different ‘biological’ subtypes (Luminal, HER2+/ER−, Basal-like, and Normal-like) and that these subtypes are prognostic (Sorlie et al. Proc Natl Acad Sci USA 2001, 98:10869-10874). The biological classification has been validated on multiple patient cohorts using cross-platform microarray analyses and qRT-PCR (Hu et al. BMC Genomics 2006, 7:96; Perreard et al. Breast Cancer Res 2006, 8:R23; Sorlie et al. Proc Natl Acad Sci USA 2003; 100:8418-8423).
Although there are few genes in common between those used to determine the biological subtypes and those used in other classifications for breast cancer prognosis, the different tests identify similar properties that predict tumor behavior (Fan et al. N Engl J Med 2006, 355:560-569). A major difference between the classification for biological subtypes and other classifications for breast cancer is that it is based on hierarchical clustering. The unsupervised nature of hierarchical clustering is effective for discovery (Eisen et al. Proc Natl Acad Sci USA 1998, 95:14863-14868), but it is not suitable for predicting a new sample's class since dendrogram associations can change when new data is introduced. However, it is possible to classify samples within the framework of hierarchical clustering using centroid-based methods (Tibshirani et al. Proc Natl Acad Sci USA 2002, 99:6567-6572; Bair et al. PLoS Biol 2004, 2:E108; Bullinger et al. N Engl J Med 2004, 350:1605-1616). For instance, Tibshirani et al has shown that the nearest shrunken centroid method, used in Prediction Analysis of Microarray (PAM), can classify samples as accurately as statistical approaches like artificial neural networks. In addition, Hu et al employed another simple centroid method called Single Sample Predictor (SSP) to classify subtypes of breast cancer (Hu et al. 2006).
a) Materials and Methods
(1) Tissue Procurement and Processing
All tissues and data used in this study were collected and handled in compliance with federal and institutional guidelines. Breast samples received in pathology were flash frozen in liquid nitrogen and stored at −80° C. Samples were procured at the University of North Carolina at Chapel Hill, Thomas Jefferson University, University of Chicago, and University of Utah. The 159 breast samples analyzed included a 124-sample microarray training set and a 35-sample test set profiled by microarray and real-time qRT-PCR (FF and FFPE). Total RNA from FF samples was isolated using the RNeasy Midi Kit (Qiagen, Valencia, Calif.) and treated on-column with DNase I to eliminate contaminating DNA. The RNA was stored at −80° C. until used for cDNA synthesis.
Each FF sample in the test set was compared to the clinical FFPE tissue block. An H&E slide was used to confirm the presence of >50% tumor and 20 micron cuts were prepared using a microtome. Tissue blocks were 1-5 years in age (i.e. early age FFPE). The FFPE cut was de-paraffinized in Hemo-De (Scientific Safety Solvents) and washed with 100% ethanol. Total RNA was isolated using the High Pure RNA Paraffin Kit (Roche Molecular Biochemicals, Mannheim, Germany). Manufacturer's instructions were followed for RNA extraction except that the reagents were increased 2-fold for the first proteinase K digestion. Samples were treated with TURBO DNA-free (Ambion, #1906) and stored at −80° C. until cDNA synthesis.
(2) First-Strand cDNA Synthesis
cDNA synthesis for each sample was performed in 40 μl total volume reaction containing 600 ng total RNA. Total RNA was first mixed with 2 μl gene specific cocktail containing 55 primers (each anti-sense primer at 1 pmol/μl) and 2 μl 10 mM dNTP Mix (10 mM each dATP, dGTP, dCTP, dTTP at pH7). Reagents were heated at 65° C. for 5 minutes in a PTC-100 Thermal Cycler (MJ Research, Inc.) and briefly centrifuged. The following reagents were added to each tube: 8 μl 5× First-Strand Buffer, 2 μl 0.1 M DTT, 2 μl RNase Out (Invitrogen), and 2 μl SuperScript III polymerase (200 units/μl). The reaction was thoroughly mixed by pipetting and incubated at 55° C. for 45 minutes followed by 15 minutes at 70° C. for enzyme inactivation. Following cDNA synthesis, samples were purified with the QIAquick PCR Purification Kit (Qiagen, Valencia, Calif.). Samples were adjusted to a final concentration of 1.25 ng/μl cDNA with TE (10 mM Tris-HCl, pH 8.0, 0.1 mM EDTA).
(3) Primer Design and Optimization
Primers were designed using Roche LightCycler Probe Design Software 2.0. Reference gene sequences were obtained through NCBI LocusLink and optimal primer sites were found with the aid of Evidence Viewer (http://www.ncbi.nlm.nih.gov). Primers sets were selected to avoid known insertions/deletions and mismatches while including all isoforms possible. Amplicons were limited to 60-100 bp in length due to the degraded condition of the FFPE mRNA. When possible, RNA specific amplicons were localized between exons spanning large introns (>1 kb). Finally, NCBI BLAST was used to verify gene target specificity of each primer set. Primer sequences are presented in Table 1. Primers were synthesized by Operon, Inc. (Huntsville, Ala.), re-suspended in TE to a final concentration of 60 uM, and stored at −80° C. Each new FFPE primer set was assessed for performance through qRT-PCR runs with three serial 10-fold dilutions of reference cDNA in duplicate and two no template control reactions. Primers were verified for use when they fulfilled the following criteria: 1) target Cp<30 in 10 ng reference cDNA; 2) PCR efficiency>1.75; 3) no primer-dimers in presence of template as determined through post amplification melting curve analysis; and 4) no primer-dimers in negative template control before cycle 40.
(4) Real-Time Quantitative (q)RT-PCR
PCR amplification was carried out on the Roche LightCycler 2.0. Each reaction contained 2 μl cDNA (2.5 ng) and 18 μl of PCR master mix with the following final concentration of reagents: 1 U Platinum Taq, 50 mM Tris-HCl (pH 9.1), 1.6 mM (NH4)2SO4, 0.4 mg/μl BSA, 4 mM MgCl2, 0.2 mM dATP, 0.2 mM dCTP, 0.2 mM dGTP, 0.6 mM dUTP, 1/40000 dilution of SYBR Green I dye (Molecular Probes, Eugene, Oreg., USA), and 0.4 μM of both forward and reverse primers for the selected target. The PCR was done with an initial denaturation step at 94° C. for 90 s and then 50 cycles of denaturation (94° C., 3 s), annealing (58° C., 6 s), and extension (72° C., 6 s). Fluorescence acquisition (530 nm) was taken once each cycle at the end of the extension phase. After PCR, a post-amplification melting curve program was initiated by heating to 94° C. for 15 s, cooling to 58° C. for 15 seconds, and slowly increasing the temperature (0.1° C./s) to 95° C. while continuously measuring fluorescence.
Each PCR run contained a no template control, a calibrator reference in triplicate, and each sample in duplicate. The calibrator reference sample was comprised of 3 breast cancer cell lines (MCF7, SKBR3, and ME16C2) and Stratagene Universal Human Reference RNA (Stratagene, La Jolla, Calif., USA) represented in equal parts. The crossing point (Cp) for each reaction was automatically calculated by the Roche LightCycler Software 4.0. Relative quantification was done by importing an external efficiency curve (Eff=1.89) and setting the calibrator at 10 ng for each gene. In order to correct for differences in sample quality and cDNA input, copy numbers were adjusted to the arithmetic mean of 5 ‘housekeeper’ genes (ACTB, PSMC4, PUM1, MRPL19, SF3A1). Values from replicate samples were averaged and data was log 2 transformed.
(5) Microarray
All samples were analyzed by DNA microarray (Agilent Human A1, Agilent Human A2, and Agilent custom oligonucleotide microarrays). Labeling and hybridization of RNA for microarray analysis were performed using the Agilent low RNA input linear amplification kit (http://www.chem.agilent.com/Scripts/PDS.asp?lPage=10003) as described in Hu et al (Hu et al. Biotechniques 2005, 38:121-124). Each sample was assayed versus a common reference that was a mixture of Stratagene's Human Universal Reference total RNA (Stratagene, La Jolla, Calif., USA) enriched with equal amounts of RNA from the MCF7 and ME16C cell lines. Microarray hybridizations were carried out on Agilent Human oligonucleotide microarrays using 2 μg Cy3-labeled ‘reference’ sample and 2 μg Cy5-labeled ‘experimental’ sample.
All microarrays were scanned using an Axon Scanner 4000B (Axon Instruments Inc, Foster City, Calif., USA). The image files were analyzed with GenePix Pro 4.1 (Axon Instruments) and were uploaded into the UNC Microarray Database at the University of North Carolina at Chapel Hill (https://genome.unc.edu/), where a Lowess normalization procedure was performed to adjust the Cy3 and Cy5 channels (Yang et al. Nucleic Acids Res 2002′ 30:e15).
(6) Clinical Immunohistochemistry and PCR
Samples were scored for protein expression at the time of diagnosis using standard operating procedures established at each institution. Greater than 10% positive staining nuclei was considered positive for the ER and PR. Staining and scoring criteria for HER2 were carried out according to the Dako HercepTest™ (Dako, Carpinteria, Calif., USA). Quantitative PCR, used to determine DNA copy number of the ERBB2 gene, was done using a clinical assay from ARUP Laboratories Inc (cat# 00049390, Salt Lake City, Utah, USA).
(7) Selecting Genes for Real-Time qRT-PCR
The real-time qRT-PCR assay consisted of 5 housekeeper genes (Szabo et al. Genome Biol 2004, 5:R59), 5 proliferation genes for risk stratification of the Luminal (ER− positive) tumors, and 40 ‘intrinsic’ genes important for distinguishing biological subtypes of breast cancer. The minimal 40 ‘intrinsic’ classifiers were statistically selected from a larger 1300 ‘intrinsic’ gene set previously reported in Hu et al (2006). The larger gene set was minimized as described in Perreard et al (2006). Briefly, a semi-supervised classification method was used in which samples are hierarchical clustered and assigned subtypes based on the sample-associated dendrogram. Samples were designated as Luminal, HER2+/ER−, Basal-like, or Normal-like. The best class distinguishers were identified according to the ratio of between-group to within-group sums of squares. A 10-fold cross-validation was performed using a nearest centroid classifier and testing overlapping gene sets of varying sizes. The smallest gene set which provided the highest class prediction accuracy when compared to the classifications made by the complete nicroarray-based intrinsic gene set was selected.
(8) Assessing qRT-PCR Agreement between FF and FFPE Tissues
Thirty-five matched FF and FFPE samples (70 samples total) were analyzed by qRT-PCR using the same primer sets. Agreement in the quantitative data was determined using diagonal bias (m), diagonal spread (d), diagonal standard deviation (dsd), diagonal correlation (rd), and concordance correlation coefficient (CCC).
In diagonal bias, a best fitting line parallel to the diagonal (slope equals 1) is made from a plot of the qRT-PCR data (FF versus FFPE). Numerically, if (xi,yi), i=1, . . . , n denote the measurement pairs then the best fitting line parallel to the diagonal is given by the expression:
y=x+
where
Then diagonal bias is calculated as:
The diagonal standard deviation was calculated as follows:
Let d represent:
Diagonal correlation was used to determine the spread of points around the diagonal line:
This method does not provide information about the extent of deviation but allows measurements with different units to be compared. Further, if we let ρ denote the correlation coefficient and ΓX and σY the respective standard deviations, then
That is, the diagonal correlation penalizes the correction coefficient if there is a scale shift (σX≠σY). The combined effect of the bias and scale shift was measured using the concordance correlation coefficient (CCC) proposed by Lin et al (Lin et al. Biometrics 1989, 45:255-268):
(9) Assessing Agreement between Microarray and qRT-PCR for Classification.
A breast cancer subtype predictor was developed in PAM (http://www-stat.stanford.edu/˜tibs/PAM/) and SSP using 124 breast samples and the ‘intrinsic’ gene set identified in Hu et al (2006). The training set contained representative samples of Luminal (64 samples), HER2+/ER− (23 samples), Basal-like (28 samples), and Normal-like (9 samples) subtypes. Classification of an independent test set (35 matched FF and FFPE samples) was done using a large (1300 genes) and minimized (40 genes) version of the ‘intrinsic’ set. Subtypes were assigned based on Spearman correlation to the centroid. The qRT-PCR data from the test set was merged with the microarray data of the training set prior to classification using distance weighted discrimination (Benito et al. Bioinformatics 2004, 20:105-114). The gold standard for classification of the training and test samples was based on FF tissue RNA and using the classifications obtained when performing hierarchical clustering analysis using the 1300 gene intrinsic gene set from microarray data.
b) Results
(1) Assessment of qRT-PCR Primer Set Performance by Comparing Agreement between FF and FFPE Tissues.
The data set of 35 matched FF and FFPE tissues (70 samples) was evaluated for 50 genes using the same PCR conditions. Agreement between FF and FFPE tissues was assessed for diagonal bias (m), diagonal correlation (rd) diagonal standard deviation (dsd), and concordance correlation coefficient (ccc).
For each gene, the agreement between FF and FFPE was analyzed using the raw data, housekeeper normalized data, and DWD adjusted normalized data. Scatter plots are provided in
The median biases for the un-normalized, housekeeper normalized, and DWD adjusted normalized data were −1.5 (−3.1 to −0.033), 0.58 (−1.1 to 2) and 0.24 (−0.3 to 1.3), respectively. Normalization to the housekeeper genes had a relatively modest effect on the diagonal standard deviation with a change in the median from 1.1 (0.76-2) to 0.81 (0.38-1.8). While most genes had a similar standard deviation (e.g. ESR1) after applying the housekeepers, other genes such as C10orf7 and COX6C had nearly a 3-fold reduction in standard deviation after normalization.
In general, genes with the highest diagonal correlation between FF and FFPE also had the largest dynamic range in expression (e.g., ESR1, TFF3, COX6C, and FBP1). Housekeeper genes and other genes with low variability in expression (IGBP1) had the lowest diagonal correlation since they form more of a cloud than a line around the diagonal. The housekeeper genes all had high agreement in terms of having low variability in expression across samples in the FF and FFPE tissues.
The concordance correlation coefficient (CCC) considers both bias and scale shift when determining agreement. The median concordance correlation coefficient between FF and FFPE for the raw data of the 45 genes (housekeepers excluded) was 0.28. Normalization to housekeepers raised the CCC median to 0.48, and adjusting with DWD brought the median to 0.61. Only 27% of the genes had a CCC value greater than 0.5, whereas 47% of the genes were above that value in the normalized data, and 76% were above that when using DWD adjusted normalized data. A comparison of the CCC value to the ratio of the diagonal standard deviation over the dynamic range identified many of the same primer sets as good (or poor) performers from the FFPE derived samples.
(2) Breast Cancer Subtype Classification of Test Set using PAM and SSP.
Hierarchical clustering of the 124 sample training set using the “intrinsic” gene set identified in Hu et al shows 4 distinct classes representing Luminal, HER2+/ER−, Basal-like, and Normal-like (
Agreement in classification between large and minimized microarray gene sets. Thirty-three out of 35 (94%) samples classified the same between PAM and SSP when using the large ‘intrinsic’ microarray dataset for classification. In both discrepant cases, IHC data agreed with the PAM classification. There was the same agreement (94%) when performing the analysis with the minimized version of the microarray data. Interestingly, there was one sample that was called HER2+/ER− by both PAM and SSP when using the large microarray dataset, but called Basal-like by both methods when using the minimized microarray dataset. Additional analysis of this sample by quantitative PCR showed no DNA amplification of HER2/ERBB2 amplicon.
Agreement in claissification between FF and FFPE. By qRT-PCR, there was 97% (34/35) concordance between FF and FFPE using PAM, and 91% (32/35) concordance using SSP. There was 94% (33/35) concordance between the diagnostic algorithms from FF tissue and complete agreement in classification from FFPE tissue. Since the FFPE samples were obtained from the clinical block, it is likely that there was a higher tumor percentage in those samples than in the matched FF sample, which could affect the agreement. Indeed, 2 out of the 3 discrepancies in classification made by SSP were when the FF tissue sample was called Normal-like (microarray and PCR) and the FFPE sample was called Luminal (PCR). These samples were ER-positive by IHC and likely Luminal. The only discrepancy in PAM was in a sample classified as Normal-like from FF tissue and Luminal from FFPE.
Overall concordance across methods. Overall, PAM diagnosed 33 out of 35 samples (94%) the same across microarray and qRT-PCR, while SSP diagnosed 30 out of 35 samples (86%) the same across platforms and procurement methods. Discrepancies were of several types including Luminal tumors classified as Normal-like, HER2+/ER− tumors classified as Luminal, and Basal-like tumors classified as HER2+/ER−.
c) Discussion
The transition of large-scale microarray experiments into a clinical test requires identifying a minimum set of genes for classification, translating the assay from microarray to qRT-PCR for routine diagnostics, and validating the assay using both FF and FFPE specimen types.
A previous qRT-PCR assay for identifying biological subtypes was based on an intrinsic gene set derived from first generation microarrays that contained 8,100 genes. In comparison, the current intrinsic set was derived from a different microarray platform (cDNA versus Agilent), contained a larger number of genes (427 vs. 1300), and used pre-treatment samples only (Hu et al. 2006. The overlap in the minimized gene set developed here versus the list in Perreard et al. was 14 out of 40, which is not surprising since there were only 108 genes in common between the larger intrinsic gene sets. It has been shown that the new intrinsic gene set reproducibly identifies the same breast cancer subtypes within independent datasets (i.e. pure training and test sets), and that the biological classification adds significant clinical information when used in a multivariate Cox analysis.
It has been shown that the centroid-based method called Single Sample Predictor can use microarray data to classify breast cancers into biological subtypes that predict survival in treated and untreated patients (Hu et al. 2006). Here PAM is directly compared to SSP using the large microarray dataset applied in Hu et al, and also tested a minimized version using microarray and qRT-PCR data. Both methods performed well.
This method of classification is considered semi-supervised since data from hierarchical clustering is initially used to develop a centroid or shrunken centroid from a training set and new samples are then classified based on the distance to the centroid. In this way, the training set is not only necessary for initial discovery and validation but the data continues to be used as a reference base for future classification of new samples. Similarly, the Oncotype Dx assay established cut points for risk of relapse from a training set and this classifier rule is applied to new samples to derive a recurrence score.
Determining agreement between methods is a complex issue that requires consideration of several factors before reaching a conclusion. Cronin et al used Pearson correlation to show that the genes with the highest correlation in microarray maintained their association with qRT-PCR. They used short amplicons and control ‘housekeeper’ genes in the qRT-PCR assay to correct biases between FF and FFPE tissues. Although correlation provides information about the linearity and slope (positive or negative correlation) of the data, it does not indicate the amount of bias, scale shift, or data spread. These additional measurements are helpful in determining whether the discrepancies in the data can be compensated for experimentally (e.g., housekeeper genes) or by software algorithms. For example, when the qRT-PCR data from FF and FFPE were compared, it was found that a significant bias could be corrected by normalization to the housekeepers and applying Distance Weighted Discrimination. Distance Weighted Discrimination corrected systematic biases but did not change other measurements of agreement. After correcting for systematic bias, it is then possible to evaluate variation due to noise that cannot be predicted or controlled.
It was found that the most useful analyses for assessing PCR primer set performance across FF and FFPE tissues were the concordance correlation coefficient, the diagonal standard deviation, and the dynamic range. Genes with a large dynamic range often had high correlation and were good classifiers across conditions, even with relatively large diagonal standard deviations. Although genes with a small dynamic range can be good classifiers, the measurement may not be as reproducible if there is a large amount of variation. Thus, it was found that the best assessment of a classifier was using a ratio of the diagonal standard deviation to the dynamic range. This allowed genes with smaller dynamic ranges to be considered as good classifiers, if they also had low diagonal standard deviations. The concordance correlation coefficient and the ratio of the diagonal standard deviation to the dynamic range selected many of the same genes as having similar performance from the FF and FFPE tissues.
Translating an assay from microarray to qRT-PCR provides a second level of gene validation and allows the test to be used on archived FFPE tissue blocks from clinical trials or on samples submitted for routine diagnostics (Paik et al. 2004; Cronin et al. Am J Pathol 2004, 164:35-42). qRT-PCR on formalin-fixed paraffin-embedded tissue can be effectively used for gene expression based diagnostics for translation into the clinical laboratory. The FFPE procured RNA provided accurate subtype classifications in breast cancer, and in some instances provided more tumor specific information than the FF derived samples. This study also developed methodologies that have wider application for developing qRT-PCR assays for subtype classification in a wide variety of cancer types. These gene expression based tests can provide powerful new prognostic clinical tools and aid in more appropriate individualized treatment decisions.
indicates data missing or illegible when filed
indicates data missing or illegible when filed
indicates data missing or illegible when filed
indicates data missing or illegible when filed
.0885
.184
indicates data missing or illegible when filed
indicates data missing or illegible when filed
cerevisiae)
laevis)
cerevisiae)
cerevisiae)
indicates data missing or illegible when filed
Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. “Cluster analysis and display of genome-wide expression patterns” Proc Natl Acad Sci USA 95:14863-14868 (1998).
Peron C M, Sorlie T, Eisen M B, van de Rijn M, Jeffrey S S, Rees C A, Pollack J R, Ross D T, Johnsen H, Akslen L A, Fluge O, Pergarmenschikov A, Williams C, Zhu S X, Lonning P E, Borresen-Dale A L, Brown P O, Botstein D. “Molecular portraits of human breast tumours” Nature 406:747-752 (2000).
SantaLucia J. “A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics” Proc Natl Acad Sci USA 95:1460-1465 (1998).
van't Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., et al. “Gene expression profiling predicts clinical outcome of breast cancer” Nature 415:530-536 (2002).
This application claims priority to U.S. Provisional Patent Application No. 60/739,155, which was filed on 23 Nov. 2005.
This work was supported in part by the National Cancer Institute (P50-CA58223-11 and R33 CA097769-01 and U01 CA114722. The United States Government may have certain rights in the inventions disclosed herein.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US06/44737 | 11/17/2006 | WO | 00 | 3/13/2009 |
Number | Date | Country | |
---|---|---|---|
60739155 | Nov 2005 | US |