This invention relates to protein analysis and, in particular, to protein expression profiling of breast tumors and cancers.
Adjuvant systemic therapy has a favorable impact on survival in patients with early breast cancer.1, 2 The decision to give or withhold such therapy is based upon a series of histoclinical prognostic criteria reviewed in consensus conferences, i.e., National Institute Health NIH and St-Gallen.3, 4 However, despite the establishment of standardized criteria, the heterogeneity of breast tumors remains poorly understood. For example, clinical treatment decisions on whether to treat patients with node-negative breast cancer by surgery and radiotherapy alone, or in combination with adjuvant chemotherapy are currently being made with scant information on patient risk for metastatic relapse. Additionally, identifying among the patients who receive chemotherapy those who will benefit and those who will not benefit from standard anthracyclin-based protocols remains elusive. However, the relatively limited efficacy of current protocols (about 30-40% of failure rate) and the increasing availability of new therapies make this issue clinically important. Furthermore, the development of molecularly-targeted drugs such as trastuzumab (Herceptin™), a monoclonal antibody against the ERBB2 tyrosine kinase receptor, is needed.5 With few exceptions, such as estrogen receptor and ERBB2 receptor, the available molecular markers are of limited value in clinical practice.
High-throughput molecular technologies such as DNA arrays, have recently significantly contributed to enhance understanding of the molecular complexity of breast cancer.6 Several studies have demonstrated the potential clinical utility of gene expression signatures defined by the combined RNA expression of a few tens of genes. These signatures have lead to the development of a new molecular taxonomy of disease, including the identification of previously indistinguishable prognostic subclasses.7-15 The clinical impact of these tests on disease management must be subsequently evaluated in large retrospective and prospective studies of adequate statistical power on fully annotated patient samples, followed by the development of gene expression-based diagnostics adapted to the clinical setting.
Unfortunately, the cost, technical complexity, and interpretation of DNA microarray technology still complicate investigation with cancer specimens and are currently unsuitable for routine use in the standard clinical setting. Issues that must be addressed prior to validation and integration of this technology to clinical pathology laboratories include the requirement for high-quality RNA extracted from unfixed tissues, intra-tumoral heterogeneity of excised patient samples, and bias resulting from the asymmetry of variables with a number of hybridized samples greatly inferior to the number of genes being tested leading to non-trivial statistical problems. Finally, the sensitivity, specificity, reproducibility and technical feasibility outside large academic centers will have to be addressed, and experimental conditions will have to be standardized and data compared in multi-center clinical trials.
Additional opportunities to validate and/or identify prognostic expression signatures are provided by alternative high-throughput approaches, which may be used either separately or in combination with DNA microarrays. One of these is the tissue microarray (TMA) technique,16-18 which allows for the simultaneous study of hundreds of tumor specimens at the DNA, RNA or protein level. Immunohistochemistry (IHC) is applicable to paraffin-embedded samples that constitute the bulk of pathology archives, avoiding the requirement for high-quality RNA extracted from frozen specimens. IHC is relatively inexpensive, straightforward and well established in standard clinical pathology laboratories. Thus, IHC on TMA may be a practical approach both in validation studies and in routine testing. However, analytical classification methods to efficiently process and interpret multiple target IHC data have not been previously developed.
Recent studies have shown the reliability of hierarchical clustering for classifying cancers when applied to IHC TMA data of a significant range of markers.19-24 However, none addressed the prognostic issue.
This invention in a broad sense provides a means of analyzing histopathologic features of breast disease, in particular, of classifying breast cancers into prognostically relevant subclasses. After exhaustive testing on a retrospective panel of 552 early breast cancer samples we found that this classification was possible by analyzing a consistent set of proteins. Classification of samples, based on this multidimensional protein data set, was first done using classical unsupervised hierarchical clustering. We then developed a supervised bioinformatic method that further improved the classification as compared with usual prognostic factors.
The invention provides a protein expression signature identified by protein expression profiling which may be used for analyzing histopathologic features of breast disease as well as methods for carrying out such analysis. In particular, protein expression profiling is a clinically useful approach to assess breast cancer heterogeneity and prognosis in patients with stage I, II, or III disease. It may be used both for breast tumor management in clinical settings and as a research tool in academic laboratories.
The invention provides in one aspect a method for analyzing differential protein expression associated with histopathologic features of breast disease, in particular, breast tumours, e.g., breast carcinomas, comprising detecting overexpression or underexpression of a pool of proteins in breast tissues or cells, the pool comprising all or part of a protein set comprising:
By “all or part” is meant 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51 or 52 proteins.
By “Cytokeratin 5/6” is meant Cytokeratin 5 and/or Cytokeratin 6. The same is applicable to “Cytokeratin 8/18.”
The following table displays proteins of the invention and their corresponding amino-acid sequences (SEQ ID NO. 1 to 52). These proteins are identified by their common names (first column) in the methods, libraries, sets, pools, etc. of the invention. Other names in the literature which designate the same proteins (alias, synonyms, etc.) are included and are incorporated herein by reference.
The invention may also define these proteins by their amino-acid (polypeptidic) sequences (SEQ ID NO.), or portions or modifications thereof in accordance with the definition of “protein” provided in Table 1 below.
“Over or underexpression of a pool of protein” means that overexpression of certain proteins are detected simultaneously to the underexpression of others the proteins. “Simultaneously” means concurrent with or within a biologic or functionally relevant period of time during which the over expression of a protein may be followed by the under expression of another protein, or conversely, e.g., because both expressions are directly or indirectly correlated.
In a further aspect, the invention provides a method for analyzing differential protein expression associated with histopathologic features of breast disease comprising detecting overexpression or underexpression of a pool of protein in breast tissues comprising a protein set comprising:
In a further aspect, the invention provides a method for analyzing differential protein expression associated with histopathologic features of breast disease comprising detecting overexpression or underexpression of a pool of protein in breast tissues comprising a protein set comprising:
According to a preferred aspect, the pool of protein comprises a protein set comprising:
According to another aspect, the pool of protein comprises a protein set comprising all proteins of the Table 1 above.
The method further comprises at least one of the following aspects:
The method may further comprise at least one of the following aspects:
A further aspect of the invention provides a protein library useful for molecular characterization of histopathologic features of breast disease comprising or corresponding to a pool of protein sequences, over or under expressed, in breast tissue or cells, the pool corresponding to the protein sets previously described.
Preferably, the protein librairies may be immobilized on a solid support which may preferably be selected from the group comprising nylon membrane, nitrocellulose membrane, polyvinylidene difluoride, glass slide, glass beads, polystyrene plates, membranes on glass support and silicon chip or gold chip.
In a further aspect, the invention provides a method for analyzing differential protein expression associated with histopathologic features of breast disease comprising detecting overexpression or underexpression of a pool of protein in breast tissues comprising:
Alternatively to breast tissue cells from a patient, detecting over or under expression of the pool of protein may be carried out on breast tumor cell lines.
The proteins may be directly or indirectly labeled before reaction step (b) with a label which may be selected from the group comprising radioactive, colorimetric, enzymatic, molecular amplification, bioluminescent or fluorescent labels. Advantageously, one or more specific label are used for each protein of the library. A person skilled the art will be able to select appropriate labels and labelling methods to carry out the invention. For example, one may use a label selected in the group comprising, but not limited to: biotine and digoxygenin.
Measuring over or under expression of proteins may be carried out on cell or tissue, frozen or embedded in any appropriate material, e.g., paraffin, e.g. tissue microarray. Various known methods may be used sicj as, e.g., ImmunoHistoChemistry (IHC) technologies. Measuring over or under expression of proteins may be also be carried out with, e.g., protein (micro)arrays, antibody (micro)arrays, antigen (micro)arrays or any other appropriate technology, e.g., by using the previously defined supports.
According to an advantageous aspect, the method for analysing differential protein expression of the invention further comprises:
The invention is useful for detecting, diagnosing, staging, monitoring, predicting, preventing conditions associated with breast cancer. It is particularly useful for predicting clinical outcome of breast cancer and/or predicting occurrence of metastatic relapse and/or determining the stage or aggressiveness of a breast disease in at least about 50%, e.g., at least about 55%, e.g., at least about 60%, e.g., at least about 65%, e.g., at least about 70%, e.g., at least about 75%, e.g., at least about 80%, e.g., at least about 85%, e.g., at least about 90%, e.g., at least about 95%, e.g., about 100% of the patients. The invention is also useful for selecting more appropriate doses and/or schedule of chemotherapeutics and/or biopharmaceuticals and/or radiation therapy to circumvent toxicities in a patient.
The invention is also useful for selecting appropriate doses and/or schedule of chemotherapeutics and/or (bio)pharmaceuticals, and/or targeted agents, among which include Aromatase Inhibitors (e.g., Exomestane, Anastrazole, Letrozole), Anti-estrogens (e.g., Fluvestrant, Tamoxifen), Taxanes (e.g., PacliTaxol, Docetaxel), Antracyclines (e.g., Doxurubicin, Cyclophosphamide), CHOP (Doxurubicin, Cyclophosphamide, ocovorin, prednisone when taken in combination). Other drugs such as Velcade™, 5-Fluorouracil, Vinblastine, Gemcitabine, Methotrexate, Goserelin, Irinotecan, Thiotepa, Topotecan or Toremifene are included as well.
Targeted therapies include use of Iressa (gefitnib, ZD1839, anti-EGFR, PDGFR, c-kit, Astra-Zeneca); ABX-EGFR (anti-EGFR, Abgenix/Amgen); Zamestra (FTI, J & J/Ortho-Biotech); Herceptin (anti-HER2/neu, Genentech); Avastin (bevancizumab, anti-VEGF antibody, Genentech); Tarceva (ertolinib, OSI-774, RTK inhibitor, Genentech-Roche); ZD66474 (anti-VEGFR, Astra-Zeneca); Erbitux (IMC-225, cetuximab, anti-EGFR, Imclone/BMS); Oncolar (anti-GRH, Novartis); PD-183805 (RTK inhibitor, Pfizer); EMD72000, (anti-EGFR/VEGF ab, MerckKgaA); CI-1033 (HER2/neu & EGF-R dual inhibitor, Pfizer); EGF10004; Herzyme (anti-HER2 ab, Medizyme Pharmaceuticals); Corixa (Microsphere delivery of HER2/neu vaccine, Medarex).
Further relevant anti-breast cancer agents are described by Awada et al. in “The pipeline of new anticancer agents for breast cancer treatment in 2003,” Critical Reviews in Oncology/Hematology 48 (2003) 45-63, the content of which is incorporated herein by reference.
Advantageously, in the method, breast tissue cell may be obtained from a patient regardless of whether the patient has received or not a neo-adjuvant or adjuvant, e.g., systemic, therapy. Similarly, treated or untreated cell lines may be used.
Advantageously, in the method, breast tissue cell may be obtained from a patient regardless of ER receptor expression.
In a further aspect, the invention provides a method for treating a patient with breast cancer comprising (i) implementing a method for analysing differential protein expression on a sample from the patient, and (ii) determining a treatment for the patient based on the analysis of differential protein expression profile obtained in step i).
In a further aspect, the invention relates to a method for analyzing differential protein expression associated with histopathologic features of breast disease, wherein detecting overexpression or underexpression of the pool of protein in breast tissues comprises detecting overexpression or underexpression of nucleic acids coding for the proteins.
The invention further relates to a nucleic acids library useful for the molecular characterization of histopathologic features of breast disease comprising nucelic acids coding for the over or underexpressed proteins, or equivalents thereof.
The sequences of the nucleic acids of the library are easily available for a person skilled in the art that may, for example, use printed publications describing the sequences and/or public databases, e.g., the National Center for Biotechnological Information (NCBI) database, that provide such sequences as well. The content of the NCBI database may be available via internet at the following adress http://www.ncbi.nlm.nih.gov/.
Definitions
“Aggressiveness of cancer” refers to cancer growth rate or potential to metastasize; a so-called “aggressive cancer” will grow or metastasize rapidly or significantly affect overall health status and quality of life.
“Adjuvant therapy” refers to treatment involving radiation, chemotherapy (drug treatment), biologic therapy (vaccines) or hormone therapy, or any combination given after primary treatment.
“Antibody” is intended to include whole antibodies, e.g., of any isotype, and includes fragments thereof which are also specifically reactive with a vertebrate, e.g., mammalian, protein. Antibodies can be fragmented using conventional techniques and the fragments screened for utility in the same manner as described above for whole antibodies. Thus, the term includes segments generated by proteolytic cleavage or prepared recombinant portions of an antibody molecule capable of selectively reacting with a certain protein. Non-limiting examples of such proteolytic and/or recombinant fragments include Fab, F(ab′)2, Fab′, Fv, and single chain antibodies (scFv) containing a V[L] and/or V[H] domain joined by a peptide linker. The scFv's may be covalently or non-covalently linked to form antibodies having two or more binding sites. Antibodies may include polyclonal, monoclonal, or other purified preparations of antibodies and recombinant antibodies.
“Associated with” refers to a disease in a subject which is caused by, contributed to by, or causative of an abnormal level of expression of a protein.
“Control” comprises, for example, proteins from a sample of the same patient or from a pool of different patients, or selected among reference proteins which may be already known to be over or under expressed. The expression level of the control can be an average or an absolute value of the expression of reference proteins. These values may be processed to accentuate the difference relative to the expression of the proteins according to the invention. The analysis of the over or under expression of proteins can be carried out on samples such as biological material derived from any mammalian cells, including cell lines, xenografts, human tissues preferably breast tissue and the like. The method according to the invention may be performed on sample from a, e.g., cell lines, healthy donors, patients or an animal (for example for veterinary application or preclinical studies).
“Directly or indirectly labeled” include proteins the sub-constituants of which, i.e., amino acids or amino acid groups or atoms, are themselves labeled (directly), as well as proteins labeled by the intermediate of any element able to recognize and bind to the targeted protein, e.g., an antibody.
“Equivalent” includes nucleic acids encoding functionally equivalent proteins. Equivalent nucleotide sequences include sequences that differ by one or more nucleotide substitutions, additions or deletions, such as allelic variants and, will, therefore, include sequences that differ from the nucleotide sequence of the nucleic acids of the invention because of the degeneracy of the genetic code.
“Good-prognosis” and “poor-prognosis,” respectively, refer to favorable (e.g., remission) or unfavorable (e.g., metastasis, death) patient clinical outcome.
“Histopathologic features of breast diseases” includes diseases, disorders or conditions known as, lethally or not, affecting breast cells and/or tissues, including but not limited to breast tumours, for example i) non cancerous breast diseases, for example, hyperplasias, metaplasias, fibroadenomas, fibrocystic disease, papillomas, sclerosing adenosis or preneoplastic, or ii) breast cancer. “Breast cancer” includes but is not limited to:
“ImmunoHistoChemistry (IHC)” refers to methods using histochemical localization of immunoreactive substances using antibodies as reagents on cells or tissues by technologies such as, but not limited to flow cytometry, ELISA, Western and Southwestern Blot Analysis, and frozen and paraffin-embedded samples.
“Nucleic acids” refers to polynucleotides, e.g., isolated, such as deoxyribonucleic acid (DNA), and, where appropriate, ribonucleic acid (RNA). The term should also be understood to include, as equivalents, analogs of RNA or DNA made from nucleotide analogs, and, as applicable to the aspect being described, single (sense or antisense) and double-stranded polynucleotides. ESTs, chromosomes, cDNAs, mRNAs, and rRNAs are representative examples of molecules that may be referred to as nucleic acids.
“Over or underexpression” may comprise the detection of differences in expression of the proteins according to the invention in relation to at least one control.
“Predicting clinical outcome” refers to the ability for one skilled in the art to classify patients into at least two classes “good prognosis” and “bad prognosis” showing significantly different long-term Metastasis Free Survival (MFS).
“Protein” refers to a polypeptide with a primary, secondary, tertiary or quaternary structure, or any portion or modification, e.g., a mutant, or isoform thereof. A “portion” or “modification” of a protein retains at least one biological or antigenic characteristic of a native (wild-type) protein.
“Protein microarray” refers to a spatially defined and separated collection of individual proteins immobilised on a solid surface.
“Treating” as used herein is intended to encompass treating as well as ameliorating at least one symptom of the condition or disease.
We combined IHC and TMA to measure the expression levels of selected proteins in a consecutive series of 552 patients with early stage breast cancer. We determined protein combinations to refine tumor classification and improve the prognostic classification of disease.
Protein Expression Profiling Identifies Subclasses of Breast Cancer
Analysis and interpretation of the large amount of data generated (552 samples and 26 antibodies, about 14,000 data points) caused us to develop bioinformatic tools. As a first step, we applied pre-existing unsupervised hierarchical clustering algorithms as previously reported.19-24 Two recent studies on breast cancer analyzed the expression of 15 proteins in 166 tumors,22 and 13 proteins on 107 samples,19 respectively. Several of these markers were included in this work (BCL2, ER, PR, ERBB2, EGFR, Cyclins, Cytokeratins, MIB1, P53), allowing for direct comparison of results. In our analysis, clustering allowed the identification of four major coherent protein clusters designated according to the function of most included proteins: “ER-related cluster”, “adhesion cluster”, “mitosis cluster” and “proliferation cluster.” Correlated expression of proteins may be due to different mechanisms such as coregulation (e.g., ER/BCL230), functional interaction (e.g., STK6/Taxins27, 28), phenotypic association (e.g., ERBB2/P5331) or chromosomal location (e.g., FGFR1/TACC1 located on 8p11). Some co-expressed proteins were previously reported in RNA or protein expression profiling studies. For example, ER, PR, BCL2 and GATA3 clustered together.8-10, 13 This “ER-related cluster” was negatively correlated with the “mitosis” and “proliferation” clusters, in agreement with the higher proliferation index in ER-negative tumors32 and the known proliferation-differentiation balance in carcinomas. The “ER-related cluster” was close to the “adhesion cluster” that included other markers that may correlate positively with ER expression such as FHIT,33 CK8/18,19, 22 CCND134 and MUC1.8 Our “proliferation cluster” had some similarities to that identified by others with the common presence of P53, Ki67, CCNE, ERBB2 and CK5/619 or CCNE, ERBB2, EGFR and CK5/6.22 Interestingly, this cluster also included CDH3/P-Cadherin, present in a “basal cluster” identified in gene expression analyses9 and previously shown to be overexpressed in a subgroup of breast carcinomas associated with higher proliferation rates and aggressive behavior.35
Hierarchical clustering sorted tumors into three clusters that correlated with relevant histoclinical parameters, including histological type, SBR grade, ER status, ERBB2 status and the presence or absence of peritumoral vascular emboli. Correlations were found between the characteristics of these tumor clusters and their protein expression profiles. For example, the high number of grade III tumors in cluster B, as well as the high number of ERBB2-positive samples, agreed with the frequent strong expression of the “proliferation” cluster—which included ERBB2—and the “mitosis” cluster in these tumors. Conversely, 99% of cluster A1 samples were ER-positive, and showed a frequent strong expression of the “ER-related” cluster and low expression of the “proliferation cluster”.32
Interestingly, the tumor clusters also correlated with a breast cancer classification recently proposed in two series of analyses that provided a new conceptual framework of mammary oncogenesis. First, phenotypic analyses have established a three-cell phenotypic classification of breast cancer cells.22, 36, 37 These authors suggested that biomarkers such as intermediate filaments cytokeratins (CK), encoded by a large number of keratin genes, are able to distinguish between distinct cell subpopulations within the mammary gland epithelial compartment. It has been proposed that “basal” cells contain mammary gland progenitor cells able to give raise to both “luminal” and “myoepithelial”38 cells.(39 for review) Progenitor cells express type II keratins CK5 and 6. In contrast, differentiated “luminal” cells express type II keratin CK8 and type I keratin CK18, which are also observed in normal simple and glandular epithelia. Luminal cells also express ER.10, 11 Use of tissue microarray screening has confirmed this emerging theory.19, 22 Second, recent gene expression analyses using DNA microarrays have led to a similar identification of subclasses of breast tumors that corresponded to the phenotypic classification.9-11
These experiments concurred to establish a distinction between several types of epithelial cells in the mammary gland. The origin of the breast malignant cell remains unknown. Two major types of breast cancer may derive from basal/progenitor or luminal cells, respectively. Alternatively, most tumors may originate from pluripotent stem cells and reach different stages of differentiation.40 Our results support this new classification model. Tumor cluster A1 may be approximated to a cluster of luminal cell-like tumors, with frequent strong expression of ER and CK8/18. Cluster B may consist of tumors with basal/progenitor, ER-negative characteristics, i.e. strong expression of CK5/6 and proliferation markers. A2 tumors, with an intermediate profile, may represent a transitory “baso-luminal” stage, or consist of tumors that have lost ER function. It can be expected that luminal A1 tumors, in which the bulk of cells are more differentiated and express ER-related cluster proteins, are of better prognosis, whereas more undifferentiated and proliferative basal B tumors are associated with poor prognosis. The significant differences in clinical outcome observed between the three defined tumor clusters in this study are consistent with this model and recent studies.9-11, 41 In addition, we discovered that lobular carcinomas are luminal-like tumors, and comprise differentiated luminal cells that express CK8/18.
Protein Expression Profiling Predicts Clinical Outcome of Breast Cancer
Thus, classical unsupervised hierarchical clustering applied to all tested proteins was able to identify biologically and clinically relevant classes of breast cancer. Recently, supervised methods have been successfully applied to gene expression data analysis in parallel with unsupervised approaches. In a second step, we thus developed a supervised method to identify the best combination within 26 proteins that would further improve the prognostic classification. To our knowledge, our study is the first application of such supervised methods to large-scale IHC data. We identified a 21-protein set which optimally classified patients into two classes (“good-prognosis” and “poor-prognosis class”) with significantly different long-term MFS.
Initially identified in a random learning set of 368 patients, this prognostic signature was validated in an independent set of 184 patients, showing its robustness. Our discriminator set included 10 proteins coded by genes identified across recent gene expression studies,7-15 as well as other proteins with unclear role in disease progression and sensitivity to systemic therapy. The prognostic value of the signature was increasingly accurate with the addition of other proteins as evidenced by univariate and multivariate analyses, further highlighting the strength of large-scale molecular analyses for understanding tumor heterogeneity through the identification of expression signatures.
The classification based on the 21-protein predictor was associated with a highly significant difference in clinical outcome. The 5-year MFS was 90% for patients of the “good-prognosis class” and only 62% for patients of the “poor-prognosis class.” When compared in multivariate analysis with classical prognostic factors and with each tested protein separately, our classification performed significantly better for predicting the occurrence of metastatic relapse. Such prognostic association persisted when applied to patients with lymph node-positive and lymph node-negative cancer.
Interestingly, the MFS of node-negative patients from the “poor-prognosis class” was similar to that of node-positive patients from the “good-prognosis class.” Notably, our molecular classification performed better than that defined by St-Gallen and NIH criteria for node-negative patients. This finding is of particular significance, since about 75% of node-negative patients candidate for adjuvant chemotherapy based on the St. Gallen/NIH criteria are currently thought to be over-treated.
Our 21-protein predictor assigned fewer node-negative patients to the “poor-prognosis class,” and their clinical outcome was more frequently unfavorable than it was for patients assigned to the high-risk class defined by St-Gallen or NIH criteria. Our predictor also performed well in patients irrespective of ER status. The 5-year MFS was 90% for ER-positive patients from the “good-prognosis class,” and 58% for ER-positive patients from the “poor-prognosis class,” suggesting our 21-protein set may provide more accurate clinical information than ER status alone, possibly reflecting functional differences in the ER pathway.
Additionally, our molecular classification conserved its predictive impact for patients independent of adjuvant systemic therapy. Since distant metastasis may be influenced by adjuvant therapy, we separately analyzed the 186 patients who did not receive any chemo- and hormone therapy, as well as the 133 patients who exclusively received adjuvant chemotherapy with anthracyclin-based regimen in most cases.
Interestingly, we found within the group of 186 untreated patients an odds ratio of 7.45 for metastatic relapse in the “poor-prognosis class” when compared with patients of the “good-prognosis class.” Similar discrimination was observed within the 133 patients treated with chemotherapy alone with a corresponding odds ratio of 3. Thus, the 21-protein signature may facilitate the selection of appropriate treatment options in early breast cancer patients. It may be an important clinical tool to circumvent unnecessary, toxic and costly treatment of node-negative patients, and it may help for selecting, among patients who need adjuvant chemotherapy, those who might benefit from standard protocol and those who would be candidates to other protocol or other form of systemic therapy.
Materials and Methods
Patients and Histological Samples
A consecutive series of 552 women with early (stage I, II or III) breast cancer treated at the Institut Paoli-Calmettes before December 1999 was studied using the TMA technology. The stage of disease was defined according to TNM classification (Union Internationale Contre le Cancer, UICC, TNM, 5th edition). Patients with locally advanced, inflammatory or metastatic disease, or with previous history of cancer were not included. Tumors were invasive adenocarinomas including, according to the WHO histological typing, 388 ductal carcinomas (70%), 72 lobular (13%), 24 mixed (4%), 40 tubular (8%), 8 medullary (1%) and 20 other types (4%). Clinical annotation of each sample included patient age, axillary lymph node status, pathological tumor size, Scarff-Bloom-Richardson (SBR) grade, peritumoral vascular invasion, estrogen receptor (ER), progesterone receptor (PR) and ERBB2 status as evaluated by IHC with positivity cut-off values of 1% for hormone receptors and with 2 or 3+score (HercepTest kit scoring guidelines) for ERBB2. The characteristics of patients are listed in Table 2 (see first column only).
*as defined using the 21-protein signature;
**P-values for the comparison of numbers of patients were calculated using the Chi-2 test, and P-values for the comparison of metastasis-free survival (MFS) were calculated using the log-rank test;
NS, not significant;
***calculated, for the 450 patients who did not experience metastatic relapse as a first event, from the date of diagnosis to the time of last follow-up;
CI denotes confidence interval.
Patients were treated according to the following guidelines: all had primary surgery that included complete resection of breast tumor (modified radical mastectomy in 28% of cases and lumpectomy in 72%) and axillary lymph node dissection; 96% of patients (including 100% of those treated with breast-conservative surgery) received adjuvant local-regional radiotherapy; 47% were given adjuvant chemotherapy (anthracyclin-based regimen in most cases), and 42% received adjuvant hormone treatment (tamoxifen for most cases). After completion of local-regional treatment, patients were evaluated at least twice per year for the first 5 years and at least annually thereafter. The median follow-up was 57 months (range, 2 to 182) after diagnosis for the 450 patients who did not experience metastatic relapse as a first event, 37 months (range, 4 to 151) for the 102 patients with metastasis as first event, and 51 months (range, 2 to 182) for all patients. The 5-year MFS rate was 80% [95% CI 76.2-83.7].
Tissue Microarrays Construction
TMA's were prepared as previously described25 with slight modifications. For each tumor, three representative areas from the primary tumor were carefully selected from a hematoxylin-eosin stained section of a donor block. Core cylinders with a diameter of 0.6 mm each were punched from each of these areas and deposited into three separate recipient paraffin blocks using a specific arraying device (Beecher Instruments, Silver Spring, Md.). The technique of TMA allows the analysis of tumors and controls under identical experimental conditions. In addition to tumor tissues, the recipient block also received 10 normal breast tissue samples from 10 healthy women that underwent reductive mammary surgery and pellets from nine mammary cell lines. Five-μm sections of the resulting TMA block were made and used for IHC analysis after transfer onto glass slides. We previously assessed the reliability of the method by comparison with the standard immunohistochemical method for the usual prognostic parameters; the value of the kappa test was 0.95.25
Selection of the 26 Markers
Selection of the proteins was performed according to the following criteria: known or potential importance in breast cancer and availability of a corresponding antibody that performed well in IHC on paraffin-embedded tissues. Twenty-six proteins were selected including hormone receptors (ER, PR), subclass markers (Cytokeratins), oncogenes and proliferation proteins (ERBB family members, BCL2, Cyclins, MIB1, FGFR1, Aurora A, Taxins), tumor suppressors (P53, FHIT), adhesion molecules (Cadherins, Catenins, Afadin), proteins from oncogenes of amplified genomic regions (ERBB2, CCND1, STK6), and other potential prognostic markers identified in specific studies or previous DNA microarray experiments (CCNE, GATA3, MUC1). Twelve out of the 26 proteins were mentioned as potential significant genes in RNA expression profiling studies in breast cancer.6-15 The characteristics of the antibodies used are listed in Table 3. When available, several antibodies were studied for comparison, and only the reagents that gave the best quality data were kept for the global analysis.
Mmab: mouse monoclonal antibody;
Rpab: rabbit polyclonal antibody;
DTRS: Dako target retrieval solution.
Immunohistochemical Analysis
IHC was carried out on five-μm sections of tissue fixed in alcohol formalin for 24 h and embedded in paraffin. Sections were deparaffinized in Histolemon (Carlo Erba Reagenti, Rodano, Italy) and rehydrated in graded alcohol. Antigen retrieval was accomplished by incubating the sections in pre-treatment solutions depending on the antibody used. Pretreatment conditions are listed in Table 3. The reactions were carried out using an autoimmunostainer (Dako Autostainer). Staining was performed at room temperature as follows: rehydrated tissues were washed in phosphate buffer, followed by quenching of endogenous peroxidase activity by treatment with 0.1% H2O2, slides, incubated with blocking serum (Dako) for 30 min., then with the affinity-purified antibody for one hour. After washing, slides were sequentially incubated with biotinylated antibody against rabbit IgG for 20 min. followed by streptadivin-conjugated peroxidase (Dako LSABR2 kit), then visualized with Diaminobenzidine (3-amino-9-ethylcarbazole). Slides were counter-stained with hematoxylin, coverslipped using Aquatex (Merck, Darmstadt, Germany) mounting solution, then evaluated under a light microscope by two pathologists. The results were expressed in terms of percentage (P) and intensity (I) of positive cells as previously described.25 For each sample, the mean of the score of a minimum of two core biopsies was calculated. The results were then scored by the quick score (Q) (Q=P×I), except for ERBB2 status that was evaluated with the Dako scale (HercepTest™ kit scoring guidelines).
Quick score allowed separating tumors into two or three classes. Homogeneous classes were defined by grouping samples with an equivalent staining level according to the distribution curves as described.25 Two classes (negative and positive) were defined for Afadin, α and β Catenins, BCL2, Cyclins D1 and E, Cytokeratins 5/6 and 8/18, EGFR, ERBB3, ERBB4, FGFR1, GATA3, MIB1, P53, P-Cadherin, PR and TACC3, with a positivity cut-off value of Q=1, except for Cyclin D1 and MIB1 with a positivity cut-off value of 10 and 20, respectively. Three classes were defined (negative, moderate and strong staining) for Aurora A, E-Cadherin, ER, FHIT, MUC1, TACC1, and TACC2, with negative (Q=0), moderate (0<Q≦100) or strong expression (100<Q≦300). For ERBB2, three classes (0/1+, 2+, 3+) were obtained with the Dako scale.
Data Analysis
A combination of exploratory unsupervised and supervised bioinformatic methods was used to analyze these immunohistochemical profiles. First, we applied unsupervised hierarchical clustering similar to that used in gene expression profiling studies. Data were reformatted using the following scoring system: −2 designated negative staining, 1 weakly positive staining, 2 strongly positive staining and missing data were left blank in the scored table. Hierarchical clustering investigates relationships between samples and between proteins, based on the similarity of sample immunoreactive scores. We used the Cluster program (average-linkage with Pearson correlation as similarity metric) and results were displayed with the TreeView software.26
We then performed supervised analysis to identify the protein-set that best distinguished between two classes of samples with different clinical outcome. To simplify the analyses, the IHC scores were recorded as negative (negative staining) or positive (weakly and strong positive staining). The classifier was derived through training on a subset of chosen samples (⅔ of population, learning set) and then validated on the remaining subset (⅓ of population, validation set). The assignment of samples to each set was random, but the ratio between tumors with and without metastatic relapse was preserved. An exhaustive testing comprising all combinations of 1 to 5 proteins, as well as the complementary combinations of 21 to 25 proteins was performed to assess their ability to classify tumors into 2 classes (“poor-prognosis” and “good-prognosis”) in agreement with their clinical outcome.
Using the protein expression scores of each combination, we developed a “Metastasis Scoring” system that assigned to each tumor a probability to belong to the “poor-prognosis class” or the “good-prognosis class.” Consider a combination of N proteins P1, K, PN (where N ranges from 1 to 5 and 21 to 26) and two predefined classes X, Y of tumors within the learning set: X={X1, K, XK} includes samples with metastatic relapse during the follow-up and Y={Y1, K, YM} includes samples without any metastatic relapse. For each protein combination tested, one tumor is represented as a ternary vector (e.g. X1={X1(P1), K, X1(PN)} where each component is scored 0 for missing data or +1/−1 for positive/negative IHC staining. Every tumor Z has a score S(Z) defined as follows. For each protein Pi, we compute the frequencies of +1/−1 value in the X class (adjusted to avoid a 0 probability):
where, for instance, card{k: Xk(Pi)=+1} is the number of X tumors with positive IHC staining for protein Pi. Similarly we compute the frequencies fYi(+1) and fYi(−1) in the Y class and we define f•i(0)=1. The Metastasis Score of tumor Z is the log ratio of the joint probabilities:
Samples were then sorted according to their S(Z) score. The natural threshold that divides the population in 2 classes is S=0: if S(Z)>0 then Z is more similar to the class X and is predicted to belong to the “poor-prognosis class” and if S(Z)<0 then Z is more similar to the class Y and is predicted to belong to the “good-prognosis class.” The number of misclassifications (error rate) was defined as the number of X tumors classified in the “good-prognosis class” plus the number of Y tumors classified in the “poor-prognosis class.” The best classifier protein-set was that with the minimal rate of misclassified tumors.
Once identified, the prognostic power of the classifier was tested on the validation set by classifying the remaining independent tumors using the same approach. Finally, it was assessed on the whole population. For each tumor set, the prognostic impact was further estimated by univariate analyses that compared the rate of metastatic relapses within the two molecularly defined classes of tumors (Fisher exact test).
Statistical Methods
Distributions of molecular markers and other categorical variables were compared using either the standard Chi-2 test or Fisher exact test. The follow-up was calculated from the date of diagnosis to the time of metastasis as first event or time of last follow-up for censored patients. The end point was the metastasis-free survival (MFS), calculated from the date of diagnosis, first metastasis being scored as an event. All other patients were censored at the time of the last follow-up, death, recurrence of local or regional disease, or development of a second primary cancer, including contralateral breast cancer. Survival curves were derived from Kaplan-Meier estimates and Were compared by log-rank test. The influence of molecular grouping, adjusted for other factors including classical prognostic factors and significant IHC measurement, was assessed in multivariate analysis by the Cox proportional hazard models. Survival rates and odds ratios (OR) are presented with their 95% confidence intervals (95% CI). Statistical tests were two-sided at the 5% level of significance. All statistical tests were done using SAS Version 8.02.
Results
Expression Protein Profiling of Breast Cancers using Tissue Microarrays.
The expression of 26 proteins was studied by IHC on TMA containing 552 early stage breast tumor samples and controls (
*as compared to 10 normal breast samples.
**P-values for the comparison of MFS were calculated using the log-rank test.
CI denotes confidence interval.
Unsupervised Hierarchical Classification of 552 Breast Tumors Upon Protein Expression Profiling
Hierarchical Clustering
The overall expression patterns for the 552 samples were first analyzed with hierarchical clustering. Results are displayed in a color-coded matrix in
The combined protein expression patterns defined two major clusters of tumors designated cluster A (462 cases) and cluster B (89 cases) in
Correlation with Histoclinical Parameters and Survival
We identified correlations between tumor clusters and relevant biopathological parameters. In each cluster, the most frequent histological type was the ductal type. However, in cluster A1, 19% of samples were of the lobular type compared with 12% in cluster A2 and only 7% in cluster B (p=0.03; Chi-2 test).
Importantly, the tumor clusters correlated with clinical outcome. With a median follow-up of 57 months, the 5-year MFS was significantly different (p<0.0001, log-rank test) between cluster A1 (54 metastases, 86% MFS [95% CI 82.1-89.9]), cluster A2 (21 metastases, 68% MFS [95% CI 79.9-56.5]) and cluster B (26 metastases, 66% MFS [95% CI 54.3-77.6]) (data not shown).
Supervised Analysis and Clinical Outcome
We developed a supervised analysis method to search for smaller sets of discriminator proteins that might improve our prognostic classification. Analysis was conducted using two equivalent but independent tumor sets (learning and validation sets).
Supervised Analysis and Classification of Patients
The learning set of samples (n=368) allowed the identification of a combination of proteins (protein expression signature) that correlated with long-term MFS. The number of proteins in the “metastatic predictor” was optimized by iteratively testing all combinations of 1 to 5 proteins and the complementary combinations of 21 to 25 proteins and by assessing their ability for correct classification of samples using a “Metastatic Score.” The optimal combination for these tumors contained 21 proteins (
We then shown the ability of this multiprotein signature to predict prognosis in an independent set of 184 patients (validation set). Using the same threshold for the “Metastatic Score” previously described, we identified two classes of patients that strongly correlated with clinical outcome. There were 24 metastatic relapses out of the 63 patients (38%) in the “poor-prognosis class” and only 10 out of the 121 (8%) in the “good-prognosis class” (odds ratio, OR=6.8 [95% CI 2.8-17.3], p<0.0001, Fisher exact test) (
When all 552 cases (learning and validation cases) were analyzed together, the predictor correlated well with long-term MFS.
Correlation of Molecular Classification with Histoclinical Parameters and Survival
Table 2 (see the three last columns) shows the characteristics of patients in each class. The histoclinical parameters significantly associated with this classification were SBR grade (p<0.0001, Chi-2 test), hormone receptor status (p<0.0001, Fisher exact test), ERBB2 status (p<0.0001, Fisher exact test), and whether patients received adjuvant chemotherapy (p=0.001, Fisher exact test) or hormone therapy (p<0.0001, Fisher exact test). There was no correlation with patient age, tumor size, and number of involved lymph nodes. In contrast, a strong correlation with clinical outcome was observed (
Survival and Lymph Node Status
Our protein expression signature also classified the 255 patients with node-positive disease into two classes that correlated with clinical outcome. In the “good-prognosis class,” 28 out of 158 patients experienced metastatic relapse during follow-up as compared with 43 out of 97 in the “poor-prognosis class” (odds ratio, OR=3.7 [95% CI 2.0-6.8], p<0.0001, Fisher exact test) (
The same was true for the 292 patients with node-negative breast cancer. In this group, the odds ratio for metastasis was 6.5 ([95% CI 2.7-16.8], p<0.0001, Fisher exact test) among the 93 women from the “poor-prognosis class,” as compared with the 199 women from the “good-prognosis class” (
We compared our prognostic classification of node-negative patients with those provided by the consensus criteria established during the St-Gallen and NIH conferences.3, 4 These criteria classified all 292 patients into two groups (low risk versus high risk) (
Survival and Estrogen Receptor Status.
The same analysis was separately applied to ER-positive and ER-negative tumors. In the ER-positive group (n=422), 35 of 345 patients from the “good-prognosis class” displayed metastatic relapse as compared with 29 of 77 from the “poor-prognosis class” (odds ratio, OR=5.4 [95% CI 2.8-9.9], p=<0.0001, Fisher exact test). The corresponding 5-year MFS were 90% [95% CI 85.9-93.3] and 58% [95% CI 45.4-70.6], respectively (p<0.0001, log-rank test) (data not shown). The same trend was observed, although not significant (p=0.21, log-rank test), for the 129 ER-negative tumors with 5-year MFS of 91% [95% CI 76.0-100.0] and 66% [95% CI 56.0-75.1], respectively.
Survival and Adjuvant Systemic Therapy
Since the occurrence of metastatic relapse may be influenced by the delivery of adjuvant systemic therapy, the classification based on our 21-protein signature was applied to 186 women who received neither chemotherapy nor hormone therapy after local-regional treatment. Importantly, the 21-protein signature successfully predicted prognosis in these patients: 6 metastatic relapses of 119 patients in the “good-prognosis class” and 19 of 67 in the “poor-prognosis class” (odds ratio, OR=7.4 [95% CI 2.6-23.9], p<0.0001, Fisher exact test) (
Similar results were observed when we focused on the 133 patients who received adjuvant chemotherapy without hormone therapy. In the “good-prognosis class,” 12 of the 58 patients displayed metastatic relapse whereas 33 of 75 experienced metastasis in the “poor-prognosis class” (odds ratio, OR=3 [95% CI 1.3-7.2], p=0.006 Fisher exact test) (
Uni- and Multivariate Prognostic Analysis
We finally compared the prognostic ability of our molecular grouping of tumors with classical histoclinical factors and individual protein markers. In univariate analysis, the histoclinical factors that correlated with MFS (p<0.05, log-lank test) were pathological tumor size (≦20 mm, >20), tumor grade (SBR I, II, III), number of positive axillary lymph nodes (0, 1-3, ≧4), and peritumoral vascular invasion (negative, positive). Proteins significantly correlated to MFS were BCL2 (p<0.0001), GATA3 (p=0.0006), MIB1 (p<0.0001), ER (p<0.0001), PR (p=0.0007), P53 (p=0.003) and α-Catenin (p=0.005) (Table 5).
CI denotes confidence interval.
The influence on the risk of distant metastasis of our multiprotein-based grouping, adjusted for other prognostic factors, was assessed in multivariate analysis by the Cox proportional hazards model. The parameters entered in the model were dichotomised and included the classification based on the discriminator 21-protein set (“good-prognosis class” and “poor-prognosis class”), age of patients (≦50 years, >50 years), number of positive axillary lymph nodes (0, 1-3, ≧4), pathological tumor size (≦20 mm, >20), tumor grade (SBR I, II, III), estrogen receptor status (negative, positive), progesterone receptor status (negative, positive), peritumoral vascular invasion (negative, positive), chemotherapy (delivery or not), hormone therapy (delivery or not) and each of the proteins (negative, positive) significantly associated with survival in univariate analyses. Results are shown in Table 5. Several independent factors predictive of distant metastasis as first event were evidenced including the prognosis signature based on the 21-protein combination, pathological size of tumors, axillary lymph node status (only when dichotomized ≦3 vs >3), Ki67/MIB1 status and delivery of hormone therapy. However, the 21-protein signature was the strongest predictor with a hazard ratio of 2.2 for “poor-prognosis class” patients, compared to “good-prognosis class” patients ([95% CI 1.25-3.89], p<0.0001).
The references below and the subject matter therein is incorporated herein by reference.
21. Liu C L, Prapong W, Natkunam Y, et al. Software tools for high-throughput analysis and archiving of immunohistochemistry staining data obtained with tissue microarrays. Am J Pathol 2002; 161:1557-65.
40. Al-Hajj M, Wicha M S, Benito-Hernandez A, Morrison S J, Clarke M F. Prospective identification of tumorigenic breast cancer cells. Proc Natl Acad Sci USA 2003; 100:3983-8.
This patent application claims priority of U.S. Provisional Application No. 60/537,412, filed Jan. 16, 2004. This earlier provisional application is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60537412 | Jan 2004 | US |