This invention relates, in one embodiment, to a method of providing a prognosis for breast cancer by determining the number of single nucleotide polymorphisms (SNPs) in specified genes.
Breast cancer is a heterogeneous disease that exhibits a wide variety of clinical presentations, histological types and growth rates. In patients with no detectable lymph node involvement (a population thought to be at low-risk) between 20-30% of the patients develop recurrent disease after five to ten years of follow-up. Identification of individuals in this group who are at risk for recurrence cannot be done reliably at present.
DNA copy number alterations (CNAs) or copy number polymorphisms (CNPs), such as deletions, insertion and amplifications, are believed to be one of the major genomic alterations that contribute to the carcinogenesis. Both conventional and array-based comparative genomic hybridizations have revealed chromosomal regions that are altered in breast tumors. There is no study, however, that used a high throughput, high resolution platform to investigate the relationship of DNA copy number alterations with breast cancer prognosis.
The methods disclosed herein make it feasible to use copy number alterations (CNAs) to predict patient prognostic outcome. When combined with gene expression based signatures for prognosis, copy number signature (CNS) refines risk classification and can identify those breast cancer patients who have a significantly worse outlook in prognosis and a potential differential response to chemotherapeutic drugs.
In the examples discussed herein a high-throughput and high-resolution oligo-nucleotide based single nucleotide polymorphism (SNP) array technology was used to analyze the CNAs for more than 100,000 SNP loci in the breast cancer genome. In a large cohort of 313 LNN (lymph node negative) breast cancer patients CNAs were identified that were correlated with a subset of patients with a very high probability of developing distant metastasis. The prognostic power of the CNAs was validated in two independent patient cohorts. In addition, using published predictive gene signatures, the identified patient subgroups with different prognosis were tested for putative drug efficacy. The results indicate that combining DNA copy number analysis and gene expression analysis provides an additional and better means for risk assessment for breast cancer patients.
The present invention is disclosed with reference to the accompanying drawings, wherein:
The examples set out herein illustrate several embodiments of the invention but should not be construed as limiting the scope of the invention in any manner.
Specific DNA copy number alterations (CNAs), such as deletions and amplifications, are major genomic alterations that contribute to the carcinogenesis and tumor progression through reduced apoptosis, unchecked proliferation, increased motility and angiogenesis. Because a significant proportion of genomic aberrations are unrelated to cancer biology and merely due to random neutral events, it is a challenge to identify those causative gene CNAs that are responsible for gene expression regulation that ultimately leads to malignant transformation and progression. Both fluorescence in situ hybridization and comparative genomic hybridizations (CGH) have revealed chromosomal regions that showed CNAs in breast tumors. In a recent study including 51 breast tumors, a high-resolution SNP array was used together with gene-expression profiling to refine breast cancer amplicon boundaries and narrow the list of potential driver genes. However, only a limited number of studies investigated the CNAs in relation to their prognostic significance while the sample sizes of these studies were too small to draw firm conclusions. In addition, fewer studies investigated breast cancer prognosis using combined analysis of CNAs and gene expression profiling with sufficient sample size and a technology that had appropriate coverage and mapping resolution of the human genome.
This specification describes the analysis of DNA copy numbers for over 100,000 SNP loci across the human genome in genomic DNA from 313 lymph node-negative (LNN) primary breast tumors for which genome-wide gene-expression data were also available. Combining these two data sets allowed the identification of genomic loci, and their mapped genes, that have high correlation with distance metastasis. The identified patient subgroups were further tested for putative drug efficacy based on published predictive signatures.
A combined analysis of DNA copy number and gene expression was performed on a large cohort of 313 LNN breast cancer patients who received no adjuvant systemic therapy. To our knowledge, this is the largest such study to analyze CNAs for breast cancer prognosis using the high-density SNP array technology that has much higher resolution than aCGH. A signature of 81 genes that showed CNAs and concordant gene expression regulation were identified from a training set of 200 LNN patients. This CNS was validated in the independent 113 LNN patients, as well as in an external aCGH data set of 116 LNN patients. Preliminary clinical utility has been demonstrated since the very poor prognostic group with a particularly rapid relapse identified by the 81-gene CNS actually constituted a subset of the poor prognostic patients predicted by the 76-gene GES alone. Thus by applying CNS in addition to GES, risk classification for breast cancer patients' prognosis is clearly improved. Furthermore, by using previously reported gene signature profiles for sensitivity to chemotherapeutic compounds, it was shown that this very poor prognostic group might be much more resistant to preoperative T/FAC combination chemotherapy, particularly against the cyclophosphamide and doxorubicin compounds, while benefiting from etoposide and topotecan. This may suggest that patients belonging to this category should be closely monitored and be managed with different chemotherapy regimes compared with other patient groups, and that the 81 genes of the CNS also play an important role in chemo sensitivity.
Previous studies investigating the association between gene amplification and breast cancer prognosis considered different breast cancer subtypes such as ER positive and ER negative as a single homogenous cohort. However, it is well known that these tumors are pathologically and biologically very different as evidenced by tremendous distinct global gene expression profiles. This dichotomy also extended to the global pattern of the DNA copy numbers. Therefore, the analysis needed to be performed separately for ER-positive and ER-negative (estrogen-receptor positive and negative) tumors. Indeed, the prognostic chromosomal regions identified from the ER-positive tumors share little in common with those from the ER-negative tumors. For example, chromosome region 8q is a widely known site of DNA amplification that is associated with poor prognosis in breast cancer. The region 8q was indeed a hotspot for amplification in ER-positive tumors, but contained no significant amplified areas for ER-negative tumors. Because ER-negative tumors constitute only a small percentage (˜25%) of the LNN breast cancers, it is reasonable to speculate that those studies that did not separate the two types of breast tumors in their analysis may had their conclusions overwhelmed by the results from the majority of the samples of ER-positive tumors. Another apparent difference between the two types of tumors observed from our analysis was at chromosome region 20q13.2-13.3. A gain in copy number of this region in ER-positive tumors, but by contrast, a loss in copy number of this region in ER-negative tumors, was related to an early recurrence. Taken together, these results re-emphasize that ER-positive and ER-negative tumors follow different biological pathways for cancer development and progression.
The median of the mean copy numbers computed from each SNP's interquartile copy number estimates was 2.1, consistent with the general assumption that the majority of the genome is diploid. Unsupervised analysis using PCA on all 313 tumors showed that chromosomal copy number variations displayed a clear trend of separation between ER-positive and ER-negative tumors (
First, chromosome regions were identified whose CNAs were correlated with patients' DMFS. For ER-positive tumors, 45 chromosomal regions distributed over 17 chromosomes were identified as having CNAs that correlated with DMFS (
In the training set of 200 patients an 81-gene prognostic copy number signature (CNS) was constructed that identified a subgroup of patients with a high probability of distant metastasis in the independent testing set of 113 patients (hazard ratio [HR]:2.8, 95% confidence interval [CI]:1.4-5.6,p=0.0036), and in an external data set of 116 patients (HR: 3.7, 95 CI: 1.3-10.6,p=0.0102). These high-risk patients constituted a subset of the high-risk patients predicted by our previously established 76-gene expression signature (GES). This very poor prognostic group identified by CNS and GES was putatively more resistant to preoperative paclitaxel and 5-FU-doxorubicin-cyclophosphamide (T/FAC) combination chemotherapy (p=0.0003), particularly against the doxorubicin and cyclophosphamide compound, while potentially benefiting from etoposide and topotecan.
Frozen tumor specimens of 313 LNN breast cancer patients selected from the tumor bank at the Erasmus Medical Center (Rotterdam, Netherlands) were used in this study. None of these patients did receive any systemic (neo)adjuvant therapy. The guidelines for local primary treatment were the same. Among these specimens, 273 were used to develop a 76-gene signature for the prediction of distant metastasis using Affymetrix U133A chips. The remaining 40 patients were used to study prognostic biological pathways. The study was approved by the Medical Ethics Committee of the Erasmus MC Rotterdam, The Netherlands (MEC 02.953), and was conducted in accordance to the Code of Conduct of the Federation of Medical Scientific Societies in the Netherlands (http://www.fmwv.nl/), and where ever possible the Reporting Recommendations for Tumor Marker Prognostic Studies REMARK was followed.
A sampling of 199 tumors were classified as ER positive and 114 as ER negative, using previously described ER (and PgR) cutoffs. Median age of patients at the time of surgery (breast conserving surgery: 230 patients; modified radical mastectomy: 83 patients) was 54 years (range, 26-83 years). The median follow-up time for surviving patients (n=220) was 99 months (range, 20-169 months). A total of 114 patients (36%) developed distant metastasis and were counted as failures in the analysis of DMFS. Of the 93 patients who died, 7 died without evidence of disease and were censored at last follow-up in the analysis of DMFS; 86 patients died after a previous relapse. The clinicopathological characteristics of the patients are given in Table 1. The data set containing the clinical and SNP data has been submitted to Gene Expression Omnibus database with accession number 10099 (http://www.ncbi.nlm.nih.gov/geo, username: jyu8; password: jackxyu).
The external array CGH (aCGH) data set of 116 LNN patients used in this study as an independent validation was downloaded from http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE8757. The clinical data (Table 1) related to this data set were kindly provided by Dr. Teschendorff, University of Cambridge, UK.
Genomic DNA was isolated from 5 to 10 30 μm tumor cryostat sections (10-25 mg) with QIAamp DNA mini kit (Qiagen, Venlo, Netherlands) according to the protocol provided by the manufacturer. Genomic DNA from each patient sample was allelo-typed using the Affymetrix GeneChip® Mapping 100K Array Set (Affymetrix, Santa Clara, Calif.) in accordance with the standard protocol. Briefly, 250 ng of genomic DNA was digested with either Hind III or XbaI, and then ligated to adapters that recognize the cohesive four base pair (bp) overhangs. A generic primer that recognizes the adapter sequence was used to amplify adapter-ligated DNA fragments with PCR conditions optimized to preferentially amplify fragments ranging from 250 to 2000 bp size using DNA Engine (MJ Research, Watertown, Mass.). After purification with the Qiagen MinElute 96 UF PCR purification system, a total of 40 μg of PCR product was fragmented and about 2.9 μg was visualized on a 4% TBE agarose gel to confirm that the average size of DNA fragments was smaller than 180 bp. The fragmented DNA was then labeled with biotin and hybridized to the Affymetrix GeneChip® Human Mapping 100K Array Set for 17 hours at 480 C in a hybridization oven. The arrays were washed and stained using Affymetrix Fluidics Station, and scanned with GeneChip Scanner 3000 G7 and GeneChip® Operating software (GCOS) (Affymetrix). GTYPE (Affymetrix) software was used to generate a SNP call for each probe set on the array. SNP call was determined for 96.6% of the probe sets across the study, with a standard deviation of 2.6%. CCNT 3.0 software was then used to generate a value representing the copy number of each probe set. This was done by comparing the hybridized intensities of each chip to a manufacturer provided reference set of intensity measurements for over 100 normal individuals of various ethnicities. The copy number measurements were then smoothed using the genomic smoothing function of CCNT with a window size of 0.5 Mb. The Affymetrix GeneChip@Human Mapping 100K Array Set contains 115,353 probe sets for which the exact mapping positions were defined. The median length of the interval between the probe sets was 8.6 kb, 75% of the intervals were less than 28 kb and 95% were less than 94.5 kb.
Identification of Chromosome Regions with Prognostic Copy Number Alterations
An integrated analytical method was designed to identify the chromosome regions and the mapped candidate genes whose CNAs were correlated with distance metastasis, by taking advantage of the availability of the genomic data on both RNA gene expression which were generated from our previous studies and DNA copy number from the same cohort of patients that became available in this study (
The first step in our analysis was to identify chromosome regions whose copy number alterations were correlated with distance metastasis. Briefly, in the training set the univariate Cox proportional-hazards regression was used to evaluate the statistical significance of the correlation between the copy number of each individual SNP and the time of DMFS. Then, to define prognostic chromosomal regions, chromosomes were scanned in steps of 1 Mb using a sliding window of 5 Mb which contained an average of 250 SNPs to compile the Cox regression p-values of all SNPs within the window and to determine a smoothed p-value of all these SNPs as a whole relative to permutated data sets. Briefly, for a given window of size 5 Mb containing n SNPs, let βi and Pi denote the Cox regression coefficient and the P value from the Cox regression for the ith SNP, respectively. A log score S for this window was defined by summarizing the statistical significance of all SNPs within this window as a whole as follows:
The indicator variable Ii was used to account for and to distinguish the positively correlated copy number changes from the negatively correlated ones, indicated by the signs of the Cox regression coefficients βi. The positive coefficients reflect that relapsing patients had higher copy numbers than disease-free patients and the negative coefficients suggested the opposite. To compute the smoothed p-values from the log scores, permutations were used to derive the null distribution of the log scores. Four hundred permutations were performed by shuffling the clinical information with regard to the patient IDs. From the smoothed p-values, the prognostic chromosomal regions were defined as the chromosomal segments within which the smoothed p-values were all less than 0.05.
Once the prognostic chromosome regions were identified, the well defined genes were mapped with an Entrez Gene ID within those regions using the UCSC Genome Browser (http://genome.ucsc.edu) Human March 2006 (hg18) assembly. Next, two filtering steps were used to select those genes with greater confidence of having prognostic values to build a CNS. First, those genes that have at least one corresponding Affymetrix U133A probe set ID were filtered down. Only those genes that had statistically significant Cox regression p-values (p<0.05) from the gene expression data were followed through. Second, the correlation between the gene expression levels and copy numbers must be greater than 0.5. If the gene contained multiple SNPs inside, then the SNP with the best Cox regression p-value was selected; if contained no SNP, then the nearest SNP was chosen. For U133A probe set, the one with the best Cox p-value was used.
To build a model using the genes in the CNS to predict distant metastasis, the genes numeric copy number estimates were transformed into discrete values, i.e., amplification, no change, or deletion. In order to do the transformation, the diploid copy numbers for each gene was estimated by performing a normal mixture modeling on the representative SNP's copy number data and using the main peak of the modeled distribution as the estimate of the diploid copy number. Then for amplification, it was defined as 1.5 units above the diploid copy number estimate to ensure low false positives due to the intrinsic data variability; whereas deletion was defined as 0.5 units below the diploid copy number estimate because of the nature of the alteration and the narrow distribution of the copy number data for copy number loss. Once the copy number data were transformed, the following simple and intuitive algorithm was used to build a predictive model. The algorithm classified a patient as a relapser if at least n genes had copy numbers altered in that patient, and as a non-relapser otherwise. All possible scenarios were examined for n ranging from 1 to all genes in the CNS and determined the value of n by examining the performance of the signature in the training set as measured by a significant log-rank test p-value and setting a lower limit for the percentage of positives (predicted relapsers) to avoid the situation of very small number of positives as n increases.
The performance of the CNS was assessed both in the copy number data set of the remaining testing patients and in the external aCGH data set using the same algorithm described above. For the external data set, because it was derived from totally different aCGH technology and the data format was log 2 ratios, the cutoff for amplification was set at 0.45 while the cutoff for deletion was −0.35 to ensure comparable percentage of positives generated as the SNP array technology. As with the construction of the CNS, the validation was done in the ER positive and negative tumors separately using the corresponding subsets of genes in the CNS. The final performance shown, however, represented the combined performance for both ER positive and negative patients in the testing set.
To test for putative responses of testing set patients to chemotherapeutic compounds, gene expression signatures in two published studies were used. The original gene expression data set and the R function for the prediction algorithm of diagonal linear discriminant analysis (DLDA) for the 30-gene preoperative paclitaxel, fluorouracil, doxorubicin and cyclophosphamide (T/FAC) response signature was downloaded from http://bioinformatics.mdanderson.org/pubdata.html. The model was trained from the original data set using the provided R function and then tested in our gene expression data set. For each of the seven gene expression signatures that predict sensitivity to individual chemotherapeutic drugs, the predicted probability of sensitivity to each compound using the Bayesian fitting of binary probit regression models was calculated with the help of Drs. Anil Potti and Joseph Nevins (for details see Potti A, Dressman H K, Bild A, Riedel R F, Chan G, Sayer R, et al. Genomic signatures to guide the use of chemotherapeutics. Nat Med. 2006 November; 12(11):1294-300).
Unsupervised analysis using principal component analysis (PCA) was performed on the copy number dataset with all SNPs to examine the potential subclasses of the tumors. Kaplan-Meier survival plots and log-rank tests were used to assess the differences in DMFS of the predicted high and low risk groups. Cox's proportional-hazard regression was performed to compute the HR and its 95% CI. Due to missing data on grade, multivariate Cox regression analysis was done by multiple imputation using Markov Chain Monte Carlo method under the general location model (Schafer J L. Analysis of incomplete multivariate data. London: Chapman & Hall/CRC Press; 1997). T tests were performed to assess the significance of differential therapeutic responses among the prognostic groups. All statistical analyses were performed using R version 2.6.2.
The gene expression profiling data from our previous studies of the same tumors were used (Wang Y, Klijn J G, Zhang Y, Sieuwerts A M, Look M P, Yang F, et al. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet. 2005 Feb. 19;365(9460):671-9 and Yu J X, Sieuwerts A M, Zhang Y, Martens J W, Smid M, Klijn J G, et al. Pathway analysis of gene signatures predicting metastasis of node-negative primary breast cancer. BMC Cancer. 2007 Sep. 25;7(1):182) to screen for genes that had consistent change patterns between the gene expression profiles and the copy number variations. It was deemed reasonable that the change in copy numbers has to be reflected in the corresponding change in gene expression levels in order to have a phenotypic effect. Within these prognostic regions, a total of 2,833 and 3,656 genes were mapped for ER-positive tumors (Table 4) and ER-negative tumors (Table 5), respectively. For the ER-positive tumors, 122 genes had significant Cox regression p<0.05 in both the gene expression data and the copy number data, and showed the same direction for the changes in DNA copy number and gene expression. For the ER-negative tumors, 78 genes had significant p-values in both data sets, and showed the same direction of alterations (
The validation was done in the ER positive and negative tumors separately for the testing set using 53 and 28 genes from the CNS, respectively. The final performance shown represented the combined results of the 2 subgroups. In the testing set of 113 independent patients, the Kaplan-Meier analyses of the two patient groups stratified by the 81-gene CNS showed a statistically significant difference in time to distance metastasis (
Next, the CNS were tested in a completely independent external data set of 116 LNN patients (79 ER-positive and 37 ER-negative tumors) derived from a lower resolution aCGH technology (Chin S F, Teschendorff A E, Marioni J C, Wang Y, Barbosa-Morais N L, Thorne N P, et al. High-resolution array-CGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer. Genome Biol. 2007 Oct. 9;8(10):R215). The 81-gene CNS significantly stratified this patient cohort (
The chemotherapy response profiles were subsequently investigated for the three prognostic groups determined by the GES and CNS prognostic assays using well-validated gene signatures derived from two studies (Potti A, Dressman H K, Bild A, Riedel R F, Chan G, Sayer R, et al. Genomic signatures to guide the use of chemotherapeutics. Nat Med. 2006 Nov.;12(11):1294-300 and Hess K R, Anderson K, Symmans W F, Valero V, Ibrahim N, Mejia J A, et al. Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J Clin Oncol. 2006 Sep. 10;24(26):4236-44) for which follow-up validation studies were also available (Bonnefoi H, Potti A, Delorenzi M, Mauriac L, Campone M, Tubiana-Hulin M, et al. Validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapy: a substudy of the EORTC 10994/BIG 00-01 clinical trial. Lancet Oncol. 2007 Dec.;8(12):1071-8 and Peintinger F, Anderson K, Mazouni C, Kuerer H M, Hatzis C, Lin F, et al. Thirty-gene pharmacogenomic test correlates with residual cancer burden after preoperative chemotherapy for breast cancer. Clin Cancer Res. 2007 Jul. 15;13(14):4078-82). Firstly, using a previously published 30-gene signature that predicted pathological complete response (pCR) to preoperative T/FAC chemotherapy (Hess K R, Anderson K, Symmans W F, Valero V, Ibrahim N, Mejia J A, et al. Pharmacogenomic predictor of sensitivity to preoperative chemotherapy with paclitaxel and fluorouracil, doxorubicin, and cyclophosphamide in breast cancer. J Clin Oncol. 2006 Sep. 10;24(26):4236-44), each patient in the different prognostic subgroups was assigned into 2 response groups: either as having pCR or still with residual disease. Only 2 of the 15 patients (13%) in the very poor prognostic group were predicted as having pCR, while 34 of the 60 patients (57%) and 14 of the 38 patients (37%) in the poor and good prognostic groups, respectively, were predicted as having pCR. The chemo response score for the very poor prognostic group was significantly lower than those of the poor prognostic group (p=0.0003), indicating that these patients would be much more resistant to preoperative T/FAC chemotherapy in case these patients would have received pre-operative T/FAC chemotherapy (
Drosophila)
cerevisiae)
This application claims priority to and the benefit of co-pending U.S. provisional patent application Ser. No. 61/007,650, filed Dec. 14, 2007, which application is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61007650 | Dec 2007 | US |