Each sample in the training (a) and test set (b) is plotted (x-axis) against the sample's prediction strength (PS, y-axis). The training data set consists of 55 tumours and the test data set consists of 41 tumours. Samples exhibiting high positive PS values are classified as ER+, while samples with a high negative PS are ER−. Blue samples were correctly classified while red samples were misclassified. In general, a group of ‘low-confidence’ samples is observed (grey box) in both the training and test tumours.
(a) and (b) Depicted are the relative expression levels of the top 122 ER discriminating genes (obtained from the SAM-133 gene set, see text) that are positively correlated to ER+ status in (a) ER+/High (yellow) and ER+/Low (turquoise), and (b) ER−/High (dark blue) and ER−/Low (pink) samples.
The order of the 122 genes along the x axis is determined by their S2N ratio (see Materials and Methods). The S2N metric for a particular gene takes into account both the difference in mean expression level between two classes, as well as the standard deviation in expression for that gene within each class being compared. Note that the specific order of the 122 genes in (a) and (b) are different, depending on their S2N ratio (Table 2). (c) and (d) depicted are the relative expression levels of the top 54 ER discriminating genes that are negatively correlated to ER+ status (11 belonging to the SAM-133 gene set, see supplementary info for details) in (c) ER/High (yellow) and ER+/Low (turquoise), and (d) ER−/High (dark blue) and ER−/Low (pink) samples. There are considerably less perturbations observed than in (a) and (b).
The overall incidence patterns of breast cancer in Caucasian and Asian populations are distinct (8), prompting the inventors to investigate if findings from previous reports (3, 4) could also be observed in their local patient population. They first used gene expression profile data to classify a set of breast tumours by their ER status. A training set of 55 breast tumours was selected, where the ER status of each tumour was pre-determined using IHC. Two classification methods were tested: weighted-voting (WV) and support vector machines (SVM), and classification accuracy was assessed through leave-one-out cross validation (LOOCV) (Supplementary Information). In addition to classifying a sample, quantitative metrics were used to provide an assessment of classification uncertainty (Materials and Methods). The overall classification accuracy on the training set was 95% (WV) and 96% (SVM), with seven samples characterized by ‘low confidence’ or marginal predictions (grey box,
Since the differentiation of tumours into ‘high’ and ‘low-confidence’ sub-populations was achieved through a purely computational analysis of tumour gene expression profiles, it is unclear if this distinction is biologically or clinically meaningful, and if the use of gene expression profiles in this manner affords any substantial advantage over conventional immunohistochemical techniques to determine the ER status of breast tumours. To address this issue, the inventors investigated if the ‘low-confidence’ tumours might exhibit any clinical behaviors distinct from their ‘high-confidence’ counterparts. They used two publicly available breast cancer expression data sets for which related but distinct types of clinical information was available. The first set (9) consists of a cDNA microarray data set of 78 breast carcinomas and 7 nonmalignant samples with overall patient survival information (referred to as the Stanford data set). The second one (10) consists of 71 ER+ and 46 ER lymph-node negative tumours profiled using oligonucleotide-based microarrays, out of them 97 samples had the clinical information being the time interval from initial tumour diagnosis to the appearance of a new distant metastasis (referred to as the Rosetta dataset). The inventors used WV to classify the breast tumours in the Stanford and Rosetta datasets by their ER subtype. Consistent with their own data set, among the 56 ER+ and 18 ER tumours in the Stanford data set (4 tumours were removed due to lack of ER status information), they observed an overall LOOCV accuracy of 93%, with 14 tumours being classified as ‘low-confidence’. Similarly, the WV analysis also identified 15 tumours in the Rosetta data set as exhibiting a ‘low-confidence’ classification, with an overall LOOCV accuracy of 92%. These numbers are comparable to that observed in the inventors' own patient population.
They then compared the clinical behaviour of the ‘high’ and ‘low-confidence’ tumour populations using Kaplan-Meier analysis. As shown in
The classification algorithms used in these and other studies (e.g. WV, SVM, ANN, see below) all rely upon the combinatorial input of multiple discriminator genes whose individual contributions are then combined to arrive at a particular classification decision (i.e. if the tumour is ER+ or ER−). It is formally possible that the ‘low-confidence’ prediction status of these breast tumours is due to either the dramatic deregulation of a few key discriminator elements (i.e. specific effects), or the more subtle perturbation of a large number of discriminator genes (i.e. widespread effects). To distinguish between these two possibilities, the inventors compared the expression levels of genes important for ER subtype discrimination between ‘high’ and ‘low’ confidence tumours. First, to identify ER discriminating genes which where differentially regulated between ER+ and ER− tumours, they utilized a statistical technique called significance analysis of microarrays (SAM) (11).
Employing their combined dataset (total number=96 tumours), a total of 133 differentially regulated genes (SAM-133) were identified at a ‘false discovery rate’ (FDR) of 0% (the FDR is an index used by SAM to estimate the number of false positives—an FDR of 10% for 100 genes indicates that 10 genes are likely to be false positives). In this set, 122 genes were up-regulated in ER+ samples (ie positively correlated to ER status), while the remaining 11 were down-regulated in ER+ tumours (ie negatively correlated to ER). As predicted, the SAM-133 gene set includes a number of genes related to the ER pathway, such as ESR1, LIV1 (an estrogen-inducible genes), and TFF1, and some genes (e.g. GATA-3) were identified multiple times. A number of genes in the SAM-133 list are also found in similar lists reported by others (3, 4).
The inventors then subdivided the ER+ and ER− tumours each into ‘high’ and ‘low’ confidence categories (ie ER+/High, ER+/Low, ER−/High, ER−/Low), and the expression levels of the SAM-133 genes were compared between the groups (
The expression perturbations observed in the ‘low-confidence’ breast tumours could be due to multiple reasons, ranging from experimental variation (e.g. poor sample quality, tumour excision and handling), choice of the classification method, to population and sample heterogeneity. To gain insights into the possible mechanisms underlying these expression perturbations, the inventors attempted to determine if there were any specific histopathological parameters that might be correlated to the ‘low-confidence’ state. No significant associations were observed between the ‘low-confidence’ status of a tumour and patient age, lymph node status, tumour grade, p53 mutation status or progesterone receptor status (Table 1). The inventors discovered, however, a significant positive association (p<0.001, Supplementary Information) between a tumours' ERBB2 status and a ‘low confidence’ prediction. This correlation, observed using the training set data, was then assessed using the independent test set samples. Of the nine ‘low-confidence’ samples in the independent test set, eight tumours were also ERBB2+(8/9), indicating that this association is not dataset-specific.
The inventors also investigated if the correlation between the ‘low-confidence’ predictions with high ERBB2 expression could have been independently discovered by comparing the global expression profiles of ‘high’ and ‘low’ confidence tumours. First, they compared the ‘high-confidence’ and ‘low-confidence’ tumours belonging to the ER+ subtype. A total of 89 genes were identified as being significantly regulated (FDR=14%). Among the top 50 most significantly up-regulated genes in the ER+‘low-confidence’ samples, 3 genes—PMNT (ranked 4th), GRB7V (8th), and ERBB2 (36th) were of particular interest (Supplementary Information), as they are all physically located on the 17 q region, a frequent target of DNA amplification in breast cancer (12). In a separate analysis, the ER− ‘high-confidence’ and ER− ‘low-confidence’ samples were also compared. Among the top 50 genes identified as being differentially regulated (FDR=4%), the inventors once again identified the 17 q genes PMNT (ranked 5th), GRB7V (10th) and ERBB2 (28th) as exhibiting increased expression in the ‘low-confidence’ samples (Supplementary Information). Taken collectively, these results suggest that for both the ER+ and ER− subtypes, the ‘low-confidence’ breast tumours are significantly associated with increased expression of ERBB2 in comparison to the ‘high confidence’ tumours, most likely resulting from DNA amplification of the 17 q locus. However, please note that the association between ‘low-confidence’ prediction and ERBB2+ expression, although highly significant, is not perfect, as a few tumours that were designated as ERBB2+ by conventional IHC exhibited ‘high-confidence’ predictions, while not all ‘low-confidence’ tumours are ERBB2+. One possibility may be that other genes, besides ERBB2, may also contribute to a breast tumour exhibiting a ‘low-confidence’ state.
To validate their finding, the inventors then analyzed the other independently derived breast cancer expression datasets. First, of the nine ERBB2+ tumours in the Stanford data set, all nine were predicted as being in the ‘low-confidence’ group (p<0.001, Supplementary Information). Second, in the Rosetta data set, they once again found a significant association between the confidence level of prediction and ERBB2 expression (p<0.001, Supplementary Information). Third, Gruvberger and his colleagues utilized artificial neural networks (ANNs) on a cDNA microarray data set of 28 ER+ and 30 ER− samples to predict the ER status of breast tumours (3). Their results, shown in
The strong correlation between high ERBB2 levels and the widespread perturbations of ER-subtype discriminating genes observed in the ‘low-confidence’ tumours raises the possibility that ERBB2 may be functionally contribute towards this phenomenon. One possible mechanism by which this could occur is through ERBB2 signaling which has been proposed to inhibit the transcriptional activity of ER (see Discussion). Under this scenario, one might expect that a significant proportion of the genes perturbed between the ‘high-confidence’ (ERBB2−) and ‘low-confidence (ERBB2+) tumours would consist of genes regulated by ER. The inventors tested this hypothesis in two ways. First, they compared their list of significantly-perturbed genes (Table 2) to SAGE expression data derived from estrogen (E2) stimulated MCF-7 cells (13) to determine if the extent of overlap between the two. Only two genes (STC2, TFF1) were found in common between the SAGE data and the ‘perturbed’ gene list, and one (TFF1) was regulated in the opposite manner from that expected, exhibiting higher expression in the ERBB2+ samples. This result, within the limits of the cell line assay, suggests that many of the ‘perturbed’ genes in the ‘low confidence’ tumours may not be directly regulated by estrogen. Second, as in-vitro cell line studies may not fully recapitulate the effects of estrogen in vivo, the inventors then adopted a bioinformatics approach using a recently described algorithm, Dragon Estrogen Response Element Finder (DEREF), to search for putative estrogen-response elements (EREs) in the promoter regions of the perturbed genes (14). The prediction accuracy of DEREF has been validated in a number of in vivo examples—it detects ERE patterns 2.8× more frequently in the promoter regions of estrogen responsive versus non-responsive genes in a microarray experiment, and 5.4× more frequently in the promoters of genes belonging to the estrogen-induced SAGE dataset versus genes whose expression is negatively correlated to ER in breast cancers (Supplementary Information). Of the top 50 perturbed genes in the ER+tumours (Table 2), the transcriptional start sites of 35 could be accurately determined and thus were subsequently analyzed by DEREF. Of this 35, EREs were detected with high-confidence in only 12 promoters (total frequency 34%) (Table 2).
Conversely, of the top 50 perturbed genes in the ER− tumours, 33 were analyzed by DEREF and high-confidence EREs were detected in only 3 (total frequency 9%) (Table 2). Thus, EREs were detected in the promoters of perturbed genes in ER+ tumours at 3.7× higher frequency than in the ER− tumours. This difference was significant by a chi-square analysis (p=0.012), suggesting that ERBB2 may affect transcription in ER+ and ER tumours via distinct mechanisms (see Discussion). Regardless, EREs were not detected as over represented in the perturbed genes in both subtypes (ER+ and ER−), suggesting that these genes may not be direct transcriptional targets of ER. These genes may represent either indirect targets of ER, or may be transcriptionally regulated via ER-independent mechanisms.
The objective of this analysis was to identify an optimal set of genes which could be used to classify “high” and “low-confidence” tumours regardless of their ER status.
A total of 96 tumours were analyzed, of which 16 were LC and 80 were HC. A series of three independent analytical methods (SAM, GR, and WT, see below) were used to identify genes that were differently regulated between the two groups (LC and HC). The ability of these gene sets to classify the HC or LC status of a tumour was assessed by a leave-one-out cross validation assay using either Support Vector Machine or Weighted Voting as the classification algorithm.
SAM (Significance Analysis of Microarrays): At a FDR (False-discovery rate) of <15%, a total of 86 up-regulated and 2 down-regulated genes in low-confidence tumours were identified. Using this gene set, the LOOCV assay produced a classification accuracy of 84%. The 88 genes are shown in Table A1.
GR (Gene Ranking by SVM): A total of 251 genes were identified with the ability to classify the HC or LC status of a tumour, with a classification accuracy of 86%. The 251 genes are shown in Table A2.
WT (Wilcoxon Test): At a P-value of <0.05 and a >=2-fold change cutoff, a total of 38 genes were identified. This 38 gene set delivered a LOOCV accuracy of 80%. The 38 genes are shown in Table A3.
13 ‘common’ genes among the three gene sets (SAM-88, GR-251, WT-38) were then identified. This 13 member gene achieved a classification accuracy of 84% by LOOCV. In essence, these 13 ‘common genes’ are robust significant markers and can archive comparable performance as other ‘complete’ marker sets. Hence they could be taken as ‘optimal’ genes. The 13 genes are shown in Table A4.
The objective of this analysis was to compare the clinical prognoses of patients with ‘high-confidence’ ER negative tumours to those patients harbouring ‘low-confidence’ ER negative tumours.
Two independent data sets were analysed, referred to as the ‘Rosetta’ and ‘Stanford’ data sets. The Rosetta data set contains 29 ER negative tumours, of which 19 are ‘high-confidence’ while 10 are ‘low-confidence’. The Stanford data set contains 19 ER negative tumours, of which 12 are ‘high-confidence’ and 7 are ‘low-confidence’. The results of the analysis are shown in
In both cases, patients with ‘low-confidence’ tumours exhibited a worse prognosis than their high-confidence counterparts. Although this difference is not statistically significant, this may be due to low numbers of patients analyzed in these studies.
The findings in this report complement and extend the previous work in this area related to the classification of breast tumours by ER subtype. In general, these studies have shown that while gene expression data can be successfully used to classify the ER subtype of most tumours, there invariably exists a certain population of tumours that exhibit a low-confidence of prediction and thus cannot be accurately classified (3, 4). The inventors decided to investigate these ‘low-confidence’ samples, by performing an in-depth analysis of these ‘low-confidence’ tumours. They made a number of surprising findings. They found that in comparison to patients with ‘high-confidence’ tumours, patients with ‘low-confidence’ tumours exhibited a significantly worse overall survival and shorter time to distant metastasis. The ‘high’ vs ‘low-confidence’ classification, arrived at by computational analysis of gene expression profiles, also served to separate ER+ tumours into groups exhibiting distinct clinical behaviours (
The inventors also made the surprising finding that the ‘low-confidence’ state is significantly associated with elevated expression of the ERBB2 receptor. However, they emphasize that the connection between ERBB2 and ‘low-confidence’ predictions remains an association, and that at this point they have no evidence (from their own data) that ERBB2 is functionally responsible for causing the ‘low-confidence’ state. Nevertheless, given that ER and ERBB2 are currently the two most clinically relevant molecular biomarkers in breast cancer, it is tempting to speculate that these results suggest that there may exist substantial cross-talk between these two signaling pathways in breast cancer, a possibility that has also been proposed by others (7). Intriguingly, the association between ERBB2+ and ‘low-confidence’ prediction, although highly significant, is not perfect, as a few ERBB2+ tumours were also found to exhibit ‘high-confidence’ predictions, while not all ‘low-confidence’ tumours are ERBB2+. Thus, it is unlikely the ‘low-confidence’ population of breast tumours could have been discerned by conventional histopathological techniques used to detect ERBB2 such as IHC and FISH. Instead, the inventors believe that for tumours designed ERBB2+ by routine histopathology, that the further examination of these tumours for the presence of such characteristic ‘expression perturbations’ may be a promising method to distinguish between tumours that are likely to be more clinically aggressive versus those that will progress along a comparatively more indolent course.
Exploring this possibility will be an important task for future research. Clinically, elevated ERBB2 expression in ER+ breast tumours has long been associated with decreased sensitivity to anti-hormonal therapies, and a number of experimental papers have been reported addressing possible mechanisms by which ERBB2 activity might cause this effect. In general, the most popular model has been one in which elevated ERBB2 signaling causes ER to exhibit diminished transcriptional activity, either through transcriptional down-regulation of the ER gene (17), posttranslational modifications of ER (e.g. phosphorylation) (18), or via induction of ER binding corepressors such as MTA1 (19). If the effects of ERBB2 were mediated primarily through effects on ER transcriptional activity, then one might expect that a substantial number of the genes whose transcription is significantly perturbed in the ERBB2+‘low-confidence’ samples should correspond to genes which are direct targets of ER. The inventors found, however, that a significant proportion of the genes that were significantly perturbed in both ER+ and ER− tumours have not been previously identified as estrogen-induced genes, and these genes also appear to lack potential EREs in their promoters. This is particularly the case in the ER− tumours, in which only 9% of the significantly perturbed genes were found to contain high-confidence putative EREs in their promoters. Although the inventors cannot rule out the possibility that these perturbed genes may be indirect targets of ER or may be activated by ER via non-ERE mechanisms, these findings raise the possibility that ERBB2 activity may regulate a significant fraction of genes in breast tumours in an ER-independent fashion. There are numerous avenues by which this could occur. For example, ERBB2 might regulate other transcription factors besides ER through activation of the RAS/MAPK or PI3/Akt pathways (18).
Alternatively, ERBB2 activity may results in the induction of chromatin factors such as MTA1 which may play more pleiotropic effects (19).
Breast Tissue Samples and Patient Data Breast tissue samples and clinical data were obtained from the Tissue Repository in the institution National Cancer Center of Singapore, after appropriate approvals had been obtained from the institution's Repository and Ethics Committees. Samples were grossly dissected in the operating theater immediately after surgical excision, and flash-frozen in liquid N2. Histological information (ER, ERBB2) was provided by the Department of Pathology at Singapore General Hospital, and samples were selected to provide a comparable number of ER+ and ER− tumours (as determined by IHC) for each data set.
Tumour samples contained >50% tumour content as assessed by cryosections. 55 tumours (35 ER+ samples and 20 ER− samples), was used as training data, while a separate set of 41 tumours (21 ER+ and 20 ER− samples) was used for blind testing. A detailed list of all samples and clinical data for the patient is included in Table S1.
RNA was extracted from tissues using Trizol reagent and processed for Affymetrix Genechip hybridizations using U133A Genechips according to the manufacturer's instructions.
Raw chip scans were quality controlled using the Genedata Refiner program and deposited into a central data storage facility. The expression data was pre-processed by removing genes whose expression was absent throughout all samples (i.e. ‘A’ calls), subjecting the remaining genes to a log 2 transformation, and mediate-centering by samples.
Two classification algorithms, weighted voting (WV) (20) and support vector machines (SVMs) (21), were used to classify breast tumours according to ER subtype. Classification accuracy is defined as the number of correctly classified samples divided by the total number of samples. For the WV analyses, classification accuracy was determined using a gene set of the top 50 discriminating genes for ER status, while the SVM-based binary classifier utilized all genes.
Weighted Voting (WV): The weighted voting algorithm utilizes a signal-to-noise (S2N) metric to perform binary classifications. Each gene belonging to a predictor set is assigned a ‘vote’, expressed as the weighted difference between the gene expression level in the sample to be classified and the average class mean expression level. Weighting is determined using the correlation metric
(μ and σ denotes means and standard deviations of expression levels of the gene in each of the two classes). The ultimate vote for a particular class assignment is computed by summing all weighted votes made by each gene used in the class discrimination. The “prediction strength” (PS) is defined as:
where VWIN and VLOSE are the vote totals for the winning and losing classes, respectively. PS reflects the relative margin of victory and hence provides a quantitative reflection of prediction certainty.
Support Vector Machine (SVM): Support Vector Machines are classification algorithms which define a discrimination surface in the utilized feature (gene) space that attempts to maximally separate classes of training data (21). An unknown test sample's position relative to the discrimination surface determines its class. Distances are usually calculated in the n-dimensional gene space, corresponding to the total number of gene expression values considered. The inventors used SVM-FU (available at www.ai.mit.edu/projects/cbcl/) with the linear kernel to implement the SVM analysis. The confidence of each SVM prediction is based on the distance of a test sample from the discrimination surface, as previously described (22).
Due to the clinical importance of achieving good prediction confidence, the inventors conservatively chose a high confidence threshold to minimize potential false positive classifications. On the basis of the leave-one-out cross validation (LOOCV) results, they used a threshold of 0.4 and identified 16 samples (out of a total of 96) as being in the ‘low confidence’ group. A tumour sample was assigned to the “low-confidence” category if its prediction strength (PS) from WV was less than this threshold.
Selection of Differentially Expressed Genes and Determination of Expression Perturbations Significance analysis of microarrays (SAM) is a statistical methodology developed to identify genes that are differentially expressed between separate groups (11). Genes are ranked are according to their statistical likelihood of being regulated. The SAM algorithm also performs a permutation analysis of the expression data to estimate the number of genes identified as being ‘differentially regulated’ by random chance (i.e. false positives). This number is the ‘false discovery rate’ (FDR). Depending upon the desired stringency, different reports have used FDRs ranging from <5% to 33% (23, 24).
Student's t-test was used to compare levels of expression in the SAM-133 gene set between ‘high’ and ‘low-confidence’ groups. A gene was classified as exhibiting significant ‘perturbed expression’ if its p-value was less than 0.05.
Computational Identification of Estrogen Response Elements (EREs) using DEREF A computational algorithm, Dragon ERE Finder (DEREF) (14), was used to identify putative estrogen response elements (EREs), which are DNA binding sites of ER within promoters (see http://sdmc.lit.org.sg/ERE-V2/index for a description of the underlying methodology of DEREF). On the default setting, DEREF produces on average one ERE pattern prediction per 13,000 nt on human genomic DNA, with a sensitivity of 83%. To reduce the number of false positives, the inventors applied in this report an additional criteria that a predicted ERE pattern of 17 nucleotides (14) also had to match (based on BLAST (25) matching without allowed gaps) a similar ERE pattern from at least one other human gene promoter, under conditions where the latter pattern could be predicted by DEREF at a sensitivity of 97%. The ERE searches in this report were performed against a database of approximately 11,000 reference human promoter sequences covering the range [−3000, +1000] relative to the 5′end of the gene, which was generated using the FIE2 program (26, 27). Some genes to be analyzed were not contained in this promoter database, and the ERE searches for these genes were thus not performed. Such genes are denoted in Table 2 by N/A.
Weighted Voting and Leave One Out Cross Validation was independently performed for two independent data sets (referred to as “Stanford” and “Rosetta” data sets). The results are plotted in a similar manner to those of
Stanford data set: This data was produced using 2-colour cDNA microarrays, in which PCR-amplified cDNA fragments (representing different genes) were robotically deposited onto a solid substrate to create the microarray
Rosetta data set: This data was produced using 2 colour oligonucleotide microarrays, in which 70-80mer oligonucleotides (representing different genes) were chemically synthesized in-situ on a solid substrate to create the microarray.
The Stanford data set consists of cDNA microarray data for 78 breast carcinomas (tumours) and 7 nonmalignant samples with overall patient survival information.
The Rosetta set consists of 117 early stage (lymph-node negative) breast tumours profiled using oligonucleotide-based microarrays
As shown above, the low-confidence tumours occupy around 15-19% of each breast tumour population. To confidently identify this tumour subpopulation, a minimum data set of at least 25-30 profiles, preferably higher (around 80-100 tumours, as in the three data sets above) is preferably required.
Table S7 shows the mean (μ) and standard deviation (σ) parameters for use in a Weighted Voting algorithm for each gene of the SAM-133 geneset. These data could be used to assign the an unknown breast tumour sample as high or low confidence, given a set of expression levels for genes of the SAM-133 geneset. The genes of Table 2 are included in the SAM-133 geneset. The data is specific to Weighted Voting techniques applied to expression data from the Affymetrix U133 genechip.
Table S8 shows expression data for the Table A4 multigene classifier (common 13 genes) across high confidence and low confidence samples. The data are specific for the Affymetrix U133A genechip and have been through data preprocess. The gene expression profiles of the Table A4 multigene classifier can be used as training data to build a predictive model (eg, WV and SVM), which then can assign the confidence of an unknown breast tumour.
The data is tab delimited, and has the following format:
1st column: Probe-ID of prognostic set genes
2nd column: Gene Name
3rd and other columns: gene expression data
1st row: Sample Ids (35 samples)
2nd row: Confidence (high or low) of sample.
3rd and other rows: gene expression data
The gene expression data is derived as described in the ‘Sample Preparation and Microarray Hybridization’ and ‘Data Preprocessing’ (see Materials and Methods section).
Table S9 shows the mean (μ) and standard deviation (σ) parameters for use in a Weighted Voting algorithm for each gene of the Table A4 geneset. These data could be used to assign the an unknown breast tumour sample as high or low confidence, irrespective of ER status of the tumour, given a set of expression levels for genes of the Table A4 geneset.
The data is specific to Weighted Voting techniques applied to expression data from the Affymetrix U133 genechip.
Table 2. The top 50 genes that are significantly perturbed between ER+/Low and ER+/High samples (a), and ER−/Low and ER−/High samples (b). In the ERE column, “ERE” indicates that the promoter contains a high confidence putative ERE as predicted by DEREF, “non-ERE” indicates that a putative ERE was not found, while “Low” indicates that an ERE was found for that promoter at medium confidence. N/A means that the promoter was not analyzed as it was not possible to determine their transcription start sites based on full-length transcripts. Genes are ranked in order of their S2N ratio between High and Low-confidence samples.
Homo sapiens cDNA: FLJ21695 fis, clone COL09653, mRNA
Homo sapiens mRNA; cDNA DKFZp564F053 (from clone
Homo sapiens mRNA; cDNA DKFZp434E082 (from clone
Table S2: Classification Results of Independent Test and External Breast Cancer Datasets
Leave-One-Out Cross Validation (LOOCV): We used a standard leave-one-out cross-validation (LOOCV) approach to assess classification accuracy in the training set. In LOOCV, one sample in the training set is initially ‘left out’, and the classifier operations (eg gene selection and classifier training) are performed on the remaining samples. The ‘left out’ sample is then classified using the trained algorithm, and this process is then repeated for all samples in the training set.
The output of the WV analyses for all four data sets (including PS) and corresponding p-values for the association of ERBB2 expression with prediction confidence can be obtained as an Excel file from http://www.omniarray.com/ERClassification.html.
Table S3: Identification of Genes Important for ER Subtype Discrimination
Significance Analysis of Microarrays (SAM) was used to identify and rank 133 genes that were differentially regulated between ER+ and ER− tumors (FDR of 0%, ≧2-fold expression change). 122 of them are up-regulated in ER+(positive gene) and 11 are down-regulated in ER+ (negative genes). The S2N ratio of a particular gene reflects the extent of the expression perturbation observed between Low and High confidence samples.
Homo sapiens, clone MGC: 1925,
Homo sapiens mRNA; cDNA
Homo sapiens mRNA; cDNA
Homo sapiens mRNA; cDNA
Homo sapiens mRNA; cDNA
Homo sapiens mRNA; cDNA
Homo sapiens clone 23736 mRNA
Homo sapiens cDNA: FLJ21695 fis,
Homo sapiens mRNA for membrane
Due to the limited number of ER negative genes, we decreased the threshold of SAM to derive 54 genes with FDR of 0%. These negative genes were used in
Table S4: Comparing the Global Expression Profiles of ‘High’ and ‘Low-Confidence’ Tumors
SAM was used to identify differentially regulated genes between a) ER+ ‘High’ and ‘Low’ Confidence tumors, and b) ER− ‘High’ and ‘Low’ Confidence tumors. For the ER+ comparison, 50 genes were identified as up-regulated in ER+/Low and 39 are downregulated in comparison to ER+/High tumors. For the ER− comparison, 50 genes were identified as up-regulated in ER−/Low, and no genes were identified as being downregulated in comparison to ER−/High tumors.
Homo sapiens mRNA; cDNA DKFZp564G112 (from clone
Homo sapiens clone 23809 mRNA sequence
Homo sapiens PAC clone RP5-1093O17 from 7q11.23-q21
Homo sapiens cDNA: FLJ21695 fis, clone COL09653
Homo sapiens cytokine-like nuclear factor n-pac mRNA, complete
Use of DRAGON-ERE Finder (DEREF) to Identify Putative EREs in Gene Promoters
The DEREF algorithm was used to define potential EREs in the promoters of genes belonging to various categories (see http://sdmc.lit.org.sg/ERE-V2/index for a description of the underlying methodology of DEREF). The manuscript of ref. 14 can be accessed via http://www.omniarray.com/ERClassification.html. The estrogen-induced SAGE data set was derived from (http://143.111.133.249/ggeg/, see ref. 13), using the thresholds of 3 hr fold increase >=2 and 3 hr p value <0.005. 65 SAGE Tags were selected. These 65 SAGE Tags matched 68 genes that are furthered subject to ERE analysis. The gene set of the top 100 genes negatively correlated to ER status was derived using SAM. Table S6a depicts the results.
e S7: Weighted Voting parameters for mean (μ) and standard deviation (σ) of expression data
SAM-133 geneset
_ID
0_at
8_at
4_at
5_s_at
3_s_at
4_s_at
8_at
9_s_at
6_at
2_x_at
6_at
1_at
1_at
8_at
4_at
9_s_at
8_at
9_s_at
1_s_at
7_at
8_at
5_at
9_s_at
8_x_at
9_s_at
3_at
1_at
3_at
7_at
8_s_at
3_at
2_s_at
8_at
2_s_at
3_s_at
1_s_at
3_s_at
4_s_at
5_s_at
4_s_at
9_at
1_at
6_at
5_at
4_at
0_at
1_s_at
5_at
6_s_at
6_s_at
4_s_at
8_s_at
2_at
7_at
1_s_at
9_x_at
4_s_at
Homo sapiens mRNA for membrane glycoprotein LIG-1, complete cds.
Homo sapiens, clone MGC: 1925, mRNA, complete cds.
Homo sapiens mRNA; cDNA DKFZp564F053 (from clone DKFZp564F053)
Homo sapiens mRNA; cDNA DKFZp564F053 (from clone DKFZp564F053)
Homo sapiens mRNA; cDNA DKFZp434D2111 (from clone DKFZp434D2111)
Homo sapiens mRNA; cDNA DKFZp434D2111 (from clone DKFZp434D2111)
Homo sapiens mRNA; cDNA DKFZp434E082 (from clone DKFZp434E082)
Homo sapiens clone 23736 mRNA sequence
Homo sapiens cDNA: FLJ21695 fis, clone COL09653
indicates data missing or illegible when filed
Homo sapiens cDNA FLJ36630 fis, clone TRACH2018278, mRNA sequence
Homo sapiens cDNA FLJ34019 fis, clone FCBBF2002898, mRNA sequence
Homo sapiens cDNA FLJ30298 fis, clone BRACE2003172, mRNA sequence
Homo sapiens cDNA FLJ30096 fis, clone BNGH41000045, mRNA sequence
Homo sapiens cDNA FLJ12140 fis, clone MAMMA1000340, mRNA sequence
Homo sapiens mRNA; cDNA DKFZp564F053 (from clone DKFZp564F053), mRNA sequence
Homo sapiens cDNA FLJ35646 fis, clone SPLEN2012743, mRNA sequence
Homo sapiens mRNA; cDNA DKFZp434E235 (from clone DKFZp434E235), mRNA sequence
Homo sapiens cDNA FLJ38575 fis, clone HCHON2007046, mRNA sequence
Homo sapiens clone 24566 mRNA sequence
Homo sapiens cDNA: FLJ21521 fis, clone COL05880, mRNA sequence
Number | Date | Country | Kind |
---|---|---|---|
0323226.1 | Oct 2003 | GB | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/GB04/04190 | 10/1/2004 | WO | 00 | 4/23/2007 |