This application claims the benefit of and priority from Australian Provisional Application No. 2021901939, filed Jun. 25, 2021, the contents and disclosures of which are incorporated herein by reference in their entirety.
The present invention relates to methods for the classification and stratification of cells within tumour samples. In one aspect, the invention provides for methods for determining cell-type abundances in whole tumour samples and categorising these cell-type abundances into ecotypes.
Cancer largely results from various molecular aberrations comprising somatic mutational events such as single nucleotide mutations, copy number changes and DNA methylations. In addition, cancer is viewed as a wildly heterogeneous disease, consisting of different subtypes with diverse molecular progression of oncogenesis and therapeutic responses. Many organ-specific cancers have established definitions of molecular subtypes on the basis of genomic, transcriptomic, and epigenomic characterizations, indicating diverse molecular oncogenic processes and clinical outcomes.
One such example is breast cancer (BrCa), which is stratified based on the expression of the estrogen receptor (ER), progesterone receptor (PR) and overexpression of HER2 or amplification of the HER2 gene ERBB2. This results in three broad clinical subtypes of BrCa: Luminal (ER+, PR+/−), HER2+(HER2+, ER+/−, PR+/−) and triple negative (TNBC; ER−, PR−, HER2−) that correlate with prognosis and define treatment strategies. Luminal cancers have an inherently less aggressive natural history than the Her2+ and TNBC subsets and are typically treated with systemic endocrine therapy targeting the Estrogen Receptor+/− cytotoxic chemotherapy. Her2+ cancers are treated with small molecule and antibody-based systemic drugs targeting the Her2 receptor plus cytotoxic chemotherapy. TNBC are typically only eligible for systemic cytotoxic chemotherapy and thus have the poorest outcomes of the 3 subtypes. BrCa are also stratified based on bulk transcriptomic profiling using the ‘PAM50’ gene signature into five ‘intrinsic’ molecular subtypes: luminal-like (LumA and LumB), HER2-enriched (HER2E), basal-like (BLBC) and normal-like. There is ˜70-80% concordance between molecular subtypes and clinical subtypes. For instance, the HER2E subtype is composed of clinically HER2+ and HER2− BrCa, as well as those that are ER+ and ER−3.
BrCa comprise diverse cellular microenvironments, whereby heterotypic interactions between neoplastic and non-neoplastic cells, such as stromal and immune cells, are important in defining disease etiology and response to treatment. So, while BrCa are generally considered to have a low mutational burden and immunogenicity, there is evidence that immune activation is pivotal in a subset of patients. It has followed that the presence of tumour infiltrating lymphocytes is a strong biomarker for good clinical outcome and complete pathological response to neoadjuvant chemotherapy. In contrast, tumour associated macrophages are often associated with poor prognosis and are recognised as important emerging targets for cancer immunotherapy. Moreover, mesenchymal cells have also emerged as important regulators of the malignant phenotype, chemotherapy response and anti-tumour immunity. Although these findings have elevated mesenchymal cells as critical mediators of tumour biology, progress has been impeded by a lack of a clear taxonomy of stromal subclasses.
Our understanding of the cellular heterogeneity and tissue architecture of human cancers has been largely derived from histology, bulk-sequencing, low dimensionality hypothesis-based studies and experimental model systems. As a consequence, information about the tumour microenvironment has not yet been integrated into clinical stratification and stromal-directed therapies are not yet in clinical practice.
A more detailed transcriptional atlas of various cancers at high molecular resolution, representative of all subtypes and cell types, is therefore required to further define the taxonomy of the disease and to determine how cells in the tumour microenvironment are organized as functional units in space. The identification of tumour heterogeneity is essential to the design of effective stratified treatments and for the discovery of treatments that can be extended to particular tumour subtypes.
In view of the above-described limitations, there is a need for improved methods for cancer stratification that overcome one or more of the above described limitations.
It will be clearly understood that, if a prior art publication is referred to herein, this reference does not constitute an admission that the publication forms part of the common general knowledge in the art in Australia or in any other country.
In an aspect of the invention, there is provided a method for the identification of an ecotype within cancer samples, the method comprising:
In an embodiment of the invention, the step of generating the gene expression profiles from the cells of the training set samples comprises annotating cells within each of the cancer sample training sets as a specific cell type and/or cell state.
In another aspect of the invention, there is provided a method for the identification of an ecotype within cancer samples, the method comprising:
In an embodiment, the step of generating a cell abundance profile based on the respective cancer sample training set comprises:
generating a cell gene expression profile for each cell of the respective cancer sample training set based on cell type or cell state, wherein the cell gene expression profile correlates with a distinct cell type and/or cell state within the respective In another aspect of the invention, there is provided a method for the identification of an ecotype within cancer samples, the method comprising:
In an embodiment of the invention, the method includes optionally applying the training set to a cancer sample from a subject by:
In an embodiment, the step of generating a cell gene expression profiles comprises annotating cells within the cancer sample training sets as a specific cell type and/or cell state.
In another aspect of the invention, there is provided a method for the identification of an ecotype within cancer samples, the method comprising:
In another aspect of the invention, there is provided a method for generating cell gene expression profiles based on which an ecotype within cancer samples can be determined, the method comprising:
In an embodiment, from the cell gene expression profiles, an ecotype within cancer samples can be determined by:
In an embodiment of the invention, the step of performing or having performed bulk gene expression RNA sequencing on cancer samples to generate a bulk gene expression profile of the cancer samples comprises the generation of bulk gene expression profiles from the same samples or the generation an independent dataset of bulk expression profiles, e.g., METABRIC.
In an embodiment of the invention, the ecotype may be selected from the group consisting of E1, E2, E3, E4, E5, E6, E7, E8 or E9.
In an embodiment of the invention, all steps of the methods described herein may be performed on a computer except for the initial generation of the single-cell or bulk gene expression profiles from the cancer sample.
In another aspect, there is provided a method for diagnosing or prognosing cancer in a subject, the method comprising:
In another aspect of the invention, there is provided a method for diagnosing or prognosing cancer in a subject, the method comprising:
In an embodiment of the invention, where an identification of ecotype, diagnosis, prognosis or prediction to drug treatment or survival is provided, the method may comprise:
Where the training of a predictor set of cancer samples from subjects with known diagnosis, prognosis or prediction to drug treatment or survival is required, the method may comprise:
In another embodiment, where the training of a predictor set of cancer samples from subjects with known ecotype, diagnosis, prognosis or prediction to drug treatment or survival is required, the method may comprise applying the predictor set to test cancer sample from a subject by:
In an embodiment of the invention, the method comprises identifying a treatment for the subject based on the identification of the ecotype the cancer sample. In this embodiment, the treatment may comprise chemotherapy, hormonal therapy, radiation therapy, biological therapy such as immunotherapy, small molecule therapy or antibody therapy, or a combination thereof. In another embodiment, the method comprises administering the identified treatment.
In an embodiment, the cancer may be any cancer known in the art or selected from the list consisting of include, but are not limited to, a basal cell carcinoma, biliary tract cancer; bladder cancer; bone cancer; brain and central nervous system cancer; breast cancer; cancer of the peritoneum; cervical cancer; choriocarcinoma; colon and rectum cancer; connective tissue cancer; cancer of the digestive system; endometrial cancer; esophageal cancer; eye cancer; cancer of the head and neck; gastric cancer (including gastrointestinal cancer); glioblastoma; hepatic carcinoma; hepatoma; intraepithelial neoplasm; kidney or renal cancer; larynx cancer; leukemia; liver cancer; lung cancer (e.g., small-cell lung cancer, non-small cell lung cancer, adenocarcinoma of the lung, and squamous carcinoma of the lung); melanoma; myeloma; neuroblastoma; oral cavity cancer (lip, tongue, mouth, and pharynx); ovarian cancer; pancreatic cancer; prostate cancer; retinoblastoma; rhabdomyosarcoma; rectal cancer; cancer of the respiratory system; salivary gland carcinoma; sarcoma; skin cancer; squamous cell cancer; stomach cancer; testicular cancer; thyroid cancer; uterine or endometrial cancer; cancer of the urinary system; vulval cancer; lymphoma including Hodgkin's and non-Hodgkin's lymphoma, as well as B-cell lymphoma (including low grade/follicular non-Hodgkin's lymphoma (NHL); small lymphocytic (SL) NHL; intermediate grade/follicular NHL; intermediate grade diffuse NHL; high grade immunoblastic NHL; high grade lymphoblastic NHL; high grade small non-cleaved cell NHL; bulky disease NHL; mantle cell lymphoma; AIDS-related lymphoma; and Waldenstrom's Macroglobulinemia; chronic lymphocytic leukemia (CLL); acute lymphoblastic leukemia (ALL); Hairy cell leukemia; chronic myeloblastic leukemia; as well as other carcinomas and sarcomas; and post-transplant lymphoproliferative disorder (PTLD), as well as abnormal vascular proliferation associated with phakomatoses, edema (such as that associated with brain tumours), and Meigs' syndrome.
In an embodiment, the subject from which the sample was obtained from a subject who has, or is suspected of having, breast cancer and exhibits one or more of the following symptoms:
In an embodiment, the cancer is diagnosed according to one or more clinical subtypes HR+/HER2− (“Luminal A”); HR−/HER2− (“Triple Negative”); HR+/HER2+(“Luminal B”) or HR−/HER2+(“HER2-enriched”). In another embodiment, the subject is diagnosed with a non-invasive or invasive carcinoma including ductal, lobular colloid (mucinous), medullary, micropapillary, papillary, and tubular invasive carcinoma.
In an embodiment, the method further comprises diagnosing the subject with any type of cancer defined herein or known in the art, preferably breast cancer. In another embodiment, the method further comprises a step of treating the subject for a period of time sufficient for a therapeutic response prior to obtaining the sample from the subject.
In an embodiment, the treatment comprises an adjuvant or neoadjuvant therapy. In another embodiment, the neoadjuvant or adjuvant therapy comprises or is selected from the group consisting of radiotherapy, chemotherapy, immunotherapy, biological response modifiers or hormone therapy.
In an embodiment, any gene expression profile or matrix described herein is generated using reverse transcription and real-time quantitative polymerase chain reaction (qPCR) with primers specific for each of the genes. In another embodiment, the gene expression profile is generated by microarray analysis with probes specific for each of the genes. In yet another embodiment, the gene expression profile or matrix is generated using RNA-Seq or other methods known in the art including Nanostring GeoMX DSP platform that uses hybridisation of probes, followed by elution and sequencing of probes to estimate GE; Spatial transcriptomics (commercialised as visium by 10× genomics) which uses spotted arrays of barcoded capture probes to perform something similar to a microarray; and methods that use sequencing in situ to perform targeted RNA-Seq in situ. In a preferred embodiment, the gene expression profile or matrix is generated using single-cell RNA sequencing.
In an embodiment, the gene expression profile is normalised to a control, preferably one or more housekeeping genes. In this embodiment, the housekeeping genes may be selected from RRN18S, ACTB, GAPDH, PGK1, PPIA, RPL13A, RPLPO, B2M, GUSB, HPRT1, TBP.
In another embodiment, the method comprises one or more of the following diagnostic tests:
In another aspect, there is provided a method for predicting survival in a subject having or suspected of having cancer, the method comprising:
In another aspect, there is provided a method for predicting survival in a subject having or suspected of having cancer, the method comprising:
In an embodiment, the prognosis or survival is selected from the group comprising or consisting of cancer specific survival, event-free survival, or response to therapy.
In an embodiment, samples with Basal-like and proliferative cells (or E3 as described herein) correlate with a poorer survival outcome or prognosis. In another embodiment, samples with HER2E and HER2E_SC cells (or E7 as described herein) correlate with a poorer survival outcome or prognosis. In another embodiment, samples with ecotypes comprising LumA and Normal-like cells (or E2 as described herein) correlate with a better survival outcome or prognosis. In another embodiment, samples with ecotypes comprising LumA, Normal-like cells as well as endothelial CXCL12+ and ACKR1+ cells, s1 MSC iCAFs and a depletion of cycling cells (or E2 as described herein) correlate with a better survival outcome or prognosis. Accordingly, ecotypes with a better survival outcome or prognosis have a better likelihood of cancer specific survival, event-free survival, or response to therapy.
In another aspect, there is provided a method for predicting a response to therapy in a subject having or suspected of having cancer, the method comprising:
In another aspect, there is provided a method for predicting a response to therapy in a subject having or suspected of having cancer, the method comprising:
In another aspect, there is provided a method for treating cancer in a subject having or suspected of having cancer, the method comprising:
In another aspect, there is provided a method for treating cancer in a subject having or suspected of having cancer, the method comprising:
In another aspect, there is provided use of a treatment in the preparation of a medicament for treating cancer in a subject having or suspected of having cancer, the use comprising:
In another aspect, there is provided use of a treatment in the preparation of a medicament for treating cancer in a subject having or suspected of having cancer, the use comprising:
In an embodiment, the sample comprises ecotypes with cell type abundances selected from the group comprising or consisting of immune enriched cells; cycling cells; normal or healthy cells; PVLs; endothelial cells; myeloid cells; plasmablasts; B-cells; T-cells; innate lymphoid cells (ILCs); cancer associated fibroblasts; immune depleted; high cancer heterogenicity; and combinations thereof.
In an embodiment, the gene expression profile comprises a plurality of gene expression profiles, each of which correlates with a distinct cell type within a sample.
In an embodiment, the method comprises providing or having provided a cancer sample comprising different cell types.
In an embodiment, the sample comprises bulk tissue. In another embodiment, the sample comprises cells, blood or body fluid. In another embodiment, the sample comprises a formalin-fixed, paraffin-embedded (FFPE) tissue or a frozen tissue.
In a preferred embodiment, the cancer is breast cancer.
In an embodiment, the method comprises single cell RNA sequencing of least 1000, 2000, 3000, 4000 or 5000 cells.
In an embodiment, the deconvolution module comprises estimating cell type abundance using any known deconvolution method in the art, preferably the CIBERSORTx or DWLS method.
In another aspect, the invention provides a kit for identifying an ecotype in a cancer sample, the kit comprising reagents for the detection of the genes in the cancer sample. In an embodiment, the reagents comprise oligonucleotide primers and/or probes sufficient for the detection and/or quantitation of one or more of the genes in a cancer sample.
Any of the features described herein can be combined in any combination with any one or more of the other features described herein within the scope of the invention.
This patent application contains at least one drawing executed in color. Copies of this patent application with color drawing(s) will be provided by the Office upon request and payment of the necessary fee.
Various embodiments of the invention will be described with reference to the following drawings, in which:
Preferred features, embodiments and variations of the invention may be discerned from the following Description which provides sufficient information for those skilled in the art to perform the invention. The following Description is not to be regarded as limiting the scope of the preceding Summary of the Invention in any way.
Reference will now be made in detail to certain embodiments of the invention. While the invention will be described in conjunction with the embodiments, it will be understood that the intention is not to limit the invention to those embodiments. On the contrary, the invention is intended to cover all alternatives, modifications, and equivalents, which may be included within the scope of the present invention as defined by the claims. One skilled in the art will recognize many methods and materials similar or equivalent to those described herein, which could be used in the practice of the present invention. The present invention is in no way limited to the methods and materials described.
It will be understood that the invention disclosed and defined in this specification extends to all alternative combinations of two or more of the individual features mentioned or evident from the text or drawings. All of these different combinations constitute various alternative aspects of the invention. It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
Throughout this specification, unless specifically stated otherwise or the context requires otherwise, reference to a single step, composition of matter, group of steps or group of compositions of matter shall be taken to encompass one and a plurality (i.e. one or more) of those steps, compositions of matter, groups of steps or groups of compositions of matter. Thus, as used herein, the singular forms “a”, “an” and “the” include plural aspects, and vice versa, unless the context clearly dictates otherwise. For example, reference to “a” includes a single as well as two or more; reference to “an” includes a single as well as two or more; reference to “the” includes a single as well as two or more and so forth.
In the present specification and claims (if any), the word ‘comprising’ and its derivatives including ‘comprises’ and ‘comprise’ include each of the stated integers but does not exclude the inclusion of one or more further integers.
One skilled in the art will recognize many methods and materials similar or equivalent to those described herein, which could be used in the practice of the present invention. The present invention is in no way limited to the methods and materials described.
The present invention is not to be limited in scope by the specific examples described herein, which are intended for the purpose of exemplification only. Functionally-equivalent products, compositions and methods are clearly within the scope of the present invention.
Any example or embodiment of the present invention herein shall be taken to apply mutatis mutandis to any other example or embodiment of the invention unless specifically stated otherwise.
Unless specifically defined otherwise, all technical and scientific terms used herein shall be taken to have the same meaning as commonly understood by one of ordinary skill in the art (for example, in cell culture, molecular genetics, immunology, immunohistochemistry, protein chemistry, and biochemistry).
Cancer largely results from various molecular aberrations comprising somatic mutational events such as single nucleotide mutations, copy number changes and DNA methylations. In addition, cancer is viewed as a wildly heterogeneous disease, consisting of different subtypes with diverse molecular progression of oncogenesis and therapeutic responses. Many organ-specific cancers have established definitions of molecular subtypes on the basis of genomic, transcriptomic, and epigenomic characterizations, indicating diverse molecular oncogenic processes and clinical outcomes.
The inventors show herein for the first time the development of a single cell method for the stratification of tumour samples into tumour ecotypes. In particular, by using single cell signatures, deconvolution of large breast cancer cohorts allows for the stratification of tumour samples into nine clusters, termed ‘ecotypes’, with unique cellular compositions and clinical outcomes.
This approach has advantages over previously described approaches including:
Moreover, whilst WO 2019/018684 provides a computational framework for performing in silico tissue dissection to accurately infer cell type abundance and cell type (e.g., cell type-specific) gene expression from RNA profiles of intact tissues, the inventors work described herein provides for superior signatures that have been specifically extracted from breast cancers and provides for clustering of patients, optionally after deconvolution to stratify patients into groups with similar composition into ecotypes.
Tissue composition can be a major determinant of phenotypic variation and a key factor influencing disease outcomes. Although scRNA-Seq can be a powerful technique for characterizing cellular heterogeneity, it can be impractical for large sample cohorts and may not be applied to fixed specimens collected as part of routine clinical care. To overcome these challenges, the present disclosure provides a platform for in silico cytometry that can enable the simultaneous inference of cell type-specific gene expression profiles (GEPs) and cell type abundance from bulk tissue transcriptomes. Using the methods disclosed herein for in silico purification, bulk tissue composition can be accurately estimated using scRNA-Seq-derived reference signatures. The disclosed methods and systems may link unbiased cell type discovery with large-scale tissue dissection. Digital cytometry can augment single-cell profiling efforts, enabling cost-effective, high-throughput tissue characterization without antibodies, disaggregation, or viable cells.
Immunophenotyping approaches, such as flow cytometry and immunohistochemistry (IHC), can rely on small combinations of preselected marker genes, which can limit the number of cell types that can be simultaneously interrogated. By contrast, single-cell mRNA sequencing (scRNA-Seq) can be used for unbiased transcriptional profiling of hundreds to thousands of individual cells from a single-cell suspension (scRNA-Seq). Despite the power of this technology, analyses of large sample cohorts may not be practical, and many fixed clinical specimens (e.g., formalin-fixed, paraffin embedded (FFPE) samples) may not be dissociated into single-cell suspensions. Furthermore, the impact of tissue disaggregation on cell type representation may be poorly understood.
Computational techniques for dissecting cellular content directly from genomic profiles of mixture samples may rely on a specialized knowledgebase of cell type-specific “barcode” genes (e.g., a “signature matrix”), which is derived from FACS-purified or in vitro differentiated/stimulated cell subsets. Although useful when cell types of interest are well defined, such gene signatures may be suboptimal for the discovery of novel cell types and cell type gene expression profiles, and for capturing the full spectrum of major cell phenotypes in complex tissues.
The present disclosure provides a computational framework to accurately infer cell type abundance and cell type-specific gene expression from RNA profiles of intact tissues. By leveraging cell type expression signatures from single-cell experiments or sorted cell subsets, the methods of the present disclosure can provide comprehensive portraits of tissue composition without physical dissociation, antibodies, or living material. Such approaches may include, for example, a method for enumerating cell composition from tissue gene expression profiles with techniques for cross-platform data normalization and in silico cell purification. The latter can allow the transcriptomes of individual cell types of interest to be digitally “purified” from bulk RNA admixtures without physical isolation. As a result, changes in cell type-specific gene expression can be inferred without cell separation or prior knowledge. The results described herein illustrate that methods of the present disclosure are useful for deciphering complex tissues, with implications for high-resolution cell phenotyping in research and clinical settings.
The methods described herein can be used to decode cellular heterogeneity in complex tissues. This strategy can be used to “digitally gate” cell subsets of interest from single-cell transcriptomes, profile the identities and expression patterns of these cells in cohorts of bulk tissue gene expression profiles (e.g., fixed specimens from clinical trials), and systemically determine their associations with diverse metadata, including genomic features and clinical outcomes.
The term “scRNA-Seq,” as used herein, generally refers to a single-cell RNA sequencing method to obtain expression profiles of individual cells. For example, single-cell libraries can be prepared from single-cell suspensions of dissociated cancers (e.g., from cancer patients) using Chromium with v2 chemistry (10× Genomics). Such single-cell libraries can be sequenced (e.g., a NextSeq 500 (Illumina)). Sequencing reads may be processed, for example, by alignment, filtration, deduplication, and/or conversion into a digital count matrix using Cell Ranger 1.2 (10× Genomics).
Outlier cells may be identified and filtered based on (1) anomalously high/low mitochondrial gene expression (e.g., cells with >10 or <1 mitochondrial content may be removed) and/or (2) potential doublets/multiplets, as identified by comparing the number of expressed genes detected by per cell versus the number of unique molecular identifiers (UMIs) detected per cell (e.g., cells with greater than 3,500 and less than 500 expressed genes may be removed). Clusters may be identified (e.g., using Seurat v.1.4.0.16) by (1) regressing out the dependence of gene expression on the number of unique molecular identifiers (UMIs) and the percentage of mitochondrial content, and (2) by running “FindClusters” on a suitable number (e.g., 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, or more than 10) of principal components of the data. Cell labels may be assigned according to the expression of canonical marker genes, for instance in leukocytes (e.g., MS4A1 high=B cells; CD8A high and GNLY low=CD8 T cells; CD3E high, CD8A low, and GNLY low=CD4 T cells; GNLY high and CD3E low=NK cells; GNLY high and CD3E high=NKT cells; CD14 high=monocytes). Publicly available PBMC datasets from healthy donors profiled by Chromium v2 (5′ and 3′ kits) may be downloaded (Table 1) and preprocessed as above, with the following minor modifications.
During quality control, cells with >5000 expressed genes for 5′ assays, >4000 expressed genes for 3′ assays, and <200 expressed genes may be excluded. Seurat “FindClusters” may be applied on the first 20 principal components, with the resolution parameter set to 0.6. Cell labels may be assigned as described above. In addition, myeloid cells may be defined by high CD68 expression, megakaryocytes may be defined by high PPBP expression, and dendritic cells may be defined by high FCER1A expression.
The term “bulk RNA-Seq,” as used herein, generally refers to a bulk RNA sequencing method to obtain expression profiles of bulk cell populations or tissues. For example, total RNA may be isolated from blood samples stored in, e.g., PAXgene tubes using, e.g., the PAXgene Blood RNA Kit (Qiagen) according to the manufacturer's recommendations. RNA may be quantitated and quality assessed using, e.g., a 2100 Bioanalyzer (Agilent). Library preparation may be performed using, e.g., an RNA exome kit (Illumina) per the manufacturer's recommendations. RNA-Seq libraries may be multiplexed together and sequenced using, e.g., a single HiSeq 4000 lane (Illumina) using 2×150 bp reads. For example, total RNA may be isolated from PBMC samples using TRIzol (Invitrogen) per the manufacturer's recommendations. RNA molecules may be quantitated and quality assessed, e.g., using a 2100 Bioanalyzer (Agilent) with a RNA 6000 Pico chip (Agilent). Library preparation of the RNA molecules may be performed, e.g., using the SMARTer Stranded Total RNA-Seq—Pico kit (Takara Biosciences) per the manufacturer's recommendations. Libraries may be quantified, e.g., with the dsDNA HS Assay kit (Thermo Fisher Scientific) using a Qubit 3.0 fluorometer (Thermo Fisher Scientific). Library quality may be assessed, e.g., using a 4200 TapeStation Instrument (Agilent) with D1000 ScreenTape. RNA-Seq libraries may be sequenced on a suitable sequencing instrument (e.g., a NextSeq 500 (Illumina) using 2×150 base-pair (bp) reads). As another example, total RNA may be extracted from bulk tumours (e.g., NSCLC) and sorted cell populations (e.g., in a range of about 100, about 200, about 300, about 400, about 500, about 1,000, about 5,000, about 10,000, about 15,000, about 20,000, about 25,000, or more than 25,000 cells), e.g., using an AllPrep DNA/RNA Micro kit (Qiagen).
An amount of total RNA (e.g., about 10 nanograms (ng), about 20 ng, about 30 ng, about 40 ng, about 50 ng, or more than 50 ng) may be amplified, e.g., using an Ovation RNA-Seq System V2 (NuGEN). The resulting complementary DNA (cDNA) may be sheared (e.g., by sonication (Covaris S2 System) to an average size of 150-200 bp) and used to construct DNA libraries (e.g., using the NEBNext DNA Library Prep Master Mix (New England Biolabs)). Libraries may be sequenced on a suitable sequencing instrument (e.g., a HiSeq 2000 (Illumina) to generate 100 bp paired end reads with an average of 100 million (M) reads per sample).
To maximize linearity in the context of deconvolution analyses, raw FASTQ reads may be processed (e.g., with Salmon v0.8.265) using GENCODE v23 reference transcripts, the—biasCorrect flag, and otherwise default parameters. RNA-Seq quantification results may be merged into a single gene-level TPM matrix using an R package, tximport.
Microarrays may be used to generate ground truth reference profiles using microarrays. Total RNA may be extracted from bulk FL specimens and sorted B cells and assessed for yield and quality. Complementary RNA (cRNA) may be prepared from 100 ng of total RNA following linear amplification (3′ IVT Express, Affymetrix), and then hybridized to HGU133 Plus 2.0 microarrays (Affymetrix) according to the manufacturer's protocol. Obtained CEL data files may be pooled with a publicly available Affymetrix dataset containing CD4 and CD8 tumorinfiltrating lymphocytes (TILs) which are FACS-sorted from FL lymph nodes (GSE2792840). Resulting datasets may be RMA normalized using the “affy” package in Bioconductor, mapped to NCBI Entrez gene identifiers using a custom chip definition file (e.g., Brainarray version 21.0; http://brainarray.mbni.med.umich.edu/Brainarray/), and converted to HUGO gene symbols. Replicates of sorted cell subsets may be combined to create ground truth reference profiles using the geometric mean of expression values.
External datasets may comprise next generation sequencing (NGS) datasets which are downloaded and analyzed using normalization settings. Such external datasets may comprise one or more of: transcripts per million (TPM), reads per kilobase of transcript per million (RPKM), or fragments per kilobase of transcript per million (FPKM) space. For analyses in log 2 space, values of 1 may be added to expression values prior to log 2 adjustment. Affymetrix microarray datasets may be summarized and normalized as described with microarrays, using RMA in cases where bulk tissues and ground truth cell subsets were profiled on the same Affymetrix platform, and otherwise using MASS normalization. NanoString nCounter data may be downloaded and analyzed with batch correction in non-log linear space, but without any additional preprocessing.
Single-cell expression values may be first normalized to transcript per million (TPM) and divided by 10 to better approximate the number of transcripts per cell. For each cell phenotype, genes with low average expression in log 2 space may be set to 0 as a quality control filter. Because of sparser gene coverage, filter may not be applied to data generated by 10× Chromium. For each cell type represented by at least 3 single cells, 50% of all available single cell GEPs may be selected using random sampling without replacement (fractional sample sizes may be rounded up such that 2 cells were sampled if only 3 were available). The profiles may be aggregated by summation in non-log linear space and each population-level GEP may be normalized into TPM. This process may be repeated in order to generate aggregated transcriptome replicates (e.g., 2, 3, 4, 5, or more than 5) per cell type. For example, scRNA-Seq and bulk RNA-Seq signature matrices may be generated as described previously with the following typical parameters: minimum number of genes per cell type=300, maximum number of genes per cell type=500, q-value of 0.01, and no quantile normalization.
In some embodiments, at least 20, at least 30, at least 40, at least 50, at least 60, at least 70, at least 80, at least 100, at least 120, at least 140, at least 160, at least 180, at least 200, at least 220, at least 240, at least 260, at least 280, at least 300 or more genes from a cancer sample are measured. In some embodiments, it is the combination of substantially all of the genes from a cancer sample that allows for the most accurate determination of abundance of the cell type in the sample and prognostication of outcome, diagnosis or therapeutic response to treatment. In a preferred embodiment, the methods described herein directly utilise single-cell RNA sequencing data rather than known gene-lists as input to generate a gene expression matrix or profile.
“Gene expression” as used herein refers to the relative levels of expression and/or pattern of expression of a gene. The expression of a gene may be measured at the level of DNA, cDNA, RNA, mRNA, or combinations thereof. “Gene expression profile” refers to the levels of expression of multiple different genes measured for the same sample. An expression profile can be derived from a biological sample collected from a subject at one or more time points prior to, during, or following diagnosis, treatment, or therapy for cancer (or any combination thereof), can be derived from a biological sample collected from a subject at one or more time points during which there is no treatment or therapy for cancer (e.g., to monitor progression of disease or to assess development of disease in a subject at risk for breast cancer), or can be collected from a healthy subject.
Gene expression profiles may be measured in a sample, such as samples comprising a variety of cell types, different tissues, different organs, or fluids (e.g., blood, urine, spinal fluid, sweat, saliva or serum) by various methods including but not limited to microarray technologies and quantitative and semi-quantitative RT-PCR techniques as well as single-cell transcriptome sequencing (sc-RNA-seq) and other methods known in the art.
The term “deconvolution,” as used herein, may refer to the process of identifying (e.g., estimating) the relative proportions or the abundance (e.g., an absolute or fractional abundance) of cell subsets or cell populations in a mixture of cell subsets or cell populations of a sample. Deconvolution methods generally work on the principle that the expression value of each gene, in a bulk, heterogenous sample, can be mathematically modelled as the gene expression contributions from each of the individual cell-types that constitute the sample (Cobos et al., (2018) Bioinformatics 34:11, 1969-1979), incorporated herein in its entirety).
Deconvolution methods are often broadly grouped into 3 common types of methods: ordinary least squares (OLS); linear least squares (LLS); or simply least squares (LS). A skilled person will understand suitable deconvolution that may be used in the methods described herein. The process of deconvolution may vary as understood by a skilled person in the art. Some processes of deconvolution use known gene-lists as input (e.g., the original CIBERSORT method). Others directly utilise single-cell RNA sequencing data (e.g., the newer CIBERSORTx method and DWLS methods). In a preferred embodiment, the methods described herein directly utilise single-cell RNA sequencing data rather than known gene-lists as input.
In an embodiment, the process of deconvolution includes:
According to the methods described herein, dampened weighted least squares (DWLS) or CIBERSORTx may be used to determine gene expression deconvolution, whereby cell-type composition of a bulk RNA-sequence data set is computationally inferred. However, a skilled person will understand that other known methods may be used to determine gene expression deconvolution and the methods described herein are not limited accordingly.
Batch correction techniques may be developed to minimize technical variation in expression profiling and may be applied to gene expression deconvolution. In an embodiment, a deconvolution method (e.g., to identify or quantify cell-type states from a mixture of different cell types) may comprise performing a batch correction procedure to reduce technical variation (e.g., between the cell signature profile and the bulk mixture profiles). For example, a bulk reference mode (e.g., B-mode) batch correction may be performed as follows. Generally, while a deconvolution method (e.g., CIBERSORT) may be applied to RNA-Seq, including to reference phenotypes derived from single-cell transcriptome profiling, such a method may not explicitly handle technical variation between the cell signature profile and bulk mixture profiles. Technical variation may include cross-platform technical variation or cross-sample technical variation. For example, technical variation may arise from obtaining feature profiles of the signature matrix and feature profiles of the bulk mixture across different platforms (e.g., RNA-Seq, scRNA-Seq, microarrays, 10× Chromium, SMART-Seq2, droplet-based techniques, UMI-based techniques, non-UMI-based techniques, 3 5′-biased techniques) and/or different sample types (e.g., fresh/frozen samples, FFPE samples, single-cell samples, bulk sorted cell populations or cell types, and samples containing mixtures of cell populations or cell types). For example, crossplatform technical variation may arise in cases where feature profiles with a same type of expression data (e.g., GEPs) are obtained using different platforms. Since technical variation can variably confound deconvolution results, a normalization workflow which may comprise at least two distinct strategies, can be applied to reliably apply gene expression deconvolution across platforms (e.g., RNA-Seq, microarrays) and tissue storage types (e.g., fresh/frozen versus FFPE). For example, a decision tree to guide users in selecting the most appropriate strategy may be used to assist in selecting a bulk-mode batch correction (e.g., B-mode) procedure and/or a single cell batch correction (e.g., S-mode) procedure to be performed.
The distinct cell subsets (e.g., cell types) of the biological sample according to the present disclosure may be any distinct cell types that contribute to the feature profile of the biological sample.
In an embodiment, the distinct cell types comprise any of:
In an embodiment, the ecotypes may comprise the following qualitative parameters:
A skilled person will understand that varying proportions of these subtypes can form a given ecotype. Within the cell types listed above, a skilled person will understand that each of the cell types can be further broken down into the five ‘intrinsic’ molecular subtypes: luminal-like (LumA and LumB), HER2-enriched (HER2E), basal-like (BLBC) and normal-like.
In some embodiments, the distinct subsets of cells comprise subsets of cells at different cell cycle stages. A subset of cells may include cells in any suitable cell cycle stage, including, but not limited to, interphase, mitotic phase or cytokinesis. In some embodiments, cells in a subset of cells are at prophase, metaphase, anaphase, or telophase. In some cases, the cells in a subset of cells is quiescent (Go phase), at the Gi checkpoint (Gi phase), replicated DNA but before mitosis (G2 phase), or undergoing DNA replication (S phase). A skilled person will understand that the term “cycling cell” refers to a cell at different cell cycle stages.
In some embodiments, the distinct cell subsets include different functional pathways within one or more cells. Functional pathways of interest include, without limitation, cellular signalling pathways, gene regulatory pathways, or metabolic pathways. Thus, in some embodiments, the method of the present disclosure may be a method estimating the relative activity of different signalling or metabolic pathways in a cell, a collection of cells, a tissue, etc., by measuring multiple features of the signalling or metabolic pathways (e.g., measuring activation state of proteins in a signalling pathway; measuring expression level of genes in a gene regulatory network; measuring the level of a metabolite in a metabolic pathway, etc.). The cellular signalling pathways of interest include any suitable signalling pathway, such as, without limitation, cytokine signalling, death factor signalling, growth factor signalling, survival factor signalling, hormone signalling, Wnt signalling, Hedgehog signalling, Notch signalling, extracellular matrix signalling, insulin signalling, calcium signalling, G-protein coupled receptor signalling, neurotransmitter signalling, and combinations thereof. The metabolic pathway may include any suitable metabolic pathway, such as, without limitation, glycolysis, gluconeogenesis, citric acid cycle, fermentation, urea cycle, fatty acid metabolism, pyrimidine biosynthesis, glutamate amino acid group synthesis, porphyrin metabolism, aspartate amino acid group synthesis, aromatic amino acid synthesis, histidine metabolism, branched amino acid synthesis, pentose phosphate pathway, purine biosynthesis, glucoronate metabolism, inositol metabolism, cellulose metabolism, sucrose metabolism, starch and glycogen metabolism, and combinations thereof.
In some embodiments, a cell subset may be any group of cells in a biological sample whose presence is characterized by one or more features (such as gene expression on the RNA level, protein expression, genomic mutations, biomarkers, and so forth). A cell subset may be, for example, a cell type or cell sub-type. In certain aspects, one or more cell subsets may be leukocytes (e.g., white blood cells or WBCs). Potential leukocyte cell subsets include monocytes, dendritic cells, neutrophils, eosinophils, basophils, and lymphocytes. These leukocyte subsets can be further subdivided, for example, lymphocyte cell subsets include natural killer cells (NK cells), T-cells (e.g., CD8 T cells, CD4 naive T cells, CD4 memory RO unactivated T cells, CD4 memory RO activated T cells, follicular helper T cells, regulatory T cells, and so forth) and B-cells (naive B cells, memory B cells, Plasma cells). Immune cells subsets may be further separated based on activation (or stimulation) state.
In certain embodiments, leukocytes may be from an individual with a leukocyte disorder, such as blood cancer, an autoimmune disease, myelodysplastic syndrome, and so forth. Examples of a blood disease include Acute lymphoblastic leukemia (ALL), Acute myelogenous leukemia (AML), Chronic lymphocytic leukemia (CLL), Chronic myelogenous leukemia (CML), Acute monocytic leukemia (AMoL), Hodgkin's lymphoma, Non-Hodgkin's lymphoma, and myeloma.
In certain embodiments, one or more cell subsets may include tumour infiltrating leukocytes (TILs). Tumour infiltrating leukocytes may be in mixture with cancer cells in the biological sample, or may be enriched by any methods described above or known in the art.
In certain aspects, one or more cell subsets may include cancer cells, such as blood cancer, breast cancer, colon cancer, lung cancer, prostate cancer, hepatocellular cancer, gastric cancer, pancreatic cancer, cervical cancer, ovarian cancer, liver cancer, bladder cancer, cancer of the urinary tract, thyroid cancer, renal cancer, carcinoma, melanoma, and brain cancer.
Cell subsets of interest may include brain cells, including neuronal cells, astrocytes, oligodendrocytes, and microglia, and progenitor cells thereof. Other cell subsets of interest include stem cells, pluripotent stem cells, and progenitor cells of any biological tissue, including blood, solid tissue from brain, lymph node, thymus, bone marrow, spleen, skeletal muscle, heart, colon, stomach, small intestine, kidney, liver, lung, and so forth.
Despite recent advances, the challenge of cancer treatment remains to target specific treatment regimens to distinct tumour types with different pathogenesis, and ultimately personalize tumour treatment in order to maximize outcome. In particular, once a patient is diagnosed with cancer, such as breast cancer, there is a need for methods that allow a practitioner to predict the expected course of disease, including the likelihood of cancer recurrence, long-term survival of the patient and the like, and select the most appropriate treatment options accordingly.
For the purposes of the present invention, “breast cancer” includes, for example, those conditions classified by biopsy or histology as malignant pathology. One of skill in the art will appreciate that breast cancer refers to any malignancy of the breast tissue, including, for example, carcinomas and sarcomas. Particular embodiments of breast cancer include ductal carcinoma in situ (DCIS), lobular carcinoma in situ (LCIS), or mucinous carcinoma. Breast cancer also refers to infiltrating ductal (IDC) or infiltrating lobular carcinoma (ILC). In most embodiments of the invention, the subject of interest is a human patient suspected of or having been diagnosed with breast cancer.
Breast cancer is a heterogeneous disease with respect to molecular alterations and cellular composition. This diversity creates a challenge for researchers trying to develop classifications that are clinically meaningful. Gene expression profiling by microarray has provided insight into the complexity of breast tumours and can be used to provide prognostic information beyond standard pathologic parameters.
Expression profiling of breast cancer identifies biologically and clinically distinct molecular subtypes which may require different treatment approaches. The major intrinsic subtypes of breast cancer referred to as Luminal A, Luminal B, HER2-enriched, Basal-like have distinct clinical features, relapse risk and response to treatment. The “intrinsic” subtypes known as Luminal A (LumA), Luminal B (LumB), HER2-enriched, Basal-like, and Normal-like were discovered using unsupervised hierarchical clustering of microarray data (Perou et al. (2000) Nature 406:747-752). Intrinsic genes, as described in Perou et al. (2000) Nature 406:747-752, are statistically selected to have low variation in expression between biological sample replicates from the same individual and high variation in expression across samples from different individuals. Thus, intrinsic genes are the classifier genes for breast cancer classification. Although clinical information was not used to derive the breast cancer intrinsic subtypes, this classification has proved to have prognostic significance (Sorlie et al. (2001) PNAS 98(19) 10869-10874).
Breast tumours of the “Luminal” subtype are ER positive and have a similar keratin expression profile as the epithelial cells lining the lumen of the breast ducts (Taylor Papadimitriou et al. (1989) J Cell Sci 94:403-413; Perou et al (2000) New Technologies for Life Sciences: A Trends Guide 67-7 6). Conversely, ER-negative tumours can be broken into two main subtypes, namely those that overexpress (and are DNA amplified for) HER-2 and GRB7 (HER-2-enriched) and “Basal-like” tumours that have an expression profile similar to basal epithelium and express Keratin 5, 6B, and 17. Both these tumour subtypes are aggressive and typically more deadly than Luminal tumours; however, there are subtypes of Luminal tumours with different outcomes. The Luminal tumours with poor outcomes consistently share the histopathological feature of being higher grade and the molecular feature of highly expressing proliferation genes.
The methods described herein may be further combined with information on clinical variables to generate a risk of relapse predictor or to aid diagnosis or prognosis or for use in any other method described herein.
As described herein, a number of clinical and prognostic breast cancer factors are known in the art and are used to predict treatment outcome and the likelihood of disease recurrence. Such factors include, for example, lymph node involvement, tumour size, histologic grade, estrogen and progesterone hormone receptor status, HER-2 levels, and tumour ploidy.
Methods of identifying breast cancer patients and staging the disease are well known and may include manual examination, biopsy, review of patient's and/or family history, and imaging techniques, such as mammography, magnetic resonance imaging (MRI), and positron emission tomography (PET). It will be understood that breast cancer stage is usually expressed as a number on a scale of 0 through IV with stage 0 describing non-invasive cancers that remain within their original location and stage IV describing invasive cancers that have spread outside the breast to other parts of the body.
Stage 0 is used to describe non-invasive breast cancers, such as DCIS (ductal carcinoma in situ). In stage 0, there is no evidence of cancer cells or non-cancerous abnormal cells breaking out of the part of the breast in which they started, or getting through to or invading neighbouring normal tissue. Stage I describes invasive breast cancer (cancer cells are breaking through to or invading normal surrounding breast tissue). Stage IA describes invasive breast cancer in which the tumour measures up to 2 centimeters (cm) and the cancer has not spread outside the breast; no lymph nodes are involved. Stage IB describes invasive breast cancer in which there is no tumour in the breast; instead, small groups of cancer cells—larger than 0.2 millimeter (mm) but not larger than 2 mm—are found in the lymph nodes or there is a tumour in the breast that is no larger than 2 cm, and there are small groups of cancer cells—larger than 0.2 mm but not larger than 2 mm—in the lymph nodes.
Stage II is divided into subcategories known as IIA and IIB. Stage IIA describes invasive breast cancer in which no tumour can be found in the breast, but cancer (larger than 2 millimeters [mm]) is found in 1 to 3 axillary lymph nodes (the lymph nodes under the arm) or in the lymph nodes near the breast bone (found during a sentinel node biopsy) or the tumour measures 2 centimeters (cm) or smaller and has spread to the axillary lymph nodes or the tumour is larger than 2 cm but not larger than 5 cm and has not spread to the axillary lymph nodes. Stage IIB describes invasive breast cancer in which the tumour is larger than 2 cm but no larger than 5 centimeters; small groups of breast cancer cells—larger than 0.2 mm but not larger than 2 mm—are found in the lymph nodes or the tumour is larger than 2 cm but no larger than 5 cm; cancer has spread to 1 to 3 axillary lymph nodes or to lymph nodes near the breastbone (found during a sentinel node biopsy) or the tumour is larger than 5 cm but has not spread to the axillary lymph nodes.
Stage III is divided into subcategories known as IIIA, HIB, and IHC. In general, stage IIIA describes invasive breast cancer in which either no tumour is found in the breast or the tumour may be any size; cancer is found in 4 to 9 axillary lymph nodes or in the lymph nodes near the breastbone (found during imaging tests or a physical exam) or the tumour is larger than 5 centimeters (cm); small groups of breast cancer cells (larger than 0.2 millimeter [mm] but not larger than 2 mm) are found in the lymph nodes or the tumour is larger than 5 cm; cancer has spread to 1 to 3 axillary lymph nodes or to the lymph nodes near the breastbone (found during a sentinel lymph node biopsy). Stage IIIB describes invasive breast cancer in which the tumour may be any size and has spread to the chest wall and/or skin of the breast and caused swelling or an ulcer and may have spread to up to 9 axillary lymph nodes or may have spread to lymph nodes near the breastbone.
Stage IIIC describes invasive breast cancer in which there may be no sign of cancer in the breast or, if there is a tumour, it may be any size and may have spread to the chest wall and/or the skin of the breast and the cancer has spread to 10 or more axillary lymph nodes or the cancer has spread to lymph nodes above or below the collarbone or the cancer has spread to axillary lymph nodes or to lymph nodes near the breastbone.
Stage IV describes invasive breast cancer that has spread beyond the breast and nearby lymph nodes to other organs of the body, such as the lungs, distant lymph nodes, skin, bones, liver, or brain.
Using the methods of the present invention, the diagnosis and/or prognosis of a breast cancer patient can be determined independent of, or in combination with assessment of these clinical factors. In some embodiments, combining the methods disclosed herein with evaluation of these clinical factors may permit a more accurate risk assessment.
The methods of the invention may be further coupled with analysis of, for example, estrogen receptor (ER) and progesterone receptor (PgR) status, and/or HER-2 expression levels. Other factors, such as patient clinical history, family history and menopausal status, may also be considered when evaluating breast cancer prognosis or diagnosis via the methods of the invention.
In one embodiment of the present invention, abundance of cell type is assessed through the evaluation of gene expression profiles of the genes in one or more subject samples. For the purpose of discussion, the term subject, or subject sample, refers to an individual regardless of health and/or disease status. A subject can be a subject, a study participant, a control subject, a screening subject, or any other class of individual from whom sample is obtained and assessed in the context of the invention.
Accordingly, a subject can be diagnosed with breast cancer, can present with one or more symptoms of breast cancer, or a predisposing factor, such as a family (genetic) or medical history (medical) factor, for breast cancer, can be undergoing treatment or therapy for breast cancer, or the like. Alternatively, a subject can be healthy with respect to any of the aforementioned factors or criteria. It will be appreciated that the term “healthy” as used herein, is relative to breast cancer status. Thus, an individual defined as healthy with reference to any specified disease or disease criterion, can in fact be diagnosed with any other one or more diseases, or exhibit any other one or more disease criterion, including one or more cancers other than breast cancer. However, the healthy controls are preferably free of any cancer.
In particular embodiments, the methods for determining abundance of the cell type in the sample include collecting a sample comprising a cancer cell or tissue, such as a breast tissue sample or a primary breast tumour tissue sample.
A “sample” or “biological sample” is intended to mean any sampling of cells, tissues, or bodily fluids in which expression of one or more intrinsic genes can be determined. Examples of such biological samples include, but are not limited to, biopsies and smears. Bodily fluids useful in the present invention include blood, lymph, urine, saliva, nipple aspirates, gynecological fluids, or any other bodily secretion or derivative thereof. Blood can include whole blood, plasma, serum, or any derivative of blood. In some embodiments, the biological sample includes breast cells, particularly breast tissue from a biopsy, such as a breast tumour tissue sample. Biological samples may be obtained from a subject by a variety of techniques including, for example, by scraping or swabbing an area, by using a needle to aspirate cells or bodily fluids, or by removing a tissue sample (i.e., biopsy). Methods for collecting various biological samples are well known in the art. In some embodiments, a breast tissue sample is obtained by, for example, fine needle aspiration biopsy, core needle biopsy, or excisional biopsy. Fixative and staining solutions may be applied to the cells or tissues for preserving the specimen and for facilitating examination. Biological samples, particularly breast tissue samples, may be transferred to a glass slide for viewing under magnification. In one embodiment, the biological sample is a formalin-fixed, paraffin-embedded breast tissue sample, particularly a primary breast tumour sample.
Any methods available in the art for detecting expression of genes in a cancer sample are encompassed herein. By “detecting expression” is intended determining the quantity or presence of an RNA transcript or its expression product of an intrinsic gene.
Methods for detecting expression of the intrinsic genes of the invention, that is, gene expression profiling, include methods based on hybridization analysis of polynucleotides, methods based on sequencing of polynucleotides, immunohistochemistry methods, and proteomics based methods. The methods generally detect expression products (e.g., mRNA) of the genes in a cancer sample.
In embodiments, PCR-based methods, such as reverse transcription PCR (RT-PCR) (Weis et al., TIG 8:263-64, 1992), and array-based methods such as microarray (Schena et al., Science 270:467-70, 1995), preferably single-cell RNA sequencing, is used. By “microarray” is intended an ordered arrangement of hybridisable array elements, such as, for example, polynucleotide probes, on a substrate. The term “probe” refers to any molecule that is capable of selectively binding to a specifically intended target biomolecule, for example, a nucleotide transcript or a protein encoded by or corresponding to an intrinsic gene. Probes can be synthesized by one of skill in the art, or derived from appropriate biological preparations. Probes may be specifically designed to be labelled. Examples of molecules that can be utilized as probes include, but are not limited to, RNA, DNA, proteins, antibodies, and organic molecules.
Many expression detection methods use isolated RNA. The starting material is typically total RNA isolated from a biological sample, such as a tumour or tumour cell line, and corresponding normal tissue or cell line, respectively. If the source of RNA is a primary tumour, RNA (e.g., mRNA) can be extracted, for example, from frozen or archived paraffin embedded and fixed (e.g., formalin-fixed) tissue samples (e.g., pathologist-guided tissue core samples).
General methods for RNA extraction are well known in the art and are disclosed in standard textbooks of molecular biology, including Ausubel et al., ed., Current Protocols in Molecular Biology, John Wiley & Sons, New York 1987-1999. Methods for RNA extraction from paraffin embedded tissues are disclosed, for example, in Rupp and Locker (Lab Invest. 56:A67, 1987) and De Andres et al. (Biotechniques 18:42-44, 1995). In particular, RNA isolation can be performed using a purification kit, a buffer set and protease from commercial manufacturers, such as Qiagen (Valencia, Calif.), according to the manufacturer's instructions. For example, total RNA from cells in culture can be isolated using Qiagen RN easy mini-columns. Other commercially available RNA isolation kits include MASTERPURE™ Complete DNA and RNA Purification Kit (Epicentre, Madison, Wis.) and Paraffin Block RNA Isolation Kit (Ambion, Austin, Tex.). Total RNA from tissue samples can be isolated, for example, using RNA Stat-60 (Tel-Test, Friendswood, Tex.). RNA prepared from a tumour can be isolated, for example, by cesium chloride density gradient centrifugation. Additionally, large numbers of tissue samples can readily be processed using techniques well known to those of skill in the art, such as, for example, the single-step RNA isolation process of Chomczynski (U.S. Pat. No. 4,843,155).
Isolated RNA can be used in hybridization or amplification assays that include, but are not limited to, PCR analyses and probe arrays. One method for the detection of RNA levels involves contacting the isolated RNA with a nucleic acid molecule (probe) that can hybridize to the mRNA encoded by the gene being detected. The nucleic acid probe can be, for example, a full-length cDNA, or a portion thereof, such as an oligonucleotide of at least 7, 15, 30, 60, 10 0, 250, or 500 nucleotides in length and sufficient to specifically hybridize under stringent conditions to an intrinsic gene of the present invention, or any derivative DNA or RNA. Hybridization of an mRNA with the probe indicates that the intrinsic gene in question is being expressed.
In one embodiment, the mRNA is immobilized on a solid surface and contacted with a probe, for example by running the isolated mRNA on an agarose gel and transferring the mRNA from the gel to a membrane, such as nitrocellulose. In an alternative embodiment, the probes are immobilized on a solid surface and the mRNA is contacted with the probes, for example, in an Agilent gene chip array. A skilled person can readily adapt known mRNA detection methods for use in detecting the level of expression of the intrinsic genes of the present invention.
An alternative method for determining the level of intrinsic gene expression product in a sample involves the process of nucleic acid amplification, for example, by RT-PCR (U.S. Pat. No. 4,683,202), ligase chain reaction (Barany, Proc. Natl. Acad. Sci. USA 88:189-93, 1991), self sustained sequence replication (Guatelli et al., Proc. Natl. Acad. Sci. USA 87: 187 4-78, 1990), transcriptional amplification system (Kwoh et al., Proc. Natl. Acad. Sci. USA 86:1173-77, 1989), Q-Beta Replicase (Lizardi et al., Bio/Technology 6:1197, 1988), rolling circle replication (U.S. Pat. No. 5,854,033), or any other nucleic acid amplification method, followed by the detection of the amplified molecules using techniques well known to those of skill in the art. These detection schemes are especially useful for the detection of nucleic acid molecules if such molecules are present in very low numbers.
In particular aspects of the invention, intrinsic gene expression is assessed by quantitative RT-PCR. Numerous different PCR or QPCR protocols are known in the art and exemplified herein below and can be directly applied or adapted for use using the presently described methods for the detection and/or quantification of the intrinsic genes listed in a cancer sample. Generally, in PCR, a target polynucleotide sequence is amplified by reaction with at least one oligonucleotide primer or pair of oligonucleotide primers. The primer(s) hybridize to a complementary region of the target nucleic acid and a DNA polymerase extends the primer(s) to amplify the target sequence. Under conditions sufficient to provide polymerase-based nucleic acid amplification products, a nucleic acid fragment of one size dominates the reaction products (the target polynucleotide sequence which is the amplification product). The amplification cycle is repeated to increase the concentration of the single target polynucleotide sequence. The reaction can be performed in any thermocycler commonly used for PCR. However, preferred are cyders with real-time fluorescence measurement capabilities, for example, SMARTCYCLER® (Cepheid, Sunnyvale, Calif.), ABI PRISM 7700® (Applied Biosystems, Foster City, Calif.), ROTOR-GENET™ (Corbett Research, Sydney, Australia), LIGHTCYCLER® (Roche Diagnostics Corp, Indianapolis, Ind.), !CYCLER® (Biorad Laboratories, Hercules, Calif.) and MX4000® (Stratagene, La Jolla, Calif.).
Quantitative PCR (QPCR) (also referred as realtime PCR) is preferred under some circumstances because it provides not only a quantitative measurement, but also reduced time and contamination. In some instances, the availability of full gene expression profiling techniques is limited due to requirements for fresh frozen tissue and specialized laboratory equipment, making the routine use of such technologies difficult in a clinical setting. However, QPCR gene measurement can be applied to standard formalin-fixed paraffin-embedded clinical tumour blocks, such as those used in archival tissue banks and routine surgical pathology specimens. As used herein, “quantitative PCR (or “real time QPCR”) refers to the direct monitoring of the progress of PCR amplification as it is occurring without the need for repeated sampling of the reaction products. In quantitative PCR, the reaction products may be monitored via a signalling mechanism (e.g., fluorescence) as they are generated and are tracked after the signal rises above a background level but before the reaction reaches a plateau. The number of cycles required to achieve a detectable or “threshold” level of fluorescence varies directly with the concentration of amplifiable targets at the beginning of the PCR process, enabling a measure of signal intensity to provide a measure of the amount of target nucleic acid in a sample in real time.
In another embodiment of the invention, microarrays are used for expression profiling. Microarrays are particularly well suited for this purpose because of the reproducibility between different experiments. DNA microarrays provide one method for the simultaneous measurement of the expression levels of large numbers of genes. Each array consists of a reproducible pattern of capture probes attached to a solid support. Labelled RNA or DNA is hybridized to complementary probes on the array and then detected by laser scanning. Hybridization intensities for each probe on the array are determined and converted to a quantitative value representing relative gene expression levels. See, for example, U.S. Pat. Nos. 6,040,138, 5,800,992 and 6,020,135, 6,033,860, and 6,344,316. High-density oligonucleotide arrays are particularly useful for determining the gene expression profile for a large number of RNAs in a sample. Techniques for the synthesis of these arrays using mechanical synthesis methods are described in, for example, U.S. Pat. No. 5,384,261. Although a planar array surface is generally used, the array can be fabricated on a surface of virtually any shape or even a multiplicity of surfaces. Arrays can be nucleic acids (or peptides) on beads, gels, polymeric surfaces, fibers (such as fiber optics), glass, or any other appropriate substrate. See, for example, U.S. Pat. Nos. 5,770,358, 5,789,162, 5,708,153, 6,040,193 and 5,800,992. Arrays can be packaged in such a manner as to allow for diagnostics or other manipulation of an all-inclusive device. See, for example, U.S. Pat. Nos. 5,856,174 and 5,922,591.
In a specific embodiment of the microarray technique, PCR amplified inserts of cDNA clones are applied to a substrate in a dense array. The microarrayed genes, immobilized on the microchip, are suitable for hybridization under stringent conditions. Fluorescently labelled cDNA probes can be generated through incorporation of fluorescent nucleotides by reverse transcription of RNA extracted from tissues of interest. Labelled cDNA probes applied to the chip hybridize with specificity to each spot of DNA on the array. After stringent washing to remove non-specifically bound probes, the chip is scanned by confocal laser microscopy or by another detection method, such as a CCD camera. Quantitation of hybridization of each arrayed element allows for assessment of corresponding mRNA abundance.
With dual colour fluorescence, separately labelled cDNA probes generated from two sources of RNA are hybridized pairwise to the array. The relative abundance of the transcripts from the two sources corresponding to each specified gene is thus determined simultaneously. The miniaturized scale of the hybridization affords a convenient and rapid evaluation of the expression pattern for large numbers of genes. Such methods have been shown to have the sensitivity required to detect rare transcripts, which are expressed at a few copies per cell, and to reproducibly detect at least approximately two-fold differences in the expression levels (Schena et al., Proc. Natl. Acad. Sci. USA 93:106-49, 1996). Microarray analysis can be performed by commercially available equipment, following manufacturer's protocols, such as by using the Affymetrix GenChip technology, or Agilent ink jet microarray technology. The development of microarray methods for large-scale analysis of gene expression makes it possible to search systematically for molecular markers of cancer classification and outcome prediction in a variety of tumour types.
It is often useful to pre-process gene expression data, for example, by addressing missing data, translation, scaling, normalization, weighting, etc. Multivariate projection methods, such as principal component analysis (PCA) and partial least squares analysis (PLS), are so-called scaling sensitive methods. By using prior knowledge and experience about the type of data studied, the quality of the data prior to multivariate modelling can be enhanced by scaling and/or weighting. Adequate scaling and/or weighting can reveal important and interesting variation hidden within the data, and therefore make subsequent multivariate modelling more efficient. Scaling and weighting may be used to place the data in the correct metric, based on knowledge and experience of the studied system, and therefore reveal patterns already inherently present in the data.
If possible, missing data, for example gaps in column values, should be avoided. However, if necessary, such missing data may replaced or “filled” with, for example, the mean value of a column (“mean fill”); a random value (“random fill”); or a value based on a principal component analysis (“principal component fill”).
“Translation” of the descriptor coordinate axes can be useful. Examples of such translation include normalization and mean centering. “Normalization” may be used to remove sample-to-sample variation. For microarray data, the process of normalization aims to remove systematic errors by balancing the fluorescence intensities of the two labelling dyes. The dye bias can come from various sources including differences in dye labelling efficiencies, heat and light sensitivities, as well as scanner settings for scanning two channels. Some commonly used methods or calculating normalization factor include: (i) global normalization that uses all genes on the array; (ii) housekeeping genes normalization that uses constantly expressed housekeeping/invariant genes; and (iii) internal controls normalization that uses known amount of exogenous control genes added during hybridization (Quackenbush (2002) Nat. Genet. 32 (Suppl.), 496-501). In one embodiment, the intrinsic genes disclosed herein can be normalized to control housekeeping genes. For example, the housekeeping genes described in U.S. Patent Publication 2008/0032293, which is herein incorporated by reference in its entirety, can be used for normalization. Exemplary housekeeping genes include MRPL19, PSMC4, SF3A1, PUM1, ACTB, GAPD, GUSB, RPLP0, and TFRC. It will be understood by one of skill in the art that the methods disclosed herein are not bound by normalization to any particular housekeeping genes, and that any suitable housekeeping gene(s) known in the art can be used.
Many normalization approaches are possible, and they can often be applied at any of several points in the analysis. In one embodiment, microarray data is normalized using the LOWESS method, which is a global locally weighted scatterplot smoothing normalization function. In another embodiment, qPCR data is normalized to the geometric mean of set of multiple housekeeping genes.
“Mean centering” may also be used to simplify interpretation. Usually, for each descriptor, the average value of that descriptor for all samples is subtracted. In this way, the mean of a descriptor coincides with the origin, and all descriptors are “centered” at zero. In “unit variance scaling,” data can be scaled to equal variance. Usually, the value of each descriptor is scaled by 1/StDev, where StDev is the standard deviation for that descriptor for all samples. “Pareto scaling” is, in some sense, intermediate between mean centering and unit variance scaling. In pareto scaling, the value of each descriptor is scaled by 1/sqrt(StDev), where StDev is the standard deviation for that descriptor for all samples. In this way, each descriptor has a variance numerically equal to its initial standard deviation. The pareto scaling may be performed, for example, on raw data or mean centered data.
“Logarithmic scaling” may be used to assist interpretation when data have a positive skew and/or when data spans a large range, e.g., several orders of magnitude. Usually, for each descriptor, the value is replaced by the logarithm of that value. In “equal range scaling,” each descriptor is divided by the range of that descriptor for all samples. In this way, all descriptors have the same range, that is, 1. However, this method is sensitive to presence of outlier points. In “autoscaling,” each data vector is mean centered and unit variance scaled. This technique is a very useful because each descriptor is then weighted equally, and large and small values are treated with equal emphasis. This can be important for genes expressed at very low, but still detectable, levels.
In one embodiment, data is collected for one or more test samples and classified using the methods described herein. When comparing data from multiple analyses (e.g., comparing expression profiles for one or more test samples to the centroids constructed from samples collected and analyzed in an independent study), it will be necessary to normalize data across these data sets. In one embodiment, Distance Weighted Discrimination (DWD) is used to combine these data sets together (Benito et al. (2004) Bioinformatics 20(1):105-114, incorporated by reference herein in its entirety). DWD is a multivariate analysis tool that is able to identify systematic biases present in separate data sets and then make a global adjustment to compensate for these biases; in essence, each separate data set is a multidimensional cloud of data points, and DWD takes two points clouds and shifts one such that it more optimally overlaps the other.
The methods described herein may be implemented and/or the results recorded using any device capable of implementing the methods and/or recording the results. Examples of devices that may be used include but are not limited to electronic computational devices, including computers of all types. When the methods described herein are implemented and/or recorded in a computer, the computer program that may be used to configure the computer to carry out the steps of the methods may be contained in any computer readable medium capable of containing the computer program. Examples of computer readable medium that may be used include but are not limited to diskettes, CD-ROMs, DVDs, ROM, RAM, and other memory and computer storage devices. The computer program that may be used to configure the computer to carry out the steps of the methods and/or record the results may also be provided over an electronic network, for example, over the internet, an intranet, or other network.
In an embodiment, a processor of the computer is configured to perform the deconvolution method and the cell signature expression profile is stored in a computer readable medium.
Provided herein are methods for predicting cancer outcome. Outcome or prognosis may refer to overall or disease-specific survival, event-free survival, or outcome in response to a particular treatment or therapy. In particular, the methods may be used to predict the likelihood of long-term, disease-free survival. Predicting the likelihood of survival of a cancer patient is intended to assess the risk that a patient will die as a result of the underlying cancer. Long-term, disease-free survival is intended to mean that the patient does not die from or suffer a recurrence of the underlying cancer within a period of at least five years, or at least ten or more years, following initial diagnosis or treatment.
In one embodiment, outcome is predicted based on classification of a subject according to subtype. This classification is based on expression profiling using one more of the genes in a cancer sample. Generally, cell types abundance, when classified according to the methods described herein is indicative of not only prognosis but also response to treatment.
In an embodiment, the ecotypes may comprise the following qualitative parameters which correlate with the prognosis of a subject having or suspected of having cancer:
In another embodiment, the methods described herein provide a determination of a Risk Of Relapse (ROR) score that can be used in any patient population regardless of disease status and treatment options. The ROR also have value in the prediction of pathological complete response in subjects treated with, for example, neoadjuvant taxane and anthracycline chemotherapy. Thus, in various embodiments of the present invention, a ROR method model is used to predict outcome. Using these risk models, subjects can be stratified into low, medium, and high risk of relapse groups. Calculation of ROR can provide prognostic information to guide treatment decisions and/or monitor response to therapy.
In some embodiments described herein, the prognostic performance of the defined ecotypes and/or other clinical parameters is assessed utilizing a Cox Proportional Hazards Model Analysis, which is a regression method for survival data that provides an estimate of the hazard ratio and its confidence interval. The Cox model is a well-recognized statistical technique for exploring the relationship between the survival of a patient and particular variables. This statistical method permits estimation of the hazard (i.e., risk) of individuals given their prognostic variables (e.g., intrinsic gene expression profile with or without additional clinical factors, as described herein). The “hazard ratio” is the risk of death at any given time point for patients displaying particular prognostic variables. See generally Spruance et al., Antimicrob. Agents & Chemo. 48:2787-92, 2004.
In an embodiment of the invention, where a diagnosis, prognosis or prediction to drug treatment is provided, it will be understood that the method will comprise:
Where the training of a predictor set of cancer samples from subjects with known diagnosis, prognosis or prediction to drug treatment is required, the method may comprise:
In another embodiment, where the training of a predictor set of cancer samples from subjects with known diagnosis, prognosis or prediction to drug treatment is required, the method may also comprise applying the predictor set to the cancer sample by:
Cancer is managed by several alternative strategies that may include, for example, surgery, radiation therapy, hormone therapy, chemotherapy, or some combination thereof. For example, as is known in the art, treatment decisions for individual breast cancer patients can be based on endocrine responsiveness of the tumour, menopausal status of the patient, the location and number of patient lymph nodes involved, estrogen and progesterone receptor status of the tumour, size of the primary tumour, patient age, and stage of the disease at diagnosis. Analysis of a variety of clinical factors and clinical trials has led to the development of recommendations and treatment guidelines for early-stage breast cancer by the International Consensus Panel of the St. Gallen Conference (2005). See, Goldhirsch et al., Annals Oneal. 16:1569-83, 2005. The guidelines recommend that patients be offered chemotherapy for endocrine non-responsive disease; endocrine therapy as the primary therapy for endocrine responsive disease, adding chemotherapy for some intermediate- and all high-risk groups in this category; and both chemotherapy and endocrine therapy for all patients in the uncertain endocrine response category except those in the low-risk group.
Stratification of patients according to risk of relapse and risk score disclosed herein provides an additional or alternative treatment decision-making factor. The methods comprise evaluating risk of relapse optionally in combination with one or more clinical variables, such as node status, tumour size, and ER status. The risk score can be used to guide treatment decisions. For example, a subject having a low risk score may not benefit from certain types of therapy, whereas a subject having a high risk score may be indicated for a more aggressive therapy.
The methods of the present invention find use in identifying high-risk, poor prognosis population of subjects and thereby determining which patients would benefit from continued and/or more aggressive therapy and close monitoring following treatment. For example, early-stage cancer patients assessed as having a high risk score by the methods disclosed herein may be selected for more aggressive adjuvant therapy, such as chemotherapy, following surgery and/or radiation treatment. In particular embodiments, the methods of the present invention may be used in conjunction with the treatment guidelines established by the St. Gallen Conference to permit practitioners to make more informed cancer treatment decisions.
The methods disclosed herein also find use in predicting the response of a cancer patient to a selected treatment. Predicting the response of a cancer patient to treatment is intended to mean assessing the likelihood that a patient will experience a positive or negative outcome with a particular treatment. As used herein, indicative of a positive treatment outcome refers to an increased likelihood that the patient will experience beneficial results from the selected treatment (e.g., complete or partial remission, reduced tumour size, etc.). Indicative of a negative treatment outcome is intended to mean an increased likelihood that the patient will not benefit from the selected treatment with respect to the progression of the underlying breast cancer.
In some embodiments, the relevant time for assessing prognosis or disease-free survival time begins with the surgical removal of the tumour or suppression, mitigation, or inhibition of tumour growth. In another embodiment, the risk score is calculated based on a sample obtained after initiation of neoadjuvant therapy such as endocrine therapy. The sample may be taken at any time following initiation of therapy, but is preferably obtained after about one month so that neoadjuvant therapy can be switched to chemotherapy in unresponsive patients. It has been shown that a subset of tumours indicated for endocrine treatment before surgery is non-responsive to this therapy. The model provided herein can be used to identify aggressive tumours that are likely to be refractory to endocrine therapy, even when tumours are positive for estrogen and/or progesterone receptors.
Survival analysis can be performed using any known method in the art, including the Kaplan-Meier method (as described in the Example herein). The Kaplan-Meier method estimates the survival function from life-time data. In medical research, it can be used to measure the fraction of patients living for a certain amount of time after treatment. A plot of the Kaplan-Meier method of the survival function is a series of horizontal steps of declining magnitude which, when a large enough sample is taken, approaches the true survival function for that population. The value of the survival function between successive distinct sampled observations (“clicks”) is assumed to be constant.
An important advantage of the Kaplan-Meier curve is that the method can take into account “censored” data-losses from the sample before the final outcome is observed (for instance, if a patient withdraws from a study). On the plot, small vertical tick-marks indicate losses, where patient data has been censored. When no truncation or censoring occurs, the Kaplan-Meier curve is equivalent to the empirical distribution.
In statistics, the log-rank test (also known as the Mantel-Cox test) is a hypothesis test to compare the survival distributions of two groups of patients. It is a nonparametric test and appropriate to use when the data are right censored. It is widely used in clinical trials to establish the efficacy of new drugs compared to a control group when the measurement is the time to event. The log-rank test statistic compares estimates of the hazard functions of the two groups at each observed event time. It is constructed by computing the observed and expected number of events in one of the groups at each observed event time and then adding these to obtain an overall summary across all time points where there is an event. The log-rank statistic can be derived as the score test for the Cox proportional hazards model comparing two groups. It is therefore asymptotically equivalent to the likelihood ratio test statistic based from that model.
The invention also provides for methods for diagnosing a breast cancer clinical subtype in a test sample from a subject. Diagnosis as used herein refers to the determination that a subject or patient has a type of breast cancer, or intrinsic subtype of breast cancer as described herein or known in the art. The type of breast cancer diagnosed according to the methods described herein may be any type known in the art or described herein.
In an embodiment, one or more of the following additional diagnostic tests may be used in addition to the methods for diagnosis described herein. These include:
In an embodiment, the subject may exhibit one or more of the following risk factors: age, preferably over 50 years of age; genetic mutations to certain genes, such as BRCA1 and BRCA2; early menstrual periods before age 12 and starting menopause after age 55; having dense breasts; personal history of breast cancer or certain non-cancerous breast diseases; family history of breast or ovarian cancer; previous treatment using radiation therapy; or history of taking the drug diethylstilbestrol (DES).
In some embodiments, the subject diagnosed with breast cancer exhibits one or more of the symptoms of breast cancer described herein or known in the art.
In an aspect of the invention, there is provided methods for diagnosing and treating breast cancer in a subject.
The terms “patient” and “subject” to be treated herein are used interchangeably and refer to patients and subjects of human or other mammal and includes any individual being examined or treated using the methods of the invention. Suitable mammals that fall within the scope of the invention include, but are not restricted to, primates, livestock animals (e.g., sheep, cows, horses, donkeys, pigs), laboratory test animals (e.g., rabbits, mice, rats, guinea pigs, hamsters), companion animals (e.g., cats, dogs) and captive wild animals (e.g., koalas, bears, wild cats, wild dogs, wolves, dingoes, foxes and the like).
In some embodiments, the treatment may include any of those described herein or known in the art including surgery; chemotherapy; hormonal therapy; biological therapy such as immunotherapy, small molecule therapy or antibody therapy; and radiation therapy. In a further embodiment, the chemotherapy may include the administration of one or more of:
In yet another embodiment, the radiotherapy may include the administration of one or more of:
In an embodiment, the subject to be treated exhibits one or more symptoms of a disease associated with breast cancer described herein or known in the art. Non-limiting examples may include one or more of:
Thus, a positive response to treatment with a therapeutically effective amount of any drug or compound identified herein may include amelioration of one of more of the above described symptoms or other symptoms known in the art. For instance, an individual having a positive response to treatment with any drug or compound administered as a result of the methods described herein may have a reduced presence of a lump in the breast or underarm or alternatively this may be surgically excised. An individual having a positive response to treatment with any drug or compound administered as a result of the methods described herein may also have reduced thickening or swelling, reduced irritation of breast skin, reduced redness or flaky skin in the nipple area or the breast, reduced nipple discharge or lessened pain or the symptoms may have disappeared altogether.
“Therapeutically effective amount” is used herein to denote any amount of a drug identified by the methods defined herein which is capable of reducing one or more of the symptoms associated with breast cancer. A single administration of the therapeutically effective amount of the drug may be sufficient, or they may be applied repeatedly over a period of time, such as several times a day for a period of days or weeks. The amount of the active ingredient will vary with the conditions being treated, the stage of advancement of the condition, the age and type of host, and the type and concentration of the formulation being applied. Appropriate amounts in any given instance will be readily apparent to those skilled in the art or capable of determination by routine experimentation.
The terms “treatment” or “treating” a subject includes the application or administration of a drug or compound with the purpose of delaying, slowing, stabilizing, curing, healing, alleviating, relieving, altering, remedying, less worsening, ameliorating, improving, or affecting the disease or condition, the symptom of the disease or condition, or the risk of (or susceptibility to) the disease or condition. The term “treating” refers to any indication of success in the treatment or amelioration of an injury, pathology or condition, including any objective or subjective parameter such as abatement; remission; lessening of the rate of worsening; lessening severity of the disease; stabilization, diminishing of symptoms or making the injury, pathology or condition more tolerable to the subject; slowing in the rate of degeneration or decline; making the final point of degeneration less debilitating; or improving a subject's physical or mental well-being.
The drugs or compounds that may be administered following the methods described herein may be provided in the form of a pharmaceutical composition comprising a therapeutically effective amount of any drug described herein or known in the art. In additional embodiments there is provided a pharmaceutical composition of any drug described herein or known in the art comprising a pharmaceutically acceptable salt.
The term “pharmaceutically acceptable salt” also refers to a salt of the compositions of the present invention having an acidic functional group, such as a carboxylic acid functional group, and a base. Pharmaceutically acceptable salts include, by way of non-limiting example, may include sulfate, citrate, acetate, oxalate, chloride, bromide, iodide, nitrate, bisulfate, phosphate, acid phosphate, isonicotinate, lactate, salicylate, acid citrate, tartrate, oleate, tannate, pantothenate, bitartrate, ascorbate, succinate, maleate, gentisinate, fumarate, gluconate, glucaronate, saccharate, formate, benzoate, glutamate, methanesulfonate, ethanesulfonate, benzenesulfonate, p-toluenesulfonate, camphorsulfonate, pamoate, phenylacetate, triftuoroacetate, acrylate, chlorobenzoate, dinitrobenzoate, hydroxybenzoate, methoxybenzoate, methylbenzoate, o-acetoxybenzoate, naphthalene-2-benzoate, isobutyrate, phenylbutyrate, a-hydroxybutyrate, butyne-1,4-dicarboxylate, hexyne-1,4-dicarboxylate, caprate, caprylate, cinnamate, glycolate, heptanoate, hippurate, malate, hydroxymaleate, malonate, mandelate, mesylate, nicotinate, phthalate, teraphthalate, propiolate, propionate, phenylpropionate, sebacate, suberate, p-brornobenzenesulfonate, chlorobenzenesulfonate, ethylsulfonate, 2-hydroxyethylsulfonate, methylsulfonate, naphthiene-1-sulfonate, naphthalene-2-sulfonate, naphthiene-1,5-sulfonate, xylenesulfonate, and tartarate salts.
Further, any drug described herein or known in the art can be administered to a subject as a component of a composition that comprises a pharmaceutically acceptable carrier or vehicle. Such compositions can optionally comprise a suitable amount of a pharmaceutically acceptable excipient so as to provide the form for proper administration.
Pharmaceutical excipients can be liquids, such as water and oils, including those of petroleum, animal, vegetable, or synthetic origin, such as peanut oil, soybean oil, mineral oil, sesame oil and the like. The pharmaceutical excipients can be, for example, saline, gum acacia, gelatin, starch paste, talc, keratin, colloidal silica, urea and the like. In addition, auxiliary, stabilizing, thickening, lubricating, and colouring agents can be used.
In one embodiment, the pharmaceutically acceptable excipients are sterile when administered to a subject. Water is a useful excipient when any agent described herein is administered intravenously. Saline solutions and aqueous dextrose and glycerol solutions can also be employed as liquid excipients, specifically for injectable solutions. Suitable pharmaceutical excipients also include starch, glucose, lactose, sucrose, gelatin, malt, rice, flour, chalk, silica gel, sodium stearate, glycerol monostearate, talc, sodium chloride, dried skim milk, glycerol, propylene, glycol, water, ethanol and the like. Any agent described herein, if desired, can also comprise minor amounts of wetting or emulsifying agents, or pH buffering agents.
In one embodiment, of any drug described herein or known in the art can take the form of solutions, suspensions, emulsion, drops, tablets, pills, pellets, capsules, capsules containing liquids, powders, sustained-release formulations, suppositories, emulsions, aerosols, sprays, suspensions, nanoparticles or microneedles or any other form suitable for use. In one embodiment, the composition is in the form of a capsule. Other examples of suitable pharmaceutical excipients are described in Remington's Pharmaceutical Sciences 1447-1676 (Alfonso R. Gennaro eds., 19th ed. 1995), incorporated herein by reference.
Where necessary, of any drug described herein or known in the art also includes a solubilizing agent. Also, the agents can be delivered with a suitable vehicle or delivery device as known in the art.
Any drug described herein or known in the art can be co-delivered in a single delivery vehicle or delivery device. Compositions for administration can optionally include a local anaesthetic such as, for example, lignocaine to lessen pain at the site of the injection.
Any drug described herein or known in the art may conveniently be presented in unit dosage forms and may be prepared by any of the methods well known in the art. Such methods generally include the step of bringing the therapeutic agents into association with a carrier, which constitutes one or more accessory ingredients. Typically, the formulations are prepared by uniformly and intimately bringing the therapeutic agent into association with a liquid carrier, a finely divided solid carrier, or both, and then, if necessary, shaping the product into dosage forms of the desired formulation (e.g., wet or dry granulation, powder blends, etc., followed by tableting using conventional methods known in the art).
In one embodiment, of any drug described herein or known in the art is formulated in accordance with routine procedures as a composition adapted for a mode of administration described herein. In one aspect, the pharmaceutical composition is formulated for administration to the respiratory tract, the skin or the gastrointestinal tract. Accordingly, the pharmaceutical composition for administration to the respiratory tract may be formulated as an inhalable substance, such as common to the art and described herein. In another embodiment, the pharmaceutical composition for administration to the gastrointestinal tract may be formulated with an enteric coating, such as common to the art and described herein.
In an embodiment, the pharmaceutical composition may be administered in a single or as multiple doses. The pharmaceutical composition may be administered between one to three times in a 24 hour period, or daily over a 7 day period or longer. The frequency and timing of administration may be as known in the art.
Routes of administration include, for example: intradermal, intramuscular, intraperitoneal, intravenous, subcutaneous, intranasal, epidural, oral, sublingual, intracerebral, intra-lymph node, intratracheal, intravaginal, transdermal, rectally, by inhalation, or topically, particularly to the ears, nose, eyes, or skin. In some embodiments, the administering is effected orally or by parenteral injection. The mode of administration can be left to the discretion of the practitioner, and depends in-part upon the site of the medical condition. In most instances, administration results in the release of any agent described herein into the bloodstream.
In certain embodiments, the human suffering from or suspected of having breast cancer has an age in a range of from about 0 months to about 6 months old, from about 6 to about 12 months old, from about 6 to about 18 months old, from about 18 to about 36 months old, from about 1 to about 5 years old, from about 5 to about 10 years old, from about 10 to about 15 years old, from about 15 to about 20 years old, from about 20 to about 25 years old, from about 25 to about 30 years old, from about 30 to about 35 years old, from about 35 to about 40 years old, from about 40 to about 45 years old, from about 45 to about 50 years old, from about 50 to about 55 years old, from about 55 to about 60 years old, from about 60 to about 65 years old, from about 65 to about 70 years old, from about 70 to about 75 years old, from about 75 to about 80 years old, from about 80 to about 85 years old, from about 85 to about 90 years old, from about 90 to about 95 years old or from about 95 to about 100 years old.
The present invention also provides kits useful for determining cell type abundance. These kits comprise a set of capture probes and/or primers specific for the intrinsic genes listed in a cancer sample, as well as reagents sufficient to facilitate detection and/or quantitation of the intrinsic gene expression product. The kit may further comprise a computer readable medium.
In one embodiment of the present invention, the capture probes are immobilized on an array. By “array” is intended a solid support or a substrate with peptide or nucleic acid probes attached to the support or substrate. Arrays typically comprise a plurality of different capture probes that are coupled to a surface of a substrate in different, known locations.
The arrays of the invention comprise a substrate having a plurality of capture probes that can specifically bind an intrinsic gene expression product. The number of capture probes on the substrate varies with the purpose for which the array is intended. The arrays may be low-density arrays or high-density arrays and may contain 4 or more, 8 or more, 12 or more, 16 or more, 3 2 or more addresses, but will minimally comprise capture probes for the intrinsic genes in a cancer sample.
Arrays may be packaged in such a manner as to allow for diagnostics or other manipulation on the device. See, for example, U.S. Pat. Nos. 5,856,174 and 5,922,591 herein incorporated by reference.
In another embodiment, the kit comprises a set of oligonucleotide primers sufficient for the detection and/or quantitation of each of the intrinsic genes in a cancer sample.
The oligonucleotide primers may be provided in a lyophilized or reconstituted form or may be provided as a set of nucleotide sequences. In one embodiment, the primers are provided in a microplate format, where each primer set occupies a well (or multiple wells, as in the case of replicates) in the microplate. The microplate may further comprise primers sufficient for the detection of one or more housekeeping genes as discussed infra. The kit may further comprise reagents and instructions sufficient for the amplification of expression products from the genes in a cancer sample.
In order that the invention may be readily understood and put into practical effect, particular preferred embodiments will now be described by way of the following non-limiting examples.
The present example illustrates an embodiment of the invention. In particular, the example demonstrates, using single cell signatures, deconvolution of large breast cancer cohorts to stratify them into nine clusters, termed ‘ecotypes’, with unique cellular compositions and clinical outcomes.
Primary untreated breast cancers used in this study were collected under protocols x13-0133, x19-0496, x16-018 and x17-155. Human research ethics committee approval was obtained through the Sydney Local Health District Ethics Committee, Royal Prince Alfred Hospital zone, and the St Vincent's hospital Ethics Committee. Site-specific approvals were obtained for all additional sites. Written consent was obtained from all patients prior to collection of tissue and clinical data stored in a de-identified manner, following pre-approved protocols. Consent into the study included the agreement to the use of all patient tissue and data for publication. Two TNBC samples used for Visium analysis (1142243F and 1160920F) were sourced from BioIVT Asterand®.
Samples collected in this study (Table 1) were analysed from fresh surgical resections and cryopreserved tissue. Tumours were mechanically and enzymatically dissociated using Human Tumour Dissociation Kit (Miltenyi Biotec), following the manufacturer's protocol. For cryopreserved tissue, tumour tissues were thawed and washed twice with RPMI 1640 prior to dissociation, as previously described Wu et al., (2021) Genome Medicine, doi: 10.1186/s13073 00885-z. Following incubation at 37° C. for 30 to 60 min, the sample was resuspended in RPMI 1640 and filtered through MACS® SmartStrainers (70 μM; Miltenyi Biotec). The resulting single cell suspension was centrifuged at 300×g for 5 min. For fresh tissue processing, red blood cells were lysed with Lysing Buffer (Becton Dickinson) for 5 min and the resulting suspension was centrifuged at 300×g for 5 min. Where viability was <80%, viability enrichment was performed using the EasySep Dead Cell Removal (Annexin V) Kit (StemCell Technologies) as per manufacturer's protocol. Dissociated cells were resuspended in a final solution of PBS with 10% fetal calf serum (FCS) solution prior to loading on the 10× Chromium platform.
Single-cell sequencing was performed using the Chromium Single-Cell v2 3′ and 5′ Chemistry Library, Gel Bead, Multiplex and Chip Kits (10× Genomics) according to the manufacturer's protocol. A total of 5,000 to 7,000 cells were targeted per well. Libraries were sequenced on the NextSeq 500 platform (Illumina) with pair-ended sequencing and dual indexing. A total of 26, 8 and 98 cycles were run for Read 1, i7 index and Read 2, respectively.
Raw bcl files were demultiplexed and mapped to the reference genome GRCh38 using the Cell Ranger Single Cell v2.0 software (10× Genomics). For individual samples, the EmptyDrops method66 was applied to filter the raw unique molecular identifiers (UMIs) count matrix for real barcodes from ambient background RNA cells. An additional cutoff was applied, filtering for cells with a gene and UMI count greater than 200 and 250, respectively. All cells with a mitochondrial UMI count percentage greater than 20% were removed. We used the Seurat v3 method (Stuart et al., (2019) Cell 177, 1888-1902 e21) in R for data normalisation, dimensionality reduction and clustering using default parameters. Cell clusters were annotated using the Garnett method (Pliner et al., Nature Methods (2019) 16, 983-986) using the default recommended parameters, with a classifier derived from an array of cell signatures for breast epithelial subsets from Lim et al. (2009) Nat Med 15, 907-13, and immune and stromal cell types from the XCell database (Aran et al., (2017) Genome Biol 18:220), including T-cells, B-cells, plasmablasts, monocyte/macrophages, endothelial, fibroblast and perivascular cell signatures.
Data integration was performed using Seurat v3 using default parameters (Stuart et al., (2019) Cell 177, 1888-1902 e21). A total of 2000 features for anchoring (FindIntegrationAnchors step) and 30 dimensions for alignment (IntegrateData step) were used. For reclustering immune and mesenchymal lineages, a total of 5000 features were used for anchoring (FindIntegrationAnchors step), with a total of 30, 20, and 10 Principal Components were used for clustering T-cells, Myeloid cells and B-cells, respectively. The default resolution of 0.8 was used (FindNeighbors and FindClusters step). For clustering without batch correction steps, we merged all individual dataset together (merge function) performed clustering steps (RunPCA, FindNeighbors and FindClusters steps) using the “RNA” assay with a total of 100 principal components.
Identifying Neoplastic from Normal Breast Cancer Epithelial Cells
CNV signal for individual cells was estimated using the inferCNV method with a 100 gene sliding window. Genes with a mean count of less than 0.1 across all cells were filtered out prior to analysis, and signal was denoised using a dynamic threshold of 1.3 standard deviations from the mean Immune and endothelial cells were used to define the reference cell inferred copy-number profiles. Epithelial cells were used for the observations. Epithelial cells were classified into normal (non-neoplastic), neoplastic or unassigned using a similar method to that previously described by Neftel et al., (2019) Cell 178, 835-849 e21. Briefly, inferred changes at each genomic loci were scaled (between −1 and +1) and the mean of the squares of these values were used to define a genomic instability score for each cell. In each individual tumour, the top 5% of cells with the highest genomic instability scores were used to create an average CNV profile. Each cell was then correlated to this profile. Cells were plotted with respect to both their genomic instability and correlation scores. Partitioning around medoids (PAM) clustering was performed using the ‘pamk’ function in the R package ‘cluster’ to choose the optimum value for k (between 2-4) using silhouette scores, and the ‘pam’ function to apply the clustering. Thresholds defining normal and neoplastic cells were set at 2 cluster standard deviations to the left and 1.5 standard deviations below the first cancer cluster means. For tumours where PAM could not define more than 1 cluster, the thresholds were set at 1 standard deviation to the left and 1.25 standard deviations below the cluster means. This method was used to identify 27,506 neoplastic and 6084 normal cells in all tumours, the remaining 3208 cells were classed as unassigned (
We constructed “pseudo-bulk” expression profiles for each tumour, where all the reads from all cells of a given tumour were added together, and then mapped as one sample. The resulting pseudo-bulk matrix thus constructed was named “Allcells-Pseudobulk” and was subsequently processed similarly to any bulk RNA-Seq sample (i.e. upper quartile normalized-log transformed) for calling molecular subtypes using the PAM50 method (Parker et al., (2009) J Clin Oncol 27, 1160-7). An important consideration made before PAM50 subtyping is to adjust a new sample set relative to the PAM50 training set according to their ER and HER2 status as detailed by Zhao et al., (2015) Breast Cancer Res 17, 29. Thus, after ER/HER2 group-based adjustments, and then applying the PAM50 centroid predictor to the pseudo-bulk data, the methodology identified 7 of 20 Basal-like (CID3963, CID4465, CID4495, CID44971, CID4513, CID4515, CID4523), 4 of 20 HER2E (CID3921, CID4066, CID44991, CID45171), 5 of 20 LumA (CID3941, CID4067, CID4290A, CID4463, CID4530N), 3 of 20 LumB (CID3948, CID4461, CID4535) and 1 of 20 as Normal-like (CID4471).
We performed whole-transcriptome RNA-Seq using Ribosomal Depletion on 18 matching tumour samples from our single-cell dataset. RNA was extracted from diagnostic FFPE blocks using the High Pure RNA Paraffin Kit (Roche #03 270 289 001). The Sequence alignment was done using Salmon (Patro et al., (2017) Nature Methods 14, 417-419). We then called PAM50 on each bulk tumour using Zhao et al., (2015) Breast Cancer Res 17, 29 normalization and then the PAM50 centroid predictor (Table 2).
Table 2: PAM50/scSubtype Comparative Table of all patient samples included in the scSubtype analysis showing their clinical Immunohistochemistry classification, PAM50 Subtype calls on pseudobulk RNA profiles from 10× scRNA-Seq and PAM50 Subtype calls on bulk RNA profiles using Ribozero mRNA-Seq data. Also, included are the number and percentage of individual neoplastic cells in each tumour assigned to each of the 4 scSubtype subtypes.
Calling Intrinsic Subtype on scRNA-Seq Using scSubtype
To design and validate a new subtyping tool specific for scRNA-Seq data, we first divided our tumour samples into training and testing sets. The training dataset was defined by identifying tumours with unambiguous molecular subtypes. Here, we identified robust training set samples using two subtyping approaches: (i) PAM50 subtyping of the Allcells-Pseudobulk datasets (described above); and (ii) hierarchical clustering of the Allcells-Pseudobulk data with the 1,100 tumours in the TCGA BrCa RNA-Seq dataset using 2000 genes from an intrinsic breast cancer gene list (Parker, J. S. et al. (2009) J Clin Oncol 27, 1160-7). We first identified tumours that shared the same “concordant” subtype from both Allcells-Pseudobulk PAM50 calls and TCGA hierarchical clustering-based subtype classifications (Table 2). Next, since our methodology aimed to subtype cancer cells, we removed any tumours with <150 cancer cells. Finally, we did not include cells from the two metaplastic samples (CID4513 and CID4523) in the training data because this is a histological subtype not used in the original PAM50 training set. Using this approach, we identified 10 tumour samples in the training dataset: HER2E (CID3921, CID44991, CID45171), Basal-like (CID4495, CID44971, CID4515), LumA (CID4290, CID4530) and LumB (CID3948, CID4535). Only tumour cells with greater than 500 UMIs were used for training and test datasets in scSubtype (total of 24,889 cells).
Within each training set subtype, we utilized the cancer cells from each tumour sample and performed pairwise single cell integrations and differential gene expression calculations. The integration was carried out in a “within group” pairwise fashion using the FindIntegrationAnchors and IntegrateData functions in the Seurat v3 package (Stuart et al., (2019) Cell 177, 1888-1902 e21). Briefly, the first step identifies anchors between pairs of cells from each dataset using mutual nearest neighbors. The second step integrates the datasets together based on a distance based weights matrix constructed from the anchor pairs. Differentially expressed genes were calculated between each pair using a Wilcoxon Rank Sum test by the FindAllMarkers function within Seurat v3. As the number of cancer cells per tumour sample were highly variable, this strategy prevented a bias of identifying genes for a training group from a sample with the highest number of cells. The following pairs were analyzed: HER2E (CID3921-CID44991, CID44991-CID45171, CID45171-CID3921), Basal-like (CID4495-CID44971, CID44971-CID4515, CID4515-CID4495), LumA (CID4290-CID4530) and LumB (CID3948-CID4535). In this way we identified unique upregulated genes per sample, but also genes broadly highlighting cells within each respective training group or subtype. We removed any duplicate genes occurring between the 4 training groups, which yielded 4 sets of genes composed of 89 genes defining Basal_SC, 102 genes defining HER2E_SC, 46 genes defining LumA_SC and 65 genes defining LumB_SC, which we define as “scSubtype” gene signatures (Table 3). Table 3 represents the scSubtype gene table Gene lists used to define the single-cell scSubtype molecular subtype classifier, one for each scSubtype (Basal_SC, Her2E_SC, LumA_SC and LumB_SC).
To assign a subtype call to a cell we calculated the average (i.e. mean) read counts for each of the 4 signatures for each cell. The SC subtype with the highest signature score was then assigned to each cell. We utilized this method to subtype all 24,489 neoplastic cells, from both our training samples (n=10) and the remaining test (n=10) set samples.
As previously described, we calculated the degree of epithelial cell differentiation status (DScore), and proliferation signature status, on each and every tumour cell in our scRNA-Seq cohort, as well as the 1,100 tumours in TCGA dataset. The 11 genes used to compute the proliferation signature status are independent of the scSubtype gene lists, while the Dscore is computed using a centroid based predictor with information from ˜20 thousand genes.
Tumour tissue was fixed in 10% neutral buffered formalin for 24 hrs and then processed for paraffin embedding. Diagnostic tumour blocks were accessed for samples that did not have a research block available. Blocks were sectioned at 4 uM. Sections were stained with Haematoxylin and Eosin for standard histological analysis Immunohistochemistry (IHC) was performed on serial sections with pre-diluted primary antibodies against ER (clone 6F11; leica PA0151) or CK5 (clone XM26; leica PA0468) using suggested protocols on the BOND RX Autostainer (Leica, Germany). Antigen retrieval was performed for 20 min using BOND Epitope Retrieval solution 1 for ER or solution 2 for CK5, followed by primary antibody incubation for 60 min and secondary staining with the Bond Refine detection system (Leica). Slides were imaged using the Aperio CS2 Digital Pathology Slide Scanner.
For each individual tumour, with more than 50 neoplastic cells, the neoplastic cells were clustered using Seurat v337 at five resolutions (0.4, 0.8, 1.2, 1.6, 2.0). MAST69 was then used to identify the top-200 differentially regulated genes in each cluster. Only gene-signatures containing greater than 5 genes and originating from clusters of more than 5 cells were kept. In addition, redundancy was reduced by comparing all pairs of signatures within each sample and removing the pair with fewest genes from those pairs with a Jaccard index greater than 0.75. Across all tumours, a total of 574 gene-signatures of intra-tumour heterogeneity were identified.
Consensus clustering (using spherical k-means, skmeans, implemented in the cola R package: https://www.bioconductor.org/packages/release/bioc/html/cola.html) of the Jaccard similarities between these gene-signatures was used to identify 7 robust groups, or gene-modules. For each of these, a gene module was defined by taking the 200 genes that had the highest frequency of occurrence across clusters and individual tumours. These are defined as gene-modules GM1 to GM7. A gene-module signature was calculated for each cell using AUCell and each neoplastic cell was assigned to a module, using the maximum of the scaled AUCell gene-module signature scores. This resulted in 4,368, 3,288, 2,951, 4,326, 3,931, 2,500, 3,125 cells assigned to GM1 to GM7, respectively. These are defined as gene-module based neoplastic cell states.
Differential gene expression was performed using the MAST method (Finak, G. et al., (2015). Genome Biol 16, 278) in Seurat (FindAllMarkers step) using default cutoff parameters. All DEGs from each cluster (data not shown) were used as input into the ClusterProfiler package for gene ontology functional enrichment. All ontologies within the enrichGO databases were used with the human org.Hs.eg.db database. Results were clustered, scaled and visualised using the pheatmap package in R. Cytotoxic, TAM and Dysfunctional T-cell gene expression signatures were assigned using the AddModuleScore function in Seurat v337. The list of genes used for dysfunctional T-cells were adopted from Li et al., (2019) Cell 176, 775-789 e18. The TAM gene list was adopted from Cassetta et al., (2019) Biomarkers, and Therapeutic Targets. Cancer Cell 35, 588-602 e10. The cytotoxic gene list consists of 12 genes which translate to effector cytotoxic proteins (GZMA, GZMB, GZMH, GZMK, GZMM, GNLY, PRF1 and FASLG) and well described cytotoxic T-cell activation markers (IFNG, TNF, IL2R and IL2).
Cell differentiation was inferred for mesenchymal cells (CAFs, PVL and Endothelial cells) using the Monocle 2 method with default parameters as recommended by developers. Integrated gene expression matrices from each cell type were first exported from Seurat v3 into Monocle to construct a CellDataSet. All variable genes defined by the differentialGeneTest function (q-val cutoff<0.001) were used for cell ordering with the setOrderingFilter function. Dimensionality reduction was performed with no normalisation methods and the DDRTree reduction method in the reduceDimension step.
Samples were stained with 10× Chromium 3′ mRNA capture compatible TotalSeq-A antibodies (Biolegend, USA). Staining was performed as previously described by Stoeckius et. al., (2017) Nat Methods 14, 865-868 with a few modifications listed below. A total of four cases from our scRNA-Seq cohort were analyzed, including one luminal (CID4040), one HER2 (CID383) and two TNBC (CID4515 and CID3956). A panel of 157 barcoded antibodies (data not shown) were used, which recognised a range of cell surface lineage and activation markers, in addition to a large collection of co-stimulatory and co-inhibitory receptors and ligands. Briefly, a maximum of 1 million cells per sample was resuspended in 120 ul of cell staining buffer (Biolegend, USA) with 5 ul of Fc receptor Block (TrueStain FcX, Bioelegend, USA) for 15 min. This was followed by a 30 min staining of the antibodies at 4° C. A concentration of 1 ug/100 ul was used for all antibody markers used in this study. The cells were then washed 3 times with PBS containing 10% FCS media followed by centrifugation (300×g for 5 min at 4° C.) and expungement of supernatant. The sample was then resuspended in PBS with 10% FCS for 10× Chromium capture.
Demultiplexed reads were assigned to individual cells and antibodies with python package CITE-seq-count v.1.4.3 (https://github.com/Hoohm/CITE-seq-Count/tree/1.4.2). CITE counts were normalised and scaled with Seurat v.3.1.4. Imputation of CITE data was performed per individual cell type (B-cells, T-cells, myeloid cells, mesenchymal cells) for those antibodies that were differentially expressed between subclusters (FindAllMarkers step) for individual samples. We used anchoring based transfer learning to transfer protein expression levels from these four samples to the remaining BrCa cases.
Survival Analysis of scRNA-Seq Signatures
To assess impact of particular cell types described by scRNA-Seq (e.g. LAM1 and LAM2) on clinical outcome, we assessed the association between gene signatures (derived as described above) with patient overall survival in the METABRIC cohort. For each tumour from the bulk expression cohort, average gene signature expression was derived using the top 100 genes from the gene signature of interest. Patients were then stratified based on the top and bottom 30%, and survival curves were generated using the Kaplan Meier method with the ‘survival’ package in R (https://crans-projectorg/package—survival). We assessed the significance between two groups using the log-rank test statistics. Differences in survival between ecotypes were assessed using Kaplan-Meier analysis and log-rank test statistics, using the survival and survminer R packages.
CIBERSORTx59 and DWLS60 were used to deconvolute predicted cell-fractions from a number of bulk transcript profiling datasets. To prevent confounding of cycling cell-types we first assigned all neoplastic epithelial cells with a proliferation score>0 as cycling and then combined these with “cycling” cell states from all other cell-types to generate a single “Cycling” cell-state. To generate cell-type signature matrices for each of the tiers of cell-type annotation described in this study, we randomly subsampled 15% of cells from each level of annotation type.
We then ran CIBERSORTx “cibersortx/fractions” to generate cell-type signature matrices using the following parameters: --single_cell TRUE --G.min 300 --G.max 500 --q.value 0.01 --filter FALSE --k.max 999 --replicates 5 --sampling 0.5 --fraction 0.75.
For cell-type deconvolution of bulk tumours we ran CIBERSORTx “cibersortx/fractions” to calculate the relative cell-type abundances in each tumour. S-mode batch correction was used for the METABRIC tumours.
For deconvolution analysis using DWLS we used the functions in the “Deconvolution_functions.R” script obtained from https://github.com/dtsoucas/DWLS. Cell-type signature matrices were generated using the buildSignatureMatrixMAST( ) function and then filtered to only contain genes that are present in both the bulk and single-cell derived signature matrices, using the trimData( ) function. Cell-type abundances were then calculated using the solveDampenedWLS( ) function.
Pseudo-bulk expression matrices were generated from the scRNA-Seq datasets in this study by summing the unique molecular identifiers (UMIs) for each gene across all cells for each tumour. Normalised METABRIC expression matrices, clinical information and PAM50 subtype classifications were obtained from https://www.cbioportal.org/study/summary?id—brca_metabric.
Tumour ecotypes in the METABRIC cohort were identified using spherical k-means (skmeans) based consensus clustering (as implemented in the cola R package: https://www.bioconductor.org/packages/release/bioc/html/cola.html) of the predicted cell-fraction from either CIBERSORTx or DWLS, in each bulk METABRIC patient tumour. When comparing ecotypes between methods (i.e., consensus clustering results from using cell-abundances of all cell-types or just the 32 significantly correlated cell-types from CIBERSORTx deconvolution and consensus clustering results from CIBERSORTx or DWLS cell-abundances) the number of tumour ecotypes was fixed as 9 and the tumour overlaps between all ecotype pairs was calculated (Tables 4 and 5). Common ecotypes were then identified by identifying the ecotype pairs with the largest average METABRIC tumour overlap.
With reference to Table 4: The table columns are: ecotype_all: The ecotype ID when using all cell-types; ecotype_all_samples: number of tumours in ecotype from using all cell-types; ecotype_signif: The ecotype ID when using only the significantly correlated cell-types; ecotype_signif samples: number of tumours in ecotype from using only the significantly correlated cell-types; overlap: number of overlapping tumours between the ecotype pairs; ecotype_all_overlap: fraction of overlapping tumours from ecotypes generated using all cell-types; ecotype_signif overlap: fraction of overlapping tumours from ecotypes generated using only the significantly correlated cell-types; avg_overlap: the averaged fractional overlap (i.e., (ecotype_all_overlap+ecotype_signif overlap)/2)
With reference to Table 5: The table columns are: cibersortx_ecotype: The ecotype ID when using CIBERSORTx; cibersortx_ecotype_samples: number of tumours in ecotype from CIBERSORTx; dwls_ecotype: The ecotype ID when using DWLS; dwls_ecotype_samples: number of tumours in ecotype from using DWLS; overlap: number of overlapping tumours between the ecotype pairs; cibersortx_ecotype_overlap: fraction of overlapping tumours from ecotypes generated using CIBERSORTx; dwls_ecotype_overlap: fraction of overlapping tumours from ecotypes generated using DWLS; avg_overlap: the averaged fractional overlap (i.e., (cibersortx_ecotype_overlap+dwls_ecotype_overlap)/2)
To elucidate the cellular architecture of BrCa, we analysed 26 primary pre-treatment human BrCa, including 11 ER+, 5 HER2+ and 10 TNBCs, by scRNA-Seq (Table 1;
As previously reported in other cancer types, UMAP visualization showed a clear separation of epithelial cells by tumour, although three clusters contained cells from multiple patients and subtypes (
scSubtype: Intrinsic Subtyping for Single Cell RNA-Seq Data
As unsupervised clustering could not be used to find recurring neoplastic cell gene expression features between tumours, we asked whether we could classify cells using the established PAM50 method. Due to the inherent sparsity of single-cell data, we took the opportunity to develop a scRNA-Seq compatible method for intrinsic molecular subtyping. We constructed “pseudo-bulk” profiles from scRNA-Seq for each tumour, with at least 150 neoplastic cells, and applied the PAM50 centroid predictor. This identified 7 Basal-like, 4 HER2E, 5 LumA, 3 LumB and 1 Normal-like BrCa. To identify a robust training set, we used hierarchical clustering of the pseudo-bulk samples with the TCGA dataset of 1,100 BrCa using an 2,000 gene intrinsic BrCa genelist4 (
For each PAM50 subtype within the training dataset, we performed pairwise single cell integrations and differential gene expression to identify 4 sets of genes that would define our single-cell derived molecular subtypes (89 genes Basal_SC; 102 genes HER2E_SC; 46 genes LumA_SC; 65 genes LumB_SC; methods). We defined these genes as the “scSubtype” gene signatures (
As another means of assessing the accuracy of scSubtype, we performed “true bulk” whole transcriptome RNA-Seq on 18 matching tumours in our scRNA-Seq cohort. As scSubtype does not include a Normal-like subtype, the two tumours called as Normal-like by RNA-Seq were not included in the comparison. We observed concordance between the majority scSubtype cell calls and the overall bulk tumour FFPE RNA-Seq profile in 12 of the remaining 16 BrCa, including 7 of the 8 matching training set tumours. We also clustered the true bulk RNA-Seq data with TCGA and confirmed that the true bulk clustered with the pseudo-bulk profiles for 14 of 18 samples (
scSubtype revealed that 13 of 20 samples had less than 90% of neoplastic cells falling under one molecular subtype, while only one tumour (CID3921; HER2E) composed of neoplastic cells with a completely homogenous molecular subtype (
To further support the validity of scSubtype, we calculated the degree of epithelial cell differentiation (DScore) and proliferation, both of which are independently associated with the molecular intrinsic subtype of each tumour cell (
We investigated the biological pathways driving intra-tumour transcriptional heterogeneity (ITTH) in an unsupervised manner using integrative clustering, of tumours with at least 50 neoplastic cells, to generate 574 gene-signatures of ITTH. Across all tumours, we used these gene-signatures to identify 7 robust groups, “gene-modules”, based on their Jaccard similarity (
Gene-set enrichment identified a number of shared and distinct functional features of these GMs (
For each neoplastic cell, we calculated signature scores for each of the 7 GMs and used hierarchical clustering to identify correlations between cells (
Immune checkpoint inhibitors have revolutionized cancer therapy but have shown minimal efficacy for the treatment of BrCa, mostly restricted to TNBC. To examine the BrCa immune milieu at high resolution, we reclustered immune cells to identify T cells and innate lymphoid cells (
A total of 18 T-cell and innate lymphoid clusters were identified based on RNA expression, which were detected in the majority of cases (
We identified five CD8 T-cell clusters (c4, c5, c7, c8 and c17), two of which were specific to individual tumours (c8, c17). The remaining three were exhausted tissue resident memory (TRM) CD8+ T-cells expressing high levels of inhibitory checkpoint molecules including LAG3, PDCD1 and TIGIT (CD8+ T-cells:LAG3/c8), TRM PDCD1low CD8+ T-cells that expressed relatively high levels of IFNG and TNF (CD8+ T-cells:IFNG/c7), and CD8+ effector memory (EM) chemokine expressing T-cells (CD8+ T-cells:ZFP36/c4) (
TNBC have more TILs in general and CD8+ T-cells in particular. We also observed that T cell clusters IFIT1/c6, LAG3/c8 and MKI67/c11 made up a higher proportion of T cells in TNBC samples compared to other subsets (
When we reclustered B cells, we observed two major subclusters (naive and memory), with plasmablasts forming a separate cluster (
Myeloid cells formed 13 clusters which could be identified in all tumours at varying frequencies, with the exception of macrophage cluster 5 that was mostly limited to an individual tumour (
Macrophages formed 6 clusters, including a cluster (Mac:CXCL10/c9) with features previously associated with an “M1-like” phenotype and two clusters (Mac:EGR1/c10 and Mac:SIGLEC1/c5) resembling the “M2-like” phenotype. All of which bear some resemblance to TAMs previously described in BrCa (
The stromal cell types and subclasses present in human BrCa are yet to be profiled at high resolution and across clinical subtypes. We identified three major mesenchymal cell types including CAFs (PDGFRA and COL1A1), perivascular-like cells (PVL; MCAM/CD146, ACTA2 and PDGFRB), endothelial cells (PECAM1/CD31 and CD34), and two smaller clusters of lymphatic endothelial cells (LYVE1) and cycling PVL cells (MKI67) (
Trajectory analysis revealed five CAF states with two distinct branch points (
Previously reported pancreatic ductal adenocarcinoma (PDAC) CAF signatures20, defined by iCAFs and myCAFs, were predominantly enriched in CAF s1 and s5, respectively (
Trajectory analysis revealed three main PVL states with a single branch point (
The expression of these genes decreased along the pseudotime trajectory as cells transitioned to PVL s3, which was enriched for contractile related genes including MYH11 and ACTA2 (
Interestingly, all PVL states were also modestly enriched for PDAC myCAF gene signatures, suggesting that they have shared transcriptional features related to contractility (
Endothelial cells sub-clustered into three pseudotime states with one distinct branch point (
To determine whether stromal states were unique to the TME, we performed scRNA-Seq on three normal breast tissue samples and were surprised to find that no clusters or cell states were unique to disease status or subtypes (
Deconvolution of Breast Cancer Cohorts Reveals Nine Ecotypes Associated with Patient Survival
Our single cell data has generated a draft cellular taxonomy of BrCa, with at least three tiers of cell types and states (Major, Minor and Subset;
We deconvoluted all primary breast tumour datasets in the METABRIC cohort. Supporting the validity of the predictions (and the scSubtype signatures), we observed significant enrichment (Wilcox test, p<2.2e-16) of the four scSubtypes (Basal_SC, HER2E_SC, LumA_SC and LumB_SC) in tumours with matching bulk-PAM50 classifications and significant enrichment (Wilcox test, p<2.2e-16) of cycling cells in Basal, LumB and HER2E tumours (
We next investigated the prognostic differences between all ecotypes (
To further assess the robustness of the ecotypes, we repeated the consensus clustering using only the 32 significantly correlated cell-types, as well as the DWLS method. Substantial overlap of tumours (Table 4 and Table 5) ecotype features (
Finally, we investigated the association between ecotypes and the integrative genomic clusters (int-clusters) identified by METABRIC (
We use deconvolution to define nine ecotypes amongst thousands of primary breast cancers. Interestingly, clustering of most ecotypes is driven by cells spanning the major lineages (epithelial, immune and stromal), features not captured by previous studies that stratified disease based on mass cytometry primarily using immune markers. Integration of our data with these datasets is an important future direction for the field. While ecotypes partially associated with intrinsic subtype and genomic classifiers, they are not simply surrogates for previous methods stratification. Future work will investigate the molecular mechanisms organizing tissue architecture and tumour ecotypes, aiming to explain their differences in clinical outcome and examine whether tumour ecotypes can be used to personalise therapy.
Number | Date | Country | Kind |
---|---|---|---|
2021901939 | Jun 2021 | AU | national |