This disclosure relates generally to biomarkers, and more particularly to systems, devices, and methods for constructing and using biomarkers.
The treatment of early luminal (estrogen receptor positive) breast cancer is both a major success story and an ongoing clinical challenge. Targeted anti-endocrine therapies have significantly reduced mortality over the last 30-40 years [1,2], but luminal disease still leads to the majority of deaths from early breast cancer. To address this urgent clinical need, research has focused on improving anti-endocrine therapies (e.g. third-generation aromatase inhibitors) [2] and on generating a plethora of “prognostic markers” to personalize risk stratification for luminal breast cancer patients [3]. These strategies have led to a statistically significant, but clinically modest, improvement in outcome [2,3].
More broadly, human disease is complex, caused by the interaction of genetic, epigenetic and environmental insults. These interactions allow a specific disease phenotype to arise in many different ways, with a far greater diversity of molecular underpinnings than phenotypic consequences. Molecular heterogeneity within a disease is believed to underlie poor clinical trial results for some therapies [43] and the poor performance of many genome-wide association studies [44-46].
A new solution is thus needed for overcoming the shortfalls of the solutions currently available in the market in respect of not just early luminal (estrogen receptor positive) breast cancer, but also a wider range of diseases and other phenotypes.
In an aspect, there is disclosed a method of prognosing or classifying a patient using a biomarker comprising a plurality of subnetwork modules, said method comprising: determining an activity of a plurality of genes in a test sample of the patient, said plurality of genes associated with the plurality of subnetwork modules; constructing an expression profile using the activity of the plurality of genes; determining dysregulation of each of the plurality of subnetwork modules by calculating a score proportional to a degree of dysregulation in each of the plurality of subnetwork modules from said expression profile; prognosing or classifying the patient by: inputting each dysregulation score into a model for predicting patient outcomes for patients having a disease, the model trained with a plurality of reference dysregulation scores and a plurality of reference clinical indicators; and inputting a clinical indicator of the patient into the model to obtain a risk associated with the disease.
In another aspect, there is disclosed a method of prognosing or classifying a patient comprising: determining mRNA abundance using a sample of a breast cancer tumour of the patient for the group of genes comprising: GSK3B, AKT1S1, RHEB, TSC1, TSC2, RPS6KB1, RPTOR, MTOR, RICTOR, ERBB2, MKI67, ESR1 and PGR, each of said genes associated with at least one node of the PIK3 cell signalling pathway; constructing an expression profile from the mRNA abundance; comparing said expression profile to a plurality of reference expression profiles and comparing clinical indicators of the patient to a plurality of reference clinical indicators, wherein the clinical indicators comprise N-stage and tumour size, and wherein each of the plurality of reference expression profiles and each of the reference clinical indicators are associated with a predetermined residual risk of breast cancer; and selecting the reference expression profile most similar to the expression profile and the reference clinical indicators most similar to the patient clinical indicators, to obtain a residual risk associated with breast cancer.
In yet another aspect, there is disclosed a computer-implemented method of prognosing or classifying a patient using a biomarker comprising a plurality of subnetwork modules, said method comprising: storing, in electronic memory, a model for predicting patient outcomes for patients having a disease, the model trained with a plurality of reference dysregulation scores and a plurality of reference clinical indicators; receiving, at at least one processor, data reflecting an activity of a plurality of genes in a test sample of the patient, said plurality of genes associated with the plurality of subnetwork modules; constructing, at the at least one processor, an expression profile using the data reflecting the activity of the plurality of genes; determining, at the at least one processor, dysregulation of each of the plurality of subnetwork modules by calculating a score proportional to a degree of dysregulation in each of the plurality of subnetwork modules from said expression profile; prognosing or classifying, at the at least one processor, the patient by: inputting each dysregulation score into the model; and inputting a clinical indicator of the patient into the model to obtain a risk associated with the disease.
In one aspect, there is disclosed a computer-implemented method of prognosing or classifying a patient, the method comprising: receiving, at at least one processor, data reflecting mRNA abundance determined using a sample of a breast cancer tumour of the patient for the group of genes comprising: GSK3B, AKT1S1, RHEB, TSC1, TSC2, RPS6KB1, RPTOR, MTOR, RICTOR, ERBB2, MKI67, ESR1 and PGR, each of said genes associated with at least one node of the PIK3 cell signalling pathway; constructing, at the at least one processor, an expression profile from the data reflecting mRNA abundance; comparing, at the at least one processor, said expression profile to a plurality of reference expression profiles and comparing clinical indicators of the patient to a plurality of reference clinical indicators, wherein the clinical indicators comprise N-stage and tumour size, and wherein each of the plurality of reference expression profiles and each of the reference clinical indicators are associated with a predetermined residual risk of breast cancer; and selecting, at the at least one processor, the reference expression profile most similar to the expression profile and the reference clinical indicators most similar to the patient clinical indicators, to obtain a residual risk associated with breast cancer.
In one aspect, there is disclosed a device for prognosing or classifying a patient using a biomarker comprising a plurality of subnetwork modules, the device comprising: at least one processor; and electronic memory in communication with the at least one processor, the electronic memory storing: a model for predicting patient outcomes for patients having a disease, the model trained with a plurality of reference dysregulation scores and a plurality of reference clinical indicators; and processor-executable code that, when executed at the at least one processor, causes the at least one processor to: receive data reflecting an activity of a plurality of genes in a test sample of the patient, said plurality of genes associated with the plurality of subnetwork modules; construct an expression profile using the data reflecting the activity of the plurality of genes; determine dysregulation of each of the plurality of subnetwork modules by calculating a score proportional to a degree of dysregulation in each of the plurality of subnetwork modules from said expression profile; prognose or classify the patient by: inputting each dysregulation score into the model; and inputting a clinical indicator of the patient into the model to obtain a risk associated with the disease.
In another aspect, there is disclosed a device for prognosing or classifying a patient, the device comprising: at least one processor; and electronic memory in communication with the at one processor, the electronic memory storing processor-executable code that, when executed at the at least one processor, causes the at least one processor to: receive data reflecting mRNA abundance determined using a sample of a breast cancer tumour of the patient for the group of genes comprising: GSK3B, AKT1S1, RHEB, TSC1, TSC2, RPS6KB1, RPTOR, MTOR, RICTOR, ERBB2, MKI67, ESR1 and PGR, each of said genes associated with at least one node of the PIK3 cell signalling pathway; construct an expression profile from the data reflecting mRNA abundance; compare said expression profile to a plurality of reference expression profiles and comparing clinical indicators of the patient to a plurality of reference clinical indicators, wherein the clinical indicators comprise N-stage and tumour size, and wherein each of the plurality of reference expression profiles and each of the reference clinical indicators are associated with a predetermined residual risk of breast cancer; and select the reference expression profile most similar to the expression profile and the reference clinical indicators most similar to the patient clinical indicators, to obtain a residual risk associated with breast cancer.
In another aspect, there is disclosed a method of treating a patient, comprising: determining the disease relapse risk of the patient according to the methods disclosed herein; and selecting a treatment based on the disease relapse risk, and preferably treating the patient according to the treatment.
In yet another aspect, there is disclosed a computer-implemented method of constructing a biomarker for a biological state of a given type, the method comprising: maintaining an electronic datastore storing: a plurality of subnetwork records, each comprising data reflecting one of a plurality of subnetwork modules of biological pathways; and a plurality of patient records, each comprising data reflecting molecular aberration measured for one of a plurality of patients of the biological state, and data reflecting a patient state for that patient; processing, at at least one processor, the subnetwork records and the patient records to assign, to each of the plurality of subnetwork modules, a score proportional to a degree of dysregulation in that subnetwork module; ranking, at the at least one processor, the plurality of subnetwork modules according to score assigned to each of the plurality of subnetwork modules; and upon said ranking, selecting, at the at least one processor, the biomarker as comprising a subset of the plurality of subnetwork modules.
In one aspect, there is disclosed a computer-implemented method of identifying a dysregulated subnetwork module of a biological pathway causing a biological state of a given type, the method comprising: maintaining an electronic datastore storing: a plurality of subnetwork records, each comprising data reflecting one of a plurality of subnetwork modules of biological pathways; and a plurality of patient records, each comprising data reflecting molecular aberration measured for one of a plurality of patients of the biological state, and data reflecting a patient state for that patient; processing, at at least one processor, the subnetwork records and the patient records to assign, to each of the plurality of subnetwork modules, a score proportional to a degree of dysregulation in that subnetwork module; identifying, at the at least one processor, from the scores, the dysregulated subnetwork module from amongst the plurality of subnetwork modules.
In yet another aspect, there is disclosed a device for constructing a biomarker for a biological state of a given type, the device comprising: at least one processor; and electronic memory in communication with the at least one processor, the electronic memory storing: a plurality of subnetwork records, each comprising data reflecting one of a plurality of subnetwork modules of biological pathways; a plurality of patient records, each comprising data reflecting molecular aberration measured for one of a plurality of patients of the biological state, and data reflecting a patient state for that patient; and processor-executable code that, when executed at the at least one processor, causes the at least one processor to: process the subnetwork records and the patient records to assign, to each of the plurality of subnetwork modules, a score proportional to a degree of dysregulation in that subnetwork module; rank the plurality of subnetwork modules according to score assigned to each of the plurality of subnetwork modules; and upon said ranking, select the biomarker as comprising a subset of the plurality of subnetwork modules.
In one aspect, there is disclosed a device for identifying a dysregulated subnetwork module of a biological pathway causing a biological state of a given type, the device comprising: at least one processor; and electronic memory in communication with the at least one processor, the electronic memory storing a plurality of subnetwork records, each comprising data reflecting one of a plurality of subnetwork modules of biological pathways; a plurality of patient records, each comprising data reflecting molecular aberration measured for one of a plurality of patients of the biological state, and data reflecting a patient state for that patient; and processor-executable code that, when executed at the at least one processor, causes the at least one processor to: process the subnetwork records and the patient records to assign, to each of the plurality of subnetwork modules, a score proportional to a degree of dysregulation in that subnetwork module; identify from the scores, the dysregulated subnetwork module from amongst the plurality of subnetwork modules.
In another aspect, there is disclosed a system comprising: a first device for prognosing or classifying a patient using a biomarker comprising a plurality of subnetwork modules; a second device for constructing a biomarker for a biological state of a given type, the device comprising; and wherein the biomarker of the first device is a biomarker constructed by the second device.
In the drawings, embodiments are illustrated by way of example. It is to be expressly understood that the description and drawings are only for the purpose of illustration and as an aid to understanding, and are not intended as a definition of the limits of the invention.
Embodiments will now be described, by way of example only, with reference to the attached figures, wherein:
As a consequence of the complexity of human disease, disease researchers face two pressing challenges. First, molecular markers are needed to personalize and optimize treatment decisions by predicting patient outcome (prognosis) and response to therapy. Second, the clinical heterogeneity in patient outcome needs to be molecularly rationalized to allow direct targeting of the mechanistic underpinnings of disease. For example, if a single pathway is being dysregulated in multiple ways, drugs targeting that pathway as a whole could be developed. Further, there is a need for improved ways to detect or predict various other aspects of patient state such as disease type, disease subtype, cancer type, cancer subtype, disease state, or the like.
Conventionally, most validated multigene tests for residual risk prediction in breast cancer were generated using genome-wide analysis of mRNA data and are strongly driven by proliferation [5]. They provide similar and modest clinical utility [6, 7], do not identify key pathways for targeted therapeutics and do not inform patients or clinicians on the optimal therapeutic approach. One alternative is to use key signaling pathways to improve the accuracy of multi-parameter tests for residual risk prediction and to stratify patients into trials of targeted molecular therapeutics. The PIK3CA signalling pathway represents a robust candidate for this approach as it is frequently dysregulated in multiple cancer types [8], including breast cancer [9-12]. Mutations in PIK3CA are present in almost 40% of luminal breast cancers [8, 9, 13, 14] and drugging of the PIK3CA/mTOR pathway is a promising approach for advanced breast cancer [15]. Nonetheless, to date mutational analysis of the PIK3CA pathway has not enabled molecular targeting of existing agents, nor have key mechanistic events been identified in primary patients to focus drug development on specific pathway components [16-19].
In an aspect, this disclosure provides novel molecular markers and methods of prognosing or classifying a patient using such molecular markers.
For example, targeted molecular profiling was performed of the PIK3CA pathway in a multinational phase III clinical trial. These data allowed for the development and validation of a novel residual risk signature that out-performs a clinically-validated test.
In other aspects, the residual risk signature and associated methods developed in respect of breast cancer may be modified to provide prognostic signatures for a multitude of diseases, including colon, ovarian and lung cancers, and other biological states.
In another aspect, this disclosure also provides methods of using the novel breast cancer signature to stratify patients for trials targeting PIK3CA signaling nodes. More generally, this disclosure provides methods of using the signatures detailed herein to stratify patients for particular trials/treatments that target particular pathways and/or particular nodes/edges of those pathways.
In a further aspect, a subnetwork-based approach is provided that can use arbitrary molecular data types to identify one or more dysregulated pathways and to create functional biomarkers for a variety of biological states (e.g., phenotypes, diseases of a given type, cancers of a given type, etc.).
In a yet further aspect, a subnetwork-based approach is used to identify one or more dysregulated pathways in order to stratify patients for trials/treatments that target those pathways or particular nodes/edges of those pathways.
In this disclosure, the terms “pathways” and “biological pathways” are used broadly to refer to cellular signaling pathways, extra-cellular signaling pathways, or other biological functional units such as protein complexes. “Pathways” or “biological pathways” may also refer to interaction amongst or between intra-cellular and/or extra-cellular molecules.
While there are several well-studied complex diseases, including Alzheimer's, schizophrenia and diabetes, examples are provided herein for cancer, as it is among the most heterogeneous complex disease [63, 64]. Patients with the same cancer type have highly variable outcome [65], response to therapy [66] and mutational profiles [67, 68]. Studies across multiple cancer types provide strong evidence that cancer mutations are often exclusive: exactly one gene in a pathway is dysregulated, leading to a common phenotype [69]. We validate the ability of our approach, called SIMMS, by using it to create prognostic models in cohorts of 4,096 breast, 517 colon, 749 lung and 1,303 ovarian cancer patients profiled with a diverse range of molecular assays.
As depicted, device 10 and device 20 may be interconnected by a network 30. When so interconnected, these devices may operate in concert to construct a biomarker for a given biological state, and then use that biomarker to perform prognosis and/or classifications of patients. In particular, biomarkers constructed by device 10 may be transferred to device 20, and used at device 20 to perform prognosis/classification in manners detailed herein. Of course, biomarkers constructed by device 10 may also be transferred to device 20 in other ways, e.g., by way of suitable computer storage/transport media (e.g., disks).
Processor 100 may be any type of processor, such as, for example, any type of general-purpose microprocessor or microcontroller (e.g., an Intel™ x86, PowerPC™, ARM™ processor, or the like), a digital signal processing (DSP) processor, an integrated circuit, a field programmable gate array (FPGA), or any combination thereof.
Memory 102 may include a suitable combination of any type of computer memory that is located either internally or externally such as, for example, random-access memory (RAM), read-only memory (ROM), compact disc read-only memory (CDROM), electro-optical memory, magneto-optical memory, erasable programmable read-only memory (EPROM), and electrically-erasable programmable read-only memory (EEPROM), or the like. Portions of memory 102 may be organized using a conventional filesystem, controlled and administered by an operating system governing overall operation of device 10.
I/O interfaces 104 enable device 10 to interconnect with input and output devices. For example, I/O interfaces 104 may enable device 10 to interconnect with other input/output devices such as a keyboard, mouse, display, storage device, or the like.
Network interfaces 106 enable device 10 to communicate with other devices by connecting to one or more networks such as network 30 (
Operating system 140 may be a conventional operating system. For example, operating system 140 may be a Microsoft Windows™, Unix™, Linux™, OSX™ operating system or the like. Operating system 140 allows patient prognosis/classification application 150 and other applications at device 10 to access the hardware components of device 10 (e.g., processors 100, memory 102, I/O interfaces 104, network interfaces 106).
Data storage engine 142 allows operating system 140 and applications at device 10 to read from and write to datastore 144. Datastore 144 may be a conventional relational database such as a MySQL™, Microsoft™ SQL, Oracle™ database, or the like. So, data storage engine 142 may be a conventional relational database engine. Datastore 144 may also be another type of database such as, for example, an objected-oriented database or a NoSQL database, and data storage engine 142 may be a database engine adapted to read from and write to such other types of databases. Datastore 144 may reside in memory 102.
In some embodiments, datastore 144 may also simply be a collection of files stored and organized in memory 102. In such embodiments, data storage engine 142 may be omitted.
Datastore 144 may store a plurality of subnetwork records, each including data reflecting one of a plurality of subnetwork modules of one or more biological pathways.
Datastore 144 may also store a plurality of patient records, each including data reflecting molecular aberration measured for one of a plurality of patients of a biological state of a given type. The molecular aberration may include at least one of genomic aberration, epigenomic aberration, transcriptomic aberration, proteomic aberration, and metabolic aberration. More specifically, the molecular aberration may include at least one of somatic point mutation, small indel, mRNA abundance, somatic or germline copy-number status, somatic or germline genomic rearrangements, metabolite abundance, protein abundance, and DNA methylation.
Datastore 144 may also store a plurality of pathway records, each identifying a biological pathway associated with one of the plurality of subnetwork modules.
The records of datastore 144 may be populated by data retrieved from data repositories interconnected to device 10 by way of network interface 106, or by data inputted at device 10 through one of I/O interfaces 104.
As detailed herein, biomarker/pathway identification application 150 may be configured to implement the SIMMS approach detailed herein. As such, application 150 may also be referred to as “SIMMS” herein, or an application implementing “SIMMS”.
So, application 150 may be configured to implement methods of constructing a biomarker for a biological state of a given type, where the biomarker is selected as including a subset of a plurality of subnetwork modules. Application 150 may be also configured to implement methods of identifying a dysregulated subnetwork module of a biological pathway causing a biological state of a given type.
Each of these components may be implemented in a high-level programming language (e.g., a procedural language, an object-oriented language, a scripting language, or any combination thereof). For example, each of these components may be implemented using C, C++, C#, Perl, Java, or the like. Each of these components may also be implemented in assembly or machine language. Each of the components may be in the form of an executable program, a script, a statically linkable library, or a dynamically linkable library.
In a particular embodiment, one or more of the components of application 150 may be implemented in the R programming language.
Data preprocessing component 152 is configured to preprocess (e.g. normalize) data reflecting measurements of molecular aberrations. Data may be normalized by one or more of a plurality of methods, including using algorithmic controls or experimental controls. For example, with respect to experimental controls, data may be normalized with reference to corresponding data collected from a patient or a plurality of patients and stored in datastore 144. For example, mRNA abundance of a given set of genes of a patient may be normalized with reference to mRNA abundance of the same set of genes obtained from a sample of one or more different samples of the patient, or alternatively samples obtained from one or more different patients. mRNA abundance for a patient may also be normalized with reference to mRNA abundance of one or more specific control genes (i.e., reference genes) of the same patient, or one or more different patients (i.e., a reference patient), said control genes may be different to those being assessed for purposes of constructing a biomarker or prognosing/classifying a patient. Alternatively, the data may be normalized using an algorithmic control to mathematically manipulate data to remove noise, reduce variance and make data comparable across multiple experimental cohorts. Algorithmic controls may also enable normalization with reference to external data sets.
Module scoring component 154 is configured to process the subnetwork records and the patient records in datastore 144 to assign, to each of the subnetwork modules, a score proportional to a degree of dysregulation in that subnetwork module.
Module ranking component 156 is configured to rank the subnetwork modules according to their assigned scores.
Module selection component 158 is configured to select, as a biomarker, a subset of the subnetwork modules.
As detailed in the examples below, module selection component 158 may be configured to perform this selection by applying backward variable elimination. Module selection component 158 may also be configured to perform this selection by applying forward variable selection.
In some embodiments, module selection component 158 may be configured to select the biomarker such that the subnetwork modules in the subset of the plurality of subnetwork modules belong to one biological pathway.
Model construction component 160 is configured to a construct model for predicting patient states, where the model includes a selected subset of subnetwork modules.
In the examples detailed below, a Cox proportional hazards model is constructed by model construction component 160. However, model construction component 160 may also be configured to construct other types of models for predicting patient state, such as, a general linear model, a random forest model, a support vector machine model, a k-nearest neighbour model, a naïve Bayes model, or the like.
Module/pathway identification component 162 is configured to identify from the calculated scores a dysregulated subnetwork module.
These components of application 150 (or a subset thereof) may cooperate to implement methods detailed herein.
In particular, they may implement a method of constructing a biomarker for a biological state of a given type. The method including: maintaining an electronic datastore (e.g., datastore 144) storing: a plurality of subnetwork records, each comprising data reflecting one of a plurality of subnetwork modules of biological pathways; and a plurality of patient records, each comprising data reflecting molecular aberration measured for one of a plurality of patients of the biological state, and data reflecting a patient state for that patient. The method also includes processing (e.g., by module scoring component 154), at least one processor (e.g., processors 100), the subnetwork records and the patient records to assign, to each of the plurality of subnetwork modules, a score proportional to a degree of dysregulation in that subnetwork module. The method also includes ranking (e.g., by module ranking component 156), at the at least one processor, the plurality of subnetwork modules according to score assigned to each of the plurality of subnetwork modules; and upon said ranking, selecting (e.g., by module selection component 158), at the at least one processor, the biomarker as comprising a subset of the plurality of subnetwork modules.
The method may also include constructing (e.g., by model construction component 160), at the at least one processor, a model for predicting patient states for patients of the biological state, the model comprising the selected subset of the plurality of subnetwork modules.
The method may also include preprocessing (e.g., by data preprocessing component 152) the data reflecting molecular aberration, e.g., to normalize the data.
The components of application 150 (or a subset thereof) may also cooperate to implement a method of identifying a dysregulated subnetwork module of a biological pathway causing a biological state of a given type. The method including: maintaining an electronic datastore (e.g., datastore 144) storing: a plurality of subnetwork records, each comprising data reflecting one of a plurality of subnetwork modules of biological pathways; and a plurality of patient records, each comprising data reflecting molecular aberration measured for one of a plurality of patients of the biological state, and data reflecting a patient state for that patient. The method also includes processing (e.g., by module scoring component 154), at at least one processor, the subnetwork records and the patient records to assign, to each of the plurality of subnetwork modules, a score proportional to a degree of dysregulation in that subnetwork module. The method also includes identifying (e.g., by module/pathway identification component 162), at the at least one processor, from the scores, the dysregulated subnetwork module from amongst the plurality of subnetwork modules.
In some embodiments, said identifying comprises identifying a plurality of dysregulated subnetwork modules from amongst the plurality of subnetwork modules.
The method may also include maintaining in the electronic datastore a plurality of pathway records, each identifying a biological pathway associated with one of the plurality of subnetwork modules, and processing (e.g., by module/pathway identification component 162), at the at least one processor, the pathway records to identify a biological pathway associated with the dysregulated subnetwork module.
The method may also include preprocessing (e.g., by data preprocessing component 152) the data reflecting molecular aberration, e.g., to normalize the data.
I/O interfaces 204 enable device 20 to interconnect with input and output devices. For example, device 20 may be configured to receive patient data (e.g., mRNA abundance data) from an interconnected assay device, for example a gel electrophoresis device configured for northern blotting, a device configured for quantitative polymerase chain reaction (qPCR) or reverse transcriptase quantitative polymerase chain reaction (RT-qPCR), a hybridization microarray, a device configured for serial analysis of gene expression (SAGE), or a device configured for RNA Seq or Whole Transcriptome Shotgun Sequencing (WTSS), by way of I/O interface 204. I/O interfaces 204 also enable device 20 to interconnect with other input/output devices such as a keyboard, mouse, display, or the like.
Network interfaces 206 enable device 20 to communicate with other devices by connecting to one or more networks such as network 30 (
Operating system 240 may be substantially similar to operating system 140. Operating system 240 allows biomarker/pathway identification application 250 and other applications at device 20 to access the hardware components of device 20 (e.g., processors 200, memory 202, I/O interfaces 204, network interfaces 206).
Data storage engine 242 may be substantially similar to data storage engine 142. Data storage engine 242 allows operating system 240 and applications at device 20 to read from and write to datastore 244.
Datastore 244 may store data reflective of measurements of molecular aberrations (e.g., mRNA abundance) obtained from a test sample, to be processed by application 150 in manners detailed below. Datastore 244 may also store one or more biomarkers to be used by application 250 in manners detailed below. Such biomarkers may be biomarkers constructed by biomarker construction/pathway identification device 10, and received therefrom.
The records of datastore 244 may be populated by data retrieved from data repositories interconnected to device 20 by way of network interface 206, or by data inputted at device 20 through one of I/O interfaces 204.
As detailed herein, patient prognosis/classification application 250 may be configured to perform prognosis and/or classification of patients using a biomarker for a given biological state, where the biomarker comprises a plurality of subnetwork modules.
Each of these components may be implemented in any of the manners and take any of the forms described above for the components of application 150.
Data preprocessing component 252 is configured to perform preprocessing (e.g., normalization) on data reflecting activity of a plurality of genes obtained from a test sample.
Activity level determination component 254 is configured to determine an activity of a plurality of genes in a test sample of the patient.
Expression profile construction component 256 is configured to construct an expression profile by processing the data reflecting activity of a plurality of genes.
Dysregulation scoring component 258 is configured to process an expression profile to calculate scores proportional to a degree of dysregulation in a given subnetwork module.
Risk evaluation component 260 is configured to process a clinical indicator of the patient to determine a risk associated with the disease. Risk evaluation component 260 may use a model for predicting patient outcomes for patients having a disease, the model trained with a plurality of reference dysregulation scores and a plurality of reference clinical indicators. A trained model may be constructed at device 20 in the manners described herein for model construction component 160. A trained model may also be received at device 20 from device 10.
These components of application 250 (or a subset thereof) may cooperate to implement methods detailed herein.
In particular, they may implement a method of prognosing or classifying a patient using a biomarker comprising a plurality of subnetwork modules. The method including: determining (e.g., by activity level determination component 254), an activity of a plurality of genes in a test sample of the patient, said plurality of genes associated with the plurality of subnetwork modules; constructing (e.g., by expression profile construction component 256) an expression profile using the activity of the plurality of genes; determining (e.g., by dysregulation scoring component 258), dysregulation of each of the plurality of subnetwork modules by calculating a score proportional to a degree of dysregulation in each of the plurality of subnetwork modules from said expression profile; prognosing or classifying (e.g., by risk evaluation component 260) the patient by: inputting each dysregulation score into a model for predicting patient outcomes for patients having a disease, the model trained with a plurality of reference dysregulation scores and a plurality of reference clinical indicators; and inputting a clinical indicator of the patient into the model to obtain a risk associated with the disease.
The method may also include normalizing the activity of the plurality of genes using at least one control by, for example, data preprocessing component 252, in substantially the same manner as data preprocessing component 152, described above.
A risk associated with the disease may refer to the probability or expected probability of a disease occurring or reoccurring in a given patient. This, for example in the context of cancer, may be expressed as distant recurrence free survival or distant metastasis free survival (DRFS), or the length of time after primary treatment ends for a cancer that the patient survives without any signs or symptoms of that cancer, or before death of that patient for any cause. Examples of primary cancer treatments include, but are not limited to, endocrine therapy, chemotherapy, radiotherapy, hormone therapy, surgery, gene therapy, thermal therapy, and ultrasound therapy. However, risk may be associated with diseases other than cancer, and therefore other metrics of risk may be used. For example, risk may be expressed as overall survival (OS), which represents the length of time from either the date of diagnosis or the start of treatment for a disease that patients diagnosed with the disease are still alive.
Alternatively, the risk associated with the disease may be expressed as either a low, medium, and/or high risk of disease relapse, and for example, may correspond to a standard or commonly used risk scoring system, for example the Oncotype DX risk score in respect of cancer. For example, if risk is expressed as either a high or low risk, an Oncotype DX score of under 24.5 for a patient may be designated as low risk for relapse, while a patient's score greater than 24.5 may be designated as high risk for relapse. Low or high risk thresholds may also be modified in accordance with any other standard disease relapse risk scoring system in order to accommodate specific risks associated with any one disease. For example, the risk may also correspond with specific values associated with the MammaPrint gene signature risk scoring system.
Clinical indicators may be any measured or observed pathological or clinical metric of a patient, a patient's tumour, or a metric relating to a molecular marker associated with the patient. Clinical indicators may, in respect of cancer for example, comprise the TNM Classification of Malignant Tumours (TNM), wherein the size and growth of a tumour (T), whether cancer has spread to lymph nodes (N) and whether cancer has spread to different parts of the body (M), is determined and scored. Each of or all of these indicators may be relevant as part of a biomarker. Other cancers may have their own classification systems, or may have different relevant metrics. For example, prostate cancer may be scored using a Gleason score, while lymphoma may be staged using the Ann Arbor staging system. Additional clinical indicators may, for example, be tumour size, tumour location, cancerous cell type (for example, squamous cell or adenocarcinoma in the case of esophageal cancers), or may be levels of a specific molecule (i.e., prostate specific antigen in respect of prostate cancer) measured in, for example, the blood or serum of a patient.
The components of application 250 (or a subset thereof) may also cooperate to implement a method of prognosing or classifying a patient comprising: determining (e.g., by activity level determination component 254) mRNA abundance using a sample of a breast cancer tumour of the patient for the group of genes comprising: GSK3B, AKT1S1, RHEB, TSC1, TSC2, RPS6KB1, RPTOR, MTOR, RICTOR, ERBB2, MKI67, ESR1 and PGR, each of said genes associated with at least one node of the PIK3 cell signalling pathway; constructing (e.g., by expression profile construction component 256) an expression profile from the normalized mRNA abundance; comparing (e.g., by risk evaluation component 260) said expression profile to a plurality of reference expression profiles and comparing clinical indicators of the patient to a plurality of reference clinical indicators, wherein the clinical indicators comprise N-stage and tumour size, and wherein each of the plurality of reference expression profiles and each of the reference clinical indicators are associated with a predetermined residual risk of breast cancer; and selecting the reference expression profile most similar to the expression profile and the reference clinical indicators most similar to the patient clinical indicators, to obtain a residual risk associated with breast cancer.
The method may also include normalizing the activity of the plurality of genes using at least one control by, for example, data preprocessing component 252, in substantially the same manner as data preprocessing component 152, described above.
As used herein, “residual risk” refers to the probability or risk of cancer recurrence in breast cancer patients after primary treatment. Residual risk may, for example, be expressed as distant recurrence free survival or distant metastasis free survival (DRFS), or the length of time in, for example, days, months or years, after primary treatment ends for a cancer that the patient survives without any signs or symptoms of that cancer or before death of that patient for any cause. Examples of primary cancer treatments include, but are not limited to, endocrine therapy, chemotherapy, radiotherapy, hormone therapy, surgery, gene therapy, thermal therapy, and ultrasound therapy.
Referring again to
Biomarker construction/pathway identification device 10 and patient prognosis/classification device 20 are further described with reference to constructing and using an example biomarker for breast cancer. For this example biomarker, each subnetwork module corresponds to a node of a signaling pathway, namely the PIK3CA pathway.
First, biomarker/pathway identification device 10 is configured and operated to construct the breast cancer biomarker. Then, patient prognosis/classification device 20 is configured and operated to use the breast cancer biomarker to perform patient prognosis and classification.
The TEAM trial is a multinational, randomised, open-label, phase III trial in which postmenopausal women with hormone receptor-positive luminal [20] early breast cancer were randomly assigned to receive exemestane (25 mg), once daily or tamoxifen (20 mg) once daily for the first 2.5-3 years followed by exemestane (total of 5 years treatment). This study complied with the Declaration of Helsinki, individual ethics committee guidelines, and the International Conference on Harmonisation and Good Clinical Practice guidelines; all patients provided informed consent. Distant metastasis free survival (DRFS) was defined as time from randomisation to distant relapse or death from breast cancer [20].
The TEAM trial included a well-powered pathology research study of over 4,500 patients from five countries (
At device 10, datastore 144 was populated with patient records created for patients of the TEAM trial cohort.
Five 4 μm formalin-fixed paraffin-embedded (FFPE) sections per case were deparaffinised, tumor areas were macro-dissected and RNA extracted according to Ambion® Recoverall™ Total Nucleic Acid Isolation Kit-RNA extraction protocol (Life Technologies™, Ontario, Canada) except for one change: samples were incubated in protease for 3 hours instead of 15 minutes. RNA samples were eluted and quantified using a Nanodrop-8000 spectrophometer (Delaware, USA). Samples, where necessary, underwent sodium-acetate/ethanol re-precipitation. RNAs extracted from 3,476 samples were successfully analysed.
mRNA Abundance Analysis
Thirty-three genes of interest were selected from the PIK3CA signalling pathway and 6 reference genes. Genes of interest were selected specifically to interrogate key functional nodes within the PIK3CA signalling pathway [24, 25] as shown in
Probes for each gene were designed and synthesised at NanoString® Technologies (Washington, USA). RNA samples (400 ng; 5 μL of 80 ng/4) were hybridised, processed and analysed using the NanoString® nCounter® Analysis System, according to NanoString® Technologies protocols.
At device 10, raw mRNA abundance counts data were pre-processed by data preprocessing component 152, which incorporated the R package NanoStringNorm [26] (v1.1.16), as further detailed below. A range of pre-processing schemes was assessed to identify the most optimal normalisation parameters. (
Univariate survival analysis of processed mRNA abundance data was performed by median-dichotomizing patients into high- and low-risk groups, except for ERBB2 (
4.53 × 10−14
2.19 × 10−11
2.89 × 10−15
IHC4-protein model risk scores were calculated as described by Cuzick et al. and further adjusted for clinical covariates. An IHC4-mRNA model was trained on mRNA abundance profiles of ESR1, PGR, ERBB2 and MKI67 in the training cohort using multivariate Cox proportional hazards modelling (Table 5). Model predictions (continuous risk scores) were grouped into quartiles (
mRNA Network Analysis
The 33 genes were derived from 8 functionally-related modules (
Datastore 144 was populated with subnetwork records created for each of these 8 modules.
At device 10, for each functional module, module scoring component 154 calculated a ‘module-dysregulation score’ (MDS). Module-specific MDSs were subsequently used in multivariate Cox proportional hazards modelling by model construction component 160, adjusted for clinical covariates as above. All models were trained in the training cohort and validated in the fully-independent validation cohort (Table 1) using DRFS truncated to 10 years as an end-point. Recurrence probabilities were estimated as described below. All survival modelling was performed on distant metastasis free survival (DRFS), in the R statistical environment with the survival package (v2.37-4) and model performance compared through area under the receiver operating characteristic (ROC) curve (see below).
Power calculations were performed on complete TEAM cohort (n=3,476; events=507) and for each of the training (n=1,734; events=250) and validation (n=1,742; events=257) subsets separately. Power estimates representing the likelihood of observing a specific HR against the above-mentioned events, (assuming equal-sized patient groups) were derived using the following formula [41]:
where E represents the total number of events (DRFS) and a represents the significance level which was set to 10−3. zpower was calculated for HR ranging from 1 to 3 with steps of 0.01.
mRNA Abundance Data Processing
As noted, raw mRNA abundance counts data were preprocessed by data preprocessing component 152 incorporating the R package NanoStringNorm [15] (v1.1.16). In total, 252 preprocessing schemes were evaluated; spanning normalization with respect to six positive controls, eight negative controls and six housekeeping genes (GUSB, PUM1, SF3A1, TBP, TFRC and TMED10) followed by global normalization (
Patient records in datastore 144 were updated to reflect the data, as preprocessed by data processing component 152.
As will be appreciated, in some embodiments, raw measurements may be used to calculate MDS, and preprocessing may be avoided.
At device 10, predefined functional modules reflected in the subnetwork records in datastore 144 were scored by module scoring component 154 using a two-step process. First, weights (β) of all the genes were estimated by fitting a univariate Cox proportional hazards model (Training cohort only). Second, these weights were applied to scaled mRNA abundance profiles to estimate per-patient module dysregulation score using the following equation:
where n represents the number of genes in a given module and Xi is the scaled (z-score) abundance of gene i. MDS was subsequently used in the multivariate Cox proportional hazards model alongside clinical covariates.
Univariate survival analysis of mRNA abundance data was performed by median-dichotomizing patients into high- and low-risk groups, except for ERBB2 (Table 3). ERBB2 risk groups were determined with expectation-maximization clustering (k=2) using R package mclust (v4.2). Univariate survival analysis of clinical variables was performed by modelling age as binary variable (dichotomized at age≧55), while grade, N-stage and T-stage were modelled as ordinal variables (Table 4). Univariate survival analysis of mutational profiles (AKT1, PIK3CA and RAS; Table 4) was performed by dichotomizing patients into mutant and wild-type groups.
At device 10, MDS profiles (equation 2) of patients in the Training cohort were used to fit a multivariate Cox proportional hazards model alongside clinical variables by processing the patient records and subnetwork records in datastore 144. Through a backwards step-wise refinement algorithm implemented in module selection component 158 following ranking of the modules by module ranking component 156, a module-based risk model containing selected subnetwork modules was created by model construction component 160 (Table 7). The parameters estimated by the multivariate model were applied to the MDS and clinical profiles of patients in the Validation cohort to generate per-patient risk score. These risk scores (continuous) were grouped into quartiles using the thresholds derived from the Training cohort, and resulting groups were subsequently evaluated through Kaplan-Meier analysis.
At device 20, the biomarker comprising the selected subnetwork modules may be used by patient prognosis/classification application to perform patient prognosis/classification. In particular, application 250 may use the model generated by model construction component 160 to predict patient outcomes. For example, for a given patient with mRNA abundance profile of genes underlying modules in Table 7, MDS can be calculated (equation 2) by dysregulation scoring component 258, then a risk score estimate can be generated by risk evaluation component 260 from the MDS and clinical data to predict the likelihood of relapse using the model in
More generally, application 250 may implement methods to determine (e.g., by activity level determination component 254), an activity of a plurality of genes in a test sample of the patient, said plurality of genes associated with the plurality of predetermined subnetwork modules. Activity of the genes contained in the biomarker, as described above, may be determined, for example, using mRNA abundance of the genes. mRNA abundance may, for example, be measured using a qPCR or RT-qPCR device which may be interconnected with device 20 by way of an I/O interface 204.
Application 250 may also implement methods to construct (e.g., by expression profile construction component 256) an expression profile of the patient using the determined activity of the plurality of genes. The expression profile may be a data structure, said structure comprising entries, wherein each entry comprises the mRNA abundance data of each of the genes comprising the biomarker for the patient. However, the expression profile may alternatively comprise data corresponding to activity measured, for example, according to one or more of somatic point mutation, small indel, somatic copy-number status, germline copy-number status, somatic genomic rearrangements, germline genomic rearrangements, metabolite abundances, protein abundances and DNA methylation.
The dysregulation of each of the plurality of subnetwork modules for the patient may be calculated by dysregulation scoring component 258 in substantially the same fashion as module scoring component 154, assigning to each of the plurality of subnetwork modules a score proportional to a degree of dysregulation in that subnetwork module based on the patient's expression profile.
Prognosing or classifying the patient may be performed by risk evaluation component 260 implementing the following: inputting each dysregulation score into a model for predicting patient outcomes for patients having a disease, the model trained with a plurality of reference dysregulation scores and a plurality of reference clinical indicators; and inputting a clinical indicator of the patient into the model to obtain a risk associated with the disease, which is described in more detail above.
The IHC4-RNA model was trained on mRNA abundance profiles of ESR1, PGR, ERBB2 and MKI67 in the Training cohort using a multivariate Cox proportional hazards model (Table 5). The model parameters learnt through fitting the multivariate Cox proportional hazards model were subsequently applied to the mRNA abundance profiles of the above-mentioned four genes in the Validation cohort to generate per-patient risk score. These risk scores (continuous) were grouped into quartiles. These groups were evaluated using Kaplan-Meier analysis and multivariate Cox proportional hazards model adjusted for age (binary variable dichotomized at age 55), N-stage (ordinal), tumour size (continuous variable) and grade (ordinal variable). The IHC4-protein model was calculated as described by Cuzick et al [42]. All models were trained and validated using DRFS truncated to 10 years as an end-point.
Recurrence probabilities at 5 years were estimated by binning the predicted risk-scores in 25 equal groups. For each group, recurrence probability R(t) was estimated as 1-S(t), where S(t) is the Kaplan-Meier survival estimate at year 5. The R(t) estimates of 25 groups were smoothed using local polynomial regression fit. The predicted estimates were plotted against the median risk score of each group except the first and last group, where the lowest risk score and 99th percentile were used, respectively. All survival modelling was performed in the R statistical environment (R package: survival v2.37-4).
Performance of survival models was compared through area under the receiver operating characteristic (ROC) curve. Significance of difference between the ROC curves was assessed through permutation analysis (10,000 permutations by shuffling the risk scores while maintaining the order of survival objects). Patients censored before 5 years (Training cohort: n=192, Validation cohort: n=181) were eliminated from sampling. ROC analysis was implemented using R packages pROC (v1.6.0.1) and survivalROC (v1.0.3).
mRNA abundance data shown in the heatmaps (
mRNA abundance profiles of 33 genes were available for 3,476 patients and complete mutational data was available for 3,353 patients [12]. Outcome data were available for 3,343 patients (
Univariate mRNA Expression
Tumors from patients who subsequently progressed to metastatic breast cancer showed markedly different mRNA abundance profiles relative to tumors from patients who did not progress during follow up (
IHC4—mRNA Based Assessment of a Conventional Risk Score
The ability of a protein-based residual risk classifier, IHC4, was evaluated to predict outcome in this large, well-powered cohort (
A prognostic model was generated using the mRNA abundances of the IHC4 markers, which we call IHC4-mRNA (Table 5). IHC4-protein and IHC4-mRNA risk scores were well-correlated (p=0.66, p=3.55×10−205,
The 33 PI3K pathway genes were aggregated into 8 modules representing different nodes of the pathway. mRNA abundance data within each module was collapsed into a single per patient Module Dysregulation Score (MDS) to enable comparisons between modules and to determine module co-expression. All 8 modules were univariately associated with patient outcome in the training cohort (p<0.05, Table 6). Given that only 7 genes were univariately prognostic (
4.11 × 10−15
1.78 × 10−15
A residual risk model was generated by biomarker construction/pathway identification application 150 in the training cohort. The final signature contained four modules (i.e. modules 2, 3, 7 & 8), N-Stage and tumor size (Table 7;
Finally, we compared the prognostic ability of the clinically-validated IHC4-protein model to those of our new IHC4-mRNA and PI3K signalling module models. We used the area under the receiver operating characteristic curve as a performance indicator. The PI3K pathway-based MDS model (AUC=0.75) was significantly superior to both the IHC4-mRNA (AUC=0.70; p=1.39×10−3) and IHC-protein (AUC=0.67; p=5.78×10−6) models (
By profiling key signalling nodes within the PIK3CA signalling pathway, a sixteen-gene residual risk signature adapted for theranostic use in association with early luminal breast cancer (
The residual risk signature was derived using the key signalling modules in the PIK3CA signalling pathways and integration with known prognostic markers (Ki67, ER, PgR, HER2) and type I receptor tyrosine kinase signalling (EGFR, ERBB2-4). The “IHC4” markers, which assess proliferation, ER and HER2 signalling, represent a strong component of existing residual risk signatures [6].
This result establishes that molecular profiling of signalling pathways may be used for risk stratification of cancer and for patient stratification. Both the IHC4 and type I receptor tyrosine kinase modules have extensive clinical and pre-clinical data validating their utility in early breast cancer [5, 30-32]. In addition, two key nodes within the PIK3CA pathway identify TSC1/TSC2/Rheb (Module 2) and Raptor/Rictor/mTOR (Module 3) signalling nodes as of pivotal prognostic importance in early breast cancer.
Targeted therapies directed against Rheb/mTOR signalling may be of value in treatment of early luminal breast cancers. Strikingly, the collective impact of these two modules outweighed individual gene contributions from the EIF4 gene family, mediators of protein translation through CCND1/GSK3B/4EBP1 signalling, which are also associated with poor outcome in luminal cancers [33-35]. Univariate analysis of individual genes (see Table 3) indicate additional candidates for theranostic intervention in this pivotal pathway including Harvey and Kirsten RAS, PDK1 and PIK3CA itself. The documented effects of PIK3CA pathway inhibitors in advanced breast cancer, if appropriately targeted using theranostic gene/drug partnerships, may be translated into significant improvements in survival in early breast cancer. Despite the high frequency of PIK3CA mutations in this dataset [13], no prognostic impact was observed. Nor did we find any evidence that either PTEN or AKT expression, across all 3 isoforms, was important in residual risk prediction [36, 37].
Biomarker construction/pathway identification device 10 and patient prognosis/classification device 20 are further described with reference to further example biomarker for breast cancer, colon cancer, NSCLC cancer, and ovarian cancer. In these examples, each subnetwork module corresponds to a signaling pathway.
These example biomarkers are listed in Appendix A, and include:
First, biomarker/pathway identification device 10 is configured and operated to construct the biomarker for the particular cancer type. Then, patient prognosis/classification device 20 is configured and operated to use the constructed biomarker to perform patient prognosis and classification for patients of the particular cancer type.
mRNA Abundance Data Pre-Processing
As before, pre-processing was performed at biomarker construction/pathway identification device 10 by data preprocessing component 152 incorporating an R statistical environment (v2.13.0). Raw datasets from breast, colon, NSCLC and ovarian cancer studies (Tables 10-13) were normalized using RMA algorithm [70] (R package: affy v1.28.0) except for two colon cancer datasets (TOGA and Loboda dataset) which were used in their original pre-normalized and log-transformed format. ProbeSet annotation to Entrez IDs was done using custom CDFs [71] (R packages: hgu133ahsentrezgcdf v12.1.0, hgu133bhsentrezgcdf v12.1.0, hgu133plus2hsentrezgcdf v12.1.0, hthgu133ahsentrezgcdf v12.1.0, hgu95av2hsentrezgcdf v12.1.0 for breast cancer datasets. hgu133ahsentrezgcdf v14.0.0, hgu133bhsentrezgcdf v14.0.0, hgu133plus2hsentrezgcdf v14.0.0, hthgu133ahsentrezgcdf v14.0.0, hgu95av2hsentrezgcdf v14.0.0 and hu6800hsentrezgcdf v14.0.0 for the respective colon, NSCLC and ovarian cancer datasets). The Metabric breast cancer dataset was preprocessed, summarized and quantile-normalized from the raw expression files generated by Illumina BeadStudio. (R packages: beadarray v2.4.2 and illuminaHuman v3.db_1.12.2). Raw Metabric files were downloaded from European genome-phenome archive (EGA) (Study ID: EGAS00000000083). Data files of one Metabric sample were not available at the time of our analysis, and were therefore excluded. All datasets were normalized independently. Raw CEL files for mRNA abundance of TOGA ovarian cancer (Broad institute cohort) were downloaded from the TOGA data matrix (http://tcga-data.nci.nih.gov/). These were normalized using RMA (R package: affy v1.28.0) and ProbeSets were annotated to Entrez Gene IDs using custom CDF (R package: hthgu133ahsentrezgcdf v14.1.0). Pre-normalized ovarian cancer copy-number aberration and DNA methylation data was downloaded from cBio cancer genomics portal at: http://cbio.mskcc.org/cancergenomics/ov/.
For each of breast, colon, NSCLC and ovarian cancer studies, datastore 144 was populated with patient records for patients from those studies with data in the patient records normalized by data preprocessing component 152.
The pathway dataset was downloaded from the NCI-Nature Pathway Interaction database [72] in PID-XML format (Table 9). The XML dataset was parsed to extract protein-protein interactions from all the pathways using custom Perl (v5.8.8) scripts. The protein identifiers extracted from the XML dataset were further mapped to Entrez gene identifiers using Ensembl BioMart (version 62). Whereever annotations referred to a class of proteins, all members of the class were included in the pathway, in some case using additional annotations from Reactome and Uniprot databases. The protein-protein interactions, once mapped to the Entrez gene identifiers, were grouped under respective pathways for subsequent processing. The initial dataset contained 1,159 variable size subnetwork modules (
At device 10, datastore 144 was populated with subnetwork records created for each of these 500 subnetwork modules.
In order to avoid dataset-specific bias, all included studies were analyzed independently (Table 10). First, each dataset was pre-processed independently by data preprocessing component 152, as described in the ThRNA abundance data pre-processing′ section above. Next, genes across all the datasets were evaluated for their prognostic power using a univariate Cox proportional hazards model followed by the Wald-test (R package: survival v2.36-9). Overall survival (OS) was used as the survival time variable; for the studies that do not report OS, the closest alternative endpoint available in that study was used (e.g. disease-specific survival or distant metastasis-free survival). All the genes were subsequently ranked by the Wald-test p-value within each study. The top genes across all studies were compared on multiple criterion:
The Rank Product [73] of each gene was computed as:
Here k represents the number of studies which had the mRNA abundance measure available for gene g. ri is the rank of gene g in study i. The overall ranking table was used as a benchmark to identify datasets in which a given gene was ranked farthest when its rank product was compared to studywise ranks. The farthest dataset count was computed for the overall top ranked (100, 200, 300, . . . , 1000, 2000) genes (
The p-value (Wald-test) based ranking was transformed into percentile ranks within each study. These ranks were used as a measure of gene's position with reference to the benchmark rank derived in the step 1 to evaluate deviation of genes' ranks for each study (
The mRNA abundance profiles of common genes across all studies were extracted and patient wise Spearman rank correlation coefficient was estimated (R package: stats v2.13.0). The correlation coefficient was used to further analyze intra- and inter-study correlation in order to identify any outlier studies (
Eliminating Redundant mRNA Profiles (Breast Cancer Data)
The Spearman rank correlation coefficient was also used to establish a non-redundant set of patients. This is important not only to identify any patients that might have participated in more than one study or duplicate data used in multiple papers, but also to train a robust model thereby preventing model over-fitting. The survival data of patients with high correlation coefficient (ρ≧0.98) was matched, and 22 samples [65, 74] having identical survival time and status were found. These patients were removed from further analyses (
Correspondingly, patient records in datastore 144 were updated to remove records for redundant patients.
Following univariate analyses and elimination of redundant patients, the remaining studies were divided into two sets, training and validation (Tables 10-13). The RMA normalized mRNA abundance measures were median scaled within the scope of each dataset (R package: stats v2.13.0) by data preprocessing component 152.
At device 10, models were fitted to the patient records by model construction component 160. The hazard ratio for all the genes by combining samples from all the training datasets was estimated using the univariate Cox proportional hazards model. The Cox model was fit to the median dichotomized grouping of mRNA abundance profiles of the samples as opposed to continuous measure of mRNA abundance.
The hazard ratio for all the protein-protein interactions gathered from the NCI-Nature pathway interaction database were estimated using a multivariate Cox proportional hazards model. A Cox model, shown below, was fit to median dichotomized patient grouping of each of the interacting gene pairs:
h(t)=h0(t)exp(β1XG1+β2XG2/β3XG1.G2) (2)
where XG1 and XG2 represent patient's group for gene 1 and gene 2. XG1.G2 represents patient's binary interaction measure between the gene 1 and gene 2, as shown below:
X
G1.G2=(
where ⊕ represents exclusive disjunction between the grouping of each gene. The expression encodes XNOR boolean function emulating true (1) whenever both the interacting genes belong to the same group.
At device 10, module scoring component 154 processed patient records and subnetwork records stored in datastore 144 to score each of the modules. In particular, the pathway-based subnetwork modules were scored using three different models. These models compute a module-dysregulation score (MDS) by incorporating the hazard ratio of nodes and edges that form the subnetwork:
where n and e represent total number of nodes (genes) and edges (interactions) in a subnetwork module respectively. HR represents the hazard ratios of genes and the protein-protein interactions in a subnetwork module (section: Meta-analysis). The subnetworks were ranked by module ranking component 156 according to their MDS, thereby identifying candidate prognostic features.
The subnetwork MDS was used to draw a list of the top n subnetwork features for each of the three models (see section: Subnetwork module-dysregulation score). These features were subsequently used to estimate patient risk scores using Model N+E, N and E. The patient risk score for each of the subnetwork modules (riskSN) was expressed using the following models constructed by model construction component 160:
where n and e represent the total number of nodes (genes) and edges (interactions) in a subnetwork module (SN), respectively. HR is the hazard ratio of genes and the protein-protein interactions (section: Meta-analysis) in a subnetwork module. x and y are the two nodes connected by an edge ej and ω is the scaled intensity of an arbitrary molecular profile (e.g. mRNA abundance, copy number aberrations, DNA methylation beta values etc).
A univariate Cox proportional hazards model was fitted to the training set by model construction component 160, and applied to the validation set for each of the subnetwork modules. The prognostic power of all three models was compared using non-parametric two sample Wilcoxon rank-sum test (R package: stats v2.13.0) (
In order to narrow down the size of subnetwork features in each of the three models yet maintaining the prognostic power, backward variable elimination and forward variable selection algorithms was applied by module selection component 158. The backward elimination algorithm starts with a model having a complete feature set and attempts to remove the least informative features one by one, as long as the overall performance is not compromised. Conversely, the forward selection algorithm starts with the most prognostic feature and expands the model by adding one feature at a time. Both models terminate as soon as the overall performance is locally maximized. Following every addition or deletion, the model re-computes the goodness of fit, called Akaike information criterion (AIC). The AIC measure guides the model on the statistical significance of a feature/variable in consideration. The selection/elimination trace was tracked from the beginning to the convergence point and, at each iteration, the prognostic power for that particular state of the model was evaluated (R package: MASS v7.3-12). The evaluation was conducted by fitting a multivariate Cox proportional hazards model on the training set. The coefficients (β) estimated by the fit were subsequently used to compute an overall measure of per patient risk score for the validation set using the following formula:
where Yij is the ith patient's risk score for subnetwork module j. The training set HRs of the nodes and edges were used to compute Yij (see section: Patient risk score). Next, the validation cohort was median dichotomized into low- and high-risk patients using the median risk score estimated on the training set. The risk group classification was assessed for potential association with patient survival data using Cox proportional hazards model and Kaplan-Meier survival analysis.
The biomarker is the selected subset of the subnetwork modules following backward variable elimination/forward variable selection.
The performance comparison of all three models was conducted by bootstrapping training set samples 10,000 times. Each model was tested on the validation set samples. Validation results of Model N+E, N, and E were compared using Tukey HSD test (R package: stats v2.13.0).
Jackknifing was performed over the subnetwork marker space for four tumour types; breast, colon, NSCLC and ovarian. Ten million prognostic classifiers (200,000 for each size n=5, 10, 15, . . . , 250; where n represents the number of subnetworks) were randomly sampled using all 500 subnetworks. The predictive performance of each random classifier was measured as the absolute value of the log2-transformed hazard ratio obtained by fitting a multivariate Cox proportional hazards model using Model N.
All plots were created in the R statistical environment (v2.13.0). Forest plots were generated using rmeta package (v2.16), all others were created using lattice (v0.19-28), latticeExtra (v0.6-16) and VennDiagram (v1.0.0) packages.
At device 10, 14 mRNA abundance breast cancer datasets were collated (Table 10). Since these datasets originate from different studies and array platforms, comprehensive univariate analyses were conducted to identify outlier datasets and to find patients duplicated across datasets. Two studies were identified as outliers and 22 redundant patients having identical survival data (
Comparison with Colon, NSCLC and Ovarian Cancer Prognostic Biomarkers
In order to compare the performance of SIMMS's with existing gene expression-based colon [99, 100], NSCLC [101-105] and ovarian [106-109] cancer prognostic biomarkers, we limited our search to the studies which shared the validation datasets with those included in our analysis as validation datasets too. This selection criterion enabled unbiased comparison of hazard ratios and P-values between published markers and those identified by SIMMS for the same set of patients unless specified otherwise. To maintain parity, strictly gene expression-based predictors with dichotomous output were considered for performance evaluation. These results are presented in Table 26. To test the colon cancer 34-gene signature [100] on TCGA cohort, this signature was re-implemented following the original protocol. Briefly, VMC and Moffitt sub-cohorts were treated as training and validation sets respectively. The validation results on the Moffitt cohort and TCGA cohort are shown in Table 26.
Comparison with Oncotype DX and MammaPrint
Oncotype DX is an RT-PCR 21-gene signature having 5 normalization genes and 16 predictor genes [110]. Of the 16 predictor genes, Entrez gene 2944 was missing from all validation datasets and Entrez gene 57758 was missing from the Bild dataset. Entrez gene 6175 was missing from the normalization genes. These missing genes were assigned zero score. The mRNA profiles of the predictor genes were normalized by subtracting the mean of normalization gene set. The original Oncotype DX protocol was implemented using R package genefu (v1.2.1) [111]. The Oncotype DX protocol offers 3 risk groups; low (risk score<18), intermediate (18 risk score<31) and high 31). To make it comparable with SIMMS, the intermediate risk group patients was split into low- and high-risk groups at the median of risk score guide for the intermediate group (24.5). The dichotomized groups across all validation datasets were further analyzed using Cox proportional hazards model followed by Kaplan-Meier analysis (Table 8).
MammaPrint is a microarray based 70-gene signature [112]. Of the 70 genes, we were unable to map 7 genes to Entrez ids in our validation cohort, namely Contig32125_RC, Contig20217_RC, Contig24252_RC, Contig40831_RC, Contig35251_RC, AA555029_RC and Contig63649_RC. We set the corresponding mRNA abundance score of these genes to zero. The gene signature implementation was done using R package genefu (v1.2.1) [111]. The risk scores were dichotomized by using two different thresholds; default (0.3) and median risk score (Table 8).
For both Oncotype DX and MammaPrint, due to limited clinical annotations for
Affymetrix based datasets, we used all patients. However, for Metabric (Illumina dataset), Oncotype DX was applied to preselected Stage [0,1,2,3], ER positive, lymph node negative and HER2 negative patients only. Similarly MammaPrint was applied to Stage [0,1,2], lymph node negative patients having tumour size<5 cm.
Overall, SIMMS performance was at least as good as MammaPrint and better than Oncotype DX across the studies in validation cohort, independently as well as combined.
Recent studies conducted by TOGA have generated datasets on multiple genomic aberrations including somatic mutations, mRNA abundance, copy-number aberration (CNA) and DNA methylation [107, 113]. These datasets lend themselves naturally to integrative analyses that are crucial to bridge the gap between molecular features and clinical covariates. To this end, we applied our methodology to TOGA ovarian cancer [107] (Broad Institute cohort) and established 7 different models using SIMMS Model N. Molecular features based on mRNA, CNA and DNA methylation were used as gene-level properties. Next, subnetwork modules feature selection was carried out and MDS was computed by using the above-mentioned features independently as well as in a multivariate setting. As we only had one dataset with 478 patients having all three data types, the dataset was randomly dichotomized into equal sized training and validation cohorts. To avoid randomization specific bias, the procedure was repeated 1,000 times and aggregated the validation results (
SIMMS, as for example implemented in biomarker construction/pathway identification application 150, is generic and can work with any combination of molecular features and interaction networks. In an embodiment, it provides an extendible framework to support user-defined parameter estimation and classification algorithms. In an embodiment, SIMMS provides: (i) support for multiple datatypes (mRNA, methylation, CNA etc), (ii) support for user-defined networks, and (iii) support for user-defined methods for quantifying dysregulation effect of a subnetwork. For (i), users can supply the location and names of the files they would like to analyze with SIMMS. For (ii), a text file describing networks in a tab-delimited format can be supplied as an input to SIMMS, see pathway_based—networks*.txt files that comes as a part of R package. For (iii), the package offers an interface function ‘derive.network.features’ that accepts a parameter ‘feature.selection.fun’ for user-defined function name (see code snippet below). By default, the function ‘calculate.network.coefficients’ is called to compute MDS for Mode N, Model E and Mode N+E. However, users can easily write their own algorithms and simply use them with SIMMS as plug and play components.
SIMMS, as implemented for example in biomarker construction/pathway identification application 150, acts upon a collection of subnetwork modules, where each node is a molecule (e.g. a gene or metabolite) and each edge is an interaction (physical or functional) between molecules. Molecular data is projected onto these subnetworks using network topology measurements that represent the impact of and synergy between different molecular features and associated patient data. Because different biological processes can have different underlying tumourigenic promoting network architectures, three network topology measurements are provided based on different interaction models. One model, hereafter referred to as Model N (nodes only), estimates the extent of dysregulation in molecules that function together. Two other models Model E (edges only) and Model N+E (nodes and edges) incorporate the impact of dysregulated interactions (Methods). Regardless of which model is used, module scoring component 154 of application 150 computes a Thodule-dysregulation score′ (MDS) for each subnetwork that measures how a disease affects any given subnetwork (
We first focused on prognostic models, which predict patient survival, and therefore used Cox proportional hazards models for these censored data. Each Cox model generated a hazard ratios (HR) which quantifies how effectively a biomarker can stratify patients into low- and high-risk groups (Methods).
The distributional characteristics of our candidate disease-subnetwork modules revealed unexpected and important properties of tumour network biology. First, there was a global propensity for highly prognostic subnetworks to be larger, containing more genes and interactions than expected by chance (nodes p<10−3, edges p<10−3; permutation test) (
To ensure independence from the discovery cohort-specific effects, we inspected prediction robustness by permuting the discovery cohorts. While a distribution of performance was observed both in terms of statistical significance (
At device 10, module/pathway identification component 162 processes the subnetwork module scores, as calculated by module scoring component 154, to identify one or more dysregulated subnetwork modules. Upon identifying one or more dysregulated subnetwork modules, module/pathway identification component 162 may process the pathway records stored in datastore 144 to identify one or more biological pathway associated with the identified dysregulated subnetwork modules as dysregulated pathways.
Identifying dysregulation of particular subnetwork modules and/or pathways for specific diseases (or other phenotypes) provides targets for treatment.
For example, by acting at the pathway level, insight can be provided about therapeutic approaches that might target an entire pathway. Subnetwork module scores are used to identify specific pathways statistically-significantly dysregulated in each disease (Methods section: Patient risk score). Survival analysis demonstrated that the subnetwork based patient risk scores were prognostic indicators of patient outcome in each tumour type (
Having established that the subnetwork modules are predictive of clinical phenotype, the inter-subnetwork co-occurrence and mutual exclusivity in breast cancer (
Next, it was determined if specific pathways were recurrently mutated across different tumour types, in spite of the large inter-patient variability in disease presentation [69]. There were some clear similarities in subnetwork dysregulation between cancer types, with four pathways dysregulated in all types (
In addition to identifying specific subnetworks dysregulated in each disease type (e.g., each tumour type), a more general question is to quantitatively determine the similarity between different tumour types at the pathway-level. This question was addressed by sampling random sets of subnetworks, generating a prognostic model for each, and comparing the prognostic capacity of this model on each tumour type. Then million random samples of n subnetworks (where n=5, 10, 15, . . . , 250) were generated and tested their prognostic capability in the 4 tumour types. Breast and NSCLC markers showed a modest correlation (
Performance as a function of biomarker size was also analyzed (
The ability of biomarker construction/pathway identification application 150 to construct clinically-use biomarkers for each of the four noted tumor types was assessed. The most optimal size of subnetworks for different tumour types was determined using permutation analysis (
Because SIMMS operates at the level of pathways, it is robust to changes in the genomics platform. The Metabric clinical cohort of 1,988 patient profiles generated using IIlumina microarrays was used to demonstrate this flexibility [85]. The 50-subnetwork breast cancer classifier generated using Affymetrix microarrays (
Comparison with Existing Pan-Cancer Prognostic Biomarkers
To demonstrate the clinical utility of the biomarkers generated by SIMMS, as implemented in application 150, we conducted coherent performance comparison with previously published colon, NSCLC and ovarian cancer markers. The performance of SIMMS's identified markers was highly competitive and reproducible across a panel of independent patient studies. SIMMS produced the best prognostic marker for colon cancer by a wide margin, and was tied for the best lung and ovarian cancer markers (Table 26). Of note, each of the 15 other biomarkers evaluated used an entirely separate methodology. Overall, these results indicate that functionally-derived subnetworks have excellent prognostic capability, and can be used to identify new biomarkers across a range of human diseases.
1The validity of this dataset has been much criticised in the literature, with several studies being retracted (PMIDs: 17057710 and 16899777)
To further establish the clinical utility of SIMMS's classifications, we tested for synergy between SIMMS-predicted risk groups and the intrinsic breast cancer subtypes [81] using the Metabric cohort. The prognostic model created on the Metabric training cohort yielded risk-groups with in agreement with the PAM50 intrinsic subtypes (
However SIMMS can assist in the improved clinical management of breast cancer beyond simply subtyping them. For example, the majority of Basal-like tumours are triple negatives (ER-, PgR-, and Her2-) and vice versa, yet these are heterogeneous diseases with subgroups of patients having differential response to neo-adjuvant therapy [86]. Hence, molecular biomarkers are urgently needed for better management of patient subgroups that do not respond to current therapeutic regimes. To identify such biomarkers, we created subtype-specific SIMMS classifiers for breast cancer subgroups. Despite greatly reduced sample-sizes, SIMMS's classifiers successfully stratified the most heterogeneous groups (i.e. luminal A, luminal B and ER-positive [87]) into good and poor prognosis sub-groups (
To further demonstrate clinical utility, SIMMS's classifier was directly compared to two clinically-approved breast cancer biomarkers, Oncotype DX [88] and MammaPrint [89], in 7 independent validation cohorts. Each validation patient was classified using both these clinically-approved biomarkers and the SIMMS-trained breast-cancer classifier created using forward selection (
Large-scale disease-specific initiatives are rapidly generating matched genomic, transcriptomic and epigenomic profiling on large cohorts, with detailed clinical annotation [90]. Systematic integration of such data remains challenging, but offers the prospect for enhanced biomarker accuracy. We applied SIMMS to the Metabric dataset to combine copy number aberration (CNA) and mRNA abundance data. The integrated data yielded improved prediction relative to either data-type alone (
Such data types may include data reflecting aberration, epigenomic aberration, transcriptomic aberration, proteomic aberration, and metabolic aberration, and more particularly data reflecting somatic point mutation, small indel, mRNA abundance, somatic or germline copy-number status, somatic or germline genomic rearrangements, metabolite abundance, protein abundance, and DNA methylation.
It will be appreciated that any device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, tape, and other forms of computer readable media. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD), blue-ray disks, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any application or component herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media.
Furthermore, the described embodiments are capable of being distributed in a computer program product including a physical, non-transitory computer readable medium that bears computer-executable instructions for one or more processors. The medium may be provided in various forms, including one or more diskettes, compact disks, tapes, chips, magnetic and electronic storage media, volatile memory, non-volatile memory and the like. Non-transitory computer-readable media may include all computer-readable media, with the exception being a transitory, propagating signal. The term non-transitory is not intended to exclude computer readable media such as primary memory, volatile memory, RAM and so on, where the data stored thereon may only be temporarily stored. The computer useable instructions may also be in various forms, including compiled and non-compiled code.
It will be appreciated that numerous specific details are set forth in order to provide a thorough understanding of the exemplary embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Furthermore, this description is not to be considered as limiting the scope of the embodiments described herein in any way, but rather as merely describing implementation of the various embodiments described herein. All references herein, including in the following Appendices and Reference List, are hereby incorporated by reference.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2015/050692 | 7/23/2015 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62027966 | Jul 2014 | US | |
62085416 | Nov 2014 | US |