The present invention relates to a data-driven integrative system and method for providing patient-specific gene expression predictions by building a gene-gene regulatory influence network that incorporates community-curated biological pathway network information and omics data, such as RNAseq-based expression data, copy number variation (CNV) data, and DNA methylation data, and comparing with multi-omic patient-specific measurements, including RNAseq-based gene expression, array-based DNA methylation (epigenetic) and SNP-array based somatic copy-number alterations (sCNA). More particularly, the patient-specific gene expression predictions are used to identify significant deviations and inconsistencies in gene expression levels from expected levels in individual patient samples for providing predictive information in relation to cancer and cancer treatment.
The pathobiology of cancer is associated with significant aberrations within the natural complex biological processes governing the growth and differentiation of normal cells. However, there exists significant heterogeneity within cancers originating even within the same tissue type, possibly reflecting the multiple ways in which the normal signaling networks can be pathologically altered. This heterogeneity underlies the significant challenges posed in the development of diagnostic and theranostic biomarkers as well as potential therapeutic interventions in oncology, and points to the need for a systems-level understanding of cancer etiology and progression.
For instance, the ERBB2 gene which encodes a member of the epidermal growth factor (EGF) receptor family of receptor tyrosine kinases and plays a significant role in cell proliferation is highly overexpressed in multiple cancers, especially breast, gastrointestinal and ovarian cancers. This gene is deregulated in approximately 20% of breast cancer and in most cases its overexpression is associated with copy number amplifications, and has resulted in the definition of a specific subtype of breast cancer named after this gene, HER2-positive breast cancer. Despite the availability of a targeted therapeutic intervention for this particular subtype of breast cancer, namely Herceptin, the response rate of breast cancer patients to this therapy remains in the 50-55% range. This heterogeneity in response points to the existence of other genetic modulators of tumor progression. Indeed, it has been shown that aberrations in the AKT/PI3K pathway, such as deletions of the tumor suppressor gene PTEN, and mutations in the PIK3CA gene result in resistance to Herceptin. However, no systems-level pathway model currently exists that can integrate all of these factors into a single integrative biomarker for therapy resistance.
While the tumorigenic effects of specific recurrent mutations in known cancer driver-genes is well-characterized, not much is known about the functional relevance of the vast majority of recurrent mutations observed across cancers. Computational methods to assess the functional relevance of mutations largely depend on either estimating their impact on protein structure or are based on the relative frequency of their occurrence as compared to background mutational processes. In order to shed light on the potential impact of the mutations on downstream cellular processes, recent approaches have attempted to identify functional effects of genomic aberrations by integrating multi-omics measurements in cancer samples with community-curated biological pathway networks. However, the vast majority of these approaches tend to overlook key biological considerations including unequal and possibly nonlinear influences of multiple regulatory factors on downstream gene transcription and the tissue-specificity of pathway interactions.
Several computational frameworks have been developed in order to evaluate the functional significance of mutations or genomic aberrations in cancer samples. While methods based on the inference of mutational effects on protein structure have been widely used in the community, recent work has focused on determining driver mutations in genes by evaluating the relative frequency of a gene being mutated as compared to background mutation processes. Recognizing that silent mutations are typically rare for any candidate gene, likely resulting in inaccurate background mutation rate estimates, MutSigCV attempted to leverage genes with similar genomic properties to the candidate gene in order to improve the background mutation rate estimates. Other methods aim to identify subnetworks that are frequently hit by somatic mutations within a given cancer subtype. However, these approaches do not provide mechanistic insights into the downstream deregulatory or signaling effects of somatic aberrations. These shortcomings have led to network-based methods, where the well-curated biological interactions among cell entities (e.g., genes, RNAs, proteins, protein complexes and miRNAs) are incorporated into the model in terms of pathway networks. Other studies focus solely on the association between cancer clinical outcomes and the activation level of molecular entities such as gene and protein expression levels but do not explicitly model the functional effects of mutations in cancer biology. Recently, PARADIGM-SHIFT was proposed to integrate pathway regulatory networks with multi-omics data to model the functional impact of somatic mutations on the activity of individual nodes in the pathway. The functional effect of a somatic aberration in any given protein is inferred based on the difference in activity of the corresponding node obtained once from its upstream regulatory network and again from its downstream target nodes.
Although different in development, there are common drawbacks shared among these methods, which is their absolute reliance on the biological pathway networks, hence utilization of these methods should be restricted to well-curated pathway networks and not recommended for partially validated or molecular networks that were derived from different tissue contexts. More importantly, these techniques typically presume that all parent genes contribute equally to the corresponding interaction, thus overlooking the potential for variations in the strength of influence between the interactions among the network nodes. For instance, if multiple genes appear as transcriptional regulators of a specific target gene, they are considered to contribute equally to the expression level of the target gene, which is biologically questionable. In reality, the pairwise influence among adjacent nodes can be extremely different. The heterogeneity among links is considered in the HotNet algorithm, which intends to discover this heterogeneity through defining a measure of pairwise influence among gene pairs based on the network topology. However, the actual pairwise influence heterogeneity arising from complex underlying regulatory interactions is not fully extractable from the putative pathway network topology.
Since pathway level aberrations can result from multiple sources, such as somatic mutations, copy-number alterations, epigenetic variations and the regulatory gene expression changes, jointly modeling these sources of variability is essential to developing comprehensive pathway-based predictive models of use in oncology. Furthermore, with the recent advances in low-cost genome-wide data acquisition techniques in molecular biology, measurements of the different sources of variability are becoming increasingly available. However, modeling frameworks that can fully utilize the information present in these multi-omics profiles are lacking in both the research and diagnostic communities. Development of computational frameworks to integrate various data sources, including RNA expression level, copy number variations, DNA methylation patterns, and somatic mutations, with the objective of finding clinically useful biomarkers is therefore an essential need in the oncology community.
Recently, several integrative approaches were proposed to incorporate various sources of information into a unified framework to facilitate early cancer diagnosis, clinical outcome prediction and more relevant therapeutic interventions. The majority of these approaches take one of the two extreme perspectives of either; i) totally ignoring the conceptual biological information and relying solely on data-driven techniques, or ii) fully trusting the conceptual biological information via incorporating a network of interacting molecular entities. Ignoring biological interactions among the cell molecular entities (e.g., genes and proteins) in the first approach is highly inefficient in finding biologically relevant subsets of entities with significant collective predictive powers, due to the potential of data over-fitting. Indeed, this problem is particularly accentuated in cancer research since the number of cancer samples in any given study tends to be an order of magnitude lower than the number of molecular features measured. On the other hand, a full reliance on descriptive biological networks ignores their limitations: the pathway networks are typically constructed based on experimental evidence in a specific cellular context, which may not always be translatable to other tissue and pathological contexts.
The within invention takes a hybrid approach and incorporates both measurement-based omics data and partially trusted pathway information into a unified framework to build a gene-gene influence network, which is capable of predicting a particular gene expression level given the regulatory network status. This framework not only refines and extends our knowledge of tissue-specific protein-protein interactions but also provides patient-specific predictions and conditional distributions of network entities (e.g., genes). These patient-specific gene expression predictions are then leveraged to find significant deviations and inconsistencies in gene expression levels from expected levels in individual patient samples, thus allowing for the discovery of potential associations with phenotypes such as therapy response and prognosis.
This invention overcomes several significant limitations in integrating biological information and various molecular measurement data sources into a unified network-based computational framework. This leads to revealing more relevant patient-specific malfunctioning genes and perturbed biological processes.
For example, the method of this invention incorporates the biological information and reports only genes that show significant inconsistency with the underlying network-based predictions and patient-specific measurements. This approach, therefore, results in higher specificity as well as sensitivity in identifying the most functionally-relevant genes associated with the phenotype in consideration.
Also, the current set-based methods take biological information into account by first annotating sets of genes that are jointly associated with a particular phenotype or cellular/biological process based on a prior biological knowledge. However, set-based methods are not capable of adaptive integration and the user is required to include the biological information manually via forming potentially more relevant gene sets. In contrast, the within invention does not require any prior information about the cancer biology. The method develops a gene regulatory network for each gene from the pathway network annotations. The resulting pathway subnetworks associated with a phenotype provide functional insights along with robust biomarkers and is therefore widely applicable across cancers.
Currently available network-based methods such as Paradigm, Pathologist and SPIA aim at integrating pathway information with measurement data, in order to identify perturbed pathways and genes exhibiting significant deviations from predictions obtained from the network. These approaches have two important drawbacks. Firstly, these approaches fully trust the biological pathway network relationships without allowing for the potential of tissue-specific variations in pathway network connectivity. The second and even more important issue is that these techniques overlook the potential for functional heterogeneity amongst the interaction links in the network. They presume an equal influence of all the direct parent nodes, while in reality the influence of some regulatory parent genes may be significantly higher than the other parent genes.
The within method and system do not rely fully on the pathway network but rather refines the influence network by assigning different coefficients to the network edges that are learned from the multi-omics data. See, e.g., Tables 2 and 3; network edges representing upstream regulators are captured using the coefficients for ancestors; cis-regulatory influences are captured as CNV and Methylation coefficients. Further, loosely connected links are removed. Therefore, our method highlights and discovers the heterogeneous relations among network nodes (e.g., genes, RNAs, proteins).
In further contrast, our method uses both biological pathways and multi-omics measurement data to capture not only the topology but also the strength of the influence between nodes in the network as mentioned above. Therefore, it provides a more accurate and realistic influence among network nodes. Secondly, the within method is not limited to finding paths that are frequently affected by somatic mutations, but also finds the malfunctioning nodes.
In order to address these issues, the process of the invention, that we refer to as Information Flow impacted by Mutations (“InFlo-Mut”), incorporates multi-omics measurements, including RNAseq-based gene expression, array-based DNA methylation (epigenetic) and SNP-array based somatic copy-number alterations (sCNA), and biological pathway network information to build a gene-gene regulatory influence network. InFlo-Mut learns the pairwise influence of the regulatory nodes on the target genes from molecular profiles of normal and cancer tissues. To infer activity of the nodes in a new sample, InFlo-Mut uses the network coefficients, which are already learned from the training dataset. This is realized through learning a non-linear Bayesian model to predict the expression level of any given gene using its own sCNA and methylation profiles along with upstream regulatory influences inferred from biological pathway networks. This approach not only solves the issue of unequal parent node contributions by capturing heterogeneous pairwise influence coefficients, but also is capable of learning non-linear relations among the nodes. InFlo-Mut also allows for the assessment of the association between the somatic mutations with the downstream target genes, in order to reveal subset of genes whose mutations have higher impact on dysregulating the target genes. We demonstrate the robustness and biological validity of InFlo-Mut by applying it to two large multi-omics datasets on breast and colon cancers and uncover the potential modulatory effects of mutations across genes in key oncogenic pathways in these diseases.
In particular, an object of the present invention is to provide a system and method that solves the above-mentioned problems of the prior art by integrating curated pathway networks with multi-omic biological information and various molecular measurement data sources, into a unified network-based computational framework to identify the impact of somatic mutations. It is also an object of the present invention to provide a system and method for providing patient-specific gene expression predictions and identifying the significant deviations and inconsistencies in patient gene expression levels from predicted levels, to identify more relevant malfunctioning genes and perturbed biological processes. It is a further object of the present invention to identify potential associations with phenotypes such as therapy response and prognosis. It is also an object of the present invention to provide an alternative to the prior art.
Thus, the above-described object and several other objects are intended to be obtained in a first aspect of the invention by providing a system and method for identifying and reporting potential somatic aberrations driving dysregulated genes, such method comprising the steps of:
determining a primary dataset of upstream regulatory parent gene information for each specific target gene of interest by obtaining biological network pathway information from well-curated publicly available pathway networks and inputting the pathway information onto a processor configured to receive the pathway information;
determining, by applying, a regulatory tree for each specific target gene that captures the relationship between the gene's expression level with its own genomic and epigenetic status as well as its upstream transcriptional regulators; the gene of interest resides in the root node and the leaves of the tree represent all of the genes that potentially regulate its transcription either directly or indirectly through intermediate signaling partners;
determining a second dataset of measurement-based omics data, such as RNAseq expression data, copy number variation data and DNA methylation data, and inputting the measurement-based omics data onto a processor configured to receive such data,
applying computational techniques, by computer, to learn a non-linear function for each gene of interest based on the gene's epigenetic information and regulatory network status, in order to relate that particular gene expression level to the measurements associated with the regulatory tree leaves; the parameters of the non-linear function are estimated using a Bayesian inference method incorporating a novel depth penalization mechanism to capture the potentially stronger regulatory impact of nodes closer to the root node in the tree;
applying analytical techniques, by computer, to predict expression levels for each gene of interest;
determining patient-specific information relating to observed expression levels for the desired target genes, and inputting the patient-specific information as a third dataset, the patient—specific information including new cancer sample data such as RNA expression data, CNV data, methylation data and somatic mutation data;
using the patient-specific information and the predicted expression level information, calculating relative patient-specific inconsistency scores between the predicted and observed expression levels for the desired target genes in a given sample;
evaluating the activation and inconsistency scores obtained for all test samples to discover statistically significant associations between inconsistencies in the target gene expression level with the somatic mutations in the upstream regulatory network of that particular gene.
According to a second aspect of the invention a system is provided for utilizing the statistically significant associations between inconsistencies in the target gene expression level in individual patient samples with somatic mutations in the upstream regulatory network, to identify patient-specific biomarkers, such system comprising an integrated, unified network for identifying significant deviations and inconsistencies in gene expression levels, comprising;
a primary dataset of upstream regulatory parent gene information for each specific target gene of interest obtained from well-curated biological network pathway information, the primary dataset contained on a processor configured to receive said pathway information;
a regulatory tree for each specific target gene that captures the relationship between the target gene's expression level with said target gene's own genomic and epigenetic status, as well as its upstream transcriptional regulators, the gene of interest resides in the root node and the leaves of the tree represent all of the genes that potentially regulate its transcription either directly or indirectly through intermediate signaling partners, said tree determined from the primary dataset;
a second dataset of measurement-based omics data, such as RNAseq expression data, copy number variation data and DNA methylation data, the second dataset also located on a processor configured to receive such data,
a non-linear function learned for each target gene determined from said target gene's epigenetic information and regulatory network status, said non-linear function relating that particular target gene's expression level to measurements associated with said regulatory tree; wherein the parameters of said non-linear function are estimated using a Bayesian inference method incorporating a novel depth penalization mechanism to capture the potentially stronger regulatory impact of nodes closer to the root node in the tree;
a third dataset of patient-specific information relating to observed expression levels for the target genes, the patient-specific information including new cancer sample data such as RNA expression data, CNV data, methylation data and somatic mutation data;
wherein, expression levels of the target genes are determined utilizing the non-linear function, and relative patient-specific inconsistency scores are determined between the predicted and observed expression levels for the target genes in a given sample; and
wherein activation and inconsistency scores a r e determined a third dataset of patient-specific information relating to observed expression levels for the target genes, the patient-specific information including new cancer sample data such as RNA expression data, CNV data, methylation data and somatic mutation data;
wherein, expression levels of said target genes are determined utilizing the non-linear function, and relative patient-specific inconsistency scores are determined between the predicted and observed expression levels for the target genes in a given sample; and
wherein activation and inconsistency scores a r e determined for all test samples whereby statistically significant associations between inconsistencies in the target gene expression level with the somatic mutations in the upstream regulatory network of that particular gene are identified.
The methods according to the invention will now be described in more detail with regard to the accompanying figures. The figures showing ways of implementing the present invention and are not to be construed as being limiting to other possible embodiments falling within the scope of the attached claims.
The present invention provides a system and method for integrating multi-omic biological information and various molecular measurement data sources into a unified network-based computational method for providing patient-specific gene expression predictions and identifying significant deviations and inconsistencies in gene expression levels from expected levels. The present invention is described in further detail below with reference made to
According to an embodiment of the present invention, a flowchart presenting the overall block-diagram of the method for providing patient-specific gene expression predictions, identifying significant deviations and inconsistencies in gene expression levels from expected levels and reporting patient-specific biomarkers, is set forth by the steps, or modules, outlined in
In Module 2, the second step, we determine a non-linear function for each gene in order to relate that particular gene expression level to measurements associated with the regulatory tree leaves. Thus, each tree subnetwork is used to learn a non-linear function to predict the corresponding gene expression level from its own epigenetic information (e.g., DNA Methylation and Copy Number) and its regulatory ancestor gene expression levels. The parameters of the non-linear function are estimated using a Bayesian inference method incorporating a novel depth penalization mechanism to capture the potentially stronger regulatory impact of nodes closer to the root node of the tree. This provides a bank of functions each corresponding to a specific gene in the context of specific tissue type. This function database is learned once and can be used for patient-specific analysis in the two subsequent steps performed by Modules 3 and 4.
In step 3, Module 3 calculates relative patient-specific inconsistency scores between the predicted and observed expression levels for the desired target genes in a given sample. That is, Module 3 receives information for a given patient and performs prediction of gene expression levels for all genes within the regulatory network using the function bank. This module further calculates the consistency scores for each gene by comparing the actual measurement of gene expression, or observed value, with the predicted value. In the fourth step, Module 4, evaluates the activation and inconsistency scores obtained for all test samples to discover statistically significant associations between the inconsistencies in the target gene expression level with the somatic mutations in the upstream regulatory network of that particular gene. Thus, Module 4 identifies the genes whose expression levels are significantly inconsistent with the prediction values obtained from the regulatory network. These genes are likely malfunctioning due to copy number aberrations in the gene or somatic mutations in its ancestors. Module 4 further provides statistics to evaluate the significance of ancestor gene mutations that potentially are associated with the inconsistencies in the child gene expression level.
Module 1: Incorporating Pathway Networks—Regulatory Tree Construction
Gene transcription is a complex biological process, which is regulated at different levels through multiple interacting proteins and complexes as well as the extent of DNA methylation and the harboring DNA segment copy number, as annotated in biological pathway databases. Pathway networks are widely used to present the intra-cellular interactions and gene regulatory networks in a network format. The network builds a directed graph of nodes and edges. The nodes may consist of a diverse range of entities such as genes, proteins, RNAs, miRNAs, protein complexes, signal receptors, and even abstract processes such as apoptosis, meiosis, mitosis and cell proliferation. The network edges determine the pairs of interacting nodes and specify the type of each interaction. Several publicly available pathway networks are developed to model intra cellular activities between various species and tissue types.
In this invention, we use a comprehensive network that brings together pathways from various well-curated pathway sources including NCI-PID, Biocarta and Reactome. This “super pathway network” consists of six node types including; proteins or the corresponding genes, RNAs, protein complexes, gene families, miRNAs, and abstracts. These nodes interact with one another in six different ways of; i) positive transcription, ii) negative transcription, iii) positive activation, iv) negative activation, v) gene family membership, and vi) being a component of a protein complex. Usually, transcription is terminated only to genes represented by the corresponding proteins, while activation is applicable to all node types.
In order to learn a function relating a gene's mRNA expression level to its epigenetic parameters (DNA methylation and copy number variation), as well as its regulatory network, we extract the regulatory network for each gene from the super-pathway network databases and represent it as a “tree” (
In developing a regulatory tree for each gene, we start with a specific target gene and traverse the upstream network in the opposite direction of links to collect all upstream nodes and capture the regulatory genes along with their depth, defined as the number of links to the root node, as depicted in
We first terminate traversing a branch once we reach a predefined maximum depth level, where the depth is defined as the number of links from the visiting node to the root node. We then eliminate all the branches that do not terminate to a gene node; therefore, the leaves of tree are always genes. We pass through all nodes, except abstract nodes that represent the conceptual abstract processes, in order to avoid unnecessary network complications and inclusion of irrelevant interactions. While reaching a gene node, we only pass through the links that are not of type “transcription,” because the part of the upstream regulatory network which terminates to a gene node via a “transcription” link, is already accounted for by considering the expression level of this particular gene. The only exception for this rule is the root node, where we do the exact reverse as follows:
Passing from the root node to the direct neighbors in the first ring of root neighborhood is only allowed if the connecting edge is of “transcription” type to limit the parents to those who impact the expression level of the gene residing in the tree root. We also keep track of the distances from the leaves to the root node that is further used in the function learning process. Finally, if we meet a node via two disjoint paths, the shortest path is considered. The pseudo-code for the Module 1 selection process is summarized below, and a sample upstream tree extracted for gene PPP3CA from the network is depicted in
As an example, in
Module 2: Learning a Nonlinear Function for Each Gene
A second step of the inventive method is to learn a function relating the expression level of the gene residing at root node to its regulatory network and its own epigenetic information (e.g., DNA methylation and CNV). “Learning” a function means quantifying the influence of a regulatory gene's expression level on the target gene's expression. Also, the within method trains a model for a target gene that assigns different coefficients for parent genes based on their pairwise influence as observed in training data (as described in the Bayesian model estimation below, specifically the methods to estimate βg). Because multiple DNA methylation probes may overlap with a gene's coding or regulatory regions, this invention leverages methylation measurements by including several representative statistics such as minimum, maximum, and weighted mean value, where in calculating the weighted mean we exclude the regions with less than 10 probes for more accuracy. Thus, if gene g, overlaps with ngm regions, each having probe number p1, p2, . . . , pn
where I(.) is the identity function.
To include copy number variations, this invention uses the segment mean value provided for the region that harbors the particular gene. Most genes fall into a single CNV segment. Otherwise, if a gene falls in the border of two segments, we simply take the mean value of both segment measurements.
In order to learn a function for each gene, Module 2 uses mRNA expression of its ancestors, somatic copy-number alteration and DNA methylation measurements for ng samples to form the following classical regression model:
y
g=μg1n
where yg is a n×1 vector of expression levels for gene g across all ng samples. Xg=[Xg(Self),Xg(Parent)] is a n×p data matrix composed of two parts including Xg(Self) (self-methylation and CNV data) and Xg(Parent) (the expression levels of the ancestor genes), where;
The term 1n
The objective here is to find the optimal model parameters βi, i=1, 2, . . . , p that provides the best prediction power via minimizing the Mean Squared Error (MSE). One may use normal samples in learning phase to avoid model crash due to severely perturbed interactions in the highly contaminated/disordered cancer cells. However, this may lead to a poor predictive power when the number of predictors is large or comparable with respect to the number of samples (n<O(p)). In most studies, the number of cancer samples profiled tend to be significantly higher than the number of normal samples. For instance, in the case of TCGA data for breast cancer, the number of cancer samples exceed the normal samples by a factor of 10. Consequently, excluding all cancer samples is highly inefficient. On the other hand, including cancer samples in the training set, may deteriorate the model performance for specific genes that significantly deviate from the true underlying biological function in some samples due to genomic events as stated above. Therefore, we include all the normal samples and part of the cancer samples that have not impacted by somatic mutations in this particular gene and its ancestors in order to learn the predictive function. This approach leads to a different training set size for each gene, but provides a considerable improvement in model prediction power.
The Least Squared Error (LSE) solution minimizes the squared error for the training set, when no prior information is available about the model parameters βi.
The LSE solution is not optimal when there is prior information about the model parameters. Here, there is prior knowledge about the model that can be used to enhance the model accuracy. First, it is likely that not all of the ancestor genes may have a substantial impact of a given gene's expression levels. Therefore, a significant number of the model parameters βi could be shrunk towards zero. Therefore, imposing sparsity enhances the model generalization property by avoiding noise over-fitting. Although part of sparsity is already accounted for by using the pathway network and including only ancestor genes instead of using all genes as the input data; but a still higher level of sparsity is expected, when the number of ancestor genes grow higher (in order of tens and hundreds).
One of the common optimization-based solutions to impose sparsity is regularizing the norm of model parameters. The penalization can be applied to the Lp, p≥0 norm of coefficient vector β=[β1, β2, . . . , βp]T, which is called bridge regression. Important special cases of this approach are Lasso, Ridge, and subset selection for L, L2, L0 norm penalization, respectively. In elastic net, the penalty term is the linear combination of L1 and L2 penalty;
where λ1 and λ2 are shrinkage parameters to impose sparsity and generalizability. Efficient algorithms based on convex optimizations, Basis pursuit, LARS, coordinate descent, Dantzig Selector, Orthogonal matching pursuit, and approximate message passing may be used to solve this problem. However, the most limiting drawback of these methods is that it only provides point estimates for the regression coefficients.
In contrast, this invention employs a Bayesian framework that provides more detailed information about the model parameters through posterior distribution to be used in the subsequent consistency check analysis. It also allows the incorporation of other prior knowledge, in addition to sparsity, as explained below.
Historically, in analyzing gene expression studies potentially non-linear relations among the biological measurements have been ignored. In order to capture such nonlinear relationships, Module 2 of this invention uses a centered sigmoid function
to capture the sensitivity around the mean value and the soft thresholding function ƒ2(x; c)=sign(x)(√{square root over (x2+c2)}−c), to account for the cases in which only extremely high or low values contribute to the model. ƒ2 (x; c) can be considered as a softer version of the commonly used peace-wise linear soft-thresholding function, ƒ(x; c)=sign(x)(|x|−c)+. These functions are depicted in
Another important biological consideration in developing the ancestor-set for each gene through upwards traversing the pathway network is the variation in the distance of leaf nodes to the root node. One may expect the closer ancestors contribute more to the descendant downstream gene expression level than farther nodes that are connected via a long chain of intermediate nodes. Hence, the closer nodes tend to pose higher coefficient in the regression model. Module 2 leverages this fact into the method through depth penalization mechanism in the Bayesian framework, as captured by Id in the Bayesian model described below.
Here, this invention uses the Bayesian framework to predict the gene expression level via a nonlinear transformation/projection of its self-epigenetic data as well as the expression levels of the its regulatory ancestor genes. The Bayesian framework provides the desired statistics (e.g., median, mean, moments and . . . ) via full posterior distributions of the model parameters. Moreover, we incorporate a prior knowledge about the model parameters using hierarchical Bayesian models. The resulting posterior distributions provide significant insights into the functional effects of aberrations in the pathway.
The invention uses the idea of global and local shrinkages with penalization based on the distance of the ancestor gene (i.e., the number of links from leaves to root in the regulatory network) from the gene whose expression is being predicted. The following model is constructed, where the subscript g is omitted for notation convenience:
The above formulation extends the normal gamma prior construction in order to incorporate the link depth information to the gamma prior construction. This information is leveraged via coefficients k included in the variance of the model parameters. Thus, the variance of βi is chosen to be inversely proportional to the link depth of the corresponding ancestor via setting var(βi)=σ2τi2/ki2, where σ2 controls the global shrinkage, τi2 accounts for the local shrinkage and Id enforces the link depth impact. To provide more flexibility, we Use of a Gamma prior distribution for ki2 provides more flexibility. Using gamma prior has the advantage of yielding closed-form posterior distribution for ki and hence facilitates employing computationally efficient Gibbs sampler. Therefore, we use ki2˜Gamma(ak, bk) such that the mean of variance is inversely proportional to depth parameter, i.e.,
The constant c is a normalizing term to ensure
which is obtained by setting
Therefore, we only have one free hyperparameter ak
The above hierarchical model yields the following full joint distribution:
which immediately provides the following posterior distributions using the fact that the full conditional posterior distribution for each parameter is simply the product of the terms including that variable with other terms serving as a normalization constant to ensure the resulting probability integrates to one. This method is called completion of terms:
The Woodbury Matrix inversion formula is used to calculate A−1 when n<p to obtain more stable results and save in computations by converting a p×p square matrix inversion to a n×n one. We apply a Gibbs sampler with burn-in iterations 1000 and computation iterations 5000 to obtain the approximate posterior distributions for the model parameters βi, σ. This process is repeated for all genes g∈G using all samples s∈S, where G and S are the set of gene ids and sample ids, respectively.
In order to evaluate the disruption of a target gene g for any given sample, we obtain activation score Ag(new) and inconsistency score Cg(new), where the first shows the level of gene expression, which may be consistent with its regulatory network, and the second shows the deviation from the expected value pointing to deregulation of the gene (potentially associated with somatic mutations).
Performing Module 2 using training samples from both normal and cancer cohorts provides results in the form of a function bank, where each function corresponds to a specific gene. This function bank is then used in Module 3 to analyze test samples to identify potential inconsistencies. Thus, this module performs gene expression level prediction for all genes. For each gene, we extract the expression levels of the ancestor genes as well as the self-epigenetic information for all samples. Then, we predict expression level of this specific gene for all samples using the corresponding function learned for this gene. The prediction process provides the conditional posterior distribution for the expression level of this gene. We use the maximum a-posteriori (MAP) method to obtain the expected gene expression levels.
In order to calculate consistency scores for un-isolated target genes for which a function is learned, we note that the predictive distribution for the RNA expression of any gene for each new test sample ynew is obtained by marginalizing out the model parameters from the conditional posterior distribution for given input xnew (self-epigenetic info and ancestors expression levels):
ƒ(ynew|xnew)=∫ƒ(ynew|xnew,β,σ2)ƒ(β,σ2|y,X)dβdσ2
While the first term, which is the conditional distribution, is available in-closed form, the second term which is the posterior distribution of model parameters is not. This distribution can be approximated with the following expression, where the realizations of model parameters (β(i), σ2
The above distribution is a Gaussian mixture model (GMM) with large number of equi-probable components of mean (Ψ(xnew)Tβ(i)) and variance (σ2
ƒ(ynew|xnew)=N(ynew;μy
μy
where ∥.∥2 is the matrix induced norm. Based on this distribution, we calculate the z-score or the equivalent likelihood for the observed value as follows:
Moreover, due to the complexity of the underlying biological process for each gene and different level of inherit randomness, natural regularity and impact of unknown factors, the predictive power of the learned functions maybe significantly different for each gene. Accordingly, we consider the average empirical predictability of each gene for normal samples as a ground level for the consistency check. Hence, only cancer samples having consistency levels far below the average inconsistency of normal samples are reported as inconsistent samples. The following normalized likelihood is used:
where n0 and n1 are the number of normal and cancer samples and a is a tuning parameter between 0 and 1 in order to push different emphasis on normal and cancer cohorts. Lower values for α are chosen in order to emphasis more on the normal cancers and compensate for the lower number of normal samples. In this invention, we arbitrarily set α= 1/10, which is almost equal to the ratio of normal samples to cancer samples in the training set for TCGA Breast Cancer dataset. The inequality becomes equality if the variances of the predictive distribution are equal for all samples. The above process is repeated for all genes in parallel.
In addition to consistency score, the activation score of each gene is obtained using the gene expression level distribution modeled as a normal distribution;
where μ and σ is the mean and standard deviation of the normal distribution learned for each gene expression level after iteratively excluding the outliers. The postscript g is omitted for notation convenience. A similar normalization is used for activation scores.
As discussed above, the application of this module is to use the trained model on top of the regulatory network to predict a desired target gene expression level for a given sample based on the target gene epigenetics as well as expression levels of the genes playing transcription regulations roles in the utilized regulatory tree. In
To further illustrate the association between the gene expression level inconsistencies with somatic mutations as established in this module,
In
Gene expression levels may deviate from predicted values by the presence of somatic mutations in the regulatory network, resulting in loss/gain of regulatory functions. That is, a mutation in any of regulator genes may impact its proper role in regulating gene expression and impose a deviation on the target gene expression. Module 4 of the within method provides a methodology which assesses the impact of somatic mutations in regulatory genes on the inconsistency scores for downstream target genes. Accordingly, this module takes the activation and consistency scores provided by Module 3 and, for each new test sample, identifies the genes that are significantly inconsistent and examines if they are potentially driven by CNV aberrations or somatic mutations in the current gene or in its regulatory subnetwork.
To begin, the inconsistencies driven by CNV aberration events are identified. If the inconsistency is due to overexpression of the gene and the gene experiences copy number amplifications (CNV>0.5), then CNV amplification is reported as the main cause of the inconsistency. Likewise, if copy number deletion (CNV<−0.5) is associated with the down expression of the gene, CNV deletion is considered to be the inconsistency driver.
For the genes which do not experience relevant copy number aberrations, the inconsistencies are likely to arise from mutations in the upstream regulatory network of the gene that impacts the transcription of the downstream gene. The closer the regulatory gene is to the downstream target gene, the more influence is expected on the downstream gene expression level inconsistency. Therefore, Module 4 assigns a global depth penalization parameter 0<α≤1, such that the impact of mutated gene i with d19 hops to the root node g is scaled with value (α)d
To quantify the impact of mutations in the regulatory tree, we count all non-silent mutations affecting either the target gene or its regulators for each of the cancer samples scaled by their absolute inconsistency levels and the depth penalization factor. In general, the functional impact of gene h mutations on the expression of gene g, denoted by ƒg(h) is calculated as:
where Pg is the set of regulatory ancestor genes of gene g (i.e., the leaves of the corresponding regulatory tree), M(i) is the set of genes that are mutated in sample j, Zc(g)(j) is the inconsistency score of gene g at sample j and 1(.) is the indicator function. The role of the denominator is to normalize Σh∈P
The flowchart in
In to illustrate the prediction power of the method of this invention, its performance is compared to several near-optimal state-of-the art point-estimators including LASSO, RIDGE and Elastic-Net Regressions.
In order to demonstrate the accuracy of the inventive method, we first learn a Gaussian distribution for each gene expression level via maximum likelihood method after iteratively excluding significant outliers. We begin by learning a Gaussian distribution for the samples at each iteration and then we remove the samples which are not in the second standard deviation neighborhood of the mean value. In the subsequent iteration we repeat the process for the remaining samples until the algorithm converges and no more outlier exists. The empirical distribution for a sample gene PTEN and the learned Normal distribution is presented in
Next, we divide gene expression levels into three states (neutral, over-expressed and under-expressed) based on predefined thresholds. Thresholds are arbitrarily set such that the probability of down-expression, neutral and over-expression states become 10%, 80% and 10%, respectively. Module 3 provides patient-specific gene expression predictions for all 839 un-isolated genes. The state change rate is calculated via averaging state change events over all genes and patients. The results are calculated for each cohort separately. If the observed and predicted expression state for sample i and gene g are sg(i) and ŝg(i), respectively, the state change rate is calculated as:
In Table 1, the prediction error is calculated for some important genes that are highly associated with cancer and have a valid set of upstream regulatory genes in the global pathway network. It is seen that the within method outperforms the state of the art sparsity-imposing regression models with the additional advantage of providing full posterior distribution for the gene expression level.
Another important observation is that despite the fact of higher contribution of cancer samples to the model training due to the larger number of cancer samples with respect to normal samples, the normal cohort presents a better predictability. This observation holds for all models and reveals that the functional states of the gene expressions in normal tissues are more consistent with the upstream regulatory network.
The fact of higher consistency between the predicted and observed values of target gene expression levels in normal samples compared to cancer samples is also observed in
This figure demonstrates that the nonlinear CNV terms with coefficients obtained from the learning process well define the RNA expression level for ERBB2 with some minor variability due to other terms such as DNA methylation and ancestor gene expression levels. In fact, the coefficients of DNA methylation and majority of ancestors are explicitly removed from the predictor list by LASSO and Elastic Net Methods and also notable is that the within invention assigns negligible coefficients for DNA methylation.
On the other hand, the RNA expression level for GATA3 is more influenced by DNA methylations as well as upstream regulatory network. The expected negative sign for DNA methylation coefficients are suggestive of an inverse relationship between the gene expression level and DNA methylation for both genes. Finally, for GATA3, the upstream regulatory network plays a crucial role in regulating the expression of this gene, suggesting that most of the variation of this gene's expression in breast cancer arises primarily due to the activity of transcription factors. The regression coefficients estimated by the method for two genes ERBB2 and GATA3 provided in Table 2 and 3 reveal that the regression coefficients can be significantly different for genes due to high heterogeneity of the gene regulation functionalities.
An important source of inconsistency is due to mutations in the target gene's upstream regulatory network. Noting the fact that the impact of expression level of regulatory genes is already captured by the method if the predicted value of target gene expression level is not consistent with the observed value then we infer that the regulatory network does not play its regulatory role properly. This malfunction of the regulatory network most likely arises from somatic mutations in the regulatory network that prevents its genes or product proteins from performing their functions (complex formation, gene transcription, protein activation and . . . ) properly which in turn impacts the downstream target gene expression level.
As an illustrative example, the functional impact of somatic mutations on the dysregulation of gene PTEN is depicted in
This application claims priority to U.S. Provisional Application No. 62/210,502, filed Aug. 27, 2015, which is specifically incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB2016/055092 | 8/26/2016 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62210502 | Aug 2015 | US |