A. Field of the Invention
The present invention relates to systems and methods for constructing gene network models and determining relationships between genes.
B. Description of the Related Art
One of the most important aspects of current research and development in the life sciences, medicine, drug discovery and development and pharmaceutical industries is the need to develop methods and devices for interpreting large amounts of raw data and drawing conclusions based on such data. Bioinformatics has contributed substantially to the understanding of systems biology and promises to produce even greater understanding of the complex relationships between components of living systems. In particular, with the advent of new methods for rapidly detecting expressed genes and for quantifying expression of genes, bioinformatics can be used to predict potential therapeutic targets even without knowing with certainty, the exact roles a particular gene(s) may play in the biology of an organism.
Simulation of genetic systems is a central topic of systems biology. Because simulations can be based on biological knowledge, a network estimation method can support biological simulation by predicting or inferring previously unknown relationships.
In particular, development of microarray technology has permitted studies of expression of a large number of genes from a variety of organisms. A large amount of raw data can be obtained from a number of genes from an organism, and gene expression can be studied by intervention either by mutation, disease or drugs. Finding that a particular gene's expression is increased in a particular disease or in response to a particular intervention may lead one to may then be modified to increase potency, stability, or other properties, and the modified compounds retested in the assay. This approach returns little or no information on the mechanisms of action or cellular pathways affected by the candidates.
A second approach to drug discovery involves testing numerous compounds for a specific effect on a known molecular target, typically a cloned gene sequence or an isolated enzyme or protein. For example, high-throughput assays can be developed in which numerous compounds can be tested for the ability to change the level of transcription from a specific promoter or the binding of identified proteins. Although the use of high-throughput screens is a powerful methodology for identifying drug candidates, again it provides little or no information about the effects of a compound at the cellular or organismal level, in particular information concerning the actual cellular pathways affected.
In fact, analysis of the pathway of efficacy and toxicity of candidate drugs can consume a significant fraction of the drug development process (see, e.g., Oliff et al., 1997, “Molecular Targets for Drug Development,” in DeVita et al. Cancer: Principles & Practice of Oncology 5th Ed. 1997 Lippincott-Raven Publishers, Philadelphia). Therefore, methods of improving this analysis are of considerable current value.
In the past, it has been possible to glean some information to some extent about the pathways and mechanisms occurring inside a biological system of interest (including pathways of drug action) by simply observing the system's response to known inputs. The response observed has typically been gene expressions (i.e., mRNA abundances) and/or protein abundances. The inputs are experimental perturbations including genetic mutations (such as genetic deletions), drug treatments, and changes in environmental growth conditions.
However, it has been a usually hopeless task to try to infer the details of the system simply from the observed input-output relationships. Even if a pathway hypothesis is available, it has not been easy to determine if differential experiments provides adequate or efficient tests or confirmation of the pathway hypothesis. And even with such experiments, it has not always been known how to interpret their results in view of the pathway hypothesis.
Despite much effort and elaborate measurements, little concrete progress has been made in reconstructing the pathways of biological systems, much less their time-dependent interactions, from simple observations such as protein and mRNA concentrations (McAdams et al., 1995, Circuit simulation of genetic networks, Science 269:650-656; Reinitz et al., 1995, Mechanism of eve stripe formation, Mechanisms of Development 49:133-158).
One approach to this problem has been bringing modeling tools from other disciplines to bear on this problem. For example, boolean representations and network models familiar to the electrical engineering community have been applied to biological systems (Mikulecky, 1990, Modeling intestinal absorption and other nutrition-related processes using PSPICE and STELLA, J. of Ped. Gastroenterology and Nutrition 11:7-20; McAdams et al., 1995, Circuit simulation of genetic networks, Science 269:650-656). One application has been to the control of gene transcription during development, particularly during sequential organism development (Yuh et al., 1998, Genomic Cis-regulatory logic: Experimental and computational analysis of a sea urchin gene, Science 279:1896-1902).
The difficulties noted in developing and testing models of biological pathways in organisms has hindered effective use of the great advances recently made in biological measurement techniques. While traditional bioinformatic techniques have enabled the simultaneous study of thousands of molecular signals and their patterns of co-regulation, they suffer from the inability to reveal causal relationships among molecular signals within cells. This is a significant setback since the combination of the cause and effect relationships between all the signals operating in a cell, rather than the isolated actions of individual signals, is typically what drives and regulates processes.
Thus, much effort is being expended to develop methods for determining cause and effect relationships between genes, which genes are central to a biological phenomenon, and which genes' expression(s) are peripheral to the biological process under study. Although such peripheral gene's expression may be useful as a marker of a biological or pathophysiological condition, if such a gene is not central to physiological or pathophysiological conditions, developing drugs based on such genes may not be worth the effort. In contrast, for genes identified to be central to a process, development of drugs or other interventions may be crucial to developing treatments for conditions associated with altered expression of genes.
Increasingly, mathematical methods are being employed to determine relationships between expressed genes. However, accurately deriving a gene regulatory network from gene expression data can be difficult. Several mathematical methods have been used to infer gene regulatory networks such as differential equation models (Chen et al. 1999; de Hoon et al. 2003), state space models (Rangel et al. 2004), Boolean network models (Akutsu et al. 1998; Shmulevich et al. 2002) and Bayesian network models (Friedman et al. 2000; Imoto et al. 2002). See also U.S. patent application Ser. No. 10/259,723, filed Sep. 26, 2002, U.S. patent application Ser. No. 10/722,033, filed Nov. 25, 2003, and U.S. patent application Ser. No. 10/716,330, filed Nov. 18, 2003, which are incorporated herein fully by reference).
In certain embodiments, this invention includes the use of time course expression data in a Bayesian network model with nonparametric regression. The Bayesian network model estimated from time course data by dynamic Bayesian network model can be combined with gene knockdown expression data, and regulatory relations estimated from knock down microarrays to construct an accurate gene network that reflects the affect of a chemical compound on the system. The invention can be applied to identify a gene network affected by an agent, identify a target gene in a system containing a gene network, or to provide a service that receives raw data from a party and identify target genes for the party based on a gene network model constructed according to the present invention.
The present invention utilizes computational strategies for combining different types of biological information to construct an accurate gene network. In some embodiments of the invention, the method is used to discover druggable gene networks, i.e. those that are most strongly affected by a chemical compound. To illustrate the method, we use two types of microarray data: One is gene expression data obtained by measuring transcript abundance responses over time following treatment with the chemical compound. The other is gene knock-down expression data, where one gene is knocked-down for each microarray.
First, we estimate dynamic relationships denoted by GT between genes based on time-course data by using dynamic Bayesian networks. Second, in gene knock-down expression data, since we know the information of knocked-down genes, possible regulatory relationships between knocked-down gene and its regulatees can be derived. We denote this information by R. Finally, the gene network GK is estimated by gene knock-down data denoted by XK together with GT and R by using Bayesian networks based on multi-source biological information. The key idea for estimating a gene network based on multi-source biological information is to use GT and R as the Bayesian prior probability of GK. In the present invention, we extend the prior probability of graph in order to use prior information represented as continuous values. After estimating a gene network, we have also developed a gene network analysis tool called iNET that is an extended version of G.NET for extracting biologically plausible information from the estimated gene network. The iNet tool provides a computational environment for various path searches among genes with annotated gene network visualization.
The method of the present invention can estimate druggable gene networks as directed graphs, that are sub-networks of the tissue-specific network. In this method, the edge direction is very important information and selection of compound-related genes is necessary. Our method is also capable of using various kinds of biological data, which further enhances the accuracy of the estimated network, and enables not merely the identification of affected genes, but also the elucidation of their dependency as the network.
In one embodiment of the present invention, we use Bayesian networks and dynamic Bayesian networks for estimating gene networks from gene knock-down and time-course microarray data, respectively. In this section, we briefly describe these two network models and then elucidate how we combine multi-source biological information to estimate more accurate gene networks.
Suppose that we have the observational data X the set of p random variables X={X1, . . . Xp) and that the dependency among p random variables, shown as a directed graph G, is unknown and we want to estimate it from X. In gene network estimation based on microarray data, a gene is regarded as a random variable representing the abundance of a specific RNA species, and X is the microarray data. From a Bayes approach, the optimal graph is selected by maximizing the posterior probability of the graph conditional on the observed data. By the Bayes' theorem, the posterior probability of the graph can be represented as
where p(G) is the prior probability of the graph, p(X|G) is the likelihood of the data X conditional on G and p(X)) is the normalizing constant and does not depend on the selection of G. Therefore, we need to set p(G) and compute p(X|G) for the graph selection based on p(G|X).
The prior probability of the graph p(G) enables us to use biological data other than microarray data to estimate gene networks and the likelihood p(X|G) can be computed by Bayesian networks and dynamic Bayesian networks from gene knock-down and time-course microarray data, respectively. As those of skill in the art will recognize, the present invention can be broadly applied to biological data other than gene knock-down and time-course microarray data. We elucidate how we construct p(G|X) in the following sections.
Bayesian networks are a graphical model that represents the causal relationship in random variables. In the Bayesian networks, we use a directed acyclic graph encoding Markov relationship between connected nodes. Suppose that we have a set of random variables X={X1, . . . , Xp) and that there is a causal relationship in X by representing a directed acyclic graph GK. Bayesian networks then enable us to compute the joint probability by the product of conditional probabilities
where Paj is the set of random variables corresponding to the direct parents of Xj in GK. In gene network estimation, we regard a gene as a random variable representing the abundance of a specific RNA species, shown as a node in a graph, and the interaction between genes is represented by the direct edge between nodes.
Let XK be an N×p gene knock-down data matrix whose (i, j)-th element xj|Di corresponds to the expression data of j-th gene when Di-th gene is knocked down, where j=1, . . . , p and i=1, . . . , N. Here we assume that i-th knock-down microarray is measured by knocking-down Di-th gene. Since microarray data take continuous variables, we represent the decomposition (1) by using densities
where Θ=(θ′1, . . . , θ′p) is a parameter vector, paj|Di is the expression value vector of Paj measured by i-th knock-down microarray. Hence, the construction of the graph GK is equivalent to model the conditional probabilities fj (j=1, . . . , p), that is essentially the same as the regression problem. For constructing fj(xj|Di|paj|Di, θj), we assume the nonparametric regression model with B-splines of the form
where paj|Di(k) is the k-th element of paj|Di, εj|Di˜i.i.d.N(0, σ2) for i=1, . . . N, and mjk (k=1, . . . , |Paj|) are smooth functions constructed by B-splines as mjk(x)=Σm=1M
and b(jk)m(x) (m=1, . . . , Mjk) are parameters and B-splines, respectively.
The likelihood p(XK|GK) is then obtained by
p(XK|GK)=∫fBN(XK|Θ,GK)p(Θ|λ,GK)dΘ, (2)
where p(Θ|λ, GK) is the prior distribution on the parameter Θ specified by the hyperparameter λ. The high-dimensional integral can be asymptotically approximated with an analytical form by the Laplace approximation and Imoto et al. defined a graph selection criterion, named BNRC, of the form
where
r is the dimension of Θ, and Θ is the mode of lλ(Θ|XK). The network structure is learned so that BNRC (GK) decreases by the greedy hill-climbing algorithm. We should note that the solution obtained by the greedy hill-climbing algorithm cannot be guaranteed as the optimal. To find better solution, we repeat the greedy algorithm and choose the best one as GK. It happens quite often that the likelihood p(XK|GK) gives almost the same values for several network structures, construction an effective p(GK) based on various kinds of biological information is a key technique. We elucidate how we construct p(GK) in the section entitled “Combining Multi-Source Biological Information for Gene Network Estimation.”
Dynamic Bayesian networks represent the dependency in random variables based on time-course data. Let X(t)={X1(t), . . . , Xp(t)} be the set of p random variables at time t (t=1, 7). In the dynamic Bayesian networks, a directed graph that contains p nodes is rewritten as a complete bipartite graph that allows direct edges from X(t) to X(t+1), where t=1, . . . , T−1. The directed graph GT of the causal relationship among p random variables is then constructed by estimating the bipartite graph defined above. Under GT structure, we then have the decomposition
where Paj(t) is the set of random variables at time t corresponding to the direct parents of Xj in GT.
Let XT be a T×p time-course data matrix whose (t,j)-th element xj(t) corresponds to the expression data of j-th gene at time t, where j=1, . . . , p and t=1, . . . , T. As we described in the Bayesian networks, the decomposition in (3) holds by using densities
where Ξ=(ξ′1, . . . , ξ′p)′ is a parameter vector, paj(t) is the expression value vector of direct parents of Xj measured at time t. Here we set paj(0)=Ø. We can construct fDBN by using nonparametric regression with B-splines in the same way of the Bayesian networks. Therefore, by replacing fBN by fDBN in (2), Kim et al. proposed a graph selection criterion for dynamic Bayesian networks, named BNRCdynamic, with successful applications.
Imoto et al. proposed a general framework for combining biological knowledge with expression data aimed at estimating more accurate gene networks. In Imoto et al., the biological knowledge is represented as the binary values, e.g. known or unknown, and is used for constructing p(G). In reality, there are, however, various confidence in biological knowledge in practice. Bernard and Hartemink constructed p(G) using the binding location data that is a collection of p-values (continuous information). In the method of the present invention, we construct p(G) by using multi-source information including continuous and discrete prior information.
Let Zk is the matrix representation of k-th prior information, where (i, j)-th element zij
represents the information of “gene i→gene j”. For example, (1) If we use a prior network Gprior for Zk, Zii
takes 1 if e(i, j)εGprior or 0 if e(i, j)∉Gprior, Here, e(i,j) denotes the direct edge from gene i to gene j. (2) By using the gene knock-down data for Zk, Zij
p(εij)=πije
where πIJ=Pr(eij=1). For constructing πij, we use the logistic model with linear predictor ηij=Σk=1Kωk(zi(k)j−ck) as πij={1+exp(−ηij)}−1, where ωk and ck (k=1, . . . , K) are weight and baseline parameters, respectively. We then define a prior probability of the graph based on prior information Zk (k=1, . . . , K) by
As those of skill in the art will recognize, the method of the present invention can accommodate one or more types of data as prior probability. This prior probability of the graph assumes that edges e(i,j) (i,j=1, . . . , p) are independent of each other. In reality, there are several dependencies among ei,j's such as p(eij=1)<p(eij=1|eki=1), and so on, but we consider adding such information into p(G) to be premature by the quality of such information.
While
The computer system shown in
A number of program modules may be stored in the drives and RAM 2925, including an operating system 2935, one or more application programs 2936, other program modules 2937, and program data 2938. A user may enter commands and information into the personal computer 2920 through a keyboard 2940 and pointing device, such as a mouse 2942. Other input devices (not shown) may include a microphone, joystick, game pad, satellite dish, scanner, or the like. These and other input devices are often connected to the processing unit 2921 through a serial port interface 2946 that is coupled to the system bus, but may be connected by other interfaces, such as a parallel port, game port, or a universal serial bus (USB). A monitor 2947 or other type of display device is also connected to the system bus 2923 via an interface, such as a display controller or video adapter 2948. In addition to the monitor, personal computers typically include other peripheral output devices (not shown), such as speakers and printers.
The above computer system is provided merely as an example. The invention can be carried out in a wide variety of other configurations. Further, a wide variety of approaches for collecting and analyzing data related to quantifying gene relatedness is possible. For example, the data can be collected, nonlinear models built, the models' effectiveness measured, and the results presented on different computer systems as appropriate. In addition, various software aspects can be implemented in hardware, and vice versa.
The following examples are provided by way of illustration only and not by way of limitation. Those of skill in the art will readily recognize a variety of non-critical parameters that could be changed or modified to yield essentially similar results.
To demonstrate how the present invention operates, we analyzed expression data from human endothelial cells and generated new time-course data that reveal the responses of human endothelial cell transcripts to treatment with the anti-hyperlipidaemia drug fenofibrate. We also generated new data from 270 gene knock-down experiments in human endothelial cells. The fenofibrate-related gene network is estimated based on fenofibrate time-course data and 270 gene knock-down expression data by the method of the invention. The estimated gene network reveals gene regulatory relationships related to PPAR-α, which is known to be activated by fenofibrate. Our computational analysis suggests that this computational strategy based on gene knock-down and drug-dosed time-course microarrays provides a new and improved tool in druggable gene discovery.
We measured the time-responses of human endothelial cell genes to 25 μM fenofibrate. The expression levels of 20,469 probes were measured by CodeLink™ Human Uniset 120K at six time-points (0, 2, 4, 6, 8 and 18 hours). Here time 0 means the start point of this observation and just before exposure to the fenofibrate. In addition, we measured this time-course data as the duplicated data in order to confirm the quality of experiments.
Since our fenofibrate time-course data are duplicated data and contain six time-points, there are 26=64 possible combinations to create a time-course dataset. We should fit the same regression function to a parent-child relationship in the 64 datasets. Under this constraint, we consider fitting nonparametric regression model to the connected data of 64 datasets. That is, if we consider gene i→gene j, we will fit the model xj(c)(t)=mj(xi(c)(t−1))+εf(t), where xj(c)(t) is the expression data of gene j at time t in the c-th dataset for c=1, . . . , 64. In the Bayesian networks, the reliability of estimated edges can be measured by using the bootstrap method. For time-course data, several modifications of the bootstrap method are proposed such as block resampling, but it is difficult to apply these methods to the small number of data points generated by short time-courses. However, by using above time-course modeling, we can define a method based on the bootstrap as follows: Let D={D(1), . . . , D(64)} be the combinatorial time-course data of all genes. We randomly resample D(c) with replacement and define a bootstrap sample D*={D*(1), . . . , D*(64)}. We then re-estimate a gene network based on D*. We repeat 1000 times bootstrap replications and obtain ĜT*, . . . , ĜT*1000, where ĜT*B is the estimated graph based on the B-th bootstrap sample. The estimated reliability of edge can then be used as the matrix representation of the first prior information Z1 as zi(1)j=#{B|e(i,j)εĜT*B, B=1, . . . , 1000}/1000.
For constructing exemplary gene networks, we newly created 270 gene knock-down data by using siRNA. We measured 20,469 probes by CodeLink™ Human Uniset 120K for each knock-down microarray after 24 hours of siRNA transfection. The knock-down genes were mainly transcription factors and signaling molecules. Let {tilde over (x)}Di=({tilde over (x)}1|Di, . . . , {tilde over (x)}p|Di)′ be the raw intensity vector of i-th knock-down microarray. For normalizing expression values of each microarray, we computed the median expression value vector ν=(ν1, . . . , νp)′ as the control data, where νj=median i({tilde over (x)}j|D
In 270 gene knock-down microarray data, we know which gene is knocked-down for each microarray. Thus, when we knocked down gene Di, genes that significantly change their expression levels can be considered as the direct regulatees of gene Di. We measured this information by computing corrected log-ratio as follows: The fluctuations of the log-ratios depend on their sum of sample's and control's intensities. From the normalized MA transformed data, we then obtained the conditional variance sj=Var[log(xj|D
For estimating fenofibrate-related gene networks from fenofibrate time-course data and 270 gene knock-down data, we first defined the set of genes that are possibly related to fenofibrate as follows: First, we extracted the set of genes whose variance-corrected log-ratios, |log(xj|D
By converting the estimated dynamic network and knock-down gene information into the matrix representations of the first and second prior information Z1 and Z2, respectively, we estimated the gene network ĜK based on Z1, Z2 and the knock-down data matrix XK. For extracting biological information from the estimated gene network, we first focused on lipid metabolism-related genes, because the clusters related this function were significantly changed at 18 hours microarray. In the estimated gene network, there were 42 lipid metabolism-related genes and PPAR-α (Homo sapiens peroxisome proliferative activated receptor, alpha) is the only transcription factor among them. Actually, PPAR-α is a known target of fenofibrate. Therefore, we next focused on the node downstream of PPAR-α.
In particular, we show one sub-network having PPAR-α as a root node in
From the point of view of pharmacogenomics, it is very important to know druggable gene networks. Our gene networks have the potential to predict the mechanism-of-action of a chemical compound, discover more effective drug targets, and predict side effects arising from exposure to a given drug. The present invention provides a computational method for discovering gene networks relating to a chemical compound. We used gene knock-down microarray data and time-course response microarray data for this purpose and combine multiple information obtained from observational data in order to estimate accurate gene networks under a Bayesian statistics framework. We illustrated the entire process of the method using an actual example of gene network inference in human endothelial cells. Using fenofibrate time-course data and data from gene knock-downs in human endothelial cells, we successfully estimated a gene network related to the drug fenofibrate, which is a known agonist of PPAR-α. In the estimated gene network, PPAR-α has many direct and indirect regulatees including lipid metabolism related genes and this result indicates PPAR-α works as a trigger of the estimated fenofibrate-related network. There are many known relationships in the candidate regulatees of PPAR-α and we could find the relationship between PPAR-α and RXR-\alpha in the estimated network. Peroxisome proliferator-activated receptors (PPARs) are ligand-activated transcription factors expressed by endothelial cells and several other cell types. They are activated by ligands such as naturally occurring fatty acids and synthetic fibrates. Once activated, they heterodimerize with the retinoid-X-receptor (RXR) to activate the transcription of target genes. Many of these genes encode proteins that control carbohydrate and glucose metabolism and down-regulate inflammatory responses.
The present invention was also applied in a study of the human endothelial cell (EC) apoptosis process.
To understand the kinetics of transcriptome change during EC apoptosis, we performed a time course experiment. HUVEC (a pooled culture from 10 donors) were exposed to SFD conditions identical to those used in the study described above. RNA was prepared from the cultures before the induction of apoptosis (time 0) and after 0.5, 1.5, 3, 6, 9, 12 and 24 hours, hybridised to CodeLink Human Uniset 20K gene arrays. This experiment was repeated independently three times. The regulation of transcripts during EC apoptosis can be visualized in the scatter plots shown in
We generated a gene network (
Of the 488 edges contained in the modified gene network shown in
Although the foregoing invention has been described in some detail by way of illustration and example for purposes of clarity of understanding, it is readily apparent to those of ordinary skill in the art in light of the teachings of this invention that certain changes and modifications may be made thereto without departing from the spirit or scope of the appended claims. All publications, patents, and patent applications cited herein are hereby incorporated by reference in their entirety for all purposes.
This application claims the benefit of Provisional Patent Application No. 60/725,701, filed Oct. 12, 2005, which is incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
60725701 | Oct 2005 | US |