1. Field of the Invention
The present invention relates to a technology for extracting relations between a plurality of genes elicited in specific contexts by generating various contexts based on expression data on an expression amounts of the genes.
2. Description of the Related Art
Recently, along with the advancement in gene analysis technology, expression states of genes as many as several thousands to tens of thousands can be grasped at once. Accordingly, a technique for extracting intergenic expression relations using the expression states of many genes is under development (for example, see domestic re-publication of PCT international publication for Patent Applications No. WO2002/048915, and Homin K. Lee, Amy K. Hsu, Jon Sajdak, Jie Qin and Paul Pavlidis, “Coexpression Analysis of Human Genes Across Many Microarray Data Sets,” Genome Research 14: 1085-1094, 2004)
Examples of the extracted intergenic expression relations include an expression relation about promotion and suppression between genes in which if an expression amount of a certain gene A becomes larger, that of another gene B becomes larger, or if the expression amount of the gene A becomes larger, that of another gene B becomes smaller. Specifying such an expression relation will help to uncover causes of a disease and to treat the disease.
However, it appears that the intergenic expression relation is elicited in a specific context (a gene expression environment). Therefore, if plural pieces of expression data acquired in varied contexts are analyzed at random, it is difficult to extract expression relations. Examples of the contexts include a spatial context such as a context of tissues or a context of intercellular sites, and a temporal context such as a context of growth periods and a context of cell cycles. The context as the gene expression environment is considered to be complicated since many factors influence one another.
An apparatus according to an aspect of the present invention, which is an apparatus for extracting a relation between a plurality of genes based on expression data regarding to an expression amount of the genes, includes: a generating unit that generates contexts based on expression data of a plurality of genes satisfying a predetermined relation, the contexts representing an environment for an expression of a gene; and a determining unit that determines a relation between the genes in the contexts generated.
A method according to another aspect of the present invention, which is a method for extracting a relation between a plurality of genes based on expression data regarding to an expression amount of the genes, includes: generating contexts based on expression data of a plurality of genes satisfying a predetermined relation, the contexts representing an environment for an expression of a gene; and determining a relation between the genes in the contexts generated.
A computer-readable recording medium according to still another aspect of the present invention stores a computer program that causes a computer to execute the above method.
The above and other objects, features, advantages and technical and industrial significance of this invention will be better understood by reading the following detailed description of presently preferred embodiments of the invention, when considered in connection with the accompanying drawings.
Exemplary embodiments of the present invention will be explained in detail with reference to the accompanying drawings.
A concept of expression relation extraction performed by the genetic relation extracting apparatus according to this embodiment using contexts will be explained first.
An example of expression relations elicited in specific contexts is shown in
An example of limited expression relations in a combination of contexts is shown in
As can be seen, the genetic relation extracting apparatus according to this embodiment extracts expression relations among genes that are elicited in specific contexts using the contexts. In order to extract the expression relations using the contexts, the way of selecting the contexts is important.
The genetic relation extracting apparatus according to this embodiment selects contexts if many genes are synchronously expressed, that is, selects contexts based on synchronous expression of partial gene groups.
Alternatively, the apparatus can select contexts based on synchronous suppression of the partial gene groups instead of the synchronous expression thereof.
A configuration of the genetic relation extracting apparatus according to this embodiment will be explained.
The network extracting unit 110 inputs an expression amount matrix configured by a plurality of samples (plural pieces of expression data), creates a correlation network based on correlations among expression amounts of genes, and extracts partial networks from the created correlation network.
A context is configured by a plurality of samples and corresponds to a partial matrix configured by a plurality of rows in the expression amount matrix. In
In the correlation network, if the correlation coefficient between the expression amounts of the two genes is higher, that is, the correlation between the two genes is higher, the two genes are arranged closer to each other.
A correlation coefficient between, for example, the genes “TJP1” and “NCKAP1” is 0.8709 and the genes “TJP1” and “NCKAP1” are arranged close to each other. If the expression amount of a sample j about a gene i is xij, a correlation coefficient rαβ of a gene pair (α, β) is represented by
The network extracting unit 110 extracts synchronous gene groups that are groups of genes for which correlation coefficients are equal to or higher than a predetermined value (0.8, for example), from whole genes. If the correlation coefficient between the expression amounts of the two genes is equal to or higher than the predetermined value and the correlation network that connects the two genes by a line is created, the synchronous gene groups correspond to partial networks of the correlation network, respectively. The network extracting unit 110, therefore, extracts a plurality of partial networks from the correlation network.
In this embodiment, it is assumed that groups of genes in which each correlation coefficient between the expression amounts of the two genes is equal to or higher than the predetermined value, that is, groups of genes in which there is a high positive correlation between each gene pair are synchronous gene groups. Alternatively, groups of genes in which there is a high negative correlation between each gene pair can be assumed as the synchronous gene groups.
In this embodiment, the network extracting unit 110 calculates a correlation coefficient per gene pair, and creates the correlation network that links pairs each having the correlation coefficient equal to or higher than the threshold, thereby extracting the synchronous gene groups. Alternatively, the network extracting unit 110 can extract the synchronous gene groups by clustering the genes according to the correlation coefficients.
The context generator 120 calculates a typical expression amount for genes included in each of the partial networks extracted by the network extracting unit 110, i.e., for the genes belonging to each of the synchronous gene groups, and generates contexts based on the calculated typical expression amounts, respectively. The contexts correspond to partial samples that are subgroups obtained by dividing a group of all samples into a plurality of segments.
For example, in
As can be seen, the context generator 120 generates the contexts by dividing the samples based on the average expression amounts of the genes belonging to each synchronous gene group. Thus, the context generator 120 can extract each expression relation between the two genes in each specific context.
The network extracting unit 110 extracts synchronous gene groups for the partial sample corresponding to each context generated by the context generator 120. The context generator 120 generates contexts based on the average expression amounts of the genes belonging to each synchronous gene group extracted by the network extracting unit 110. By repeating this process, it is possible to generate various contexts and extract expression relations each between the two genes in the various contexts.
The repetition of the extraction of synchronous gene groups and the generation of contexts can be finished when a predetermined condition is satisfied, for example, when the number of samples belonging to each of the generated contexts is equal to or smaller than a predetermined value. Alternatively, the repetition can be finished by an instruction from a user.
In this embodiment, the context generator 120 uses the average expression amounts of the genes as the typical expression amounts of the genes belonging to each synchronous gene group. Alternatively, the context generator 120 can use, as the typical expression amount, another value such as a first main component obtained by singular value decomposition.
In this embodiment, the context generator 120 creates the histogram according to the average expression amounts and divides the samples based on the created histogram. Alternatively, the context generator 120 can obtain two or three or more contexts by applying clustering, binarization, or the like to the typical expression amounts.
The network comparator 130 compares and displays the various partial networks extracted by the network extracting unit 110. The network comparator 130 compares the various contexts by comparing correlation networks in the respective various contexts.
The network extracting unit 110 performs a pair correlation calculation of calculating the correlation coefficient between the expression amounts of two genes using each of the extracted partial matrixes (step S102), thus creating intergenic correlation matrixes. Each of the intergenic correlation matrixes is a matrix having a correlation coefficient r((between a gene pair ((, ( ) as elements.
Based on the created intergenic correlation matrixes, the network extracting unit 110 extracts synchronous gene groups from the whole gene groups (step S103). Namely, the network extracting unit 110 creates a correlation network based on the intergenic correlation matrixes, and extracts partial networks from the created correlation network.
Thus, by creating the correlation network for specific contexts, the network extracting unit 110 can extract expression relations elicited in the specific contexts. In addition, by extracting the partial networks from the created correlation network, the network extracting unit 110 can extract the synchronous gene groups used to generate new contexts.
In addition to generation of contexts, the context generator 120 calculates an evaluation value for each of the generated contexts. As the evaluation value, the number of samples included in the context, a separation rate of the context from the other contexts, a variation amount thereof from the original contexts, or the like can be used.
The context generator 120 ranks the generated contexts based on their evaluation values (step S202), and presents the user with the ranks of the contexts and their evaluation values (step S203). The context generator 120 makes the user select contexts (step S204), and feeds the contexts selected by the user to the network extracting unit 110 as new contexts.
By generating the new contexts using the whole expression amount matrix, the original contexts for generating contexts, and the specific synchronous gene groups, the context generator 120 can extract the expression relations elicited in the specific contexts.
In this embodiment, the context generator 120 presents the user with the generated contexts and their evaluation values so that the user selects contexts. Alternatively, the context generator 120 can automatically select contexts having evaluation values equal to or higher than a predetermined value and feed the selected contexts to the network extracting unit 110.
As shown in
The context generator 120 creates a histogram based on the averages calculated for the respective samples, divides the samples from the created histogram, and generates contexts (step S303). The context generator 120 also calculates evaluation values for the respective generated contexts (step S304), and stores the evaluation values together with the respective contexts (step S305).
Thus, the context generator 120 calculates the average expression amounts of the genes included in the synchronous gene groups for the respective samples, divides the samples based on the calculated averages, and generates the contexts.
As explained, according to this embodiment, the network extracting unit 110 creates the correlation network from the expression amount matrix and extracts the synchronous gene groups. The context generator 120 generates contexts based on the expression amounts of the genes belonging to each of the specific synchronous gene groups. The network extracting unit 110 creates the correlation network from the expression amount matrix corresponding to the contexts generated by the context generator 120, thereby extracting the expression relations elicited in the specific contexts.
According to this embodiment, the generation of the contexts by the context generator 120 and the extraction of the synchronous gene groups by the network extracting unit 110 are repeatedly performed, thereby generating various contexts and extracting the expression relations elicited in the various contexts.
In this embodiment, the genetic relation extracting apparatus has been explained. By realizing the configuration of the genetic relation extracting apparatus by software, a genetic relation extracting program having functions similar to the genetic relation extracting apparatus can be obtained. A computer that executes this genetic relation extracting program will be explained next.
The RAM 210 is a memory that stores programs and progress results of executing the programs. The CPU 220 reads and executes a program from the RAM 210.
The HDD 230 is a disk device that stores programs and data. The LAN interface 240 is used for connecting the computer 200 to another computer through a LAN.
The input and output interface 250 is used for connecting an input device such as a mouse or a keyboard and a display device to the computer 200. The DVD drive 260 reads and writes data from and to a DVD.
A genetic relation extracting program 211 executed by the computer 200 is stored in the DVD, read from the DVD by the DVD drive 260, and installed in the computer 200.
Alternatively, the genetic relation extracting program 211 is stored in a database or the like of another computer system connected to the computer 200 through the LAN interface 240, read from the database, and installed in the computer 200.
The installed genetic relation extracting program 211 is stored in the HDD 230, read to the RAM 210, and executed as a genetic relation extracting process 221 by the CPU 220.
Additional advantages and modifications will readily occur to those skilled in the art. Therefore, the invention in its broader aspects is not limited to the specific details and representative embodiments shown and described herein. Accordingly, various modifications may be made without departing from the spirit or scope of the general inventive concept as defined by the appended claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2005-075726 | Mar 2005 | JP | national |