This application claims the benefit of Korean Patent Application No. 10-2007-0027795, filed on Mar. 21, 2007, and Korean Patent Application No. 10-2007-0099927, filed on Oct. 4, 2007, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
1. Field of the Invention
The present invention relates to clustering of gene expression profiles, and more particularly, to a method and apparatus for clustering gene expression profiles by using the Gene Ontology (GO).
The present invention is derived from a study conducted by the Ministry of Information and Communication (MIC) of the Republic of Korea and the Institute for Information Technology Advancement (IITA) as one of a number of new growth engine core IT technology development projects (Assignment Number: 2006-S-007-02; Assignment Name: Ubiquitous Health Care Module System).
2. Description of the Related Art
Genes are expressed in response to specific stimuli. The amount of gene expression varies according to various stimuli (experimental conditions) and time variation. Data obtained by measuring the amount of gene expression by conducting a micro-array experiment is gene expression data, i.e., gene expression profiles.
It is known that genes having similar functions have similar expression patterns. Therefore, genes having similar expression profiles are clustered (i.e. grouped), so that a biological relationship of genes belonging to the same cluster (group) can be analogized. In more detail, from the cluster analysis, unknown functions of a gene can be inferred from the known functions of another genes belonging to the same cluster, and biological correlations between genes having similar expression patterns can be analogized.
Conventional technologies of dividing (clustering) gene expression profiles into subsets of genes having similar expression patterns are as follows:
Gene expression data sets are clustered by using a neural network algorithm that is referred to as a self-organizing map (SOM). The SOM is used to cluster the gene expression data sets by learning a connection network having weights between input nodes and output nodes. The SOM is used to allocate input data (gene expression profiles in the form of a vector) to the most similar cluster representative (that is randomly determined in the initial state), and re-calculate weights of the connection network so as to be best suited to the currently allocated data. That is, the SOM is a kind of winner-take-all neural network algorithm. This method is able to discover the phase relationship between clusters by allocating similar clusters to its neighbor. But, many input parameters such as the topology of the SOM need to be determined and the quality of its a clustering result depends on the input parameters. Furthermore, the initial cluster representatives should be determined accurately.
Determining seed genes for each cluster (i.e., cluster representative), has been a main drawback of conventional dividing-based clustering methods. It is more effectively treated. In more detail, in order to extract seed genes of each clusters singular value decomposition (SVD) is applied to gene expression data that is Gaussian transformation. This method does not need a process of determining complex initial input parameters unlike the conventional clustering algorithms. But, the number of initial seed genes still need to be determined. A wrong selection of the number of initial seed genes may dramatically deteriorate the quality of clustering result. Moreover, this method does not focus on the biological function but the mathematical similarity, which results in an unclear biological analysis for detected gene groups.
A clustering method takes into account genes in the Gene Ontology (GO), unlike the above methods. This method is able to analyze individual functions of each gene included in a cluster, and to concentrates on candidate genes. And thereby, it may reduce unnecessary processing time. However, since only genes whose correlation is greater than a predetermined reference level are selected, useful information included in other genes may be lost.
The conventional methods must determine complex parameters or initial cluster representatives that have a significant influence on the quality of clustering results. Or it uses a mathematical similarity only, causing an unclear analysis of a biological function. Move over, although an analysis of the biological function is used, some important information may be lost or its application is limited.
The present invention provides a method and apparatus for detecting similar expression gene groups, which ensures reliability of clustering seeds that have a significant influence on clustering result, and effectively uses Gene Ontology (GO) terms as clustering seeds, thereby enhancing biological meaning and reliability of the clustering result and reducing information loss of the GO term seeds.
According to an aspect of the present invention, there is provided a method of clustering gene expression profiles comprising: selecting one or more Gene Ontology (GO) terms from a GO tree; receiving gene expression data sets; classifying the gene expression data sets into groups according to the GO terms; firstly clustering gene expression data belonging to each of the groups based on a similarity of the gene expression data; and secondly clustering the gene expression data sets by using the result of the first clustering as a seed.
According to another aspect of the present invention, there is provided an apparatus for clustering gene expression profiles comprising: a GO selection unit selecting one or more GO terms from a GO tree; a gene expression data input unit receiving gene expression data sets; a classification unit classifying the gene expression data sets into groups according to the GO terms; a first clustering unit firstly clustering gene expression data belonging to each of the selected groups based on a similarity of the gene expression data; and a second clustering unit secondly clustering the gene expression data sets by using the result of the first clustering as a seed.
The above and other features and advantages of the present invention will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings in which:
Hereinafter, the present invention will be described in detail by explaining embodiments of the invention with reference to the attached drawings.
After the GO terms of interest are selected, gene expression data sets that are to be used for clustering are received (Operation 110). When a gene of a cell is exposed to specific conditions, the gene is expressed so as to create a material such as mRNA or DNA, i.e., a gene expression product. The specific conditions include exposure to a temperature, acidity (pH), growth/culture conditions, time variation, medicine or a candidate medicine material, etc. A value for measuring an amount of the gene expression product is a gene expression value. Expression values of a gene are gene expression profiles. An example of the gene expression profile is illustrated in
After the GO terms of interest are selected and the gene expression data sets are inputted, the gene expression data sets are classified according to the selected GO terms of interest (Operation 120). Genes of the gene expression data sets have GO terms relating to their functions. That is, one gene can have a plurality of related GO terms. The genes are allocated to groups of the selected GO terms.
Thereafter, the gene expression data sets are firstly clustered according to is expression profile similarity of the genes allocated to each of the GO terms (Operation 130). The gene expression data sets are secondly clustered by using the result of the first clustering as a seed (Operation 140). The first and second clustering are described in detail with reference to
A similarity between the gene expression profiles allocated to each of the GO terms of interest is calculated (Operation 200). The similarity is calculated using any one of the conventional methods. For example, a Pearson correlation coefficient is used to calculate the similarity. The similarity calculation is obvious to one of ordinary skill in the art and thus its detailed description will not be provided.
The genes are rearranged based on the similarity (Operation 210). In this regard, it is most important to sequentially extend the gene sets from any one of the genes to additional genes. The additional genes are the most similar to a currently created gene set. A similarity between the sets and the gene can be calculated using the conventional various methods. A sequence of extending the gene sets from any one of the genes to the additional genes is a sequence of the rearranged genes. The order of inclusion of the gene in expanding the set is that of rearrangement.
After the genes are rearranged, a similarity map is prepared by reflecting the sequence of the rearranged genes (Operation 220). The similarity map is used to support a user to determine blocks (seeds) of similarity. An example of the similarity map is illustrated in
Once the similarity map is completed, a user set blocks of one or more genes that are considered to be similar to one another (Operation 230). Referring to
Each gene is allocated to the cluster (seeds of the cluster) having the highest similarity (Operation 310). The similarity can be calculated using the method that is adopted in the first clustering.
All the genes allocated to each cluster and the seed of the cluster may not have a satisfactory similarity. Therefore, if the similarity is lower than a designated similarity, the user excludes the gene from the cluster (Operation 320).
The GO term selection unit 700 displays the GO term tree on a screen to allow a user to select one or more GO terms. The GO term selecting unit 700 displays the GO term tree on a conventional GUI screen for user convenience, and receives a user's selection.
The gene input unit 710 receives gene expression data sets from a user. A preprocessing process of the gene expression data sets is obvious to one of ordinary skill in the art, and thus its detailed description will not be provided.
The gene classification unit 720 classifies genes of the gene expression data sets according to the selected GO terms.
The first clustering unit 730 measures a similarity between the genes allocated to each of the GO terms, rearranges the genes based on the similarity, and prepares a similarity map reflecting the order of the rearrangement. The first clustering unit 730 displays the similarity map on the screen to allow the user to set one or more blocks of the genes.
The second clustering unit 740 secondly clusters the genes by using the result of the first clustering unit 730 as seeds. In more detail, the second clustering unit 740 sets the results obtained from the first clustering unit 730 as a seed, allocates similar genes to each seed, and secondly clusters the genes. The second clustering unit 740 displays its result on the screen to allow the user to remove the genes having a lower similarity than a prespecified similarity from the cluster results.
The embodiments of the present invention can be written as computer programs and can be implemented in general-use digital computers that execute the programs using a computer readable recording medium. Examples of the computer readable recording medium include magnetic storage media (e.g., ROM, floppy disks, hard disks, etc.), optical recording media (e.g., CD-ROMs, or DVDs), and storage media such as carrier waves (e.g., transmission through the Internet). The computer readable recording medium can also be distributed network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
The method of detecting a similar expression gene group by using the GO, according to the present invention effectively uses GO information when time-serial gene expression profile sets obtained from a micro array experiment are divided into clusters having similar expression patterns, thereby creating a biologically meaningful and highly reliable clustering result. The method can reduce information loss in GO seeds. Therefore, an effective study regarding a gene operation can be provided.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The exemplary embodiments should be considered in a descriptive sense only and not for purposes of limitation. Therefore, the scope of the invention is defined not by the detailed description of the invention but by the appended claims, and all differences within the scope will be construed as being included in the present invention.
Number | Date | Country | Kind |
---|---|---|---|
10-2007-0027795 | Mar 2007 | KR | national |
10-2007-0099927 | Oct 2007 | KR | national |