The present invention relates generally to the field of data analysis methods, systems and apparatus, sometimes referred to as data mining.
In today's systems, there is a severe shortage of advanced data analysis software to search for information in large genome data sets. Current statistical and data mining tools cannot adequately address the needs of scientists that want to find answers to complex questions in genome data sets. Now that the human genome has been sequenced, a greater challenge faces the scientists: to use the information being populated in aenome databases worldwide for improved disease diagnosis and drug discovery. With advances in sequencing techniques, increasingly large amounts of data is becoming available on a worldwide basis as a combination of public and private genome databases. It has been estimated that a single genome may require as much as 300 Terabytes of trace files. With the genomes of several organisms completely sequenced, interest within bio-informatics has shifted from sequencing to learning more about the genes encoded in the sequence and their functions. Specifically, scientists would like answers to questions such as
WO0237102A2 “Methods for Analyzing Dynamic Changes in Cellular Informatics and Uses Therefor” by Huang and Ingber describes analysis of dynamic changes in cellular processes and representing cellular processes as dynamic signatures or phase portraits. The signature is based upon time dependent molecular changes that are associated with a transition between distinct stable cellular behavioral states.
WO0134789A2 related to gene expression clustering by statistically significant connections.
U.S. Pat. No. 6,420,108 relates to a computer aided display for comparative gene expression.
US 2002/0019704 is a method for analyzing a plurality of sets of values associated with a plurality of genes to identify those genes whose associated values differ by an amount of statistical significance.
U.S. Pat. No. 6,185,561 relates to organizing expression information in a way that facilitates data mining.
The present invention provides a method and programmed means for clustering genes having potential functional similarity by a comparison of their time varying gene expression profiles.
The temporal expression patterns of large number of genes are known to exhibit some degree of order across a tissue. Therefore, a match of the gene expression profiles using both time and intensity information is better at detecting functional similarity than using intensity alone.
According to the instant invention, two temporal sequences are similar and can be placed in the same cluster if they have enough non-overlapping time-ordered pairs of sub-sequences that are similar.
It is an advantage of our invention that functional similarity between portions of gene expression profiles can be clustered, thereby characterizing similarity between genes in one or more phases of the cell cycle.
Another advantage of our invention is that it can cluster all similar genes without a linear search of the genome database through a fast multidimensional index structure.
These and other advantages of the invention which will become clear upon reading the following description of a preferred embodiment are obtained by novel processes of clustering the result of several methods of signal matching in signal processing, such as correlation.
A method and programmed means is disclosed for discovering functional similarity between portions of gene expression profiles, to cluster all similar genes without a linear search of the genome, thereby characterizing similarity between genes in one or more phases of a cell cycle. Our preferred embodiment uses a time and intensity-invariant correlation function such as that described by R. Agriawal, K. Lin, H. S. Sawhney, K. Shim: “Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases”, Proc. Of the 21st Int'l Conference on Very Large Databases, Zurich, Switzerland, September 1995. Specifically, we employ the similar sequences algorithm embodiment of the above described correlation function in Intelligent. Miner for Data (™ IBM Corp.), which was designed for business intelligence, against time varying gene expression data.
The method of the invention uses the time and intensity invariant correlation function of the IBM tool to find matches of gene expression profiles using both time and intensity information, which is better at detecting functional similarity than using intensity information alone. The output of Intelligent Miner is a data set of gene expression pairs with the match factor and number of subsets used to compare each pair. A threshold match factor is chosen and genes are listed in clusters by their match fractions. Genes are then removed from all except the cluster with the highest match fraction. Any genes not already in a cluster are added to a cluster which includes a gene that has a highest match fraction with the added gene.
Referring now to the drawings, and first to
The focal point of the preferred personal computer architecture comprises a processor 51. The processor 51 is connected to a bus 52 which comprises a set of data lines, a set of address lines and a set of control lines. A plurality of I/O devices, memory and storage devices 53-58 and 66 are connected to the bus 52 through separate adapters 59-64 and 67, respectively. For example, the display 54 may be either a CRT or a flat panel display.
The random access memory (RAM) 56 and the read-only memory (ROM) 58 and their corresponding adapters 62 and 64 are included as standard equipment in most computers, although additional random access memory to supplement memory 56 may be added via a plug-in memory expansion option.
Within the ROM 58 are stored a plurality of instructions, known as the basic input/output operating system, or BIOS, for execution by the processor 51. The BIDS controls the fundamental operations of the computer. An operating system such as a windows oriented operating system software available from IBM Corporation, MICROSOFT Corporation or other supplier is loaded into the memory 56 and runs in conjunction with the BIOS stored in ROM 58.
The programs embodying the instant invention as well as other programs such as scientific instrument control programs may also be loaded into the memory 56 to provide instructions to the microprocessor 51 to enable a comprehensive set of tasks, including the gathering of gene expression profiles to be performed by the computer system shown in
In a computer such as the computer for the system shown in
Computer architecture and components are further explained in The Winn Rosch Hardware Bible, W. L. Rosch, Simon & Schuster, ISBN 0-13-160979-3 (“Rosch”), which is specifically incorporated herein by reference.
Referring now to
Referring now to
The program product logic means of the invention has a clustering section 223 which lists gene expression pairs in clusters by their match fractions. If gene gi is similar to gene gj, then these two genes are placed in a cluster ca and i and j are added to the gene index array G and to the cluster index array C. The next gene expression pair gi and gk are then examined. If gene gi is similar to gene gk, but i and k are already in the gene index array G then the next gene expression pair is examined. But if gene gi is in the index G but gene k is not, then gene k is placed in cluster ca with i and j by adding k to G and to C as indicated at 223.
The program product of the invention also has means at block 225 for removing a first gene from a cluster cb when the first gene is also in another cluster ca which has another gene with a higher match fraction with the first gene than any of the genes in the cluster cb have with the first gene. When a gene has such a higher match fraction mf with another gene in another cluster ca but the difference between the match fractions is less than a predetermined match difference threshold mdt value such as 5 percent, and the similarity with the other gene comprises more subsequences than the similarity in the cluster cb, then the gene is placed only in the cluster cb and is removed from the another cluster ca. This programmed logic removing means is cycled until all genes are listed in only one cluster.
The program product of the invention also has means at block 229 for responding to the content of list L and index G to determine whether all genes being analyzed have been placed in a cluster. If rot, the means at block 229 adds each remaining gene to a cluster having a gene with which the remaining gene has a highest match fraction mf regardless of whether mf is less than the threshold mft.
Important features of the invention are that non-statistical clustering is used. This retains the benefits of scale invariance but adds time invariance to the analysis. Unlike other conventional methods, even partial similarity can be recognized. Multiple sub-sequence matches are handled without compromising accuracy and for this reason, the result obtained is very resistant to noise since gaps are allowed. Unlike other methods, the invention allows an algorithm to be used that accommodates a shift in time over which similarity is seen. An example similarity search output is shown below for two hypothetical genes, gene 8 and gene 9 whose profiles are shown in
The higher the match fraction, the better is the match of two sequences. The match fraction above is shown for purposes of description only and is not an actual calculated fraction.
Another feature of the algorithm that is accommodated by the method of the invention is that there can be “gaps” in similar sequences as shown in
The algorithm that is used in this preferred embodiment is described in the paper by R. Agrawal, K. Lin, H. S. Sawhney, K. Shim: “Fast Similarity Search in the Presence of Noise, Scaling, and Translation in Time-Series Databases”, Proc. Of. the 21st Int'l Confererce on Very Large Databases, Zurich, Switzerland, September 1995. This algorithm will be referred to hereinafter as the Agrawal Fast Similarity Search. It is understood that the use of the algorithm per se does not comprise the novelty of the invention but that the novel and unobvious programmed logic means and method of the clustering means permit such use. The algorithm uses a model of similarity of time sequences that presents fast search technique. The amplitude of one of the two sequences is scaled by any suitable amount and its offset is adjusted appropriately. The matching of sequences is then scale-independent, state-independent, translation neutral and noise resistant. The algorithm creates a fast, indexable data structure using small, atomic subsequences that represent all the sequences up to amplitude scaling and offset. R*-tree family of structures are used for this representation because arbitrary precision can be maintained for the sequence values while still allowing for similarities to be defined with respect to a user-defined e distance in L-infinity norm between the atomic subsequences. Therefore, all atomic subsequence matches within a distance e can be efficiently calculated. The second stage employs a fast algorithm for stitching atomic matches to form long subsequence matches, allowing non-matching gaps to exists between the atomic matches. The third stage linearly orders the subsequence matches found in the second stage to determine if enough similar pieces exist in the two sequences.
A typical gene expression data set appears below as table I. g1 to g7 are 7 genes expressed over 6 time stamps. Table I is a data set of gene expression profiles which also appear as data set 211 in
If one plots these 7 lines as a function of time we have the graphs shown in
Referring to
However, with gene 7 and gene 1, shown in
This similarity which is shifted in time can be identified by the Agrawal Fast Similarity Search algorithm which identifies these two as similar genes.
These results of similarity are used by the invention for clustering. The data shown in
Referring again to
The logic of block 223 places the genes gi through gene gz into appropriate clusters. For each similar gene pair in 219, if the index i of gi has been seen before as evidenced by an entry i in G, the method skips to the next gene pair. If the index i of gi has not been seen in G but the index j of the other gene of the pair has been seen in G, then the pair gi,gj belong in the cluster ca to which gene gj belongs. If the index i of gi has not been seen in G and index j of the other gene of the pair has also not been seen in G, then the pair gi,gj belong in a new cluster cb. In this way the logic at block 223 lists gene expression pairs in clusters by their match fractions. Another way to express these results is using associative logic. That is if A is similar to B and B is similar to C, then A,B, and C belong to a similar group.
This method is applied to the data of table II. For example gene 1 and gene 2 are the first pair in table II and they are placed in cluster cl as shown below in table III. Gene 1 and gene 7 are the second pair in table II and by the program logic of block 223, gene 7 also is placed in cluster c1.
Thus we have 3 clusters if the threshold is above 0.2 and 4 clusters if no threshold is set or if the threshold is less than 0.2. This result is shown in table form below as table IV.
Referring now to the table above, gene 3 and gene 4 are seen to each be in two clusters. According to the logic of block 225, the match fraction (0.2) of gene 3 and gene 6 of cluster 4 is compared to the match fraction (0.8) of gene 3 and gene 4 of cluster 3. Since the match fraction (0.8) of cluster 2 is greater than (0.2), gene 3 is removed from cluster 4 and retained in cluster 2. Likewise gene 4 is removed from cluster 4 and retained in cluster 2. Likewise gene 6 is removed from cluster 4 and retained in cluster 3. This logic is expressed in block 225 as: if gene gk belongs to cluster ca and to cluster cb, and the maximum match fraction of gk in cluster ca is greater than the maximum match fraction of gk in cluster cb then the gene gk is placed only in cluster ca.
In an embodiment where a threshold is provided, also shown in
Since all other match fractions are less than 0.5, these pairs will not be included in cluster building logic and because all gene expression profiles g1 through g7 have all been accounted for, the process ends.
In another embodiment, we may have identified a particular gene gn of interest contained in a data set 233 shown in
First we insert the gene expression for the particular gene of interest as a row gn into the data set 211 of interest. Then we perform the algorithm processing step of block 213 in the method of
In a still further embodiment, we have identified a particular set of genes cp=gm, gn, . . . from a data set 233 and we are now looking for all genes behaving similarly but they are stored in an different data set 211 that has been created using similar experimental conditions. The steps in this embodiment are as follows:
First we insert the gene expressions for the particular genes of interest gm, gn, . . . as rows gm, gn, . . . into the data set 211 of interest. Then we perform the algorithm processing step of block 213 in the method of
Having described the programmed means and method of the invention and several embodiment thereof, it may be seen that the present invention overcomes the shortcomings of the prior art systems by providing clusters of genes in the presence of noise and time shifts by a programmed apparatus using efficient method steps. It will be understood by those skilled in the art of computer systems that many additional modifications and adaptations to the present invention can be made in both embodiment and application without departing from the spirit and scope of this invention. Accordingly, this description should be considered as illustrative of the present invention and not in limitation thereof.