The present invention is of a method for analyzing and visualizing large collections of data.
Exploratory data analysis is critical in a broad range of research areas, where large collections of data need to be meaningfully arranged and presented. Indeed, a major challenge in the analysis of large-scale multidimensional data is effective organization and visualization. Graphically structured presentation can greatly aid humans in data mining: a clear and interactive display may reveal subtle structure and relationships, and assist in tracking down elusive connections.
The background art does not teach or suggest an efficient, intuitive tool for automated analysis and visualization, which may optionally be performed with little or no manual intervention. The background art also does not teach or suggest reorganization of distance matrices using the characteristics of the distances themselves. The background art does not teach how to read the properties and relationships of the data from the reordered distance matrix.
The present invention overcomes these deficiencies of the background art by providing a method for an unsupervised analysis of data according to a reordered distance matrix. According to preferred embodiments thereof, the present invention is useful for large scale multidimensional data, more preferably data having at least four dimensions. The present invention is also preferably used for data comprising a plurality of objects characterized by continuous variables, for example variables having a continuum of possible values rather than a plurality of discrete values. It should be noted that single object featuring a plurality of points would also be considered a plurality of objects with regard to the present invention.
According to preferred embodiments, the present invention provides an analysis method termed herein SPIN, a novel method for the organization and visualization of data, implemented in a simple tool. SPIN utilizes traits of distance matrices to sort objects in a natural ordering that highlights the underlying structure of the original, multidimensional data. The shape of the distribution of objects and/or of the objects themselves, and relationships between objects can be inferred from the reordered distance matrix generated by SPIN. As an unsupervised analysis tool, SPIN does not rely on any external labels, but rather explores the inherent characteristics of the data. In the analysis of high-throughput biological experiments, discretely-labeled data, such as clinical labels of ‘sick’ versus ‘healthy’, is traditionally organized by various clustering approaches. However, when the objects are characterized by continuous variables, e.g. survival intervals of patients or expression levels of genes, any sharp separation into distinct clusters will be rather arbitrary. Thus, a different organization approach, one which emphasizes ordering rather than grouping, could be more relevant.
This work focuses on finding a one-dimensional ordering of a set composed of n data points, and to present as output the matching (2-dimensional) n by n distance matrix D. An element Dij of D represents the dissimilarity between objects i and j. Our aim is to find a permutation of the data points, such that the correspondingly reordered distance matrix reveals the underlying structure of the data, utilizing the human ability to readily recognize patterns in color images [1]. Sorting Points Into Neighborhoods (SPIN), generates a one-dimensional ordering of the objects and presents the reordered distance matrix in an intuitive color coded image that allows the observer to infer the underlying structure of the data. SPE7 is especially suitable for analyzing high-throughput biological experiments, such as gene array experiments, where results are typically summarized in an expression matrix, in which each element denotes the expression level of a particular gene in a specific sample [1]. In this context two types of distance matrices can be produced: the distances between all pairs of samples can be calculated based on their expression levels over the. measured genes, and the distance between all pairs of genes can be measured in the sample dimensions [2]. The sorted distance matrix generated by SPIN is particularly useful in time-series experiments, where an elongated cluster represents the temporal evolution of a particular biological module, such as cell-cycle progression. Another example where the shape revealed by SPIN has a clear biological interpretation comes from cancer research where samples are often composed of mixtures of cells: for instance, colon tissue samples isolated from liver metastases arrayed into an elongated, ellipsoid cluster [3]. The genes that induced the elongation were characteristic of liver, suggesting that this pattern reflects a mixture of the metastasis samples with cells originating from the liver.
Among the many advantages of the present invention is that the method provides an efficient and intuitive way to read the properties and relationships of the data from the reordered distance matrix. Contact maps of proteins have been used to discover secondary structure, but they posses an inherent ordering (according to the primary sequence). Therefore, the present invention represents the first method to be able to discover such properties and relationships without any inherent ordering (that is to say, pre-ordering) of the data.
The invention herein described, by way of example only, with reference to the accompanying drawings, wherein:
a-d shows the results of using the method according to the present invention for machine vision;
a-g shows the results of analyzing colon cancer data with the method according to the present invention.
The present invention is of a method for an unsupervised analysis of data according to a reordered distance matrix. According to preferred embodiments thereof, the present invention is useful for large scale multidimensional data, more preferably data having at least four dimensions. The present invention is also preferably used for data comprising a plurality of objects characterized by continuous variables, for example variables having a continuum of possible values rather than a plurality of discrete values.
According to preferred embodiments, the present invention provides an analysis method termed herein SPIN, a novel method for the organization and visualization of data, implemented in a simple tool.
The input to SPIN is a distance matrix, and its output is a reordered distance matrix, obtained by permuting the N objects. Currently two different algorithms, based on two complementary intuitions, are implemented. However, optionally and preferably substantially any algorithm may be employed with the method of the present invention. The two algorithms utilize two distinct (and sometimes competing) desirable properties of properly ordered distance matrices: first, in many cases the values in the upper rows of a well-ordered distance matrix tend to increase with the column index, while the values in the bottom rows have the opposite inclination. In other words, the slope of the linear regression of the values in a row is a decreasing function of the row's index in the sorted matrix. The first algorithm, named Side-to-side, simply generates such a pattern. The second property is that the region near the main diagonal tends to have smaller dissimilarity values, i.e. a “good” ordering locates points next to their neighbors in the full high-dimensional space. The second algorithm, called neighborhood, tries to create such an arrangement by ensuring that distant data points in the multi-dimensional space are not placed close to each other in the linear ordering.
Although both algorithms achieve a one dimensional ordering of the data set, the final resulting permutations are different in the following sense: Side-to-side tries to capture a particular pattern in the image of the distance matrix. As a result, points that are placed far apart in the linear ordering are also distant in the full high-dimensional space. Neighborhood, on the other hand, tries to make sure that neighboring points in the linear ordering are close to each other in the high dimensional space. This subtle distinction in emphasis may lead to substantial difference in the results, as illustrated for points that form a ring, described in greater detail below. A ring is a simple example where these two criteria are mutually exclusive. Neighborhood orders the points around the circumference of the ring. Due to the cyclic symmetry of the ring, the end points in this ordering are very close to one another in the true high dimensional space. This does not conform to the pattern that Side-to-side enforces.
In general, Side-to-side is simpler, faster, and seems to converge quickly for all the examples currently examined. It has no parameters, so the final ordering depends only on the initial permutation. Neighborhood, on the other hand, seems to have the potential to generate superior results in several cases. However, it does not always converge, and occasionally gets mired in obviously local stationary permutations. The size of the neighborhood, σ, determines the typical scales of objects that are revealed: small values of σ tend to break large clusters into small “clumps”; conversely, large σ values tend to merge neighboring clusters.
Several examples are given below in which the structure uncovered by SPIN has a clear biological interpretation, such as the cyclic nature of cell-cycle progression, visualized in a ring conformation. In another example the tissue composition of tested samples is captured by their relative placement in an ordered elongated cluster, formed in the space of tissue specific genes. Another example is related to machine or robot vision. Therefore, the method of the present invention has general applicability, which makes it relevant to diverse scientific disciplines and/or technologies.
Example 1 demonstrates the concepts and the intuitions that underlie the method according to the present invention, and shows how to infer structure characteristics from the sorted distance matrix. Example 2 provides a formal description of a preferred illustrative embodiment of the method. Examples 3-4 feature several applications to real data, where the shapes uncovered by SPIN are directly interpretable in biological terms. Example 5 relates to data for machine or robot vision.
Section 1: General Description of an Illustrative Method
A properly ordered distance matrix is indicative of the shape of a set of points. All the data sets presented in this article were ordered using SPIN, starting from a random initial permutation. The distance matrices were generated using the Euclidean distance measure, though our methodology can be applied to many dissimilarity metrics. The color of element Dij reflects the relative distance between points i and j, where blue (red) denotes small (large) distances, respectively.
For explaining the SPIN method, we first address a set of points that form a single object in multidimensional space. The top row (1) of
For example, consider points uniformly distributed within a cylinder, as presented in
Another simple structure, a ring (see
Given more complex data, the ordered distance matrix suggested by SPIN can capture the over all layout of a compound structure, as well as the local conformation of various components. In
This toy data set is “composed of 800 points in 10-dimensions. The complex object was originally generated in 3-D, and then seven additional dimensions of noise (uniformly distributed between −1 and 1) were added. The right image of
The next example illustrates SPIN's ability to deal with complex objects embedded in high dimensional space:
In the distance matrix in
This Example provides an illustrative method according to the present invention, as a description of a preferred embodiment thereof, the SPIN method.
The input to SPIN is a distance matrix Dn×n calculated for a data set composed of n points, and its output is a reordered distance matrix, obtained by permuting the n objects according to a particular permutation PεSn (the permutation group of n points). We denote by P also the permutation matrix associated with p.
In order to find criteria for a good ordering, we studied several simple objects characterized by an inherent natural ordering (See
These attributes can be mathematically formulated by introducing an energy function F≡FD:Sn→ quantifying the fitness of every matrix ordering. Thus, the ordering problem becomes finding the permutation p minimizing F. We emphasize that there is no unique ‘correct’ choice of F, as different energy functions may potentially reveal different aspects of the data, thus enabling study of diverse properties, as will be demonstrated later.
The two aforementioned desired features of an ordered distance matrix can be represented by the following energy functions:
1. Side-to-Side (STS): Let X be a strictly increasing (column) vector. Set F(P)=XTPDPX.
2. Neighborhood: Let W be a symmetric weight matrix concentrated in a region, determined by a parameter σ around its main diagonal. Set
where tr denotes the matrix trace.
Interestingly, the problems of minimizing the two choices of F mentioned above are special cases of a more general optimization problem, known as the Quadratic Assignment Problem (QAP), introduced by [4]. The QAP formulation is as follows: Given two n×n matrices D and W: find PεSn that minimizes tr(PDPTW). Note that W=XXT corresponds to the STS problem.
The general QAP is considered an extremely difficult optimization problem. It is known to be NP-Hard even to approximate, and in practice, usually untractable for n more than 30 (See [5] for a comprehensive survey of the problem). The particular choices of F that were made for the present Examples are shown to be also NP-hard, and therefore two analogous heuristic search algorithms were proposed, aimed at finding a global minimum.
These two algorithms are now explained in more detail below.
Side-to-side.
The algorithm of STS is summarized as follows:
Given a distance matrix D, multiply it by a weight-vector W; the resulting vector S is termed “scores” (see
In the second step the score vector is sorted in descending order, and this is taken as the new ordering of the points. Since the distance matrix is symmetrical, reordering the points dictates rearranging both rows and columns. The change in the order of the columns alters the order of the values in all rows. This means that if we repeat the process of scoring, the new score of a row will, in general, differ from the old one. This is resolved by iterating the process of scoring and sorting.
We call each time we pass steps 1-3 a STS iteration, whose complexity is O(n2). Each STS iteration can be viewed as a mapping from the permutation group Sn to itself, GD: Sn->Sn. Thus P is a possible output of STS if and only if it is a fixed point of GD. Note that the resulting fixed point may not be a global minimum of F, as for different initial permutations the algorithm may terminate at different fixed points, with different values of F. A known strategy to cope with this problem is to start the algorithm from many randomly generated initial permutations, and choose the best fixed point obtained. Moreover, it is also possible to have multiple global minima. For example, define for every permutation P its ‘reverse’
As a concrete example, when the algorithm is applied to data comprised of three well separated “superclusters”, each of which consists of three dense spherical sub clusters close to each other (see
The three super-clusters are visible as dark blue squares along the main diagonal and their actual separation in the true multi-dimensional space is captured by the colors of the regions connecting these dark squares. At a higher resolution, the three sub clusters are also apparent. Furthermore, their relative positions can be inferred by the shading of the relevant rectangles in the distance matrix. The sizeable separation of the super-clusters is reflected in the final score vector in the form of large jumps that correspond to the boundaries between super-clusters, and smaller jumps corresponding to individual clusters.
Neighborhood.
The algorithm of Neighborhood is summarized as follows:
input: Dn×n and Wn×n
1. Compute M=D W
2. Set P=arg minQεS
3. If tr (P M)!=tr(M), set D=P D PT and go to 1.
4. Output D.
Each passage of steps 1-3 is a Neighborhood iteration. Step 2 is accomplished by solving the Linear Assignment Problem. This solution reflects the best current guess for an improved location for all the data points. At every iteration, points are sent to their new location, based on the current ordering of the points. However, since all the points are permuted simultaneously, there is no guarantee that the previous assignment is optimal for the new ordering. Hence the need to re-iterate. Since the Linear Assignment Problem is known to be solvable in time O(n3) [6], the complexity of each iteration is O(n3).
This algorithm of SPIN relocates points to the local neighborhood that best fits them. In this context a neighborhood is defined by a positive weight matrix Wij with a finite range σ. For example we use Gaussian weights,
The size of the neighborhood affects the scale at which objects are distinguished. By taking the product of the distance matrix with W we perform Gaussian smoothing of width σ on each of its rows; we call the result the mismatch matrix Mij. The index of the minimum in the smoothed row i, termed the score Si, reflects the best current guess for an improved location for that particular point. The vector of scores is calculated for all points i simultaneously, as explained in
Since all the points are relocated at the same time, the points in the target regions also change, so the process of scoring and sorting is repeated iteratively, until convergence is reached or the number of iterations exceeds a preset bound.
The current implementation is as an interactive GUI so that the user chooses how to adjust σ manually. For a given data set there exists a range of relevant σ values where the resulting sorted distance matrix reflects the structure of the data at that resolution. In general, relatively large σ values correspond to working at low resolution, which allows the user to study the over all layout of the data, and observe the main separations. Smaller σ values can give a better local organization (near the main diagonal) at the expense of possibly fragmenting larger clusters. At the extreme end of small σ this is simply a nearest neighbor algorithm. One heuristic scheme that usually works well is starting with a very large neighborhood, iterating several times, then lowering ε (e.g. by a factor of 2) and so forth.
Although both algorithms find a one dimensional ordering of the data set, the characteristics of the final permutations are different in the following sense: Side-to-side (denoted STS) enforces a particular pattern on the image of the distance matrix, one that places red points (which denote large distances) in the top-right (and bottom-left) corners. Thus points that are placed far apart in the linear ordering are also distant in the full high-dimensional space. Neighborhood, on the other hand, tries to make sure that neighboring points in the linear ordering are close to each other in the high dimensional space. This subtle distinction, in emphasis may lead to substantial difference in the results, as illustrated in
The left image is the result of STS, which tries to position red points in the top-right (and bottom-left) corners. The image on the right is the result of neighborhood sorting, which aims to avoid placing red points near the main diagonal. As a result, the optimal Neighborhood permutation orders the points around the circumference of the ring. Due to the cyclic symmetry of the ring, the end points in this ordering are very close to one another in the original space. This does not conform to the pattern that STS imposes.
For both algorithms, the score is shown to be improved on every iteration, thus convergence, to a fixed point is guaranteed after a finite time (see below for outline of proofs of complexity and convergence).
Proofs of Complexity
Claim: The Side-to-Side problem is NP-Hard
Proof: Let G=<V,E> be some graph on n vertices. Define D as follows:
if (Vi,Vj)εE then Dij=1, else Dij=2.
Set Dii=0.
Let kε[1,n] be some integer, and set Xi=1(i>=n−k+1). It can be easily shown that G has a clique of size k if and only if minPεS
Claim: The Neighborhood problem is NP-Hard
We get a reduction from the Traveling Salesman Problem, known to be NP-Hard, even in the Euclidian case (Papadimitriou [15]).
Proofs of Convergence
We first give the STS algorithm in a slightly revised manner, where P operates on X instead of D:
It can be easily seen that this algorithm presentation is equivalent. We now prove the following lemma:
Lemma
1. X
2. X≠XXDX<X
Proof
1. Note that:
X
2. From the first place it follows that X
We can now prove the following theorem:
Theorem:
Proof:
According to Thm. 2.11 in Baxter |4|, D is Almost Negative Definite. That is ,we have for any vector V such that ΣV=0, vTDV≦0. Since for every X is a permutation of X it follows that
(X−X)TD(X−X)≦D (2)
Since D is symmetric it follows that
Subtracting H
But the algorithm never stays at the same point for more than one iteration (step 4), namely X≠Xand therefore, according to the previous lemma:
X
To conclude, the energy function ()=X
The proof from above proves that for Lp norms with pε(1,2], SPIN converges to a local minima of the dynamical energy (X,X)=X
Convergence of Neighborhood:
To prove convergence, we revise the Neighborhood algorithm as follows:
Taking Q=p in 7 gives the desired result.
Using the above claim, a proof of the algorithm termination can be obtained, similarly to STS. We skip the details here.
Proof of Neighborhood Convergence
First we revise the algorithm to an equivalent form:
Neighborhood (Rev.)
Input: Dn×n and Wn×n
1. Set W0=W, P−1=In×n, t=0.
2. Compute Mt=DWt.
3. Set Pt=arg minQεS
4. If tr(PtMt) 6!=tr(Pt−1Mt−1), set Wt+1=PtTW, t=t+1 and go to 2.
5. Output PtDPtT.
Claim: tt(Pt+1DPtTW)<=tr(PtDPt−1TW)
Proof: tr(Pt+1DPtTW)=tr(Pt+1DWt+1)·tr(QDWt+1)∀QεS
Using the of W and the property tr(AB)=tr(BA) we get:
tr(QDWt+1)=tr(QDPtTW)=tr((QDPtTW)T) =tr(WPtQT)=tr(PtDQTW)
Taking Q=Pt−1 gives the desired result.
According to step 4, the algorithm terminates unless a strict inequality holds in the above claim. This prevents cycles of constant energy. Since the permutation space is finite, termination in a fixed point after a finite number of steps is guaranteed.
The current implementation of SPIN is as an interactive GUI, which enables the user to use either STS or Neighborhood. In general, STS is simpler, faster, and convergence seems to be quick for all the examples we have tried so far. It has no parameters, so the final ordering depends only on the initial permutation. Neighborhood, on the other hand, seems to capture features of the data which are missed by STS. For STS, one exemplary choice of weights is
which is an anti-symmetric linearly ascending vector, from −1 to 1. For this particular choice, (DX)j is simply the slope of the linear regression of the values in the jth row of D.
One exemplary choice for the weight matrix of Neighborhood is taken to be Gaussian
which is then normalized to be doubly stochastic (i.e. sum of each row and column is equal to one). For a given data set, there exists a range of relevant length scales, where large scales reflect the over all layout of the data while smaller values give a better local organization at the expense of possibly fragmenting larger structures. This is captured in SPIN by controlling the value of σ. One heuristic scheme that usually works well is starting with a very large sigma, iterating several times, then lowering σ (e.g. by a factor of 2) and so forth.
Section II-Illustrative Examples and Applications
This Section describes some illustrative, non-limiting examples and applications for the method according to the present invention, demonstrated according to a preferred embodiment of the method, termed herein SPIN. A sorting algorithm, such as the method presented herein, is particularly useful in cases where the effect of some continuous parameter needs to be studied.
A specific example of the type of data where this form of analysis may be pertinent is biological experiments, such as genome-wide experiments for example. For example, the expression profile of synchronized cells is governed by the time in cell-cycle progression in which a particular sample was harvested, as demonstrated in Example 3. In Example 4, initial findings from the analysis of cancer data are presented. Example 5 demonstrates the use of the present invention for machine or robot vision.
In these cases, SPIN's ability to ferret out elongated structures, even when the elongation refers to a complicated contour embedded in a high dimensional space, is extremely valuable.
A sorting algorithm, such as the one we present, is particularly useful in cases where the effect of some continuous parameter needs to be studied. A specific example of the type of data where this form of analysis may be pertinent is genome-wide experiments. For example, the expression profile of synchronized cells is governed by the time in cell-cycle progression in which a particular sample was harvested. In these cases, SPIN's ability to ferret out elongated structures, even when the elongation refers to a complicated contour embedded in a high dimensional space, is extremely valuable.
We chose to present here analysis of the yeast Elutriation-Synchronized cell-cycle expression data (taken from [1]). Spellman et al. employed a supervised ‘phasing’ method to assign genes to five known classes, namely G1, S, S/G2, G2/M and M/G1, utilizing the expression profiles of genes that were previously known to participate in specific phases of the cell cycle. They then proceeded to perform unsupervised analysis, specifically hierarchical clustering, and found that most genes belonging to the same class were clustered together. In another work, [2] further improved the organization of the tree by employing a leaf ordering algorithm, and recovered the order of the phases in the cycle.
Here we suggest the sorting approach as a different exploratory analysis methodology. Instead of partitioning the genes into distinct clusters we generate a distance matrix and order it by SPIN. As explained in Section I above, the nature of a cyclic object can be deduced from the colored pattern in the sorted distance matrix (
The technical details of our analysis for this data set are as follows: The raw expression data was downloaded from a server at Stanford (http://cellcycle-www.stanford.edu), and included a total of 5,981 genes measured across 14 samples (which denote several consecutive stages along the cell-cycle). The only pre-processing step was a variance filter: the standard deviation was calculated for each of the 5,981 genes, and only the 600 genes with the highest values were chosen for analysis in SPIN. The gene distance matrix of size 600×600 was calculated using simply Euclidian distance metric, and sorted in SPIN, as shown in
In the general context of expression data SPIN provides a two-way sorting platform, i.e. it is possible to order both samples and genes. In the specific case of the yeast cell-cycle data the samples are already organized and labeled accord to the stage in cell-cycle progression from which they were harvested. Therefore, we proceeded to sort only the genes, and left the samples ordered according to their labels. However, we did examine the organization of the Euclidian distance matrix for the samples (of size 14×14). One interesting observation is that the samples also order in a cyclic conformation of a ring. This observation is in accordance with the biology of the experiment, since each sample represents the expression profile of a yeast cell during consecutive stages of the cell-cycle.
1. Spellman P T, S. G., Zhang M Q, Iyer V R, Anders K, Eisen M B, Brown P O, Botstein D, Futcher B., Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell, 1998. 9(12): p. 3273-3297.
2. Bar-Joseph Z. G. D., Jaakkola T S, Fast optimal leaf ordering for hierarchical clustering. Bioinformatics, 2001. 17: p. S22-9.
3. O. Alter, P.OB.a.D.B., Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling. PNAS, 2000. 97(18): p. 10101-10106.
The present invention used SPIN in the analysis of expression data originating from large-scale cancer experiments. A known problem in microarray experiments is that a given sample is usually contaminated by a mixture of cell types, so that the expression signal from the desired target may be partially masked [2].
The method of the present invention was used to analyze genomic data obtained from human leukemia patients. Expression data was used from Table 3 in ([17]), who identified 80 genes that separate Pro B-cell from pre B- and T-cell ALLs. The analysis presented in (10), may link those genes to hematopoiesis, which is the process of generation and differentiation of blood cells. During hematopoiesis stem cells divide and undergo differentiation to various stages, gradually losing their multipotencity (11).
In the present analysis both the samples and genes were reordered using SPIN. By looking at the reordered distance matrices in
When the expression data is ordered in both directions it becomes apparent that most (about 60) genes are gradually turning off, and a minority of 20 genes, specific of the final target of the process, are turned on as differentiation proceeds. The gradual decrease in transcription viewed here is in accordance with the hypothesis that stem cells possess an open chromatin structure, which is progressively quenched during differentiation (12).
As previously described, the present invention is useful for any type of analysis problem involving the analysis of large sets of multidimensional data, including those characterized by having continuous variables. One example of such data is pattern recognition for machine or robot vision.
As an exemplary data set, the multi-feature digit dataset was examined. This dataset consists of features of handwritten numerals (‘0’-‘9’) extracted from a collection of Dutch utility maps (13-14). Two hundred patterns per class (for a total of 2,000 patterns) have been digitized in binary images, which have subsequently been averaged in windows of 2×3, resulting in 240 averaged pixels per image. Each pattern is thus represented as a vector of 240 elements, with values ranging from 0 to 1. The 2000×2000 Euclidean distance matrix between the patterns was calculated; optionally other distance matrices could also be used. One advantage to selecting a simpler distance measure such as Euclidean distance is that the characteristics of the measure itself do not bias the results in a particular direction. The distance matrix was then sorted by Neighborhood, using a series of decreasing values of σ (σ2=1,000,000, 500,000, 100,000, etc.) until convergence.
As can be seen in
A zoom-in operation in this context refers to extracting a sub-matrix from the input data, and regarding it as a “new” data-set. The distances are recalculated using only the remaining information in this sub-matrix. This is somewhat reminiscent of local PCA. SPIN thus allows the evaluation of the effects of sub-sets of the features on the data points. At the same time it also allows the evaluation of sub-sets of the data on the importance of the features. As an example, some of the pixels in the images of digits are always black (0) and thus do not contribute at all to the distances between patterns. Other pixels only change within a sub-set of the digits, and are thus important for discriminating between them, but not between others.
The method of the present invention was also used to analyze colon cancer. The biological question addressed here is that of recognizing alterations in gene expression that may be linked with the progression of cancer. SPIN is especially appropriate for this analysis, since cancer evolution is an inherently continuous process, which arises from a gradual accumulation of genetic alterations that promote selection of cells with increasingly aggressive behavior. Such continuity may be completely overlooked by traditional methods that emphasize clear separations.
Colon cancer is a good model system since samples are readily available across several, well-defined, stages of the disease, enabling a study of the onset of the neoplastic transformation. Expression profiles were determined for seven types of samples using the Affymetrix U133A GeneChip [D. Tsafrir, W. Liu, Y. Yamaguchi, I. Tsafrir, Y. Wen, W. Gerald, R. Stengel, F. Barany, P. Paty, F. Domany, and D. Notterman. A novel mathematical approach to analyzing gene expression data results from an international colon cancer consortium. In proc. of AACR-2004, 2004]: 47 primary carcinomas; 24 adenomas; 22 normal colon epithelium; 16 liver metastasis; 19 lung metastasis; 11 normal liver; and 5 normal lung. Standard pre-processing of the data included thresholding to T=10 and log transformation. A variance filter was utilized to concentrate on the most relevant genes. The process was started with the 500 highest varying transcripts, then doubled the number; since there was a significant change in the results, the number of transcripts was doubled again, to 2000. This did not alter the main conclusions to a noticeable degree, so many of the results were obtained with the top 1000 samples.
The results are shown in
In the context of such complex data, the search for genes and pathways that are causally involved in cancer is complicated by the need to distinguish their signal from a large background of innocent bystander genes, whose expression levels appear altered due to secondary causes. An initial objective is to generate an overall impression of the data's structure, identifying major partitions and relationships. By filtering the highest variance genes and ordering the resulting expression matrix in SPIN (see
In consecutive analysis stages, detailed in the following paragraphs, the process focused individually on sets of correlated genes that were identified in this initial step. SPIN is used to re-order the samples in the context of each gene-set separately, and the resulting permutation is shown to be informative of the underlying biology (see
This Example also includes a consideration of the effect of overrepresentation of, or contamination by, particular types of tissues. Previous expression-data studies recognized the challenge posed by the heterogeneous composition of sampled tissues [Alon et al., 1999], which was not answered in the context of traditional analysis methods [Ghosh, 2004]. In the current data the clearest separation in the samples is according to their organ of origin—either colon, liver or lung—with the liver samples forming the most distinct group (see
The most prominent gene-cluster, highlighted by the bottom black rectangle (
Other types of contamination were also seen, including muscle and connective tissue contamination. As demonstrated above, the problem of tissue heterogeneity may be a major complication, and one that was mostly unresolved by traditional analysis methods. In some data sets an assessment by the pathologist of the percentage of relevant tissues in each sample is available [Notterman et al., 2001, Alon et al., 1999], and this information can be utilized to construct an appropriate statistical test [Ghosh, 2004]. In the current data no such knowledge is available, which prevents the proper employment of supervised methods, and necessitates the use of an unsupervised approach.
For example, consider a group of genes that appear significantly under-expressed in the neoplastic samples as compared with normal tissue (434 transcripts out of the examined top 1000 passed the Wilcoxon ranksum test with FDR of q=0.05). It has already been observed in colon cancer studies that tumor samples are more biased towards epithelium tissue then their normal counterparts, causing apparent under-expression of genes functioning in muscle and connective tissues; [Alon et al., 1999]. In the SPIN-permuted data (
The method of the present invention was also shown to be able to detect gradual loss of differentiation, in addition to the artifacts described above. Employing supervised statistical tests to compare our normal colon samples with the tumors resulted in a mixed list, which included some genes that the SPIN analysis revealed to be related with tissue-mixtures. It is further possible using SPIN to distinguish the desired set of disease-progression associated. genes, and show that the reduction in their expression is correlated with the gradual onset of the cancer. Focusing on this subset of genes reveals that in this context the samples trace an elongated shape (
The analysis of the colon cancer data demonstrates a situation where SPIN can be used to assign new labels to samples, and employ this knowledge to improve the application of supervised methods. Metastasis samples, for example, can be marked according to the degree of surrounding normal tissue inadvertently included in the sample's preparation. One way of gaining this information is in the context of the liver-specific cluster, where the samples' expression profiles can be viewed as the result of a gradual mixing process, starting with samples extracted from the colon, that contain no liver tissue, and continuing with the metastasis samples that vary in the amount of liver contamination. The degree of liver mixture in each sample is reflected by the SPIN ordering, as can be seen in
In this work we presented several data sets where the ordered distance matrix generated by SPIN was extremely helpful in uncovering the structure of the data. One of the examples demonstrating SPIN's ability to reveal the layout of the data is the yeast cell-cycle. This data set was previously analyzed using hierarchical “clustering [7]. Despite being a very useful visualization tool, hierarchical dendrograms do not give a clear indication of the relative positions, symmetries, and shapes of the clusters. Another drawback of hierarchical clustering is the large number of possible leaf orderings of the clustering tree. The algorithm in [8] finds the optimal leaf-ordering with respect to the nearest-neighbors energy function, given a particular dendrogram. This energy function is a special case of Neighborhood with Wij=1||i−j|=1. Moreover, the requirement of an ordering satisfying a given dendrogram could be too restrictive, especially since different clustering algorithms may give different results for the same data set. SPIN, on the other hand, provides an, ordering of the objects using only the information available from the distance matrix, thus maintaining the ability to explore the entire permutation space, bypassing the need for a middle-man. Having said that, it may be beneficial to combine our sorting strategy with clustering. In such synergy the clustering algorithm would enhance the separation into clear clusters, while the sorter would help elucidate the shapes and relationships between the clusters.
Although SPIN can be viewed as a special case of dimensionality reduction (to one dimension), the emphasis is on ordering the points, rather than preserving their distances. Dimensionality reduction techniques, such as MDS, LLE [12] or Isomap [13], distort the distances. Therefore, the existence of a low-dimensional object can be discovered, however its structure is not readily inferred. Using SPIN, we have demonstrated that the re-ordered distance matrix highlights % structural features of the object embedded in the high-dimensional space. Furthermore, we have also shown how SPIN can enhance dimensionality reduction techniques, as exemplified above where the color coded ordering significantly clarifies the PCA image. To conclude, the sole input to SPIN is a distance matrix (not necessarily Euclidian) which makes it applicable to any problem involving arrangements of points in multi-dimensional space, where a metric can be defined.
Although the invention has been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IL05/00241 | WO |
Number | Date | Country | |
---|---|---|---|
60548182 | Mar 2004 | US |