The present invention relates to systems for performing evolutionary spectral clustering.
In many clustering applications, the characteristics of the objects to be clustered change over time. Typically, such characteristic change contains both long-term trend due to concept drift and short-term variation due to noise. For example, in the blogosphere where blog sites are to be clustered (e.g., for community detection), the overall interests of a blogger and the blogger's friendship network may drift slowly over time and simultaneously, short-term variation may be triggered by external events. As another example, in a ubiquitous computing environment, moving objects equipped with GPS sensors and wireless connections are to be clustered (e.g., for traffic jam prediction or for animal migration analysis). The coordinate of a moving object may follow a certain route in the long-term but its estimated coordinate at a given time may vary due to limitations on bandwidth and sensor accuracy.
These application scenarios, where the objects to be clustered evolve with time, raise new challenges to traditional clustering algorithms. On one hand, the current clusters should depend mainly on the current data features—aggregating all historic data features makes little sense in non-stationary scenarios. On the other hand, the current clusters should not deviate too dramatically from the most recent history. This is because in most dynamic applications, the system does not expect data to change too quickly and as a consequence, the system expects certain levels of temporal smoothness between clusters in successive time steps. This point can be illustrated using an evolutionary clustering scenario example in
In time series analysis, moving averages are often used to smooth out short-term fluctuations. Because similar short-term variances also exist in clustering applications, either due to data noises or due to non-robust behaviors of clustering algorithms (e.g., converging to different locally suboptimal modes), new clustering techniques are needed to handle evolving objects and to obtain stable and consistent clustering results.
In clustering data streams, large amounts of data that arrive at high rate make it impractical to store all the data in memory or to scan them multiple times. Such a new data model raises issues such as how to efficiently cluster massive data set by using limited memory and by one-pass scanning of data, and how to cluster evolving data streams under multiple resolutions so that a user can query any historic time period with guaranteed accuracy.
Incremental clustering algorithms have been used to efficiently apply dynamic updates to the cluster centers, medoids, or hierarchical trees when new data points arrive. However, newly arrived data points have no direct relationship with existing data points, other than that they probably share similar statistical characteristics. For example, moving objects can be clustered based on micro-clustering and an incremental spectral clustering algorithm has been applied to similarity changes among objects that evolve with time. However, the focus of these systems is to improve computation efficiency at the cost of lower cluster quality. Constrained clustering has also been used where either hard constraints such as cannot links and must links or soft constraints such as prior preferences are incorporated in the clustering task.
Evolutionary clustering is an emerging research area essential to important applications such as clustering dynamic Web and blog contents and clustering data streams. In evolutionary clustering, a good clustering result should fit the current data well, while simultaneously not deviate too dramatically from the recent history. To fulfill this dual purpose, a measure of temporal smoothness is integrated in the overall measure of clustering quality. In Chakrabarti et al., Evolutionary clustering, In Proc. Of the 12th ACM SIGKDD Conference, 2006, an evolutionary hierarchical clustering algorithm and an evolutionary k-means clustering algorithm are discussed. Chakrabarti et al. proposes to measure the temporal smoothness by a distance between the clusters at time t and those at time t−1. The cluster distance is defined by (1) pairing each centroid at t to its nearest peer at t−1 and (2) summing the distances between all pairs of centroids. However, the pairing procedure is based on heuristics and it could be unstable (a small perturbation on the centroids may change the pairing dramatically). Additionally, because Chakrabarti ignores the fact that the same data points are to be clustered in both t and t−1, this distance may be sensitive to the movement of data points such as shifts and rotations (e.g., consider a fleet of vehicles that move together while the relative distances among them remain the same).
Systems and methods are disclosed for clusterizing information by determining similarity matrix for historical information and similarity matrix for current information; generating an aggregated similarity matrix (aggregated kernel); and applying evolutionary spectral clustering on the aggregated kernel to a content stream to produce one or more clusters.
Implementations of the above aspect may include one or more of the following. The system can combine the similarity matrices to obtain the aggregated kernel. The system can scale the similarity matrices. The similarity matrices can be linearly combined to obtain the kernel. The system can determine a quality of current cluster result. The system can also determine temporal smoothness. Evolutionary clusters can be generated. The system can define a cost function to measure a quality of a clustering result on evolving information. The cost function can be defined using one or more graph-based measures. The cost function can be
Cost=α·CS+β·CT
where CS represents a snapshot cost that measures a snapshot quality of a current clustering result with respect to current data features and CT represents a temporal cost that measures a temporal smoothness, and where 0≦α≦1 is a parameter assigned by a user and together with β(=1−α), reflect the user's emphasis on the snapshot cost and temporal cost. CT can represent a goodness-of-fit of the current clustering result with respect to historic data features. CT can also measure a cluster quality. The system can determine a negated average association for evolutionary spectral clustering. A cost for evolutionary normalized cut can be generated. The system can derive corresponding optimal solutions such as relaxed optimal solutions. Such systems can be used for clusterizing blog entries for community detection.
In another aspect, a method for clusterizing information includes determining a first similarity matrix from a historic cluster obtained from historic information; generating an aggregated similarity matrix (aggregated kernel); and applying evolutionary spectral clustering on the aggregated kernel to a content stream to produce one or more clusters.
Implementations of the above aspect can include one or more of the following. The similarity matrices can be combined to obtain the kernel. The similarity matrices can be scaled. The system can linearly combine the similarly matrices to obtain the kernel. The system can determine a quality of current cluster result and temporal smoothness. The system can then generate evolutionary clusters. The cost function is defined using one or more graph-based measures. The cost function can be Cost=α·CS+β·CT, where CS represents a snapshot cost that measures a snapshot quality of a current clustering result with respect to current data features and CT represents a temporal cost that measures a temporal smoothness, and where 0≦α≦1 is a parameter assigned by a user and together with β(=1−α), reflect the user's emphasis on the snapshot cost and temporal cost. CT represents a goodness-of-fit of the current clustering result with respect to historic data features. CT can be used to measure a cluster quality. The system can determine a negated average association for evolutionary spectral clustering. The system can also determine a cost for evolutionary normalized cut. Corresponding optimal solutions can be located, such as relaxed optimal solutions. The system can be used for clusterizing blog sites for community detection. The system supports changing cluster numbers and adding and removing of nodes.
Advantages of the system and/or method may include one or more of the following. The system provides the ability to solve the entire class of time-dependent clustering problem and to provide clustering results with higher quality. The system solves the evolutionary spectral clustering problems in a manner that provides more stable and consistent clustering results that are less sensitive to short-term noises and at the same time are adaptive to long-term cluster drifts. Furthermore, the system provides optimal solutions to the relaxed versions of the corresponding evolutionary k-means clustering problems. The system supports two frameworks for evolutionary spectral clustering in which the temporal smoothness is incorporated into the overall clustering quality. The system derives optimal solutions to the relaxed versions of the proposed evolutionary spectral clustering frameworks. Because the unrelaxed versions are shown to be NP-hard, the system provides practical ways of obtaining the final clusters and the upper bounds on the performance of the algorithms. The system can handle cases where the number of clusters changes with time and the case where new data points are inserted and old ones are removed over time. The system also obtains clusters that evolve smoothly over time. The constraints are not given a priori. Instead, the system sets its goal to optimize a cost function that incorporates temporal smoothness. As a consequence, some soft constraints are automatically implied when historic data and clusters are connected with current ones.
The clustering techniques in the system can be used in any applications where temporal information is available. Some examples include: community and event detection in the blogs and Web where the inter-relation among blogs and Web sites are changing with time, image segmentation for a sequence of images that follow temporal order, objects tracking in ubiquitous computing where objects equipped with sensors are moving around with time, among others.
Next, details on the above processes will be discussed. In the following discussion, capital letters, such as W and Z, represent matrices. Lower case letters in vector forms, such as {right arrow over (v)}i and {right arrow over (μ)}l, represent column vectors. Scripted letters, such as V and Vp, represent sets. For easy presentation, for a given variable, such as W and {right arrow over (v)}i, the system attaches a subscript t, i.e., Wt and {right arrow over (v)}i,t, to represent the value of the variable at time t. And the system uses Tr(W) to represent the trace of W where Tr(W)=ΣiW(i,i). In addition, for a matrix XεRn×k, the system uses span(X) to represent the subspace spanned by the columns of X. For vector norms the system uses the Euclidian norm and for matrix norms the system uses the Frobenius norm, i.e., ∥W∥2=Σi,jW(i,j)2=Tr(WTW).
The clustering problem can be stated in the following way. For a set V of n nodes, a clustering result is a partition {V1, . . . , Vk} of the nodes in V such that V=∪l=1kVl and Vp∩Vq=∅ for 1≦p, q≦k, p≠q. A partition (clustering result) can be equivalently represented as an n-by-k matrix Z whose elements are in {0,1} where Z(i,j)=1 if only if node i belongs to cluster j. Obviously, Z·{right arrow over (1)}k={right arrow over (1)}n, where {right arrow over (1)}k and {right arrow over (1)}n are k-dimensional and n-dimensional vectors of all ones. In addition, the system can see that the columns of Z are orthogonal. Furthermore, Z can be normalized in the following way: the l-th column of Z is divided by √{square root over (|Vl|)} to get {tilde over (Z)}, where |Vl| is the size of Vl. Note that the columns of {tilde over (Z)} are orthonormal, i.e., {tilde over (Z)}T{tilde over (Z)}=Ik.
K-Means Clustering
The k-means clustering problem is one of the most widely-studied clustering problems. Assume the i-th node in V can be represented by an m-dimensional feature vector {right arrow over (v)}iεm, then the k-means clustering problem is to find a partition {V1, . . . , Vk} that minimizes the following measure
where {right arrow over (μ)}l is the centroid (mean) of the l-th cluster, i.e., {right arrow over (μ)}l=ΣjεV
A well-known algorithm to the k-means clustering problem is the so called k-means algorithm in which after initially randomly picking k centroids, the following procedure is repeated until convergence: all the data points are assigned to the clusters whose centroids are nearest to them, and then the cluster centroids are updated by taking the average of the data points assigned to them.
Spectral Clustering
The basic idea of spectral clustering is to cluster based on the eigenvectors of a (possibly normalized) similarity matrix W defined on the set of nodes in V. Very often W is positive semi-definite. Commonly used similarities include the inner product of the feature vectors, W(i,j)={right arrow over (v)}iT{right arrow over (v)}j, the diagonally-scaled Gaussian similarity, W(i,j)=exp(−({right arrow over (v)}i−{right arrow over (v)}j)Tdiag({right arrow over (γ)})({right arrow over (v)}i−{right arrow over (v)}j)), and the affinity matrices of graphs.
Spectral clustering algorithms usually solve graph partitioning problems where different graph-based measures are to be optimized. Two popular measures are to maximize the average association and to minimize the normalized cut. For two subsets, Vp and Vq, of the node set V (where Vp and Vq do not have to be disjoint), they system first defines the association between Vp and Vq as assoc(Vp, Vq)=ΣiεV
and the k-way normalized cut as
where V\Vl is the complement of Vl. For consistency, the system further defines the negated average association as
where, as will be shown later, NA is always non-negative if W is positive semi-definite. In the remaining of the paper, instead of maximizing AA, the system equivalently aims to minimize NA, and as a result, all the three objective functions, KM, NA and NC are to be minimized.
Finding the optimal partition Z for either the negated average association or the normalized cut is NP-hard. Therefore, in spectral clustering algorithms, usually a relaxed version of the optimization problem is solved by (1) computing eigenvectors X of some variations of the similarity matrix W, (2) projecting all data points to span(X), and (3) applying the k-means algorithm to the projected data points to obtain the clustering result. While it may seem nonintuitive to apply spectral analysis and then again use the k-means algorithm, it has been shown that such procedures have many advantages such as they work well in the cases when the data points are not linearly separable. Steps (2) and (3) uses standard procedures in traditional spectral clustering and thus will not be discussed in depth.
Two Frameworks for Evolutionary Spectral Clustering
Overview
The system defines a general cost function to measure the quality of a clustering result on evolving data points. The function contains two costs. The first cost, snapshot cost (CS), only measures the snapshot quality of the current clustering result with respect to the current data features, where a higher snapshot cost means worse snapshot quality. The second cost, temporal cost (CT), measures the temporal smoothness in terms of the goodness-of-fit of the current clustering result with respect to either historic data features or historic clustering results, where a higher temporal cost means worse temporal smoothness. The overall cost function is defined as a linear combination of these two costs:
Cost=α·CS+β·CT
where 0≦α≦1 is a parameter assigned by the user and together with β(=1−α), they reflect the user's emphasis on the snapshot cost and temporal cost, respectively.
In both frameworks, for a current partition (clustering result), the snapshot cost CS is measured by the clustering quality when the partition is applied to the current data. The two frameworks are different in how the temporal cost CT is defined. In the first framework, which the system names PCQ for preserving cluster quality, the current partition is applied to historic data and the resulting cluster quality determines the temporal cost. In the second framework, which the system names PCM for preserving cluster membership, the current partition is directly compared with the historic partition and the resulting difference determines the temporal cost.
Preserving Cluster Quality (PCQ)
In the first framework, PCQ, the temporal cost is expressed as how well the current partition clusters historic data. The system illustrates this through an example shown in
where |Z
Negated Average Association
The PCQ framework for evolutionary spectral clustering starts with the case of negated average association. At time t, for a given partition Zt, a natural definition of the overall cost is
The cluster quality is measured by the negated average association NA rather than the k-means KM.
Next, the system derives a solution to minimizing CostNA. First, the negated average association can be equivalently written as
NA=Tr(W)−Tr({tilde over (Z)}TW{tilde over (Z)})
Therefore the system writes the overall cost as
The first term Tr(αWt+βWt-1) is a constant independent of the clustering partitions and as a result, minimizing CostNA is equivalent to maximizing the trace Tr[{tilde over (Z)}tT(αWt+βWt-1){tilde over (Z)}t], subject to {tilde over (Z)}t being a normalized indicator matrix. Because maximizing the average association is an NP-hard problem, finding the solution {tilde over (Z)}t that minimizes CostNA is also NP-hard. So following most spectral clustering algorithms, the system relaxes {tilde over (Z)}t to Xtεn×k with XtTXt=Ik. It is known that one solution to this relaxed optimization problem is the matrix Xt whose columns are the k eigenvectors associated with the top-k eigenvalues of matrix αWt+βWt-1. Therefore, after computing the solution Xt the system can project the data points into span(Xt) and then apply k-means to obtain a solution to the evolutionary spectral clustering problem under the measure of negated average association. In addition, the value Tr(αWt+βWt-1)−Tr[XtT(αWt+βWt-1)Xt] provides a lower bound on the performance of the evolutionary clustering problem.
Moreover, a close connection between the k-means clustering problem and spectral clustering algorithms has been shown—if the system puts the m-dimensional feature vectors of the n data points in V into an m-by-n matrix A=({right arrow over (v)}1, . . . , {right arrow over (v)}n), then
KM=Tr(ATA)−Tr({tilde over (Z)}TATA{tilde over (Z)})
The k-means clustering problem is a special case of the negated average association spectral clustering problem, where the similarity matrix W is defined by the inner product ATA. As a consequence, the solution to the NA evolutionary spectral clustering problem can also be applied to solve the k-means evolutionary clustering problem in the PCQ framework, i.e., under the cost function previously defined.
Normalized Cut
For the normalized cut, the system defines the overall cost for evolutionary normalized cut to be
However, computing the optimal solution to minimize the normalized cut is NP-hard. As a result, finding an indicator matrix Zt that minimizes CostNC is also NP-hard. The system now provides an optimal solution to a relaxed version of the problem.
For a given partition Z, the normalized cut can be equivalently written as
NC=k−Tr[YT(D−1/2WD−1/2)Y]
where D is a diagonal matrix with D(i,i)=Σj=1nW(i,j) and Y is any matrix in n×k that satisfies two conditions: (a) the columns of D−1/2Y are piecewise constant with respect to Z and (b) YTY=Ik. The system remove the constraint (a) to get a relaxed version for the optimization problem
for some Xtεn×k such that XtTXt=Ik. Again the system has a trace maximization problem and a solution is the matrix Xt whose columns are the k eigenvectors associated with the top-k eigenvalues of matrix
And again, after obtaining Xt, the system can further project data points into span(Xt) and then apply the k-means algorithm to obtain the final clusters.
Moreover, the normalized cut approach can be used to minimize the cost function of a weighted kernel k-means problem. As a consequence, the evolutionary spectral clustering algorithm can also be applied to solve the evolutionary version of the weighted kernel k-means clustering problem.
The PCQ evolutionary clustering framework provides a data clustering technique similar to the moving average framework in time series analysis, in which the short-term fluctuation is expected to be smoothed out. The solutions to the PCQ framework turn out to be very intuitive—the historic similarity matrix is scaled and combined with current similarity matrix and the new combined similarity matrix is fed to traditional spectral clustering algorithms.
One assumption used in the above derivation is that the temporal cost is determined by data at time t−1 only. However, the PCQ framework can be easily extended to cover longer historic data by including similarity matrices W's at older time, probably with different weights (e.g., scaled by an exponentially decaying factor to emphasize more recent history).
Preserving Cluster Membership (PCM)
The second framework of evolutionary spectral clustering, PCM, is different from the first framework, PCQ, in how the temporal cost is measured.
In this second framework, the temporal cost is expressed as the difference between the current partition and the historic partition.
The current partition is defined as Zt={V1,t, . . . , Vk,t} and the historic partition as Zt-1={V1,t-1, . . . , Vk,t-1}. A measure for the difference between Zt and Zt-1 is defined next. Comparing two partitions has long been studied in the literatures of classification and clustering. Here the traditional chi-square statistics is used to represent the distance between two partitions
where |Vij| is the number of nodes that are both in Vi,t (at time t) and in Vj,t-1 (at time t−1). In the above definition, the number of clusters k does not have to be the same at time t and t−1. By ignoring the constant shift of −1 and the constant scaling n, the temporal cost for the k-means clustering problem is defined as
where the negative sign is because the system wants to minimize CTKM. The overall cost can be written as
Negated Average Association
For negated average association, NA=Tr({tilde over (Z)}TW{tilde over (Z)}) should be maximized. In this case {tilde over (Z)} is further relaxed to continuous-valued X, whose columns are the k eigenvectors associated with the top-k eigenvalues of W. So in the PCM framework, the system defines a distance dist(Xt, Xt-1) between Xt, a set of eigenvectors at time t, and Xt-1, a set of eigenvectors at time t−1. However, for a solution Xεn×k that maximizes Tr(XTWX), any X′=XQ is also a solution, where Qεk×k is an arbitrary orthogonal matrix. This is because TR(XTWX)=Tr(XTWXQQT)=Tr((XQ)TWXQ)=Tr(X′TWX′). Therefore a distance dist(Xt, Xt-1) is determined that is invariant with respect to the rotation Q. One such solution, is the norm of the difference between two projection matrices, i.e.,
which essentially measures the distance between span(Xt) and span(Xt-1). The number of columns in Xt does not have to be the same as that in Xt-1 as discussed in the next section.
By using this distance to quantify the temporal cost, the system derives the total cost for the negated average association as
Therefore, an optimal solution that minimizes CostNA is the matrix Xt whose columns are the k eigenvectors associated with the top-k eigenvalues of the matrix αWt+βXt-1Xt-1T. After getting Xt, the following steps are the same as before. Furthermore,
As a result, the evolutionary spectral clustering based on negated average association in the PCM framework provides a relaxed solution to the evolutionary k-means clustering problem defined in the PCM framework.
Normalized Cut
The PCM framework can be extended from the negated average association to normalized cut as
Therefore, an optimal solution that minimizes CostNC is the matrix Xt whose columns are the k eigenvectors associated with the top-k eigenvalues of the matrix
After obtaining Xt, the subsequent steps are the same as before.
In the PCM framework, CostNC has an advantage over CostNA in terms of the ease of selecting an appropriate α. In CostNA, the two terms CSNA and CTNA are of different scales—CSNA measures a sum of variances and CTNA measures some probability distribution. Consequently, this difference needs to be considered when choosing α. In contrast, for CostNC, because the CSNC is normalized, both
and Xt-1Xt-1T have the same 2-norms scale, for both matrices have λmax=1. Therefore, the two terms CSNC and CTNC are comparable and a can be selected in a straightforward way.
In the PCM evolutionary clustering framework, all historic data are taken into consideration (with different weights)—Xt partly depends on Xt-1, which in turn partly depends on Xt-2 and so on. In one extreme case, when α approaches 1, the temporal cost will become unimportant and as a result, the clusters are computed at each time window independent of other time windows. On the other hand, when α approaches 0, the eigenvectors in all time windows are required to be identical. Then the problem becomes a special case of the higher-order singular value decomposition problem, in which singular vectors are computed for the three modes (the rows of W, the columns of W, and the timeline) of a data tensor W where W is constructed by concatenating Wt's along the timeline.
In addition, if the similarity matrix Wt is positive semi-definite, then
is also positive semi-definite because both
and Xt-1Xt-1T are positive semi-definite.
Next, a comparison of the PCQ and PCM frameworks will be discussed. For simplicity of discussion, only time slots t and t−1 are considered and older history is ignored.
In terms of the temporal cost, PCQ aims to maximize Tr(XtTWt-1Xt) while for PCM, Tr(XtTXt-1Xt-1TXt) is to be maximized. However, these two are closely connected. By applying the eigen-decomposition on Wt-1, the system has
XtTWt-1Xt=XtT(Xt-1,Xt-1⊥)Λt-1(Xt-1,Xt-1⊥)TXt
where Λt-1 is a diagonal matrix whose diagonal elements are the eigenvalues of Wt-1 ordered by decreasing magnitude, and Xt-1 and Xt-1⊥ are the eigenvectors associated with the first k and the residual n−k eigenvalues of Wt-1, respectively. It can be easily verified that both Tr(XtTWt-1Xt) and Tr(XtTXt-1Xt-1TXt) are maximized when Xt=Xt-1 (or more rigorously, when span(Xt)=span(Xx-1)).
The differences between PCQ and PCM are (a) if the eigenvectors associated with the smaller eigenvalues (other than the top k) are considered and (b) the level of penalty when Xt deviates from Xt-1. For PCQ, all the eigenvectors are considered and their deviations between time t and t−1 are penalized according to the corresponding eigenvalues. For PCM, rather than all eigenvectors, only the first k eigenvectors are considered and they are treated equally. In other words, in the PCM framework, other than the historic cluster membership, all details about historic data are ignored.
Although by keeping only historic cluster membership, PCM introduces more information loss, there may be benefits in other aspects. For example, the CT part in the PCM framework does not necessarily have to be temporal cost—it can represent any prior knowledge about cluster membership. For example, the system can cluster blogs purely based on interlinks. However, other information such as the content of the blogs and the demographic data about the bloggers may provide valuable prior knowledge about cluster membership that can be incorporated into the clustering. The PCM framework can handle such information fusion easily.
There are two assumptions in the PCQ and the PCM framework discussed above. First, the system assumed that the number of clusters remains the same over all time. Second, the system assumed that the same set of nodes is to be clustered in all timesteps. Both assumptions are too restrictive in many applications. In this section, the frameworks are extended to handle the issues of variation in cluster numbers and insertion/removal of nodes over time.
Variation in Cluster Numbers
So far, the system has assumed that the number of clusters k does not change with time. However, keeping a fixed k over all time windows is a strong restriction. Various effective methods for selecting appropriate cluster numbers (e.g., by thresholding the gaps between consecutive eigenvalues) can be used. The number of cluster k at time t can determined by one of these methods.
If the cluster number k at time t is different from the cluster number k′ at time t−1, both the PCQ and the PCM frameworks can handle variations in cluster number. In the PCQ framework, the temporal cost is expressed by historic data themselves, not by historic clusters and therefore the computation at time t is independent of the cluster number k′ at time t−1. In the PCM framework, the partition distance and the subspace distance can both be used without change when the two partitions have different numbers of clusters. As a result, the PCQ and PCM frameworks can handle variations in the cluster numbers.
Insertion and Removal of Nodes
In many applications the data points to be clustered may vary with time. In the blog example application, often there are old bloggers who stop blogging and new bloggers who just start.
Node Insertion and Removal in PCQ
For the PCQ framework, the key is αWt+βWt-1. When old nodes are removed, the system can simply remove the corresponding rows and columns from Wt-1 to get {tilde over (W)}t-1 (assuming {tilde over (W)}t-1 is n1×n1). However, when new nodes are inserted at time t, the system needs to add entries to {tilde over (W)}t-1 and to extended it to Ŵt-1, which has the same dimension as Wt (assuming Wt is n2×n2). Without lost of generality, the system assumes that the first n1 rows and columns of Wt correspond to those nodes in {tilde over (W)}t-1. The system defines
Such a heuristic has the following good property:
Property 1 (1) Ŵt-1 is positive semi-definite if Wt-1 is. (2) In Ŵt-1, for each existing node vold, each newly inserted node vnew looks like an average node in that the similarity between vnew and vold is the same as the average similarity between any existing node and vold. (3) In Ŵt-1, the similarity between any pair of newly inserted nodes is the same as the average similarity among all pairs of existing nodes. The property is appealing when no prior knowledge is given about the newly inserted nodes.
Node Insertion and Removal in PCM
For the PCM framework, when old nodes are removed, the system removes the corresponding rows from Xt-1 to get {tilde over (X)}t-1 (assuming {tilde over (X)}t-1 is n1×k). When new nodes are inserted at time t, the system extends {tilde over (X)}t-1 to {circumflex over (X)}t-1, which has the same dimension as Xt (assuming Xt is n2×k) as follows
That is, the system inserts new rows as the row average of {tilde over (X)}t-1. After obtaining {circumflex over (X)}t-1, the system replaces the term βXt-1Xt-1T with β{circumflex over (X)}t-1({circumflex over (X)}t-1T{circumflex over (X)}t-1)−1{circumflex over (X)}t-1T. The foregoing equation corresponds to, for each newly inserted nodes, assigning to it a prior clustering membership that is approximately proportional to the size of the clusters at time t−1.
Next, experimental studies based on both synthetic data sets and a real blog data set are discussed. First, several experiments on synthetic data sets are reported to illustrate the good properties of the PCQ and PCM algorithms.
NA-Based Evolutionary Spectral Clustering
Three experimental studies based on synthetic data are discussed next. In the first experiment, a stationary case is tested where data variation is due to a zero-mean noise. In the second experiment, a non-stationary case is tested where there are concept drifts. In the third experiment, a case tests large differences between the PCQ and PCM frameworks.
Using the k-means algorithm, two baselines are done. The first baseline called ACC accumulates all historic data before the current timestep t and applies the k-means algorithm on the aggregated data. The second baseline called IND independently applies the k-means algorithm on the data in only timestep t and ignore all historic data before t.
The system uses the NA-based PCQ and PCM algorithms because of the equivalence between the NA-based spectral clustering problem and the k-means clustering problem. The system uses W=ATA in the NA-based evolutionary spectral clustering and compares its results with that of the k-means baseline algorithms. For a fair comparison, the system uses the KM defined for the k-means clustering problem as the measure for performance, where a smaller KM value is better.
The data points to be clustered are generated in the following way. 800 two-dimensional data points are initially positioned as described in
In the first experiment, for timesteps 2 through 10, the system add an i.i.d. Gaussian noise following N(0,0.5) to the initial positions of the data points. The system uses this data to simulation a stationary situation where the concept is relatively stable but there exist short-term noises.
In
Next, for the same data set, α is increased from 0.2 to 1 with a step of 0.1.
In the second experiment, the system simulates a non-stationary situation. At timesteps 2 through 10, before adding random noises, the system first rotates all data points by a small random angle (with zero mean and a variance of π/4).
In the third experiment, the system shows a case where the PCQ and PCM frameworks behave differently. The system first generates data points using the procedure described in the first experiment (the stationary scenario), except that this time the system generates 60 timesteps for a better view. This time, instead of 4 clusters, the system lets the algorithms partition the data into 2 clusters. From
Next, NC-based Evolutionary Spectral Clustering experiments will be discussed. It is difficult to compare the NC-based evolutionary spectral clustering with the k-means clustering algorithm. Instead, in this experiment, the system uses a toy example in the 2-dimensional Euclidean space with only 4 timesteps (as shown in
As a conclusion, these experiments based on synthetic data sets demonstrate that compared to traditional clustering methods, the instant evolutionary spectral clustering algorithms can provide clustering results that are more stable and consistent, less sensitive to short-term noise, and adaptive to long-term trends.
Next, experiments on actual blog data will be discussed. The sample blog data set contains 148,681 entry-to-entry links among 407 blogs crawled by a crawler during 63 consecutive weeks, between Jul. 10, 2005 and Sep. 23, 2006. By looking at the contents of the blogs, the system discovered two main groups: a set of 314 blogs with technology focus and a set of 93 blogs with politics focus.
One application of clustering blogs is to discover communities. Since the system already has the ground truth of the two communities based on content analysis, the system starts by running the clustering algorithms with k=2. The data is prepared in this way: each week corresponds to a timestep; all the entry-to-entry links in a week are used to construct an affinity matrix for the blogs of that week (i.e., those blogs that are relevant to at least one entry-to-entry link in that week); and the affinity matrix is used as the similarity matrix W in the clustering algorithms. For baselines, the system again uses ACC and IND, except that this time the normalized cut algorithm is used. For our algorithms, the system uses the NC-based PCQ and PCM.
a),(b), and (c) give the CSNC, CTNC, and CostNC for the two baseline algorithms and the PCM algorithm (to make the figures readable, the results for PCQ, which are similar to those of PCM, as shown in Table 1 were not plotted). In
0.46
0.06
0.42
1.07
0.02
0.98
1.70
0.03
1.57
In addition, the system runs the algorithms under different cluster numbers and report the performance in Table 1, where the best results among the same category are in bold face. Our evolutionary clustering algorithms always give more stable and consistent cluster results than the baselines where the historic data is totally ignored or totally aggregated.
There are new challenges when traditional clustering techniques are applied to new data types, such as streaming data and Web/blog data, where the relationship among data evolves with time. On one hand, because of long-term concept drifts, a naive approach based on aggregation will not give satisfactory cluster results. On the other hand, short-term variations occur very often due to noise. Preferably the cluster results should not change dramatically over short time and should exhibit temporal smoothness. In this paper, the system proposes two frameworks to incorporate temporal smoothness in evolutionary spectral clustering. In both frameworks, a cost function is defined where in addition to the traditional cluster quality cost, a second cost is introduced to regularize the temporal smoothness. The system then derives the (relaxed) optimal solutions for solving the cost functions. The solutions turn out to have very intuitive interpretation and have forms analogous to traditional techniques used in time series analysis. Experimental studies demonstrate that these new frameworks provide cluster results that are both stable and consistent in the short-term and adaptive in the long run.
The above two processes or frameworks incorporate temporal smoothness in evolutionary spectral clustering. These processes solve corresponding cost functions for the evolutionary spectral clustering problems. The system's evolutionary spectral clustering processes provide stable and consistent clustering results that are less sensitive to short-term noises while at the same time are adaptive to long-term cluster drifts. As discussed below, performance experiments over a number of real and synthetic data sets illustrate the system's evolutionary spectral clustering methods provide more robust clustering results that are not sensitive to noise and can adapt to data drifts.
The invention may be implemented in hardware, firmware or software, or a combination of the three. Preferably the invention is implemented in a computer program executed on a programmable computer having a processor, a data storage system, volatile and non-volatile memory and/or storage elements, at least one input device and at least one output device.
By way of example, a block diagram of a computer to support the system is discussed next. The computer preferably includes a processor, random access memory (RAM), a program memory (preferably a writable read-only memory (ROM) such as a flash ROM) and an input/output (I/O) controller coupled by a CPU bus. The computer may optionally include a hard drive controller which is coupled to a hard disk and CPU bus. Hard disk may be used for storing application programs, such as the present invention, and data. Alternatively, application programs may be stored in RAM or ROM. I/O controller is coupled by means of an I/O bus to an I/O interface. I/O interface receives and transmits data in analog or digital form over communication links such as a serial link, local area network, wireless link, and parallel link. Optionally, a display, a keyboard and a pointing device (mouse) may also be connected to I/O bus. Alternatively, separate connections (separate buses) may be used for I/O interface, display, keyboard and pointing device. Programmable processing system may be preprogrammed or it may be programmed (and reprogrammed) by downloading a program from another source (e.g., a floppy disk, CD-ROM, or another computer).
Each computer program is tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
The invention has been described herein in considerable detail in order to comply with the patent Statutes and to provide those skilled in the art with the information needed to apply the novel principles and to construct and use such specialized components as are required. However, it is to be understood that the invention can be carried out by specifically different equipment and devices, and that various modifications, both as to the equipment details and operating procedures, can be accomplished without departing from the scope of the invention itself.
This application claims the benefit of priority to U.S. Provisional Patent Application Ser. No. 60/939,696, filed May 23, 2007, the content of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
20080294684 A1 | Nov 2008 | US |
Number | Date | Country | |
---|---|---|---|
60939696 | May 2007 | US |