Data clustering, also called cluster analysis, is an active research area with a long history (see [21, 24, 14, 23]), from the K-means methods [25, 26, 22] and hierarchical clustering algorithms in 1960s-1970s to the present-day more complex methods such as those based on Gaussian mixtures [28] and other model-based methods [1], graph partitioning methods [31, 19, 32, 29], and methods developed by the database community (see the summary in [20]).
The essential task of data clustering is partitioning data points into disjoint clusters so that (P1) objects in the same cluster are similar, (P2) objects in different clusters are dissimilar. When data objects distribute as compact clumps which are well separated, clusters are well defined and we refer to those well-defined clusters as natural clusters, When the clumps are not compact or when clumps overlap with each other, clusters are not well defined; a clear and meaningful definition of clusters then becomes crucial.
Existing clustering methods typically attempt to satisfy one of the two requirements above. K-means algorithm, for example, attempts to ensure that data point in the same cluster are similar which is (P1), while the graph partitioning methods RatioCut and NormalizedCut attempt to ensure that objects in different cluster are maximally different, which is (P2).
In this paper we introduce MinMaxCut, a graph partitioning based clustering algorithm which incorporate (P1) and (P2) simultaneously. We formally state them as the following min-max clustering principle: data should be grouped into clusters such that the similarity or association across clusters is minimized, while the similarity or association within each cluster is maximized (see [14, 35] for recent studies of clustering objective functions).
Clustering algorithms such as K-means and those based on Gaussian mixtures require the coordinates/attributes of each object explicitly. Graph partitioning algorithm requires only the pairwise similarities between objects. Given the pairwise similarity S=(sij), where sij indicates the similarity between objects i and j, we may consider S as the adjacency matrix of a weighted graph G, hence the data clustering problem becomes a graph partition problem. (Splitting a dataset into two is rephrased as culling a graph into two subgraphs; cutting a graph into two very imbalanced subgraphs is referred to as skewed cut; the boundary between two subgraphs is sometimes called cut.)
Cluster analysis is applied to large amount of data with a variety of distributions/shapes for the clusters. Using the similarity metric, more complex shaped distributions can be accommodated. For example K-means favors spherically shaped clusters, while hierarchical agglomerative clustering can produce elongate-shaped clusters by using single-linkage. Using similarity-based graph partitioning, the connectivity between objects becomes most important, instead of their shape in a Euclidean space which is very hard to model.
The min-max clustering principle favors the objective function optimization approach, i.e. clusters are obtained by optimizing an appropriate objective function. This is mathematically more principled approach, in contrast to procedure-oriented clustering methods, such as the hierarchical algorithms.
In the following we briefly summarize the results obtained in this paper, which also serve as the outline of the paper. In § 2, we discuss the MinMaxCut for K=2 case. We first show that the continuous solution of the cluster membership indicator vector is the eigenvector of the generalized Laplacian matrix of the similarity matrix. Related work on spectral graph clustering, the RatioCut [19] and the NormalizedCut [32] are discussed in § 2.1. Using random graph model, we show that MinMaxCut tends to produce balanced clusters while earlier methods do not (see § 2.2). The cluster balancing power of MinMaxCut can be softened or hardened by a slight generalization of the clustering objective function (see § 2.3). In § 2.4, we define the cohesion of a dataset/graph as the optimal value of MinMaxCut objective function when the dataset is split into two. We prove important lower and upper bounds for the cohesion value. Experiments on clustering internet newsgroups are presented in § 2.5. which show the advantage of MinMaxCut compared with existing methods. In § 2.6 we derive the conditions for possible skewed clustering for MinMaxCut and NormalizedCut which shows the balancing power of MinMaxCut. In § 2.7, we show the MinMaxCut linkage is useful for further refinement of clusters obtained from MinMaxCut. The linkage differential ordering can further improve the clustering results (see § 2.8). In § 2.9, we discuss the clustering of a contingency table which can be viewed as a weighted bipartite graph. The simultaneous clustering of rows and columns of the contingency table can be done in much the same way as in 2-way clustering of § 2.
In § 3, we discuss MinMaxCut for K>2 cases. We show K-way MinMaxCut leads to the more refined or subtle cluster balance, the similarity-weighted size balance in § 3.1. In § 3.2. the importance of K′ eigenvectors are noted with the generalized lower and upper bounds of optimal value of the objective function. The K-way clustering requires two stages, the initial clustering and refinement. In § 3.3, three methods of initial clustering are briefly explained, the eigenspace K-means, the divisive and agglomerative clusterings. The cluster refinement algorithms based on MinMaxCut objective functions are outline in § 3.4.
In § 4, the divisive MinMaxCut as a K-way clustering method is explained in detail. We first prove the monotonicity of MinMaxCut and K-means objective function w.r.t. to cluster merging and splitting in § 4.1. In § 4.2, we outline the cluster selection methods: those based on the size-priority, average similarity, cohesion and temporary objectives. Stopping criteria are outlined in § 4.3. In § 4.4, we discuss objective function saturation, a subtle issue in objective function optimization based approaches. In § 4.5, results of comprehensive experiments on newsgroup are presented which show the average similarity as a better cluster selection method. Our results also show the importance of MinMaxCut based refinement after the initial clusters are obtained in the divisive clustering. This indicates the appropriateness of the MinMaxCut objective function. In § 5, summary and discussions results are given. Some preliminary results [10, 8] of this paper were previously represented in conferences.
Given pairwise similarities or associations for a set of n data objects specified in S=(sij), we wish to cluster the data objects into two clusters A, B based on the following min-max clustering principle: data points are grouped into clusters such that between-cluster associations are minimized while within-cluster associations are maximized. The association between A, B is the sum of pairwise associations between the two clusters, s(A,B)=Σi∈A,j∈Bsij. The association within cluster A is s(A,A)=Σi∈A,j∈Asij; s(B, B) is analogously defined. The min-max clustering principle requires
min s(A, B), max s(A, A), max s(B, B). (1)
These requirements are simultaneously satisfied by minimizing the objective function [10],
Note that there are many objective functions that satisfy Eq. (1). However, for JMMC, a continuous solutions can be computed efficiently.
Clustering solution can be represented by an indicator vector q,
where a=√{square root over (dB/dA)}, b=√{square root over (dA/dB)},
and di=Σjsij is the degree of node i. Thus
qTDe=0, (5)
where D=diag(d1, . . . ,dn). We first prove that
Define the indicator vector x, where
Now
By definition of q in Eq. (3), we obtain
q
T
Sq=a
2
s(A,A)+b2s(B,B)−2 abs(A,B). (8)
The orthogonality condition of Eq. (5) becomes
as(A,A)−bs(B,B)+(a−b)s(A,B)=0. (9)
With these relations, after some algebraic manipulation, we obtain
Since b/a>0 is fixed, one can easily see that
Hence JMMC is a monotonically decreasing function of Jm. This proves Eq. (6).
Optimization of Jm(q) with the constraint that qi takes discrete values {a, −b} is a bard problem. Following § 2.1, we let qi take arbitrary continuous values in the interval [−1, 1]. The optimal solution for the Rayleigh quotient of Jm in Eq. (6) is the eigenvector q associated with the largest eigenvalue of the system
Sq=λDq. (11)
Let q=D−1/2z and multiply both sides by D−1/2; this equation becomes a standard eigenvalue problem:
D
−1/2
WD
−1/2
z
k=λkzk, λk=1−ζk, (12)
The desired solution is q2. (The trivial solution λ1=1 and q1=e is discarded.) Since qk satisfies the orthogonality relation zkTzp=qkTDqp=0 if k≠q, constraint Eq. (5) is automatically satisfied. We summarize these results as
From Eq. (3), we can recover cluster membership by sign, i.e., A={i|q2(i)≤0}, B={i|q2(i)≥0}. In general, the optimal dividing point could shift away from 0; we search the dividing point icut=1, . . . ,n−1, setting
A={i|q
2(i)≤q2(icut)}, B={i|q2(i)>q2(icut)}, (13)
such that JMMC(A, B) is minimized. The corresponding A and B are the final clusters.
The computation of the eigenvectors can be done quickly via the Lanczos method [30]. A software package for this calculation, LANSO, is available online (http://www.nersc.gov/˜kewu/planso.html). Overall the computational complexity is O(n2).
Spectral graph partitioning is based on the properties of eigenvectors of the Laplacian matrix L=D−W, first developed by Donath and Hoffman [11] and Fiedler [16, 17]. The method becomes widely known in high performance computing area by the work of Pothen, Simon and Liu [31]. The objective of the partitioning is to minimize the MinCut objective, i.e, the cutsize (the between-cluster similarity) Jcut(A,B)=s(A,B) with the requirement that two subgraphs have the same number of nodes: |A|=|B|. Using indicator variable xu, xu={1,−1} depending on u∈A or B, the cutsize is
Relax xu from {1, −1} to continuous value in [−1, 1], minimizing s(A, B) is equivalent to solve the eigensystem
(D−S)x=ζx, (15)
Since the trivial x1=e is associated with λ1=0, the second eigenvector x2, also called Fiedler vector, is the solution. Hagen and Kahng [19] remove the requirement |A|=|B| and show that the x2 provides the continuous solution of the cluster indicator vector for the RatioCut objective function [5]
The generalized eigensystem of Eq. (15) is
(D−S)x=ζDx, (17)
which is identical to Eq. (11) with λ=1−ζ. The use of this equation has been studied by a number of authors [12, 6, 32]. Chung [6] emphasizes the advantage of using normalized Laplacian matrix which leads to Eq. (17). Shi and Malik [32] propose the NormalizedCut,
where dA, dB defined in Eq. (4) are also called the volumes [6] of subgraphs A, B, in contrast to the sizes of A, B. A key observation is that Jncut can be written as
since
and dB=s(B,B)+s(A,B). The presence of s(A, B) in the denominators of Jncut indicate it is not conformal to the min-max clustering principle. In practical applications, NormalizedCut sometimes leads to unbalanced clusters (see § 2.5). MinMaxCut is developed by being conformal to the min-max clustering principle. In extensive experiments (see § 2.5), MinMaxCut consistently outperforms NormalizedCut and RatioCut.
RatioCut, NormalizedCut and MinMaxCut objective functions are first prescribed by proper motivating considerations and then q2 is shown to be the continuous solution of the cluster indicator vectors. It should be noted that in a perturbation analysis on the case where clusters are well separated (thus clearly defined) [9], the same three objective functions can be automatically recovered as the second principal eigenvalue of the corresponding (normalized) Laplacian matrix and indicator vector of Eq. 3 are recovered. This further strengthens the connection between clustering objective functions and the Laplacian matrix of a graph.
Besides Laplacian matrix based spectral partitioning methods, other recent partitioning methods use singular value decompositions [2, 13].
One important feature of the MinMaxCut method is that it tends to produce balanced clusters, i.e., the resulting subgraphs have similar sizes. Here we use the random graph model [5, 3] to illustrate this point. Suppose we have a uniformly distributed random graph with n nodes. For this random graph, any two nodes are connected with probability p, 0≤p≤1. We consider the four objective functions, the MinCut, RatioCut, NormalizedCut and MinMaxCut (see § 2.1). We have the following
Proof. We compute the object functions for the partition of G into A and B. Note that the number of edges between A and B are p|A∥B| on average. For MinCut, we have
J
mincut(A,B)=p|A∥B|
For RatioCut, we have
For NormalizedCut, since all nodes have the same degree (n−1)p,
For MinMaxCut, we have
We now minimize these objectives. Clearly, MinCut favors either |A|=n−1,|B|=1 or |B|=n−1,|A|=1, both are skewed cuts. Minimizing JMMC(A, B), we obtain a balanced cut: |A|=|B|=n/2:
Both Rcut and Ncut objectives have no size dependency and no size preference.
Clearly MinMaxCut bas a strong tendency to produce balanced clusters. Although balanced clusters are desirable, sometimes naturally occurring clusters are not necessarily balanced. Here we introduce a generalized MinMaxCut that has varying degree of cluster balancing. We define the generalized clustering objective function
for any fixed parameter α>0.
The important property of JMMC(α) is that the procedure for computing the clusters remains identical to α=1 case in § 2.1, because minimization of JMMC(α) leads to the same problem of max Jm(q), i.e.,
for any α>0; this can be proved by repeating the proof of Eq. (6).
The generalized MinMaxCut for any α>0 still retains the cluster balancing property as one can easily show that Theorem 2.2 regarding cluster balancing on random graphs remains valid. However, the level of balancing depends on α.
If α>1, JMMC(α) will have stronger cluster balancing than JMMC(α−1), because the larger of the two terms
will dominate JMMC(α) more, and thus min JMMC(α>1) will more strongly force the two terms to be equal. We call this case the hard MinMaxCut. In particular, for α>>1, we have
We call this case the “minimax cut”. Minimax-cut ignores the details of the smaller term and therefore is less sensitive than JMMC(α−1).
If α<1, JMMC(α) will have weaker cluster balancing. This case is more applicable for datasets where natural clusters are of different sizes. Here ½≤α<1 are good choices. We call this case the soft MinMaxCut.
Given a dataset of n objects and their pairwise similarity S=(sij), we may partition them into two subsets in many different ways with different values of JMMC. However, the optimal JMMC value
is a well-defined quantity, although its exact value may not be easily computed.
Definition. Cluster cohesion of a dataset is the smallest value of the MinMaxCut objective function when the dataset is split into two clusters.
Cluster cohesion is a good characterization of a dataset against splitting it into two clusters. Suppose we apply MinMaxCut to split a dataset into two clusters. If JMMCopt thus obtained is large, this indicates the overlap between the two resulting clusters is large in comparison to the within-cluster similarity, and thus the dataset is likely a single natural cluster and should not be split.
On the other hand, if JMMCopt(S) is small, the overlap between the two resulting clusters is small, i.e., two clusters are well-separated, which indicates that the dataset should be split. Thus JMMCopt is a good indicator of cohesion of the dataset with respect to clustering. For this reason, JMMCopt(S) is called cluster cohesion and is denoted as h(S).
Note that h is similar to Cheeger constant [6] h1 in graph theory, which is defined as
From Eq. (19), one can see that NormalizedCut is a generalization of Cheeger constant, i.e., both terms are retained in the optimization of NormalizedCut. Using the analogy of the minimax version of MinMaxCut via Eq. (23), we may also say that h1 is the minimax version of NormalizedCut. Since S can be viewed as the adjacency matrix of a graph G, we call h the cohesion of graph G.
For all possible graphs one may expect the cohesion value to have a large range and thus difficult to gauge. Surprisingly, cohesion for an arbitrarily weighted graph is restricted to a narrow range, as we can prove the following:
(b) the cohesion of a graph has the bound
where λ2 is from Eq. (12).
Proof. Part (a) can be proved by the following two lemmas regarding graphs. Lemma (L1): The unweighted complete graph (clique) has the cohesion of Eq. (24), same as the random graph with p=1 (see Eq. (20)). Lemma (L2): All graphs, both weighted and unweighted, have cohesion smaller than that of the complete graph. L2 is very intuitive and can be proved rigorously by starting with a clique and removing edges. Details are skipped here. Part (b) is proved by considering JMMC(Jm, a/b) as a function of a/b and Jm. It can be shown that
The last inequality follows from
Theorem 2.4 establishes cluster cohesion JMMCopt as a useful quantity to characterize a dataset with the chosen similarity metric. The upper bound is useful for checking whether a partition of the dataset is within the right range.
Document clustering has been popular in analyzing text information. Here we perform experiments on newsgroup articles in 20 newsgroups (dataset available online [27]). We focus on three datasets, each bas two newsgroups:
Word-document matrix X=(x1, . . . ,xn) is first constructed. 2000 words are selected according to the mutual information between words and documents
where w represents a word and x represents a document. Words are stemmed using [27]. Standard tf.idf scheme for term weighting is used and standard cosine similarity between two documents x1, x2: sim(x1,x2)=x1·x2/|x1∥x2| is used. When each document, colon of X, is normalized to 1 using L2 norm, document-document similarities are calculated as W=XTX. W is interpreted as the weight/affinity matrix of the undirected graph. From this similarity matrix, we perform the clustering as explained above.
For comparison purpose, we also consider three other clustering methods: the RatioCut the NormalizedCut and the principle direction divisive partitioning (PDDP) [2]. PDDP is based on the idea of principle component analysis (PCA) applied to the vector-space model on X. First X is centered, i.e., the average of each row (a word) is subtracted. Then the first principle direction is computed. The loadings of the documents (the projection of each document on the principle axis) form a 1-dim linear search order. This provides a heuristic very similar to the linear search order provided by the Fiedler vector. Instead of searching through to find a minimum based on some objective function, PDDP partitions data into two parts at the center of mass.
We perform these two-cluster experiments in a way similar to cross-validation. We divide one newsgroup A randomly into K1 subgroups and the other newsgroup B randomly into K2 subgroups. Then one of the K1 subgroups of A is mixed with one of the K2 subgroups of B to produce a dataset G. The graph partition methods are run on this dataset G to produce two clusters. Since the true label of each newsgroup article is known, we use accuracy, percentage of newsgroup articles correctly clustered, as a measure of success. This is repeated for all K1k2 pairs between A and B, and the accuracy is averaged. In this way, every newsgroup articles is used the same number of times. The mean and standard deviation of accuracy are listed.
To Table 1, the clustering results are listed for balanced cluster cases, i.e., both subgroups have about 200 newsgroup articles. MinMaxCut performs about the same as Ncut for newsgroups NG1/NG2, where the cluster overlap is small. MinMaxCut performs substantially better than Ncut for newsgroups NG10/NG11 and newsgroups NG18/NG19, where the cluster overlaps are large. MinMaxCut performs slightly better than PDDP. Rcut always performs the worst among the 4 methods and will not be studied further.
In Table 2, the clustering results are listed for unbalanced cases, i.e., one subgroup has 300 newsgroup articles and the other subgroup has 200. This is generally a harder problem due to the unbalanced prior distributions. In this case, both MinMaxCut and Ncut perform reasonably well, no clear deterioration is seen, while the performance of PDDP clearly deteriorated. This indicates the strength of MinMaxCut method using graph model. MinMaxCut consistently performs better than NormalizedCut for cases where the cluster overlaps are large.
We further study the reasons that MinMaxCut consistently outperforms NormalizedCut in large overlap cases. NormalizedCut sometimes cuts out a small subgraph, because the presence of s(A, B) in the denominators helps to produce a smaller Jncut value for the skewed cut than for the balanced cut.
We examine several cases and one specific case is shown in
These case studies provide some insights into those graph partition methods. Prompted by these studies, here we provide further analysis and derive general conditions under which a skewed cut will occur. Consider the balanced cases where s(A,A)≅s(B,B). Let
s(A,B)=f·(s), (s)=½(s(A,A)+s(B,B),
where f>0 is the average fraction of cut relative to within cluster associations.
In the case when the partition is optimal, A and B are exactly the partitioning result. The corresponding NormalizedCut value is
For a skewed partition A1, B1, we have s(A1,A1)<<s(B1,B1), and therefore s(A1,B1)<<(B1,B1). The corresponding Jncut value is
Using NormalizedCut, a skewed or incorrect cut will happen if Jncut(A1,B1)<Jncut(A,B) Using Eqs. (27, 28), this condition is satisfied if
We repeat the same analysis using MinMaxCut and calculating JMMC(A, B) and JMMC(A1, B1). The condition for a skewed cut using MinMaxCut is MinMaxCut1<MinMaxCut0, which is
For large overlap case, say, f=½, the conditions for possible skewed cut are:
NormalizedCut: s(A1,A1)=s(B1,B1)≤s(A1,B1)/2,
MinMaxCut: s(A1,A1)=s(B1,B1)≤s(A1,B1). (29)
The relevant quantity is listed in Table 4. For datasets newsgroups 10-11, and newsgroups 18-19, the condition for skewed NormalizedCut is satisfied most of the time, leading to many skewed cuts and therefore lower clustering accuracy in Tables 1 and 2. For the same datasets, condition for skewed MinMaxCut is not satisfied most of time, leading to more correct cuts and therefore higher clustering accuracy. Eq. (29) is the main results of this analysis.
So far we has discussed MinMaxCut using the eigenvectors of Eq. (11) as the continuous solution of the objective function as provided by Theorem 2.1. This is a good solution to the MinMaxCut problem, as the experiment results shown above. But this is still an approximate solution. Given a current clustering solution, we can refine it to improve the MinMaxCut objective function. There are many ways to refine a given clustering solution. In this and next subsections, we discuss two refinement strategies and show the corresponding experimental results.
Searching for optimal icut in Theorem 2.1 is equivalent to a linear search based on the order defined by sorting the elements of q2, which we call q2-order. Let π=(π1, . . . ,πn) represent a permutation of (1, . . . , n). The q2-order is the permutation π induced when sorting q2(i) in increasing order, i.e., q2(πi)≤q2(πi+1) for all i. The linear search algorithm based on π is to search for minimal JMMC(A,B) as j=1, 2, . . . ,n−1, while setting clusters C1, C2 as
A={i|q
2(πi)≤q2(πj)}, B={i|q2(πi)>q2(πj)}. (30)
The linear search implies that nodes on one side of the cut point must belong to one cluster: if q2(i)≥q2(j)≥q2(k) where i, j, k are nodes, then the linear search will not allow the situation that i, k belong to one cluster and j belongs to the other cluster. Such a strict order is not necessary. In fact, in large overlap cases, we expect some nodes could be moved to the other side of the cut, lowering the overall objective function.
How to identify those nodes near the boundary of between the two clusters? For this purpose, we define linkage as a closeness or similarity measure between two clusters (subgraphs):
(A,B)=s(A,B)/s(A,A)s(B,B) (31)
(This is motivated by the average linkage (A,B)=s(A,B)/|A∥B| in hierarchical agglomerative clustering. Following the spirit of MinMaxCut, we replaced |A|, |B| by s(A, A), s(B, B)). For a single node u, its linkage to subgraph A is (A,u)=s(A,u)/s(A,A). Now we can identify the nodes near the cut. If a node u is well inside a cluster, u will have a large linkage with the cluster, and a small linkage with the other cluster. If u is near the partition boundary, its linkages with both clusters should be close. Therefore, we define the linkage difference
Δ(u)=(u,A)−(u,B). (32)
A node with small Δ should be near the cut and is a possible candidate to be moved to the other cluster.
In
After moving node #62 to cluster B, we try to move another node with negative Δ from cluster A to cluster B depending on whether the objective function is lowered. In fact, we move all nodes in cluster A with negative Δ to cluster B if the objective function is lowered. Similarly we move all nodes in cluster B with positive Δ to cluster A. This procedure of swapping nodes is called the “linkage-based swap”. It is implemented by sorting the array s(u)Δ(u)[s(u)=−1 if u∈A and s(u)=1 if u∈B] in decreasing order to provide a priority list and then moving the nodes, one by one. The greedy move starts from the top of the list to the last node u where s(u)Δ(u)≥0. This swap reduces the objective function and increases the partitioning quality. In Table 5, the effects on clustering accuracy due to the swap are listed. In all cases, the accuracy increases. Note that in the large overlap cases, NG9/NG10, NG18/NG19, the accuracy increase about 10% over the MinMaxCut without refinement.
If s(u)Δ(u)<0 but close to 0, node u is in the correct cluster, although it is close to the cut. Thus we select the smallest 5% of the nodes with s(u)Δ(u)<0 as the candidates, and move those which reduce MinMaxCut objective to the other cluster. This is done in both cluster A and B. We call this procedure “linkage-based move”. Again, these moves reduce MinMaxCut objective and therefore improve the solution. In Table 5, their effects on improving clustering accuracy are shown. Putting together, the linkage based refinements improve the accuracy by 20%. Note the final MinMaxCut results are about 30-50% better than NormalizedCut and about 6-25% better than PDDP (see Tables 5 and 1).
Given a current clustering solution A, B, we can always compute the linkage difference Eq. (32) for every nodes. Now by sorting linkage differences we obtain an ordering which we call linkage differential ordering (LD-order).
The motivation of the LD-order is from observing linkage differences as shown in
This prompt us to apply the linear search algorithm of Eq. (30) to the LD-order to search for optimal MinMaxCut. The results are given in Table 6. We see that the MinMaxCut values obtained on LD-order are lower than that based on the q2-order. The clustering accuracy also increases substantially. Note that the LD order can be recursively applied to the clustering results for further improvements.
In many applications we look for inter-dependence among different aspects (attributes) of the same data objects. For example in text processing, a collection of documents is represented by a rectangular word-document association matrix, where each column represents a document and each row represents a word. The mutual interdependence reflect the fact that the content of a document is determined by the word occurrences, while the meaning of words can be inferred through their occurrences across different documents. The association data matrix P=(pij) typically has non-negative data entries. It can be studied as contingency table and viewed as a bipartite graph with P as its adjacency matrix as shown in
For a contingency table with m rows and n columns, we wish to partition the rows R into two clusters R1, R2 and simultaneously partition the columns C into two clusters C1, C2. Let s(Rp,Cq)≡Σr
If n=m and pij=pji, Eq. (2.9) is reduced to Eq. (2). Let indicator vector f determine how to split R into R1, R2 and indicator vector g splits C into C1, C2:
Let dir=Σj=1npij be row sums and djc=Σi=1mpij be column sums. Form diagonal matrices Dr=diag(d1r, . . . ,dmr), Dc=diag(d1c, . . . ,dnc). Define the scaled association matrix,
with the singular value expansion explicitly written. We have the following:
The proof is an extension of Theorem 2.1 by treating the bipartite graph P as a standard graph [34] S=
Details are skipped due to space limit. The use of SVD is also noted in [7].
So far we have focused on 2-way clustering. Now we extend to K-way cluster, K≥3. We define the objective function as the sum of all possible pairs of 2-way JMMC:
where
and NormalizedCut is extended to K-way clustering as
Note that for large K, s(Ck,
The analysis of MinMaxCut, RatioCut, and NormalizedCut on random graph model as in section § 2.2 can be easily extended to K≥3 case, with identical conclusions. i.e., RatioCut and NormalizedCut show no size preferences, while on random graph model as in MinMaxCut favors balanced cut.
In the above on cluster balance, we are primarily concerned with cluster size, i.e., we desire the final clusters obtained have approximately same sizes,
|C1|≅|C2|≅ . . . ≅|C8|. (39)
There is another form of cluster balance, as we discuss below. First of all, when minimizing JMMC(C1, . . . , CK), there are K terms, all of which are positive. For JMMC to be minimized, all terms should be of approximately same value: minimization does not favor the situation that one term is much larger than the rest. Thus we have
Now define the average between-cluster similarity
we have
Assume further that
s11|C1|≅s22|C2|≅ . . . ≅
We call this the similarity-weighted size balance. MinMaxCut is studied in a recent study on clustering objective functions [35], where for a dataset of articles about sports, for K=10 clustering, MinMaxCut produces clusters where the cluster sizes vary about a factor of 3.3 while the the similarity-weighted cluster size vary only a factor of 1.5 (example in Table 9 of [35]).
The lower and upper bounds of JMMC for K=2 (see section § 2.4) can be extended to K>2 case:
where ζ2, . . . , ζK are the largest eigenvalues of Eq. (12).
Proof. The proof of the lower-bound relating to the first K eigenvectors is given in (which differ from those for K=2 in § 2.1 and § 2.4). The upper-bound is a simple extension from the K=2 case.
K-way MinMaxCut is more complicated because there are multiple eigenvectors involved as explained by Theorem 3.2. Our approach is to first obtain approximate K initial clusters and then refine them. We discuss three methods for initial clusterings here.
Eigenspace K-means As provided by Theorem 3.2. cluster membership indicators of the K-way MinMaxCut are closely related to the first K eigenvectors. Thus we may use the projection in the K-dimensional eigenspace formed by the K eigenvectors and perform a K-means clustering. K-means cluster is a popular and efficient method. It minimizes the following clustering objective function
where xi is projected feature vector in the eigenspace and ck=Σi∈C
Divisive MinMaxCut. We start from the top, treating the whole dataset as a cluster. We repeatedly partition a current cluster into two via the 2-way MinMaxCut (a leaf node in a binary tree) until the number of clusters reaches a predefined value K, or some other stopping criteria are met. The crucial issue here is how to select the next candidate cluster to split. Details is explained in section § 4.
Agglomerative MinMaxCut. Here clusters are built from bottom up like conventional hierarchical agglomerative clustering. During each recursive procedure, we select two current clusters Cp and Cq and merge them to form a bigger cluster. The standard cluster selection methods include single linkage, complete linkage and average linkage. For MinMaxCut objective function, the MinMax linkage of Eq. (31) seems to be more appropriate. The cluster merging is repeated until a stopping condition is met.
Once the initial clustering (i.g., in divisive MinMaxCut) is computed, the refinements should be applied to improve the MinMaxCut objective function. The cluster refinement for K=2 discussed in § 2.7 may be extended to K>2 case by applying the 2-way linkage-based refinement pairwisely on all pairs of clusters.
On the other hand a direct k-way linkage-based refinement procedure may be adopted: Assume a node u currently belongs to cluster Ck. The linkage difference Δpq(u)=(u,Cp)−(u,Cq) for all other K−1 clusters are computed. The smallest Δpq(u) and the corresponding cluster indices are stored as an entry in a priority list. This is repeated for all nodes so every entry of the list is filled. The list is then sorted according to Δpq(u) to obtain the final priority list. Following the list, nodes are then moved one after another to the appropriate clusters if the overall MinMaxCut objective is reduced. This completes one pass. Several passes may be necessary.
Divisive MinMaxCut is one practical algorithm for implementing K-way MinMaxCut via the hierarchical approach. It amounts to recursively select and split a cluster into two smaller ones in a top-down fashion until terminated. One advance of our divisive MinMaxCut over the traditional hierarchical clustering is that our methods have a clear objective function; Refinements of the clusters obtained from divisive process improve both the objective function and the clustering accuracy, as demonstrated in the experiments (§ 4.5). Divisive clustering depends crucially on the criterion of selecting the cluster to split.
It is instructive to see how clustering objective functions change with respect to the change of K, the number of clusters. Given the dataset and similarity measure (Euclidean distance in K-means and similarity graph weight in MinMaxCut), the global optimal value of the objective function is a function of K. An important property of these clustering objective functions is the monotonicity: as K increases K=2, 3, . . . , the MinMaxCut objective increases monotonically, while the K-means objective decreases monotonically. Thus there is a fundamental difference between the graph-based MinMaxCut and the Euclidean distance based K-means:
J
Kmeans
opt(C1, . . . ,CK)>JKmeansopt(C1, . . . ,CK,CK+1)
and (b) the optimal value of the MinMax Cut objective function increases monotonically:
J
MMC
opt(C1, . . . ,CK)<JMMCopt(C1, . . . ,CK,CK+1)
Proof. (a) is previously known. To prove (b), we assume A, B1, B2 are the optimal clusters for K=3 for a given dataset, and merge B1, B2, into a cluster. We compute the current JMMC(A, B) and obtain
noting s(A,B)=s(A,B1)+(A,B2), s(B1,B1)<s(B,B) and s(B2,B2)<s(B,B). The global minimum for K=2 must be lower than or equal to the particular instance of JMMC(A, B). Thus we have
J
MMC
opt(A,B)≤JMMCB-merge(A,B)<JMMCopt(A, B1,B2).
Theorem 4.1 shows the difference between MinMaxCut objective and K-means objective. If we use the optimal value of the objective function to judge what is the optimal K, then K-means favors large number of clusters while MinMaxCut favors small number of clusters. The monotonic increase or decrease indicate that one cannot determine optimal K from the objective functions alone. Another consequence is that in the top-down divisive clustering, as clusters are split into more clusters, the K-means objective will steadily decrease while the MinMaxCut objective will steadily increase.
Suppose the dataset is clustered into m clusters in the divisive clustering. The question is how to select one of these m clusters to split.
by setting γ=½. Note that setting γ=1, we get similarity criterion; setting γ=0, we get cohesion criterion.
In our experiments below, we terminate the divisive procedure when the number of leaf clusters reaches the predefined K. Another criterion is based on cluster cohesion. Theorem 4.1(b) indicates that as the divisive process continues and the number of leaf clusters increase, cluster cohesion of these leaf clusters increases. So a threshold on cohesion is a good stop criterion in applications.
If a dataset has K reasonably distinguishable clusters, these natural clusters could have many different shapes and sizes. But in many datasets, clusters overlap substantially and natural clusters cannot be defined clearly. Therefore, in general, a single (even the “best” if exists) objective function J can not effectively model the vast different types of datasets. For many datasets, as J is optimized, the accuracy (quality) of clustering is usually improved. But this works only up to a point. Beyond that, further optimization of the objective will not improve the quality of clustering because the objective function does not necessarily model the data in fine details. We here formalize this characteristics of clustering objective function as the saturation of objective function.
Definition. For a given measure η of quality of clustering (i.g. accuracy), the saturation objective, Jsat, is defined to be the value when J is further optimized beyond Jsat, η is no longer improved. We say η reaches its saturation value ηsat.
Saturation accuracy is a useful concept and also a useful measure. Given a dataset with known class labels, there is a unique saturation accuracy for a clustering method. Saturation accuracy gives a good sense on how well the clustering algorithm will do on the given dataset.
In general we have to use the clustering method to do extensive clustering experiments to compute saturation accuracy. Here we propose an effective method to compute an upper bound on saturation accuracy for a clustering method. The method is the following. (a) Initialize with the perfect clusters constructed from the known class labels. At this stage, the accuracy is 100%. (b) Run the refinement algorithm on this clustering until convergence. (c) Compute accuracy and other measures. These values are the upper bounds on saturation values.
We apply the divisive MinMaxCut algorithm to document clustering. We perform experiments on Internet newsgroup articles in 20 newsgroups. as in § 2.5. We focus on two sets of 5-cluster cases. The choice of K=5 is to have enough levels in the cluster tree; we avoid K=4,8 where the clustering results are less sensitive to cluster selection. The first dataset includes
In M5, clusters overlap at medium level. In L5, overlaps among different clusters are large. From each set of newsgroups, we construct two datasets of different sizes: (A) randomly select 100 articles from each newsgroup. (B) randomly select 200, 140, 120, 100, 60 from the 5 newsgroups, respectively. Dataset (A) has clusters of equal sizes, which is presumably easier to cluster. Dataset (B) has clusters of significantly varying sizes, which is presumably difficult to cluster. Therefore, we have 4 newsgroup-cluster size combination categories
For each category, 5 different datasets randomly sampled from the newsgroups dataset; the divisive MinMaxCut algorithm is applied to each of them. The final results are the average of these 5 random datasets in each categories.
The results of clustering on the four datasets are listed in Table 7. The upper bounds of saturation values are computed as described in § 4.4. Clustering results for each cluster selection method, size-priority (Size-P), average similarity (avg-sim), cohesion and similarity-cohesion (sim-coh) (see Eq. 43) and temporary objective (tmp-obj) are given in 2 rows: “I” (initial) are the results immediately after divisive cluster; “F” (final) are the results after two rounds of greedy refinements.
A number of observations can be made from these extensive clustering experiments. (1) The best results are obtained by average similarity cluster selection. This is consistent for all 4 datasets. (2) The similarity-cohesion cluster selection gives very good results, statistically no different from average similarity selection method. (3) Cluster cohesion alone as the selection method gives consistently poorest results. The temporary objective choice performs slightly better than cohesion criterion, but still substantially below avg-sim and sim-coh choices. These results are somehow unexpected. We checked the details of several divisive processes. The temporary objective and cohesion often lead to unbalanced clusters because of the greedy nature and unboundedness of these choices1. (4) Size-priority selection method gives good results for datasets with balanced sizes, but not as good results for datasets with unbalanced cluster sizes. These are as expected. (5) The refinement based on MinMaxCut objective almost always improves the accuracy for all cluster selection methods on all datasets. This indicates the importance of refinements in hierarchical clustering. (6) Accuracies of the final clustering with avg-sim and sim-coh choices are very close to the saturation values, indicating the obtained clusters are as good as the MinMaxCut objective function could provide. (7) Dataset M5B has been studied in using K-means methods. The standard K-means method achievers an accuracy of 66%, while two improved K-means methods achieve 76-80% accuracy. 1 A current cluster Ck is usually split into balanced clusters Ck1, Ck2 by the MinMaxCut. However, Ck1 and Ck2 may be quite smaller than other current clusters, because no mechanism exists in the divisive process to enforce balance across all current clusters. After several divisive steps, they could become substantially out of balance. In contrast, avg-similarity and size-priority choices prevent large unbalance to occur.
In comparison, the divisive MinMaxCut achieves 92% accuracy.
In this paper, we provide a comprehensive analysis on MinMaxCut spectral data clustering method. Comparing to earlier clustering methods, MinMaxCut has a strong cluster balancing feature (§ 2.2, § 2.6, § 3.1). The 2-way clustering can be computed easily while the K-way clustering requires a divisive clustering (§ 4).
In divisive MinMaxCut, cluster selections based on average similarity and cluster cohesion leads to balanced clusters in final stage and thus better clustering quality. Experiments on agglomerative MinMaxCut (as discussed in § 3.3) indicate [8] that agglomerative MinMaxCut is as good as the divisive MinMaxCut, both in clustering quality and in computational efficiency.
Our extensive experiments, on medium and large overlapping clusters with balanced and unbalanced cluster sizes, show that refinements of the clusters obtained in divisive and agglomerative MinMaxCut always improve clustering quality, strongly indicating the min-max clustering objective function captures the essential features of clusters in a wide range of situations. This supports our emphasis on the objective function optimization based approach.
Since the cluster refinement is an essential part of objective function based approach, efficient refinement algorithms are needed. The refinement methods discussed in § 2.7, § 2.8, § 3.4 are of order O(n2) complexity. An efficient refinement algorithm like Fiduccia-Mattheyses linear time heuristic [15] is highly desirable.
A counter point to the objective function optimization approach is the objective function saturation, i.e., objective optimization is useful only up to a certain point (see § 4.4). Therefore finding a universal clustering objective function is another important direction of research. On the order hand, the saturation values of accuracy or objective functions can be used as a good assessment of the effectiveness of the clustering method as shown in Table 7. However, this point does not favor the procedure oriented clustering approach, where the lack of objective function makes the self-consistent assessment impossible; justifications of the method are empirical.
Acknowledgments. This work is supported by U.S. Department of Energy, Office of Science (MICS Office and LDRD) under contract DE-AC03-76SF00098 and an NSF grant CCR-0305879.