1. Field of the Invention
The present invention is directed toward the field of computer implemented clustering techniques, and more particularly, toward methods and apparatus for fast sampling based approximate clustering.
2. Art Background
In general, clustering is the problem of grouping objects into categories such that members of the category are similar in some interesting way. Literature in the field of clustering spans numerous application areas, including data mining, data compression, pattern recognition, and machine learning. The computational complexity of the clustering problem is very well understood. The general problem is known to be NP hard.
The analysis of the clustering problem in the prior art has largely focused on the accuracy of the clustering results. For example, there exist methods that compute a clustering with maximum diameter at most twice as large as the maximum diameter of the optimum clustering. Although such prior art clustering techniques generate close to optimum results, they are not tuned for implementation in a computer, particularly when the dataset for clustering is large. Essentially, most prior art clustering methods are not designed to work with massively large datasets, especially because most computer implemented clustering methods require multiple passes through the entire dataset which may overwhelm or bog down a computer system if the dataset is too large. As such, it may not be feasible to cluster large datasets, even given the recent developments in large computing power.
In order to try and overcome this problem, only a few prior art approaches have actually focused on some purported solutions. A few approaches are based on representing the dataset in a compressed fashion, based on how important a point is from a clustering perspective. For example, one prior art technique stores those points most important in main computer memory, compresses those that are less important, and discards the remaining points.
Another prior art technique for handling large datasets is through the use of sampling. For example, one technique illustrates how large a sample is needed to ensure that, with high probability, the sample contains at least a certain fraction of points from each cluster.
Attempts to use sampling to cluster large data bases typically require a sample whose size depends on the total number of points n. Such approaches are not readily adaptable to potentially infinite datasets (which are commonly encountered in data mining and other applications which may use large data sources like the web, click streams, phone records or transactional data). Essentially, all prior art clustering techniques are constrained by the sample size and running time parameters, both of which are dependent on n, and as such, they do not adequately address large data set environmental realities. Moreover, many prior art approaches do not make guarantees regarding the quality of the actual clustering rendered. Accordingly, it is desirable to develop a clustering technique with some guarantee of clustering quality that operates on massively large datasets for efficient implementation in a computer, all without the sample and time dependence on n.
Fast sampling methods offers significant improvements in both the amount of points that may be clustered, and in the quality of the clusters which are produced. The first fast sampling-based method for center-based clustering, clusters a set of points, S, to identify k centers by utilizing probability and approximation techniques. The potentially infinite set of points, S, may be clustered through k-median approximate clustering. The second fast sampling-based method for conceptual clustering identifies k disjoint conjunctions that describe each cluster so that the clusters themselves are more than merely a collection of data points.
In center-based clustering, the diameter M of the space is determined as the largest distance between a pair of points in S. Where M is unknown, it may be accurately estimated by utilizing a sampling based method that is reflective of the aspects of the given space in the sample. Utilizing then, a determined value for M, a sample R of the set of points is determined, which in turn provides the input to be clustered, which in one embodiment, is according to α-approximation methods. Further provision is made for employing the above methodology in cases where there are more dimensions than there are data points, that the dimensions can be crushed in order to eliminate the dependence of the sample complexity on dimensional parameter d.
In conceptual clustering, in order to identify k disjoint conjunctions, each collection of k clusters is characterized by a signature s. A sample R from S is initially taken. Then, for each signature s, the sample R is partitioned into a collection of buckets where points in the same bucket agree on the literals as stipulated by the signature s. A cap on the number of allowable buckets exists so as not to unnecessarily burden computational complexity by dependence on n. For each bucket Bi in the collection, a conjunction, ti, reflecting the most specific conjunction satisfied by all examples in b, is computed, and an empirical frequency R(ti) is computed, such that a quality may be defined as the sum over all buckets B1, . . . , Bk induced by signature q of the product of conjunction length |ti| and the empirical frequency R(ti). These computational procedures yield respective quality numerical values from which the outputted clustering may be maximized.
Center-based Clustering:
For center-based clustering, clustering is a process to operate on a set “S”, of “n” points, and a number, “k”, to compute a partitioning of S into k groups such that some clustering metric is optimized. The number of points n to be clustered dominates the running time, particularly for prior art approximate clustering techniques which tend to be predicated on a time complexity of O (n2), which differs from the inventive approach as described below.
The application of clustering to knowledge discovery and data mining require a clustering technique with quality and performance guarantees that apply to large datasets. In many of the data mining applications mentioned above, the number of data items n is so large that it tends to dominate other parameters, hence the desire for methods that are not only polynomial, but in fact are sublinear in n. Due to these large datasets, even computer implemented clustering requires significant computer resources and can consume extensive time resources. As described fully below, the fast sampling technique of the present invention is sublinear, and as such, significantly improves the efficiency of computer resources, reduces time of execution, and ultimately provides for an accurate, fast technique for clustering which is independent of the size of the data set. Moreover, the inventive fast sample clustering has wide applicability over the realm of metric space, but will nevertheless be primarily discussed throughout in terms of one embodiment within Euclidean space, utilized within a computer implemented framework.
Overall, the fast sampling technique of the present invention provides the benefit of sampling without the prior art limitations according to sample size (potentially, an infinite size data set or an infinite probability distribution is clusterable according to the inventive methodology) and with the added benefit that the resulting clusters have good quality.
In general, the fast sampling technique of center based clustering reduces a large problem (i.e., clustering large datasets) to samples that are then clustered. This inventive application of applying sampling to clustering provides for the clustering to be sublinear so that there is no dependence on either the number of points n, or on time (which is typically a squared function of n.) Similar to the strategy employed in learning theory, the inventive fast sampling is, in one embodiment, modeled as “approximate clustering”, and provides for methods which access much less of an input data set, while affording desirable approximation guarantees. In particular, prior art methods for solving clustering problems also tend to share a common behavior in that they make multiple passes through the datasets, thereby rendering them poorly adapted to applications involving very large datasets. A prior art clustering approach may typically generate a clustering through some compressed representation (e.g., by calculating a straight given percentage on the points n in the dataset). By contrast, the inventive fast sampling technique of center based clustering applies an α approximation method to a sample of the input dataset whose size is independent of n, thereby reducing the actual accessing of the data set, while providing for an acceptable level of clustering cost that in fact yields a desirable approximation of the entire data set. Also, the reduced accessing of input data sets allows for manageable memory cycles when implemented within a computer framework, and can therefore render a previously unclusterable data set clusterable.
Furthermore, implicit in clustering is the concept of tightness. In defining the tightness, a family F:Rd→ of cost functions exists where for each ƒ in F and for each x in Rd, ƒ(x) simply returns the distance dist from x to its closest center where dist is any distance metric on Rd. Reference may then be made to F as the family of k-median cost functions: F={ƒcl, . . . ck:ƒcl, . . . ck(x)=minidist(x,ci)}. Closely related, and also of interest, is the family of k-median2 cost functions that return the squared distance from point to nearest center. This objective is the basis of the popular k-means clustering method. The inventive technique provides for the finding of the k-median cost function ƒ with minimum expected value. Because methods that minimize the sum of distances from points to centers also minimize the average distance from points to centers, a multitude of approximation methods may also be used. In the present embodiment, for a particular cost function ƒ in F, the expected tightness of ƒ relative to S, denoted ES(ƒ) is simply the average distance from a point to its closest center, i.e.,
[In the event that S is a probability distribution over a finite space,
In the event that S is an infinite-sized dataset, the summation in the expectation is replaced with integration in the usual way.] Define the optimum cost function for a set of points S to be the cost function ƒSεF with minimum tightness, i.e., ƒS=arg minƒεFES(ƒ). Similarly define the optimum cost function for a sample R of S to be the cost function ƒRεF with minimum tightness, i.e., ƒR=arg minƒεFER(ƒ). Because it is impossible to guarantee that the optimum cost function for any sample R of S performs like ƒS (such as in situations where an unrepresentative sample is drawn), the parameter δ indicates the closeness of the expected tightness values. Given that optimum clustering ƒR is NP-hard, and that α-approximation methods to ƒR represent very effective methods herein, α-approximation clustering to ƒR should then behave like ƒS. This establishes that F is α-approximately clusterable with additive cost ε iff for each ε,δ>0 there exists an m such that for a sample R of size m, the probability that ES(ƒR)≦αES(ƒS)+ε is at least 1−δ. From this it is possible to derive the case where m does not depend on n under the Euclidean space assumption (and the case where m depends on log n under the more general metric space assumption).
In one embodiment, we consider the k-median clustering problem where given a set S of n points in Rd, the objective is to find k centers that minimize the average distance from any point in S to its nearest center. As shown in greater detail below, the fast sampling technique of the center-based clustering may take a sample
which suffices to find a roughly α-approximate clustering assuming a distance metric on space [0,M]d. This and other techniques in the invention will generalize to other problems beside the k-median problem, but for purposes of illustrating the sublinear nature of the methods herein, the k-median problem is selected for illustrative purposes when demonstrating the independence of the sample size and running time on n.
While the inventive techniques may apply to a metric space (X, d), for the particular case of clustering in d-dimensional Euclidean space, it is possible to obtain time and sample bounds completely independent of the input dataset.
The fast sampling technique of the present invention will first determine diameter M, graphically depicted by the illustrative arrow 140 in
After drawing a sample R of size m1, k centers are discovered in R using standard clustering methods. The size ml of the sample R is chosen so that approximately good centers of R are approximately good centers of S. These approximately good centers for the sample R will, as further detailed hereafter, yield close to same result as if one had processed each and every point in S. The inventive center based clustering may therefore be seen—especially when taken within the context of the sample size ml given hereafter—as a minimization of the true average distance from points in S to the center(s) of respective clusters (referred to as “true cost”), despite the fact the center-based clustering approximately minimizes the sample average distance from points in R to the centers of their respective clusters (referred to as “sample cost”).
The objective of the k-median problem is to find a clustering of minimum cost, i e., minimum average distance from a point to its nearest center. As mentioned before, prior art k-median inquiries focus on obtaining constant factor approximations when finding the optimum k centers that minimize the average distance from any point in S to its nearest center. In doing so, these constant factor approximations are dependent on the time factor O(n2). By contrast, the inventive techniques provide for a large enough sample such that the true cost approximates the sample cost. Thus, minimizing the sample cost is like minimizing the true cost. In other embodiments, the fast sampling technique may use other clustering methods to achieve similar ends, such as other clustering methods which exist for the k-center problem. Accordingly, the fast sampling technique described herein may be applied to any clustering method that outputs k centers. These methods may output k centers that optimize a metric different from k center. Moreover, it is similarly important to note that any of the sample sizes referred to herein are exemplary in fashion, and one skilled in the art will readily recognize that any sample size may be utilized that ensures the uniform convergence property, i.e., that the sample cost approaches the true cost.
Turning then to
An assessment is made at decision block 230,
and compute M′ as the maximum distance between two points in a sample U, as graphically depicted in previously discussed
The probability that no point is drawn from any one of these strips is at most δ when a sample of size
is drawn. The probability that a point in a particular strip between G and H is not drawn in m trials is at most
This probability is at most
The probability that a point is not drawn in all 2d strips between G and H in m trials is at most δ by the sample size given. Hence, if a bound M on the space is unknown, then estimating M with M′ on a sample size given above, while running an α-approximation method on a sample size
yields an α-approximation clustering with additive cost ε(1+M).
A sample R is then drawn according to
which suffices to find a roughly α-approximate clustering, assuming a Euclidean metric on [0,M]d. As delineated at block 250,
which provides for the clustering of a sample by an α-approximate k-median method that yields a k-median cost function ƒR such that with probability at least 1−δ, ES(ƒR)≦αES(ƒS)+ε. For the general metric assumption, a sample of R of size
provides the same k-median quality guarantee.
If the number of dimensions d were crushed down to log n in step 220,
Conceptual Clustering Method:
In prior art applications, methods that output conclusions such as “this listing of 43 Mb of data point are in one cluster” may not be as useful as finding a description of a cluster. Conceptual clustering is the problem of clustering so as to find the more helpful conceptual descriptions. Within the context of an embodiment of a k disjoint conjunction example, the inventive techniques can not only offer a meaningful description of data, but also can provide a predictor of future data when clustering.
In practical applications, the set S of data to be clustered is typically a subcollection of a much larger, possibly infinite set, sampled from an unknown probability distribution. In contrast, the fast sampling techniques utilize processes similar to that of the probably approximately correct (“PAC”) model of learning, in that the error or clustering cost is distribution weighted, and a clustering method finds an approximately good clustering. Broadly speaking, the related mathematics are such that where D is an arbitrary probability distribution on X, the quality of a clustering depends simultaneously on all clusters in the clustering, and on the distribution, with the goal being to minimize (or maximize) some objective function Q(t1,t2, . . . , tk,D) over all choices of k-tuples t1, . . . , tk. In this way, PAC clustering can be utilized within a disjoint conjunction clustering application.
More specifically however, a dO(k
where PrD(ti) is the fraction of the distribution (also termed “probability”) that satisfies ti. It is evident that an optimum k-clustering is always at least as good as an optimum k−1 clustering, since any cluster can be split into two by constraining some variable, obtaining two tighter clusters with the same cumulative distributional weight. Hence, the number of desired clusters k is assumed to be input to the method. Further, it is required that the conjunctive clusters cover most of the points in S (or most of the probability distribution). This requirement is enforced with a parameter γ that stipulates that all but γ of the distribution must be covered by the conjunctions. Thus, the objective is to maximize the length of the cluster descriptions (i.e., longer, more specific conjunctions are more “tight”), weighted by the probabilities of the clusters, subject to the constraint that all but a γ fraction of the points are satisfied by the conjunctions (alternatively, at least 1−γ of the probability distribution is covered).
Conceptual clustering provides clusters that are more than a mere collection of data points. Essentially, the inventive conceptual clustering outputs the set of attributes that compelled the data to be clustered together.
By way of graphic depiction of this concept, Table 1 shows an example of four customers together with the items they purchased. In the table, customer 1 purchased a printer and a toner cartridge, but did not purchase a computer. Assuming an exemplary clustering of k=2, Table 1 can be broken into two clusters, one including customers 1 and 2 and the other including customers 3 and 4. In determining the aforementioned quality, we measure a length of a conjunction (a grouping of attributes in a string of positions, also termed a “data length”) by the number of variables or attributes making up the respective conjunction, while a probability of a conjunction is determined from the number of points (in this example the number of customers) that satisfy the conjunction.
The longer a conjunction, the fewer the number of points that satisfy it. For example, a short conjunction P (represented by customers who bought Printers) includes the first three customers. On the other hand, the longer conjunction PT, i.e., those customers that bought both printers and toner cartridges, is satisfied by only the first two customers.
Utilizing the above described quality function max
for the two conjunctions P and C we see that these short conjunctions have quality that yield:
which equals 1.5 (where the data length of P is 1, and the data length of C is 1, and the probability of each is three out of four data points
being satisfied). Similarly, we may use the same quality function for the two conjunctive clusters PT and {overscore (T)}C to obtain:
which equals 2. This means that the conjunctions PT (represented by the first two customers) and {overscore (T)}C (represented by customers 3 and 4), have a better quality (e.g., 2), than P (represented by the first 3 customers) and C (represented by the last 3 customers), which only have a quality of 1.5.
In the k disjoint conjunction problem, such kinds of clustering produce disjoint clusters, where a variable is negated in one cluster, and un-negated in another cluster. For example, two arbitrary clusters designated as say, PT together with TC are not disjoint because there are points that satisfy both of these conjunctions (customer 2). Similarly, conjunctive clusters where the variables do not overlap may not be disjoint, like P and C, since customers 2 and 3 satisfy both conjunctions. By contrast, a cluster of say, PT and {overscore (T)}C would be disjoint.
In one exemplary embodiment known as the k disjoint conjunction problem, the disjoint aspect of clusters can be utilized to provide an inventive signature q between clusters. Each set of k disjoint conjunctions has a corresponding signature q that contains a variable that witnesses the difference between each pair of conjunctions. The length of a signature is thus O(k2). The following table gives a simple example of three signatures for k=2 clustering of the data in Table 1.
The first signature “P” means that the first conjunction contains the literal P and the second conjunction contains the literal {overscore (P)}. Thus the second column shows the induced skeleton for this signature. The third column indicates the buckets into which the points are partitioned. The points “110,111,101” are associated with the first bucket since the first bit position (corresponding to P) is always “1”. The point “001” is placed in the second bucket since this point satisfies {overscore (P)}. Given the buckets, a most specific conjunction is computed. The most specific conjunction is a conjunction of attributes that is satisfied by all the points and yet is as long as possible. For the first bucket, the conjunction P is as long as possible since adding any other literal (T,{overscore (T)},C, or {overscore (C)}) will cause one of the points to not satisfy the conjunction. For the second bucket, the conjunction {overscore (P)} can be extended to include {overscore (T)}C and the resulting conjunction covers exactly “001” and can't be extended further.
In general, the signature q of k disjoint conjunctions may be defined as a k-signature having a sequence lijl≦i≦j≦k where each lij is a literal in {xi, . . . , xd,{overscore (x)}1, . . . , {overscore (x)}d}. Associated with each k-signature is a “skeleton” of k disjoint conjunctions s1, . . . , sk, where conjunction si contains exactly those literal lij for i<j, and the complements of the literals lki for k<i. k disjoint conjunctions ti, . . . tk are a specialization of a skeleton s1, . . . sk iff for each i, the set of literals in si is contained in the set of literals in ti. Clearly, if q is a k-signature, then the skeleton conjunctions induced by q are disjoint, as are any k conjunctions that are a specialization of that skeleton. Furthermore, every k disjoint conjunctions are a specialization of some skeleton induced by a k-signature.
According to the signature q, the sample R may then be partitioned into buckets B according to the literals in the signature. For each bucket b in B, we can then compute the most specific conjunctive description. The overall method for identifying k disjoint conjunctions may then be exemplified as in the flow diagram
A sample R is drawn at block 300,
(block 345,
In one embodiment, the size of the sample R drawn in Block 300,
then with probability at least 1−δ, the clustering found by the method covers all but γ of the distribution, and the quality of the clustering is within an additive value ε of the optimum clustering.
Computer Implementation Efficiency:
Clustering of large data sets, as it relates to the use of computer resources, may generally consume enormous amounts of memory and processing bandwidth. If mem is the size of memory in the computer, then one issue to maximize the computer implementation of clustering is to ascertain the best way to cluster S, using any clustering technique, when |S|>>mem.
In general, most computer implemented clustering methods require multiple passes through the entire dataset. Thus, if the dataset is too large to fit in the main memory of a computer, then the computer must repeatedly swap the dataset in and out of main memory (i.e., the computer must repeatedly access an external data source, such as a hard disk drive). In general, a method that manages placement or movement of data is called an external memory method (i.e., also referred to as I/O efficiency and out-of-core method). The I/O efficiency of an external memory method is measured by the number of I/O accesses it performs. Also, I/O efficiency of an external memory method is measured by the number of times the input dataset is scanned. In the inventive technique, however, the number of scans is greatly reduced by the sampling approach described previously. Moreover, the prior art computer based clustering was incapable of processing vast data sets, particularly where the amount of data was infinite or approached infinity, unlike the inventive sampling which overcomes this limit.
By way of an exemplary embodiment,
To process a massively large dataset using a prior art clustering technique, the program either swaps data in and out of main memory 420 and/or the program executes numerous input/output operations to the external data source 440. The fast sampling method of the present invention improves I/O efficiency because a very large dataset, initially stored in the persistent data store 440, is sampled and stored in main memory 420. The clustering method calculation may be executed on these vast data sets without any data swapping to the external data source 440, unlike the prior art clustering techniques which would bog down or simply overwhelm all aspects of the computer system when infinite or near infinite data sets are processed. Furthermore, the fast sampling technique requires only one scan of the dataset, whereas the prior art clustering techniques require multiple scans of the dataset. Hence, the descried computer implementation provides for a more efficient method that is capable of clustering infinite and near infinite data sets, all while affording the aforementioned quality guarantees.
Although the present invention has been described in terms of specific exemplary embodiments, it will be appreciated that various modifications and alterations might be made by those skilled in the art without departing from the spirit and scope of the invention.
Number | Date | Country | |
---|---|---|---|
Parent | 10039617 | Jan 2002 | US |
Child | 11140857 | May 2005 | US |