This invention relates to combinations of segmentations, and particularly to a computer-implemented clustering method for data mining in image segmentation, social network analysis, computational biology, market research, search engine and other applications.
The explosive growth of data generated nowadays has presented an urgent need to improve the efficiency of data mining. Data mining is defined as a process used to extract usable data from a larger set of any raw data. Among major data mining tasks, clustering is a technique to discover groupings in a given dataset.
A clustering algorithm groups data points into clusters based on the notion of similarity between data points. Example existing algorithms are K-means clustering, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and hierarchical clustering. The state-of-art algorithms are density-peak clustering (DP) and scalable kernel k-means. However, these clustering algorithms hardly achieve both high-quality clustering outcomes and runtime efficiency. For instance, DP is strong in clustering outcomes, but it is also one of the most computationally expensive algorithms due to the use of a similarity between two data points: it needs large memory space and its runtime is proportional to the square of data size (n2); while scalable kernel k-means is efficient yet not effective due to the use of kernel which has intractable dimensionality and is data independent.
The present description discloses the first kernel-based clustering which has runtime proportional to data size and yields clustering outcomes that are superior to those of existing clustering algorithms.
This summary is not an extensive overview of the disclosure and it does not exhaustively identify key/critical elements of the invention or delineate the scope of the invention. Its sole purpose is to present the rationale behind the disclosure herein in a simplified form as a prelude to the more detailed description that is presented later.
A computer implemented method of image segmentation is comprised of: receiving one image; and
Application data descriptor step: converting the image into a dataset of some descriptors e.g., CIELAB, to form a dataset E; and
Conversion using Isolation Kernel step: converting each point in dataset E, using a feature map of Isolation Kernel, to a point in a dataset D; and
Seed finding step: using a point-set kernel to find the most similar point wrt (with respect to) D and then use it as an initial seed for cluster G; and
Cluster growing step: growing the cluster G at a set rate (Q) incrementally using the point-set kernel by recruiting most similar points from D wrt G; and the cluster stops growing when all points excluding G in D have similarities wrt D less than or equal to τ, where τ is a user-defined similarity threshold; and
growing the next cluster using the remaining points in dataset D excluding G by restarting from the Seed finding step, until D is empty or no point can be found which has similarity more than τ.
The most similar point as the initial seed for cluster G in the dataset D is defined based on a point-set kernel {circumflex over (K)} as argmaxx∈D{circumflex over (K)}(x,D). In the cluster growing step, the most similar point to grow a cluster G is obtained from argmaxx∈D{circumflex over (K)}(x,G), where D excludes all points already in G.
CIELAB is a color space based on human perception. Instead of using Red, Green, and Blue as the “axes”, CIELAB uses Lightness (Black/White), “a” (Green/Red), and “b” (Blue/Yellow). CIELAB color space provides a perceptually uniform color space. In this color space, the distance between two points approximates how different the colors are in luminance, chroma, and hue.
Further, the point-set kernel clustering method can be applied to cluster a set of images, and in social network analysis, computational biology, market research, search engine and other applications by replacing the one image with a dataset in each of these applications.
When the clustering method is applied to cluster a set of images into several subsets of images, the given dataset is a set of images. In each of the above applications (social network analysis, computational biology, market research, search engine etc), psKC can be applied to either segmenting one data object into multiple segments, or clustering a set of data objects into several subsets of data objects, as exemplified when the data objects are images. The choice depends on the desired outcome and the problem formulation. For example, when the clustering method is applied in social network analysis, one can either segment one social network data object into multiple segments, or cluster a set of social network data objects into subsets of social network data objects.
Further, wherein a data descriptor for each application shall be used to describe the original dataset, consisting of either one data object or a set of data objects, into a set of points in vector representation.
Further, wherein using the feature map of Isolation Kernel converting each point of dataset E to a point in dataset D is comprised of: using a random sample of ψ points from dataset E to produce a Voronoi diagram, where each Voronoi cell isolates one point from the rest of the points in the sample. A total of t Voronoi diagrams are produced from dataset E and each point x in dataset E is converted using the t Voronoi diagrams to produce a feature vector Φ(x) of tψ binary attributes in dataset D: x→Φ(x).
Further, wherein using the point-set kernel to find the most similar point wrt D is comprised of producing kernel mean map {circumflex over (Φ)}(G) from a set of points G via averaging, and measuring similarity between point x and set G using the point-set kernel,
where {circumflex over (Φ)} is a kernel mean map of {circumflex over (K)}; and Φ is the feature map of Isolation Kernel k; and a, b denotes a dot product between two vectors a and b.
As the point-set kernel is constructed from a dataset D, the point-set kernel equations can be more precisely expressed as:
where G⊆D, and Φ is the feature map of Isolation Kernel which is constructed from D.
Further, wherein a post-processing can be applied to all clusters produced by point-set kernel clustering to ensure that the following objective is achieved:
where a dataset D having k clusters, Gj, j=1, . . . , k, the post-processing re-examines all points which have the lowest similarity regarding cluster Gj if they could be reassigned to other cluster to maximize the total similarity.
Further, wherein the similarity threshold τ<1 and growth rate ∈(0,1).
Software stored on a non-transitory machine-readable medium is comprised of instructions for enabling a data processing system to:
a) receive one image; and
b) convert the image into a dataset of some descriptors e.g., CIELAB, to form a dataset E; and
c) convert each point in dataset E, using a feature map of Isolation Kernel, to a point in a dataset D; and
d) use a point-set kernel to find the most similar point wrt D and then use it as an initial seed for cluster G; and
e) grow cluster G at a set rate () incrementally using the point-set kernel by recruiting most similar points from dataset D wrt G; and the cluster stops growing when all points excluding G in dataset D have similarities wrt D less than or equal to τ, where τ is a similarity threshold; and
f) grow the next cluster using the remaining points in dataset D excluding G by restarting from step d, until D is empty or no point can be found which has similarity more than τ.
The kernel-based clustering which is based on a point-set kernel, i.e., point-set kernel clustering (psKC), is described. In an embodiment, it characterizes every cluster of arbitrary shape, varied density and size in a dataset, from a seed; and runs orders of magnitude faster than existing state-of-the-art clustering algorithms which have quadratic time cost.
Comparatively, density-peak clustering DP did well in five benchmark datasets shown in
In an embodiment, the computed ratio of psKC in a scaleup test using the MNIST8M dataset, which has a total 8.1 million data points with 784 dimensions, was linear to the data size. The algorithmic advantage of psKC, together with the use of the point-set kernel, allows it to run on a standard machine of single-CPU (for clustering) and GPU (for feature mapping in pre-processing). This enables the clustering to be run on a commonly available machine (with both GPU and CPU) to deal with large scale datasets. In a nutshell, it is the only clustering algorithm that can process millions of data points on a commonly used machine.
The present description will be better understood from the following detailed description and the accompanying drawings. The details of the present invention, both as to its structure and operation, can best be understood by referring to the accompanying drawings, in which like reference numbers and designations refer to like elements.
The point-set kernel relies on how similar a data point x, represented as a vector, is to a set of data points G, a point-set kernel is represented as:
{circumflex over (K)}(x,G)=<Φ(x),{circumflex over (Φ)}(G)>
and
where {circumflex over (Φ)} is a kernel mean map of {circumflex over (K)} and {circumflex over (Φ)} is the feature map of a point-to-point kernel and a, b denotes a dot product between two vectors a and b.
Computing time: the summation in {circumflex over (Φ)}(G) needs to be done once only as a pre-processing. Then computing {circumflex over (K)}(x,G) based on the dot product, takes a fixed amount of time only, independent of n (the data size of G). Therefore, to compute the similarity of x with respect to G for all points x in G, i.e., {circumflex over (K)}(x,G)∀x∈G, has a computational cost which is proportional to n only.
Note that kernel mean embedding is an approach to convert a point-to-point kernel into a distribution kernel which measures the similarity between two distributions. The point-set kernel can be viewed as a special case of kernel mean embedding. Kernel mean embedding uses the same kernel mean map as used here.
The use of the feature map is necessary to achieve the stated efficiency. The alternative, which employs the point-to-point kernel/distance directly in the computation, will have a computational cost that is proportional to n2—the root cause of high computational cost in existing density-based algorithms. The point-set kernel formulation assumes that the point-to-point kernel has a finite-dimensional feature map. Commonly used point-to-point kernels (such as Gaussian and Laplacian kernels) have two key limitations: they have a feature map of intractable dimensionality; and their similarity is independent of a given dataset. The first limitation prevents these kernels to be used in the formulation directly.
Therefore, a recently introduced point-to-point kernel, which has an exact finite dimensional feature map called Isolation Kernel, is used in {circumflex over (K)}. Isolation Kernel is a data dependent kernel that is derived directly from data. Isolation Kernel is employed here because it is data dependent, which is essential to a good clustering outcome. Employing a Gaussian Kernel, which is data independent, psKC will perform poorly on datasets with clusters of non-globular shape, different data sizes and/or densities. This is because its similarity measurement is independent of data distribution. As shown in
Isolation Kernel has two characteristics, which are antitheses to the two limitations of data independent kernels mentioned above. The first is that it has a finite-dimensional feature map, which enables Isolation Kernel to be used directly in the point-set kernel as the exact finite dimensional feature map is crucial in achieving runtime proportional to data size. The second refers to its similarity adapts to local density of the data distribution of a given dataset, which means that two points in a sparse region are more similar than two points of equal inter-point distance in a dense region. This characteristic is crucial for the clustering algorithm to obtain good clustering outcomes.
As the point-set kernel is constructed from a dataset D, the point-set kernel equations can be more precisely expressed as:
where G⊆D, and Φ is the feature map of Isolation Kernel which is constructed from D. Note that Isolation Kernel has no closed form expression.
The point-set kernel can be used to describe a cluster in terms of similarity distribution, independent of the clustering process. Given a dataset D having k clusters Gj, j=1, . . . , k. The clusters could be the ground truth, or the clustering outcome of an algorithm. The {circumflex over (K)} similarity distribution of all clusters Gj in D is defined as:
The properties of the point-set kernel for points outside the cluster are described as follows. Given a dataset D and a cluster G⊂D. Let x, x′∈D\G; the distance between x and a set G be
and ρ(x) denotes the density of x. Properties of the point-set kernel derived from D include: (a) Fall-off-the-cliff property: {circumflex over (K)}(x,G) decreases sharply as l(x,G) increases; (b) Data dependent property:
if l(x,G)=l(x′,G) and ρ(argminz∈G∥x−z∥)>ρ(argminz∈G∥x′−z∥). In other words, the rate of falling-off at x is data dependent, i.e., it is proportional to the density at the point G closest to x, i.e., argminz∈G∥x−z∥ and inversely proportional to l(x,G).
These properties enable each cluster to be expanded radially in all directions from a seed in multiple iterations, where each iteration recruits a subset of new members in the immediate neighborhood of the expanding cluster and arbitrary-shaped clusters of different densities and sizes to be discovered through growing a cluster.
The clustering, called point-set kernel clustering or psKC, employs the point-set kernel {circumflex over (K)} to characterize clusters. It identifies all members of each cluster by first locating the seed in the dataset. Then, it expands its members in the cluster's local neighborhood which grows at a set rate () incrementally; and it stops growing when all unassigned points having similarity with respect to the cluster fall below a threshold (τ). The process repeats for the next cluster using the remaining points in dataset D, yet to be assigned to any clusters found so far, until D is empty or no point can be found which has similarity more than τ. All remaining points after the entire clustering process are noise as they are less than the set threshold for each of the clusters discovered. The psKC procedure is shown in Algorithm 1.
indicates data missing or illegible when filed
The cluster, grown from a seed, according to psKC can be formally defined as follows: A -expanded cluster grows from a seed xp selected from D, using D, {circumflex over (K)}(·,·) with similarity threshold τ<1 and growth rate ∈(0,1), is defined recursively as:
G
i
={x∈D|{circumflex over (K)}(x,Gi-1)>γi>τ}
where xq=argmaxx∈D\{x
Let Gj be -expanded cluster j from dataset D. The number of -expanded clusters in dataset D is discovered automatically by repeating the above cluster growing process on Gk from D\{Gj, j=1, . . . , k−1}. After discovering all -expanded clusters Gj in D, noise is defined as
N={x∈D|∀j{circumflex over (K)}(x,Gj)≤τ}.
A post-processing can be applied to all clusters produced by psKC to ensure that the following objective is achieved:
This post-processing re-examines all points which have the lowest similarity regarding cluster Gj if they could be reassigned to other cluster to maximize the total similarity. This re-examination begins with points in Gj, j=1, . . . , k in the order the clusters are produced.
As shown in
receiving one image; and
Application data descriptor step: converting the image into a dataset of some descriptors e.g., CIELAB, to form a dataset E; and
Conversion using Isolation Kernel step: converting each point in dataset E, using a feature map of Isolation Kernel, to a point in a dataset D; and
Seed finding step: using a point-set kernel to find the most similar point wrt D and then use it as an initial seed for cluster G; and
Cluster growing step: growing the cluster G incrementally using the point-set kernel by recruiting most similar points from dataset D wrt G; the cluster stops growing when all points excluding G in D have similarities less than or equal to τ, where τ is a similarity threshold; and
growing the next cluster using the remaining points in dataset D excluding G by restarting from the Seed finding step, until the dataset D is empty or no point can be found which has similarity more than τ.
As shown in
As shown in
The use of Isolation Kernel in the point-set kernel enables similarity between a point and a set to be computed efficiently, without the need to compute point-to-point similarity/distance—the root cause of high time complexity in existing algorithms. The finite dimensional feature map of Isolation Kernel and its use in the point-set kernel enable the algorithm to achieve its full potential: runtime proportional to data size (n)—a level unable to be achieved by existing effective clustering algorithms such as DP, and even less effective but efficient algorithms such as scalable kernel k-means. They have runtimes at least proportional to n2. Time complexity of psKC is demonstrated in the following table given that the maximum number of iterations is fixed given the threshold τ and growth rate , independent of the data size. t and ψ are parameters of Isolation Kernel.
In order to compare the clustering performance with existing clustering algorithms including DP, scalable kernel k-means which employs Gaussian kernel, and kernel k-means, which employs an adaptive kernel, five experiments were conducted: one for reporting clustering outcomes on artificial datasets, one for clustering outcome on one image (each having high or low resolution), one for clustering a set of images into subsets of images, one for the scaleup test and one for a stability analysis.
Parameter search ranges. Parameters are searched in each of the algorithms, i.e., DP, scalable kernel k-means, kernel k-means and psKC; and their best clustering outcomes are reported after the search. psKC is implemented in C++. Scalable kernel k-means is implemented in Scala as part of the Spark framework; DBSCAN is implemented in Java as part of the WEKA framework. and DP, DPik, k-means and kNN kernel are implemented in MATLAB.
The parameter search ranges used in the experiments on artificial datasets are:
(1) DP: ∈ (the bandwidth used for density estimation) is in [0.001 m, 0.002 m, . . . , 0.4 m] where m is the maximum pairwise distance. The number of clusters is set to the true number of clusters.
(2) Kernel k-means: k in kNN kernel is in [0.01n, 0.02n, . . . , 0.99n]; and the number of dimensions used is 100. The number of clusters is set to the true number of clusters.
(3) Scalable kernel k-means: σ in [0.1, 0.25, 0.5, 1, . . . , 16, 24, 32]; k is set to the true number of clusters; s=100 (the target dimensions of the PCA step) and c=400 (sketch size for the Nyström dimensional output), except for data set less than 400 points then it is s=20 and c=200.
(4) psKC: ψ in [2, 4, 6, 8, 16, 24, 32], t=100, τ=0.1 and σ=0.1.
(5) psKCg: γ=2i where i in [1, 2, 3, . . . , 16], τ=0.1 and σ in [0.1, 0.01, 0.001, . . . , 1×10−10].
(6) DBSCAN: ∈ in [0.001 m, 0.002 m, . . . , 0.999 m] and MinPts in [2, 3, . . . , 30], where m is the maximum pairwise distance and MinPts is the density threshold.
(7) DPik: For DP, ∈ is in [0.001 m, 0.002 m, . . . , 0.4 m] where m is the maximum pairwise distance. The number of clusters is set to the true number of clusters. For Isolation Kernel: ψ in [2, 4, 6, 8, 16, 24, 32] and t=100.
(8) k-means: The number of clusters is set to the true number of clusters.
The experiments ran on a Linux CPU machine: AMD 16-core CPU with each core running at 2.0 GHz and 32 GB RAM. The feature space conversion was executed on a machine having GPU:2×GTX 1080 Ti with each card having 12 GB RAM. Both are commonly used machines.
Clustering outcomes on artificial datasets. In the first experiment, five commonly used benchmark datasets, namely Ring-G, AC, Aggregation, Spiral and S3 are used.
As shown in
The two versions of kernel k-means are weaker algorithms than DP as they did poorly on at least three out of the five datasets, i.e., Ring-G, Aggregation and Spiral. This is because of the use of k-means which has fundamental weaknesses in detecting clusters that have non-globular shapes. The use of a kernel in both kernel k-means transfers these fundamental weaknesses from input space to feature space. The results show that there is no guarantee that they can detect clusters of non-globular shapes.
Point-set kernel clustering is the only algorithm that did well in all five datasets; and it is the only algorithm that successfully identified all four clusters in the Ring-G dataset. This is a direct result of the cluster identification procedure which employs the point-set kernel. Other algorithms failed to correctly identify the four clusters because their algorithmic design which must determine all density peaks/centers before individual points can be assigned to one of the peaks/centers.
Clustering outcomes on one image (each having high or low resolution). In the second experiment, three photographic images of resolutions from low to high: Forbidden City Gate (499×324 pixels), Vincent van Gogh's Starry night over the Rhone 2 (932×687 pixels) and Chinese Painting: Zhao Mengfu's Autumn Colors (2,005×500 pixels), are used to compare the clustering outcome. Color images are represented in the CIELAB color space. All clustering algorithms in the comparison are presented with a dataset with this CIELAB representation when an image is to be segmented.
As shown in
In contrast, psKC could discover the two clusters, without splitting the sky into two. Note that the Forbidden City Gate image has a total of 161,676 pixels. DP could only process a lower resolution of this image of 60,000 pixels. Like kernel k-means, DP identified this elongated cluster to have two peaks instead of one, which is one weakness of the DP peaks identification procedure.
As shown in
Clustering a set of images. In the third experiment, psKC is used to cluster a set of images into several subsets of images, instead of segmenting one image into multiple segments.
Meanwhile, digits 4 & 9 are grouped into three clusters. In addition to the two pure clusters, the third cluster consists of both digits of 1:3 proportion. This is in contrast of the result produced by a kNN-graph based clustering algorithm RCC (short for Robust Continuous Clustering), where both digits 4 & 9 have been grouped into a single cluster. The digits grouped in the third cluster have different written styles from those in the two pure clusters of 4 & 9.
Runtime experiment. In the fourth experiment, the MNIST8M dataset which has 8.1 million data points with 784 dimensions is used for the scaleup test. The runtime is measured in terms of CPU seconds (and include the GPU seconds when GPU is used.)
The experimental result in
Even with 12 CPUs, as shown in
In contrast, the algorithmic advantage of psKC, together with the use of the point-set kernel, allows it to run on a standard machine of single-CPU (for clustering) and GPU (for feature mapping in preprocessing). This enables the clustering to be run on a commonly available machine (with both GPU and CPU) to deal with large scale datasets.
In terms of real time: on the dataset with 40 k data points, psKC took 73 seconds which consists of 58 GPU seconds for feature mapping and 15 CPU seconds for clustering. In contrast, DP took 541 seconds. The gap in runtime widens as data size increases: To complete the run on 8.1 million points, DP is projected to take 379 years. That would be 12 billion seconds which is six orders of magnitude slower than psKC's 20 thousand seconds (less than 6 hours). The widening gap is apparent in
As it is, there is no opportunity for DP to do feature mapping (where GPU could be utilized). While it is possible for kernel k-means to make use of GPU as in psKC, the main restriction of scalable kernel k-means is PCA which has no efficient parallel implementation, to the best of our knowledge. The clustering procedures of both DP and psKC could potentially be parallelized, but this does not change their time complexities.
Stability test in the fifth experiment.
On the Spiral dataset, psKC appears to have variance larger than kernel k-means. This is because kernel k-means produced significantly poorer clustering overall, having all 10 trials lower than 0.5 F1 score.
Overall, psKC (using t=100) produces higher F1 score than kernel k-means on all three datasets, where the median result is shown as the line inside the box. In addition,
According to five experimental results, the clustering psKC outclasses DP, and two versions of kernel k-means in terms of both clustering outcomes and runtime efficiency. psKC algorithms have the following advantages. First, the algorithm is deterministic, given a kernel function and the user-specified parameters. This resolves the instability issue and often leads to better clustering outcomes. The only randomization is due to the Isolation Kernel. The use of most similar points in D as seeds is much more stable, even with different initializations of Isolation Kernel, compared with random groupings of clusters which can change wildly from one run to the next. Second, the psKC procedure enables detection of clusters of arbitrary shape, of different sizes and densities. Third, the psKC procedure commits each point to a cluster once it is assigned; and most points which are similar to the cluster never need to be reassigned. This is possible because of the use of a seed to grow a cluster. Points which are similar to a cluster grown from the seed will not be similar to another cluster if the points are less similar to the seeds of other clusters in the first place. The sequential determination of seeds (as opposed to the parallel determination of centers in k-means) makes that possible.
As a result, psKC avoids many unnecessary recomputations in k-means mentioned earlier. In other words, the clustering outcome of psKC is already close to the final maximization objective. The post-processing literally tweaks at the edges by reexamining those lowest similarity points regarding each cluster for possible reassignment to achieve the final maximization of the objective function.
In summary, the two root causes of shortcomings of existing clustering algorithms are: (i) the use of data independent point-to-point distance/kernel (where the kernel has a feature map with intractable dimensionality) to compute the required similarity directly; and (ii) the algorithmic designs that constrict the types of clusters that they can identify. For example, in the case of kernel k-means, even though a kind of point-set kernel is used, it can detect clusters of globular shape only in feature space; and this does not guarantee that non-globular shaped clusters in input space can be detected. These have led to poorer clustering outcomes and the longstanding runtime issue that have prevented them from dealing with large scale datasets.
These root causes are addressed by using a data dependent point-set kernel and a new clustering algorithm which utilizes the point-set kernel to characterize clusters—they encompass many types of clusters which cannot be detected by existing algorithms. As a result, psKC is the only clustering algorithm that is both effective and efficient—a quality which is all but nonexistent in current clustering algorithms. It is also the only kernel-based clustering that has runtime proportional to data size.
The clustering method for data mining of the present invention can be applied to multiple fields, and the image segmentation application is taken as an example in the above embodiment. The method of the data mining can also be applied to applications such as clustering a set of images, social media analysis, computer biology, market research, search engines, etc. When the data analysis is performed in the corresponding field, the data descriptor for each application shall be used to convert the original dataset into a set of vector representation.
The method is both effective and efficient that enables it to deal with large scale datasets. In comparison with the state-of-the-art density-peak clustering and scalable kernel k-means clustering, the method is more effective and runs orders of magnitude faster when applied to datasets of millions of data points, on a commonly used computing machine.
This application claims the benefit of U.S. Provisional Application No. 63/020,248, filed on May 5, 2020, the contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63020248 | May 2020 | US |