The present disclosure relates to data clustering, and more particularly relates to procedures for clustering of objects into clusters of related objects, such as clustering of a group of images into one or more clusters of related images.
In the field of digital imaging, it is common to partition a large set of images into smaller clusters of images. For example, a PC or camera may organize a large collection of images into a small number of groups of images similar to each other. To that end, a number of different clustering procedures have been developed, to automatically cluster images based on features of the images.
The performance of a clustering procedure may vary depending on the set of images being clustered. Thus, in some situations it may be better to choose one clustering procedure over another.
To compare clusterings, a Mutual Information (MI) measure has been developed. The MI measures how closely two different clustering procedures place the same objects in the same clusters.
One shortcoming of MI is that MI only compares one clustering against another. MI can indicate a similarity between two clusterings, but does not account for which (if either) of the clusterings is “true” or correct. For example, some MI procedures result in the same symmetric MI score for either of the two clusterings, implying that the MI procedure simply compares the two clusters to another without regard for either of the clusters being “true”. Thus, while MI can indicate cluster similarity, it does not provide a user with an indication of a specific clustering procedure that would work best for a set of objects, i.e., which clustering procedure would best approximate a manual clustering performed by the user.
The foregoing situation is addressed by comparing results from a set of clustering procedures against a predetermined categorization of images to generate respective scores for each clustering procedure, and selecting the clustering procedure with the highest score.
Thus, in an example embodiment described herein, a clustering procedure for grouping a set of images is selected from amongst plural clustering procedures. A predetermined categorization of objects such as images is input, and image features are extracted from each image in the set of images. A comparison measure is determined, by which to compare respective features of the set of images. Respective features between the images in the set of images are compared, based on the comparison measure, and a group of measures representing the differences between features of respective images is output. The plural clustering procedures are applied to the set of images to cluster the images based in part on the calculated group of measures. A clustering quality score is generated for each clustering procedure, based on the clusters created by the clustering procedure and the predetermined categorization of images. The clustering procedure with a high clustering quality score is selected.
By comparing a set of clustering procedures against a fixed “true” categorization of images to generate respective scores for each clustering procedure, it is ordinarily possible to automatically choose a clustering procedure which will group images in a manner best approximating a grouping performed manually by a user.
This brief summary has been provided so that the nature of this disclosure may be understood quickly. A more complete understanding can be obtained by reference to the following detailed description and to the attached drawings.
Host computer 41 also includes computer-readable memory media such as computer hard disk 45 and DVD disk drive 44, which are constructed to store computer-readable information such as computer-executable process steps. DVD disk drive 44 provides a means whereby host computer 41 can access information, such as image data, computer-executable process steps, application programs, etc. stored on removable memory media. Other devices for accessing information stored on removable or remote media may also be provided.
Host computer 41 may acquire digital image data from other sources such as a digital video camera, a local area network or the Internet via a network interface. Likewise, host computer 41 may interface with other color output devices, such as color output devices accessible over a network interface.
Display screen 42 displays a list of clustering procedures and a respective score for each procedure, along with a selection of the clustering procedure with the highest score. In that regard, while the below process will generally be described with respect to images for purposes of conciseness, it should be understood that other embodiments could also operate on other objects. For example, other embodiments could be directed to selecting a clustering procedure for clustering audio files, or moving image files.
While
RAM 115 interfaces with computer bus 114 so as to provide information stored in RAM 115 to CPU 110 during execution of the instructions in software programs such as an operating system, application programs, image processing modules, and device drivers. More specifically, CPU 110 first loads computer-executable process steps from fixed disk 45, or another storage device into a region of RAM 115. CPU 110 can then execute the stored process steps from RAM 115 in order to execute the loaded computer-executable process steps. Data such as color images or other information can be stored in RAM 115, so that the data can be accessed by CPU 110 during the execution of computer-executable software programs, to the extent that such software programs have a need to access and/or modify the data.
As also shown in
Image processing module 123 comprises computer-executable process steps, and generally comprises an input module, an extraction module, a determination module, a comparison module, an application module, a score generation module, and a selection module. Image processing module 123 inputs a set of images, and outputs a selection of a clustering procedure which best fits the set of images. More specifically, image processing module 123 comprises computer-executable process steps executed by a computer for causing the computer to perform a method for selecting a clustering procedure for grouping the set of images, as described more fully below.
The computer-executable process steps for image processing module 123 may be configured as a part of operating system 118, as part of an output device driver such as a printer driver, or as a stand-alone application program such as a color management system. They may also be configured as a plug-in or dynamic link library (DLL) to the operating system, device driver or application program. For example, image processing module 123 according to example embodiments may be incorporated in an output device driver for execution in a computing device, such as a printer driver, embedded in the firmware of an output device, such as a printer, in an input/output device such as a camera with a display, in a mobile output device (with or without an input camera) such as a cell-phone or music player, or provided in a stand-alone image management application for use on a general purpose computer. It can be appreciated that the present disclosure is not limited to these embodiments and that the disclosed image processing module 123 may be used in other environments in which image clustering is used.
In particular,
As shown in
Briefly, in
In more detail, in step 401, a set of images are input, along with a predetermined categorization of the images. In that regard, the predetermined categorization of images can be selected by a user, generated based on past categorizations of images, or generated using pre-labeled learning set of images, among other methods. For example, multiple user selections of categorizations could be aggregated and stored, or transmitted to computing equipment 40 over a network. Of course, the predetermined categorization could also be adjusted or modified over time, to keep up with changes in categorizations by users.
In step 402, image features are extracted from all of the input images. For example, colors, shapes, and other features can be extracted, depending on which comparison measure is to be used. In that regard, in many cases, clustering is not performed on the actual data, but on features extracted from it. For example, an procedure for clustering images of cars does not usually operate in the pixel space of the images, but instead works with features such as color or shape extracted from the images.
In step 403, a comparison measure is determined, by which to compare respective features of the set of images. For example, the comparison measure could be a chi-squared distance, a “histogram intersection” measure, Cosine distance, Tanimoto coefficient, Lp distances, Earth movers distance, or Hamming distance, among many others.
In step 404, respective features between the images in the set of images are compared based on the comparison measure, and a group of measures representing the differences between features of respective images is output. In particular, each image is compared against every other image in the set, and the output measures indicate how similar (or different) the images are according to the selected comparison measure.
In step 405, the plural clustering procedures are applied to the set of images to cluster the images, based in part on the calculated group of measures. In that regard, nearly every clustering procedure uses at least one such measure in the clustering process. Thus, each clustering procedure is executed on the set of images, based in part on the calculated set of measures, to generate resultant clusters. In that regard, it should be understood that each clustering procedure involves a choice of feature, comparison measure, and specific clustering process. Thus, this step contemplates that the same clustering procedure could be used multiple times but with differing parameters, to thereby produce different results.
In step 406, a clustering quality score is generated for each clustering procedure, based on the clusters created by the clustering procedure and the predetermined categorization of images.
For example, the clustering quality score can be generated for each clustering procedure by calculating mutual information between the clustering procedure and the predetermined categorization of images, and adjusting by a penalty factor indicating expected mutual information from randomly assigning images to clusters of the clustering procedure.
In one example, the clustering quality score (AMI*) is generated according to
wherein Î(M) equals the mutual information between the clustering procedure and the predetermined categorization of images, E[Î(M)|a,C] equals the penalty factor and is based on the number of clusters C generated by the clustering procedure, and on the predetermined categorization a, and wherein κ is a normalization constant which depends only on the predetermined categorization a. Generation of AMI* will be described more fully below with respect to
In step 407, a clustering procedure with a high clustering quality score is selected.
In step 408, the selected clustering procedure is output. In that regard, the output clustering procedure may be displayed to the user as in
Generation of the clustering quality score (i.e., AMI*) will now be described more fully with respect to
Turning to ={U1, U2, . . . UR}. A clustering procedure produces a partition of these N objects into C clusters labeled
={V1, V2, . . . VC}.
The overlap between the true categories, and the clusters produced by a clustering procedure, can be summarized in the form of a contingency table M shown in
Formally, the mutual information I(X; Y) between discrete random variables X and Y is defined as
where and
are the domains of X and Y respectively. I(X; Y) is a symmetric measure that quantifies the information X and Y share. Entropy, denoted by H(X), is a measure of uncertainty associated with a random variable, X. Formally,
It can be verified that I(X; Y)=H(X)−H(X|Y)=H(Y)−H(Y|X). Thus, MI is a measure of how much knowing one of the variables reduces uncertainty of the other. I(X; Y) is upper-bounded by both H(X) and H(Y).
Using a statistical view, random variables U can be used to represent the category, and
to represent the cluster, that an object belongs to. Then after observing a contingency table M, the following frequentist estimates are generated:
The mutual information between U and V can be estimated as
When comparing two partitions, V and V′ with C and C′ clusters respectively, against a “true” partition U. If C=C′, the MI of each of the two partitions to the true partition, I(U; V) and I(U; V′), is a fair measure for comparing these clustering procedures.
However, if C≠C′ this might not be the case. For example, suppose there is comparison of 3 partitions, V1, V2 and V3 of a dataset consisting of two objects from one category and two objects from another.
As shown in
Accordingly, a more informative measure should include a correction term to account for the mutual information that would be obtained by chance. That is, in order to evaluate a procedure that partitions the data into C clusters, the evaluation should take into account how much better this procedure does, on average, than a procedure that randomly partitions the same data into C clusters.
Therefore, a penalty factor, EMI*, is calculated below. Then, EMI* can be used as a baseline that can be subtracted from MI to obtain a more meaningful measure to validate a clustering procedure given a “true” clustering. The difference is typically normalized to lie within a range, and the resulting measure can be called the adjusted mutual information, denoted by AMI*.
To calculate the penalty factor EMI*, given N objects, with ai>0 objects belonging to categories Ui for i=1 . . . R, it would be useful to compute the expectation of the mutual information estimate over all possible clusterings of these objects into exactly C clusters.
In that regard,
where M is the set of all R×C contingency tables M, such that row i sums to ai for i=1 . . . R, and such that columns sums are non zero, P(M|a, C) is calculated as:
where (M) is the number of ways to cluster the given objects that result in the contingency table M. Plugging in the above,
The summation over M above can be replaced with a summation over all possible values for bj and Mij. There is not necessarily a need to sum over ai since it is a fixed quantity.
Considering the range of values that bj and Mij can take, since there must be at least one element in each column of M, bj has to be at least 1 and at most N−(C−1). Given bj, Mij can be at most min(ai; bj). Additionally, after filling the [i; j]th cell, the jth column must be filled with bj−Mij elements from a pool of N−ai elements. Therefore, Mi,j should be at least (ai+b−N)+, which is max(0; ai+bj−N).
To replace the summation over M as mentioned above, (M) should be replaced with
(Mij, ai, bj|C), where
(n, a, bj|C) is the number of ways to cluster the given objects into exactly C clusters such that there are n elements in a particular cell, and the number of elements in the corresponding row and column are a and b respectively. With this transformation,
Since the categories of the objects are given, the denominator in the above equations is simply the number of ways to partition N distinguishable objects into C distinguishable non-empty bins, i.e.:(M)=S(N,C)×C!,
where S denotes a Stirling number of the second kind.
Turning to (n, a, b|C) can be calculated. As mentioned, this is the number of ways to cluster the given N objects into exactly C clusters so that a given cell contains n elements and there are a and b elements in its corresponding row and column respectively. Specifically,
In addition, substituting into the above, the terms inside the summation are independent of j and hence the summation over j can be removed and the whole expression multiplied by C. Thus,
Once EMI* has been calculated, the adjusted mutual information can be calculated as
where is a normalization constant which depends only on the predetermined categorization of images. Using one such choice for
we have
By comparing a set of clustering procedures against a fixed “true” categorization of images to generate respective scores for each clustering procedure, it is ordinarily possible to automatically choose a clustering procedure which will group images in a manner best approximating a grouping performed manually by a user.
As mentioned above, while the above process has been described with respect to images for purposes of conciseness, it should be understood that other embodiments could also operate on other objects. For example, other embodiments could be directed to selecting a clustering procedure for clustering audio files, or moving image files.
This disclosure has provided a detailed description with respect to particular representative embodiments. It is understood that the scope of the appended claims is not limited to the above-described embodiments and that various changes and modifications may be made without departing from the scope of the claims.
Number | Name | Date | Kind |
---|---|---|---|
5537488 | Menon et al. | Jul 1996 | A |
6122628 | Castelli et al. | Sep 2000 | A |
6229918 | Toyama | May 2001 | B1 |
7227985 | Ikeda et al. | Jun 2007 | B2 |
7305133 | Divakaran | Dec 2007 | B2 |
7890445 | Ben Hur et al. | Feb 2011 | B2 |
8094948 | Jain et al. | Jan 2012 | B2 |
8150098 | Gallagher et al. | Apr 2012 | B2 |
8358857 | Demirci et al. | Jan 2013 | B2 |
20020097914 | Yaung | Jul 2002 | A1 |
20030169919 | Ikeda et al. | Sep 2003 | A1 |
20110305399 | Zitnick et al. | Dec 2011 | A1 |
Entry |
---|
Kuncheva et al., Using diversity in cluster ensembles, 2004 IEEE, pp. 1214-1219. |
R.M. Corless, et al., “On the LambertW Function”, Advances in Computational Mathematics, vol. 5, Issue 1, pp. 329-359, 1996. |
L. Hubert, et al., “Comparing Partitions”, Journal of Classification, vol. 2, Issue 1, pp. 193-218, 1985. |
C. Lanczos, “A Precision Approximation of the Gamma Function”, Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis, vol. 1, pp. 86-96, 1964. |
M. Meil, “Comparing Clusterings: An Axiomatic View”, Proceedings of the 22nd International Conference on Machine Learning, ACM, pp. 577-584, 2005. |
W.M. Rand, “Objective Criteria for the Evaluation of Clustering Methods”, Journal of the American Statistical Association, pp. 846-850, 1971. |
A. Strehl, et al., “Cluster Ensembles—A Knowledge Reuse Framework for Combining Multiple Partitions”, The Journal of Machine Learning Research, vol. 3, pp. 583-617, 2003. |
N.M. Temme, “Asymptomatic Estimates of Stirling Numbers”, Stud. Appl. Math., vol. 89, Issue 3, pp. 233-243, 1993. |
N.X. Vinh, et al., “Information Theoretic Measures for Ciusterings Comparison: Is a Correction for Chance Necessary?”, Proceedings of the 26th Annual International Conference on Machine Learning, ACM, pp. 1073-1080, 2009. |
Number | Date | Country | |
---|---|---|---|
20120093424 A1 | Apr 2012 | US |