Clustering is typically the task of grouping a set of objects in such a way that objects in the same group (e.g., cluster) are more similar to each other than to those in other groups (e.g., clusters). In a typical scenario, a user provides a clustering application with a plurality of objects that are to be clustered. The clustering application typically generates clusters from the plurality of objects in an unsupervised manner, where the clusters may be of interest to the user.
Features of the present disclosure are illustrated by way of example and not limited in the following figure(s), in which like numerals indicate like elements, in which:
For simplicity and illustrative purposes, the present disclosure is described by referring mainly to examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be readily apparent however, that the present disclosure may be practiced without limitation to these specific details. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure.
Throughout the present disclosure, the terms “a” and “an” are intended to denote at least one of a particular element. As used herein, the term “includes” means includes but not limited to, the term “including” means including but not limited to. The term “based on” means based at least in part on.
In a clustering application that generates clusters in an unsupervised manner, the resulting clusters may not be useful to a user. For example, a clustering application may generate clusters for documents related to boats based on color (e.g., red, blue, etc.) based on the prevalence of color-related terms in the documents. However, the generated clusters may be irrelevant to an area of interest (e.g., sunken boats, boats run aground, etc.) of the user. In this regard, according to examples, an intent based clustering apparatus and a method for intent based clustering are disclosed herein to generate clusters that align with a user's expectations of the way that data should be organized. Beyond clustering the data, the apparatus and method disclosed herein also provide an interactive process to provide for customization of clustering results to a user's particular needs and intentions. The clustering implemented by the apparatus and method disclosed herein further adds efficiency to the clustering process, thus reducing inefficiencies related to hardware utilization, and reduction in processing time related to generation of the clusters.
Generally, the apparatus and method disclosed herein may provide for clustering of categorical data that is based on, and/or complements, previously defined clusters. Categorical data may be described as data with features whose values may be grouped based on categories. A category may represent a value of a feature of the categorical data. A feature may include a plurality of categories. For example, in the area of cyber-security, data may be grouped by features related to operating system, Internet Protocol (IP) address, packet size, etc. For the apparatus and method disclosed herein, categorical data may be clustered (e.g., to generate new clusters) so as to take into account already defined clusters (e.g., initial clusters). For the apparatus and method disclosed herein, the aspect of complementing previously defined clusters may provide, for example, determination of new and emerging problems that are being reported about a company's products. For example, a data analyst or domain expert may determine new and emerging problems with respect to a company's products.
For the apparatus and method disclosed herein, categorical data may be clustered based on a pre-defined order of the data. The pre-defined order may be based on Kullback-Leibler (KL) distances (i.e., mutual information) between histograms of each feature in clustered data versus non-clustered data, user feedback, feature entropy, a number of categories in a feature, and/or random numbers.
For the apparatus and method disclosed herein, thresholds may be pre-defined constants, or a function of several parameters including the number of features in a cluster definition, the maximal distance between items in a cluster, the number of clusters that are already defined, the number of items being currently processed, and/or the number of items in each of the other clusters divided (or not) by the number of items that are being processed since the other clusters were determined.
According to an example, the apparatus disclosed herein may include a processor, and a memory storing machine readable instructions that when executed by the processor cause the processor to generate a plurality of initial clusters from objects, where the objects include features (e.g., attributes) that include categories. The machine readable instructions may further cause the processor to receive user feedback related to the initial clusters, order the features of the objects as a function of, for example, a histogram of values of each of the features based on the user feedback (and/or feature entropy, and/or a number of categories in the feature, and/or random numbers), and determine whether a number of samples of the objects for a category of the categories meets a specified criterion (e.g., exceeding a specified threshold). The specified threshold may be a function of, for example, the number of features in a cluster definition and/or the maximal distance between the items in a cluster. The machine readable instructions may further cause the processor to generate a plurality of new clusters based on an analysis of the categories of each of the features with respect to the order based on the determination of whether the number of samples of the objects for the category of the categories meets the specified criterion.
According to an example, the machine readable instructions to generate a plurality of initial clusters from objects may further include ordering the features of the objects based on an entropy of each of the features. According to an example, the machine readable instructions to generate a plurality of initial clusters from objects may further include recursively analyzing each of the categories of each of the features in order of increasing entropy of each of the features based on the determination of whether the number of samples of the objects for each of the categories of each of the features meets the specified criterion of exceeding a threshold. According to an example, the machine readable instructions to order the features of the objects as a function of a histogram of values of each of the features based on the user feedback may further include determining histograms of values obtained by each of the features in clustered objects and in non-clustered objects, and determining KL distances between the histograms.
Feature Entropy=−Sum over all of the categories i of (pi*(log(pi))) Equation (1)
For Equation (1), ρi represents the probability of a data value being in a particular category i. Generally, features may include a higher feature entropy based on a relatively lower number of categories and a relatively similar amount (i.e., balanced) of the values for the categories. For the example of
With respect to the feature ordering module 102, the user feedback may be either direct or indirect. For example, direct feedback may include manipulation of a cluster by a user via user input 108. Indirect feedback may include dragging items from one cluster to another.
A data partitioning module 110 is to take a block of the data 104 that fits into memory, and apply a recursive process as described with reference to the flowchart 200 of
In order to cluster the data 104, a clustering module 112 is to utilize the ordered features to cluster the data 104. Starting with the first feature, the clustering module 112 may determine whether there are additional features that may be used to cluster the data 104. For the flowchart 200 of
In response to a determination that there are additional features that may be used to cluster the data 104 (e.g., since there are greater than zero features, the output to block 204 is yes), the clustering module 112 may begin with the first (i.e., lowest entropy) feature (or take the next feature after the first feature has been analyzed) to cluster the data 104 for this first feature by all possible values of this feature. For example, referring to
At block 208, the clustering module 112 may determine whether there are additional categories for the feature evaluated at block 206.
In response to a determination that there are additional categories for the feature evaluated at block 206 (i.e., since there are greater than zero categories, the output to block 208 is yes), at block 210, the clustering module 112 may begin with the first category of the first feature. For the example of
At block 212, the clustering module 112 may determine whether a number of samples in the first category (e.g., category 302 for the example of
In response to a determination that a number of samples in the first category (e.g., category 302 for the example of
At block 206, in response to a determination that there are additional features that may be used to cluster the data 104, the clustering module 112 may continue with the next feature (e.g., feature #1 for the example of
In response to a determination at block 208 that there are more categories for the next feature (e.g., feature #1 for the example of
At block 212, the clustering module 112 may determine whether a number of samples in the first category (e.g., category 306 for the example of
In this manner, further categories for the next feature (e.g., feature #1, and then feature #2 for the example of
With respect to features #1-#3 for the example of
At block 218, in response to a determination that there are no additional categories (e.g., no additional categories for feature #3 for the example of
With the initial clusters 118 in a block of the data 104 being determined, the clustering module 112 may mark all data items that fit the initial clusters 118 as clustered, and all of the data items that do not fit the initial clusters 118 as residual (i.e., for adding the data items that do not fit the clusters to a residual 120). For the example of
In certain cases, an “ignore feature” parameter g may be defined. This parameter may be used when feature weights entered by a user result in a feature order that yields clusters which are not sufficiently deep (i.e., clusters that do not include sufficient features). In such cases, if there is no new cluster for this feature, but g>0, block 220 may be bypassed, and processing may instead proceed to block 204. The cluster and the rule that defines the feature may remain unchanged, and the unhelpful feature may be ignored. In this case, the ignore feature parameter g may be decreased by 1.
The data partitioning module 110 may take a next block of the data 104, and the clustering module 112 may assign the data items to the known clusters (e.g., the clusters #1 to #6 for the example of
In this manner, the data partitioning module 110, and the clustering module 112 may continue to cluster blocks of data and add to the residual 120, until the residual 120 is larger than a specified size. Once the residual 120 is larger than a specified size, the data for the residual 120 may be similarly clustered as described with reference to blocks 202-220. The order of processing may remain unchanged (e.g., copied from the previous order). However, the order of processing may also be changed, for example, because the block content is different and the entropies are to be re-determined, or if there are no new clusters in the last few blocks, the cluster propositions may need to be refreshed. For example, once the number of samples of the residual 120 (e.g., N(residual)) is larger than a specified size (e.g., N(block)*constant, where block is the size of a block of data used by the data partitioning module 110, and constant is a specified number), the data for the residual 120 may be similarly clustered as described with reference to blocks 202-220. In this regard, the feature order (e.g., based on feature entropy) with respect to block 202 may be re-determined or used as previously determined. For the example of
Once the initial clusters 118 are generated, a user interaction module 122 is to receive results from user interaction (i.e., the user input 108) with the proposed initial clusters 118. The user interaction module 122 may also receive input from user interaction, prior to generation of the initial clusters 118, or during generation of the initial clusters 118 for each block of the data 104. For example, a user may utilize the user interaction module 122 to modify alignment of the proposed clusters to the user's needs. For example, a user may obtain a cluster description, obtain items (e.g., data) from a cluster, rename a cluster, set an action with respect to a cluster (e.g., alert, don't show), delete a feature from cluster definition, merge clusters, divide clusters, define a new cluster from features, create a new cluster based on data from a cluster, submit a cluster of items to classify, assign items to a relevant cluster, delete a cluster, etc. Each of these interactions by a user may change the clustering version (e.g., from the initial cluster version, to a modified cluster version). The clusters that are modified based on the user interaction may be designated as user-approved clusters 124. For the example of
The clusters that are modified by a user may be used as input of block 202, and lead to a different KL score for each feature. For example, referring to block 202, the feature ordering module 102 may order the features 106 (e.g., the features #1 to #3 for the example of
For Equation (2), P represents a histogram of the values that a feature obtains in clustered data, and Q represents a histogram of the values that a feature obtains in the residual 120. With respect to Equation (2), user manipulation of a cluster may change the cluster, and hence that samples that are in the cluster (i.e., the samples that represent the cluster). Thus, user manipulation of a cluster may change the histogram P, and further, Q histograms may change with each block of data. With respect to user manipulation (e.g., dividing a cluster), entropy calculation, and KL distance calculation, a function may be defined to combine several numbers for each feature (e.g., several “vectors”) into one number for each feature (e.g., one “vector”), and then order the resulting number. The features may be ordered as a function of user weights and KL distance, and if two features have the same number, by user weight, then by KL distance, and then by entropy. The different KL score for each feature may result in a different order of features in which the recursive clustering operates. This results in a closer fit of proposed new clusters 126 to the needs of the user.
For the example of
Referring to
The modules and other elements of the apparatus 100 may be machine readable instructions stored on a non-transitory computer readable medium. In this regard, the apparatus 100 may include or be a non-transitory computer readable medium. In addition, or alternatively, the modules and other elements of the apparatus 100 may be hardware or a combination of machine readable instructions and hardware.
Referring to
At block 504, the method may include determining a measure of each of the features. For example, as described herein with reference to
At block 506, the method may include ordering the features of the data based on the measure of each of the features. For example, as described herein with reference to
At block 508, the method may include determining whether a number of samples of the data for a category of the categories meets a specified criterion. For example, as described herein with reference to
At block 510, the method may include generating a plurality of clusters based on an analysis of the categories of each of the features with respect to the order of each of the features based on the determination of whether the number of samples of the data for the category of the categories meets the specified criterion. For example, as described herein with reference to
According to an example, for the method 500, generating a plurality of clusters based on an analysis of the categories of each of the features with respect to the order of each of the features based on the determination of whether the number of samples of the data for the category of the categories meets the specified criterion may further include recursively analyzing each of the categories of each of the features in order of increasing entropy of each of the features based on the determination of whether the number of samples of the data for each of the categories of each of the features meets the specified criterion of exceeding a threshold.
According to an example, for the method 500, the plurality of clusters may be designated as initial clusters, the method may further include receiving user feedback related to the initial clusters, and generating a plurality of new clusters based on the user feedback.
According to an example, for the method 500, generating a plurality of new clusters based on the user feedback may further include ordering the features as a function of histograms of values of each of the features.
According to an example, for the method 500, generating a plurality of new clusters based on the user feedback may further include determining histograms of values obtained by each of the features in clustered data and in non-clustered data, determining KL distances between the histograms, ordering the features based on the KL distances between each of the features, and generating the plurality of new clusters based on an analysis of the categories of each of the features with respect to the order based on KL distances based on the determination of whether the number of samples of the data for the category of the categories meets the specified criterion.
According to an example, the method 500 may further include processing blocks of the data to generate the plurality of new clusters, and combining respective new clusters of the plurality of new clusters that are generated based on the processing of the blocks of the data.
Referring to
At block 604, the method may include determining histograms of values obtained by each of the attributes in clustered data and in non-clustered data. For example, as described herein with reference to
At block 606, the method may include determining KL distances between the histograms. For example, as described herein with reference to
At block 608, the method may include ordering the attributes in decreasing order of the KL distances between a respective histogram of the histograms of the clustered data and a respective histogram of the histograms of the non-clustered data. For example, as described herein with reference to
At block 610, the method may include generating a plurality of new clusters based on an analysis of the categories of each of the attributes with respect to the order by determining whether a number of samples of the data for a category of the categories meets a specified criterion (e.g., exceeding a specified threshold). For example, as described herein with reference to
According to an example, for the method 600, generating a plurality of new clusters based on an analysis of the categories of each of the attributes with respect to the order by determining whether a number of samples of the data for a category of the categories exceeds a specified threshold may further include recursively analyzing each of the categories of each of the attributes in increasing order of each of the attributes by determining whether the number of samples of the data for each of the categories of each of the attributes exceeds the specified threshold.
According to an example, the method 600 may further include determining if an attribute of the attributes blocks the determination of sub-clusters of one of the plurality of new clusters, and in response to a determination that the attribute of the attributes blocks the determination of sub-clusters of one of the plurality of new clusters, continue the analysis of other attributes with respect to the one of the plurality of new clusters, and omit the attribute from the ordered attributes with respect to the one of the plurality of new clusters.
The computer system 700 may include a processor 702 that may implement or execute machine readable instructions performing some or all of the methods, functions and other processes described herein. Commands and data from the processor 702 may be communicated over a communication bus 704. The computer system may also include a main memory 706, such as a random access memory (RAM), where the machine readable instructions and data for the processor 702 may reside during runtime, and a secondary data storage 708, which may be non-volatile and stores machine readable instructions and data. The memory and data storage are examples of computer readable mediums. The memory 706 may include an intent based clustering module 720 including machine readable instructions residing in the memory 706 during runtime and executed by the processor 702. The intent based clustering module 720 may include the modules of the apparatus 100 shown in
The computer system 700 may include an I/O device 710, such as a keyboard, a mouse, a display, etc. The computer system may include a network interface 712 for connecting to a network. Other known electronic components may be added or substituted in the computer system.
What has been described and illustrated herein is an example along with some of its variations. The terms, descriptions and figures used herein are set forth by way of illustration only and are not meant as limitations. Many variations are possible within the spirit and scope of the subject matter, which is intended to be defined by the following claims—and their equivalents—in which all terms are meant in their broadest reasonable sense unless otherwise indicated.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2014/058837 | 10/2/2014 | WO | 00 |