In machine learning (ML), there are two main types of learning approaches: supervised and unsupervised. Supervised learning involves training a supervised ML model on a labeled dataset comprising labeled data instances. Each labeled data instance includes values for the features (i.e., attributes/dimensions) of the dataset and a label, typically determined by a human, indicating the correct value that should be output/predicted by the supervised ML model upon being provided the data instance's feature values as inputs. This label can be a class/category in the case where the ML task is classification or a continuous number in the case where the ML task is regression. By training the supervised ML model in this manner, the supervised ML model can learn how the feature values of the labeled dataset's data instances map to their desired outputs/predictions. Once the training is complete, the supervised ML model can be applied to generate predictions for query data instances (i.e., data instances whose labels are unknown).
Unsupervised learning, on the other hand, does not make use of a labeled dataset and does not involve training a supervised ML model. Instead, with unsupervised learning, an unsupervised ML model is provided as input an unlabeled dataset comprising unlabeled data instances (i.e., data instances that do not have labels indicating what their correct outputs/predictions should be). The unsupervised ML model then makes inferences—or in other words, generates predictions—regarding the unlabeled data instances based on information gleaned from the inherent structure of that data. For example, one common type of unsupervised ML model is a clustering model that groups unlabeled data instances into clusters according to the data distribution of the unlabeled dataset.
In the context of supervised learning, it is possible to compute feature importance scores for the features in a labeled dataset L used to train a supervised ML model M, where the feature importance score for a given feature f in L signifies the importance or usefulness of f in generating correct predictions via the trained version of M. Upon computing these feature importance scores, they can be leveraged in various ways to improve the efficiency and effectiveness of the supervised learning process.
However, in the context of unsupervised learning, there is no analogous technique for calculating feature importance scores for the features in an unlabeled dataset, because such a dataset does not include labels and thus lacks a “ground truth” for determining the predictive importance/usefulness of each feature. As a result, there is currently no way to improve the efficiency and effectiveness of unsupervised learning using a feature-based metric that is similar to the feature importance metric available in supervised learning.
In the following description, for purposes of explanation, numerous examples and details are set forth in order to provide an understanding of various embodiments. It will be evident, however, to one skilled in the art that certain embodiments can be practiced without some of these details or can be practiced with modifications or equivalents thereof.
The present disclosure is directed to techniques for computing and using a new type of feature-based metric, referred to herein as “inter-feature influence,” for the features (e.g., f1, . . . , fn) in an unlabeled dataset X. Generally speaking, these techniques involve training a set of supervised ML models M1, . . . , Mn on labeled datasets that are derived from unlabeled dataset X, such that each supervised ML model Mi is trained to predict feature fi based on the other features in X. The techniques further involve computing an inter-feature influence score for each pair of features in X using the trained versions of supervised ML models M1, . . . , Mn, where the inter-feature influence score for a given feature pair (fi, fj) indicates the degree of influence feature fi has on feature fj (or in other words, how useful/important feature fi is in predicting feature fj).
With these techniques, the inter-feature influence scores computed for unlabeled dataset X can be leveraged in various ways that are similar to the use cases for feature importance scores in supervised learning. For example, in certain embodiments the inter-feature influence scores can be applied to perform dimensionality reduction on unlabeled dataset X, which means reducing the number of features (i.e., dimensions) in X from n to some lower number n−r. Among other things, this advantageously allows unlabeled dataset X to be used, in its compressed/reduced form, for unsupervised learning in environments that that cannot efficiently operate on high dimensional datasets due to compute, memory, bandwidth, time, and/or other constraints.
The foregoing and other aspects of the present disclosure are described in further detail in the sections that follow.
Starting with step (1) (reference numeral 102), computer system 100 can receive an unlabeled dataset X that is composed of m unlabeled data instances d1, . . . , dm and n features f1, . . . , fn. Each unlabeled data instance di can be understood as a row of unlabeled dataset X and each feature fi can be understood as a column or dimension of X, such that each unlabeled data instance includes n feature values corresponding to features f1, . . . , fn. By way of example, Table 1 below illustrates a version of unlabeled dataset X in the scenario where X comprises three features (columns) “age,” “eye color,” and “hair color” and four unlabeled data instances (rows) with values for these features:
At step (2) (reference numeral 104), computer system 100 can construct, for each feature fi of unlabeled dataset X, a labeled dataset (Xi, yi) that incorporates the m unlabeled data instances of X, but (a) excludes feature fi from the feature set of each data instance in (Xi, yi) and (b) adds fi as the dataset's label column (i.e., y) (resulting in labeled, rather than unlabeled, data instances). Stated in a more formal manner, each labeled dataset constructed at step (2) can be defined as follows:
In the formulation above, X[i] is the i'th feature of unlabeled dataset X, Xi is the matrix of features in labeled dataset (Xi, yi), and yi is the column (or vector) of labels in labeled dataset (Xi, yi). In addition, the expression “a\b” indicates that b is excluded from a (and thus “X\ X[i]” signifies the exclusion of feature i from unlabeled dataset X).
Assuming the foregoing formulation is applied to the version of unlabeled dataset X shown in Table 1, the following are the labeled datasets that would be created for the features “age,” “hair color,” and “eye color” respectively:
Upon constructing the labeled datasets using unlabeled dataset X at step (2), computer system 100 can train a corresponding set of supervised ML models M1, . . . , Mn on those labeled datasets (step (3); reference numeral 106). Through this training, each supervised ML model Mi can be trained to predict the value of feature fi in unlabeled dataset X based on the values of the other features in X. For example, with respect to the version of unlabeled dataset X in Table 1, computer system 100 would train a first supervised ML model M1 on the labeled dataset shown in Table 2 (thereby training M1 to predict “age” based on the values for “hair color” and “eye color”); train a second supervised ML model M2 using the labeled dataset shown in Table 3 (thereby training M2 to predict “hair color” based on values for “age” and “eye color”); and train a third supervised ML model M3 using the labeled dataset shown in Table 4 (thereby training M3 to predict “eye color” based on values for “age” and “hair color”).
In this scenario, because “age” is a numerical feature, supervised ML model M1 will be a regressor model (i.e., an ML model configured to predict/output a numerical value). In contrast, because “hair color” and “eye color” are categorical features, supervised ML models M2 and M3 will be classifier models (i.e., ML models configured to predict/output categorical, or class, values).
Once the ML model training at step (3) is complete, computer system 100 can compute, for each pair of features (fi, fj) in unlabeled dataset X, an inter-feature influence score using the trained version of supervised ML model Mj, where the inter-feature influence score for feature pair (fi, fj) indicates how useful or important fi is in predicting fj (step (4); reference numeral 108). For example, with respect to the version of unlabeled dataset X in Table 1, computer system 100 would compute inter-feature influence scores for: (1) feature pair (“age,” “hair color”) using the trained version of the supervised ML model for “hair color,” (2) feature pair (“age,” “eye color”) using the trained version of the supervised ML model for “eye color,” (3) feature pair (“hair color,” “age”) using the trained version of the supervised ML model for “age,” (4) feature pair (“hair color,” “eye color”) using the trained version of the supervised ML model for “eye color,” (5) feature pair (“eye color,” “age”) using the trained version of the supervised ML model for “age,” and (6) feature pair (“eye color,” “hair color”) using the trained version of the supervised ML model for “hair color.”
In one set of embodiments, the computation of the inter-feature influence score for each feature pair (fi, fj) can be carried out using a “random re-shuffling” approach that involves determining a first accuracy score for the trained version of supervised ML model Mj, randomly re-shuffling the values for feature fi in labeled dataset (Xj, yj) (resulting in a new labeled dataset (Xj, yj)′), re-training Mj on new labeled dataset (Xj, yj)′, determining a second accuracy score for the re-trained version of Mj, and computing the inter-feature influence score based on the first and second accuracy scores. This approach is described in further detail in Section (3) below. In other embodiments, the inter-feature influence score computation at step (4) can be carried out using other approaches that are similar to existing techniques for computing feature importance scores in supervised learning. This is possible because the inter-feature influence score for feature pair (fi, fj) is analogous to the feature importance score for feature fi in the context of labeled dataset (Xj, yj).
Finally, at step (5) (reference numeral 110), computer system 100 can apply the inter-feature influence scores computed at step (4) in order to carry out one or more further actions with respect to unlabeled dataset X. As one example, computer system 100 can provide the inter-feature influence scores as additional input features to one or more unsupervised ML models that operate on unlabeled dataset X. As another example, computer system 100 can use the inter-feature influence scores to reduce the number of features/dimensions in unlabeled dataset X and thereby compress it, without substantially affecting the dataset's inherent structure and data distribution. One method for implementing this dimensionality reduction is detailed in Section (4) below.
It should be appreciated that
Starting with blocks 202 and 204, computer system 100 can receive unlabeled dataset X (comprising unlabeled data instances d1, . . . , dm and features f1, . . . , fn as mentioned above) and enter a first loop for each feature fi in X (where i=1, . . . , n).
Within this first loop, computer system 100 can construct a labeled dataset (Xi, yi) that incorporates the m unlabeled data instances in unlabeled dataset X, but excludes feature fi from each data instance and instead adds that feature as the label for the data instance (block 206). Computer system 100 can then train a supervised ML model Mi using labeled dataset (Xi, yi) (block 208), thereby enabling model Mi to predict the value of feature fi based on the values of features f1, . . . , fn\fi (i.e., the features of unlabeled dataset X excluding fi).
As noted previously, in scenarios where feature fi is categorical, model M1 will be a classifier model; conversely, in scenarios where feature fi is numerical, model M1 will be a regressor model. However, computer system 100 is not otherwise constrained in terms of the type of ML model that it uses to implement Mi. For example, if feature fi is categorical, M1 may be implemented using a random forest classifier, an adaptive boosting classifier, a gradient booster classifier, etc. Similarly, if feature fi is numerical, Mi may be implemented using a random forest regressor, an adaptive boosting regressor, and so on. In certain embodiments, computer system 100 may employ different types of classifier/regressor models for different features of X (e.g., a random forest classifier for feature f1, an adaptive boosting classifier for feature f2, etc.).
Upon training supervised ML model M1 at block 208, computer system 100 can reach the end of the current loop iteration (block 210) and return to the top of the loop to process the next feature. Once all of the features have been processed and corresponding supervised ML models M1, . . . , Mn have been trained, computer system 100 can enter a second loop for each feature fi in X (block 212) and, within this second loop, enter a third loop for each feature fj in X that is not fi (in other words, the set of features in X that exclude fi) (block 214).
Within this third loop, computer system 100 can compute the inter-feature influence score for feature pair (fi, fj) using the random re-shuffling approach mentioned earlier. In particular, at blocks 216 and 218, computer system 100 can provide one or more query data instances as input to the trained version of supervised ML model Mj and determine a first accuracy score for the trained version of Mj based on the resulting predictions. In a particular embodiment, this accuracy score can be computed as the number of correct predictions made by Mj divided by the total number of predictions.
At block 220, computer system 100 can randomly re-shuffle the values for feature fi in labeled dataset (Xj, yj)—such that the values in the column of (Xj,yj) corresponding to fi are randomly switched among the labeled data instances of (Xj, yj)—resulting in a new labeled dataset (Xj, yj)′. Computer system 100 can thereafter re-train supervised ML model Mj using (Xj, yj)′ (block 222), provide the same query data instances from block 216 as input to the re-trained version of supervised ML model Mj (block 224), and determine a second accuracy score for the re-trained version of Mj based on the resulting predictions (block 226).
Then, at block 228, computer system 100 can compute an inter-feature influence score for feature pair (fi, fj) based on the first and second accuracy scores determined at blocks 218 and 226 respectively. In a particular embodiment, the computed inter-feature influence score can be proportional to the degree of divergence between the first and second accuracy scores, such that a relatively high degree of divergence between the two accuracy scores corresponds to a relatively high inter-feature influence score and a relatively low degree of divergence between the two accuracy scores corresponds to a relatively low inter-feature influence score. This is because a high degree of divergence indicates that feature fi has a strong influence on predicting feature fj and conversely a low degree of divergence indicates that feature fi does not have a strong influence on predicting feature fj.
Upon computing the inter-feature influence score at block 228, computer system 100 can reach the end of the current loop iteration for feature fj (block 230) and return to block 214 in order to process the next fj in X that is not fi. Further, upon processing all of the features in X that are not fi, computer system 100 can reach the end of the current loop iteration for feature fi (block 232) and return to block 212 in order to process the next fi in X. Finally, upon processing all of the features in X per blocks 212-232, the flowchart can end.
It should be appreciated that flowchart 200 is illustrative and various modifications are possible. For example, although flowchart 200 assumes that computer system 100 creates and trains n separate supervised ML models (one for each feature f1, . . . , fn) via the first loop starting at block 204, in some embodiments computer system 100 may create and train less than n models by, e.g., selecting a subset of features for model creation/training (using principal component analysis (PCA) or some other feature selection/ranking method) or by combining several features into a single feature (via a sum, sum of squares, or any other function).
Further, although flowchart 200 assumes that the labeled dataset (Xi, yi) constructed for each feature fi at block 206 includes all of the features of unlabeled dataset X other than fi (i.e., features f1, . . . , fn\fi), in some embodiments this may not be the case. Instead, computer system 100 may select a subset of those other features for inclusion in the labeled dataset (Xi, yi) based on one or more criteria (e.g., a correlation measure between those other features and feature fi, etc.).
Yet further, various modifications to the random re-shuffling process at blocks 216-228 are possible. For example, as an alternative to re-training supervised ML model Mj on new labeled dataset (Xj, yi)′ and providing the query data instances as input to the re-trained version of Mj at blocks 222 and 224, computer system 100 can provide new labeled dataset (Xj, yj)′ as input to the initial trained version of Mj and thus check the accuracy of that initial trained model against the randomly re-shuffled training data. In addition, rather than performing a single random re-shuffling of labeled data set (Xj, yj), in some embodiments computer system 100 can perform several random re-shufflings of (Xj, yj)—thereby generating several “second accuracy scores” per block 226—and compute the inter-feature influence score for each feature pair (fi, fj) based on the first accuracy score and some aggregation (e.g., average) of the second accuracy scores.
Starting with block 302, computer system 100 can build a strongly-connected directed graph G comprising vertices v1, . . . , vn and edges e1, . . . , en(n-1) where (1) each vertex vi corresponds to a feature fi in unlabeled dataset X and (2) each edge from vertex vi to vertex vj is weighted with the inter-feature influence score computed for feature pair (fi, fj). For example, assume the following inter-feature influence scores are computed for the version of unlabeled dataset X shown in Table 1:
Given the scores above,
At block 304, computer system 100 can remove all of the edges in graph G whose weights are below a predefined score threshold t. Through this edge removal step, the computer system can effectively isolate subsets of vertices in G—and thus, features in X—whose members have a relatively strong influence on each other. For example,
Upon removing the edges per block 304, computer system 100 can select one vertex (or x vertices, where x is less than the subset total) from each subset of one or more vertices in graph G whose members remain strongly connected (i.e., connected to each other via incoming and outgoing edges), or comprises exactly one vertex (block 306). With respect to
Finally, at block 308, computer system 100 can output a new unlabeled dataset X′ that includes all of the unlabeled data instances in unlabeled dataset X, but excludes the features that correspond to unselected vertices at block 306. Stated another way, each unlabeled data instance in new unlabeled dataset X′ can solely include feature values for the features selected at block 306. For example, if computer system 100 selects the “age” and “eye color” vertices from graph 400 in
In addition to the techniques described above, in some embodiments computer system 100 may construct one or more new features for unlabeled dataset X and augment X with those new feature(s) prior to computing inter-feature influence scores. For example, in the case where X will be used to perform anomaly detection, computer system 100 may employ an unsupervised anomaly detection model to generate anomaly scores for the unlabeled data instances in X and add the anomaly scores as a new feature/column to X.
The benefit of this approach is that it will cause computer system 100 to compute inter-feature influence scores for the new feature(s), which can help in capturing the importance of certain existing features in X that generally have a low influence on other existing features, but have high value for the purpose(s) reflected in the new feature(s) (e.g., anomaly detection). For example, a given feature fk may have low inter-feature influence scores with respect to other existing features in X, but may have a high inter-feature influence score with respect to a newly added anomaly score feature fa, thereby making clear that feature fk is important for anomaly detection.
Certain embodiments described herein can employ various computer-implemented operations involving data stored in computer systems. For example, these operations can require physical manipulation of physical quantities—usually, though not necessarily, these quantities take the form of electrical or magnetic signals, where they (or representations of them) are capable of being stored, transferred, combined, compared, or otherwise manipulated. Such manipulations are often referred to in terms such as producing, identifying, determining, comparing, etc. Any operations described herein that form part of one or more embodiments can be useful machine operations.
Further, one or more embodiments can relate to a device or an apparatus for performing the foregoing operations. The apparatus can be specially constructed for specific required purposes, or it can be a generic computer system comprising one or more general purpose processors (e.g., Intel or AMD x86 processors) selectively activated or configured by program code stored in the computer system. In particular, various generic computer systems may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations. The various embodiments described herein can be practiced with other computer system configurations including handheld devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
Yet further, one or more embodiments can be implemented as one or more computer programs or as one or more computer program modules embodied in one or more non-transitory computer readable storage media. The term non-transitory computer readable storage medium refers to any data storage device that can store data which can thereafter be input to a computer system. The non-transitory computer readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer system. Examples of non-transitory computer readable media include a hard drive, network attached storage (NAS), read-only memory, random-access memory, flash-based nonvolatile memory (e.g., a flash memory card or a solid state disk), a CD (Compact Disc) (e.g., CD-ROM, CD-R, CD-RW, etc.), a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The non-transitory computer readable media can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.
Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations can be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component can be implemented as separate components.
As used in the description herein and throughout the claims that follow, “a,” “an,” and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The above description illustrates various embodiments along with examples of how aspects of particular embodiments may be implemented. These examples and embodiments should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of particular embodiments as defined by the following claims. Other arrangements, embodiments, implementations and equivalents can be employed without departing from the scope hereof as defined by the claims.