In machine learning (ML), classification is the task of predicting, from among a plurality of predefined categories (i.e., classes), the class to which a given data instance belongs. An ML model that implements classification is referred to as an ML classifier. Examples of well-known types of supervised ML classifiers include random forest, adaptive boosting, and gradient boosting, and an example of a well-known type of unsupervised ML classifier is isolation forest.
ML classifiers are often employed in use cases/applications where high classification accuracy is important (e.g., identifying fraudulent financial transactions, network security monitoring, detecting faults in safety-critical systems, etc.). Accordingly, techniques that can improve the performance of ML classifiers are highly desirable.
In the following description, for purposes of explanation, numerous examples and details are set forth in order to provide an understanding of various embodiments. It will be evident, however, to one skilled in the art that certain embodiments can be practiced without some of these details or can be practiced with modifications or equivalents thereof.
Embodiments of the present disclosure are directed to techniques for augmenting the training data set for an ML classifier (e.g., M1) via metadata that is generated by another, different ML classifier (e.g., M2) at the time of classifying data instances in that data set. As used herein, “augmenting” the training data set refers to adding one or more additional features to each data instance in the training data set based on the metadata generated by ML classifier M2. Such metadata can include, e.g., the classification and associated confidence level output by ML classifier M2 for each data instance.
Once the training data set has been augmented as described above, the augmented training data set can be used to train ML classifier M1, thereby improving the performance of M1 by virtue of the additional features derived from ML classifier M2. In one set of embodiments, the entirety of the augmented training data set may be used to train ML classifier M1. In another set of embodiments, a subset of the augmented training data set may be selected according to one or more criteria and the selected subset may be used to train ML classifier M1, thus reducing the training time for M1.
The foregoing and other aspects of the present disclosure are described in further detail below.
To provide context for the embodiments presented herein,
At step (1) of process 100 (reference numeral 106), training data set X is provided as input to ML classifier M1. At step (2) (reference numeral 108), ML classifier M1 is trained using training data set X The details of this training will differ depending on the type of ML classifier M1, but in general the training entails configuring/building ML classifier M1 in a manner that enables the classifier to correctly predict label yi for each data instance i in training data set X. Once ML classifier M1 is trained using training data set X, a trained version of ML classifier M1 (reference numeral 110) is generated.
While conventional training process 100 of
At step (2) (reference numeral 206), ML classifier M2 is trained using training data set X, resulting in a trained version of M2 (reference numeral 208). Training data set X is then provided as input to trained ML classifier M2 (step (3); reference numeral 210) and trained ML classifier M2 classifies the data instances in X (step (4); reference numeral 212), thereby generating metadata W comprising p metadata values w1 . . . wp for each data instance i in X arising out of the classification process (reference numeral 214).
For example, in one set of embodiments, metadata W can include the predicted classification and associated confidence level output by trained ML classifier M2 for each data instance. In other embodiments, metadata W can include other types of classification-related information, such as the full class distribution vector generated by trained ML classifier M2 (in the case where M2 is a random forest classifier), the number of trees in trained ML classifier M2 that voted for the predicted classification (in the case where M2 is a tree-based ensemble method classifier), and so on.
At step (5) (reference numeral 216), training data set X is augmented using metadata W, resulting in augmented training data set X′ (reference numeral 218). As shown, augmented training data set X′ includes, for feature set xi of each data instance i, an additional set of metadata values metadatax
With the training process shown in
In some embodiments, rather than using the entirety of augmented training data set X′ to train ML classifier M1 per step (6) of process 200, X′ can be filtered and thus reduced in size from n data instances to q data instances, where q<n. The filtered version of augmented training data set X′ can then be provided as input to ML classifier M1 for training. This approach is depicted via alternative steps (6)-(8) (reference numerals 300-306) in
The particular criterion or criteria used for filtering augmented training data set X′ can vary depending on the implementation (a number of examples are presented in section (3) below). However, in certain embodiments this filtering step can remove a higher percentage of training data instances that were deemed “easy” by trained ML classifier M2 (i.e., those training data instances that M2 was able to classify with a high degree of confidence), while keeping a higher percentage of training data instances that were deemed “difficult” by trained ML classifier M2 (i.e., those training data instances that M2 could not classify with a high degree of confidence). By removing a larger number of easy data instances and consequently keeping a larger number of difficult data instances, the size of the training data set can be reduced without meaningfully affecting the accuracy of trained ML classifier M1.
It should be appreciated that processes 200 and 300 of
Further, although processes 200 and 300 assume that training data sets X and X′ are labeled data sets (i.e., they include label column y) and thus ML classifiers M1 and M2 are supervised classifiers, in other embodiments one or both of M1 and M2 may be unsupervised classifiers (such as an isolation forest classifier). In these embodiments, training data set X and/or augmented training data set X′ may comprise unlabeled data instances. One of ordinary skill in the art will recognize other variations, modifications, and alternatives.
Starting with blocks 402 and 404 of workflow 400, a computing device/system can receive a training data set (e.g., training data set X of
At blocks 406 and 408, the computing device/system can provide the training data set as input to the trained first ML classifier and the trained first ML classifier can classify each data instance in the training data set. As part of block 408, the trained first ML classifier can generate metadata arising out of the classification of each data instance. This metadata can include, e.g., the predicted classification and confidence level for the data instance, the confidence levels of other classes that were not chosen as the predicted classification, etc.
At block 410, the computing device/system can augment the training data set to include the classification metadata generated at block 408. For example, for each data instance i in the training data set, the computing device/system can add one or more new features to the feature set of data instance i corresponding to the metadata generated for i.
Finally, at block 412, the computing device/system can train a second ML classifier (e.g., ML classifier M1 of
Turning now to workflow 500 of
At block 512, the computing device/system can filter the data instances in the augmented training data set created at block 510, thereby generating a filtered (i.e., reduced) augmented training data set. In certain embodiments, this filtering step can involve randomly sampling data instances in the augmented training data set and, for each sampled data instance, determining whether the data instance meets one or more criteria; if the answer is yes, the data instance can have a higher likelihood of being added to the filtered data set. However if the answer is no, the data instance can have a higher probability of being discarded. This can continue until a target number or percentage of data instances have been added to the filtered data set (e.g., 10% of the total number of data instances).
One example criterion that may be applied to each data instance for the filtering at block 512 is whether the confidence level of the prediction generated by the first ML classifier is less than a confidence threshold; if so, that means the data instance was relatively difficult for the first ML classifier to classify and thus should have a high chance of being added to the filtered data set. Another example criterion is whether the variance of per-class probabilities generated by the first ML classifier for the data instance (in the case where the first classifier is, e.g., a random forest classifier) is less than a variance threshold; if so, this also indicates that the data instance was relatively difficult for the first ML classifier to classify and thus should have a high chance of being added to the filtered data set.
Finally, upon filtering the augmented training data set, the computing device/system can train a second ML classifier (e.g., ML classifier M1 of
Certain embodiments described herein can employ various computer-implemented operations involving data stored in computer systems. For example, these operations can require physical manipulation of physical quantities—usually, though not necessarily, these quantities take the form of electrical or magnetic signals, where they (or representations of them) are capable of being stored, transferred, combined, compared, or otherwise manipulated. Such manipulations are often referred to in terms such as producing, identifying, determining, comparing, etc. Any operations described herein that form part of one or more embodiments can be useful machine operations.
Further, one or more embodiments can relate to a device or an apparatus for performing the foregoing operations. The apparatus can be specially constructed for specific required purposes, or it can be a generic computer system comprising one or more general purpose processors (e.g., Intel or AMD x86 processors) selectively activated or configured by program code stored in the computer system. In particular, various generic computer systems may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations. The various embodiments described herein can be practiced with other computer system configurations including handheld devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like.
Yet further, one or more embodiments can be implemented as one or more computer programs or as one or more computer program modules embodied in one or more non-transitory computer readable storage media. The term non-transitory computer readable storage medium refers to any data storage device that can store data which can thereafter be input to a computer system. The non-transitory computer readable media may be based on any existing or subsequently developed technology for embodying computer programs in a manner that enables them to be read by a computer system. Examples of non-transitory computer readable media include a hard drive, network attached storage (NAS), read-only memory, random-access memory, flash-based nonvolatile memory (e.g., a flash memory card or a solid state disk), a CD (Compact Disc) (e.g., CD-ROM, CD-R, CD-RW, etc.), a DVD (Digital Versatile Disc), a magnetic tape, and other optical and non-optical data storage devices. The non-transitory computer readable media can also be distributed over a network coupled computer system so that the computer readable code is stored and executed in a distributed fashion.
Finally, boundaries between various components, operations, and data stores are somewhat arbitrary, and particular operations are illustrated in the context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within the scope of the invention(s). In general, structures and functionality presented as separate components in exemplary configurations can be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component can be implemented as separate components.
As used in the description herein and throughout the claims that follow, “a,” “an,” and “the” includes plural references unless the context clearly dictates otherwise. Also, as used in the description herein and throughout the claims that follow, the meaning of “in” includes “in” and “on” unless the context clearly dictates otherwise.
The above description illustrates various embodiments along with examples of how aspects of particular embodiments may be implemented. These examples and embodiments should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of particular embodiments as defined by the following claims. Other arrangements, embodiments, implementations and equivalents can be employed without departing from the scope hereof as defined by the claims.