Anomaly detection is the task of distinguishing anomalies from normal data, typically with use of a machine learning model. Anomaly detection has a variety of different real-world applications, such as in manufacturing to detect faults in manufactured products; in financial analysis to monitor financial transactions for potentially fraudulent activity; and in healthcare data analysis to identify diseases or other harmful conditions in a patient. There are multiple settings that anomaly detection is considered.
One scenario is a fully supervised setting, where labels for all samples are available, for both normal and anomalous samples. This setting is typically addressed with specialized approaches for data imbalance, such as weighted loss functions or resampling methods. A special case of this fully supervised setting is where only labeled normal samples are available. One-class classifiers (OCCs), such as support vector machines (SVM) or auto-encoder, and isolation detection, such as Isolation Forest, are approaches for this special case. Despite being widely studied, these scenarios have a tedious labeling requirement in real-world applications.
Another scenario is an unsupervised setting, without any labeled data. Various methods have been proposed for this setting. While the labeling costs can be entirely eliminated, performance degradation is often significant compared to the supervised setting, limiting the reliability for real world application.
Yet another scenario is a semi-supervised setting for anomaly detection that aims to achieve high performance with a limited amount of labeling data. Methods for the semi-supervised setting include focusing on a positive-unlabeled setting or utilizing OCCs or adversarial training on semi-supervised learning that treats all unlabeled data as normal samples. Most semi-supervised learning methods assume that the labeled and unlabeled data come from the same distributions. More specifically, the subsets of the data are labeled such that sampling from the unlabeled data is randomly uniform. However, in practice, this assumption often does not hold as distribution mismatch commonly occurs, with labeled and unlabeled data coming from different distributions.
Some methods consider distribution mismatch in a limited setting where only the label distributions are different, such as the anomalous ratio is 10% for training but 50% for testing. However, more general real-world scenarios can commonly include positive and unlabeled (PU) or negative and unlabeled (NU) settings, where the distributions between labeled, either positive or negative, and unlabeled, both positive and negative, samples are different. Further, additional unlabeled data can be gathered after labeling, causing distribution shift. For example, manufacturing processes may keep evolving and thus, the corresponding defects can change and the defect types at labeling differ from the defect types in unlabeled data. In addition, for financial fraud detection and anti-money laundering applications, new anomalies can appear after the data labeling process, as the criminals themselves adapt. Lastly, human labelers are more confident on easy samples; thus, easy samples are more likely to be included in the labeled data and difficult samples are more likely to be included in the unlabeled data. For example, with some crowd-sourcing-based labeling tools, only the samples with some consensus on the labels, as a measure of confidence, are included in the labeled set.
Semi-supervised learning methods are sub-optimal for anomaly detection under distribution mismatch because they are developed with the assumption that labeled and unlabeled data come from the same distribution. Generated pseudo-labels are highly dependent on a small set of labeled data; thus, the trained semi-supervised models would be biased on the labeled data distribution. Transfer learning methods or the frameworks for distribution shifts may constitute alternatives by treating source/target data as labeled/unlabeled data. However, these alternatives have not been effective with a small number of labeled samples.
Aspects of the disclosure are directed to a semi-supervised anomaly detection framework to achieve high performance with limited labeling budget. The semi-supervised anomaly detection framework yields robust performance even in the presence of distribution mismatch, such as when labeled and unlabeled data come from different distributions. An ensemble of one-class classifiers is used for pseudo-labeling to reduce dependence from a limited amount of labeled data. A predictor is trained with both a small amount of labeled data and pseudo-labeled samples. Partial distribution matching is utilized to automatically determine critical hyper-parameters for the pseudo-labeled samples.
An aspect of the disclosure provides for a method for anomaly detection. The method includes: receiving, by one or more processors, unlabeled data; determining, by the one or more processors, pseudo labels for the unlabeled data using a plurality of one-class classifiers; assigning, by the one or more processors, the pseudo labels to the unlabeled data to generate pseudo labeled data; and training, by the one or more processors, a machine learning model to detect network anomalies using the pseudo labeled data.
In an example, each of the one-class classifiers are trained with negatively labeled data and a disjoint subset of unlabeled data. In another example, determining the pseudo labels further includes determining a positive pseudo label when a threshold amount of the one-class classifiers agree to assign the positive pseudo label. In yet another example, determining the pseudo labels further includes determining a negative pseudo label when a threshold amount of the one-class classifiers agree to assign the negative pseudo label. In yet another example, determining the pseudo label further includes determining an unlabeled label when a threshold amount of one-class classifiers do not agree whether to assign positive or negative pseudo labels.
In yet another example, the method further includes receiving, by the one or more processors, labeled data. In yet another example, training the machine learning model further includes training the machine learning model to detect network anomalies using the labeled data.
In yet another example, determining the pseudo labels further includes matching a distribution of anomaly scores of positively labeled data to anomaly scores of the unlabeled data and estimating a positive marginal distribution. In yet another example, determining the pseudo labels further includes matching a distribution of anomaly scores of negatively labeled data to anomaly scores of the unlabeled data and estimating a negative marginal distribution.
In yet another example, training the machine learning model further comprises using binary cross entropy on the pseudo labeled data.
Another aspect of the disclosure provides for a system including: one or more processors; and one or more storage devices coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the one or more processors to perform operations for anomaly detection. The operations include: receiving unlabeled data; determining pseudo labels for the unlabeled data using a plurality of one-class classifiers; assigning the pseudo labels to the unlabeled data to generate pseudo labeled data; and training a machine learning model to detect network anomalies using the pseudo labeled data.
In an example, each of the one-class classifiers are trained with negatively labeled data and a disjoint subset of unlabeled data; and determining the pseudo labels further includes: determining a positive pseudo label when a threshold amount of the one-class classifiers agree to assign the positive pseudo label; determining a negative pseudo label when a threshold amount of the one-class classifiers agree to assign the negative pseudo label; and determining an unlabeled label when a threshold amount of one-class classifiers do not agree whether to assign positive or negative pseudo labels.
In another example, the operations further include receiving labeled data; and training the machine learning model further includes training the machine learning model to detect network anomalies using the labeled data.
In yet another example, determining the pseudo labels further includes: matching a distribution of anomaly scores of positively labeled data to anomaly scores of the unlabeled data and estimating a positive marginal distribution; and matching a distribution of anomaly scores of negatively labeled data to anomaly scores of the unlabeled data and estimating a negative marginal distribution.
In yet another example, training the machine learning model further comprises using binary cross entropy on the pseudo labeled data.
Yet another aspect of the disclosure provides for a non-transitory computer readable medium for storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations for anomaly detection. The operations include: receiving unlabeled data; determining pseudo labels for the unlabeled data using a plurality of one-class classifiers; assigning the pseudo labels to the unlabeled data to generate pseudo labeled data; and training a machine learning model to detect network anomalies using the pseudo labeled data.
In an example, each of the one-class classifiers are trained with negatively labeled data and a disjoint subset of unlabeled data; and determining the pseudo labels further includes: determining a positive pseudo label when a threshold amount of the one-class classifiers agree to assign the positive pseudo label; determining a negative pseudo label when a threshold amount of the one-class classifiers agree to assign the negative pseudo label; and determining an unlabeled label when a threshold amount of one-class classifiers do not agree whether to assign positive or negative pseudo labels.
In another example, the operations further include receiving labeled data; and training the machine learning model further includes training the machine learning model to detect network anomalies using the labeled data.
In yet another example, determining the pseudo labels further includes: matching a distribution of anomaly scores of positively labeled data to anomaly scores of the unlabeled data and estimating a positive marginal distribution; and matching a distribution of anomaly scores of negatively labeled data to anomaly scores of the unlabeled data and estimating a negative marginal distribution.
In yet another example, training the machine learning model further includes using binary cross entropy on the pseudo labeled data.
Generally disclosed herein are implementations for a semi-supervised anomaly detection framework, SPADE, that yields strong and robust performance even under distribution mismatch. SPADE introduces a pseudo-labeling mechanism using an ensemble of OCCs and a method for combining supervised and self-supervised learning. SPADE reduces the dependence on the labeled data as the predictors are trained with a small number of labeled and pseudo-labeled samples. SPADE includes using a partial matching method to pick hyperparameters without a validation set, which is advantageous as validation sets are often unavailable in real-world applications with limited labeled data.
SPADE significantly improves Area under the Curve (AUC) measurements in real-world scenarios, such as those utilizing tabular data or image data. SPADE also consistently outperforms existing methods in fraud detection with distribution shifts over time due to the adversarial nature of the real-world application.
For semi-supervised anomaly detection with distribution mismatch, consider given labeled training data Dl={(xil, yil}i=1N
SPADE aims to train a binary classifier for normal and anomalous data by iteratively learning from labeled and pseudo-labeled data. As such, SPADE includes a pseudo-labeler to assign binary labels to unlabeled data. Using a trained binary classifier for pseudo-labeling can be sub-optimal for anomaly detection with distribution shift as the decision boundaries of binary classifiers could be highly biased by the small amount of labeled data.
For example, positive pseudo-labels can be assigned to unlabeled data samples if all OCCs 602 agree on them: v(h(xu))=1 if Πk=1Kŷkpu=1 where
ŷ
k
pu={1 if oK(h(xu))>ηkp0 otherwise. (1)
Similarly, a negative pseudo-label can be assigned if all OCCs 602 agree on them: v(h(xu))=0 if Πk=1Kŷknu=1 where
ŷ
k
nu={1 if oK(h(xu))<ηkn0 otherwise. (2)
Unlabeled data without consensus can be annotated as unknown: v(h(xu))=−1 if Πk=1Kŷkpu×ŷknu=0.
Thresholds ηp and ηn can correspond to parameters for converting the continuous values output from the OCCs 602 into binary values for determining the pseudo-label. These parameters can be determined without sacrificing labeled data for validation by adapting partial distribution matching 604. The partial distribution matching 604 can estimate a marginal distribution of unlabeled data by matching the distribution to a known one-class distribution, e.g., positive or negative. Essentially, normal samples can be closer to other normal samples and anomalous samples can be closer to other anomalous samples. The partial distribution matching 604 can match the distribution of anomaly scores of positively labeled data to that of unlabeled data to estimate their marginal distribution and determine ηp accordingly. Similarly, the partial distribution matching 604 can match the distribution of anomaly score of negatively labeled data to that of unlabeled data to estimate their marginal distribution and determine Tin accordingly. Example formulations for ηp and ηn are below:
ηkp=arg arg Dw({oK(h(xl)|yl=1},{oK(h(xu)>η}) (3)
ηkn=arg arg Dw({oK(h(xl)|yl=0},{oK(h(xn)<η}) (4)
where Dw is a Wasserstein distance between two distributions. Subsets of the unlabeled data can be determined for pseudo-labeling whose Wasserstein distance from labeled data is a minimum.
In some semi-supervised settings, such as positive and unlabeled (PU) and negative and unlabeled (NU) settings, only one class of labeled samples are available. In these settings, Otsu's method can be employed to identify a threshold of the class without labeled samples. With Otsu's method, the threshold that minimizes intra-class anomaly score variances can be determined in an unsupervised way. For example, in a PU setting, ηp can be set using EQ. (3) and ηn.
An anomaly detection model q(h(⋅)), such as the predictor 504, can be trained using loss functions, such as binary cross entropy (BCE) on labeled data, BCE on pseudo-labeled data, and self-supervised loss on all data. A self-supervised module g, such as the decoder 502 for reconstruction loss or the projection head 508 for contrastive loss, can be jointly trained with an auxiliary self-supervised loss.
For example, the BCE loss on the labeled data can be formulated as LY
To improve the quality of the encoder 502, auxiliary self-supervised losses can be utilized with various pretext tasks depending on the real-world application domain. For example, the auxiliary self-supervised losses can include a reconstructive objective, such as LR=E[LMSE(x,g(h(x))))], or more specific objectives to data type, such as contrastive learning for image data.
Overall, the encoder 502 (h), predictor 504 (q), and self-supervised module 508 (g) can be trained by solving the following optimization problem:
h*,g*,q*=arg arg[LY
where α and β are hyperparameters. Training loss can be used for the convergence criteria. For example, if the training loss is converged, e.g., no improvement is observed in the loss for at least 5 epochs, it can be determined that the models are converged as well. The pseudo-labeler 506 can also converge during training.
The benefits of SPADE can be highlighted in various practical settings involving semi-supervised learning with distribution mismatch. To illustrate the benefits, multiple anomaly detection datasets can be considered for image and tabular data types, such as MVTec anomaly detection and Magnetic tile datasets for image data and Covertype, Thyroid, and Drug datasets for tabular data. Further, fraud detection datasets, such as Kaggle credit and Xente, can be utilized to illustrate the benefits of SPADE as well. The datasets can be divided into disjoint train and test data and the training data can be further divided into disjoint labeled and unlabeled data. The labeled and unlabeled data can come from different distributions. AUC can be used as the evaluation metric for SPADE.
Anomalies can evolve over time in many applications. For fraud detection, criminals might invent new fraudulent approaches to trick the existing systems. For manufacturing, a modified process might yield different defects that have been never met before. Therefore, labeled data can become outdated and newly gathered unlabeled data can come from different distributions. Datasets can be constructed with multiple anomaly types to simulate such scenarios. Among multiple anomaly types, subsets of the anomaly types and normal samples can be provided as labeled data and other anomaly types can only appear in unlabeled data. As depicted in the tables in
Each baseline has its own limitations. Supervised classifiers cannot utilize unlabeled data at all, and negative supervised classifiers suffer from contaminated labeled data for training the predictive model. OCC models are suboptimal as they cannot utilize the anomalous label information. Semi-supervised learning baselines suffer from distribution mismatch between labeled and unlabeled data. For domain adaptation baseline, it shows poor performances with a small number of source samples.
While some samples can be easier to label, other samples can be misleadingly difficult to label because they can appear differently from known cases. To simulate this scenario, datasets can be constructed where the labeled data only includes easy-to-label samples while the unlabeled data includes hard-to-label samples. Logistic regression can be trained using the entire training data and labeled samples can be gathered where confidence of the trained logistic regression outputs is larger than a certain threshold and the predictions are correct.
The easiness section of the bottom table in
With only positive samples as the labeled data and all other samples being unlabeled, e.g., the positive and unlabeled (PU) setting, distributions between labeled, only positive samples, and unlabeled, both positive and negative samples, would be different. Datasets can be constructed with multiple anomaly types to simulate such scenarios. Among multiple anomaly types, subsets of the anomaly types can be provided as labeled data and other anomaly types can only appear in unlabeled data. Normal samples can be excluded from the labeled data to represent the PU setting. The table in
SPADE can also be evaluated with real-world fraud detection datasets: Kaggle credit card fraud, 0.17% anomaly ratio with 284807 total samples; and Xente fraud detection, 0.20% anomaly ratio with 95662 total samples. Here, anomalies can be evolving, e.g., their distributions change over time. To catch evolving anomalies, the anomaly detection model needs to be retrained based on labeling for new anomalies, which can be costly and time consuming. SPADE can improve anomaly detection performance using both labeled data and newly gathered data, even without additional labeling.
Training and test data can be split based on measurement time. The later samples can be included in the testing data, which can be about 50%, and earlier samples can be included in the training data, which can be about 50%. The training data can be further divided into labeled and unlabeled data. Earlier acquired data can be included in the labeled data, which can be about 5-20% while later acquired data can be included in the unlabeled data, which can be about 80-95%. AUC can be used as the anomaly detection metric. As shown in the table in
The server computing device 1104 can include one or more processors 1112 and memory 1114. The memory 1114 can store information accessible by the processors 1112, including instructions 1116 that can be executed by the processors 1112. The memory 1114 can also include data 1118 that can be retrieved, manipulated, or stored by the processor 1112. The memory 1114 can be a type of transitory or non-transitory computer readable medium capable of storing information accessible by the processors 1112, such as volatile and non-volatile memory. The processors 1112 can include one or more central processing units (CPUs), graphic processing units (GPUs), field-programmable gate arrays (FPGAs), and/or application-specific integrated circuits (ASICs), such as tensor processing units (TPUs).
The instructions 1116 can include one or more instructions that when executed by the processors 1112, cause the one or more processors 1112 to perform actions defined by the instructions 1116. The instructions 1116 can be stored in object code format for direct processing by the processors 1112, or in other formats including interpretable scripts or collections of independent source code modules that are interpreted on demand or compiled in advance. The instructions 1116 can include instructions for implementing processes consistent with aspects of this disclosure. Such processes can be executed using the processors 1112, and/or using other processors remotely located from the server computing device 1104.
The data 1118 can be retrieved, stored, or modified by the processors 1112 in accordance with the instructions 1116. The data 1118 can be stored in computer registers, in a relational or non-relational database as a table having a plurality of different fields and records, or as JSON, YAML, proto, or XML documents. The data 1118 can also be formatted in a computer-readable format such as, but not limited to, binary values, ASCII or Unicode. Moreover, the data 1118 can include information sufficient to identify relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memories, including other network locations, or information that is used by a function to calculate relevant data.
The user computing device 1106 can also be configured similar to the server computing device 1104, with one or more processors 1120, memory 1122, instructions 1124, and data 1126. The user computing device 1106 can also include a user output 1128, and a user input 1130. The user input 1130 can include any appropriate mechanism or technique for receiving input from a user, such as keyboard, mouse, mechanical actuators, soft actuators, touchscreens, microphones, and sensors.
The server computing device 1104 can be configured to transmit data to the user computing device 1106, and the user computing device 1106 can be configured to display at least a portion of the received data on a display implemented as part of the user output 1128. The user output 1128 can also be used for displaying an interface between the user computing device 1106 and the server computing device 1104. The user output 1128 can alternatively or additionally include one or more speakers, transducers or other audio outputs, a haptic interface or other tactile feedback that provides non-visual and non-audible information to the platform user of the user computing device 1106.
Although
The server computing device 1104 can be configured to receive requests to process data from the user computing device 1106. For example, the environment 110 can be part of a computing platform configured to provide a variety of services to users, through various user interfaces and/or APIs exposing the platform services. One or more services can be a machine learning framework or a set of tools for generating neural networks or other machine learning models according to a specified task and training data. The user computing device 1106 may receive and transmit data specifying target computing resources to be allocated for executing a machine learning model trained to perform a particular machine learning task.
The computing devices 1104, 1106 can be capable of direct and indirect communication over the network 1110. The computing devices 1104, 1106 can set up listening sockets that may accept an initiating connection for sending and receiving information. The network 1110 itself can include various configurations and protocols including the Internet, World Wide Web, intranets, virtual private networks, wide area networks, local networks, and private networks using communication protocols proprietary to one or more companies. The network 1110 can support a variety of short- and long-range connections. The short- and long-range connections may be made over different bandwidths, such as 2.402 GHz to 2.480 GHz (commonly associated with the Bluetooth® standard), 2.4 GHz and 5 GHz (commonly associated with the Wi-Fi® communication protocol); or with a variety of communication standards, such as the LTE® standard for wireless broadband communication. The network 1110, in addition or alternatively, can also support wired connections between the computing devices 1104, 1106, including over various types of Ethernet connection.
Although a single server computing device 1104 and user computing device 1106 are shown in
As such, generally disclosed herein are implementations for a framework, SPADE, which combines supervised and self-supervised learning using a pseudo-labeling mechanism with an ensemble of OCCs. Further, SPADE includes an approach to pick hyperparameters without a validation set, a crucial component for data-efficient anomaly detection. Overall, SPADE can consistently outperform alternatives in various scenarios. AUC improvements with SPADE can be up to 10.6% on tabular data and 3.6% on image data.
Aspects of this disclosure can be implemented in digital circuits, computer-readable storage media, as one or more computer programs, or a combination of one or more of the foregoing. The computer-readable storage media can be non-transitory, e.g., as one or more instructions executable by a cloud computing platform and stored on a tangible storage device.
In this specification the phrase “configured to” is used in different contexts related to computer systems, hardware, or part of a computer program, engine, or module. When a system is said to be configured to perform one or more operations, this means that the system has appropriate software, firmware, and/or hardware installed on the system that, when in operation, causes the system to perform the one or more operations. When some hardware is said to be configured to perform one or more operations, this means that the hardware includes one or more circuits that, when in operation, receive input and generate output according to the input and corresponding to the one or more operations. When a computer program, engine, or module is said to be configured to perform one or more operations, this means that the computer program includes one or more program instructions, that when executed by one or more computers, causes the one or more computers to perform the one or more operations.
While operations shown in the drawings and recited in the claims are shown in a particular order, it is understood that the operations can be performed in different orders than shown, and that some operations can be omitted, performed more than once, and/or be performed in parallel with other operations. Further, the separation of different system components configured for performing different operations should not be understood as requiring the components to be separated. The components, modules, programs, and engines described can be integrated together as a single system or be part of multiple systems. One or more processors in one or more locations implementing an example SPADE according to aspects of the disclosure can perform the operations shown in the drawings and recited in the claims.
Unless otherwise stated, the examples described herein are not mutually exclusive, but may be implemented in various combinations to achieve unique advantages. As these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the description should be taken by way of illustration rather than by way of limitation of the subject matter defined by the claims. In addition, the provision of the examples described herein, as well as clauses phrased as “such as,” “including” and the like, should not be interpreted as limiting the subject matter of the claims to the specific examples; rather, the examples are intended to illustrate only one of many possible embodiments. Further, the same reference numbers in different drawings can identify the same or similar elements.
The present application claims the benefit of the filing date of U.S. Provisional Patent Application No. 63/303,294, filed Jan. 26, 2022, the disclosure of which is hereby incorporated herein by reference.
| Number | Date | Country | |
|---|---|---|---|
| 63303294 | Jan 2022 | US |