The present disclosure relates generally to machine learning models and neural networks, and more specifically, to out-of-distribution detection of data using ellipsoidal data description with pre-trained models.
Artificial intelligence, implemented with neural networks and deep learning models, has been widely applied to automatically analyze real-world information with or approaching human-like accuracy. Specifically, a machine learning model may receive input data, e.g., a natural language question, an image, etc., and classify the input data as one of a set of pre-defined classes. This process is referred to as classification. Machine learning models may perform well when the training and testing data are sampled from the same distribution, e.g., when the training data and testing data largely fall within the scope of the same set of pre-defined classes. However, as real-world applications of the models can involve different datasets, which may exhibit different distribution than the distribution of training datasets, such applications may fail at machine learning models that are pre-trained with in-distribution training datasets only. Thus, the detection of out-of-distribution data is an important component of the deployment of AI in real world scenarios.
In the figures and appendix, elements having the same designations have the same or similar functions.
As used herein, the term “network” may comprise any hardware or software-based framework that includes any artificial intelligence network or system, neural network or system and/or any training or learning models implemented thereon or therewith.
As used herein, the term “module” may comprise hardware or software-based framework that performs one or more functions. In some embodiments, the module may be implemented on one or more neural networks.
Artificial intelligence, implemented with neural networks and deep learning models, has demonstrated great promise as a technique for automatically analyzing real-world information with or approaching human-like accuracy. In general, such neural network and deep learning models receive input information and make predictions based on the same. Whereas other approaches to analyzing real-world information may involve hard-coded processes, statistical analysis, and/or the like, neural networks learn to make predictions gradually, by a process of trial and error, using a machine learning process. A given neural network model may be trained using a large number of training examples, proceeding iteratively until the neural network model begins to consistently make similar inferences from the training examples as a human might make Neural network models have been shown to outperform and/or have the potential to outperform other computing techniques in a number of applications.
Machine learning models may perform well when the training data and the testing data are sampled from the same distribution, i.e., when the same distribution is used to generate the training data and the testing data. In other words, machine learning models function well in static situations where the input distribution generating test data at testing time is same as the training distribution generating the training data for model training. The real world, however, can be dynamic and complex where data distributions shift over time or even new categories of objects may appear after model training at testing time. In such cases, pre-trained neural networks may still classify inputs from unknown classes into known classes incorrectly but with high confidence, potentially resulting in catastrophic failures during real world deployments of the models. For example, a traffic sign classification model for an autonomous driving system may predict a traffic sign with a high-speed limit for a scene in real life that does not contain any such signs, with obvious potentially catastrophic consequences. As such, detecting such out-of-distribution (OOD) data, i.e., detecting or finding patterns in data that do not conform to expected behavior, can be an important aspect of the application of machine learning models in various real-life scenarios.
Previously developed approaches for OOD detection, however, can have several drawbacks. For example, some approaches may fail for OOD detection applications involving high-dimensional scenarios (e.g., texts and images). Other approaches, for example approaches deploying unsupervised OOD methods, have been known to exhibit pathological behavior such as having higher confidence on specific type of the OOD data than the in-distribution data. And yet other approaches may require additional OOD data as negative samples, limiting their application in several real-world scenarios. Further, because the algorithms of classifier-based approaches depend on properties of multi-class classifiers, such approaches may not be applicable to cases such as one-class classification, out-of-domain question detection for question answering, etc., when one cannot access label information.
To address these challenges, embodiments described herein provide an approach and/or framework that leverages the representation power of existing pre-trained models or neural networks for OOD detection. Specifically, a hyper-ellipsoid structure in the feature space is utilized to distinguish in-distribution or OOD. The hyper-ellipsoid based approach may be generalized to a wide range of classifiers, and to non-classification tasks such as but not limited to Question Answering. In addition, the approach may be applicable whether the pre-trained models are obtained under supervised setting or non-supervised setting. Further, unlike most existing systems, the hyper-ellipsoid based approach may not need specific OOD samples to pre-train a model, facilitating the approach's application to real world scenarios.
According to some embodiments, the systems and methods of the present disclosure employ an ellipsoid data description technique with a pre-trained method for OOD detection of a testing data, i.e., determining whether the testing data can be obtained from the distribution that generated the training data used to train the pre-trained model. The ellipsoid data description technique may allow the determination of an optimal hyper-ellipsoidal space in feature space that may include at least a portion (e.g., most or all) of the data in the feature space, where the data in the feature space are mappings of the training data (e.g., as mapped via a feature map). In some embodiments, the testing data may be determined to not be an OOD data or to be an OOD data based on whether the mapping of the testing data in feature space is enclosed or is not enclosed within the optimal hyper-ellipsoidal space, respectively.
Memory 120 may be used to store software executed by computing device 100 and/or one or more data structures used during operation of computing device 100. Memory 120 may include one or more types of machine readable media. Some common forms of machine readable media may include floppy disk, flexible disk, hard disk, magnetic tape, any other magnetic medium, CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, FLASH-EPROM, any other memory chip or cartridge, and/or any other medium from which a processor or computer is adapted to read.
Processor 110 and/or memory 120 may be arranged in any suitable physical arrangement. In some embodiments, processor 110 and/or memory 120 may be implemented on a same board, in a same package (e.g., system-in-package), on a same chip (e.g., system-on-chip), and/or the like. In some embodiments, processor 110 and/or memory 120 may include distributed, virtualized, and/or containerized computing resources. Consistent with such embodiments, processor 110 and/or memory 120 may be located in one or more data centers and/or cloud computing facilities.
As shown, memory 120 includes an out-of-distribution (OOD) detection module 130 that may be used to implement and/or emulate the neural network systems and models described further herein and/or to implement any of the methods described further herein, such as but not limited to the method described with reference to
In some examples, memory 120 may include non-transitory, tangible, machine readable media that includes executable code that when run by one or more processors (e.g., processor 110) may cause the one or more processors to perform the methods described in further detail herein. In some examples, OOD detection module 130 may be implemented using hardware, software, and/or a combination of hardware and software. As shown, computing device 100 receives input 140, which is provided to OOD detection module 130, which then may generate output 150.
In some embodiments, the input 140 may include testing data and training data, or a collection of data including training data for training a neural network. The types of testing data and/or the types of the collection of data including the training data can be of any kind, including but not limited to images, texts, etc. The output 150 can include a determination of whether the testing data is an out-of-distribution data or not, i.e., whether the testing data is generated based on the same distribution used to generate the collection of data including the training data. In other examples, the output 150 may include a classification output from the pre-trained base model.
Referring to
In such embodiments, the OOD detection module may be tasked to determine whether a new data y is generated by the same distribution that generated n. In some cases, the OOD detection module may have access only to the pre-trained model f and in particular may not have knowledge of or access to data samples (e.g., such as outlier samples). For example, for non-classification tasks (e.g., question answering), the OOD detection module may not have access to adversarial outliers. In some embodiments, however, the OOD detection module may have access to outlier samples (e.g., adversarial outliers), such as in the cases of classification tasks (e.g., one-class classification, multi-class image/text classification, etc.).
At process 220, the OOD detection module may generate, via a processor, a feature map by combining mapping functions corresponding to the plurality of layers into a vector of mapping function elements. In some cases, the feature map may be generated using the feature before the softmax layer as the feature of the data, i.e., ϕ(x)=f(x), where ϕ(x) is the feature map. However, it may be desirable to generate the feature map using multiple layers of the plurality of layers, because outputs from different layers can provide information related to different levels of a feature. For example, for convolutional neural networks that may be used in the image domain, the bottom layer may provide texture information of a feature, while the top layer may provide shape information of the feature. As such, an improved feature map that can utilize the feature from most or all of the different layers of neural network may be generated.
In some embodiments, the vector of mapping function elements of the feature map may be generated at least in part by performing a function composition of two or more mapping functions of the plurality of mapping functions to form a mapping function element in the vector. For instance, the feature map may contain mapping function elements that are function compositions of functions corresponding to layers of the neural network. An example of such an improved feature map can include the features from different layers and be expressed as
ϕ(x)=[f1(x),f2∘f1(x), . . . ,fL∘fL-1∘ . . . ∘f1(x)] (Eq. 1)
At process 230, the OOD detection module may map, by the feature map (e.g., the improved feature map ϕ(x)), the set of training data to a set of feature space training data in a feature space. That is, the set of training data x may be mapped to the set of feature space training data as shown in the equation above by the feature map data ϕ(x).
At process 240, the OOD detection module may identify, via the processor, a hyper-ellipsoid in the feature space enclosing the feature space training data based on the generated feature map. In some cases, one may employ one-class support vector machine (SVM) techniques to determine whether y is an OOD data, which may include finding a hyper-plane in feature space separating the in-distribution data from the out-of-distribution data. In some cases, one may employ support vector data description (SVDD) techniques to find a hypersphere in feature space that can separate the in-distribution data from the out-of-distribution data. In some cases, however, such hyper-surfaces may not be able to separate features provided by deep models. In some embodiments, a hyper-ellipsoid in feature space can be used to separate the in-distribution data from the out-of-distribution data, such hyper-ellipsoid defined by the expression or equation
∥ϕ(x)−c∥Σ
where c is the center of the hyper-ellipsoid, Σ is a symmetric positive definite matrix that reflects the shape of the hyper-ellipsoid and R reflects the volume of the hyper-ellipsoid. In some cases, ∥Σ∥=1 where the norm can be operator norm, Frobenius norm or the operator norm, which can give the definition of the hyper-ellipsoid with unique Σ and R. In some embodiments, the hyper-ellipsoid that separates in-distribution data from OOD may be an optimal hyper-ellipsoid obtained from the expression
In some cases, the regularization term 0.5∥Σ∥Fr2 may serve to constrain the complexity of Σ, ξi are slack variables that allow the margin to be soft, and ν∈(0,1] is a hyper-parameter that balance the penalties and the radius.
At process 250, the OOD detection module may receive a first test data sample outside the set of training data. For example, the OOD detection module may receive data sample y that is not from n for determining whether data sample y is generated by the same distribution that generated n.
At process 260, the OOD detection module may map the first test data sample into the feature space by the feature map. For example, the OOD detection module may apply the feature map of Eq. 1 to the new data sample y, i.e., ϕ(y)=[f1(y), f2∘f1(y), . . . , fL∘fL-1 ∘ . . . ∘f1(y)]
At process 270, the OOD detection module may determine, via the processor, that the first test data sample is OOD data when the mapped first test data sample in the feature space is outside the hyper-ellipsoid. For example, as discussed above, the OOD detection module may solve Eq. 3 to identify the optimal hyper-ellipsoid (for example, by determining R, c, and Σ that satisfy Eq. 3), and determine whether the new data sample y is OOD data or not by determining whether ϕ(y) is outside of or enclosed by the optimal hyper-ellipsoid. That is, in some embodiments, the OOD detection module may determine that the new data sample y (e.g., which is not from n) is an OOD data if the ϕ(y) is not enclosed by the optimal hyper-ellipsoid ∥ϕ(x)−c∥Σ
In some embodiments, the set of training data includes a single class training dataset; and the pre-trained neural model is a classifier neural model pre-trained with the single class training dataset. In some embodiments, the set of training data includes a multi-class training dataset; and the pre-trained neural model is a classifier neural model pre-trained with the multi-class training dataset. In some embodiments, the set of training data includes an intent classification training dataset; and the pre-trained neural model is a classifier neural model pre-trained with the intent classification training dataset. In some embodiments, the set of training data includes a question answering training dataset; and the pre-trained neural model is a non-classifier neural model pre-trained with the question answering training dataset. In some embodiments, the pre-trained neural model is a non-classifier neural model pre-trained with data lacking an OOD sample. Some embodiments of method 200 further comprise determining one hyper-ellipsoid that has a smallest volume among a set of candidate hyper-ellipsoids that enclose the feature space training data in the feature space as the hyper-ellipsoid.
In some embodiments, solving Eq. 3 exactly to identify the optimal hyper-ellipsoid may be a computationally challenging or even intractable problem, because solving the equation may involve finding an optimal Σ of shape d×d, where d is the dimension of the feature and can have values up to tens or hundreds of thousands. In some embodiments, an efficient approximation scheme that renders Eq. 3 computationally tractable includes decomposing the feature space into several subspaces based on the feature from different layers of the neural network. For example, Σ may be assumed to be a block diagonal matrix,
where Σl reflects the shape of feature distribution at layer l, which may allow for ∥ϕ(x)−c∥{circumflex over (Σ)}
where wl is a layer-dependent constant. With these approximations, in some cases, the problem of solving Eq. 3 becomes at least substantially equivalent to the problem of finding the proper {wl}l=1L the corresponding R and {ξi}i=1n, which is a low dimensional optimization problem that scales only with the number of layer L linearly. In some embodiments, with the definitions w=[w1, w2, . . . , wL]T, Ml(xi)=(fl ∘fl-1 . . . ∘f1(xi)−ĉl)T{circumflex over (Σ)}l−1 (fl ∘fl-1 . . . ∘f1(xi)−ĉl) and M(x)=[M1(x), M2(x), . . . , ML(x)]T, ∥ϕ(x)−c∥Σ
∥ϕ(x)−c∥Σ
In some embodiments, because
is not convex with respect to w, −½ ∥w∥22 may be minimized to have similar or same regularization effect on Σ (e.g., because ∥w∥2 being small is equivalent to ∥Σ∥Fr2 being too large). Combining the foregoing, eq. 3 may be re-expressed as:
In some embodiments, Eq. 7 may be viewed as a one-class SVM with linear kernel that can be solved with convex optimization methods. In some case, to determine whether a new data sample y is OOD or not, R and w for the new data sample y may be determined and S(y)=w, M(Y)−R2 may be used as the anomaly score (e.g., for determining whether y is an anomaly, i.e., not from the distribution that generated x).
In some embodiments, if the feature of the data x is assumed to follow a Gaussian distribution, i.e. ϕ(x)˜(c, Σ) where c and Σ are the mean and the covariance of the Gaussian distribution, then the above method or formulation may include or be connected to density estimation (e.g., Gaussian density estimation). That is, the formulation may include or be related to estimating the density of the data x by assuming the feature of the data follows a (layer-wise factorized) Gaussian distribution, and then identifying data with likelihood smaller than the threshold as OOD data. To show this, with the assumption of Gaussian distribution, the log-density of p(ϕ(x)) can be written as:
log p(ϕ(x))=−½(ϕ(x)−c)TΣ−1(ϕ(x)−c)+log Z=−½∥ϕ(x)−c∥Σ
where Z is the normalization constant (e.g., the partition function) that is independent of ϕ(x). In such cases, the above-noted approximation of decomposing the feature space into several subspaces may be related or equivalent to introducing an additional assumption that the feature from each layer is approximately independent, i.e., p(ϕ(x))≈Πl=1Lp(fl∘fl-1 . . . ∘f1(x)). As such,
In some embodiments, the results depicted in
In some embodiments, the results depicted in
In some embodiments,
As discussed above, in some embodiments, OODDM, the OOD detection method disclosed in the present disclosure, may not require accessing outlier samples.
In some embodiments, Mahalanobis distance can be used to measure the distance between a point X and a distribution P, and as such can be used for anomaly detection. Assuming that the mean of P is c and the covariance of P is Σ, the Mahalanobis distance is defined as √{square root over ((x−c)TΣ−1(x−c))}=∥ϕ(x)−c∥Σ
To obtain the results of
In some embodiments, an intent-classifier is fine-tuned on the in-scope training dataset using bi-directional encoder representations from transformers (BERT), discussed in Devlin et al., BERT: Pre-training of deep bidirectional transformers for language understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 4171-4186 (2019), the disclosure of which is incorporated herein by reference in its entirety. For all the OOD detection methods shown in
In some embodiments,
In some embodiments, the results shown in
In some embodiments, a heuristic baseline may be constructed by using the scores output from BERT for candidate spans, normalized by softmax function. The score of a candidate span from position i to position j is defined as S·Ti+E·Tj, where S is the starting vector and E is the ending vector introduced in BERT. Ti and Tj are the token embeddings output from BERT for position i and j separately. In some cases, the maximum value of the normalized scores can be used as anomaly score (MSP). In addition, the maximum calibrated probability with temperature scaling (MTS) can be used as a baseline. For OODDM, test question and passage may be treated as a single packed input sequence, and w and R may be optimized on the training dataset using the latent features of all layers from BERT.
In some embodiments, to obtain the results shown in
In some embodiments,
Although illustrative embodiments have been shown and described, a wide range of modification, change and substitution is contemplated in the foregoing disclosure and in some instances, some features of the embodiments may be employed without a corresponding use of other features. One of ordinary skill in the art would recognize many variations, alternatives, and modifications. Thus, the scope of the invention should be limited only by the following claims, and it is appropriate that the claims be construed broadly and in a manner consistent with the scope of the embodiments disclosed herein.
The present application claims priority to and the benefit of the U.S. Provisional Patent Application No. 63/032,696, filed May 31, 2020, titled “Systems and Methods for Out-of-Distribution Detection,” which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | |
---|---|---|---|
63032696 | May 2020 | US |