The following disclosure(s) are submitted under 35 U.S.C. 102(b)(1)(A):
The present invention relates generally to the electrical, electronic and computer arts and, more particularly, to machine learning.
Neural networks are known to have a problem of being over-confident when directly using an output label distribution to generate uncertainty measures. Existing methods mainly resolve this issue by retraining the entire machine learning model to impose the uncertainty quantification capability so that the learned model can achieve the desired performance in accuracy and uncertainty prediction simultaneously. However, training the model from scratch is computationally expensive and may not be feasible in many situations.
Despite the promising performance of deep neural networks achieved in various practical tasks, uncertainty quantification (UQ) has attracted growing attention in recent years to fulfill the emerging demand for more robust and reliable machine learning models, as UQ aims to measure the reliability of the model's prediction quantitatively, and the transparency of machine learning models by quantitively measuring the uncertainty of the model's predictions: generating low uncertainty when the prediction is correct and high uncertainty when the prediction is wrong. Accurate uncertainty estimation is particularly significant for fields that are highly sensitive to error prediction, such as autonomous driving, financial services, and medical diagnosis. Accurate uncertainty quantification can prevent machine learning models from taking the risk of making wrong predictions by letting humans become involved in the decision-making process.
As alluded to above, existing uncertainty quantification solutions are computationally heavy, requiring the training of a model from scratch to achieve the desired prediction and uncertainty quantification performance simultaneously, and sometimes require training multiple models or inferencing the model multiple times. Conventional solutions often suffer from a lack of flexibility, cannot quantify different uncertainties (such as total, aleatoric, and epistemic uncertainties) and require specific training objectives or model architectures.
Principles of the invention provide systems and techniques for post-hoc uncertainty quantification for machine learning systems. In one aspect, an exemplary method includes the operations of obtaining a pretrained machine learning model; configuring a Bayesian meta-model to cooperate with the pretrained machine learning model, the Bayesian meta-model being configured to quantify different kinds of uncertainties associated with the pretrained machine learning model, wherein the Bayesian meta-model comprises a plurality of linear layers attached to different intermediate features of the pretrained machine learning model with a final linear layer generating a Dirichlet distribution; receiving multiple intermediate features extracted from the pretrained machine learning model as inputs; generating a Dirichlet distribution over a probability simplex as output, wherein the Dirichlet distribution is parameterized by the Bayesian meta-model and allows quantification of uncertainty of model prediction; and using the Bayesian meta-model and the pretrained machine learning model in a downstream task.
In one aspect, a computer program product comprises one or more tangible computer-readable storage media and program instructions stored on at least one of the one or more tangible computer-readable storage media, the program instructions executable by a processor, the program instructions comprising obtaining a pretrained machine learning model; configuring a Bayesian meta-model to cooperate with the pretrained machine learning model, the Bayesian meta-model being configured to quantify different kinds of uncertainties associated with the pretrained machine learning model, wherein the Bayesian meta-model comprises a plurality of linear layers attached to different intermediate features of the pretrained machine learning model with a final linear layer generating a Dirichlet distribution; receiving multiple intermediate features extracted from the pretrained machine learning model as inputs; generating a Dirichlet distribution over a probability simplex as output, wherein the Dirichlet distribution is parameterized by the Bayesian meta-model and allows quantification of uncertainty of model prediction; and using the Bayesian meta-model and the pretrained machine learning model in a downstream task.
In one aspect, an apparatus comprises a memory and at least one processor, coupled to the memory, and operative to perform operations comprising obtaining a pretrained machine learning model; configuring a Bayesian meta-model to cooperate with the pretrained machine learning model, the Bayesian meta-model being configured to quantify different kinds of uncertainties associated with the pretrained machine learning model, wherein the Bayesian meta-model comprises a plurality of linear layers attached to different intermediate features of the pretrained machine learning model with a final linear layer generating a Dirichlet distribution; receiving multiple intermediate features extracted from the pretrained machine learning model as inputs; generating a Dirichlet distribution over a probability simplex as output, wherein the Dirichlet distribution is parameterized by the Bayesian meta-model and allows quantification of uncertainty of model prediction; and using the Bayesian meta-model and the pretrained machine learning model in a downstream task.
As used herein, “facilitating” an action includes performing the action, making the action easier, helping to carry the action out, or causing the action to be performed. Thus, by way of example and not limitation, instructions executing on a processor might facilitate an action carried out by instructions executing on a remote processor, by sending appropriate data or commands to cause or aid the action to be performed. Where an actor facilitates an action by other than performing the action, the action is nevertheless performed by some entity or combination of entities.
Techniques as disclosed herein can provide substantial beneficial technical effects. Some embodiments may not have these potential advantages and these potential advantages are not necessarily required of all embodiments. By way of example only and without limitation, one or more embodiments may provide one or more of:
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The following drawings are presented by way of example only and without limitation, wherein like reference numerals (when used) indicate corresponding elements throughout the several views, and wherein:
It is to be appreciated that elements in the figures are illustrated for simplicity and clarity. Common but well-understood elements that may be useful or necessary in a commercially feasible embodiment may not be shown in order to facilitate a less hindered view of the illustrated embodiments.
Principles of inventions described herein will be in the context of illustrative embodiments. Moreover, it will become apparent to those skilled in the art given the teachings herein that numerous modifications can be made to the embodiments shown that are within the scope of the claims. That is, no limitations with respect to the embodiments shown and described herein are intended or should be inferred.
Most state-of-the-art approaches to uncertainty quantification focus on building a deep model equipped with uncertainty quantification capability so that a single deep model can achieve both the desired prediction and UQ performance simultaneously. However, such an approach for UQ suffers from practical limitations because it either requires a specific model structure or explicitly training the entire model from scratch to impose the uncertainty quantification ability. A more realistic scenario is to quantify the uncertainty of a pretrained model in a post-hoc manner due to practical constraints. For example, (1) compared with prediction accuracy and generalization performance, the uncertainty quantification ability of deep learning models is usually considered with lower priority, especially for profit-oriented applications, such as recommendation systems; (2) some applications require the models to impose other constraints, such as fairness or privacy, which might sacrifice the UQ performance; and (3) for some applications such as transfer learning, the pretrained models are usually available, and it might be a waste of resources to train a new model from scratch.
Motivated by these practical concerns, one pertinent focus of one or more embodiments is on tackling the post-hoc uncertainty learning problem; i.e., given a pretrained model, determine how to improve its UQ quality without affecting its predictive performance. Prior works on improving uncertainty quality in a post-hoc setting have mainly been targeted towards improving calibration. These approaches typically fail to augment the pre-trained model with the ability to capture different sources of uncertainty, such as epistemic uncertainty, which is pertinent for applications such as Out-of-Distribution (OOD) detection. Several recent works have adopted the meta-modeling approach, where a meta-model is trained to predict whether or not the pretrained model is correct on the validation samples. These methods still rely on a point estimate of the meta-model parameters, which can be unreliable, especially when the validation set is small.
In one example embodiment, a more practical post-hoc uncertainty learning setting is considered, where a well-trained base model is given, and the uncertainty quantification task is focused on the second stage of training. In one example embodiment, a Bayesian meta-model is disclosed to augment pre-trained models with better uncertainty quantification abilities, which is effective and computationally efficient. The disclosed methods require no additional training data (no modification of the base model), are flexible enough to quantify different uncertainties for different down-stream tasks and easily adapt to different application settings, including out-of-domain data detection, misclassification detection, and trustworthy transfer learning. An exemplary meta-model approach's flexibility and superior empirical performance are demonstrated on these applications over multiple representative image classification benchmarks. Exemplary techniques are flexible enough to quantify different kinds of uncertainties and easily adapt to different application settings.
Exemplary empirical results provide pertinent insights regarding meta-model training: (1) the diversity in feature representations across different layers is important for uncertainty quantification, especially for out-of-domain (OOD) data detection tasks; (2) leveraging the Dirichlet meta-model to capture different uncertainties, including total uncertainty and epistemic uncertainty; and (3) there exists an over-fitting issue in uncertainty learning similar to supervised learning that should be addressed by a novel validation strategy to achieve better performance. Furthermore, it is shown that exemplary embodiments have the flexibility to adapt to various applications, including OOD detection, misclassification detection, and trustworthy transfer learning.
Uncertainty Quantification methods can be broadly classified as intrinsic or extrinsic, depending on how the uncertainties are obtained from the machine learning models. Intrinsic methods encompass models that inherently provide an uncertainty estimate along with their predictions. Some intrinsic methods, such as neural networks with homoscedastic/heteroscedastic noise models and quantile regression, can only capture Data (aleatoric) uncertainty. Many applications including out-of-distribution detection, require capturing both Data (aleatoric) and Model (epistemic) accurately. Bayesian methods such as Bayesian neural networks (BNNs) and Gaussian processes and ensemble methods are well-known examples of intrinsic methods that can quantify both uncertainties. However, Bayesian methods and ensembles can be quite expensive and require several approximations to learn/optimize in practice.
Under model misspecification, Bayesian approaches are not well-calibrated and can produce severely mis-calibrated uncertainties.
For models without an inherent notion of uncertainty, extrinsic methods are employed to extract uncertainties in a post-hoc manner. However, many conventional methods require additional data samples that are either validation or out of distribution dataset to train or tune the hyper-parameter, which is infeasible when these data are not available. Moreover, they are often not flexible enough to distinguish epistemic uncertainty and aleatoric uncertainty, which are known to be significant in various learning applications. In contrast, exemplary embodiments do not require additional training data or modifying the training procedure of the base model.
Consider a classification problems, as described below. Let Z=X×Y, where X denotes the input space, and Y={1, . . . . K} denotes the label space. Given a base-model training set DB={xiB, yiB}i=1N and h:
→ΔK-1 denote two complementary components of the neural network. More specifically, Φ(x)=Φ(x; wϕ) stands for the intermediate feature representation of the base model, and the model output h(Φ(x))=h (Φ(x; wϕ); wh) denotes the predicted label distribution PB(y|Φ(x)) ∈ ΔK-1, given input sample x, where (wϕ, wh) ∈ W are the parameters of the pretrained base model.
The performance of the base model is evaluated by a non-negative loss function :W×Z→
; e.g., cross entropy loss. Thus, a standard way to obtain the pretrained base model is by minimizing the empirical risk over DB, i.e.,
Although the well-trained deep base model is able to achieve good prediction accuracy, the output label distribution PB(y|Φ(x)) is usually unreliable for uncertainty quantification, i.e., it can be overconfident or poorly calibrated. Without retraining the model from scratch, an interest is in improving the uncertainty quantification performance in an efficient post-hoc manner. A meta-model g: →Ŷ with parameter wj ∈ Wg was used building on top of the base model. The meta-model shares the same feature extractor from the base model and generates an output {tilde over (y)}=g(Φ(x); wg), where {tilde over (y)} ∈ {tilde over (Y)} can take any form, e.g., a distribution over ΔK-1 or a scalar. Given a meta-model training set DM={xiM, yiM}i=1N
where :Wg×Wϕ×Z→
is the loss function for the meta-model.
In the following, the post-hoc uncertainty learning problem using meta-model is formally introduced.
Given a base model h∘Φ learned from the base-model training set DB, the uncertainty learning problem by meta-model is to learn the function g using the meta-model training set DM and the shared feature extractor Φ, i.e.,
such that the output from the meta-model {tilde over (y)}=g(Φ(x)) equipped with an uncertainty metric function u: {tilde over (Y)}→ is able to generate a robust uncertainty score u({tilde over (y)}).
Next, the most critical questions are how the meta-model should use the information extracted from the pretrained base model, what kinds of uncertainty the meta-model should aim to quantify, and finally, how to train the meta-model appropriately.
The post-hoc uncertainty learning framework defined in Problem 1 is specified below. First, the structure of the meta-model is introduced. Next, the meta-model training procedure, including the training objectives and a validation trick, is discussed. Finally, metrics for uncertainty quantification used in different applications are defined.
The design of an exemplary meta-model method is based on three high-level insights: first, different intermediate layers of the base model usually capture various levels of feature representation, from low-level features to high-frequency features, e.g., for OOD detection task, the OOD data is unlikely to be similar to in-distribution data across all levels of feature representations. Therefore, it is desirable to leverage the diversity in feature representations to achieve better uncertainty quantification performance. Second, the Bayesian method is used to model different types of uncertainty for various uncertainty quantification applications, i.e., total uncertainty and epistemic uncertainty. Thus, a Bayesian meta-model is disclosed to parameterize the Dirichlet distribution, used as the conjugate prior distribution, over label distribution. Third, the overconfident issue of the base model is believed to be caused by over-fitting in supervised learning with cross-entropy loss. In the post-hoc training of the meta-model, a validation strategy is proposed to improve the performance of uncertain learning instead of prediction accuracy.
Given an input sample x, the base model 204 outputs a conditional label distribution PB(y|Φ(x))∈ΔK-1, corresponding to a single point in the probability simplex. However, such label distribution PB(y|Φ(x)) is a point estimate, which only shows the model's uncertainty about different classes but cannot reflect the uncertainty due to the lack of knowledge of a given sample, i.e., the epistemic uncertainty. To this end, the Dirichlet technique is adopted in order to better quantify the epistemic uncertainty. Letting the label distribution as a random variable over probability simplex be denoted as π=[π1, π2, . . . , πk], the Dirichlet prior distribution is the conjugate prior of the categorical distribution, i.e.,
The exemplary meta-model g explicitly parameterizes the posterior Dirichlet distribution, i.e,
where the output of the disclosed meta-model 216 is {tilde over (y)}=log α(x), and α(x)=[α1(x), a2(x), . . . , αk(x)] is the concentration parameter of the Dirichlet distribution given an input x.
From a Bayesian perspective, the predicted label distribution using the Dirichlet meta-model is given by the expected categorical distribution:
where α0=Σc=1k αc is the precision of the Dirichlet distribution.
The true posterior of the categorical distribution over sample (x y) is P(π|Φ(x),y)∝P(y|π, Φ(x))P(π|Φ(x)), which is difficult to evaluate. Instead, a variational inference technique is used to generate a variational distribution q(π|Φ(x); wg) parameterized by Dirichlet distribution with meta-model to approximate the true posterior distribution P(π|Φ(x),y), then minimize the KL-divergence KL(q(π|Φ(x); wg)∥P(π|Φ(x),y)), which is equivalent to maximizing the evidence lower bound (ELBO) loss, i.e.,
where α(i) is the Dirichlet concentration parameter parameterized by the meta-model 216, i.e., α(i)=e(g(Φ(x
Validation with early stopping is a commonly used technique in supervised learning to train a model with desired generalization performance, i.e., stop training when the error evaluated on the validation set starts increasing. However, we have found that the standard validation method does not work well for uncertainty learning. One possible explanation is that the model achieves the highest accuracy when the validation loss is small, but may not achieve the best UQ performance, i.e., the model can be overconfident. To this end, an exemplary simple and effective validation approach is disclosed specifically for uncertainty learning. Instead of monitoring the validation cross-entropy loss, a specific uncertainty quantification performance metric is evaluated. For example, another noisy validation set for the OOD task is created by adding noise to the original validation samples and such noisy validation samples are treated as OOD samples (more details are provided in the section entitled “Description of OOD datasets”). The uncertainty score u({tilde over (y)}) is evaluated on both the validation set and the noisy validation set, and the meta-model training is stopped when the OOD detection performance achieves its maximum based on some predefined metrics, e.g., area under the receiver operating characteristic (ROC) curve (AUROC) score. Unlike most existing techniques using additional training data to help achieve a desired performance, in one or more embodiments, no additional data is required for training the meta-model.
It is shown below that the exemplary meta-model 216 has the desired behavior to quantify different uncertainties and how they can be used in various applications.
Total uncertainty, also known as predictive uncertainty, is a combination of epistemic uncertainty and aleatoric uncertainty. The total uncertainty is often used for misclassification detection problems, where the misclassified samples are viewed as in-distribution hard samples. There are two standard ways to measure total uncertainty: (1) Entropy (Ent): the Shannon entropy of expected categorical label distribution over the Dirichlet distribution, i.e., H(P(y|Φ(xi), wg))=H((π|Φ(x); w
The epistemic uncertainty quantifies the uncertainty when the model has insufficient knowledge of a prediction, e.g., the case of an unseen data sample. The epistemic uncertainty is especially useful in OOD detection problems. When the meta-model 216 encounters an unseen sample during testing, it will output a high epistemic uncertainty score due to a lack of knowledge. Three metrics are defined to measure the epistemic uncertainties.
Differential entropy measures the entropy of the Dirichlet distribution, a large differential entropy corresponds to a more spread Dirichlet distribution, i.e., H(P(π|Φ(xi), wg))=−∫P(π|Φ(xi), wg)·log P(π|Φ(xi); wg)dπ.
Mutual Information is the difference between the Entropy (measures total uncertainty) and the expected entropy of the categorical distribution sampled from the Dirichlet distribution (approximates aleatoric uncertainty), i.e.,
The precision is the summation of the Dirichlet distribution concentration parameter α, where a larger value corresponds to sharper distribution and higher confidence, i.e., α0=Σc=1αc.
The strong empirical performance of the exemplary meta-model-based uncertainty learning method is demonstrated below: first, the UQ applications are introduced; the experiment settings are described; next, the main results of the three aforementioned uncertainty quantification applications are presented; and finally, the take-aways are discussed. More experiment results and implementation details are given in the sections entitled “Experiment Setup” and “Additional Experiment results.
Three exemplary applications that can be tackled using the disclosed meta-model approach are focused on below:
Given a base-model h trained using data sampled from the distribution PZB, the same base-model training set is used to train the meta-model 216, i.e., DB=DM. During testing, there exists some unobserved out-of-domain data from another distribution pZood. The meta-model 216 is expected to identify the out-of-distribution input sample based on epistemic uncertainties.
Instead of detecting whether a testing sample is out of domain, the goal here is to identify the failure or success of the meta-model prediction at test time using total uncertainties.
In transfer learning, there exists a pretrained model trained using source task data Ds sampled from source distribution PZs, and the goal is to adapt the source model to a target task using target data Dt sampled from target distribution PZt. Most existing transfer learning approaches only focus on improving the prediction performance of the transferred model, but ignore its UQ performance on the target task. The disclosed meta-model method can be utilized to address this problem, i.e., given pretrained source model hs∘Φs, the meta-model 216 can be efficiently trained using target domain data by gt=argming(g∘Φs, Dt).
For both OOD detection and misclassification detection tasks, three standard datasets are employed to train the base model 204 and the meta-model 216: the first conventional large database of handwritten digits, a first conventional image dataset, and a second conventional image dataset. For each dataset, different base-model structures are used, i.e., a first conventional convolutional neural network for the first conventional large database of handwritten digits, a conventional object detection and classification algorithm for the first conventional image dataset, and a conventional wide residual network for the second conventional image dataset. For the first conventional convolutional neural network and the conventional object detection and classification algorithm, the meta-model 216 uses the extracted feature after each pooling layer, and for the conventional wide residual network, the meta-model 216 uses the extracted feature after each residual block. In general, the total number of intermediate features is less than five, to ensure computational efficiency. For the OOD task, five different OOD datasets are considered for evaluating the OOD detection performance: a second conventional large database of handwritten digits, a conventional dataset of article images, a third conventional large database of handwritten digits, the first conventional image dataset, and a corrupted version of the first conventional large database of handwritten digits as outliers for the first conventional large database of handwritten digits; a conventional database of house numbers, the conventional dataset of article images, a third conventional image dataset, a fourth conventional image dataset, and a corrupted version of the first conventional image dataset (a corrupted version of the second conventional image dataset) as outliers for the first conventional image dataset (the second conventional image dataset) dataset. For the trustworthy transfer learning task, a second conventional convolutional neural network pretrained on a fifth conventional image dataset was used as the pretrained source domain model and the source model was adapted to the two target datasets, a sixth conventional image dataset and the first conventional image dataset, by training the meta-model 216.
For OOD and misclassification tasks, except for the naive base-model trained with cross-entropy loss, an exemplary technique is mainly compared with the existing post-hoc UQ methods as baselines: (1) the meta-model based method; and (2) a post-hoc uncertainty quantification using Laplace Approximation. In order to further validate the strong empirical performance, the disclosed method was also compared with other state-of-the-art (SOTA) intrinsic UQ methods in the section entitled “Additional Experiment Results”: (1) a standard Bayesian method; (2) a Dirichlet network with variational inference; (3) the posterior network with density estimation; and (4) a robust OOD detection method.
For the trustworthy transfer learning task, since there is no existing work designed for this problem, the exemplary technique was compared with two simple baselines: (1) fine-tune the last layer of the source model; and (2) train the disclosed meta-model on top of the source model using standard cross-entropy loss.
The UQ performance was evaluated by measuring the area under the ROC curve (AUROC) and the area under the Precision-Recall curve (AUPR). The results are averaged over five random trials for each experiment. For the OOD task, the in-distribution test samples are considered as the negative class and the outlier samples as the positive class. For the misclassification task, the correctly classified test samples are considered as the negative class and the miss-classified test samples are considered as the positive class.
The second conventional convolutional neural network pretrained on the fifth conventional image dataset was used as the source domain base model and the pretrained model was adapted to the target task by training the meta-model 216 using the target domain training data. Unlike traditional transfer learning, which only focuses on testing prediction accuracy on the target task, the UQ ability of the meta-model 216 was also evaluated in terms of OOD detection performance. A conventional dataset of article images was used as OOD samples for both target datasets, the sixth conventional image dataset and the first conventional image dataset, and the AUROC score was evaluated.
The disclosed method was further investigated through an ablation study using the first conventional image dataset OOD task. Based on the disclosed insights and the empirical results, it was concluded that the following four pertinent factors relate to the success of the exemplary meta-model based method:
The disclosed meta-model structure was replaced with a simple linear classifier attached to only the final layer.
Instead of using a meta-model to parameterize a Dirichlet distribution, the meta-model 216 was trained using the standard cross-entropy loss, which simply outputs a categorical label distribution. The ablation results are shown in the table of
The last layer of the base model 204 is retrained using the cross-entropy loss with the exemplary validation trick. The results are shown in the table of
Instead of using all the training samples, only 10% of samples to train the meta-model were randomly chosen. The results are shown in the table of
The exemplary meta-model approach is believed to not only have the flexibility to tackle other applications relevant to uncertainty quantification, such as quantifying transferability in transfer learning and domain, but also to be adaptable to other model architectures, such as transformer and language model.
The post-hoc uncertainty learning problem aims to improve the UQ performance of a pretrained base model. First, the pretrained model is generated by training the base model using cross-entropy loss to achieve optimal testing accuracy. The maximum epochs for training the first conventional convolutional neural network, the conventional object detection and classification algorithm, and the conventional wide residual network are set to be 20, 200, and 200, respectively. Then, in the second stage, the parameter of the pretrained base model is frozen and the meta-model is trained on top of it using the Dirichlet variational loss. The meta-model uses the same training data as the base model, and the maximum epochs for training the meta-model is set to be 50. All the models are optimized using a stochastic gradient descent (SGDO optimizer.
The pretrained second conventional convolutional neural network was trained on the fifth conventional image dataset as the base-model. Similarly, the parameter of the pretrained model was frozen and the meta-model was trained on top of it using the training data of the target task. All the models are optimized using an SGD optimizer. The hyper-parameters for training the meta-model are summarized in the table of
The high-level description of the meta-model structure is provided in the section entitled “Meta-model Structure.” More specifically, all the linear layers gi and gc include three elements: a fully-connected layer, a rectified linear unit (ReLU) activation function, and Max-Pooling. Each gi has multiple fully-connected layers, each is followed by a (ReLU) and a Max-Pooling, each fully-connected layer reduces the input feature dimension to half size, and the output meta-feature of gi has the dimension as the class number, e.g., 10 for the first conventional image dataset. The final linear layer gc is a single fully-connected layer that takes the concatenation of all the meta-feature and outputs the concentration parameter α.
An exemplary meta-model-based approach is much more efficient than traditional uncertainty quantification approaches due to the simpler structure and faster convergence speed. To quantitatively demonstrate such efficiency, the wall clock time of training the meta-model is measured in seconds (on a single_conventional graphical processing unit (GPU)) as follows. The training time of training the conventional object detection and classification algorithm (model) on the first conventional image dataset is 66.5 seconds(s) for five epochs; the training time of training the conventional wide residual network model on the second conventional image dataset is 241.9 s for ten epochs. The training time of the meta-model can be negligible compared to the approaches training the entire base model from scratch (usually taking several hours).
The proposed validation trick described in the section entitled “Uncertainty Learning” was used to perform early stopping in the training of the meta-model. 20% of the original training data was randomly selected as the validation set. For the OOD detection task, the noisy validation set was created by applying various kinds of noise and perturbation to the original images, including permuting the pixels, applying Gaussian blurring, and performing contrast re-scaling. For the misclassification task, the validation set was directly used to evaluate the misclassification detection performance with the correctly classified and misclassified validation samples.
For the OOD detection task, the testing set was used as an in-domain dataset, and the out of domain dataset was ensured to have the same number of samples (10000 samples) as in-domain dataset. Different dataset input images are resized to, for example, 32×32 to ensure they have the same size, and all gray-scale images are converted into three-channel images. The following datasets were used as OOD samples for the OOD detection task:
A fourth conventional large database of handwritten digits contains 1632 hand-written characters taken from 50 different alphabets. 10,000 images were randomly picked from its testing set as OOD samples for the first conventional large database of handwritten digits.
The conventional dataset of article images with 10,000 images was used as OOD samples for both the first conventional large database of handwritten digits and CIFAR (including CIFAR-10 and CIFAR-100).
The third conventional large database of handwritten digits contains handwritten characters from the Japanese texts. The testing set with 10,000 images was used as OOD samples for the first conventional large database of handwritten digits.
The conventional database of house numbers contains images of house numbers. The testing set with 10,000 images was used as OOD samples for CIFAR.
The third conventional image dataset is a dataset of different objects taken from 10 different scene categories. The images from the classroom categories were used and 10,000 training images were randomly sampled as OOD samples for CIFAR.
The fourth conventional image dataset is a subset of the fifth conventional image dataset, and the validation set with 10,000 images was used as OOD samples for CIFAR.
Corrupted is an artificial dataset generated by perturbing the original testing images using Gaussian blurring, pixel permutation, and contrast re-scaling.
In the following, an exemplary embodiment is compared with several SOTA uncertainty quantification methods with traditional settings on the OOD detection task.
In the following, an exemplary embodiment was compared with several SOTA uncertainty quantification methods with traditional settings on the misclassification detection task.
Given the discussion thus far, it will be appreciated that, in general terms, an exemplary method, according to an aspect of the invention, includes the operations of obtaining a pretrained machine learning model; configuring a Bayesian meta-model 216 to cooperate with the pretrained machine learning model 204, the Bayesian meta-model 216 being configured to quantify different kinds of uncertainties associated with the pretrained machine learning model 204, wherein the Bayesian meta-model 216 comprises a plurality of linear layers 212-1, 212-2, 212-3 attached to different intermediate features of the pretrained machine learning model 204 with a final linear layer 220 generating a Dirichlet distribution; receiving multiple intermediate features extracted from the pretrained machine learning model 204 as inputs; generating a Dirichlet distribution over a probability simplex as output, wherein the Dirichlet distribution is parameterized by the Bayesian meta-model 216 and allows quantification of uncertainty of model prediction; and using the Bayesian meta-model 216 and the pretrained machine learning model 204 in a downstream task.
In one aspect, a computer program product comprises one or more tangible computer-readable storage media and program instructions stored on at least one of the one or more tangible computer-readable storage media, the program instructions executable by a processor, the program instructions comprising obtaining a pretrained machine learning model; configuring a Bayesian meta-model 216 to cooperate with the pretrained machine learning model 204, the Bayesian meta-model 216 being configured to quantify different kinds of uncertainties associated with the pretrained machine learning model 204, wherein the Bayesian meta-model 216 comprises a plurality of linear layers 212-1, 212-2, 212-3 attached to different intermediate features of the pretrained machine learning model 204 with a final linear layer 220 generating a Dirichlet distribution; receiving multiple intermediate features extracted from the pretrained machine learning model 204 as inputs; generating a Dirichlet distribution over a probability simplex as output, wherein the Dirichlet distribution is parameterized by the Bayesian meta-model 216 and allows quantification of uncertainty of model prediction; and using the Bayesian meta-model 216 and the pretrained machine learning model 204 in a downstream task.
In one aspect, a system comprises a memory and at least one processor coupled to the memory, and operative to perform operations comprising obtaining a pretrained machine learning model; configuring a Bayesian meta-model 216 to cooperate with the pretrained machine learning model 204, the Bayesian meta-model 216 being configured to quantify different kinds of uncertainties associated with the pretrained machine learning model 204, wherein the Bayesian meta-model 216 comprises a plurality of linear layers 212-1, 212-2, 212-3 attached to different intermediate features of the pretrained machine learning model 204 with a final linear layer 220 generating a Dirichlet distribution; receiving multiple intermediate features extracted from the pretrained machine learning model 204 as inputs; generating a Dirichlet distribution over a probability simplex as output, wherein the Dirichlet distribution is parameterized by the Bayesian meta-model 216 and allows quantification of uncertainty of model prediction; and using the Bayesian meta-model 216 and the pretrained machine learning model 204 in a downstream task.
In one example embodiment, the configuring of the Bayesian meta-model to cooperate with the pretrained machine learning model is performed without modifying the pretrained machine learning model.
In one example embodiment, the Bayesian meta-model is trained using a Bayesian variational loss on a training dataset and using a validation process to ensure the Bayesian meta-model achieves optimal uncertainty quantification performance.
In one example embodiment, the validation process comprises generating a noisy validation set by adding noise to validation data; using the noisy validation data as an approximation of out-of-distribution (OOD) data; evaluating uncertainty quantification performance using the noisy validation set; and selecting the optimal uncertainty quantification performance.
In one example embodiment, meta-model training is stopped when an out-of-distribution (OOD) detection performance achieves its maximum based on predefined metrics.
In one example embodiment, the final linear layer combines all the intermediate features.
In one example embodiment, the Dirichlet distribution comprises a concentrated Dirichlet distribution over the probability simplex corresponding to confident prediction and comprises a diffused Dirichlet distribution corresponding to uncertain predictions.
In one example embodiment, the linear layers consist only of fully connected layers and activation functions.
In one example embodiment, the generating the Dirichlet distribution is based on a loss function and the loss function uses a likelihood term to encourage sharpening of a categorical distribution around a true class on the simplex and uses a KL-divergence term as a regularizer to prevent overconfident prediction, and a hyper-parameter is supplied to balance a trade-off between the sharpening of the categorical distribution and the prevention of the overconfident prediction. Further details regarding the loss function are provided above with regard to equations (4) and (5) and the accompanying text.
In one example embodiment, a physical system, such as an autonomous vehicle, is controlled using the Bayesian meta-model in conjunction with the pretrained machine learning model.
In one example embodiment, an operator such as a driver is alerted to assume control of the physical system such as an autonomous vehicle in response to a confidence level generated by the Bayesian meta-model in conjunction with the pretrained machine learning model being less than a given threshold. Generally, refer to
In one example embodiment, a predicted label distribution using the Dirichlet meta-model is given by an expected categorical distribution:
where α0=Σc=1kαc is a precision of the Dirichlet distribution.
In one example embodiment, given an input sample x, a representation of the multiple intermediate features extracted from the pretrained machine learning model is denoted as {Φj(x)}j=1m, a corresponding linear layer will construct a low-dimensional meta-feature {gj(Φj(x))}j=1m, wherein, for each intermediate base-feature Φj, the final linear layer of the meta-model takes the multiple meta-features as inputs and generates a single output {tilde over (y)}=g({Φj(x)}j=1m; wg)=gc({gj(Φj(x))}; wg
In one example embodiment, an image of a license plate is classified using the Bayesian meta-model in conjunction with the pretrained machine learning model. In one example embodiment, a user is alerted to inspect an image of a license plate in response to a confidence level generated by the Bayesian meta-model in conjunction with the pretrained machine learning model being less than a given threshold.
Refer now to
Various aspects of the present disclosure are described by narrative text, flowcharts, block diagrams of computer systems and/or block diagrams of the machine logic included in computer program product (CPP) embodiments. With respect to any flowcharts, depending upon the technology involved, the operations can be performed in a different order than what is shown in a given flowchart. For example, again depending upon the technology involved, two operations shown in successive flowchart blocks may be performed in reverse order, as a single integrated step, concurrently, or in a manner at least partially overlapping in time.
A computer program product embodiment (“CPP embodiment” or “CPP”) is a term used in the present disclosure to describe any set of one, or more, storage media (also called “mediums”) collectively included in a set of one, or more, storage devices that collectively include machine readable code corresponding to instructions and/or data for performing computer operations specified in a given CPP claim. A “storage device” is any tangible device that can retain and store instructions for use by a computer processor. Without limitation, the computer readable storage medium may be an electronic storage medium, a magnetic storage medium, an optical storage medium, an electromagnetic storage medium, a semiconductor storage medium, a mechanical storage medium, or any suitable combination of the foregoing. Some known types of storage devices that include these mediums include: diskette, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or Flash memory), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanically encoded device (such as punch cards or pits/lands formed in a major surface of a disc) or any suitable combination of the foregoing. A computer readable storage medium, as that term is used in the present disclosure, is not to be construed as storage in the form of transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide, light pulses passing through a fiber optic cable, electrical signals communicated through a wire, and/or other transmission media. As will be understood by those of skill in the art, data is typically moved at some occasional points in time during normal operations of a storage device, such as during access, de-fragmentation or garbage collection, but this does not render the storage device as transitory because the data is not transitory while it is stored.
Computing environment 100 contains an example of an environment for the execution of at least some of the computer code involved in performing the inventive methods, such as machine learning system 200 incorporating aspects of the invention. In addition to block 200, computing environment 100 includes, for example, computer 101, wide area network (WAN) 102, end user device (EUD) 103, remote server 104, public cloud 105, and private cloud 106. In this embodiment, computer 101 includes processor set 110 (including processing circuitry 120 and cache 121), communication fabric 111, volatile memory 112, persistent storage 113 (including operating system 122 and block 200, as identified above), peripheral device set 114 (including user interface (UI) device set 123, storage 124, and Internet of Things (IoT) sensor set 125), and network module 115. Remote server 104 includes remote database 130. Public cloud 105 includes gateway 140, cloud orchestration module 141, host physical machine set 142, virtual machine set 143, and container set 144.
COMPUTER 101 may take the form of a desktop computer, laptop computer, tablet computer, smart phone, smart watch or other wearable computer, mainframe computer, quantum computer or any other form of computer or mobile device now known or to be developed in the future that is capable of running a program, accessing a network or querying a database, such as remote database 130. As is well understood in the art of computer technology, and depending upon the technology, performance of a computer-implemented method may be distributed among multiple computers and/or between multiple locations. On the other hand, in this presentation of computing environment 100, detailed discussion is focused on a single computer, specifically computer 101, to keep the presentation as simple as possible. Computer 101 may be located in a cloud, even though it is not shown in a cloud in
PROCESSOR SET 110 includes one, or more, computer processors of any type now known or to be developed in the future. Processing circuitry 120 may be distributed over multiple packages, for example, multiple, coordinated integrated circuit chips. Processing circuitry 120 may implement multiple processor threads and/or multiple processor cores. Cache 121 is memory that is located in the processor chip package(s) and is typically used for data or code that should be available for rapid access by the threads or cores running on processor set 110. Cache memories are typically organized into multiple levels depending upon relative proximity to the processing circuitry. Alternatively, some, or all, of the cache for the processor set may be located “off chip.” In some computing environments, processor set 110 may be designed for working with qubits and performing quantum computing.
Computer readable program instructions are typically loaded onto computer 101 to cause a series of operational steps to be performed by processor set 110 of computer 101 and thereby effect a computer-implemented method, such that the instructions thus executed will instantiate the methods specified in flowcharts and/or narrative descriptions of computer-implemented methods included in this document (collectively referred to as “the inventive methods”). These computer readable program instructions are stored in various types of computer readable storage media, such as cache 121 and the other storage media discussed below. The program instructions, and associated data, are accessed by processor set 110 to control and direct performance of the inventive methods. In computing environment 100, at least some of the instructions for performing the inventive methods may be stored in block 200 in persistent storage 113.
COMMUNICATION FABRIC 111 is the signal conduction path that allows the various components of computer 101 to communicate with each other. Typically, this fabric is made of switches and electrically conductive paths, such as the switches and electrically conductive paths that make up busses, bridges, physical input/output ports and the like. Other types of signal communication paths may be used, such as fiber optic communication paths and/or wireless communication paths.
VOLATILE MEMORY 112 is any type of volatile memory now known or to be developed in the future. Examples include dynamic type random access memory (RAM) or static type RAM. Typically, volatile memory 112 is characterized by random access, but this is not required unless affirmatively indicated. In computer 101, the volatile memory 112 is located in a single package and is internal to computer 101, but, alternatively or additionally, the volatile memory may be distributed over multiple packages and/or located externally with respect to computer 101.
PERSISTENT STORAGE 113 is any form of non-volatile storage for computers that is now known or to be developed in the future. The non-volatility of this storage means that the stored data is maintained regardless of whether power is being supplied to computer 101 and/or directly to persistent storage 113. Persistent storage 113 may be a read only memory (ROM), but typically at least a portion of the persistent storage allows writing of data, deletion of data and re-writing of data. Some familiar forms of persistent storage include magnetic disks and solid state storage devices. Operating system 122 may take several forms, such as various known proprietary operating systems or open source Portable Operating System Interface-type operating systems that employ a kernel. The code included in block 200 typically includes at least some of the computer code involved in performing the inventive methods.
PERIPHERAL DEVICE SET 114 includes the set of peripheral devices of computer 101. Data communication connections between the peripheral devices and the other components of computer 101 may be implemented in various ways, such as Bluetooth connections, Near-Field Communication (NFC) connections, connections made by cables (such as universal serial bus (USB) type cables), insertion-type connections (for example, secure digital (SD) card), connections made through local area communication networks and even connections made through wide area networks such as the internet. In various embodiments, UI device set 123 may include components such as a display screen, speaker, microphone, wearable devices (such as goggles and smart watches), keyboard, mouse, printer, touchpad, game controllers, and haptic devices. Storage 124 is external storage, such as an external hard drive, or insertable storage, such as an SD card. Storage 124 may be persistent and/or volatile. In some embodiments, storage 124 may take the form of a quantum computing storage device for storing data in the form of qubits. In embodiments where computer 101 is required to have a large amount of storage (for example, where computer 101 locally stores and manages a large database) then this storage may be provided by peripheral storage devices designed for storing very large amounts of data, such as a storage area network (SAN) that is shared by multiple, geographically distributed computers. IoT sensor set 125 is made up of sensors that can be used in Internet of Things applications. For example, one sensor may be a thermometer and another sensor may be a motion detector.
NETWORK MODULE 115 is the collection of computer software, hardware, and firmware that allows computer 101 to communicate with other computers through WAN 102. Network module 115 may include hardware, such as modems or Wi-Fi signal transceivers, software for packetizing and/or de-packetizing data for communication network transmission, and/or web browser software for communicating data over the internet. In some embodiments, network control functions and network forwarding functions of network module 115 are performed on the same physical hardware device. In other embodiments (for example, embodiments that utilize software-defined networking (SDN)), the control functions and the forwarding functions of network module 115 are performed on physically separate devices, such that the control functions manage several different network hardware devices. Computer readable program instructions for performing the inventive methods can typically be downloaded to computer 101 from an external computer or external storage device through a network adapter card or network interface included in network module 115.
WAN 102 is any wide area network (for example, the internet) capable of communicating computer data over non-local distances by any technology for communicating computer data, now known or to be developed in the future. In some embodiments, the WAN 102 may be replaced and/or supplemented by local area networks (LANs) designed to communicate data between devices located in a local area, such as a Wi-Fi network. The WAN and/or LANs typically include computer hardware such as copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and edge servers.
END USER DEVICE (EUD) 103 is any computer system that is used and controlled by an end user (for example, a customer of an enterprise that operates computer 101), and may take any of the forms discussed above in connection with computer 101. EUD 103 typically receives helpful and useful data from the operations of computer 101. For example, in a hypothetical case where computer 101 is designed to provide a recommendation to an end user, this recommendation would typically be communicated from network module 115 of computer 101 through WAN 102 to EUD 103. In this way, EUD 103 can display, or otherwise present, the recommendation to an end user. In some embodiments, EUD 103 may be a client device, such as thin client, heavy client, mainframe computer, desktop computer and so on.
REMOTE SERVER 104 is any computer system that serves at least some data and/or functionality to computer 101. Remote server 104 may be controlled and used by the same entity that operates computer 101. Remote server 104 represents the machine(s) that collect and store helpful and useful data for use by other computers, such as computer 101. For example, in a hypothetical case where computer 101 is designed and programmed to provide a recommendation based on historical data, then this historical data may be provided to computer 101 from remote database 130 of remote server 104.
PUBLIC CLOUD 105 is any computer system available for use by multiple entities that provides on-demand availability of computer system resources and/or other computer capabilities, especially data storage (cloud storage) and computing power, without direct active management by the user. Cloud computing typically leverages sharing of resources to achieve coherence and economies of scale. The direct and active management of the computing resources of public cloud 105 is performed by the computer hardware and/or software of cloud orchestration module 141. The computing resources provided by public cloud 105 are typically implemented by virtual computing environments that run on various computers making up the computers of host physical machine set 142, which is the universe of physical computers in and/or available to public cloud 105. The virtual computing environments (VCEs) typically take the form of virtual machines from virtual machine set 143 and/or containers from container set 144. It is understood that these VCEs may be stored as images and may be transferred among and between the various physical machine hosts, either as images or after instantiation of the VCE. Cloud orchestration module 141 manages the transfer and storage of images, deploys new instantiations of VCEs and manages active instantiations of VCE deployments. Gateway 140 is the collection of computer software, hardware, and firmware that allows public cloud 105 to communicate through WAN 102.
Some further explanation of virtualized computing environments (VCEs) will now be provided. VCEs can be stored as “images.” A new active instance of the VCE can be instantiated from the image. Two familiar types of VCEs are virtual machines and containers. A container is a VCE that uses operating-system-level virtualization. This refers to an operating system feature in which the kernel allows the existence of multiple isolated user-space instances, called containers. These isolated user-space instances typically behave as real computers from the point of view of programs running in them. A computer program running on an ordinary operating system can utilize all resources of that computer, such as connected devices, files and folders, network shares, CPU power, and quantifiable hardware capabilities. However, programs running inside a container can only use the contents of the container and devices assigned to the container, a feature which is known as containerization.
PRIVATE CLOUD 106 is similar to public cloud 105, except that the computing resources are only available for use by a single enterprise. While private cloud 106 is depicted as being in communication with WAN 102, in other embodiments a private cloud may be disconnected from the internet entirely and only accessible through a local/private network. A hybrid cloud is a composition of multiple clouds of different types (for example, private, community or public cloud types), often respectively implemented by different vendors. Each of the multiple clouds remains a separate and discrete entity, but the larger hybrid cloud architecture is bound together by standardized or proprietary technology that enables orchestration, management, and/or data/application portability between the multiple constituent clouds. In this embodiment, public cloud 105 and private cloud 106 are both part of a larger hybrid cloud.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.