This application claims priority to European Application No. 20181134.6, having a filing date of Jun. 19, 2020, the entire contents of which are hereby incorporated by reference.
The following relates to a computer-implemented method for post-processing output data of a classifier. Further, the following relates to a corresponding technical unit and a computer program product.
Artificial intelligence (“AI”) systems for decision making in dynamically changing environments are known from the conventional art. Such AI systems require not only a high predictive power, but also uncertainty awareness. A meaningful and trustworthy predictive uncertainty is particularly important for real-world applications where the distribution of input samples can drift away from the training distribution. Continuously monitoring model performance and reliability under such domain drift scenarios can be facilitated by well calibrated confidence scores—that is, if model accuracy decreases due shifts in the input distribution, confidence scores change in a coordinated fashion, reflecting the true correctness likelihood of a prediction.
Previous attempts to obtain well-calibrated estimates of predictive uncertainties have focused on training intrinsically uncertainty-aware probabilistic neural networks or post-processing unnormalized logits to achieve in-domain calibration. However, the known approaches cannot provide consistently well-calibrated predictions under dataset shifts.
An aspect relates to a computer-implemented method for post-processing output data of a classifier in an efficient and reliable manner.
This problem is solved by a computer-implemented method for post-processing output data of a classifier, comprising the steps:
Accordingly, embodiments of the invention are directed to a method for post-processing output data of a classifier in the context of machine learning. In other words, the method is directed to a post-processing algorithm. In an embodiment, the output logits of the classifier are post-processed.
Thereby, the classifier is a trained machine learning model, in particular AI (“Artificial Intelligence”) model. Exemplary classifiers are listed further below, including neural networks.
In the first steps a. to b., the input data sets are provided or received as input for step c., namely the validation data set and the perturbation levels.
The validation data set comprises a set of labelled sample pairs, also referred to as samples. Thereby, the validation set comes from the same distribution as the training set. In context of machine learning, each sample is a pair. The pair comprises an input object, in particular a vector or matrix, and a desired output value or label (also called the supervisory signal). According to this, the model input can be equally referred to as input object and the model output can be equally referred to as output value or label.
The perturbation levels can be interpreted as perturbation strength. Thereby, a perturbation level quantifies how far away from the training distribution a perturbed sample is. The perturbation levels are chosen to span the entire spectrum of domain shift, from in-domain to truly out-of-domain (OOD; for OOD samples a model has random accuracy). The perturbation levels can be denoted as values Epsilon e.g. between 0 and 1. The perturbation levels can be randomly sampled or selected via alternative methods.
The input data sets can be received via one or more interfaces and/or can be stored in a storage unit for data storage and data transmission from the storage unit to a computing unit with respective interfaces for data transmission. The transmission includes receiving and sending data in two directions.
Next, in step c., the perturbation method is applied on the received input data sets resulting in perturbated sample pairs. In other words, perturbated sample pairs of varying perturbation strength are generated using in particular the Fast Gradient Signed Method (FGSM) based on the validation data set.
In the next steps d. to f., the post-processing of the output data of the classifier is performed. The post-processing can be parametric or non-parametric. Accordingly, a monotonic function can be used to transform the unnormalized logits of the classifier into post-processed logits of the classifier, such as piecewise temperature scaling.
In more detail, a post-processing model is determined based on the plurality of perturbated sample pairs. In other words, the post-processing model is trained. Thereby, optimizers can be applied, including optimizers and other calibration metrics, such as Nelder Mead, log likelihood, Brier score and ECE.
Then, the determined post-processing model is applied on testing data to post-process the output data of the classifier. This step of applying the post-processing model can be repeated, in particular whenever a new classification is made, throughout the life-cycle of the model. In other words, the trained post-processing model can be applied any time a prediction is made. Hence, once the postprocessing model is trained, no more perturbed sample pairs are needed.
In the last step, the post-processed output data of the classifier is provided.
The advantage of the method according to embodiments of the invention is that the trained classifiers can be post-processed in an efficient and reliable manner without the need of retraining.
Another advantage is that the method has no negative effect on the accuracy. The method ensures that the classifier is well calibrated not only for in-domain predictions but yields well calibrated predictions also under domain drift.
In one aspect the classifier is a trained machine learning model selected from the group comprising: SVM, xgboost, random forest and neural network. Accordingly, the classifier or trained machine learning model can be selected in a flexible manner according to the specific application case, underlying technical system and user requirements.
In another aspect the perturbation method is a noise function selected from the group comprising: Fast gradient sign method (FGSM) and Gaussian function. The FGSM has proven to be particular advantageous due to the fact that not only the direction, but also the strength of the domain drift that may occur after model deployment remains unknown, the adversarials can be generated at a variety of noise levels covering the entire spectrum from in-domain to truly out-of domain.
A further aspect of embodiments of the invention is a technical unit for performing the aforementioned method.
The technical unit may be realized as any device, or any means, for computing, in particular for executing a software, an app, or an algorithm. For example, the unit may comprise a central processing unit (CPU) and a memory operatively connected to the CPU.
The unit may also comprise an array of CPUs, an array of graphical processing units (GPUs), at least one application-specific integrated circuit (ASIC), at least one field-programmable gate array, or any combination of the foregoing. The unit may comprise at least one module which in turn may comprise software and/or hardware. Some, or even all, modules of the unit may be implemented by a cloud computing platform.
A further aspect of embodiments of the invention is a computer program product (non-transitory computer readable storage medium having instructions, which when executed by a processor, perform actions) directly loadable into an internal memory of a computer, comprising software code portions for performing the steps according to the aforementioned method when the computer program product is running on a computer.
Some of the embodiments will be described in detail, with references to the following FIGURES, wherein like designations denote like members, wherein:
The method can be split into three distinct stages, as listed in the following:
Stage 1 and 2 are performed once only. Stage 3 can be performed repeatedly.
A set of samples are generated which cover the entire spectrum from in-domain samples to truly out-of-domain samples in a continuous and representative manner. According to this, the fast gradient sign method (FGSM) is used on the basis of the validation data set with sample pairs to generate perturbated samples pairs S3, with varying perturbation strength. More specifically, for each sample pair in the validation data set, the derivative of the loss is determined with respect to each input dimension and the sign of this gradient is recorded. If the gradient cannot be determined analytically (e.g., for decision trees), it can be resorted to a 0th-order approximation and the gradient can be determined using finite differences. Then, noise $epsilon$ is added to each input dimension in the direction of its gradient. For each sample pair, a noise level can be selected at random, such that the adversarial validation set comprises representative samples from the entire spectrum of domain drift, as shown in the pseudo code of algorithm 1 and explanation.
According to an alternative embodiment, the formulation of Algorithm 1 differs in that not only one adversarial sample is generated per sample pair; but instead FGSM is applied for all available epsilons. Thereby the size of the adversarial validation set can be significantly increased by the size of the set of epsilons. In other words, different perturbation strategies can be used e.g., based on image perturbation. The advantage is that the method according to embodiments of the invention can be applied on black box models where it is not possible to compute the gradient.
The third stage covers the Generation of the post-processed model. According to this, a strictly monotonic parameterized function is used to transform the unnormalized logits of the classifier. For example, Platt scaling, temperature scaling, other parameterizations of a monotonic function, or non-parametric alternatives can be used. In an embodiment according to the following equation a novel parameterization is used, which adds additional flexibility to known functions by introducing range-adaptive temperature scaling. While in classical temperature scaling a single temperature is used to transform logits across the entire spectrum of outputs, a range-specific temperature is used for different value ranges.
The following is a formula of an embodiment:
with θ=[θ0, . . . θ3] parameterizing the temperature T (zr; θ) and zr=max(z)−min(z) being the range of an unnormalized logits tuple z. θ0 can be interpreted as an asymptotic dependency on zr. The following function can be used
exp_id: x->{x+1, x>0; exp(x), else} to ensure a positive output. This parameterized temperature is then used to obtain calibrated confidence scores {circumflex over (Q)}i for sample i based on unnormalized logits:
Sigma_SM denotes the softmax function. The parameters of the function (theta) are then determined by optimizing a calibration metric based on the adversarial validation set. Calibration metrics can be the log likelihood, the Brier score or the expected calibration error, see also Algorithm 2.
Number | Date | Country | Kind |
---|---|---|---|
20181134.6 | Jun 2020 | EP | regional |