This application claims priority under 35 U.S.C. Β§ 119 (a) to Korean Patent Application No. 10-2023-0178447, filed on Dec. 11, 2023, with the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
The present disclosure relates to an apparatus and method for calibrating confidence of an artificial neural network, and more particularly, to an apparatus and method for calibrating confidence based on adaptive conditional label smoothing.
Currently, artificial neural networks are widely used in various applications such as image processing, speech recognition, and text-to-speech conversion.
In
As such, when an artificial neural network is actually applied, decisions are made based on the confidence value, so the network must output an accurate confidence value. For example, if the confidence value for the class (c) estimated by the image recognition neural network is 0.9, it should mean that the input image is the class (c) estimated with a 90% probability.
In addition, the network outputs accurate confidence values, it is easy to decide whether to follow the network's predictions based on the confidence values in real situations. In other words, a reliable artificial neural network model should have matching prediction accuracy and confidence for the input.
However, most current image recognition networks using deep learning suffer from the problem of overconfident prediction, where the confidence value is too large compared to the accuracy. This is because the softmax cross entropy loss Lee is used in the process of training the network.
Currently, when training an image recognition network, learning is mostly performed using the distribution of one-hot labels in the dimension C, where the probability (qj, j=y) for the correct class (here, y as an example) labeled as the ground truth is set to 1, and the probability (qj, jβ y) for the remaining false classes is set to 0, as the target probability distribution q.
In addition, as shown in
At this time, the cross entropy loss Lce can be calculated according to Equation 1.
In Equation 1, the class-wise estimated probability (pj) of the probability vector p according to the element zj of the logit vector z is expressed as Equation 2.
In addition, the gradient of the cross entropy loss Lce for backpropagation during learning can be calculated using Equation 3.
In other words, training is performed so that the class-wise estimated probability (pj) of the probability vector p and the class-wise target probability (qj) of the target probability distribution q become the same (pjβqj=0).
In this process, the softmax cross entropy loss Lce learns to make the correct class estimate probability (py) for the correct class reach 1, which is the correct class target probability qy in the target probability distribution q, but the correct class estimate probability (py) cannot reach 1, which is the correct class target probability qy. Therefore, the correct class estimate probability (py) is learned to continuously increase. On the other hand, the false class estimate probability (pj, jβ y) for the false class is learned to continuously decrease by following the false class target probability (qj, jβ y) in the target probability distribution q. This causes the artificial neural network to overfit and make overconfident predictions.
An object of the present disclosure is to provide an apparatus and method for calibrating confidence, capable of preventing overconfident prediction problems by calibrating target probabilities for the correct class and false class of a target probability distribution according to a one-hot label so as to be smoothed.
Another object of the present disclosure is to provide an apparatus and method for calibrating confidence, capable of performing adaptive conditional smoothing for a target probability distribution by providing a smoothing function that adjusts the level of smoothing of a target probability distribution and an indicator function that determines whether to smooth or not, and applying the provided smoothing function and indicator function.
According to one embodiment of the present disclosure, an apparatus for calibrating confidence includes: a memory; and a processor that executes at least a part of an operation according to a program stored in the memory, wherein the processor determines a correct logit element and a false logit element among a plurality of logit elements of a logit vector obtained by a neural network model performing a neural network operation on the training data, based on a correct class and a false class determined from a target probability distribution including target probabilities for each of a plurality of classes according to ground truth of the training data as elements, and determines whether to calibrate and a calibration value for a plurality of target probabilities of the target probability distribution based on a difference in value between the correct logit element and the false logit element.
The processor may perform calibration for the correct class when the value obtained by subtracting the value of the false logit element having the minimum value among the remaining false logit elements from the value of the correct logit element is greater than or equal to the margin value.
The processor may perform calibration for the false logit element when the value obtained by subtracting the value of each false logit element from the value of the correct logit element for the false class is greater than or equal to the margin value.
The processor may obtain a correct calibration value by weighting the difference value, obtained by subtracting the margin value and the false logit element having the minimum value among the remaining false logit elements from the value of the correct logit element for the correct class, with a first weight.
The processor may perform calibration, when it is determined that calibration is being performed for a correct class, by subtracting the correct calibration value determined from the correct target probability among a plurality of target probabilities of the target probability distribution.
The processor may obtain a false calibration value by weighting the difference value obtained by subtracting the value of each false logit element and the margin value from the value of the correct logit element, for the false class, with a second weight.
The processor may perform calibration, when it is determined that calibration is being performed for a false class, by adding a false calibration value determined from a false target probability among a plurality of target probabilities of the target probability distribution.
The processor may calculate the gradient of the loss for training the neural network model by weighting the calculated difference value with a first weight, when the difference value, obtained by subtracting the margin value and the value of the false logit element having the minimum value among the remaining false logit elements from the value of the correct logit element, for the correct class, is a positive value.
The processor may calculate the gradient of the loss for training the neural network model by weighting the calculated difference value with a negative second weight, when the difference value, obtained by subtracting the margin value and the value of each false logit element from the value of the correct logit element, for the false class, is a positive value.
The target probability distribution may be configured in a one-hot label format where the target probability for the correct class has a probability value of 1 and the target probability for the false class has a probability value of 0.
According to another embodiment of the present disclosure, a method for calibrating confidence is performed by a processor, the method comprising the steps of: determining a correct logit element and a false logit element among a plurality of logit elements of a logit vector obtained by a neural network model performing a neural network operation on the training data, based on a correct class and a false class determined from a target probability distribution including target probabilities for each of a plurality of classes according to ground truth of the training data as elements; and determining whether to calibrate and a calibration value for a plurality of target probabilities of the target probability distribution based on a difference in value between the correct logit element and the false logit element.
The apparatus and method for calibrating confidence of the present disclosure provide a smoothing function for adjusting the level of smoothing and an indicator function for determining whether to smooth, and apply the provided smoothing function and the indicator function during learning to smooth the target probability for the correct class and the false class of the target probability distribution according to the one-hot label, thereby preventing the occurrence of an overconfident prediction problem.
Hereinafter, specific embodiments according to the embodiments of the present disclosure will be described with reference to the drawings. The following detailed description is provided to assist in a comprehensive understanding of the methods, devices and/or systems described herein. However, this is only an example and the present invention is not limited thereto.
In describing the embodiments of the present disclosure, when it is determined that detailed descriptions of known technology related to the present disclosure may unnecessarily obscure the gist of the embodiments, the detailed descriptions thereof will be omitted. The terms used below are defined in consideration of functions in the present disclosure, but may be changed depending on the customary practice or the intention of a user or operator. Thus, the definitions should be determined based on the overall content of the present specification. The terms used herein are only for describing the embodiments, and should not be construed as limitative. Unless the context clearly indicates otherwise, the singular forms are intended to include the plural forms as well. It should be understood that the terms βcomprises,β βcomprising,β βincludes,β and βincluding,β when used herein, specify the presence of stated features, numerals, steps, operations, elements, or combinations thereof, but do not preclude the presence or addition of one or more other features, numerals, steps, operations, elements, or combinations thereof. In addition, terms such as β . . . unitβ, β . . . er/orβ, βmoduleβ and βblockβ described in the specification means a unit for processing at least one function or operation, which may be implemented by hardware or software or a combination of hardware and software.
Confidence calibration refers to a technique of smoothing the correct class target probability qy and the false class target probability (qj, jβ y) in the target probability distribution q to prevent the occurrence of the above-mentioned overconfident prediction problem.
Various smoothing techniques may be considered.
First, the label smoothing technique illustrated in
That is, each probability value is calibrated by a pre-specified value so that the difference in probability value is reduced between the correct class target probability (qy) and the false class target probability (qj, jβ y) in the target probability distribution q.
This label smoothing technique has the advantage of being the simplest method and can solve the overconfident prediction problem to some extent, but it smooths the class-wise target probability (qj) of the target probability distribution q regardless of the correct class estimate probability (py) and the false class estimate probability (pj, jβ y) of the probability vector p, so the learning efficiency may be reduced. In particular, even if the correct class target probability (qy) is calibrated to a probability value less than 1, the correct class target probability (qy) required in an actual artificial neural network may be lower than the calibrated probability value in some cases, and conversely, the false class target probability (qj) may be higher than the calibrated probability value in some cases. Therefore, when applying the label smoothing technique, the neural network model may not exhibit the required performance depending on the set first and second calibration values (Ξ΅1, Ξ΅2).
Meanwhile, the adaptive smoothing technique illustrated in
In
That is, in the adaptive smoothing technique, calibration may be performed by subtracting a calibration value that increases in proportion to the probability value of the correct class estimate probability (py) from the correct class target probability (qy), and subtracting a calibration value that increases in inverse proportion to the probability value of the false class estimate probability (pj) from the false class target probability (qj).
The adaptive smoothing technique can improve the efficiency of learning by adjusting the level of calibration for smoothing according to the correct class estimate probability (py) and the false class estimate probability (pj) in the probability vector p, and can also train the artificial neural network model to perform operations suitable for actual reality. In the adaptive smoothing technique, smoothing can be performed by setting a smoothing function (f) in advance to perform calibration. However, in order to perform ideal calibration, the smoothing function (f) must be set so that it can linearly and accurately perform the required calibration according to the class-wise estimate probability (pj) of the probability vector p (or the logit element (zj) of the logit vector z).
Meanwhile, the conditional smoothing technique illustrated in
The above-mentioned label smoothing technique, adaptive smoothing technique, and conditional smoothing technique each have their own advantages and disadvantages, and therefore, in this disclosure, the above-mentioned label smoothing technique, adaptive smoothing technique, and conditional smoothing technique are all reflected to perform efficient smoothing. Here, this technique is called Adaptive and Conditional Label Smoothing (Hereinafter, βACLSβ).
Referring to
The neural network model 20 is an artificial neural network model that is the target of training, and may be implemented as various artificial neural networks. The neural network model 20 may be a neural network model trained using a supervised learning method based on training data, and here, it is explained assuming that it is an image recognition neural network as an example.
The training data acquisition module 10 acquires a large amount of training data for training the neural network model 20. Here, the training data may be data labeled with ground truth, which are the result values that the neural network model must estimate.
The training module 30 is a configuration for training a neural network model based on training data acquired by the training data acquisition module 10. The training module 30 may train a neural network model by receiving the logit vector z and the probability vector p output from the neural network model 20 and backpropagating the loss using the ground truth labeled in the training data.
In particular, in the present disclosure, the training module 30 smoothes and calibrates the class-wise target probability (qj) of the target probability distribution q obtained from the ground truth, and trains the neural network model based on the calibrated target probability distribution (qβ²) according to the calibrated class-wise target probability (qβ²j), thereby preventing the neural network model from making overconfident predictions.
The training module 30 may include a one-hot label generation module 31, a condition module 32, a smoothing module 33, and a loss backpropagation module 34.
The one-hot label generation module 31 receives the ground truth labeled in the training data acquired by the training data acquisition module 10 and generates a target probability distribution q in the form of a one-hot label as shown in
In the present disclosure, the condition module 32 distinguishes between a correct logit element (zj, j=y) and false logit elements (zj, jβ y) among a plurality of logit elements (zj) of the logit vector z applied from the neural network model 20, based on the target probability distribution q in the form of a one-hot label generated by the one-hot label generation module 31, and compares the distinguished correct logit element (zj, j=y) and false logit elements (zj, jβ y) to determine whether to perform smoothing for each class target probability (qj). The condition module 32 may set an indicator function ((z
(z
Here, the indicator function ((z
According to the indicator function ((z
In addition, for the false class (jβ y), if the difference between the value of the correct logit element (zy) and the value of the false logit element (zj) is greater than or equal to the margin value (M), smoothing is performed.
Using the indicator function ((z
The smoothing module 33 also, similarly to the conditional module 32, distinguishes between a correct logit element (zj, j=y) and false logit elements (zj, jβ y) among a plurality of logit elements (zj) of the logit vector z applied from the neural network model 20, based on the target probability distribution q in the form of a one-hot label generated by the one-hot label generation module 31, and compares the distinguished correct logit element (zj, j=y) and false logit elements (zj, jβ y) to adjust the smoothing level for each class target probability (qj).
The smoothing module 33 may set a smoothing function (f(zj)) that adjusts the smoothing level, and input a plurality of logit elements (zj) of the logit vector z into the set smoothing function (f(zj)) to obtain a calibration value indicating the smoothing level for each of a plurality of class-wise target probabilities (qj) of the target probability distribution q.
Here, the smoothing function (f(zj)) can be set as in Equation 5.
According to the smoothing function (f(zj)) of Equation 5, the smoothing module 33 may obtain a correct class calibration value by weighting the difference value, obtained by subtracting the value of the false logit element (zk) having the minimum value among the remaining false logit elements (zk, kβ y) and the margin value (M) from the value of the correct logit element (zy), for the correct class (j=y), with a first weight (Ξ»1).
In addition, for the false class (jβ y), the false class calibration value may be obtained by weighting the difference value, obtained by subtracting the false logit element (zj) and the margin value (M) from the correct logit element (zy), with a second weight (Ξ»2).
Since the smoothing function (f(zj)) is implemented as a linear function, the class target probability (qj) may increase or decrease at a constant rate depending on the size of the plurality of logit elements (zj).
Here, the condition module 32 and the smoothing module 33 do not consider the class target probability (qj) of the target probability distribution q, because the probability values of the correct class target probability (qy) and the false class target probability (qj, jβ y) are 1 and 0, respectively, so the class target probability (qj) is already reflected simply by distinguishing between the correct class (j=y) and the false class (jβ y).
The smoothing module 33 may generate a smoothing vector composed of calibration values for each class obtained by using the smoothing function (f(zj)) of Equation 5 as elements, and output it to the loss backpropagation module 34.
The loss backpropagation module 34 calibrates the plurality of class target probabilities (qj) of the target probability distribution q generated by the one-hot label generation module 31 based on the probability vector p obtained by the neural network model 20, the indicator vector obtained by the condition module 32, and the smoothing vector obtained by the smoothing module 33, and trains the neural network model 20 by backpropagating the gradient value of the loss corrected based on the calibrated class target probability (qβ²j) to the neural network model 20.
Since the loss backpropagation module 34 performs training using the calibrated class target probability (qβ²j) calibrated by smoothing, the loss backpropagation module 34 must consider not only the cross entropy loss (Lce) but also the smoothing loss (LREG) due to smoothing. Therefore, the total loss (L) can be expressed as Equation 6.
In addition, since the gradient for the cross entropy loss (Lce) is expressed as Equation 3, the gradient
for the total loss (L) of Equation 6 can be expressed as Equation 7.
As described above, the loss backpropagation module 34 may perform smoothing based on the indication vector and the smoothing vector, and the indicator vector and the smoothing vector are applied separately for the correct class and the false class using the indicator function ((z
The loss backpropagation module 34 may calculate the gradient
for the loss (L) using the probability vector p, the target probability distribution q, and the indicator vector and the smoothing vector, and backpropagate it.
In addition, since the indicator function ((z
for the loss (L) of Equation 8 can be organized into Equation 9.
Referring to equation 9, the loss backpropagation module 34 may obtain the gradient
of the loss (L) by weighting the calculated difference value with a first weight (Ξ»1), when the difference value obtained by subtracting the margin value (M) and the false logit element (zk) having the minimum value among the remaining false logit elements (zk, kβ y) from the value of the correct logit element (zy), for the correct class (j=y), is a positive value.
In addition, for the false class (jβ y), when the difference value, obtained by subtracting the false logit element (zj) and the margin value (M) from the correct logit element (zy), is a positive value, the gradient
of the loss (L) may be obtained by weighting the calculated difference value with a negative second weight (βΞ»2).
The loss backpropagation module 34 may train the neural network model 20 to prevent overconfidence prediction by backpropagating the gradient
of the loss (L) obtained according to Equation 9 to the neural network model 20.
Here, the loss (L) can be expressed by Equation 10.
However, since the loss backpropagation module 34 performs training by backpropagating the gradient
of the loss (L) to the neural network model 20, the loss (L) may not be calculated.
In the above-described configuration, the condition module 32, the smoothing module 33, and the loss backpropagation module 34 can be seen as operating as an apparatus for calibrating confidence. For the convenience of understanding, the condition module 32, the smoothing module 33, and the loss backpropagation module 34 in which the training module 30 operates are illustrated separately. However, since the loss backpropagation module 34 obtains the gradient
of the loss (L) according to Equation 9 and backpropagates it, the condition module 32 and the smoothing module 33 may be omitted. Here, for the convenience of understanding, the one-hot label generation module 31 is described as generating a target probability distribution q in the form of a one-hot label based on the ground truth applied from the training data acquisition module 10, but in many cases, the training data acquisition module 10 already transmits the ground truth in the form of a one-hot label. Therefore, the one-hot label generation module 31 may also be omitted. That is, the training module 30 including the one-hot label generation module 31, the condition module 32, the smoothing module 33, and the loss backpropagation module 34 can be called an apparatus for calibrating confidence.
In the illustrated embodiment, respective configurations may have different functions and capabilities in addition to those described above, and may include additional configurations in addition to those described above. In addition, in an embodiment, each configuration may be implemented using one or more physically separated devices, or may be implemented by one or more processors or a combination of one or more processors and software, and may not be clearly distinguished in specific operations unlike the illustrated example.
In addition, the apparatus for calibrating confidence shown in
In addition, the apparatus for calibrating confidence may be mounted in a computing device or server provided with a hardware element as a software, a hardware, or a combination thereof. The computing device or server may refer to various devices including all or some of a communication device for communicating with various devices and wired/wireless communication networks such as a communication modem, a memory which stores data for executing programs, and a microprocessor which executes programs to perform operations and commands.
Referring to
Once the target probability distribution q is obtained, the training data is input into the neural network model 20 that is the target of training, and the neural network model 20 obtains a logit vector z that encodes the training data (52). Here, the logit vector z is composed of a plurality of logit elements (zj) that have values according to each class.
Once a logit vector z composed of a plurality of logit elements (zj) is obtained, it is determined whether each logit element (zj) is a correct logit element (zy) for the correct class or a false logit element (zj) for the false class according to a plurality of target probabilities (qi), which are elements of the target probability distribution q (53).
Once the correct logit element (zy) and the false logit element (zj) are distinguished, as in Equation 4, an indicator vector is obtained that indicates whether to perform calibration, i.e. smoothing, for the target probability (qj) of the target probability distribution q according to the difference in the values of the correct logit element (zy) and the false logit element (zj) (54).
Here, the indicator vector may be generated so as to instruct that, for the correct class (j=y), smoothing is to be performed when the difference between the value of the correct logit element (zy) and the value of the false logit element (zk) having the minimum value among the remaining false logit elements (zk, kβ y) is greater than or equal to the margin value (M), and that, for the false class (jβ y), smoothing is to be performed when the difference between the value of the correct logit element (zy) and the value of the false logit element (zj) is greater than or equal to the margin value (M).
Meanwhile, once the correct logit element (zy) and the false logit elements (zj) are distinguished, a smoothing vector consisting of calibration values representing the smoothing level of the target probability (qj) of the target probability distribution q is obtained according to the difference in the values of the correct logit element (zy) and the false logit element (zj), as in Equation 5 (55).
In the smoothing vector, the calibration value for the correct class (j=y) may be obtained by weighting the difference value, obtained by subtracting the false logit element (zk) with the minimum value among the remaining false logit elements (zk, kβ y) and the margin value (M) from the value of the correct logit element (zy), with a first weight (Ξ»1), and the calibration value for the false class (jβ y) may be obtained by weighting the difference value, obtained by subtracting the false logit element (zj) and the margin value (M) from the correct logit element (zy), with a second weight (Ξ»2).
Once the indicator vector and the smoothing vector are obtained, a plurality of target probabilities (qj) of the target probability distribution q are calibrated according to the obtained indicator vector and smoothing vector (56). At this time, only the target probability (qj) specified by the indicator vector among the plurality of target probabilities (qj) may be calibrated, and the target probability specified by the indicator vector may be calibrated by increasing or decreasing by the size specified by the smoothing vector. The smoothing vector may be set to subtract from the correct class target probability (qy) while increasing from the false class target probability (qj).
Once the target probability distribution q is calibrated to obtain the calibrated target probability distribution (qβ²), the neural network model 20 calculates the gradient
of the loss (L) based on the difference between the probability vector p obtained from the logit vector z and the elements (pj, qj) of the calibrated target probability distribution (qβ²), and trains the neural network model 20 by backpropagating the calculated gradient
of the loss (L) (57).
In the above, it was explained that the indicator vector and smoothing vector for calibrating the target probability distribution q are first obtained, and the target probability distribution q is first calibrated based on the obtained indicator vector and smoothing vector, and then the gradient
of the loss (L) is calculated based on the calibrated target probability distribution (qβ²).
However, as shown in Equation 9, the gradient
of the loss (L) can also be directly calculated and obtained in a manner in which the calculated target probability distribution (qβ²) is reflected from the logit vector z, without first calculating the calculated target probability distribution (qβ²).
Here, the gradient
of the loss (L) reflecting the calculated target probability distribution (qβ²) can be obtained, for the correct class (j=y), by weighting the calculated difference value with the first weight (Ξ»1) when the difference value, obtained by subtracting the false logit element (zk) having the minimum value among the remaining false logit elements (zk, kβ y) and the margin value (M) from the value of the correct logit element (zy), is a positive value.
In addition, for the false class (jβ y), when the difference value, obtained by subtracting the false logit element (zj) and the margin value (M) from the correct logit element (zy), is a positive value, it can be obtained by weighting the calculated difference value with the negative second weight (βΞ»2).
In
In the illustrated embodiment, respective configurations may have different functions and capabilities in addition to those described below, and may include additional configurations in addition to those described below. The illustrated computing environment 70 may include a computing device 71 to perform the method for calibrating confidence illustrated in
The computing device 71 includes at least one processor 72, a computer readable storage medium 73 and a communication bus 75. The processor 72 may cause the computing device 71 to operate according to the above-mentioned exemplary embodiment. For example, the processor 72 may execute one or more programs 74 stored in the computer readable storage medium 73. The one or more programs 74 may include one or more computer executable instructions, and the computer executable instructions may be configured, when executed by the processor 72, to cause the computing device 71 to perform operations in accordance with the exemplary embodiment.
The communication bus 75 interconnects various other components of the computing device 71, including the processor 72 and the computer readable storage medium 73.
The computing device 71 may also include one or more input/output interfaces 76 and one or more communication interfaces 77 that provide interfaces for one or more input/output devices 78. The input/output interfaces 76 and the communication interfaces 77 are connected to the communication bus 75. The input/output devices 78 may be connected to other components of the computing device 71 through the input/output interface 76. Exemplary input/output devices 78 may include input devices such as a pointing device (such as a mouse or trackpad), keyboard, touch input device (such as a touchpad or touchscreen), voice or sound input device, sensor devices of various types and/or photography devices, and/or output devices such as a display device, printer, speaker and/or network card. The exemplary input/output device 78 is one component constituting the computing device 71, may be included inside the computing device 71, or may be connected to the computing device 71 as a separate device distinct from the computing device 71.
The present disclosure has been described in detail through a representative embodiment, but those of ordinary skill in the art to which the art pertains will appreciate that various modifications and other equivalent embodiments are possible from this. Therefore, the true technical protection scope of the present invention should be determined by the technical spirit set forth in the appended scope of claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 10-2023-0178447 | Dec 2023 | KR | national |