The present disclosure is related to reducing classification errors in a neural network when an input data set used for training includes mislabeled data.
Neural networks are trained on labelled data. Datasets often contain erroneous labels, also called noisy labels. The neural network trained with noisy labels can be improved by formulating a noise transition matrix (NTM). NTM relies on an assumption that there are enough known data points in the datasets that are known to be correctly labeled or mislabeled to accurately model transition probabilities from clean labels to noisy labels. However, such an assumption does not hold true in many real world applications as label noise (noisy labels) in training datasets is instance-dependent and the underlying noise distribution does not follow a particular distribution (e.g., uniform distribution).
An artificial intelligence (AI) classifier is trained using supervised training and an effect of noise in the training data is reduced. The training data includes observed noisy labels. A posterior transition matrix (PTM) is used to minimize, in a statistical sense, a cross entropy between a noisy label and a function of the classifier output. A loss function using the PTM is provided to use in training the classifier. The classifier provides final output predictions with higher accuracy even with the existence of noisy labels. Also, information fusion is included in the classifier training using the PTM and an estimated noise transition matrix (NTM) to reduce estimation error at the classifier output.
Provided herein is a computer-implemented method of training a neural network, the computer-implemented method comprising: obtaining a data set, the data set comprising a plurality of instances (x1, . . . , xn) and a plurality of labels ({tilde over (y)}1, . . . , ) each label of the plurality of labels corresponding to respective ones of the plurality of instances; training the neural network using the data set to obtain a first neural network; obtaining a first output (f) of the first neural network in response to a first instance (x) of the plurality of instances; obtaining a probability of a noisy label ({circumflex over (P)}) given x; obtaining a first transition matrix (PTM), wherein the obtaining the PTM comprises including in the PTM a term based on the first output and P; and updating the first neural network at a first time based on the PTM to obtain a second neural network.
Also provided herein is an apparatus for training a neural network, the apparatus comprising: one or more processors; and one or more memories, the one or more memories storing instructions configured to cause the apparatus to at least: obtain a data set, the data set comprising a plurality of instances (x1, . . . , xn) and a plurality of labels ({tilde over (y)}1, . . . , ) each label of the plurality of labels corresponding to respective ones of the plurality of instances; train the neural network using the data set to obtain a first neural network; obtain a first output (f) of the first neural network in response to a first instance (x) of the plurality of instances; obtain a probability of a noisy label ({circumflex over (P)}) given x; obtain a first transition matrix (PTM), wherein the obtaining the PTM comprises including in the PTM a term based on the first output and P; and update the first neural network at a first time based on the PTM to obtain a second neural network.
Also provided herein is a non-transitory computer readable medium storing instructions for training a neural network, the instructions configured to cause a computer to at least: obtain a data set, the data set comprising a plurality of instances (x1, . . . , xn) and a plurality of labels ({tilde over (y)}1, . . . , ) each label of the plurality of labels corresponding to respective ones of the plurality of instances; train a neural network using the data set to obtain a first neural network; obtain a first output (f) of the first neural network in response to a first instance (x) of the plurality of instances; obtain a probability of a noisy label ({circumflex over (P)}) given x; obtain a first transition matrix (PTM), wherein the obtaining the PTM comprises including in the PTM a term based on the first output and {circumflex over (P)}; and update the first neural network at a first time based on the PTM to obtain a second neural network.
Also provided herein is a server configured to train a neural network, the server comprising: one or more processors; a non-transitory computer readable medium storing instructions, the instructions configured to cause the one or more processors to at least: accessing a training sample comprising an input x and a noisy output label {tilde over (y)}; applying the input x to the neural network and receiving an observed output f(x); determining a posterior transition matrix (PTM) associated with the training sample based on the noisy output label y and the observed output f(x), wherein the PTM represents a posterior probability of having a clean output label y given the noisy output label y; determining a posterior loss based on the PTM; and updating the neural network based on the posterior loss.
In some embodiments of the server, the instructions are further configured to cause the one or more processors to at least perform: determining a noise transition matrix (NTM) associated with the training sample based on the observed output f(x) and anchor points, wherein the NTM represents a probability of the clean output label y flipping into the noisy output label y; determining a first reconstruction error associated with the NTM and a second reconstruction error associated with the PTM; determining a first weight and a second weight for a linear combination of the PTM and the NTM, wherein the first weight and the second weight are determined by a minimization of mean squared reconstruction error; and determining the posterior loss based on the linear combination of the PTM and the NTM.
The text and figures are provided solely as examples to aid the reader in understanding the invention. They are not intended and are not to be construed as limiting the scope of this invention in any manner. Although certain embodiments and examples have been provided, it will be apparent to those skilled in the art based on the disclosures herein that changes in the embodiments and examples shown may be made without departing from the scope of embodiments provided herein.
Overall, noisy training samples (data D 1-3) are collected. In an example, the training samples may be bid requests, user feedback, user historical events (for example web search strings, user purchases, user clicks on particular web screens), and metadata of advertisements (“ads”).
The application of the embodiments are not limited as aforementioned and can be applied to other applicable environments.
A noise transition matrix, NTM may be estimated offline based on the noisy training samples (D 1-3). A PTM may be estimated iteratively based on model prediction of the classifier 1-5 and observed noisy labels in D 1-3. The NTM and PTM may be combined based on an optimal Kalman gain. In addition to
Specifically,
At operation 1-6, an ad bid request 1-7 is received and inference is performed using classifier 1-5. An ad bid 1-8 is then output with improved estimation of user response prediction (URP).
Also shown in
The logic flow of
Embodiments presented herein compute a noise transition matrix (NTM) 2-10 (see
In
At operation 3-2, the noisy data set D 1-3 is obtained.
At operation 3-4, a warm up is performed in which f 1-5 is trained based on D 1-3. This training may use conventional supervised learning algorithms, since the training instances x in the set D 1-3 (for example, images) are provided with labels (although the labels are sometimes incorrect).
At operation 3-6, the probability {circumflex over (P)} 3-5 of the underlying noisy label {tilde over (Y)} 3-3 given the instance x is found. For an example, see “Case 1” and “Case 2” of
At operation 3-8, logic 3-1 computes an estimate of the posterior clean probability given noisy labels, Ŵ, also referred to as PTM 2-12.
At operation 3-10, logic 3-1 computes a noise transition matrix {circumflex over (T)} also referred to as NTM 2-10. NTM 2-10 represents the probability of labeling instance x which belongs to class i to class j.
At operation 3-12, a blended loss 3-11 is found for instance x. The blended loss 3-11 blends PTM 2-12 with NTM 2-10.
The blended loss, in some embodiments, includes combining the PTM with a second transition matrix (NTM) to obtain a third transition matrix (WKM or W_km or Wkm), wherein the updating comprises minimizing a loss function based on the WKM
At operation 3-14, a posterior loss, Lposterior 3-15 for x is found based on the blended loss 3-11 and a cross entropy loss L(x) 3-13.
At operation 3-16, Lposterior 3-15 is included in an ensemble 3-17 of posterior loss values. Operations 3-6 through 3-16 are repeated until the instances x in the noisy data set D 1-3 have been processed; the ensemble 3-17 is then considered to be complete and an average posterior loss 3-19 is found.
At operation 3-18, f 1-5 is updated based on the average posterior loss 3-19. The update may be performed using back propagation and stochastic gradient descent (SGD). Backpropagation is an algorithm used in artificial intelligence (AI) to fine-tune mathematical weight functions and improve the accuracy of an artificial neural network's outputs. In this case, f 1-5 and average posterior loss 3-19 are inputs to the back propagation and SGD algorithm, and an improved more robust f 1-5 is the output.
First, with the goal of utilizing the observed noisy labels, a posterior transition matrix (PTM), is used to describe the transition probabilities given the observed noisy labels. Second, a loss function incorporates the estimated PTM so that the final output predictions can be corrected even with the existence of noisy labels. Third, to further improve accuracy, an information fusion (IF) method may be used, which combines the estimated noise transition matrix (NTM) and PTM to achieve lower estimation error.
An architecture of an example embodiment is provided in
Specifically, data D 1-3 is input to training 4-4, which trains the classifier f 1-5 and uses softmax to provide the classifier output f(x). At 4-16, PTM estimation is performed and PTM 2-12 is output. At 4-8, blending in the form of a linear combination is applied to PTM 2-12 (referring to the instance x) and to NTM 2-10 (also referring to the instance x). NTM 4-12 is obtained based on the instance x and the noisy label Y. The resulting matrix, Wkm 3-11 is operated on to obtain a posterior loss 3-15 which is collected into the ensemble 3-17. The average 3-19 over the ensemble for all x is finally used to update f 1-5 at training 4-4. The ensemble 3-17 may be constructed at a batch level.
Based on
In another embodiment of
At operation 5-2, the noisy data set D 1-3 is obtained.
At operation 5-4, warm up training is performed of f 1-5 using D 1-3.
Operation 5-6 indicates beginning of batch processing.
Operation 5-8 indicates beginning of processing of the instance x of the batch.
At operation 5-10, {circumflex over (P)} 3-5 is found based on {tilde over (Y)} 3-3 such that {circumflex over (P)} 3-5 is all 0 except for the jth entry which is 1, for on {tilde over (Y)}(x)=j. That is, the classifier f 1-5 operating on x produces the index j (for example 1 or 2), which corresponds to the label {tilde over (Y)} 3-3 (for example, “dog” or “cat” see
At operation 5-12, the PTM 2-12 is found as the outer product of f(x) with {circumflex over (P)} 3-5. For a vector u and a vector v and the outer product A=uTv, the element Ai,j=ui*vj, where “*” is scalar multiplication and “uT” is vector transpose of u.
At operation 5-14, blending 4-8 is performed between NTM 2-10 and PTM 2-12. NTM 2-10 may be found using a conventional technique.
At operation 5-16, the posterior loss Lposterior 3-15 is found.
At operation 5-18, the posterior loss Lposterior 3-15 is included in an ensemble 3-17. When all x in the batch have been considered, the average 3-19 of the ensemble 3-17 is found. Otherwise path 5-20 is taken back to 5-8 to obtain another x from D 1-3.
At operation 5-22, f 1-5 is updated based on the average 3-19. If all batches have been considered, then f 1-5 is output as the improved classifier. Otherwise, path 5-24 is taken back to 5-6 to begin a new batch.
Exemplary performance of the trained classifier 1-5 is provided in U.S. Provisional Application No. 63/310,040 filed Feb. 14, 2022 at
For example details of the operations of
Let function f 1-5 represent a neural network and f(x) denote the c dimensional output probability for instance x, where the ith index of the output fi(x) represents the predicted probability for class i. An approach is to minimize the cross-entropy (CE) loss L(f(x), y)=−log(fy(x)) to force the output fy(x) to approximate 1. However, the label noise may mislead a deep learning model. In some embodiments, an approach is to first estimate NTM T(x) 2-10 and then adopt it to correct the loss function, L 3-15. For example, in a forward correction procedure, the estimated NTM 2-10 is adopted to corrupt the predicted probability f(x) 1-5, i.e., the corrupted predicted probability varies as f(x)=T(x)Tf(x), and then the corrupted predicted probability is enforced to approximate the noisy label Y 3-3. Suppose T(x) 2-10 is non-singular and the loss function L 3-15 is proper and composite. The forward loss correction can achieve a consistent classifier, i.e., the optimal classifier for the corrected loss with respect to the underlying noisy distribution is the same as that for the CE loss with respect to the underlying clean distribution.
The main goal is to train a c-class neural network classifier f 1-5 to predict the clean label probability P(Y|X). Since only the noisy labels are observed, there is a gap between the clean and noisy label, described via NTM 2-10.
Motivated by the observed noisy labels (i.e., posterior information), embodiments define the PTM W(x) 2-12 to describe the posterior clean label probability given noisy labels, where Wi,j(x)=P(Y=i|{tilde over (Y)}=j, X=x). The relationship between the PTM W(x) 2-12 and NTM T(x) 2-10 can be expressed via Bayes' rule as shown in Eq. 2.
The summation of any column is 1 for PTM W(x), while the summation of any row is 1 for NTM T(x). Embodiments provide a posterior loss correction method via NTM.
The model prediction is f(x) for noisy sample (x, {tilde over (y)}) and W(x) is the PTM 2-12 associated with the noisy sample x.
In some embodiments, a posterior forward loss is used for training (Eq. 3a).
L
forward
=L({tilde over (y)}Σi=1cWi,{tilde over (y)}(x)fi(x)) Eq. 3a
where fi(x) is the ith element of f(x).
In some embodiments, a posterior reweight loss 3-15 is used (Eq. 3b).
The posterior reweight loss 3-15 is defined as in Eq. 3b.
L
reweight=Σi=1cWi,{tilde over (y)}(x)L(i,f(x)) Eq. 3b
In
An analysis of expected risk and empirical risk shows that the posterior reweight loss can achieve a consistent classifier for the underlying distribution and the empirical distribution. “Underlying distribution” means the true distribution and “empirical distribution” is the distribution relying on noisy labels.
The optimal solution for finding PTM 2-12 is using the minimal Frobenius norm. The solution is expressed in Eq. 4. Ŵ may also be referred to as W_hat herein, {tilde over (Y)} may be referred to as Y_tilde herein.
Eq. 4 provides the PTM estimation for general empirical noisy label distributions. For the case of instance x with only a single occurrence, the empirical noisy distribution satisfies Two Norm of {circumflex over (P)}({tilde over (Y)}|x)=1. And this empirical noisy distribution achieves consistent estimated PTM 2-12 in Equation (4).
Eq. 4 provides PTM estimation method based on the observed noisy labels D 1-3. However, the condition that a neural network approximates clean labels could still be strong even after the warm-up strategy and iterative estimation are adopted, and PTM estimation error could be large for large noisy label rates. To further reduce the estimation error, motivated by Kalman filtering, embodiments provide an information fusion (IF) approach to obtain more accurate transition matrix estimation via weighted average of PTM 2-12 and NTM 2-10.
Intuitively, for each instance, the estimated NTM 2-10 and PTM 2-12 may have different estimation accuracy, and, therefore, it is possible to obtain a more accurate transition matrix estimation by adaptively and linearly combining these two matrices. Embodiments quantify the estimation uncertainty and assign higher weight for the estimation with lower uncertainty. In this way, a more accurate estimated transition matrix is generated.
Uncertainty is obtained by modeling as follows. For the estimated NTM 2-10, the the noisy label {tilde over (Y)} satisfies c-dimension Bernoulli distribution with parameter {tilde over (f)}.
Once the uncertainty has been established, embodiments integrate the two estimated transition matrices into a Kalman transition matrix, defined as Wkm(x), via a weighted average operation. Mathematically, the Kalman transition matrix is given by Eq. 5. Wkm may be referred to as W_km or WKM herein. {circumflex over (T)} may referred to as T_hat herein.
W
km(x)=(1−λ(x)){circumflex over (T)}(x)+λ(x)Ŵ(x) Eq. 5
Hardware for performing embodiments provided herein is now described with respect to
Through the above embodiments, higher quality labels can be generated so that improved classification can be performed. In an exemplary application, the use of the above embodiments can result in more accurate classification by the use of a neural network trained with blended loss 3-11. An example of the neural network is the classifier 1-5. In some embodiments, the blended loss 3-11 is based on the posterior clean label probability given noisy labels (PTM 2-12). For example, the improved neural network results in more accurate output classifications with higher accuracy.
Overall, explicitly considering label noise improves the accuracy of supervised learning models (for example, classifier f 1-5). In practice, it is not possible to have 100% clean data in practice. The above embodiments model label noise during training and reduce misleading of supervised learning models (see
For example, performance of the classifier f 1-5 on example benchmark datasets, CIFAR-10 and SVHN, is improved over alternative approaches. Classifier f 1-5 achieves better performance than alternative approaches across different datasets and over a range of noise rates. For example, the higher accuracy indicates that the posterior information is particularly important for higher noise rates. Thus, by explicitly rectifying noisy labels, embodiments provide robust models and corrected predictions, and therefore improve the performance.
This application claims benefit of priority of U.S. Provisional Application No. 63/310,040 filed Feb. 14, 2022, the contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63310040 | Feb 2022 | US |