Deep learning has demonstrated impressive performance on a variety of tasks. Arguably the most important task, that of supervised classification, has led to many advancements. Notably, the use of deeper structures and more powerful loss functions have resulted in far more robust feature representations. There has also been more attention on obtaining better-behaved gradients through normalization of batches or weights.
One of the most important practical applications of deep networks with supervised classification is face recognition. Robust face recognition poses a challenge as it is characterized by a very large number of classes with relatively few samples per class for training with significant nuisance transformations.
A good understanding of the challenges in this task results in a better understanding of the core problems in supervised classification, and in general representation learning. However, despite the impressive attention on face recognition tasks over the past few years, there are still many gaps the understanding of this task. Notably, the need and practice of feature normalization. Normalization of features provides significant improvement in performance which implicitly results in a cosine embedding. However, direct normalization in deep networks in a non-convex formulation results in local minima generated by the loss function.
A common primary loss function is Softmax. Proposals have been made to use norm constraints before the Softmax loss is applied. However, the formulations investigated are non-convex in the feature representations leading to difficulties in optimization. Further, there is a need for better understanding of the benefits of normalization itself. The ‘radial’ nature of the Softmax features, as shown in
Described herein is a novel approach to normalization, known as “Ring Loss”. This method may be used to normalize all sample features through a convex augmentation of the primary loss function. The value of the target norm is also learned during training. Thus, the only hyperparameter in Ring Loss is the loss weight with respect to the primary loss function.
Deep feature normalization is an important aspect of supervised classification problems the model is required to represent each class in a multi-class problem equally well. The direct approach to feature normalization through a hard normalization operation results in a non-convex formulation. Instead, Ring Loss applies soft normalization, where it gradually learns to constrain the norm to the scaled unit circle, while preserving convexity leading to more robust features.
Feature matching during testing in face recognition is typically done through cosine distance creating a gap between testing and training protocols which do not utilize normalization. The incorporation of Ring Loss during training eliminates this gap. Ring Loss allows for seamless and simple integration into deep architectures trained using gradient-based methods. Ring Loss provides consistent improvements over a large range of its hyperparameter when compared to other baselines in normalization and other losses proposed for face recognition in general. Ring Loss also helps by being robust to lower resolutions through the norm constraint.
The Ring Loss augmentation method constrains the radial classifications of the Softmax loss function, shown in
The Ring Loss method provides three main advantages over the use of the un-augmented Softmax as the loss function: 1) The norm constraint is beneficial to maintain a balance between the angular classification margins of multiple classes; 2) Ring Loss removes the disconnect between training and testing metrics; and 3) Ring Loss minimizes test errors due to angular variation due to low norm features.
The Angular Classification Margin Imbalance. Consider a binary classification task with two feature vectors x1 and x2 from classes 1 and 2 respectively, extracted using some model (possibly a deep network). Let the classification weight vector for class 1 and 2 be ω1 and ω2 respectively. The primary loss function may be, for example, Softmax.
An example arrangement is shown in
In general, for the class 1 vector ω1 to pick x1 and not x2 for correct classification, it is required that ω1Tx1>ω1Tx2⇒∥x1∥2 cos θ1>∥x2∥2 cos θ2. Here, θ1 and θ2 are the angles between the weight vector ω1 (class 1 vector only) and x1, x2 respectively. The feasible set (range for θ1) required for this inequality to hold is known as the angular classification margin. Note that it is also a function of θ2.
Setting
(r>0) for correct classification, it is required that cos θ1>r cos θ2⇒θ1<cos−1(r cos θ2) as cos θ is a decreasing function between [−1, 1] for θ ∈ [0,π]. This inequality needs to hold true for any θ2.
Fixing cos θ2=δ, results in θ1<cos−1(rδ). From the domain constraints of cos−1, we have
Combining this inequality with r>0, result in
For these purposes it suffices to only look at the case δ>0 because δ<0 doesn't change the inequality −1≤rδ≤1.
Discussion on the angular classification margin. The upper bound on θ1 (i.e., cos−1(r cos θ2)) is plotted for a range of δ ([0.1, 1]) and the corresponding range of r.
In other terms, as the norm of x2 increases, with respect to x1, the angular margin for x1 to be classified correctly while rejecting x2 by ω1 decreases. The difference in norm (r>1) therefore will have an adverse effect during training by effectively enforcing smaller angular classification margins for classes with smaller norm samples. This also leads to lopsided classification margins for multiple classes due to the difference in class norms, as can be seen in
Effects of Softmax on the norm of MNIST features. The effects of Softmax on the norm of the features (and thereby classification margin) on MNIST can be qualitatively observed in
Removing the disconnect between training and testing. Evaluation using the cosine metric is currently ubiquitous in applications such as face recognition, where the features are normalized beforehand in gallery (thereby requiring fewer FLOPs during large scale testing). However, during training, this is not the case and the norm is usually not constrained. This creates a disconnect between training and testing scenarios which hinders performance. The Ring Loss method removes this disconnect in an elegant way.
Regularizing Softmax Loss with the Norm Constraint. The ideal training scenario for a system testing under the cosine metric would be where all features pointing in the same direction have the same loss. However, this is not true for the most commonly used loss function, Softmax and its variants (FC layer combined with the Softmax function and the cross-entropy loss). Assuming that the weights are normalized (i.e. ∥ωk∥=1), the Softmax loss for feature vector (xi) can be expressed as (for the correct class yi) as:
Clearly, despite having the same direction, two features with different norms have different losses. From this perspec-tive, the straightforward solution to regularize the loss and remove the influence of the norm is to normalize the features before Softmax. However, this approach is effectively a projection method, that is, it calculates the loss as if the features are normalized to the same scale, while the actual network does not learn to normalize features.
The need for features normalization in feature space. As an illustration, consider the training and testing set features trained by vanilla Softmax of the digit 8 from MNIST in
This is yet another motivation to normalize features during training. Forcing the network to learn to normalize the features helps to mitigate this problem during testing wherein the network learns to work in the normalized feature space.
Incorporating the norm constraint as a convex problem. Identifying the need to normalize the sample features from the network, the problem can now be formulated. If the primary loss function is defined as LS (for instance Softmax loss), and it is assumed that provides deep features for a sample x as (x), the loss subject can be minimized to the normalization constraint as follows
min LS((x))s.t.∥(x)∥2=R
Here, R is the scale constant that the features are to be normalized to. Note that this problem is non-convex in (x) because the set of feasible solutions is itself non-convex due to the norm equality constraint. Approaches which use standard SGD while ignoring this critical point would not be providing feasible solutions to this problem thereby, the network would not learn to output normalized features. Indeed, the features obtained using this straightforward approach are not normalized compared to the Ring Loss method, shown in
Ring Loss Definition. Ring loss LR is defined as:
where (xi) is the deep network feature for the sample xi. Here, R is the target norm value which is also learned and λ is the loss weight enforcing a trade-off between the primary loss function. m is the batch-size. The square on the norm difference helps the network to take larger steps when the norm of a sample is too far off from R, leading to faster convergence. The corresponding gradients are as follows.
Ring Loss (LR) can be used along with any other loss function such as Softmax or large-margin Softmax. The loss encourages norm of samples being value R (a learned parameter) rather than explicit enforcing through a hard normalization operation. This approach provides informed gradients towards a better minimum which helps the network to satisfy the normalization constraint. The network therefore, learns to normalize the features using model weights (rather than needing an explicit non-convex normalization operation, or batch normalization). In contrast, and in connection, batch normalization enforces the scaled normal distribution for each element in the feature independently. This does not constrain the overall norm of the feature to be equal across all samples and neither addresses the class imbalance problem nor the gap in the training and testing protocols in face recognition.
Ring loss Convergence Visualizations. To illustrate the effect of the Softmax loss augmented with the enforced soft-normalization, analytical simulations were conducted. A 2D mesh of points from (−1.5, 1.5) in (x,y)-axis were generated. The gradients of Ring Loss (R=1) were computed, assuming the vertical dotted line in
A method of training using a primary loss function augmented by a Ring Loss function has been presented in which the network learns to normalized features as they are extracted. The Ring Loss method was found to consistently provide significant improvements over a large range of the hyperparameter λ. Further, the network learns normalization, thereby being robust to a large range of degradations.
The application claims the benefit of U.S. Provisional Patent Application No. 62/710,814, filed Feb. 28, 2018, which is incorporated herein by reference in its entirety.
This invention was made with government support under N6833516C0177 awarded by the US Navy. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US19/20090 | 2/28/2019 | WO | 00 |
Number | Date | Country | |
---|---|---|---|
62710814 | Feb 2018 | US |