Deep learning has demonstrated impressive performance on a variety of tasks. Arguably the most important task, that of supervised classification, has led to many advancements. Notably, the use of deeper structures and more powerful loss functions have resulted in far more robust feature representations. There has also been more attention on obtaining better-behaved gradients through normalization of batches or weights.
One of the most important practical applications of deep networks with supervised classification is face recognition. Robust face recognition poses a challenge as it is characterized by a very large number of classes with relatively few samples per class for training with significant nuisance transformations.
A good understanding of the challenges in this task results in a better understanding of the core problems in supervised classification, and in general representation learning. However, despite the impressive attention on face recognition tasks over the past few years, there are still many gaps the understanding of this task. Notably, the need and practice of feature normalization. Normalization of features provides significant improvement in performance which implicitly results in a cosine embedding. However, direct normalization in deep networks in a non-convex formulation results in local minima generated by the loss function.
A common primary loss function is Softmax. Proposals have been made to use norm constraints before the Softmax loss is applied. However, the formulations investigated are non-convex in the feature representations leading to difficulties in optimization. Further, there is a need for better understanding of the benefits of normalization itself. The ‘radial’ nature of the Softmax features, as shown in
Described herein is a novel approach to normalization, known as “Ring Loss”. This method may be used to normalize all sample features through a convex augmentation of the primary loss function. The value of the target norm is also learned during training. Thus, the only hyperparameter in Ring Loss is the loss weight with respect to the primary loss function.
Deep feature normalization is an important aspect of supervised classification problems the model is required to represent each class in a multi-class problem equally well. The direct approach to feature normalization through a hard normalization operation results in a non-convex formulation. Instead, Ring Loss applies soft normalization, where it gradually learns to constrain the norm to the scaled unit circle, while preserving convexity leading to more robust features.
Feature matching during testing in face recognition is typically done through cosine distance creating a gap between testing and training protocols which do not utilize normalization. The incorporation of Ring Loss during training eliminates this gap. Ring Loss allows for seamless and simple integration into deep architectures trained using gradient-based methods. Ring Loss provides consistent improvements over a large range of its hyperparameter when compared to other baselines in normalization and other losses proposed for face recognition in general. Ring Loss also helps by being robust to lower resolutions through the norm constraint.
The Ring Loss augmentation method constrains the radial classifications of the Softmax loss function, shown in
The Ring Loss method provides three main advantages over the use of the un-augmented Softmax as the loss function: 1) The norm constraint is beneficial to maintain a balance between the angular classification margins of multiple classes; 2) Ring Loss removes the disconnect between training and testing metrics; and 3) Ring Loss minimizes test errors due to angular variation due to low norm features.
The Angular Classification Margin Imbalance. Consider a binary classification task with two feature vectors x1 and x2 from classes 1 and 2 respectively, extracted using some model (possibly a deep network). Let the classification weight vector for class 1 and 2 be ω1 and ω2 respectively. The primary loss function may be, for example, Softmax.
An example arrangement is shown in
In general, for the class 1 vector ω1 to pick x1 and not x2 for correct classification, it is required that ω1Tx1>ω1Tx2⇒∥x1∥2 cos θ1>∥x2∥2 cos θ2. Here, θ1 and θ2 are the angles between the weight vector ω1 (class 1 vector only) and x1, x2 respectively. The feasible set (range for θ1) required for this inequality to hold is known as the angular classification margin. Note that it is also a function of θ2.
Setting
(r>0) for correct classification, it is required that cos θ1>r cos θ2⇒θ1<cos−1(r cos θ2) as cos θ is a decreasing function between [−1, 1] for θ∈[0,π]. This inequality needs to hold true for any θ2.
Fixing cos θ2=δ, results in θ1<cos−1(rδ). From the domain constraints of cos−1, we have
Combining this inequality with r>0, result in
For these purposes it suffices to only look at the case δ>0 because δ<0 doesn't change the inequality −1≤rδ≤1.
Discussion on the angular classification margin. The upper bound on θ1 (i.e., cos−1(r cos θ2)) is plotted for a range of δ ([0.1, 1]) and the corresponding range of r.
In other terms, as the norm of x2 increases, with respect to x1, the angular margin for x1 to be classified correctly while rejecting x2 by ω1 decreases. The difference in norm (r>1) therefore will have an adverse effect during training by effectively enforcing smaller angular classification margins for classes with smaller norm samples. This also leads to lopsided classification margins for multiple classes due to the difference in class norms, as can be seen in
Effects of Softmax on the norm of MNIST features. The effects of Softmax on the norm of the features (and thereby classification margin) on MNIST can be qualitatively observed in
Removing the disconnect between training and testing. Evaluation using the cosine metric is currently ubiquitous in applications such as face recognition, where the features are normalized beforehand in gallery (thereby requiring fewer FLOPs during large scale testing). However, during training, this is not the case and the norm is usually not constrained. This creates a disconnect between training and testing scenarios which hinders performance. The Ring Loss method removes this disconnect in an elegant way.
Regularizing Softmax Loss with the Norm Constraint. The ideal training scenario for a system testing under the cosine metric would be where all features pointing in the same direction have the same loss. However, this is not true for the most commonly used loss function, Softmax and its variants (FC layer combined with the Softmax function and the cross-entropy loss). Assuming that the weights are normalized (i.e. ∥ωk∥=1), the Softmax loss for feature vector (xi) can be expressed as (for the correct class yi) as:
Clearly, despite having the same direction, two features with different norms have different losses. From this perspective, the straightforward solution to regularize the loss and remove the influence of the norm is to normalize the features before Softmax. However, this approach is effectively a projection method, that is, it calculates the loss as if the features are normalized to the same scale, while the actual network does not learn to normalize features.
The need for feature normalization in feature space. As an illustration, consider the training and testing set features trained by vanilla Softmax of the class 8 from MNIST in
This is yet another motivation to normalize features during training. Forcing the network to learn to normalize the features helps to mitigate this problem during testing wherein the network learns to work in the normalized feature space.
Incorporating the norm constraint as a convex problem. Identifying the need to normalize the sample features from the network, the problem can now be formulated. If the primary loss function is defined as LS (for instance Softmax loss), and it is assumed that provides deep features for a sample x as (x), the loss subject can be minimized to the normalization constraint as follows
min LS((x))s.t.∥(x)∥2=R
Here, R is the scale constant that the features are to be normalized to. Note that this problem is non-convex in (x) because the set of feasible solutions is itself non-convex due to the norm equality constraint. Approaches which use standard SGD while ignoring this critical point would not be providing feasible solutions to this problem thereby, the network would not learn to output normalized features. Indeed, the features obtained using this straightforward approach are not normalized compared to the Ring Loss method, shown in
Ring Loss Definition. Ring loss LR is defined as:
Ring Loss (LR) can be used along with any other loss function such as Softmax or large-margin Softmax. The loss encourages norm of samples being value R (a learned parameter) rather than explicit enforcing through a hard normalization operation. This approach provides informed gradients towards a better minimum which helps the network to satisfy the normalization constraint. The network therefore, learns to normalize the features using model weights (rather than needing an explicit non-convex normalization operation, or batch normalization). In contrast, and in connection, batch normalization enforces the scaled normal distribution for each element in the feature independently. This does not constrain the overall norm of the feature to be equal across all samples and neither addresses the class imbalance problem nor the gap in the training and testing protocols in face recognition.
Ring loss Convergence Visualizations. To illustrate the effect of the Softmax loss augmented with the enforced soft-normalization, analytical simulations were conducted. A 2D mesh of points from (−1.5, 1.5) in (x,y)-axis were generated. The gradients of Ring Loss (R=1) were computed, assuming the vertical dotted line in
A method of training using a primary loss function augmented by a Ring Loss function has been presented in which the network learns to normalized features as they are extracted. The Ring Loss method was found to consistently provide significant improvements over a large range of the hyperparameter λ. Further, the network learns normalization, thereby being robust to a large range of degradations.
This application is a national phase filing under 35 U.S.C. § 371 claiming the benefit of and priority to International Patent Application No. PCT/US2019/020090, filed on Feb. 28, 2019, which claims the benefit of U.S. Provisional Patent Application No. 62/710,814, filed Feb. 28, 2018. The entire contents of these applications are incorporated herein by reference.
This invention was made with government support under N6833516C0177 awarded by the US Navy. The government has certain rights in the invention.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2019/020090 | 2/28/2019 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2019/169155 | 9/6/2019 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20130132315 | Principe | May 2013 | A1 |
20160307071 | Perronnin | Oct 2016 | A1 |
20160307565 | Liu | Oct 2016 | A1 |
20170300811 | Merhav | Oct 2017 | A1 |
20210165852 | Granger | Jun 2021 | A1 |
Entry |
---|
Heo, Jingu, and Marios Savvides. “Real-time face tracking and pose correction for face recognition using active appearance models.” Biometric Technology for Human Identification IV. vol. 6539. SPIE, 2007. (Year: 2007). |
Abiantun, Ramzi, Utsav Prabhu, and Marios Savvides. “Sparse feature extraction for pose-tolerant face recognition.” IEEE transactions on pattern analysis and machine intelligence 36.10 (2014): 2061-2073. (Year: 2014). |
Juefei-Xu, Felix, Dipan K. Pal, and Marios Savvides. “NIR-VIS heterogeneous face recognition via cross-spectral joint dictionary learning and reconstruction.” Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 2015. (Year: 2015). |
Pal, Dipan K., Felix Juefei-Xu, and Marios Savvides. “Discriminative invariant kernel features: a bells- and-whistles-free approach to unsupervised face recognition and pose estimation.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016: 5590-5599 (Year: 2016). |
Peng, Xi, Nalini Ratha, and Sharathchandra Pankanti. “Learning face recognition from limited training data using deep neural networks.” 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016: 1442-1447 (Year: 2016). |
Wen, Yandong, et al. “A discriminative feature learning approach for deep face recognition.” Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, Oct. 11-14, 2016, Proceedings, Part VII 14. Springer International Publishing, 2016: 499-515 (Year: 2016). |
Ranjan, Rajeev, Carlos D. Castillo, and Rama Chellappa. “L2-constrained softmax loss for discriminative face verification.” arXiv preprint arXiv:1703.09507 (2017). (Year: 2017). |
Liu, Weiyang, et al. “Sphereface: Deep hypersphere embedding for face recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 212-220 (Year: 2017). |
Yuan, Yuhui, Kuiyuan Yang, and Chao Zhang. “Feature incay for representation regularization.” arXiv preprint arXiv:1705.10284 (2017). (Year: 2017). |
Hasnat, Md Abul, et al. “von mises-fisher mixture model-based deep learning: Application to face verification.” arXiv preprint arXiv:1706.04264 (2017): 1-16 (Year: 2017). |
Wang, Feng, et al. “Normface: L2 hypersphere embedding for face verification.” Proceedings of the 25th ACM international conference on Multimedia. 2017: 1041-1049 (Year: 2017). |
Wang, Hao, et al. “CosFace: Large Margin Cosine Loss for Deep Face Recognition.” arXiv preprint arXiv:1801.09414v1 (Jan. 2018): 1-12. (Year: 2018). |
Thornton, Jason, Marios Savvides, and BVK Vijaya Kumar. “Enhanced iris matching using estimation of in-plane nonlinear deformations.” Biometric Technology for Human Identification III. vol. 6202. SPIE, 2006. (Year: 2006). |
Zheng, Yutong, et al. “Towards a deep learning framework for unconstrained face detection.” 2016 IEEE 8th International Conference on Biometrics Theory, Applications and Systems (BTAS). IEEE, 2016. (Year: 2016). |
Zhu, Chenchen, et al. “Weakly supervised facial analysis with dense hyper-column features.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2016. (Year: 2016). |
Aslan, Melih S., et al. “Multi-channel multi-model feature learning for face recognition.” Pattern Recognition Letters 85 (2017): 79-83. (Year: 2017). |
Qi, Xianbiao, and Lei Zhang. “Face recognition via centralized coordinate learning.” arXiv preprint arXiv:1801.05678 (Jan. 2018): 1-14 (Year: 2018). |
Abiantun, Ramzi, and Marios Savvides. “Boosted multi image features for improved face detection.” 2008 37th IEEE Applied Imagery Pattern Recognition Workshop. IEEE, 2008. (Year: 2008). |
Juefei-Xu, Felix, and Marios Savvides. “Subspace-based discrete transform encoded local binary patterns representations for robust periocular matching on NIST's face recognition grand challenge.” IEEE transactions on image processing 23.8 (2014): 3490-3505. (Year: 2014). |
Juefei-Xu, Felix, Vishnu Naresh Boddeti, and Marios Savvides. “Gang of GANs: Generative adversarial networks with maximum margin ranking.” arXiv preprint arXiv:1704.04865 (2017). (Year: 2017). |
Le, T. Hoang Ngan, et al. “Semi self-training beard/moustache detection and segmentation simultaneously.” Image and Vision Computing 58 (2017): 214-223. (Year: 2017). |
Supplementary European Search Report for European Patent Application No. 19760160.2 mailed on Oct. 20, 2021, 5 pages. |
Wen et al., “A Discriminative Feature Learning Approach for Deep Face Recognition”, Computer Vision—ECCV 2016, Springer International Publishing, pp. 499-515, Sep. 16, 2016. |
Ranjan et al., “L2-constrained Softmax Loss for Discriminative Face Verification”, Jun. 7, 2017, Retrieved from Internet URL: <https://arxiv.org/pdf/1703.09507.pdf>, Retrieved on Jul. 24, 2020, 10 pages. |
Wang et al., “NormFace: L2 Hypersphere Embedding for Face Verification”, Jul. 26, 2017, Retrieved from Internet URL: <https://arxiv.org/pdf/1704.06369.pdf>, Retrieved on Feb. 18, 2020, pp. 1-11. |
International Search Report and Written Opinion for International Patent Application No. PCT/US19/20090 mailed on May 23, 2019, 8 pages. |
Wang et al., “CosFace: Large Margin Cosine Loss for Deep Face Recognition”, Cornell University Library, Computer Science, Computer Vision and Pattern Recognition, Jan. 29, 2018, [online], Retrieved from the Internet <URL: https://arxiv.org/abs/1801.09414>, Retrieved on May 2, 2019, 12 pages. |
Number | Date | Country | |
---|---|---|---|
20210034984 A1 | Feb 2021 | US |
Number | Date | Country | |
---|---|---|---|
62710814 | Feb 2018 | US |