Embodiments of this invention relate to the processing of training samples in a Deep Convolutional Neural Network (DCNN), for example in vision tasks such as image classification.
Deep Convolutional Neural Networks (DCNN) are a popular method for vision tasks including image classification, object detection and semantic segmentation. DCNNs usually comprise convolutional layers, normalization layers and activation layers. Normalization layers are important in improving performance and speeding up the training process.
However, the training of DCNNs is generally difficult and time consuming. The performance of previous training methods is also limited.
Batch Normalization (BN), as described in Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift”, International Conference on Machine Learning, pages 448-456, 2015, normalizes the feature map with the mean and variance calculated along with the batch, height, and width dimension of a feature map and then re-scales and re-shifts the normalized feature map to maintain the representation ability of a DCNN. Based on BN, many normalization methods for other tasks have been proposed to calculate the mean and variance statistics along different dimensions. For example, Layer Normalization (LN), as described in Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton, “Layer normalization”, NIPS Deep Learning Symposium, 2016, was proposed for calculating the statistics along the channel, height and width dimension for Recurrent Neural Network (RNN). Weight Normalization (WN), as described in Tim Salimans and Durk P Kingma, “Weight normalization: A simple reparameterization to accelerate training of deep neural networks”, Advances in neural information processing systems, pages 901-909, 2016, was proposed to parameterize the weight vector for supervised image recognition, generative modelling, and deep reinforcement learning. Divisive Normalization, as described in Mengye Ren, Renjie Liao, Raquel Urtasun, Fabian H Sinz, and Richard S Zemel. “Normalizing the normalizers: Comparing and extending network normalization schemes”, International Conference on Learning Representations, 2016, which includes BN and LN as special cases was proposed for image classification, language modeling and super-resolution. Instance Normalization (IN), as described in Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Instance normalization: The missing ingredient for fast stylization”, arXiv preprint arXiv:1607.08022, 2016, where the statistics were calculated from the height and width dimension, was proposed for fast stylization. Instead of calculating the statistics from data, Normalization Propagation, as described in Devansh Arpit, Yingbo Zhou, Bhargava Kota, and Venu Govindaraju, “Normalization propagation: A parametric technique for removing internal covariate shift in deep networks”, International Conference on Machine Learning, pages 1168-1176, 2016, estimated the data independently from the distribution in layers. Group Normalization, as described in Yuxin Wu and Kaiming He, “Group normalization”, Proceedings of the European Conference on Computer Vision (ECCV), pages 3-19, 2018, divided the channels into groups and calculated the statistics for each grouped channel, height and width dimension, showing stability to batch sizes. Positional Normalization (PN), as described in Boyi Li, Felix Wu, Kilian Q Weinberger, and Serge Belongie, “Positional normalization”, Advances in Neural Information Processing Systems, pages 1620-1632, 2019, was proposed to calculate the statistics along the channel dimension for generative networks.
BN, IN, LN, GN and PN share the same four steps: divide the intermediate feature map into multiple feature groups; calculate the mean and variance of each feature group; use the calculated mean and variance of each feature group to normalize the corresponding feature group; and use extra two trainable parameters for each channel of the intermediate feature map to recover the DCNN representation ability. The main difference between BN, IN, LN, GN and PN is the division of feature groups.
Among these normalization methods, BN can usually achieve good performance at large batch sizes. However, its performance may degrade at small batch sizes. GN enjoys a greater degree of stability at different batch sizes, while slightly under-performs BN at large batch sizes. Other normalization methods, including IN, LN and PN perform well in specific tasks, but are usually less generalizable to multiple vision tasks than BN and under-perform at large batch sizes.
It is desirable to develop a method for normalization that overcomes such problems.
According to one aspect there is provided a device for machine learning, the device comprising one or more processors configured to implement a first neural network layer, a second neural network layer and a normalization layer arranged between the first neural network layer and the second neural network layer, the normalization layer being configured to, when the device is undergoing training on a batch of training samples: receive multiple outputs of the first neural network layer for a plurality of training samples of the batch, each output comprising multiple data values for different indices on a first dimension and on a second dimension, the first dimension representing a channel dimension; group the outputs into multiple groups in dependence on the indices on the first and second dimensions to which they relate; form a normalization output for each group; and provide the normalization outputs as input to the second neural network layer.
This may allow for the training of a DCNN with good performance, that performs stably at different batch sizes, and that is generalizable to multiple vision tasks. This may also speed up and improve the performance of DCNN training.
The said second dimension may represent one or more spatial dimensions. For example, the height and width of a feature map of an image. This may provide an effective way of performing machine learning on spatially extended samples.
The step of forming a normalization output for each group may comprise computing an aggregate statistical parameter over the outputs in that group. Such a parameter may conveniently be used to assist in the training of subsequent neural network layers.
The step of forming a normalization output for each group may comprise computing a mean and a variance over the outputs in that group. One or both of these quantities may be useful in training subsequent neural network layers.
The step of grouping the outputs may comprise allocating each output to only a single one of the groups. In this way each output may not be overrepresented in the training of subsequent neural network layers.
The step of grouping the outputs may comprise allocating all outputs relating to a common index or point on the first dimension and to a common index or point on the second dimension to the same group. Thus such a group may comprise outputs that are related by having those indices or points in common.
The step of grouping the outputs may comprise allocating outputs relating to a common batch to different groups. Including the batch dimension in the statistic calculation may further improve the performance and generalizability of normalization,
The step of grouping the outputs may comprise allocating outputs to different groups in dependence on the point or index on the first dimension to which they relate. This may allow aggregated values derived from that group to provide information about outputs having that point or index.
The step of grouping the outputs may comprise allocating outputs to different groups in dependence on the point or index on the second dimension to which they relate. This may allow aggregated values derived from that group to provide information about outputs having that point or index.
The normalization layer may be configured to: receive a control parameter; compare the control parameter to a predetermined threshold; and in dependence on that parameter determine how, during the said grouping step, to allocate outputs to different groups in dependence on the points in the first dimension and the second dimension to which they relate. Selecting the size of feature group that is used to calculate the statistic may further improve the stability of normalization to different batch sizes.
The device may be configured to form the control parameter in dependence on the number of training samples in the batch. For example, when the batch size is small, a small G can be used, while when the batch size is large, a large G can be used.
The outputs may be feature maps formed by the first neural network layer. This may allow the device to be used in computer vision and image classification tasks.
The device may be configured to train the second neural network layer in dependence on the normalization outputs.
According to a second aspect there is provided a method for training, on a batch of training samples, a device for machine learning comprising a first neural network layer, a second neural network layer and a normalization layer arranged between the first neural network layer and the second neural network layer, the method comprising: receiving multiple outputs of the first neural network layer for a plurality of training samples of the batch, each output comprising multiple data values for different indices on a first dimension and on a second dimension, the first dimension representing a channel dimension; grouping the outputs into multiple groups in dependence on the indices on the first and second dimensions to which they relate; forming a normalization output for each group; and providing the normalization outputs as input to the second neural network layer.
This method may allow for the training of a DCNN with good performance, that performs stably at different batch sizes, and that is generalizable to multiple vision tasks. The method may speed up and improve the performance of DCNN training.
The present invention will now be described by way of example with reference to the accompanying drawings.
In the drawings:
Described herein is a normalization approach for the training of deep convolutional neural networks that has been shown in some implementations to achieve better performance, stability and generalizability than previous approaches.
The method described herein may be implemented by a machine leaning device having a processor, the processor being configured to implement a first neural network layer, a second neural network layer and a normalization layer arranged between the first neural network layer and the second neural network layer.
As will be described in more detail below, the normalization layer may be configured to, when the device is undergoing training on a batch of training samples, receive multiple outputs of the first neural network layer for a plurality of training samples of the batch, each output comprising multiple data values for different indices on a first dimension and on a second dimension, the first dimension representing a channel dimension.
Preferably, the outputs are feature maps formed by the first neural network layer, as described in the examples below.
In one example, the first dimension is the channel C of the feature map. The second dimension represents one or more spatial dimensions of the feature map. For example, the second dimension may represent the height (H) and/or width (W) of the feature map.
The outputs are then grouped into multiple groups in dependence on the indices on the first and second dimensions to which they relate and a normalization output is formed for each group. Advantageously, the step of grouping the outputs may also comprise allocating outputs relating to a common batch to different groups.
In one example, consider a feature map output by previous layers of the network, FN×C×H×W, where N is the batch size of the feature map.
The channel, height and width dimensions are first merged into a new dimension to give FN×M, where M = C × N × W.
The step of forming a normalization output for each group preferably comprises computing an aggregate statistical parameter over the outputs in that group, such as the mean and variance.
In this example, the mean µg and variance
are calculated along the batch and new dimension (C, H, W) as:
where G is the number of groups that the new dimension is divided and is a hyper-parameter, S = M/G is the number of instances inside each divided feature group.
The hyper-parameter G may be used to control the number of feature instances or the size of feature groups for calculating the statistics.
The normalization layer may therefore be further configured to receive a control parameter (i.e. hyper-parameter G) and compare the control parameter to a predetermined threshold. In dependence on that parameter, the normalization layer may determine how, during the said grouping step, to allocate outputs to different groups in dependence on the points in the first dimension and the second dimension to which they relate.
The device may be configured to form the parameter G in dependence on the number of training samples in the batch.
When the batch size of a DCNN is determined, a full batch size may cause confused gradients while a small batch size may cause noisy gradients. Good statistics in normalization should cover a proper amount of feature instances. The method described herein may therefore introduce the feature group and the hyper-parameter G to control the number of feature instances or the size of feature groups for calculating the statistics. For example, when the batch size is small, a small G can be used to combine the whole new dimension into statistic calculation, while when the batch size is large, a large G can be used to split the new dimension into small pieces for calculating the statistics. Then for g ∈ [1, G], the feature map is normalized as:
where ∈ is a small number added for division stability. FN×M is then split back to FN×C×H×W, following BN, IN, LN, GN and PN, in order to maintain the representation ability of a DCNN. Extra trainable parameters are added for each feature channel:
In BN, µc and
used in the testing stage are the moving average of that in the training stage. The method described herein may use this policy as well, as a normalization method should preferably be batch-size independent. IN, LN, GN and PN generally use the statistics calculated directly from the testing stage.
The normalization layer therefore groups the outputs into multiple groups in dependence on the indices on the first and second dimensions to which they relate. A normalization output is then formed for each group. The normalization outputs are then provided as input to the second neural network layer.
The outputs may be grouped in different ways. The step of grouping the outputs may comprise allocating each output to only a single one of the groups. The step of grouping the outputs may comprise allocating all outputs relating to a common point on the first dimension and to a common point on the second dimension to the same group.
In another example, the step of grouping the outputs comprises allocating outputs to different groups in dependence on the point on the first dimension to which they relate. Alternatively or additionally, the step of grouping the outputs may comprise allocating outputs to different groups in dependence on the point on the second dimension to which they relate.
In a preferred implementation, the step of grouping the outputs comprises allocating outputs relating to a common batch to different groups. Therefore, groups may additionally be formed along the batch dimension (N). Referring to the representation shown in
The difference between BN, IN, LN, GN, PN and the method described herein, which will be referred to below as Batch Group Normalization (BGN), with respect to the dimensions along which the statistics are computed is illustrated in
One application of the method described herein is in image classification. In the example described below, ImageNet (see Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks”, Advances in neural information processing systems, pages 1097-1105, 2012) was used which contains 1:28M training images and 50000 validation images. The model used in the examples is ResNet-50 (see Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition”, Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016) where around 50 convolutional layers followed by normalization and activation layers are stacked with residual learning. 8 GPUs were used in the ImageNet experiments. The gradients used for backpropagation were averaged across 8 GPUs, while the mean and variance used in BN and BGN were calculated within each GPU independently. γc and βc were initialized as 1 and 0 respectively, while all other trainable parameters were initialized in the same way as in He et al. 120 epochs were trained with the learning rate decayed by 10x at the 30th, 60th, and 90th epoch. The initial learning rates for the experiments with batch sizes of 128, 64, 32, 16, 8, 4 and 2 were 0.4, 0.2, 0.1, 0.05, 0.025, 0.0125 and 0.00625 respectively, following the method described in Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He, “Accurate, large minibatch sgd: Training imagenet in 1 hour.”, arXiv preprint arXiv:1706.02677, 2017. Stochastic Gradient Descent (SGD) was used as the optimizer. A weight decay of 10-4 was applied to all trainable parameters.
For the validation, each image was cropped into 224 x 224 patches from the center. The Top1 accuracy is reported as the evaluation criterion. All experiments were trained under the same programming implementation, but replacing the normalization layer according to BN, IN, LN, GN, PN, and BGN respectively.
To explore the hyper-parameter G, BGN with a group number of 512, 256, 128, 64, 32, 16, 8, 4, 2 and 1 respectively were used as the normalization layer in ResNet-50 for ImageNet classification. The largest (according to GPU memory) and smallest batch size in the experiments (128 and 2) were tested. The Top1 accuracy of the validation dataset is shown in
In general, the results demonstrate that a large G (e.g. 512) is more suitable for a large batch size (e.g. 128,) while a small G (e.g. 1) is more suitable for a small batch size (e.g. 2). This demonstrates that the number of feature instances affects the statistic calculation in normalization. Suitably, when the batch size is large, a large G may be used to split the new dimension to maintain proper number of feature instances for statistic calculation. Suitably, when the batch size is small, a small G may be used to combine the new dimension to maintain proper number of feature instances for statistic calculation.
The results of further experiments are shown in
The following example demonstrates the application of the method for image classification on the CIFAR-10 (Canadian Institute for Advanced Research) dataset with NAS (Neural Architecture Search). This demonstrates that as well as manually designed and regular neural architectures, BGN is also applicable to automatically-designed and less-regular neural architectures. The following example uses cell-based architectures designed automatically with NAS, specifically DARTS, as described in Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search”, International Conference on Learning Representations, 2019. For DARTS, normalization methods were used for both the searching and training.
As shown in
Given a set of possible operations, DARTS encodes the architecture search space with continuous parameters to form a one-shot model and performs searching by training the one-shot model with bi-level optimization, where the model weights and architecture parameters are optimized with training data and validation data alternatively.
For the DARTS training, the same experimental settings were used as in Lui et al. The BN layers in DARTS were replaced with the normalization layers of IN, LN, GN, PN and BGN in both the search and evaluation stages. In this implementation, the method searched for 8 cells in 50 epochs with batch size 64 and initial number of channels as 16. SGD was used to optimize the model weights with initial learning rate 0:025, momentum 0:9 and weight decay 3x10-4. Adam, as described in Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization”, arXiv preprint arXiv: 1412.6980, 2014, was used to optimize architecture parameters with initial learning rate 3x10-4, momentum (0:5; 0:999) and weight decay 10-3. A network of 20 cells and 36 initial channels was used for evaluation to ensure a comparable model size as other baseline models. The whole training set was used to train the model for 600 epochs with batch size 96 to ensure convergence. For GN, the configuration G = 32 was used, while BGN, the configuration G = 256 was used. Other hyper-parameters were set to be the same as those in the search stage. The best 20-cell architecture searched on CIFAR-10 by DARTS was trained from scratch with corresponding normalization methods used during the search phase. The validation accuracy of each method is shown in
DCNNs have been known to be vulnerable to malicious perturbed examples, known as adversarial attacks. Adversarial training has been proposed to counter this problem. In the following example, BGN was applied to adversarial training and its results compared to BN, IN, LN, GN, and PN. The WideResNet, as described in Sergey Zagoruyko and Nikos Komodakis, “Wide residual networks”, Edwin R. Hancock, Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1-87.12. BMVA Press, September 2016, with the depth set as 10 and the wide factor set as 2 is used for image classification tasks on the CIFAR-10 dataset. The neural network was trained and evaluated against a four-step Projected Gradient Descent (PGD) attack. For the PGD attack, the step size was set as 0.00784, and the maximum perturbation norm as 0.0157. 200 epochs were trained until convergence. Due to the specialty of adversarial training, G = 128 was used in GN and BGN. This divides images into patches, which can help to improve the robustness by breaking the correlation of adversarial attacks in different image blocks and constraining the adversarial attacks on the features within a limited range. The Adam optimizer was used with a learning rate of 0.01. The robust and clean accuracy of training WideResNet with BN, IN, LN, GN, PN and BGN as the normalization layer are shown in
The BGN method may also be implemented as part of a Few Shot Learning (FSL) task. FSL aims to train models capable of recognizing new, previously unseen categories using only limited training samples. A training dataset with sufficient annotated samples comprises base categories. The test dataset contains C novel classes, each of which is associated with only a few, K, labelled samples (for example, 5 or fewer samples) that comprise the support set, while the remaining unlabelled samples comprise the query set and are used for evaluation. This may also be referred to as a C-way K-shot FSL classification problem.
In one example, the imprinted weights model, as described in Hang Qi, Matthew Brown, and David G Lowe, “Low-shot learning with imprinted weights”, Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5822-5830, 2018, was used. At training time, a cosine classifier was learned on top of feature extraction layers and each column of classifier parameter weights can be regarded as a prototype for the respective class. At test time, a new class prototype (new column of classifier weight parameters) was defined by averaging the feature representation of support images, and the unlabelled images were classified via a nearest neighbor strategy. Settings including 5-way 1-shot and 5-way 5-shot were tested for the ResNet-12 backbone (see Boris Oreshkin, Pau Rodriguez López, and Alexandre Lacoste, “Tadam: Task dependent adaptive metric for improved few-shot learning”, NeurIPS, 2018) on miniImageNet (see Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al., “Matching networks for one shot learning”, NeurIPS, 2016. In this example, the training protocol described in Spyros Gidaris and Nikos Komodakis, “Dynamic few-shot visual learning without forgetting”, CVPR, 2018, was used. The BGN model was optimized using SGD with Nesterov momentum set to 0:9, weight decay to 0.0005, mini-batch size to 256, and 60 epochs. All input images were resized to 84 x 84. The learning rate was initialized to 0.1, and changed to 0.006, 0.0012, and 0.00024 at the 20th, 40th and 50th, respectively. The mean and variance accuracy of replacing the normalization layers in Imprinted Weights to BN, IN, LN, GN, PN and the proposed BGN, of training on miniImageNet, and of the 5-way 1-shot and 5-shot tasks are shown in
A machine learning device 900 configured to implement the BGN method is schematically illustrated in
The device 900 comprises a processor 901 configured to process the datasets in the manner described herein. For example, the processor 901 may be implemented as a computer program running on a programmable device such as a Central Processing Unit (CPU). The system 200 comprises a memory 902 which is arranged to communicate with the processor 901. Memory 902 may be a non-volatile memory. The processor 901 may also comprise a cache (not shown in
As described above, the method divides the intermediate feature map into feature groups in a different way. In the preferred implementation, each intermediate feature map has four dimensions including the batch, height, width and channel dimension. The height, width and channel dimensions are first merged into one dimension and then this new dimension is divided into multiple feature groups. The hyper-parameter G is used to control how many groups the intermediate feature map is divided into. The statistics (e.g. mean and variance) are then calculated for each feature group across the entire mini-batch.
The normalization method described herein exhibits good performance, performs stably at different batch sizes, and is generalizable to multiple vision tasks. It does not use additional trainable parameters, information across multiple layers or iterations, or extra computation. It can calculate the mean and variance statistics from batch and grouped (channel, height and width) dimensions and may use a hyper-parameter G to control the size of divided feature groups. This normalization method can, in some implementations, speed up and improve the performance of DCNN training.
The method can advantageously consider the batch dimension in the statistic calculation (i.e. include the batch dimension in the mean and variance calculation), and can control the size of feature group used for statistic calculation to be moderate (i.e. neither too large or too small). Including the batch dimension in the statistic calculation may further improve the performance and generalizability of normalization, while selecting the size of feature group that is used to calculate the statistic may further improve the stability of normalization to different batch sizes.
In the method described herein, no extra trainable parameters or calculations or multi-iteration/multi-layer information are used. The method can be used jointly with other techniques using extra trainable parameters or calculations or multi-iteration/multi-layer information to further improve the performance. It is therefore intuitive to implement, is orthogonal to and can be used in addition to many methods to further improve performance.
In some implementations, BGN outperforms BN by almost 10% on ImageNet classification with a small batch size. It has been shown in some implementations to outperform BN, IN, LN, GN and PN on image classification, Neural Architecture Search, adversarial learning, Few Shot Learning and Un-supervised Domain Adaption tasks.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.
This application is a continuation of International Application No. PCT/CN2020/114041, filed on Sep. 8, 2020, the disclosure of which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2020/114041 | Sep 2020 | WO |
Child | 18180841 | US |