Various example embodiments relate to compression of neural network(s).
Neural networks have recently prompted an explosion of intelligent applications for IoT devices, such as mobile phones, smart watches and smart home appliances. Because of high computational complexity and battery consumption related to data processing, it is usual to transfer the data to a centralized computation server for processing. However, concerns over data privacy and latency of large volume data transmission have been promoting distributed computation scenarios.
There is, therefore, a need for common communication and representation formats for neural networks to enable efficient transmission of neural network(s) among devices.
Now there has been invented an improved method and technical equipment implementing the method, by which the above problems are alleviated. Various aspects comprise an apparatus, a method, and a computer program product comprising a computer program stored therein, which are characterized by what is stated in the independent claims. Various example embodiments are disclosed in the dependent claims.
According to a first aspect, there is provided an apparatus comprising means for performing: training a neural network by applying an optimization loss function, wherein the optimization loss function considers empirical errors and model redundancy; pruning a trained neural network by removing one or more filters that have insignificant contributions from a set of filters; and providing the pruned neural network for transmission.
According to an embodiment, the means are further configured to perform: measuring filter diversities based on normalized cross correlations between weights of filters of the set of filters.
According to an embodiment, the means are further configured to perform: forming a diversity matrix based on pair-wise normalized cross correlations quantified for a set of filter weights at layers of the neural network.
According to an embodiment, the means are further configured to perform: estimating accuracy of the pruned neural network; and retraining the pruned neural network if the accuracy of the pruned neural network is below a pre-defined threshold.
According to an embodiment, the optimization loss function further considers estimated pruning loss and wherein training the neural network comprises minimizing the optimization loss function and the pruning loss.
According to an embodiment, the means are further configured to perform: estimating the pruning loss, the estimating comprising computing a first sum of scaling factors of filters to be removed from the set of filters after training; computing a second sum of scaling factors of the set of filters; and forming a ratio of the first sum and the second sum.
According to an embodiment, the means are further configured to perform, for mini-batches of a training stage: ranking filters of the set of filters according to scaling factors; selecting the filters that are below a threshold percentile of the ranked filters; pruning the selected filters temporarily during optimization of one of the mini-batches; and iteratively repeating the ranking, selecting and pruning for the mini-batches.
According to an embodiment, the threshold percentile is user specified and fixed during training.
According to an embodiment, the threshold percentile is dynamically changed from 0 to a user specified target percentile.
According to an embodiment, the filters are ranked according to a running average of scaling factors.
According to an embodiment, a sum of model redundancy and pruning loss is gradually switched off from the optimization loss function by multiplying with a factor changing from 1 to 0 during the training.
According to an embodiment, the pruning comprises ranking the filters of the set of filters based on column-wise summation of a diversity matrix; and pruning the filters that are below a threshold percentile of the ranked filters.
According to an embodiment, the pruning comprises ranking the filters of the set of filters based on an importance scaling factor; and pruning the filters that are below a threshold percentile of the ranked filters.
According to an embodiment, the pruning comprises ranking the filters of the set of filters based on column-wise summation of a diversity matrix and an importance scaling factor; and pruning the filters that are below a threshold percentile of the ranked filters.
According to an embodiment, the pruning comprises layer-wise pruning and network-wise pruning.
According to an embodiment, the means comprises at least one processor; at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, cause the performance of the apparatus.
According to a second aspect, there is provided a method for neural network compression, comprising training a neural network by applying an optimization loss function, wherein the optimization loss function considers empirical errors and model redundancy; pruning a trained neural network by removing one or more filters that have insignificant contributions from a set of filters; and providing the pruned neural network for transmission.
According to a third aspect, there is provided a computer program comprising computer program code configured to, when executed on at least one processor, cause an apparatus to: train a neural network by applying an optimization loss function, wherein the optimization loss function considers empirical errors and model redundancy; prune a trained neural network by removing one or more filters that have insignificant contributions from a set of filters; and provide the pruned neural network for transmission.
According to a fourth aspect, there is provided an apparatus, comprising at least one processor; at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to train a neural network by applying an optimization loss function, wherein the optimization loss function considers empirical errors and model redundancy; prune a trained neural network by removing one or more filters that have insignificant contributions from a set of filters; and provide the pruned neural network for transmission.
In the following, various example embodiments will be described in more detail with reference to the appended drawings, in which
A neural network (NN) is a computation graph comprising several layers of computation. Each layer comprises one or more units, where each unit performs an elementary computation. A unit is connected to one or more other units, and the connection may have associated a weight. The weight may be used for scaling a signal passing through the associated connection. Weights may be learnable parameters, i.e., values which may be learned from training data. There may be other learnable parameters, such as those of batch-normalization (BN) layers.
The neural networks may be trained to learn properties from input data, either in supervised way or in unsupervised way. Such learning is a result of a training algorithm, or of a meta-level neural network providing a training signal. The training algorithm changes some properties of the neural network so that its output is as close as possible to a desired output. For example, in the case of classification of objects in images, the output of the neural network can be used to derive a class or category index which indicates the class or category that the object in the input image belongs to. Examples of classes or categories may be e.g. “person”, “cat”, “dog”, “building”, “sky”.
Training usually happens by changing the learnable parameters so as to minimize or decrease the output's error, also referred to as the loss. The loss may be e.g. a mean squared error or cross-entropy. In recent deep learning techniques, training is an iterative process, where at each iteration the algorithm modifies the weights of the neural net to make a gradual improvement of the network's output, i.e., to gradually decrease the loss.
Training a neural network is an optimization process, but the final goal is different from the typical goal of optimization. In optimization, the only goal is to minimize a functional. In machine learning, the goal of the optimization or training process is to make the model learn the properties of the data distribution from a limited training dataset. In other words, the goal is to learn to use a limited training dataset in order to learn to generalize to previously unseen data, i.e., data which was not used for training the model. This is usually referred to as generalization.
The network to be trained may be a classifier neural network, such as a Convolutional Neural Network (CNN) capable of classifying objects or scenes in input images.
Trained models or parts of deep Neural Networks (NN) may be shared in order to enable rapid progress of research and development of AI systems. The NN models are often complex and demand a lot of computational resources which may make sharing of the NN models inefficient.
There is provided a method and an apparatus to enable compressed representation of neural networks and efficient transmission of neural network(s) among devices.
As used in this application, the term “circuitry” may refer to one or more or all of the following: (a) hardware-only circuit implementations (such as implementations in only analog and/or digital circuitry) and (b) combinations of hardware circuits and software, such as (as applicable):
(i) a combination of analog and/or digital hardware circuit(s) with software/firmware and (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and (c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.”
This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
The method disclosed herein provides for enhanced diversity of neural networks. The method enables pruning redundant neural network parts in an optimized manner. In other words, the method reduces filter redundancies at the layers of the NN and compresses the number of NN parameters. The method imposes constraints during the learning stage, such that learned parameters of NN are orthogonal and independent with respect to each other as much as possible. The outcome of the neural network compression is a representation of the neural network which is compact in terms of model complexities and sizes, and yet comparable to the original, uncompressed, NN in terms of performances.
The method may be implemented in an off-line mode or in an on-line mode.
In the off-line mode, a neural network is trained by applying an optimization loss function considering empirical errors and model redundancy. Defined loss function, i.e. a first loss function, may be written as
Loss=Error+weight redundancy.
Given network architectures may be trained with the original task performance optimized, without imposing any constraints on learned network parameters, i.e. weights and bias terms. Mathematically, this general optimization task may be described by:
W*=arg min E0(W,D),
wherein D denotes the training dataset, and E0 the task objective function e.g. class-wise cross-entropy for image classification task. W denotes the weights of the neural network.
In the method disclosed herein, the optimization loss function, i.e. the objective function of filter diversity enhanced NN learning may be formulated by:
W*=arg min E0(W,D)+λKθ(W),
wherein λ is the parameter to control relative significance of the original task and the filter diversity enhancement term Kθ, and θ is the parameter to measure filter diversities used in function K. W* above represents the first loss function.
Filter diversities may be measured based on Normalized Cross Correlations between weights of filters of a set of filters. Filter diversities may be measured by quantifying pair-wise Normalized Cross Correlation (NCC) between weights of two filters represented as weight vectors e.g. Wi, Wj:
in which , denotes dot product of two vectors. Note that Cij is between [−1, 1] due to the normalization of Wi, Wj.
A diversity matrix may be formed based on pair-wise NCCs quantified for a set of filter weights at layers of the neural network. For a set of filter weights at each layer i.e. Wi, i={1, . . . , N}, all pair-wise NCCs constitute a matrix:
with its diagonal elements C11 . . . CNN=1.
The filter diversity Klθ at layer l may be defined based on NCC matrix:
K
l
θ=Σi,j=1N,N|Cij| (2).
A total filter diversity term Kθ=ΣKlθ, is the sum of filter diversities at all layers l=1 . . . L. The diversity is getting smaller as Kθ gets smaller.
The trained neural network may be pruned by removing one or more filters that have insignificant contribution from a set of filters. There are alternative pruning schemes. For example, in diversity based pruning, the filters of the set of filters may be ranked based on column-wise summation of the diversity matrix (1). These summations may be used to quantify the diversity of a given filter with regard to other filters in the set of filters. The filters may be arranged in descending order of the column-wise summations of the diversities. The filters that are below a threshold percentile p % of the ranked filters may be pruned. A value p of the threshold percentile may be e.g. user-defined. The value p may be any value from zero to 1, and is subject to requirements on performance, e.g. accuracy, of the model, and on model size. For example, p may be 0.75 for VGG19 network on CIFAR-10 dataset without significantly losing accuracy. As another example, p may be 0.6 for VGG19 network on CIFAR-100 dataset without significantly losing accuracy. The p of a value 0.75 means that 75% of the filters are pruned. Correspondingly, the p of a value of 0.6 means that 60% of the filters are pruned.
As another example, scaling factor based pruning may be applied. The filters of the set of filters may be ranked based on importance scaling factors. For example, a Batch-Normalization (BN) based scaling factor may be used to quantify the importance of different filters. The scaling factor may be obtained from e.g. batch-normalization or additional scaling layer. The filters may be arranged in descending order of the scaling factor, e.g. the BN-based scaling factor. The filters that are below a threshold percentile p % of the ranked filters may be pruned. A value p of the threshold percentile may be e.g. user-defined. The value p may be any value from zero to 1, and is subject to requirements on performance, e.g. accuracy, of the model, and on model size.
As yet another example, a combination approach may be applied to prune filters. In the combination approach, the scaling factor based pruning and the diversity based pruning are combined. For example, the ranking results of the both pruning schemes may be combined, e.g. by applying an average or a weighted average. Then, the filters may be arranged according to the combined results. The filters that are below a threshold percentile p % of the ranked filters may be pruned. A value p of the threshold percentile may be e.g. user-defined. The value p may be any value from zero to 1, and is subject to requirements on performance, e.g. accuracy, of the model, and on model size.
Alternative pruning schemes 340, 345 may be applied for the trained network. The combination approach described earlier is not shown in the example of
The Approach II 345 represents scaling factor based pruning. The filters of the set of filters may be ranked based on importance scaling factors. For example, a Batch-Normalization (BN) based scaling factor may be used to quantify the importance of different filters. The filters may be arranged in descending order of the scaling factor, e.g. the BN-based scaling factor. The filters that are below a threshold percentile p % of the ranked filters may be pruned 350. A value p of the threshold percentile may be e.g. user-defined. The value p may be any value from zero to 1, and is subject to requirements on performance, e.g. accuracy, of the model, and on model size.
As a result of pruning 350, there is provided a pruned ith convolutional layer 360. The filters illustrated using a dashed line represent the pruned filters. The pruned network may be provided for transmission from an apparatus wherein the compression of the network is performed to another apparatus. The pruned network may be transmitted from an apparatus to another apparatus.
Table 1 below shows accuracies of off-line mode pruned VGG19 network at various pruning rates.
As can be seen in the table 1, even when pruning rate of 70% is applied, the accuracy is high, even 0.9373.
Pruning the network in the off-line mode may cause a loss of performance, e.g. when the pruning is excessive. For example, accuracy of image classification may be reduced. Therefore, the pruned network may be retrained, i.e. fine-tuned with regard to the original dataset to retain its original performance. Table 2 below shows improved accuracies after applying retraining to a VGG19 network pruned with 70% and 75% percentiles. The network pruned at 70% achieves sufficient accuracy which thus does not require retraining, while the network pruned at 75% shows degraded performance and thus requires retraining to restore its performance. Sufficient accuracy is use case dependent, and may be pre-defined e.g. by a user. For example, accuracy loss of approximately 2% due to pruning may be considered acceptable. It is to be understood, that in some cases, acceptable accuracy loss may be different, e.g. 2.5% or 3%.
The method may comprise estimating accuracy of the network after pruning. For example, the accuracy of the image classification may be estimated using a known dataset. If the accuracy is below a threshold accuracy, the method may comprise retraining the pruned network. Then the accuracy may be estimated again, and the retraining may be repeated until the threshold accuracy is achieved.
In the on-line mode, a neural network is trained by applying an optimization loss function considering empirical errors and model redundancy and further, estimated pruning loss, i.e. loss incurred by pruning. The defined loss function, i.e. a second loss function, may be written as
Loss=Error+weight redundancy+pruning loss.
The loss incurred by pruning is iteratively estimated and minimized during the optimization. Thus, the training of the neural network may comprise minimizing the optimization loss function and the pruning loss. Minimization of the pruning loss ensures that potential damages caused by pruning do not exceed a given threshold. Thus, there is no need of a post-pruning retraining stage of the off-line mode.
When the pruning loss is taken into account during the learning stage, potential performance loss caused by pruning of filters may be alleviated.
When the pruning loss is taken into account during the learning stage, unimportant filters may be safely removed from the trained networks without compromising the final performance of the compressed network.
When the pruning loss is taken into account during the learning stage, possible retraining stage of the off-line pruning mode is not needed. Thus, extra computational costs investigated on the possible retraining stage may be avoided.
When the pruning loss is taken into account during the learning stage, the strengths of important filters will be boosted and the unimportant filters will be suppressed, as shown in
The method may comprise estimating the pruning loss. In order to estimate potential pruning loss for a given set of filters Γ associated with scaling factors γi, we use the following formula to define the pruning loss:
in which P(Γ) is the set of filters to be removed after training. The scaling factors may be e.g. the BN scaling factors. The scaling factor may be obtained from e.g. batch-normalization or additional scaling layer. Numerator in the equation 3 is a first sum of scaling factors of filters to be removed from the set of filters after training. The denominator in the equation 3 is a second sum of scaling factors of the set of filters. A ratio of the first sum and the second sum is the pruning loss.
So, the objective function in the on-line mode may be formulated by
W*=arg min E0(W,D)+λKθ(W)+γP.
W* above represents the second loss function.
In the on-line mode, dynamic pruning approach may be applied to ensure the scaling factor based pruning loss is a reliable and stable estimation of real pruning loss. For each mini-batch of the training stage, the following steps may be iteratively applied; the filters of the set of filters may be ranked according to associated scaling factors γi. Then, filters that are below a threshold percentile p % of the ranked filters may be selected. Those selected filters, which are candidates to be removed after the training stage, may be switched off by enforcing their outputs to zero i.e. temporarily pruned during the optimization of one mini-batch.
According to an embodiment, the parameter p of the lower p % percentile is user specified and fixed during the learning process/training.
According to an embodiment, the parameter p is dynamically changed, e.g. from 0 to a user specified target percentage p %.
According to an embodiment, the parameter p is automatically determined during the learning stage, by minimizing the designated object function.
According to an embodiment, the ranking of the filters is performed according to the Running Average of Scaling Factors which is defined as follows:
i
t=(1−k)
in which γit is the scaling factor for filter i at epoch t, and
Note that for k=1, then
According to an embodiment, all regularization terms in the objective function may be gradually switched off by:
Loss=Error+a×(weight redundancy+pruning-loss),
in which a is the annealing factor which may change from 1.0 to 0.0 during the learning stage. This option helps to deal with undesired local minima introduced by regularization terms.
The alternative pruning schemes described above may be applied in the on-line mode as well. The alternative pruning schemes comprise diversity based pruning, scaling factor based pruning and a combination approach, wherein the scaling factor based pruning and the diversity based pruning are combined.
The pruning may be performed at two stages, i.e. the pruning may comprise layer-wise pruning and network-wise pruning. This two-stage pruning scheme improves adaptability and flexibility. Further, it removes potential risks of network collapses which may be a problem in a simple network-wise pruning scheme.
The neural network compression framework may be applied to a given neural network architecture to be trained with a dataset of examples for a specific task, such as an image classification task, an image segmentation task, an image object detection task, and/or a video object tracking task. Dataset may comprise e.g. image data or video data. The neural network compression method and apparatus disclosed herein enables efficient, error resilient and safe transmission and reception of the neural networks among device or service vendors.
An apparatus may comprise at least one processor; at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to train a neural network by applying an optimization loss function, wherein the optimization loss function considers empirical errors and model redundancy; prune a trained neural network by removing one or more filters that have insignificant contributions from a set of filters; and provide the pruned neural network for transmission.
The apparatus may be further caused to measure filter diversities based on normalized cross correlations between weights of filters of the set of filters.
The apparatus may be further caused to form a diversity matrix based on pair-wise normalized cross correlations quantified for a set of filter weights at layers of the neural network.
The apparatus may be further caused to estimate accuracy of the pruned neural network; and retrain the pruned neural network if the accuracy of the pruned neural network is below a pre-defined threshold.
The apparatus may be further caused to estimate the pruning loss, the estimating comprising computing a first sum of scaling factors of filters to be removed from the set of filters after training; computing a second sum of scaling factors of the set of filters; and forming a ratio of the first sum and the second sum.
The apparatus may be further caused to, for mini-batches of a training stage: rank filters of the set of filters according to scaling factors; select the filters that are below a threshold percentile of the ranked filters; prune the selected filters temporarily during optimization of one of the mini-batches; iteratively repeat the ranking, selecting and pruning for the mini-batches.
It is obvious that the present invention is not limited solely to the above-presented embodiments, but it can be modified within the scope of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
20195032 | Jan 2019 | FI | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FI2020/050006 | 1/2/2020 | WO | 00 |