Aspects of the present disclosure relate, in general, to methods for training an artificial neural network and more particularly, although not exclusively to method for providing a quantized neural network.
Deep neural networks show promising results in various computer vision tasks as well as many other fields. However, deep learning models usually contain many layers and a large number of parameters and can therefore use a large amount of resources during their training (repeated forward and backward propagation) as well as inference (forward propagation), which ultimately limits their application on edge devices such as hardware limited devices that are positioned logically at the edge of a telecommunication network. For example, using a trained deep neural network model for predictions involves computations consisting of multiplications of real-valued weights by real-valued activations in a forward pass. These multiplications are computationally expensive as they comprise floating point to floating point operations, which may prove intractable or impracticable to perform on resource limited devices where computational resource and memory are at a premium.
To alleviate this problem, a number of approaches have been proposed to quantize deep neural network (DNN) models in order to enable acceleration and compression of the models. That is, in order to decrease the storage and compute requirements of a model during inference it is possible for some of its parameters, such as weights or/and activations, to be stored as integers with a low number of bits. For example, instead of 32-bit (4 bytes) floating point numbers, 8-bit integers (1 byte) may be used. The process of converting model parameters from “continuous” floating point to discrete integer numbers is called quantization.
Quantization of DNN models by reducing the bit-width of weights and/or activations enables multiple memory reductions. Such quantized DNNs use less storage space and are more easily distributed over resource-limited devices. Furthermore, arithmetic with lower bit-width numbers is faster and cheaper. Since floating-point calculations are computationally intensive and may not always be supported on the microcontrollers of some ultra low-power embedded devices, quantized DNN models may also be advantageously deployed on edge devices comprising reduced resource. For example, 8-bit fixed-point quantization DNNs have become widely used in industrial projects and are supported in the most common frameworks for DNN training and inference.
However, despite the advantages of quantization, the accuracy and quality of quantized DNNs can suffer in comparison with DNNs that use relatively higher bit-width parameters utilising 16-bit or 32-bit floating point arithmetic. Furthermore, the use of relatively lower bit-width parameters can trigger a requirement to either fine-tune a model in order to compensate for the reduction in granularity with which parameters may be represented, or retrain it from scratch, thus leading to increased use of resources and downtime.
According to a first aspect, there is provided a method for training an artificial neural network comprising multiple nodes each defining a quantized activation function configured to output a quantized activation value, the nodes arranged in multiple layers, in which nodes in adjacent layers are connected by connections each defining a quantized connection weight function configured to output a quantized connection weight value, the method comprising minimising a loss function, the loss function comprising a scalable regularisation factor defined by a differentiable periodic function configured to provide a finite number of minima selected on the basis of a quantisation scheme for the neural network, whereby to constrain a connection weight value to one of a predetermined number of values of the quantisation scheme.
The differentiable periodic function forms a smooth quantization regularization function for training quantized neural networks. The function is so defined as to push or constrain weight and activation values of a neural network to a selected quantization grid according to quantization parameters representing scale parameters for weights and/or hidden inputs of the network. Advantageously, a model can therefore be quantized to any bit-width precision, and no special implementation measures are required in order to apply the method using existing architectures.
Each of the minima of the periodic function coincide with a value of the quantisation scheme, which itself defines a number of integer bits.
The differentiable periodic function can be used as part of a regularization term in a loss calculation in order to push values of weights/activations to a set of discrete points during training.
Using the loss function, a quantized activation value is constrained to one of a predetermined number of values of the quantisation scheme. In an implementation of the first aspect, a quantized connection weight value is tuned, e.g. varied or modified, and the loss function is minimised using a gradient descent mechanism. This process can be iteratively performed until the loss function is minimised.
According to a second aspect, there is provided a non-transitory machine-readable storage medium encoded with instructions for training an artificial neural network, the instructions executable by a processor of a machine whereby to cause the machine to minimise a loss function comprising a scalable regularisation factor defined by a differentiable periodic function configured to provide a finite number of minima selected on the basis of a quantisation scheme for the neural network, whereby to constrain a connection weight value to one of a predetermined number of values of the quantisation scheme.
In an implementation of the second aspect, the non-transitory machine-readable storage medium can comprise further instructions to adjust a weight scale parameter of the differentiable periodic function, the weight scale parameter representing a scale factor for a weight value of a weight function defining a connection between nodes of the neural network, and compute a value for the loss function on the basis of the adjusted weight scale parameter.
Further instructions can be provided to adjust an activation scale parameter of the differentiable periodic function, the activation scale parameter representing a scale factor for an activation value of an activation function of a node of the neural network, and compute a value for the loss function on the basis of the adjusted activation scale parameter. In an example, a value of the loss function can be computed by performing a gradient descent calculation in which, for example, a first-order iterative optimization algorithm is used to determine local minima of the periodic differentiable function.
According to a third aspect, there is provided a quantisation method to constrain a parameter for a neural network to one of a number of integer bits defining a selected quantisation scheme as part of a regularisation process in which a loss function comprising a scalable regularisation factor defined by a differentiable periodic function configured to provide a finite number of minima selected on the basis of a quantisation scheme for the neural network is minimised, the method comprising iteratively minimising the loss function by adjusting a quantized parameter value as part of a gradient descent mechanism.
According to a fourth aspect, there is provided a neural network comprising a set of quantised activation values and a set of quantised connection weight values trained according to a method as provided herein. The neural network can comprise a set of parameters quantised according to a method of the third aspect.
In an implementation of the fourth aspect, the neural network can be initialised using sample statistics from training data. For example, the parameters can comprise scale factors for activations, and the scale factors for weights can be initialized using current maximum absolute values of weights.
For a more illustrative understanding of the present disclosure, reference is now made, by way of example only, to the following descriptions taken in conjunction with the accompanying drawings, in which:
Example embodiments are described below in sufficient detail to enable those of ordinary skill in the art to embody and implement the systems and processes herein described. It is important to understand that embodiments can be provided in many alternate forms and should not be construed as limited to the examples set forth herein. Accordingly, while embodiments can be modified in various ways and take on various alternative forms, specific embodiments thereof are shown in the drawings and described in detail below as examples. There is no intent to limit to the particular forms disclosed. On the contrary, all modifications, equivalents, and alternatives falling within the scope of the appended claims should be included. Elements of the example embodiments are consistently denoted by the same reference numerals throughout the drawings and detailed description where appropriate.
The terminology used herein to describe embodiments is not intended to limit the scope. The articles “a,” “an,” and “the” are singular in that they have a single referent, however the use of the singular form in the present document should not preclude the presence of more than one referent. In other words, elements referred to in the singular can number one or more, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,” “includes,” and/or “including,” when used herein, specify the presence of stated features, items, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, items, steps, operations, elements, components, and/or groups thereof.
Unless otherwise defined, all terms (including technical and scientific terms) used herein are to be interpreted as is customary in the art. It will be further understood that terms in common usage should also be interpreted as is customary in the relevant art and not in an idealized or overly formal sense unless expressly so defined herein.
Quantization techniques enable reductions in memory and computational costs in DNN models. However, without using specialized tricks for training and optimization, model accuracy and quality can be degraded. Typically, the optimisation of models by way of quantization cannot be generalized to arbitrary tasks due to the specific approaches used. For example, approaches to reduce the bit width of the values of weights and activations may rely on schemes that maps a range of the distribution of parameter values to the range of a quantization grid which depends on the bit width. Although such approaches can reduce memory and computational cost, they rely on keeping full-precision floating point weights during training for backpropagation. Thus, during training a comparatively large amount of memory is still required.
Accordingly, approaches for enabling the deployment of DNNs over a larger set of devices that may be computationally limited are based either on specific training quantization algorithms or on the use of additional loss functions which are added to a main loss function in order to minimize the difference between full-precision weights and/or activations of DNNs and their quantized analogues. As noted, specific settings or selections made as part of a training process are needed in order to enable such approaches to be compatible with different models. Existing methods cannot therefore be generalized to arbitrary tasks.
According to an example, a general quantization mechanism is provided in which a smooth quantization regularization function for training quantized neural networks is provided. The function is smooth and differentiable, and naturally pushes the weight and activation values of a DNN closer to a selected quantization grid according to quantization parameters. It is therefore possible to enable quantization of a model to any bit-width precision. The mechanism does not depend on network architecture and can be applied to any existing supervised problems (such as classification, regression, speech synthesis, etc.).
In an implementation, the function is used as part of a regularization term in a loss calculation in order to push values of weights/activations to a set of discrete points during training. The function has a limited number of minima that can correspond to a selected quantization scheme, which means that values for weights (and/or activations) can be constrained within a desired range.
In order to build a framework for the definition of a smooth quantization regularization (SQR) function according to an example, some preliminary explanations are provided. Firstly, it is possible to define a standard uniform quantization round on a uniform grid of integers from segment [n1: n2] by Qu(x) such that:
A similar function can be defined for
. It is assumed that the function Qu is applied to each component of the vector
The mean squared quantization error (MSQE) of vector
on a uniform grid of integers from segment [n1: n2] can be denoted:
In practice, layers of a neural network commute with scalar multiplication, so to perform layer evaluation using integer arithmetic layers can be evaluated using rounded scaled weights and input activations and then determining the inverse of the scaling.
Therefore, instead of minimizing MSQE(
by setting the scale factor s. Generally speaking, the weights of each quantized block of a model and each quantized activation layer have their own scale factors. In some cases, individual rows of weight matrices or convolution filters have their own scale factors, in which case they can be considered as individual quantized blocks.
In the case of a neural network quantization mechanism,
can represent the weights tensors (Wi) of quantized blocks of the model and their scale factors
, and
can represent the quantized activation layers (Aj) of the model and their scale factors
. The set of weights tensors of quantized blocks can be denoted
, whilst the vector consisting of scale factors
can be denoted as
and the vector consisting of scale factors
of quantized activation layers can be denoted by
.
Let ξ be the input data distribution, and
be the function that the quantized model defines on sample x. That is, according to an example, the quantized model performs calculations using tensor
instead of using the weight tensor Wi of each quantized block, and using tensor
instead of the tensor Aj (x) of each quantized activation layer on sample x. In fact, the tensor Aj(x) of each activation layer also depends on
and
.
The MSQE of weights of a quantized model can be defined, in an example, as:
and the MSQE of activations of the quantized model for an input data distribution ξ can be denoted by:
If
is the value of the base loss function on a sample of data x, then the neural network quantization problem can be represented as:
where Ω is some region of parameters
of limited quantization errors MSQEw and MSQEa.
According to an aspect, for a class of smooth functions ϕ that will be defined and discussed in more detail below, the above minimisation problem reduces to finding:
in the domain of parameters
.
According to an example, a function ϕ(x) can be defined as a smooth quantization regularization (SQR) function for the uniform grid of integers from segment [n1: n2] when the following holds:
The value of ϕ(
It is therefore clear that ϕ(
Thus, for any a, b ∈ ℝ, 0 < a < b there exists an SQR function ϕ(x) such that aMSQE(x) ≤ ϕ(x) ≤ bMSQE(x).
Furthermore, SQR functions as defined herein according to example possess the natural properties of a quantization error - that is, the same number of minima, symmetry with respect to grid points and an equal rights of grid points.
Thus, an SQR, ϕ(x), according to an example, possesses:
Furthermore, for an SQR, ϕ(x), according to an example for which aMSQE(x) ≤ ϕ(x) ≤ bMSQE(x) for some
for any s > 0 and
for:
.
The goal of quantization is to minimize a quantization norm, because it reflects a value of rounding error. MSQE is a quantization norm but is not differentiable and its behaviour at the points of discontinuity of the derivative thereof makes it difficult to move between the minima. As a result, it is a poor target functional for the optimization problem. On the contrary, an SQR function as defined above can be used to optimize a model loss function using, e.g., a gradient descent algorithm, since it is differentiable and has a preselected number of minima in a range [n1, n2].
Accordingly, for any SQR ϕ(x) for which MSQE(x) ≤ ϕ(x) ≤ bMSQE(x) for some a, b ∈ ℝ, 0 < a < b and for any λw, λa > 0, each solution to the optimization problem:
in the domain of parameters (
in some region Ω of parameters (
and,
Region Ω contains all the points at which the weighted sum of squared quantization errors λwMSQEw + λaMSQEa is less or equal than
while in region Ω the weighted sum λwMSQEw + λaMSQEa does not exceed
.
Thus, according to an example, selection of values for the parameters a, b, λw and λa enables adjustment of the width of the channel
.
Thus, by minimizing a smooth function according to:
the minima of
in the region of a limited number of quantization errors MSQEw and MSQEa, can be obtained. Note that a similar proposition for the function MSQE is not true since it not differentiable.
The scale factors,sw
An SQR function according to an example for two fixed integers n1 and n2, where n1 < n2, and any x ∈ ℝ is:
The function defined above is smooth, and the proper choice of n1 and n2 enables quantization of a model to any bit precision. For example, selecting n1 = -128 and n2 = 127 an int8 scheme is obtained, whereas selecting n1 = 0 and n2 = 255 an uint8 scheme is obtained. Furthermore, it can be seen that a 1-bit scheme (Qsinbin) and triple scheme (Qsintriple)can be obtained, which are also C2(ℝ), as follows:
Thus, according to an example, given a neural network with floating points weights, a corresponding network which uses integer weights and inner activations (hidden inputs) can be constructed using smooth quantization regularizers (SQR) such as described above (of which Qsin is an example).
In block 205, a global accumulation SQR block is set that is configured to collect SQR values for weights and hidden inputs from each layer of the neural network. In block 207, the loss function with SQR (as described above) is calculated and gradients are computed for the neural network weights and new parameters (scale parameters).
In block 209 all parameters are updated by the gradient step. When minimising the SQR value (i.e. training the network) weights that are closer and closer to integer values are obtained and this means that evaluation of the neural network becomes iteratively closes to integer evaluation. Finally, in an example, weights (and activations etc.) can be round to nearest integer values.
Therefore, a method for training an artificial neural network comprising multiple nodes each defining a quantized activation function configured to output a quantized activation value, in which the nodes are arranged in multiple layers, and in which nodes in adjacent layers are connected by connections each defining a quantized connection weight function configured to output a quantized connection weight value, can proceed by minimising a loss function, the loss function comprising a scalable regularisation factor defined by a differentiable periodic function configured to provide a finite number of minima selected on the basis of a quantisation scheme for the neural network, whereby to constrain a connection weight value to one of a predetermined number of values of the quantisation scheme. As described above, the differentiable periodic function can be an SQR function, such as the Qsin function. Selection of parameters for the function enables the construction of a framework that defines a desired bit width for the training mechanism, such as uint8, int8 and so on, as described above, such that a network with floating points parameters can be used as the basis for construction of a corresponding (that is, the same) neural network which uses integer weights and inner activations (hidden inputs).
With reference to
Examples in the present disclosure can be provided as methods, systems or machine-readable instructions, such as any combination of software, hardware, firmware or the like. Such machine-readable instructions may be included on a computer readable storage medium (including but not limited to disc storage, CD-ROM, optical storage, etc.) having computer readable program codes therein or thereon.
The present disclosure is described with reference to flow charts and/or block diagrams of the method, devices and systems according to examples of the present disclosure. Although the flow diagrams described above show a specific order of execution, the order of execution may differ from that which is depicted. Blocks described in relation to one flow chart may be combined with those of another flow chart. In some examples, some blocks of the flow diagrams may not be necessary and/or additional blocks may be added. It shall be understood that each flow and/or block in the flow charts and/or block diagrams, as well as combinations of the flows and/or diagrams in the flow charts and/or block diagrams can be realized by machine readable instructions.
The machine-readable instructions may for example, be executed by a general-purpose computer, a special purpose computer, an embedded processor or processors of other programmable data processing devices to realize the functions described in the description and diagrams. In particular, a processor or processing apparatus may execute the machine-readable instructions. Thus, modules of apparatus may be implemented by a processor executing machine-readable instructions stored in a memory, or a processor operating in accordance with instructions embedded in logic circuitry. The term ‘processor’ is to be interpreted broadly to include a CPU, processing unit, ASIC, logic unit, or programmable gate set etc. The methods and modules may all be performed by a single processor or divided amongst several processors.
Such machine-readable instructions may also be stored in a computer readable storage that can guide the computer or other programmable data processing devices to operate in a specific mode. For example, the instructions may be provided on a non-transitory computer readable storage medium encoded with instructions, executable by a processor.
Accordingly, such machine-readable instructions may be loaded onto a computer or other programmable data processing devices, so that the computer or other programmable data processing devices perform a series of operations to produce computer-implemented processing, thus the instructions executed on the computer or other programmable devices provide an operation for realizing functions specified by flow(s) in the flow charts and/or block(s) in the block diagrams, such as block 315 for example or those function specified with reference to
Further, the teachings herein may be implemented in the form of a computer software product, the computer software product being stored in a storage medium and comprising a plurality of instructions for making a computer device implement the methods recited in the examples of the present disclosure.
While the method, apparatus and related aspects have been described with reference to certain examples, various modifications, changes, omissions, and substitutions can be made without departing from the present disclosure. In particular, a feature or block from one example may be combined with or substituted by a feature/block of another example.
The features of any dependent claim may be combined with the features of any of the independent claims or other dependent claims.
This application is a continuation of International Application No. PCT/RU2020/000263, filed on Jun. 03, 2020, the disclosure of which is hereby incorporated by reference in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/RU2020/000263 | Jun 2020 | WO |
| Child | 18074166 | US |