The present application claims priority to European Patent Application 18151416.7 filed by the European Patent Office on 12 Jan. 2018, the entire contents of which being incorporated herein by reference.
This disclosure relates to artificial neural networks (ANNs).
The “background” description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, is neither expressly nor implicitly admitted as prior art against the present disclosure.
So-called deep neural networks (DNN), as an example of an ANN, have become standard machine learning tools to solve a variety of problems such as computer vision and automatic speech recognition processing.
In Deep Learning, so-called batch normalization (BN) has become popular, and many successful DNNs contain some BN layers. Batch normalisation relies on the empirical observation that DNNs tend to learn more efficiently when their input features (or in other words, in this context, the data passed from layer to layer of the DNN) are uncorrelated with zero mean and unit variance. Because a DNN may comprise an ordered series of layers, such that one layer receives as input data the output of the preceding layer and passes its output data to form the input to a next layer, batch normalisation acts to normalise, or in this context to convert to zero mean and unit variance, the feature data passed from one layer to another. However, the processing is also based upon learned parameters which act to apply an Affine transform to the data.
The training of learned parameters of a batch normalisation process can involve a backward propagation of errors or “back propagation” process as part of an overall training process by which a so-called loss function is evaluated for the whole ANN. The learned parameters are modified at the training phase so that the loss decreases during the training process.
During inference, BN therefore involves a multiplicative scaling and additive shifting of the feature maps and, therefore, involves at least some multiplications.
This disclosure provides a computer-implemented method of training an artificial neural network (ANN) by generating a first learned parameter for use in normalising input data values during a subsequent inference phase of the trained ANN, the method comprising:
By this arrangement, the scaling factor that can subsequently be used for inference can be a vector of powers-of-two so that no multiplication is necessarily needed in the BN layer at least during such subsequent inference operations.
The present disclosure also provides computer software which, when executed by a computer, causes the computer to implement the above method.
The present disclosure also provides a non-transitory machine-readable medium which stores such computer software.
The present disclosure also provides a computer-implemented method of operating an artificial neural network (ANN) to process input data values, the method comprising:
The present disclosure also provides an Artificial neural network (ANN) configured to process input data values, the ANN comprising:
Further respective aspects and features of the present disclosure are defined in the appended claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary, but are not restrictive, of the present technology.
A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, in which:
Referring now to the drawings,
Here x and w represent the inputs and weights respectively, b is the bias term that the neuron optionally adds, and the variable i is an index covering the number of inputs (and therefore also the number of weights that affect this neuron).
The neurons in a layer have the same activation function Φ, though from layer to layer, the activation functions can be different.
The input neurons I1 . . . I3 do not themselves normally have associated activation functions. Their role is to accept data from (for example) a supervisory program overseeing operation of the ANN. The output neuron(s) O1 provide processed data back to the supervisory program. The input and output data may be in the form of a vector of values such as:
Neurons in the layers 210, 220 are referred to as hidden neurons. They receive inputs only from other neurons and output only to other neurons.
The activation functions is non-linear (such as a step function, a so-called sigmoid function, a hyperbolic tangent (tan h) function or a rectification function (ReLU).)
Training and Inference
Use of an ANN such as the ANN of
The so-called training process for an ANN can involve providing known training data as inputs to the ANN, generating an output from the ANN, comparing the output of the overall network to a known or expected output, and modifying one or more parameters of the ANN (such as one or more weights or biases) in order to aim towards bringing the output closer to the expected output. Therefore, training represents a process to search for a set of parameters which provide the lowest error during training, so that those parameters can then be used in an operational or inference stage of processing by the ANN, when individual data values are processed by the ANN.
An example training process includes so-called back propagation. A first stage involves initialising the parameters, for example randomly or using another initialisation technique. Then a so-called forward pass and a backward pass of the whole ANN are iteratively applied. A gradient or derivative of an error function is derived and used to modify the parameters.
At a basic level the error function can represent how far the ANN's output is from the expected output, though error functions can also be more complex, for example imposing constraints on the weights such as a maximum magnitude constraint. The gradient represents a partial derivative of the error function with respect to a parameter, at the parameter's current value. If the ANN were to output the expected output, the gradient would be zero, indicating that no change to the parameter is appropriate. Otherwise, the gradient provides an indication of how to modify the parameter to achieve the expected output. A negative gradient indicates that the parameter should be increased to bring the output closer to the expected output (or to reduce the error function). A positive gradient indicates that the parameter should be decreased to bring the output closer to the expected output (or to reduce the error function).
Gradient descent is therefore a training technique with the aim of arriving at an appropriate set of parameters without the processing requirements of exhaustively checking every permutation of possible values. The partial derivative of the error function is derived for each parameter, indicating that parameter's individual effect on the error function. In a backpropagation process, starting with the output neuron(s), errors are derived representing differences from the expected outputs and these are then propagated backwards through the network by applying the current parameters and the derivative of each activation function. A change in an individual parameter is then derived in proportion to the negated partial derivative of the error function with respect to that parameter and, in at least some examples, having a further component proportional to the change to that parameter applied in the previous iteration.
An example of this technique is discussed in detail in the following publication http://page.mi.fu-berlin.de/rojas/neural/ (chapter 7), the contents of which are incorporated herein by reference.
Batch Normalisation
It has been found empirically and reported in the paper Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. (cite arxiv:1502.03167) and in Ioffe et al U.S. 2016/0217368 A1, the contents of both of which are incorporated herein by reference, that ANNs can be potentially trained more efficiently when the input data or “features” to a layer are uncorrelated with zero mean and unit variance. Because each layer of the ANN receives the output of the preceding layer as its inputs, a so-called batch normalisation process can be used to transform or normalise its own inputs into a form having zero mean and unit variance. The batch normalisation process also involves the application of learned parameters (discussed below) and so can also apply an element-wise Affine transform so that the output of the batch normalisation process need not necessarily have zero mean and unit variance.
Referring to
In examples, the arrangement of
In brief, the batch normalisation process as used during inference includes a stage of multiplying each data value by a quantity (which may be dependent on the variance of data values used in training), and adding or subtracting a quantity (which may be dependent upon the mean of data values used in training), so as to allow a unit variance and zero mean to be achieved. However, as mentioned above these quantities are also modified by learned parameters which can be trained by the training phase including a backpropagation process as discussed above. The learned parameters can vary the effect of the normalisation and depending on the learned values acquired in training, can in principle undo or otherwise change its effect. So the network can arrive at a normalisation process which best suits the data in use. The learned parameters may for example be initialised to provide zero mean and unit variance but are then allowed to vary from this arrangement.
An example of the training phase of the batch normalisation process is shown in the schematic flowchart of
In the following the input to the BN layer is the vector x∈Ro and the output is the vector y∈Ro. The formulas shown below use the element of index o (“xo”) of the input vector, and are meant to be applied element-wise for every element of the input vector x.
At a step 400, the learned parameters γo and βo are initialised, for example randomly, or to values which will give a unit variance and zero mean, or to other initial values. The process which follows allows for learning to take place starting from those initial values.
The training is conducted using batches, each of multiple data values. Multiple batches may be used in the entire training process, and an individual batch may be used more than once. At a step 410, if all available batches have been used, the process ends at a step 415. Otherwise the training process continues. In some contexts, the batch of data used in a single iteration of the flowchart of
At a step 420, a batch mean, μB and batch variance (square of the standard deviation σB) are derived from the data in the current batch B.
At a step 430, a running mean and variance E[xo] and Var[xo] are updated. These represent the mean and variance applicable to all of the m training data values processed so far:
The batch normalisation process (which will be applied at the step 450 to be discussed below) can be represented by the expression:
In other words, it involves multiplying an input data sample by a parameter αo and adding a parameter bo. Note that the multiplier factor αo depends on the learned parameter γo.
At a step 440, the learned value γo is quantised or approximated to a value {circumflex over (γ)}o such that α=
is quantized to {circumflex over (α)}, and the resulting {circumflex over (α)} is a power-of-two number, i.e., (for the element of index o):
=sign(αo)·2round(log
and
{circumflex over (γ)}o=sign(αo)·2round(log
At a step 450, a forward pass is conducted, in which (for the BN layer in question) the batch of training data values are normalised to generate normalised data γo and are scaled and shifted by learned parameters γo and βo respectively:
where {circumflex over (γ)} (approximated version of γ) is computed so that
is a power-of-two. In these formulas, μB and σB are the mean and standard deviations of the data in the batch B, while σT=√{square root over (Var[x]+ϵ)} is the running standard deviation from the whole training set processed so far (the m training data values). Here, ϵ is an arbitrarily small constant to avoid division by zero. Thus, σT is the standard deviation which, at the end of the training (after all batches of data have been processed), will actually be used for inference. Note that, during training time,
is not necessarily an exact power-of-two. However, the arrangement {circumflex over (γ)} such that
which will be used at inference time, is a power-of-two (here, in training, the prevailing running variance is used, and at inference, the final running variance applicable to the entire training phase is used). Therefore, this involves approximating γ (as an approximation of a current value of the learned parameter) so that a first scaling factor (to be used during inference) dependent upon the approximation of the first learned parameter and the running variance, is constrained to be equal to a power of two.
The mean and variance used in this forward pass at step 450 are the current batch values μB and σB2 so that the normalising the batch of input data values is performed by multiplying each input data value by a second scaling factor dependent upon the approximation of the current value of the first learned parameter and the batch variance. However, the process for the approximation of γ explained above is dependent upon the running variance rather than the batch variance.
As mentioned above, the learned parameters γ and β are provided and trained as a vector of values, one for each vector element in the input data, and batch normalisation is carried out separately for the different vector elements in the data.
At a step 460, a gradient computation takes place to compute the error and gradient with respect to {circumflex over (γ)} (the approximated value for γ) and also to compute the gradient of the error with respect to β
At a step 470, the full precision (not approximated) prevailing value of γ and the prevailing value of β are updated according to the detected gradients.
After the final batch has been processed (the negative outcome from the step 410) this is the completion of the training of the batch normalisation process, as the learned parameters γ and β have been trained to their values for use at inference. Note that at inference, the process uses β and {circumflex over (γ)}, the quantized version of γ
By this training scheme, it is ensured that the resulting {circumflex over (α)} at the end of training is a vector of powers-of-two and no multiplication will be needed in the batch normalization layer during inference time. Note the embodiment can achieve a saving of multiplications during inference time, rather than at training time.
Inference Phase
The final values of the running mean E[x] and running variance Var[x] of the training data set are provided at a step 660.
The learned parameters {circumflex over (γ)} and β from the training phase are provided at step 650
An input data value 672 is processed to generate an output data value 674 by the trained BN layer at a step 670 by applying the function:
where γ is set to {circumflex over (γ)} and accordingly α to {circumflex over (α)}, which is to say, the process uses {circumflex over (γ)}, the quantized or approximated version of γ.
At inference time bo is a constant since βo and {circumflex over (γ)}o (the learned parameters) are constant and the final values of the running mean and variance (E[xo] and Var[xo]) are also constant.
By the training scheme described above, it is ensured that that the resulting
which is used for interference, is always a vector of powers-of-two numbers and no multiplication is needed in the BN layer during inference time.
In summary, applying the terminology of
As illustrated schematically by a dividing line 870, the training process may be considered to be represented by the steps 800-840. The result or outcome is a trained ANN which can subsequently (and potentially separately) be used in inference.
For ease of explanation, the inference phase (shown below the schematic divider 870 in
The steps 850, 860 of
Separately, the steps 850-860 may provide an example of a stand-alone computer-implemented method of operating an artificial neural network (ANN) to process input data values, the method comprising:
It will be apparent that numerous modifications and variations of the present disclosure are possible in light of the above teachings. It is therefore to be understood that within the scope of the appended clauses, the technology may be practised otherwise than as specifically described herein.
Various respective aspects and features will be defined by the following numbered clauses:
Number | Date | Country | Kind |
---|---|---|---|
18151416 | Jan 2018 | EP | regional |
Number | Name | Date | Kind |
---|---|---|---|
20160217368 | Ioffe et al. | Jul 2016 | A1 |
20170286830 | El-Yaniv et al. | Oct 2017 | A1 |
Entry |
---|
Ferrer, et al., NeuroFPGA—Implementing Artificial Neural Networks on Programmable Logic Devices, Proceedings of the conference on Design, automation and test in Europe—vol. 3, 2004, pp. 1-6 (Year: 2004). |
Tay Hubara et al: “Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations”, arxiv.org, Cornell University Library, 201 Olin Library Cornell University Ithaca, NY 14853,Sep. 22, 2016 (Sep. 22, 2016), XP080813052, * p. 6, paragraph 2.4 *. |
European Extended Search Report dated Jun. 13, 2019, issued in corresponding European Patent Application No. 18215628.1. |
Courbariaux et al., “Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1”, Mar. 17, 2016, 11 Pages. |
Ioffe et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, Mar. 2, 2015, pp. 1-11. |
R. Rojas, “Neural Networks—A Systematic Introduction—Chapter 7—The Backpropagation Algorithm,” Neural Networks, Springer-Verlag, Berlin, New-York, 1996, pp. 151-184. |
Number | Date | Country | |
---|---|---|---|
20190220741 A1 | Jul 2019 | US |