The present application claims priority to European Patent Application 18158172.9 filed by the European Patent Office on Feb. 22, 2018, the entire contents of which being incorporated herein by reference.
This disclosure relates to artificial neural networks (ANNs).
So-called deep neural networks (DNN) have become standard machine learning tools to solve a variety of problems such as computer vision and automatic speech recognition processing.
Designing and training such a DNN is typically very time consuming. When a new DNN is developed for a given task, many so-called hyper-parameters (parameters related to the overall structure of the network) must be chosen empirically. For each possible combination of structural hyper-parameters, a new network is typically trained from scratch and evaluated. While progress has been made on hardware (such as Graphical Processing Units providing efficient single instruction multiple data (SIMD) execution) and software (such as a DNN library developed by NVIDIA called cuDNN) to speed-up the training time of a single structure of a DNN, the exploration of a large set of possible structures remains still potentially slow.
It is envisaged that various electronic devices may be equipped with ANN technology. An example is the use of an ANN in a digital camera for techniques such as face detection and/or recognition. It is recognised that a family of devices may use similar techniques but provide different processing capabilities.
The present disclosure provides a computer-implemented method of generating a derived artificial neural network (ANN) from a base ANN, the method comprising:
initialising a set of parameters of the derived ANN in dependence upon parameters of the base ANN;
inferring a set of output data from a set of input data using the base ANN;
quantising the set of output data; and
training the derived ANN using training data comprising the set of input data and the quantised set of output data.
The present disclosure also provides computer software which, when executed by a computer, causes the computer to implement the above method.
The present disclosure also provides a non-transitory machine-readable medium which stores such computer software.
The present disclosure also provides an artificial neural network (ANN) generated by the above method and data processing apparatus comprising one or more processing elements to implement such an ANN.
Further respective aspects and features of the present disclosure are defined in the appended claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary, but are not restrictive, of the present technology.
A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, in which:
Referring now to the drawings,
Here x and w represent the inputs and weights respectively, b is the bias term that the neuron optionally adds, and the variable i is an index covering the number of inputs (and therefore also the number of weights that affect this neuron).
The neurons in a layer have the same activation function ϕ, though from layer to layer, the activation functions can be different.
The input neurons I1 . . . I3 do not themselves normally have associated activation functions. Their role is to accept data from (for example) a supervisory program overseeing operation of the ANN. The output neuron(s) O1 provide processed data back to the supervisory program. The input and output data may be in the form of a vector of values such as:
[x1, x2, x3]
Neurons in the layers 210, 220 are referred to as hidden neurons. They receive inputs only from other neurons and output only to other neurons.
The activation functions is non-linear (such as a step function, a so-called sigmoid function, a hyperbolic tangent (tanh) function or a rectification function (ReLU).)
Use of an ANN such as the ANN of
The so-called training process for an ANN can involve providing known training data as inputs to the ANN, generating an output from the ANN, comparing the output of the overall network to a known or expected output, and modifying one or more parameters of the ANN (such as one or more weights or biases) in order to aim towards bringing the output closer to the expected output. Therefore, training represents a process to search for a set of parameters which provide the lowest error during training, so that those parameters can then be used in an operational or inference stage of processing by the ANN, when individual data values are processed by the ANN.
An example training process includes so-called back propagation. A first stage involves initialising the parameters, for example randomly or using another initialisation technique. Then a so-called forward pass and a backward pass of the whole ANN are iteratively applied. A gradient or derivative of an error function is derived and used to modify the parameters.
At a basic level the error function can represent how far the ANN's output is from the expected output, though error functions can also be more complex, for example imposing constraints on the weights such as a maximum magnitude constraint. The gradient represents a partial derivative of the error function with respect to a parameter, at the parameter's current value. If the ANN were to output the expected output, the gradient would be zero, indicating that no change to the parameter is appropriate. Otherwise, the gradient provides an indication of how to modify the parameter to achieve the expected output. A negative gradient indicates that the parameter should be increased to bring the output closer to the expected output (or to reduce the error function). A positive gradient indicates that the parameter should be decreased to bring the output closer to the expected output (or to reduce the error function).
Gradient descent is therefore a training technique with the aim of arriving at an appropriate set of parameters without the processing requirements of exhaustively checking every permutation of possible values. The partial derivative of the error function is derived for each parameter, indicating that parameter's individual effect on the error function. In a backpropagation process, starting with the output neuron(s), errors are derived representing differences from the expected outputs and these are then propagated backwards through the network by applying the current parameters and the derivative of each activation function. A change in an individual parameter is then derived in proportion to the negated partial derivative of the error function with respect to that parameter and, in at least some examples, having a further component proportional to the change to that parameter applied in the previous iteration.
An example of this technique is discussed in detail in the following publication http://page.mi.fu-berlin.de/rojas/neural/ (chapter 7), the contents of which are incorporated herein by reference.
Training from Scratch
At a step 400, the parameters (such as W, b for each layer) of the ANN to be trained are initialised. The training process then involves the successive application of known training data, having known outcomes, to the ANN, by steps 410, 420 and 430.
At the step 410, an instance of the input training data is processed by the ANN to generate a training output. The training output is compared to the known output at the step 420 and deviations from the known output (representing the error function referred to above) are used at the step 430 to steer changes in the parameters by, for example, a gradient descent technique as discussed above.
The technique described above can be used to train a network from scratch, but in the discussion below, techniques will be described by which an ANN is established by adaptation of an existing ANN.
Some reasons for adopting this approach can include the aspect that training an ANN from scratch can be a lengthy and expensive process. In situations where similar but not identical ANNs are required, for example for use in (say) face recognition in a range of relatively similar products such as digital cameras having different respective processing capabilities, training an individual ANN for each model could be prohibitively expensive and/or time consuming. Indeed, it is possible that the original training data may not in fact be available at the time that a new camera model's ANN needs to be trained.
Other reasons can relate to adaptation of an ANN. In the illustrative example of digital cameras, an ANN trained on a brand new and fully operational digital camera may start to become less well suited to that particular camera as the camera ages and (for example) some pixels of an image sensor in the camera potentially deteriorate at different rates, or lens damage affects some parts of the captured images but not others. Here, it could be useful for the ANN to be able to adapt to these changes, but because such an adaptation would be “in the field”, or in other words while the camera is in the hands of the user, the original training data is unlikely to be available.
With regard to the context discussed above,
One type of adaptation mentioned above is to adapt an existing (base) ANN, running on a particular data processing apparatus, in response to changes in the nature of data being handled by the base ANN. An illustrative example given above relates to deterioration of an image sensor and/or lens in a camera arrangement, but this is just one example. In another illustrative example where the ANN is used for (say) speech recognition, changes over time could occur through the aging of the main user or a geographical move of the apparatus to an area with a different style or accent of speech. This type of adaptation will be referred to in this description as a “performance adaptation”.
A process for handling such a performance adaptation will be described below with reference to
The process illustrated in
In brief, the derived ANN is initialised to the same parameters (such as ϕ, W, b for each layer) as those of the base ANN, but is then further trained using input and output data handled by the base ANN, but with the base ANN's output vectors being processed using a quantisation technique such as so-called “one hot” encoding so that the largest single data value of the output vector is set to “1.0” and other data values are set to “0.0”. This is one form of quantisation (other forms can be used) and can serve to reduce errors or uncertainties in the output data from the base ANN and can in at least some situations provide better or more useful data by which the derived ANN can be trained.
As an example, if an output vector 720 as generated by the base ANN is:
[0.2, 0.01, −0.3, 0.8, 0.12, −0.9]
It is quantised by one-hot encoding to:
[0, 0, 0, 1, 0, 0]
Note that typically the output coding is such that negative values correspond to a very low likelihood. Very large magnitude negative values carry the meaning “very close to zero”.
So, referring back to the example training process of
The training process corresponding to the step 410 of
So, in summary, the derived ANN is initialised, for example to be initially identical to the base ANN, and is then trained using actual data processed by the base ANN, with the ground truth output corresponding to that data being taken to be the quantised or one-hot encoded version of the actual output of the base ANN.
In the examples discussed above in which the nature of the input data has potentially changed since the base ANN was first trained, the above process takes this into account by training the derived ANN using actual processed data handled by the base ANN.
In some examples, the process described above can be undertaken by two inter-communicating data processing apparatuses, potentially simultaneously (though this is not a requirement) so that as an input vector is processed by the base ANN, the input and output vectors are passed directly to be used in training the derived ANN. This could in principle avoid the need for the buffers 750, 780.
In other examples, the inference by the base ANN and the training of the derived ANN are executed by the same data processing apparatus on a time division basis, so that the data processing apparatus first executes the base ANN and buffers the input and output data obtained in (for example) a period of normal use. The derived ANN is then trained by the same data processing apparatus using the buffered training data.
In further examples, the inference by the base ANN and the training of the derived ANN are executed, potentially simultaneously, by the same data processing apparatus on a multi-tasking or multi-threading basis.
In summary, in these techniques, the derived ANN may have the same network structure as the base ANN; and the initialising may comprise setting the parameters of the derived ANN to be the same as respective parameters of the base ANN.
As mentioned above, it may also (or instead) be desirable to adapt the ANN designed for one model of camera and using one configuration of image sensor, lens and face recognition module, for use with another model of camera and using another configuration of image sensor, lens and face recognition module. For example, it may be desired to operate an ANN similar in function to a base ANN but on a data processing apparatus of lower or different computational power than that for which the base ANN was developed. Other examples in apparatuses other than cameras also apply. This type of adaptation will be referred to as a “structural adaptation” in the discussion below.
In a structural adaptation, a derived ANN can differ in network structure to the base ANN. An example of this situation is shown schematically in
The derived ANN has its parameters initialised using those of the base ANN as a starting point. In some instances, it can be possible to omit a layer of the base ANN without change to the non-omitted parameters. In other arrangements however, the initialisation of the parameters of the derived ANN 830 is carried out by an initialisation module 832 which acts on the parameters 835 of the base ANN 800 to generate initialisation parameters 837 for the derived ANN. An example of the performance of this technique (by a so-called least squares method) will be discussed in detail below.
The training process corresponding to the step 410 of
Therefore, in embodiments, the set of output data comprises one or more output data vectors each having a plurality of data values; and the quantising comprises replacing each data value other than a data value having a highest value amongst the plurality of data values, by a first predetermined value. In for example the quantising process of so-called one-hot encoding, the first predetermined value may be zero. The quantising step may comprise replacing a data value having a highest value amongst the plurality of data values, by a second predetermined value such as 1.
So, in summary, in the structural adaptation case, the derived ANN is initialised, for example using parameters derived from those of the base ANN (for example by the least squares process to be discussed below), and is then trained using actual data processed by the base ANN, with the ground truth output corresponding to that data being taken to be the quantised or one-hot encoded version of the actual output of the base ANN.
Embodiments of the present disclosure can provide techniques to use an approximation method to modify the structure of a previously trained neural network model (a base ANN) to a new structure (of a derived ANN) to avoid training from scratch every time. In the present examples, the previously trained network is the base ANN 800, the new structure is that of the derived ANN 830, and these processes can be performed by the module 832. The possible modifications (of the derived ANN over the base ANN) include for example increasing and decreasing layer size, widen and shorten depth, and changing activation functions.
A previously proposed approach to this problem would have involved evaluating several net structures by training each structure from scratch and evaluating on a validation set. This requires the training of many networks and can potentially be very slow. Also, in some cases only a limited amount of different structure can be evaluated. In contrast, embodiments of the disclosure modify the structure and parameters of the base ANN to a new structure (the derived ANN) to avoid training from scratch every time.
In embodiments, the derived ANN has a different network structure to the base ANN. In examples, the base ANN has an ordered series of two or more successive layers of neurons, each layer passing data signals to the next layer in the ordered series, the neurons of each layer processing the data signals received from the preceding layer according to an activation function and weights for that layer,
the method comprising:
detecting the data signals for a first position and a second position in the ordered series of layers of neurons;
generating the derived ANN from the base ANN by providing an insertion layer of neurons to provide processing between the first position and the second position with respect to the ordered series of layers of neurons of the base ANN; and
initialising at least a set of weights for the insertion layer using a least squares approximation from the data signals detected for the first position and a second position.
In a left hand column of
In the present example, the two or more successive layers 1000, 1010, 1020 may be fully connected layers in which each neuron in a fully connected layer is connected to receive data signals from each neuron in a preceding layer and to pass data signals to each neuron in a following layer.
In the present technique, a so-called least squares morphism (LSM) is used to approximate the parameters of a single linear layer such that it preserves the function of a (replaced) sub-network of the parent network.
To do this, a first step is to forward training samples through the parent network up to the input of the sub-network to be replaced, and up to the output of the sub-network. In the example of
Given the data at the input of the parent sub-network x1, . . . , xN and the corresponding data at the output of the sub-network y1, . . . , yN it is possible to approximate (or for example optimize) a replacement linear layer with weights parameters Winit and bias term binit which approximate the sub-network. This then provides a starting point for subsequent training of the replacement network (derived ANN) as discussed above. The approximation/optimization problem can be written as:
The expression in the vertical double bars is the square of the deviation of the desired output y of the replacement layer, from its actual output (the expression with W and b). The sub index n is over the neurons (units) of the layer. So, the sum is something that is certainly positive (because of the square) and zero only if the linear replacement layer accurately reproduces y (for all neurons). So an aim is to minimize the sum, and the free parameters which are available to do this are W and b, which is reflected in the “arg min” (argument of the minimum) operation. In general, no solution is possible that provides zero error unless in certain circumstances; the expected error has a closed form solution and is given below as Jmin.
The solution to this least squares problem can be expressed closed-form and is given by:
The residual error is given by:
So, for the replacement layer 1040 of the morphed network (derived ANN) 1050, the initial weights W′ are given by Winit and the initial bias b′ is given by binit, both of which are derived by a least squares approximation process from the input and output data (at the first and second positions).
Therefore, in examples, the neurons of each layer of the base ANN process the data signals received from the preceding layer according to a bias function for that layer, the method comprising deriving an initial approximation of at least a bias function for the insertion layer using a least squares approximation from the data signals detected for the first position and a second position.
This process of parameter initialisation is summarised in
the method comprising:
detecting (at a step 1100) the data signals for a first position x1, . . . , xN (such as the input to the layer 1000) and a second position y1, . . . , yN (such as the output of the layer 1010) in the ordered series of layers of neurons;
generating (at a step 1110) the modified ANN from the base ANN by providing an insertion layer 1040 of neurons to provide processing between the first position and the second position with respect to the ordered series of layers of neurons of the base ANN (in the example above, the layer 1040 replaces the layers 1000, 1010 and so acts between the (previous) input to the layer 1000 and the (previous) output of the layer 1010);
deriving (at a step 1120) an initial approximation of at least a set of weights (such as Winit and/or binit) for the insertion layer 1040 using a least squares approximation from the data signals detected for the first position and a second position; and
processing (at a step 1140) training data (such as training data generated by the one-hot encoder 870 from output data of the base ANN) using the modified ANN to train the modified ANN including training the weights W′ of the insertion layer from their initial approximation.
In this example, use is made of training data comprising a set of data having a set of known input data and corresponding output data (for example being generated by quantising the base ANN output data as discussed above), and in which the processing step 1140 comprises varying at least the weighting of at least the insertion layer to so that, for an instances of known input data, the output data of the modified ANN is closer to the corresponding known output data. For example, for each instance of input data in the set of known input data, the corresponding known output data may be output data of the base ANN for that instance of input data.
An optional further weighting step 1130 is also provided in
In particular,
The process discussed above can be used in the following example ways:
The ANNs of
The techniques may be implemented by computer software which, when executed by a computer, causes the computer to implement the method described above and/or to implement the resulting ANN. Such computer software may be stored by a non-transitory machine-readable medium such as a hard disk, optical disk, flash memory or the like, and implemented by data processing apparatus comprising one or more processing elements.
In further example embodiments, when increasing net size (increase layer size or adding more layers), it can be possible to make use of the increased size to make the subnet more robust to noise.
The scheme discussed above for increasing the size of a subnet aims to preserve a subnet's function t:
t=NET(X)=MORPHED_NET(X)
In other examples, similar techniques can be used in respect of a deliberately corrupted outcome, so as to provide a morphed subnet so that:
t=NET(X)≈MORPHED_NET({tilde over (X)})
with {tilde over (X)} being a corrupted version of X.
A way to corrupt {tilde over (X)} is to use binary masking noise, sometimes known as so-called “Dropout”. Dropout is a technique in which neurons and their connections are randomly or pseudo-randomly dropped or omitted from the ANN during training. Each network from which neurons have been dropped in this way can be referred to as a thinned network. This arrangement can provide a precaution against so-called overfitting, in which a single network, trained using a limited set of training data including sampling noise, can aim to fit too precisely to the noisy training data. It has been proposed that in training, any neuron is dropped with a probability p (0<p<=1). Then at inference time, the neuron is always present but the weight associated with the neuron is modified by multiplying it by p.
Applying this type of technique to the LSM process discussed above (to arrive at a so-called denoising morphing process), as seen previously the least square solution for:
For the denoising morphing an aim is to optimize:
where {tilde over (x)}k is xk corrupted by dropout with probability p. The corruption {tilde over (x)}k depends on a random or pseudo-random corruption, therefore, in some examples the technique is used to produce R repetitions of the dataset with different corruption {tilde over (x)}r,k so as to produce a large dataset representative of the corrupted dataset. The least squares (LS) problem then becomes:
The ideal position is to perform the optimization with a very large number of repetitions R→∞. Clearly in a practical embodiment, R will not be infinite, but for the purposes of the mathematical derivation the limit R→∞ is considered, in which case the solution of the LS problem is:
W=E[Ct{tilde over (x)}]E[C{tilde over (x)}{tilde over (x)}]−1
The coefficients of (tk−μt)({tilde over (x)}k−μx)T keep their “non-corrupted” value with a probability of (1−p) or are set to zero.
Therefore, the expected corrupted correlation matrix can be expressed as:
E[Ct{tilde over (x)}]=(1−p)Ctx
The off-diagonal coefficients of ({tilde over (x)}k−μk)({tilde over (x)}k−μx)T keep their “non-corrupted” value with a probability of (1−p)2 (they are corrupted if any of the two dimension is corrupted).
The diagonal coefficients of ({tilde over (x)}x−μk)({tilde over (x)}k−μx)T keep their “non-corrupted” value with a probability of (1−p).
Therefore, the expected corrupted correlation matrix can be expressed as:
The optimization is ideally performed with a very large number of repetitions R→∞.
When R→∞ the solution of the LS problem is:
W=E[Ct{tilde over (x)}]E[C{tilde over (x)}{tilde over (x)}]−1
By taking (1−p) out, the solution can also be expressed with a simple weighting of Cxx:
W=C
tx(A∘Cxx)−1
with A being a weighting matrix with ones in the diagonal and the off-diagonal coefficients being (1−p).
Therefore, W and b can be computed in a closed-form solution directly from the original input data xk without in fact having to construct any corrupted data {tilde over (x)}k. This requires a relatively small modification to the LS solution implementation of the network decreasing operation.
This provides an example of the further weighting step 530, or in other words an example of adding a further weighting to the least squares approximation of the weights to simulate the addition of dropout noise in the ANN.
The techniques discussed above relate to fully-connected or Affine layers. In the case of a convolutional layer a further technique can be applied to reformulate the convolutional layer as an Affine layer for the purposes of the above technique. In a convolutional layer a set of one or more learned filter functions is convolved with the input data. Referring to
So, in this example, at least one of the two or more successive layers is a convolutional layer, the method comprising deriving a fully connected layer from the convolutional layer.
initialising (at a step 1400) a set of parameters of the derived ANN in dependence upon parameters of the base ANN;
inferring (at a step 1410) a set of output data from a set of input data using the base ANN;
quantising (at a step 1420) the set of output data; and
training (at a step 1430) the derived ANN using training data comprising the set of input data and the quantised set of output data.
In so far as embodiments of the disclosure have been described as being implemented, at least in part, by software-controlled data processing apparatus, it will be appreciated that a non-transitory machine-readable medium carrying such software, such as an optical disk, a magnetic disk, semiconductor memory or the like, is also considered to represent an embodiment of the present disclosure. Similarly, a data signal comprising coded data generated according to the methods discussed above (whether or not embodied on a non-transitory machine-readable medium) is also considered to represent an embodiment of the present disclosure.
It will be apparent that numerous modifications and variations of the present disclosure are possible in light of the above teachings. It is therefore to be understood that within the scope of the appended clauses, the technology may be practised otherwise than as specifically described herein.
Various respective aspects and features will be defined by the following numbered clauses:
1. A computer-implemented method of generating a derived artificial neural network (ANN) from a base ANN, the method comprising:
initialising a set of parameters of the derived ANN in dependence upon parameters of the base ANN;
inferring a set of output data from a set of input data using the base ANN;
quantising the set of output data; and
training the derived ANN using training data comprising the set of input data and the quantised set of output data.
2. A method according to clause 1, in which:
the set of output data comprises one or more output data vectors each having a plurality of data values; and
the quantising step comprises replacing each data value other than a data value having a highest value amongst the plurality of data values, by a first predetermined value.
3. A method according to clause 2, in which the first predetermined value is zero.
4. A method according to clause 2 or clause 3, in which the quantising step comprises replacing a data value having a highest value amongst the plurality of data values, by a second predetermined value.
5. A method according to clause 4, in which the second predetermined value is 1.
6. A method according to any one of the preceding clauses, in which:
the derived ANN has the same network structure as the base ANN; and
the initialising step comprises setting the parameters of the derived ANN to be the same as respective parameters of the base ANN.
7. A method according to any one of clauses 1 to 5, in which the derived ANN has a different network structure to the base ANN.
8. A method according to clause 7, in which the base ANN has an ordered series of two or more successive layers of neurons, each layer passing data signals to the next layer in the ordered series, the neurons of each layer processing the data signals received from the preceding layer according to an activation function and weights for that layer,
the method comprising:
detecting the data signals for a first position and a second position in the ordered series of layers of neurons;
generating the derived ANN from the base ANN by providing an insertion layer of neurons to provide processing between the first position and the second position with respect to the ordered series of layers of neurons of the base ANN; and
initialising at least a set of weights for the insertion layer using a least squares approximation from the data signals detected for the first position and a second position.
9. A method according to clause 8, in which the two or more successive layers are fully connected layers in which each neuron in a fully connected layer is connected to receive data signals from each neuron in a preceding layer and to pass data signals to each neuron in a following layer.
10. A method according to clause 8 or clause 9, in which at least one of the two or more successive layers is a convolutional layer, the method comprising deriving a fully connected layer from the convolutional layer.
11. A method according to any one of clauses 8 to 10, in which the training step comprises varying at least the weighting of at least the insertion layer to so that, for an instances of known input data, the output data of the derived ANN is closer to the quantised set of output data.
12. A method according to any one of clauses 8 to 11, in which the generating step comprises providing the insertion layer to replace one or more layers of the base ANN.
13. A method according to clause 12, in which the insertion layer has a different layer size to that of the one or more layers it replaces.
14. A method according to any one of clauses 8 to 13, in which the generating step comprises providing the insertion layer in addition to the layers of the base ANN.
15. A method according to any one of clauses 8 to 14, comprising adding a further weighting to the least squares approximation of the weights to simulate the addition of dropout noise in the ANN.
16. A method according to any one of clauses 8 to 15, in which the neurons of each layer of the base ANN process the data signals received from the preceding layer according to a bias function for that layer, the method comprising deriving an initial approximation of at least a bias function for the insertion layer using a least squares approximation from the data signals detected for the first position and a second position
17. Computer software which, when executed by a computer, causes the computer to implement the method of any one of the preceding clauses.
18. A non-transitory machine-readable medium which stores computer software according to clause 17.
19. An Artificial neural network (ANN) generated by the method of any one of clauses 1 to 16.
20. Data processing apparatus comprising one or more processing elements to implement the ANN of clause 19.
Number | Date | Country | Kind |
---|---|---|---|
18158172.9 | Feb 2018 | EP | regional |