The present disclosure relates to a computer-implemented method of optimising a neural network. A related computer program product and system are also disclosed.
Neural networks are employed in a wide range of applications such as image classification, speech recognition, character recognition, image analysis, natural language processing, gesture recognition and so forth. Many different types of neural network such as Convolutional Neural Networks “CNN”, Recurrent Neural Networks “RNN”, Generative Adversarial Networks “GAN”, and Autoencoders have been developed and tailored to such applications.
A feature common to neural networks is that they include multiple “neurons”, which are the basic unit of a neural network. A neuron has one or more inputs and generates an output based on the input(s). The value of data applied to each input(s) is weighted, summed, and applied to an “activation function” that sums the weighted inputs in order to determine the output of the neuron. The activation function also has a “bias” that controls the output of the neuron by providing a threshold to then neuron's activation. The neurons are typically arranged in layers, which may include an input layer, an output layer, and one or more hidden layers arranged between the input layer and the output layer. The neurons are connected to one another by the weights that are applied to the neuron inputs. Connections between the neurons may be between neurons in the same layer in the neural network, or between neurons in different layers. The weights determine the strength of each connection in the network and thus control the flow of information between the input layer and the output layer of the neural network. The weights, the biases, and the neuron connections are examples of “trainable parameters” of the neural network that are “learnt”, or in other words, capable of being trained, during a neural network “training” process. Another example of a trainable parameter of a neural network, found particularly in neural networks that include a normalization layer, is the (batch) normalization parameter(s). During training, the (batch) normalization parameter(s) are learnt from the statistics of data flowing through the normalization layer.
A neural network also includes “hyperparameters” that are used to control the neural network training process. Depending on the type of neural network concerned, the hyperparameters may for example include one or more of: a learning rate, a decay rate, momentum, a learning schedule and a batch size. The learning rate controls the magnitude of the weight adjustments that are made during training. The batch size is defined herein as the number of data points used to train a neural network model in each iteration. Together, the hyperparameters and the trainable parameters of the neural network are defined herein as the “parameters” of the neural network.
The process of training a neural network includes adjusting the weights that connect the neurons in the neural network, as well as adjusting the biases of activation functions controlling the outputs of the neurons. There are two main approaches to training: supervised learning and unsupervised learning. Supervised learning involves providing a neural network with input data and corresponding output data. During supervised learning the weights and the biases are automatically adjusted such that when presented with the input data, the neural network accurately provides the corresponding output data. The input data is said to be “labelled” or “classified” with the corresponding output data. In unsupervised learning the neural network decides itself how to classify or generate another type of prediction from un-labelled input data based on common features in the input data by likewise automatically adjusting the weights, and the biases. Semi-supervised learning is another approach to training wherein a neural network is input with a combination of labelled and un-labelled data. Typically the input data includes a minor portion of labelled data. During training the weights and biases of the neural network are automatically adjusted using guidance from the labelled data.
Whichever training process is used, training a neural network typically involves inputting a large amount of data, and making numerous of iterations of adjustments to the neural network parameters in order to ensure that the trained neural network provides an accurate output. As may be appreciated, significant processing resources are typically required in order to perform such training. Dedicated neural processors, also known as neural network accelerators, AI accelerators, and Tensor Processing Units “TPU” are often employed in contrast to a general purpose Central Processing Units “CPU” or Graphics Processing Units “GPU” in order to accelerate the process of training a neural network. Training therefore typically employs a centralized approach wherein cloud-based or mainframe-based neural processors are used to train a neural network. By contrast, after the training process has been completed, the processing requirements of neural networks are significantly diminished. This allows a trained neural network to be deployed, for example to a device, and used in systems having significantly less processing capability.
However, there remains a need to provide improved neural networks.
According to a first aspect of the present disclosure, there is provided a computer-implemented method of optimising a student neural network, based on a previously-trained neural network trained on first data using a first processing system. The method includes: using a second processing system to generate reference output data from the previously-trained neural network in response to inputting second data to the previously-trained neural network; and optimising a student neural network for processing the second data with the second processing system, by using the second processing system to adjust a plurality of parameters of the student neural network such that a difference between the reference output data, and second output data generated by the student neural network in response to inputting the second data to the student neural network, satisfies a stopping criterion.
According to a second aspect of the present disclosure the method includes: identifying a subset of second processing system input data to use as the second data. Second processing system input data is included in the subset if the sampled second processing system input data increases a diversity metric of the subset.
According to a third aspect of the present disclosure the method includes: optimising the student neural network by reducing a precision of its weights, and/or removing neurons and/or connections defined by its weights.
According to a fourth aspect of the present disclosure the method includes: generating test output data from the student neural network in response to test input data. The test input data has corresponding expected output data that is expected from the student neural network. The optimising of the student neural network is constrained such that a difference between the generated test output data, and the expected output data, is less than a second predetermined value.
A computer program product and a system are provided in accordance with other aspects of the disclosure. The functionality disclosed in relation to computer-implemented method may also be implemented in the computer program product, and in the system in a corresponding manner.
Further features and advantages of the disclosure will become apparent from the following description of preferred implementations of the disclosure, given by way of example only, which is made with reference to the accompanying drawings.
Examples of the present application are provided with reference to the following description and the figures. In this description, for the purposes of explanation, numerous specific details of certain examples are set forth. Reference in the specification to “an example”, “an implementation” or similar language means that a feature, structure, or characteristic described in connection with the example is included in at least that one example. It is also to be appreciated that features described in relation to one example may also be used in another example and that all features are not necessarily duplicated for the sake of brevity. For instance, features described in relation to the computer-implemented method may be used in the computer program product and in the system in a corresponding manner.
In the present disclosure, reference is made to examples of a neural network in the form of a Deep Feed Forward neural network. It is however to be appreciated that the disclosed method is not limited to use with this particular type of neural network, and that it may be used with other types neural networks, such as for example a CNN, a RNN, a GAN, an Autoencoder, and so forth. Reference is also made to operations in which the neural network processes input data in the form of image data, and uses this to generate output data in the form of a predicted classification. It is to be appreciated that these example operations serve for the purpose of explanation, and that the disclosed method is not limited to use in classifying image data. The disclosed method may be used to generate predictions in general, and the method may process other forms of input data such as audio data, motion data, financial data, and so forth.
As illustrated in
Variations of the example Feed Forward Deep neural network described above with reference to
As outlined above, the process of training a neural network includes automatically adjusting the above-described weights that connect the neurons in the neural network, as well as the biases of activation functions controlling the outputs of the neurons. In supervised learning, the neural network is presented with (training) input data that has a known classification. The input data might for instance include images of animals that have been classified with an animal “type”, such as cat, dog, horse, etc. In supervised learning, the training process automatically adjusts the weights and the biases, such that when presented with the input data, the neural network accurately provides the corresponding output data. The neural network may for example be presented with a variety of images corresponding to each class. The neural network analyses each image and predicts its classification. A difference between the predicted classification and the known classification, is used to “backpropagate” adjustments to the weights and biases in the neural network such that the predicted classification is closer to the known classification. The adjustments are made by starting from the output layer and working backwards in the network until the input layer is reached. In the first training iteration the initial weights and biases, of the neurons are often randomized. The neural network then predicts the classification, which is essentially random. Backpropagation is then used to adjust the weights and the biases. The teaching process is terminated when the difference, or error, between the predicted classification and the known classification is within an acceptable range for the training data. In a later deployment phase, the trained neural network is presented with new images without any classification. If the training process was successful, the trained neural network accurately predicts the classification of the new images.
Various algorithms are known for use in the backpropagation stage of training. Algorithms such as Stochastic Gradient Descent “SGD”, Momentum, Adam, Nadam, Adagrad, Adadelta, RMSProp, and Adamax “optimizers” have been developed specifically for this purpose. Essentially, the value of a loss function, such as the mean squared error, or the Huber loss, or the cross entropy, is determined based on a difference between the predicted classification and the known classification. The backpropagation algorithm uses the value of this loss function to adjust the weights and biases. In SGD, for example, the derivative of the loss function with respect to each weight is computed using the activation function and this is used to adjust each weight.
With reference to
After a neural network such as that described with reference to
Compression is defined herein as pruning and/or weight clustering and/or quantisation, and is carried out prior to deploying a neural network. Pruning a neural network is defined herein as the removal of one or more connections in a neural network. Pruning involves removing one or more neurons from the neural network, or removing one or more connections defined by the weights of the neural network. This may involve removing one or more of its weights entirely, or setting one or more of its weights to zero. Pruning permits a neural network to be processed faster due to the reduced number of connections, or due to the reduced computation time involved in processing zero value weights. Quantisation of a neural network involves reducing a precision of one or more of its weights. Quantization may involve reducing the number of bits that are used to represent the weights—for example from 32 to 16, or changing the representation of the weights from floating point to fixed point. Quantization permits the quantized weights to be processed faster, or by a less complex processor. Weight clustering in a neural network involves identifying groups of shared weight values in the neural network and storing a common weight for each group of shared weight value. Weight clustering permits the weights to be stored with less bits, and reduces the storage requirements of the weights as well as the amount of data transferred when processing the weights. Each of the above-mentioned compression techniques act independently to accelerate or otherwise alleviate the processing requirements of the neural network. Examples techniques for pruning, quantization and weight clustering are described in a document by Han, Song et al. (2016) entitled “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”, arXiv:1510.00149v5, published as a conference paper at ICLR 2016.
In accordance with the present disclosure, a computer-implemented method of optimising a neural network, is provided. The method may for example be used to optimise the neural network described in relation to
With reference to
In more detail, first processing system FPS in
Referring to first processing system FPS in
By way of an example, neural network NN in
Referring now to second processing system SPS in
Second processing system SPS in
With continued reference to second processing system SPS in
With continued reference to second processing system SPS in
As illustrated in
As illustrated by the label DIFF in
Student neural network SNN in
As outlined in the following example implementations, the optimising can include training, and/or compressing the student neural network SNN. In general, the parameters that are adjusted during the optimising may include the training parameters and/or the hyperparameters. The actual parameters that are adjusted during the optimising, as well as the stopping criterion, both depend on how the student neural network SNN is optimised.
In one example implementation, the optimisation involves training the student neural network SNN. In this example implementation, with reference to
Thus, in this example implementation the parameters that are adjusted are the weights w0 . . . j and the biases B. The stopping criterion is that difference DIFF between the reference output data ROD, and the second output data SOD, is less than a predetermined value. The adjusting may include the above-mentioned backpropagation process. By way of an example, the backpropagation may for instance use the above-mentioned SGD algorithm, wherein the derivative of the difference DIFF with respect to each weight is computed using the activation function and this is used to adjust each weight.
In so doing, the optimised student neural network that is provided by the iterative adjustments is tailored to both the processing capabilities of the second processing system, as well as to the second data. This alleviates the processing burden of operating the student neural network on the second processing system.
Optionally, in this example implementation, the iteratively adjusting the weights w0 . . . j and the biases B of the student neural network SNN, may additionally include adjusting a temperature parameter of the student neural network SNN. In general, the temperature parameter of a neural network controls its classification confidence. When the student neural network is being trained, it may be beneficial to use the temperature parameter to soften the predictions of the previously-trained neural network PTNN, before they are used as targets for the student neural network SNN. In this example implementation the previously-trained neural network PTNN and the student neural network may generate class probabilities with logit vector Output where Output=(Output1, . . . Outputn). A Softmax function may be performed in order to produce a probability vector q=(q1, . . . qn) by comparing Output1 with the other logits. Probability vector q is defined as:
In general, the temperature parameter T in Equation 1 may be used to control the classification confidence of a neural network because it affects the sensitivity of the student neural network SNN to low probability output data candidates. Increasing the temperature parameter reduces the classification confidence.
Thus, in this example implementation, the previously-trained neural network PTNN is trained on the first data FD using a first value of a temperature parameter, the temperature parameter controlling a classification confidence of the previously-trained neural network PTNN. The: iteratively adjusting the weights w0 . . . j and the biases B of the student neural network SNN until the difference between the reference output data ROD, and the second output data SOD, is less than a predetermined value, comprises: using a second value for the temperature parameter, the second value being higher than the first value such that a classification confidence of the optimised student neural network is lower than the classification confidence of the previously-trained neural network PTNN.
As illustrated by the dashed boxes in
Each of these processes further reduce the processing requirements of the second processing system.
In any of the above examples wherein the optimisation involves training the student neural network SNN, one or more hyperparameters of the student neural network SNN may also be adjusted during training in order to further optimise the training process.
In another example implementation, the optimisation described with reference to
Reducing a precision of the weights w0 . . . j, and removing neurons N0 . . . i and/or connections defined by the weights w0 . . . j, both degrade the predictive accuracy of the student neural network SNN whilst simultaneously reducing the processing burden of running the optimised student neural network. This example implementation therefore allows a trade-off between predictive accuracy and processing burden to be made, and thereby tailored to, the second processing system SPS.
The value of the predetermined limit that is used when reducing a precision of the weights w0 . . . j, or when removing neurons N0 . . . i and/or connections defined by the weights w0 . . . j therefore controls the accuracy with which the second output data SOD generated by the student neural network SNN predicts the reference output data ROD. Continuing with the above animal image classification example, the predetermined limit may be that the student neural network SNN should predict the classification generated by the previously-trained neural network to within a certain percentage. For example, images of cats that are inputted as second data SD may generate an output from the previously trained neural network PTNN with the classification of “cat” having 90% probability. The predetermined limit may be that images of cats, should generate an output from the student neural network SNN with the classification of “cat” as being within 10% of the 90% probability generated by the previously trained neural network PTNN; i.e. greater than 80%.
By using the second processing system SPS to perform each of these operations, and also by performing each of these operations with the second data SD, the optimised student neural network that is provided by each of these optimisation operations is tailored to both the processing capabilities of the second processing system SPS, as well as to the second data SD. This alleviates the processing burden of operating the optimised student neural network on the second processing system SPS.
Optionally, in some implementations the weights of the student neural network SNN are represented with a lower precision than the weights of the previously-trained neural network PTNN. This facilitates faster optimisation of the student neural network SNN. In these implementations the plurality of parameters of the student neural network SNN includes a plurality of weights w0 . . . j connecting a plurality of neurons N0 . . . i in the student neural network SNN. The previously-trained neural network PTNN also comprises a plurality of weights connecting a plurality of neurons in the previously-trained neural network PTNN. The weights of the student neural network w0 . . . j are represented with a lower precision than the weights of the previously-trained neural network PTNN.
Optionally, in some implementations the student neural network SNN is provided by performing a quantization process on the previously-trained neural network PTNN. The quantization process may for instance be performed by the first processing system FPS, or by the second processing system SPS, or by yet another processing system. In these implementations the quantization process includes providing the weights w0 . . . j of the student neural network SNN by reducing a precision of the weights of the previously-trained neural network PTNN such that the weights of the student neural network SNN are represented with a lower precision than the weights of the previously-trained neural network PTNN.
Optionally, in some implementations the second processing system SPS is used to perform the quantization process on the previously-trained neural network PTNN so that weights of the student neural network w0 . . . j are represented with a lower precision than the weights of the previously-trained neural network PTNN. In these implementations, the second processing system SPS is used to perform the quantization process on the previously-trained neural network PTNN to provide the student neural network SNN, prior to optimising the student neural network SNN for processing the second data SD with the second processing system SPS. Using the second processing system SPS to perform the quantization process on the previously-trained neural network PTNN, requires only a single neural network, specifically the previously-trained neural network PTNN, to be transferred to the second processing system.
Optionally, in some implementations, the second data SD that is used in the optimisation is provided by sampling a dataset, specifically second processing system input data SPSID, that is input to the second processing system SPS. The second processing system input data SPSID is sampled, and included in a subset of the sampled second processing system input data SPSID in order to provide the second data SD if it increases a diversity metric of the subset.
This is indicated in
In more detail, the second processing system SPS, receives second processing system input data SPSID; and the second processing system SPS is used to identify a subset of the second processing system input data SPSID to use as the second data SD. Identifying a subset of the second processing system input data SPSID to use as the second data SD, comprises: sampling the second processing system input data SPSID, and including the sampled second processing system input data in the subset if the sampled second processing system input data increases a diversity metric of the subset.
By selecting the second data SD using the diversity metric it is avoided that the optimised student neural network SNN becomes over-optimised, i.e. too sensitive, to common features in the data that is used to optimise the student neural network SNN, at the expense of diminished sensitivity to less common features in the data. Using the above-described animal image classification example, if the optimisation being performed using the second data SD is training, then if the second data SD predominantly includes images of a particular type, such as horses, then the optimisation risks being highly sensitive to horses at the expense of poor sensitivity to cats. Using the diversity metric helps to prevent this situation by using data that is as different as possible to optimise the student neural network SNN.
The diversity metric of the subset that is used to provide the second data SD may be computed in various ways. For example, the diversity metric may be computed based on a numerical distance between the output of the student neural network SNN, or the output of the previously-trained neural network PTNN, generated in response to inputting the sampled second processing system input data, and the output of the respective neural network, generated in response to inputting each existing element of the subset. This is illustrated in more detail with reference to
With reference to
The second data SD that is defined in this manner is then used to optimise the student neural network SNN. The second data SD in the subset may also be periodically updated. For example, if the subset has a fixed maximum size, such as 1000 images, then after including sufficient sampled second processing system input data SPSID to fill the subset, existing subset data elements may be replaced in order to further increase the diversity of the second data SD.
Various distance metrics may be used to compute the aforementioned numerical distance, including for example the Kullback-Leibler divergence “KLD”, the cosine distance “CD”, the Mean-Absolute Error “MAE”, the Mean-Squared Error “MSE”, the Minkowski Distance, the Euclidean Distance, and so forth.
Referring now to
Optionally, in some implementations the second processing system output data SPSOD that is generated by the by neural network is provided to a user and substantially in real-time, and the optimising of the student neural network is performed at a later point in time. This option is indicated by way of the horizontal dashed line separating the labels “Down-time” and “Real-time” in
Using the above-described animal image classification example, in these latter implementations, if second processing system input data SPSID represents images of animals, the second neural network may be used to generate second processing system output data SPSOD in the form of a classification of the animal images. The classification may be in real-time. After having performed the classification, the second processing system may use a subset of the second processing system input data SPSID, i.e. the second data SD, to optimise the student neural network SNN. The subset may be determined by sampling the second processing system input data SPSID, and including the sampled second processing system input data in the subset if the sampled second processing system input data increases a diversity metric of the subset. By performing the optimisation after the real-time classification, it is avoided that the optimisation interrupts the classification.
Optionally, in some implementations the optimisation of the student neural network is constrained in order to ensure that for particular test input data, the output of the optimised neural network does not diverge too far from corresponding expected output data. This acts to prevent the optimised student neural network from becoming too sensitive to some features of the input data at the expense of being insensitive to other features of the input data. Thereto,
With reference to
Using the above-described animal image classification example, test input data TID may for example include an image of a dog, a cat and a horse that are each classified with corresponding expected output data EOD indicative of the classification and its associated probability: “Dog, 100%”, “Cat, 100%”, “Horse, 100%”. If, using the proposed adjusted parameters, the student neural network SNN classifies each image by generating test output data TOD that is within less than a certain percentage, for example to within less than 20% of each of the above EOD classification probability values, then the proposed adjusted parameters are used in the optimised student neural network SNN. Otherwise, the optimisation described above is repeated.
The above-described methods may be provided on a non-transitory computer-readable storage medium comprising a set of computer-readable instructions stored thereon which, when executed by at least one processor, cause the at least one processor to perform the method. In other words, the above-described methods may be implemented as a computer program product. The computer program product can be provided by dedicated hardware or hardware capable of running the software in association with appropriate software. When provided by a processor, these functions can be provided by a single dedicated processor, a single shared processor, or multiple individual processors that some of the processors can share. Moreover, the explicit use of the terms “processor” or “controller” should not be interpreted as exclusively referring to hardware capable of running software, and can implicitly include, but is not limited to, digital signal processor “DSP” hardware, read only memory “ROM” for storing software, random access memory “RAM”, a non-volatile storage device, and the like. Furthermore, implementations of the present disclosure can take the form of a computer program product accessible from a computer usable storage medium or a computer readable storage medium, the computer program product providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable storage medium or computer-readable storage medium can be any apparatus that can comprise, store, communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system or device or device or propagation medium. Examples of computer readable media include semiconductor or solid state memories, magnetic tape, removable computer disks, random access memory “RAM”, read only memory “ROM”, rigid magnetic disks, and optical disks. Current examples of optical disks include compact disk-read only memory “CD-ROM”, optical disk-read/write “CD-R/W”, Blu-Ray™, and DVD.
A system is also provided for execution of the above-described method. Thereto,
In use, previously-trained neural network PTNN, and student neural network SNN are transferred to second processing system SPS. These neural networks may for instance be transferred to second processing system SPS by transferring the parameters and configuration settings that define their architecture and control their operation. The neural networks may be transferred by reading data from a computer-readable storage medium, or downloaded from the Internet or the Cloud. System SY may optionally include a camera or another type of input device for receiving or generating second processing system input data SPSID. System SY may for instance include an input device in the form of a microphone for configured to generate audio data. The use of other input devices configured to sense or receive other types of data, including optical, vibration, pressure, temperature, motion is also contemplated. Second processing system input data SPSID may alternatively be read from an external computer readable storage medium. System SY may also include an output device such as a display or a speaker (not illustrated in
The above example implementations are to be understood as illustrative examples of the present disclosure. Further implementations are also envisaged. For example, the implementations described in relation to a method may also be implemented in the computer program product, in the computer readable storage medium, or in the system. It is therefore to be understood that a feature described in relation to any one implementation may be used alone, or in combination with other features described, and may also be used in combination with one or more features of another of the implementation, or a combination of other the implementations. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the disclosure, which is defined in the accompanying claims. Any reference signs in the claims should not be construed as limiting the scope of the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
2007329.2 | May 2020 | GB | national |
This application is the U.S. national stage application filed pursuant to 35 U.S.C. 365(c) and 120 as a continuation of International Patent Application No. PCT/GB2021/051190, filed May 18, 2021, which application claims priority to United Kingdom Patent Application No. 2007329.2, filed May 18, 2020, which applications are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/GB2021/051190 | May 2021 | US |
Child | 18055192 | US |