The present disclosure claims priority to Chinese disclosure application No. 201911115314.5 filed on Nov. 14, 2019, and claims priority to Chinese disclosure application No. 201910807591.6 filed on Aug. 29, 2019, the entire contents of both are hereby incorporated by reference.
The present disclosure relates to a modeling field of Deep Neural Networks (DNN), and in particular to a training method of neural network models suitable for different calculation accuracies.
The deep neural network is a model with a complex network architecture. Common neural network models include a Convolutional Neural Network (CNN) model, a Recurrent Neural Network (RNN) model, a Graph Neural Network (GNN) model, and the like.
Taking the CNN model shown in
In the process of updating each weight in the network model, according to the error evaluation function L(y, y*) (whose inputs are the actual output result y and the expected output result y* and which is a function for indicating the error between them), the partial derivative (gradient) g of each weight w is calculated layer by layer in the network model from bottom to top. It is assumed that the current network model to be trained has been trained t times (that is, the number of training iterations is t), there are several weights to be updated in the network model, and the weight currently being updated is wt. First, the gradient gt of the weight wt is calculated according to the following formula (1); then, the weight wt is updated according to the following formula (2) to obtain the weight wt+1 after the (t+1)-th (this) training.
g
t
=∂L(y,y*)/∂wt formula (1)
w
t+1
=w
t−ηtgt formula (2)
Wherein ηt is an update step scale (also known as a learning rate), which can be a constant or a variable and is used to scale the gradient gt.
According to the above update process for the weight wt, it is known that calculation of the gradient is an important step in the weight update process, but the above formula (2) only takes the relationship between the weight wt and the gradient into consideration, that is, only takes the gradient after the latest update into consideration, without taking into consideration of the effect of historical gradients (gradients in previous training iterations) on the gradients in this (current) training iteration, so that the direction inertia generated at the gradient position is small, which is adverse to the acceleration of neural network model training.
According to an aspect of the present disclosure, there is provided a training method of a neural network model, comprising: determining gradients of weights in the neural network model during a back propagation; performing, for at least one of the determined gradients, the following processing: determining whether a gradient is within a constraint threshold range, and constraining a gradient that exceeds the constraint threshold range to be within the constraint threshold range, wherein the constraint threshold range is determined according to the number of training iterations and calculation accuracy of the neural network model; and updating weights using constrained gradients.
According to another aspect of the present disclosure, there is provided a training system of a neural network model, comprising: a server that stores at least one first network model, wherein the first network model provides information for synchronizing a second network model, and the server is used to determine gradients of weights in the first network model during a back propagation, and perform, for at least one of the determined gradients, the following processing: determining whether a gradient is within a constraint threshold range, and constraining a gradient that exceeds the constraint threshold range to be within the constraint threshold range, updating a weight using the constrained gradient, and outputting the updated weight, wherein the constraint threshold range is determined according to the number of training iterations and a calculation accuracy of the first network model; and a terminal that stores the second network model, wherein the terminal is used to synchronize the second network model using the weight output by the server.
According to another aspect of the present disclosure, there is provided a training apparatus of a neural network model, comprising: a gradient determination unit configured to determine gradients of weights in the neural network model during a back propagation; a gradient constraint unit configured to perform, for at least one of the gradients determined by the gradient determination unit, the following processing: determining whether a gradient is within a constraint threshold range, and constraining a gradient that exceeds the constraint threshold range to be within the constraint threshold range, wherein the constraint threshold range is determined according to the number of training iterations and a calculation accuracy of the neural network model; and an update unit configured to update weights using constrained gradients.
According to another aspect of the present disclosure, there is provided an application method of a neural network model, comprising: storing a neural network model trained based on the above training method; receiving a data set which is required to correspond to a task which can be executed by the stored neural network model; and performing operation on the data set in each layer in the stored neural network model from top to bottom, and outputting a result.
According to another aspect of the present disclosure, there is provided an application apparatus of a neural network model, comprising: a storage module configured to store a neural network model trained based on the above training method; a receiving module configured to receive a data set which is required to correspond to a task which can be executed by the stored neural network model; and a processing module configured to perform operation on the data set in each layer in the stored neural network model from top to bottom, and output a result.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing instructions, which, when executed by a computer, cause the computer to perform the training method based on the above neural network model.
Further features of the present disclosure will become apparent from the following description of exemplary embodiments with reference to the accompanying drawings.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments of the present disclosure, and together with the description of the exemplary embodiments, serve to explain the principles of the present disclosure.
In order to optimize the traditional weight update process, improve the convergence speed of the neural network model, and accelerate the training of the neural network model, an ADAM (Adaptive Moment Estimation) training method is proposed. When a certain weight is updated in the neural network model, historical gradients of the weights in previous updates (training) are used to update the gradient of the weight in this update (training), and thus a new gradient is used to update weights.
First, the gradient gt of the weight wt is calculated using the aforementioned formula (1).
Then, the following derived formula (3) is used to calculate a moving average first-order moment of the historical gradients of the weight wt, that is, weighted summation of the historical gradient first-order moments.
m
t=β1mt−1+(1−β1)gt=>(1−β1)Σi=1tβ1t−1gt formula (3)
Wherein β1mt−1+(1−β1) gt is a recursive formula, mt−1 is calculated using mt−2 and this formula, and so on, starting from m1 of the first training, m2, m3 . . . mt−1 are calculated successively, so that a first-order weighted summation formula (1−β1)Σi=1tβ1t−igt is derived. The β1 is a weighted value, represents a gradient attenuation rate and may be a constant, such as 0.9. β1t−i in formula (3) represents the t-i power of β1.
Next, the following derived formula (4) is used to calculate a moving average second-order moment of the historical gradients, that is, weighted summation of the historical gradient second-order moments.
V
t=β2Vt−1+(1−β2)gt2=>(1−β2)diag(Σi=1tβ2t−igt2) formula (4)
Wherein β2 Vt−1+(1−β2) gt2 is a recursive formula similar to that in formula (3), starting from V1 of the first training, V2, V3 . . . Vt−1 are calculated successively, so that a second-order weighted summation formula (1−β2)diag(Σi=1tβ2t−igt2) is derived. The β2 is a weighted value, represents a gradient attenuation rate and may be a constant, such as 0.999. β2t−i in formula (4) represents the t-i power of β2.
Finally, according to the calculation results of formulas (3) and (4), the formula (5) is used to update the gradient gt to obtain the updated gradient gt′; then the updated gradient gt′ is further used to update the weight wt according to formula (6), so as to obtain the weight wt+1 after this training.
g
t
′=m
t/√{square root over (Vt)} formula (5)
w
t+1
=w
t−ηtgt′=wt−ηtmt/√{square root over (Vt)} formula (6)
In the ADAM-based weight update method, not only the gradient gt of the weight wt calculated in advance during this training is used, but also the historical gradients of the weight during the previous training are introduced by means of weighted summation to obtain the gradient gt′ available in this training, so that a greater inertia can be adaptively generated at positions where the gradient is continuous in order to accelerate the training of the neural network model.
Although the ADAM method overcomes the problem that the training speed of traditional neural network models is slow, the premise of applying the ADAM method is that the trained neural network model is a high-accuracy model (weights and input x are high-accuracy), for example, the weight w in the model is a 32-bit floating-point type. As the number of training iterations increases, the weight w changes as shown in
When the high-accuracy neural network model is quantized to low accuracy, for example, a weight w of 32-bit floating-point type is quantized to a weight Wb of Boolean type, and its value is only −1 or 1.
Wherein wb=sign(w).
At this time, as the number of training iterations increases, the gradient change of the weight wb is shown in
Wherein the gradient of k can be obtained by deriving the following {tilde over (w)}:
{tilde over (w)}=α sign(w)
Wherein α is a quantization scale factor.
Comparing
{tilde over (Υ)} is calculated in the same manner as Υt+1, except that Υt+1 is based on a fully refined network and {tilde over (Υ)} is based on a quantized network), which results in a difference between the gradient g of the weight w and the gradient ĝ of the weight wb. When a low-accuracy network model is trained, if an inappropriate gradient ĝ is used continuously to update the weights, the accumulation of gradient differences may prevent the network model from achieving the expected performance. On the other hand, since the ADAM method requires weighted summation of historical gradients, passing quantization errors will be accumulated while weighted summing the historical gradients due to existence of quantization errors, which causes the direction inertia to shift and thus affects the training accuracy of the network model.
As shown in
In order to illustrate the deficiencies of the ADAM method in the present disclosure, the following three theorems are provided as supplementary explanations.
Theorem 1: Supposing that there is a quantization scale factor α and a binary quantization function sign(w). There is an online convex optimization problem. For the optimization of a quantitative neural network, any initial step scale η is given, and the ADAM cannot converge to the optimal solution. This is because it has a non-zero regret, that is, when T→∞, RT/T0.
Theorem 2: Supposing that there is a quantization scale factor α and a binary quantization function sign(w). Any β1, β2 is given, which belongs to [0,1), and β1<√{square root over (β2)}. There is an online convex optimization problem. For the optimization of a quantitative neural network, any initial step scale η is given, and the ADAM cannot converge to the optimal solution. This is because it has a non-zero regret, that is, when T→∞, RT/T0. For any convex function {fi}i=1∞, the functional (F) has a constraint value G∞ with the constrained gradient.
Theorem 3: Supposing that there is a quantization scale factor α and a binary quantization function sign(w). Any β1, β2 is given, which belongs to [0,1), and β1<√{square root over (β2)}. There is a random convex optimization problem. For the optimization of a quantitative neural network, any initial step scale η is given, the convergence speed C is based on the ADAM, and the ADAM is determined by β1, β2, α and G∞. For any convex function {fi}i=1∞, the functional (F) has a constraint value G∞ with the constrained gradient.
Wherein, as for the detailed description of the above Theorem 1, Theorem 2 and Theorem 3, please refer to the corresponding part of the <Theorems and inferences in the exemplary embodiments> in the specification.
Since the existing ADAM method has the above problems, the present disclosure proposes a training solution for a multilayer neural network model. Compared with the ADAM method, the solution in the present disclosure is not limited to being applied to a high-accuracy neural network model, and also has better performance in the training of low-accuracy neural network models.
As described above, when a low-accuracy network model is trained based on the ADAM method, compared with the training of a high-accuracy network model, there is a quantization error Υ for the quantization of weights, wherein w represents a high-accuracy weight and wb represents a low-accuracy weight after high-accuracy weights are quantized. Comparing
Hereinafter, exemplary embodiments of the present disclosure will be described with reference to the accompanying drawings. For the purpose of clarity and conciseness, not all features of an embodiment are described in the specification. However, it should be understood that many implementation-specific settings must be made during the implementation of the embodiment in order to achieve the developer's specific goals, such as meeting those restrictions related to equipment and business, and these restrictions may change depending on the different implementations. In addition, it should also be understood that, although development work may be very complex and time-consuming, it is only a routine task for those skilled in the art having the benefit of the present disclosure.
It should also be noted here that in order to avoid obscuring the present disclosure due to unnecessary details, only the processing steps and/or system structures that are closely related to at least the solution according to the present disclosure are shown in the drawings, and other details not closely relevant to the present disclosure are omitted.
Next, various aspects of the present disclosure will be described.
Step S101: The forward propagation of this training is performed, and the difference value between the actual output result and the expected output result of the neural network model is determined.
The training process of the neural network model is a cyclic and iterative process. Each training includes a forward propagation and a back propagation. Wherein the forward propagation is a process of layer-by-layer operation of the data x to be trained from top to bottom in the neural network model. The forward propagation process described in the present disclosure may be a known forward propagation process. The forward propagation process may include the quantization process of the weight of arbitrary bits and the feature map. The present disclosure is not limited to this. If the difference value between the actual output result and the expected output result of the neural network model does not exceed a predetermined threshold, it means that the weights in the neural network model are optimal solutions, the performance of the trained neural network model has reached the expected performance, and the training of the neural network model is finished. Conversely, if the difference value between the actual output result and the expected output result of the neural network model exceeds a predetermined threshold, the back propagation process needs to be continued, that is, based on the difference value between the actual output result and the expected output result, the operation is performed layer by layer in the neural network model from bottom to top, and the weights in the model are updated so that the performance of the network model with updated weights is closer to the expected performance.
The neural network model applicable to the present disclosure may be any known model, such as a convolutional neural network model, a recurrent neural network model, a graph neural network model, etc. The disclosure is not limited to the type of the network model.
The calculation accuracy of the neural network model applicable to the present disclosure can be arbitrary accuracy, for example, both high accuracy and low accuracy, and the terms “high accuracy” and “low accuracy” are relative to each other, and do not limit specific numerical values. For example, high accuracy can be a 32-bit floating-point type, and low accuracy can be a 1-bit fixed-point type. Of course, other accuracies such as 16-bit, 8-bit, 4-bit, and 2-bit are also included in the calculation accuracy range applicable to the solution of the present disclosure. The term “calculation accuracy” may refer to the accuracy of the weights in the neural network model, or the accuracy of the input x to be trained, which is not limited in the present disclosure. The neural network model described in the present disclosure may be binary neural network models (BNNs), and of course, it is not limited to neural network models with other calculation accuracy.
Step S102: In the back propagation, the gradient of each weight in the network model is calculated, and subsequent optimization processing for the gradient of at least one weight is executed.
In the back propagation, the layers involved in quantization in the forward propagation are processed using STE (gradient estimation) techniques.
It is assumed that, in this embodiment, subsequent optimization processing is performed for the gradient of the weight w.
Since the neural network model is a multi-layer structure, in the back propagation, the gradient value of the weight in each layer is calculated layer by layer from bottom to top using a chain method, according to the error evaluation function L(y, y*) (also known as the loss function). In this step S102, the method for calculating the gradient of the weight may be any known method, and the present disclosure is not limited to this.
Here, after a gradient of each one weight is calculated, subsequent optimization processing may be executed on the gradient. For example, when there are gradients of multiple weights that require subsequent optimization processing, the subsequent optimization processing is performed in series between the gradients of each weight; after the gradients of multiple weights belonging to the same layer are calculated, the subsequent optimization processing may be executed on the gradients of the weights in this layer, for example, the subsequent optimization processing may be executed in parallel on the gradients of the weights in the same layer, and may be performed in series on the gradients of the weights in different layers; after the gradients of all weights in the neural network model are calculated, the subsequent optimization processing may be executed on the gradients of a part or all of the weights, for example, the subsequent optimization processing is executed in series between the gradients of each weight, or in accordance with the order of layers from bottom to top in the neural network model, the subsequent optimization processing may be executed in parallel on the gradients of the weights in the same layer, and may be executed in series on the gradients of the weights in different layers. The present disclosure does not limit gradients of which weights to execute the subsequent optimization processing. For example, the subsequent optimization processing may be performed on gradient values of all weights, or the subsequent optimization processing may be performed only on gradients of weights such as in a convolution layer.
Step S103: The weighted maximum value is determined from the gradient of the weight w calculated in step S102 and the gradient of the weight w in the previous N trainings, wherein N is an integer greater than or equal to 1.
It is assumed that the network model to be trained has been trained t times (the number of training iterations is t), and in step S102, the gradient of the weight w is calculated as gt. In this step S103, in consideration of the influence of the historical gradients on the gradient of this training, the gradients including the gt and gradients in the previous N trainings are weighted, and the maximum value is taken. The gradients in the previous N trainings may be a gradient updated using the method of the present disclosure, or may also be a gradient calculated by using formula (3), formula (4), and formula (5) with the ADAM method. The number N of historical gradients used here is not greater than t.
An optional algorithm for determining the weighted maximum value in step S103 is to calculate the weighted maximum value of the historical gradient second-order moment as shown in formula (7).
v
t=β2vt−1+(1−β2)gt2, {circumflex over (v)}t=max({circumflex over (v)}t−1,vt) formula (7)
Wherein, β2vt−1+(1−β2) gt2 is a recursive formula. The calculated vt and {circumflex over (v)}t−1 are compared, and the larger value is used as the gradient after optimization processing is performed once in step S103. Since the weighted maximum value is determined in this step S103 during each training, a larger value is selected between vt calculated in the recursive formula in this training and the weighted maximum value {circumflex over (v)}t−1 in the previous training, which can ensure that the weighted maximum value of the historical gradient second-order moment is determined in this step S103.
β2 in formula (7) may be the same weighted value 0.999 as β2 in formula (4), and may also be a weighted value determined according to the calculation accuracy of the neural network model. An optional algorithm for determining β2 based on the calculation accuracy is shown in formula (8).
Wherein, β2(i) represents during the t-th training, for example, 0.999; β2(t−1) represents β2 during the (t−1)-th training, which is collectively determined by β2(t) and the calculation accuracy.
Further, the weighted maximum value vt can be converted into the form of a diagonal matrix, referring to formula (9).
V
t=Diag({tilde over (v)}t) formula (9)
It should be noted that Vt in formula (9) is a diagonal matrix of the weighted maximum value of the historical gradients, which is not equal to the diagonal matrix Vt of the weighted summation of the historical gradients in formula (4).
Compared with the manner of weighted summing the historical gradient second-order moments in the ADAM method, taking the maximum value of weighting the historical gradient second-order moments in the first embodiment can better represent the direction inertia of the gradient of the current neural network model, and the manner of taking the weighted maximum value can make the performance of the network model similar to that of taking the weighted summation value. In the case of ensuring that the performance is not going worse, the weighted maximum value is used instead of the weighted summation value, which prevents the accumulation of quantization errors.
This step S103 is to optimize the gradient calculated in step S102 once. This is a preferred step in this embodiment, and this embodiment is not limited to the case of directly proceeding to step S104 from step S102.
Step S104: A constraint threshold range is determined.
In the training process of the neural network model, especially in the later stage of the training, in addition to using direction inertia to accelerate convergence, it is also necessary to set an appropriate gradient to converge at an appropriate step size. However, due to existence of the quantization error, especially in the case of low accuracy, the quantization error has a non-negligible impact on training, and the gradient that has been optimized once in step S103 needs to be constrained again. Therefore, how to determine the constraint threshold range appears particularly important.
From the above description, it can be known that, in a case where the accuracy is lower in the later stage of training, it is very necessary to constrain the gradient, that is, the number of training iterations and the calculation accuracy of the model have a large impact on the gradient, so the constraint threshold range can be determined by comprehensively considering both the number of training iterations and the calculation accuracy of the neural network model. Further, since the calculation accuracy of the neural network model directly determines the quantization error, determining the constraint threshold range based on both the number of training iterations and the calculation accuracy can also be regarded as determining the constraint threshold range based on the number of training iterations and the quantization error.
The quantization error used to determine the constraint threshold range can be a quantization error for the entire neural network model or a quantization error for each weight in the neural network model. According to the different meanings of the quantization error, the meaning of the determined constraint threshold range is also different. Specifically, on the one hand, if the quantization error used to determine the constrained threshold range is the quantization error for the entire neural network model (that is, quantization errors for all weights in the neural network model are calculated, and the maximum quantization error therein is used as the quantization error for the neural network model), the determined constraint threshold range is also for the entire neural network model. In this case, a gradient of any weight in the neural network model is constrained using the same constraint threshold range when performing optimization processing. On the other hand, if the quantization error used to determine the constraint threshold range is a quantization error for each weight, the constraint threshold range is determined for each weight, and the determined constraint threshold range constrains the gradient of the weight.
It is assumed that the upper limit value and the lower limit value of the constrained threshold range are two abstract limit functions, cu and cl, respectively. An optional manner for calculating the upper limit value cu and the lower limit value cl is to use manners in the following formulas (10) and (11). In the algorithms shown in formulas (10) and (11), the abstract limit functions representing the upper limit value cu and lower limit value cl are monotonically decreasing and monotonically increasing, respectively.
Wherein t is the number of training iterations; Υt+1 is the quantization error; β is the weighted value, which can be β1 in formula (3) with a value of 0.9, or can be calculated and determined according to the calculation accuracy of the neural network in the manner shown in formula (8); of course, β can also be β2 in formula (4) with a value of 0.999, or can be calculated and determined according to the calculation accuracy of the neural network in the manner shown in formula (8). Here, β is a weighted value indicating the gradient attenuation rate. In addition to selection β1 or β2 as β mentioned above, this embodiment is not limited to setting β by other manners.
It can be seen from formulas (10) and (11) that the upper limit value and the lower limit value of the constraint threshold range are determined by the quantization error l(w,wb) of the neural network model and the number of training iterations t. Since the value of β is always less than 1 and l(w,wb) is always greater than 0, when the number of training iterations t is large (close to infinity), the two abstract boundary functions cu and cl are approaching to each other.
In an optional manner, no matter whether the neural network model starts training for the first time or restarts training after the training is interrupted, t starts from 0; in another optional manner, when the neural network model starts training for the first time, t starts from 0, and when the neural network model restarts training after the training is interrupted, t starts from where the training is interrupted.
The above formula (10) and formula (11) are an optional manner to implement this step, and the present disclosure is not limited to reasonable modifications of formula (10) and formula (11), or other manners for determining the constraint threshold range based on the number of training iterations and quantization errors of the neural network model.
Step S105: It is judged whether the gradient is within the constraint threshold range, if yes, proceeding to step S106; if not, constraining the gradient to a value within the constraint threshold range.
The gradient value which has been optimized once in step S103 is to be optimized (constrained) twice in this step S105. An optional constraint manner is the manner shown in formula (12).
{tilde over (V)}
tΠF({circumflex over (v)}t) (12)
Wherein F is the constraint mapping value domain [μ, t]; Vt is the diagonal matrix of the weighted maximum value of the historical gradient second-order moment in formula (9); ΠF( ) is the constraint mapping operation, which means that Vt is mapped into [l, cu]; {tilde over (V)}t represents the gradient after constraint. When Vt is greater than i, Vt is constrained to [cl, cu], for example, Vt is constrained as the upper limit value i; when Vt is less than μ, Vt is constrained to [cl, cu], for example, Vt is constrained as the lower limit value μ. The present disclosure is also not limited to constraining Vt as other values within [cl, cu], for example, when Vt is not in [cl, cu], Vt is constrained as an average value of the upper limit value and lower limit value.
The constraint processing of the gradient in step S105 can overcome the gradient distortion caused by the quantization error, and the constrained gradient can be substantially close to the actual gradient of the weight in the network model under high accuracy. As shown in
Step S106: The weight is updated using the constrained gradient.
An optional manner is to use formula (6) to update the weights. Since the gradient in this step is the gradient after the constraint processing, formula (6) can be transformed into formula (13).
w
t+1
=w
t−ηtmt/√{square root over ({tilde over (V)}t)} formula (13)
Wherein mt can be calculated according to formula (3); preferably, β1 in formula (3) can be a constant 0.9, or can be calculated and determined according to the calculation accuracy of the neural network based on the manner shown in formula (8).
Since the gradients of the weights are optimized twice in steps S103 and S105, in this step S106, the weights are updated using the optimized gradients, that is, the training of the neural network model, so that the performance of the neural network model with arbitrary calculation accuracy can approach the expected performance. To demonstrate the methods of the present disclosure, the theorems and inferences are provided as follows.
Theorem 4: Supposing that there is a quantization scale factor α, a binary quantization function sign(w), and a quantization scale domain −
∥∞|≤D∞; ∀
,
∈F and ∥f({tilde over (w)})∥≤G∞, ∀t∈[T] and {tilde over (w)}∈F. It is supposed that ∥f(
)−f(
)∥≤C(α)∥
−
∥≤C(α)D∞ and ∥C(α)∥≤L∞. For {tilde over (w)}t, with the method of the present disclosure, there may be a constraint solution shown in the following formula (14):
According to the above constraint solution, the following can be inferred:
Inference 1: It is supposed that β1t=β1λt−1 in Theorem 4, wherein λ=β1/√β2, formula (15) can be obtained.
Wherein the details of the above Theorem 4 and Inference 1 can be found in the corresponding part of the <Theorems and inferences in the exemplary embodiments> in the specification.
Step S107: Whether there is a weight which is not updated; if yes, step S103 is performed to continue updating other weights; otherwise, this training is ended, and the process proceeds to step S101.
It should be noted that the hyper-parameters of the network model in the first embodiment may be stored in advance, or acquired from the outside through the network, or obtained by local operation, which is not limited in the present disclosure. The hyper-parameters include, but are not limited to, the calculation accuracy of the network model, the learning rates ρt, β1, β2, and the like.
In this embodiment, steps S101 to S107 are repeatedly performed until the training end condition is satisfied. Here, the training end condition may be any condition set in advance, the difference value between the actual output result and the expected output result of the neural network model does not exceed a predetermined threshold, or the number of trainings of the network model reaches a predetermined number of times, or the like.
Through the solution of the first exemplary embodiment of the present disclosure, even in a case where the calculation accuracy of the neural network model is lower, a larger quantization error is generated, and thus the gradient of the weight is distorted. However, the gradient that has been distorted is constrained by using the set constraint threshold range in the present disclosure, so that the step size obtained from the gradient is appropriate. Referring to
Based on the foregoing first exemplary embodiment, the second exemplary embodiment of the present disclosure describes a network model training system, which includes a terminal, a communication network, and a server. The terminal and the server communicate through the communication network. The server uses the locally stored network model to train the network model stored in the terminal online, so that the terminal can use the trained network model to perform the real-time business. Each part in the training system of the second exemplary embodiment of the present disclosure is described below.
The terminal in the training system may be an embedded image acquisition device such as a security camera or the like, or a device such as a smart phone or PAD, or the like. Of course, the terminal may not be a terminal with a weak computing capability such as an embedded device or the like, but other terminals with strong computing capabilities. The number of terminals in the training system can be determined according to actual needs. For example, if the training system is to train security cameras in the shopping mall, all security cameras in the shopping mall can be considered as terminals. At this time, the number of terminals in the training system is fixed. For another example, if the training system is to train the smartphones of users in the shopping mall, all the smartphones connected to the wireless local area network of the shopping mall can be regarded as terminals. At this time, the number of terminals in the training system is not fixed. In the second exemplary embodiment of the present disclosure, the type and number of terminals in the training system are not limited, as long as the network model can be stored and trained in the terminal.
The server in the training system can be a high-performance server with strong computing capability, such as a cloud server. The number of servers in the training system can be determined according to the number of terminals served by the servers. For example, if the number of terminals to be trained in the training system is small or the geographical range of the terminal distribution is small, the number of servers in the training system is small, such as only one server. If the number of terminals to be trained in the training system is large or the geographical range of the terminal distribution is large, then the number of servers in the training system is large, such as establishing a server cluster. In the second exemplary embodiment of the present disclosure, the type and number of servers in the training system are not limited, as long as the server can store at least one network model and provide information for training the network model stored in the terminal.
The communication network in the second exemplary embodiment of the present disclosure is a wireless network or a wired network for implementing information transfer between the terminal and the server. Any network currently available for uplink/downlink transmission between the network server and the terminal can be used as the communication network in this embodiment, and the second exemplary embodiment of the present disclosure does not limit the type and communication manner of the communication network. Of course, the second exemplary embodiment of the present disclosure is not limited to other communication manners. For example, a third-party storage area is allocated for this training system. When the terminal and the server want to transfer information to each other, the information to be transferred is stored in the third-party storage area. The terminal and the server periodically read the information in the third-party storage area to realize the information transfer between them.
The online training process of the training system of the second exemplary embodiment of the present disclosure will be described in detail with reference to
Step S201: The terminal initiates a training request to the server via the communication network.
The terminal initiates a training request to the server through the communication network, and the request includes information such as the terminal identification or the like. The terminal identification is information that uniquely indicates the identity of the terminal (for example, ID or IP address of the terminal, or the like).
This step S201 is described by taking one terminal initiating a training request as an example. Of course, multiple terminals may initiate a training request in parallel. The processing procedure for multiple terminals is similar to the processing procedure for one terminal, and will be omitted here.
Step S202: The server receives the training request.
The training system shown in
Therefore, the communication network can transfer a training request initiated by the terminal to the server. If the training system includes multiple servers, the training request can be transferred to a relatively idle server according to the idle status of the server.
Step S203: The server responds to the received training request.
The server determines the terminal that initiated the request according to the terminal identification included in the received training request, and thus determines the network model to be trained stored in the terminal. One optional manner is that the server determines the network model to be trained stored in the terminal that initiated the request according to the comparison table between the terminal and the network model to be trained; another optional manner is that the training request includes information on the network model to be trained, and the server can determine the network model to be trained based on the information. Here, determining the network model to be trained includes, but is not limited to, determining information representing the network model such as a network architecture, a hyper-parameter or the like of the network model.
After the server determines the network model to be trained, the method of the first exemplary embodiment of the present disclosure may be used to train the network model stored in the terminal that initiated the request using the same network model stored locally by the server. Specifically, the server updates the weights in the network model locally according to the method in step S101 to step S106 in the first exemplary embodiment, and transfer the updated weights to the terminal, so that the terminal synchronizes the network model to be trained stored in the terminal according to the updated weights that have been received. Here, the network model in the server and the network model trained in the terminal may be the same network model, or the network model in the server is more complicated than the network model in the terminal, but the outputs of the two are close. The disclosure does not limit the types of the network model used for training in the server and the trained network model in the terminal, as long as the updated weights output from the server can synchronize the network model in the terminal to make the output of the synchronized network model in the terminal closer to the expected output.
In the training system shown in
Through the training system described in the second exemplary embodiment of the present disclosure, the server can perform online training on the network model in the terminal, which improves flexibility of the training, meanwhile greatly enhances the business processing capability of the terminal and expands the business processing scenario of the terminal. The above second exemplary embodiment describes the training system by taking the online training as an example, but the present disclosure is not limited to the offline training process, and will not be repeated here.
The third exemplary embodiment of the present disclosure describes a training apparatus for a neural network model, and the apparatus can execute the training method described in the first exemplary embodiment. Moreover, the apparatus, when applied to an online training system, may be an apparatus in the server described in the second exemplary embodiment. The software structure of the apparatus will be described in detail below with reference to
The training apparatus in the third embodiment includes a gradient determination unit 11, a gradient constraint unit 12, and an update unit 13, wherein the gradient determination unit 11 is used to determine gradients of weights in the network model during a back propagation; the gradient constraint unit 12 is used to perform, for at least one of the gradients determined by the gradient determination unit 11, the following processing: determining whether the gradient is within a constraint threshold range, and constraining a gradient exceeding the constraint threshold range to be a value within the constraint threshold range, wherein the constraint threshold range is determined according to the number of training iterations and calculation accuracy of the neural network model; and the update unit 13 is used to update weights by using constrained gradients.
Preferably, the gradient constraint unit 12 is further used to: determine a quantization error of each weight in the neural network model, and use the maximum quantization error as a quantization error of the neural network model; and, use the quantization error and the number of training iterations of the neural network model to determine a constraint threshold range, wherein the determined constraint threshold range constrains the at least one gradient. That is, a common constraint threshold range is set for the entire neural network model, which is used to constrain all gradients to be constrained.
Preferably, the gradient constraint unit 12 is further used to: determine, for at least one weight in the network model, a quantization error of the weight; and determine a constraint threshold range by using the quantization error of the weight and the number of training iterations, wherein the determined constraint threshold range constrains the gradient of the weight. In other words, a respectively independent constraint threshold range is set for each weight, which is only used to constrain the gradient of the corresponding weight.
The training apparatus further includes a gradient update unit 14 which is used to determine, for at least one weight in the network model, a weighted maximum value from the gradient determined by the weight in this training and the constrained gradient of the weight in the previous multiple trainings; the gradient constraint unit 12 is used to determine whether the weighted maximum value is within a constraint threshold range and constrain a weighted maximum value that exceeds the constraint threshold range to be a value within the constraint threshold range.
The training apparatus of this embodiment also has a module that implements the functions (such as a function of identifying received data, a data encapsulation function, a network communication function and so on, which are not described herein again) of a server in the training system.
The training apparatus of the third exemplary embodiment of the present disclosure may be operated in the structure shown in
The network model storage unit 20 stores hyper-parameters of the network model to be trained described in the first exemplary embodiment of the present disclosure, including but not limited to, structural information of the network model, and information required in operation performed in each layer (such as the calculation accuracy of the network model, the learning rates ηt, β1 and β2, etc.). The feature map storage unit 21 stores feature map information required by each layer in the network model during operation.
In the forward propagation, the convolution unit 22 is used to perform convolution processing on the data set according to the information input by the network model storage unit 20 and the information input by the feature map storage unit 21 (such as the input feature map of the i-th layer). In the back propagation, according to the method of the first embodiment, the constraint threshold range for the constraint is determined according to the calculation accuracy of the weights in the convolutional layer and the number of trainings, and the gradient of the weights in the convolutional layer is constrained. The weights in the convolutional layer are updated by using the constrained gradient.
Other units such as the pooling/activation unit 23, the quantization unit 24 and so on are not necessary units to implement the present disclosure.
The control unit 25 controls the operations of the network model storage unit 20 to the quantization unit 24 by outputting control signals to each unit in
The environment to which the training apparatus for the neural network model in the third exemplary embodiment of the present disclosure is applied is described below with reference to
The processor 30 may be a CPU or a GPU, which is used to perform overall control on the training apparatus. The internal memory 31 includes a random access memory (RAM), a read-only memory (ROM), and the like. The RAM can be used as a main memory, a work area, and the like of the processor 30. The ROM may be used to store control program of the processor 30, and may also be used to store files or other data to be used when the control program is operated. The network interface 32 may be connected to a network and implement network communication. The input unit 33 controls input from devices such as a keyboard, a mouse, and the like. The external memory 34 stores startup program, various applications, and the like. The bus 35 is used to connect the above components.
After the training of the neural network model by using the solution of the first exemplary embodiment of the present disclosure is implemented, the application business may be executed using the trained network model. Taking a case where a network model trained according to the manner in the first exemplary embodiment is stored in a security camera as an example, it is assumed that the security camera is to execute a target detection application, after the security camera takes a picture as a data set, the taken picture is input into the network model, so that the picture is calculated in each layer from the top to the bottom in the network model, and the target detection result is output. The present disclosure is also not limited to further performing post-processing on the output results, such as data classification and the like.
Corresponding to the application method described herein, the present disclosure further describes an application apparatus of a neural network model, the application apparatus includes: a storage module for storing a trained network model; a receiving module for receiving a data set which is required to correspond to a task which can be executed by the stored network model; and a processing module for performing operations on the data set from top to bottom in each layer in the stored network model and outputting a result.
Embodiment(s) of the present disclosure can also be realized by a computer of a system or apparatus that reads out and executes computer executable instructions (e.g., one or more programs) recorded on a storage medium (which may also be referred to more fully as a ‘non-transitory computer-readable storage medium’) to perform the functions of one or more of the above-described embodiment(s) and/or that includes one or more circuits (e.g., application specific integrated circuit (ASIC)) for performing the functions of one or more of the above-described embodiment(s), and by a method performed by the computer of the system or apparatus by, for example, reading out and executing the computer executable instructions from the storage medium to perform the functions of one or more of the above-described embodiment(s) and/or controlling the one or more circuits to perform the functions of one or more of the above-described embodiment(s). The computer may comprise one or more processors (e.g., central processing unit (CPU), micro processing unit (MPU)) and may include a network of separate computers or separate processors to read out and execute the computer executable instructions. The computer executable instructions may be provided to the computer, for example, from a network or the storage medium. The storage medium may include, for example, one or more of a hard disk, a random-access memory (RAM), a read only memory (ROM), a storage of distributed computing systems, an optical disk (such as a compact disc (CD), digital versatile disc (DVD), or Blu-ray Disc (BD)™), a flash memory device, a memory card, and the like.
Embodiments of the present disclosure can also be realized by the following method, that is, a software (program) that executes the functions of the above-described embodiments is supplied to a system or apparatus through a network or various storage medium, a computer of the system or apparatus, or a central processing unit (CPU) and a micro processing unit (MPU), reads and executes the program.
While the present disclosure has been described with reference to exemplary embodiments, it is to be understood that the disclosure is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
Lemmas needed for the proof:
Lemma 1. For any wt⊂Rd and convex functional F⊂f=Rd, it is assumed that c1=min{tilde over (w)}
Proof. Because c1=min{tilde over (w)}
c
1
=
∥w
t−αt sign(wt)∥ formula (16)
By rearranging, given t∈N, we assume that αt<0, formula (17) can be obtained:
According to the projection property {tilde over (w)}t∈F and the convex functional F, formula (18) can be obtained:
It is easy to know that because formula (18) does not hold, formula (18) is true only when αt>0.
Lemma 2. It is supposed that vt=β2vt−1+(1−β2) gt2 and v0=0 and 0≤β2<1. Given ∥gt∥<G∞, formula (19) can be obtained:
If β2=0 and vt=g2t, the above assumptions are satisfied. In addition, if 0<β2<1, formula (20) can be obtained:
∥vt∥=β2∥vt−1∥+(1−β2)∥gt2∥≤β2∥vt−1∥+(1−β2)∥G∞2∥. formula (20)
Formula (20) comes from the gradient constraint ∥gt∥<G∞. At the same time, because v0=0, the formula (21) can be obtained:
Formula (21) is summed, wherein t=[1,T], and formula (22) can be obtained:
Formula (22) comes from the constraints of formula (23):
Lemma 3. For the parameter settings and conditions assumed in
Theorem 3, formula (24) can be obtained:
Referring to the proof of formula (24) in formula (25).
According to β1t<1t<1, the first inequality in formula (25) can be proved. According to the definition of the maximum value in αT,i,vT,i, before the current iteration step and the algorithm 1 described later, the second inequality in formula (25) can be proved. According to the Cauchy-Schwarz inequality, the third inequality in formula (25) can be proved. According to β1k<β1 (wherein k∈[T]) and Σj=1Tβ1T<1/(1−β1), the fourth inequality in formula (25) can be proved. Formula (26) can be further obtained:
Because formula (27) holds, the last inequality in formula (26) also holds.
It is considered the following settings: ft is a linear function containing implicit quantized weights, and the domain of definition of weights F=[−1,1]. Specifically, we consider the sequence of functions shown in the following formula (28):
Wherein {tilde over (w)}t=αtwb is the implicit quantization weight, the purpose of which is to minimize the quantization error, that is, to obtain min{tilde over (w)}
Wherein, for C∈N, the following formula (30) is satisfied:
Since the solving problem is a one-dimensional problem, ⊙ can be omitted in order to simplify the representation. At the same time, the coordinate index can be further omitted.
For formula (29), it is not difficult to see that when {tilde over (w)}=−1, the smallest regret is provided.
Since the ADAM processing is performed on formula (29), formula (31) can be obtained:
F has an L∞ domain limitation. At this time, all parameter settings satisfy the Adam algorithm.
In order to provide a proof of the theorem, an arbitrary learning rate is supposed and there is {tilde over (w)}t>0, wherein ∀t∈N, and meanwhile, there further is {tilde over (w)}Ct+3=1, wherein ∀C∈N∪{0}. The subscripts of the parameters can be rearranged. For any C step size, assuming t∈N, there can be {tilde over (w)}Ct+1≥0. Our goal is to provide {tilde over (w)}Ct+i>0, wherein ∀i∈N∩[2, C+1]. It is not difficult to see that when {tilde over (w)}Ct+i>1 holds, the above assumption holds. Further assuming that {tilde over (w)}Ct+1≤1, because wb=sign(w) and {tilde over (w)}t>0 hold, the corresponding gradient is observed as shown in formula (32):
For (Ct+1) times of update of the ADAM algorithm, formula (33) can be obtained:
When β2vCt>0, where ∀t∈N holds, formula (34) can be obtained:
holds, and the second inequality of the above formula (34) holds.
Therefore, 0<ŵCt+2<1 and {tilde over (w)}Ct+2=ŵCt+2 can be obtained.
In order to complete the above proof, we need to prove that {tilde over (w)}Ct+3=1. In order to prove the theorem, when ŵCt+3≥1, we can get: when {tilde over (w)}Ct+3=ΠF(ŵCt+3) and F=[−1,1] (wherein F is a simple Euclidean mapping) hold, {tilde over (w)}Ct+3=1 holds.
It is considered the following situation again, if 0<{tilde over (w)}Ct+2<1, when (Ct+2)-th update is performed, formula (35) can be obtained:
Since ŵCt+2={tilde over (w)}Ct+, holds, the second equation of formula (35) also holds. In order to prove that {tilde over (w)}Ct+3≥1, we need to prove the following formula (36):
Formula (37) can be obtained by rearranging formula (36):
The last inequality in formula (37) comes from the constraints of formula (38):
Because Lemma 1 and F=[−1,1], the second inequality in the above formula (38) holds.
Further, when {tilde over (w)}i≥0 and i mod C is not equal to 1 or 2, because the gradient will be equal to 0, formula (39) can be obtained:
Therefore, given w1=1, formula (39) holds for all ∀t∈N, so formula (40) can be obtained:
Wherein k∈N. Therefore, for each C step, the regret of the ADAM algorithm is C. Therefore, RT/T0 holds when T→∞.
Proof of Theorem 1 is complete.
Proof of Theorem 2.
Theorem 2 generalizes the optimal setting of Theorem 1. Specifically, we can construct a binary optimization algorithm. We define a more general case, and design a constant deviation ϵ during the update process of the ADAM algorithm, see formula (41):
ŵ
t+1
=ŵ
t−ηtmt/√{square root over (Vt+εI)}. formula (41)
We consider the following settings: ft is a linear function containing implicit quantized weights, and the domain of definition of weights F=[−1,1]. Specifically, we consider the sequence of functions shown in the following formula (42):
Wherein, the constant C∈N satisfies formula (43)
Wherein,
C is a constant based on β1, β2, and α.
If mkC≤0 (wherein ∀k∈N∩{0}) holds, in a more general case, mkC+C is as shown in formula (44):
m
kC+C=−⅓(1−β1)β1C-2+2(1−β1)β1C-1+β1CmkC. formula (44)
If mkC<0, mkC+C<0 is still satisfied.
At time iteration t, formula (45) can be obtained:
x
i+C≥min{xi+ct1} formula (45)
When ct: >0, for a sequence function containing implicit quantized weights, formula (46) can be obtained:
Wherein, i⊂={1, . . . , C}. If δt+j≥0, for j⊂{1, . . . , C−1}, δt+s≥0 (where ∀s∈{j, . . . , C−1}) holds. Using known lemmas, formula (47) can be obtained:
Letting that i′=C/2 holds. In order to prove the above formula (45), we need to further prove the following formula (48):
Finally, formula (49) can be obtained:
According to Lemma 2 and the following formula (50), wherein i′≤i≤C, the following formula (51) can be obtained.
When t≥T′, for every C step, Rt/T0 holds when T→∞.
Proof of Theorem 2 is complete.
Proof of Theorem 3.
Setting ξ as an arbitrarily small positive real number. Considering the one-dimensional random convex optimization setting in the domain [−1,1], for each iteration number t, the gradient of ft(w) is shown in formula (52):
Wherein, C is a constant based on β1, β2, ξ and α*. The expectation function is F(w)=ξw. Therefore, the optimal solution in the domain [−1,1] is w*=−1.
Therefore, the step size is updated by Adam to be formula (53):
There is a real number C that is large enough so that the lower limit of the average value E[Δt] of Δt in the formula (53) is as shown in formula (54):
Wherein C is a function, which is determined by β1, β2, ξ and α*.
The result of Theorem 4 is used to prove the validity of Algorithm 1, wherein Algorithm 1 is:
({tilde over (w)}t − ηtmt/
)
The above algorithm 1 is described by taking a binary neural network model as an example, and the present disclosure is not limited to other types of neural network models.
The following Theorem 4 provides proof of the convergence of Algorithm 1.
Letting that w*=argminΣt=1Tft({tilde over (w)}) holds, wherein w* exists in the case where F is a closed convex solution.
Formula (55) is known:
Wherein,
using Lemma 4 and u1={tilde over (w)}t+1 and u2=w*, formula (56) can be obtained:
Rearranging the above formula (56), formula (57) can be obtained:
The second inequality in formula (54) can be proved by Cauchy-Schwarz and Young's inequality. We use the convexity of the ft function to restrict the regret at each step to obtain the following formula (58):
By using Lemma 3, we can get formula (59):
Since β1t≤β1<1, according to the above formula (59), formula (60) can be obtained.
Based on the constraint on Υt, formula (61) can be obtained:
In the feasible domain of the function, using L∞ and all the above formulas, formula (62) can be obtained:
According to the following formula (63),
{tilde over (w)}
t=αL sign(w); s.t. sign(w)∈(Sd) formula (63)
The following formula (64) can be obtained.
Number | Date | Country | Kind |
---|---|---|---|
201910807591.6 | Aug 2019 | CN | national |
201911115314.5 | Nov 2019 | CN | national |