The present disclosure relates to image processing, and in particularly to a method, an apparatus, a system, a storage medium and an application for generating a quantized neural network, for example.
At present, deep neural networks (DNNs) are widely used in various tasks. With an increase of various parameters in the networks, the resource load has become an issue of applying the DNNS to the practical industrial application. In order to reduce storage and computing resources needed in the practical application, quantizing neural networks has become conventional means.
In the process of quantizing neural networks (i.e., in the process of generating quantized neural networks), an issue that gradients do not match (i.e., loss of gradient information) will be caused since a large number of non-differentiable functions (e.g., an operation of taking a sign (sign function)) are usually used, thereby affecting performance of the generated quantized neural networks. For the problem that the gradients do not match, the non-patent literature, Mixed Precision DNNs: All you need is a good parameterization (Stefan Uhlich, Lukas Mauch, Kazuki Yoshiyama, Fabien Cardinaux, Javier Alonso Garcia, Stephen Tiedemann, Thomas Kemp, Akira Nakamura; ICLR 2020), proposes an exemplary method. The non-patent literature discloses an approximate differentiable neural network quantizing method. This exemplary method introduces, in the process of quantizing floating-point weights of the neural networks to be quantized using the sign function and a straight-through estimator (STE), auxiliary parameters obtained based on precision of the neural networks to be quantized, thereby performing smoothing processing for a variance of the reverse gradient corresponding to the quantized weight obtained by estimation by the STE using the auxiliary parameters, and achieving the purpose of correcting the gradient.
As can be known from the above, it still needs to use the non-differentiable function in the above-mentioned exemplary method, which only alleviates the issue that the gradients do not match in the neural network quantizing process by introducing the auxiliary parameters. Since in the neural network quantizing process, the issue that the gradients do not match still exists, that is, the issue of loss of gradient information still exists, thus the performance of the generated quantized neural network will still be affected.
In view of the recordation in the above Related Art, the present disclosure is directed to solve at least one of the above issues.
According to an aspect of the present disclosure, there is provided a method of generating a quantized neural network, the method comprising: determining, based on floating-point weights in a neural network to be quantized, networks which correspond to the floating-point weights and are used for directly outputting quantized weights, respectively; quantizing, using the determined network, the floating-point weight corresponding to the network to obtain the quantized neural network; and updating, based on a loss function value obtained via the quantized neural network, the determined network, the floating-point weight and the quantized weight in the quantized neural network.
According to a further aspect of the present disclosure, there is provided a system for generating a quantized neural network, the system comprising: a first embedded device that determines, based on floating-point weights in a neural network to be quantized, networks which correspond to the floating-point weights and are used for directly outputting quantized weights, respectively; a second embedded device that quantizes, using the network determined by the first embedded device, the floating-point weight corresponding to the network to obtain the quantized neural network; and a server that calculates a loss function value via the quantized neural network obtained by the second embedded device, and updates the determined network, the floating-point weight and the quantized weight in the quantized neural network based on the loss function value obtained by calculation, wherein the first embedded device, the second embedded device and the server are connected to each other via a network.
Wherein, in the present disclosure, one floating-point weight in the neural network to be quantized corresponds to one network for directly outputting the quantized weight. In the present disclosure, the network for directly outputting the quantized weight can be for example referred to as a meta-network. Wherein, in the present disclosure, one meta-network includes: a module for convolving floating-point weights; and a first objective function for constraining an output of the module for convolving the floating-point weights. Wherein, for one floating-point weight in the neural network to be quantized and the meta-network corresponding to the floating-point weight, the first objective function in the network preferentially tends elements that can reduce loss of an objective task in the output of the module for convolving floating-point weights to the quantized weight based on a priority of the elements in the floating-point weight.
According to another further aspect of the present disclosure, there is provided a method of applying a quantized neural network, the method comprising: loading a quantized neural network; inputting, to the quantized neural network, a data set which is required to correspond to a task which can be executed by the quantized neural network; performing operation on the data set in each layer in the quantized neural network from top to bottom; and outputting a result. Wherein, the loaded quantized neural network is a quantized neural network obtained according to the method of generating the quantized neural network.
As can be known from the above, in the process of quantizing the neural network, the present disclosure uses a meta-network capable of directly outputting the quantized weight to replace the sign function and the STE needed in the conventional method, and generates the quantized neural network in a manner of training the meta-network and the neural network to be quantized cooperatively, thereby achieving the purpose of not losing information. Therefore, according to the present disclosure, the issue that the gradients do not match in the neural network quantizing process can be solved, thereby improving the performance of the generated quantized neural network.
Further features and advantages of the present disclosure will become apparent from the following description of typical embodiments with reference to the attached drawings.
The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate embodiments of the present disclosure and, together with the description of the embodiments, serve to explain the principles of the present disclosure.
Exemplary embodiments of the present disclosure will be described in detail below with reference to the drawings. It should be noted that the following description is illustrative and exemplary in nature and is in no way intended to limit the disclosure, its application or uses. The relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless it is specifically stated otherwise. In addition, the techniques, methods and devices known by persons skilled in the art may not be discussed in detail, however, they shall be a part of the present specification under a suitable circumstance.
It is noted that, similar reference numbers and letters refer to similar items in the drawings, and thus once an item is defined in one figure, it may not be discussed in the following figures. The present disclosure will be described in detail below with reference to the drawings.
(Hardware Configuration)
At first, the hardware configuration capable of implementing the technique described below will be described with reference to
The hardware configuration 100 includes for example a central processing unit (CPU) 110, a random access memory (RAM) 120, a read only memory (ROM) 130, a hard disk 140, an input device 150, an output device 160, a network interface 170 and a system bus 180. In one implementation, the hardware configuration 100 can be implemented by a computer such as a tablet computer, a laptop, a desktop or other suitable electronic devices.
In one implementation, an apparatus for generating a quantized neural network according to the present disclosure is configured by hardware or firmware, and serves as a module or a component of the hardware configuration 100. For example, an apparatus 500 for generating a quantized neural network that will be described in detail below with reference to
The CPU 110 is any suitable programmable control device (e.g. a processor) and can execute various functions to be described below by executing various application programs stored in the ROM 130 or the hard disk 140 (e.g. a memory). The RAM 120 is used for temporarily storing programs or data loaded from the ROM 130 or the hard disk 140, and is also used as a space in which the CPU 110 executes various procedures (e.g. implementing the technique to be described in detail below with reference to
In one implementation, the input device 150 is used for allowing a user to interact with the hardware configuration 100. In one example, the user can input for example neural networks to be quantized, specific task processing information (e.g. object detection task), etc., via the input device 150, wherein the neural networks to be quantized include for example various weights (e.g. floating-point weights). In another example, the user can trigger the corresponding processing of the present disclosure via the input device 150. Further, the input device 150 can adopt a plurality of forms, such as a button, a keyboard or a touch screen.
In one implementation, the output device 160 is used for storing the finally generated and obtained quantized neural network in the hard disk 140 for example, or is used for outputting the finally generated quantized neural network to specific task processing such as object detection, object classification, image segmentation, etc.
The network interface 170 provides an interface for connecting the hardware configuration 100 to a network. For example, the hardware configuration 100 can perform data communication with other electronic devices that are connected by a network via the network interface 170. Alternatively, the hardware configuration 100 may be provided with a wireless interface to perform wireless data communication. The system bus 180 can provide a data transmission path for mutually transmitting data among the CPU 110, the RAM 120, the ROM 130, the hard disk 140, the input device 150, the output device 160, the network interface 170, etc. Although being referred to as a bus, the system bus 180 is not limited to any specific data transmission technique.
The above hardware configuration 100 is only illustrative and is in no way intended to limit the present disclosure, its application or uses. Moreover, for the sake of simplification, only one hardware configuration is illustrated in
(Meta-Network)
In order to avoid using the sign function and the STE which will cause loss of information (i.e., gradient mismatch) in the process of quantizing floating-point weights in the neural network to be quantized, the inventors consider that the sign function and the STE can be replaced by correspondingly designing one meta-network capable of directly outputting the quantized weight for each floating-point weight, thereby achieving the purpose of losing no information. In addition, in the process of quantizing floating-point weights in the neural network to be quantized, not all floating-point weights are important in fact. For example, since the performance of the generated quantized neural network will also be affected greatly even if information is lost slightly in the process of quantizing the floating-point weight with a high importance degree, it is necessary to ensure that their quantized weights more tend to “+1” or “−1” when the floating-point weight with a high importance degree is quantized. In the process of quantizing the floating-point weight with a low importance degree, the performance of the generated quantized neural network will not be affected even if information is lost slightly; moreover, the purpose of quantizing the floating-point weight is to obtain a quantized neural network with the best performance, instead of tending the quantized weights of all floating-point weights to “+1” or “4”, such that it is unnecessary to tend their quantized weights to “+1” or “−1” accurately when the floating-point weight with a low importance degree is quantized.
Wherein, in the present disclosure, the floating-point weight with a high importance degree can be further defined by the following mathematical assumption. It is assumed that all vectors v belong to a n-dimensional real-number set Rn and each have one k sparse representation, and meanwhile, there is a minimal ε (which belongs to (0, 1)) and an optimal quantized weight w*q. Wherein, accompanied by applying the task objective function to the specific task in the process of updating and optimizing the quantized neural network, the updating and optimizing process can have attributes expressed by the following formulas (1) and (2):
Therefore, the inventors deem that, in order to be helpful for generating the quantized weight with a higher accuracy, corresponding to one floating-point weight in the neural network to be quantized, the meta-network capable of directly outputting the quantized weight thereof can be designed to have the structure as shown in
Hereinafter, explanation is performed by taking a floating-point weight w in the neural network to be quantized as an example, wherein a matrix shape of the floating-point weight is for example [a width of a convolution kernel, a height of the convolution kernel, a number of input channels and a number of output channels]. In one implementation, the first module 211 can be used as a coding function module for converting the floating-point weight w into a high dimension. Specifically, in order to convert the floating-point weight w into a high-dimension structure so as to generate features with more distinctiveness for the objective task, the input shape size of the coding function module can be set to be the same as the matrix shape size of the floating-point weight w, and the number of output channels of the coding function module can be set to be at least four times greater than or equal to the square of a size of the convolution kernel of the floating-point weight w, wherein the square of a size of convolution kernel of the floating-point weight w is also a product of the “width of the convolution kernel” and the “height of the convolution kernel”.
The third module 213 can be used as a compressing function module for analyzing principal components of the output result of the encoding function module, compressing and extracting the principal components. Specifically, in order to extract the principal components of the converted high-dimension structure to filter out the priority of each element, the input shape size of the compressing function module can be set to be the same as the output shape size of the encoding function module, and the number of output channels of the compressing function module can be set to be at least twice greater than or equal to a size of the convolution kernel of the floating-point weight, but meanwhile less than or equal to a half of the number of output channels of the coding function module.
The second module 212 can be used as a decoding function module for activating and decoding an output result of the coding function module or the compressing function module. Specifically, in order to restore the dimension of the floating-point weight w to generate the quantized weight, the input shape size of the decoding function module can be set to be the same as the output shape size of the coding function module or the compressing function module, and the number of output channels of the decoding function module can be set to be the same as the matrix shape size of the floating-point weight.
The first objective function 220 can be used as a quantized objective function for constraining an output result of the decoding function module to obtain a quantized weight wq of the floating-point weight w. Wherein, in order to derive the quantized objective function, the following assumption can be defined in the present disclosure:
Assuming that there is a functional F(w), and meanwhile, a function tan h(F(w)) is formed, such that the gradient in the hyperbolic tangent function tan h(F(w)) for w can be expressed as the following formulas (3) and (4):
In the above formula (4), “w.r.t” indicates that the formula (4) belongs to extension of the formula (3), and V indicates to take a gradient for the function tan h(F(w)).
Specifically, in the present disclosure, the quantized objective function can be for example defined as the following formula (5):
In the above formula (5), b indicates a quantized reference vector, which functions to constrain the output result of the decoding function module to tend to the quantized weight wq; w*q indicates to an optimal quantized weight obtained after optimizing and constraining, wherein wq and w*q are vectors, which belong to a mn-dimentional real-number set; m and n indicate a number of input channels and a number of output channels of the quantized weight; ∥wq∥ indicates a L1 normal operator, which functions to identify a priority of each element in the floating-point weight w by the sparsity rule, wherein the operator having a priority of identifying each element in the floating-point weight w can be used.
Further, in the present disclosure, the coding function module (i.e., the first module 211), the compressing function module (i.e., the third module 213) and the decoding function module (i.e., the second module 212) can consist of at least one neural network layer (e.g. full-connection layer), respectively. Wherein, the number of neural network layers constituting each function module can be decided by the accuracy of the quantized neural network that needs to be generated. Taking that the module 210 for convolving the floating-point weights simultaneously includes the coding function module, the compressing function module and the decoding function module as an example, in one implementation, the coding function module consists of a full-connection layer 410, the compressing function module consists of a full-connection layer 420, and the decoding function module consists of a full-connection layer 430 for example as shown in
(Apparatus and Method for Generating a Quantized Neural Network)
Next, by taking an example of implementing by one hardware configuration, generation of the quantized neural network according to the present disclosure will be described with reference to
First, for example, the input device 150 shown in
Then, as shown in
The quantization unit 520, using the meta-network determined by the determination unit 510, quantizes the floating-point weight corresponding to the meta-network, so as to obtain the quantized neural network. That is to say, the quantization unit 520 quantizes each floating-point weight using the meta-network corresponding to the floating-point weight, so as to obtain the corresponding quantized weight. After all floating-point weights are quantized, the corresponding quantized neural network can be obtained.
The update unit 530 updates the meta-network determined by the determination unit 510, the floating-point weight in the neural network to be quantized and the quantized weight in the quantized neural network based on the loss function value obtained via the quantized neural network.
In addition, the update unit 530 further judges whether the quantized neural network after being updated satisfies a predetermined condition, e.g. the total number of updates (for example, T times) has already been completed or the predetermined performance has already been achieved (e.g. the loss function value tends to a constant value). If the quantized neural network does not satisfy the predetermined condition yet, the quantization unit 520 and the update unit 530 will execute the corresponding operation again.
If the quantized neural network has already satisfied the predetermined condition, the storage unit 540 stores the quantized neural network obtained by the quantization unit 520, thereby applying the quantized neural network to the subsequent specific task processing such as object detection, object classification, image segmentation, etc.
The method flow chart 600 shown in
In the quantization step S620, the quantization unit 520 quantizes, using the meta-network determined in the determination step S610, the floating-point weight corresponding to the meta-network, so as to obtain the quantized neural network. That is to say, in the quantization step S620, the quantization unit 520 quantizes each floating-point weight using the meta-network corresponding to the floating-point weight, so as to obtain the corresponding quantized weight. After all floating-point weights are quantized, the corresponding quantized neural network can be obtained. For an arbitrary floating-point weight (e.g. floating-point weight w), in one implementation, the floating-point weight w can be quantized for example by the following operation:
First, the quantization unit 520 transforms the floating-point weight w and inputs the transformation result as a meta-network corresponding to the floating-point weight w. As can be seen from the above, the matrix shape of the floating-point weight w is [a width of a convolution kernel, a height of the convolution kernel, a number of input channels and a number of output channels]. That is to say, the matrix shape of the floating-point weight w is a four-dimensional matrix. After the transformation operation, the matrix shape of the floating-point weight w is transformed into a two-dimensional matrix, whose matrix shape is [a width of the convolution kernel×a height of the convolution kernel, and a number of input channels×a number of output channels].
Then, the quantization unit 520 quantizes the transformed floating-point weight w using the meta-network corresponding to the floating-point weight w, so as to obtain the corresponding quantized weight. Since the input of the meta-network is a two-dimensional matrix, the matrix shape of the obtained quantized weight is also a two-dimensional matrix. Thus, the quantization unit 520 also needs to transform the obtained quantized weight to have a matrix shape that is the same as the matrix shape of the floating-point weight w, that is, needs to transform the matrix shape of the quantized weight to be a four-dimensional matrix.
Returning to
Further, after the operation of the update step S630 ends, in the storage step S640, the storage unit 540 stores the quantized neural network obtained in the quantization step S620, thereby applying the quantized neural network to the subsequent specific task processing such as object detection, object classification, image segmentation, etc. Wherein, for example, the quantized weight in the quantized neural network or the fixed-point weight after the quantized weight is enabled fixed-point is stored in the storage unit 540. Wherein, the operation for fixed-point the quantized weight is for example the rounding operation of the quantized weight.
In one implementation, in order to improve accuracy of the generated quantized neural network, the update unit 530 executes the corresponding update operation referring to
As shown in
First, the update unit 530 performs the forward propagation operation using the quantized neural network obtained in the quantization step S620, and calculates the task loss function value according to the task objective function.
Then, the update unit 530 updates the quantized weight using the function for updating the quantized weight, based on the task loss function value obtained by calculation. Wherein, the function for updating the quantized weight can be defined as the following formula (6) for example:
In the above formula (6), indicates a task objective loss function value; gw
Returning to
On one hand, the update unit 530 updates the floating-point weight using the function for updating the floating-point weight, based on the gradient value obtained by calculation through the above formula (6). Wherein, the function for updating the floating-point weight for example can be defined as the following formula (7):
w
t+1
=w
t
−ηg
Θ (7)
In the above formula (7), η indicates a training learning rate of the meta-network, t indicates a number of times of updating the current quantized neural network (i.e., a number of training iterations), and wt indicates a floating-point weight for the tth update.
On one hand, the update unit 530 updates the weight in the meta-network itself using the general backward propagation operation, based on the quantized loss function value obtained by calculation.
Further, in the present disclosure, two update operations executed by the update unit 530 can be jointly trained using two independent neural network optimizers, respectively.
Returning to
In the flow S630 shown in
As an example, the operation flow of generating the quantized neural network according to an embodiment of the present disclosure will be described below:
In addition, as stated above, how many floating-point weights need to be quantized correspondingly depending on how many network layers constitute one neural network to be quantized. Therefore, as an example, taking that the neural network to be quantized consists of three network layers as an example, this neural network to be quantized according to an embodiment of the present disclosure is quantized to obtain a structure diagram of the corresponding quantized neural network for example shown in
As stated above, in the process of quantizing the neural network, the present disclosure uses a meta-network capable of directly outputting the quantized weight to replace the sign function and the STE needed in the conventional method, and generates the quantized neural network in a manner of training the meta-network and the neural network to be quantized cooperatively, thereby achieving the purpose of losing no information. Therefore, according to the present disclosure, the problem that the gradients do not match in the neural network quantizing process can be solved, thereby improving the performance of the generated quantized neural network.
(System for Generating the Quantized Neural Network)
As illustrated in
As shown in
The second embedded device 1020 quantizes, using the meta-network determined by the first embedded device 1010, the floating-point weight corresponding to the meta-network to obtain the quantized neural network.
The server 1030 calculates the loss function value via the quantized neural network obtained by the second embedded device 1020, and updates the determined meta-network, the floating-point weight and the quantized weight in the quantized neural network based on the loss function value obtained by calculation. Wherein, the server 1030, after updating the meta-network, the floating-point weight and the quantized weight in the quantized neural network, transmits the updated meta-network to the first embedded device 1010, and transmits the updated floating-point weight and quantized weight to the second embedded device 1020.
All the above units are illustrative and/or preferable modules for implementing the processing in the present disclosure. These units may be hardware units (such as Field Programmable Gate Array (FPGA), Digital Signal Processor, Application Specific Integrated Circuit and so on) and/or software modules (such as computer readable program). Units for implementing each step are not described exhaustively above. However, in a case where a step for executing a specific procedure exists, a corresponding functional module or unit for implementing the same procedure may exist (implemented by hardware and/or software). The technical solutions of all combinations by the described steps and the units corresponding to these steps are included in the contents disclosed by the present application, as long as the technical solutions constituted by them are complete and applicable.
The methods and apparatuses of the present disclosure can be implemented in various forms. For example, the methods and apparatuses of the present disclosure may be implemented by software, hardware, firmware or any other combinations thereof. The above order of the steps of the present method is only illustrative, and the steps of the method of the present disclosure are not limited to such order described above, unless it is stated otherwise. In addition, in some embodiments, the present disclosure may also be implemented as programs recorded in recording medium, which include a machine readable instruction for implementing the method according to the present disclosure. Therefore, the present disclosure also covers the recording medium storing programs for implementing the method according to the present disclosure.
While some specific embodiments of the present disclosure have been demonstrated in detail by examples, it is to be understood for persons skilled in the art that the above examples are only illustrative and does not limit to the scope of the present disclosure. In addition, it is to be understood for persons skilled in the art that the above embodiments can be modified without departing from the scope and spirit of the present disclosure. The scope of the present disclosure is restricted by the attached Claims.
This application claims the benefit of Chinese Patent Application No. 202010142443.X, filed Mar. 4, 2020, which is hereby incorporated by reference herein in its entirety.
Number | Date | Country | Kind |
---|---|---|---|
202010142443.X | Mar 2020 | CN | national |