The disclosure relates to a method and an apparatus for computation on a convolutional layer of a neural network.
Quantization is primarily a technique to speed up the computation while a deep learning model is adopted. Quantization allows each parameter and activation in the deep learning model to be transformed to a fixed-point integer. On the other hand, de-quantization is required to transform a fixed-point value back to a floating-point value during convolution operation, or alternatively, a fixed-point value is required to be subtracted by a corresponding zero point (i.e. a fixed-point value corresponding to a floating-point zero) such that the floating-point zero corresponds to the fixed-point zero, and then multiplication in convolution operation is performed thereafter. However, floating-point multiplication is not supported by hardware and this results in an inapplicable approach to de-quantization. Moreover, since the range of an input activation is not assumed to be symmetric with respect to real value 0, additional bits are required for the input activation due to asymmetric quantization that is normally applied thereon.
A method and an apparatus for computation on a convolutional layer of a neural network are proposed.
According to one of the exemplary embodiments, the apparatus includes an adder configured to receive a first sum of products, receive a pre-computed convolution bias of the convolutional layer, and perform accumulation on the first sum of products and the pre-computed convolution bias to generate an adder result of the convolutional layer, where the first sum of products is a sum of products of quantized input activation of the convolutional layer and quantized convolution weights of the convolutional layer, and where the pre-computed convolution bias is associated with a zero point of input activation of the convolutional layer and a zero point of output activation of the convolutional layer.
According to one of the exemplary embodiments, the method includes to receive a first sum of products, receive a pre-computed convolution bias of the convolutional layer, and perform accumulation on the first sum of products and the pre-computed convolution bias to generate an adder result of the convolutional layer, where the first sum of products is a sum of products of quantized input activation of the convolutional layer and quantized convolution weights of the convolutional layer, and where the pre-computed convolution bias is associated with a zero point of input activation of the convolutional layer and a zero point of output activation of the convolutional layer.
It should be understood, however, that this summary may not contain all of the aspect and embodiments of the disclosure and is therefore not meant to be limiting or restrictive in any manner. Also, the disclosure would include improvements and modifications which are obvious to one skilled in the art.
The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments of the disclosure and, together with the description, serve to explain the principles of the disclosure.
To make the above features and advantages of the application more comprehensible, several embodiments accompanied with drawings are described in detail as follows.
To solve the prominent issue, some embodiments of the disclosure will now be described more fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the application are shown. Indeed, various embodiments of the disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will satisfy applicable legal requirements. Like reference numerals refer to like elements throughout.
Referring to
Referring to
Denote rin qin, and zin are respectively a floating-point input activation before quantization (i.e., an input activation to be quantized), a fixed-point quantized input activation, and a zero point with respect to the input activation. A quantized input activation qin may be represented as follows:
Herein, scalein denotes a floating-point scale factor for the input activation to be quantized from floating-point values to integers and is also referred to as “a first scale factor” hereafter. As for asymmetric quantization of the input activation, scalein may be represented as follows:
Note that qmin and qmax respectively denote the minimum and the maximum of the quantized integer values. For example, if 8-bit quantization is performed, [qmin, qmax] may be [−128,127 ] or [0, 255].
Quantized convolution weights qweight may be represented as follows:
Herein, rweight, qweight and Zweight are respectively a floating-point weight before quantization (i.e., a weight to be quantized), a fixed-point quantized weight, and a zero point with respect to the weight. scaleweight denotes a floating-point scale factor for convolution weights to be quantized from floating-point values to integers and is also referred to as “a second scale factor” hereinafter. As for symmetric quantization of the convolution weights, scaleweight may be represented as follows:
A quantized output activation gout may be represented as follows:
Herein, rout, gout, and zout are respectively a floating-point output activation before quantization (i.e., an output activation to be quantized), a fixed-point quantized output activation, and a zero point with respect to the output activation. scaleout denotes a floating scale factor for the output activation to be quantized from floating-point values to integers and is also referred to as “a third scale factor”. As for asymmetric quantization of the output activation, scaleout may be represented as follows:
Note that a quantized bias qbias may be represented as follows:
Also, note that a floating output activation rout may be represented as follows:
r
out=Σ(rin×rweight)+rbias Eq. (4)
By substituting the information in Eq.(1), Eq.(2), and Eq.(4) into Eq.(3), the quantized output activation gout in Eq.(3) may be rewritten as follows:
It can be observed from Eq.(5) that, the quantized input activation is subtracted by a zero point such that the floating-point zero corresponds to the fixed-point zero, and the quantized input activation requires an additional 1-bit. If the input activation is quantized to n-bit and the convolution weights are quantized to m-bit, the result of the convolution multiplication requires (n+1)-bit×m-bit. To remedy such issue, the quantized output activation gout in Eq.(5) may be expanded and rearranged as follows:
Therefore, no additional bit is a required for the computation of a zero point.
Moreover, it can be observed from Eq.(6) that, re-quantization adopts multiplication with a floating point
which is not hardware friendly, and therefore
may be approximated by multiplication operation with a multiplication factor req_mul and bit-shift operation with a bit-shift number req_shift, where req_mul and req_mul are both natural numbers. Therefore, the approximation of the quantized output activation go, may be expressed as follows, which does not involve floating-point multiplication:
q
out˜([Σ(qin×qweight)+q′bias]×req_mul)>>reqshift Eq. (8)
In practice,
Referring to
The receiving circuit 310 is configured to receive an n-bit integer input as a quantized input activation qin. Also, the quantization circuit 320 is configured to perform quantization on convolution weights to generate quantized convolution weights qweight. The quantization circuit 320 may receive floating-point weights and symmetrically quantize the floating-point weights into m-bit integer weights.
The multiplication circuit 330 is configured to receive the quantized input activation qin and the quantized convolution weights qweight to generate multiplication results, and the summation circuit 340 is configured to receive and sum the multiplication results to generate a first sum of products, where the first sum of products corresponds to the term Σ(qin×qweight) in Eq.(8).
The adder 350, similar to the adder 150 in
Note that the pre-computed convolution bias q′bias may be pre-computed through offline quantization based on a quantized bias, a zero point of the input activation, a zero point of the output activation, and the quantization convolution weights, where the quantized bias is in integer values scaled from a convolution bias in floating-point values. In the present exemplary embodiment, the pre-computed convolution bias may be computed according to Eq.(7). Up to this stage, each step only involves integer operations, and no additional bit is required for the computation of a zero point.
The multiplier 360 is configured to perform multiplication operation on the adder result with a multiplication factor req_mul to generate a multiplication result, and the bit-shifter 370 is configured to perform bit-shift operation with a bit-shift number req_shift on the multiplication result to generate a quantized output activation qout. Herein, the multiplication with floating points adopted in re-quantization is replaced by the approximated value with the multiplication operation and the bit-shift operation. The quantized output activation gout is also a quantized input activation of a next convolutional layer of the neural network, and the output circuit 380 is configured to output the quantized output activation gout to the receiving circuit 310.
In view of the aforementioned descriptions, an effective quantization approach is proposed for computation on a convolutional layer of a neural network so as to ease the hardware burden.
No element, act, or instruction used in the detailed description of disclosed embodiments of the present application should be construed as absolutely critical or essential to the present disclosure unless explicitly described as such. Also, as used herein, each of the indefinite articles “a” and “an” could include more than one item. If only one item is intended, the terms “a single” or similar languages would be used. Furthermore, the terms “any of” followed by a listing of a plurality of items and/or a plurality of categories of items, as used herein, are intended to include “any of”, “any combination of”, “any multiple of”, and/or “any combination of multiples of the items and/or the categories of items, individually or in conjunction with other items and/or other categories of items. Further, as used herein, the term “set” is intended to include any number of items, including zero. Further, as used herein, the term “number” is intended to include any number, including zero.
It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the disclosed embodiments without departing from the scope or spirit of the disclosure. In view of the foregoing, it is intended that the disclosure cover modifications and variations of this disclosure provided they fall within the scope of the following claims and their equivalents.