This application is based on and claims priority under 35 U.S.C. § 119 to French Patent Application No. FR2308018 filed on Jul. 25, 2023, the disclosure of which is herein incorporated by reference in its entirety.
The present disclosure relates to deep learning techniques in the field of artificial intelligence, and more particularly to a hardware multiply-accumulate operator adapted to perform some of the calculations involved.
In deep learning techniques, some of the layers of deep neural networks use arithmetic involving small integers (e.g., 8-bit integers, known as INT8) instead of the more hardware-intensive 32-bit floating-point arithmetic (FP32 number format according to the IEEE 754 standard).
Conventional deep learning techniques are based on a simple integer quantization transformation between FP32 floating-point numbers and INT8 numbers. A real number X represented in FP32 format is mapped to an INT8 number x within a given quantization domain defined by a 32-bit integer zero point z (INT32) and an FP32 floating-point scaling factor S. The quantized number X, denoted ˜X, is then expressed by:
The number x in INT8 format is the one that is stored and used for arithmetic operations. The quantization domain corresponds to a low-resolution range of variation of the number ˜X around a central value z located in a high dynamic range (that of factor S, an FP32 number). As the values of x evolve in the range [−128, 127], the range of variation of ˜X becomes:
This range of variation, which in practice is supposed to cover a Gaussian distribution of X values around the zero point z, does not cover the entire range of variation of the number X. So, for values of X outside the range, ˜X is saturated at the boundaries of the range.
There may be one quantization domain per activation tensor (a multidimensional matrix of activation values), and a channel of the tensor (one of its dimensions) may sometimes have a different quantization domain. Once the FP32 numbers have been quantized, they may be exploited using resource-efficient operations on the small integers x.
Despite the advantages of this quantization, FP32 arithmetic is still used in calculations that modify the representation of tensor elements between quantization domains. FP32 arithmetic is more costly to implement in terms of performance, power consumption and silicon area.
A hybrid hardware multiply-accumulate operator is generally provided, configured to multiply a floating-point multiplicand by an integer multiplicand and add an integer multiplication result to an integer operand.
The operator may comprise a first multiplicand input for the floating-point multiplicand having a mantissa field and an exponent field; a second multiplicand input for the integer multiplicand; an accumulator input for the integer operand; an integer multiplier configured to multiply the mantissa field of the floating-point multiplicand by the integer multiplicand and produce a result in a fixed-point format corresponding to a zero exponent of the floating-point multiplicand; a shifter connected to receive the result of the multiplier and shift it left or right according to the value of the exponent field of the floating-point multiplicand; and an integer adder receiving the integer operand and a window on the result of the shifter, the window capturing a number of bits to the left of the fixed decimal point position equal to the number of bits of the integer operand to within one carry bit.
The left and right shift amplitude of the shifter may be limited to the number of bits of the integer operand and the operator may comprise a saturation circuit connected to the adder and the shifter and configured to produce the result of the adder when no bits to the left of the window are significant and the exponent field encodes a value less than the number of bits of the integer operand, and produce a saturated result when at least one bit to the left of the window is at a significant value, or the exponent field encodes a value greater than or equal to the number of bits in the integer operand.
The saturation circuit may be configured to round the adder result according to significant values of the bits to the right of the window.
The following non-limiting description is provided in relation to the attached drawings, among which:
Hereinafter, a new type of hardware hybrid fused multiplication and addition (HFMA) operator is provided for efficient tensor conversion between different quantization domains. The structure of this HFMA operator results from a particular analysis and decomposition of the operations involved in converting between quantization domains. The resulting structure multiplies a signed integer (e.g., INT32) by a floating-point number (e.g., FP32), and adds a signed integer operand (INT32) to the product to provide an integer result in the same format as the operand. Since the result is in the same format as the addition operand, the structure may be used to calculate the accumulation of products, so it may also be referred to as a “Hybrid Multiply-ACcumulate” (HMAC) operator.
The most common calculation in deep learning networks is the accumulated scalar product of an activation vector Xi by a weight vector Wi with the addition of a bias Bj according to the following relationship (as quantized values are exclusively referred to hereinafter, the prefix “˜” is omitted to clarify the writing):
Each of the terms is quantized in its own quantization domain, respectively (zy, Sy), (zx, Sx), (0, Sw) and (0, SxSw). The zero point of the weights Wi and biases Bj is 0. The quantization domain (0, SxSw) of Bj has the same scale factor as the quantization domain resulting from the product XiWi.
The different terms are expressed as follows:
Recall that values S are FP32 scaling factors, values z are INT32 zero points, and xi and yj are INT8 numbers used for storage and recurrent operations.
Making substitutions in these relations yields:
The integer representing the quantified result is then expressed as:
with
Each term cj is an INT32 integer constant that may be calculated off-line using simple integer arithmetic, such as a conventional multiply-accumulate (MAC) operator operating on INT32 integers. The sum Σi=0n wi is an integer constant for the range of j values that can be calculated once and stored to calculate all terms cj.
The basic recurring calculation on index j, conventionally involving operations in FP32 arithmetic, becomes an integer product and addition operation that multiplies the INT32 integer Σi=0nxiwi+cj by a real number SxSw/Sy in FP32 format, and adds an INT32 integer zy, the result again being an INT32 number. This is exactly what the HFMA operator calculates. The yj values are then saturated to fit back into an INT8 integer and stored and reused to represent the quantized yj values.
With conventional means, the operation would consist in converting the integer multiplicand and the integer addition operand into FP32 numbers, performing a fused multiplication and addition (FMA) of the resulting FP32 numbers, and converting the FP32 result into the INT32 integer, i.e., three or four operations instead of one, one of which involving a complex FMA operator for FP32 numbers.
This gain in the number of operations occurs, in the above example, once every n+1 iterations on i. In some cases, n may be of the same order as the number of iterations on j, so that a gain of the order of 3n operations is obtained.
The HFMA operator may also be used for punctual operations such as the sum of quantized vectors:
This corresponds to a nested application of the HFMA or HMAC operator (with two INT32 integer additions).
This calculation may be further optimized when the quantization domain of one of the inputs is the same as that of the output, which often occurs in practice. For example, if Sx=Sy and zx=zy then:
This corresponds to a single application of the HMAC operator (with one addition of INT32 integers).
The HMAC operator may also be used in combination with an operator designated FSCALE to quantize FP32 values to INT8 values without involving FP32 arithmetic. An FSCALE operation scales an FP32 value by an integer power of two, in the form of a simple integer addition over the 8-bit exponent field of the FP32 numerical representation.
For example, the case of Xi=(xi−zx)Sx with 1/Sx=2em yields:
Here, the FSCALE operation produces an FP32 result, while e is INT8, and m and z are INT32. Thus, the calculation of xi is once again an HMAC calculation, producing an INT32 value which will be transformed into an INT8 value.
The floating-point multiplicand Op0 includes an 8-bit exponent field EXP and a 24-bit mantissa field MANT. The mantissa field MANT of the operator Op0, treated as an integer, and the integer multiplicand Op1 are supplied to an integer multiplier 10. The multiplier 10 produces a 57-bit product P. The product P is in fact a fixed-point number whose position is chosen by convention for an exponent EXP equal to 0. In this case, the product corresponds to the multiplication of a number 1.xx . . . xx with 24 bits after the decimal point by an integer XX . . . XX of 32 bits (in front of the decimal point), so that the fixed decimal point is positioned between the bits 23 and 24 of the 57-bit product P.
A left-shift circuit 12 receives product P at positions 31 and 87 of a 119-bit virtual number having 31 bits at 0 to the right of product P and 31 bits at 0 to the left of product P. The shifting of circuit 12 is controlled by the exponent EXP of multiplicand Op0. When this exponent is positive, circuit 12 performs a corresponding shift to the left. When the exponent is negative, circuit 12 performs a corresponding shift to the right.
The “useful” output of shifter 12 is taken from a window covering positions 55 to 87 of the shifted number, i.e., the 33 bits in front of the fixed decimal point, forming a 32-bit integer with a carry bit. The unused bits to the right of the window are used for rounding, while the unused bits to the left are used for saturation.
Note that the considered amplitude of the left or right shift is the size of the integer addition operand, here 32 bits (31 plus the position without a shift for exponent zero), corresponding to a signed exponent of 6 bits instead of 8 bits. This constrained amplitude corresponds to the limit positions where significant bits of the shifted product may still be in the window and form a usable integer. On the left, this corresponds to the product of an integer equal to 1 and a float with exponent +31. On the right, this corresponds to the product of an integer whose most significant bit is 1 and a float with exponent −31. Beyond an exponent of +31, the result is systematically saturated or rounded to 0.
The contents of the 33-bit window are supplied to an adder 14, which also receives the 32-bit integer operand Op2. The adder 14 produces a 34-bit result, the 34th bit being a carry bit.
The result of adder 14 is fed to a saturation and rounding circuit 16, which also receives the 55 bits to the right and the 31 bits to the left of the window of shifter 12. Circuit 16 produces a final result R1 in the form of a signed integer INT32, under the following conditions.
The result R1 is saturated to the largest positive integer when at least one of the following conditions is met:
The result R1 is saturated to the smallest negative integer when at least one of the following conditions is met:
In the description of the circuit shown in
Furthermore, since the operator involves a floating-point number FP32 as a multiplicand, this number may be indefinite (NaN) or infinite in certain situations. If it is infinite and the integer multiplicand Op1 is non-zero, the result R1 is saturated to the largest positive integer or the smallest negative integer, depending on the sign of the result. If it is indefinite or infinite and multiplied by a zero integer, the result R1 may be saturated by convention, for example to the smallest negative integer.
Finally, the result R1 is rounded according to the bits of significant value on the right side of the window, for example according to the IEEE 754 standard governing floating-point numbers.
The operator may also raise the IEEE 754 overflow, underflow, inaccuracy and invalidity flags. The underflow flag is raised when the non-rounded result is non-zero and strictly between −1 and 1 (so-called subnormal numbers).
Number | Date | Country | Kind |
---|---|---|---|
FR2308018 | Jul 2023 | FR | national |