OPTIMIZED MULTIPLY-ACCUMULATE OPERATOR FOR AI CALCULATIONS

Description

CROSS-REFERENCE TO RELATED APPLICATION

This application is based on and claims priority under 35 U.S.C. § 119 to French Patent Application No. FR2308018 filed on Jul. 25, 2023, the disclosure of which is herein incorporated by reference in its entirety.

TECHNICAL FIELD

The present disclosure relates to deep learning techniques in the field of artificial intelligence, and more particularly to a hardware multiply-accumulate operator adapted to perform some of the calculations involved.

BACKGROUND

In deep learning techniques, some of the layers of deep neural networks use arithmetic involving small integers (e.g., 8-bit integers, known as INT8) instead of the more hardware-intensive 32-bit floating-point arithmetic (FP32 number format according to the IEEE 754 standard).

Conventional deep learning techniques are based on a simple integer quantization transformation between FP32 floating-point numbers and INT8 numbers. A real number X represented in FP32 format is mapped to an INT8 number x within a given quantization domain defined by a 32-bit integer zero point z (INT32) and an FP32 floating-point scaling factor S. The quantized number X, denoted ˜X, is then expressed by:

$\sim X = (x - z) S$

The number x in INT8 format is the one that is stored and used for arithmetic operations. The quantization domain corresponds to a low-resolution range of variation of the number ˜X around a central value z located in a high dynamic range (that of factor S, an FP32 number). As the values of x evolve in the range [−128, 127], the range of variation of ˜X becomes:

$\sim X \in [(- 1 2 8 - z) S, (127 - z) S]$

This range of variation, which in practice is supposed to cover a Gaussian distribution of X values around the zero point z, does not cover the entire range of variation of the number X. So, for values of X outside the range, ˜X is saturated at the boundaries of the range.

There may be one quantization domain per activation tensor (a multidimensional matrix of activation values), and a channel of the tensor (one of its dimensions) may sometimes have a different quantization domain. Once the FP32 numbers have been quantized, they may be exploited using resource-efficient operations on the small integers x.

Despite the advantages of this quantization, FP32 arithmetic is still used in calculations that modify the representation of tensor elements between quantization domains. FP32 arithmetic is more costly to implement in terms of performance, power consumption and silicon area.

SUMMARY

A hybrid hardware multiply-accumulate operator is generally provided, configured to multiply a floating-point multiplicand by an integer multiplicand and add an integer multiplication result to an integer operand.

The operator may comprise a first multiplicand input for the floating-point multiplicand having a mantissa field and an exponent field; a second multiplicand input for the integer multiplicand; an accumulator input for the integer operand; an integer multiplier configured to multiply the mantissa field of the floating-point multiplicand by the integer multiplicand and produce a result in a fixed-point format corresponding to a zero exponent of the floating-point multiplicand; a shifter connected to receive the result of the multiplier and shift it left or right according to the value of the exponent field of the floating-point multiplicand; and an integer adder receiving the integer operand and a window on the result of the shifter, the window capturing a number of bits to the left of the fixed decimal point position equal to the number of bits of the integer operand to within one carry bit.

The left and right shift amplitude of the shifter may be limited to the number of bits of the integer operand and the operator may comprise a saturation circuit connected to the adder and the shifter and configured to produce the result of the adder when no bits to the left of the window are significant and the exponent field encodes a value less than the number of bits of the integer operand, and produce a saturated result when at least one bit to the left of the window is at a significant value, or the exponent field encodes a value greater than or equal to the number of bits in the integer operand.

The saturation circuit may be configured to round the adder result according to significant values of the bits to the right of the window.

BRIEF DESCRIPTION OF DRAWINGS

The following non-limiting description is provided in relation to the attached drawings, among which:

FIG. 1 illustrates a block diagram of a hybrid multiplication and addition operator.

DETAILED DESCRIPTION

Hereinafter, a new type of hardware hybrid fused multiplication and addition (HFMA) operator is provided for efficient tensor conversion between different quantization domains. The structure of this HFMA operator results from a particular analysis and decomposition of the operations involved in converting between quantization domains. The resulting structure multiplies a signed integer (e.g., INT32) by a floating-point number (e.g., FP32), and adds a signed integer operand (INT32) to the product to provide an integer result in the same format as the operand. Since the result is in the same format as the addition operand, the structure may be used to calculate the accumulation of products, so it may also be referred to as a “Hybrid Multiply-ACcumulate” (HMAC) operator.

The most common calculation in deep learning networks is the accumulated scalar product of an activation vector X_iby a weight vector W_iwith the addition of a bias B_jaccording to the following relationship (as quantized values are exclusively referred to hereinafter, the prefix “˜” is omitted to clarify the writing):

$Y_{j} = \sum_{i = 0}^{n} X_{i} W_{i} + B_{j}$

Each of the terms is quantized in its own quantization domain, respectively (z_y, S_y), (z_x, S_x), (0, S_w) and (0, S_xS_w). The zero point of the weights W_iand biases B_jis 0. The quantization domain (0, S_xS_w) of B_jhas the same scale factor as the quantization domain resulting from the product X_iW_i.

The different terms are expressed as follows:

$\begin{matrix} X_{i} = (x_{i} - z_{x}) S_{x} \\ W_{i} = w_{i} S_{w} \\ B_{j} = S_{x} S_{w} b_{j} \\ Y_{j} = (y_{j} - z_{y}) S_{y} \end{matrix}$

Recall that values S are FP32 scaling factors, values z are INT32 zero points, and x_iand y_jare INT8 numbers used for storage and recurrent operations.

Making substitutions in these relations yields:

$(y_{j} - z_{y}) S_{y} = S_{x} S_{w} \sum_{i = 0}^{n} x_{i} w_{i} - z_{x} S_{x} S_{w} \sum_{i = 0}^{n} w_{i} + S_{x} S_{w} b_{j}$

The integer representing the quantified result is then expressed as:

$y_{j} = int (z_{y} + \frac{s_{x} s_{w}}{s_{y}} (\sum_{i = 0}^{n} x_{i} w_{i} + c_{j}))$

with

$c_{j} = b_{j} - z_{x} \sum_{i = 0}^{n} w_{i}$

Each term c_jis an INT32 integer constant that may be calculated off-line using simple integer arithmetic, such as a conventional multiply-accumulate (MAC) operator operating on INT32 integers. The sum Σ_i=0ⁿw_iis an integer constant for the range of j values that can be calculated once and stored to calculate all terms c_j.

The basic recurring calculation on index j, conventionally involving operations in FP32 arithmetic, becomes an integer product and addition operation that multiplies the INT32 integer Σ_i=0ⁿx_iw_i+c_jby a real number S_xS_w/S_yin FP32 format, and adds an INT32 integer z_y, the result again being an INT32 number. This is exactly what the HFMA operator calculates. The y_jvalues are then saturated to fit back into an INT8 integer and stored and reused to represent the quantized y_jvalues.

With conventional means, the operation would consist in converting the integer multiplicand and the integer addition operand into FP32 numbers, performing a fused multiplication and addition (FMA) of the resulting FP32 numbers, and converting the FP32 result into the INT32 integer, i.e., three or four operations instead of one, one of which involving a complex FMA operator for FP32 numbers.

This gain in the number of operations occurs, in the above example, once every n+1 iterations on i. In some cases, n may be of the same order as the number of iterations on j, so that a gain of the order of 3n operations is obtained.

The HFMA operator may also be used for punctual operations such as the sum of quantized vectors:

$\begin{matrix} Y_{i} = X_{i} + W_{i} \\ X_{i} = (x_{i} - z_{x}) S_{x} \\ W_{i} = (W_{i} - z_{w}) S_{w} \\ Y_{j} = (y_{j} - z_{y}) S_{y} \\ y_{i} = int ((x_{i} - z_{x}) \frac{S_{x}}{S_{y}} + (w_{i} - z_{w}) \frac{S_{w}}{S_{y}}) + z_{y} \\ y_{i} \approx int ((x_{i} - z_{x}) \frac{S_{x}}{S_{y}} + int ((w_{i} - z_{w}) \frac{S_{w}}{S_{y}} + z_{y})) \end{matrix}$

This corresponds to a nested application of the HFMA or HMAC operator (with two INT32 integer additions).

This calculation may be further optimized when the quantization domain of one of the inputs is the same as that of the output, which often occurs in practice. For example, if S_x=S_yand z_x=z_ythen:

$y_{i} = int (x_{i} + \frac{(w_{i} - z_{w}) S_{w}}{S_{y}})$

This corresponds to a single application of the HMAC operator (with one addition of INT32 integers).

The HMAC operator may also be used in combination with an operator designated FSCALE to quantize FP32 values to INT8 values without involving FP32 arithmetic. An FSCALE operation scales an FP32 value by an integer power of two, in the form of a simple integer addition over the 8-bit exponent field of the FP32 numerical representation.

For example, the case of X_i=(x_i−z_x)S_xwith 1/S_x=2^em yields:

$x_{i} = int (\frac{X_{i}}{S_{x}} + z_{x}) = int (FSCALE (X_{i}, e) m + z_{x})$

Here, the FSCALE operation produces an FP32 result, while e is INT8, and m and z are INT32. Thus, the calculation of x_iis once again an HMAC calculation, producing an INT32 value which will be transformed into an INT8 value.

FIG. 1 shows a block diagram of a hybrid multiplication and addition (or accumulation) operator. The operator is configured to perform the operation Op0·Op1+Op2, where multiplicand Op0 is a floating-point number, e.g., FP32, multiplicand Op1 is an integer, e.g., INT32, and operand Op2 is an integer of the same format (INT32) as operand Op1.

The floating-point multiplicand Op0 includes an 8-bit exponent field EXP and a 24-bit mantissa field MANT. The mantissa field MANT of the operator Op0, treated as an integer, and the integer multiplicand Op1 are supplied to an integer multiplier 10. The multiplier 10 produces a 57-bit product P. The product P is in fact a fixed-point number whose position is chosen by convention for an exponent EXP equal to 0. In this case, the product corresponds to the multiplication of a number 1.xx . . . xx with 24 bits after the decimal point by an integer XX . . . XX of 32 bits (in front of the decimal point), so that the fixed decimal point is positioned between the bits 23 and 24 of the 57-bit product P.

A left-shift circuit 12 receives product P at positions 31 and 87 of a 119-bit virtual number having 31 bits at 0 to the right of product P and 31 bits at 0 to the left of product P. The shifting of circuit 12 is controlled by the exponent EXP of multiplicand Op0. When this exponent is positive, circuit 12 performs a corresponding shift to the left. When the exponent is negative, circuit 12 performs a corresponding shift to the right.

The “useful” output of shifter 12 is taken from a window covering positions 55 to 87 of the shifted number, i.e., the 33 bits in front of the fixed decimal point, forming a 32-bit integer with a carry bit. The unused bits to the right of the window are used for rounding, while the unused bits to the left are used for saturation.

Note that the considered amplitude of the left or right shift is the size of the integer addition operand, here 32 bits (31 plus the position without a shift for exponent zero), corresponding to a signed exponent of 6 bits instead of 8 bits. This constrained amplitude corresponds to the limit positions where significant bits of the shifted product may still be in the window and form a usable integer. On the left, this corresponds to the product of an integer equal to 1 and a float with exponent +31. On the right, this corresponds to the product of an integer whose most significant bit is 1 and a float with exponent −31. Beyond an exponent of +31, the result is systematically saturated or rounded to 0.

The contents of the 33-bit window are supplied to an adder 14, which also receives the 32-bit integer operand Op2. The adder 14 produces a 34-bit result, the 34th bit being a carry bit.

The result of adder 14 is fed to a saturation and rounding circuit 16, which also receives the 55 bits to the right and the 31 bits to the left of the window of shifter 12. Circuit 16 produces a final result R1 in the form of a signed integer INT32, under the following conditions.

The result R1 is saturated to the largest positive integer when at least one of the following conditions is met:

- the exponent is greater than +31, the integer multiplicand Op1 is non-zero and the sign of the multiplication is positive,
- at least one of the 31 bits to the left of the window is 1 and the sign of the multiplication is positive, or one of the two most significant bits of the result (bits 32 and 33) of the addition is at 1.

The result R1 is saturated to the smallest negative integer when at least one of the following conditions is met:

- the exponent is less than −31, the integer multiplicand Op1 is non-zero and the sign of the multiplication is negative,
- at least one of the 31 bits to the left of the window is 1 and the sign of the multiplication is negative, or one of the two most significant bits of the addition result is 0.

In the description of the circuit shown in FIG. 1, it has been assumed for sake of clarity that the numbers are positive. In practice, the numbers are signed so that, for negative numbers, the significant value of the bits is 0 instead of 1. The circuit shown in FIG. 1 is extended in a known way to handle negative numbers as well, and to invert the required bits according to a sign bit. For example, for a negative product P, the bits to the left and right at the input of shifter 12 are all 1 instead of 0.

Furthermore, since the operator involves a floating-point number FP32 as a multiplicand, this number may be indefinite (NaN) or infinite in certain situations. If it is infinite and the integer multiplicand Op1 is non-zero, the result R1 is saturated to the largest positive integer or the smallest negative integer, depending on the sign of the result. If it is indefinite or infinite and multiplied by a zero integer, the result R1 may be saturated by convention, for example to the smallest negative integer.

Finally, the result R1 is rounded according to the bits of significant value on the right side of the window, for example according to the IEEE 754 standard governing floating-point numbers.

The operator may also raise the IEEE 754 overflow, underflow, inaccuracy and invalidity flags. The underflow flag is raised when the non-rounded result is non-zero and strictly between −1 and 1 (so-called subnormal numbers).

Claims

1. A hybrid hardware multiply-accumulate operator configured to multiply a multiplicand formatted as a floating-point number by a multiplicand formatted as an integer number and add a result of the multiplication to an operand formatted as an integer number, wherein the result of the multiplication is an integer number of the same format as the operand, except for a carry bit.
2. An operator according to claim 1, comprising: a first multiplicand input for the floating-point multiplicand having a mantissa field and an exponent field;a second multiplicand input for the integer multiplicand;an accumulator input for the integer operand;an integer multiplier configured to multiply the mantissa field of the floating-point multiplicand by the integer multiplicand and produce a result in a fixed-point format corresponding to a zero exponent of the floating-point multiplicand;a shifter connected to receive the result of the multiplier and shift it left or right according to a value of the exponent field of the floating-point multiplicand; andan integer adder receiving the integer operand and a window on the result of the shifter, the window capturing a number of bits to the left of a fixed decimal point position equal to the number of bits of the integer operand to within one carry bit.
3. The operator according to claim 2, wherein a left and right shift amplitude of the shifter is limited to the number of bits of the integer operand and the operator comprises a saturation circuit connected to the adder and the shifter and configured to: produce the result of the adder when no bits to the left of the window are significant and the exponent field encodes a value less than the number of bits of the integer operand, andproduce a saturated result when at least one bit to the left of the window is at a significant value, or the exponent field encodes a value greater than or equal to the number of bits in the integer operand.
4. The operator according to claim 3, wherein the saturation circuit is configured to round the adder result according to significant values of the bits to the right of the window.

Priority Claims (1)

Number	Date	Country	Kind
FR2308018	Jul 2023	FR	national

OPTIMIZED MULTIPLY-ACCUMULATE OPERATOR FOR AI CALCULATIONS

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (1)