The present disclosure relates to digital circuits, and in particular, to systems and methods for numerical precision in digital multiplier circuitry.
Digital circuits process logical signals represented by zeros (0) and ones (1) (i.e., bits). A digital multiply-accumulator is an electronic circuit capable of receiving multiple digital input values, determining a product of the input values, and summing the results. Performing digital multiply-accumulate operations can raise a number of challenges. For example, data values being multiplied may be represented digitally in a number of different data types. However, including different multipliers to handle all the different data types a system may need to process would consume circuit area and increase complexity.
One particular application where digital multiplication of different data types is particularly useful is machine learning (aka artificial intelligence). Such applications may receive large volumes of data values in a multiply-accumulator. Accordingly, such systems require particularly fast, efficient, and/or accurate multiply-accumulators capable of handling multiple different data types to carry out various system functions.
Embodiments of the present disclosure pertain to digital multimodal multiplier systems and methods. In one embodiment, the present disclosure includes a circuit comprising a plurality of multimodal multiplier circuits, the multimodal multiplier circuits comprising one or more storage register circuits for storing digital bits corresponding to one or more first operands and one or more second operands. In a first mode, the one or more storage register circuits store one first operand and one second operand having a first data type. In a second mode, the one or more storage register circuits store a first plurality of operands and a second plurality of operands having a second data type. A plurality of multiplier circuits are configured to receive the one or more first operands and the one or more second operands. In the first mode, the one first operand and the one second operand are multiplied in one or more of the plurality of multiplier circuits. In the second mode, a first operand of the first plurality of operands is multiplied with a first operand of the second plurality of operands and a second operand of the first plurality of operands is multiplied with a second operand of the second plurality of operands in the plurality of multiplier circuits.
In one embodiment, the first operands are weights and the second operands are activation values.
In one embodiment, the one first operand and the one second operand having the first data type comprise floating point values, and the first and second plurality of operands having the second data type comprise integer values.
In one embodiment, at least one of the plurality of multiplier circuits are used to multiply operands in both the first mode and the second mode. In another embodiment, a number of multiplier circuits used to multiply operands in the first mode is the same as a number of multiplier circuits used to multiply operands in the second mode.
In one embodiment, the one first operand and the one second operand having the first data type comprise a greater number of bits than the first and second plurality of operands having the second data type.
In one embodiment, multiplier circuitry is used to multiply operand and another operand of a first format. One or more storage register circuits of the multiplier circuitry store digital bits corresponding to the operand of the first format and another operand of the first format. A decomposing circuit of the multiplier circuitry decomposes the operand into a first plurality of operands, and decomposes the other operand into a second plurality of operands. The multiplier circuitry further includes a plurality of multiplier circuits. Each multiplier circuit multiplies a respective first operand of the first plurality of operands with a respective second operand of the second plurality of operands to generate a corresponding partial result of a plurality of partial results. An accumulator coupled to the plurality of multiplier circuits accumulates the plurality of partial results using a second format to generate a complete result of the second format that is stored in the accumulator circuit. A conversion circuit converts the complete result of the second format into an output result of an output format.
In another embodiment, the techniques described herein are incorporated in a hardware description language program, the hardware description language program comprising sets of instructions, which when executed produce a digital circuit. The hardware description language program may be stored on a non-transitory machine-readable medium, such as a computer memory (e.g., a data storage system).
The following detailed description and accompanying drawings provide a better understanding of the nature and advantages of the present disclosure.
In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present disclosure. Such examples and details are not to be construed as unduly limiting the elements of the claims or the claimed subject matter as a whole. It will be evident to one skilled in the art, based on the language of the different claims, that the claimed subject matter may include some or all of the features in these examples, alone or in combination, and may further include equivalent modifications of the features and techniques described herein.
Numerical precision is necessary for many artificial intelligence (AI) and machine learning (ML) applications. However, because errors accumulate when addends are smaller than the range supported by the significand, numerical precision is often sacrificed. Although approaches to minimize error accumulation are known, for example, using a higher precision format such as floating-point 32-bit (FP32) to accumulate floating-point 16-bit (FP16) addends, such approaches require a large number of FP32 significand bits (e.g., 23 bits). Integer accumulation is loss-less, but requires large registers, and may slow calculations.
The present disclosure describes a computing system that provides numerical precision equivalent to or better than FP32 numerical representation using integer formatted operands, e.g., 8-bit (INT8) or 4-bit (INT4) integer format operands. In one or more embodiments, the computing system presented herein converts operands from a floating point format to an integer format and implements a Toom-Cook decomposition algorithm to perform a plurality of integer multiplications to generate a plurality of partial multiplication results. The partial multiplication results are then shifted so that they are aligned with an appropriate power (i.e., 1, 10, 100). After that, the partial multiplication results are accumulated in one or more accumulation registers using the TruePoint™ (TP) numerical precision (i.e., fixed point format representation). A final multiplication result is obtained by rounding (i.e., truncating) the accumulated result to a desired numerical precision (e.g., FP32 numerical representation).
In accordance with embodiments of the present disclosure, a tensor streaming processor (TSP) may be utilized as a core processor module of the computer system presented herein. The TSP is particularly suited for computations in AI and ML applications. The TSP is a device that is commercially available from Groq, Inc. of Mountain View, Calif. For use in commerce, the Groq TSP Node™ Accelerator Card is available as a x16 PCI-Express (PCIe) 2-slot expansion card that hosts a single Groq Chip1™ device.
Referring now to
As further illustrated in
Referring again to
As mentioned above, operands having the first data type (e.g., floating point values) may have a greater number of bits than operands having the second data type (e.g., integers). Accordingly, multiplier circuit 210 may be configured to multiply inputs having a greater number of bits than multiplier circuit 211, for example. In this example, operands having the second data type entering multiplier 210 may be sign extended to match the extended bit capabilities of multiplier circuit 210. For instance, the multimodal multiplier circuits may further comprise a sign extension circuit 212 coupled to outputs of the first and second storage register circuits 200 and 201 to receive, in the second mode, one of the first plurality of operands (e.g., Op1) from the first storage register circuit 200 and one of the second plurality of operands (e.g., Op3) from the second storage register circuit 201, for example. Sign extension circuit 212 may increase the number of bits of each binary number (e.g., Op1 and Op3) while preserving the number's sign (positive/negative) and value, for example. Another select circuit 204 receives the mode control signal to couple inputs of multiplier 210 to either outputs of the sign extension circuit 212 to receive operands of the second data type, or alternatively, to outputs of select circuits 202 and 203 to receive operands of the first data type.
As mentioned above, in some applications operands coupled to input registers 200 and 201 may be floating point numbers. Accordingly, a multimodal multiplier circuit may further comprise an adder circuit 213. In one mode, exponent bits of one operand (e.g., a floating point operand) in storage register circuit 200 and exponent bits in a second operand (e.g., another floating point operand) in storage register circuit 201 are coupled to adder circuit 213 (designated as dashed lines for when floating point is used). Floating point values may have the form “significand×baseexponent,” where the exponent of two FP operands may be added in adder 213 and significands (aka the mantissa) of the FP operands are multiplied in multiplier 210, for example. Floating point numbers may be represented in the system using more bits than integers, for example, and thus multiplier 210 may have more bits than multiplier 211, which may only multiply operands having the second data type, for example. As described in more detail below, outputs of multipliers 210 and 211 and adder 213 may be further processed and added to other multiplier outputs.
One example application of the techniques described herein is in machine learning processors (aka artificial intelligence processors, e.g., neural networks). Such processors may require volumes of multiply-accumulate functions, and it may be desirable in many applications to flexibly process input data represent in a variety of different data types, such as signed integer, unsigned integer, or floating point (e.g., FP16 IEEE 754). Accordingly, in one embodiment, the first operands are weights and the second operands are activation values and the circuits and methods described herein are implemented in a machine learning processor. For example, one mode may configure a machine learning processor to multiply floating point (FP) numbers. Accordingly, a first FP operand corresponding to a weight may be stored in register 200 and a second FP operand corresponding to an activation (e.g., a pixel value of an input image) may be stored in register 201. In the example shown in
In another mode, the circuit may receive operands A and B having a different data type with a greater number of bits. For example, operands A and B may be a 16-bit floating point numbers. Accordingly, these operands may be stored as components in different register segments of registers 230-231. For example, one operand A may be stored as two components in two register segments in register 230, and another operand B may be stored as two components in two register segments in register 231. In one embodiment, operand A comprises a first component (e.g., lower order bits) received on A0 and stored in register segment 230A and a second component (e.g., higher order bits) received on A2 and stored in register segment 230C. Operand B comprises a first component (e.g., lower order bits) received on B0 and stored in register segment 231A and a second component (e.g., higher order bits) received on B1 and stored in register segment 231B, for example. Embodiments of the present disclosure may selectively couple different input bits into different register segments in different modes. For example, in this mode, the first component of A on input A0 may be coupled to and stored in register segment 230B, and the second component of A on input A2 may be coupled to and stored in register segment 230D. Similarly, the first component of B on input B0 may be coupled to and stored in register segment 230C, and the second component of B on input B1 may be coupled to and stored in register segment 231D. The selective arrangement of inputs in different register segments for different modes is illustrated in
Output product values C0-C3 of components of the inputs may be stored in register 237, for example. In this mode, outputs of multipliers 232-235 may be coupled to shift circuits 240-243. Outputs of shift circuits 240-243 are coupled to an adder circuit to produce an output product of the inputs A*B. For example, C0 may be coupled to shift circuit 240, which may have a nominal shift value of 0, C1 may be coupled to shift circuit 241, which may have a nominal shift value of N (where N is the number of bits of the input component—e.g., N=8 for an 8 bit component into each multiplier), C2 may be coupled to shift circuit 242, which may have a nominal shift value of N, and C3 may be coupled to shift circuit 243, which may have a nominal shift value of 2N. Each shift circuit may perform a left shift, for example. Accordingly, in this example, products of lower order bits A0B0 are not shifted, products of higher and lower order bits A2B0 and B1A0 are shifted by N, and products of higher order bits A2B1 are shifted by 2N. From the above it can be seen that in some embodiments no shifter 240 may be included since C0 may not be shifted. However, in one embodiment, exponent bits of floating point operands, expA and expB, may be input to adder circuit 260 and added together and the result used to increase the shift performed by each shift circuit. For example, an output of adder circuit 260 is coupled to a control input of each shift circuit 240-243 so that the sum of exponent bits expA and expB may increase the shift of each shift circuit (e.g., expA=1; expB=2; increase each shift by 3). The outputs of the shift circuits are summed in an adder circuit 244, which may comprise a plurality of N-bit adders, for example. The shifted and added output product values may provide a second output (Out2) in one of the modes, which may be a fixed point representation, for example. Accordingly, in some embodiments, multiplication of the inputs may result in output products being converted to a third data type, which may be added to output products of other multimodal multiplier circuits as described below.
The circuit in
As described in more detail below, some embodiments of multiplier 310 may, in the first mode, produce floating point values, which are then converted to a third data type, such as fixed point, having an extended bit length to achieve wide dynamic range and accuracy. In one embodiment, a fixed point value may comprise a number of bits equal to at least N (e.g., N=4) times the number of bits produced by products of operands (e.g., Op4*Op2, Op5*Op7, Op6*Op8) having the second data type (e.g., 8-bit integer). Accordingly, the same adder 330 and output register 340 may be used to store one extended length data type or multiple integer data types, for example, which may have advantages including reduced circuit area, for example.
In this example, equalizing the number of bits between first and second modes may include concatenating the multiplier outputs, for example, using concatenation circuit 402. Accordingly, in the second mode, select circuit 401 couples the output of multiplier 210 to one input of concatenation circuit 402, and other inputs of concatenation circuit 402 may be coupled to outputs of other multiplier circuits, such as multiplier circuit 211 as shown in
As illustrated in
Referring again to
The outputs of each multimodal multiplier 510A-N may be coupled to adder 520, which may (in some embodiments) correspond to adder 330 in
In some embodiments, each multiply-accumulator circuit 500-502 may comprise an input register circuit having an input coupled to an output register circuit of another multimodal multiply-accumulator circuit. For example, multiply accumulator circuit 500 includes an input register 540, which may be configured to receive one or more sums from multiply-accumulator 501 based on the mode the system is operating in, for example. Accordingly, when multiply-accumulator circuits 500 and 501 are in a first mode, input register 540 receives and stores a single input value, which may have the third data type (e.g., an extended fixed point value), and when multiply-accumulator circuits 500 and 501 are in a second mode, input register 540 receives and stores a plurality of input values having the second data type (e.g., four (4) integer values).
An output of register 540 is coupled to the adder circuit 520. Accordingly, in the first mode, a plurality of values, one from each multimodal multiplier 510A-N, may be added together and further added to the single input value in register 540. Alternatively, in the second mode, multiple values from each multimodal multiplier 510A-N and the multiple values from input register 540 are added, where values corresponding to particular columns are added to other values corresponding to particular columns. For example, if there are four values in input register 540 and four multipliers used in each multimodal multiplier 510A-N in the second mode, then a first of the four values from register 540 may be added with values from N multipliers 310 (See
In some embodiments, a digital system, such as a computer system based on the TSP core 100, utilizes either a floating point format or an integer format to store representations of input operands in a compressed format while arithmetic calculations (e.g., multiplications and additions) can be performed in an integer format. The results of arithmetic operations are accumulated in one or more accumulate registers using the TP format, (i.e., fixed point numerical representation), and a final multiplication result is obtained by truncating the accumulation result to a desired precision (e.g., FP32). More specifically, the TP format is a fixed point numerical representation of an accumulation of FP16 products that avoids the need for higher precision calculations in the matrix multiplication loop. The TP format represents a fixed point numerical representation for the accumulation result having an accuracy comparable to a higher precision FP numerical representation (e.g., FP64 numerical precision). At an output of the MXM 117 and/or the MXM 118, a sum of products is converted from the TP format (i.e., the fixed point loss-less integer representation) to, e.g., FP32 numerical representation with only 23 bits of significand.
A decomposition circuit 704 decomposes the operand into a first plurality of operands (e.g., smaller integer numbers). The decomposition circuit 705 further decomposes the other operand into a second plurality of operands (e.g., smaller integer numbers). The decomposition circuit 704 may decompose the operand and the other operand by applying, e.g., a Toom-Cook decomposition algorithm. Details about the Toom-Cook decomposition algorithm are provided further below.
The first plurality of multipliers 706A, . . . , 706N and the second plurality of multipliers 708A, . . . , 708N are integer multipliers. When the first format is an integer format, each operand of the first plurality of operands is routed from the decomposition circuit 704 to each multiplier of a first plurality of multipliers 706A, . . . , 706N as well as to each multiplier of a second plurality of multipliers 708A, . . . , 708M. Similarly, each operand of the second plurality of operands is routed from the decomposition circuit 704 to each multiplier of the first plurality of multipliers 706A, . . . , 706N as well as to each multiplier of the second plurality of multipliers 708A, . . . , 708M. Each pair of operands from the first and second pluralities of operands are mutually multiplied in a corresponding multiplier of the first and second pluralities of multipliers 706A, . . . , 706N, 708A, . . . , 708M to generate a corresponding partial result of a plurality of partial results. The partial results generated by the multipliers 706A, . . . , 706N, 708A, . . . , 708M are stored in corresponding registers 709A, . . . , 709N, 710A, . . . , 710M.
When the first format is a floating point format, a significand portion from each operand of the first plurality of operands is routed from the decomposition circuit 704 to each multiplier of the first plurality of multipliers 706A, . . . , 706N as well as to each multiplier of the second plurality of multipliers 708A, . . . , 708M. Similarly, a significand portion from each operand of the second plurality of operands is routed from the decomposition circuit 704 to each multiplier of the first plurality of multipliers 706A, . . . , 706N as well as to each multiplier of the second plurality of multipliers 708A, . . . , 708M. Each pair of significand portions from the first and second pluralities of operands are mutually multiplied in a corresponding multiplier of the first and second pluralities of multipliers 706A, . . . , 706N, 708A, . . . , 708M to generate a corresponding partial result stored in a corresponding register 709A, . . . , 709N, 710A, . . . , 710M. Additionally, an exponent portion from each operand of the first plurality of operands is routed from the decomposition circuit 704 to each adder of a first plurality of adders 705A, . . . , 705N as well as to each adder of a second plurality of adders 707A, . . . , 707M. Similarly, an exponent portion from each operand of the second plurality of operands are routed from the decomposition circuit 704 to each adder of the first plurality of adders 705A, . . . , 705N as well as to each adder of the second plurality of adders 707A, . . . , 707M. Each pair of exponent portions from the first and second pluralities of operands are mutually summed in a corresponding adder of the first and second pluralities of adders 705A, . . . , 705N, 707A, . . . , 707M to generate a corresponding exponent Exp11, . . . , ExpN1, Exp1M, . . . , ExpNM. When the first format is a floating point format, the first and second pluralities of adders 705A, . . . , 705N, 707A, . . . , 707M are not utilized. In such case, the adders 705A, . . . , 705N, 707A, . . . , 707M can be turned off based on the Mode signal, all zero bits are routed to the inputs of the adders 705A, . . . , 705N, 707A, . . . , 707M, or the adders 705A, . . . , 705N, 707A, . . . , 707M are bypassed in some other manner and their outputs are not utilized.
When the first format is a floating point format, each partial result stored in the corresponding register 709A, . . . , 709N, 710A, . . . , 710M is shifted at a corresponding shift circuit 713A, . . . , 713N, 714A, . . . , 714N by a number of bits equal to a value of a respective exponent Exp11, . . . , ExpN1, Exp1M, . . . , ExpNM output from a corresponding adder 705A, . . . , 705N, 707A, . . . , 707M. Each shifted partial result is passed onto a corresponding conversion circuit 715A, . . . , 715N, 716A, . . . , 716M. Conversion circuits 715A, . . . , 716N, 716A, . . . , 716M convert the plurality of partial results to the TP format, i.e., to the fixed point numerical representation. A position of a decimal point in the TP numerical representation of each shifted partial result is based on a value of the respective exponent Exp11, . . . , ExpN1, Exp1M, . . . , ExpNM.
When the first format is an integer format, shifting and conversion are not required, i.e., the shift circuits 713A, . . . , 713N, 714A, . . . , 714N and the conversion circuits 715A, . . . , 715N, 716A, . . . , 716M are bypassed using, e.g., corresponding demultiplexers 711A, . . . , 711N, 712A, . . . , 712M controlled by an appropriate value of the Mode signal. In such case, the partial results stored in the registers 709A, . . . , 709N, 710A, . . . , 710M are directly provided to an accumulator circuit 719, e.g., via corresponding multiplexers 717A, . . . , 717N, 718A, . . . , 718M controlled by an appropriate value of the Mode signal. When the first format is a floating point format, the shifted partial results at the outputs of the conversion circuits 715A, . . . , 715N, 716A, . . . , 716M are provided to the accumulator circuit 719, e.g., via corresponding multiplexers 717A, . . . , 717N, 718A, . . . , 718M controlled by an appropriate value of the Mode signal.
The accumulator circuit 719 accumulates the plurality of partial results (or the plurality of shifted partial results) using the second format (i.e., the TP numerical representation) to generate a complete result of the second format that is also stored in a register of the accumulator circuit 719. In a preferred embodiment, in order to minimize accumulation of an error, the accumulator circuit 719 accumulates the plurality of partial results from a smallest partial result among the plurality of partial results to a largest partial result among the plurality of partial results. Although
In an illustrative embodiment, when FP16 matrix multiplication operations utilize the accumulator circuit 719 for accumulations (e.g., within the MXM 117 or the MXM 118) with the precision of, e.g., a 91-bit integer, a register of the accumulator circuit 719 is at least 116 bits wide because 22 compressed carry bits and three status bits are used for carry information to enable calculations using a faster clock frequency. Accumulated multiplier results are converted from the 116-bit register of the accumulator circuit 719 with 91-bit integer precision to FP32 using a truncation/conversion circuit 720 coupled to an output of the accumulator circuit 719. The truncation/conversion circuit 720 may be part of the NIM 115 or the NIM 116, and the conversion may occur when the accumulated multiplier results are streamed from the MXM 117 or the MXM 118 to the VXM 110.
In another embodiment, for INT8 matrix multiplication operations and accumulation, a width of each partial output sum at the register of the accumulator circuit 719 is 25 bits. A total of four partial sums are concatenated to 100 bits at the register of the accumulator circuit 719 to achieve INT32 precision. The remaining bits in the register of the accumulator circuit 719 are not used. The value produced and stored at the register of the accumulator circuit 719 is in a fully loss-less INT32 format, i.e., the TP format with INT32 numerical representation.
In yet another embodiment, an accumulator in the NIM 115 (or in the NIM 116) performing a full sum operation would resolve compressed carry bits in a 112-bit word to 90-bits, and then accumulate multiple 256×256 matrix multiplication output values, with a maximum capacity to accumulate up to 238 90-bit TP numbers into a single INT128. If matrix multiplications are interleaved, then the partial (interim) results are added separately. The VXM 110 may comprise an arbitrary precision arithmetic instruction that includes a carry that is persistent to the next clock cycle. Using an initial ADD_MOD, a series of ADD_MOD_CI instructions, and an optional final ADD_MOD_CI INT32 0,0 to get the final carry bit, any size INT can be accumulated at the accumulator circuit 719.
Because of a size of the accumulator circuit 719, no rounding (i.e., truncation) is applied during the accumulation in the accumulator circuit 719. The only rounding (i.e., truncation) is applied to a final accumulation result to obtain a final multiplication result of a desired floating point precision (e.g., FP16, FP32, FP 64 precision, or some other floating point precision). In one or more embodiments, significands of input operands are converted to integer format (e.g., at the conversion circuits 702, 703) enabling the multipliers 706A, . . . , 706N, 708A, . . . , 708M to perform a fused dot product operation instead of a fused multiply accumulate operation. The result of fused dot product operation is obtained and stored within the register of the accumulator circuit 719 to maintain a pre-defined precision, e.g., the precision of at least 80 bits. For example, when the multiplier circuitry of
An accumulated result in the second format (e.g., TP format) stored in the register of the accumulator circuit 719 represents a complete multiplication result. The truncation/conversion circuit 720 coupled to the register of the accumulator circuit 719 converts the complete multiplication result of the second format (e.g., the TP number) into an output result of an output format that is stored in an output register 721. The truncation/conversion circuit 720 may convert the complete multiplication result from the second format into the output format by first selectively truncating a portion of the complete multiplication result stored in the register of the accumulator circuit 719. After the truncation, the truncation/conversion circuit 720 converts the complete multiplication result (i.e., the truncated accumulation result) into the output format, e.g., FP32 format, FP64 format, FP128 format, or some other floating point format. The conversion by the truncation/conversion circuit 720 may be based on a desired output precision provided to the truncation/conversion circuit 720 via an “Out_Format” signal, as shown in
For example, the rounding (i.e., truncation) to the FP32 format in accordance with the IEEE 754 standard uses 8 bits to represent an exponent and 23 bits to represent a significand. For example, it can be shown that the accumulation at the accumulator circuit 719 with truncation of a final accumulated result to the FP32 format precision at the truncation/conversion circuit 720 provides the calculation rate of approximately 4.98 teraflops. Note that “one teraflops” represents a computing speed of one million floating point operations per second while providing numerical results with precision equivalent to a FP32 unit. The rounding (i.e., truncation) to the FP16 format in accordance with the IEEE 754 standard uses 5 bits for the exponent and 10 bits for the significand. It can be shown that the accumulation at the accumulator circuit 719 with truncation of a final accumulated result to the FP16 format precision at the truncation/conversion circuit 720 provides the calculation rate of approximately 403 teraflops. Additionally, the rounding (i.e., truncation) to FP16 representation with 8 exponent bits and 7 bits for the significand can be utilized, which can be denoted as bfloat16 or BF16. It can be shown that the accumulation at the accumulator circuit 719 with truncation of a final accumulated result to the BF16 format precision at the truncation/conversion circuit 720 provides the calculation rate of approximately 44.78 teraflops.
In one or more embodiments, as aforementioned, the decomposition circuit 704 performs the decomposition of large integers by applying the Toom-Cook decomposition algorithm in order to obtain smaller integers suitable for faster integer multiplications. Alternatively, the decomposition circuit 704 can apply the Toom-Cook3 decomposition algorithm. The decomposition circuit 704 that applies the Toom-Cook algorithm (or, alternatively, the Toom-Cook3 algorithm) can be a building block of the VXM 110, MXM 117 and/or the MXM 118 separate from digital multiplier circuitry.
A simplified version of the Toom-Cook decomposition algorithm is illustrated herein by way of example in the case of multiplying the pair of integers 23 and 35. The following polynomials that represent decomposed integers 23 and 35 are obtained: p(x)=2x+3, q(x)=3x+5, where p(x) represents decomposition of 23 into smaller integers 2 and 3, and q(x) represents decomposition of 35 into smaller integers 3 and 5, where x equals 10. Accordingly, the result of the multiplication would be p(x)*q(x)=r(x). Decomposing the significands of the first and second numbers into a first and second plurality of operands according to the Toom-Cook algorithm yields the polynomial equation (2x+3)*(3x+5)=ax2+bx+c=r(x) with smaller integers mutually multiplied, where a, b and c are unknown parameters. From p(0)*q(0)=r(0), it can be determined that c=15. From p(1)*q(1)=r(1), it follows, after substitutions for x and c, that a+b=25; and from p(−1)*q(−1)=r(−1), it follows, after substitutions for x and c, that a−b=−13. From the two linear equations with two unknowns a+b=25 and a−b=−13, it can be determined that a=6 and b=19. Thus, the result of multiplication p(x)*q(x)=r(x)=6x2+19x+15. By substituting x=10 in r(x), the result of multiplication can be obtain as r(10)=600+190+15=805.
In another example, the integers 7 and 22 are multiplied. In such case, two integer multiply operations would occur, and each time the partial result would be 14, but the correct numbers to be added are 140 and 14 yielding a proper final multiplication result of 154. However, the problem would occur when the least significant digit is truncated to obtain an approximate final result, which is typical in the case of rounding floating point numbers. Then, shifting the digits to account for the ones, tens, and hundreds columns (e.g., performed at the shift circuits 713A, . . . , 713N, 714A, . . . , 714M) and the rounding (i.e., truncation) of the least significant digit (e.g., at the truncation/conversion circuit 720) would result in multiplying 10 with 10 yielding a final multiplication result of 100, instead of 154. Even if only the least significant digit of the partial multiplication result of 14 is dropped, the final multiplication result would still only be an approximation. This example illustrates what happens if the precision for accumulation of partial products at the accumulator circuit 719 is sacrificed in favor of computational speed and/or power dissipation.
In one or more embodiments, operands that are input into the multiplier circuitry of
In general, INT8 multiplications (with INT32 accumulation) have sufficient precision and accuracy for inference applications. It should be noted that precision and accuracy are two different requirements. The precision requirement is related to a number of bits for representation of a multiplication result, e.g., a 16-bit multiplication result. The accuracy requirement is related to whether the multiplication result is mathematically correct, e.g., whether the 16-bit result is mathematically correct or not.
Note that models in A1 and/or ML applications (e.g., performed at the TSP core 100) are generally trained using floating point representation of numbers because the trained models require the fidelity to calculate converging differences between weights of a previous learning iteration and weights of a current learning iteration. Otherwise, the trained models would not converge as the differences would be greater than predetermined threshold values, i.e., the differences would be too large to converge. The multiplier circuitry of
Input operands of the multiplier circuitry of
In one or more embodiments, when the input operands are in INT8 format, the multiplier circuitry of
Note that, for integer multiplication, there is no risk for overflow. However, in the case of multiplication of floating point numbers, there is a potential for overflow. Accordingly, the operands are converted to the TP format, e.g., at the conversion circuits 702, 703 or the conversion circuits 715A, . . . , 715N, 716A, . . . , 716M. The product of floating point multiply and accumulate operations are thus maintained in the TP format at the accumulator circuit 719. Thus, the multiplier circuitry of
By utilizing the TP format for accumulation of partial multiplication results, the multiplier circuitry of
It should be noted that the integer multiplication with the accumulation based on the TP format provides improved precision for A1 and/or ML workloads.
Therefore, the TP based calculations provide improved latency and throughput, while providing the most accurate floating point results. To achieve the same accuracy when training a ML model on a single core of a CPU or GPU using the same weights and same inputs, CPU or GPU based systems would have to accumulate to, e.g., FP128 precision format. Advantageously, the presented TP based multiply-and-accumulate (MAC) operations running on the TSP core 100 utilize FP16 operands and generate FP32 results, with accuracy that is significantly better than that of a GPU or CPU.
Energy and performance cost for higher-precision numeric calculations can be significant in many applications. However, the TP format can be also a key enabler for low power calculations when calculations involve utilizing floating point formats. It is known that energy required to compute products of operands in FP16 format is less than energy required to compute products of operands represented in wider formats, e.g., FP32 or FP64 formats. For example, it can be shown that FP32 based calculations consume approximately four times the energy compared to FP16 based calculations. To take energy advantage of mixed-precision applied at the multiplier circuitry of
Embodiments of the present disclosure further relate to various methods for conversion of FP numerical representation (e.g., FP32 or BF16) of input operands (e.g., activations and weights) for performing element-wise operations, e.g., element-wise multiplications between an activation matrix and a weight matrix—MATMUL. In some embodiments, in a first method, all exponents of the input operands are sorted by range. In one embodiment, in a first sub-method of the first method, all input numbers (e.g., matrix elements) are first pre-processed by being sorted into groups each having a respective range of exponent. Note that each exponent can be within one of the following ranges: 2n−2 to 1, 2n×2−4 to 2n−1, 2n×3−6 to 2n×2−3, 2n×4−8 to 2n×3−5, 2n×5−10 to 2n×4−7, 2n×6−12 to 2n×5−9, 2n×7−14 to 2n×6−11, 2n×8−16 to 2n×7−13, 2n×9−34 to 2n×8−15, where n is a number of bits for representing the exponent. Second, numbers (e.g., matrix elements) from each group are normalized to be within a defined exponent range of the MATMUL while keeping track which range each group was in before the normalization. Third, an element-wise operation (e.g., multiplication) is performed on the normalized numbers from each activations group and weights group obtained at the second step. Fourth, an intermediate result is adjusted to align with the original range. Fifth, accumulation with previous group result(s) is performed. Sixth, if any groups remain, the third, fourth and fifth steps are repeated. Seventh, once all the groups are completed, final accumulation and conversion to the final format are performed. The first sub-method of the first method utilizes the TP format on intermediate results, and no error is introduced until the final conversion. The first sub-method of the first method requires (roundup (exponent range of inputs/exponent range of matrix))2 passes in the matrix×matrix size/MATMUL matrix size plus pre-processing and post-processing cycles to complete.
In another embodiment, in a second sub-method of the first method, all matrix weights are first pre-processed to belong into the same range. An exponent of a respective matrix weight can be within one of the following range: 2n−2 to 1, 2n×2−4 to 2n−1, 2n×3−6 to 2n×2−3, 2n×4−8 to 2n×3−5, 2n×5−10 to 2n×4−7, 2n×6−12 to 2n×5−9, 2n×7−14 to 2n×6−11, 2n×8−16 to 2n×7−13, 2n×9−34 to 2n×8−15, where n is a number of bits for the exponent. Second, the largest intermediate exponent N is pre-processed, and all values with exponent less than (e−log 2(m)−s) are zeroed out, where m is a number of operations to perform, e is a size exponent in the final format, s is a size significand for conversion, and e≥N. Third, activations are re-sorted and the zeroed out values are removed. Fourth, all matrix activations are pre-processed to belong into the same range. Fifth, each group of activations is normalized to be in the exponent range of the MATMUL, while keeping track which range each group was in before the normalization. Sixth, an element-wise operation (e.g., multiplication) is performed on a current normalized activations group and pre-processed weights. Seventh, an intermediate result is adjusted to align with the original range. Eighth, accumulation with previous groups result(s) is performed. Ninth, once all the groups are completed, final accumulation and conversion to the final format is performed. The second sub-method of the first method throws away values up front that would not make a difference in the final conversion. The second sub-method of the first method utilizes the TP format on intermediate results. The second sub-method of the first method have the potential to introduce error on the least significant bit (LSB) region and requires more pre-processing than the first sub-method of the first method.
In some other embodiments, in a second method of a limited range, only most significant bits (MSBs) of exponents of the input operands (e.g., activations and weights) are utilized. In one embodiment, in a first sub-method of the second method, pre-processing of the input operands is first performed and only m MSBs of an exponent of each input number are used. Second, the input numbers are pre-processed by breaking the input numbers into n significands where n=roundup(significant bits in/significant bits in matrix unit) for zero internal error, or n=truncation(significant bits in/significant bits in matrix unit) for non-zero internal error. Third, an element-wise operation (e.g., multiplication) is performed on each activations group and weights group obtained at the second step. Fourth, an intermediate result is adjusted to align with an original significand. Fifth, accumulation with previous groups result(s) is performed. Sixth, if any groups remain, the third, fourth and fifth steps are repeated. Seventh, once all the groups are completed, final accumulation is performed, followed by adjustment to the original range and conversion to the final format. The first sub-method of the second method introduces a precision error in two ways. First, the precision error is introduced by limiting the exponents. Second, the precision error is non-zero if the number of sub-significands times the significand bits in the matrix is less than the number of input significand bits. The first sub-method of the second method requires a number of passes to complete the matrix that is significantly less than for the first and second sub-methods of the first method. For conversion to the final format that is FP32, the first sub-method of the second method requires just four to nine passes to complete the matrix depending on the sub-significands (i.e., depending whether the truncation or roundup is performed at the second step).
In another embodiment, in a second sub-method of the second method, pre-processing of all activations is first performed including normalization to a highest exponent bit that is “1” (that particular bit and the m−1 MSBs are used after that). Second, activations are pre-processed by breaking them into n significands where n=roundup(significant bits in/significant bits in matrix unit) for zero internal error or n=truncation(significant bits in/significant bits in matrix unit) for non-zero internal error. Third, all weights are pre-processed and normalized to a highest exponent bit that is “1” and use that bit plus the m−1 MSBs after that. Fourth, the normalized weights are pre-processed by breaking them into n significands where n=roundup(significant bits in/significant bits in matrix unit) for zero internal error or n=truncation(significant bits in/significant bits in matrix unit) for non-zero internal error. Fifth, an element-wise operation (e.g., multiplication) is performed on each activations group and weights group obtained are the second and the fourth steps. Sixth, an intermediate result is adjusted to align with an original significand. Seventh, accumulation with previous groups result(s) is performed. Eighth, if any groups remain, the third, fourth, fifth, sixth and seventh steps are repeated. Ninth, once all the groups are completed, final accumulation is performed, followed by adjustment to the original range and conversion to the final format. The second sub-method of the second method introduces a potential precision error in two ways. First, the potential precision error can occur due to limiting the exponents. Second, if the number of bits for sub-significands times the significant bits in the matrix is less than the number of input significant bits, the potential precision error is introduced. The number of passes required to complete the matrix is significantly less for the second sub-method of the second method than for the first and second sub-methods of the first method. For conversion to the final format of FP32, the second sub-method of the second method requires just four to nine passes to complete the matrix depending on the sub-significands (i.e., depending whether the truncation or roundup is performed at the second and fourth steps). The second sub-method of the second method has the potential to be more accurate than the first sub-method of the second method.
In yet another embodiment, in a third sub-method of the second method, the first step is to force a format of input exponents to only use the range of the matrix unit. Second, all input numbers (activations and weights) are pre-processed by breaking them into n significands where n=roundup(significant bits in/significant bits in matrix unit) for zero internal error or n=truncation(significant bits in/significant bits in matrix unit) for non-zero internal error. Third, an element-wise operation (e.g., multiplication) is perform on each activations group and weights group obtained at the second step. Fourth, an intermediate result is adjusted to align with an original significand. Fifth, accumulation with previous groups result(s) is performed. Sixth, if any groups remain, the third, fourth and fifth steps are repeated. Seventh, once all groups are completed, final accumulation is performed, followed by adjustment to the original range and conversion to the final format. The third sub-method of the second method forces the input range to match the limited range of the matrix for the exponent. If the roundup is used at the second step, no error is introduced until the final conversion. The third sub-method of the second method matches a throughput of the first sub-method of the second method. However, the third sub-method of the second method does not introduce any precision or range error during the processing of matrix elements.
In some other embodiments, in a third method, exponents are broken into N m-bit units. In one embodiment, in a first sub-method of the third method, pre-processing of all input numbers (i.e., activations and weights) is first performed by breaking the exponent portion in equal bits (or near equal bits) under the size of the matrix unit exponent size. Second, pre-processing of the input numbers generated at the first step into n significands is performed where n=roundup(significant bits in/significant bits in matrix unit) for zero internal error or n=truncation(significant bits in/significant bits in matrix unit) for non-zero internal error. Third, an element-wise operation (e.g., multiplication) is perform on each activations group and weights group obtained at the second step. Fourth, an intermediate result is adjusted to align with the original range. Fifth, accumulation with previous group result(s) is performed. Sixth, if any groups remain, the third, fourth and fifth steps are repeated. Seventh, once all groups are completed, final accumulation and conversion to the final format are performed. The first sub-method of the third method utilizes the TP format for intermediate results, and no error is introduced until the final conversion, if the roundup is used at the second step. The first sub-method of the third method requires N equal exponents times N equal exponents times n significands times n significands passes for each matrix to complete. For FP32 to FP16 full range conversion (i.e., the input operands having F32 format and the final output format is FP16), the first sub-method of the third method with full precision requires, e.g., 2×2×3×3=36 passes to complete the matrix operation.
In another embodiment, in a second sub-method of the third method with a limited significand, pre-processing of all input numbers (i.e., activations and weights) is first performed by breaking each exponent portion in equal bits (or near equal bits) under the size of the matrix unit exponent size. Second, all significands are truncated to match the size of the matrix unit significand. Third, an element-wise operation (e.g., multiplication) is perform on each activations group and weights group obtained at the second step. Fourth, an intermediate result is adjusted to align with the original range. Fifth, accumulation with previous group result(s) is performed. Sixth, if any groups remain, the third, fourth and fifth steps are repeated. Seventh, once all groups are completed, final accumulation and conversion to the final output format are performed. The second sub-method of the third method keeps the complete range of the original input numbers but limits the precision to an internal matrix. By applying the second sub-method of the third method, the number of passes for each matrix with FP32 format is just four. The number of passes for each matrix with BF16 format is also four, but the second sub-method of the third method provides a better precision for FP32 format than for BF16 format until the final conversion.
In some embodiments, the accumulation as part of the matrix multiplication can be performed in the extended variable precision TP format. An amount of accumulated precision required for a given matrix multiple accumulation (MATMUL) can be dynamically changed. In an embodiment, when performing N×N FP16 MATMUL that is a size of an internal matrix, no extension is required in the final accumulation and conversion to obtain a final output format. In another embodiment, when performing, e.g., 216×216 N×N FP16 MATMUL, an intermediate accumulation (i.e., accumulation of partial products) is required to be extended by 32 bits to keep from overflowing the final result during the accumulation. In yet another embodiment, when performing, e.g., 264×264 N×N FP16 MATMUL, an intermediate accumulation (i.e., accumulation of partial products) is extended by 128 bits to keep from overflowing the final result during the accumulation. In each of these cases, no error is introduced for precision or accuracy until the final conversion to the final format. Furthermore, in each of these cases, the minimum final format is FP32 in order to maintain a complete range for the final result without overflow during the accumulation.
In yet another embodiment, when performing, e.g., 216×216 256×256 FP32 MATMUL in a 256×256 FP16 matrix, an intermediate accumulation (i.e., accumulation of partial products) is extended by a total of 512 bits to keep from overflowing the final result during the accumulation. The total of 512 bits required for extension of the intermediate accumulation is due to, e.g., 32 bits required for the size of the FP32 MATMUL, plus 564 bits for FP32 TP, minus 90 bits for FP16, plus roundup log2 36 bits (i.e., 6 bits) as one FP32 matrix operation requires 36 FP16 operations for full range and precision TP assuming 256×256 base matrix size. Again, no error is introduced for precision or accuracy until the final conversion to the final format. In such case, the minimum final output format is FP64 in order to maintain a complete range for the final result without overflow during the accumulation.
By way of example,
The structure of a computing machine described in
The example computer system 1100 includes one or more processors (generally, a processor 1102) (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), one or more application specific integrated circuits (ASICs), one or more radio-frequency integrated circuits (RFICs), or any combination of these), a main memory 1104, and a static memory 1106, which are configured to communicate with each other via a bus 1108. The computer system 1100 may further include graphics display unit 1110 (e.g., a plasma display panel (PDP), a liquid crystal display (LCD), a projector, or a cathode ray tube (CRT)). The computer system 1100 may also include alphanumeric input device 1112 (e.g., a keyboard), a cursor control device 1114 (e.g., a mouse, a trackball, a joystick, a motion sensor, or other pointing instrument), a storage unit 1116, a signal generation device 1118 (e.g., a speaker), and a network interface device 1120, which also are configured to communicate via the bus 1108.
The storage unit 1116 includes a computer-readable medium 1122 on which the instructions 1124 are stored embodying any one or more of the methodologies or functions described herein. The instructions 1124 may also reside, completely or at least partially, within the main memory 1104 or within the processor 1102 (e.g., within a processor's cache memory). Thus, during execution thereof by the computer system 1100, the main memory 1104 and the processor 1102 may also constitute computer-readable media. The instructions 1124 may be transmitted or received over a network 1126 via the network interface device 1120.
While the computer-readable medium 1122 is shown in an example embodiment to be a single medium, the term “computer-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store instructions (e.g., the instructions 1124). The computer-readable medium 1122 may include any medium that is capable of storing instructions (e.g., the instructions 1124) for execution by the machine and that cause the machine to perform any one or more of the methodologies disclosed herein. The computer-readable medium 1122 may include, but not be limited to, data repositories in the form of solid-state memories, optical media, and magnetic media. The computer-readable medium 1122 does not include a transitory medium such as a signal or a carrier wave.
The above specification provides illustrative and example descriptions of various embodiments. While the present disclosure illustrates various techniques and embodiments as physical circuitry (e.g., on an integrated circuit), it is to be understood that such techniques and innovations may also be embodied in a hardware description language program such as VHDL or Verilog as is understood by those skilled in the art. A hardware description language (HDL) is a specialized computer language used to describe the structure and behavior of electronic circuits, including digital logic circuits. A hardware description language results in an accurate and formal description of an electronic circuit that allows for the automated analysis and simulation of an electronic circuit. An HDL description may be synthesized into a netlist (e.g., a specification of physical electronic components and how they are connected together), which can then be placed and routed to produce the set of masks used to create an integrated circuit including the elements and functions described herein.
The above examples should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of the particular embodiments as defined by the following claims. Based on the above disclosure and the following claims, other arrangements, embodiments, implementations and equivalents may be employed without departing from the scope of the present disclosure as defined by the claims.
This application claims a benefit and priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application Ser. No. 63/134,941, filed on Jan. 7, 2021, which is hereby incorporated by reference in its entirety. This application is a continuation-in-part of co-pending U.S. application Ser. No. 16/986,007, filed May 8, 2020, which is a continuation of U.S. application Ser. No. 16/139,093, filed Sep. 23, 2018, now U.S. Pat. No. 10,776,078, each of which are incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63134941 | Jan 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16139093 | Sep 2018 | US |
Child | 16986007 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16986007 | Aug 2020 | US |
Child | 17351044 | US |