SYSTEM AND METHOD FOR IMPLEMENTING VARIABLE-PRECISION MATRIX MULTIPLICATION USING LOW-PRECISION DIGIT MATRIX MULTIPLIER

FIELD OF INVENTION

Embodiments of the present disclosure relate to the field of processing systems and more particularly to a system and method for implementing variable-precision matrix multiplication using a low-precision digit matrix multiplier.

BACKGROUND

A Central Processing Unit generally performs different computing tasks such as running application software, operating systems, graphics processing, image processing, and digital signal processing, some of which can be off-loaded to specialized accelerators (e.g., graphics and image processing on a Graphics Processing Unit). Each described processor is programmed in a different manner. Era of big data processing demands higher performance at lower energy as compared with today's general-purpose processors.

Most conventional scalar and data-parallel processors operate on a small subset of data types, often limited to only 8, 16, 32, or 64-bit data. However, in fields like signal processing and machine learning, lower and more varied precisions are becoming increasingly prevalent, such as 2-bit, 4-bit, or 12-bit data. For instance, machine learning problems may require different precisions and matrix sizes in different neural network layers, as well as separate precisions for each operand (such as weights and inputs). As such, the ability to operate on minimum-precision operands while maintaining desired accuracy can in some cases give an order of magnitude or more savings of power and latency.

One of the most common computation kernels in machine learning is matrix-matrix multiplication, which takes two input matrices and returns a matrix result where each output element is the inner product of a row of the first matrix and a column of the second. A larger matrix problem can be decomposed into several smaller matrix multiplications of a fixed “block size” and then combined. The multiplication of two matrices A and B is essentially a series of dot products of the rows of A with the columns of B. Equation (1) below depicts an example of the two input matrices A and B being broken into 9 blocks each (3 rows by 3 columns), where each output block of the matrix multiply is the inner product of a row of A and a column of B (e.g.: C₀=A₀*B₀+A₁*B₃+A₂*B₆): A matrix multiplier is made up of many dot product units, each of which computes the dot product of a row of A with a column of B. Note that the full-precision output of a dot product unit contains more bits than any individual element of A or B due to the bit growth after each multiplication and addition.

$\begin{matrix} \overset{k 1 \times k 2}{\overset{︷}{(\begin{matrix} A_{0} & A_{1} & A_{2} \\ A_{3} & A_{4} & A_{5} \\ A_{6} & A_{7} & A_{8} \end{matrix})}} \overset{k 2 \times k 3}{\overset{︷}{(\begin{matrix} B_{0} & B_{1} & B_{2} \\ B_{3} & B_{4} & B_{5} \\ B_{6} & B_{7} & B_{8} \end{matrix})}} = \overset{k 1 \times k 3}{\overset{︷}{(\begin{matrix} C_{0} & C_{1} & C_{2} \\ C_{3} & C_{4} & C_{5} \\ C_{6} & C_{7} & C_{8} \end{matrix})}} & equation (1) \end{matrix}$

One conventional approach, such as Bit-serial Matrix Multiply (BISMO), which is a vectorized bit-serial matrix multiplication overlay, contains a fully parallel Dot Product Unit (DPU) array. Unlike element-level multi-precision approaches, BISMO uses a matrix level multi-precision approach. However, the BISMO approach relies on bit-level decomposition, which limits the throughput that can be achieved, and broadcasting of information within the entire matrix array, which leads to wiring congestion and makes it difficult to scale for larger matrix sizes. As a result, BISMO is not suitable for high-performance matrix multiplication.

Yet another approach is a systolic array of multipliers. Specifically, Tensor Processing Unit (TPU) uses fixed 8-bit multipliers and is designed for 256×256 matrices. TPU is unable to efficiently handle lower precisions such as 2-bit or 4-bit and smaller matrix dimensions such as 8×8 or 32×32. Additionally, due to the high latency of systolic arrays, the TPU is not suitable for embedded applications.

In another approach, the Intel Xeon processor uses vector instructions to compute bit-level dot-products by ANDing two vectors and counting the number of ones remaining. Throughput is limited because computation is not matrix level and there is not enough recirculation of data. Furthermore, there is a lot of overhead of converting from natural order in-memory representation to bit-serial representation. As a result, the Intel Xeon is also not suitable for high-performance low-precision matrix multiplication.

Moreover, existing matrix multiply systems use one of many schemes such as dot product arrays and systolic arrays. However, these operate on a fixed data type, such as an 8-bit bit systolic array. Sometimes, these matrix multipliers can be used to also work on a limited subset of hard-coded larger precision data types like 16, 32, or 64 bits. These systems do not allow for low-precision operations (those below 8 bits), such as 2- or 4-bit data types. They also do not give flexibility for any multiples of the matrix multiplier data size, such as allowing for 24-bit data with an 8-bit matrix multiplier.

Hence, there is a need for an improved system and method for implementing variable-precision matrix multiplications using a low-precision digit matrix multiplier.

SUMMARY

This summary is provided to introduce a selection of concepts, in a simple manner, which is further described in the detailed description of the disclosure. This summary is neither intended to identify key or essential inventive concepts of the subject matter nor to determine the scope of the disclosure.

In accordance with an embodiment of the present disclosure, a system and method for implementing variable-precision matrix multiplications using a low-precision digit matrix multiplier is disclosed. The system comprises a load store unit configured to load and store matrix data between a register file and either on-chip or off-chip system memory. The system further comprises the register file configured to provide data in natural order to a permute unit. The system further comprises the permute unit, configured to deinterleave the data, i.e., to convert the data into fixed dimension and fixed precision sub-matrices. The deinterleaved data is stored back into the register file. The system further comprises a digit matrix multiplier unit configured to receive digit matrices in deinterleaved form from the register file. The digit matrix multiplier unit is composed of multiple subsystems to perform various operations. The subsystems include an input unit, a compute unit, a combination unit, and an output unit. The input unit is configured to decode instructions and generate control signals. Concurrently, in the input unit, the digit matrices are buffered, and a portion of the digit matrices are transmitted to the compute unit to be multiplied. The compute unit is configured to compute the product of two digit matrices and transmit the product to a matrix combination unit to be accumulated. The product of the digit matrices is computed using an array of dot product units (DPUs). Each DPU computes a dot product of one row of one digit matrix and one column of the other digit matrix. The matrix combination unit accumulates the results of the current digit matrix multiplication with the results of previous multiplications and applies offsets as needed to account for the multi-precision aspect and unsigned product to signed product conversion. Accumulation results are stored locally inside the matrix combination unit, and portions of the results are transmitted to an output unit. The output unit converts full-precision elements of the matrix product to a reduced precision by performing scaling, rounding and saturation operations.

The present disclosure also includes a method for implementing variable-precision matrix multiplications using a low-precision digit matrix multiplier. The method comprises decomposing a first matrix A into a set {A_ir} of n×n sub-matrices of M-bit digits and decomposing a second matrix B into a set {B_rj} of n×n sub-matrices of M-bit digits. Here, matrix A is an N1×N matrix and matrix B is an N×N2 matrix, and

$i = {0, \dots, \frac{N 1}{n} - 1}, r = {0, \dots, \frac{N}{n} - 1}, and j = {0, \dots, \frac{N 2}{n} - 1} .$

The product of A and B is an N1×N2 matrix C.

The matrices A and B and their corresponding sets {A_ir} and {B_rj} are stored in the system memory. The method further comprises initializing i and j to 0. The method further comprises computing product sub-matrix C_ij. The computation of the product sub-matrix C_ijcomprises of an outer and inner loop.

The outer loop computation is initialized by setting r to 0 and initializing an n×n bank of accumulators, ACC, to 0 and repeating an outer loop iteration step, followed by incrementing r provided that r is less than N/n. The outer loop iteration step comprises retrieving the sub matrix A_irand the submatrix B_rjfrom the system memory and placing them in intermediate storage. The outer loop iteration step further comprises computing product A_ir×B_rj, the computation of which includes accumulating into ACC. Once all the outer loop iteration steps have completed, the method further comprises post-processing of each accumulator element by multiplying it with a defined scale factor and performing rounding and saturating to produce an element of C_ij, and storing it in the system memory.

The computation of the product A_ir×B_rjcomprises decomposing A_irinto a set of K m-bit digit matrices {A_ir^k0} by deinterleaving each digit position for each element of A_ir. In this case, digit position k0 ranges from {0, . . . , K−1}. In this case, each A_ir^k0has a corresponding scale factor 2^m*^k0. The computation of product A_ir×B_rjfurther comprises decomposing B_rjinto a set of K m-bit digit matrices {B_rj^k1} by deinterleaving each digit position for each element of B_ir. The digit position k1={0, . . . , K−1}. In this case, each B_rj^k1has a corresponding scale factor 2^m*k1. The digit matrices A_ir^k0and B_rj^k1are placed in the intermediate storage. The computation of product A_ir×B_rjfurther comprises initializing k0 and k1 to 0, setting the scale factor, S, to 1, and performing the inner loop computation.

The inner loop computation comprises of several iteration steps. One inner loop iteration step is performed by retrieving A_ir^k0and B_rj^k1from intermediate storage, followed by computing an intermediate matrix T (an n×n digit matrix) by multiplying A_irand B_rj^k1using a digit matrix multiplier functional unit, followed by multiplying T by the scale factor S and adding the resulting product to the accumulator ACC (i.e. ACC=ACC+T*S). The inner loop computation further comprises incrementing k1 and setting the scale factor S equal to 2^m*(k0+k1)and repeating the inner loop iteration step defined above provided that k1 is less than K. When k1 is equal to K, k1 is set to 0, k0 is incremented, scale factor S is set to 2^m*(k0+k1)and the inner loop iteration step is repeated provided that k0 is less than K.

Once all the outer loop and inner loop computation has concluded and the result C_ijhas been placed in system memory, the method further comprises incrementing i and repeating the outer loop and inner loop computations provided that i is less than N1/n. When i reaches N1/n, the method further comprises setting i to 0 and incrementing j and repeating the outer loop and inner loop computations provided that j is less than N2/n. Once all the iterations of i and j have been completed, the result matrix C will be in system memory.

To further clarify the advantages and features of the present disclosure, a more particular description of the disclosure will follow by reference to specific embodiments thereof, which are illustrated in the appended figures. Note that these figures depict only typical embodiments of the disclosure and are therefore not to be considered limiting in scope. The disclosure will be described and explained with additional specificity and detail with the appended figures.

BRIEF DESCRIPTION OF DRAWINGS

The disclosure will be described and explained with additional specificity and detail with the accompanying figures in which:

FIG. 1 is a block diagram illustrating a multi-precision matrix multiply unit, in accordance with an embodiment of the present disclosure;

FIG. 2 is a block diagram, illustrating an internal organization of a multi-precision matrix multiply unit, in accordance with an embodiment of the present disclosure;

FIG. 3 is a block diagram illustrating an exemplary input unit, in accordance with an embodiment of the present disclosure;

FIG. 4 is a block diagram illustrating an exemplary dataflow for sending blocks of A and B into a systolic array of block matrix multipliers, in accordance with an embodiment of the present disclosure;

FIG. 5 is a block diagram illustrating an exemplary dataflow of the outputs streaming out of the systolic array of block matrix multipliers, in accordance with an embodiment of the present disclosure;

FIG. 6 is a block diagram illustrating an exemplary dot product unit (DPU), in accordance with an embodiment of the present disclosure;

FIG. 7 is a block diagram illustrating an exemplary block matrix multiplier (PE), in accordance with an embodiment of the present disclosure;

FIG. 8 is a block diagram illustrating an exemplary matrix combination unit, in accordance with an embodiment of the present disclosure;

FIGS. 9A and 9B are a block diagram illustrating an exemplary output unit, in accordance with an embodiment of the present disclosure;

FIGS. 10A and 10B are a process flow diagram illustrating an exemplary method for implementing variable-precision matrix multiplications using a low-precision digit matrix multiplier, in accordance with an embodiment of the present disclosure; and

FIG. 11 is a block diagram illustrating the decomposing of matrices into digit matrices, in accordance with an embodiment of the present disclosure.

Further, those skilled in the art will appreciate that elements in the figures are illustrated for simplicity and may not have necessarily been drawn to scale. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the figures by conventional symbols, and the figures may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the figures with details that will be readily apparent to those skilled in the art having the benefit of the description herein.

DETAILED DESCRIPTION

For the purpose of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiment illustrated in the figures and specific language will be used to describe them. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended.

In the present document, the word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment or implementation of the present subject matter described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.

The terms “comprises”, “comprising”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such a process or method. Similarly, one or more devices or subsystems or elements or structures or components preceded by “comprises . . . a” does not, without more constraints, preclude the existence of other devices, subsystems, elements, structures, components, additional devices, additional subsystems, additional elements, additional structures or additional components. Appearances of the phrase “in an embodiment”, “in another embodiment” and similar language throughout this specification may, but not necessarily do, all refer to the same embodiment.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this disclosure belongs. The system, methods, and examples provided herein are only illustrative and not intended to be limiting.

A computer system (standalone, client or server computer system) configured by an application may constitute a “module” (or “subsystem”) that is configured and operated to perform certain operations. In one embodiment, the “module” or “subsystem” may be implemented mechanically or electronically, so a module may comprise dedicated circuitry or logic that is permanently configured (within a special-purpose processor) to perform certain operations. In another embodiment, a “module” or “subsystem” may also comprise programmable logic or circuitry (as encompassed within a general-purpose processor or other programmable processor) that is temporarily configured by software to perform certain operations.

Accordingly, the term “module” or “subsystem” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (hardwired) or temporarily configured (programmed) to operate in a certain manner and/or to perform certain operations described herein.

Referring now to the drawings, and more particularly to FIGS. 1 through 11, where similar reference characters denote corresponding features consistently throughout the figures, there are shown preferred embodiments and these embodiments are described in the context of the following exemplary system and/or method.

The present disclosure provides a solution for the matrix multiplication problem as several dot product computations. The solution is applied to block matrix multiplication. A large matrix is decomposed into blocks of smaller matrices, and each block is treated like a single piece of data, as shown in table. 1. A single block matrix multiply would be computed as follows: C₀=A₀*B₀+A₁*B₃+A₂*B₆. Note that each block A₀. . . A₈has dimensions k_b1×k_b2and each block B₀. . . B₈has dimensions k_b2×k_b3. Hence, each resulting block C₀. . . C₈has dimensions k_b1×k_b3. The present disclosure uses this concept of block decomposition to compute large matrix multiplications.

The higher-precision matrices are decomposed into lower-precision matrices through deinterleaving. The matrices are originally stored in natural order, each element in a row being stored contiguously. Through the deinterleaving process, data is converted so that each element in a row of a digit matrix is stored contiguously; data is then processed by the matrix multiplier in this deinterleaved form.

FIG. 1 is a block diagram illustrating a multi-precision matrix multiply unit 100, in accordance with an embodiment of the present disclosure. The system 100 comprises a load store unit 104 configured to load and store data between a register file 106 and a system memory 112. The system memory 112 may either be external or on-chip. The system 100 further comprises a register file 106 configured to provide data in natural order to a permute unit 110. The system 100 further comprises a permute unit 110 configured to deinterleave the data, convert the data into fixed dimension and fixed precision sub-matrices. The deinterleaved data is stored back into the register file 106. The process of deinterleaving is carried out as follows. Each digit matrix is initially stored in memory, such as system memory 112, in row-major format in natural order. This means elements in a row are stored contiguously and digits in an element are stored contiguously. The purpose of deinterleaving is to convert a matrix of multi-digit elements into multiple matrices with single-digit elements of the same weight stored contiguously. This is needed because the digit matrix multiplier can only process data formatted as single-digit elements. Table. 2 shows a matrix row with 4 elements, each of which is made up of 4 digits. The elements are broken apart into digits then repacked so that digits of the same weight are stored next to each other. Table. 3 shows the result of this deinterleaving. The process of deinterleaving is performed block by block.

TABLE 2

A3₃
A3₂
A3₁
A3₀
A2₃
A2₂
A2₁
A2₀
A1₃
A1₂
A1₁
Al₀
A0₃
A0₂
A0₁
A0₀

A7₃
A7₂
A7₁
A7₀
A6₃
A6₂
A6₁
A6₀
A5₃
A5₂
A5₁
A5₀
A4₃
A4₂
A4₁
A4₀

A11₃
A11₂
A11₁
A11₀
A10₃
A10₂
A10₁
A10₀
A9₃
A9₂
A9₁
A9₀
A8₃
A8₂
A8₁
A8₀

A15₃
A15₂
A15₁
A15₀
A14₃
A14₂
A14₁
A14₀
A13₃
A13₂
A13₁
A13₀
A12₃
A12₂
A12₁
A12₀

TABLE 3

A15₀
A14₀
A13₀
A12₀
A11₀
A10₀
A9₀
A8₀
A7₀
A6₀
A5₀
A4₀
A3₀
A2₀
A1₀
A0₀

A15₁
A14₁
A13₁
A12₁
A11₁
A10₁
A9₁
A8₁
A7₁
A6₁
A5₁
A4₁
A3₁
A2₁
A1₁
A0₁

A15₂
A14₂
A13₂
A12₂
A11₂
A10₂
A9₂
A8₂
A7₃
A6₂
A5₂
A4₂
A3₂
A2₂
A1₂
A0₂

A15₃
A14₃
A13₃
A12₃
A11₃
A10₃
A9₃
A8₃
A7₃
A6₃
A5₃
A4₃
A3₃
A2₃
A1₃
A0₃

The system 100 further comprises a matrix multiply unit 102 that uses an unsigned digit matrix multiplier unit 108, which includes a systolic array of processing elements (PE's). Each PE computes a digit matrix multiply of a submatrix of smaller dimensions using an array of low-precision integer dot product units (DPU's). The low-precision integer may be 1-, 2- or 4-bit digits.

The unsigned digit matrix multiplier unit 108 calculates the product of two signed matrices and adjusts for the sign correction using the sum of all elements within each row of one matrix and the sum of all elements within each column of the other matrix. The row sums and the column sums are calculated online using dedicated hardware blocks.

FIG. 2 is a block diagram illustrating an internal organization of the digit matrix multiply unit 108, in accordance with an embodiment of the present disclosure. The digit matrix multiplier unit 108 further comprises a plurality of subsystems to perform various operations. The plurality of subsystems comprises an input unit 202 configured to decode instructions and generate control signals. Concurrently, in the input unit 202, the digit matrices are buffered for transfer to a compute unit 204 to be multiplied over time. The plurality of subsystems further comprises the compute unit 204 configured to compute the product of two digit matrices and transfer the product to a matrix combination unit to be accumulated. The product of digit matrices is computed using a systolic array of processing elements (PE's), where each PE comprises an array of dot product units (DPU's). The hierarchical distribution of digit matrix computations between DPU arrays within the PE and systolic arrays of PE's allows for large digit matrix sizes within the digit matrix multiplier while reducing overall latency. Each DPU computes the unsigned dot product of one row of one digit matrix and one column of the other digit matrix. To adjust for the product of signed matrices, the row sums of one matrix and the column sums of the other matrix are computed in the compute unit 204. The plurality of subsystems further comprises a matrix combination unit 206 configured to accumulate results of current digit matrix multiplications with the results of previous multiplications and to apply scale factors as necessary to account for the multi-precision computations and to adjust for offsets using row sums and column sums as necessary to account for signed matrix products. Partial accumulation results are stored locally inside the matrix combination unit 206 and, once all accumulations are completed, the accumulation results are transferred to an output unit 208. The plurality of subsystems further comprises the output unit 208 configured to convert full-precision elements of the matrix product to a reduced precision by performing scaling, rounding and saturation operations.

FIG. 3 is a block diagram illustrating an exemplary input unit 202, in accordance with an embodiment of the present disclosure. The input unit comprises buffers for storing the input digit matrices of fixed dimension and fixed digit size. The buffered digit matrices are subsequently transferred to the compute unit over multiple cycles while the input matrices for the next operation are collected in the input unit. In an exemplary embodiment, the input unit may buffer two 32×32 matrices A and B of 2-bit elements while an additional two matrices are being transferred to the compute unit. A and B are each reorganized into 16 sets of 8×8 sub-matrices {A0, . . . , A15} and {B0, . . . , B15}. A subset of four sub-matrices of A and four sub-matrices of B are transferred to the compute unit in each clock cycle.

FIG. 4 illustrates an exemplary input data flow of digit matrices A and B within the systolic array of PE's that comprises the compute unit 204. The illustration shows how data is shared within the systolic array for efficiency. Sub-matrices of A are shared by PE's in the same row. Sub-matrices of B are shared by the PE's in the same column. At every clock cycle, the sub-matrix of A at the input of each PE is passed to the right. At every clock cycle, the sub-matrix of B at the input of each PE is passed down to the row below.

FIG. 5 illustrates an exemplary output data flow of digit matrices out of the systolic array of PEs, in accordance with an embodiment of the present disclosure. To prevent congestion of wires and to output results in the correct cycle, the result of each PE is sent to the PE below to be forwarded, rather than being sent directly to the output. At every cycle, the systolic array outputs up to four sub-matrix multiplication results from the four bottom-most PEs. Concurrently, as the PE's are computing results, the sum of rows and columns of matrices A and B are calculated to be sent out to the combination unit 206.

FIG. 6 is a block diagram illustrating an exemplary dot product unit (DPU) 600, in accordance with an embodiment of the present disclosure.

FIG. 7 is a block diagram illustrating an exemplary block matrix multiplier (PE) 700, in accordance with an embodiment of the present disclosure. The block matrix multiplier 700 is an array of dot product units (DPU's) 600. Each of the dot product units (DPU's) 600 computes the dot product of a row of a block A with a column of a block B. In addition to several unsigned multipliers and an adder, each dot product unit (DPU) 600 includes a register. The register accumulates results of the dot product unit (DPU) 600 over time. For synchronisation of the results, the dot product unit (DPU) 600 includes an extra register and a mux to hold or forward the dot product unit (DPU) 600 outputs which come from the dot product units (DPU's) 600 in the block matrix multiplier (PE) 700 above it.

FIG. 8 is a block diagram illustrating an exemplary matrix combination unit 206, in accordance with an embodiment of the present disclosure. Data has thus far been treated as unsigned. The signed product, as shown in equation (2), is the sum of an unsigned component, which is the result of the compute block, and an offset:

A[ij]*B[ij]=A′[i]*B′[j]−α*ΣB[j]−β*ΣA[i]−α*β equation (2).

Here, the offset has three components: a weighted row sum of A[i], a weighted column sum of B[j], and a constant. In a combine stage of the matrix combination unit 206, the offsets are subtracted from the unsigned component using the row sum of A[i] and column sum of B[j] which were both computed in the computation stage in parallel with the systolic array. Once the sign is adjusted for, the product is shifted to adjust weight due to multi-precision. Finally, the product is accumulated in an accumulator array 806. The output from the combination stage is read from the accumulator array 806 and sent to the output stage.

FIGS. 9A and 9B are a block diagram illustrating an exemplary output unit 900, in accordance with an embodiment of the present disclosure. A full-precision matrix multiply results is packed before outputting. The full precision matrix multiply results are rounded off, shifted, and saturated to a desired precision. The number of result elements at the output of the output unit 900 is small compared to number of total elements in A and B. The present design of the system 100 focuses more on outputting results over time instead of outputting the results all at once. Due to shifting, rounding, saturating and outputting results over time, less logic is used, due to which the design becomes more cost efficient. In an exemplary embodiment, the digit matrix multiplier unit 108 can compute a matrix multiplication of dimensions (32×1024)*(1024×32) and 64-bit precision and post-processing it by using just a single 138-bit wide shift, round, and saturate datapath over many cycles. This is done by outputting one element each for 1024 cycles. The digit matrix multiplier unit 108 can also efficiently compute smaller matrix multiplications. In an exemplary embodiment, a 32×32 matrix of 8-bit precision is multiplied with a 32×32 matrix of 2-bit precision and would output 64 elements per cycle for 16 cycles.

FIGS. 10A and 10B are a process flow diagram 1000 illustrating an exemplary method for implementing variable-precision matrix multiplications using a low-precision digit matrix multiplier, in accordance with an embodiment of the present disclosure. At step 1002, the method comprises computation of matrix product C=A*B where A is an N1×N matrix, B is an N×N2 matrix, and C is an N1×N2 matrix. All elements of A and B are M-bit digits. A, B, and C are stored in system memory 112. At step 1004, the method comprises decomposing a first matrix A into a set {A_ir} of n×n sub-matrices of M-bit digits. The first matrix A is an N1×N matrix and is stored in a system memory 112. In this case:

i={0, . . . , N1/n−1} and r={0, . . . , N/n−1} equation (3).

The set {A_ir} is stored in the system memory 112.

At step 1004, the method comprises decomposing a second matrix B into a set {B_rj} of n×n sub-matrices of the M-bit digits. The second matrix B is an N×N2 matrix and is stored in the system memory 112. In this case:

r={0, . . . , N/n−1} and j={0, N2/n−1} equation (4).

The set {B_rj} is stored in the system memory 112.

At step 1006, the method further comprises initializing i and j to 0.

At step 1008, the method comprises computing product sub-matrix C_ij. At step 1008 the method further comprises initializing r to 0 and initializing n×n bank of accumulators ACC to 0.

At step 1010 the method comprises retrieving the submatrix A_irand the submatrix B_rjfrom the system memory 112 and placing in intermediate storage. Further, the method comprises computing of product A_ir×B_rj.

At step 1012, the method comprises decomposing the set A_irinto a set of K m-bit digit matrices {A_ir^k0}. The digit position k0 ranges from {0, . . . , K−1}. In this case, A_ir^k0has a corresponding scale factor 2 m^m*k0. The digit matrix {A_ir^k0} is placed in the intermediate storage. Similarly, the method further comprises decomposing the set B_rjinto a set of K m-bit digit matrices {B_rj^k1}. In this case, the digit position k1={0, . . . , K−1}. B_rj^k1has corresponding scale factor 2 ^m*1k. The digit matrix {B_rj^k1} is placed in the intermediate storage.

At step 1014, the method comprises initializing k0 and k1 to 0 and scale factor S is set to 1. At step 1016, the method comprises retrieving A_ir^k0and B_rj^k1from the system memory 112. Further, the method comprises computing a third matrix T of n×n digit matrix by multiplying A_ir^k0and B_rj^k1using a digit matrix multiplier functional unit 108.

At step 1018, the method comprises multiplying the third matrix T by the scale factor S and adding respective product to the accumulator ACC. The accumulator ACC is updated with the resulting value:

ACC=ACC+T*S equation (5).

At step 1020, the method comprises incrementing k1 and setting the scale factor S equal to 2^m*(k0+k1).

At step 1022, the method comprises determining whether k1<K.

If k1 is not less than K, then at step 1024, the method comprises setting k1 to 0, incrementing k0 and setting the scale factor S equal to 2^m*(k0+k1). If k1 is less than K, then the loop goes back to step 1016.

At step 1026, it is determined whether k0<K.

If k0 is not less than K, then at step 1028, the method comprises incrementing the value of r. If k0 is less than K, then the loop goes back to 1016.

At step 1030, it is determined whether r<N/n.

If r is not less than N/n, then at step 1032, the method comprises post-processing of each accumulator element by first multiplying each accumulator element comprising C_ijwith a defined scale factor, next performing rounding and saturating for the scaled C_ijand lastly storing the saturated C_ijin the system memory 112. If r is less than N/n, then the loop goes back to step 1010.

At step 1034, the method comprises incrementing i.

At step 1036, it is determined whether i<N1/n.

If i is not less than N1/n, then at step 1038 the method comprises incrementing j. If i is less than N1/n, then the loop goes back to step 1008.

At step 1040, it is determined whether j<N2/n.

If j is not less than N2/n, the at step 1042 the method comprises outputting a product matrix C=A*B. The product matrix C is an N1×N2 matrix. If j is less than N2/n, then the loop goes back to step 1008.

The method comprises repeating steps from 1008 to 1038 if i is less than N1/n and if j is less than N2/n.

The method further comprises repeating steps from 1010 to 1028, if r is less than N/n.

The method further comprises repeating steps from 1016 to 1020, if k1 is less than K.

The method further comprises repeating steps from 1016 to 1024, if k1 is less than K and if k0 is less than K.

FIG. 11 is a diagram illustrating decomposition of submatrices into digit matrices, in accordance with an embodiment of the present disclosure. A matrix can be represented as a sequence of digit matrices (similar to an integer being represented as a sequence of digits). Matrix multiplication can be performed in terms of digit matrix multiplications. The number of digit matrices that decomposition produces depends on the original matrix's precision and the supported digit size. This allows for using a fixed digit matrix multiplier to implement multi-precision matrix multiplication.

In one exemplary embodiment, a method for implementing variable-precision matrix multiplications using a low precision digit matrix multiplier is explained below. Consider there are two matrices A and B stored in the system memory 112. A is a 12×12 matrix of 8-bit digits. B is a 12×12 matrix of 8-bit digits. Each matrix has 122 elements, where each element is stored as an 8-bit number. A 4×4 digit matrix of 2-bit digits is used. A and B are each decomposed into 4×4 matrices of 2-bit digits.

The first step of multiplication of matrices A and B includes decomposing of A into submatrices, as shown below:

A=[A
₀₀
A
₀₁
A
₀₂
A
₁₀
A
₁₁
A
₁₂
A
₂₀
A
₂₁
A
₂₂] equation (6)

Each submatrix A_ijis a 4×4 matrix. For example, A₀₀is a 4×4 matrix of 16 scalar elements as shown below:

A
₀₀
=[a
₀₀
a
₀₁
a
₀₂
a
₀₃
a
₁₀
a
₁₁
a
₁₂
a
₁₃
a
₂₀
a
₂₁
a
₂₂
a
₂₃
a
₃₀
a
₃₁
a
₃₂
a
₃₃] equation (7)

Elements of a submatrix are full-precision (same precision as A's elements).

At step II, B is decomposed, where the matrix B can be similarly decomposed into 4×4 block submatrices as shown below:

B=[B
₀₀
B
₀₁
B
₀₂
B
₁₀
B
₁₁
B
₁₂
B
₂₀
B
₂₁
B
₂₂]equation (8)

Each of the 9 submatrices is further decomposed into four 4×4 digit matrices of 2-bit digits as shown below:

B
₀₀=2⁶×B₀₀³+2⁴×B₀₀²+2²×B₀₀¹+2⁰×B₀₀⁰ equation (9)

The same process is done for A.

At step III, the product C of A and B is decomposed. The 12×12 matrix product C=A×B is implemented with 4×4 sub-matrices as shown below:

[[C₀₀C₀₁C₀₂], [C₁₀C₁₁C₁₂], [C₂₀C₂₁C₂₂]]=[[A₀₀A₀₁A₀₂], [A₁₀A₁₁A₁₂], [A₂₀A₂₁A₂₂]]×[[B₀₀B₀₁B₀₂], [B₁₀B₁₁B₁₂], [B₂₀B₂₁B_22]] equation (10)

Each output submatrix is calculated by multiplying 3 submatrices and adding the results together. For example, C₀₀can be calculated as shown below:

C
₀₀
=A
₀₀
×B
₀₀
+A
₀₁
×B
₁₀
+A
₀₂
×B
₂₀ equation (11)

Therefore, the matrix product A×B can be decomposed into 27 submatrix multiply and add operations. For each one of these 27 operations, the processor fetches a submatrix A_irand a sub-matrix B_rjand decomposes each into their constituent digit matrices such as 4 digit matrices for A_irand 4 digit matrices for B_rjand implements 16 digit matrix multiply, scale and add operations. There are a total of 27×16=432 digit matrix multiply scale and add operations to implement A×B.

At step IV, which is a single digit matrix multiply needed in step III, the product A₀₀×B₀₀is multiplied in the following manner:

At the first sub-step of step IV, a processor fetches A₀₀and B₀₀from system memory 112. At this point, A₀₀and B₀₀are 4×4 matrices of 8-bit elements.

At the second sub-step of step IV, the processor decomposes A₀₀into four 4×4 digit matrices of 2-bit elements:

(A₀₀⁰, A₀₀¹, A₀₀²and A₀₀³) equation (12)

At the third sub-step of step IV, the processor similarly decomposes B₀₀into four 4×4 digit matrices of 2-bit elements:

(B₀₀⁰, B₀₀¹, B₀₀²and B₀₀³) equation (13)

At the fourth sub-step of step IV, the processor multiplies A₀₀⁰and B₀₀⁰and stores the resulting 4×4 matrix into a bank of 4×4 accumulators. Each element of the accumulator is wide enough to store the result of the largest matrix multiply supported by the system 100. While the result of A₀₀⁰×B₀₀⁰would only need 6-bits to store each element, each element of the accumulator bank could be much wider such as 100 bits or more.

At the fifth sub-step of step IV, the processor multiplies A₀₀⁰and B₀₀⁰. The processor mutiplies the result by the scale of B₀₀¹, which is 2²and adds the result to the accumulator. Therefore, the accumulator contains the result of

acc=A₀₀⁰×B₀₀⁰+2²×A₀₀⁰×B₀₀¹ equation (14)

At the sixth sub-step of step IV, the processor multiplies A₀₀⁰and B₀₀⁰, applies a scale factor of 2⁴and adds the result to the accumulator. The accumulator thus contains the result of

acc=A₀₀⁰×B₀₀⁰+2²×A₀₀⁰×B₀₀¹+2⁴×A₀₀⁰×B₀₀² equation (15)

The seventh sub-step of step IV includes adding (A₀₀⁰×B₀₀³)×2⁶. The resulting accumulator state is as shown below:

acc=A₀₀⁰×B₀₀⁰+2²×A₀₀⁰×B₀₀¹+2⁴×A₀₀⁰×B₀₀²+2⁶×A₀₀⁰+B₀₀⁰ equation (16)

At the eighth sub-step of step IV, the process is continued similarly for each combination of A₀ⁱand B₀^j:

acc=acc+2²×A₀₀¹×B₀₀⁰

acc=acc+2⁴×A₀₀¹×B₀₀¹

acc=acc+2⁶×A₀₀¹×B₀₀²

acc=acc+2⁸×A₀₀¹×B₀₀³ equation (17)

- Then:

acc=acc+2⁴×A₀₀²×B₀₀⁰

acc=acc+2⁶×A₀₀²×B₀₀¹

acc=acc+2⁸×A₀₀²×B₀₀²

acc=acc+2¹⁰×A₀₀²×B₀₀³ equation (18)

- Finally:

acc=acc+2⁶×A₀₀³×B₀₀⁰

acc=acc+2⁸×A₀₀³×B₀₀¹

acc=acc+2¹⁰×A₀₀³×B₀₀²

acc=acc+2¹²×A₀₀³×B₀₀³ equation (19)

Once all digit matrices for A₀₀and B₀₀have been multiplied, scaled and added to the accumulator, the accumulator contains the results for A₀₀×B₀₀.

At step V, the product A₀₁×B₁₀is calculated. Once step IV is completed, the processor fetches A₀₁and B₁₀and repeats the second sub-step of step IV through the eighth sub-step of step IV on these submatrices. Each digit matrix product is scaled appropriately and added to the accumulator. At the end of step V, the accumulator contains:

A₀₀×B₀₀+A₀₁×B₁₀ equation (20)

At step VI, product of A₀₂×B₂₀is decomposed. Step VI is identical to step V. At the end of step VI, the accumulator contains:

C
₀₀
=A
₀₀
×B
₀₀
+A
₀₁
×B
₁₀
+A
₀₂
×B
₂₀ equation (21)

Note that this is the same as equation (11).

At this point, 48 digit matrix multiply results have been added to the accumulator after appropriate scaling.

Step VII includes post-processing of the accumulator results. Once C₀₀has been calculated, the accumulator's results must be scaled, rounded and saturated to the desired output bit width, and the results must be committed back to system memory 112. The results stored in each element of the accumulator for this example occupy 20 bits such as 12 additions of 8 bit×8 bit multiplies requires 20 bits. The physical accumulator element may be much larger than 20-bits to accommodate the product of large matrices. For example, an 8-bit output precision is desired. Furthermore, consider a user specifies a scale factor of 1/1024. The data post processing subsystem would shift the data right by 10 bits (equivalent to multiplying each accumulator element by 1/1024) and range limit the result to 8 bits by rounding and saturating. The resulting 4×4 matrix of 8-bit elements is stored back into the system memory 112 to complete the process for C₀₀.

Step VIII includes calculation for C₀₁through C₂₂. The process for calculating C₀₁through C₂₂is identical to the process described above for C₀₀in step VII. Overall, 48×9=432 digit matrix multiply operations are needed to calculate C.

In an additional embodiment, the present disclosure is applicable to both signed and unsigned arithmetic operations and multi-precision arithmetic operations.

Generally, in fixed point arithmetic, the number of bits needed to maintain full precision increases with each operation. When adding two fixed point numbers, the number of bits increases by 1. When multiplying two numbers, the number of bits doubles, since multiplication is essentially the same as adding a number to itself many times. When adding n numbers, the bit growth is log₂n. An unsigned n-bit number can hold values from 0 to 2ⁿ−1. A signed n-bit number in 2's complement representation can hold values from −2ⁿ⁻¹to 2ⁿ⁻¹−1. One way to go from a signed number to an unsigned number is to add the offset 2ⁿ⁻¹so that it now has the range 0 to 2ⁿ−1.

An embodiment of the present disclosure allows multiplication of two signed matrices using an unsigned matrix multiplier. This can be achieved by adding a fixed offset to all the numbers. Let A (with dimensions k1×k2) and B (with dimensions k2×k3) be two signed matrices and A′ (k1×k2) and B′ (k2×k3) be the unsigned equivalent of A and B respectively. If α and β are scalars representing the offset and 1 represents a matrix of ones (same dimensions as A and B, respectively; not the same as identity matrix I), the equation (22) below shows how the signed matrix multiplication would be calculated using the unsigned matrix multiplication result. The equation (22) below depicts signed matrix multiplication in terms of unsigned product:

A′*B′=(A+α1)(B+β1)

A′*B′=A*B+A*(β1)+(α1)*B+(α1)*(β1)

A*B=A′*B′−A*(β1)−(α1)*B−(α1)*(β1) Equation (22)

In an alternate embodiment, two middle entries may be used to compute the product as below:

A*(β1)=β*Σ_i=0^k1A[i]

(α1)*B=β*Σ_j=0^k3B[j] equation (23)

Equation (23) depicts weighted row sum of A and weighted column sum of B. Essentially, the first term is a scaled version of a row sum of A and the second term is a scaled version of a column sum of B. Rather than compute the matrix multiplication with a matrix of ones, the present system 100 uses the row sum and column sum to simplify this calculation, as done in the present disclosure design.

This decomposition of a signed matrix multiplication into an unsigned multiplication and offset will be used in this design so that the matrix multiplier can compute an unsigned or signed matrix multiplication without requiring a separate signed and unsigned multiplier.

In case of multi-precision arithmetic, an embodiment of the present disclosure allows the system 100 to break the data into digits of fixed precision to use a small fixed-precision multiplier to multiply larger data. Suppose a and b are each 8-bit data and the multiplier takes data of 2-bit precision. a and b are each decomposed into 2-bit digits, so there are 4 digits of a and 4 digits of b. Each 2-bit digit of a needs to be multiplied with each 2-bit digit of b and shifted, depending on the bit position of that digit. The final product would be computed as shown below:

$\begin{matrix} a * b = a 3 * b 3 * 2^{1 2} + a 3 * b 2 * 2^{1 0} + a 3 * b 1 * 2^{8} + a 3 * b 0 * 2^{6} + a 2 * b 3 * 2^{1 0} + a 2 * b 2 * 2^{8} + a 2 * b 1 * 2^{6} + a 2 * b 0 * 2^{4} + a 1 * b 3 * 2^{8} + a 1 * b 2 * 2^{6} + a 1 * b 1 * 2^{4} + a 1 * b 0 * 2^{2} + a 0 * b 3 * 2^{6} + a 0 * b 2 * 2^{4} + a 0 * b 1 * 2^{2} + a 0 * b 0 * 2^{0} & equation (24) \end{matrix}$

This same approach can be applied to matrix multiplication. Matrices would be broken up into digit matrices as shown in FIG. 11. This allows us to use a fixed digit matrix multiply to implement multi-precision matrix multiply. Equation (25) below shows how this would be calculated:

$\begin{matrix} A * B = A 3 * B 3 * 2^{1 2} + A 3 * B 2 * 2^{1 0} + A 3 * B 1 * 2^{8} + A 3 * B 0 * 2^{6} + A 2 * B 3 * 2^{1 0} + A 2 * B 2 * 2^{8} + A 2 * B 1 * 2^{6} + A 2 * B 0 * 2^{4} + A 1 * B 3 * 2^{8} + A 1 * B 2 * 2^{6} + A 1 * B 1 * 2^{4} + A 1 * B 0 * 2^{2} + A 0 * B 3 * 2^{6} + A 0 * B 2 * 2^{4} + A 0 * B 1 * 2^{2} + A 0 * B 0 * 2^{0} & equation (25) \end{matrix}$

This is significant because the multipliers in the present design are of fixed precision but can be used for multi-precision computations. A multiplier would compute each sub-product, such as A3*B3, and a shifter would adjust the weight, in this case shifting A3*B3 left by 12. This ability to process different sized data using a fixed size block is what makes it multi-precision.

Various embodiments of the present disclosure relate to a system using a low-precision digit matrix multiplier for operations. The operations include any multiples of the digit size. The present system handles range of precisions, such as 2-bit, 4-bit, 6-bit, 12-bit, 64-bit. In contrast to existing designs, the present systems' design supports dynamic selection of precisions, making it easy to select a different precision for each individual operation and hence maximizing power efficiency. The number of multiply operations and cycles required for multiplying two 6-bit precision matrices in this design is little over half that is required using 8-bit precision. Therefore, the present system is useful to be able to dynamically select precision and thus improves power consumption and latency.

Furthermore, the purpose of the present disclosure design is to have the ability to multiply two matrices of variable dimensions and precisions using a fixed sized matrix multiplier. The system 100 breaks up the two matrices into matrices of a fixed precision and dimension, feeding those new matrices to the matrix multiplier over time. The product of each is computed in a systolic array of block matrix multipliers, accumulated in a set of accumulators, and sent back to the register file 106. Over time, the result of the entire matrix multiply of the original matrices is computed.

The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.

The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid-state memory, magnetic tape, a removable computer diskette, a random-access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.

A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary, a variety of optional components are described to illustrate the wide variety of possible embodiments of the invention. When a single device or article is described herein, it will be apparent that more than one device/article (whether or not they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether or not they cooperate), it will be apparent that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the invention need not include the device itself.

The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open-ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.

Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.

SYSTEM AND METHOD FOR IMPLEMENTING VARIABLE-PRECISION MATRIX MULTIPLICATION USING LOW-PRECISION DIGIT MATRIX MULTIPLIER

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims