Embodiments of the present disclosure relate to the field of processing systems and more particularly to a system and method for implementing variable-precision matrix multiplication using a low-precision digit matrix multiplier.
A Central Processing Unit generally performs different computing tasks such as running application software, operating systems, graphics processing, image processing, and digital signal processing, some of which can be off-loaded to specialized accelerators (e.g., graphics and image processing on a Graphics Processing Unit). Each described processor is programmed in a different manner. Era of big data processing demands higher performance at lower energy as compared with today's general-purpose processors.
Most conventional scalar and data-parallel processors operate on a small subset of data types, often limited to only 8, 16, 32, or 64-bit data. However, in fields like signal processing and machine learning, lower and more varied precisions are becoming increasingly prevalent, such as 2-bit, 4-bit, or 12-bit data. For instance, machine learning problems may require different precisions and matrix sizes in different neural network layers, as well as separate precisions for each operand (such as weights and inputs). As such, the ability to operate on minimum-precision operands while maintaining desired accuracy can in some cases give an order of magnitude or more savings of power and latency.
One of the most common computation kernels in machine learning is matrix-matrix multiplication, which takes two input matrices and returns a matrix result where each output element is the inner product of a row of the first matrix and a column of the second. A larger matrix problem can be decomposed into several smaller matrix multiplications of a fixed “block size” and then combined. The multiplication of two matrices A and B is essentially a series of dot products of the rows of A with the columns of B. Equation (1) below depicts an example of the two input matrices A and B being broken into 9 blocks each (3 rows by 3 columns), where each output block of the matrix multiply is the inner product of a row of A and a column of B (e.g.: C0=A0*B0+A1*B3+A2*B6): A matrix multiplier is made up of many dot product units, each of which computes the dot product of a row of A with a column of B. Note that the full-precision output of a dot product unit contains more bits than any individual element of A or B due to the bit growth after each multiplication and addition.
One conventional approach, such as Bit-serial Matrix Multiply (BISMO), which is a vectorized bit-serial matrix multiplication overlay, contains a fully parallel Dot Product Unit (DPU) array. Unlike element-level multi-precision approaches, BISMO uses a matrix level multi-precision approach. However, the BISMO approach relies on bit-level decomposition, which limits the throughput that can be achieved, and broadcasting of information within the entire matrix array, which leads to wiring congestion and makes it difficult to scale for larger matrix sizes. As a result, BISMO is not suitable for high-performance matrix multiplication.
Yet another approach is a systolic array of multipliers. Specifically, Tensor Processing Unit (TPU) uses fixed 8-bit multipliers and is designed for 256×256 matrices. TPU is unable to efficiently handle lower precisions such as 2-bit or 4-bit and smaller matrix dimensions such as 8×8 or 32×32. Additionally, due to the high latency of systolic arrays, the TPU is not suitable for embedded applications.
In another approach, the Intel Xeon processor uses vector instructions to compute bit-level dot-products by ANDing two vectors and counting the number of ones remaining. Throughput is limited because computation is not matrix level and there is not enough recirculation of data. Furthermore, there is a lot of overhead of converting from natural order in-memory representation to bit-serial representation. As a result, the Intel Xeon is also not suitable for high-performance low-precision matrix multiplication.
Moreover, existing matrix multiply systems use one of many schemes such as dot product arrays and systolic arrays. However, these operate on a fixed data type, such as an 8-bit bit systolic array. Sometimes, these matrix multipliers can be used to also work on a limited subset of hard-coded larger precision data types like 16, 32, or 64 bits. These systems do not allow for low-precision operations (those below 8 bits), such as 2- or 4-bit data types. They also do not give flexibility for any multiples of the matrix multiplier data size, such as allowing for 24-bit data with an 8-bit matrix multiplier.
Hence, there is a need for an improved system and method for implementing variable-precision matrix multiplications using a low-precision digit matrix multiplier.
This summary is provided to introduce a selection of concepts, in a simple manner, which is further described in the detailed description of the disclosure. This summary is neither intended to identify key or essential inventive concepts of the subject matter nor to determine the scope of the disclosure.
In accordance with an embodiment of the present disclosure, a system and method for implementing variable-precision matrix multiplications using a low-precision digit matrix multiplier is disclosed. The system comprises a load store unit configured to load and store matrix data between a register file and either on-chip or off-chip system memory. The system further comprises the register file configured to provide data in natural order to a permute unit. The system further comprises the permute unit, configured to deinterleave the data, i.e., to convert the data into fixed dimension and fixed precision sub-matrices. The deinterleaved data is stored back into the register file. The system further comprises a digit matrix multiplier unit configured to receive digit matrices in deinterleaved form from the register file. The digit matrix multiplier unit is composed of multiple subsystems to perform various operations. The subsystems include an input unit, a compute unit, a combination unit, and an output unit. The input unit is configured to decode instructions and generate control signals. Concurrently, in the input unit, the digit matrices are buffered, and a portion of the digit matrices are transmitted to the compute unit to be multiplied. The compute unit is configured to compute the product of two digit matrices and transmit the product to a matrix combination unit to be accumulated. The product of the digit matrices is computed using an array of dot product units (DPUs). Each DPU computes a dot product of one row of one digit matrix and one column of the other digit matrix. The matrix combination unit accumulates the results of the current digit matrix multiplication with the results of previous multiplications and applies offsets as needed to account for the multi-precision aspect and unsigned product to signed product conversion. Accumulation results are stored locally inside the matrix combination unit, and portions of the results are transmitted to an output unit. The output unit converts full-precision elements of the matrix product to a reduced precision by performing scaling, rounding and saturation operations.
The present disclosure also includes a method for implementing variable-precision matrix multiplications using a low-precision digit matrix multiplier. The method comprises decomposing a first matrix A into a set {Air} of n×n sub-matrices of M-bit digits and decomposing a second matrix B into a set {Brj} of n×n sub-matrices of M-bit digits. Here, matrix A is an N1×N matrix and matrix B is an N×N2 matrix, and
The product of A and B is an N1×N2 matrix C.
The matrices A and B and their corresponding sets {Air} and {Brj} are stored in the system memory. The method further comprises initializing i and j to 0. The method further comprises computing product sub-matrix Cij. The computation of the product sub-matrix Cij comprises of an outer and inner loop.
The outer loop computation is initialized by setting r to 0 and initializing an n×n bank of accumulators, ACC, to 0 and repeating an outer loop iteration step, followed by incrementing r provided that r is less than N/n. The outer loop iteration step comprises retrieving the sub matrix Air and the submatrix Brj from the system memory and placing them in intermediate storage. The outer loop iteration step further comprises computing product Air×Brj, the computation of which includes accumulating into ACC. Once all the outer loop iteration steps have completed, the method further comprises post-processing of each accumulator element by multiplying it with a defined scale factor and performing rounding and saturating to produce an element of Cij, and storing it in the system memory.
The computation of the product Air×Brj comprises decomposing Air into a set of K m-bit digit matrices {Airk0} by deinterleaving each digit position for each element of Air. In this case, digit position k0 ranges from {0, . . . , K−1}. In this case, each Airk0 has a corresponding scale factor 2m*k0. The computation of product Air×Brj further comprises decomposing Brj into a set of K m-bit digit matrices {Brjk1} by deinterleaving each digit position for each element of Bir. The digit position k1={0, . . . , K−1}. In this case, each Brjk1 has a corresponding scale factor 2m*k1. The digit matrices Airk0 and Brjk1 are placed in the intermediate storage. The computation of product Air×Brj further comprises initializing k0 and k1 to 0, setting the scale factor, S, to 1, and performing the inner loop computation.
The inner loop computation comprises of several iteration steps. One inner loop iteration step is performed by retrieving Airk0 and Brjk1 from intermediate storage, followed by computing an intermediate matrix T (an n×n digit matrix) by multiplying Air and Brjk1 using a digit matrix multiplier functional unit, followed by multiplying T by the scale factor S and adding the resulting product to the accumulator ACC (i.e. ACC=ACC+T*S). The inner loop computation further comprises incrementing k1 and setting the scale factor S equal to 2m*(k0+k1) and repeating the inner loop iteration step defined above provided that k1 is less than K. When k1 is equal to K, k1 is set to 0, k0 is incremented, scale factor S is set to 2m*(k0+k1) and the inner loop iteration step is repeated provided that k0 is less than K.
Once all the outer loop and inner loop computation has concluded and the result Cij has been placed in system memory, the method further comprises incrementing i and repeating the outer loop and inner loop computations provided that i is less than N1/n. When i reaches N1/n, the method further comprises setting i to 0 and incrementing j and repeating the outer loop and inner loop computations provided that j is less than N2/n. Once all the iterations of i and j have been completed, the result matrix C will be in system memory.
To further clarify the advantages and features of the present disclosure, a more particular description of the disclosure will follow by reference to specific embodiments thereof, which are illustrated in the appended figures. Note that these figures depict only typical embodiments of the disclosure and are therefore not to be considered limiting in scope. The disclosure will be described and explained with additional specificity and detail with the appended figures.
The disclosure will be described and explained with additional specificity and detail with the accompanying figures in which:
Further, those skilled in the art will appreciate that elements in the figures are illustrated for simplicity and may not have necessarily been drawn to scale. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the figures by conventional symbols, and the figures may show only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the figures with details that will be readily apparent to those skilled in the art having the benefit of the description herein.
For the purpose of promoting an understanding of the principles of the disclosure, reference will now be made to the embodiment illustrated in the figures and specific language will be used to describe them. It will nevertheless be understood that no limitation of the scope of the disclosure is thereby intended.
In the present document, the word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment or implementation of the present subject matter described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.
The terms “comprises”, “comprising”, or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process or method that comprises a list of steps does not include only those steps but may include other steps not expressly listed or inherent to such a process or method. Similarly, one or more devices or subsystems or elements or structures or components preceded by “comprises . . . a” does not, without more constraints, preclude the existence of other devices, subsystems, elements, structures, components, additional devices, additional subsystems, additional elements, additional structures or additional components. Appearances of the phrase “in an embodiment”, “in another embodiment” and similar language throughout this specification may, but not necessarily do, all refer to the same embodiment.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this disclosure belongs. The system, methods, and examples provided herein are only illustrative and not intended to be limiting.
A computer system (standalone, client or server computer system) configured by an application may constitute a “module” (or “subsystem”) that is configured and operated to perform certain operations. In one embodiment, the “module” or “subsystem” may be implemented mechanically or electronically, so a module may comprise dedicated circuitry or logic that is permanently configured (within a special-purpose processor) to perform certain operations. In another embodiment, a “module” or “subsystem” may also comprise programmable logic or circuitry (as encompassed within a general-purpose processor or other programmable processor) that is temporarily configured by software to perform certain operations.
Accordingly, the term “module” or “subsystem” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (hardwired) or temporarily configured (programmed) to operate in a certain manner and/or to perform certain operations described herein.
Referring now to the drawings, and more particularly to
The present disclosure provides a solution for the matrix multiplication problem as several dot product computations. The solution is applied to block matrix multiplication. A large matrix is decomposed into blocks of smaller matrices, and each block is treated like a single piece of data, as shown in table. 1. A single block matrix multiply would be computed as follows: C0=A0*B0+A1*B3+A2*B6. Note that each block A0 . . . A8 has dimensions kb1×kb2 and each block B0 . . . B8 has dimensions kb2×kb3. Hence, each resulting block C0 . . . C8 has dimensions kb1×kb3. The present disclosure uses this concept of block decomposition to compute large matrix multiplications.
The higher-precision matrices are decomposed into lower-precision matrices through deinterleaving. The matrices are originally stored in natural order, each element in a row being stored contiguously. Through the deinterleaving process, data is converted so that each element in a row of a digit matrix is stored contiguously; data is then processed by the matrix multiplier in this deinterleaved form.
The system 100 further comprises a matrix multiply unit 102 that uses an unsigned digit matrix multiplier unit 108, which includes a systolic array of processing elements (PE's). Each PE computes a digit matrix multiply of a submatrix of smaller dimensions using an array of low-precision integer dot product units (DPU's). The low-precision integer may be 1-, 2- or 4-bit digits.
The unsigned digit matrix multiplier unit 108 calculates the product of two signed matrices and adjusts for the sign correction using the sum of all elements within each row of one matrix and the sum of all elements within each column of the other matrix. The row sums and the column sums are calculated online using dedicated hardware blocks.
A[ij]*B[ij]=A′[i]*B′[j]−α*ΣB[j]−β*ΣA[i]−α*β equation (2).
Here, the offset has three components: a weighted row sum of A[i], a weighted column sum of B[j], and a constant. In a combine stage of the matrix combination unit 206, the offsets are subtracted from the unsigned component using the row sum of A[i] and column sum of B[j] which were both computed in the computation stage in parallel with the systolic array. Once the sign is adjusted for, the product is shifted to adjust weight due to multi-precision. Finally, the product is accumulated in an accumulator array 806. The output from the combination stage is read from the accumulator array 806 and sent to the output stage.
i={0, . . . , N1/n−1} and r={0, . . . , N/n−1} equation (3).
The set {Air} is stored in the system memory 112.
At step 1004, the method comprises decomposing a second matrix B into a set {Brj} of n×n sub-matrices of the M-bit digits. The second matrix B is an N×N2 matrix and is stored in the system memory 112. In this case:
r={0, . . . , N/n−1} and j={0, N2/n−1} equation (4).
The set {Brj} is stored in the system memory 112.
At step 1006, the method further comprises initializing i and j to 0.
At step 1008, the method comprises computing product sub-matrix Cij. At step 1008 the method further comprises initializing r to 0 and initializing n×n bank of accumulators ACC to 0.
At step 1010 the method comprises retrieving the submatrix Air and the submatrix Brj from the system memory 112 and placing in intermediate storage. Further, the method comprises computing of product Air×Brj.
At step 1012, the method comprises decomposing the set Air into a set of K m-bit digit matrices {Airk0}. The digit position k0 ranges from {0, . . . , K−1}. In this case, Airk0 has a corresponding scale factor 2 mm*k0. The digit matrix {Airk0} is placed in the intermediate storage. Similarly, the method further comprises decomposing the set Brj into a set of K m-bit digit matrices {Brjk1}. In this case, the digit position k1={0, . . . , K−1}. Brjk1 has corresponding scale factor 2 m*1k. The digit matrix {Brjk1} is placed in the intermediate storage.
At step 1014, the method comprises initializing k0 and k1 to 0 and scale factor S is set to 1. At step 1016, the method comprises retrieving Airk0 and Brjk1 from the system memory 112. Further, the method comprises computing a third matrix T of n×n digit matrix by multiplying Airk0 and Brjk1 using a digit matrix multiplier functional unit 108.
At step 1018, the method comprises multiplying the third matrix T by the scale factor S and adding respective product to the accumulator ACC. The accumulator ACC is updated with the resulting value:
ACC=ACC+T*S equation (5).
At step 1020, the method comprises incrementing k1 and setting the scale factor S equal to 2m*(k0+k1).
At step 1022, the method comprises determining whether k1<K.
If k1 is not less than K, then at step 1024, the method comprises setting k1 to 0, incrementing k0 and setting the scale factor S equal to 2m*(k0+k1). If k1 is less than K, then the loop goes back to step 1016.
At step 1026, it is determined whether k0<K.
If k0 is not less than K, then at step 1028, the method comprises incrementing the value of r. If k0 is less than K, then the loop goes back to 1016.
At step 1030, it is determined whether r<N/n.
If r is not less than N/n, then at step 1032, the method comprises post-processing of each accumulator element by first multiplying each accumulator element comprising Cij with a defined scale factor, next performing rounding and saturating for the scaled Cij and lastly storing the saturated Cij in the system memory 112. If r is less than N/n, then the loop goes back to step 1010.
At step 1034, the method comprises incrementing i.
At step 1036, it is determined whether i<N1/n.
If i is not less than N1/n, then at step 1038 the method comprises incrementing j. If i is less than N1/n, then the loop goes back to step 1008.
At step 1040, it is determined whether j<N2/n.
If j is not less than N2/n, the at step 1042 the method comprises outputting a product matrix C=A*B. The product matrix C is an N1×N2 matrix. If j is less than N2/n, then the loop goes back to step 1008.
The method comprises repeating steps from 1008 to 1038 if i is less than N1/n and if j is less than N2/n.
The method further comprises repeating steps from 1010 to 1028, if r is less than N/n.
The method further comprises repeating steps from 1016 to 1020, if k1 is less than K.
The method further comprises repeating steps from 1016 to 1024, if k1 is less than K and if k0 is less than K.
In one exemplary embodiment, a method for implementing variable-precision matrix multiplications using a low precision digit matrix multiplier is explained below. Consider there are two matrices A and B stored in the system memory 112. A is a 12×12 matrix of 8-bit digits. B is a 12×12 matrix of 8-bit digits. Each matrix has 122 elements, where each element is stored as an 8-bit number. A 4×4 digit matrix of 2-bit digits is used. A and B are each decomposed into 4×4 matrices of 2-bit digits.
The first step of multiplication of matrices A and B includes decomposing of A into submatrices, as shown below:
A=[A
00
A
01
A
02
A
10
A
11
A
12
A
20
A
21
A
22] equation (6)
Each submatrix Aij is a 4×4 matrix. For example, A00 is a 4×4 matrix of 16 scalar elements as shown below:
A
00
=[a
00
a
01
a
02
a
03
a
10
a
11
a
12
a
13
a
20
a
21
a
22
a
23
a
30
a
31
a
32
a
33] equation (7)
Elements of a submatrix are full-precision (same precision as A's elements).
At step II, B is decomposed, where the matrix B can be similarly decomposed into 4×4 block submatrices as shown below:
B=[B
00
B
01
B
02
B
10
B
11
B
12
B
20
B
21
B
22]equation (8)
Each of the 9 submatrices is further decomposed into four 4×4 digit matrices of 2-bit digits as shown below:
B
00=26×B003+24×B002+22×B001+20×B000 equation (9)
The same process is done for A.
At step III, the product C of A and B is decomposed. The 12×12 matrix product C=A×B is implemented with 4×4 sub-matrices as shown below:
[[C00 C01 C02], [C10 C11 C12], [C20 C21 C22]]=[[A00 A01 A02], [A10 A11 A12], [A20 A21 A22]]×[[B00 B01 B02], [B10 B11 B12], [B20 B21 B22]] equation (10)
Each output submatrix is calculated by multiplying 3 submatrices and adding the results together. For example, C00 can be calculated as shown below:
C
00
=A
00
×B
00
+A
01
×B
10
+A
02
×B
20 equation (11)
Therefore, the matrix product A×B can be decomposed into 27 submatrix multiply and add operations. For each one of these 27 operations, the processor fetches a submatrix Air and a sub-matrix Brj and decomposes each into their constituent digit matrices such as 4 digit matrices for Air and 4 digit matrices for Brj and implements 16 digit matrix multiply, scale and add operations. There are a total of 27×16=432 digit matrix multiply scale and add operations to implement A×B.
At step IV, which is a single digit matrix multiply needed in step III, the product A00×B00 is multiplied in the following manner:
At the first sub-step of step IV, a processor fetches A00 and B00 from system memory 112. At this point, A00 and B00 are 4×4 matrices of 8-bit elements.
At the second sub-step of step IV, the processor decomposes A00 into four 4×4 digit matrices of 2-bit elements:
(A000, A001, A002 and A003) equation (12)
At the third sub-step of step IV, the processor similarly decomposes B00 into four 4×4 digit matrices of 2-bit elements:
(B000, B001, B002 and B003) equation (13)
At the fourth sub-step of step IV, the processor multiplies A000 and B000 and stores the resulting 4×4 matrix into a bank of 4×4 accumulators. Each element of the accumulator is wide enough to store the result of the largest matrix multiply supported by the system 100. While the result of A000×B000 would only need 6-bits to store each element, each element of the accumulator bank could be much wider such as 100 bits or more.
At the fifth sub-step of step IV, the processor multiplies A000 and B000. The processor mutiplies the result by the scale of B001, which is 22 and adds the result to the accumulator. Therefore, the accumulator contains the result of
acc=A000×B000+22×A000×B001 equation (14)
At the sixth sub-step of step IV, the processor multiplies A000 and B000, applies a scale factor of 24 and adds the result to the accumulator. The accumulator thus contains the result of
acc=A000×B000+22×A000×B001+24×A000×B002 equation (15)
The seventh sub-step of step IV includes adding (A000×B003)×26. The resulting accumulator state is as shown below:
acc=A000×B000+22×A000×B001+24×A000×B002+26×A000+B000 equation (16)
At the eighth sub-step of step IV, the process is continued similarly for each combination of A0i and B0j:
acc=acc+22×A001×B000
acc=acc+24×A001×B001
acc=acc+26×A001×B002
acc=acc+28×A001×B003 equation (17)
acc=acc+24×A002×B000
acc=acc+26×A002×B001
acc=acc+28×A002×B002
acc=acc+210×A002×B003 equation (18)
acc=acc+26×A003×B000
acc=acc+28×A003×B001
acc=acc+210×A003×B002
acc=acc+212×A003×B003 equation (19)
Once all digit matrices for A00 and B00 have been multiplied, scaled and added to the accumulator, the accumulator contains the results for A00×B00.
At step V, the product A01×B10 is calculated. Once step IV is completed, the processor fetches A01 and B10 and repeats the second sub-step of step IV through the eighth sub-step of step IV on these submatrices. Each digit matrix product is scaled appropriately and added to the accumulator. At the end of step V, the accumulator contains:
A00×B00+A01×B10 equation (20)
At step VI, product of A02×B20 is decomposed. Step VI is identical to step V. At the end of step VI, the accumulator contains:
C
00
=A
00
×B
00
+A
01
×B
10
+A
02
×B
20 equation (21)
Note that this is the same as equation (11).
At this point, 48 digit matrix multiply results have been added to the accumulator after appropriate scaling.
Step VII includes post-processing of the accumulator results. Once C00 has been calculated, the accumulator's results must be scaled, rounded and saturated to the desired output bit width, and the results must be committed back to system memory 112. The results stored in each element of the accumulator for this example occupy 20 bits such as 12 additions of 8 bit×8 bit multiplies requires 20 bits. The physical accumulator element may be much larger than 20-bits to accommodate the product of large matrices. For example, an 8-bit output precision is desired. Furthermore, consider a user specifies a scale factor of 1/1024. The data post processing subsystem would shift the data right by 10 bits (equivalent to multiplying each accumulator element by 1/1024) and range limit the result to 8 bits by rounding and saturating. The resulting 4×4 matrix of 8-bit elements is stored back into the system memory 112 to complete the process for C00.
Step VIII includes calculation for C01 through C22. The process for calculating C01 through C22 is identical to the process described above for C00 in step VII. Overall, 48×9=432 digit matrix multiply operations are needed to calculate C.
In an additional embodiment, the present disclosure is applicable to both signed and unsigned arithmetic operations and multi-precision arithmetic operations.
Generally, in fixed point arithmetic, the number of bits needed to maintain full precision increases with each operation. When adding two fixed point numbers, the number of bits increases by 1. When multiplying two numbers, the number of bits doubles, since multiplication is essentially the same as adding a number to itself many times. When adding n numbers, the bit growth is log2n. An unsigned n-bit number can hold values from 0 to 2n−1. A signed n-bit number in 2's complement representation can hold values from −2n−1 to 2n−1−1. One way to go from a signed number to an unsigned number is to add the offset 2n−1 so that it now has the range 0 to 2n−1.
An embodiment of the present disclosure allows multiplication of two signed matrices using an unsigned matrix multiplier. This can be achieved by adding a fixed offset to all the numbers. Let A (with dimensions k1×k2) and B (with dimensions k2×k3) be two signed matrices and A′ (k1×k2) and B′ (k2×k3) be the unsigned equivalent of A and B respectively. If α and β are scalars representing the offset and 1 represents a matrix of ones (same dimensions as A and B, respectively; not the same as identity matrix I), the equation (22) below shows how the signed matrix multiplication would be calculated using the unsigned matrix multiplication result. The equation (22) below depicts signed matrix multiplication in terms of unsigned product:
A′*B′=(A+α1)(B+β1)
A′*B′=A*B+A*(β1)+(α1)*B+(α1)*(β1)
A*B=A′*B′−A*(β1)−(α1)*B−(α1)*(β1) Equation (22)
In an alternate embodiment, two middle entries may be used to compute the product as below:
A*(β1)=β*Σi=0k1A[i]
(α1)*B=β*Σj=0k3B[j] equation (23)
Equation (23) depicts weighted row sum of A and weighted column sum of B. Essentially, the first term is a scaled version of a row sum of A and the second term is a scaled version of a column sum of B. Rather than compute the matrix multiplication with a matrix of ones, the present system 100 uses the row sum and column sum to simplify this calculation, as done in the present disclosure design.
This decomposition of a signed matrix multiplication into an unsigned multiplication and offset will be used in this design so that the matrix multiplier can compute an unsigned or signed matrix multiplication without requiring a separate signed and unsigned multiplier.
In case of multi-precision arithmetic, an embodiment of the present disclosure allows the system 100 to break the data into digits of fixed precision to use a small fixed-precision multiplier to multiply larger data. Suppose a and b are each 8-bit data and the multiplier takes data of 2-bit precision. a and b are each decomposed into 2-bit digits, so there are 4 digits of a and 4 digits of b. Each 2-bit digit of a needs to be multiplied with each 2-bit digit of b and shifted, depending on the bit position of that digit. The final product would be computed as shown below:
This same approach can be applied to matrix multiplication. Matrices would be broken up into digit matrices as shown in
This is significant because the multipliers in the present design are of fixed precision but can be used for multi-precision computations. A multiplier would compute each sub-product, such as A3*B3, and a shifter would adjust the weight, in this case shifting A3*B3 left by 12. This ability to process different sized data using a fixed size block is what makes it multi-precision.
Various embodiments of the present disclosure relate to a system using a low-precision digit matrix multiplier for operations. The operations include any multiples of the digit size. The present system handles range of precisions, such as 2-bit, 4-bit, 6-bit, 12-bit, 64-bit. In contrast to existing designs, the present systems' design supports dynamic selection of precisions, making it easy to select a different precision for each individual operation and hence maximizing power efficiency. The number of multiply operations and cycles required for multiplying two 6-bit precision matrices in this design is little over half that is required using 8-bit precision. Therefore, the present system is useful to be able to dynamically select precision and thus improves power consumption and latency.
Furthermore, the purpose of the present disclosure design is to have the ability to multiply two matrices of variable dimensions and precisions using a fixed sized matrix multiplier. The system 100 breaks up the two matrices into matrices of a fixed precision and dimension, feeding those new matrices to the matrix multiplier over time. The product of each is computed in a systolic array of block matrix multipliers, accumulated in a set of accumulators, and sent back to the register file 106. Over time, the result of the entire matrix multiply of the original matrices is computed.
The written description describes the subject matter herein to enable any person skilled in the art to make and use the embodiments. The scope of the subject matter embodiments is defined by the claims and may include other modifications that occur to those skilled in the art. Such other modifications are intended to be within the scope of the claims if they have similar elements that do not differ from the literal language of the claims or if they include equivalent elements with insubstantial differences from the literal language of the claims.
The embodiments herein can comprise hardware and software elements. The embodiments that are implemented in software include but are not limited to firmware, resident software, microcode, etc. The functions performed by various modules described herein may be implemented in other modules or combinations of other modules. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can comprise, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid-state memory, magnetic tape, a removable computer diskette, a random-access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk-read only memory (CD-ROM), compact disk-read/write (CD-R/W) and DVD.
A description of an embodiment with several components in communication with each other does not imply that all such components are required. On the contrary, a variety of optional components are described to illustrate the wide variety of possible embodiments of the invention. When a single device or article is described herein, it will be apparent that more than one device/article (whether or not they cooperate) may be used in place of a single device/article. Similarly, where more than one device or article is described herein (whether or not they cooperate), it will be apparent that a single device/article may be used in place of the more than one device or article or a different number of devices/articles may be used instead of the shown number of devices or programs. The functionality and/or the features of a device may be alternatively embodied by one or more other devices which are not explicitly described as having such functionality/features. Thus, other embodiments of the invention need not include the device itself.
The illustrated steps are set out to explain the exemplary embodiments shown, and it should be anticipated that ongoing technological development will change the manner in which particular functions are performed. These examples are presented herein for purposes of illustration, and not limitation. Further, the boundaries of the functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternative boundaries can be defined so long as the specified functions and relationships thereof are appropriately performed. Alternatives (including equivalents, extensions, variations, deviations, etc., of those described herein) will be apparent to persons skilled in the relevant art(s) based on the teachings contained herein. Such alternatives fall within the scope and spirit of the disclosed embodiments. Also, the words “comprising,” “having,” “containing,” and “including,” and other similar forms are intended to be equivalent in meaning and be open-ended in that an item or items following any one of these words is not meant to be an exhaustive listing of such item or items or meant to be limited to only the listed item or items. It must also be noted that as used herein and in the appended claims, the singular forms “a,” “an,” and “the” include plural references unless the context clearly dictates otherwise.
Finally, the language used in the specification has been principally selected for readability and instructional purposes, and it may not have been selected to delineate or circumscribe the inventive subject matter. It is therefore intended that the scope of the invention be limited not by this detailed description, but rather by any claims that issue on an application based here on. Accordingly, the embodiments of the present invention are intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.