The present invention relates to Compute-in-memory (CIM) processors.
The computations of the matrix-matrix multiplication (i.e. CM×N=AM×K×BK×N) are usually encountered in scientific and engineering fields. For instance, many of the calculations performed in neural network applications are matrix-matrix multiplications. In an example of matrix-matrix multiplication for neural network applications, one calculation: CM×N=AM×K×BK×N, computes a weighted activation matrix CM×N (including M×N weighted activations) from the multiplication of an activation matrix AM×K (formed by M×K activations) and a weight matrix BK×N (formed by K×N weights). Compute-in-Memory (CIM) technology proposes a new type of memory that stores the entries of BK×N, and performs an “INNER PRODUCT” operation between the entries of BK×N stored in the CIM memory and the entries of AM×K as the input into the CIM memory. The CIM memory may further accumulate the inner products of the entries of BK×N and the entries of AM×K, to complete a multiply-and-accumulate (MAC) calculation required in matrix-matrix multiplication. The matrix-matrix multiplication is accelerated through such a CIM memory.
However, the matrices in applications of scientific and engineering fields (e.g. for some of those used in artificial intelligence (AI) model inference) may be large in size. H. T. Kung, V. Natesh and A. Sabot propose in their paper “CAKE: Matrix-matrix Multiplication Using Constant-Bandwidth Blocks” in SC21 (International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 2021, pp. 1-13, doi: 10.1145/3458817.3476166) to handle the large-sized matrix-matrix multiplications. According to the novel matrix-matrix multiplication algorithm CAKE, the matrix AM×K is divided into A blocks, and the matrix BK×N is divided into B blocks. The huge calculation of AM×K×BK×N is divided into the small-sized matrix-matrix multiplication.
Referring to the left picture, each MM space unit has dimensions m×k×n. Referring to the slice of MM space presented in the right picture, it includes MM space units 1 to 9 in the middle picture. As shown, MM space unit 1 relates to a block A1 (on the back side of the right picture but obscured) and a block B1 (on the upper side of the picture), block 2 relates to a block A2 (on the back side of the right picture but obscured) and a block B2 (on the upper side of the picture), block 3 relates to a block A3 (on the back side of the right picture but obscured) and a block B3 (on the upper side of the picture), block 4 relates to a block A4 (on the back side of the right picture but obscured) and the block B3, block 5 relates to a block A5 (on the back side of the right picture but obscured) and the block B2, block 6 relates to a block A6 (on the back side of the right picture but obscured) and the block B1, block 7 relates to a block A7 (on the back side of the right picture but obscured) and the block B1, block 8 relates to a block A8 (on the back side of the right picture but obscured) and the block B2, and block 9 relates to a block A9 (on the back side of the right picture but obscured) and the block B3. Three C blocks (three 2-D blocks) C1, C2, and C3 are presented on the right side of the slice of MM space. The block C1 is calculated from A1×B1+A2×B2+A3×B3. The block C2 is calculated from A4×B3+A5×B2+A6×B1, wherein the block B3 used in the final step of the C1 calculation can be reused in the first step of the C2 calculation. The block C3 is calculated from A7×B1+A8×B2+A9×B3, wherein the block B1 used in the final step of the C2 calculation can be reused in the first step of the C3 calculation. The block reuse concept is called inter-block reuse, which reduces the number of memory (e.g. DRAM) accesses.
The reuse concept may be implemented in some ways. In conventional designs, the reuse scheme depends on the matrix size.
When the size of matrix AM×K is larger than the size of matrix BK×N (i.e., M>N), the first reuse scheme first_reuse_A (referring to
However, in some computing architectures, the conventional matrix-size-related reuse scheme may still result in considerable power consumption. There is a need in the art for an energy-efficient matrix-matrix multiplication technology.
The present invention aims to minimize the memory access energy to save power for matrix-matrix multiplication in a computing system.
A computing system in accordance with an exemplary embodiment of the invention includes a computing-in-memory (CIM) design. The computing system includes a CIM processor and a two-level memory system coupled to the CIM processor. The CIM processor includes a processor control unit, a CIM memory with CIM capability, and a register file. The two-level memory system includes a first-level (L1) memory and a second-level (L2) memory. The processor control unit is operable to load A blocks (divided from a matrix AM×K) from the L2 memory to the L1 memory, and load B blocks (divided from a matrix BK×N) from the L2 memory to the CIM memory. The processor control unit further is operable to program the A blocks buffered in the L1 memory to the register file to be entered into the CIM memory. The CIM memory performs multiply-and-accumulate (MAC) calculations on the A blocks and the B blocks to generate C blocks. The C blocks will form a matrix CM×N(=AM×K×BK×N). Based on the size of the matrix AM×K, the size of the matrix BK×N, the A block buffering capability (μ) of the L1 memory, and the B block buffering capability (L) of the CIM memory, the CIM processor selects a reuse scheme to reuse the A blocks buffered in the L1 memory and the B blocks buffered in the CIM memory.
In an exemplary embodiment, the number of A blocks divided from the matrix AM×K is Mb×Kb, where Mb=M/m, and Kb=K/k. The number of B blocks divided from the matrix BK×N is Kb×Nb, where Nb=N/n. Mb, Kb, and No representing the size of the matrix AM×K and the matrix BK×N are used in selecting the reuse scheme.
In an exemplary embodiment, m=n=mc, and k=αmc, where α is greater than 0. In response to a situation wherein Mb>1 and Nb>1, the CIM processor judges Kb to select the reuse scheme based on a threshold function T(⋅), wherein the threshold function T(⋅) is the function of the A block buffering capability (μ) of the L1 memory and the sparsity of the matrix BK×N (ss). In some exemplary embodiments, the threshold function T(⋅) further depends on Mb and Nb. In an exemplary embodiment, the threshold function T(⋅) is:
where 1≤μ≤Kb, and 0≤ss<1. Lmax denotes the upper limit to which the CIM memory buffers the B blocks. When min(Lmax, Kb)<T(⋅), the CIM processor sets the B block buffering capability, L, of the CIM memory to min(2, Kb), and selects a first reuse scheme first_reuse_A. According to the first reuse scheme first_reuse_A, the A blocks buffered in the L1 memory are reused even if the CIM memory has updated the B blocks buffered therein. When min(Lmax, Kb)≥T(⋅), the CIM processor sets the B block buffering capability, L, of the CIM memory to min(Lmax, Kb), and selects a second reuse scheme first_reuse_B. According to the second reuse scheme first_reuse_B, the B blocks buffered in the CIM memory are reused even if the L1 memory has updated the A blocks buffered therein.
A more proper reuse scheme is selected according to the proposed technology. The foregoing concept can also realize a method for power saving of a compute-in-memory (CIM) design.
A detailed description is given in the following embodiments with reference to the accompanying drawings.
The present invention can be more fully understood by reading the subsequent detailed description and examples with references made to the accompanying drawings, wherein:
The following description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims.
The matrix-matrix multiplication is CM×N=AM×K×BK×N. The matrix-matrix multiplication (MM) space (referring to the middle cube presented in
The A blocks and B blocks are stored in the L2 memory 312. The matrix BK×N may have the sparsity ss (0≤ss<1) and can be compressed in case its entries are static (e.g. weights of the neural network layer for inference are static) so that B blocks may be stored in the compressed format in L2 memory 312. The sparsity ss of the matrix BK×N will be set zero to indicate compression does not apply to the B blocks even though the matrix BK×N has zero valued entries. The L2 memory controller 316 reads the A blocks and B blocks (perhaps compressed) from the L2 memory 312, decompress B blocks if necessary (i.e. the matrix BK×N has zero valued entries and compression applies to B blocks so that ss will be greater than 0), and passes A blocks and decompressed B blocks (decompression will not be performed on B blocks if ss=0) to the L1 memory controller 314. The L1 memory controller 314 loads the A blocks (Ab) to the L1 memory 310, and loads the B blocks (Bb) to the CIM memory 306. The A block buffering capability of the L1 memory 310 is μ, where 1≤μ≤Kb. It means that the number of A blocks buffed in the L1 memory 310 can be up to μ. The B block buffering capability of the CIM memory 306 is L, where 1≤L≤Kb. It means that the number of B blocks buffed in the CIM memory 306 can be up to L. For MAC calculation that generates one C block Cb, the A blocks read from the L1 memory 310 are programmed into the register file 308 to be entered into the CIM memory 306. In the CIM memory 306, the received A blocks are multiplied by the buffered B blocks (small-sized matrix-matrix multiplication) and the products are accumulated to form one C block Cb (where Cb+=Ab×Bb, referring to the right picture in
In such a computing architecture, the power consumption is due to: reading of the two-level memory (referring to the L1 and L2 memories 310 and 312); multiply-and-accumulate (MAC) computing of the CIM memory 306; and accessing of the register file 308. In the disclosure, the focus is on fully utilizing the A block buffer of the L1 memory 310 and the B block buffer of the CIM memory 306, to suppress the power consumption of reading the L1 memory 310 and the L2 memory 312.
The first reuse scheme first_reuse_A (referring to
Different from the conventional technology that selects the reuse scheme based on only the size of the matrix AM×K and the matrix BK×N, the size of the A block buffer provided by the L1 memory 310 (μ) and the size of the B block buffer provided by the CIM memory 306 (L) are also considered in the proposed technology about reuse scheme selection.
In an exemplary embodiment, the size of an MM space unit (m×n×k) has the following characteristics: m=n=mc, and k=αmc (α>0). A variable σ denotes the energy consumption of reading the L2 memory 312 with a constant scaling δ1. A variable β denotes the energy consumption of reading the L2 memory 312 and the L1 memory 310 (with a constant scaling δ1 and δ2, respectively). The sparsity of the matrix BK×N is ss (0≤ss<1). When the first reuse scheme first_reuse_A is adopted, the energy consumption of reading the L1 and L2 memories 310 and 312 may be represented as: (β+σ−ssσ)MbKbNb+μσMb(1−Nb)+(1−ss)Lσ(1−Mb). When the second reuse scheme first_reuse_B is adopted, the energy consumption of reading the L1 and L2 memories 310 and 312 may be represented as: (β+σ−ssσ)MbKbNb+(1−ss)LσNb(1−Mb)+μσ(1−Nb). By substituting β=cσ (c>1) and neglecting σ in the energy consumption comparison, the energy consumption index values of the two different reuse schemes are:
If any of Mb and Nb is 1, fA(Mb, Kb, Nb)=fB(Mb, Kb, Nb), regardless of what the value L (the B block buffering capability provided by the CIM memory 306) is. If Mb>1 and Nb>1, the comparison between fA(Mb, Kb, Nb) and fB(Mb, Kb, Nb) depends on the magnitudes of Mb, Nb, μ and L.
In an exemplary embodiment, a threshold function T(⋅) is proposed:
The threshold function T(⋅) as the function of u (the A block buffering capability of the L1 memory 310) and ss (0≤ss<1, the sparsity of the matrix BK×N) is applied to judge the value of Kb when Mb>1 and Nb>1. In another exemplary embodiment, the threshold function is:
which further depends on Mb and Nb.
When Kb=T(⋅), the B block buffering capability L provided by the CIM memory 306 is set to Kb, and the second reuse scheme first_reuse_B is selected to reuse the B blocks buffered in the CIM memory 306 even if the L1 memory 310 has updated the A blocks buffered therein. According to the second reuse scheme first_reuse_B, every Kb B blocks (e.g. a column of B blocks buffered in the CIM memory 306) are reused till the column of B blocks are processed by the all Mb rows of A blocks for matrix-matrix multiplication.
When Kb<T(⋅), the B block buffering capability L provided by the CIM memory 306 is set to 2, and the first reuse scheme first_reuse_A is selected to reuse the A blocks buffered in the L1 memory 310 even if the CIM memory 306 has updated the B blocks buffered therein. According to the first reuse scheme first_reuse_A, every u A blocks buffered in the L1 memory 310 are reused till being processed by the all related B blocks for matrix-matrix multiplication.
In the simpler situations, the threshold function T(⋅) is not required. When Mb>Nb=1, the B block buffering capability L provided by the CIM memory 306 is set to a non-zero integer greater than or equal to 2, and the second reuse scheme first_reuse_B may be selected to reuse the B blocks buffered in the CIM memory 306 though the first reuse scheme first_reuse_A results in the same energy consumption index values as the second reuse scheme first_reuse_B because in some exemplary embodiments, the second reuse scheme first_reuse_B is only selected in this situation.
When Nb>Mb=1, the B block buffering capability L provided by the CIM memory 306 is set to a non-zero integer greater than or equal to 2, and the first reuse scheme first_reuse_A may be selected to reuse the A blocks buffered in the L1 memory 310 though the second reuse scheme first_reuse_B results in the same energy consumption index values as the first reuse scheme first_reuse_A because in some exemplary embodiments, the first reuse scheme first_reuse_A is only selected in this situation.
When Mb=Nb=1, the B block buffering capability L provided by the CIM memory 306 is set to a non-zero integer greater than or equal to 2. The first reuse scheme first_reuse_A and the second reuse scheme first_reuse_B result in the same power consumption. The reuse scheme can be kept at the previous setting.
Table 1 shows the aforementioned reuse strategy.
If the reuse scheme is selected only based on the matrix size as that taught in the conventional technology without considering the A block buffering capability μ of the L1 memory 310 and the B block buffering capability of L the CIM memory 306, the selected reuse scheme may not be the best reuse scheme. In some exemplary embodiments, the hardware limitation, Lmax, of the B block buffer provided by the CIM memory 306 is taken into consideration. The upper limit of B blocks buffered in the CIM memory 306 is Lmax.
When min(Lmax, Kb)=T(⋅), the B block buffering capability L provided by the CIM memory 306 is set to min(Lmax, Kb), and the second reuse scheme first_reuse_B is selected to reuse the B blocks buffered in the CIM memory 306. When Kb<T(⋅), the B block buffering capability L provided by the CIM memory 306 is set to min(2, Kb), and the first reuse scheme first_reuse_A is selected to reuse the A blocks buffered in the L1 memory 310.
Table 2 shows the reuse strategy considering the hardware limitation.
In table 2, when any of Mb and Nb is 1, the B block buffering capability L provided by the CIM memory 306 is set to min(Lmax, Kb). For example, the B block buffering capability L provided by the CIM memory 306 may be set to a non-zero integer greater than or equal to 2.
In step S402, the matrix AM×K is divided into blocks Ab, and the matrix BK×N is divided into blocks Bb. Each block Ab is a small matrix of dimension m×k, which is also named an A block. The total number of the A blocks Ab is Mb×Kb, where Mb is M/m, Kb is K/k. Each block Bb is a small matrix of dimension k×n, which is also named a B block. The total number of the B blocks Bb is Kb×Nb, where Nb is N/n. Especially, m=n=mc, and k=αmc (α>0). The A blocks Ab and the B blocks Bb are stored into the L2 memory 312.
In step S404, the values of Mb and Nb are checked. It is determined whether Mb and Nb both are greater than 1. If yes, step S406 is performed to compare min(Lmax, Kb) with the threshold function T(⋅), where Lmax is the hardware limitation of the B block buffer of the CIM memory 306, and the threshold function T(⋅) is:
where μ is the A block buffering capability provide by the L1 memory 310, and ss is the sparsity of the matrix BK×N (0≤ss<1).
If min(Lmax, Kb)=T(⋅), step S408 is performed to set the B block buffering capability L to min(Lmax, Kb). The number of B blocks buffered in the CIM memory 306 in the same time is up to L. In step S410, the second reuse scheme first_reuse_B is selected. The B blocks buffered in the CIM memory 306 are reused till the CIM memory 306 completes the multiply-and-accumulate calculation between the buffered B blocks and all of its related A blocks. The L1 memory 310 is continuously updated by the required A blocks while the B blocks buffered in the CIM memory 306 are reused to complete the calculation of the target C block.
If step S406 determines that min(Lmax, Kb)<T(⋅), step S412 is performed to set the B block buffering capability L to min(2, Kb). In step S414, the first reuse scheme first_reuse_A is selected. The A blocks buffered in the L1 memory 310 are reused by the CIM memory 306 till the multiply-and-accumulate calculations between the reused A blocks and all of its related B blocks are finished. The CIM memory 306 is continuously updated by the required B blocks while the A blocks buffered in the L1 memory 310 are reused to complete the calculation of the target C block.
If any of Mb and Nb is not greater than 1, step S416 is performed to further check whether any of Mb and Nb is 1. If Mb=Nb=1, step S418 is performed, by which the B block buffering capability L provide by the CIM memory 306 is set to min(Lmax, Kb) (i.e. a non-zero integer greater or equal to 2). There is no need to change the reuse scheme. The reuse scheme may be the first reuse scheme first_reuse_A or the second reuse scheme first_reuse_B.
If Mb>Nb=1, step S420 is performed, by which the B block buffering capability L provide by the CIM memory 306 is set to min(Lmax, Kb) (i.e. a non-zero integer greater or equal to 2). In step S422, the second reuse scheme first_reuse_B is selected. The B blocks buffered in the CIM memory 306 are reused till the CIM memory 306 completes the multiply-and-accumulate calculation between the buffered B blocks and all of its related A blocks.
If Nb>Mb=1, step S424 is performed, by which the B block buffering capability L provide by the CIM memory 306 is set to min(Lmax, Kb) (i.e. a non-zero integer greater or equal to 2). In step S426, the first reuse scheme first_reuse_A is selected. The A blocks buffered in the L1 memory 310 are reused by the CIM memory 306 till the multiply-and-accumulate calculations between the reused A blocks and all of its related B blocks are finished.
Based on the aforementioned concept, a method for power saving of a compute-in-memory (CIM) design is proposed in accordance with an exemplary embodiment of the present invention. The method includes loading A blocks divided from a matrix AM×K from the L2 memory 312 to the L1 memory 310. The method includes loading B blocks divided from a matrix BK×N from the L2 memory 312 to a CIM memory 306. The method includes programming the A blocks buffered in the L1 memory 310 to the register file 308 to be entered into the CIM memory 306. The CIM memory 306 performs multiply-and-accumulate (MAC) calculations on the A blocks and the B blocks to generate C blocks which form a matrix CM×N that is AM×K×BK×N. Based on the size of the matrix AM×K, the size of the matrix BK×N, the A block buffering capability, μ, of the L1 memory 310, and the B block buffering capability, L, of the CIM memory 306, the method includes selecting a reuse scheme to reuse the A blocks buffered in the L1 memory 310 and the B blocks buffered in the CIM memory 306. The computer readable medium 318 may be provided to configure the computing system 300 to perform the proposed method.
There may be many variations of the CAKE algorithm with the proposed novel reuse strategy. Any reuse strategy that selects a reuse scheme based on not only the size of the matrices but also the buffering capability of the A blocks and the B blocks should be considered as being within the scope of the invention.
While the invention has been described by way of example and in terms of the preferred embodiments, it should be understood that the invention is not limited to the disclosed embodiments. On the contrary, it is intended to cover various modifications and similar arrangements (as would be apparent to those skilled in the art). Therefore, the scope of the appended claims should be accorded the broadest interpretation so as to encompass all such modifications and similar arrangements.
This application claims the benefit of U.S. Provisional Application No. 63/591,468, filed Oct. 19, 2023, the entirety of which is incorporated by reference herein.
Number | Date | Country | |
---|---|---|---|
63591468 | Oct 2023 | US |