The disclosure relates in general to an In-Memory-Computing memory device and an operation method thereof.
Artificial Intelligence (“AI”) has recently emerged as a highly effective solution for many fields. The key issue in AI is that AI contains large amounts of input data (for example input feature maps) and weights to perform multiply-and-accumulate (MAC) operation.
However, the current AI structure usually encounters IO (input/output) bottleneck and inefficient MAC operation flow.
In order to achieve high accuracy, it would perform MAC operations having multi-bit inputs and multi-bit weights. But, the IO bottleneck becomes worse and the efficiency is lower.
In-Memory-Computing (“IMC”) can accelerate MAC operations because IMC may reduce complicated arithmetic logic unit (ALU) in the process centric architecture and provide large parallelism of MAC operation in memory.
In IMC, the unsigned integer multiplication operations and the signed integer multiplication operations are explained as below.
For example, two unsigned 8-bit integers a[7:0] and b[7:0] are multiplied. Eight single-bit multiplication are executed to generate eight partial products p0[7:0]˜p7[7:0], each of the eight partial products are related to each bit of the multiplicand “a”. The eight partial products are expressed as below.
In order to generate the dot product, the eight partial products p0[7:0]˜p7[7:0] are accumulated as shown in
Wherein P0=p0[0]+0+0+0+0+0+0+0, and P1=p0[1]+p1[0]+0+0+0+0+0+0, and so on.
The product P[15:0] is generated by accumulating the partial products P0˜P15. The product P[15:0] refers a 16-bit unsigned multiplication product generated from multiplying two unsigned integers (both 8-bit).
However, if the integer b is a signed integer, then before summation, the partial products are sign-extended to the product width. Still further, if the integer “a” is also a signed integer, then the partial product P7 are subtracted from the final sum, rather than added to the final sum.
In executing IMC, if the operation speed is improved and the memory capacity requirement is lowered, then the IMC performance will be improved.
According to one embodiment, provided is a memory device including: a plurality of memory dies, each of the memory die including a plurality of memory planes, a plurality of page buffers and an accumulation circuit, each of the memory planes including a plurality of memory cells. Wherein an input data is encoded; an encoded input data is sent to at least one page buffer of the page buffers; and the encoded input data is read out from the at least one page buffer in parallel; a first part and a second part of a weight data are encoded into an encoded first part and an encoded second part of the weight data, respectively, the encoded first part and the encoded second part of the weight data are written into the plurality of memory cells of the memory device, and the encoded first part and the encoded second part of the weight data are read out in parallel; the encoded input data is multiplied with the encoded first part and the encoded second part of the weight data respectively to parallel generate a plurality of partial products; and the partial products are accumulated to generate an operation result.
According to another embodiment, provided is an operation method for a memory device. The operation method includes: encoding an input data, sending an encoded input data to at least one page buffer, and reading out the encoded input data from the at least one page buffer in parallel; encoding a first part and a second part of a weight data into an encoded first part and an encoded second part of the weight data, respectively, writing the encoded first part and the encoded second part of the weight data into a plurality of memory cells of the memory device, and reading out the encoded first part and the encoded second part of the weight data in parallel; multiplying the encoded input data with the encoded first part and the encoded second part of the weight data respectively to parallel generate a plurality of partial products; and accumulating the partial products to generate an operation result.
In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, that one or more embodiments may be practiced without these specific details. In other instances, well-known structures and devices are schematically shown in order to simplify the drawing.
Technical terms of the disclosure are based on general definition in the technical field of the disclosure. If the disclosure describes or explains one or some terms, definition of the terms is based on the description or explanation of the disclosure. Each of the disclosed embodiments has one or more technical features. In possible implementation, one skilled person in the art would selectively implement part or all technical features of any embodiment of the disclosure or selectively combine part or all technical features of the embodiments of the disclosure.
In step 220, weight data is encoded; the encoded weighted data (which is a vector) is written into a plurality of memory cells of the memory device; and the encoded weight data is read out in parallel. In encoding, a most significant bit (MSB) part and a least significant bit (LSB) part of the weight data are independently encoded.
In step 230, the encoded input data is multiplied with the MSB part of the encoded weight data and the LSB part of the encoded weight data in parallel respectively to generate a plurality of partial products in parallel.
In step 240, the partial products are summed (accumulated) to generate multiply-and-accumulation (MAC) operation results or Hamming distance operation results.
One embodiment of the application discloses a memory device implementing digital MAC operations with error-bit-tolerance data encoding to tolerate error bits and reduce area requirements. The error-bit-tolerance data encoding uses input data duplication and weight data flattening techniques. Further, sensing scheme in one embodiment of the application includes standard single level cell (SLC) reading and logic “AND” function to implement bit multiplication for partial product generation. In other possible embodiment of the application, during the sensing procedure, the standard SLC read operation may be replaced by selected-bit-line read or by the standard Multi-Level Cell (MLC)/Triple Level Cell (TLC)/Quad-level cells (QLC) read operation if the page buffer will not remove input data stored in the latch. Further, in one embodiment of the application, the digital MAC operations use high bandwidth weighted accumulator to generation results by reusing the fail-bit-count (FBC) circuits for implementing weighted accumulation.
Another embodiment of the application discloses a memory device implementing Hamming distance computation with error-bit-tolerance data encoding which aims to tolerate error-bits. Error-bit-tolerance data encoding uses input data duplication and weight data flattening techniques. Further, in one embodiment of the application, the sensing scheme comprises the standard SLC read and a logic-XOR function to implement bit multiplication for partial results generation. In other possible embodiment of the application, during the sensing procedure, the standard SLC read operation may be replaced by selected-bit-line read or by the standard Multi-Level Cell (MLC)/Triple Level Cell (TLC)/Quad-level cells (QLC) read operation if the page buffer will not remove input data stored in the latch. Further, the logic-XOR function may be replaced by the logic-XNOR and the logic-NOT function. Further, in one embodiment of the application, the digital Hamming distance computation operations use high bandwidth unweighted accumulator to generation results by reusing the fail-bit-count (FBC) circuits for implementing unweighted accumulation.
In
Each bit of the MSB vector of the 8-bit weight vector and the LSB vector of the 8-bit weight vector is encoded by unary coding (or said value format)). For example, the bit W=i=0(7) of the MSB vector of the 8-bit weight vector is encoded into 8 bits (duplicated 8 times); the bit Wi=0(6) of the MSB vector of the 8-bit weight vector is encoded into 4 bits (duplicated 4 times); the bit Wi=0(5) of the MSB vector of the 8-bit weight vector is encoded into 2 bits (duplicated 2 times); and the bit Wi=0(4) of the MSB vector of the 8-bit weight vector is encoded into 1 bit (duplicated 1 time), and a spare bit (0) is added after the bit Wi=0(4) of the MSB vector of the 8-bit weight vector. The four-bit MSB vector of the 8-bit weight vector is encoded into 16 bits in unary coding.
Similarly, the four-bit LSB vector of the 8-bit weight vector is encoded into 16 bits in unary coding.
In one embodiment of the application, via the encoding, the error-bit tolerance is improved.
As shown in
In cycle 1, the bit Xi(6) of the input data is multiplied by the MSB vector Wi(7:4) of the weight data to generate a second MSB partial product. Similarly, the bit Xi(6) of the input data is multiplied by the LSB vector Wi(3:0) of the weight data to generate a second LSB partial product. The second MSB partial product is shifted four bits and added to the second LSB partial product to generate a second partial product. Further, the first partial product is shifted by one bit to add to the second partial product to update the second partial product. Operations of other cycles (cycle 2 to cycle 7) are similar and thus are omitted here.
Thus, 8-bit unsigned integer multiplication operation is completed in eight cycles,
As shown in
In cycle 1, a second MSB partial product is generated by summing (1) an inverted multiplication result of the bit Xi(6) of the input data with the MSB vector Wi(7) of the weight data and (2) a multiplication result of the bit Xi(6) of the input data with the MSB vector Wi(6:4) of the weight data. Similarly, the bit Xi(6) of the input data is multiplied by the LSB vector Wi(3:0) of the weight data to generate a second LSB partial product. The second MSB partial product is shifted four bits and added to the second LSB partial product to generate a second partial product. Further, the first partial product is shifted by one bit to add to the second partial product to update the second partial product. Operations of other cycles (cycle 2 to cycle 7) are similar and thus are omitted here.
Thus, 8-bit signed integer multiplication operation is completed in eight cycles.
In the above example, it takes eight cycles to complete 8-bit signed integer multiplication operation and/or 8-bit unsigned integer multiplication operation.
In
In
In
In details, the bit Xi(7) of the input data is multiplied with the MSB vector Wi(7:4) of the weight data to generate a first MSB partial product. The bit Xi(6) of the input data is multiplied with the MSB vector Wi(7:4) of the weight data to generate a second MSB partial product. And so on. The bit Xi(0) of the input data is multiplied with the MSB vector Wi(7:4) of the weight data to generate an eighth MSB partial product. For example, in
Similarly, the bit Xi(7) of the input data is multiplied with the LSB vector Wi(3:0) of the weight data to generate a first LSB partial product. The bit Xi(6) of the input data is multiplied with the LSB vector Wi(3:0) of the weight data to generate a second LSB partial product. And so on. The bit Xi(0) of the input data is multiplied with the LSB vector Wi(3:0) of the weight data to generate an eighth LSB partial product. All the LSB partial products are combined into an input stream L.
The first to the eighth MSB partial products and the first to the eighth LSB partial products are summed; and the number of bit “1” in the summation is counted to generate the MAC operation result of the unsigned multiplication operation.
In
In details, the bit Xi(7) of the input data is multiplied with the MSB vector W(7:4) of the weight data to generate a first MSB partial product. The bit Xi(6) of the input data is multiplied with the MSB vector Wi(7:4) of the weight data to generate a second MSB partial product. And so on. The bit Xi(0) of the input data is multiplied with the MSB vector Wi(7:4) of the weight data to generate an eighth MSB partial product.
Similarly, the bit Xi(7) of the input data is multiplied with the LSB vector Wi(3:0) of the weight data to generate a first LSB partial product. The bit Xi(6) of the input data is multiplied with the LSB vector Wi(3:0) of the weight data to generate a second LSB partial product. And so on. The bit Xi(0) of the input data is multiplied with the LSB vector Wi(3:0) of the weight data to generate an eighth LSB partial product.
The first to the eighth MSB partial products and the first to the eighth LSB partial products are summed; and the number of bit “1” in the summation is counted to generate the MAC operation result of the signed multiplication operation.
The memory die 615 includes a plurality of memory planes (MP) 620, a plurality of page buffers (PB) 625 and an accumulation circuit 630. In
In each memory die 615, the accumulation circuit 630 is shared by the memory planes 620 and thus the accumulation circuit 630 sequentially performs the accumulation operations of the memory planes 620. Further, each memory die 615 may independently execute the above digital MAC operations and the digital Hamming distance operations.
The input data is input into the page buffers 625 via a plurality of word lines.
The page buffer 625 includes a sensing circuit 631, a plurality of latch units 633-641 and a plurality of logic gates 643 and 645.
The sensing circuit 631 is coupled to a bit line BL to sense the current on the bit line BL.
The latch units 633-641 are for example but not limited by, a data latch (DL) 633, a latch (L1) 635, a latch (L2) 637, a latch (L3) 639 and a common data latch (CDL) 641. The latch units 633-641 are for example but not limited by, a one-bit latch.
The data latch 633 is for latching the weight data and outputting the weight data to the logic gates 643 and 645.
The latch (L1) 635 and the latch (L3) 639 are for decoding.
The latch (L2) 637 is for latching the input data and sending the input data to the logic gates 643 and 645.
The common data latch (CDL) 641 is for latching the output data form the logic gates 643 and 645.
The logic gates 643 and 645 are for example but not limited by, a logic AND gate and a logic XOR gate. The logic gate 643 performs logic AND operation on the input data and the weight data and writes the logic operation result to the CDL 641. The logic gate 645 performs logic XOR operation on the input data and the weight data and writes the logic operation result to the CDL 641. The logic gates 643 and 645 are controlled by enable signals AND_EN and XOR_EN, respectively. For example, in performing the digital MAC operations, the logic gate 643 is enabled by the enable signal AND_EN; and in performing the digital Hamming distance operations, the logic gate 645 is enabled by the enable signal XOR_EN.
Taking
The accumulation circuit 630 includes a partial product accumulation unit 651, a single dimension product generation unit 653, a first multi-dimension accumulation unit 655, a second multi-dimension accumulation unit 657 and a weigh accumulation control unit 659.
The partial product accumulation unit 651 is coupled to the page buffer 625 for receiving a plurality of logic operation results from the plurality of CDLs 641 of the page buffers 625 to generate a plurality of partial products.
For example, in
The single dimension product generation unit 653 is coupled to the partial product accumulation unit 651 for accumulating the partial products from the partial product accumulation unit 651 to generate a single dimension product.
For example, in
For example, in cycle 0, the product of the dimension <0> is generated by the single dimension product generation unit 653; and in cycle 1, the product of the dimension <1> is generated by the single dimension product generation unit 653, and so on.
The first multi-dimension accumulation unit 655 is coupled to the single dimension product generation unit 653 to accumulate the plurality of single dimension products from the single dimension product generation unit 653 for generating a multi-dimension product accumulation result.
For example but not limited by, the first multi-dimension accumulation unit 655 accumulates products of dimension <0> to dimension <7> from the single dimension product generation unit 653 for generating a product accumulation result of 8-dimension <0:7>. Also, the first multi-dimension accumulation unit 655 accumulates dimension <8> to dimension <15> products from the single dimension product generation unit 653 for generating a product accumulation result of 8-dimension <8:15>.
The second multi-dimension accumulation unit 657 is coupled to the first multi-dimension accumulation unit 655 to accumulate the plurality of multi-dimension products from the first multi-dimension accumulation unit 655 for generating an output accumulation value. For example but not limited by, the second multi-dimension accumulation unit 657 accumulates sixty-four 8-dimension products from the first multi-dimension accumulation unit 655 for generating a 512-dimension output accumulation value.
The weigh accumulation control unit 659 is coupled to the partial product accumulation unit 651, the single dimension product generation unit 653 and the first multi-dimension accumulation unit 655. Based on whether either the digital MAC operation or the digital Hamming distance operation is performed, the weigh accumulation control unit 659 is enabled or disabled, For example but not limited by, when the digital MAC operation is performed, the weigh accumulation control unit 659 is enabled; and when the digital Hamming distance operation is performed, the weigh accumulation control unit 659 is disabled. When the weigh accumulation control unit 659 is enabled, the weigh accumulation control unit 659 is enabled based on the weight accumulation enable signal WACC_EN for outputting control signals to the partial product accumulation unit 651, the single dimension product generation unit 653 and the first multi-dimension accumulation unit 655.
The single page buffer 620 in
In the above description, the partial product accumulation unit 651 receives 128 bits in one cycle, the first multi-dimension accumulation unit 655 generates sixty-four 8-dimension products and the second mufti-dimension accumulation unit 657 generates a 512-dimension output accumulation value. But the application is not limited by this. In another possible embodiment, the partial product accumulation unit 651 receives 64 bits (2 bits in one set) in one cycle, the first multi-dimension accumulation unit 655 generates three-two 16-dimension products and the second multi-dimension accumulation unit 657 generates a 512-dimension output accumulation value.
In the conventional art, it needs long operation time; but in one embodiment of the application, the parallel bit-multiplication is for generating (1) the partial products of the input vector and the MSB vector of the weight data; and (2) the partial products of the input vector and the LSB vector of the weight data. Thus, in one embodiment of the application, the unsigned multiplication operation and/or the signed multiplication operation is completed in one cycle. Therefore, one embodiment of the application has faster operation speed than the conventional art.
As described above, in one embodiment of the application, via the error-bit tolerance data encoding technology, the error bits are reduced, the accuracy is improved and the memory capacity requirement is also reduced.
Further, in one embodiment of the application, the digital MAC operation generates the output result by using high bandwidth weighted accumulator which implements weighted accumulation by reusing the fail bit counting circuit, thus the accumulation speed is improved.
Further, in one embodiment of the application, the digital Hamming distance operation generates the output result by using high bandwidth unweighted accumulator which implements unweighted accumulation by reusing the fail bit counting circuit, thus the accumulation speed is improved.
The embodiments of the application are applied to NAND type flash memory, or the memory device sensitive to the error bits, for example but not limited by, NOR type flash memory, phase changing memory, magnetic RAM or resistive RAM.
In one embodiment of the application, the accumulation circuit 630 receives 128 partial products from the page buffer 625, but in other embodiment of the application, the accumulation circuit 630 receives 2, 4, 8, 16, 512 (which is the power of 2) partial products from the page buffer 625, which is still within the spirit and the scope of the application.
In the above embodiment, the accumulation circuit 630 supports the addition function, but in other possible embodiment, the accumulation circuit 630 supports subtraction function, which is still within the spirit and the scope of the application.
In the above embodiment, the INT8 or UINT8 digital MAC operation is taken as an example, but other possible embodiment also supports INT2, UINT2, INT4 or UINT4 digital MAC operation, which is still within the spirit and the scope of the application.
Although in the embodiments of the application, the weight are divided into the MSB vector and the LSB vector (i.e. two vectors), but the application is not limited by this. In other possible embodiment of the application, the weight are divided into more vectors, which is still within the spirit and the scope of the application.
The embodiments of the application are not only applied to AI model design that needs to perform MAC operation, but also applied to other AI technologies, such as fully-connection layer, convolution layer, multiple layer Perceptron, support vector machine.
The embodiments of the application are not only applied to computing usage but also to similarity search, analysis usage, clustering analysis and so on.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed embodiments. It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims and their equivalents.
This application claims the benefit of U.S. provisional application Ser. No. 63/281,734, filed Nov. 22, 2021, the subject matter of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63281734 | Nov 2021 | US |