The disclosure relates in general to a memory device and a computing in memory method thereof.
In deep learning training, data movement consumes large energy cost. Ideally, computing in memory may reduce 25% energy consumption because movement of weight values is reduced.
In computing in memory, taking Covolutional Neural Network (CNN) as an example, when stride operations are performed, it usually takes several cycles to complete the stride operations. Performing stride operations (stride=1) on a 3*3 array (which is a weight array) is taken as an example.
In a first cycle, inputs data I1-I3, I6-I8 and I11-I13 are input into the word lines WL1-WL9, respectively. The operations areas bellows.
In a second cycle, three bits are updated and thus inputs data I6-I8, I11-I13 and I16-I18 are input into the word lines WL1-WL9, respectively. The operations are as bellows.
In a third cycle, three bits are updated and thus inputs data I11-I13, I16-I18 and I21-I23 are input into the word lines WL1-WL9, respectively. The operations are as bellows.
In a fourth cycle, three bits are updated and thus inputs data I2-I4, I7-I9 and I12-I14 are input into the word lines WL1-WL9, respectively. The operations are as bellows.
In a fifth cycle, three bits are updated and thus inputs data I7-I9, I12-I14 and I17-I19 are input into the word lines WL1-WL9, respectively. The operations are as bellows.
In a sixth cycle, three bits are updated and thus inputs data I12-I14, I17-I19 and I22-I24 are input into the word lines WL1-WL9, respectively. The operations are as bellows.
In a seventh cycle, three bits are updated and thus inputs data I3-I5, I8-I10 and I13-I15 are input into the word lines WL1-WL9, respectively. The operations are as bellows.
In an eighth cycle, three bits are updated and thus inputs data I8-I10, I13-I15 and I18-I20 are input into the word lines WL1-WL9, respectively. The operations are as bellows.
In a ninth cycle, three bits are updated and thus inputs data I13-I15, I8-I10 and I23-I25 are input into the word lines WL1-WL9, respectively. The operations are as bellows.
About the traditional CIM, there exists duplicate feeding of input feature maps. This is because the stride operation will generate a lot of input data having contents overlapping. The traditional CIM usually stores one kernel at one bit line and accordingly, input duplicate feeding is caused.
Input duplicate feeding situation will become worse when input data becomes large and stride step becomes small. Thus, how to reduce the input duplicate feeding is very important. As known, more input duplicate feeding results in more data movement, more energy consumption and reduced operation speed.
Thus, how to reduce data movement, lower energy consumption and improve operation speed are important.
According to one embodiment, provided is a computing in memory method for a memory device. The computing in memory method includes: based on a stride parameter, unfolding a kernel into a plurality of sub-kernels and a plurality of complement sub-kernels; based on the sub-kernels and the complement sub-kernels, writing a plurality of weights into a plurality of target memory cells of a memory array of the memory device; inputting an input data into a selected word line of the memory array; performing a stride operation in the memory array; temporarily storing a plurality of partial sums; and summing the stored partial sums into a stride operation result when all operation cycles are completed.
According to one embodiment, provided is a memory device comprising: a memory array; and a controller coupled to the memory array, the controller being configured for based on a stride parameter, unfolding a kernel into a plurality of sub-kernels and a plurality of complement sub-kernels; based on the sub-kernels and the complement sub-kernels, writing a plurality of weights into a plurality of target memory cells of a memory array of the memory device; inputting an input data into a selected word line of the memory array; performing a stride operation in the memory array; temporarily storing a plurality of partial sums; and summing the stored partial sums into a stride operation result when all operation cycles are completed.
In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the disclosed embodiments. It will be apparent, however, that one or more embodiments may be practiced without these specific details. In other instances, well-known structures and devices are schematically shown in order to simplify the drawing.
Technical terms of the disclosure are based on general definition in the technical field of the disclosure. If the disclosure describes or explains one or some terms, definition of the terms is based on the description or explanation of the disclosure. Each of the disclosed embodiments has one or more technical features. In possible implementation, one skilled person in the art would selectively implement part or all technical features of any embodiment of the disclosure or selectively combine part or all technical features of the embodiments of the disclosure.
As shown in
As shown in
In general, the kernel includes an original weight matrix. When the original weight matrix is k×l (k and l both being a natural number) and the input data is N×M (N and M both being a natural number), for the stride parameter being equal to “1”, a total number of the sub-kernels is (N−k+1)×(M−l+1) and a total number of the complement sub-kernels is (N−k+1)×(M−l+1).
In step 520, based on the sub-kernels and the complement sub-kernels, the weights are written into a plurality of target memory cells.
Taken the sub-kernel SK1 in
Taken the sub-kernel SK2 in
In step 530, the input data is input into the selected word lines.
In step 540, MAC operations are performed in the memory array.
In step 550, a respective partial sum is stored in a respective latch unit.
In step 560, it is determined whether the corresponding complement sub-kernel is calculated (or, it is determined whether all operation cycles are completed). If yes in step 560, then the flow proceeds to step 570 to sum the partial sums stored in the latch units to generate the MAC result. If no in step 560, then the flow returns to step 530.
In one embodiment of the application, in order to reduce data movement, the weights W1-W9 are written into target memory cells as indicated by sixteen operations shown in
As shown in operation (a) of
As shown in operation (b) of
As shown in operation (c) of
As shown in operation (d) of
As shown in operation (e) of
As shown in operation (f) of
As shown in operation (g) of
As shown in operation (h) of
As shown in operation (i) of
As shown in operation (j) of
As shown in operation (k) of
As shown in operation (1) of
As shown in operation (m) of
As shown in operation (n) of
As shown in operation (o) of
As shown in operation (p) of
As shown in
As shown in
As shown in
As shown in
For easy understanding, the partial sums in the four cycles are added as follows (i.e. the output from the latch units L1 to L16 after four cycles):
As above, advantages of embodiments of the application are reduction of data movement and faster execution time.
As above, in embodiments of the application, in stride operations, the kernel (the weight matrix) of the deep learning model are unfolded into a plurality of sub-kernels and a plurality of complement sub-kernels. The weights are written into the target memory cells based on the sub-kernels and the complement sub-kernels. Thus, the input data is efficiently reused in the memory array for reducing operation time and data movement.
Embodiments of the application are used in AI (Artificial Intelligence) field or any computing field having many MAC operations, for example but not limited by, memory data search, image processing and voice detection.
Embodiments of the application are used in different AI model design, for example but not limited by, fully connection layer model design, convolution layer model design, multilayer perceptron model design and support vector machine.
Embodiments of the application may apply any volatile memory (SRAM, DRAM) or any non-volatile memory (Resistive-RAM, Phase Change Memory, flash memory, Magnetoresistive RAM, Ferroelectric RAM and so on).
Further, in other possible embodiments of the application, the role of the bit lines and the word lines are interchangeable, that is, the input data may be input into the bit lines, which is still within the spirit and the scope of the application.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed embodiments. It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims and their equivalents.
This application claims the benefit of U.S. provisional application Ser. No. 62/916,797, filed Oct. 18, 2019, the subject matter of which is incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
9721203 | Young | Aug 2017 | B1 |
9766886 | Corbal et al. | Sep 2017 | B2 |
10193569 | Tseng et al. | Jan 2019 | B2 |
10255656 | Appu et al. | Apr 2019 | B2 |
10388387 | Park | Aug 2019 | B2 |
10445002 | Lin et al. | Oct 2019 | B2 |
10719296 | Lee et al. | Jul 2020 | B2 |
20160342893 | Ross | Nov 2016 | A1 |
20180096226 | Aliabadi et al. | Apr 2018 | A1 |
20190294413 | Vantrease et al. | Sep 2019 | A1 |
20200272946 | Kim | Aug 2020 | A1 |
20210173787 | Nagy | Jun 2021 | A1 |
Number | Date | Country |
---|---|---|
I550628 | Sep 2016 | TW |
201911057 | Mar 2019 | TW |
201926023 | Jul 2019 | TW |
201933190 | Aug 2019 | TW |
201941159 | Oct 2019 | TW |
Entry |
---|
Ma, Y et al. Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA. IEEE Transactions on VLSI Systems, vol. 26, No. 7, pp. 1354-1367, Jul. 2018, [online], [retrieved on Mar. 29, 2022], <URL: https://ieeexplore.ieee.org/document/8330049> <DOI: 10.1109/TVLSI.2018.2815603>. |
Chen, Yh et al. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, vol. 52, No. 1, pp. 127-138, Jan. 2017, [online], [retrieved on Mar. 29, 2022], <URL: https://ieeexplore.ieee.org/document/7738524> <DOI: 10.1109/JSSC.2016.2616357>. |
Tu, F et al. Deep Convolutional Neural Network Architecture With Reconfigurable Computation Patterns, IEEE Transactions on VLSI Systems, vol. 25, No. 8, pp. 2220-2233, Aug. 2017, [online], [retrieved on Mar. 29, 2022], <URL: https://ieeexplore.ieee.org/document/7898402> <DOI: 10.1109/TVLSI.2017.2688340>. |
Number | Date | Country | |
---|---|---|---|
20210117187 A1 | Apr 2021 | US |
Number | Date | Country | |
---|---|---|---|
62916797 | Oct 2019 | US |