The present invention relates to methods and systems for processing read-modify-write requests, and more particularly relates to a memory system with a plurality of memory banks and other circuit components that are configured to process the read-modify-write requests.
A residual block (or residual unit) is an important architectural feature of many neural networks, particularly Convolutional Neural Networks (CNNs), in a residual block architecture, a tensor is passed through one or more convolutional layers (referred to as a “main path”), and the tensor also makes a “skip connection” bypassing those layers. The main path and the skip connection tensors are then added, element-wise. An activation function such as a Rectifying Linear Unit” (ReLU) may be applied to the result of this element-wise sum and the result stored back into memory for subsequent use in the neural network, Additional details of residual block architectures may be found in Kaiming He et al. “Identity Mappings in Deep Residual Networks,” Microsoft Research, arXiv:1603.05027v3 [cs.CV] 25 Jul. 2016.
In accordance with one embodiment of the invention, a read-modify-write request is generated to implement a residual block in a neural network. Each read-modify-write request can include both a ad address and a write address along with a first operand (e.g., the “main path tensor”), and these requests are routed to the appropriate memory bank in the memory system based on the read and write addresses. A bank-specific buffer temporarily stores the write address and the first operand (e.g., the “main path tensor”), while a second operand (e.g., the “skip connection tensor”) is being read from the memory bank. A bank-specific combiner circuit performs the element-wise sum of the first and second operands, and a bank-specific activation circuit optionally applies the ReLU activation function. Finally, the result is written to one of the memory banks at the address specified by the write address. Each read-modify-write request may be processed independently (and concurrently) at each memory bank. In a preferred embodiment, the read address and write address of each read-modify-write request reside in the same memory bank.
An advantage provided by the hardware architecture is that the “main path tensor” does not need to be first stored in one of the memory banks prior to being combined with the “skip connection tensor” that is read from one of the memory banks. Instead, the “main path tensor” may be temporarily stored in a per-bank buffer while the “skip connection tensor” is being retrieved from one of the memory banks.
More generally, in one embodiment, a memory system comprises a plurality of memory sub-systems, each with a memory bank and other circuit components. For each of the memory sub-systems, a first buffer receives a read-modify-write request (with a read address, a write address and a first operand), a second operand is read from the memory bank at the location specified by the read address, a combiner circuit combines the first operand with the second operand, an activation circuit transforms the output of the combiner circuit, and the output of the activation circuit is stored in the memory bank at the location specified by the write address.
For each of the memory banks, a second buffer may store the first operand while the second operand is being read from the memory bank. Additionally, for each of the memory banks, the second buffer may store the write address while the write data is being computed by the combiner circuit and the activation circuit.
In one embodiment, the output of the activation circuit may be first stored in the first buffer prior to being stored in the memory bank. In another embodiment, the output of the activation circuit may be stored in the memory bank, and such storing may bypass the first buffer. In such embodiment, a controller may be needed to mediate the access to the memory bank so that such writing of the output of the activation function circuit to the memory bank happens during a window of time that the buffer is not also accessing the memory bank.
These and other embodiments of the invention are more fully described in association with the drawings below.
In the following detailed description of the preferred embodiments, reference is made to the accompanying drawings that form a part hereof, and in which are shown by way of illustration specific embodiments in which the invention may be practiced. It is understood that other embodiments may be utilized and structural changes may be made without departing from the scope of the present invention. Descriptions associated with any one of the figures may be applied to different figures containing like or similar components/steps.
A memory system is described below for processing general read-modify-write requests, and such system may be specifically adapted to implement residual block structures in the context of a CNN. It is noted that such system may also be adapted to implement residual block structures in other networks, such as a transformer neural network.
In one embodiment, the first operand is a vector of n values and the second operand is also a vector of n values. The combiner circuit 110 may perform an element-wise combination of the first and second operands to generate an output with n values. In one embodiment, the activation function circuit 112 may apply an identical mathematical transformation on each of the n input values so as to generate n output values. In one embodiment, n may equal 1, in which case the processing of
The activation function circuit 112 may be an optional part of the memory sub-system 100. If the operation of activation function circuit 112 is not desired, the activation function circuit 112 may be set to an identity function (i.e., with the output set equal to the input), or the output of the combiner circuit 110 may bypass the activation function circuit 112 and be stored in the memory bank 102 at the location specified by the write address.
Lastly, it is noted that the use of solid signal lines and dashed signal lines was for the purpose of clarity (i.e., to allow the reader to better distinguish between separate signal lines in instances where there are intersecting signal lines). The intersection points between the solid and dashed signal lines are not electrically connected (i.e., are not shorted), but rather it is understood that one signal line merely crosses over another signal line.
Buffer 104a may receive and store a read-modify-write request 114a that includes a read address, a write address and a first operand. In one embodiment, buffer 104a may be a first-in-first-out (FIFO) buffer. A second operand is then read from the memory bank 102a from a location specified by the read address of the read-modify-write request 114a. The first operand is combined with the second operand by combiner circuit 110a, and the output of the combiner circuit 110a is provided to an activation function circuit 112a. The output of the activation function circuit 112a is then stored in the memory bank 102a at the location specified by the write address of the read-modify-write request 114a. The operation of buffer 106a and multiplexor 108a was previously explained above in
Buffer 104b may receive and store a read-modify-write request 114b that includes a read address, a write address and a first operand. In one embodiment, buffer 104b may be a first-in-first-out (FIFO) buffer. A second operand is then read from the memory bank 102b from a location specified by the read address of the read-modify-write request 114b. The first operand is combined with the second operand by combiner circuit 110b, and the output of the combiner circuit 110b is provided to an activation function circuit 112b. The output of the activation function circuit 112b is then stored in the memory bank 102b at the location specified by the write address of the read-modify-write request 114b. The operation of buffer 106b and multiplexor 108b was previously explained above in
Logic (not depicted) or the controller (not depicted) appropriately routes each of the read-modify-write requests 114a, 114b to one of the memory banks 102a, 102b, such that the read address and the write address resides in that memory bank 102a, 102b. For instance, the read address and the write address of the read-modify-write request 114a resides in memory bank 102a. Similarly, the read address and the write address of the read-modify-write request 114b resides in memory bank 102b. In one embodiment, the combiner circuit 110a generates its output data (in response to the read-modify-write request 114a) while the combiner circuit 110b generates its output data (in response to the read-modify-write request 114b).
Thus, methods and systems for processing read-modify-write requests have been described. It is to be understood that the above-description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. The scope of the invention should, therefore, be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled.
This application is a Continuation application of U.S. application Ser. No. 17/818,876, filed on 10 Aug. 2022 (now issued as U.S. Pat. No. 11,630,605), incorporated by reference herein.
Number | Name | Date | Kind |
---|---|---|---|
5235693 | Chinnaswamy et al. | Aug 1993 | A |
5548744 | Ogura et al. | Aug 1996 | A |
6643212 | Jones, Jr. et al. | Nov 2003 | B1 |
8959292 | Ahn et al. | Feb 2015 | B1 |
9811263 | Teh | Nov 2017 | B1 |
9971540 | Herrero Abellanas | May 2018 | B2 |
10776668 | Dutta et al. | Sep 2020 | B2 |
10931588 | Matthews et al. | Feb 2021 | B1 |
11237905 | Chachad et al. | Feb 2022 | B2 |
20030120880 | Banno | Jun 2003 | A1 |
20080148108 | Barnum et al. | Jun 2008 | A1 |
20170075823 | Ward et al. | Mar 2017 | A1 |
20170206036 | Pax et al. | Jul 2017 | A1 |
20170358327 | Oh et al. | Dec 2017 | A1 |
20180075344 | Ma et al. | Mar 2018 | A1 |
20190042920 | Akin et al. | Feb 2019 | A1 |
20190042922 | Pillai | Feb 2019 | A1 |
20190057302 | Cho et al. | Feb 2019 | A1 |
20190205244 | Smith | Jul 2019 | A1 |
20190339980 | Sity et al. | Nov 2019 | A1 |
20200097406 | Ou | Mar 2020 | A1 |
20200104072 | Ngu et al. | Apr 2020 | A1 |
20200184001 | Gu et al. | Jun 2020 | A1 |
20200293319 | Lee et al. | Sep 2020 | A1 |
20200356305 | Kim | Nov 2020 | A1 |
20210110876 | Seo et al. | Apr 2021 | A1 |
20210173656 | Diamant | Jun 2021 | A1 |
20210209022 | Song | Jul 2021 | A1 |
20210209450 | Cassidy et al. | Jul 2021 | A1 |
20210216243 | Shin et al. | Jul 2021 | A1 |
20210225430 | O | Jul 2021 | A1 |
20210263739 | Norrie et al. | Aug 2021 | A1 |
20220076717 | Mathew et al. | Mar 2022 | A1 |
20220317923 | Balakrishnan | Oct 2022 | A1 |
Entry |
---|
Angizi; et al., “DIMA: A Depthwise CNN In-Memory Accelerator”, Association for Computing Machinery, ICCAD '18, Nov. 5-8, 2018, San Diego, CA, USA, 8 pgs. |
Gao; et al., “TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory”, ASPLOS '17, Apr. 8-12, 2017, Xi'an, China, 14 pgs. |
He; et al., “Identity Mappings in Deep Residual Networks”, Cornell University, arXiv:1603.05027v3 [cs.CV] Jul. 25, 2016, 15 pgs. |
Liu; et al., “Processing-in-Memory for Energy-efficient Neural Network Training: A Heterogeneous Approach”, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (Micro), Oct. 20-24, 2018, 14 pgs. |
Corrected Notice of Allowability mailed Mar. 8, 2023, for U.S. Appl. No. 17/818,876, filed Aug. 10, 2022, 6 pgs. |
Notice of Allowance mailed Feb. 24, 2023, for U.S. Appl. No. 17/818,876, filed Aug. 10, 2022, 9 pgs. |
Amendment filed Dec. 23, 2022, for U.S. Appl. No. 17/818,876, filed Aug. 10, 2022, 7 pgs. |
Final Office Action dated Dec. 1, 2022, for U.S. Appl. No. 17/818,876, filed Aug. 10, 2022, 13 pgs. |
Amendment filed Oct. 31, 2022, for U.S. Appl. No. 17/818,87, filed Aug. 10, 2022, 9 pgs. |
Non-Final Office Action dated Oct. 26, 2022, for U.S. Appl. No. 17/818,876, filed Aug. 10, 2022, 11 pgs. |
International Search Report and Written Opinion mailed Jun. 21, 2023, from the ISA/European Patent Office, for International Patent Application No. PCT/US2023/015130 (filed Mar. 13, 2023), 14 pgs. |
Written Opinion of the International Preliminary Examining Authority mailed Jul. 2, 2024, from the IPEA/European Patent Office, for International Patent Application No. PCT/US2023/015130 (filed Mar. 13, 2023), 8 pgs. |
Number | Date | Country | |
---|---|---|---|
20240053919 A1 | Feb 2024 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17818876 | Aug 2022 | US |
Child | 18183034 | US |