This application is based on and claims priority under 35 U.S.C. § 119 to Korean Patent Application No. 10-2020-0128274, filed on Oct. 5, 2020, in the Korean Intellectual Property Office, the entire disclosure of which is incorporated by reference herein for all purposes.
The following description relates to a memory device that performs in-memory processing by using an in-memory arithmetic unit.
Applications such as processing of graphics algorithms, processing of neural networks, and so on, are compute-intensive arithmetic operations and require a computing system with large-capacity arithmetic operation and memory capabilities. Memory devices have been developed which are capable of performing some arithmetic operations (computation operations) of a computing system as internal processing (or in-memory processing) of memory devices. As such, the burden of arithmetic operations of a computing system may be reduced by internal processing of a memory device. However, when separate processing hardware for internal processing is added to a memory device, methods of efficiently performing arithmetic operation processing thereof may be required.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
A memory device that performs in-memory processing.
In one general aspect, a memory device configured to perform in-memory processing includes a plurality of in-memory arithmetic units each configured to perform in-memory processing of a pipelined arithmetic operation, and a plurality of memory banks allocated to the in-memory arithmetic units such that a set of n memory banks is allocated to each of the in-memory operation units, each memory bank configured to perform an access operation of data requested from the in-memory arithmetic units while the pipelined arithmetic operation is performed. Each of the in-memory arithmetic units is configured to operate at a first operating frequency that is less than or equal to a product of n and a second operating frequency of each of the memory banks.
The arithmetic operation may be pipelined into multi-pipeline stages of sub-arithmetic units capable of being processed within a first operation cycle corresponding to the first operating frequency.
Each of the in-memory arithmetic units may be configured to access any number of the allocated n memory banks within a second operation cycle corresponding to the second operating frequency of each of the memory banks.
The memory device may include a pipeline register configured to buffer sub-arithmetic operation results of pipeline stages of the pipelined arithmetic operation.
The memory device may include a clock divider configured to generate, based on an externally provided clock signal, a first clock signal for the in-memory arithmetic units to operate at the first operating frequency and to distribute the first clock signal to the in-memory arithmetic units.
The memory device may include a bank selector configured to sequentially enable one or more of the n memory banks allocated to a first in-memory arithmetic unit, which is included in the plurality of in-memory arithmetic units; a multiplexer configured to provide the first in-memory arithmetic unit with data accessed from the one or more memory banks enabled by the bank selector; and a bank arbiter configured to control data to be output from the multiplexer.
The bank selector may be configured to operate based on the second operating frequency, and the bank arbiter is configured to operate based on the first operating frequency.
The n memory banks allocated to a first in-memory arithmetic unit, which is included in the plurality of in-memory arithmetic units, may include a first memory bank in which a first operand is stored and a second memory bank in which a second operand is stored, the memory device may include a first multiplexer for multiplexing the first operand and a second multiplexer for multiplexing the second operand, and the first multiplexer and the second multiplexer may be provided between the n memory banks allocated to a first in-memory arithmetic unit and the first in-memory arithmetic unit.
The first multiplexer and the second multiplexer may be configured to provide the first in-memory arithmetic unit with the first operand and the second operand within a first operation cycle corresponding to the first operating frequency.
In another general aspect, a memory device configured to perform in-memory processing includes a plurality of in-memory arithmetic units configured to perform in-memory processing of a pipelined arithmetic operation; a plurality of memory banks allocated to each of the in-memory arithmetic units such that a set of n memory banks is allocated to each of the in-memory operation units, each memory bank configured to perform an access operation of data requested from the in-memory arithmetic units while the pipelined arithmetic operation is performed; and at least one multiplexer configured to provide each of the in-memory arithmetic units with data accessed from at least one memory bank that is enabled among the n memory banks allocated to each of the in-memory arithmetic units, wherein each of the in-memory arithmetic units is configured to operate at a first operating frequency that is less than or equal to a product of n and a second operating frequency of each of the memory banks.
Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
Throughout the drawings and the detailed description, unless otherwise described or provided, the same drawing reference numerals will be understood to refer to the same elements, features, and structures. The drawings may not be to scale, and the relative size, proportions, and depiction of elements in the drawings may be exaggerated for clarity, illustration, and convenience.
The following detailed description is provided to assist the reader in gaining a comprehensive understanding of the methods, apparatuses, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatuses, and/or systems described herein will be apparent after an understanding of the disclosure of this application. For example, the sequences of operations described herein are merely examples, and are not limited to those set forth herein, but may be changed as will be apparent after an understanding of the disclosure of this application, with the exception of operations necessarily occurring in a certain order. Also, descriptions of features that are known after an understanding of the disclosure of this application may be omitted for increased clarity and conciseness.
The features described herein may be embodied in different forms and are not to be construed as being limited to the examples described herein. Rather, the examples described herein have been provided merely to illustrate some of the many possible ways of implementing the methods, apparatuses, and/or systems described herein that will be apparent after an understanding of the disclosure of this application.
Throughout the specification, when a component is described as being “connected to,” or “coupled to” another component, it may be directly “connected to,” or “coupled to” the other component, or there may be one or more other components intervening therebetween. In contrast, when an element is described as being “directly connected to,” or “directly coupled to” another element, there can be no other elements intervening therebetween. Likewise, similar expressions, for example, “between” and “immediately between,” and “adjacent to” and “immediately adjacent to,” are also to be construed in the same way. As used herein, the term “and/or” includes any one and any combination of any two or more of the associated listed items.
Although terms such as “first,” “second,” and “third” may be used herein to describe various members, components, regions, layers, or sections, these members, components, regions, layers, or sections are not to be limited by these terms. Rather, these terms are only used to distinguish one member, component, region, layer, or section from another member, component, region, layer, or section. Thus, a first member, component, region, layer, or section referred to in examples described herein may also be referred to as a second member, component, region, layer, or section without departing from the teachings of the examples.
The terminology used herein is for describing various examples only and is not to be used to limit the disclosure. The articles “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms “comprises,” “includes,” and “has” specify the presence of stated features, numbers, operations, members, elements, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, numbers, operations, members, elements, and/or combinations thereof.
Unless otherwise defined, all terms, including technical and scientific terms, used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains and based on an understanding of the disclosure of the present application. Terms, such as those defined in commonly used dictionaries, are to be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the disclosure of the present application and are not to be interpreted in an idealized or overly formal sense unless expressly so defined herein. The use of the term “may” herein with respect to an example or embodiment (e.g., as to what an example or embodiment may include or implement) means that at least one example or embodiment exists where such a feature is included or implemented, while all examples are not limited thereto.
Referring to
The memory device 10 may be implemented as a memory chip or a memory module. The memory controller 20 may be implemented as part of a host, or the memory device 10 and the memory controller 20 may also be arranged in one memory module. That is, an implementation form may be various and is not limited to one configuration. Meanwhile, although not illustrated in
The memory controller 20 may control an overall operation of the memory device 10 by providing various signals to the memory device 10. For example, the memory controller 20 may control a memory access operation of the memory device 10, such as read or write. Specifically, the memory controller 20 may provide a command CMD and an address ADDR to the memory device 10 to write data DATA to the memory device 10 or read data DATA from the memory device 10. In addition, the memory controller 20 may further provide a clock signal CLK to the memory device 10.
The command CMD may include an active command for setting the memory banks 120 to an active state to read data or write data. The memory device 10 may activate rows, that is, word lines included in the memory banks 120 in response to an active command. In addition, the command CMD may include a precharge command for switching the memory banks 120 from an active state to a standby state after data read or data write is completed. In addition, the command CMD may include a refresh command for controlling refresh operations of the memory banks 120. However, the type of the command CMD described herein is only exemplary, and there are different types of the command CMD.
In addition, the memory controller 20 may control in-memory processing operations of the in-memory arithmetic units 110 by providing various signals to the memory device 10. For example, the memory controller 20 may provide the memory device 10 with a signal generated by combining the command CMD, the address ADDR, and/or the clock signal CLK to indicate in-memory processing operations of the in-memory arithmetic units 110.
The in-memory arithmetic units 110 may be implemented as processing elements (PEs) for performing arithmetic processing in the memory device 10. That is, the in-memory arithmetic units 110 may perform in-memory processing (or internal processing) in the memory device 10.
Specifically, the in-memory arithmetic units 110 may perform data operations on the data DATA stored in the memory banks 120 and/or the data DATA received from the memory controller 20, and may store data DATA of the arithmetic operation result in the memory banks 120 or provide the data to the memory controller 20. Accordingly, the in-memory arithmetic units 110 may also be referred to as a function-in-memory (FIM) or a processor in memory (PIM).
The in-memory arithmetic units 110 may be an arithmetic logic unit (ALU) or a multiply-accumulate (MAC). For example, the in-memory arithmetic units 110 may perform data inversion, data shift, data swap, data comparison, logical operations such as AND and exclusive OR (XOR), mathematical operations such as addition and subtraction, and data operations.
The number of in-memory arithmetic units 110 included in the memory device 10 and the number of memory banks 120 included in the memory device 10 may be changed. In addition, n (where n is a natural number) memory banks may be allocated to one in-memory arithmetic unit.
For example, when the memory device 10 is a double data rate 4 dynamic random access memory (DDR4 DRAM) module, the number of memory banks 120 is 16, and the number of in-memory arithmetic units 110 is 8. In this case, the in-memory arithmetic units 110 and the memory banks 120 may be mapped at a ratio of 1:2 (n=2 and n is a natural number). Alternatively, the number of memory banks 120 may be 16, and the number of in-memory arithmetic units 110 may be 4, and in this case, the in-memory arithmetic units 110 and the memory banks 120 may be mapped at a ratio of 1:4 (n=4). That is, according to the various examples, a mapping ratio between the in-memory arithmetic units 110 and the memory banks 120 may be varied.
Each of the memory banks 120 may include a plurality of memory cells. Specifically, the memory cells of the memory banks 120 may be located at points where a plurality of word lines and a plurality of bit lines cross each other. The memory banks 120 may store in-memory processing data. Here, the in-memory processing data may include data on which arithmetic operations will be performed by the in-memory arithmetic units 110 and/or data generated as a result of the arithmetic operations of the in-memory arithmetic units 110.
The memory device 10 may include various types of memory, for example, a dynamic random access memory (DRAM) such as a double data rate synchronous dynamic random access memory (DDR SDRAM), a low power double data rate (LPDDR) SDRAM, a graphics double data rate (GDDR) SDRAM, or a Rambus dynamic random access memory (RDRAM). However, the various examples herein are not limited thereto, and the memory device 10 may include a nonvolatile memory such as a flash memory, a magnetic random access memory (MRAM), a ferroelectric RAM (FeRAM), a phase change RAM (PRAM), or a resistive RAM (ReRAM).
The memory device 10 illustrated in
As illustrated in
Alternatively, as illustrated in
That is, memory banks 120 included in the memory device 10 may share one in-memory arithmetic unit in units of n memory banks, and each of the in-memory arithmetic units may access n memory banks corresponding thereto to perform data operations. Here, when the memory device 10 corresponds to a DDR4 DRAM module, n may be any one of 2, 4, and 8 but is not limited thereto.
Referring to
According to
In contrast to this, according to
Therefore, if the operating frequency (first operating frequency) of the in-memory arithmetic unit is increased more than the operating frequency (second operating frequency) of the memory bank, performance of in-memory processing may be increased. Here,
SECOND OPERATING FREQUENCY*1<FIRST OPERATING FREQUENCY≤SECOND OPERATING FREQUENCY*n [Equation 1]
(n is the number of memory banks mapped to one in-memory arithmetic unit)
That is, arithmetic operation performance of the in-memory computation unit may be increased by twice Y giga operation per second (GOPS), which is arithmetic operation performance of a first arithmetic unit 411 in
Referring to
According to
That is, if the operating frequency (first operating frequency) of the in-memory arithmetic unit is increased in proportion to the number of memory banks mapped to one in-memory arithmetic unit, more efficient arithmetic processing (for example, 4Y GOPS) of the in-memory arithmetic unit may be performed within the operation cycle of the memory bank.
Referring to
An operating frequency (first operating frequency) of the first arithmetic unit 611 of
As in
Pipelining is a technology of increasing an arithmetic processing speed by dividing a process in which an arithmetic operation is performed into several stages and processing the respective stages in parallel (simultaneously).
Referring to
The pipelining processing of the in-memory arithmetic unit operating at exemplary operating frequencies (250 MHz and 500 MHz) illustrated in
When a given arithmetic operation is pipelined in units of an operating frequency of 4 ns corresponding to the operating frequency (that is, 250 MHz) of the in-memory arithmetic unit (first arithmetic unit 631) illustrated in
However, when a given arithmetic operation is pipelined in units of operation cycles of 2 ns corresponding to the operation cycle of 500 MHz (2×250 MHz) of the in-memory arithmetic unit (first arithmetic unit 611) illustrated in
That is, the given arithmetic operation is pipelined into multiple pipeline stages of sub-arithmetic units (or sub-arithmetic operations) that may be processed within a first operation cycle corresponding to the first operating frequency of the in-memory arithmetic unit, and when the first operating frequency of the in-memory arithmetic unit is higher than the second operating frequency of the memory bank, the in-memory arithmetic unit may perform in-memory processing of more pipelining arithmetic operations within the same time. Here, a relationship between the first operating frequency and the second operating frequency is the same as the relationship represented by Equation 1 described above. In
Referring to
The in-memory arithmetic unit (first arithmetic unit 111) is connected to the respective memory banks on the memory die 100. In this case, the memory die 100 may include a bank selector 140 that selects any one of the four memory banks BANK1, BANK2, BANK3, and BANK4 allocated to the first arithmetic unit 111, and a bank arbiter 160 for controlling a multiplexer (MUX) 150 (e.g., controlling data to be output from a multiplexer (MUX) 150) that provides the first arithmetic unit 111 with data accessed from a memory bank selected by bank selector 140. Hardware configuration elements implemented on the memory die 100 may be connected to each other through a data bus 170. In an example, the bank selector may sequentially enable one or more of the four memory banks BANK1, BANK2, BANK3, and BANK4 allocated to the first arithmetic unit.
The first arithmetic unit 111 may operate at a first operating frequency to perform in-memory processing of a pipelined arithmetic operation 810. Here, the arithmetic operation 810 may be pipelined into multi-pipeline stages of a sub-arithmetic unit (or sub-arithmetic operation) that may be processed within a first operation cycle corresponding to the first operating frequency of the first arithmetic unit 111. The first arithmetic unit 111 may include at least one pipeline register 1111 for buffering sub-arithmetic operation results in each pipeline stage of the pipelined arithmetic operation 810.
Each of the memory banks BANK1, BANK2, BANK3, and BANK4 may operate at a second operating frequency to perform an access operation of data requested from the first arithmetic unit 111 while the pipelined arithmetic operation is performed.
A clock divider 130 may distribute a clock signal BANK CLK provided from an external device (the memory controller 20 of
That is, the clock divider 130 may generate the first clock signal CLK1 from the clock signal BANK CLK based on a relationship between the first operating frequency and the second operating frequency and may distribute the first clock signal CLK1. For example, as described with reference to
The first arithmetic unit 111 operates at a first operating frequency that is twice the second operating frequency of each of the memory banks BANK1, BANK2, BANK3, and BANK4. In other words, the first operating frequency is twice the second operating frequency. It is previously described that n corresponds to the number of memory banks allocated to one in-memory arithmetic unit, and the relationship between the first operating frequency and the second operating frequency is represented by Equation 1.
Meanwhile, the bank selector 140 may operate at the second operating frequency in response to the clock signal BANK CLK in the same manner as the memory banks BANK1, BANK2, BANK3, and BANK4, and the bank arbiter 160 may operate at the first operating frequency in response to first clock signal CLK1 in the same manner as the first arithmetic unit 111.
The bank selector 140 may provide enable terminals EN of the memory banks BANK1 and BANK2 with a control signal for enabling the memory banks BANK1 and BANK2 within a certain operation cycle through a first terminal 1ST, and may provide enable terminals EN of the memory banks BANK3 and BANK4 with a control signal for enabling the memory banks BANK3 and BANK4 within a next operation cycle through a second terminal 2ND. The bank arbiter 160 may control the multiplexer (MUX) 150 so that the first arithmetic unit 111 may sequentially access two memory banks that are enabled within a certain operation cycle.
Referring to
During one cycle of the clock signal BANK CLK, the memory banks BANK1 and BANK2 are enabled by a control signal BANK Selector_1ST outputted through the first terminal 1ST of the bank selector 140. At this time, the memory banks BANK3 and BANK4 are in a disabled state.
Because a frequency of the first clock signal CLK Divider_output is twice the frequency of the clock signal BANK CLK, data DATA1 911 and data DATA1 912 of the memory banks BANK1 and BANK2 may be sequentially accessed to the first arithmetic unit 111 during two cycles of the first clock signal CLK Divider_output. According to
During the next cycle of the clock signal BANK CLK, the data DATA1 913 and data DATA1 914 of the other memory banks BANK3 and BANK4 may be sequentially accessed to the first arithmetic unit 111.
In this order, during the four cycles of the clock signal BANK CLK, the DATA1 911, DATA1 912, DATA1 913, DATA1 914, DATA2 921, DATA2 922, DATA2 923, and DATA2 924 of the memory banks BANK1, BANK2, BANK3, and BANK4 may be sequentially accessed as an input of the first arithmetic unit 111.
Referring to
The pipelined arithmetic operation 700 includes multi-pipeline stages of six sub-arithmetic operations STAGE 1-1 to STAGE 3-2. Each arithmetic operation OP # (e.g., OP1, OP2 or OP3) performed by the first arithmetic unit 111 illustrated in
The pipeline stage allocation 1020 in each timeline (cycle) corresponds to pipelining according to the in-memory processing of the first arithmetic unit 611 (operating frequency: 500 MHz and operation cycle: 2 ns) and the memory banks 621 to 624 (operating frequency: 250 MHz and operation cycle: 4 ns) described in
Specifically, during a Nth cycle, the first arithmetic unit 611 accesses the first memory bank 621 to perform a first arithmetic operation OP1 of the sub-arithmetic operation STAGE 1-1 and accesses the second memory bank 622 to perform the first arithmetic operation OP1 of the sub-arithmetic operation STAGE 1-1. Meanwhile, the first arithmetic unit 611 may access data of the first arithmetic operation OP1 from the first memory bank 621 and at the same time perform pipelining of the first arithmetic operation OP1 of the next sub-arithmetic operation STAGE 1-2. In this way, the first arithmetic unit 611 may alternately access the memory banks 621 to 624 even in the (N+5)th cycle illustrated in
Meanwhile, the pipeline stage allocation 1010 corresponds to pipelining according to in-memory processing of the first arithmetic unit 631 (operating frequency: 250 MHz and operation cycle: 4 ns) and the memory banks 641 to 644 (operating frequency: 250 MHz and operation cycle: 4 ns) described with reference to
Referring to
Referring to
According to the timing diagram of
Because the frequency of the first clock signal CLK Divider_output is twice the frequency of the clock signal BANK CLK, operand A data OPD_A1 of the first memory bank BANK1, operand A data OPD_A1 of the second memory bank BANK2, operand B data OPD_B1 of the third memory bank BANK3, and operand B data OPD_B1 of the fourth memory bank BANK4 may all be accessed during two cycles of the first clock signal CLK Divider_output. Here, because the multiplexers (MUXs) 1241 and 1242 may substantially simultaneously access the memory banks BANK1 and BANK2 or the memory banks BANK3 and BANK4 by the operand A arbiter 1231 and the operand B arbiter 1232, the in-memory arithmetic unit 1250 may substantially simultaneously access operand A data OPD_A1, OPD_A2, OPD_A3 or OPD_A4 and operand B data OPD_B1, OPD_B2, OPD_B3 or OPD_B4. Accordingly, the in-memory arithmetic unit 1250 may perform an arithmetic operation between the operand A and the operand B within one operation cycle of the in-memory arithmetic unit 1250. In
Referring to
Specifically, a memory device that performs in-memory processing and is mounted as the RAM 1520 may include a plurality of in-memory arithmetic units that perform in-memory processing of a pipelined arithmetic operation, and a plurality of memory banks that are allocated to each in-memory arithmetic unit in units of n memory banks and perform an access operation of data requested from each in-memory arithmetic unit while the pipelined arithmetic operation is performed. Here, each of the in-memory arithmetic units may operate at the first operating frequency that is less than or equal to n times the second operating frequency of each of the memory banks.
The computing system 1500 includes a central processing unit (CPU) 1510, a RAM 1520, a user interface 1530, and a nonvolatile memory 1540, which are electrically connected to a bus 1550. The nonvolatile memory 1540 may include a mass storage device such as a solid state drive (SSD) or a hard disk drive (HDD).
As the memory device (or memory system) described above is applied to the computing system 1500, a memory device included in the RAM 1520 may perform in-memory processing.
While this disclosure includes specific examples, it will be apparent after an understanding of the disclosure of this application that various changes in form and details may be made in these examples without departing from the spirit and scope of the claims and their equivalents. The examples described herein are to be considered in a descriptive sense only, and not for purposes of limitation. Descriptions of features or aspects in each example are to be considered as being applicable to similar features or aspects in other examples. Suitable results may be achieved if the described techniques are performed in a different order, and/or if components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents. Therefore, the scope of the disclosure is defined not by the detailed description, but by the claims and their equivalents, and all variations within the scope of the claims and their equivalents are to be construed as being included in the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2020-0128274 | Oct 2020 | KR | national |
Number | Name | Date | Kind |
---|---|---|---|
5270832 | Balkanski | Dec 1993 | A |
5459846 | Hyatt | Oct 1995 | A |
5559986 | Alpert | Sep 1996 | A |
6993637 | Kwong | Jan 2006 | B1 |
8234441 | Uchiyama et al. | Jul 2012 | B2 |
8874837 | Neely et al. | Oct 2014 | B2 |
20020001894 | Lee | Jan 2002 | A1 |
20030145162 | Casper | Jul 2003 | A1 |
20050166021 | Leijten | Jul 2005 | A1 |
20110191564 | Roy | Aug 2011 | A1 |
20130036288 | Ansari | Feb 2013 | A1 |
20160188499 | Nagarajan | Jun 2016 | A1 |
20180217929 | Lillibridge | Aug 2018 | A1 |
20180322382 | Mellempudi et al. | Nov 2018 | A1 |
20200243154 | Sity | Jul 2020 | A1 |
20210208811 | Kim | Jul 2021 | A1 |
20210208814 | Song | Jul 2021 | A1 |
20210216243 | Shin | Jul 2021 | A1 |
20210373805 | Alsop | Dec 2021 | A1 |
20210389907 | Alsop | Dec 2021 | A1 |
20210397376 | Ro et al. | Dec 2021 | A1 |
20220068366 | Kwon | Mar 2022 | A1 |
20220164297 | Sity | May 2022 | A1 |
20220206869 | Ramachandran | Jun 2022 | A1 |
Number | Date | Country |
---|---|---|
5-265853 | Oct 1993 | JP |
10-0285136 | Mar 2001 | KR |
10-1798279 | Nov 2017 | KR |
10-2021-0156058 | Dec 2021 | KR |
Entry |
---|
Vinçon, Tobias, et al., “Moving Processing to Data: On the Influence of Processing in Memory on Data Management”, arXiv preprint arXiv:1905.04767, May 12, 2019 (21 pages in English). |
“UPMEM Announces the First Processing In-Memory Chip Accelerating Big Data Applications” PRN Newswire, Sep. 7, 2017 (5 pages in English). |
Number | Date | Country | |
---|---|---|---|
20220107803 A1 | Apr 2022 | US |