The disclosed embodiments are generally directed to memory devices, and in particular, to memory device processing.
Modern memory devices, such as those based on dynamic random access memory (DRAM) and phase change memory (PCM) technology, consist of several independent banks, each of which is arranged as a two-dimensional array of memory cells. To access a row or block of data, a memory controller issues an activation command followed by multiple read or write commands over a memory channel to the memory device. In many situations, it might be useful to move data from one row of a bank to another row in the bank. These intra-bank, inter-row data transfers can be achieved in conventional systems by having the memory controller issue a series of multiple read commands followed by a series of multiple write commands over the memory channel.
In these conventional systems, the memory channel remains occupied for the entire duration of the intra-bank, inter-row data transfer. This bandwidth use can be viewed as a waste of bandwidth because the compute units do not benefit directly from this data transfer. There is a significant command and address bus bandwidth waste doing these transfers. Since memory bandwidth is one of the most important system resources, this can result in significant lost performance potential. In addition, the bank involved in the data transfer cannot service any other requests to the bank when occupied by the data transfer. This reduction in available bandwidth and the consequent contention for the banks arises even if other requests can be interleaved among the read or write requests. The data transfers across the interface to and from the DRAM also require potentially significant energy, particularly in high-frequency memory systems.
A method and apparatus for inter-row data transfer in memory devices is described. Data transfer from one physical location in a memory device to another is achieved without engaging the external input/output pins on the memory device. In some embodiments, a memory device is responsive to a row transfer (RT) command which includes a source row identifier and a target row identifier. The memory device activates a source row and stores source row data in a row buffer, latches the target row identifier into the memory device, activates a word line of a target row to prepare for a write operation, and stores the source row data from the row buffer into the target row.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
The processor 102 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 104 may be located on the same die as the processor 102, or may be located separately from the processor 102. The memory 104 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 106 may include a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 108 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 110 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 112 communicates with the processor 102 and the input devices 108, and permits the processor 102 to receive input from the input devices 108. The output driver 114 communicates with the processor 102 and the output devices 110, and permits the processor 102 to send output to the output devices 110. It is noted that the input driver 112 and the output driver 114 are optional components, and that the device 100 will operate in the same manner if the input driver 112 and the output driver 114 are not present.
To access a block of data 340, a row of data is first brought into a series of sense-amplifiers and latches, (internal to the device 300, and one per bank 305 . . . 310), which constitute a row buffer 345. This operation is termed as row activation and is performed by issuing a row activation command (ACT) along with an address for the specific row, (i.e. selecting a word line 347). To access the desired block 340, (which is a subset of the selected row), a column read (RD) or column write (WR) command is then issued along with a column address to respectively read from or write to the open row in the row buffer 345, (i.e. selecting a bit line 349).
In the case of an RD command, the data from the columns selected by the column address is moved from the row buffers 345 to external input/output (I/O) pins (not shown) and subsequently driven over an external memory channel to a memory controller, (for example, over memory channel 208 to memory controller 204 as shown in
Intra-bank, inter-row data transfers can be achieved in conventional systems by having a master device issue a sequence of reads from the source row. On the first read, an activate command is issued with the source row address by the memory controller 204, following which the row's data is brought into the memory bank's row buffer 345. A series of RD commands are then issued for each block in the row buffer 345 and the data is returned over the data bus. For example, in a specific configuration of current double data rate type three (DDR3) systems, a single row activation in a 1 GB DRAM dual in-line memory module (DIMM) fetches 16 KB of data in to row buffers. This amounts to a total of 256 64-byte transfers.
The data transferred to the memory controller 204 is returned to the master device and then needs to be buffered somewhere. The master device then issues a sequence of writes to write the buffered data to the target row. On the first write, the memory controller 204 issues a precharge command to the bank, which closes the open row, (i.e. the source row). The target row is then activated using the ACT command. A series of WR commands are then issued to write back each block to the row buffer 345. The target row has all the data stored in it after the write-recovery time has passed since the last block has been transferred over the pins. Such a transfer may be broken up into smaller blocks of data to reduce the buffering in the host at the cost of separate activations of source and target rows for each block.
As illustrated, the memory channel 208 remains occupied for the entire duration and cannot service any other requests from the bank for the entire duration. Since the compute units do not benefit directly from this transfer, this bandwidth use can be viewed as a waste of bandwidth. This can result in significant system slowdown since memory bandwidth is one of the most important system resources. This reduction in available bandwidth and the consequent contention for the banks arises even if other requests can be interleaved among the read or write requests.
Described herein is a method and apparatus for inter-row data transfer in memory devices without engaging the external I/O pins on the memory device. As described herein below, the master device is relieved from reading and writing each data block, (or word), the memory controller is relieved from performing each micro-operation and the participation of the I/O pins is not required as the data is not moved out of the memory chip. The method and apparatus minimize overhead on the master devices orchestrating the transfer and streamlines participation of the memory controller. There is no engagement of the I/O pins and memory data channel. In addition, there is a relatively quick turnaround with respect to the bank, which reduces the bank conflict overhead.
As illustrated herein below, the method and apparatus lowers the absolute latency of the transfer operation since all bits of the transfer are read simultaneously and written simultaneously. This frees up the bank faster. This also frees up the channel bandwidth, which in turn allows data transfer to and from other banks and ranks on the channel while a row transfer (RT) command is executed on a bank as described herein below. Moreover, the extra energy consumption on the I/O pins for the wasteful to-and-fro data transfer is eliminated, extra storage in the master device is eliminated and use of the address and command bus is minimized.
The method and apparatus may be useful in operations that require copying large amounts of data, (i.e. those that span entire memory rows). These operations may include application level operations, (for example, the duplication of large data structures) and system software operations, (for example, copy-on-write duplication of large memory pages).
In an embodiment, a memory controller is augmented to issue a new command, a row transfer (RT). The RT command requires that a bank identifier, a source row identifier and a target row identifier be specified. In some embodiments, issuing this command may occupy the memory interface for 2 cycles in contrast to the 1 cycle needed to issue conventional memory operations.
A memory device is correspondingly enhanced to perform a set of operations upon receipt of the RT command (405) as shown in
In another embodiment, the RT command permits a subset of a row to be duplicated into the corresponding subset of another row, (instead of duplicating the entire row as in the embodiment described hereinabove). In this embodiment, the RT command includes a specification of which columns within the rows are to be duplicated. For example, this can be specified by a (start, length) tuple at a power-of-2 granularity, (which reduces the encoding overhead), or a more generic encoding enabling more flexible regions to be copied at the expense of more command encoding overhead. In another example, a (start, end) tuple can be used.
A memory device is correspondingly enhanced to perform a set of operations upon receipt of the RT command (505) as shown in
The above masked inter-row copy operation can be used to accelerate data copies in regions smaller than an entire memory page, (for example, 4 KB OS pages), as long as the copy operation is among aligned regions within the source and destination rows. The alignment may need to be ensured by OS techniques, (on copy-on-write of 4 KB pages, the OS can ensure both the source and destination pages are aligned within the memory row within the same memory bank).
In another embodiment, a memory controller may be augmented with a “PRECHARGE TO ADDRESSED ROW” (PREAR) command. The PREAR command has a bank identifier and row address. In some embodiments, the PREAR command may use the same format as an ACT command and therefore use the same command addressing interface. This embodiment eliminates the need for a state machine in the memory device. For example, in the RT embodiment, the memory device needs to maintain a state machine to track the flow of operations shown in
Table 1 shows an example operation to copy a row with the PREAR command. The memory controller issues an ACT command which copies the contents of the source row into the row buffer and destroys the data stored in the source row. The ACT command is followed by issuance of a PREAR command which writes data in the row buffer to the target row. The PREAR command in turn is followed by issuance of a precharge (PRE) command which writes the data in the row buffer back to the source row.
Table 2 shows an example operation to move a row with the PREAR command. The memory controller issues an ACT command which copies the contents of the source row into the row buffer and destroys the data stored in the source row. The ACT command is followed by issuance of a PREAR command which writes the data in the row buffer to the target row.
Compared to the RT command embodiment, the PREAR command may require additional command interface bandwidth and increased complexity in the memory controller sequencing, (i.e. a complicated state machine). The benefit is a simpler memory device with less complex internal sequencing and simpler command processing.
In another embodiment, the PREAR command could be replaced by a sequence of commands consisting of a new command latch addressed row (LAR) and then a PRE command. Such an embodiment requires that the PRE command preserves row buffer contents. The LAR command uses a bank identifier and a target row identifier, and latches the row address of the target row to allow the bank's row decoder to address the intended row.
Table 3 shows an example operation to copy a row with the LAR command. The memory controller issues an ACT command which copies the contents of the source row into the row buffer and destroys the data stored in the source row. The ACT command is followed by issuance of a PRE command which writes data back to the source row. The PRE command in turn is followed by issuance of a LAR command which latches the row address of the target row. The LAR command in turn is followed by a PRE command which writes the data in the row buffer to the target row.
Table 4 shows an example operation to move a row with the LAR command. The memory controller issues an ACT command which copies the contents of the source row into the row buffer and destroys the data stored in the source row. The ACT command is followed by issuance of a LAR command which latches the row address of the target row. The LAR command is followed in turn by a PRE command which writes the data in the row buffer to the target row.
This approach may further increase command bandwidth, (as compared to the PREAR approach), but may simplify the internal circuitry and timing in the memory device.
In general, in accordance with some embodiments, a method for inter-row data transfer in a memory device is responsive to a row transfer (RT) command which includes a source row identifier and a target row identifier. The RT command may also include a bank identifier. The method includes performing the following actions upon receipt of the RT command. A source row is activated and source row data is stored in a row buffer. In some embodiments, a subset of the row buffer is stored in the target row. For example, the RT command may identify certain columns within the row buffer to be stored in the target row. This may be implemented by using start and length fields or start and end fields. The target row identifier is latched into the memory device and a word line of a target row is activated to prepare for a write operation. This may be done during activation of the source row. The source row data from the row buffer is stored into the target row.
In some embodiments, a method for inter-row data transfer in a memory device includes receiving an activation (ACT) command. The method further includes activating a source row and latching the source row data in a row buffer. A precharge to addressed row (PREAR) command is then received which includes a bank identifier and a row address. The source row data from the row buffer is then stored in a target row. A precharge (PRE) command may be received which writes the source row data in the row buffer to the source row.
In some embodiments, a method for inter-row data transfer in a memory device includes receiving an activation (ACT) command. The method further includes activating a source row and latching the source row data in a row buffer. A latch addressed row (LAR) command is then received which includes a bank identifier and a target row identifier. The row address of a target row is latched to allow a row decoder to address the target row. A precharge (PRE) command is then received which writes the source row data in the row buffer to the target row. A precharge (PRE) command may be received which writes the source row data in the row buffer to the source row. In an embodiment, the PRE command is received after the ACT command and before the LAR command.
It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements.
The methods provided may be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.
The methods or flow charts provided herein, to the extent applicable, may be implemented in a computer program, software, or firmware incorporated in a computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).