1. Field
The present disclosure generally relates to techniques for accessing data in a memory system. More specifically, the present disclosure relates to techniques for accessing data in a memory system that includes independently addressable memory chips.
2. Related Art
In a typical commodity memory system, multiple dynamic-random-access-memory (DRAM) devices are arranged in parallel to provide a fixed-width data interface to a memory controller or a processor. Because of limited pin and routing resources in memory modules, DRAM devices within a given rank are usually accessed in lockstep, using the same address provided on a shared bus. However, this memory-access technique prevents individual addressing of each memory chip in the memory modules, which can reduce the efficiency of memory operations.
Hence, what is needed is a memory-access technique without the problems described above.
One embodiment of the present disclosure provides a computer system for accessing rows and columns in a matrix that is stored in a memory system, which includes a set of independently addressable memory chips. During operation, the computer system receives a row-write request to write to a row N in the matrix. In response to the row-write request, the computer system rotates the row right by N elements, and writes the row in parallel to address N of the memory chips in the memory system. Then, the computer system receives a column-write request to write to column M in the matrix. In response to the column-write request, the computer system rotates the column right by M elements, and writes the column in parallel to the memory chips in the memory system. Note that, during the write operation, a memory chip C in the memory system is assigned address (M+C) mod the number of rows in the matrix.
In some embodiments, the computer system receives a row-read request to read from row N in the matrix. In response to the row-read request, the computer system reads the row in parallel from address N of the memory chips in the memory system, and rotates the row returned by the parallel read operation left by N elements. Moreover, the computer system may receive a column-read request to read column M from the matrix. In response to the column-read request, the computer system may read the column in parallel from the memory chips in the memory system (where, during the read operation, the memory chip C in the memory system is assigned address (M+C) mod the number of rows in the matrix), and may rotates the column returned by the parallel read operation left by M elements.
Note that the rotating and writing operations may facilitate simultaneously accessing the elements of row N from the memory chips, and the rotating and writing operations may facilitate simultaneously accessing elements of column M from the memory chips.
Moreover, the memory chips may facilitate a configurable width for a memory operation.
Furthermore, the memory chips may be included in a ramp-stack chip package and/or a plank-stack chip package.
Additionally, frames of data stored in the memory chips may include corresponding error-correction information, where a frame has a pre-defined length and a pre-defined width, and the error-correction information facilitates identification and correction of errors in a given frame.
In some embodiments, the computer system writes data associated with a graph to the memory chips so that nodes in the graph are randomly distributed over the memory chips. Moreover, the computer system may accesses independent pages in the data concurrently on the memory chips.
Another embodiment provides a method that includes at least some of the operations performed by the computer system.
Another embodiment provides a computer-program product for use with the computer system. This computer-program product includes instructions for at least some of the operations performed by the computer system.
Another embodiment provides an integrated circuit (such as a processor or a memory controller) that performs at least some of the operations performed by the computer system.
Another embodiment provides a computer system or a memory system that includes the integrated circuit.
Table 1 provides pseudocode for rotating elements in a matrix in accordance with an embodiment of the present disclosure.
Note that like reference numerals refer to corresponding parts throughout the drawings. Moreover, multiple instances of the same part are designated by a common prefix separated from an instance number by a dash.
Embodiments of a computer system, a computer-program product, an integrated circuit (such as a processor or a memory controller), a system that includes the integrated circuit, and methods for accessing rows and columns in a memory system are described. This memory system includes memory chips that can be individually addressed and accessed (for example, the memory chips may be included in a ramp-stack chip package and/or a plank-stack chip package). In order to leverage this capability, prior to performing a row-write request on the memory system, the computer system may transform the rows and the columns in a matrix. In particular, in response to receiving a row-write request to write to a row N in the matrix, the computer system rotates the row right by N elements, and writes the row in parallel to address N of the memory chips in the memory system. Similarly, in response to receiving a column-write request to write to column M in the matrix, the computer system rotates the column right by M elements, and writes the column in parallel to the memory chips in the memory system. Note that, during the write operation, a memory chip C in the memory system is assigned address (M+C) mod the number of rows in the matrix.
This memory access technique may allow the data in elements in a given row or a given column in the matrix to be spread across different memory chips. In addition, the data may be mapped so that a given page is randomly distributed over the memory chips. In these ways, the memory chips can be independently and simultaneously accessed, which may facilitate a configurable width for the row-write operation and/or the column-write operation. Furthermore, because of this capability, the computer system may store frames of data in the memory chips with corresponding error-correction information so that errors in the frame can be identified and corrected. These memory-access and storage techniques may allow the memory system to be used efficiently and/or may facilitate new memory operations.
We now describe embodiments of the memory system and the memory-access and storage techniques.
Activities of slave memory controllers 112 may be coordinated by an optional master memory controller 114 (which is sometimes referred to as an ‘integrated circuit’ in the discussion that follows). In particular, control logic 116 in optional master memory controller 114 may coordinate the activities of slave memory controllers 112 based on the desired access mode. Optional master memory controller 114 may also aggregate data returned from memory chips 110 on read operations and may distribute data to be written to memory chips 110 on write operations.
Alternatively or additionally, activities of slave memory controllers 112 may be coordinated and data to and from memory chips 110 may be aggregated, at least in part, by processor 118 (which is also sometimes referred to as the ‘integrated circuit’ in the discussion that follows). In particular, execution mechanism 120 in processor 118 may execute instructions associated with memory operations. In some embodiments, slave memory controllers 112 can be operated independently of optional master memory controller 114, which either may not be included in memory system 100 or may operate as a passthrough.
The memory operations, such as those in the memory-access and the storage techniques described below with reference to
As noted previously, memory system 100 may facilitate new memory operation, such as row/column matrix access. In general, applications such as matrix multiplication or database reads may benefit from the ability to efficiently read either a row or a column from a matrix (or a table). Because of address limitations, in a traditional memory system data can typically be organized to optimize for either row or column access. In contrast, in a stacked memory system, such as memory system 100, with individually addressable memory chips 110, data can be organized such that rows and columns can both be efficiently accessed.
As an illustration, consider storage of a matrix.
In an alternative approach, the software executed by processor 118 (
Pseudocode illustrating rotating of the rows and columns of the matrix prior to performing a row or column-write request (or a row or column-read request) on memory chips 110 (
Moreover, reading column M from the memory system may require processor 118 (
In some embodiments, the computer system optionally receives a row-read request to read from row N in the matrix (operation 518). In response to the row-read request, the computer system optionally reads the row in parallel from address N of the memory chips in the memory system, and optionally rotates the row returned by the parallel read operation left by N elements (operation 520). Moreover, the computer system may optionally receive a column-read request to read column M from the matrix (operation 522). In response to the column-read request, the computer system may optionally read the column in parallel from the memory chips in the memory system (where, during the read operation, the memory chip C in the memory system is assigned address (M+C) mod the number of rows in the matrix), and may optionally rotates the column returned by the parallel read operation left by M elements (operation 524).
Note that the rotating and writing operations may facilitate simultaneously accessing the elements of row N from the memory chips, and the rotating and writing operations may facilitate simultaneously accessing elements of column M from the memory chips.
Moreover, the memory chips may facilitate a configurable width for a memory operation.
Furthermore, the memory chips may be included in a ramp-stack chip package and/or a plank-stack chip package.
The ability to independently and simultaneously access data on different memory chips in the memory system also has consequences for error correcting codes (ECC). Many enterprise applications require the use of ECC to reduce the failure rate of software due to memory read errors. Typically, this is implemented with an extra chip on the memory data bus that holds error-correction information obtained by passing the data words through a single-error-correction double-error-detecting (SECDED) code generator. When data is read, it is passed through a SECDED decoder to check whether it matches the original ECC information. This technique works well for typical memory systems because the memory system is always accessed the same way. However, for a configurable-width, individually addressable memory system (such as that shown in
One approach is to use an extra chip in the memory system to store ECC information for rows and columns for the two access modes. This approach may be sub-optimal, however, because there is redundant information stored in the ECC chip. In particular, any given element in the data may be represented in the ECC information for a row read and for a column read. This approach may also be challenging because any modification of the data requires that the ECC information for a row access and a column access be re-computed. Thus, if a row was read and modified, the corresponding column may also be read in order to re-compute the ECC information before being written back. Consequently, any write becomes a read-modify-write operation.
Another technique takes advantage of the inherent burst length of memory components. In DRAM, a read of a particular address usually results in an 8-cycle burst of data starting from that address. Moreover, a typical DRAM component is 8-bits wide (×8), so the burst results in a 64-bit (8-byte) transfer (which provides an illustration of a ‘frame’ having a pre-defined length and a pre-defined width). The last 8-bit word of the burst can be used as the ECC word. Because all transfers will have this minimum granularity, the ECC information can always be computed and checked, regardless of the access mode. However, note that the 8-bits used for ECC in this example is inefficient. 8 bits of ECC information is typically used with 64 bits of data, while in this case there are 56 bits of data with an 8-bit ECC word. Consequently, this frame size is an illustration of the storage technique and frame sizes with different lengths and widths (and, thus, different ECC overhead) may be used.
Another use of independently addressable memory chips is for parallel graph traversal. Traversing graphs usually involves significant amounts of pointer-chasing, in which there are small reads from random memory locations. These reads are typically inefficient on traditional memory systems because the pointers being read generally fit within one page but eight pages may be opened across all eight memory chips, a 12.5% read efficiency. The configurable-width or individually addressable memory system shown in
Consider the exemplary graphs in
Because memory chips 110 (
Because the parallelism in this storage technique is limited by the number of independent memory chips, the graph-traversal technique may choose which nodes to process first. In this example, the pointers from node 1 may be processed first. The computer system may feed pointer 5 to memory chip 1, pointer 6 to memory chip 2, pointer 7 to memory chip 3, and pointer 8 to memory chip 0. These nodes are all accessed simultaneously and it is determined that they are leaves and no additional processing is necessary. Next, the storage technique goes back to processing node 4 and sends pointer 9 to memory chip 1, returning the leaf node 9. Note that the page hit rate in this example is 100%, which means that every page opened in the memory system has data used from it. Thus, the individually addressable memory chips allow finer memory-access granularity, which makes applications that access small blocks of data more efficient.
In some embodiments of methods 500 (
We now describe embodiments of the computer system.
Memory 1124 in computer system 1100 may include volatile memory and/or non-volatile memory. More specifically, memory 1124 may include: ROM, RAM, EPROM, EEPROM, flash memory, one or more smart cards, one or more magnetic disc storage devices, and/or one or more optical storage devices. Memory 1124 may store an operating system 1126 that includes procedures (or a set of instructions) for handling various basic system services for performing hardware-dependent tasks. Memory 1124 may also store procedures (or a set of instructions) in a communication module 1128. These communication procedures may be used for communicating with one or more computers and/or servers, including computers and/or servers that are remotely located with respect to computer system 1100.
Memory 1124 may also include multiple program modules (or sets of instructions), including: storage module 1130 (or a set of instructions). Note that one or more of these program modules (or sets of instructions) may constitute a computer-program mechanism.
During the accessing and/or the storage techniques, storage module 1130 may perform at least some of the operations in methods 500 (
Instructions in the various modules in memory 1124 may be implemented in: a high-level procedural language, an object-oriented programming language, and/or in an assembly or machine language. Note that the programming language may be compiled or interpreted, e.g., configurable or configured, to be executed by the one or more processors 1110.
Although computer system 1100 is illustrated as having a number of discrete items,
Components in computer system 1100 may be coupled by signal lines, links or buses. These connections may include electrical, optical, or electro-optical communication of signals and/or data. Furthermore, in the preceding embodiments, some components are shown directly connected to one another, while others are shown connected via intermediate components. In each instance, the method of interconnection, or ‘coupling,’ establishes some desired communication between two or more circuit nodes, or terminals. Such coupling may often be accomplished using a number of circuit configurations, as will be understood by those of skill in the art; for example, AC coupling and/or DC coupling may be used.
In some embodiments, functionality in these circuits, components and devices may be implemented in one or more: application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), and/or one or more digital signal processors (DSPs). Furthermore, functionality in the preceding embodiments may be implemented more in hardware and less in software, or less in hardware and more in software, as is known in the art. In general, the computer system may be at one location or may be distributed over multiple, geographically dispersed locations.
Note that computer system 1100 may include: a VLSI circuit, a switch, a hub, a bridge, a router, a communication system (such as a WDM communication system), a storage area network, a data center, a network (such as a local area network), and/or a computer system (such as a multiple-core processor computer system). Furthermore, the computer system may include, but is not limited to: a server (such as a multi-socket, multi-rack server), a laptop computer, a communication device or system, a personal computer, a work station, a mainframe computer, a blade, an enterprise computer, a data center, a tablet computer, a supercomputer, a network-attached-storage (NAS) system, a storage-area-network (SAN) system, a media player (such as an MP3 player), an appliance, a subnotebook/netbook, a smartphone, a cellular telephone, a network appliance, a set-top box, a personal digital assistant (PDA), a toy, a controller, a digital signal processor, a game console, a device controller, a computational engine within an appliance, a consumer-electronic device, a portable computing device or a portable electronic device, a personal organizer, and/or another electronic device.
Furthermore, the embodiments of the integrated circuit, the memory system and/or the computer system may include fewer components or additional components. Although these embodiments are illustrated as having a number of discrete items, the preceding embodiments are intended to be functional descriptions of the various features that may be present rather than structural schematics of the embodiments described herein. Consequently, in these embodiments two or more components may be combined into a single component, and/or a position of one or more components may be changed. In addition, functionality in the preceding embodiments of the integrated circuit, the memory system and/or the computer system may be implemented more in hardware and less in software, or less in hardware and more in software, as is known in the art.
An output of a process for designing an integrated circuit, or a portion of an integrated circuit, comprising one or more of the circuits described herein may be a computer-readable medium such as, for example, a magnetic tape or an optical or magnetic disk. The computer-readable medium may be encoded with data structures or other information describing circuitry that may be physically instantiated as an integrated circuit or portion of an integrated circuit. Although various formats may be used for such encoding, these data structures are commonly written in: Caltech Intermediate Format (CIF), Calma GDS II Stream Format (GDSII) or Electronic Design Interchange Format (EDIF). Those of skill in the art of integrated circuit design can develop such data structures from schematics of the type detailed above and the corresponding descriptions and encode the data structures on a computer-readable medium. Those of skill in the art of integrated circuit fabrication can use such encoded data to fabricate integrated circuits comprising one or more of the circuits described herein.
In the preceding description, we refer to ‘some embodiments.’ Note that ‘some embodiments’ describes a subset of all of the possible embodiments, but does not always specify the same subset of embodiments.
The foregoing description is intended to enable any person skilled in the art to make and use the disclosure, and is provided in the context of a particular application and its requirements. Moreover, the foregoing descriptions of embodiments of the present disclosure have been presented for purposes of illustration and description only. They are not intended to be exhaustive or to limit the present disclosure to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Additionally, the discussion of the preceding embodiments is not intended to limit the present disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.