The invention relates to a memory subsystem and in particular, to merging local data onto a bus which contains data from other sources.
Computer memory subsystems have evolved over the years, but continue to retain many consistent attributes. Computer memory subsystems from the early 1980's, such as the one disclosed in U.S. Pat. No. 4,475,194 to LaVallee et al., of common assignment herewith, included a memory controller, a memory assembly (contemporarily called a basic storage module (BSM) by the inventors) with array devices, buffers, terminators and ancillary timing and control functions, as well as several point-to-point busses to permit each memory assembly to communicate with the memory controller via its own point-to-point address and data bus.
As shown in
One drawback to the use of a daisy chain bus is associated with the capturing and repowering of the signals between the memory modules. In daisy chained memory module structures, the latency of data transmission as it travels between the cascaded memory modules and back to the memory controller is critical to performance. Currently, the merging of local data from a memory module with data from other sources already on the memory bus delays the re-drive of data being transferred on the bus.
Exemplary embodiments of the present invention include a method for re-driving data in a memory subsystem. The method includes receiving controller interface signals and a forwarded interface clock associated with the controller interface signals at a memory module. The memory module is part of a cascaded interconnect system. The controller interface signals are sampled with the forwarded interface clock and the sampling results in the controller interface signals being latched into interface latches. The controller interface signals are then latched into local latches using a local clock on the memory module. The contents of the local latches along with the local clock are transmitted to an other memory module or controller in the cascaded interconnect system.
Additional exemplary embodiments include a cascaded interconnect system. The system includes a memory controller, a memory bus and one or more memory modules. The memory controller and the memory modules are interconnected by a packetized multi-transfer interface via the memory bus. Each memory module includes interface latches, local latches and a local clock. Each memory module also includes instructions for receiving controller interface signals and a forwarded interface clock associated with the controller interface signals via the memory bus. Instructions are also included for sampling the controller interface signals with the forwarded interface clock, with the sampling resulting in the controller interface signals being latched into the interface latches. Further instructions are included for latching the controller interface signals into the local latches using the local clock and transmitting, via the memory bus, the contents of the local latches along with the local clock to an other memory module or to the controller.
Further exemplary embodiments include a storage medium for re-driving data in a memory subsystem. The storage medium is encoded with machine readable computer program code for causing a computer to implement a method. The method includes receiving controller interface signals and a forwarded interface clock associated with the controller interface signals at a memory module. The memory module is part of a cascaded interconnect system. The controller interface signals are sampled with the forwarded interface clock and the sampling results in the controller interface signals being latched into interface latches. The controller interface signals are then latched into local latches using a local clock on the memory module. The contents of the local latches along with the local clock are transmitted to an other memory module or controller in the cascaded interconnect system.
A further embodiment includes a dual inline memory module (DIMM) including a card and a plurality of individual local memory devices attached to the card. The card has a length of about 151.2 to about 151.5 millimeters and a key. A buffer device is attached to the card, with the buffer device configured for converting a packetized memory interface. The card includes at least 276 pins configured thereon with power pins and ground pins spanning the key.
Referring now to the drawings wherein like elements are numbered alike in the several FIGURES:
Exemplary embodiments of the present invention provide circuits and methods of transmitting information in a cascaded memory module structure with high bandwidth and low latency. Memory read operations that access memory modules further away, or downstream, from the memory controller take longer to return data than operations accessing memory modules nearer to the controller. Command information sent from the controller must be captured and re-powered by each of the cascaded memory modules in the channel (or memory bus) as it makes its way to the selected memory module. Further, the returned read data must also travel to and through each of the cascaded memory modules. For this reason, the latency through each individual cascaded controller interface connection is an important component of the overall system performance. A further requirement of the controller interface in a buffered memory module system is that it be capable of merging locally obtained read data into the read data that is received from the downstream memory modules. Merging the downstream read data with the locally obtained read data with a minimum amount of added latency presents a difficult timing problem to solve.
Exemplary embodiments of the present invention include a memory subsystem using cascaded and fully buffered memory modules with controller interfaces to a controller and/or to other memory modules. The memory modules are connected by unidirectional, high speed signaling links referred to as memory busses. Forwarded clocks are utilized to re-drive data along the high speed memory busses. Forwarded clock refers to the data being received on the high speed links (e.g., the memory data busses) along with a clock for sampling the incoming data. The data from the high speed links is then put into the local clock domain and re-driven on the high speed links (upstream or downstream) with the local clock as the forwarded clock for the re-drive. Exemplary embodiments of the controller interfaces utilize single ended, or differential, high speed signaling, forwarded clocks, an elastic interface data capture macro, a phase locked loop (PLL) used to create local clock domains from the forwarded clock, as well as a double data rate (DDR) re-powered signal generator tightly coupled to the incoming data capture circuits. This combination produces an interface with a relatively high bandwidth per pin along with low latency per cascaded memory module.
Although point-to-point interconnects permit higher data rates, overall memory subsystem efficiency must be achieved by maintaining a reasonable number of memory modules 806 and memory devices per channel (historically four memory modules with four to thirty-six chips per memory module, but as high as eight memory modules per channel and as few as one memory module per channel). Using a point-to-point bus necessitates a bus re-drive function on each memory module to permit memory modules to be cascaded such that each memory module is interconnected to other memory modules, as well as to the memory controller 802.
An exemplary embodiment of the present invention includes two uni-directional busses between the memory controller 802 and memory module 806a (“DIMM #1”), as well as between each successive memory module 806b-d (“DIMM #2”, “DIMM #3” and “DIMM #4”) in the cascaded memory structure. The downstream memory bus 904 is comprised of twenty-two single-ended signals and a differential clock pair. The downstream memory bus 904 is used to transfer address, control, write data and bus-level error code correction (ECC) bits downstream from the memory controller 802, over several clock cycles, to one or more of the memory modules 806 installed on the cascaded memory channel. The upstream memory bus 902 is comprised of twenty-three single-ended signals and a differential clock pair, and is used to transfer read data and bus-level ECC bits upstream from the sourcing memory module 806 to the memory controller 802. Because the upstream memory bus 902 and the downstream memory bus 904 are unidirectional and operate independently, read data, write data and memory commands may be transmitted simultaneously. This increases effective memory subsystem bandwidth and may result in higher system performance. Using this memory structure, and a four to one data rate multiplier between the DRAM data rate (e.g., 400 to 800 Mb/s per pin) and the unidirectional memory bus data rate (e.g., 1.6 to 3.2 Gb/s per pin), the memory controller 802 signal pincount, per memory channel, is reduced from approximately one hundred and twenty pins to about fifty pins.
The memory controller 802 interfaces to the memory modules 806 via a pair of high speed busses (or channels). The downstream memory bus 904 (outbound from the memory controller 802) interface has twenty-four pins and the upstream memory bus 902 (inbound to the memory controller 802) interface has twenty-five pins. The high speed channels each include a clock pair (differential), a spare bit lane, ECC syndrome bits and the remainder of the bits pass information (based on the operation underway). Due to the cascaded memory structure, all nets are point-to-point, allowing reliable high-speed communication that is independent of the number of memory modules 806 installed. Whenever a memory module 806 receives a packet on either bus, it re-synchronizes the command to the internal clock and re-drives the command to the next memory module 806 in the chain (if one exists).
As described previously, the memory controller 802 interfaces to the memory module 806 via a pair of high speed channels (i.e., the downstream memory bus 904 and the upstream memory bus 902). The downstream (outbound from the memory controller 802) interface has twenty-four pins and the upstream (inbound to the memory controller 802) has twenty-five pins. The high speed channels each consist of a clock pair (differential), as well as single ended signals. Due to the cascade memory structure, all nets are point to point, allowing reliable high-speed communication that is independent of the number of memory modules 806 installed. The differential clock received from the downstream interface is used as the reference clock for the buffer device PLL and is therefore the source of all local buffer device 1002 clocks. Whenever the memory module 806 receives a packet on either bus, it re-synchronizes it to the local clock and drives it to the next memory module 806 or memory controller 802, in the chain (if one exists).
The buffer device 1002 also includes a downstream to upstream functional block 1112 with data receivers and a clock receiver. Input to the downstream to upstream functional block 1112 includes interface signals 1124 and an interface bus clock 1108 via the upstream memory bus 902. Output from the downstream to upstream functional block 1112 includes an interface upstream clock signal 1104 and interface upstream data signals 1120 being sent via the upstream memory bus 902. The interface upstream data signals 1120 include any locally merged read data from the buffer device 1002.
Also included in the buffer device 1002 is a local clock functional block 1114, including delay reference/feedback, PLL and local distribution. The buffer device 1002 further includes a core logic functional block 1116 (contains the memory interface, etc) which is driven off of the local clock.
The buffer device 1002 depicted in
The downstream IO clock domain runs off of the controller interface bus clock 1102 (i.e., the forwarded interface clock) from the memory controller 802. The controller interface bus clock 1102 is utilized to latch the controller interface signals 1118 into interface latches in the buffer device 1002. The data is latched into latches by the IO sampler portion of the upstream to downstream functional block 1110 in conjunction with signals from the IO clock distribution block 1128. The FIFO portion of the upstream to downstream functional block 1110 allows the transfer of the latched signals into the local clock domain. The IO clock distribution block 1128 in the upstream to downstream functional block 1110 samples the high speed interface from the memory controller 802. The IO clock distribution block 1128 takes the received controller interface bus clock 1102 (it may condition it to read more reliably) and delivers it to the latches in the IO sampler and FIFO portion of the upstream to downstream functional block 1110. Another function of the controller interface bus clock 1102 is that it is input into the local clock functional block 1114 in the local clock domain.
The local clock domain receives its reference oscillator from the controller interface bus clock 1102 which is input to the local clock functional block 1114. In the local clock functional block 1114, the IO clock may be modified by optional offsetting delay adjustments, passed through a PLL and then distributed as the local clock to all areas of the buffer device 1002 (e.g., to the local latch in the upstream to downstream functional block 1110 and the cure logic functional block 1116) via the local distribution logic. The local clock arrives at the other areas of the buffer device 1002 at the offset delay time due to the feedback and circuits of the PLL. As is known in the art, the PLL is utilized, among other things, to remove the time difference between the controller interface bus clock 1102 and the local clock by using a feedback path from the local distribution to the delay reference block in the local clock functional block 1114. As a result, the local clock is nominally in phase with the controller interface bus clock 1102 but offset by a deterministic amount of delay.
As described previously, the received controller interface bus clock 1102 (i.e., the forwarded clock) is distributed to the “IO sampler” and to the FIFO block in the upstream to downstream functional block 1110. The controller interface signals 1118 are captured there and transferred into the local clock domain in the “spare and local data multiplexor” and “local latch” blocks in the upstream to downstream functional block 1110. Because all signals are transferred into the local clock domain, they can be easily merged with local data sources from the core logic functional block 1116 (e.g., local memory read data). The “DDR generator” block in the upstream to downstream functional block 1110 performs the merge function and then generates DDR signals to be driven out on the controller interface outputs. Both a controller interface downstream clock signal 1106 and controller interface downstream data signals 1122 are transmitted to the next memory module 806 (if any) in the cascaded memory subsystem. The same process, with the exception that the local clock is not driven by the IO clock distribution block 1126 in the downstream to upstream functional block 1112, is performed for data being received by the upstream memory bus 902 (i.e., controller interface signals 1124 and controller interface bus clock 1108).
Because all driven signals (i.e., the controller interface downstream data signals 1122 and the controller interface upstream data signals 1120) are launched from latches in the local clock domain, their clocks have been cleaned up by the PLL in the buffer device 1002. This allows high bandwidth signaling by preventing accumulated noise effects, such as duty cycle distortion and jitter, from building up on the cascaded controller interfaces. Forwarded clocks allow high speed operation with a simple clock recovery mechanism.
Local data merging is accomplished by selecting between controller interface signals 1124 that are ready to be captured in the local clock domain and local data from the core logic functional block 1116. The selection is possible because both the data that came in on the controller interface signals 1124 and the local data are in the local clock domain (i.e., share the same local clock). To minimize latency, local data is given priority at the selector (i.e., the DDR generator in the downstream to upstream functional block 1112). Any non-local data arriving during a cycle in which local data is being driven will be lost. Collisions at the multiplexor are managed by the system memory controller 802 which schedules read data operations to avoid such conflicts. Except for the small gate delay added by the local data multiplexor, the data merging is performed without delaying the re-drive of data being transferred on the bus.
The memory module 806 depicted in
Exemplary embodiments of the present invention provide the ability for a buffer device on a memory module 806 to merge local data from memory devices on the memory module 806 onto a data bus in a cascaded memory subsystem. The data bus may already contain data from memory devices that are not located on the current memory module 806 for merging with the local data. The merging is performed without delaying the re-drive of data being transferred on the bus. In addition, exemplary embodiments of the present invention sample incoming data with a forwarded bus clock associated with the incoming data and then move the incoming data into a local clock domain. The incoming data is then transmitted to the next memory module in the chain in response to the local clock and the local clock is transmitted as the forward bus clock along with the data. In this manner the clock signals are corrected (e.g., for jitter and duty cycle distortion) between each transmission and may result in better clock signals.
Exemplary embodiments of the present invention include a flexible, high speed and high reliability memory system architecture and interconnect structure that includes a single-ended point-to-point interconnection between any two high speed communication interfaces. The memory subsystem may be implemented in one of several structures, depending on desired attributes such as reliability, performance, density, space, cost, component re-use and other elements. A bus-to-bus converter chip enables this flexibility through the inclusion of multiple, selectable memory interface modes. This maximizes the flexibility of the system designers in defining optimal solutions for each installation, while minimizing product development costs and maximizing economies of scale through the use of a common device. In addition, exemplary embodiments of the present invention provide a migration path that allows an installation to implement a mix of buffered memory modules and unbuffered and/or registered memory modules from a common buffer device.
Memory subsystems may utilize a buffer device to support buffered memory modules (directly connected to a memory controller via a packetized, multi-transfer interfaces with enhanced reliability features) and/or existing unbuffered or registered memory modules (in conjunction with the identical buffer device, on an equivalent but, programmed to operate in a manner consistent with the memory interface defined for those module types). A memory subsystem may communicate with buffered memory modules at one speed and with unbuffered and registered memory modules at another speed (typically a slower speed). Many attributes associated with the buffered module structure are maintained, including the enhanced high speed bus error detection and correction features and the memory cascade function. However, overall performance may be reduced when communicating with most registered and unbuffered DIMMs due to the net topologies and loadings associated with them.
The DRAM package outline is a combination of a tall/narrow (i.e., rectangular) DRAM package and a short/wide (i.e., squarish) DRAM package. Thus configured, a single card design may accommodate either “tall” or “wide” DRAM device/package combinations, consistent with historical and projected device trends. Moreover, the buffer device 1002 is rectangular in shape, thereby permitting a minimum distance between high-speed package interconnects and the DIMM tab pins, as well as reducing the distance the high-speed signals must travel under the package to reach an available high-speed pin, when an optimal ground referencing structure is used.
As is also shown in
Referring to
In addition to inputting the original or re-ordered signals to the bus sparing logic 1436, the bus sparing logic 1426 also inputs the original or re-ordered signals into a downstream bus ECC functional block 1420 to perform error detection and correction for the frame. The downstream bus ECC functional block 1420 operates on any information received or passed through the multi-mode buffer device 1002 from the downstream memory bus 904 to determine if a bus error is present. The downstream bus ECC functional block 1420 analyzes the bus signals to determine if it they are valid. Next, the downstream bus ECC functional block 1420 transfers the corrected signals to a command state machine 1414. The command state machine 1414 inputs the error flags associated with command decodes or conflicts to a pervasive and miscellaneous functional block 1410. The downstream and upstream modules also present error flags and/or error data (if any) to the pervasive and miscellaneous functional block 1410 to enable reporting of these errors to the memory controller, processor, service processor or other error management unit.
Referring to
The command state machine 1414 also determines if the corrected signals (including data, command and address signals) are directed to and should be processed by the memory module 806. If the corrected signals are directed to the memory module 806, then the command state machine 1414 determines what actions to take and may initiate DRAM action, write buffer actions, read buffer actions or a combination thereof. Depending on the type of memory module 806 (buffered, unbuffered, registered), the command state machine 1414 selects the appropriate drive characteristics, timings and timing relationships. The write data buffers 1412 transmit the data signals to a memory data interface 1406 and the command state machine 1414 transmits the associated addresses and command signals to a memory command interface 1408, consistent with the DRAM specification. The memory data interface 1406 reads from and writes memory data 1442 to a memory device.
Data signals to be transmitted to the memory controller 802 may be temporarily stored in the read data buffers 1416 after a command, such as a read command, has been executed by the memory module 806, consistent with the memory device ‘read’ timings. The read data buffers 1416 transfer the read data into an upstream bus ECC functional block 1422. The upstream bus ECC functional block 1422 generates check bits for the signals in the read data buffers 1416. The check bits and signals from the read data buffers 1416 are input to the upstream data multiplexing functional block 1432. The upstream data multiplexing functional block 1432 merges the data on to the upstream memory bus 902 via the bus sparing logic 1438 and the driver functional block 1430. If needed, the bus sparing logic 1438 may re-direct the signals to account for a defective segment between the current memory module 806 and the upstream receiving module (or memory controller). The driver functional block 1430 transmits the original or re-ordered signals, via the upstream memory bus 902, to the next memory assembly (i.e., memory module 806) or memory controller 802 in the chain. In an exemplary embodiment of the present invention, the bus sparing logic 1438 is implemented using a multiplexor to shift the signals. The driver functional block 1430 provides macros and support logic for the upstream memory bus 902 and, in an exemplary embodiment of the present invention, includes support for a twenty-three bit, high speed, low latency cascade driver bus.
Data, clock and ECC signals from the upstream memory bus 902 are also received by any upstream multi-mode buffer device 1002 in any upstream memory module 806. These signals need to be passed upstream to the next memory module 806 or to the memory controller 802. Referring to
In addition to passing the data and ECC signals to the upstream data multiplexing functional block 1432, the bus sparing functional block 1440 also inputs the original or re-ordered data and ECC signals to the upstream bus ECC functional block 1422 to perform error detection and correction for the frame. The upstream bus ECC functional block 1422 operates on any information received or passed through the multi-mode buffer device 1002 from the upstream memory bus 902 to determine if a bus error is present. The upstream bus ECC functional block 1422 analyzes the data and ECC signals to determine if they are valid. Next, the upstream bus ECC functional block 1422 transfers any error flags and/or error data to the pervasive and miscellaneous functional block 1410 for transmission to the memory controller 802. In addition, once a pre-defined threshold for the number or type of failures has been reached, the pervasive and miscellaneous functional block 1410, generally in response to direction of the memory controller 802, may substitute the spare segment for a failing segment.
The block diagram in
As indicated in
The terms “net topology” in
Finally,
In an exemplary embodiment, each of the redundant pins is located behind the respective primary function pin for which it is redundant. For example, redundant service pins serv_ifc(1)_r and serv_ifc(2)_r (pins 142, 143) are located directly behind service pins serv_ifc(1) and serv_ifc(2) (pins 4, 5), respectively. In this manner, the DIMM is resistant to single point-of-fail memory outage (e.g., such as if the DIMM were warped or tilted toward one side or the other).
Among the various functions included within the 276-pin layout are a pair of continuity pins (1, 138) and scope trigger pins (3, 141). As will be appreciated from an inspection of the pin assignment tables in
As will also be noted, for example in
As described above, the embodiments of the invention may be embodied in the form of computer-implemented processes and apparatuses for practicing those processes. Embodiments of the invention may also be embodied in the form of computer program code containing instructions embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other computer-readable storage medium, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. The present invention can also be embodied in the form of computer program code, for example, whether stored in a storage medium, loaded into and/or executed by a computer, or transmitted over some transmission medium, such as over electrical wiring or cabling, through fiber optics, or via electromagnetic radiation, wherein, when the computer program code is loaded into and executed by a computer, the computer becomes an apparatus for practicing the invention. When implemented on a general-purpose microprocessor, the computer program code segments configure the microprocessor to create specific logic circuits.
While the invention has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but that the invention will include all embodiments falling within the scope of the appended claims. Moreover, the use of the terms first, second, etc. do not denote any order or importance, but rather the terms first, second, etc. are used to distinguish one element from another.