Embodiments of the invention generally relate to the field of integrated circuits and, more particularly, to systems, methods and apparatuses for enabling an integrated memory controller to transparently work with defective memory devices.
The density of dynamic random access memory devices (DRAMs) has been growing at a substantial rate. In addition, the number of DRAMs on a memory module (and the number of memory modules in a computing system) has also been growing at a substantial rate. All of these manufactured components are subject to the same statistical yield patterns, and this means that as the DRAM density increases there is a corresponding increase in the risk of defective bits in the manufactured components. Current yields for DRAMs are around 90%. The components with defective bits are binned and sold as lower density chips if possible. On the other hand, the ever increasing memory footprint of computer operating systems and data processing needs continues to drive the need for larger memory subsystems in computing systems. In almost all segments the memory subsystem cost is becoming a significant part of the total cost of a computing system.
Embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings in which like reference numerals refer to similar elements.
Embodiments of the invention are generally directed to systems, methods, and apparatuses for enabling an integrated memory controller to transparently work with defective memory devices. In some embodiments, a marginal condition is imposed on a memory module during normal operations of the memory module. The term “marginal condition” refers to a condition that is out of compliance with a specified (or “normal”) operating condition for the memory module. The memory module may exhibit failures in response to the marginal conditions and compensating mechanisms may mitigate the failures.
Integrated circuit 102 includes logic to control the transfer of information with DRAM subsystem 104. In the illustrated embodiment, integrated circuit 102 includes processor cores 108 and logic 110. Processor cores 108 may be any of a wide range of processor cores including general processor cores, graphics processor cores, and the like. Logic 110 broadly represents a wide array of logic including, for example, a memory controller, an uncore, and the like.
In some embodiments, logic 110 also includes logic to impose a marginal condition on a memory module (or on another element of DRAM subsystem 104) and logic to compensate for the imposed marginal condition. The term “marginal condition” broadly refers to a condition that exceeds the bounds of normal operating conditions as defined by a (public or proprietary) specification, standard, protocol, and the like. For example, normal operating conditions for voltage, temperature, and refresh rate are typically defined for a memory module in a specification or standard. The phrase “imposing a marginal condition” refers to operating the device (e.g., the memory module) outside of the range of values that are considered “normal” for the device.
In some embodiments, “imposing a marginal condition” refers to imposing a voltage, a temperature, and/or a refresh rate that is outside of values that are considered “normal” (e.g., as defined by a specification, a standard, or the like). For example, in some embodiments, logic 110 imposes a refresh rate that is lower than the refresh rate specified for memory module 112. The advantages to imposing a lower refresh rate include an improvement in overall system performance since the system is spending less time with its memory in refresh. In addition, the power consumed by DRAM subsystem 104 may be reduced by reducing the refresh rate. Similarly, operating under lower voltage can yield power savings. In alternative embodiments, logic 110 may impose a different marginal condition on memory module 112 (and/or any other element of DRAM subsystem 104 and/or on interconnect 106).
The phrase “compensating for the imposed marginal condition” refers to detecting a change in the performance of DRAM subsystem 104 and/or compensating for those changes. For example, in some embodiments, logic 110 imposes a reduced refresh rate on memory module 112. Some of the memory locations in module 112 may exhibit defects in response to the reduced refresh rate. Logic 110 may detect those defects and compensate for them. For example, in some embodiments, logic 110 may move information that is stored in the “defective” memory locations to another location (e.g., one that is known to be operating properly). Aspects of logic 110 are further discussed below with reference to
DRAM subsystem 104 provides at least a portion of the main memory for system 100. In the illustrated embodiment, DRAM subsystem 104 includes one or more memory modules 112. Modules 112 may be any of a wide range of memory modules including dual inline memory modules (DIMMs), small outline DIMMs (SO-DIMMs), and the like. Each module 112 may have one or more DRAMs 114 (and possibly other elements such as registers, buffers, and the like). DRAMs 114 may be any of a wide range of devices including nearly any generation of double data rate (DDR) DRAMs.
The embodiment illustrated in
Marginal condition logic 201 includes logic to impose a marginal condition on one or more elements of a DRAM subsystem (e.g., DRAM subsystem 204 shown in
ECC logic 202 includes logic to detect and correct selected errors in information (e.g., data and/or code) that is read from the DRAM subsystem (e.g., DRAM subsystem 104, shown in
In some embodiments, hard error detect logic 204 determines whether a detected error is a hard error or a soft error. The term “soft error” refers to an error in stored information that is not the result of a hardware defect (e.g., an error due to an alpha strike). A “hard error” refers to an error that is due to a hardware defect. For example, bits that go bad due to a memory module operating in a marginal condition are hard errors. In some embodiments, logic 204 determines whether there are hard errors based on whether the error is persistent. For example, logic 204 may use replay logic to write to and read from a memory location a number of times to determine whether one or more bits are persistently bad. The replay logic may be preexisting replay logic (e.g., in a memory controller) or it may replay logic that is part of logic 204.
In some embodiment, if logic 204 detects a “hard error” then relocation logic 206 moves the information stored in the defective memory location to another memory location (e.g., a reserved memory location that is operating normally). As used herein, the term “relocation” refers to moving information from a defective region to a known good region. Relocation may also include building and using memory map 208. For example, the process flow for relocation may include changing a pointer, changing a table entry, and the like. Memory map 208 is a logical structure that provides a mapping to relocated information and/or provides an indication of which memory locations are currently defective. Memory map 208 may be built and used during the normal operation of a system (e.g., during real time rather than manufacture time). As defective locations are identified and information is relocated, logic 206 builds and uses memory map 208. Relocation is performed before the “hard” error leads to a system failure or data corruption.
In some embodiments, at least a portion of the logic to compensate for the marginal condition is, optionally, performed in software. For example, some or all of the tasks associated with detecting a hard error, relocating information, and/or building/using a memory map may be performed by software 210. In some embodiments software 210 is a handler such as a system management interrupt handler (e.g., an SMI handler). In other embodiments, software 210 may be part of the operating system (OS) kernel.
Marginal condition logic (e.g., logic 201 or other logic) imposes a marginal condition at 304. In some embodiments, the marginal condition is a reduced refresh rate. In other embodiments, the marginal condition is a marginal operating voltage and/or a marginal temperature. In yet other embodiments, the marginal condition may be nearly any other condition that is in variance with the “normal” operating conditions for the DRAM subsystem.
Logic to compensate for the marginal condition performs an action at 306. In some embodiments, compensating for the marginal condition includes detecting hard errors and relocating information to a known good memory location. In some embodiments, the compensating logic uses a memory map to reference the new locations for the relocated data.
Referring to process block 402, an ECC code (e.g., ECC 202, shown in
If a hard error is detected, then relocation logic may move the data currently located in the “defective” memory location to a known good location (408). In some embodiments, the relocation logic may reserve one or more “spare” memory locations (e.g., rows, portions of rows, ranks, and the like) that are functioning normally. When a hard error is detected, the information in the defective memory locations may be moved to one of the spare locations. In some embodiments, the relocation logic uses a memory map to reference the new locations for the relocated data and to indicate where the defective locations are.
Elements of embodiments of the present invention may also be provided as a machine-readable medium for storing the machine-executable instructions. The machine-readable medium may include, but is not limited to, flash memory, optical disks, compact disks-read only memory (CD-ROM), digital versatile/video disks (DVD) ROM, random access memory (RAM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), magnetic or optical cards, propagation media or other type of machine-readable media suitable for storing electronic instructions. For example, embodiments of the invention may be downloaded as a computer program which may be transferred from a remote computer (e.g., a server) to a requesting computer (e.g., a client) by way of data signals embodied in a carrier wave or other propagation medium via a communication link (e.g., a modem or network connection).
In the description above, certain terminology is used to describe embodiments of the invention. For example, the term “logic” is representative of hardware, firmware, software (or any combination thereof) to perform one or more functions. For instance, examples of “hardware” include, but are not limited to, an integrated circuit, a finite state machine, or even combinatorial logic. The integrated circuit may take the form of a processor such as a microprocessor, an application specific integrated circuit, a digital signal processor, a micro-controller, or the like.
It should be appreciated that reference throughout this specification to “one embodiment” or “an embodiment” means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the invention.
Similarly, it should be appreciated that in the foregoing description of embodiments of the invention, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various inventive aspects. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description.