The exemplary embodiments of this invention relate generally to computer memory and, more specifically, relate to error detection and correction in a memory system.
This section endeavors to supply a context or background for the various exemplary embodiments of the invention as recited in the claims. The content herein may comprise subject matter that could be utilized, but not necessarily matter that has been previously utilized, described or considered. Unless indicated otherwise, the content described herein is not considered prior art, and should not be considered as admitted prior art by inclusion in this section.
The following abbreviations are utilized herein:
CRC cyclic redundancy check
DDR double data rate
DED double-bit error detection
DIMM dual in-line memory module
DRAM dynamic random access memory
ECC error correction code
EPROM erasable programmable read only memory
HDD hard disk drive
IPL initial program load
LRC longitudinal redundancy check
NVS nonvolatile storage
RAID redundant array of inexpensive/independent disks
RAM random access memory
SDR single data rate
SDRAM synchronous dynamic random access memory
SEC single-bit error correction
UE uncorrectable error
XOR exclusive OR
Computer systems often require a considerable amount of high speed RAM to hold information such as operating system software, programs and other data while a computer is powered on and operational. This information is normally binary, composed of patterns of 1's and 0's known as bits of data. The bits of data are often grouped and organized at a higher level. A byte, for example, is typically composed of 8 bits, although it may be composed of additional bits (e.g. 9, 10, etc.) when the byte also includes information for use in the identification and/or correction of errors. This binary information is normally loaded into RAM from NVS such as HDDs during power on and IPL of the computer system (e.g., boot up). The data is also paged-in from and paged-out to NVS during normal computer operation. In general, all the programs and information utilized by a computer system cannot fit in the smaller, more costly DRAM. In addition, even if it did fit the data would be lost when the computer system is powered off. At present, it is common for NVS systems to be built using a large number of HDDs.
Computer RAM is often designed with pluggable subsystems, often in the form of modules, so that incremental amounts of RAM can be added to a computer, as dictated by the specific memory requirements for the system and/or application. The acronym “DIMM” refers to dual in-line memory modules, a common type of memory modules that is currently in use. A DIMM is a thin, rectangular card comprising one or more memory devices, and may also include one or more registers, buffers, hub devices, and/or non-volatile storage (e.g., EEPROM) as well as various passive devices (e.g., resistors and/or capacitors), all mounted to the card. DIMMs are often designed with dynamic memory chips or DRAMs that are regularly refreshed to prevent the data stored within from being lost. Originally, DRAM chips were asynchronous devices, however contemporary chips, such as SDRAM (e.g., SDR, DDR, DDR2, DDR3, etc.), have synchronous interfaces to improve performance. DDR devices are available that use pre-fetching along with other speed enhancements to improve memory bandwidth and reduce latency. DDR3, for example, has a standard burst length of 8.
Memory device densities have continued to increase as computer systems have become more powerful. Currently it is not uncommon to have the RAM content of a single computer be composed of hundreds of trillions of bits. Unfortunately, the failure of just a portion of a single RAM device can cause the entire computer system to fail. When memory errors occur, which may be “hard” (repeating) or “soft” (one-time or intermittent) failures, these failures may occur as single cell, multi-bit, full chip or full DIMM failures and all or part of the system RAM may be unusable until it is repaired. Repair turn-around times can be hours or even days, which can have a substantial impact on a business dependent on the computer systems. The probability of encountering a RAM failure during normal operations has continued to increase as the amount of memory storage and complexity continues to grow in contemporary computers.
Techniques to detect and correct bit errors have evolved into an elaborate science over the past several decades. Perhaps the most basic detection technique is the generation of odd or even parity where the number of 1's or 0's in a data word are “exclusive or-ed” (XOR-ed) together to produce a parity bit. For example, a data word with an even number of 1's will have a parity bit of 0 and a data word with an odd number of 1's will have a parity bit of 1, with this parity bit data appended to the stored memory data. If there is a single error present in the data word during a read operation, it can be detected by regenerating parity from the data and then checking to see that it matches the stored (originally generated) parity.
Richard Hamming recognized that the parity technique could be extended to not only detect errors, but correct errors by appending an XOR field (e.g., an ECC field) to each code word. The ECC field is a combination of different bits in the word XOR-ed together so that errors (small changes to the data word) can be easily detected, pinpointed and corrected. The number of errors that can be detected and corrected are directly related to the length of the ECC field appended to the data word. The technique includes ensuring a minimum separation distance between valid data words and code word combinations. The greater the number of errors desired to be detected and corrected, the longer the code word, thus creating a greater distance between valid code words. The smallest distance between valid code words is known as the minimum Hamming distance.
These error detection and error correction techniques are commonly used to restore data to its original/correct form in noisy communication transmission media or for storage media where there is a finite probability of data errors due to the physical characteristics of the device. The memory devices generally store data as voltage levels representing a 1 or a 0 in RAM and are subject to both device failure and state changes due to high energy cosmic rays and alpha particles. Similarly, HDDs that store 1's and 0's as magnetic fields on a magnetic surface are also subject to imperfections in the magnetic media and other mechanisms that can cause undesired changes in the data pattern from what was originally stored.
In the 1980's, RAM memory device sizes first reached the point where they became sensitive to alpha particle hits and cosmic rays causing memory bits to flip. These particles do not damage the device but can create memory errors. These are known as soft errors, and most often affect just a single bit. Once identified, the bit failure can be corrected by simply rewriting the memory location. The frequency of soft errors has grown to the point that it has a noticeable impact on overall system reliability.
Memory ECCs, like those proposed by Hamming, use a combination of parity codes in various bit positions of the data word to allow detection and correction of errors. Every time data words are written into memory, a new ECC word needs to be generated and stored with the data, thereby allowing detection and correction of the data in cases where the data read out of memory includes an ECC code that does not match a newly calculated ECC code generated from the data being read.
The first ECCs were applied to RAM in computer systems in an effort to increase fault-tolerance beyond that allowed by previous means. Binary ECC codes were deployed that allowed for DED and SEC. This SEC/DED ECC also allowed for transparent recovery of single bit hard errors in RAM. Scrubbing routines were also developed to help reduce memory errors by locating soft errors through a complement/re-complement process so that the soft errors could be detected and corrected.
In one exemplary embodiment a method is provided comprising providing a plurality of random access memories having at least a first region, a second region and a third region; storing protected data on the first region on at least two of the random access memories, where the protected data is stored distributed among the at least two random access memories of the first region; storing parity information for the protected data on the second region on at least a third one of the random access memories; and storing unprotected data on the third region.
In another aspect, an example method comprises providing a plurality of random access memories including a first region, a second region and a third region; storing protected data on the first region on at least two of the random access memories; storing parity information for the protected data on the second region on at least a third one of the random access memories; storing unprotected data on the third region; writing new protected data to the at least two random access memories; computing updated parity information based on the new protected data; and writing the updated parity information to the second region of the plurality of random access memories.
In another aspect, an example method comprises providing a plurality of random access memories comprising a first region, a second region and a third region; storing protected data on the first region on at least two of the random access memories; storing parity information for the protected data on the second region on at least a third one of the random access memories; storing unprotected data on the third region; in response to a command to write new protected data to one of the random access memories that has failed of the at least two random access memories, reading other protected data from other ones of the at least two random access memories and reading the parity information from the second region; reconstructing missing protected data for the failed random access memory based on the other protected data and the parity information; determining new parity information based on the new protected data and the reconstructed missing protected data; and writing the new parity information to the second region.
The foregoing and other aspects of embodiments of this invention are made more evident in the following Detailed Description, when read in conjunction with the attached Drawing Figures, wherein:
Some storage manufacturers have used advanced ECC techniques, such as Reed-Solomon codes, to correct for full memory chip failures. Some memory system designs also have standard reserve memory chips (e.g., “spare” chips) that can be automatically introduced in a memory system to replace a faulty chip. These advancements have greatly improved RAM reliability, but as memory size continues to grow and customers' reliability expectations increase, further enhancements are needed. There is the need for systems to survive a complete DIMM failure and for the DIMM to be replaced concurrent with system operation. In addition, other failure modes must be considered which affect single points of failure between one or more DIMMs and the memory controller/embedded processor. For example, some of the connections between the memory controller and the memory device(s) may include one or more intermediate buffer(s) that may be external to the memory controller and reside on or separate from the DIMM, however upon its failure, may have the effect of appearing as a portion of a single DIMM failure, a full DIMM failure, or a broader memory system failure, for example.
Although there is a clear need to improve computer RAM reliability (also referred to as “fault tolerance”) by using even more advanced error correction techniques, attempts to do this have been hampered by various factors, including impacts to available customer memory, performance, space, and heat. Using redundancy by including extra copies (e.g., “mirroring”) of data or more sophisticated error coding techniques drives up costs, adds complexity to the design, incurs additional overhead, and may impact another key business measure: time-to-market. For example, the simple approach of memory mirroring has been offered as a feature by several storage manufacturing companies. The use of memory mirroring permits systems to survive more catastrophic memory failures, but acceptance has been very low because it generally requires a doubling of the memory size on top of the base SEC/DEC ECC already present in the design, which generally leaves customers with less than 50% of the installed RAM available for system use. ECC techniques have been used to improve availability of storage systems by correcting HDD failures so that customers do not experience data loss or data integrity issues due to failure of a HDD, while further protecting them from more subtle failure modes.
Some suppliers of storage systems have successfully used RAID techniques to improve availability of HDDs to computer RAM. In many respects it is easier to recover from a HDD failure using RAID techniques because it is much easier to isolate the failure in HDDs than it is in RAM. HDDs often have embedded checkers such as ECCs to detect bad sectors. In addition, CRCs and LRCs may be embedded in HDD electronics and/or disk adapters, or there may be checkers used by higher levels of code and applications to detect HDD errors. CRCs and LRCs are written coincident with data to help detect data errors. CRCs and LRCs are hashing functions used to produce a small substantially unique bit pattern generated from the data. When the data is read from the HDD, the check sum is regenerated and compared to that stored on the platter. The signatures must match exactly to ensure the data retrieved from the magnetic pattern encoded on the disk is as was originally written to the disk.
RAID systems have been developed to improve performance and/or to increase the availability of disk storage systems. RAID distributes data across several independent HDDs. There are many different RAID schemes that have been developed with each having different characteristics and different pros and cons associated with them. Performance, availability, and utilization/efficiency (the percentage of the disks that actually hold customer data) are among the most important aspects. The tradeoffs associated with various RAID schemes have to be carefully considered because improvements in one attribute can often result in reductions in another. For example, a RAID-1 system uses two exact copies (mirrors) of the data. Clearly, this has a negative impact on utilization/efficiency while providing additional reliability (e.g., a failure of one copy of the data is not fatal since the remaining copy can be used). As another example, a RAID-0 system (stripe set or striped volume) splits data evenly across two or more disks. This can improve performance (since each disk can be read concurrently, resulting in faster reads) while reducing reliability (since failure of only one disk will lead to system failure).
There is some inconsistency and ambiguity in RAID-related terminology used throughout the industry. The following definitions are what is implied by use of these terms in this disclosure unless otherwise stated. An array is a collection of hard disk drives in which one or more instances of a RAID erasure code is implemented. A symbol or an element is a fundamental unit of data or parity, the building block of the erasure codes. In coding theory, this is the data assigned to a bit within the symbol. This is typically a set of sequential sectors. An element is composed of a fixed number of bytes. It is also common to define elements as a fixed number of blocks. A block is a fixed number of bytes. A stripe is a complete and connected set of data and parity elements that are dependently related to the parity computation relations. In coding theory, the stripe is the code word or code instance. A strip is a collection of contiguous elements on a single hard disk drive. A strip contains data elements, parity elements or both from the same disk and stripe. The term strip and column are used interchangeably. In coding theory, the strip is associated with the code word and is sometime called the stripe unit. The set of strips in a code word form a stripe. It is most common for strips to contain the same number of elements. In some cases stripes may be grouped together to form a higher level construct know as a stride.
As noted above, RAID-0 is striping of data across multiple HDDs to improve performance. RAID-1 is mirroring of data, keeping two exact copies of the data on two different HDDs to improve availability and prevent data loss. Some RAID schemes can be used together to gain combined benefits. For example, RAID-10 is both data striping and mirroring across several HDDs in an array to improve both performance and availability.
RAID-3, RAID-4 and RAID-5 are very similar in that they use a single XOR check sum to correct for a single data element error. RAID-3 is byte-level striping with dedicated parity HDD. RAID-4 uses block level striping with a dedicated parity HDD. RAID-5 is block level striping like RAID-4, but with distributed parity. There is no longer a dedicated parity HDD. Parity is distributed substantially uniformly across all the HDDs, thus eliminating the dedicated parity HDD as a performance bottleneck. The key attribute of RAID-3, RAID-4 and RAID-5 is that they can correct a single data element fault when the location of the fault can be pinpointed (e.g., through some independent means).
There is no single universally accepted industry-wide definition for RAID-6. In general, RAID-6 refers to block or byte-level striping with dual checksums. An important attribute of RAID-6 is that it allows for correction of up to two data element faults when the faults can be pinpointed through some independent means. It also has the ability to pinpoint and correct a single failure when the location of the failure is not known.
Another very important computer system attribute that can easily be overlooked is that not all memory failures are equal. For example, DIMM, channel and buffer chip failures are single points of failure and, thus, are not protectable by usage of an ECC. As another example, if hypervisor data on an affected DIMM were to fail, it would cause a system failure disrupting overall system operation.
One technique for protecting such high importance elements and/or partitions is to utilize selective memory mirroring. This technique selectively protects sensitive information or data at a comparatively high costs (e.g., 100% overhead in memory capacity due to the mirroring). Since the overhead is high, this technique may not be suitable for usage with all partitions. However, this technique may be suitable for important and/or critical elements (e.g., hypervisor-related elements).
There is a need in the art to improve failure detection and correction in memory systems. For example, it would be desirable for a memory system to be able to survive a complete DIMM failure and/or for the DIMM to be replaced concurrent with system operation.
The exemplary embodiments of the invention utilize a RAID-like structure in conjunction with parity data to provide comprehensive fault protection for a memory system. As an example, utilization of the exemplary embodiments of the invention, as discussed in further detail below, will enable continued operation even in the face of difficult faults, such as a DIMM failure, for example. Previously, such a failure, were it to occur for a DIMM holding sensitive or important information, could lead to system failure. Furthermore, the overhead required for providing such fault protection is much less than 100%, as will be illustrated herein.
In one exemplary embodiment of the invention, a number of memory modules (e.g., DIMMs) are coupled to a number of memory controllers via a number of channels. The channels, and corresponding memory modules, are separated into data channels and one parity channel. The parity channel/module stores parity information based on the data channels/modules. As an example, the parity information may be obtained by XOR-ing the data channels. As an example, for a four channel system there may be three data channels and one parity channel. By storing parity information, a failure of any one module (e.g., DIMM) is recoverable, for example, by recomputing the lost data/information using the parity information. In addition, once the lost data/information is recomputed, normal operations can proceed which may involve using the recomputed data/information to update the parity information and store the updated parity information. This arrangement is particularly useful given that in conventional systems and arrangements some critical faults (e.g., loss of an entire DIMM) would cripple the system and disallow further operations. In addition, no additional or special hardware is needed and the overhead incurred is less than the 100% for full mirroring of the data (e.g., 33% for the four-channel system noted above). The below descriptions, particularly with reference to the figures, provide further information concerning the various exemplary embodiments of the invention.
As shown in
While shown in
While shown in
In the exemplary system 300 of
It should be appreciated that while the exemplary embodiments of the invention are discussed herein with respect to protected information/data, the configuration is fully selectable such that any fraction of the total memory is protectable. For example, the parity information on the fourth DIMM 308-4 may only cover a portion (i.e., less than all) of the data stored on the DIMMs 308-1, 308-2, 308-3. In such a case, and by extension, it may be that not all of the fourth DIMM 308-4 is used for parity information and a portion of the fourth DIMM 308-4 may be used for data storage. In such a manner, and in accordance with some exemplary embodiments of the invention, the above-noted exemplary 33% overhead may constitute a maximum overhead with some cases having less than 33% overhead (e.g., if less than all of the information is protected with the parity information). As a non-limiting example, it may be the case that only important and/or critical information (e.g., hypervisor-related data) is protected with the parity information on the fourth DIMM 308-4.
During a normal read operation (
Next consider a read operation with a DIMM failure (
In further exemplary embodiments, the memory controller 304 can send an error signal and consider corrective/reconfiguration options (807), such as: scrub and retry data, deallocate the faulty sector(s) or “call home” (i.e., signal higher level errors) for example. The memory controller 304 may also mark a hard error (808). This will enable the memory controller 304 to skip the first two steps (i.e., the read sent to the faulty DIMM and the return of an UE) until the DIMM, which had the DIMM failure, is repaired or replaced. This is represented in
Note that the exemplary embodiments of the invention enable recreation of the data stored on the faulty DIMM 308-1 instead of trying to reread it. Furthermore, note that in the event of a DIMM failure (
Next consider a normal write operation (
As illustrated in
Thus, in accordance with the exemplary embodiments of the invention the overhead for a write operation on a failed DIMM 308-1 is three line reads and one line write. There is no need to write to the failed DIMM 308-1 unless trying to scrub and recreate, for example. While this may seem like a lot of overhead, recall that normally (e.g., in the absence of the parity information) this write operation is not possible. With conventional systems, the failed DIMM is usually declared “dead” and there cannot be any write operation for the data contained on the failed DIMM.
Note that if the DIMM is not already marked as failed, a few additional operations will occur, namely an attempt to write the data to the failed DIMM and a returning of an UE (e.g., similar to the initial steps 801, 802 previously noted for reading from a failed DIMM that has not already been marked). Additional operations may be performed subsequent to the write operations of
The system 500 may include at least one communications component 514 that enables communication with at least one other component, system, device and/or apparatus. As non-limiting examples, the communications component 514 may include a transceiver configured to send and receive information, a transmitter configured to send information and/or a receiver configured to receive information. As a non-limiting example, the communications component 514 may comprise a modem and/or network card. The system 500 of
It should be noted that in accordance with the exemplary embodiments of the invention, one or more of the circuitry 502, processor(s) 504, memory 506, storage 508, program logic 510 and/or communications component 514 may store one or more of the various items (e.g., data, databases, tables, items, vectors, matrices, variables, equations, formula, operations, operational logic, logic) discussed herein. As a non-limiting example, one or more of the above-identified components may receive and/or store the data, information, parity information and/or instructions/operations/commands. As a further non-limiting example, one or more of the above-identified components may receive and/or store the function(s), operations, functional components and/or operational components, as described herein.
Further in accordance with the exemplary embodiments of the invention, the storage 508 may comprise one or more memory modules (e.g., memory cards, DIMMs) that are connected together in order to collectively function as described herein. For example, the storage 508 may comprise a plurality of cascaded interconnect memory modules (e.g., with unidirectional busses). In further exemplary embodiments, the processor(s) 504 and/or circuitry 502 may comprise one or more memory controllers. In some exemplary embodiments, a plurality of memory controllers is provided such that each memory controller oversees operations for at least one channel coupling the respective memory controller to a corresponding memory module (e.g., part or all of memory 506, a DIMM).
The exemplary embodiments of this invention may be carried out by computer software implemented by the processor 504 or by hardware, or by a combination of hardware and software. As a non-limiting example, the exemplary embodiments of this invention may be implemented by one or more integrated circuits. The memory 506 may be of any type appropriate to the technical environment and may be implemented using any appropriate data storage technology, such as optical memory devices, magnetic memory devices, semiconductor-based memory devices, fixed memory and removable memory, as non-limiting examples. The processor 504 may be of any type appropriate to the technical environment, and may encompass one or more of microprocessors, general purpose computers, special purpose computers and processors based on a multi-core architecture, as non-limiting examples.
In some exemplary embodiments, at IPL two address spaces may be initialized—one for the parity-protected data and one for the unprotected data. As a non-limiting example, for a total address space of size R, the parity-protected region may be of size 0.75 R. For example, the exemplary system discussed above in
In other exemplary embodiments, different values of N may be available, such as N=8, for example. In further exemplary embodiments, the parity space may be of size ⅛ N (e.g., 7 data channels and 1 parity channel for 7 data memory modules and 1 parity memory module). In some exemplary embodiments, the address spaces may overlap. In such a case, and by way of example, the hypervisor may be responsible for allocating the spaces so as to prevent any overlap in the used memory (i.e., versus allocated/initialized).
In accordance with the exemplary embodiments of the invention, there is no additional overhead incurred on a normal read operation. Furthermore, conventional ECC, CRC and UE operations may be performed. For example, corrective measures may be implemented on a UE.
As noted above, for N=4 (e.g., four channels, four DIMMs, one channel/DIMM used for parity) write operations incur two line reads (parity line and original line) and two line writes (data line and updated parity line). Also as noted above, the overhead for writes to failed DIMMs include three line reads (the two good data lines and the old parity line) and one line write (updated parity line). There is no need to write to the failed DIMM unless attempting to scrub and recreate, for example. While a DIMM failure will necessitate usage of three times the normal bandwidth (due to the three reads), the DIMMs in the array may be accessed in parallel to minimize any further delays.
Also note that since the scope of protection is fully configurable, the additional overhead need not be incurred for all data. For example, there is no additional overhead for any non-protected data/regions.
The exemplary embodiments of the invention afford a number of advantages and benefits, discussed herein by way of non-limiting examples. The exemplary embodiments are capable of covering all memory errors. That is, the exemplary embodiments provide continued operation of the system even in the face of what would otherwise be crippling errors (e.g., if the failed DIMM 308-1 stored hypervisor data). Furthermore, the coverage is comprehensive in that it protects against errors in the memory controller 304, the channels 306 and the memory modules 308 (e.g., DIMMs), as non-limiting examples. The scope of protection is fully configurable such that it may be used to protect as much or as little of the data as desired (e.g., enabling selective control of overhead). In such a manner, the incurred overhead is similarly configurable/selectable. In addition, no packaging changes, such as any extra memory channels, are needed for implementation. In at least some cases, the exemplary embodiments of the invention can be implemented using existing hardware (e.g., operating with different software and/or logic). Furthermore, there is no read overhead on a normal read operation.
It is observed that while conventional systems may utilize parity information in conjunction with permanent or long-term storage (e.g., NVS, certain RAID arrays), the exemplary embodiments of the invention utilize parity information in conjunction with volatile memory (e.g., RAM, DRAM) to enable continued system operation even in the face of critical memory errors (e.g., UEs). In modern computing, there is a focus on uptime and reliability for critical systems. The exemplary embodiments of the invention enable more robust systems that are capable of continued performance despite errors that would otherwise cripple conventional systems.
Below are further descriptions of various non-limiting, exemplary embodiments of the invention. The below-described exemplary embodiments are numbered separately for clarity purposes. This numbering should not be construed as entirely separating the various exemplary embodiments since aspects of one or more exemplary embodiments may be practiced in conjunction with one or more other aspects or exemplary embodiments.
In one exemplary embodiment, and as illustrated in
The method may further comprise dynamically varying a first amount of protected data and a second amount of unprotected data. A first size of the first region and a third size of the third region may be dynamically variable. The method may further comprise allocating a total memory space of the plurality of random access memories among a group consisting of the first region, the second region and the third region. The method may further comprise reallocating a total memory space of the plurality of random access memories among a group consisting of the first region, the second region, the third region and a fourth region of the plurality of random access memories, where the fourth region consists of a portion of the plurality of random access memories that has been determined to be inaccessible or unusable. The method may further comprise reallocating a total memory space of the plurality of random access memories among a group consisting of the first region, the second region, the third region and a fourth region of the plurality of random access memories, where the fourth region consists of a portion of the plurality of random access memories that is inaccessible or unusable. The first region, the second region and the third region might not overlap one another. The method may further comprise using the parity information to reconstruct a lost or inaccessible portion of the protected data. The method may further comprise, in response to an uncorrectable error occurring for one of the plurality of random access memories, continuing usage of remaining ones of the plurality of random access memories by using the parity information to reconstruct lost or inaccessible protected data. The parity information may enable reconstruction of a portion of protected data stored on a random access memory that fails. The method may further comprise writing new protected data to one of the plurality of random access memories; computing updated parity information based on the new protected data; and writing the updated parity information to the second region of the plurality of random access memories. The method may further comprise, in response to a command to write new protected data to a random access memory that has failed, reading other protected data from others of the plurality of random access memories and reading the parity information from the second region of the plurality of random access memories; reconstructing missing protected data for the failed random access memory based on the other protected data and the parity information; determining new parity information based on the new protected data and the reconstructed missing protected data; and writing the new parity information to the second region of the plurality of random access memories. The plurality of random access memories may consist of four memory modules. The plurality of random access memories may consist of eight memory modules.
In one example a computer-readable storage medium storing program instructions may be provided, execution of the program instructions resulting in operations comprising storing, by an apparatus, data on a first portion of a plurality of random access memories; and storing, by the apparatus, parity information for the stored data on a second portion of the plurality of random access memories.
The operations may further comprise dynamically varying a first amount of protected data and a second amount of unprotected data. A first size of the first region and a third size of the third region are dynamically variable. The operations may further comprise allocating a total memory space of the plurality of random access memories among a group consisting of the first region, the second region and the third region. The operations may further comprise reallocating a total memory space of the plurality of random access memories among a group consisting of the first region, the second region, the third region and a fourth region of the plurality of random access memories, where the fourth region consists of a portion of the plurality of random access memories that has been determined to be inaccessible or unusable. The operations may further comprise reallocating a total memory space of the plurality of random access memories among a group consisting of the first region, the second region, the third region and a fourth region of the plurality of random access memories, where the fourth region consists of a portion of the plurality of random access memories that is inaccessible or unusable. The first region, the second region and the third region do not overlap one another. The operations may further comprise using the parity information to reconstruct a lost or inaccessible portion of the protected data. The operations further comprise in response to an uncorrectable error occurring for one of the plurality of random access memories, continuing usage of remaining ones of the plurality of random access memories by using the parity information to reconstruct lost or inaccessible protected data. The parity information may enables reconstruction of a portion of protected data stored on a random access memory that fails. The operations further comprise writing new protected data to one of the plurality of random access memories; computing updated parity information based on the new protected data; and writing the updated parity information to the second region of the plurality of random access memories. The operations may further comprise in response to a command to write new protected data to a random access memory that has failed, reading other protected data from others of the plurality of random access memories and reading the parity information from the second region of the plurality of random access memories; reconstructing missing protected data for the failed random access memory based on the other protected data and the parity information; determining new parity information based on the new protected data and the reconstructed missing protected data; and writing the new parity information to the second region of the plurality of random access memories.
In one type of example apparatus, the apparatus may comprise at least one memory controller; and a plurality of random access memories, where the at least one memory controller is configured to allocate the plurality of random access memories among at least a first portion and a second portion, where the first portion is configured to store data, where the second portion is configured to store parity information for the stored data.
The at least one memory controller may be configured to dynamically vary a first amount of protected data and a second amount of unprotected data. The at least one memory controller may be configured to allocate a total memory space of the plurality of random access memories among a group consisting of the first region, the second region and the third region. The at least one memory controller may be configured to reallocate a total memory space of the plurality of random access memories among a group consisting of the first region, the second region, the third region and a fourth region of the plurality of random access memories, where the fourth region consists of a portion of the plurality of random access memories that has been determined to be inaccessible or unusable. The at least one memory controller may be configured to reallocate a total memory space of the plurality of random access memories among a group consisting of the first region, the second region, the third region and a fourth region of the plurality of random access memories, where the fourth region consists of a portion of the plurality of random access memories that is inaccessible or unusable. The at least one memory controller may be configured to use the parity information to reconstruct a lost or inaccessible portion of the protected data. The at least one memory controller may be configured to, in response to an uncorrectable error occurring for one of the plurality of random access memories, continue usage of remaining ones of the plurality of random access memories by using the parity information to reconstruct lost or inaccessible protected data. The at least one memory controller may be configured to write new protected data to one of the plurality of random access memories; compute updated parity information based on the new protected data; and write the updated parity information to the second region of the plurality of random access memories. The at least one memory controller is configured to in response to a command to write new protected data to a random access memory that has failed, read other protected data from others of the plurality of random access memories and reading the parity information from the second region of the plurality of random access memories; reconstruct missing protected data for the failed random access memory based on the other protected data and the parity information; determine new parity information based on the new protected data and the reconstructed missing protected data; and write the new parity information to the second region of the plurality of random access memories.
The exemplary embodiments of the invention, as discussed herein and as particularly described with respect to exemplary methods, may be implemented in conjunction with a program storage device (e.g., at least one memory) readable by a machine, tangibly embodying a program of instructions (e.g., a program or computer program) executable by the machine for performing operations. The operations comprise steps of utilizing the exemplary embodiments or steps of the method.
The blocks shown in
In addition, the arrangement of the blocks depicted in
That is, the exemplary embodiments of the invention shown in
Any use of the terms “connected,” “coupled” or variants thereof should be interpreted to indicate any such connection or coupling, direct or indirect, between the identified elements. As a non-limiting example, one or more intermediate elements may be present between the “coupled” elements. The connection or coupling between the identified elements may be, as non-limiting examples, physical, electrical, magnetic, logical or any suitable combination thereof in accordance with the described exemplary embodiments. As non-limiting examples, the connection or coupling may comprise one or more printed electrical connections, wires, cables, mediums or any suitable combination thereof.
Generally, various exemplary embodiments of the invention can be implemented in different mediums, such as software, hardware, logic, special purpose circuits or any combination thereof. As a non-limiting example, some aspects may be implemented in software which may be run on a computing device, while other aspects may be implemented in hardware.
Features as described herein may provide a selective redundant array of independent memory for a computer's main random access memory. This may utilize conventional RAM memory modules, and striping algorithms to protect against the failure of any particular module and keep the memory system operating continuously. It may support several DRAM device error checking and correcting (ECC) computer memory technologies that protects computer memory systems from any single memory chip failure, as well as multi-bit errors from any portion of a single memory chip, and entire memory channel failures. The features as described herein may be much more robust than parity checking and ECC memory technologies which cannot protect against many varieties of memory failures.
With features as described herein, not all of the data written or stored in the memory modules need be stored as protected data. The memory modules may store both protected data and unprotected data. Thus, not all of the data written to the memory modules needs to be provided with corresponding parity information. This may provide a “selective” redundancy for the array of independent memory where less than all of the data written or stored in the memory modules needs to have parity information also stored in the memory, and where the redundancy may be provided by data reconstruction of only the protected data (not the unprotected data) using parity information.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the best method and apparatus presently contemplated by the inventors for carrying out the invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications will still fall within the scope of the teachings of the exemplary embodiments of the invention.
Four (4) memory modules is the exemplary implementation described above. However, a minimum configuration may comprise 3 memory modules (2 memory modules of data and 1 memory module of parity). The parity may rotate across the memory modules. For any stripe of data, there may be 3 memory modules of data and 1 memory module of parity, but different stripes may put the parity information on different memory modules.
Furthermore, some of the features of the preferred embodiments of this invention could be used to advantage without the corresponding use of other features. As such, the foregoing description should be considered as merely illustrative of the principles of the invention, and not in limitation thereof.