This application incorporates by reference the following co-pending commonly owned U.S. patent applications: (i) “SYSTEMS AND METHODS FOR PROVIDING ERROR CORRECTION CODE TESTING FUNCTIONALITY,” application Ser. No. 10/435,149, filed May 9, 2003, in the name(s) of: Christopher M. Brueggen (U.S. Patent Application Publication No. 2004/0225943; published Nov. 11, 2004); (ii) “SYSTEMS AND METHODS FOR PROCESSING AN ERROR CORRECTION CODE WORD FOR STORAGE IN MEMORY COMPONENTS,” application Ser. No. 10/435,150, filed May 9, 2003, in the name(s) of: Christopher M. Brueggen (U.S. Patent Application Publication No. 2004/0225944; published Nov. 11, 2004); (iii) “RAID MEMORY SYSTEM,” application Ser. No. 10/674,262, filed Sep. 29, 2003, in the name(s) of: Larry Thayer, Eric McCutcheon Rentschler and Michael Kennard Tayler (U.S. Patent Application Publication No. 2005/0071554; published Mar. 31, 2005); and (iv) “MEMORY CORRECTION SYSTEM AND METHOD,” application Ser. No. 11/214,697, filed Sep.30, 2005, in the name(s) of: Larry Thayer.
Electronic data storage utilizing commonly available memories (such as Dynamic Random Access Memory or DRAM) can be problematic. Specifically, there is a finite probability that, when data is stored in memory and subsequently retrieved, the retrieved data will suffer some corruption. For example, DRAM stores information in relatively small capacitors that may suffer a transient corruption due to a variety of mechanisms, e.g., charged particles or radiation (i.e., soft errors). Additionally, data corruption may occur as the result of hardware failures such as loose memory modules, blown chips, wiring defects, and/or the like. The errors caused by such failures are often referred to as repeatable errors, since the same physical mechanism repeatedly causes the same pattern of data corruption.
To address this problem, a variety of error detection and error correction algorithms have been developed. In general, error detection algorithms typically employ redundant data added to a string of data. The redundant data is calculated utilizing a checksum or cyclic redundancy check (CRC) operation. When the string of data and the original redundant data is retrieved, the redundant data is recalculated utilizing the retrieved data. If the recalculated redundant data does not match the original redundant data, data corruption in the retrieved data is detected.
Error correction code (ECC) algorithms operate in a manner similar to error detection algorithms. When data (or, payload) is stored, redundant data is calculated and stored in association with the data. When the data and the redundant data are subsequently retrieved, the redundant data is recalculated and compared to the retrieved redundant data. When an error is detected (e.g, the original and recalculated redundant data do not match), the original and recalculated redundant data may be used to correct certain categories of errors.
Although current ECC solutions are known to be generally effective in addressing certain types of memory errors, higher levels of reliability are constantly being pursued in the design of memory systems.
A hierarchical error correction scheme operable with a memory system is set forth hereinbelow. In one embodiment, the memory system comprises a plurality of memory modules organized as a number of ECC domains, wherein each ECC domain includes a set of memory modules, each memory module comprising a plurality of memory devices. A first error correction engine is provided for correcting device-level errors associated with a specific memory device and a second error correction engine for correcting errors at a memory module level, wherein the first and second error correction engines are operable in association with a memory controller operably coupled to the plurality of memory modules.
Representative embodiments of the present patent disclosure will now be described with reference to various examples wherein like reference numerals are used throughout the description and several views of the drawings to indicate like or corresponding parts, and further wherein the various elements are not necessarily drawn to scale. Referring to
In one exemplary implementation, the memory controller complex 104 and associated hierarchical EDC module 106 may be operably coupled to the memory modules 108-1 through 108-N via any suitable interconnect topology 107 to form a memory system, wherein the interconnect topology 107 allows for the practice of the teachings set forth herein without regard to data bus widths (i.e., different data word sizes including redundant data for error correction), data bus segmentation, bandwidth capacities, clock speeds, etc., except the requirement that the interconnect topology 107 be preferably adaptable to operate with a variable number of memory modules that may be hierarchically organized into a number of logical levels. Conceptually, an embodiment of the hierarchical memory organization is envisioned to comprise at the lowest level a plurality of individual memory devices (not shown) that are grouped into a number of memory modules, e.g., memory modules 108-1 through 108-N, which in turn may be arranged as a plurality of ECC domains wherein each ECC domain includes a set of memory modules. Clearly, additional and/or alternative levels of hierarchical organization may be implemented in other arrangements. Regardless, the logic associated with the EDC module 106 is operable to isolate memory errors at each level (e.g., a chip-level error that may render an entire memory device inoperable, or a module-level error that may render an entire memory module inoperable), and apply suitable level-specific error correction engines that correct the multi-level errors in order to improve memory system reliability.
Referring now to
For purposes of one embodiment of the present patent disclosure, six memory modules 308-1, 308-2, 310-1, 310-2, 312-1, 312-2, are exemplified that are arranged as three pairs, wherein a pair of memory modules is operated as a particular ECC domain. Those skilled in the art should recognize that although only six memory modules are shown, there may be more than six modules in other embodiments. Additionally, there may be other arrangements with a plurality of ECC domains wherein more than two memory modules or portions thereof (i.e., a set of modules) are operated as a single ECC domain. Furthermore, a memory module may be generalized as a grouping of memory devices that are physically and/or logically treated as a single unit by the memory controller 302. Thus, where pair-based ECC domains are provided (i.e., a pair of memory modules defining each ECC domain), for a given P memory modules, P being an even number, the total number of ECC domains are P/2. In accordance with one embodiment of the present patent disclosure, module-level redundancy is provided with respect to ECC data storage (wherein each ECC word or sub-word includes a predetermined number of data bits as well as a predetermined number of ECC bits depending on memory system design and applicable ECC technique) such that for a set of ECC domains (e.g., a pair of ECC domains) there is provided an ECC domain that is operated as a redundancy domain associated therewith. For instance where a redundant domain is provided for a pair of ECC domains, if the total number of ECC domains is P/2, then 2/3 of the ECC domains may be used for ECC-added data storage and the remaining 1/3 of the domains may be used as the redundancy domains (i.e., 2:1 ratio between the data storage domains and their corresponding redundancy domains) respectively associated therewith. Accordingly, reference numerals 306-i and 306-j refer to two exemplary ECC domains operable as data storage domains which include memory module pairs 308-1/308-2 and 310-1/310-2 (labeled as E-DOMi and E-DOMj, respectively). Further, the redundant ECC domain is labeled as R-DOMk 306-k which contains parity data based on XORing of the contents of E-DOMi 306-i with the contents of E-DOMj 306-j. An XOR engine or circuit (not explicitly shown in
R-DOMk=E-DOMi{circle around (+)}E-DOMj
wherein the symbol {circle around (+)} denotes the bit-wise Exclusive-OR operation performed with respect to the set of ECC domains, e.g., E-DOMi and E-DOMj domains.
Because of the module-level redundancy provided in the memory system architecture by way of XOR circuitry, similar circuitry may be used as a module-level error correction engine for recovering data from an ECC domain that is known to be faulty. For example, if the data in a module of the domain E-DOMj 306-j is determined to be faulty or corrupted for some reason, that data may be recovered by an XOR engine operable to effectuate the following processing:
Corr{E-DOMj}=E-DOMi{circle around (+)}R-DOMk
which can be executed independent of any lower level ECC processing for correcting errors that may concurrently occur elsewhere in the memory system 300 of
One skilled in the art will recognize upon reference hereto that XOR engines described above may be embodied as a single module, e.g., module-level error correction module 305B, associated with the memory controller 302, although they may implemented as separate circuits as well. As a further generalization, it should be realized that if the number of ECC data storage domains is M, then the maximum number of ECC domain pairs between which XOR processing may be effectuated will be:
Max{Number of XOR pairs}=MCr=M!/r!(M−r)!
where r=2. However, not all XOR pairs need to be processed and stored (which would be prohibitively ineffective) for purposes of providing module-level redundancy in accordance with the teachings of the present disclosure. In fact, where the number of ECC data storage domains, M, is an even number, then M/2 of XOR pairs (thus, M/2 redundant domains) will be sufficient for providing complete module-level redundancy. On the other hand, to optimize storage, an expected rate of module failure may also be taken into account. That is, for example, if x number of modules are expected to fail and if the system is to recover from all x failing modules, then 2x redundant modules need to be provisioned.
A more generalized treatment of the above concept may be appreciated as follows regardless of the number of memory modules per ECC domain. To correct N bits with 100% certainty, one would need about 2N redundant bits. Thus, within an ECC domain, two spare DRAM devices may be provided (one per each module) to correct up to one failing DRAM. Also, in similar fashion, two redundant memory modules may be provided to correct any failing module. Accordingly, a number of implementations are possible, e.g., one data module with two redundant modules; two data modules with two redundant modules; four data modules with two redundant modules; six data modules with two redundant modules, etc., each implementation with the capability to correct any one failing memory module.
As to errors that may afflict individual memory devices, e.g., DRAM chips, within the memory modules of the memory system 300, any known or heretofore unknown ECC technique may be implemented that is operable at the device level of a particular memory organization.
As pointed out earlier, the exemplary ECC arrangement associated with the memory domain 400 involves 16 bits of redundancy (i.e., equivalent to the output of four x4 DRAM devices) which may be localized in a particular subset of the memory devices of the ECC domain, or distributed or scattered anywhere in the two memory modules 402A, 402B. For purposes of illustration, four DRAMs 410-7, 410-8, 410-16 and 410-17 are highlighted to represent the amount of ECC/redundant data as a redundancy block 412 in
By way of example, representative embodiments of the ECC domain 400 may utilize a suitable Reed-Solomon burst error correction algorithm to effectuate byte correction capability. In Reed-Solomon algorithms, the code word comprises n m-bit numbers: C=(c(n-1), c(n-2), . . . , c(0)). The ECC word may be represented mathematically by the following polynomial of degree n with the coefficients (symbols) being elements in the finite Galois field (2m):
C(x)={c(n-1)x(n-1)+c(n-2)x(n-2)+c(n-3)x(n-3) . . . +c(0)}
The ECC code word is generated utilizing a generator polynomial (typically denoted by g (x)). Specifically, the payload data (denoted by u(x)) is multiplied by the generator polynomial for systematic coding, wherein the original payload bits (i.e., actual data bits) are caused to appear explicitly in defined positions of the code word. The original payload bits are represented by [x(n-k)u(x)] and the redundancy information is represented by [x(n-k)u(x)mod{g(x)}].
When the ECC code word (e.g., a 72-bit or 144-bit code word) is subsequently retrieved from memory, the retrieved code word may suffer data corruption due to a transient failure and/or a repeatable failure. The retrieved code word may be represented by the polynomial r(x). If r(x) includes data corruption, r(x) differs from C(x) by an error signal e(x). The redundancy information is recalculated from the retrieved ECC code word. Subsequently, the original redundancy information as stored in memory and the newly calculated redundancy information are combined utilizing an XOR operation to form what is known as the syndrome polynomial s(x) which is also related to the error signal. Using this relationship, several algorithms are operable to determine the error signal and thus correct the errors in the corrupted data represented by r(x). For example, these techniques include error-locator polynomial determination, root finding for determining the positions of error(s), and error value determination for determining the correct bit-pattern of the error(s). Additional details regarding the implementation of a byte error correction algorithm coupled with data scattering may be found in one or more of following co-pending commonly owned U.S. patent applications: (i) “SYSTEMS AND METHODS FOR PROVIDING ERROR CORRECTION CODE TESTING FUNCTIONALITY,” application Ser. No. 10/435,149, filed May 9, 2003, in the name(s) of: Christopher M. Brueggen; and (ii) “SYSTEMS AND METHODS FOR PROCESSING AN ERROR CORRECTION CODE WORD FOR STORAGE IN MEMORY COMPONENTS,” application Ser. No. 10/435,150, filed May 9, 2003, in the name(s) of: Christopher M. Brueggen, each of which has been incorporated by reference hereinabove.
Furthermore, capabilities such as chip-kill correct techniques and chip erasure techniques may also be implemented at the device level hierarchy using related ECC algorithms and methodologies. Exemplary chip-kill correct techniques allow a memory system, e.g., memory system 300 shown in
In one representative embodiment, the ECC algorithm of a memory controller may implement the decoding procedure of a [36, 33, 4] shortened narrow-sense Reed-Solomon code (where the code word length is 36 symbols, the payload length is 33 symbols, and the Hamming distance is 4 bits) over a finite Galois field (28) that defines the symbol length to be 8 bits. By adapting the ECC algorithm in this manner, the ECC algorithm may perform both randomly located single-byte correction as well as erasure correction which may involve correction of known multi-byte failures.
Those skilled in the art should recognize that in additional and/or alternative embodiments, other ECC algorithms such as, e.g., Bose-Chaudhuri-Hocquenghem (BCH) codes, Reed-Muller codes, binary Golay codes and Goppa codes, etc., may also be implemented for purposes of error correction at the device level in accordance with the teachings set forth herein. Accordingly, a device level error correction engine implemented within an embodiment of the hierarchical memory error correction scheme of the present disclosure is envisioned to comprehend all such ECC engines, including chip-kill correct and chip erasure capabilities in certain arrangements.
Although the invention has been described with reference to certain exemplary embodiments, it is to be understood that the forms of the invention shown and described are to be treated as illustrative only. Accordingly, various changes, substitutions and modifications can be realized without departing from the scope of the present invention as set forth in the following claims.
Number | Name | Date | Kind |
---|---|---|---|
6243845 | Tsukamizu et al. | Jun 2001 | B1 |
6493843 | Raynham | Dec 2002 | B1 |
6715116 | Lester et al. | Mar 2004 | B2 |
6785835 | MacLaren et al. | Aug 2004 | B2 |
6845472 | Walker et al. | Jan 2005 | B2 |
6883131 | Acton | Apr 2005 | B2 |
6918007 | Chang et al. | Jul 2005 | B2 |
20040225943 | Brueggen | Nov 2004 | A1 |
20040225944 | Brueggen | Nov 2004 | A1 |
20050027891 | Emmot et al. | Feb 2005 | A1 |
20050071554 | Thayer et al. | Mar 2005 | A1 |
20050080958 | Handgen et al. | Apr 2005 | A1 |
20050160329 | Briggs et al. | Jul 2005 | A1 |
Number | Date | Country | |
---|---|---|---|
20070047344 A1 | Mar 2007 | US |