Modern computer systems include ever increasing amounts of system memory in which the computer system stores programs and data that are currently in use. Even a typical personal computer system may now include several gigabytes of dynamic random access memory (DRAM), which typically forms the largest portion of system memory. Server computer systems, such as a Web server, may include hundreds of gigabytes or even terabytes of DRAM to store programs and data associated with a Web site or a corporate database.
The larger the storage capacity of a system memory the more likely that errors in data and programs stored in the memory will occur. Note that in the present discussion the term “data” will be used to refer to any type of data stored in the system memory, including program instructions and data generated and utilized by programs currently being executed by the computer system. For a system memory including a gigabyte of DRAM, there are eight billion (8,000,000,000) individual DRAM memory cells or locations (assuming 8 bits of data per byte), each location typically storing a single bit of data. With such a large number of memory locations, errors in data bits can occur for a variety of different reasons, such as electrical noise, thermal noise, and high-energy particles like neutrons and alpha particles impacting the memory locations. For example, a high-energy particle impacting a DRAM memory location may change the amount of charge stored by that location and thereby cause the stored data bit corresponding to the stored charge to change from a logic 1 to a logic 0, or vice versa.
Errors in the data stored in system memory can result in a program providing a user with erroneous results or can cause the computer system to crash. As a result, various approaches have been utilized in system memories to prevent data errors from adversely affecting the operation of the computer system. One such approach is known as “memory mirroring” in which a duplicate copy of data is stored in the system memory. With this approach, upon the detection of data error in a primary copy of the data, the duplicate copy of the data is utilized. This approach is costly both in terms of dollars and in terms of physical space due to the requirement of doubling the size of the actual required memory capacity.
To actually detect and correct erroneous data bits, a variety of approaches are utilized in conventional system memories. The most common is the detection of erroneous data bits through the addition of a parity bit. A parity bit is a bit added to a byte of data to make the number of logic 1 s in the byte and parity bit either even or odd, as will be understood by those skilled in the art. A more advanced approach to both detect and correct erroneous data bits is through the use of error correcting codes (ECCs). The most commonly utilized ECC is a code capable of detecting single and double bit data errors and capable of correcting single bit data errors. This type of ECC code is known as a single error correction double error detection (SECDED) code.
In operation, the system memory utilizes the horizontal error correcting codes HECC to detect and correct erroneous data bits in the associated words DW The specific way in which the codes HECC are calculated from the data bits in the corresponding data word DW along with the way the codes are utilized to detect and correct erroneous data bits in the associated words will be understood by those skilled in the art, and thus, for the sake of brevity, will only be described generally herein. Briefly, before each data word DW is stored in the system memory an algorithm is applied to the data bits in the data word DW to thereby generate the corresponding HECC code. The data word DW along with the HECC code are then stored in the system memory. As will be appreciated by those skilled in the art, data in the form of the data words DW and the HECC codes may only be written to and read from the system memory one row at a time.
To detect erroneous data bits in each data word DW, the data word along with the corresponding HECC code are read from the system memory and the same algorithm is once again applied to the data bits in the read data word to generate a newly calculated HECC code. If the newly calculated HECC code does not equal the HECC code read from the system memory, then an error in the data bits of the data word DW has occurred. The algorithm generates the HECC codes in such a way that the values of the codes allow a certain number of erroneous data bits in the data word DW to be corrected. The rows of memory locations ML are sequentially read one row at a time and this process repeated for each read data word DW to detect and correct erroneous data bits in that data word.
The system memory performs this detection and correction on each of the words DW whenever that word is accessed during normal operation of the computer system containing the system memory. In addition, the system memory typically executes a process that will be referred to herein as “horizontal scrubbing.” Horizontal scrubbing is a background process periodically executed by the system memory in which each data word DW and the associated code HECC are accessed and any errors detected and corrected. Such horizontal scrubbing is done independent of whether the data word DW is accessed during normal operation of the computer system and is ideally done frequently enough to ensure that single bit errors in any of the words do not become double bit errors.
Typically, the HECC codes are Hamming SECDED codes, meaning that each code can detect and correct a single bit error in the associated word DW and can detect double bit errors in that word. Hamming is the particular type of code and defines the way in which these SECDED codes are generated, as will be understood by those skilled in the art. As can be seen from the memory diagram, the overall storage capacity of the illustrated system memory is N×(M+K) bits of data. Note that the HECC codes occupy N×K of this overall storage capacity. If more sophisticated error detection and correction is desired, such as the ability to correct double bit errors, the width K of the HECC codes becomes even greater. This greater width K means that these codes undesirably occupy a greater percentage of the overall storage capacity of the system memory. As a result, the overall storage capacity of the system memory must be increased, which undesirably increases the size and cost of the system memory.
In operation, the system memory utilizes the horizontal parity bits HP and vertical parity bits VP to detect and correct single bit errors. For example, if any of the horizontal parity bits HP indicates an erroneous bit in the corresponding data word DW, the system memory then checks the vertical parity bits VP. One of the vertical parity bits VP will indicate an error in the corresponding bits of the data words DW in that column. The vertical parity bit VP that indicates the error signals the specific location of the erroneous bit in the data word DW that was determined to have such an erroneous bit by the corresponding horizontal parity bit HP. For example,
While the approach illustrated in
Another error checking and correction approach involves distributing bits of data among the memory locations in such a way that the failure of one component in the system memory may still be detected and corrected. This approach may be generically referred to as “enhanced ECC” and is referred to using different names by different companies in the memory industry. For example, International Business Machines Corp. uses the trademark “Chipkill” and Hewlett-Packard Co. uses the trademark “Chip Spare” to refer to this ECC approach. DRAM system memory is commonly formed by a number of dual in-line memory modules (DIMMs), and with the enhanced ECC approach bits of data are distributed among the DIMMs. In this way, the computer system may access data words that include bits from a number of the DIMMs such that the failure of any one of the DIMMs can be detected and the bits in the data word from the failed DIMM corrected.
There is a need for an improved system and method of detecting and correcting multiple bit errors in system memories without excessively increasing the portion of such memory for storing error correcting codes required for such detection and correction.
According to one aspect of the present invention, a method and system detects and corrects errors in data bits of data words stored in a system memory. Each data word includes a plurality of data bits and the method includes generating a horizontal error correcting code for each data word and storing each horizontal error correcting code in the system memory. Vertical error correcting codes are generated, with each vertical error correcting code being generated using a particular bit from all of the data words. Each vertical error correcting code is stored in the system memory. Vertical scrubbing is performed using the vertical error correcting codes to detect errors in the data words and horizontal scrubbing is performed using the horizontal error correcting codes to detect and correct errors in the data words.
The vertical scrubbing may also correct detected errors. The horizontal and vertical error correcting codes may, for example, be SECDED codes, enabling detection and correction during both horizontal and vertical scrubbing. Alternatively, the horizontal error correcting code may be a SECDED code and the vertical error correcting code a parity bit, meaning vertical scrubbing detects errors and horizontal scrubbing corrects such detected errors. The vertical scrubbing may be done automatically either through suitable hardware contained on memory modules in the system memory or by a memory controller in the system memory.
In the following description, certain details are set forth in conjunction with the described embodiments of the present invention to provide a sufficient understanding of the invention. One skilled in the art will appreciate, however, that the invention may be practiced without these particular details. Furthermore, one skilled in the art will appreciate that the example embodiments described below do not limit the scope of the present invention, and will also understand that various modifications, equivalents, and combinations of the disclosed embodiments and components of such embodiments are within the scope of the present invention. Embodiments including fewer than all the components of any of the respective described embodiments may also be within the scope of the present invention although not expressly described in detail below. Finally, the operation of well known components and/or processes has not been shown or described in detail below to avoid unnecessarily obscuring the present invention.
In one embodiment of the present invention, both the horizontal error correcting codes HECC and vertical error correcting codes VECC are Hamming SECDED codes. Thus, each HECC code can correct a single bit error in the corresponding data word DW and detect a double bit error in that data word. Similarly, each VECC code can correct a single bit error in the corresponding column of data bits of the data words DW and can detect a double bit error in this column of data bits. Note that other types of error correcting codes may be utilized for the horizontal and vertical error correcting codes HECC and VECC in other embodiments of the present invention, including codes capable of detecting and correcting more erroneous data bits. Also, in one embodiment of the present invention each of the vertical error correcting codes VECC is a single parity bit, as will be discussed in more detail below.
A process utilizing the horizontal error correcting codes HECC and vertical error correcting codes VECC of
Assume initially that all of the data words DW1-DWN have been written into and stored in respective rows of the system memory, and that as part of this process the horizontal error correcting codes HECC1-HECCN for each data word were calculated and stored along with the data word. Also, as each of the data words DW was being written into the corresponding row in the system memory, the bits in that data word are also utilized in calculating the vertical error correcting codes VECC1-VECCM. Each calculated vertical error correcting code VECC1-VECCM is stored in columns of memory locations ML in the system memory as illustrated in
The process 400 begins in step 402 and proceeds immediately to step 404 in which the vertical error correcting codes VECC are utilized to detect single and double bit errors in the corresponding columns of data bits of the data words DW The step 404 includes accessing each of the N data words DW stored in the system memory. When each data word DW is accessed, the respective bits in that data word are stored or otherwise applied to circuitry (not shown) in the system memory such that after all N data words have been accessed a new value for each of the vertical error correcting codes VECC1-VECCM may be calculated. The previously calculated values for the vertical error correcting codes VECC1-VECCM are also accessed, and each compared to the newly calculated value for that code (e.g., the newly calculated value for VECC1 is compared to the previous value for VECC1 stored in the memory). From this comparison, the process determines whether single or double errors exist in any of the columns of data bits associated with the VECC codes. This process of utilizing the VECC codes to detect single and double bit errors in the corresponding columns of data bits of the data words DW may be referred to as “vertical scrubbing” in the following description.
At this point, the process proceeds to step 406 and determines whether any single or double bit errors have been detected in step 404. When this determination is negative, the process goes immediately to step 408 and ends. No further action is needed because no erroneous data bits in the data words DW were detected using the VECC codes.
When the determination in step 406 is positive, meaning at least one single and double bit error was detected for the columns of data bits associated at least one of the VECC codes, the process goes to step 410. In step 410, the process determines whether only single bit errors were detected in the data words DW via the VECC codes. If this is true, the process goes to step 412 and corrects all detected single bit errors using the appropriate VECC codes. For example, referring to
After all single bit errors have been corrected in step 412, the process proceeds to step 408 and terminates. There is no need to utilize the HECC codes in this situation because the VECC codes, which are SECDED Hamming codes in this example, have been used to detect and correct all single bit errors in the associated columns of data bits in the data words DW Note that multiple VECC codes could indicate errors and thus multiple bits within a given data word DW could actually be corrected via the vertical error correcting codes. For example, the data word DW3 could include erroneous data bits DW3<2> and DW3<4> as shown in
Returning to step 410, when the determination in this step is negative this means that at least one double bit error was detected in step 404 using the VECC codes. For example, a double bit error is shown in
The previously calculated value for the horizontal error correcting code HECC that was read from the memory is then compared to the newly calculated value for that code (e.g., the newly calculated value for HECC2 is compared to the previous value for HECC2 read from memory). From this comparison, the process determines detects and corrects all single bit error in each of the data words DW. Note that this description assumes no double bit errors exist in any of the data words DW, which could occur as illustrated in
The likelihood of a double bit error in one of the data words DW would typically be relatively low so this situation does not present a serious limitation to utility of the error correction process 400. To further reduce the likelihood of such an occurrence, the error correction process 400 may be modified such that all single bit errors detected using the VECC codes are corrected using these codes prior to performing horizontal scrubbing using the HECC codes. In this embodiment, the process corrects all single bit errors detected using the VECC codes. Only after this is done and at least one of the VECC codes has detected a double bit error does the process perform horizontal scrubbing using the HECC codes. Returning to
Other embodiments of the error correcting processes utilizing the code HECC and VECC are possible, and such processes may vary depending on the application of the system memory in which the process is being utilized. Also, the type of codes utilized for the HECC and VECC codes may similarly vary depending on the application as previously mentioned, and the type of each code may also affect the specific process that is executed. For example,
The error correction process 500 begins in step 502 and proceeds immediately to step 504 in which the process utilizes the vertical error correcting codes VECC to detect single bit errors in the associated columns of bits in the data words DW Because each of the VECC codes is a single parity bit in this embodiment, only single bit errors can be detected and no errors corrected using the VECC codes. In detecting single bit errors using the single parity bit VECC codes, each of the N data words DW stored in the system memory is accessed and respective bits in that data word stored or otherwise applied to circuitry (not shown) in the system memory such that after all N data words have been accessed a new value for each of the parity bit VECC1-VECCM codes may be calculated. The previously calculated values for the parity bits VECC1-VECCM are also accessed, and each previously calculated parity bit compared to the newly calculated parity bit (e.g., the newly calculated parity bit VECC1 is compared to the previously calculated parity bit VECC1 stored in the memory). From this comparison, the process determines whether any single bit errors exist in any of the columns of data bits associated with the VECC codes.
After all parity bits VECC having utilized to determine whether any single bit errors exist in the corresponding column of data bits in step 504, the process proceeds to step 506 and determines whether any single bit errors have been detected. If the determination in step 506 is negative, the process proceeds to step 508 and terminates since there are no detected single bit errors and thus presumably no errors in the data bits of any of the data words DW1-DWN. When the determination in step 506 is positive, at least one single bit error exists in at least one of the data words DW and the process proceeds to step 510. In step 510 the process performs horizontal scrubbing of the data words DW using the horizontal error correcting codes HECC as previously described. Single bit errors in the data words DW are detected and corrected using the HECC codes during this horizontal scrubbing. Note that it is possible that one or more of the data words DW could include multiple bit errors that cannot be corrected in this embodiment. Since the HECC codes are Hamming SECDED codes in this example embodiment, during horizontal scrubbing any double bit errors in any of the data words DW may be detected but not corrected. Any such detected double bit errors once again would typically be reported to the operating system of the computer system of which the system memory is a part.
In another error correction process according to another embodiment, the orders in which the VECC and ECC codes are utilized to detect and correct error as shown and described with reference to
Each of the DIMMs 610 includes address, data, and control buses that are collectively illustrated as a memory bus 616 in
The chipset 608 includes the memory controller 606, which is coupled to the DIMMs 610 through the memory bus 616. The memory controller 606 applies commands in the form of address, data, and control signals to the DIMMs 610 over the memory bus 616 to read data from and write data to the DIMMs. The memory controller 606 supplies these commands to the DIMMs 610 in response to requests from a processor 618 applied to the controller over a system bus 620. The memory controller 606 includes ECC logic 622 that performs the horizontal scrubbing of data words stored in the DIMMs 610 using HECC codes stored in these DIMMs as previously described with reference to
The computer system 600 further includes one or more output devices 624 coupled to the processor 618 through the chipset 608. Typical output devices 624 include a printer and a video terminal. One or more input devices 626 are also coupled to the processor 618 through the chipset 608, such as a keyboard and a mouse. Mass storage devices 628 are also typically coupled to the processor 618 through the chipset 608 to store and retrieve large amounts of data from external storage media (not shown). Examples of typical Mass storage devices 628 include hard and floppy disks, tape cassettes, compact disk read-only (CD-ROMs) and compact disk read-write (CD-RW) memories, and digital video disks (DVDs). The chipset 608 also performs all communications and control between the processor and the devices 624-628 and performs a variety of other functions, such as supplying video data to a video driver (not shown) that drives a video monitor corresponding to one of the output devices 624 and transferring data from the mass storage devices 628 to the memory subsystem 604, as will be appreciated by those skilled in the art.
In operation of the computer system 600, the processor 618 executes programs (not shown) to perform desired functions. When the processor 618 requires programming instructions or data stored in the memory subsystem 604 the processor applies an appropriate command to the memory controller 606 over the system bus 620. In response to the command, the memory controller 606 applies a corresponding command to the DIMMs 610 over the memory bus 616 to access the requested data. In response to this command from the memory controller 606, the DIMMs 610 access the corresponding data words and return the requested data words along with the corresponding HECC codes to the memory controller over the memory bus 616. The memory controller 606 then utilizes each HECC code to detect and correct any erroneous data bits in the corresponding data word and the data word over the system bus 620 to the processor 618. If the memory controller 606 detects any uncorrectable errors in data words, the controller typically reports such errors to an operating system (not shown) running on the processor 618. The operating system takes appropriate actions in response to such errors, such as notifying a user via one of the output devices 624 and terminating the execution of all programs on the processor 618.
To write data into the memory subsystem 604, the processor 618 applies an appropriate command along with the data words to be stored to the memory controller 606 over the system bus 620. In response to the command, the memory controller 606 generates the HECC code for the data word and applies a corresponding command along with the data word and HECC code to the DIMMs 610 over the memory bus 616. In response to this command from the memory controller 606, the DIMMs 610 access the appropriate memory locations and stores the data word along with the HECC code in these memory locations.
During operation of the computer system 600, the ECC logic 614 contained in each of the DIMMs 610 operates as previously described with reference to
In the embodiment of
In one embodiment of the computer system 600, the ECC logic 614 in each DIMM 610 performs vertical scrubbing during a refresh cycle of the associated DRAM memory devices 612. As will be appreciated by those skilled in the art, during a refresh cycle each memory location in the array of memory locations collectively formed by all devices 612 on each DIMM 610 is accessed to restore the data stored in each memory location. Since each memory location is being accessed, this is an opportune time for the ECC logic 614 to perform vertical scrubbing of these data bits. Thus, in one embodiment the ECC logic 614 on each DIMM 610 automatically performs vertical scrubbing during each refresh cycle of the associated DRAM memory devices 612. The ECC logic 614 could also automatically perform vertical scrubbing in response to some other parameter, such as some time period other than a refresh cycle such as once every X refresh cycle or Y times between each refresh cycle, or in response to a command from the memory controller 606.
Also note that the embodiments of the present invention are not limited to the type of memory contained in the system memory 602, and thus while the DIMMs 610 include the DRAM memory devices 612 in
Even though various embodiments and advantages of the present invention have been set forth in the foregoing description, the above disclosure is illustrative only, and changes may be made in detail and yet remain within the broad principles of the present invention. Moreover, the functions performed by the components 602-628 in the computer system 600 of