As a complexity of memory devices increase, the memory devices may become increasingly prone to data errors. For example, some types of data access patterns may cause leakage between word lines of a memory, resulting in loss or corruption of data. Manufacturers and/or vendors may be challenged to reduce a likelihood of data errors for the memory devices while minimizing latency and/or performance degradation of the memory devices.
The following detailed description references the drawings, wherein:
Specific details are given in the following description to provide a thorough understanding of embodiments. However, it will be understood that embodiments may be practiced without these specific details. For example, systems may be shown in block diagrams in order not to obscure embodiments in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring embodiments.
Memory devices are increasing in complexity as the die features size of the memory devices decreases and the storage capacity of the memory devices increases. As a result, failure mechanisms encountered in a memory device are becoming more complex as well. One type of problem encountered by the memory devices are “storms” of correctible, transient errors caused by leakage between word lines, which carry the row address information in a dynamic random access memory (DRAM). These error storms are caused by repeated accesses to a culprit word line, which may result in data being corrupted in word lines physically adjacent to the culprit word line. At a higher level, such as a system level where the memory devices are integrated, a user may have little to no control over stressful or malicious application behavior that exploits the memory device's weakness and causes such error storms.
A memory subsystem of the memory device may check for data errors periodically. Thus, these transient errors may be corrected by a chipset and/or a Basic Input/Output System (BIOS), but if the error storm continues, it may have the following negative effects on the system. For example, a user may be notified to replace hardware to eliminate the errors, which would result in system downtime and/or customer dissatisfaction. Further, the system may crash if too many transient errors cause an uncorrectable event. In a small number of cases, random transient errors may cause silent data corruption. Also, system performance may be impacted because a processor communicating to the memory device(s) may spend time correcting errors instead of executing applications.
Embodiments, may disrupt data patterns that cause the error storms and increase system reliability by reducing an error rate associated with the word line leakage weakness in memory, such as DRAM, by dynamically changing a memory refresh rate. For example, a detection unit may count a number of cells of a random-access memory (RAM) that have errors. A threshold unit may determine a refresh rate of the RAM based on the number of cells having errors and an error threshold. The threshold unit may increase the refresh rate of the RAM if the number of errors is greater than an error threshold and the refresh rate is not at a maximum rate. The threshold unit may return the refresh rate of the RAM to a normal rate if the number of errors is less than or equal to the error threshold.
Increasing the memory refresh rate disrupts the memory access pattern that creates the error storm by inserting refresh cycles. Also, each refresh restores a state cells in the RAM, such as DRAM, to a known good state and eliminates potential harmful amounts of charge accumulated in the device substrate that can cause transient memory errors. Further, embodiments may limit a performance impact associated with an increased memory refresh rate by accounting for a tendency of errors storms to be bursty. For example, the refresh rate is increased only for a period of time that is effective for lowering the number of errors, and then lowered back to a normal rate between error storms.
Thus, embodiments may reduce or eliminate memory errors associated with the word line leakage issue while reducing or minimizing a performance impact. Warranty costs and downtime may also be reduced for users who are exposed to the error storms associated with the word line leakage issue. At a same time, there will be no performance impact for the users who are not exposed to the word line leakage issue, because a broad brush approach is not applied that would always increase the refresh rate and cause performance to be reduced for all the users.
Instead, the performance impact is limited only to times when users experience bursty error storms by increasing the refresh rate only when necessary. In addition, embodiments may allow a system designer to work with a user who has an application that causes the word line leakage issue. For example, the increased refresh rate caused by embodiments can be detected. Then the application which causes the error storm can be detected and modified to reduce or eliminate the error storm.
Referring now to the drawings,
The term refresh rate may refer to a number of refresh cycles within a time period. Each memory refresh cycle refreshes a succeeding area of memory cells, thus refreshing all the cells in a round-robin fashion. The term refresh may refer to a process of periodically reading information from an area of the memory, such as DRAM, and immediately rewriting the read information to the same area without modification, for the purpose of preserving the information. In a DRAM chip, the refresh rate may refer to an interval between each row of DRAM being refreshed, such as one row every 7.8 microseconds (μs). While a refresh cycle is occurring the memory may not be available for normal read and write operations.
The detection and threshold units 110 and 120 may include, for example, a hardware device including electronic circuitry for implementing the functionality described below, such as control logic and/or memory. In addition or as an alternative, the detection and threshold units 110 and 120 may be implemented as a series of instructions encoded on a machine-readable storage medium and executable by a processor.
The detection unit 110 is to count a number of cells 152-1 to 152-n of a random-access memory (RAM) that have errors 112. For example, the detection unit 110 may detect the errors 112 by checking error-correcting codes (ECC) of the memory cells 152-1 to 152-n. The detection unit 110 may count the number of errors 112 according to, for example, a moving average and/or a total number of errors. The total number of errors may be recalculated after the refresh rate 122 is changed. For instance, if the number of errors 112 is calculated according to a moving average, a number of errors within the last 3 minutes may be used. However, if the number of errors 112 is calculated according to total number of errors, the number of errors may continue to be counted until the refresh rate 122 changes. At this point, the number of errors 112 may be reset to start from zero again. The detected errors 112 may be soft, correctible errors that are detected while the device 100 is an active state, as opposed to a sleep or an inactive state.
The threshold unit 120 may determine a refresh rate 122 of the RAM 150 based on the number of cells 152-1 to 152-n having errors 112 and an error threshold 124. For example, the threshold unit 120 may increase the refresh rate 122 of the RAM 150 if the number of errors 112 is greater than an error threshold 124 and the refresh rate 122 has not yet reached a maximum rate 128. The error threshold 124 and the maximum rate 128 may depend on the chipset and/or BIOS capabilities and may be user defined. The error threshold 124 may be, for example, approximately between 10 and 100 errors. The maximum rate 128 may be based on a capability of a chipset (not shown) of the device 100.
The threshold unit 120 is to return the refresh rate 122 of the RAM 150 to a normal rate 126 if the number of errors 122 is less than or equal to the error threshold 124. The normal rate 126 may be, for example, 7.8 μs. The normal rate 126 and/or the error threshold 124 may be set based on a user's performance requirements. The detection and threshold units 110 and 120 may operate autonomously and/or independently of a main processor (not shown) of the device 100. While the RAM 150 is shown to be external to the device 100, embodiments may also include the RAM 150 being internal to the device 100. By increasing the refresh rate 122 when a burst of errors is detected and resetting the refresh rate 122 after the burst of errors subsides, embodiments may reduce a number of errors caused by error storms while limiting an effect on performance.
The CSR 230 and correction unit 240 may include, for example, a hardware device including electronic circuitry for implementing the functionality described below, such as control logic and/or memory. In addition or as an alternative, the CSR 230 and correction unit 240 may be implemented as a series of instructions or microcode encoded on a machine-readable storage medium and executable by a processor.
In
The threshold unit 220 may increase the refresh rate 122 according to various methods. In one embodiment, the threshold unit 220 may multiply the normal rate 126 by a threshold value 222 to increase the refresh rate 122. For example, if the normal and refresh rates 122 and 124 are 1 row per 7.8 μs and the threshold value 222 is 2, the threshold unit 220 may multiply 1 row per 7.8 μs by 2 to increase the refresh rate 122 from 1 row every 7.8 μs to 2 rows every 7.8 μs.
In other embodiment, the threshold unit 220 may add a threshold rate 222 to the refresh rate 122 to increase the refresh rate 122. For example, if the refresh rate 122 is 1 row per 7.8 μs and the threshold rate 222 is 0.5 rows per 7.8 μs, the threshold unit 220 may add 0.5 rows per 7.8 μs to 1 row per 7.8 μs to increase the refresh rate 122 from 1 row every 7.8 μs to 1.5 rows every 7.8 μs.
After the RAM 150 has been refreshed at the increased refresh rate 122, the detection unit 210 may again count the number of errors 112. If the number of errors 112 is still greater than the error threshold 124 and the refresh rate 122 has not reached the maximum rate 128, the threshold unit 220 may further increase the refresh rate 122. In one instance, the threshold unit 220 may increase the threshold value 222, such as from 2 to 3. In this case, the threshold unit 220 may multiply the normal rate 126, such as 1 row per 7.8 μs, by 3 to increase the refresh rate 122 from 2 rows every 7.8 μs to 3 rows every 7.8 μs. In another instance, the threshold unit 220 may again add the threshold rate 222, such as 0.5 rows per 7.8 μs, to the existing refresh rate 122, such as 1.5 rows per 7.8 μs, to increase the refresh rate 122 to 2 rows every 7.8 μs.
However, the number of errors 112 may have instead decreased after the RAM 150 has been refreshed at the increased refresh rate 122. In this case, if the number errors 112 is now less than or equal to the error threshold 124, the threshold unit 222 may reset the refresh rate 122 by resetting the threshold value 222, such as to 1, or overwriting the existing refresh rate 122 with the normal rate 126, such as 1 row every 7.8 μs.
In a situation where the number of errors 112 is greater than the error threshold 124 and the refresh rate 122 has reached the maximum rate 128, the detection unit 220 may simply allow the correction unit 240 to correct the errors 112. This is because the errors 112 persisting in such a high number, even after the highest allowable refresh rate 122 has been reached, may indicate that the errors 112 are due to causes other than a transient error storm. In this case, the correction unit 240 may use a memory subsystem redundancy capability or mechanism to correct the errors 112, such as chip spare, rank spare, mirroring and the like.
The computing device 300 may be, for example, a secure microprocessor, a notebook computer, a desktop computer, an all-in-one system, a server, a network device, a controller, a wireless device, or any other type of device capable of executing the instructions 321, 323, 325, 327 and 329. In certain examples, the computing device 300 may include or be connected to additional components such as memories, controllers, etc.
The processor 310 may be, at least one central processing unit (CPU), at least one semiconductor-based microprocessor, at least one graphics processing unit (GPU), a microcontroller, special purpose logic hardware controlled by microcode or other hardware devices suitable for retrieval and execution of instructions stored in the machine-readable storage medium 320, or combinations thereof. The processor 310 may fetch, decode, and execute instructions 321, 323, 325, 327 and 329 to implement changing the refresh rate of the RAM based on the number of errors. As an alternative or in addition to retrieving and executing instructions, the processor 310 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 321, 323, 325, 327 and 329.
The machine-readable storage medium 320 may be any electronic, magnetic, optical, or other physical storage device that contains or stores executable instructions. Thus, the machine-readable storage medium 320 may be, for example, Random Access Memory (RAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a storage drive, a Compact Disc Read Only Memory (CD-ROM), and the like. As such, the machine-readable storage medium 320 can be non-transitory. As described in detail below, machine-readable storage medium 320 may be encoded with a series of executable instructions for changing the refresh rate of the RAM based on the number of errors.
Moreover, the instructions 321, 323, 325, 327 and 329 when executed by a processor (e.g., via one processing element or multiple processing elements of the processor) can cause the processor to perform processes, such as, the process of
The RAM may be scanned again for errors after the refresh rate is increased. Further, the total number of errors may be compared to the error threshold after the refresh rate is increased. The refresh rate may be increased by a multiple of the normal rate. The multiple may increase in value if the total number of errors remains greater than the error threshold after the refresh rate is increased. For example, if the increase instructions 327 set the refresh rate to be double the normal rate but the subsequently calculated total number of errors remains greater than the error threshold, the increase instructions 327 may then set the refresh rate to be triple the normal rate, assuming the refresh rate is less than the maximum rate.
At block 410, a detection unit 110 of the device 200 scans a random-access memory (RAM) 150 for errors 112. Then, at block 420, the detection unit 110 counts a number of the errors 112 found in the scanned RAM 150 and transmits the number of errors 112 to a threshold unit 120 of the device 200. The threshold unit 120, at block 430, compares the number of errors 112 to an error threshold 124.
If the threshold unit 120 determines that the number of errors 112 is less than or equal to the error threshold 124 at block 430, the threshold unit 120 sets the refresh rate 122 to be a normal rate 126 (or maintains the refresh rate 122 if it is already at the normal rate 126), at block 440. Then, the method 400 flows back to block 410, where the detection unit 110 continues to scan the RAM 150 for errors.
On the other hand, if the threshold unit 120 determines that the number of errors 112 is greater than the error threshold 124 at block 430, then the threshold unit 120 compares the refresh rate 112 to a maximum rate 128, at block 450. If the threshold unit 120 determines that the refresh rate 122 is less than the maximum rate 128 at block 450, the threshold unit 120 increases the refresh rate 122 at block 460. However, if the threshold unit 120 determines that the refresh rate 122 is greater than or equal to the maximum rate 128 at block 450, the threshold unit 120 signals a correction unit 204. The correction unit 204 then corrects the errors 112 at block 470, such as via a memory subsystem redundancy mechanism. The method 400 flows back to block 410 after blocks 460 and 470.
Thus, the scanning and counting at blocks 410 and 420 are repeated after the increasing at blocks 460 and 470. Moreover, the increasing at block 460 is repeated if the number of errors 122 stays above the error threshold at block 430 and the refresh rate 122 is less than the maximum rate 128 at block 450. Further, the scanning and the counting at blocks 410 and 420 are repeated at continuous intervals after the setting at block 440, if the number of errors 112 at block 430 remains below or equal to the error threshold 124.
According to the foregoing, embodiments provide a method and/or device for disrupting data patterns that cause the error storms by reducing an error rate associated with the word line leakage weakness in memory, such as DRAM, based on dynamically increasing a memory refresh rate. Further, embodiments may limit a performance impact associated with the increased memory refresh rate by accounting for a tendency of errors storms to be bursty. For example, the refresh rate is increased only for a period of time that is effective for lowering the number of errors, and then lowered back to a normal rate between error storms.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US13/24233 | 1/31/2013 | WO | 00 |