This disclosure relates generally to one or more systems and methods for memory, particularly to improved reliability, accessibility, and serviceability (RAS) in a memory device.
Memory integrity is a hallmark of modern computing. Memory systems are often equipped with hardware and/or software/firmware protocols that are configured to check the integrity of one or more memory sections and determine whether the data located therein is either accessible to higher level subsystems or whether the data is error-free. These methods fall under the RAS features of the memory, and they are essential for maintaining data persistence in the memory as well as data integrity.
The typical RAS infrastructure of a memory system may be configured to detect and fix errors in the system. For example, RAS features may include protocols for error-correcting codes. Such protocols are hardware features that can automatically correct memory errors once they are flagged by the RAS infrastructure. These errors may be due to noise, cosmic rays, hardware transients that are due to sudden changes in power supply lines, physical errors in the medium in which the data are stored.
One long-standing RAS feature that is used in volatile memories such as random access memories (RAMs), is called patrol scrubbing. This protocol is achieved using a hardware engine that may be co-located with the memory system either as an adjacent module or within the memory itself. During run time, patrol scrubbing accesses memory addresses with a predetermined frequency, and it generate requests that do not interfere with the memory's actual functions and quality of service. Such requests are read requests to the memory addresses that are accessed, and they give the hardware the opportunity to read the data from the memory addresses and run an error-correcting code on the data. If the data is not correctible, the scrubber may report the memory location to the software to indicate that the data at that location is not correctible. The scrubber may be configured to work on single memory addresses, or it may work on pre-determined address ranges. Furthermore, given enough time, the scrubber may access every memory location in the memory.
Compute Express Link™ (CXL™) is a new technology that maintains memory coherence between CPU memory space and the memory of peripheral devices to allow resource sharing and reduced software stack complexity, which improves device speed and reduces overall system cost. In CXL™-mediated devices, a failure (e.g., corrupted data at memory location) is intercepted by the patrol scrubber and the system must immediately react to this failure to ensure high level RAS features are maintained. This may slow down the device and compromise CXL™ speed. As such, there is a need for new approaches to identifying and fixing errors in emerging architectures like CXL.
The embodiments featured herein help solve or mitigate the above noted issues as well as other issues known in the art. Specifically, there is provided a system and a method for managing a failure off-line once it is identified by the patrol scrubber of a memory system. The embodiments may manage this failure off-line in either one of two novel ways. The first method includes provisioning a “jolly,” which is a spare component or a spare part of a component (e.g., a bank, a section, or a row) in the memory system. The jolly can be used to temporary replace the failed area in a manner that is impervious to the memory system in general. In this embodiment, valid data may be copied into the jolly area.
After the valid area is safe, memory addressing that is associated to the failed area is redirected to the jolly area. When the failure is no longer visible to higher level system, e.g., it has been fixed by typical fast cycling to promote retention and data integrity at the failed memory location, then a recovery procedure may be undertaken. The recovery procedure may include re-mapping the content of the jolly to the failed area. In this exemplary scenario, areas around the failed area that are valid may also be copied to the jolly area in order to maintain normal system operation.
In another embodiment, the failure may be mitigated without a jolly. In this approach, the controller implementing the failure mitigation may impose that the host retire the failure area. This may be done by removing the addresses of the failed areas from the pool of valid addresses until the failure area has been sanitized. This is achieved with a custom protocol that notifies the host of the status of the retired area.
Further, in one other example embodiment, there is provided a system for mitigating an error in a memory. The system can include a memory controller communicatively coupled to a host. The memory controller may be configured to receive information associated with a memory location. The information can indicate the error at the memory location. The controller may be configured to perform, upon receiving the information, certain operations. The operations can include copying data around the memory location, placing the copied data in a reserved area. And the operations can further include outputting, to a central controller, a set of physical addresses associated with the reserved area, wherein the central controller is configured to modify the set of physical address to perform a recovery off-line.
In another example embodiment, there is provided a method for mitigating an error in a memory. The method can include receiving, by a memory controller communicatively coupled to a host, information associated with a memory location, the information indicating the error at the memory location. The method can further include copying data around the memory location and placing the copied data in a reserved area. The method can further include outputting, to a central controller, a set of physical addresses associated with the reserved area and modifying the set of physical address to conduct a data recovery off-line.
there is provided a method for mitigating an error in a memory. The method may include receiving, by a controller communicatively coupled to the memory, information associated with a memory location. The information may indicate an error at the memory location. The method may include copying, by the controller, data around the memory location, and placing, by the controller, and the copied data in a reserved area. The method may further include returning, by the controller, a set of addresses to a host controller of the memory. The set of addresses may be associated with the reserved area, and the set of addresses may replace a corresponding set of addresses of the memory location that was flagged as having an error.
Additional features, modes of operations, advantages, and other aspects of various embodiments are described below with reference to the accompanying drawings. It is noted that the present disclosure is not limited to the specific embodiments described herein. These embodiments are presented for illustrative purposes only. Additional embodiments, or modifications of the embodiments disclosed, will be readily apparent to persons skilled in the relevant art(s) based on the teachings provided.
Illustrative embodiments may take form in various components and arrangements of components. Illustrative embodiments are shown in the accompanying drawings, throughout which like reference numerals may indicate corresponding or similar parts in the various drawings. The drawings are only for purposes of illustrating the embodiments and are not to be construed as limiting the disclosure. Given the following enabling description of the drawings, the novel aspects of the present disclosure should become evident to a person of ordinary skill in the relevant art(s).
While the illustrative embodiments are described herein for particular applications, it should be understood that the present disclosure is not limited thereto. Those skilled in the art and with access to the teachings provided herein will recognize additional applications, modifications, and embodiments within the scope thereof and additional fields in which the present disclosure would be of significant utility.
During operation, a patrol scrubber routine or protocol may be executed by the host 106. The patrol scrubber may scan the locations of the memory 102 in order to determine whether the include errors. In an example scenario illustrated in
Further, the memory element 129 may be a plurality of memory components where a unit in the memory element 129 may be a memory component like the memory 102. For example, and not by limitation, a memory component of the memory element 129 may be composed of 16 banks, and each bank may be composed of a number of sections. Each section may be composed of a number of rows.
Furthermore, for example, and not by limitation, for the host 106, all the management is transparent. The host 106 does not observe any change in the behavior of the CXL™ device, because the central controller 124 properly remaps the areas associating to the logical addresses (host) of different portions of physical locations (physical address). For instance, there may be a block in the central controller 124 that has as input the logical address (sent by the host 106) and as output a physical address that the central controller 124 can modify accordingly to perform off-lining recovery.
In one embodiment, referring to
In contrast, in the embodiment presented herein, the error is mitigated offline without compromising access to the data in the flagged memory sections. Rather, these data are copied to one or more jolly sections that are provisioned to serve as place holder locations for error mitigation. Once the valid data from the flagged sections are in their respective jolly sections the host 106 can continue program execution by access the jolly locations if the data in the original memory locations are needed.
Meanwhile, the error in the original memory section are addressed off-line using typical counter measures (error correcting code, fast cycles, etc.). Once the memory sections that exhibited errors have been sanitized their addresses are usable and the jolly is cleared since the host no longer accesses those data there but rather in the original memory locations.
The method 200 can begin at block 202. The controller 104 may receive information at block 204 from a patrol scrubber that is configured to scrub the memory 102 that a specific memory section includes one or more errors. One of ordinary skill in the art will recognize that such errors may not extend over the whole section, and that as such, despite the one or more errors, the memory section identified may still include valid data.
At block 206, the controller 104 may issue an instruction that causes the valid data, in the memory section to be copied. Upon being copied, the controller 104 may then issue a command for the copied data to be written into a jolly (block 208). The written data may include all the valid data as well as markers to indicate where the corrupted are in the original memory location. Once the data are written into the jolly, the controller 104 may fetch the address of the jolly and return the address of the jolly to the host 106 (block 210).
This may be done with specific instructions to the host to replace the address the of the original memory location with jolly's address. In this scheme program execution, i.e., host tasks may be continued unimpeded, and the data in the original memory location may now be addressed using the jolly's address since the jolly now includes all the valid data of the original memory section that was flagged (block 212). As such, memory functions remain online and program execution continues unimpeded.
Meanwhile, the original location is scheduled by the scrubber to be fixed using, for example and not by limitation, an error-correcting code (block 214). Alternatively, if the error is unrecoverable, the controller 104 may flag the memory section as being unusable. Thus, generally, the error is either fixed or mitigated. The method 200 includes waiting at block 214 if the error is not yet fixed or mitigated (decision block 216).
When the error is fixed or mitigated, the method 200 may include another decision block 218 to determine whether the error that was flagged was recoverable, i.e., correctable, or not. If the error was correctable, the jolly may be cleared (block 220), and the method 200 may end at block 220. If the error was not correctable, the controller 104 or the host 106 may issue a flag asserting that the specific addresses of the memory where the one or more errors occur are unusable since these memory locations include corrupted data or they are damaged (block 219). The method 200 may then end at block 221.
At decision block 308, the controller 104 checks whether the host 106 has mitigated or fixed the error. If not, the controller 104 waits (block 310). When the error is mitigated or fixed, the controller 104 checks whether the error was recoverable or unrecoverable (decision block 312). If unrecoverable, the controller 104 notifies the host 106 that these memory locations must be retired permanently (block 314), and the method 300 ends at block 316. If the error was recoverable and corrected, the controller 104 sends a flag to the host 106 telling it to remove the memory locations from retirement (block 313). The method 300 then ends at block 315.
The controller 400 can be a stand-alone programmable system, or a programmable module included in a larger system. For example, the controller 400 can be included in RAS hardware routine for a memory 102 connected to the controller 400. The controller 400 may include one or more hardware and/or software components configured to fetch, decode, execute, store, analyze, distribute, evaluate, and/or categorize information.
The processor 414 may include one or more processing devices or cores (not shown). In some embodiments, the processor 414 may be a plurality of processors, each having either one or more cores. The processor 414 can execute instructions fetched from the memory 402, i.e., from one of memory modules 404, 306, 408, or 410. Alternatively, the instructions can be fetched from the storage medium 420, or from a remote device connected to the controller 400 via a communication interface 416. Furthermore, the communication interface 416 can also interface with the memory 102, for which RAS features are needed, and to the host 106. An I/O module 412 may be configured for additional communications to or from remote systems.
Without loss of generality, the storage medium 420 and/or the memory 402 can include a volatile or non-volatile, magnetic, semiconductor, tape, optical, removable, non-removable, read-only, random-access, or any type of non-transitory computer-readable computer medium. The storage medium 420 and/or the memory 402 may include programs and/or other information usable by processor 414. Furthermore, the storage medium 420 can be configured to log data processed, recorded, or collected during the operation of controller 400.
The data may be time-stamped, location-stamped, cataloged, indexed, encrypted, and/or organized in a variety of ways consistent with data storage practice. By way of example, the memory modules 406 to 410 can form the previously described script autogeneration module. The instructions embodied in these memory modules can cause the processor 414 to perform certain operations consistent with the functions described above, i.e., off-line mitigation of errors flagged within one or more locations of the memory 102.
For example, and not by limitations, the operations can executed by the processor 414 can include receiving, by the processor, information associated with a memory location within the memory 102. The information may indicate an error at the memory location. The operations may then include copying, by the processor, data around the memory location, and placing, by the processor, and the copied data in a reserved area, i.e., in a jolly area which may be co-located with the memory 102. The operations may further include returning, by the processor, a set of addresses to the host 106. The set of addresses are associated with the reserved area, and the set of addresses replaces a corresponding set of addresses of the memory location that were flagged has having errors.
Having described several methods and application-specific embodiments consistent with the teachings presented herein, example general embodiments are now described. For instance, in one embodiment, there is provided a system for mitigating an error in a memory. The system can include a controller configured to receive information associated with a memory location. The information can indicate the error at the memory location.
The controller can be configured to perform, upon receiving the information, certain operations. The operations can include copying data around the memory location, placing the copied data in a reserved area, and returning a set of addresses to a host controller of the memory. The set of addresses may be associated with the reserved area. Furthermore, the set of addresses may replace a corresponding set of addresses of the memory location.
The system may be further configured to fix the error at the memory location using an error correcting code in an off-line mode. And the system may be further configured to operate unimpeded by using the set of addresses to retrieve data from the reserved area where the data correspond to uncorrupted data at the memory location. The controller may be configured to receive the information from a patrol scrubber, which may be associated with the memory system and with other memory systems.
The memory location may span a range of addresses, and one or more addresses be addresses that are specific to where one or more errors occur in the memory location. The system may be further configured to classify the error based on the received information. The controller may be configured to classify the error as recoverable or as unrecoverable. The error may be classified as unrecoverable, and the controller may be configured to notify a host of the memory controller that the memory location has an unrecoverable error. The system may be further configured to remove one or more addresses corresponding to the unrecoverable error from a pool of valid addresses available to the host.
In another embodiment, there is provided a method for mitigating an error in a memory. The method may include receiving, by a controller communicatively coupled to the memory, information associated with a memory location. The information may indicate an error at the memory location. The method may include copying, by the controller, data around the memory location, and placing, by the controller, and the copied data in a reserved area. The method may further include returning, by the controller, a set of addresses to a host controller of the memory. The set of addresses may be associated with the reserved area, and the set of addresses may replace a corresponding set of addresses of the memory location.
The method can further include fixing, by the system, the error at the memory location using an error correcting code in an off-line mode. Furthermore, the system can keep operating unimpeded by using the set of addresses to retrieve data from the reserved area, the data corresponding to uncorrupted data at the memory location. The method can further include receiving, by the controller, the information from a patrol scrubber. The memory location can span a range of addresses, and the range of addresses can include one or more specified addresses where the error is located.
The method can further include classifying, by the controller, the error based on the received information. The method can further include classifying the error as recoverable or as unrecoverable. When the error is classified unrecoverable, the operations include notifying a host of the memory controller that the memory location has an unrecoverable error. The method can further include removing one or more addresses corresponding to the unrecoverable error from a pool of valid addresses available to the host.
Those skilled in the relevant art(s) will appreciate that various adaptations and modifications of the embodiments described above can be configured without departing from the scope and spirit of the disclosure. Therefore, it is to be understood that, within the scope of the appended claims, the disclosure may be practiced other than as specifically described herein.
This application claims priority to U.S. Provisional Application No. 63/301,027 filed on Jan. 19, 2022, titled “Off-line repairing and subsequent reintegration in the system,” which is hereby expressly incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63301027 | Jan 2022 | US |