Current and future memories (e.g., dynamic random access memory (DRAM)) are susceptible to a variety of ageing-based failures that are not predictable via error correcting code (ECC) logic. That is, they do not exhibit any known pattern of errors that can be detected/corrected by the ECC before a permanent failure occurs. An example of such a failure mechanism is a Sub-Wordline contact failure in DRAM due to electromigration. Certain types of fault-modes can also evade detection and correction by the ECC when they occur, or require the use of codes with a high overhead.
A more detailed understanding can be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
Although the method and apparatus will be expanded upon in further detail below, briefly a method for predicting memory failure is described herein.
An embodiment of the invention includes an integrated prediction engine implemented in silicon within a memory device predicts impending aging based failures in the device. A prediction (generated by the prediction engine) is created from a combination of data collected from in-memory sensors, (e.g., temperature and voltage sensors), memory error logs, and return-to-manufacturer data at the memory vendor that correlates runtime measurements to predict when a failure may occur.
There is a demonstrated correlation between temperature, voltage, and aging based failures mechanisms. When a failure is predicted, the device conveys this information to a host device via logging/transparency mechanisms to trigger any remedial action schemes (RAS) actions, (e.g., post-package repair). The prediction engine may be in communication with the host processor via an interface that allows the predictor to be updated via firmware updates. For example, such an update may be performed if the vendor identifies new failure modes and desires to update the prediction engine with these modes. The predictor may be implemented using machine learning techniques, (e.g., recurrent neural network (RNN), regression), and the physical embodiment of the predictor may exist, for example, as a microcontroller, custom logic in the base layer of the memory device, or as a memristive accelerator.
Memory devices contain sensors that measure physical attributes, such as temperature, while the devices are operational in the field. Sensors for measuring additional attributes, such as voltage, have been published in the literature. Servers also implement ECC for memory and log errors that get detected and corrected while in use. These logs are collected on the device or system where memory is integrated. Additionally, memory vendors perform testing of devices that have been returned to them (i.e., return-to-vendor devices) to assess or determine the root cause of any failures, and also plan to incorporate MBIST capabilities for failure diagnoses in the field.
A method for predicting and managing a device failure includes responsive to a predicted failure of a memory device, the predicted failure based on sensor data associated with the memory device, determining a further action for the memory device.
An apparatus for predicting and managing a device failure includes a memory and a memory controller communicatively coupled with the memory. The memory controller responsive to a predicted failure of a memory device, the predicted failure based on sensor data associated with the memory device, determines a further action for the memory device.
A non-transitory computer-readable medium for predicting and managing a device failure, the non-transitory computer-readable medium having instructions recorded thereon, that when executed by the processor, cause the processor to perform operations. The operations include responsive to a predicted failure of a memory device, the predicted failure based on sensor data associated with the memory device, determining a further action for the memory device.
In various alternatives, the processor 102 includes a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core can be a CPU or a GPU. In various alternatives, the memory 104 is located on the same die as the processor 102, or is located separately from the processor 102. The memory 104 includes a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 106 includes a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 108 include, without limitation, a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 110 include, without limitation, a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 112 communicates with the processor 102 and the input devices 108, and permits the processor 102 to receive input from the input devices 108. The output driver 114 communicates with the processor 102 and the output devices 110, and permits the processor 102 to send output to the output devices 110. It is noted that the input driver 112 and the output driver 114 are optional components, and that the device 100 will operate in the same manner if the input driver 112 and the output driver 114 are not present.
The external memory 116 may be similar to the memory 104, and may reside in the form of off-chip memory. Additionally, the external memory may be memory resident in a server where the memory controller 115 communicates over a network interface to access the memory 116.
The analysis performed on the received data may include, for example, receiving one or more temperature readings from the sensors 118 and comparing the temperature readings to a threshold temperature that indicates a potential failure temperature of the device. In another example, the one or more voltage readings may be received from the sensors 118 and compared against a threshold voltage, which upon exceeding indicates a potential device failure. Another example set of data is a number of ECC events that are registered by the ECC logic 201. For example, if the number of ECC events exceeds a threshold number of events that indicate that a failure of the device is imminent, a failure may be predicted.
In accordance with the device 100 and memory controller 115 depicted in
In step 310, the memory controller 115 receives data from one or more sensors of the sensors 118. The data received may include temperature data or voltage data, for example. In addition, the data received may include usage data (e.g., latency/bandwidth), and time data (e.g. number of seconds of an operation). The data can be provided from DRAM or the processor 102, for example.
After receiving the data, the memory controller 115 analyzes (by the prediction engine 203) the data to predict whether a failure is likely to occur (step 320). In an exemplary embodiment, the prediction engine may be dedicated logic within ECC logic 201 of controller 115, separate from the ECC logic 201, a general purpose processor executing software or firm or a combination of dedicated logic and general purpose processing as described above in
In step 330, it is determined whether or not a device failure is predicted to occur. That is, if the temperature, voltage, ECC events, or other received data meet the criteria for a likely predicted failure, it is determined that a failure is likely to occur in step 330.
If it is determined in step 330 that a failure is likely to occur, the memory controller logs the prediction for additional action (step 340). For example, a log of the sensor data and ECC events is created for each identifiable device, (e.g., memory device), in which a failure was predicted to occur. Further, the logs may be uploaded to a central database, (e.g., the vendor database for the device), to track potential failure for action. The action may include providing a firmware update to the memory controller to update events and sensor data to identify more accurately when a device is going to fail. Additionally, the actions may include undertaken RAS actions, such as described above, and for example, post-package repair, or field replaceable unit (FRU) callout. At this point the method reverts to step 310.
If it is determined in step 330 that is not likely to occur, then the memory controller continues normal operation (step 350) and the method reverts to step 310.
The inference engine itself operates in a manner that is opaque to the external interface. That is, when a specific failure mode is predicted, the device may convey this information to the host via logging/transparency mechanisms to trigger any actions to enhance availability and serviceability at the system level (e.g., post-package repair, FRU callout).
A memory vendor may identify newer fault modes based on their evolving dataset and hence may wish update the prediction engine 203 (
The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements features of the disclosure. Further, although the methods and apparatus described above are described in the context of controlling and configuring PCIe links and ports, the methods and apparatus may be utilized in any interconnect protocol where link width is negotiated.
The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs). For example, the methods described above may be implemented in the processor 102 or on any other processor in the computer system 100.