The present invention relates to solid state disk (SSD) storage devices, and in particular, to detection of imminent NAND flash plane failures within SSD storage devices.
Solid state storage, in particular, flash-based devices either in. solid state disks (SSDs) or on flash cards, is quickly emerging as a credible tool for use in enterprise storage solutions. Ongoing technology developments have vastly improved performance and provided for advances in enterprise-class solid state reliability and endurance. As a result, solid state storage, specifically flash storage deployed in SSDs, are becoming vital for delivering higher performance to servers and storage systems, such as the data warehouse system illustrated in
SSDs are packaged similarly to HDDs in form factor modules that are typically 3.5 inches or 2.5 inches for enterprise and 1.8 inches for consumer drives. They communicate with an I/O controller, host adapter or storage controller in the same manner as HDDs via a standard I/O interface such as Fibre Channel (FC), Serial Attached SCSI (SAS) or Serial Advanced Technology Attachment (SATA). But access to data in SSD storage and data transfer rates for SSD storage are much faster than for HDD storage, as illustrated in
Most current SSD storage devices use NAND based flash memory which retains memory even without power provided to the memory. However, NAND flash memory is susceptible to failure modes that are uniquely associated with the storage media technology, such as Column and Plane Failures. Due to the speed at which these failures can escalate, it is important to identify potential failures early to ensure the integrity of data at risk due to a NAND flash memory failure.
NAND Flash memory devices permanently store one or more data bits in a single memory cell, with billions of cells contained in a memory device. SSD technology economically leverages these memory devices along with robust management logic to provide immediate, direct access to the large amounts of non-volatile data stored within these memory devices.
A single NAND flash package comprises billions of flash cells that are organized in a hierarchical architecture, as depicted in
As stated above, NAND flash memory is susceptible to column and plane failures. A column failure is defined as a bit location that has a failure in every block on a plane. Error handling and correction features incorporated into the SSD can often correct these errors or move data to other memory locations to prevent data corruption.
A plane failure is similar to a column failure in that it affects all the blocks on a plane. However, a plane failure encompasses all even rows and columns or all odd rows and columns in a NAND die. The difference between a plane failure and a column failure is in two key areas. The first difference is the scope of the errors. Plane failure errors occur in the hundreds and memory function progressively degrades over time. The second difference is that plane failure bit errors are not confined to a fixed bit location. Left unchecked, a plane failure will eventually overwhelm the ECC correction capability of the SSD, with the result that an entire plane appears to fail over a relatively short period of time.
To identify failures and avert unrecoverable data loss, a proactive tool known as Patrol Read is employed. Patrol Read, which is analogous to Background Media Scan in an HDD, is used to verify the flash media by periodically sending internally generated read commands to the media. A background scan of the entire SSD may take days to complete. Once the Patrol Read has checked all the data blocks in the NAND flash media, it repeats the check indefinitely.
The performance and response times of the SSD are not affected by Patrol Read, which executes in a highly parallelized manner and enables many commands to execute simultaneously. Since this is the normal mode of operation, the insertion of a relatively few number of read commands does not impact user commands. The result is that typical workloads response time is unaffected. Under very heavy workloads, priority is given to servicing the I/O workload over the Patrol Read function. Since Patrol Read provides so much extra user data protection with no impact on user performance, it is recommended that Patrol Read never be turned off, but remain enabled to continuously run in the background.
However, a breakdown of a NAND Flash plane remains a serious concern, as the rate of failure may exceed the capability of error detection due to the periodicity of the Patrol Read function validation of the media. Theoretically, in a progressive failure mode, detection and reporting of the NAND Flash plane failure should occur prior to an Unrecoverable READ Error being observed by the user. The normal Patrol Read feature function scans the SSD device storage media logical block addresses (LBAs) in a progressive fashion from lowest to highest to validate that all of the data blocks can be successfully read, and relocated if necessary.
The improved system and method for detecting potential flash memory failures described herein leverages the existing Patrol Read feature function NAND Flash plane failure algorithms, but enhances detection capability by establishing a data unit bit error threshold over unit time, and when the bit error threshold meets or exceeds the threshold triggers a bias of the Patrol Read feature at the NAND Flash plane that contains the data units that met or exceeded the threshold.
Typically, bad memory units are discovered through normal read and write accesses to the SSD. Each time that a data unit is retrieved from the NAND Flash array the number of bits in error is observed and a determination is made if correction is required, and if the error is correctable. The method of the present, illustrated in the flowchart of
The system and method provided by this invention enables earlier detection and reporting of an impending NAND Flash device plane failure, which enables a proactive service strategy of device replacement prior to data loss events that may occur as a result.
The foregoing description of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed.
Additional alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teaching. Accordingly, this invention is intended to embrace all alternatives, modifications, equivalents, and variations that fall within the spirit and broad scope of the attached claims.
This application claims priority under 35 U.S.C. §119(e) to the following co-pending and commonly-assigned provisional patent application, which is incorporated herein by reference: Provisional Patent Application Ser. No. 61/580,941, entitled “SYSTEM AND METHOD FOR SOLID STATE DISK FLASH PLANE FAILURE DETECTION” by Robert Kubo; filed on Dec. 28, 2011.
Number | Date | Country | |
---|---|---|---|
61580941 | Dec 2011 | US |