The present invention relates to disk array storage devices for computer systems and, more particularly, to an improved method for rebuilding data following a disk failure within a RAID storage system.
A disk array or RAID (Redundant Array of Inexpensive Disks) storage system comprises two or more computer system hard disk drives or solid state drives. Several disk array design alternatives were first described in an article titled “A Case for Redundant Arrays of Inexpensive Disks (RAID)” by David A. Patterson, Garth Gibson and Randy H. Katz; University of California Report No. UCB/CSD 87/391, December 1987. This article discusses disk arrays and the improvements in performance, reliability, power consumption and scalability that disk arrays provide in comparison to single large magnetic disks. Five disk array arrangements, referred to as RAID levels, are described. The simplest array, a RAID level 1 system, comprises one or more disks for storing data and an equal number of additional “mirror” disks for storing copies of the information written to the data disks. The remaining RAID levels, identified as RAID level 2, 3, 4 and 5 systems, segment the data into portions for storage across several data disks. One or more additional disks are utilized to store error check or parity information.
In 1993, these RAID levels were formalized in the first edition of the RAIDBook, published by the RAID Advisory Board, an association of manufacturers and consumers of disk array storage systems. In addition to the five RAID levels described by Patterson et al., the RAID Advisory Board now recognizes four additional RAID levels, including RAID level 0, RAID level 6, RAID level 10 and RAID level 53. RAID level 3, 5, and 6 disk array systems are illustrated in
In order to coordinate the operation of the multitude of disk or tape drives within an array to perform read and write functions, parity generation and checking, and data restoration and reconstruction, complex storage management techniques are required. Array operation can be managed through software routines executed by the host computer system or by a dedicated hardware controller constructed to control array operations.
RAID level 2 and 3 disk arrays are known as parallel access arrays.
Parallel access arrays require that all member disks (data and parity disks) be accessed, and in particular, written, concurrently to execute an I/O request. RAID level 4 and 5 disk arrays are known as independent access arrays. Independent access arrays do not require that all member disks be accessed concurrently in the execution of a single I/O request. Operations on member disks are carefully ordered and placed into queues for the member drives.
RAID level 2, 3, and 4 disk arrays include one or more drives dedicated to the storage of parity or error correction information. Referring to
RAID level 5 disk arrays are similar to RAID level 4 systems except that parity information, in addition to the data, is distributed across the N+1 disks in each group. Each one of the N+1 disks within the array includes some blocks for storing data and some blocks for storing parity information. Where parity information is stored is controlled by an algorithm implemented by the user. As in RAID level 4 systems, RAID level 5 writes typically require access to two disks; however, no longer does every write to the array require access to the same dedicated parity disk, as in RAID level 4 systems. This feature provides the opportunity to perform concurrent write operations. Referring to
The relationship between the parity and data blocks in the RAID level 5 system illustrated in
PARITY Ap=(BLOCK A1)⊕(BLOCK A2)⊕(BLOCK A3)
PARITY Bp=(BLOCK B1)⊕(BLOCK B2)⊕(BLOCK B3)
PARITY Cp=(BLOCK C1)⊕(BLOCK C2)⊕(BLOCK C3)
PARITY Dp=(BLOCK D1)⊕(BLOCK D2)⊕(BLOCK D3)
As shown above, parity data can be calculated by performing a bit-wise exclusive-OR of corresponding portions of the data stored across the N data drives. Alternatively, because each parity bit is simply the exclusive-OR product of all the corresponding data bits from the data drives, new parity can be determined from the old data and the old parity as well as the new data in accordance with the following equation:
new parity=old data⊕new data⊕old parity.
RAID level 6 extends RAID level 5 by adding an additional parity block, using block-level striping with two parity blocks distributed across all member disks. Referring to
Parity-based RAID systems, e.g., RAID 3, 5, and 6 systems, incur a substantial performance impact under READ-dominant workloads when the RAID group has a failed drive. When the RAID group is degraded in this manner, every host READ operation issued to the failed drive within the group must instead be serviced by reading from the remaining drive members in the group and then regenerating the “missing” data (on the failed drive) from parity.
The impact of these “on-the-fly” data-rebuild operations is proportional to the size of the RAID group. If the RAID group has N number of drives, the number of discrete READ operations to rebuild the missing data requested by the host is N−1. Likewise, the probability that a host READ to the RAID group will result in a costly on-the-fly rebuild operation is approximately 1/N, where N is the number of drives in the RAID group.
The table below demonstrates the theoretical performance impact of a READ workload to a RAID-6 group, assuming a probability of access previously described.
For many applications, the performance degradation levels summarized above are extremely impactful. It is therefore desirable for the storage array to perform a background RAID group rebuild to a “spare” drive as soon as possible. Not only does this action put the RAID group on a path to returning to an optimal state, but any host READs associated with rebuilt RAID stripes do not require “on-the-fly” data rebuild operations to satisfy the associated host IO. As a result, even though a RAID group is still “degraded” during the background rebuild, the performance degradation level actually experienced by the host is much lower for the JO workloads that are associated with “rebuilt” RAID stripes.
The problem with this conventional rebuild model is that full-capacity RAID rebuilds take a very long time to complete, on the order of several days for today's multi-TB drives, and current rebuild algorithms dictate a sequential rebuild of data from the lowest RAID group logical block address (LBA) to the last LBA of the group. If the host workload in question has a frequently accessed “working-set” with an associated LBA range that is logically “distant” from this rebuild process, it will take a very long time before the RAID stripes associated with the host workload are rebuilt via this background process. This relationship is illustrated in the Conventional Rebuild Model diagram of
Referring to
As discussed above, and shown in
The solution described herein details a new rebuild algorithm that will greatly reduce the period under which HOST READs are impacted by “on-the-fly” data-rebuild operations, and therefore reduce the period under which host performance is heavily degraded.
To achieve this result, it is proposed that the RAID controller keep track of the relative number of READ operations across the RAID group such that the most frequently read areas of the RAID group can be rebuilt before less frequently accessed areas.
MB_Count=RF*(MBRI+Σi=1R
Both of the methods above provide the means to differentiate the relative number of MBs read (MB_Count) per section. Using either of these approaches, the controller can create an effective “heat map” for READ frequency as a function of LBA range—where “hot” is considered a region of high READ frequency, and “cold” is considered a region of low READ frequency. An illustration of this relationship is shown in
The order in which the sections are rebuilt is governed by how frequently those sections were previously read by the host (MB_Count values). By rebuilding the “hottest”, or most frequently read, sections first, the time that that host is significantly impacted during the rebuild process is limited to the time it takes to rebuild only those sections.
As with conventional rebuilds, the proposed rebuild algorithm will require a few safeguards to ensure the controller always completes its rebuild operations completely (even if interrupted):
The above status attributes will allow the controller to complete the overall background rebuild process for the Active section, as well as the “in-queue” sections, following power-loss or other interruptions. Furthermore, this information will enable the controller to consider “rebuild-complete” sections as optimal, such that no burdensome “on-the-fly” data-rebuild operations are attempted for READ I/Os associated with the rebuilt sections of the LBA address range.
The Figures and description of the invention provided above reveal a novel system and method for rebuilding data following a disk failure within a RAID storage system. The rebuild process keeps track of the relative number of READ operations across a RAID group so that following a RAID disk failure, the most frequently read areas of the RAID group can be rebuilt before less frequently accessed areas. Host READs to the rebuilt area will no longer necessitate on-the-fly rebuild from parity, and thus host performance will be much less impacted than with prior rebuild processes. The degree to which this rebuild algorithm will reduce the period of host degradation is a function of several factors.
For applications that present a fairly consistent workload over modest percentage of the RAID group capacity that is “deep” within the RAID group's LBA address space, the benefit via this algorithm would be substantial.
Instructions of the various software routines discussed herein, are stored on one or more storage modules in the system shown in
Data and instructions of the various software routines are stored in respective storage modules, which are implemented as one or more machine-readable storage media. The storage media include different forms of memory including semiconductor memory devices such as dynamic or static random access memories (DRAMs or SRAMs), erasable and programmable read-only memories (EPROMs), electrically erasable and programmable read-only memories (EEPROMs) and flash memories; magnetic disks such as fixed, floppy and removable disks; other magnetic media including tape; and optical media such as compact disks (CDs) or digital video disks (DVDs).
The instructions of the software routines are loaded or transported to each device or system in one of many different ways. For example, code segments including instructions stored on floppy disks, CD or DVD media, a hard disk, or transported through a network interface card, modem, or other interface device are loaded into the device or system and executed as corresponding software modules or layers.
The foregoing description of various embodiments of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed. Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teaching.
This application claims priority under 35 U.S.C. §119(e) to the following co-pending and commonly-assigned patent application, which is incorporated herein by reference: Provisional Patent Application Ser. No. 61/801,108, entitled “METHOD AND SYSTEM FOR REBUILDING DATA FOLLOWING A DISK FAILURE WITHIN A RAID STORAGE SYSTEM,” filed on Mar. 15, 2013, by Matthew Fischer.
Number | Date | Country | |
---|---|---|---|
61801108 | Mar 2013 | US |