The present invention relates to the field of data storage, and particularly to disk array systems. More specifically, this invention pertains to a method for constructing disk array systems that tolerate many combinations of failed storage devices without loss of data.
Computer systems utilize data redundancy schemes such as parity computation to protect against loss of data on a storage device. A redundancy value is computed by calculating a function of the data of a specific word size, also referenced as a data element, across a quantity of similar storage devices, also referenced as data drives. One example of such redundancy is exclusive OR (XOR) parity that is computed as the binary sum of the data.
The redundancy values, hereinafter referenced as parity values, are stored on a plurality of storage devices in locations referenced as parity elements. In the case of a storage device failure that causes a loss of parity element values, the parity values can be regenerated from data stored on one or more of the data elements. Similarly, in the case of a storage device failure that causes a loss of data element values, the data values can be regenerated from the values stored on one or more of the parity elements and possibly one or more of the other non-failed data elements.
In Redundant Arrays of Independent Disk (RAID) systems, data values and related parity values are striped across disk drives. In storage subsystems that manage hard disk drives as a single logical direct (DASD) or network attached (NASD) storage device, the RAID logic is implemented in an array controller of the subsystem. Such RAID logic may also be implemented in a host system in software or in some other device in a network storage subsystem.
Disk arrays, in particular RAID-3 and RAID-5 disk arrays, have become accepted designs for highly available and reliable disk subsystems. In such arrays, the XOR of data from some number of disks is maintained on a redundant disk (the parity drive). When a disk fails, the data on it can be reconstructed by exclusive-ORing the data and parity on the surviving disks and writing this data into a spare disk. Data is lost if a second disk fails before the reconstruction is complete.
RAID-6 is an extension of RAID-5 that protects against two drive failures. There are many other RAID algorithms that have been proposed to tolerate two drive failures: for example, Reed-Solomon [reference is made to I. S. Reed, et. al., “Polynomial codes over certain finite fields,” Journal of the Society for Industrial and Applied Mathematics, vol. 8, pp. 300-304, 1960], Blaum-Roth [reference is made to M. Blaum, et. al., “On lowest density MDS codes,” IEEE Transactions on Information Theory, vol. 45, pp. 46-59, 1999], EvenOdd [reference is made to M. Blaum, et. al., “EVENODD: an efficient scheme for tolerating double disk failures in RAID architectures,” IEEE Transactions on Computers, vol. 44, pp. 192-202, 1995], Row-Diagonal Parity [reference is made to P. Corbett, et al., “Row-diagonal parity technique for enabling recovery from double failures in a storage array,” (U.S. patent application US 20030126523)], XCode [reference is made to L. Xu, et. al., “X-code: MDS array codes with optimal encoding,” IEEE Transactions on Information Theory, pp. 272-276, 1999], ZZS [reference is made to G. V. Zaitsev, et. al., “Minimum-check-density codes for correcting bytes of errors,” Problems in Information Transmission, vol. 19, pp. 29-37, 1983], BCP [reference is made to S. Baylor, et al., “Efficient method for providing fault tolerance against double device failures in multiple device systems,” (U.S. Pat. No. 5,862,158)], and LSI [reference is made to A. Wilner, “Multiple drive failure tolerant raid system,” (U.S. Pat. No. 6,327,672 B1)]. There have been a few additional extensions that protect against multiple drive failures: for example, Reed-Solomon [referenced above], EO+ [reference is made to M. Blaum, et. al., “MDS array codes with independent parity symbols,” IEEE Transactions on Information Theory, vol. 42, pp. 529-542, 1996], and [reference is made to copending U.S. patent application Ser. No. 10/956,468, filed Sep. 30, 2004, which is incorporated by reference].
More recently, storage systems have been designed wherein the storage devices are nodes in a network (not simply disk drives). Such systems may also use RAID techniques for data redundancy and reliability. The present invention is applicable to these systems as well. Though the description herein is exemplified using the disk array, it should be clear to one skilled in the art how to extend the invention to the network node application or other systems built from storage devices other than disks.
Although conventional RAID technology has proven to be useful, it would be desirable to present additional improvements. As can be seen by the various conventional RAID techniques that have been used or proposed, none has been a perfect solution to the variety of requirements that the computer industry places on a storage subsystem. Many conventional systems are complex, requiring extensive computer overhead. Furthermore, many conventional systems have excessive disk IO requirements for certain operations. Others require a large number of drives in the system, and the use of more drives reduces overall system reliability. Many conventional codes that tolerate T failures (that is, all possible combinations of T drive failing), cannot tolerate any combination of more than T drives failing. Conventional RAID techniques that can tolerate additional combinations of failures beyond T have a higher reliability than those that do not.
What is therefore needed is a system, a computer program product and an associated method for enabling recovery from failures in a storage system that is simple, can handle a large range of failure cases, and has reasonable performance properties. The need for such a solution has heretofore remained unsatisfied.
The present invention satisfies this need, and presents a system, a computer program product, and an associated method (collectively referred to herein as “the system” or “the present system”) for tolerating multiple storage device failures in a storage system with constrained parity in-degree, thus enabling efficient implementation of operations required of a storage subsystem.
The present system can tolerate as many as all combinations of T failures, for some parameter T. A feature of the invention is that every data element is redundantly stored in exactly T parity elements; that is, the data out-degree for each data element is exactly T. This is an optimum value for any code that tolerates all combinations of T failures. Consequently, the present system has optimal update 10 requirements among codes of high fault-tolerance.
Another feature of the invention is that the array size can expand beyond a minimum without changing the complexity of the computer overhead.
A further feature of the invention is that every parity element is computed from some number K<T data elements; that is, the parity in-degree is exactly K. This feature allows for a number of advantages of the present system. The set of data and parity elements that are directly related to a single data element through the parity formulas is limited by at most (K−1)T, which is independent of the array size. Consequently, an update occurring to one data element requires only a fixed subset of the array to be “locked” rather than the entire array, to prevent the storage system from attaining an undesirable inconsistent state. Such states can occur if a failure happens while two simultaneous updates are in progress to overlapped lock-zones.
Another advantage of the parity in-degree being exactly K is that parity computations require fewer resources because all parity equations have only K inputs (K is fixed independent of the array size). In contrast, all parity computations in a conventional RAID5 code require N−1 inputs where N is the array size. A further advantage of the parity in-degree being exactly K is that, only a subset of the entire array may be needed to rebuild the lost data or parity when a failure occurs, thereby improving rebuild time and costs. Yet another advantage of the parity in-degree being exactly K is that certain combinations of more than T failures can be tolerated by some embodiments of the present system, improving overall system reliability compared to conventional systems.
In yet another feature of the present invention both data elements and related parity elements are laid out on the disks together; that is, the data elements and parity elements from the same stripe, or code instance, both appear on each disk. The portion of the disk comprising both data elements and parity elements from the same stripe is referenced hereinafter as a strip. The term “vertical layout” is hereinafter used to refer to the property that strips within a stripe contain both data elements and parity elements. The term “horizontal layout” is hereinafter used to refer to the property that strips within a stripe contain only data elements or only parity elements exclusively.
Some operations that the storage system needs to perform require reading or writing at least one data element and one parity element at the same time or in parallel. In yet a further feature of the present invention, the vertical layout of the present system enables certain of these read or write operations to be performed by a single disk IO seek. Consequently, the present system reduces the disk IO overhead of these operations.
The various features of the present invention and the manner of attaining them will be described in greater detail with reference to the following description, claims, and drawings, wherein reference numerals are reused, where appropriate, to indicate a correspondence between the referenced items, and wherein:
With further reference to
In
In general, system 10 requires that there be at least one data element and at least one parity element per strip and that the same number of data elements (parity elements, respectively) is designated for each strip in the stripe. Horizontally, the elements at the same offset within each stripe form a row of elements in the stripe. In
It should be clear to those skilled in the art that system 10 can be implemented in multiple stripes on the same disks or on subsets or supersets of the disks of one implementation. It should also be clear that each implementation may have different values for the parameters N, T, K, R and Q.
The efficiency of a redundancy scheme is given by the ratio of the number of data elements to the total number of data elements and parity elements. In system 10, this value is given by the formula:
One matching for stripe 405 that can tolerate two failed disks is represented in
The matching of
Another property of the matching of system 10 represented in
The preferred embodiment of the LSI code [reference is made to A. Wilner, “Multiple drive failure tolerant raid system,” (U.S. Pat. No. 6,327,672 B1)] also has a simple pairing of data elements to parity elements as the preferred embodiment of
Another matching for the present invention, represented in stripe 405, can tolerate T failures with T≧2. As before, the embodiment of
As before, it is assumed that all subscripts in this representation are taken modulo N. The construction places this parity element Pj, 460, at some offset i−j from data element Di, 465. A cyclic application of this rule suffices to define one embodiment of system 10 of
Other configurations and embodiments of system 10 are possible.
Stripe 505 comprises strip 0, 510, strip 1, 515, strip 2, 520, to strip j, 525, and further to strip N−1, 530. Stripe 505 further comprises data row 0, 535, parity row 0, 540, parity row 1, 545, through parity row L, 550, and further to parity row Q−1, 555. Data element Dj, 560, touches parity elements in parity row 0, 540, in a sequential pattern moving to the left and starting at some offset S with wrap-around from left to right. In parity row 1, 545, data element Dj, 560, touches a sequence of parity elements starting one position to the left of the last parity element touched in parity row 0, 540, and moving again to the left, but skipping every other parity element, with wrap-around from left to right. Continuing, in parity row L, 550, data element Dj, 560, touches a sequence of parity elements starting one position to the left of the last parity element touched in the preceding row, moving again left, skipping to the (L+1) next parity element with wrap-around. This continues until the last parity row Q−1, 555, has had K parity elements touched by Dj according to this rule.
In mathematical terms, the data element Dj touches the parity elements given by
P[L, j−(K−1)L(L+1)/2−s(L+1)−S], s=1, . . . K. (2)
In equivalent mathematical terms, parity P[L, j] is computed by the formulas:
In equation (2), the second index for P[L, j] is taken modulo N; in equation (3), the subscript of the symbol D. is also taken modulo N. Table 2 shows a range of values for which these formulas produce valid T-fault-tolerant embodiments of system 10 of
These examples show that system 10 can be used to build simple and highly fault-tolerant storage systems.
The data/parity element layout of yet another embodiment that is 3-fault-tolerant is given in
Similarly,
The parity placement equations for this embodiment of
P3j=D2j+2{circumflex over (+)}D2j+4
P3j+1=D2j+3{circumflex over (+)}D2j+5
P3j+2=D2j−3{circumflex over (+)}D2j−2
The first equation shows that the left-neighbor pairing from the data row 0, 610, is placed one strip to the left of the left-neighbor and into parity row 0, 620. The second equation show that the left-neighbor pairing from data row 1, 615, is placed one strip to the left of the left-neighbor and into parity row 1, 625. The third equation shows that the lower-left/upper-right neighbor pairing between data row 0, 610, and data row 1, 615, is placed into parity row 2, 630, on the strip to the right of the upper-right neighbor. It should be understood that subscripts for data elements are taken modulo 2N and for parity elements modulo 3N.
It can be seen that the data row 0, 610, and parity row 0, 620, comprise a subcode of the embodiment of
The embodiment of
The parity placement equations for
P4j=D2j+2{circumflex over (+)}D2j+4
P4j+1=D2j+3{circumflex over (+)}D2j+5
P4j+2=D2j−5{circumflex over (+)}D2j−4
P4j+2=D2j−6{circumflex over (+)}D2j−3
The first equation shows that the left-neighbor pairing from data row 0, 710, is placed one strip to the left of the left-neighbor and into parity row 0, 720. The second equation shows that the left-neighbor pairing from data row 1, 715, is placed one strip to the left of the left-neighbor and into parity row 1, 725. The third equation shows that the lower-left/upper-right neighbor pairing between data row 0, 710, and data row 1, 715, is placed into parity row 2, 730, on the strip that is two strips to the right of the upper-right neighbor. The fourth equation shows that the upper-left/lower-right neighbor pairing between data row 0, 710, and data row 1, 715, is placed into parity row 3, 735, two strips to the right of the lower-right data element of the pair. It should be understood that subscripts for data elements are taken modulo 2N and for parity elements modulo 4N.
It can be seen that the data rows and the parity rows are similar to the embodiment of
The embodiment of
Yet further embodiments of system 10 are possible. System 10 with parameters N, T, R, Q requires that N be sufficiently large and that:
RT=QK (4)
By fixing T (the fault-tolerance level) and K (the parity in-degree), equation (4) constrains on the possible choices of the number of data rows R and the number of parity rows Q. For example, if K=2 then either R or T must be even. The embodiments represented in
R=K/M and Q=T/M where M=gcd(T, K)
These equations hold for the exemplary embodiments of
The constrained parity in-degree K and the rotational patterns allow for simple implementations of system 10, even for highly fault-tolerant systems. System 10 supports all array sizes N above some limit depending on the specific embodiment. This, together with the constrained parity in-degree, provides for bounded, independent of N, write-lock regions when updating a single or multiple logically contiguous set of data elements. Further, these features also provide for a bounded, independent of N, rebuild zone, the set of elements that are needed from the entire set of elements in a stripe to reconstruct some lost elements. Furthermore, as N increases, system 10 can tolerate certain, but not all, combinations of greater than T faults, thereby providing greater system reliability than other conventional systems.
The host system, 15, performs a write to k consecutive data elements Di+1, 825, and Di+2, 830, through Dk−1, 835, and Dk, 840. In conventional systems, this write process may require reading and writing at least 3 k data and parity elements for a total IO seek cost as large as 6 k. System 10 can implement this update by reading only two data elements Di, 820, and Dk+1, 845 (indicated by the symbol DR for data read). System 10 computes new parity values that also need updating, then writes out the new data values to data elements Di+1, 825, and Di+2, 830, through Dk−1, 835, and Dk, 840 (indicated by the symbol DW for data write) and the parity elements Pi−1, 850, Pi, 855, Pi+1, 860, and Pi+2, 865, through Pk, 870 (indicated by the symbol PW for parity write).
In addition, the update of some pairs of data and parity elements can be performed in one IO seek. In
It is to be understood that the specific embodiments of the invention that have been described are merely illustrative of certain applications of the principle of the present invention. Numerous modifications may be made to the system and method for tolerating multiple storage device failures in a storage system with constrained parity in-degree described herein without departing from the spirit and scope of the present invention. Moreover, while the present invention is described for illustration purpose only in relation to fault-tolerant disk array system, it should be clear that the invention is applicable as well, for example, to any system in which the data and parity element layout is given in
Number | Name | Date | Kind |
---|---|---|---|
5072378 | Manka | Dec 1991 | A |
5331646 | Krueger et al. | Jul 1994 | A |
5416915 | Mattson et al. | May 1995 | A |
5497457 | Ford | Mar 1996 | A |
5515502 | Wood | May 1996 | A |
5613085 | Lee et al. | Mar 1997 | A |
5634109 | Chen et al. | May 1997 | A |
5734813 | Yamamoto et al. | Mar 1998 | A |
5862158 | Baylor et al. | Jan 1999 | A |
5933834 | Aichelen | Aug 1999 | A |
6038570 | Hitz et al. | Mar 2000 | A |
6052759 | Stallmo et al. | Apr 2000 | A |
6079028 | Ozden et al. | Jun 2000 | A |
6327672 | Wilner | Dec 2001 | B1 |
6332177 | Humlicek | Dec 2001 | B1 |
6467023 | DeKoning et al. | Oct 2002 | B1 |
6574746 | Wong et al. | Jun 2003 | B1 |
6826711 | Moulton et al. | Nov 2004 | B2 |
6985995 | Holland et al. | Jan 2006 | B2 |
7000143 | Moulton et al. | Feb 2006 | B2 |
7076606 | Orsley | Jul 2006 | B2 |
7103716 | Nanda | Sep 2006 | B1 |
7353423 | Hartline et al. | Apr 2008 | B2 |
Number | Date | Country | |
---|---|---|---|
20060074995 A1 | Apr 2006 | US |