The present invention relates generally to a redundant array of inexpensive disks (RAID), and more particularly, to a method for simplified parity disk generation in a RAID system.
Virtual Tape Library
A Virtual Tape Library (VTL) provides a user with the benefits of disk-to-disk backup (speed and reliability) without having to invest in a new backup software solution. The VTL appears to the backup host to be some number of tape drives; an example of a VTL system 100 is shown in
The data is stored sequentially on the disks 110 to further increase performance by avoiding seek time. Space on the disk is given to the individual data “streams” in large contiguous sections referred to as allocation units. Each allocation unit is approximately one gigabyte (1 GB) in length. As each allocation unit is filled, load balancing logic selects the best disk 110 from which to assign the next allocation unit. Objects in the VTL 106 called data maps (DMaps) keep track of the sequence of allocation units assigned to each stream. Another object, called a Virtual Tape Volume (VIV), records the record lengths and file marks as well as the amount of user data.
There is a performance benefit to using large writes when writing to disk. To realize this benefit, the VTL 106 stores the data in memory until enough data is available to issue a large write. An example of VTL memory buffering is shown in
RAID4
RAID (redundant array of inexpensive disks) is a method of improving fault tolerance and performance of disks. RAID4 is a form of RAID where the data is striped across multiple data disks to improve performance, and an additional parity disk is used for error detection and recovery from a single disk failure.
A generic RAID4 initializes the parity disk when the RAID is first created. This operation can take several hours, due to the slow nature of the read-modify-write process (read data disks, modify parity, write parity to disk) used to initialize the parity disk and to keep the parity disk in sync with the data disks.
RAID4 striping is shown in
Performance is improved because each disk only has to record a fraction (in this case, one fourth of the data. However, the time required to update and write the party disk decrease performance. Therefore, a more efficient way to update the parity disk is needed.
Exclusive OR Parity
Parity in a RAID4 system is generated by combining the data on the data disks using exclusive OR (XOR) operations. Exclusive OR can be thought of as addition, but with the interesting attribute that if A XOR B=C then C XOR B=A, so it is a little like alternating addition and subtraction (see Table 1; compare the first and last columns).
Exclusive OR is a Boolean operator, returning true (1) if one or the other of the values being operated on are true and returning false (0) if neither or both of those values are true. In the following discussion, the caret symbol (‘ˆ’) will be used to indicate an XOR operation.
If more than two operators are being acted on, XOR is associative, so AˆBˆC=(AˆB)ˆAˆ(BˆC), as shown in Table 2. Notice also that the final result is true when A, B, and C have an odd number of is between them; this form of parity is also referred to as odd parity.
Exclusive OR is a bitwise operation; it acts on one bit. Since a byte is merely a collection of eight bits, one can perform an XOR of two bytes by doing eight bitwise operations at the same time. The same aggregation allows an XOR to be performed on any number of bytes. So if one is talking about three data disks (A, B, and C) and their parity disk P, one can say that AˆBˆC=P and, if disk A fails, A=PˆBˆC. In this manner, data on disk A can be recovered.
The present invention discloses a method and system for efficiently writing data to a RAID. A method for writing data to a RAID includes the steps of writing an entire slice to the RAID at one time, wherein a slice is a portion of the data to be written to each disk in the RAID; and maintaining information in the RAID for the slices that have been written to disk.
A system for writing data to a RAID includes a buffer, a parity generating device, transfer means, and a metadata portion in the RAID. The buffer is configured to receive data from a host and configured to accumulate data until a complete slice is accumulated, wherein a slice is a portion of the data to be written to each disk in the RAID. The parity generating device is configured to read data from the buffer and to generate parity based on the read data. The transfer means is used to transfer data from the buffer and the generated parity to the disks of the RAID. The metadata portion is configured to store information for slices that have been written to disk.
A computer-readable storage medium containing a set of instructions for a general purpose computer, the set of instructions including a writing code segment for writing an entire slice to a RAID at one time, wherein a slice is a portion of the data to be written to each disk in the RAID; and a maintaining code segment for maintaining information in the RAID for the slices that have been written to disk.
A more detailed understanding of the invention may be had from the following description of a preferred embodiment, given by way of example, and to be understood in conjunction with the accompanying drawings, wherein:
Improved Parity Generation
In a general purpose RAID such as the one shown in
A more efficient way to generate the parity is to use the method 400 shown in
To be able to use stripe C and the value AˆBˆCˆD from the parity disk to modify parity efficiently, the parity disk has to have the value AˆBˆCˆD on it before the write to stripe C is performed. This means that the parity disk has to be initialized when the RAID is defined and added to the system. There are two ways initialize the parity disk: (1) read the disks and generate the parity, or (2) write the data disks with a known pattern and the parity of that pattern to the parity disk. Both of these initialization procedures require a relatively long time to complete.
Sparse RAID4
The VTL has two types of data that it records to disk: large amounts of user data written to disk sequentially and a small amount of metadata (a few percent of the total) written randomly. Rather than try to use the same type of RAID to handle both types of data, one aspect of the present invention separates the disks into two parts: a small mirrored section 512 for the metadata and a large RAID4 region 514 for the user data. The mirrored sections 512 are then striped together to form a single logical space 516 for metadata. As used hereinafter, the term “metadata portion” refers to both the mirrored sections 512 individually and the single logical space 516.
As aforementioned, data maps (Dmaps) keep track of the sequence of allocation units (or additional disk space), assigned to each stream. These Dmaps are part of the metadata that is stored. It should be noted that other types of metadata may be stored without departing from the spirit and scope of the present invention. For example, the metadata may also include information stored by the aforementioned virtual tape volume (VTV) which records the record and file marks as well as the amount of user data. The metadata can be used to improve recovery performance in the event of a disk failure. Since the metadata tracks the slices that have been written to disk, the recovery can be improved by only recovering those slices that have been previously written to disk. In an alternate embodiment, the metadata can be used to track which slices have not yet been written to disk.
In the RAID4 region 514, the allocation units tracked by the data maps are adjusted to be a multiple of the slice size. Since this data is recorded in large sequential blocks, the read-modify-write behavior of a generic RAID4 can be avoided. Each new sequence of writes from the backup host starts recording at the beginning of an empty slice. Once an entire slice of data has been accumulated, the parity is generated, and the individual stripes in the slice are queued to be written to the disks.
Memory Buffering
In an alternate embodiment, which can be used when the system is low on memory, the first stripe is written to disk and its buffer becomes the parity buffer. Subsequent stripe buffers are XOR'ed into that buffer until the entire slice is processed, and then the parity buffer is written out to disk.
If an entire slice has been filled (step 705), the current allocation unit is used to determine where on the disk to store the slice. If it is determined 706 that the current allocation unit is full, additional space is allocated and the Dmap is updated (707) in the metadata portion. The slice is then queued to be written to the disks of the RAID (step 708). If the current allocation unit is not full and additional disk space is not required, step 707 is bypassed. Queuing the data for each stripe is a logical operation; no copying is performed. The parity is generated based on the data in the queued slice (step 710). Once the parity has been generated, and the slice has been written successfully to disk, (or is otherwise made persistent), the slice is considered to be valid. In a preferred embodiment, there is one parity buffer per slice, which improves performance by eliminating the need to read from the disks to generate the parity. The memory used for data transfer is organized as a large number of 128 KB buffers. The stripes can be aligned to the buffer boundaries to simplify the parity generation by avoiding having to handle multiple memory segments in a single stripe. The queued slice and the parity are written to disks (step 712) and the method terminates (step 714). To maintain good disk performance, writes to the disk are issued for four queued segments at a time.
It should be noted that while the preferred embodiment stores the information about which slices are valid in the metadata portion of the RAID, this does not preclude storing that information anywhere within the RAID system 600.
Since there is no read-modify-write behavior, the parity disk 510 does not need to be initialized in advance, which saves time when the RAID is created. Due to the management by the VTL, a valid parity stripe is only expected for slices that have been validly written to disk. The parity will be valid only for the slices 608 that have been filled with user data and those slices 608 are part of the allocation units that the data maps track for each virtual tape.
Any errors in writing the parity disk or the data disks invalidates that slice. An example of a failed write operation is as follows: data is written to stripes A, B, and C successfully and the write to stripe D fails. Because the tracking is performed at the slice level, and not at the stripe level, if the write to stripe D fails, a failure for the slice is indicated since it is not possible to determine which stripe within the slice has failed. If tracking is performed at the stripe level, then it would be possible to reconstruct stripe D from the remainder of the slice.
If one of the disks fails during the write of the slice, the system is in the same degraded state for that slice as it would be for all of the preceding slices and that slice could be considered successful. In general, it is better for the VTL to report the write failure to the backup application if the data is now one disk failure away from being lost. That will generally cause the backup application to retry the entire backup on another “tape” and the data can be written to a different, undegraded RAID group.
Verifying and Recovering RAID Data
It may be necessary to verify the data in the RAID on a periodic basis, to ensure the integrity of the disks. To perform a verification, all of the data stripes in a slice are read, and the parity is generated. Then the parity stripe is read from disk and compared to the generated parity. The slice is verified if the generated parity and the read parity stripe match. In a sparse RAID, only those slices that have been successfully written to disk need to be verified. Since the entire RAID does not need to be verified, this operation can be quickly performed in a sparse RAID.
If a disk fails, the data that was on the failed disk can be reconstructed, via a recovery operation. The recovery operation is performed in a similar manner to a verification. As in a verification, only the slices that contain successfully written data need to be recovered, since only those slices are tracked through the VTL. The information from the data maps is used to identify the slices that need to be reconstructed. Since the data map is a “consumer” of space on the disk, the partial reconstruction is referred to as “consumer driven.” The benefit of reconstructing only the portions of the RAID that might have useful data varies depending on how full the RAID is. The time savings is more pronounced when less of the RAID is used, because there is less data to recover. As the RAID approaches being full, the time savings are not as significant.
While specific embodiments of the present invention have been shown and described, many modifications and variations could be made by one skilled in the art without departing from the scope of the invention. For example, a preferred embodiment of the present invention uses a RAID4 system, but the principles of the invention are applicable to other multi-volume data storage systems, such as other RAID methodologies or systems (e.g., RAID5). The above description serves to illustrate and not limit the particular invention in any way.