This application claims the benefit of priority of Singapore patent application No. 201306456-3, filed Aug. 27, 2013, the content of which incorporated herein by reference in its entirety for all purposes.
Various embodiments disclosed herein relate to storage systems.
The technology of Redundant Array of Independent Disks (RAID) has been widely used in storage systems to achieve high data performance and reliability. By maintaining redundant information within an array of disks, RAID can recover the data in case one or more disk failures occur in the array. RAID systems are classified into different levels according to their structures and characteristics. RAID level 0 (RAID0) has no redundant data and cannot recover from any disk failure. RAID level 1 (RAID1) implements mirroring on a pair of disks, and therefore can recover from one disk failure in the pair of disks. RAID level 4 (RAID4) and RAID level 5 (RAID5) implement XOR parity on an array of disks, and can recover from one disk failure in the array through XOR computation. RAID level 6 (RAID6) is able recover from any two concurrent disk failures in the disk array, and it can be implemented through various kinds of erasure codes, such as the Reed-Solomon codes.
The process of recovering data from disk failures in a RAID system is called data reconstruction. The data reconstruction process is very critical to both the performance and reliability of the RAID systems. Take the RAID5 system as an example, when a disk fails in the array, the array enters degraded mode, and user I/O requests falls on the failed disk have to reconstruct data on the fly, which is quite expensive and causes great performance overhead. Moreover, the user I/O processes and reconstruction process run concurrently and compete for the disk bandwidth with each other, which further degrades the system performance severely. On the other hand, when the RAID5 system is recovering from one disk failure, a second disk failure may occur, which will exceed the system's failure tolerance ability, and cause permanent data loss. Thus, a prolonged data reconstruction process will introduce a long period of system vulnerability, and severely degrade system reliability. Based on these reasons, the data reconstruction process should be shortened as much as possible, and seeking ways and methods to optimize the data reconstruction of current RAID systems is of great importance and significance.
For data reconstruction, an ideal scenario is offline reconstruction, in which the array stops serving the user I/O requests, and let the data reconstruction process run at its full speed. However, this scenario is not practical in most production environments, where the RAID systems are required to provide uninterrupted data services even when they are recovering from disk failures. In other words, RAID systems in production environments are doing online reconstruction, in which the reconstruction process and user I/O processes are running concurrently. In previous work, several methods have been proposed to optimize the reconstruction process of RAID systems. The Workout method aims to redirect the user write data cache popular read data to a surrogate RAID, and reclaim the write data to the original RAID when the reconstruction of original RAID completes. By doing so, Workout tries to separate the reconstruction process from the user I/O processes and leave the reconstruction process undisturbed. Different from Workout, our proposed methods let the user I/O processes cooperate with reconstruction process, and contribute to the data reconstruction while serving user read/write requests. Another previous method is called Victim Disk First (VDF). VDF defines the system DRAM cache policy that caches the data in the failed disk in higher priority, so that the performance overhead of reconstructing the failed data on the fly can be minimized. Different from VDF, our methods include a policy to optimize the reconstruction sequence by utilizing the data in the NVM caches of the surviving disks in the array. A third previous work is called live block recovery. The method of live block recovery aims to recover only live file system data during reconstruction, skipping the unused data blocks. However, this method relies on the passing of file system information to the RAID block level, and thus requires significant changes of existing file systems. Moreover, this method can only be applied to replication based RAID, such as RAID1, and cannot be applied to parity based RAID, such as RAID5 and RAID6. Our proposed method also aims at reconstruct only used data blocks, but our method works entirely at block level, and requires no modification to the file systems. Besides, our method can be applied to any RAID levels including parity based RAID systems.
Hybrid drive is a new kind of hard disk drive which places a spinning magnetic disk media together with a NVM cache inside one disk enclosure. In the normal mode, the NVM cache serves as a read/write cache for user I/O requests. In the reconstruction mode, the data in the NVM cache can be exploited to accelerate the reconstruction process. In the following description of our methods, we will illustrate how to optimize the reconstruction of RAID systems by exploiting NVM caches inside hybrid drives.
According to exemplary embodiments, methods for optimizing the reconstruction process of RAID systems composed of hybrid drives are disclosed. RAID5, for example, may be used as an example to illustrate the disclosed methods. It must be noted that, these methods can also be applied to other RAID levels such as, but not limited to RAID1, RAID4 and RAID6. Various methods in accordance with exemplary embodiments may include:
A corresponding exemplary method is illustrated in
A corresponding exemplary method is illustrated in
A corresponding exemplary method is illustrated in
In the drawings, like reference characters generally refer to like parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. In the following description, various embodiments of the invention are described with reference to the following drawings, in which:
The following detailed description refers to the accompanying drawings that show, by way of illustration, specific details and embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. Other embodiments may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the invention. The various embodiments are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments.
Embodiments described in the context of one of the methods or devices are analogously valid for the other method or device. Similarly, embodiments described in the context of a method are analogously valid for a device, and vice versa.
Features that are described in the context of an embodiment may correspondingly be applicable to the same or similar features in the other embodiments. Features that are described in the context of an embodiment may correspondingly be applicable to the other embodiments, even if not explicitly described in these other embodiments. Furthermore, additions and/or combinations and/or alternatives as described for a feature in the context of an embodiment may correspondingly be applicable to the same or similar feature in the other embodiments.
In the context of various embodiments, the articles “a”, “an” and “the” as used with regard to a feature or element includes a reference to one or more of the features or elements.
In the context of various embodiments, the phrase “at least substantially” may include “exactly” and a reasonable variance.
In the context of various embodiments, the term “about” or “approximately” as applied to a numeric value encompasses the exact value and a reasonable variance.
As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items.
As used herein, the phrase of the form of “at least one of A or B” may include A or B or both A and B. Correspondingly, the phrase of the form of “at least one of A or B or C”, or including further listed items, may include any and all combinations of one or more of the associated listed items.
In accordance with exemplary embodiments, a parity stripe may refer to a unit for parity RAID systems to organize data. As shown in
Each block of a parity stripe may reside in a different disk. As show in the example of
A block in a parity stripe may either be a data block or a parity block with a typical size of approximately 4 KB. A data block can hold user data. A parity block can hold parity value(s) computed from the data blocks of the parity stripe according to certain parity algorithm, which may uses XOR computation.
In accordance with exemplary embodiments, a space bitmap may be initialized at the start of data reconstruction that is after RAID creation. That is, when a data reconstruction process for a RAID system begins, the parity block for each parity stripe to be constructed reconstruction can be checked. If the parity block is all zero, the space bitmap can be updated so as to indicate that the associated parity stripe is unused. If it is not all zero, the bitmap can be updated to indicate that the associated parity stripe is used.
For example, during a RAID creation process, all the data and parity block in the RAID system may be initialized to zero blocks. Thus if a parity stripe is used, its parity block must be updated and thus can become non-zero. However if a parity stripe is never used, its parity block may remain as an all-zero block.
In some exemplary embodiments, as previously disclosed, the parity blocks of associated parity stripes can be checked on the fly during reconstruction. Therefore, a space bitmap may not be used to indicate whether a parity stripe has been used or unused. In response to the on the fly checking of the parity blocks of the parity stripe for reconstruction, the parity stripe can be reconstructed by writing a zero to the replacement disk if the parity block is zero. If the parity block is not all zero, the reconstruction process can proceed in accordance with embodiments herein.
In accordance with exemplary embodiments, systems and methods for optimizing a reconstruction process in a RAID system with either conventional HDDs or hybrid HDDs are disclosed herein.
In accordance with exemplary embodiments, one or more bitmaps (e.g., metadata recording mechanism) may be used for reconstruction scheduling, reading/writing data, and even data caching after a disk drive failed and reconstruction process started. In exemplary embodiments two bitmaps may be built or generated at the start of a data reconstruction process. For example, one bitmap that may be used is a reconstruction bitmap, in which each bit represents the reconstruction status of a parity stripe. The reconstruction bitmap may be initialized to be all-zero, and when a parity stripe is reconstructed, a corresponding bit of the bitmap is set a 1.
Similarly, another bitmap that may be used for data reconstruction is a space bitmap, in which each bit represents whether a parity stripe (or a group of parity stripes) used or not. For example, if a parity stripe is determined or identified as previously used, a typical normal reconstruction process proceeds. Otherwise, reconstruction the parity stripe may consist of simply writing a zero to the replacement drive/disk.
In accordance with exemplary embodiments, bitmaps used in the reconstruction process may be kept in volatile memory such as system memory or NVM or any other fast access storage space.
In accordance with exemplary embodiments, a reconstruction scheduler, in a data reconstruction process, may use bitmap information and/or other information to determine a reconstruction sequence and/or how to reconstruct each parity stripe.
In accordance with exemplary embodiments, scheduling strategy to optimize a data reconstruction process in RAID system with conventional hard disk drives (HDDs) may include:
1. Determining, if there is no request sent from any applications, and if not, a reconstruction scheduler starts to schedule the reconstruction process by checking from a 1st bit in the reconstruction bitmap (associated with a 1st parity stripe). If it is 0 (indicating the parity stripe associated with the bit has not been reconstructed), the reconstruction scheduler will issue the commands to reconstruct the 1st parity stripe. The reconstruction scheduler may further check the 1st bit in the space bitmap. If it is 0 (indicating the parity stripe associated with the checked has not been used or allocated and contains all zero), the parity stripe may be reconstructed by writing zero to the replacement disk. Otherwise if the checked bit of the space bitmap is 1 (indicating it has been used/allocated), the parity stripe associate with the checked bit is reconstructed following the normal reconstruction procedure. After reconstruction of the parity bit, the reconstruction scheduler may update the reconstruction bitmap and set the bit associated with the reconstructed parity bit to 1. If the 1st bit value of the reconstruction bitmap is already a 1, the reconstruction scheduler may skip the current parity stripe (for example, the 1st parity stripe) and proceed to check the 2nd bit value to see if the parity stripe associated with the 2nd bit of the reconstruction bit map (2nd stripe), has been reconstructed already. That is, the reconstruction scheduler may continue and repeat this process until the last bit in the bitmap, assuming there is no interruption such a sent request from one or more applications.
2. In exemplary embodiments, if there is a request sent out from an application to access the failed drive during the above mentioned process, based on a priority setting of the RAID system the reconstruction scheduler may first complete the reconstruction of a currently selected checked parity stripe first, and then allow the system to serve the requesting application. For example, if the requesting application needs to write data to the failed drive, the reconstruction schedule may write directly to the replacement drive and update then update the reconstruction bitmap to indicate that the corresponding parity stripe has been reconstructed. If the requesting, application needs to read data from a failed drive but the data has not been reconstructed yet, the reconstruction scheduler may issue a command to reconstruct the data by reading from other available drives in the RAID group and reconstruct the data on the fly. The reconstruction scheduler may then write the data to the replacement drive and update the reconstruction bitmap of the corresponding reconstruction stripe to 1 to indicate that the stripe has been reconstructed. The bitmap can allow the reconstruction scheduler to avoid reconstructing a parity stripe again.
3. By checking the bitmap, the system can easily check with particular data the application requests to read has been reconstructed or not. If the data have already been reconstructed, the data may be read out directly from the replacement drives and sent back to the requesting application.
In accordance with exemplary embodiments, in a RAID system with hybrid drives, similar to a RAID system with conventional HDDs, the aforementioned methods may be used.
1. In accordance with exemplary embodiments, in a RAID system with hybrid drives, when a hybrid drive fails, the system may first to identify whether the NVM of the failed hybrid drive can be accessed or not. If yes, the data in the NVM may be read out and directly copied to a NVM of replacement hybrid drive. After copying has finished, the reconstruction bitmap may be updated by setting the bit values corresponding to the copied data to 1 s.
In accordance with exemplary embodiments, in a RAID system with hybrid drives, priority reconstruction may be scheduled based on data in the NVMs. For example, if all the data required for reconstruction is available in the NVMs of available hybrid drives, the parity stripes with high priority are reconstructed and then after the corresponding bit value in the reconstruction bitmap can be updated to 1. If only partial data is available, other remaining portion of the required data for reconstruction not in the NVM can be prefetched or caused to be prefetched to the NVMs. Once the necessary data is in the NVMs, the scheduler can schedule to reconstruct these parity stripes.
In accordance with exemplary embodiments, before data reconstruction in a RAID system, bitmaps may be built or generated, for example, a reconstruction bitmap and a space bitmap. As previously disclosed, in the reconstruction bitmap, each bit may represent the reconstruction status of a parity stripe. After generation, the bits in the reconstruction bitmap may be initialized to be all-zero. Thus when a parity stripe is reconstructed, its corresponding bit may be set to 1.
In a space bitmap, in which each bit may represent whether a parity stripe (or a group of parity stripes) is used/allocated or not. If a parity stripe was used or allocated, a data reconstruction process such as one disclosed herein may be implemented. If a parity stripe was not previously used or allocated, reconstructing the parity stripe may be accomplished by simply writing zero to the replacement disk.
In accordance with exemplary embodiments, a space bitmap may be generated. For each parity/reconstruction stripe, the associated parity block can be checked. For example if it is an all-zero block, then it can be indicated as unused in the bitmap (e.g. “0”); otherwise, it may be indicated it as used (e.g., “1”). During initialization, all the data and parity block in a RAID system may initialized to zero blocks. Thus, if a parity stripe is subsequently used, then its parity block must be updated and become non-zero. If a parity stripe is never used, its parity block must remain to be an all-zero block.
In accordance with some exemplary embodiments, a space bitmap may be avoided or not used. Instead, parity-block checking may be implemented on the fly during reconstruction, and a space bitmap is not needed to record or indicate unused space. For example, before reconstructing each parity stripe, first the parity block is checked. If the parity block is all zero, this parity stripe is reconstructed by writing 0 to the replacement disk; otherwise, it is reconstructed.
In accordance with exemplary embodiments, the various exemplary RAID systems disclosed herein may include and/or be operatively coupled to one or more computing devices not shown. The computing devices may, for example, include one or more processors and other suitable components such as memory and computer storage. For example, at least one RAID controller included with a RAID system and may be operatively connected to the storage drives constituting the RAID system. It should be understood that processor may also comprise other forms of processors or processing devices, such as a microcontroller, or any other device that can be programmed to perform the functionality described herein.
Accordingly, the computing devices may execute software so as to implement, at least in part one or more of various methods, or aspects thereof, disclosed herein such as the reconstruction scheduler processes, various input/output requests, etc. Such software may be stored on any appropriate or suitable non-transitory computer readable media so as to be executed by a processor(s). In other words, the computing devices may interact or interface with the various drives of the RAID systems disclosed herein. Accordingly, the computing devices may be used to create, update, access, etc., the tables disclosed herein (e.g., the space bitmap, the reconstruction bitmap, etc.). The tables may be stored as data in any suitable storage device, such as in any suitable computer storage device or memory.
In accordance with exemplary embodiments, a method for data reconstruction in a RAID storage system that includes a plurality of storage drives, one of which that has failed, may include: selecting for reconstruction, a parity stripe from a plurality of parity stripes for reconstruction; determining whether the selected parity stripe for reconstruction has been previously reconstructed by checking a reconstruction table, the reconstruction table comprising entries each indicating a reconstruction status corresponding to at least one of the plurality of parity stripes for reconstruction, wherein each reconstruction status indicates whether or not the at least one corresponding parity stripe has been previously reconstructed; determining whether the selected parity stripe has been previously allocated by checking a space table, the space table comprising entries indicating an allocation status corresponding to at least one of the plurality of parity stripes for reconstruction, wherein the allocation status indicates whether or not the at least one corresponding parity stripe has been previously allocated; and if the selected parity stripe has been determined to not have been previously reconstructed and if the selected parity stripe has been determined to have been previously allocated, the method further comprises reconstructing the selected parity stripe in a replacement disk and updating the reconstruction status in the reconstruction table corresponding to selected parity stripe to indicate that the selected stripe has been reconstructed.
In accordance with exemplary embodiments, the method may further include writing a zero to the replacement disk for data corresponding to the selected parity stripe, if the selected parity stripe has been determined to not have been previously allocated.
In accordance with exemplary embodiments, the method may further include receiving an input/output request for data associated with a parity stripe before the selecting of a parity stripe; and wherein the selecting of a parity stripe includes selecting the parity stripe to which the input/output request for data is associated. In accordance with exemplary embodiments, if no input/output operation request is received, the selecting of a parity stripe may include selecting a parity stripe corresponding to a first entry of the reconstruction table that indicates reconstruction has not occurred. In accordance with exemplary embodiments, the reconstruction table may be a bitmap including a plurality of bits, each bit representing a reconstruction status of each of the plurality of parity stripes for reconstruction.
In accordance with exemplary embodiments, the space table may be a bitmap including a plurality of bits, each bit representing the reconstruction status of each of the plurality of parity stripes for reconstruction.
In accordance with exemplary embodiments, the method may further include selecting an additional parity stripe from the plurality of parity stripes for reconstruction.
In accordance with exemplary embodiments, the method may further include executing the received input/output request.
In accordance with exemplary embodiments, each of the plurality of storage drives may be a hard disk drive.
In accordance with exemplary embodiments, each of the plurality of storage drives may be a hybrid drive that includes a non-volatile memory (NVM) and a magnetic disk media. In accordance with exemplary embodiments, the method may further include determining whether data of a NVM of the failed drive is accessible before the selecting of a parity stripe for reconstruction; and copying the data from the NVM of the failed hybrid drive to a NVM of a replacement hybrid drive if the NVM of the failed hybrid drive is determined to be accessible.
In accordance with exemplary embodiments, the method may further include before the selecting of a parity stripe for reconstruction, identifying one or more parity stripes for reconstruction that all of its parity blocks needed for reconstruction stored in the NVMs of non-failed disks and reconstructing the one or more identified parity stripe in a replacement disk.
In accordance with exemplary embodiments, the method may further include before the selecting of a parity stripe for reconstruction, identifying one or more additional parity stripes for reconstruction, the one or more additionally identified parity stripes having a portion of parity blocks associated with the parity stripe stored in the one or more NVMs of non-failed hybrid drives and a portion of the parity blocks stored in the magnetic disk media of the non-failed hybrid drives; instructing one or more of the non-failed hybrid drives to fetch the portion parity blocks associated with the identified parity stripes from the magnetic disk media of the non-failed hybrid drive and store in the respective NVM cache of the non-failed hybrid drives; and reconstructing the one or more identified additional parity stripes in a replacement disk.
While the invention has been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.
Number | Date | Country | Kind |
---|---|---|---|
201306456-3 | Aug 2013 | SG | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/SG2014/000406 | 8/27/2014 | WO | 00 |