The present invention relates to performance management for a data storage system, and more particularly, to a method and an apparatus for performing data recovery in a redundant storage system.
A redundant storage system with redundant storage ability such as a Redundant Array of Independent Disks (RAID) may combine a plurality of storage devices as a storage pool, and dispatch the redundant data into the different storage devices, in which the redundant data may help with data recovery when a single device is malfunctioning. However when bit rot or silent data corruption occurs, the conventional storage system lacks an efficient mechanism to solve these problems. For example, in a situation where the RAID level of the conventional RAID is RAID 5, in order to check if the data of a data chunk A1 of one of the plurality of storage devices is correct, the corresponding data chunks A2, A3 and the parity chunk Ap are read from other storage devices for comparison (in particular, by comparing the original data of the data chunk Al and the calculated data which is calculated according to the data chunks A2, A3 and the parity chunk Ap). This may greatly degrade the performance of randomly reading data. In addition, even when the comparison determines that the original data and the calculated data are different, the conventional RAID is not able to check which data is correct. In another example, in a situation where the RAID level of the conventional RAID is RAID 1, twice as much time will be taken to check if bit rot occurs.
Although the related arts provide some methods to solve these problems, other undesirable side effects may occur as a result. Therefore, a novel method and associated architecture are required.
One of the objects of the present invention is to provide a method and an associated apparatus for performing data recovery in a redundant storage system to solve the problems which exist in the related arts.
Another objective of the present invention is to provide a method and an associated apparatus for performing data recovery in a redundant storage system that can boost the performance of the redundant storage system.
According to at least one embodiment of the present invention, a method for performing data recovery in a redundant storage system is disclosed, in which the redundant storage system includes a plurality of storage devices. The method includes: determining a state of a cache block of a plurality of cache blocks, in which the plurality of storage devices includes a set of Hard Disk Drives (HDDs) and a set of Solid State Drives (SSDs), an SSD Redundant Array of Independent Disk (RAID) of the redundant storage system includes the set of SSDs, and an HDD RAID of the redundant storage system includes the set of HDDs, in which the SSD RAID is utilized as a cache system of the HDD RAID and includes the plurality of cache blocks; and performing a retry-read operation on at least one of the HDD RAID and the SSD RAID according to the state of the cache block, to obtain a correct version of data within the redundant storage system.
An apparatus for performing data recovery in a redundant storage system is also provided, in which the apparatus may include at least one portion of the redundant storage system (e.g. a portion or all of it). The apparatus may include: a control circuit located in a specific layer of a plurality of layers in the redundant storage system and coupled to a plurality of storage devices of the redundant storage system, in which the control circuit is arranged to control an operation of the redundant storage system. The step of controlling the operation of the redundant storage system includes: determining a state of a cache block of a plurality of cache blocks, in which the plurality of storage devices includes a set of HDDs and a set of SSDs, an SSD RAID of the redundant storage system includes the set of SSDs, and an HDD RAID of the redundant storage system includes the set of HDDs, in which the SSD RAID is utilized as a cache system of the HDD RAID and includes the plurality of cache blocks; and performing a retry-read operation on at least one of the HDD RAID and the SSD RAID according to the state of the cache block, to obtain a correct version of data within the redundant storage system.
The method and associated apparatus of the present invention may solve problems existing in the related arts without introducing unwanted side effects, or in a way that is less likely to introduce a side effect. In addition, the methods and associated apparatus of the present invention can efficiently boost the overall performance without wasting operation resources.
These and other objectives of the present invention will no doubt become obvious to those of ordinary skill in the art after reading the following detailed description of the preferred embodiment that is illustrated in the various figures and drawings.
Embodiments of the present invention provide a data recovery mechanism applied in a redundant storage system, in which the redundant storage system can be a storage system with redundant storage ability or a multilayer storage system stack composed of a plurality of storage systems with redundant storage ability. For example, the storage system can include at least one Redundant Array of Independent Disk (RAID) or at least one Distributed Replicated Block Device (DRBD), and the data recovery mechanism can be implemented in the storage system. In another example, the plurality of storage systems can include at least one RAID or at least one DRBD, and the data recovery mechanism can be implemented in any of the plurality of storage systems. Based on the data recovery mechanism of embodiments of the present invention, the redundant storage system can automatically recover or amend data. When the file system or application finds corrupted data via a checksum or a hash value, the data recovery mechanism can automatically perform a background data recovery operation to assure the user will not read the incorrect content. For clarity, the file system with built-in checking ability can be an example of the file system of the redundant storage system. According to an aspect of the present invention, the file system may be regarded as a layer within the redundant storage system, such as a topmost layer of a plurality of layers within the redundant storage system, and a plurality of storage elements (e.g. one or more Solid State Drives (SSDs), one or more Hard Disk Drives (HDDs), one or more RAIDs) may be located in remaining layer(s) within the plurality of layers. For example, the remaining layer(s) may comprise one or more RAIDs and the storage devices thereof (e.g. one or more HDDs and/or one or more SSDs).
As the architecture of the redundant storage system may vary, the redundant storage system may comprise one or more sub-systems under the file system (e.g. the topmost layer of the layers). Examples of the one or more sub-systems may include, but are not limited to, a generic storage system and a cache storage system. The cache storage system comprises an HDD RAID and an SSD RAID that is utilized as a cache system of this HDD RAID. The HDD RAID and the SSD RAID can be regarded as a lower layer below the file system, SSDs of the SSD RAID can be regarded as a lower layer (e.g. a bottommost layer) below the SSD RAID, and HDDs of the HDD RAID can be regarded as a lower layer (e.g. a bottommost layer) below the HDD RAID. In addition, the generic storage system comprises an HDD RAID, but does not comprise any SSD RAID that is utilized as a cache system of this HDD RAID. The HDD RAID can be regarded as a lower layer below the file system, and HDDs of the HDD RAID can be regarded as a lower layer (e.g. a bottommost layer) below the HDD RAID. Please note that a plurality of control modules for implementing the data recovery mechanism may be in at least one portion (e.g. a portion or all) of the layers to perform the background data recovery operation mentioned above, and a Retry-Read command may be utilized by an upper layer within the layers for obtaining redundant data from a lower layer within the layers, to correct data error(s) and/or provide the user with correct data content. The Retry-Read command can be applied to the generic storage system without considering caching behaviors such as that of the cache storage system. When the Retry-Read command is applied to the cache storage system, however, a proper design such as an adaptive control mechanism is required.
Normally, no matter what operating system is used to implement the file system 12, the layers of the redundant storage system 100 can use the following four basic commands:
The data recovery mechanism (e.g. the plurality of control modules, such as the control modules 14 and 114) can recognize and use these commands, and can use at least one additional command (e.g. one or more additional commands) including:
For example, in the file system 12 (e.g. Btrfs) coupled with the generic storage system 13, when 1-bit data error occurs, the file system 12 may detect it and restore the data with the aid of the control module 14 by the following operations:
In the cache storage system 113, the correct data may be stored in the SSD RAID 126 or HDD RAID 116 depending on the state of the cache blocks. The control module 114 may operate in an efficient way to determine where the data recovery mechanism should be applied. More specifically, in the file system 12 coupled with the cache storage system 113, when the control module 114 accesses data from the storage media (e.g. from the lower layers thereof), the control module 114 may inquire the SSD RAID 126 first. If the SSD RAID 126 does not have the data being inquired, the control module 114 may inquire the HDD RAID 116 and return the data. In an embodiment, after the data is found in the HDD RAID 116, the data may be regarded as hot data and replicated to the SSD RAID 126. In addition to replicating data to the SSD RAID 126, when data is first written to the file system 12 coupled with the cache storage system 113, the data may be written into the SSD RAID 126, and such data may not be written into the HDD RAID 116 immediately. Only when the file system 12 is less busy or when the dirty block percentage is more than the predetermined percentage, the written data (stored in the dirty block) in the SSD RAID 126 is synchronized into (e.g. written into) the HDD RAID 116.
In some embodiments, if the file system 12 finds that the data is incorrect (e.g. data rot or one-bit error occurs), the data recovery mechanism may be initiated to perform the data recovery operation (s). For example, the data error may occur in the SSD RAID 126 or the HDD RAID 116, and the cache blocks may have different degrees of popularity (e.g. some of the cache blocks may have hot data and others of the cache blocks may have cold data) in the SSD RAID 126. In order to make sure all the data in SSD RAID 126 and HDD RAID 116 are correct, the retry-read recovery mechanism regarding the generic storage system 13 (e.g. the retry-read operations and the associated data recovery operations for the generic storage system 13) maybe adapted for the cache storage system 113, where some associated implementation details are described in the following embodiments. Thus, the data recovery mechanism is compatible with both the generic storage system 13 and the cache storage system 113.
When the file system 12 finds that the data is incorrect, the Retry-Read command may be be transmitted to the control module 114 by the file system 12. The control module 114 may be implemented as a software module programmed to perform operations of the data recovery mechanism, but the present invention is not limited thereto. In some embodiments, the control module 114 may be implemented as a dedicated and customized hardware circuit configured to perform the data recovery function (e.g. the operations of the data recovery mechanism).
In an embodiment, in addition to preforming the operations of the data recovery mechanism, the control module 114 may further send input/output (IO) requests to the SSD RAID 126 or the HDD RAID 116, and manage the cache blocks (e.g. manage hot data and cold data).
The control module 114 may detect the state(s) of the cache blocks, and under control of the control module 114, the Retry-Read command may be performed in the HDD RAID 116 or the SSD RAID 126 with respect to the state(s) of the cache blocks. Regarding how operations associated to the Retry-Read command are performed according to the data recovery mechanism, some greater details are illustrated in the embodiment shown in
In Step 310, the control module 114 may receive the Retry-Read command. For example, the file system 12 may have found that an error (e.g. the one-bit data error) occurs and therefore may send the Retry-Read command such as the command Read_Retry(block_index). For the file system 12, the command Read_Retry(block_index) may be arranged to read the redundant data block corresponding to the index block_index from the storage system (e.g. the cache storage system 113) of the lower layers of the file system 12l to perform a retry-read operation. The command Read_Retry(block_index) may be further transmitted or forwarded to one or more layers of the lower layers of the file system 12, and more particularly, may be further transmitted or forwarded by the control module 114 within the cache storage system 113, to perform the retry-read operation with respect to the one or more layers. For the control module 114 in the cache storage system 113, the command Read_Retry(block_index) may be arranged to read the redundant data block corresponding to the index block_index from the storage system (e.g. the HDD RAID 116, the SSD RAID 126, etc.) or the storage device (e.g. the HDDs 118, the SSDs 128, etc.) of the lower layers of the control module 114 to perform a retry-read operation such as that mentioned above.
According to this embodiment, the control module 114 may manage the cache storage system 113 to serve the file system 12, and may receive the Retry-Read command from the upper layer thereof (i.e. the file system 12). The control module 114 may perform a plurality of preparation operations (e.g. one or more of the operations of Step 320, Step 330, Step 331, Step 340, Step 341, and Step 351) first, and then perform data recovery (e.g. one or more of the operations of Step 332, Step 342, Step 352, and Step 354) in response to the Retry-Read command to obtain the correct version of the data (e.g. the correct version to be found through the Retry-Read command). Please note that at least one portion (e.g. a portion or all) of the preparation operations is related to the state of the cache block.
In Step 320, the control module 114 may check the state of one or more cache blocks, and more particularly, may determine the state of a cache block of the aforementioned at least one portion (e.g. a portion or all) of the plurality of cache blocks. The cache block is within the one or more cache blocks. For example, the cache block may correspond to the block index of the Retry-Read command, such as the index block_index of the command Read Retry(block_index).
In Step 330, the control module 114 may determine whether the data (e.g. the data to be read through the Retry-Read command) is found in the cache block. When the data is found in the cache block (e.g. the cached data is found), Step 340 is entered; otherwise (e.g. the cached data is not found), Step 331 is entered.
In Step 331, the control module 114 may prohibit the data (more particularly, the data in the corresponding block of the HDD RAID 116) from being replicated to any of the cache blocks. As the data is not found in the cache block, and as data recovery is required, it is unnecessary to cache from the HDD RAID 116 to the SSD RAID 126 since caching may be meaningless (e.g. incorrect data may be cached from the HDD RAID 116 to the SSD RAID 126 during caching). The control module 114 may save time by prohibiting the data in the corresponding block of the HDD RAID 116 from being replicated to any of the cache blocks.
In Step 332, the control module 114 may transmit the Retry-Read command (e.g. the command Read_Retry(block_index)) to the HDD RAID 116 to perform data recovery on the HDD RAID 116. For example, the HDD RAID 116 may forward and transmit the Retry-Read command to one or more HDDs within the HDDs 118 to perform the retry-read operation, and therefore may read a redundant data block (such as that corresponding to the index block_index in the command Read_Retry(block_index)) from the one or more HDDs for the control module 114. When the data of the redundant data block is returned from the one or more HDDs, the file system 12 may find the correct version of the data and write the correct version to the lower layers to recover the data (e.g. correct an erroneous block). According to some embodiments, when the data of the redundant data block is returned from the one or more HDD, the HDD RAID 116 or the control module 114 may find the correct version of the data and write the correct version to the lower layer(s) thereof to recover the data.
In Step 340, the control module 114 may determine whether the cache block is dirty (e.g. the cache block is a dirty block). When the cache block is dirty (which means the cache block is a dirty block), Step 341 is entered; otherwise (i.e. when the cache block is non-dirty, which means the cache block is a non-dirty block), Step 351 is entered.
In Step 341, the control module 114 may temporarily prohibit the cache block (i.e. the cache block mentioned in Step 340) from being swapped. Since the cache block is dirty, the version of the data in the SSD RAID 126 is newer than the version of the data in the HDD RAID 116, and the latest correct data may only exist in the SSD RAID 126. If the version of the data in the SSD RAID 126 were synchronized to HDD RAID 116 and swapped, then all versions of the data in the file system 12 would be incorrect, because the control module 114 would read an incorrect copy (or incorrect version) of the data from the SSD RAID 126 and synchronize it to the HDD RAID 116. As a result of performing the operation of Step 341, the control module 114 may temporarily prohibit the cache block from being swapped, to guarantee that the correct version of the data can be obtained.
In Step 342, the control module 114 may transmit the Retry-Read command (e.g. the command Read_Retry(block_index)) to the SSD RAID 126 to perform data recovery on the SSD RAID 126. For example, the SSD RAID 126 may forward and transmit the Retry-Read command to one or more SSDs within the SSDs 128 to perform the retry-read operation, and therefore may read a redundant data block (such as that corresponding to the index block_index in the command Read_Retry(block_index)) from the one or more SSDs for the control module 114. When the data of the redundant data block is returned from the one or more SSDs, the file system 12 may find the correct version of the data and write the correct version to the lower layers to recover the data (e.g. correct an erroneous block). According to some embodiments, when the data of the redundant data block is returned from the one or more SSDs, the SSD RAID 126 or the control module 114 may find the correct version of the data and write the correct version to the lower layer(s) thereof to recover the data.
In Step 351, the control module 114 may temporarily prohibit the cache block (i.e. the cache block mentioned in Step 340) from being swapped. Since the cache block is non-dirty, the version of the data in the HDD RAID 116 and the version of the data in the SSD RAID 126 have been synchronized, and the correct version of the data may exist in the SSD RAID 126 or in the HDD RAID 116. In case of the correct version of the data only existing in the SSD RAID 126, the control module 114 may temporarily prohibit the cache block from being swapped, to guarantee that the correct version of the data can be obtained.
In Step 352, the control module 114 may transmit the Retry-Read command (e.g. the command Read_Retry(block_index)) to the HDD RAID 116 to perform data recovery on the HDD RAID 116. For example, the HDD RAID 116 may forward and transmit the Retry-Read command to one or more HDDs within the HDDs 118 to perform the retry-read operation, and therefore may read a redundant data block (such as that corresponding to the index block_index in the command Read_Retry(block_index)) from the one or more HDDs for the control module 114. When the data of the redundant data block is returned from the one or more HDDs, the file system 12 may find the correct version of the data and write the correct version to the lower layers to recover the data (e.g. correct an erroneous block).
In Step 353, the control module 114 may determine whether the data recovery is successful. When the data recovery is successful, the working flow 300 comes to the end; otherwise, Step 354 is entered.
In Step 354, the control module 114 may transmit the Retry-Read command (e.g. the command Read_Retry(block_index)) to the SSD RAID 126 to perform data recovery on the SSD RAID 126. For example, the SSD RAID 126 may forward and transmit the Retry-Read command to one or more SSDs within the SSDs 128 to perform the retry-read operation, and therefore may read a redundant data block (such as that corresponding to the index block_index in the command Read_Retry(block_index)) from the one or more SSDs for the control module 114. When the data of the redundant data block is returned from the one or more SSDs, the file system 12 may find the correct version of the data and write the correct version to the lower layers to recover the data (e.g. correct an erroneous block).
According to some embodiments, the operation of Step 352 and the operation of Step 354 may be interchangeable (e.g. after the operation of Step 351 is performed, the operation of Step 354 is performed first, and then the operation of Step 353 is performed, and the operation of Step 352 may be performed when it is determined in Step 353 that the data recovery is not successful). Since most of data in the SSD RAID 126 is a replicated version from the HDD RAID 116, it may be more efficient to send the Retry-Read command to the HDD RAID 116 in the first place.
According to an embodiment, the adaptive control mechanism of the control module 114 allows the cache storage system 113 to perform data correction efficiently and correctly. The cache block 221 shown in
As mentioned, the SSD RAID 126 may be illustrated with the RAID-1 architecture. It should be understood that the RAID type shown in the figure(s) of this document is not intended to limit the present invention. The RAID types of the HDD RAID 116 and/or the SSD RAID 126 may vary. Examples of the RAID types may include, but are not limited to: RAID-1, RAID-5, RAID-6, DRBD, or any other kinds of RAID types.
In Step 510, the SSD RAID 126 may start synchronizing internal dirty blocks.
In Step 520, the control module 114 may read the checking information and calculate the checking information of the read dirty block(s), such as one or more of the dirty blocks. For example, in an initial phase of data synchronization, the SSD RAID 126 may read the data of a dirty block from the lower layer thereof (e.g. the bottommost layer thereof, such as the SSDs 128), to provide the control module 114 with the data, such as both of the checking information (e.g. a checksum or a hash value) and the data content of the data of the dirty block. In addition, the control module 114 may read the checking information (e.g. the checksum or the hash value) of the data of the dirty block as the read checking information, and calculate the checking information of the data according to the data content of the data of the dirty block.
In Step 530, with regard to the dirty block, the control module 114 may determine whether the read checking information is the same as the calculated checking information. When the read checking information is the same as the calculated checking information, Step 540 is entered; otherwise, Step 550 is entered.
In Step 540, under control of the control module 114, the data (more particularly, the data of the dirty block, such as both of the data content and the checking information) is synchronized to (e.g. written into) the HDD RAID 116.
In Step 550, the control module 114 may perform the Retry-Read command to the SSD RAID 126 one or more times, so as to find the correct version of data of the dirty block. More specifically, the control module 114 may send the Retry-Read command to the SSD RAID 126 to trigger the SSD RAID 126 to perform the Retry-Read command, and more particularly, to read the redundant version(s) of the data in the lower layer thereof (e.g. the bottommost layer thereof, such as the SSDs 128) in response to the Retry-Read command sent from the control module 114.
For example, the operation of Step 520 and the subsequent operations in the loop coming after Step 520 (e.g. the operations of Step 530 and Step 540, or the operations of Step 530 and Step 550) may be repeated for any unread dirty block within the dirty blocks mentioned in Step 510.
By applying the working flow 500 shown in
According to another embodiment in which the correction-in-advance mechanism is applied to the control module 114, since the data in the HDD RAID 116 is always correct, the operation of Step 354 shown in
In Step 610, the control module 114 may determine the state of the cache block of the plurality of cache blocks.
In Step 620, the control module 114 may perform the retry-read operation on the at least one of the HDD RAID 116 and the SSD RAID 126 according to the state of the cache block, to obtain the correct version of the data within the redundant storage system 100.
Some related implementation details of the method are described in the above embodiments. For brevity, similar descriptions for this embodiment are not repeated in detail here.
Based on the present invention method (e.g. the method mentioned above) and the associated apparatus (e.g. the redundant storage system 100, the generic storage system 13, the cache storage system 113, the control circuits 14 and 114, etc.), when the aforementioned one-bit data error occurs in any of the SSDs/HDDs of a RAID device (e.g. the HDD RAID 16, the HDD RAID 116, and the SSD RAID 126 for the cache purpose of the HDD RAID 116) utilized by the file system due to bit rot or some kinds of hardware error, the one-bit data error can be detected and the data in the SSD(s)/HDD(s) can be corrected and restored. The one-bit data error means a single bit of the data is incorrect. More specifically, data is stored in the storage medium in the binary form. For example, the binary form of 5566 is 1010110111110. Suppose that there is an error such as the one-bit data error in the binary form of 5566, e.g. 1010110111110 being saved as 1000110111110, in which the bit “0” printed with italic type can be taken as an example of the one-bit data error. When 1000110111110 is interpreted back to the decimal form, the “1000110111110” will become the number 4542, which is a total different number than 5566. In addition, the data may have been incorrectly written into the RAID device when some kinds of hardware errors occur. More specifically, the main components of an SSD are the controller and the flash memory for storing the data. If the controller malfunctions, the data cannot be written to the SSD correctly. The present invention method and the associated apparatus can correct the one-bit data error and enhance the overall performance of the redundant storage system.
Those skilled in the art will readily observe that numerous modifications and alterations of the device and method may be made while retaining the teachings of the invention. Accordingly, the above disclosure should be construed as limited only by the metes and bounds of the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
105114847 | May 2016 | TW | national |
This application is a continuation-in-part application and claims the benefit of U.S. Non-provisional application Ser. No. 15/381,118, which was filed on Dec. 16, 2016, and is included herein by reference. In addition, this application claims the benefit of U.S. Provisional Application No. 62/441,561, which was filed on Jan. 3, 2017, and is included herein by reference.
Number | Date | Country | |
---|---|---|---|
62441561 | Jan 2017 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15381118 | Dec 2016 | US |
Child | 15491994 | US |