The above and other objects, features and advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
Preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The present invention is supplementary to and compatible with existing destage algorithms, cache replacement algorithms and disk scheduling methods, and can be applied to various RAID levels, such as RAID level 0, RAID level 1, RAID level 5 and RAID level 6.
A principal feature of the present invention is the effective use of the characteristics of a disk shown in
The RAID Advisory Board describes RAID-5 using the terminology of both strip and stripe in more detail, as show in
In the present invention, a stride cache is used to improve the I/O performance. The size of a stripe cache unit used in the method proposed in the present invention is the size of the stripe. The cache consisting of the stripe cache units is managed in terms of stripe. In other words, each stripe cache unit corresponds to each stripe, which is the management unit for cache replacement and for destages. In the present invention, matrix stripe cache (MSC) unit is a stripe cache unit that is managed by the proposed rxw-matrix for the proposed contiguity transform. The elements of the rxw-matrix correspond to blocks 47 and 48. The columns of the rxw-matrix correspond to the strips 41˜45.
Furthermore, in the present invention, the contiguity transform generates the rxw-matrix to destage a stripe and transforms two discontiguous reads or writes into a contiguous read or write by inserting additional reads and writes into the discontiguous region. The contiguity transform exploits rules for consistency and performance, which enable data to be consistent without filesystem dependency, data modification, and performance degradation. The cache, which is managed in terms of MSC unit, provides easy implementation of a RAID system and efficient performance for sequential or bulky I/Os, and exploits spatial locality. Furthermore, the present invention improves read performance of a normal mode, which is more important, at the expense of read performance of a degraded mode. The present invention is performed in the RAID controller of
Although the present invention can be applied to various RAID levels, the application of the present invention to RAID level 5 will be described as an example.
A write data transferred to the RAID is cached in the memory 21 of the RAID control unit 20 shown in
Cache memory may not be assigned to empty blocks of an MSC unit, or all cache memory may be assigned to all respective blocks of an MSC in advance. The present invention is not limited to either of the two methods.
In the MSC unit 100 of
Furthermore, after a MSC unit 100 is selected to be destaged by a destage method for what to destage, a basic rxw-matrix 110 is generated from the selected MSC unit 100, and then the contiguity transform converts the basic rxw-matrix 110 into the transform rxw-matrix 140. Finally, reads from disks, XOR operations of the block caches, and writes to disks are performed using the transformed rxw-matrix 140.
All operations of RAID-5 can be categorized by six operations: r, t, w, x, xx. In
We use these mnemonics to conveniently describe our work. To update one block with new data, it is necessary to (1) read all other blocks of the PBG to which the updated block belongs, unless it is cached; (2) XOR all data blocks; and (3) write the parity block and the new block. This operation requires (N−1−d−c) reads and (d+1) writes, both of which comprise (N−c) I/Os, where N is the number of disks, c is the number of cached clean blocks, and d is the number of dirty blocks to be updated. This process is known as a reconstruct-write cycle. When d=N−1, it is unnecessary to read any block; this case is known as a full-parity-block-group-write.
A read-modify-write cycle can be used to reduce the number of I/Os when N−c>2(1+d), as the reconstruct-write cycle requires (N−c) I/Os while the read-modify-write cycle requires 2(1+d) I/Os. This process does the following: (1) it copies the new data to the cache memory; (2) it reads the old parity block (r) and reads the old block to a temporary memory (t) simultaneously; (3) it XORs the new block with the old block (xx), and XORs the result with the old parity block (x) to generate the new parity block (_); and (4) it writes the new block (w) and writes the new parity block (w) simultaneously, as shown in
For various cases of cache status,
In the block status in a MSC unit 100, “D” denotes a dirty block in which new data is in the cache but not yet updated to a disk, “C” denotes a clean block in which consistent data with the disk is in the cache, and “E” denotes an empty block in which valid data is not in the cache. Let u be the number of blocks per strip, and let v be the number of disks consisting of a RAID-5 array. The cache status of the blocks of a MSC unit 100 that is shown in
Before the actual execution of the read, XOR, and write for all blocks in a stripe, it is necessary to determine which blocks should be read, how the parity blocks should be made, and which blocks should be written, by generating a basic rxw-matrix, as shown in
A method of generating parity blocks for parity block groups in order to destage dirty blocks is described by the basic rxw-matrix 110. For example, in the first row of the basic rxw-matrix 110, a dirty block (z11) exists in a first column that is correspond to disk Do but all blocks (z21, z31, z41) of the other columns are empty. Then, the read-modify-write cycle is used. m11 of the basic rxw-matrix 110 becomes txxw, and m15 that is the parity block of the parity block group becomes rx_w. In other words, m11 performs operation ‘t’, m15 performs operation ‘r’, an XOR operation is performed on the temporary and cache memory of m11 and the cache memory of m15, and then m11 and m15 perform operation ‘w’.
The third row of the basic rxw-matrix 110 is taken as an example. z33 is a dirty block, and z31 is a clean block. Accordingly, m31 becomes ‘x’, m32 becomes ‘rx’, m33 becomes ‘xw’, m34 becomes ‘rx’, and m35 becomes ‘_w’ by the reconstruct-write cycle.
A read matrix 120, illustrating only read operations in the basic rxw-matrix 110 is shown in
After the basic rxw-matrix 110 is generated from the block status of the MSC unit 100, the contiguity transform that consists of the read contiguity transform and the write contiguity transform is performed in order to produce the transformed rxw-matrix 140.
The fundamental principle of the read contiguity transform is as follows:
The rxw element mij, to which the read operation can be added, is in the discontiguous region between two discontiguous elements, maj and mbj, both of which include ‘r’ or ‘t’. In the discontiguous region, there exists no element that includes ‘r’ or ‘t’. In the case of the second column shown in
Meanwhile, when a read operation is added to an element mij, operation ‘t’ is used if mij does not correspond to a parity block and the corresponding cache status zij is dirty. Otherwise, operation ‘r’ is used.
The write contiguity transform follows the read contiguity transform. The fundamental principle of the write contiguity transform is as follows:
The rxw element mij, to which the write operation can be added, is in the discontiguous region between two discontiguous elements, maj and mbj, both of which include ‘w’. In the discontiguous region, there exists no element that includes ‘w’. In the case of the first column shown in
Furthermore, when a write operation is added to an element mlj, the write contiguity transform is disallowed if there exists at least one mlj such that its cache status zlj is empty and mlj does not contain ‘r’, for all mlj that are in the discontiguous region between maj and mbj.
There are the other limitations of the contiguity transform. If the stride distance between two discontiguous reads in the basic rxw-matrix 110 is greater than a predetermined “maximum read distance”, the read contiguity transform for the discontiguous reads is disallowed. In a similar way, if the stride distance between two discontiguous writes in the basic rxw-matrix 110 is greater than a predetermined “maximum write distance”, the write contiguity transform for the discontiguous writes is disallowed.
The predetermined maximum read distance is a stride distance that exhibits faster performance than a contiguous read, where the stride distance is defined by the number of blocks between two discontiguous I/Os. The predetermined maximum write distance is a stride distance that exhibits faster performance than a contiguous write. The maximum read distance and the maximum write distance are obtained from a member disk of a disk array by a stride benchmark, which is automatically performs when an administrator create the disk array. The stride benchmark generates the workload of a stride pattern by varying the stride distance.
The “maximum read distance” value and the “maximum write distance” value are stored in a non-volatile storage that can permanently store the values.
The transformed rxw-matrix 140 is generated by the contiguity transform for the basic rxw-matrix 110. In order to easily understand how the contiguity transform is performed by comparing the read matrix 120 and the write matrix 130.
Finally, actual reads, XORs, writes to destage the MSC unit 100 are performed after generating the transformed rxw-matrix 140. All reads of the transformed rxw-matrix 140 are actually requested to disks simultaneously. After all of the requested reads are completed, XOR operations are performed and all writes of the transformed rxw-matrix 140 is requested to disks. When all of the requested writes are completed, destaging the MSC unit 100 is completed.
When the contiguity transform is performed as described above, a plurality of read or write commands forms a single disk command, even though the single disk command has a longer data length. The latter disk command exhibits faster performance than the former disk commands.
A process resulting in the generation of the transformed rxw-matrix 140 of the MSC of 100 is illustrated in the flowchart of
When the MSC unit 100 is determined to be destaged at step 200, a first step 201 of determining whether there are one or more dirty blocks in the MSC unit 100 is performed. If there are one or more dirty blocks in the MSC unit 100 at the first step 201, a second step 202 is performed; otherwise, the destage of the unit MSC 100 is terminated at step 211.
At the second step 202, the basic rxw-matrix 110 is generated using the read-modify-write cycle and the reconstruct-write cycle in order to destage dirty blocks for each row of the MSC 100 unit.
Thereafter, the read contiguity transform is performed at a third step 203. In the read contiguity transform, the rxw element mij to which the read operation is added, is in the discontiguous region between two discontiguous elements, maj and mbj, both of which include ‘r’ or ‘t’. In the discontiguous region, there exists no element that includes ‘r’ or ‘t’. However, if the stride distance between two discontiguous reads in the basic rxw-matrix 110 is greater than a predetermined “maximum read distance”, the read contiguity transform for the discontiguous reads is disallowed. When a read operation is added to an element mij, operation ‘t’ is used if mij does not correspond to a parity block and the corresponding cache status zij is dirty. Otherwise, operation ‘r’ is used.
Thereafter, the write contiguity transform is performed at a fourth step 204. In the write contiguity transform, the rxw element mij, to which the write operation can be added, is in the discontiguous region between two discontiguous elements, maj and mbj, both of which include ‘w’. In the discontiguous region, there exists no element that includes ‘w’. However, if the stride distance between two discontiguous writes in the basic rxw-matrix 110 is greater than a predetermined “maximum write distance”, the write contiguity transform for the discontiguous writes is disallowed. Furthermore, when a write operation is added to an element mlj, the write contiguity transform is disallowed if there exists at least one mlj such that its cache status, zlj, is empty and mlj does not contain ‘r’, for all mlj that are in the discontiguous region between maj and mbj.
Thereafter, we determine whether the number of read operations ‘r’ or ‘t’ is one or more in the transformed rxw-matrix at a fifth step 205. If the number of read operations ‘r’ or ‘t’ is not one or more at fifth step 205, an eighth step 208 is performed. If the number of read operations ‘r’ or ‘t’ is one or more at the fifth step 205, all reads of the transformed rxw-matrix 140 are actually requested to disks simultaneously at a sixth step 206.
After all of the requested reads are completed at a seventh step 207, XOR operations for each row of the final rxw-matrix 140 are performed at the eighth step 208, all writes of the transformed rxw-matrix 140 is requested to disks a ninth step 209
When all of the requested writes are completed at a tenth step 210, destaging the MSC unit 100 is completed at step 211.
If the maximum read distance is 1, there is no write contiguity transform that is assisted by the read contiguity transform. If we aggressively increase the maximum read distance in order to increase the possibility of the write contiguity transform without obeying the said rule that determined the maximum read distance, thereby achieving better performance.
A method of improving read performance according to the present invention is described below. The read performance improvement scheme is independent of the write performance improvement scheme.
The read performance improvement scheme of the present invention can improve read performance of a normal mode by sacrificing read performance of a degraded mode. In the degraded mode, the read performance must be improved using a parity cache. For this purpose, a complicated dependency occurs between read requests, therefore the implementation of the read of a RAID becomes complex, thus resulting in considerable overhead of read.
In order to reduce such overhead, reads in a degraded mode is always performed over the entire blocks of a stripe. Thus, the complicated dependency between read requests can be reduced. However, this results in poor performance for small read operations.
The read performance improvement scheme is illustrated in the flowchart of
If a read request is generated and the read starts at step 300, the read request that ranges over two or more MSC units 100 is divided into several read requests for the respective MSC units at a first step 301 if the read request ranges over two or more MSC units 100.
Thereafter, a second step 302 of determining whether there is a failed disk is performed. If there is no failed disk at second step 302, a third step 303 of reading empty blocks of the MSC unit 100 for the divided read requests is performed, and then the read is terminated at step 309.
At a third step 303, there may be blocks that hit the cache by the MSC unit. Alternatively, all of the blocks that are requested may hit the cache. In this case, the read request is terminated without any read operation.
Meanwhile, if there is a failed disk at a second step 302, we determine whether there is an empty block in the MSC unit 100 at a fourth step 304. If there is an empty block in the MSC unit 100 at a fourth step 304, we determine whether the MSC unit 100 is under reading the entire blocks of the MSC unit 100 at a fifth step 305. If there is no empty block in the MSC unit 100 at the fourth step 304, the read is terminated at step 309. If the MSC unit 100 is under reading the entire blocks of the MSC unit 100 at the fifth step 305, the read request is inserted into the blocking list of the MSC unit 100 at an eighth step 308. Then, the read request is terminated at the seventh step 307.
If the MSC unit 100 is not under reading the entire blocks of the MSC unit 100 at fifth step 305, a sixth step 306 of producing a rxw-matrix in order to read all of the empty blocks of the MSC unit 100 is produced at a sixth step 306, and the read contiguity transform is performed for the rxw-matrix, and then reading the entire blocks of the MSC unit is performed by the rxw-matrix. However, this read contiguity transform may be omitted. After the read of the entire blocks of the stripe has been completed, a seventh step 307 of finishing read requests in the blocking list of the MSC unit 100 is performed, and the read is then terminated at step 309.
As described above, the present invention can improve the performance of discontiguously sequential writes of disk arrays with sophisticated fault-tolerant schemes such as RAID-5, RAID-6 and so on.
Number | Date | Country | Kind |
---|---|---|---|
10-2006-0055355 | Jun 2006 | KR | national |