The present invention relates generally to enhancing the performance of a storage adapter for a Redundant Array of Independent Disks (RAID) and, more particularly, to systems and methods for optimizing host reads and cache destages in a RAID subsystem.
Computing systems may include one or more host computers (“hosts”) for processing data and running application programs, storage for storing data, and a storage adapter for controlling the transfer of data between the hosts and the storage. The storage may include a Redundant Array of Independent Disks (RAID) storage device. Storage adapters, also referred to as control units or storage directors, may manage access to the RAID storage devices, which may be comprised of numerous Hard Disk Drives (HDDs) that maintain redundant copies of data (e.g., “mirror” the data or maintain parity data). A storage adapter may be described as a mechanism for managing access to a hard drive for read and write request operations, and a hard drive may be described as a storage device. Hosts may communicate Input/Output (I/O) requests to the storage device through the storage adapter.
A storage adapter and storage subsystem may contain a write cache to enhance performance. The write cache may be non-volatile (e.g., battery backed or Flash memory) and may be used to mask the “write penalty” introduced by a redundant array of independent disks (RAID) system such as RAID-5 and RAID-6 systems. A write cache may also improve performance by coalescing multiple host operations placed in the write cache into a single destage operation, which may then be processed by the RAID layer and disk devices.
Write command data sent by the host may be placed in cache memory to be destaged later to disk via the RAID layer. When using RAID such as RAID-5 and RAID-6, many of these cache destage operations may result in multiple pairs of Read-XOR/Write operations, where both operations of a pair are to the same logical block address (LBAs) on a disk. Each Read-XOR/Write pair may be the result of needing to either: 1) reading old data, XORing this old data with new data to produce a change mask, and then writing the new data OR 2) reading old parity, XORing this old parity with a change mask to produce new parity, and then writing the new parity. In both cases the Read-XOR operation to disk may need to be completed successfully to disk before the write operation may be able to be performed.
In “A Case for Redundant Arrays of Inexpensive Disks (RAID)”, Proc. of ACM SIGMOD International Conference on Management of Data, pp. 109-116, 1988, incorporated herein by reference, D. A. Patterson, G. Gibson and R. H. Katz describe five types of disk arrays classified as RAID levels 1 through 5. Of particular interest are disk arrays with an organization of RAID level 5, because the parity blocks in such a RAID type are distributed evenly across all disks, and therefore cause no bottleneck problems.
One shortcoming of the RAID environment may be that a disk write operation may be far more inefficient than on a single disk, because a data write on RAID may require as many as four disk access operations as compared with one disk access operation on a single disk. Whenever the disk controller in a RAID organization receives a request to write a data block, it may not only update (i.e., read and write) the data block, but it also may update (i.e., read and write) the corresponding parity block to maintain consistency. For instance, if data block D1 in
Therefore, the following four disk access operations may be required: (1) read the old data block D1; (2) read the old parity block P0; (3) write the new data block D1; and (4) write the new parity block P0. The reads may need to be completed before the writes may be able to be started.
In “Performance of Disk Arrays in Transaction Processing Environments”, Proc. of International Conference on Distributed Computing Systems, pp. 302-309, 1992, J. Menon and D. Mattson teach that caching or buffering storage blocks at the disk controller may improve the performance of a RAID disk array subsystem. If there is a disk cache, the pre-reading from the disk array of a block to be replaced may be avoided if the block is in the cache. Furthermore, if the parity block for each parity group is also stored in the cache, then both reads from the disk array may be avoided if the parity block is in the cache.
A Read command sent by the host may not be satisfied by data in the write cache. Unlike a host Write command which may not need wait for a disk access (as long as there is a space in the write cache), a host Read command may wait for a disk to perform a Read operation. However, if the write cache is full, a host Write command may also need to wait for disk accesses (cache destages) to complete.
Performance of the storage subsystem may be greatly influenced by controlling the interaction of the disk Read operations (resulting from host Reads) and disk Read-XOR/Write operations (resulting from cache destages).
According to an aspect of the invention, a method of a storage adapter controlling a redundant array of independent disks (RAID) may be provided. The method may include examining performance curves of a storage adapter with a write cache, determining if an amount of data entering the write cache of the storage adapter has exceeded a threshold, and implementing a strategy based on the determining operation. The strategy may include one of coupling Read-XOR/Write operations and providing priority reordering of Read operations over the Read-XOR/Write operations in order to minimize host read response time if data entering the write cache is less than the threshold, and allowing all Read operations and Read-XOR/Write operations to be queued at the device using simple tags in order to achieve maximum throughput if data entering the write cache is greater than the threshold.
According to another aspect of the invention, a storage adapter controlling a redundant array of independent disks (RAID) may be provided. The storage adapter may include a write cache, and storage adapter logic. The storage adapter logic may be configured to examine performance curves of the storage adapter, determine if an amount of data entering the write cache of the storage adapter has exceeded a threshold, and implement a strategy based on the determining operation. The strategy may include one of coupling Read-XOR/Write operations and providing priority reordering of Read operations over the Read-XOR/Write operations in order to minimize host read response time if data entering the write cache is less than the threshold, and allowing all Read operations and Read-XOR/Write operations to be queued at the device using simple tags in order to achieve maximum throughput if data entering the write cache is greater than the threshold.
According to another aspect of the invention, a system including a storage adapter controlling a redundant array of independent disks (RAID) may be provided. The system may include a cache memory coupled to the storage adapter and the RAID, the cache memory having a threshold, and storage adapter logic to examine performance curves of the storage adapter and the capacity of the cache memory. The storage adapter logic may determine if an amount of data entering the cache memory of the storage adapter has exceeded a threshold. The storage adapter logic may implement a strategy based on the determination of exceeding the threshold. The strategy may include one of coupling Read-XOR/Write operations and providing priority reordering of Read operations over the Read-XOR/Write operations in order to minimize host read response time if data entering the write cache is less than the threshold, and allowing all Read operations and Read-XOR/Write operations to be queued at the device using simple tags in order to achieve maximum throughput if data entering the write cache is greater than the threshold.
The foregoing and other aspects will become apparent from the following detailed description when considered in conjunction with the accompanying drawing figures.
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the like elements throughout. The embodiments are described below to explain the present invention by referring to the figures.
As used in this application, the terms “a”, “an” and “the” may refer to one or more than one of an item. The terms “and” and “or” may be used in the conjunctive or disjunctive sense and will generally be understood to be equivalent to “and/or”. For brevity and clarity, a particular quantity of an item may be described or shown while the actual quantity of the item may differ.
Systems and methods according to various aspects or embodiments of the present invention may provide an improved process for controlling the interaction of the disk Read operations (resulting from host Reads) and disk Read-XOR/Write operations (resulting from cache destages) in a redundant array of independent disk (RAID) system. An aspect of the present invention may be practiced on the disk array storage adapter shown in
The storage adapter 110 may contain an algorithm for managing the write cache 105 for enhancing performance for the RAID system 120. The write cache 105 may be non-volatile (e.g., battery backed or FLASH memory) and may be used to mask the “write penalty” introduced by a RAID system 120. The “write penalty” may be the delay in disk access due to the time it takes to complete a host 102 write command to the RAID system 120. The write cache 105 may contain random access memory (RAM) that acts as a buffer between the host computer 102 and the RAID system 120. This may allow for a more efficient process for writing data to the RAID system 120. A write cache 105 may also improve performance by coalescing multiple host operations into a single destage operation, which may then be processed by the RAID system 120. These destage operations may include multiple pairs of Read-XOR/Write operations, such as reading old data, XORing (comparing) this old data with new data to produce a change mask, and then writing the new data. These destage operations may also include reading old parity data, XORing this old parity data with a change mask to produce new parity data, and then writing the new parity data. Data from host write commands may be placed in the write cache 105 giving the operation relatively quick host response. Host Read commands however, may not be satisfied by data in the write cache 105. Therefore, a host Read command may wait for a disk to perform a Read operation or other activities caused by a cache destage before the command is processed.
Prior art systems typically had to compromise in improving the disk Read operations (resulting from host reads) in order to minimize host response time versus improving disk Write or disk Read-XOR/Write operations (resulting from cache destages) in order to maximize overall throughput. An embodiment of the present invention seeks to provide the most desirable of both worlds. For example, as illustrated in
A key idea of an embodiment of the invention is to be able to detect when it may be acceptable to be somewhat less efficient in order to provide the host a more desirable response time and when it may be necessary to be the most efficient possible in order to provide the greatest throughput. In examining the typical performance curves shown in
At throughputs lower than the “knee,” 200 it may be desirable to use tight coupling of Read-XOR/Write operations along with priority reordering of Reads (resulting from host Reads) ahead of Read-XOR/Write operations (resulting from cache destages) in order to minimize host response time. With throughputs greater than the “knee,” 230 it may be desirable to allow all Reads (resulting from host reads) and Read-XOR/Write operations (resulting from cache destages) to be queued at the device using simple tags in order to achieve maximum throughput, similar to that shown in 220. Simple tags allow reordering of operations as needed to, for example, maximize a number of operations per second that may be performed.
Detection of the “knee” 230 may be implemented in several ways. A basic idea is to determine if write cache 31 has adequate free space (i.e., it is being kept at or below its established threshold) 320 or whether write cache 31 threshold or a per device cache threshold has been exceeded. Then, switching strategies when write cache 31 begins to exceed threshold may help avoid the cache 31 from encountering a cache full condition 300 and extend throughput (e.g., as shown by 210).
An embodiment of the present invention may solve this problem by providing a RAID storage adapter 110 that may dynamically change the interaction of the disk Read operations (resulting from host computer Reads) and disk Read-XOR/write operations (resulting from cache destages) in order to maximize overall throughput and minimize host response time. This may apply to normal parity updates for RAID levels such as RAID-5 and RAID-6. Still another embodiment of the present invention may dynamically change the interaction of the disk Read operations (resulting from host Reads) and disk Write operations (resulting from cache destages) in order to maximize overall throughput and minimize host response time. This may apply to RAID levels such as RAID-0 and RAID-1 and to stripe writes with RAID-5 and RAID-6.
The dynamic operation of the storage adapter 110 is illustrated in the flow diagrams of
The tight coupling of the Read-XOR/Write pair operations 420 may prevent other operations from occurring before the Write of a Read-XOR/Write pair is completed. This may be done by treating the Read-XOR/Write as though they were Untagged (let other tagged operations finish to the disk, perform the Read-XOR/Write, and then dispatch other operations which may have since been queued in the adapter). Alternatively using Ordered tags with the Read-XOR/Write may be used to tightly couple Read-XOR/Write pairs.
As already noted, under conditions where write cache 31 has adequate free space (at or below threshold), it may not be desirable for host response time to make a Read operation wait behind a Read-XOR/Write operation. It may be preferable under these conditions to force the Read-XOR/Write operation to be prioritized behind the Read operations 420 in order to minimize the response time for host Read commands.
Allowing the storage adapter 110 to dynamically change the interaction of the disk Read operations and the disk Read-XOR/Write operations may allow the system to both maximize overall throughput and minimize host response time.
The foregoing description discloses only exemplary embodiments of the invention. Modifications of the above-disclosed embodiments of the present invention of which fall within the scope of the invention will be readily apparent to those of ordinary skill in the art. For instance, although in some embodiments, RAID 5 may be discussed, the system may be applicable to RAID levels 3, 4, 6, 10, 50, 60, etc.
Accordingly, while the present invention has been disclosed in connection with exemplary embodiments thereof, it should be understood that other embodiments may fall within the spirit and scope of the invention as defined by the following claims.