The invention itself, as well as a preferred mode of use, further objects, and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The present invention provides a method, system and processor for substantially reducing the write penalty (or latency) associated with writes and/or destaging operations within a RAID 5 array and/or RAID 6 array. When a write or destaging operation is initiated, i.e., when modified data is to be evicted from the cache, an existing data selection mechanism first selects the track of data to be evicted from the cache. The data selection mechanism then triggers a data track grouping (DTG) utility, which executes a thread to group data tracks, in order to maximize full stripe writes. Once the DTG algorithm completes the grouping of data tracks to complete a full stripe, a full-stripe write is performed, and parity is generated without requiring a read from the disk(s). In this manner, the write penalty is substantially reduced, and the overall write performance of the processor is significantly improved.
In one embodiment, each rank within the cache is broken into sub-ranks, with each sub-rank being assigned a different thread to complete the grouping of data sets within that sub-rank. With this embodiment, each thread performing a grouping at a particular sub-rank is scheduled within a DTG queue of that sub-rank. The DTG algorithm then sequentially provides each scheduled thread with access to a respective, specific sub-rank to complete a grouping of data tracks within that sub-rank. The currently scheduled thread is provided a lock on the particular sub-rank until the thread completes its grouping operations. Then, when a stripe within the sub-rank includes all its data tracks, the data within the stripe is evicted as a full stripe write.
In the following detailed description of illustrative embodiments of the invention, specific illustrative embodiments in which the invention may be practiced are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that logical, architectural, programmatic, mechanical, electrical and other changes may be made without departing from the spirit or scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
Within the descriptions of the figures, similar elements are provided similar names and reference numerals as those of the previous figure(s). Where a later figure utilizes the element in a different context or with different functionality, the element is provided a different leading numeral representative of the figure number (e.g., 1xx for
It is also understood that the use of specific parameter names are for example only and not meant to imply any limitations on the invention. The invention may thus be implemented with different nomenclature/terminology utilized to describe the parameters herein, without limitation.
With reference now to the figures and in particular with reference to
In the depicted example, disk array 120 includes multiple disks, of which disk 0, disk 1, disk 5, and disk 6, are illustrated. However, more or fewer disks may be included in the disk array within the scope of the present invention. For example, a disk may be added to the disk array, such as disk X in
With reference now to
In the depicted example, storage adapter 210, local area network (LAN) adapter 212, and expansion bus interface 214 are connected to PCI local bus 206 by direct component connection. In contrast, audio adapter 216, graphics adapter 218, and audio/video adapter 219 are connected to PCI local bus 206 by add-in boards inserted into expansion slots. Expansion bus interface 214 provides a connection for a keyboard and mouse adapter 221, modem 222, and additional memory 224. Storage adapter 210 provides a connection for RAID 220, which comprises hard disk drives, such as disk array 120 in
Depending on implementation, RAID 220 in data processing system 200 may be (a) SCSI (“scuzzy”) disks connected via a SCSI controller (not specifically shown), or (b) IDE disks connected via an IDE controller (not specifically shown), or (3) iSCSI disks connected via network cards. The disks are configured as one or more RAID 5 disk volumes. In alternate embodiments of the present invention, RAID 5 controller and cache controller, enhanced by the various features of the invention, are implemented as software programs running on a host processor. Alternatively, the present invention may be implemented as a storage subsystem that serves other computers and the host processor is dedicated to the RAID controller and cache controller functions enhanced by the invention.
An operating system runs on processor 202 and is used to coordinate and provide control of various components within data processing system 200 in
Those of ordinary skill in the art will appreciate that the hardware depicted in
Turning now to
Notably, the illustrative embodiment provides DTG mechanism with the ability to logically divide the larger rank 240 into sub-ranks (e.g., a-d) up to a per stripe granularity. In an alternate embodiment in which each rank is handled as a single block and provided a single DTG queue (236), (referred to as rank-level grouping, as opposed to sub-rank-level grouping of the illustrative embodiment), only a single thread is provided to group tracks for destage on the entire rank. With this implementation, the grouping process (thread) takes a lock for the entire rank while grouping data tracks. Of course, a potential problem with this approach is that as rank sizes get larger and larger, all destages for a particular rank must contend for the lock. Thus, while the features of the invention are fully applicable to complete the grouping function at the rank-level, this rank-level implementation may potentially result in the locking mechanism itself to become a bottleneck for destage grouping, and potentially for any destage, for the rank.
Thus, according to the illustrative embodiment, sub-rank-level grouping is provided, and the “grouping in progress” indication, described in greater details below, refers to a sub-rank granularity for grouping data tracks. For example, the lower numbered half of the tracks in an example rank could have a separate “grouping in progress” indicator from the higher numbered half. The embodiment illustrated by
As mentioned above, the granularity of the sub-ranks may be as small as a single stripe. In the single stripe case, there is an array of “grouping in progress” indicators, with each indicator in the array corresponding to a single stripe in the rank, and functioning as the indicator for the grouping process for that stripe. The determination of how granular the indication should be takes into consideration the amount of space used by the array of indicators (index 238) and associated DTG queue 236, as well as the bandwidth of the storage adapter. In general, the illustrative embodiments provide a pre-calculated number of separate indicators to saturate the storage adapter with grouped destages.
Grouping on/off index 238 is utilized to indicate whether or not a grouping operation is being conducted on a particular sub-rank within rank 240 (cache 208). According to the illustrative embodiment, grouping on/off index 238 is an array with single bits assigned to each sub-rank that may be subject to track grouping operations. A value of “1” indicates that grouping is occurring for the particular sub-rank, while a value of “0” indicates that no grouping operation is being conducted at that sub-rank. As shown, sub-ranks b and c have ongoing grouping operations, and thus have a value of 1 within their respective index, while sub-ranks a and d have a value of 0 within their respective index indicating that sub-ranks a and d do not have ongoing groupings operation.
Entries within DTG queue 236 correspond to the thread or threads assigned to perform the grouping of data within a particular sub-rank. As shown, both ranks b and c, which have grouping index values of 1, have at least one thread within their respective queue. Within the figure, queues run horizontally from right to left, with the leftmost position in each queue being the position that holds the currently executing thread. During operation of the invention, this first position also represents the thread with a current lock on the specific sub-rank. This thread with the lock is considered the grouping owner and is the sole thread that is able to perform grouping on the particular data set, while that thread maintains the lock. DTG queue 236 may be a FIFO (first in first out) and each new thread assigned to perform grouping for that sub-rank is scheduled behind the last thread entered within that queue.
Notably, DTG queue 236 is shown with a depth of four (4) entries, and multiple threads are illustrated within queues associated with sub-ranks b and c. However, it is contemplated that alternate embodiments of the invention may be provided in which a single entry queue is utilize or a different number of multiple-entries are provided within DTG queue 236. Also, while four queues are illustrated for the particular rank 240, corresponding to the four sub-ranks, different numbers of queues will be utilized within alternative embodiments in which different numbers of sub-ranks are provided. The number of queues provided is only limited by the total number of stripes within rank 240, as the smallest grouping size is that required for a full stripe write. The granularity provided is thus based on system design, given the added real estate required for maintaining a larger number of queues and associated grouping on/off indices, when rank is logically divided down into multiple sub-ranks.
Utilizing the structures within cache controller 230 and cache 208 in a system, such as data processing system 200, write operations and/or destaging operations by which a data line (stripe) is removed from the cache, is completed with decreased latency. The processes provided when implementing the invention to remove modified data from the cache is depicted within
A major component of the invention is the enhancement of the cache controller to include the above described components, namely, DTG mechanism 234, DTG queue 236, and grouping on/off index 238. DTG mechanism 234 utilizes the other components within an algorithm that enables the grouping of tracks, in order to maximize full stripe writes. Thus, whenever the processor or cache controller initiates the process for evicting modified data from the cache, the DTG mechanism for grouping data tracks provides the enhancements that result in the decrease latency of completing the write (data eviction) operation.
Within cache subsystem 250, several different triggers may cause the eviction of modified data from the cache. These triggers are known in the art and are only tangential to the invention and thus not described in detailed herein. Once an eviction is triggered however, cache controller implements a selection mechanism to determine which data should be evicted from the cache. There are several known selection mechanisms, and the invention is described with specific reference to the Least Recently Used (LRU) algorithm, which is illustrated within
The invention provides a DTG algorithm that is utilized in conjunction with selection mechanisms, such as the LRU algorithm, for determining data tracks to be evicted. During implementation, the processor or cache controller first determines a data track to be evicted, via the existing selection algorithm. Based on this data track, the DTG algorithm then attempts to construct a full stripe of data tracks. According to the invention, the primary rational for triggering/implementing this grouping of data is to provide full stripes for eviction, since with full-stripes, parity is generated without having to perform a read from the disk. There is thus no write penalty, and write performance is significantly improved.
With these assumptions,
If there is no grouping in progress, then the DestageThread is provided with a lock for (i.e., ownership of) the sub-rank to complete the DestageThread's grouping of data, as indicted at block 326. If, however, the indicator indicates that there is a grouping in progress, then a next decision is made at block 320 whether the DestageThread has the lock (or is the current owner) of the sub-rank for grouping purposes. If the DestageThread is not the current owner, when there is a grouping in progress, the DTG utility places the thread to the DTG queue 236 for that sub-rank, as shown at block 322. A periodic check is made at block 324 whether the DestageThread reaches the top of the queue. When the DestageThread reaches the top of the DTG queue 236, the DestageThread is provided the lock for the corresponding sub-rank at block 326. Notably, in a multi-threading environment, the invention provides a plurality of locks for a single rank, when divided into sub-ranks that may be subject to concurrent grouping of tracks within respective sub-ranks.
Returning to decision block 320, when the DestageThread is the current owner of the sub-rank, the DestageThread removes the track from the destage list and locates the stripe for the track, as provided at block 328. Then, for all other tracks in the particular stripe, a check is made at block 330 whether these other tracks are also in the cache. If any of these other tracks are in the cache, these other tracks are added to the stripe for destaging, as shown at block 332.
Once DestageThread completes the adding of the various tracks for destaging, Destage Thread checks at block 334 whether there are any more tracks to destage. When there are no more tracks to destage, DTG utility performs the destage process for the particular stripe(s), as shown at block 336.
Actual performance of the destage process following the compiling (adding) of tracks to the stripe(s) for destaging is provided by
In one implementation, the thread that is provided ownership is actually removed from the queue so that another thread my occupy the pole position within the queue, but is not yet provided the lock and corresponding ownership of the particular sub-rank. Once the thread is removed, the DTG utility then schedules the DestageThread. Finally, the DTG algorithm (via respective threads) destages all the tracks that were added with the destaging processes.
The described invention provides a novel method of grouping tracks for destaging in order to minimize the write penalty in case of RAID 5 and RAID 6 arrays. The present invention provides a technique to improve performance while destaging data tracks. Specifically, the invention allows one thread a lock on a sub-rank to try to group tracks within the sub-rank, without providing any hint as to which tracks may be fully in the cache. The invention enables the data set to be chosen so that the write penalty is minimized. This selective choosing of the data set substantially improves the overall performance of the first and the second methods of completing a write to a RAID 5 or RAID 6 array.
As provided by the claims, the invention generally provides a method, cache subsystem, and data processing system for completing a series of processes during data eviction from a cache to ensure that a smallest write penalty is incurred. The cache controller determines (via existing methods known in the art) when modified data is to be evicted from the cache. The cache controller activates an existing selection mechanism (e.g., LRU) to identify the particular unit of modified data to be evicted. The selection mechanism then triggers a destage thread grouping (DTG) utility, which initiates a data grouping process that groups individual units of data into a larger unit of data that incurs a smallest amount of write penalty when completing a write operation from the cache. The particular data is a member of the larger unit of data along with the other individual units of data, and each of the individual units of data incur a larger write penalty than the larger unit of data.
The DTG utility completes the grouping process by first generating a thread to perform the data grouping process. The DTG utility then determines if there is no previous grouping process ongoing for the specific portion of a cache array in which the particular data exists. The determination is completed by checking a respective bit of the grouping on/off indicator that is maintained by the DTG utility within the cache controller. When there is no ongoing grouping process (i.e., the lock is available for that portion of the cache array), the DTG algorithm provides the thread with a lock on the portion of the cache array to complete said grouping process. The thread then initiates the grouping process within that portion of the cache array. The cache controller (via the thread) performs the write operation of the larger unit of data generated from the grouping process. The larger unit of data is written to the storage device. Once the thread completes the grouping process, the thread stops executing, and the thread releases the lock on the portion of the cache array of data from the thread.
When the DTG utility determines that a previous grouping process is ongoing for the portion of the cache array in which the particular data exists, however, DTG utility places the thread into a DTG queue 236 corresponding to that portion of the cache array. The DTG utility monitors for when the thread reaches the top of the DTG queue 236. Then, once the previous grouping process has completed and the previous thread releases the lock, the DTG utility provides the thread with the lock on the portion of the cache array, and the thread begins the data grouping process for the particular data.
In addition to the above features, the DTG utility also provides a granular approach to completing the grouping process. Thus, the DTG utility divides the rank of data into a plurality of sub-ranks, where each sub-rank represents the portion of the array that may be targeted by a destage grouping thread. Then, the DTG utility granularly assigns to one or more sub-ranks specific destage grouping threads, with different ones of the particular data to be evicted. Thus, DTG utility initiates different grouping processes within the one or more sub-ranks. The DTG utility then performs the grouping process and subsequent write operation on a sub-rank level, and the specific destage grouping thread assigned to a particular sub-rank groups the larger unit of data solely within the particular sub-rank.
As a final matter, it is important that while an illustrative embodiment of the present invention has been, and will continue to be, described in the context of a fully functional computer system with installed software, those skilled in the art will appreciate that the software aspects of an illustrative embodiment of the present invention are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the present invention applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of signal bearing media include recordable type media such as floppy disks, hard disk drives, CD ROMs, and transmission type media such as digital and analogue communication links.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.