The invention relates generally to storage systems using direct access storage devices, relates more particularly to storage systems using SAS (serial attached SCSI) wide-port interfaces, and relates most particularly to the difficult problem of working out how to keep all of the paths within the SAS wide-port interface busy as much of the time as possible, while avoiding adding unneeded complexity to storage adapter firmware and permitting load allocation among SAS engines.
In more recent times, makers of storage systems and components of storage systems have developed SAS (serial attached SCSI) where SCSI means “small computer system interface”. SAS links permit data transfers of very high bandwidth as compared with older SCSI links. SAS links permit addressing of more devices as compared with older SCSI links. SAS cabling is narrower and thus does not interfere with circulation of cooling air as much as older SCSI cabling. The SAS arrangement of
SAS offers the designer an opportunity to set up “narrow” SAS ports and “wide” SAS ports. A wide SAS port can contain “wide” SAS links, while a narrow SAS port cannot contain a wide SAS link. A wide SAS link is able to contain more than one physical link. In a storage system, each physical link is driven by an “SAS engine.” From the point of view of maximizing the performance of the storage system, it is desirable to keep all of the physical links within the wide SAS port busy as much of the time as possible—or stated differently, it is desirable not to hold back unnecessarily any traffic from the wide port. From the point of view of the system designer this goal reduces to the goal of trying to keep each of several SAS engines busy as much of the time as possible.
For many users of storage systems, it is important to achieve high performance as well as high reliability, and to this end many storage systems use RAID (redundant array of inexpensive disks). With RAID it is possible to reduce greatly the risk of loss of data even in the event of loss of one disk drive, and depending on the level of RAID employed it can even be possible to minimize the loss of data in the event of loss of two disk drives. RAID striping can also improve the performance of the storage system as compared with a system that does not use striping.
While RAID offers these and other benefits, RAID imposes burdens on the designer of the part of the system that interfaces with the storage devices (e.g. disks). There are many “write” transactions that involve writing to two or more drives at a time, and there may well be “read” transactions interleaved with the “write” transactions. Each “write” transaction may be composed of a number of smaller operations, and there are many situations where the operations need to be executed in strict sequence.
With some DASD link types that predate SAS, the order of execution of operations is automatically (that is, necessarily) preserved and nothing about the link presents any risk of operations being executed out of order. With SAS, however, there can be several SAS engines communicating with any of a (possibly) large number of SAS devices, and the unwary designer could end up with a situation where the system design would permit operations (perhaps passed to distinct SAS engines) to get executed out of order. Such out-of-order execution would inexorably lead to corruption of stored data and a system that permitted such out-of-order execution would be wholly unacceptable to any actual user.
Today, maintaining and guaranteeing the order of execution of commands to SAS Devices is implemented using an OSQ (operation sequence queue) with a single headpointer and tailpointer. This may be analogized with a FIFO (first-in first-out) stack. Each new entry is added to the stack at the “head” and the headpointer is “bumped” or moved to show that the new entry is at the headpointer. Each entry that is deleted from the stack is deleted at the “tail” and the tailpointer is bumped to show that the entry that used to be second from the tail is now at the tail. In a RAID system there will be RAID firmware which has the task of making RAID happen (receiving data that is to be written, calculating parity or redundancy values, breaking up the to-be-stored information into the stripe or stripes to which it is to be stored, and issuing commands or operations to make the storage happen. The RAID firmware likewise has the task of rebuilding a drive in the event of drive failure, drawing upon redundant information in the other drives of the system, and in doing so it again issues operations to make the rebuilding happen. (It will be appreciated that while this discussion often uses the term “drive”, the benefits of the invention offer themselves for systems composed of any of a variety of types of storage devices.) It is thus the firmware that issues commands or operations, and that inserts them into the OSQ, generally at the “head” end. Each time firmware inserts another operation into the OSQ it bumps the headpointer accordingly.
It is important that the firmware be able to know where the tailpointer is as well. The OSQ is typically treated as a wraparound memory (e.g. a ring or loop) so that the head “chases the tail.” In an OSQ with space for 256 entries (which is typical) then firmware will know that the OSQ is full (and thus will know not to store any new entries) by noticing that the headpointer has reached the tailpointer. (Hopefully in a well dimensioned system the OSQ will rarely or never always get full, and there will always be at least a few unused entries between the headpointer and the tailpointer.) In
In a typical system there is logic (306 in
The headpointer is incremented by hardware after a new command is added to the “Operation Sequence Queue” (OSQ), whereas the tailpointer is incremented after the command is transmitted for the current OSQ entry.
This prior-art approach employing a single headpointer and a single tailpointer serves the desired purposes discussed above in the special case of a system having a single SAS engine. Part of why this prior-art approach does not come to ruin is that because there is only a single engine, operations cannot get out of order. Also in the case where there is only a single engine there is only one pipeline to keep full, or to put it differently, the question of keeping each of several pipelines busy all the time does not even arise.
But when the system designer departs from a single-engine system and chooses to use an SAS wide-port design, the use of a single headpointer and single tailpointer turns out to be far from ideal. With a single headpointer and single tailpointer, commands are executed in a single-threaded fashion (rather than multi-threaded) and this leads to poor performance—some of the pipelines in the wide port will go idle while another pipeline in the wide port is busy.
Faced with the problem of trying to get close to the maximum potential bandwidth out of the wide port (that is, to minimize how often any one or more of the pipelines is idle or less than fully used), system designers have tried two approaches.
A first approach for keeping multiple SAS engines busy. A first approach is to have as many dedicated OSQs as there are SAS engines. Each OSQ would contain operations to be performed by its respective SAS engine, and in this way each SAS engine could be kept busy, thereby maximizing the benefit derived from the fact of the SAS port being “wide”. In this first approach, each OSQ array would then employ the above mentioned single head/tail pointer implementation.
One drawback to this approach, however, is that this approach dramatically reduces and limits the number of operations that can be outstanding at any given time for each SAS engine. For example, a 256-entry OSQ serving four engines will need to be split into four OSQs with 64 entries each.
A second drawback to this approach is that depending on the workload, it might be desired to be able sometimes to queue more than (say) 64 commands to a particular SAS engine, yet it would not be possible to do so, as the rest of the OSQ space is dedicated to various other SAS engines.
A second approach for keeping multiple SAS engines busy. The other known approach is to have a single headpointer and multiple tailpointers, with a respective tailpointer dedicated to each engine. With this approach, logic 306 (
A chief drawback to this approach is that it creates added burden and complication whenever firmware decides to manipulate the queue. Stated differently, the interface between the OSQ and the RAID firmware would need to be much more complicated than it would normally need to be, to be able to cope with more than one (e.g. four) tailpointers.
There is thus a great need for an approach which would preserve order of execution of operations, and which would minimize loss of bandwidth on an SAS wide port, all without adding to the complexity of the pre-existing interface with RAID firmware. It would be desirable as well if the approach could be simple and readily accomplished in hardware logic. It would be desirable if the approach, in addition to satisfying these demands, could also perform “lookaheads” to work on entries, maximizing the number of operations per engine.
In an operation sequence queue, a single headpointer is used with two tailpointers to identify operations to be passed to SAS engines in a wide-port environment. The order of execution of commands is preserved despite being performed in an SAS wide-port RAID controller having multiple SAS engines.
The invention will be described with respect to a drawing in several figures.
Turning first to
Returning to
The invention provides a new queuing service algorithm called the Ripple Queuing Service that guarantees command sequencing and that simultaneously services multiple SAS engines. It will be seen that this new algorithm reduces the hardware complexity and gives good performance in multi-engine designs, as compared with other past approaches. In its essence, the Ripple Queuing Service addresses the starving of engines by looking ahead and finding a valid entry on the queue to execute in parallel on different engines. It adapts a single headpointer (401 in
Let us assume we have a list of entries queued in the OSQ 305 that is 256 entries in size as depicted in
How
How
The discussion now turns briefly away from prior-art approaches and returns to the invention.
Given the particular entries that were arbitrarily chosen for
How
Returning now to the invention, the inventive approach using two tailpointers will be described in considerable detail.
Returning to
Returning to
Returning to
Logic 306 next obtains the engine address of the second entry (entry 7,x) i.e. path number 7, and it concludes that this entry, too, belongs to SAS device X. At this point in time, if logic 306 were to dequeue this entry (that is, if logic 306 were to extract the entry from the OSQ 305 and insert it as a valid task into buffer 307) the result would be a failure to guarantee the sequenced order of operation, namely that the (6,x) task is supposed to be executed prior to the (7,x) task. This means that logic 306, in keeping with the invention, will not extract the (7,x) task from the OSQ 305 and will not place it into the buffer. Rather, the Ripple Queuing Service increments only the tailpointer_1 as seen in
Now, the engine address of the entry that the tailpointer_1 is pointing to i.e. path number 11 (task 11,x) is obtained by logic 306 and overwritten on buffer space 307 as a candidate task that may or may not be given to an SAS engine for execution. Logic 306 then carefully inspects the buffer 307 to see whether any SAS engine is presently carrying out any task with respect to address X. There are two possible outcomes.
Outcome number 1. Assuming that the condition of the buffer 307 continues to be in the state shown in
Outcome number 2. The other possibility could be that when logic 306 checks for the validity of this third entry (that is, checks to see whether buffer 307 shows any SAS engine to be carrying out some operation on the address X), logic 306 may find that the valid bit for the first entry (6,x) is no longer valid (is deasserted). This would mean that the SAS engine that had been working on the (6,x) task had finished its work and had cleared the “valid” bit in the buffer 307. In that case the logic 306 would move the tailpointer 403 back to the same place as the tailpointer 402 so as to start all over again with looking for the oldest operation not yet executed. But that is not what happens in the present example.
In the present example, we assume that the valid bit in the buffer 307 for the first entry (6,x) is still valid. Stated differently, this means that the SAS engine that is working on the task (6,x) is not yet done performing the task. In this case, logic 306 we again increment the tailpointer_1403 as seen in
It will be appreciated that at this point, three of the four SAS engines is idle and has been idle since the queuing began. Only one SAS engine, the one that is working on the (6,x) task, is doing anything. Most of the bandwidth of the wide port (carrying SAS wide link 104 in
Now, the engine address of the entry that the tailpointer_1 is pointing to i.e. path number 4 (operation 4,y) is obtained by logic 306 and is and overwritten on buffer space 307 as a candidate task that might be given to an SAS engine. This time when logic 306 does the comparison of buffer space 307 with all the other addresses we see that it is exclusive. Stated differently, the SAS address Y does not appear anywhere else in the buffer 307 (the only other address in the buffer 307 is the x of operation (6,x) as shown in
Therefore, logic 306 reflects this operation (4,y) as depicted in
It will be appreciated, however, that the removal of the entry (4,y) is not done by bumping the tailpointer 0402 (as would be done in a simple FIFO approach) for the simple reason that the entry (4,y) is not the “last” or “oldest” entry.
In accordance with this embodiment of the invention, logic 306 proceeds as will now be described. The entry that has been moved to buffer 307 (namely the 4,y entry) is moved in the OSQ to a place that is after the recent valid entry i.e. 6,x. The logic 306 must then bring about a ripple-effect move for the remaining entries i.e. path numbers 7,x and 11,x. Logic 306 then increments tailpointer_0 and equates tailpointer_1 to tailpointer_0 which is seen in
Now, the process of finding the next valid entry for a different SAS Device follows the above illustrated process.
Eventually, all of the buffer spaces in buffer 307 will get filled, which means that all of the SAS engines 308 will have work to do. At that point the wide link 104 of the wide port will get put fully to use and the maximum theoretical bandwidth will be obtained from the link.
It is instructive, the above discussion having taken place, to elaborate upon the interface between this queuing system (OSQ 305, logic 306, and buffer 307) and the firmware of microprocessor 303 and firmware 304.
In this embodiment, firmware 304 is given the flexibility to manipulate the queue (OSQ 305) to add or remove entries. Of course such manipulation cannot be permitted to take place at the same time that logic 306 is manipulating the OSQ 305. To manipulate the Queue (OSQ 305), firmware 304 sets a bit in a register in hardware, which halts logic 306. By “halting logic 306” is meant that the above process of finding the next valid entry in the queue (discussed in connection with
Problems Solved by the Invention
From the discussion above, it may be seen that the invention solves several problems.
Single-threading of operations in a SAS wide-port design resulting in poor performance. The simplest way in which a system designer might try to ensure the preservation of the correct order of execution of operations is, as mentioned above, to single-thread the operations. This does indeed preserve the order of operations, but loses any prospect of gaining the entire bandwidth which a wide port could offer, and in fact loses most of that bandwidth.
Limitation of number of operations for each SAS engine. Yet another way of preserving operation order, as mentioned above, might have been to set up a distinct OSQ for each SAS engine. This means, however, that each OSQ is perhaps one-fourth as big. That limits the number of operations that can be queued up for the engine corresponding to a queue.
Difficulty of manipulation and control of the “Operation Sequence Queue” by the Firmware. If multiple tailpointers are used (one for each of the SAS engines) then the firmware interface to the OSQ is much more complicated than usual.
In summary, “Ripple Queuing Service” (RQS) employs two different tail pointers that enables SAS wide ports to execute multiple commands simultaneously, while nonetheless maintaining the order of the commands. Also, the operation completion rate of the SAS RAID adapter is enhanced as compared with a single-threaded approach.
The approach according to the invention supports the dynamic swapping of entries. This feature allows the queue to execute valid entries (namely entries destined for different engines) in parallel. This idea promotes the execution of more operations in a single-queue design without having to wait for the current entry to finish its operations before dequeuing the next entry.
This invention allows providing firmware a dynamic yet simple interface to work with. The firmware will have access only to the headpointer and the first tailpointer along with a ‘freeze’ feature to manipulate the contents as deemed necessary. It will be appreciated that the firmware does not need to have any access to (or knowledge of) the second tailpointer 403.
It will be appreciated that when firmware clears the freeze bit, the logic 306 must reset the tail 1403 to match the new tail 0402. Stated differently, logic 306 cannot be permitted to rely upon anything about the tail 1403 after a freeze has come and gone.
It will be appreciated that those skilled in the art will have no difficulty at all in devising myriad obvious improvements and variants of the embodiments disclosed here, all of which are intended to be embraced by the claims which follow.
This application is a continuation of international application number PCT/IB2005/053220, filed Sep. 30, 2005, designating the United States, which application is incorporated herein by reference for all purposes. International application number PCT/IB2005/053220 claims priority from U.S. application No. 60/595,681, filed Jul. 27, 2005, which application is incorporated herein by reference for all purposes.
Number | Date | Country | |
---|---|---|---|
60595681 | Jul 2005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/IB05/53220 | Sep 2005 | US |
Child | 11163348 | Oct 2005 | US |