The invention relates generally to computer systems and, more particularly, to computer storage systems and load balancing of storage traffic.
In most computer systems, data is stored in a device such as a hard disk drive. This device is connected to the CPU either by an internal bus or through an external connection such as serial-attached SCSI or fibre channel. In order for a host software application to access stored data, it typically passes commands through a software driver stack (see example in
Software drivers interact with the storage at various levels of abstraction. Different types of storage can be connected without changes to the file system or software application. As commands move up a software driver stack, the representation of the data becomes more and more abstract. Lower layers of the software stack, performing block level I/O, have much more detailed information about the physical layout of the data than do the OS, file system or host application, for example.
Many high performance storage systems use a technology called RAID, which stands for Redundant Array of Independent Disks. RAID technology generally refers to the division of data across multiple hard disk drives. The performance of parity-based RAID is dependent on the types of storage commands issued. Since parity calculations are performed on fixed-sized boundaries, the size and offset of I/O commands can cause wide variations in RAID performance. The performance of parity-based RAID is also dependent on the order of storage commands received and the type of caching in use by the RAID algorithm.
Computer storage systems which communicate using the SCSI Architecture Model (SAM) utilize a set of attributes known collectively as tagged command queuing. With tagged command queueing each I/O command has a queueing policy attribute that specifies how a target storage device is to order the command for execution. Command tags can specify SIMPLE, ORDERED or HEAD OF QUEUE. I/O commands with the HEAD OF QUEUE task attribute must be started immediately, before any dormant ORDERED or SIMPLE commands are executed. I/O commands with the ORDERED tag must be executed in order, after any I/O commands with the HEAD OF QUEUE attribute but before any I/O commands with the SIMPLE attribute. I/O commands with the SIMPLE task attribute must wait for HEAD OF QUEUE and ORDERED tasks to complete. I/O commands with the SIMPLE task attribute can also be reordered at the target.
The overall latency of an I/O command is dependent on queuing attributes attached to the command. Many I/O commands sent by a computer system to a block-based storage device are issued with the SIMPLE tag, giving the target storage device control over the latency of each I/O command.
Many existing host applications issue large, serialized read and write commands and only have a small number of storage commands outstanding at one time, leaving most of the storage connections underutilized.
Broadly, the invention comprises a system, method and mechanism for dividing file system I/O commands into I/O subcommands. In certain aspects, the size and number of I/O subcommands created is determined based on, or as a function of, a number of factors, including in certain embodiments storage connection characteristics and/or the physical layout of data on target storage devices. In certain aspects, I/O subcommands may be issued concurrently over a plurality of storage connections, decreasing the transit time of each I/O command and resulting in an increase of overall throughput.
In other aspects of the invention, by splitting storage commands into a number of I/O subcommands, a host system can create numerous outstanding commands on each connection, take advantage of the bandwidth of all storage connections, and provide effective management of command latency. Splitting into I/O subcommands may also take advantage of dissimilar connections by creating the precise number of outstanding I/O subcommands for the given connection parameters. Overlapped commands may also be issued, fully utilizing storage command pipelining and data caching technologies in use by many targets.
Algorithms for splitting commands may be based on a number of dynamic factors. Certain aspects of the present invention provide visibility into the entire storage subsystem, and facilities for creating I/O subcommands based on dynamic criteria, such as equipment failures, weighted paths and dynamically adjusted connection speeds.
Certain aspects of the invention comprise criteria for splitting storage commands that can be customized to take advantage of the physical layout of the data on the target storage. The performance of storage commands in a RAID environment can degrade drastically based on a number of factors, such as the size of the storage command, offsets into the physical storage, and the RAID algorithm used. In some aspects of the invention, the creation of I/O subcommands may take these factors into account, resulting in substantially higher system performance. The use of these attributes may be particularly effective when the physical layout of the storage is determined automatically, allowing novice users to optimize the performance of a multipath storage system, for example.
In one aspect, the invention provides a method of processing I/O commands in a computer storage system having a host device capable of issuing I/O commands, a software driver residing on said host device capable of receiving and processing said I/O commands, a plurality of associated storage devices, and a plurality of I/O connections between said host device and said associated storage devices, comprising: receiving an I/O command from a host device which specifies a data transfer between the host and a storage device; determining the amount of data to be transferred; comparing the amount of data to a threshold data size; if said amount of data exceeds the threshold, generating a plurality of I/O subcommands, each comprising a portion of the I/O command; and sending the I/O subcommands concurrently over a plurality of I/O connections.
Other aspects of the invention include determining the number of outstanding I/O subcommands on the I/O connections, wherein the number of I/O subcommands generated is determined as a function of the number of outstanding I/O subcommands; computing the average time to complete an I/O subcommand on I/O connections, wherein the number or size of I/O subcommands generated is determined as a function of that average time; determining the weighted average of I/O connection throughput, wherein the I/O subcommands are generated as a function of the weighted average; and/or determining the logical characteristics of associated storage devices and determining the number or size of I/O subcommands generated as a function of such logical characteristics.
Another aspect comprises receiving responses from one or more of the I/O subcommands, aggregating those responses into a single aggregated response; and sending a single aggregated response to the requestor or issuer of the initial I/O command. Yet another aspect includes determining dynamic I/O throughput, wherein threshold data size is calculated as a function of the dynamic I/O throughput. Still another aspect comprises measuring the I/O throughput of each I/O connection over time, wherein the size of I/O subcommands generated is determined as a function of the I/O throughput for a corresponding I/O connection and the I/O subcommands generated are of different sizes. In another aspect, the invention includes determining the offset of I/O subcommands from the start of the original I/O command and generating a queuing policy for I/O subcommands as a function of said offset. Alternatively, a queuing policy is generated for I/O subcommands as a function of time; or as a function of logical block addresses of one or more I/O subcommands. Further aspects include determining a logical block address distance between subsequent I/O subcommands, comparing the logical block address distance to a predetermined threshold, and, if the predetermined threshold is exceeded, generating a queuing policy for the I/O subcommands such that they are executed in order. Criteria for generating I/O subcommands may be user configurable through a graphical user interface, configuration files or command line interface. Another aspect of the invention comprises determining the number of I/O connections which are active, issuing a notification each time the number changes, and storing the notifications in host memory; and determining the number or size of I/O subcommands generated as a function of those notifications.
In another aspect, the invention provides a method of processing I/O commands in a storage system having a host device capable of issuing I/O commands, a software driver residing on said host device capable of receiving and processing said I/O commands, a plurality of associated storage devices, and a plurality of I/O connections between said host device and said associated storage devices, comprising: receiving an I/O command from a host device; generating a plurality of I/O subcommands, each I/O subcommands comprising a portion of the I/O command; determining the offset of at least one of the I/O subcommands, as determined from the start of the original I/O command; generating a queuing policy for generated I/O subcommands as a function of the offset; and issuing I/O subcommands concurrently over a plurality of I/O connections in accordance with the queuing policy. The method may include some or all of the following steps: generating a queuing policy for I/O subcommands as a function of time; determining the logical block address of an I/O subcommand, generating a queuing policy for I/O subcommands as a function of the logical block address, and issuing I/O subcommands concurrently over a plurality of I/O connections according to the queuing policy; and/or sending an I/O subcommand using ORDERED tagging to limit the maximum latency of I/O subcommands.
Other aspects of the invention include systems for processing I/O commands in a computer storage system with a host device capable of issuing I/O commands, said host device coupled to a plurality of storage devices via a plurality of I/O connections; and software drivers, host memory driver stack(s), memory, controller(s), storage device(s), disk drive(s), disk drive array(s), RAID array(s), host storage adapters and other component(s) and/or device(s) for performing the foregoing methods and method steps.
Some benefits and advantages which may be provided by the present invention have been described above with regard to specific embodiments. These benefits and advantages, and any elements or limitations that may cause them to occur or to become more pronounced are not to be construed as a critical, required, or essential features of any or all of the claims. Other objects and advantages of the invention may become apparent upon reading the following detailed description and upon reference to the accompanying drawings.
While the invention is subject to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the detailed description. It should be understood, however, that the detailed description is not intended to limit the invention to the particular embodiment which is described. This disclosure is instead intended to cover all modifications, equivalents and alternatives falling within the scope of the present invention.
At the outset, it should be clearly understood that like reference numerals are intended to identify the same parts, elements or portions consistently throughout the several drawing figures, as such parts, elements or portions may be further described or explained by the entire written specification, of which this detailed description is an integral part. The following description of the preferred embodiments of the present invention are exemplary in nature and are not intended to restrict the scope of the present invention, the manner in which the various aspects of the invention may be implemented, or their applications or uses.
Generally, the invention comprises systems and methods for dividing I/O commands into smaller commands (I/O subcommands) after which the I/O subcommands are sent over multiple connections to target storage. In one embodiment, responses to the storage I/O subcommands are received over multiple connections and aggregated before being returned to the requestor. In one aspect, this I/O command division and response aggregation occurs in software within the host software driver stack. The size and number of I/O subcommands is determined in one embodiment based on a set of criteria gathered by the I/O splitting software. Examples of such criteria include, without limitation, the speed and number of connections to the target storage, errors on a target storage connection, the type of storage being accessed, host application issuing the commands, file system and target storage parameters such as RAID algorithm, number of drives in use and RAID interval size.
An exemplary system consists of a CPU communicating with a disk array through a plurality of hardware connections via a host storage adapter (as in the example illustrated in
For example, the system illustrated in
Another embodiment of the invention includes a method or means of keeping count of active connections to the target storage. When a connection to storage changes state between online and offline, the driver software issues a notification that the number of connections has changed. These notifications are stored in a list in host computer memory. The number of entries in this list determines the number and size of I/O subcommands to be generated to satisfy the initial storage command. If a connection is added, removed, or encounters too many errors to be considered for active use, the count can be adjusted. Subsequent large I/O commands will be divided into I/O subcommands using the adjusted number of connections. For example, using the system illustrated in
In another embodiment, the system keeps track of a number of metrics, such as the number of outstanding commands on each connection, average time to complete a command on a particular connection, weighted average of connection throughput, whether the command is a read or write, etc. These metrics are stored in host memory in a metric status table. The number of I/O subcommands generated for a single storage command is determined based on a real-time analysis of the stored metrics and the current state of the system. For example, the system may track the size of the data transfers outstanding on each connection. In a system with four connections as illustrated in
Another embodiment of the invention includes a method or means of determining the number of I/O subcommands by applying a weighted formula to the number of active connections to the target storage. This formula can generate the proper number of I/O subcommands to best match the needs of the weighting formula. For example, if two connections exist, but one command is to be sent on connection A for every two commands on connection B, the number of I/O subcommands to be generated from each command will be a multiple of three.
In some embodiments, the size of the I/O subcommands is determined by attributes of the physical layout of the data on the target storage. There are a number of attributes which may be considered, such as the RAID parity algorithm used, the number of target drives, the RAID interval size, the RAID stripe size and others known to those skilled in the art. The size and number of I/O subcommands can also be determined by the use of a combination of the number of connections, a weighted connection formula, and the physical layout of the target storage. In some cases the physical layout of the data may preclude the splitting of commands, since split commands may force the RAID algorithm to perform extra work to calculate parity, etc. In one embodiment, the physical layout of the data is queried from the target storage, by use of SCSI INQUIRY and MODE PAGE requests. The physical layout is then analyzed and if these cases are detected the software will avoid splitting the commands.
Another embodiment contains a means of creating I/O subcommands of different sizes at specific offsets into a single command. These different sized I/O subcommands may be generated based on the number and speed of connections to the storage, a weighted connection formula, attributes of the physical layout of the data on the target storage, or a combination of these factors. The system illustrated in
Another embodiment comprises a method for manipulating the queuing policy attributes of the I/O subcommands based on characteristics of the original command and/or the target storage. Characteristics of the original command include logical block address, command size and the requested queuing policy attributes, for example. Characteristics of the target storage include, but are not limited to, RAID algorithm, RAID interval size and number of drives in the RAID group. In an example of this embodiment, a host application sends two 8 MB commands using the system illustrated in
Another example of queuing policy manipulation of I/O subcommands is the use of ORDERED tagging to constrain the maximum latency of a group of I/O subcommands. If a number of I/O subcommands are sent using SIMPLE tagging, one of the I/O subcommands may be delayed such that its associated application level command will take a long time to complete. This latency, caused by the RAID engine, may be unacceptable to the host application. Periodically sending a subcommand using ORDERED tagging, irrespective of the subcommand's address, can control overall command latency in the system while still allowing the RAID engine to execute most I/O subcommands by the most efficient means possible.
In some aspects of the embodiment, connections to the storage are designated as read-only or write-only connections. The number and size of I/O subcommands generated for a storage command may be based on the number of available read-only or write-only connections. For example,
Further, a weighting formula can be specified by the user, either through configuration files, driver registry files, or by a graphical user interface (GUI). The specified weighting formula is used to generate different numbers of I/O subcommands based on a ratio of read- to write-commands or read- to write-bandwidth used per storage connection. In
In one aspect of this embodiment, the criteria for dividing storage commands into I/O subcommands is configured manually via user input such as a graphical user interface, configuration files, or a command line interface. The manual configuration of command division criteria, such as data physical layout, parity algorithm used, weighting and number of connections, etc. may be on the host system and combined with the dynamic status of the system to decide on the size and number of I/O subcommands to be generated.
In other embodiments, some or all of the criteria for dividing storage commands may be automatically configured by host software. Automatic configuration can take place by querying the host system for the number and speeds of connections, querying the storage for the attributes of the physical layout and monitoring connections for parameters such as connection throughput, number of errors on a connection and connection failure.
While there has been described what is believed to be the preferred embodiment of the present invention, those skilled in the art will recognize that other and further changes and modifications may be made thereto without departing from the spirit or scope of the invention. Therefore, the invention is not limited to the specific details and representative embodiments shown and described herein and may be embodied in other specific forms. The present embodiments are therefore to be considered as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes, alternatives, modifications and embodiments which come within the meaning and range of the equivalency of the claims are therefore intended to be embraced therein. In addition, the terminology and phraseology used herein is for purposes of description and should not be regarded as limiting.
The present application claims priority to U.S. Provisional Patent Application No. 61/191,856, filed Sep. 12, 2008.
Number | Date | Country | |
---|---|---|---|
61191856 | Sep 2008 | US |