1. Field of the Invention
This invention relates generally to redundant array independent disks (RAIDs) and particularly to RAIDs made of solid state disks.
2. Description of the Prior Art
There is no dispute that large storage is commonly employed for various reasons among which, by way of example, is for on-line transactions and searches. Redundant Array of independent disks (RAID), as its name suggests, is storage space of a large capacity and redundancy.
In some applications, solid state disks (SSDs) are grouped together to create a RAID group within a storage system that may support many RAID groups. Initially, a predetermined number of RAID group(s) are placed into a storage system and at a later time, additional RAID groups may be added to expand storage capacity and/or add increase throughput.
SSDs are typically costly. Further, they dissipate heat thereby affecting power consumption and management. Throughput is another factor in storage systems employing SSDs. To better describe the foregoing, a RAID group with a multitude of SSDs is placed in a storage system for maintaining large quantities of data. Additional space is typically made available for increasing storage capacity, as needed by a user, by adding more RAID groups. Each RAID group operates at a certain throughput and standard. For example, different generations of Peripheral Component Interconnect Express (PCIe)-compliant RAID groups may be employed for various applications. Further, different throughput rates may be required per RAID group. However, these requirements are typically limited to the RAID group's particular capability. That is, a RAID group that is only built to function as a GEN 2 RAID group cannot be made to operate as a GEN 3 RAID group. Similarly, a RAID group that is built to function at a certain speed cannot be employed to function at a higher speed.
Power consumption is typically affected based on throughput in that the higher the rate at which a RAID group operates, the higher its power consumption.
Currently, there is no mechanism for optimizing employment of RAID groups within a storage system. More specifically, cost, throughput, and power management are issues facing users of storage systems employing RAID groups.
Thus, there is a need for a storage system using RAIDs to have near optimal throughput and power management while reducing cost.
Briefly, a method of managing redundant array of independent disk (RAID) groups in a storage system includes determining wear of each of the plurality of RAID groups, computing the weight for each of RAID groups based on the wear, and striping data across at least one of the RAID groups based on the weight of each of the RAID groups.
These and other objects and advantages of the invention will no doubt become apparent to those skilled in the art after having read the following detailed description of the various embodiments illustrated in the several figures of the drawing.
In the following description of the embodiments, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration of the specific embodiments in which the invention may be practiced. It is to be understood that other embodiments may be utilized because structural changes may be made without departing from the scope of the invention. It should be noted that the figures discussed herein are not drawn to scale and thicknesses of lines are not indicative of actual sizes.
Referring now to
In the embodiment of
PCIe or PCI Express is a high speed serial bus standard designed for high throughput systems with lower Input/Output (IO) pin count and better throughput scaling. The PCIe link between two devices can currently consist of anywhere from 1 to 32 lanes. Throughput of a PCIe-based system scales with overall link width. The link or number of the PCIe lanes between two connected devices are automatically negotiated during device initialization and can be restricted by either device to the highest mutually supported lane count and PCIe generation. PCIe standard allows devices to have anywhere from 1 lane, for cost sensitive applications with lower throughput, to 32 lanes for throughput critical applications. PCIe 3.0 is the latest standard in production with PCIe 2.0 and 1.1 still being widely employed. Data transfer rate for PCIe 1.1, PCIe 2.0, and PCIe 3.0 is 2.5 gigabits transfer per second (GT/s), 5 gigabits transfer per second (GT/s), and 8 gigabits transfer per second (GT/s) respectively. Throughput of a PCIe device with 8 lanes of PCIe 1.1, PCIe 2.0, or PCIe 3.0 is 2,000 megabytes per second (MB/s), 4,000 MB/s, or 8,000 MB/s respectively.
The storage processor 10 is shown to include a CPU subsystem 14, a PCIe switch 16, a network interface card (NIC) 18, and memory 20. The memory 20 is shown to maintain RAID group configuration information 40 and self-monitoring analysis and reporting technology (SMART) attributes 24. The storage processor 10 is further shown to include an interface 34 and an interface 32. The RAID group configuration 40 is information regarding characteristics of the RAID groups of the storage pool 26. This information includes the generation type of the RAID group, the rate at which the RAID group is capable of operating, the PCIe lanes the RAID group can support in addition with other types of information. The RAID group configuration 40 also includes information regarding the current status of RAID groups, for example, the rate at which a RAID group is currently operating, the currently generation of the RAID group, the PCIe lanes currently being used by a RAID group and the like.
Referring still to
In an embodiment of the invention, parts or all of the memory 20 is volatile, such as, without limitation, dynamic random access memory (DRAM). In other embodiments, part or all of the memory 20 is non-volatile, such as and without limitation flash, magnetic random access memory (MRAM), spin transfer torque magnetic random access memory (STTMRAM), resistive random access memory (RRAM), or phase change memory (PCM). In still other embodiments, the memory 20 is made of both volatile and non-volatile memory.
The storage system 8 comprises of one or more RAID groups 36 through 38. A RAID group uses multiple disks that appear to be a single device to a user who may wish to increase storage capacity, improve overall throughput, and provide fault tolerance. The storage system 8 is further operable with as few as one RAID group. Additional RAID groups may be added as required later when the existing RAID groups in the system are maximally utilized and additional capacity is required. When additional RAID groups are added to the storage system 8, the throughput of the storage system 8 increases substantially since there are now additional SSDs for storing data. The process of saving segments of data across a number of SSDs is typically referred to as striping.
Storage system 8 may employ different RAID architectures depending on the desired balance between throughput and fault tolerance. These architectures are called “levels”. Level 0, for example, is a striped disk array without fault tolerance which indicates that the SSDs do not use parity. Level 4 is a striped disk array with SSDs having dedicated parity and level 5 is a striped disk array with distributed parity across the SSDs. Level 6 is similar to level 5 with the exception of having double parity distributed across the SSDs.
During operation, the host 12 issues a read or a write command, along with data in the case of the latter. Information from the host is normally transferred between the host 12 and the processor 10 through the interfaces 32 and/or 34. For example, information is transferred through the interface 34 between the processor 10 and the NIC 18. Information between the host 12 and the PCIe switch 16 is transferred using the interface 34 and under the direction of the of the CPU subsystem 14.
In the case where data is to be stored, i.e. a write operation is consummated, the CPU subsystem 14 receives the write command and accompanying data, from the host 12, through the PCIe switch 16, for storage in the storage pool 26, under the direction of the CPU subsystem 14.
Under the direction of the CPU subsystem 14, the received data is eventually saved in the memory 20. The storage processor 10 or the CPU subsystem 14 then stripes the data across the SSDs 28 of RAID groups 36 through 38. The throughput of the storage system 8, at least in part, depends on the number of SSDs in the system hence the number of RAID groups. As RAID groups are added to the storage system 8, the throughput of the storage system 8 also increases because the storage processor 10 can stripe data across more SSDs. A storage system with only one RAID group will most likely have half the throughput of a storage system with two RAID groups if all of the RAID groups are configured the same.
In order to increase the throughput of a partially populated storage system, the populated RAID groups have to operate at a higher throughput to compensate for the missing RAID groups in the storage system thereby requiring the RAID groups to be configurable to operate at different throughputs based on the number of RAID groups.
Referring now to
As mentioned earlier, even though the storage system 8 supports a plurality of RAID groups, it is operable with one or more RAID groups. New RAID groups are added to existing RAID groups when required. During an exemplary operation, the storage system 8 employs below a certain number of RAID groups, its throughput is not at its optimum since there is not enough RAID groups, thus, there is a shortage of SSDs to stripe data across. To increase the throughput of the storage system 8 with partially populated RAID groups, the SSDs of the RAID groups have to be configurable to provide different throughput levels. When the storage system 8 is not fully populated, the SSDs of the RAID groups are configured to operate at a higher throughput and when additional RAID groups are added, the RAID groups can be reconfigured to operate at a lower throughput. The throughput of the SSDs therefore depends on the number of RAID groups in the storage system.
In most storage systems, the throughput of the system mostly depends on the number of RAID groups in the storage system and the number of SSDs within a RAID group. The throughput of the storage system 8 scales up with the number of RAID groups up to a certain point and saturates thereafter. It is desirable to operate the SSDs at a higher level of throughput when the storage system is not fully populated such that the storage system can provide close to its highest throughput. However, it is not desirable to operate the SSDs at a higher throughput when the higher throughput does not contribute to overall throughput of the storage system 8. A factor that is taken into account is that the SSDs 28 consume more power and dissipate more heat when they operate at the higher throughput.
Referring back to
In some embodiments, each of the RAID interfaces is an aggregate of all of the PCIe lanes of a RAID group. In storage systems requiring higher throughput, typically, an additional number of PCIe lanes is employed; for example 4-lane (X4) or 8-lane (X8), and/or a higher PCIe generation is employed; such as PCIe 2.0 or PCIe 3.0 This requires at least some of the RAID groups in the storage system 8 to be configurable and have the means to support higher throughput than the remaining RAID groups.
When a RAID group; such as RAID group 234 in
Accordingly, not all of the individual SSDs in a RAID group need operate at their maximum throughput to provide the throughput that the storage system 8 requires. Each SSD in the RAID group can operate at a lower throughput and the storage system 8 will still deliver the requisite throughput.
Operating each SSD in a RAID group at its maximum throughput will also unnecessarily generate more heat, which has to be dissipated by the storage system 8, without contributing to the throughput of the storage system. Most storage systems 8 are designed to dissipate a predetermined amount of heat when fully configured, meaning employing a maximum number of RAID groups, and enjoying certain throughput.
In an exemplary method of the invention, a predetermined amount of heat dissipation is allocated per SSD 28 when the storage system 8 is fully populated. In a not-fully populated storage system, it is acceptable and desirable to operate the SSDs in the RAID groups at a higher throughput than would otherwise be the case if the storage system was fully populated. More specifically, each SSD consumes more energy and generates more heat since the heat dissipation mechanism of the storage system is generally designed for a fully-populated storage system.
As the number of RAID groups in the storage system 8 increases and approaches the maximum number of the RAID groups the system can support, operating the SSDs in the RAID groups at their highest throughput will not equate to a higher system throughput. The extra heat that is generated will make the storage system 8 overheat thereby preventing proper function of the storage system. The storage processor 10 reconfigures the throughput of SSDs in each RAID group based on predetermined system throughput requirements, heat dissipation for which the storage system 8 is designed, and the number of RAID groups in the storage system.
In one embodiment of the invention, the throughput of the RAID groups is based on the number of SSDs in the RAID groups and the throughput of the individual SSDs. Throughput of the SSDs is configured through the number of PCIe lanes such as X2, X4, or by 8, and the PCIe generation such as PCIe 1.1, PCIe 2.0, or PCIe 3.0. In an embodiment of the invention, the storage processor 10 re-configures the PCIe SSDs 28 configuration register(s) by changing the width of the bus (the number of PCIe lanes) or PCIe generation or combination of both. The storage processor 10 will re-initialize the SSDs 28 or the SSDs 28 on their own initiative re-initialize themselves in response to re-configuration to re-initiates the link training between the SSDs 28 in the RAID group and the PCIe switch 16 to the new highest mutually supported lane count and PCIe generation. In another embodiment, the storage processor re-configures the PCIe switch 16 registers associated with the RAID interface that is coupled to the SSDs of the RAID group; such as the RAID group 232 and the RAID interface 204. Subsequently, the storage processor 10 will re-initialize the SSDs 28 or the SSDs 28 on their own initiative re-initialize themselves in response to re-configuration to re-initiates link training between the SSDs 28 in the RAID group and the PCIe switch 16 to the new highest mutually supported lane count and PCIe generation.
In another embodiment of the invention, all the RAID groups do not have to have the capability of operating at what is considered a high throughput. Rather, only one of the RAID groups needs to be capable of operating at the highest throughput and a RAID interface with the highest number of lanes and generation. In this case, controllers of the SSDs in the one RAID group with the capability of operating at its highest throughput needs to support the maximum number of PCIe lanes and PCIe generations to achieve the required throughput of the storage system. The throughput of the rest of the RAID groups need not be as high as that of the first RAID group and in fact their throughput can be reduced over time. For example, the second RAID group can have a lower throughput than the first RAID group and the third RAID group can have a lower throughput than the second RAID group and so forth.
Furthermore, the first RAID group can operate at the higher throughputs in compare to the second RAID group or the third RAID group. Lower throughputs of the second RAID group or the third RAID groups translates to a lower number of PCIe lanes and inexpensive SSD controllers as well as considerably reduced cost of the storage system 8.
In the example of
Referring still to the example of
In the example of
Referring now to
In one embodiment of this invention, RAID group 232 has the highest throughput relative to the remaining RAID groups and as such supports the maximum number of PCIe lanes and/or the highest PCIe generation. RAID group 234 has a lower throughput in comparison to RAID group 232 but it has a higher throughput than that of the RAID groups 236 and 238. RAID groups 236 and 238 have lower throughputs relative to that of the RAID groups 232 and 234 and thus have a lower number of PCIe lanes and/or a lower PCIe generation.
In an embodiment of the invention, PCIe switch 314 connected to the higher-throughput RAID groups 232 and 234 is a PCIe 2.0 switch and the PCIe switch 316 connected to the lower-throughput RAID groups 236 and 238 is a PCIe 1.1 type of switch. PCIe 1.1 switches cost substantially less than PCIe 2.0 switches which reduces the cost of storage system 8 while maintaining the storage system 8 required throughput.
In the example of
In this example, the PCIe switch 314 is a 2.0 switch but the PCIe switch 316 can either be a PCIe 1.1 or a PCIe 2.0 switch. PCIe 1.1 switches cost substantially less than their PCIe 2.0 switches. Furthermore, the number of PCIe lanes or the PCIe generation requirements for the lower throughput RAID groups are substantially less than that for the higher throughput RAID groups which also translates to lower cost SSD controllers as well as a less complex mother board design.
In one embodiment of the invention, the storage system 8 and/or the SSDs of the RAID groups may have temperature sensors scattered throughout the critical sections of the system and SSDs. The storage processor 10 may identify the temperature of the RAID groups in the storage system 8 by periodically reading the sensors and reconfiguring the RAID groups accordingly. If, for example, one or more the RAID groups are operating at a temperature that is higher than a predetermined threshold, set based on a heat budget, the storage processor 10 will reconfigure these over-heated RAID groups to operate at a lower throughput. Operating at a lower throughput causes the SSDs of these RAID groups to dissipate less heat and eventually reach a temperature that is below their respective thresholds.
In one embodiment, once the temperature of the over-heated RAID groups operating at a lower throughput is below the threshold, the RAID groups may be reconfigured back to their throughput prior to the time their throughput was lowered.
In another embodiment of the invention, the storage processor 10 may utilize the RAID groups operating at a higher temperature less often than the RAID groups operating at a lower temperature by scheduling less commands for the particular RAID group or scheduling commands to the particular RAID groups less often. In One embodiment, the storage processor 10 may only schedule read commands and no write commands to the RAID groups operating at temperature above their respective threshold.
In yet another embodiment of the invention, the storage processor 10 may identify the temperature of the RAID groups in the storage system 8 by reading the self-monitoring analysis and reporting technology (SMART) attributes and specifically the ‘Temperature’ attribute of the SSDs in the RAID groups in determining the temperature of the RAID groups. SMART is a standard interface protocol that allows a disk to check its status and report it to a host system. SMART information consists of ‘attributes’ each one of which describes some particular aspect of the drive condition such as ‘temperature’. Each drive manufacturer may define its own set of attributes but they mostly try to adhere to the standard for interoperability.
Although the invention has been described in terms of specific embodiments, it is anticipated that alterations and modifications thereof will no doubt become apparent to those skilled in the art. It is therefore intended that the following claims be interpreted as covering all such alterations and modification as fall within the true spirit and scope of the invention.