1. Field of the Invention
Embodiments of the present invention generally relate to stripe mapping for two levels of RAID (Redundant Array of Inexpensive Disks/Drives), also known as hierarchical secondary RAID (HSR), and more specifically for configurations implementing two levels of RAID 5.
2. Description of the Related Art
Conventional RAID systems configured for implementing RAID 5 store data in stripes with each stripe including parity information. A stripe is composed of multiple strips (also known as elements or the chunk size), with each strip located on a separate hard disk drive. The location of the parity information is rotated for each stripe to load balance accesses for reading and writing data and reading and writing the parity information.
RAID array 130 includes one or more storage devices, specifically N hard disk drive 150(0) and drives 150(1) though 150(N-1) that are configured to store data and are each directly coupled to storage controller 140 to provide a high bandwidth interface for reading and writing the data. The granularity (sometimes referred to as the rank) of the RAID array is the value of N or the equivalently, the number of hard disk drives. The data and parity are distributed across disks 150 using block level striping conforming to RAID 5.
When different disk configurations are used in a RAID system, other methods and systems for mapping data and parity are needed for load-balancing and to facilitate sequential access for read and write operations.
A two-level, hierarchical secondary RAID architecture achieves a reduced mean time to data loss compared with a single-level RAID architecture as shown in
Various embodiments of the invention provide a method for configuring storage devices in a hierarchical redundant array of inexpensive disks (RAID) system include configuring an array including a primary granularity of storage bricks that each include a secondary granularity of hard disk drive storage devices that store data, primary parity, and secondary parity in stripes in the hierarchical RAID system Secondary parity for each one of the storage bricks is computed from the data that is stored in the secondary stripe within the storage brick. The secondary parity is mapped to one strip of each secondary stripe of the hard disk drives in each one of the storage bricks using a rotational allocation. Primary parity for each primary stripe of the storage bricks is computed from the data that is stored in the primary stripe. The primary parity is mapped to distribute portions of the primary parity to each one of the hard disk drives within each one of the storage bricks.
Various embodiments of the invention provide a system for configuring storage devices in a hierarchical redundant array of inexpensive disks (RAID) system. The system includes an array of storage bricks that each includes a secondary controller that is separately coupled to a set of hard disk drive storage devices configured to store data, primary parity, and secondary parity in stripes and a primary storage controller that is separately coupled to each one of the secondary controllers in the array of storage bricks. The primary storage controller and secondary storage controllers are configured to map the secondary parity for storage in one of the hard disk drives in each of the storage bricks for each secondary stripe using a rotational allocation, wherein the primary parity for each stripe is mapped for storage in one of the hard disk drives in one of the storage bricks.
So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments of this invention and are therefore not to be considered limiting of its scope, for the invention may admit to other equally effective embodiments.
In the following, reference is made to embodiments of the invention. However, it should be understood that the invention is not limited to specific described embodiments. Instead, any combination of the following features and elements, whether related to different embodiments or not, is contemplated to implement and practice the invention. Furthermore, in various embodiments the invention provides numerous advantages over the prior art. However, although embodiments of the invention may achieve advantages over other possible solutions and/or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of the invention. Thus, the following aspects, features, embodiments and advantages are merely illustrative and, unless explicitly present, are not considered elements or limitations of the appended claims.
Primary storage controller 240 is configured to function as a RAID 5 controller and is coupled to CPU 220 via a high bandwidth interface. In some embodiments of the present invention the high bandwidth interface is a standard conventional interface such as a peripheral component interface (PCI). A conventional RAID 5 configuration of storage bricks 235 includes a distributed parity drive and block (or chunk) level striping. In this case, there are N storage bricks 235 and N is the granularity of the primary storage. In other embodiments of the present invention, the I/O interface, bridge device, or primary storage controller 240 may include additional ports such as universal serial bus (USB), accelerated graphics port (AGP), Infiniband, and the like. In other embodiments of the present invention, the primary storage controller 240 could also be host software that executes on CPU 220. Additionally, primary storage controller 240 may be configured to function as a RAID 6 controller in other embodiments of the present invention.
Each storage brick 235 includes a secondary storage controller 245 that is separately coupled to storage devices, specifically M hard disk drives 250(0) though 250(M-1), where M is the granularity of the secondary storage. Secondary storage controller 245 provides a high bandwidth interface for reading and writing the data and parity stored on disks 250. Secondary storage controller 245 may be configured to function as a RAID 5 or a RAID 6 controller in various embodiments of the present invention.
If both the primary storage controller 240 and secondary storage controller 245 both implement RAID 5, this is referred to as HSR 55; if the primary storage controller 240 implements RAID 5 and secondary storage controller 245 implements RAID 6, this is referred to as HSR 56; if the primary storage controller 240 implements RAID 6 and secondary storage controller 245 implements RAID 5, this is referred to as HSR 65; and if the primary storage controller 240 implements RAID 6 and secondary storage controller 245 implements RAID 6, this is referred to as HSR 66. In summary, primary storage controller 240 and secondary storage controller 245 can be configured to implement the same RAID levels for HSR 55 and HSR 66 or different RAID levels for HSR 65 and HSR 56.
Each storage device within HSR 230, e.g. bricks 235 and disks 250, may be replaced or removed, so at any particular time, system 200 may include fewer or more storage devices. Primary storage controller 240 and secondary storage controller 245 facilitate data transfers between CPU 220 and disks 250, including transfers for performing parity functions. Additionally, parity computations are performed by primary storage controller 240 and secondary storage controller 245.
In some embodiments of the present invention, primary storage controller 240 and secondary storage controller 245 perform block striping and/or data mirroring based on instructions received from storage driver 212. Each drive 250 coupled to secondary storage controller 245 includes drive electronics that control storing and reading of data and parity within the disk 250. Data and/or parity are passed between secondary storage controller 245 and each disk 250 via a bi-directional bus. Each disk 250 includes circuitry that controls storing and reading of data and parity within the individual storage device and is capable of mapping out failed portions of the storage capacity based on bad sector information.
System memory 210 stores programs and data used by CPU 220, including storage driver 212. Storage driver 212 communicates between the operating system (OS) and primary storage controller 240 secondary storage controller 245 to perform RAID management functions such as detection and reporting of storage device failures, maintaining state data, e.g. bad sectors, address translation information, and the like, for each storage device within storage bricks 235, and transferring data between system memory 210 and HSR 230.
An advantage of a two-level or multi-level hierarchical architecture, such as system 200 is improved reliability compared with a conventional single-level system using RAID 5 or RAID 6. Additionally, storage bricks 235 may be used with conventional storage controllers that implement RAID 5 or RAID 6 since each storage brick 235 appears to primary storage controller 240 as a virtual disk drive. Primary storage controller 240 provides an interface to CPU 220 and additional RAID 5 or RAID 6 parity protection. Secondary storage controller 245 aggregates multiple disks 250 and applies RAID 5 or RAID 6 parity protection. As an example, when five disks 250 (the secondary granularity) are included in each storage brick 235 and five storage bricks 235 (the primary granularity) are included in HSR 230, the capacity equivalent to 16 useful disks of the 25 total disks 250 are available for data storage.
Each column of data and primary parity 301 corresponds to the sequence of strips that is sent to each secondary storage brick 235(0) through 235(4) and mapped into the rows of storage brick 235(0) through 235(4). Each column of data, primary parity and secondary parity in storage brick 235(0) through 235(4) is mapped to a separate disk (250). The rows of storage brick 235(0) through 235(4) are the secondary stripes and secondary parity is computed for each one of the secondary stripes. Secondary storage controller 245 applies conventional RAID 5 mapping using a “left parity rotation” to the sequence of strips from data and primary parity 301 sent from primary controller 240, and computes the secondary parity as shown by the hashed pattern of secondary parity 306. The primary and secondary parity mapping pattern shown for each storage brick 235(0) through 235(4) represents a single secondary mapping cycle that is repeated for the remaining storage in each storage brick 235. When a column of data and primary parity 301 is mapped to one of storage bricks 235, the primary parity is aligned in a single disk 250 within each storage brick 235(0) through 235(4). For example, in storage brick 235(0) the primary parity is aligned in the disk corresponding to the rightmost column. The disks 250 that store the primary parity are hot spots for primary parity updates and do not contribute to data reads. Therefore, the read and write access performance is reduced compared with a mapping that distributes the primary and secondary parity amongst all of disks 250.
As shown in
TABLE 1 shows the layout of data as viewed by the primary storage controller 240, with the numbers corresponding to the order of the data strips sent to it by the CPU 220 and “P” corresponding to the primary parity strips. The first 5 columns correspond to storage bricks 235(0) through 235(4).
TABLE 2 shows the clustered parity layout for HSR 230 in greater detail. The first 5 columns correspond to storage brick 235(0) with columns 0 through 4 corresponding to the five disks 250. The next five columns correspond to storage brick 235(1), and so on. The secondary parity is shown as “Q.” Five hundred strips are allocated in five storage bricks 235 resulting in 20 cycles of primary mapping. The primary parity is stored in a cluster, as shown in the bottom five rows (corresponding to the secondary stripes in disks 250) of TABLE 1. The primary parity is stored in locations 16-19, 36, 56, 76, 116, 136, 156, and so on, as shown in Table 2. In this example, since the granularity of the primary storage is 5, the primary parity is computed for every 4 original strips, and the notation on the parity at the bottom of Table 2 is shortened to denote the first strip in the primary parity, thus 36 denotes the primary parity for strips 36-39.
In step 315 the primary parity is mapped in one or more clusters, i.e., adjacent secondary stripes, to each of the disks 250 in storage bricks 235. In step 320 the data is mapped to the remaining locations in each of the disks 240 in storage bricks 235 for the current secondary mapping cycle. In step 235 the round-robin count is incremented, and in step 330 the method determines if the round-robin count (RRC) equals the number of disks 250 (M) in each storage brick 235. If the RRC does equal the number of disks 250, then the mapping is complete. Otherwise, the method returns to step 315 to map the primary parity and data for another secondary mapping cycle.
Separate round-robin pointers are used for the mapping of data and primary parity during steps 305 and 315 of
TABLE 3 shows the right round-robin rotation allocation parity layout for storage brick 235(0) in greater detail. The five columns correspond to the five disks 250 in storage brick 235(0). The secondary parity is shown as “Q.” The primary parity is stored in rotationally allocation locations 4, 9, 14, 19, 24, 29, and so on, as shown in
TABLE 4 shows the right round-robin rotation allocation parity layout for storage brick 235(0) when six disks 250 are included in storage bricks 235. The six columns correspond to the six disks 250 in storage brick 235(0). The secondary parity is shown as “Q.” The primary parity is stored in rotationally allocation locations 4, 9, 14, 19, 24, 29, and so on.
TABLE 4 shows the right round-robin rotation allocation parity layout corresponding to
TABLE 6 shows the right round-robin rotation allocation parity layout corresponding to
The method of mapping the data and primary parity is performed using separate pointers for each disk 250. Pseudo code describing the algorithm for updating the data pointer is shown in TABLE 7, where DP is the device pointer for the data that points to the location that the next data is mapped to. N is the number of secondary storage controllers 245.
Pseudo code describing the algorithm for updating the primary parity pointer is shown in TABLE 8, where PP is the device pointer for the primary parity that points to the location that the next primary parity is mapped to.
HSR 230 is used to achieve a reduced mean time to data loss compared with a single-level RAID architecture. The new data, primary parity, and secondary parity mapping technique provides load-balancing between the disks in the hierarchical secondary RAID architecture and facilitates sequential access by distributing the data, primary parity, and secondary parity amongst disks 250.
One embodiment of the invention may be implemented as a program product for use with a computer system. The program(s) of the program product define functions of the embodiments (including the methods described herein) and can be contained on a variety of computer-readable storage media. Illustrative computer-readable storage media include, but are not limited to: (i) non-writable storage media (e.g. read-only memory devices within a computer such as CD-ROM disks readable by a CD-ROM drive, flash memory, ROM chips or any type of solid-state non-volatile semiconductor memory) on which information is permanently stored; and (ii) writable storage media (e.g. floppy disks within a diskette drive or hard-disk drive or any type of solid-state random-access semiconductor memory) on which alterable information is stored.
While the foregoing is directed to embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. The foregoing description and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense. The listing of steps in method claims do not imply performing the steps in any particular order, unless explicitly stated in the claim.