In order to protect against potential loss of data in a storage system, it is often advantageous to implement a replication scheme. Current replication schemes are only able to sustain a limited amount of error before data within the storage system is unable to be read.
Specific embodiments of the technology will now be described in detail with reference to the accompanying figures. In the following detailed description of embodiments of the technology, numerous specific details are set forth in order to provide a more thorough understanding of the technology. However, it will be apparent to one of ordinary skill in the art that the technology may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description.
In the following description of
In general, embodiments of the technology relate to a method and system for replicating data using a multi-dimensional RAID scheme. More specifically, embodiments of the technology provide a method and system for implementing a 2D RAID scheme and a 3D RAID scheme in a storage array in which one or more storage modules are not present in the storage array. For example, consider a scenario in which the storage array may include up to 36 storage modules; however, when the storage array is initially implemented only 18 storage modules are used. Embodiments of the technology enable more efficient calculation of parity values (e.g., P and Q) in such scenarios. Said another way, embodiments of the technology enable implementation of a multi-dimensional RAID scheme in a storage array in which all storage modules the storage array is design to accommodate are not present. Implementations of various embodiments of the technology may result in fewer cache memory loads (i.e., less data being loaded into cache memory), fewer computation cycles, and, as a result, increase the performance of the storage array. In various embodiments of the technology, performance of the storage array may improvement up to 46%.
For purposes of this technology, the term “RAID” as used herein refers to “Redundant Array of Independent Disks.” While “RAID” refers to any array of independent disks, embodiments of the technology may be implemented using any type of persistent storage device where the RAID grid locations (see e.g.,
Using a 2D RAID scheme, the data stored within a RAID grid implementing such a RAID scheme may be recovered when there are more than two errors in a given RAID stripe. Similarly, using a 3D RAID scheme, the data stored within a RAID cube implementing such a RAID scheme may be recovered when there are more than two errors in a given RAID stripe.
In one or more embodiments of the technology, an IFD corresponds to a failure mode which results in the data at a given location being inaccessible. Each IFD corresponds to an independent mode of failure in the storage array. For example, if the data is stored in NAND flash, where the NAND flash is part of a storage module (which includes multiple NAND dies), then the IFDs may be a: (i) storage module, (ii) channel (i.e., the channel used by the storage module controller (not shown) in the storage module to write data to the NAND flash), and/or a (iii) NAND die.
In one embodiment of the technology, a client (100A, 100M) is any system or process executing on a system that includes functionality to issue a read request or a write request to the RAID controller (104). In one embodiment of the technology, the clients (100A, 100M) may each include a processor (not shown), memory (not shown), and persistent storage (not shown). In one embodiment of the technology, the RAID controller (104) is configured to implement the multi-dimensional RAID scheme, which includes writing data to the storage array in a manner consistent with the multi-dimensional RAID scheme (see
In one embodiment of the technology, the RAID controller (104) is operatively connected to memory (106). The memory (106) may be any volatile memory including, but not limited to, Dynamic Random-Access Memory (DRAM), Synchronous DRAM, SDR SDRAM, and DDR SDRAM. In one embodiment of the technology, the memory (106) is configured to temporarily store various data (including parity data) prior to such data being stored in the storage array.
In one embodiment of the technology, the FPGA (102) (if present) includes functionality to calculate P and/or Q parity information for purposes of storing data in the storage array (108) and/or functionality to perform various calculations necessary to recover corrupted data stored using the multi-dimensional RAID scheme. The RAID controller (104) may use the FPGA (102) to offload the processing of various data in accordance with one or more embodiments of the technology. In one embodiment of the technology, the storage array (108) includes a number of individual persistent storage devices including, but not limited to, magnetic memory devices, optical memory devices, solid state memory devices, phase change memory devices, any other suitable type of persistent memory device, or any combination thereof.
In one embodiment of the technology, the cache (110) is volatile memory that is configured to temporarily store various data (including parity values). The cache (110) is configured to store less data (including parity values) than the memory (106); however, the cache (110) has a lower read and write latency than the memory (106). In one embodiment of the technology, the cache (110) is a multi-level cache. While
Those skilled in the art will appreciate that while
Referring to row (214), in one embodiment of the technology, the data stored in RAID grid location denoted as Pr in row (214) is calculated by applying a P parity function to all RAID grid locations in the row (214) that includes data (e.g., Pr=fP (D1, D2, D3, D4). Similarly, in one embodiment of the technology, the data stored in RAID grid location denoted as Qr in row (214) is calculated by applying a Q parity function to all RAID grid locations in the row (214) that includes data (e.g., Qr=fQ (D1, D2, D3, D4).
Referring to column (216), in one embodiment of the technology, data stored in the RAID grid location denoted as Pc in column (216) is calculated by applying a P parity function to all RAID grid locations in the column (216) that includes data (e.g., PC=fP (D5, D2, D6, D7). Similarly, in one embodiment of the technology, data stored in the RAID grid location denoted by QC in column (216) is calculated by applying a Q parity function to all RAID grid locations in the column (216) that includes data (e.g., QC=fQ (D5, D2, D6, D7).
Referring to the intersection parity group (212), in one embodiment of the technology, the data stored in the RAID grid location denoted as Ir1 may be calculated by applying a P parity function to all RAID grid locations in the row P Parity Group (204) or by applying a P parity function to all RAID grid locations in the column P Parity Group (208). For example, Ir1=fP (Pr1, Pr2, Pr3, Pr4) or Ir1=fP (Pc5, Pc6, Pc7, Pc8).
In one embodiment of the technology, the data stored in the RAID grid location denoted as Ir2 may be calculated by applying a P parity function to all RAID grid locations in the row Q Parity Group (204) or by applying a Q parity function to all RAID grid locations in the column P Parity Group (208). For example, Ir2=fP (Qr1, Qr2, Qr3, Qr4) or Ir2=fQ (Pc5, Pc6, Pc7, Pc8).
In one embodiment of the technology, the data stored in the RAID grid location denoted as Ir3 may be calculated by applying a P parity function to all RAID grid locations in the column Q Parity Group (210) or by applying a Q parity function to all RAID grid locations in the row P Parity Group (204). For example, Ir3=fP (Qc5, Qc6, Qc7, Qc8) or Ir3=fQ (Pr1, Pr2, Pr3, Pr4).
In one embodiment of the technology, the data stored in the RAID grid location denoted as Ir4 may be calculated by applying a Q parity function to all RAID grid locations in the column Q Parity Group (210) or by applying a Q parity function to all RAID grid locations in the row Q Parity Group (206). For example, Ir4=fQ (Qc1, Qc2, Qc3, Qc4) or Ir4=fQ (Qc5, Qc6, Qc7, Qc8).
In one embodiment of the technology, the P and Q parity functions used to calculate the values for all of parity groups may correspond to any P and Q parity functions used to implement RAID 6.
As discussed above, the RAID grid (200) shown in
The RAID controller (or another entity in the system) may determine to which physical addresses in the storage array each of the RAID grid locations is written. This determination may be made prior to receiving any of the data (denoted as “D”) for a particular RAID grid from the client. Alternatively, the determination may be made prior to writing the RAID grid locations to the storage array.
Those skilled in the art will appreciate that while
In one embodiment of the technology, the P parity value is a Reed-Solomon syndrome and, as such, the P Parity function may correspond to any function that can generate a Reed-Solomon syndrome. In one embodiment of the technology, the P parity function is an XOR function.
In one embodiment of the technology, the Q parity value is a Reed-Solomon syndrome and, as such, the Q Parity function may correspond to any function that can generate a Reed-Solomon syndrome. In one embodiment of the technology, a Q parity value is a Reed-Solomon code. In one embodiment of the technology, Q=g0·D0+g1·D1+g2·D2+ . . . +gn-1. Dn-1, where Q corresponds any one of the Q parity values defined with respect to
Those skilled in the art will appreciate that while the RAID grid in
In certain scenarios, the storage array may include all of the storage modules that it is designed to accommodate. Continuing with the above example, while the storage module may be designed to include six storage modules only four storage modules may be present. For example, referring to
In one embodiment of the technology, the RAID controller (
In one embodiment of the technology, each column (or row) in the RAID grid may be associated with a particular storage module bay. For example, the storage module inserted into storage module bay 1 may be identified as SM 1 and associated with RAID grid locations in column 1 of the RAID grid (e.g., L1-L6 in
Turning to
Based on the above scenario, RAID grid locations L1-L6 are associated with SM 1, RAID grid locations L13-L18 are associated with SM 3, RAID grid locations L25-L30 are associated with SM 5, and RAID grid locations L31-L36 are associated with SM 6. Further, RAID grid locations L7-L12 and L19-L24 are flagged, signifying that they are not associated with a storage module that is present in the storage array.
In one embodiment of the technology, the RAID controller includes a data structure that tracks the mappings between data provided by the client and the physical address of such data in the storage array. In one embodiment of the technology, the RAID controller tracks the aforementioned information using a mapping between a logical address e.g., <object, offset> (500), which identifies the data from the perspective of the client, and physical address (502), which identifies the location of the data within the storage array. In one embodiment of the technology, the mapping may be between a hash value derived from applying a hash function (e.g., MD5, SHA 1) to <object, offset>. Those skilled in the art will appreciate that any form of logical address may be used without departing the technology.
In one embodiment of the technology, the RAID controller includes a data structure that tracks how each RAID grid location (504) (see e.g.,
Further, the RAID controller includes a data structure that tracks whether the RAID grid location (504) is associated with a flag (516). The RAID grid location may be associated with a flag when the RAID grid location corresponds to a physical location on a storage module that is not currently present in the storage array. In such cases, the RAID grid location (504) may be associated with a flag (516) but may not be associated with a physical address (502).
In one embodiment of the technology, the RAID controller includes a data structure that tracks which RAID grid (including RAID grids in the data portion and the parity portion) (508) is associated with which RAID cube (506) (assuming that the RAID controller is implementing a 3D RAID scheme) and also which RAID grid locations (504) are associated with each RAID grid (508).
In one embodiment of the technology, the RAID controller includes a data structure that tracks the state (510) of each RAID grid location (504). In one embodiment of the technology, the state (510) of a RAID grid location may be set as filled (denoting that data (or parity data) has been written to the RAID grid location) or empty (denoting that no data (or parity data) has been written to the RAID grid location). In one embodiment of the technology, the RAID controller may also set the state of the RAID grid location to “filled” if the RAID controller has identified data in the RAID controller to write to the RAID grid location.
In one embodiment of the technology, the RAID controller includes a data structure that tracks the RAID grid geometry. In one embodiment of the technology, the RAID grid geometry may include, but is not limited to, the size of RAID grid and the IFD associated with each dimension of the RAID grid. This data structure (or another data structure) may also track the size of the RAID cube and the IFD associated with each dimension of the RAID cube.
In one embodiment of the technology, the RAID controller includes a data structure that tracks the location of each P and Q parity value (including parity values within the intersection parity group (see e.g.,
In one embodiment of the technology, the RAID controller includes a data structure that tracks which RAID grid locations in the data portion of the RAID cube are used to calculate each of the P and Q parity values in the P Parity RAID grid and Q parity RAID grid, respectively.
Referring to
In step 604, the RAID controller updates one or more of the data structures (see e.g.,
The method shown in
Once the data grid portion of the RAID grid has been filled, the parity values in the RAID grid are calculated. Specifically, the following parity values may be calculated Pr1, Pr2, Pr3, Pr4, Pc5, Pc6, Pc7, Pc8, Qr1, Qr2, Qr3, Qr4, Qc5, Qc6, Qc7, or Qc8.
Each of the parity values may be calculated in accordance with the method in
In one embodiment of the technology, the data associated with the data grid portion of the RAID grid (see e.g.,
In step 610, a RAID grid location in the data grid is selected. The selected RAID grid location is a RAID grid location in the data grid that may be associated with data to be used in the parity value calculation. Said another way, the selected RAID grid location is a location in the data grid that is associated with data or that is flagged.
In step 612, a determination is made about whether the selected RAID grid location is flagged. If the RAID grid location is flagged, the process proceeds to step 630; otherwise, the process proceeds to step 614.
If the RAID grid location is not flagged, then in step 614, data associated with the RAID grid location is loaded from the memory (
If the RAID grid location is flagged, then in step 616, a determination is made about whether the parity value being calculated in a P parity value. If the parity value being calculated is a P parity value (e.g.,
More specifically, for purposes of parity value calculations, any RAID grid location that is flagged is associated with a value of zero. A value of zero in the P parity value calculation does not change the resulting P parity value that is calculated. Accordingly, if the parity value being calculated is a P parity value, then the flagged RAID grid location may be ignored (or otherwise not considered) in the P parity value calculation. However, if the parity value being calculated is a Q parity value (e.g.,
In step 618, the flagged RAID grid location is tracked. Said another way, tracking information is generated and/or updated. The tracking information may be maintained in the cache, in a hardware register associated with the processor, and/or using any other mechanism that may be used to store the tracking information. The process then proceeds to step 620.
In step 620, a determination is made about whether there are any additional RAID grid locations to be processed in order to calculate the parity value. If there are additional RAID grid locations to process, the method proceeds to step 610; otherwise, the method proceeds to step 622.
In step 622, if the parity value being calculated is a P parity value, then the P parity value is calculated using the data loaded into the cache. However, if the parity value being calculated is a Q parity value, then the Q parity value is calculated using the data loaded into the cache in combination with the tracking information. More specifically, in one embodiment of the technology, the RAID controller includes functionality to use the tracking information to appropriately calculate the Q parity value by using the tracking information to identify which RAID grid locations should be associated with a value of zero for purposes of the Q parity value calculation.
Consider a scenario in which the storage array corresponds to the storage array shown in
Turning to the example, when the storage array is initialized the RAID controller detects that only four of the six storage modules are present. Based on this detection, when the RAID controller initializes the RAID grid (700), it flags one set of RAID grid locations (708, 710) as not being associated with any storage module. Further, the RAID controller associates the non-flagged RAID grid locations with one of the storage modules (e.g., with one of SM 1, SM 3, SM 5, and SM 6).
Once the data grid has been filled (i.e., all non-flagged RAID grid locations in the data grid (712) have been associated with data), the RAID controller initiates the calculation of parity values. This example focuses on the parity value calculation of Pr1 and Qr1.
With respect to calculating Pr1, the RAID grid locations (L1-L4) in the same row (702) as the Pr1 are processed in accordance with
With respect to calculating Qr1, the RAID grid locations (L1-L4) in the same row (702) as the Qr1 are processed in accordance with
Embodiments of the technology enable the calculation of a P and Q parity values in a manner that limits the amount of loads and computation cycles required by the processor. The limited number of required cache loads is enabled by the detection of the presence and/or absence of various storage modules in combination with optimizations related to P and Q parity value calculations.
Those skilled in the art will appreciate that while various examples of the technology has been described with respect to storing data in a storage array along IFDs and/or storing data in NAND flash, embodiments of the technology may be implemented on any multi-dimensional disk array without departing from the technology. For example, one or more embodiments of the technology may be implemented using a two dimensional array of disks (magnetic, optical, solid state, or any other type of storage device), where data for each RAID grid location in a RAID grid is stored on a separate disk.
Further, in one embodiment of the technology, in the event that the RAID controller is implementing a 3D RAID scheme using a two dimensional array of disks, the RAID controller may store data for each of the RAID grid locations using the following n-tuple: <disk x, disk y, logical block address (LBA) z>, where x and y are the dimensions of the disk array. Further, for a given RAID grid the LBA is constant for each RAID grid location for a single RAID grid; however, the LBA is different across the RAID grids in the RAID cube.
The above examples for implementing embodiments of the technology using a two-dimensional disk array are not intended to limit the scope of the technology.
Those skilled in the art will appreciate that while the technology has been described with respect to a 2D RAID scheme and a 3D RAID scheme, embodiments of the technology, may be extended to any multi-dimensional RAID scheme.
One or more embodiments of the technology may be implemented using instructions executed by one or more processors in the system. Further, such instructions may correspond to computer readable instructions that are stored on one or more non-transitory computer readable mediums.
While the technology has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this disclosure, will appreciate that other embodiments can be devised which do not depart from the scope of the technology as disclosed herein. Accordingly, the scope of the technology should be limited only by the attached claims.
Number | Name | Date | Kind |
---|---|---|---|
5708668 | Styczinski | Jan 1998 | A |
6351838 | Amelia | Feb 2002 | B1 |
6415355 | Hirofuji | Jul 2002 | B1 |
7398418 | Soran et al. | Jul 2008 | B2 |
7406621 | Lubbers et al. | Jul 2008 | B2 |
7430706 | Yuan et al. | Sep 2008 | B1 |
7543100 | Singhal et al. | Jun 2009 | B2 |
7644197 | Waldorf et al. | Jan 2010 | B1 |
7752389 | Fan | Jul 2010 | B1 |
7934120 | Zohar et al. | Apr 2011 | B2 |
8078906 | Yochai et al. | Dec 2011 | B2 |
8145840 | Koul et al. | Mar 2012 | B2 |
8200887 | Bennett | Jun 2012 | B2 |
8316260 | Bonwick | Nov 2012 | B1 |
8327185 | Bonwick | Dec 2012 | B1 |
8448021 | Bonwick | May 2013 | B1 |
8464095 | Bonwick | Jun 2013 | B1 |
8554997 | Bonwick et al. | Oct 2013 | B1 |
8560772 | Piszczek et al. | Oct 2013 | B1 |
8719520 | Piszczek et al. | May 2014 | B1 |
8725931 | Kang | May 2014 | B1 |
8924776 | Mollov et al. | Dec 2014 | B1 |
8977942 | Wu et al. | Mar 2015 | B2 |
9021183 | Matsuyama et al. | Apr 2015 | B2 |
9152499 | Mollov | Oct 2015 | B1 |
9552242 | Leshinsky et al. | Jan 2017 | B1 |
9760493 | Wang | Sep 2017 | B1 |
10095414 | Zettsu et al. | Oct 2018 | B2 |
20020161972 | Talagala et al. | Oct 2002 | A1 |
20030093740 | Stojanovic | May 2003 | A1 |
20040153961 | Park et al. | Aug 2004 | A1 |
20040177219 | Meehan et al. | Sep 2004 | A1 |
20040225926 | Scales et al. | Nov 2004 | A1 |
20050166083 | Frey et al. | Jul 2005 | A1 |
20050223156 | Lubbers et al. | Oct 2005 | A1 |
20050229023 | Lubbers et al. | Oct 2005 | A1 |
20060085594 | Roberson et al. | Apr 2006 | A1 |
20060112261 | Yourst et al. | May 2006 | A1 |
20060190243 | Barkai et al. | Aug 2006 | A1 |
20070061383 | Ozawa et al. | Mar 2007 | A1 |
20080109602 | Ananthamurthy et al. | May 2008 | A1 |
20080120484 | Zhang et al. | May 2008 | A1 |
20080168225 | O'Connor | Jul 2008 | A1 |
20090187786 | Jones et al. | Jul 2009 | A1 |
20100005364 | Higurashi et al. | Jan 2010 | A1 |
20100082540 | Isaacson et al. | Apr 2010 | A1 |
20100199125 | Reche | Aug 2010 | A1 |
20110055455 | Post et al. | Mar 2011 | A1 |
20110258347 | Moreira et al. | Oct 2011 | A1 |
20120030425 | Becker-Szendy et al. | Feb 2012 | A1 |
20120079318 | Colgrove | Mar 2012 | A1 |
20120089778 | Au et al. | Apr 2012 | A1 |
20120166712 | Lary | Jun 2012 | A1 |
20120297118 | Gorobets et al. | Nov 2012 | A1 |
20120303576 | Calder et al. | Nov 2012 | A1 |
20120324156 | Muralimanohar et al. | Dec 2012 | A1 |
20130151754 | Post et al. | Jun 2013 | A1 |
20150324387 | Squires | Nov 2015 | A1 |
20160132432 | Shen et al. | May 2016 | A1 |
20160210060 | Dreyer | Jul 2016 | A1 |
20160320986 | Bonwick | Nov 2016 | A1 |
20170192889 | Sato et al. | Jul 2017 | A1 |
20170255405 | Zettsu et al. | Sep 2017 | A1 |
20170285945 | Kryvaltsevich | Oct 2017 | A1 |
20170300249 | Geml et al. | Oct 2017 | A1 |
20170329675 | Berger et al. | Nov 2017 | A1 |
20170351604 | Tang et al. | Dec 2017 | A1 |
20180267897 | Jeong | Sep 2018 | A1 |
Number | Date | Country |
---|---|---|
1577774 | Sep 2005 | EP |
2004-326759 | Nov 2004 | JP |
2010-508604 | Mar 2010 | JP |
2008054760 | May 2008 | WO |
Entry |
---|
Decision to Grant a Patent issued in corresponding Japanese Application No. 2015-501902, dated May 31, 2016 (6 pages). |
Minoru Uehara; “Orthogonal RAID with Multiple Parties in Virtual Large-Scale Disks”; IPSJ SIG Technical Report; vol. 2011-DPS-149; No. 4; Nov. 24, 2011 (8 pages). |
H. Peter Anvin; “The mathematics of RAID-6”; http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf (last visited Nov. 15, 2017) (9 pages). |
Amber Huffman; “NVM Express: Revision 1.0b”; Jul. 12, 2011 (126 pages). |
Akber Kazmi; “PCI Express™ Basics & Applications in Communication Systems”; PCI-SIG Developers Conference; 2004 (50 pages). |
Derek Percival; “Multicast Over PCI Express®,” PCI-SIG Developers Conference Europe; 2009 (33 pages). |
Jack Regula; “Using Non-transparent Bridging in PCI Express Systems”; PLX Technology, Inc.; Jun. 1, 2004 (31 pages). |