This invention pertains to array-based distributed storage systems with parity functionality.
Array-based distributed storage systems are well known. These systems distribute data over two or more different disks to improve data access times, provide fault tolerance, or both. Distributed storage systems can employ different RAID configurations, as described in “A Case for Redundant Arrays of Inexpensive Disks (RAID),” by David Patterson et al., SIGMOD Conference: pp 109-116, (1988), which is herein incorporated by reference.
One high performance distributed storage system is sold by Avid Technology, Inc. of Tewksbury, Mass. under the Unity ISIS® trade name. This system is described in more detail in U.S. Pat. Nos. 7,111,115 and 6,785,768 as well as in published application numbers 2007/0083723 and 2007/0136484, which are all herein incorporated by reference. In the ISIS® system a redundant copy of all data is stored on a different drive in an array. If a drive fails, therefore, the redundant copies can be used to reconstruct it.
In one general aspect, the invention features a data access method that includes directing data block write requests from different clients to different data storage servers based on a map. Data blocks referenced in the data block write requests are stored in the data storage servers. Data from the data write requests are also relayed to a parity server, and parity information is derived and stored for the blocks.
In preferred embodiments the method can further include independently generating the map by each of the clients. The step of independently generating the map by each of the clients can use a same predetermined permutation seed. The step of independently generating the map by each of the clients can generate a repetitive map. The step of independently generating the map by each of the clients can generate a map that is at least as long as the least common multiple of a number of data storage servers and a number of blocks for a super block for which parity is computed. The data storage servers and the parity server can be members of a group of storage servers and with the map defining which of the group members are data storage servers and which of the group members is a parity server for particular write requests. The map can change which members of the group are used as a parity server to distribute load resulting from the step of deriving. The steps of directing, storing, relaying and deriving can operate on a block size of a power of two Kilobytes. The method can further include the step of maintaining file system information that associates the blocks with files in a file system. The step of deriving parity information can operate according to a row-diagonal parity coding scheme. The steps of relaying and deriving can operate according to a single parity element. The steps of relaying and deriving can operate according to row and diagonal parity elements. Both row and diagonal parity can be calculated on one of the parity servers with the non-native parity being forwarded to the other parity server.
In another general aspect, the invention features an array-based distributed storage system with clients that each include map generation logic and a communication interface. A plurality of storage servers is also provided, which each include a communication interface responsive to the clients, data storage logic responsive to the communication interface, parity logic responsive to the communication interface, selection logic operative to determine whether to enable the data storage logic or the parity logic for a particular data block based on results of the map generation logic for that block, and relaying logic operative to relay a copy of a block to another of the servers in response to a determination by the selection logic that the data storage logic should be enabled for that block.
In preferred embodiments, the storage servers can each further include a local copy of the same map generation logic as do the clients, with the selection logic for each of the servers being responsive to its local map generation logic to determine whether to enable the data storage logic or the parity logic for a particular data block. The map generation logic can be operative to generate a map that distributes parity loading across the servers. The parity logic can operate according to a row-diagonal parity scheme. The parity logic can include native parity logic operative to derive and store a native parity block and non-native parity logic operative to derive and forward a non-native parity block to another one of the servers.
In a further general aspect, the invention features an array-based distributed storage system that includes means for directing a series of different data block write requests from one of a series of different clients to a plurality of different data storage servers based on a map, means for storing data blocks referenced in the data block write requests in the data storage servers, means for relaying the data from the data write requests to a parity server, and means for deriving and storing parity information for the series of blocks.
Systems according to the invention can provide for efficient storage access by providing a simple storage and parity server mapping method. Because this mapping method can be replicated on different servers, the need for inter-server communication can be reduced. And the method can be scaled across an arbitrary number of servers.
Systems according to the invention may also be advantageous in that they can distribute the load of parity determinations across an array of servers. This can improve the performance of file transfers and can eliminate a single RAID controller as a central bottleneck.
Referring to
Each of the servers 14 can act as a data storage server or a parity server. As is well known, the server's parity functionality provides redundant information for error correction in the case of a storage failure. In this embodiment, the parity functionality determines parity based on a well-known method described in “EVENODD: An Optical Scheme for Tolerating Double Disk Failures in RAID Architectures,” by Mario Blaum et al., IEEE (1994), which is herein incorporated by reference.
The use of a single parity server by itself is sufficient to implement a system based on RAID-5, which is intended to tolerate the failure of a single storage server (known as a “blade”). In the event of such a failure, read access requests from the clients are serviced with data reconstructed from the parity data. An optional secondary parity server may also be provided in the case of a RAID-6 configuration.
In one embodiment, the servers are implemented with blades interconnected by an IP switch fabric, although they could of course also use a different communication protocol. Each of the blades includes a LINUX-based processor running custom software that controls two 512 gigabit or 1-terabyte disk drives, although the system can handle disks of a variety of sizes. The system could of course also be based on other operating systems or even dedicated hardware, or a combination of both.
Referring to
Write access can be provided in a manner that seamlessly replaces a mirrored configuration. Specifically, duplicate writes normally directed to a mirrored server can be simply directed to the parity server(s) instead, without any significant changes to the client software.
A client normally writes or reads from a primary server, and in an error case, it will fail over to the parity server. In the write case, a primary server will forward the data from the client to the parity server for that given set of data. Once a parity server has all of the blocks required to generate parity data, it will do so and write the parity data to its internal store, although it can also store and manage partial stored parity blocks. In the case of RAID-6, one parity device will calculate both row and diagonal parity, write the “native” block to its internal store, and forward the other parity block to the secondary parity device where it will be stored.
Referring to
Referring to
Maps are generated on a per-file basis and have the following requirements/properties:
The following properties are defined:
A map is generated by first obtaining a random permutation over the available servers using the same pseudo random technique used for ISIS® map generation (see US published application no. US2007/0073990, entitled “Distribution of Data in a Distributed Shared Storage System,” published Mar. 29, 2007, which is herein incorporated by reference). This permutation is based on a seed obtained from the system director 18 at power-up and can be called P[0 . . . S−1]. A B field can then be defined to consist of M of these permutations laid down one after the other in order. To ensure even distribution of party this B field is replicated M times and will assign the nth element of each super block to parity (and possibly the n+1 element as well in the RAID-6 case) where n is the B field replication number from 0 . . . M−1 (see
With the extended map consisting of M*B elements, the following equations enable a client to find what it needs based on a file offset F.
SB=super block number=F/D=File Block Offset/Data blocks per super block
SBR=super block remainder=F%D=offset of this data block with respect to other data blocks in super block
BBN=big block number=((SB*M)/B)%M
A=absolute offset in super block including parity=SBR<BBN ?SBR:SBR+N (special case required where N=2 on our last BBN (BBN=M−1)A=SBR+1
O=offset into permutation for this server=((SB*M)+A)%S
P=offset into permutation for first parity server=((SB*M)+BBN)%S
P2=offset into permutation for second party server if N>1=((SB*M)+BBN+1)%S
So, the data block associated with F would be:
P[((F/D*M)+(F%D<((F/D*M)/B)%M)?F%D:F%D+N)%S]
And the data block for the first (row) parity block associated with the super block that F lies in would be:
P[((F/D*M)+(F/D*M)/B%M)%S]
For an illustrative RAID-5 map for a set of 9 servers with a RAID block size of 5 (4 data+1 parity):
S=9, N=1, M=5, D=4, B=45 (see FIG. 5)
To find the server associated with file block 25:
SB=F/D=25/4=6
SBR=F%D=25%4=1
BBN=((SB*M)/B)%M=((6*5)/45)%5=0
A=1<0?1:2=2
O=((SB*M)+A)%S=((6*5)+2)%9=5
The data block will therefore be P[5].
To find the parity server associated with file block 25:
P=((SB*M)+BBN)%S=((6*5)+0)%9=3
The party block would therefore be P[3].
A test program was run using the method presented above for S=9, M=5, N=1, for a set of 1000 blocks
A test program was run using the method presented above for S=50, M=6, N=2, for a set of 1000 blocks
The rules for redistribution are as follows:
1. Any replaced blocks must come from a server not in the same super block as the new one
2. Data movement and subsequent parity generation should be minimized
3. In systems where disks are replaceable without removal of a micro-server, replacement is desirable over redistribution
a. Redistribution should not be automatic if Raid is enabled
b. Replacement should be done with minimal communication with other servers
4. In systems where disks and micro-servers are bound together in a field replaceable unit (FRU), redistribution would be desirable over replacement (remove server first, re-distribute data, replace server later).
a. Redistribution could be automatic or not depending on customer requirements.
b. The system would be able to restore itself to a fully protected state very quickly when a server is removed. Adding a server back would be slower, but this operation is not time critical (no chance of losing data).
Referring to
When a server receives a block (step 60) it first determines whether it is has been assigned to act as a data storage server or a parity server. It can make this determination based a version of the map that it derives locally (step 62) or it can examine header information that the client provides based on its map. Once it has determined that it is a data storage server, it stores the block and copies it to the appropriate parity server (step 64). The location of the appropriate parity server can be determined from the map or from header data.
When a server receives a copied block (step 70) it first determines whether it has been assigned to act as a data storage server or a parity server. It can make this determination based a version of the map that it derives locally (step 72) or it can examine header information that the client provides based on its map. Once it has determined that it is a parity server, it determines and stores the parity information for the block (step 74). These operations are completed for each block in a full or partial super block (see step 76). In a RAID-6 implementation, the parity server calculates both row and diagonal parity and forwards the diagonal parity to the appropriate second parity server.
In the illustrative embodiment, the servers maintain a linked list of partially complete parity blocks. Entries in the list are created when the first block in a super block is received, and they are removed from the list when the parity block is complete and ready to be stored on disk. Partially complete parity blocks are stored after entries remain on the list for longer than a specified period.
The flowcharts presented above represent an overview of the operation of the illustrative embodiment. But one of ordinary skill in the art would recognize that other approaches to implementing the inventive concepts in this applications could result in somewhat different breakdowns of steps without departing from the spirit and scope of the invention. A server could use parallelized hardware, for example, to simultaneously send different blocks to different servers based on a single map derivation step. Other minor features and optimizations, such as the details of handling of partial blocks, are not shown because one of ordinary skill would readily be able to implement them without undue experimentation.
Referring to
The present invention has now been described in connection with a number of specific embodiments thereof. However, numerous modifications which are contemplated as falling within the scope of the present invention should now be apparent to those skilled in the art. It is therefore intended that the scope of the present invention be limited only by the scope of the claims appended hereto. In addition, the order of presentation of the claims should not be construed to limit the scope of any particular term in the claims.
| Number | Name | Date | Kind |
|---|---|---|---|
| 5210866 | Milligan et al. | May 1993 | A |
| 5371882 | Ludman et al. | Dec 1994 | A |
| 5469453 | Glider et al. | Nov 1995 | A |
| 5473362 | Fitzgerald et al. | Dec 1995 | A |
| 5511177 | Kagimasa et al. | Apr 1996 | A |
| 5537567 | Galbraith et al. | Jul 1996 | A |
| 5644720 | Boll et al. | Jul 1997 | A |
| 5712976 | Falcon, Jr. et al. | Jan 1998 | A |
| 5734925 | Tobagi et al. | Mar 1998 | A |
| 5757415 | Asamizuya et al. | May 1998 | A |
| 5790773 | DeKoning et al. | Aug 1998 | A |
| 5829046 | Tzelnic et al. | Oct 1998 | A |
| 5911046 | Amano | Jun 1999 | A |
| 5915094 | Kouloheris et al. | Jun 1999 | A |
| 5926649 | Ma et al. | Jul 1999 | A |
| 5933603 | Vahalia et al. | Aug 1999 | A |
| 5949948 | Krause et al. | Sep 1999 | A |
| 5950015 | Korst et al. | Sep 1999 | A |
| 5959860 | Styczinski | Sep 1999 | A |
| 5978863 | Dimitrijevic et al. | Nov 1999 | A |
| 6021408 | Ledain et al. | Feb 2000 | A |
| 6061732 | Korst et al. | May 2000 | A |
| 6070191 | Narendran et al. | May 2000 | A |
| 6134596 | Bolosky et al. | Oct 2000 | A |
| 6138221 | Korst et al. | Oct 2000 | A |
| 6185621 | Romine | Feb 2001 | B1 |
| 6282670 | Kalman et al. | Aug 2001 | B1 |
| 6374336 | Peters et al. | Apr 2002 | B1 |
| 6415373 | Peters et al. | Jul 2002 | B1 |
| 6449688 | Peters et al. | Sep 2002 | B1 |
| 6646576 | Delvaux et al. | Nov 2003 | B1 |
| 6760808 | Peters et al. | Jul 2004 | B2 |
| 6785768 | Peters et al. | Aug 2004 | B2 |
| 7111115 | Peters et al. | Sep 2006 | B2 |
| 7487309 | Peters et al. | Feb 2009 | B2 |
| 20020124137 | Ulrich et al. | Sep 2002 | A1 |
| 20030097518 | Kohn et al. | May 2003 | A1 |
| 20030126523 | Corbett et al. | Jul 2003 | A1 |
| 20030149750 | Franzenburg | Aug 2003 | A1 |
| 20050055521 | Saika | Mar 2005 | A1 |
| 20050097270 | Kleiman et al. | May 2005 | A1 |
| 20050165617 | Patterson et al. | Jul 2005 | A1 |
| 20060212625 | Nakagawa et al. | Sep 2006 | A1 |
| 20060248378 | Grcanac et al. | Nov 2006 | A1 |
| 20080147678 | Peters et al. | Jun 2008 | A1 |
| 20100020820 | Jones | Jan 2010 | A1 |
| Number | Date | Country |
|---|---|---|
| 0780765 | Aug 2003 | EP |
| Entry |
|---|
| Asami, Satoshi et al., “The Design of Large-Scale, Do-It-Yourself RAIDs”, Nov. 10, 1995, pp. 1-30. |
| Birk, Yitzhak, “Random RAIDs with Selective Exploitation of Redundancy for High Performance Video Servers”, EE Dept. of Israel Institute of Technolgy, 1997 IEEE, pp. 13-23. |
| Brubeck et al., “Hierarchical Storage Management in a Distributed VOD System”, IEEE Multimedia 1996, pp. 37-47. |
| Chen, Peter et al., “RAID: High Performance, Reliable Secondary Storage”, ACM Computing Surveys, vol. 26, No. 2, pp. 145-185, Jun. 1994. |
| Massiglia, Paul, The Raidbook , “A Source Book for Disk Array Technology”, Fourth Ed., Aug. 1994, pp. ii-45. |
| Stephenson et al., “Mass Storage Systems for Image Management and Distribution”. IEEE Symposium on Mass Storage Systems, 1993, pp. 233-240. |
| Ying-Dar Lin et al., A Hierarchical Network Storage Architecture for Video-on-Demand Services, IEEE Transactions on Computers, 1996, pp. 355-364. |
| Chen, Peter et al., “Striping in a RAID Level 5 Disk Array”, ACM Computing Surveys, 1995, pp. 136-145. |
| Pease et al., “IBM Storage Tank, A Distributed Storage System”, IBM Almaden Research Center, R.C. Burns: John Hopkins Univ., DDE, Long: Univ.of California, Santa Cruz, Jan. 24, 2002, pp. 1-7. |
| Drapeau, A.L. et al., “Striped Tape Arrays”, IEEE Symposium on Mass Storage Systems, 1993, pp. 257-265. |
| Liu et al., “Performance of a Storage System For Supporting Different Video Types and Qualities”, IEEE Journal on Selected Areas in Communications, 1996, pp. 1314-1331. |
| Birk, Y., “Deterministic Load-Balancing Schemes For Disk-Based Video-on-Demand Storage Servers”, IEEE Symposium on Mass Storage Systems, 1995, pp. 17-25. |
| Triantafillou et al., “Overlay Stiping and Optimal Parallel I/O For Modern Applications”, Parrallel Computing, 1997, pp. 21-43. |
| Buddhikot et al., “Design of Large Scale Multimedia Storage Server”, Computer Networks and ISDN Systems, 1994, pp. 503-517. |
| Flynn, R., et al., “Disk Stiping and Block Replication Algorithms for Video File Servers”, IEEE Proceedings of Multimedia Applications, 1996, pp. 590-597. |
| Ganger, G.R. et al., “Disk Subsystem Load Balancing: Disk Striping vs Conventional Data Placement”, IEEE Transactions on Computers, 1993, pp. 40-49. |
| Tewari, R., et al., “High Availability in Clustered Multimedia Servers”, IEEE Transactions on Computers, 1996, pp. 645-654. |
| “Method to Deliver Scalable Video Across A Distributed Computer System”, IBM Technical Disclosure, May 1994, pp. 251-256. |
| Number | Date | Country | |
|---|---|---|---|
| 20090216832 A1 | Aug 2009 | US |