The present invention relates to the field of digital data storage and more specifically to an encoding scheme for distributed encoding and storage systems, such as RAID storage systems.
Commercial mass data storage has become a vital part of the modern economy. Thousands of companies rely on secure, fault-proof data storage to serve their customers.
Data storage in commercial settings typically provides for some form of data protection to guard against accidental data loss due to component and device failures, and man-made or natural disasters. The simplest form of protection is known as redundancy. Redundancy involves making multiple copies of the same data, and then storing the copies on a separate, physical drives. If a failure occurs on one drive, the data may be recovered simply by accessing it on another drive. This is obviously costly in terms of physical storage requirements.
More advanced recovery systems use a redundant array of independent disks (RAID). RAID systems typically utilize erasure encoding to mitigate accidental loss of data. Erasure encoding breaks blocks of data into n multiple, equal-sized symbols and adds m parity symbols. Thus, a RAID system stores n+m symbols and is resilient against failures in up to any m symbol failures.
In such RAID storage systems, given k information disks or storage devices, a straightforward RAID encoding involves generating one parity disk—a (k+1)th disk—as an XOR of the bits in identical positions in the k storage devices. If any one of the k disks fails, it can be reconstructed by XOR-ing the contents of the remaining (k−1) disks and the parity disk. The code is maximum-distance separable (MDS) in the sense that the number of disks that can be reconstructed equals the number of parity disks which, in this case, is one. The well-known Reed Solomon (RS) code preserves the MDS property, that is, it allows reconstruction of as many disks as the number of parity disks used, but does not rely solely on XOR operations for data reconstruction.
Erasure encoding techniques such as Reed-Solomon codes requires substantial computational resources, because it relies on arithmetic on symbols from a finite field of size 2m (where m is a number of bits in each symbol), also called an extension field GF(2m), instead of arithmetic on bits{0,1} or a base field GF(2), from which the extension field is formed. The advantage of arithmetic based on a GF(2) field is that the arithmetic operations can be performed using simple XOR gates.
It would be desirable to encode data using an encoding technique that adheres to the three, desirable properties described above, i.e., the code is MDS, it is able to correct for multiple disk failures in a storage system, and it avoids the use of complex arithmetic.
The embodiments described herein relate to an apparatus, system and method for data encoding, storage, retrieval and decoding. In one embodiment, a distributed data encoding and storage method is described that provides for XOR-only decoding, comprising generating an information vector from received data, the information vector comprising information symbols, generating a codeword from the information vector, the codeword comprising information symbols and parity symbols, and distributing the information symbols and the parity symbols to a plurality of storage mediums, respectively, wherein the parity symbols are formed by multiplying the information vector by a portion of a binary encoder matrix, the binary encoder matrix comprising a binary representation of an encoding matrix in extension field form.
In another embodiment, a method for data recovery in a distributed data storage system is described, comprising retrieving a plurality of information symbols and a plurality of parity symbols from a plurality of storage mediums, the plurality of information symbols and plurality of parity symbols comprising a codeword formed from a binary information vector and a binary encoder matrix, wherein the binary encoder matrix comprises a binary representation of an identity matrix concatenated with a Cauchy matrix, determining that at least one of the information symbols failed, identifying a sub-matrix within the Cauchy matrix, the sub-matrix identified in accordance with an identity of the failed information symbols, computing an inverse matrix from the sub-matrix, generating a column vector from codeword symbols that did not fail, and multiplying the inverse matrix by the column vector.
The features, advantages, and objects of the present invention will become more apparent from the detailed description as set forth below, when taken in conjunction with the drawings in which like referenced characters identify correspondingly throughout, and wherein:
Certain aspects and embodiments of this disclosure are provided below. Some of these aspects and embodiments may be applied independently and some of them may be applied in combination as would be apparent to those of skill in the art. In the following description, for the purposes of explanation, specific details are set forth in order to provide a thorough understanding of embodiments of the invention. However, it will be apparent that various embodiments may be practiced without these specific details. The figures and description are not intended to be restrictive.
The ensuing description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the ensuing description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing an exemplary embodiment. It should be understood that various changes may be made in the function and arrangement of elements without departing the spirit and scope of the invention as set forth in the appended claims.
Specific details are given in the following description to provide a thorough understanding of the embodiments. However, it will be understood by one of ordinary skill in the art that the embodiments may be practiced without these specific details. For example, circuits, systems, networks, processes, and other components may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known circuits, processes, algorithms, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments.
Also, it is noted that individual embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed, but could have additional steps not included in a figure. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, its termination can correspond to a return of the function to the calling function or the main function.
The terms “computer-readable medium” “memory” and “storage medium” includes, but is not limited to, portable or non-portable storage devices, optical storage devices, and various other mediums capable of storing, containing, or carrying instructions) and/or data. These terms each may include a non-transitory medium in which data can be stored and that does not include carrier waves and/or transitory electronic signals propagating wirelessly or over wired connections. Examples of a non-transitory medium may include, but are not limited to, a magnetic disk or tape, optical storage media such as compact disk (CD) or digital versatile disk (DVD), flash memory, RAM, ROM, flash memory, disk drives, etc. A computer-readable medium or the like may have stored thereon code and/or machine-executable instructions that may represent a procedure, a function, a subprogram, a program, a routine, a subroutine, a module, a software package, a class, or any combination of instructions, data structures, or program statements. A code symbol may be coupled to another code symbol or a hardware circuit by passing and/or receiving information, data, arguments, parameters, or memory contents. Information, arguments, parameters, data, etc. may be passed, forwarded, or transmitted via any suitable means including memory sharing, message, passing, token passing, network transmission, or the like.
Furthermore, embodiments may be implemented by hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code, i.e., “processor-executable code”, or code symbols to perform the necessary tasks (e.g., a computer-program product) may be stored in a computer-readable or machine-readable medium. A processor(s) may perform the necessary tasks.
The embodiments described herein provide specific improvements to a data storage and retrieval system. For example, the embodiments allow the storage and retrieval system to recover data stored in one or more storage mediums in the event of erasures, or errors, due to, for example, media failures or noise, using only XOR arithmetic. Using XOR arithmetic avoids the use of complex arithmetic, such as polynomial calculations rooted in Galois field theory, as is the case with traditional error decoding techniques such as Reed-Solomon. Limiting the calculations to only XOR arithmetic improves the functionality of a data storage and retrieval system, because it allows the use of cheaper, less-powerful processors, and results in faster storage and retrieval than techniques known in the art.
Given k storage mediums, referred to herein as “disk drives” or simply “disks”, prior art RAID encoding involves generating a parity data symbol from an XOR of respective codeword symbols stored in identical positions in the k storage devices, and storing the parity data symbol on a (k+1)th disk (a parity disk). If any one of the k disks fail, data stored in that disk can be reconstructed by XOR-ing the contents of the disks that did not fail and the parity disk. This simple, XOR coding technique is maximum distance separable (MDS) in the sense that the number of disks that can be reconstructed equals the number of parity disks, which is one. However, increasing the correction capabilities of such a prior art RAID storage system to more than one disk, while preserving the MDS property, requires complex computations that slow access times. For example, the well-known Reed Solomon (RS) code preserves the MDS property—that is, it allows reconstruction of as many disks as the number of parity disks that are used—but does not rely solely on XOR operations for reconstruction. By treating each sequence of ‘m’ bits—for a chosen integer m—as one of the possible 2m symbols, the Reed-Solomon decoding algorithm relies on calculations on symbols from a finite field of size 2m, also called an extension field GF(2m), instead of simple arithmetic using values of “1” and “0” in a base field GF(2), from which the extension field is formed. This complex coding scheme requires expensive, computationally-intense processing while also increasing the time needed to both encode and decode data.
Encoder 206 receives the blocks from input/output data transfer logic 204 and encodes each block using a special coding technique that preserves an MDS property, allows for data recovery in the event of multiple disk failures, and does not rely on complex mathematical equations, such as polynomials in an extension field, as prior-art coding techniques do. As such, encoding (and decoding) may be performed by simple logic gates, and decoding may be performed without having to use complex mathematical equations.
Encoding each block yields a codeword, comprising equal-sized information symbols and parity symbols. The symbols are then distributed (i.e., stored) across an equal number of storage mediums 108. In one embodiment, the codeword is systematic, meaning that the information symbols of the codeword are identical to the block from which it is formed, and the parity symbols are separate and appended to the information symbols. In this embodiment, each data symbol and parity symbol is stored, respectively, in a respective storage medium 108.
For example, a systematic MDS codeword of length n=(q−1) may be defined, where q=2m, and m is equal to a number of bits per symbol. In coding terminology, a systemic codeword of length n comprises k information bits followed by (n−k) parity bits. So, if m is chosen, for example, equal to 4, each codeword will have a length of n=15 symbols. If k is chosen as 12 information symbols, the number of parity symbols is (n−k)=15−12=3, and the number of storage mediums is equal to n, in this case 15. Using this example, a decoder can then recover up to 3 failed information symbols in each retrieved codeword from the plurality of storage mediums 108. In another example, in order to recover 4 failed information symbols, (n−k)=4, and thus n=11 information symbols in each codeword. In general, a decoder can correct up to as many failed information symbols as there are number of parity symbols in each codeword.
Encoding the blocks may comprise converting each symbol in a block into binary form, creating a binary information vector. Then, the binary information vector is multiplied by a binary encoder matrix formed from an encoding matrix, the encoding matrix comprising an identity matrix concatenated with a special Cauchy matrix. A Cauchy matrix is a matrix where every square sub-matrix is invertible, and where every square sub-matrix is itself a Cauchy matrix. The elements of both the identity matrix and the special Cauchy matrix comprise members of an extension field GF(2m). Such an encoding matrix 300 is shown in
In one embodiment, the binary encoder matrix is generated by processor 200 from the encoding matrix 300, and the binary encoder matrix is stored in memory 202. Alternatively, a separate computer generates the binary encoder matrix from the encoding matrix 300, and it is then provided to data storage server 104 for storage in memory 202. Generation of the elements in the encoding matrix and the binary encoder matrix is provided in greater detail later herein.
After one or more codewords have been generated and stored across storage mediums 108a-108n by encoder 206, at some time later, a request to retrieve data may be received by data storage and retrieval server 104 from one of the hosts 102. In response, for each codeword, decoder 208 retrieves codeword symbols and one or more parity symbols in parallel from the storage mediums 108a-108n. The data and parity symbols are assembled to form a retrieved codeword, and then the retrieved codeword is decoded by decoder 208 using XOR arithmetic, avoiding complex arithmetic related to the use of polynomial extension fields, as is the case with traditional error decoding techniques used in such error correction codes such as Reed-Solomon. Limiting the calculations to only XOR arithmetic improves the functionality of data storage server 104, because it allows the use of cheaper, less-powerful processors, and results in faster storage and retrieval than techniques known in the art. Data storage and retrieval server 104 can tolerate simultaneous storage medium failures, up to the number of parity symbols used in each codeword, in this example, 3 storage medium failures. The decoding process is described in greater detail later herein.
In general, each of the functional blocks shown in
At block 400, various coding parameters are defined, taking into account things such as a number of storage mediums 108 available to data storage and retrieval system 100, the cost of such storage mediums, a desired encoding/storage speed, a desired retrieval/decoding speed, processing capabilities of a given processor 200, encoder 206 and decoder 208, and other constraints. For the remainder of this discussion, the parameters used in the example above with respect to
At block 402, an encoding matrix is defined, for example, extension field G as shown in
The identity matrix 302 comprises elements “0” and “4”, which are powers of a “primitive” α. Given an integer ‘m’ (i.e., the number of bits in each symbol), an extension field of size 2m, denoted as GF(2m), may be formed from a binary alphabet having elements {0,1} or the base field GF(2). The extension field consists of 2m m-bit vectors, and each vector may be represented by a power of a primitive of a polynomial, denoted as α, ranging from −1 to (2m−2), or from −1 to 14 in this example. A table of 4-bit vectors as powers of a primitive α in an extension field GF(24) formed from a primitive polynomial 1+x+x4 are shown in
At block 404, the special Cauchy matrix 304 is determined, using the matrix addition, multiplication, division and inverse, as described below.
Addition of Field Elements:
The addition of two field elements is a bit-by-bit XOR addition of the corresponding vectors. As an example, the addition of the two field elements α5+α6 is a bit-by-bit XOR addition of the two corresponding vectors [0 1 1 0] and [0 0 1 1]=[0 1 0 1], which in turn, refers to α9. Thus, α5+α6=α9. Another way to do addition is to take the matrix representation of each field element and perform an XOR addition of the two matrices. The resulting matrix can be mapped back either to the power of α, or to its corresponding vector. Again as an example,
Given that the addition operation is straightforwardly defined as above, in one embodiment, a table of pre-computed field element additions may be stored, in one embodiment in memory 202, comprising a matrix or table of size 2m×2m, where the table elements are represent additions between every pair of field elements in
Multiplication of Field Elements.
Unlike addition, multiplication is performed using the power of α representation itself. The powers themselves are simply added together, then added modulo (2m−1). As an example, referring to
Thus, there are three possible representations of an extension field vector: (1) as a power of α (2) as a vector, and (3) as a matrix. Moreover, there are more than one way to perform addition and multiplication between field elements.
The products between pairs of field elements may be pre-computed to create a 2m9×2m matrix—or, table—and simply look up that table whenever multiplication between field elements is needed.
Inverse of a Field Element.
In an extension field, αp=1 where p=2m−1. This observation may be used to define an inverse of a non-zero field element αi to be αj, where j is such that α(i+j)=1. With respect to
Division of Field Elements.
Using the notion of inverse of a field element, division between two field elements may be defined as αi/αj=αi. (1/αj) to be the multiplication of αi and the inverse of αj.
The special Cauchy matrix 304 comprises (n−k) rows and k columns, in this case, 3 rows and 12 columns. In one embodiment, two, separate arrays x and y are formed such that x has k members from GF(2m) and y has (n−k) members from GF(2m) while ensuring that no member that is in x is also in y. Then the special Cauchy matrix 304 is formed as M(i,j)=1/(xi+yj), where i is the ith row and j is the jth column in the special Cauchy matrix 304, for 1≤i≤(n−k); 1≤j≤k.
As an example, with (n−k)=3 and k=12, array x comprises={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11} and array y={12, 13, 14}, where the entries in array x and array y are the powers of α from FIG. 5. Using the properties of matrix addition and multiplication, the special Cauchy matrix 304 is computed using M(3,12)=1/(xi+yj) to be:
An important property of the special Cauchy matrix 304 is that any square-sub-matrix formed by taking any number of arbitrary rows and an equal number of arbitrary columns from those rows is invertible. For example, taking rows 1 and 3 and columns 2 and 3 from those two rows we can form
As another example of a square sub-matrix, take rows 1, 2, 3 and columns 2, 5, 7 from those rows:
Once the special Cauchy matrix 304 has been determined, it is concatenated with identity matrix 302, to yield the encoding matrix shown in
At block 406, a binary encoder matrix 700 is formed from the encoding matrix 300, as shown in
In the encoding matrix 300, element “−1” is represented by an all-zero matrix of size 4×4. For the remaining elements, referring to
An m-bit vector in GF(2m) is referred to herein by the power of a that corresponds to that vector. For example, the vector [1 1 0 0] in
At block 408, input/output data transfer logic receives data from one of the hosts 102 and, in response, generates a 48-bit binary information vector u.
At block 410, encoder 206 generates a systematic, binary codeword vbin by performing matrix multiplication on the binary information vector and binary encoder matrix 700. Matrix multiplication comprises XOR additions, as described above and, thus, complex arithmetic is avoided. The resultant codeword is 60 bits in length, comprising 48 information bits that are identical to the bits in the binary information vector with 12 parity bits appended to the end of the information bits.
In one embodiment, encoder 206 does not multiply the binary information vector by the entire binary encoder matrix 700, because the codeword is systematic. That is, the first 48 bits in the 60 bit binary codeword vbin are same as the information bits in the vector ubin. Thus, only the last 12 bits—i.e., the parity bits—need to be generated and appended to the information bits. In this embodiment, then, encoder 206 multiplies the binary information vector by a binary representation of the special Cauchy matrix 304, i.e., the last 12 rows and all 48 columns of the binary encoder matrix 700, to generate the 12 parity bits.
As an example, the 48-bit binary information vector u may comprise ubin=[0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 1]. The binary codeword vbin is 60-bits long, where the first 48 bits are the same as ubin. The last 12 bits of vbin—which are the parity bits—are generated by encoder 206 computing the matrix product Mbin*ubin, (where Mbin is the binary representation of the special Cauchy matrix), resulting in 1 0 1 0 1 0 1 1 1 0 0 1. By appending these 12 parity bits to ubin, the 60-bit codeword vbin=[0 1 0 1 0 0 1 0 1 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 1 1 0 0 1] is formed.
At block 412, the codeword is apportioned into a number of codeword symbols, each symbol 4 bits in length, yielding 12 information symbols and 3 parity symbols.
At block 414, each codeword symbol is stored by encoder 206 in a respective one of the storage mediums 108, in this case there are 15 storage mediums.
At block 416, a request to retrieve data (i.e., one or more codewords) is received by input/output data transfer logic 204.
At block 418, in response to receiving the request to retrieve data, decoder 208 retrieves a respective codeword symbol from each of the storage media 108a-108o. (i.e., 15 storage mediums, the first 12 mediums storing codeword symbols representing the information bits, and the last 3 mediums storing codeword symbols representing the parity bits). However, one or more of the codeword symbols may not be available, due to a malfunction of one or more of the storage media 108, or a communication problem between one or more storage mediums and decoder 208. For purposes of discussion for the remainder of the steps in this method, it will be assumed that 3 storage mediums failed during a retrieval by decoder 208 and, specifically, that storage mediums 108j, 108k and 108l failed, representing the last 12 bits of the binary information vector.
Referring back to the 15×12 encoding matrix G shown in
For ease of discussion, the following blocks 420-426 are described in terms of extension field elements in encoding matrix G, rather than the bits in binary encoder matrix 700. It should be understood that in one embodiment, encoding matrix G is not stored in memory 202 and, therefore, not available to processor 200 or decoder 208 during the calculations described in blocks 420-426. However, binary encoder matrix 700 is stored in memory 202 or some other memory, and so, in practice, processor 200 and/or decoder 208 uses the 4×4 binary matrix representations of the extension field elements to perform the calculations described in blocks 420-426. In another embodiment, the encoding matrix G is also stored in memory 202 or some other memory along with the binary encoder matrix 700, the blocks 420-426 are performed as described below.
At block 420, decoder 208 defines an array xt={9, 10, 11} and an array yt={12, 13, 14}. The entries in xt are identical to the failed symbol numbers and the entries in yt identical to the parity symbols because of the way the special Cauchy matrix 304 was formed. If a different set of values in the x and y arrays were chosen, there would not be a straightforward correspondence between xt and the failed symbol numbers and yt and the parity symbol numbers, and two tables would need to be defined and stored in memory 202, a first table for mapping column numbers to the entries in x to generate xt and another table for mapping entries in y to generate yt.
At block 422, decoder 208 generates a square sub-matrix from the encoding matrix G that corresponds to the rows denoted in yt and the columns denoted in xt. In this example, this square sub-matrix shall be referred to as sub-matrix D and, therefore:
Instead of three storage medium failures, if only two failures were encountered when retrieving data from the storage mediums, D would be formed from a respective two columns in the encoding matrix G and any two of the three parity symbols in those two columns. For example, if symbols 10 and 11 failed, any two of the three parity rows in the encoding matrix G by decoder 208, either rows 12 and 13, rows 12 and 14, or rows 13 and 14. For example, if rows 12 and 14 are selected, D is calculated to be:
At block 424, decoder 208 generates an inverse of D, referred to herein as a D−1 matrix, as follows. If a is defined as the number of entries in xt and yt, for k=1:a,
a
k=Πi<k(xti−xtk)Πk<j(xtk−xtj)
b
k=Πi<k(yti−ytk)Πk<j(yti−ytk)
e
k=Πi=1axtk+yti
f
k=Πi=1aytk+xti
Once the above quantities are calculated, the entries in D−1 are computed as follows:
for dij, 1≤i≤α; 1≤j≤α. After performing the above calculations, in this example, D−1 is equal to:
In one embodiment, rather than compute the D−1 matrix as described in blocks 420-424, a plurality of D−1 matrices could be stored in memory 202 or some other storage device, each D−1 matrix associated with a particular combination of disks or codeword symbols that failed. In the present example, with 12 information symbols/disks and a tolerance of up to 3 disk failures, the number of unique D−1 matrices that would need to be stored in memory 202 would be 220. Then, processor 200 would select a particular D−1 matrix from the plurality of D−1 matrices depending on which combination of disks/symbols failed. Each of the plurality of D−1 matrices could be stored in either extension field form or binary form. If stored in extension field form, processor 200 converts a selected D−1 matrix to binary form for use in the last step of block 426, described below.
At block 426, decoder 208 generates the failed information symbols as follows.
First, decoder 208 stores a representation of the codeword symbols/storage mediums that did not fail in an array I, and stores a representation of the parity rows selected at block 424 in an array J. The two arrays are generally stored in memory 202. Referring to the above example, I={0, 1, 2, 3, 4, 5, 6, 7, 8} and J={12, 13, 14}.
Next, decoder 208 selects an entry j from J, and selects that row number in the encoding matrix G. Then, for each entry i in I, decoder 208 selects an element from G(j,i) and computes its matrix representation, as described previously herein. In practice, each element (G(j,i) is already in 4×4 binary matrix form, as only binary encoder matrix 700 is typically stored in memory 202 or some other memory. Next, decoder 208 multiplies that matrix with a vector representation of symbol i. The will result in a 4×1 vector. This operation is performed for all i in I.
After performing this operation for all i in I, an III number of 4×1 vectors will have been generated, where III denotes the number of elements in I, in this case 9. Each of the 4×1 vectors may be stored in memory 202.
Next, decoder 208 performs a bit-by-bit XOR addition of all of the 4×1 vectors. The result is a 4×1 vector, which is then XOR-added with the bit vector representation of the jth parity symbol in the codeword, which can be obtained from the binary codeword vbin stored in memory 202. The result is a 4×1 vector which will be referred to as bj.
The above procedure for repeated for each element in J., resulting in a |J| number of 4×1 column vectors b3, where |J| denotes the number of elements in J, or the number of failed storage mediums, in this case 3.
Next, decoder 208 concatenates, or stacks, the resulting bj vectors, one below another. In the current example, this will generate a 12×1 bit column vector, referred to herein as E.
Next, decoder 208 replaces each member in D−1 by its corresponding 4×4 bit matrix, as explained previously. In practice, each member in D−1 is already in 4×4 binary matrix form, as only binary encoder matrix 700 is typically stored in memory 202 or some other memory. Therefore, this step may not actually be performed by processor 200. In the current example, decoder 208 generates a 12×12 bit matrix, referred to herein as Dinvbin.
Finally, the failed codeword symbols are generated as the product of Dinvbin and E (i.e., R=Dinvbin*E) in bit vector form, stacked one below another in a column. In the current example, R is a 12×1 bit column vector where the first 4 bits is the recovered 9th codeword symbol, the next 4 bits is the recovered 10th codeword symbol, and the last 4 bit-vector is the recovered 11th codeword symbol.
At block 428, decoder 208 arranges the codeword symbols that were retrieved successfully with the recovered codeword symbols to form the original codeword. The parity bits may be stripped, and the information bits of the codeword are provided to input/output data transfer logic, where the information is provided to a host 102 that requested the information.
In some data center applications, a disk-duplication scheme is used to provide redundancy in the event of disk failure. For example, a source disk may be replicated onto 3 other disks, and such a system can tolerate up to three, simultaneous disk failures. However, this approach is costly in terms of storage, because the overhead is 75%. (overhead may be defined as the number of additional disks divided by the total number of disks, in this case, ¾). The system described herein, on the other hand, comprises an overhead of (n−k)/n, which is typically much less than traditional systems. For example, when m=4, n=15. If it is desired to tolerate up to 3 disk failures, 3 parity disks are used. Therefore, in such a system, the overhead is 3/15=20%. If 4 parity disks are allocated, up to 4 disk failures may be tolerated, and such a system comprises an overhead of only 4/15=26.66% vs. an overhead of 4/5=80% for a traditional replication system.
The number of disks needed to reconstruct a failed disk may be referred to as the repair bandwidth. In the method described in
At block 800, blocks 400-410 of the method described above are performed, i.e., defining system parameters, defining an encoding matrix comprising an identity matrix concatenated with a special Cauchy matrix, converting the encoding matrix into a binary encoder matrix, receiving data from one or more hosts, and generating a binary information vector vbin 48 bits in length. However, a fourth parity disk 108n+1 (as shown in
At block 802, encoder 206 creates two parity symbols, as described above at block 414, by multiplying the binary information vector with rows 49-56 of the binary encoder matrix 700. However, rather than creating the third parity symbol by multiplying the binary information vector with rows 57-60, encoder 206 creates a third parity symbol from the last four rows of binary encoder matrix 700 and a fourth parity symbol, also using the last four rows of binary encoder matrix 700. The third and fourth parity symbols are created by processor 200 as follows.
As shown in
At block 804, encoder 206 multiplies the first vector 902 by last four rows of the binary encoder matrix 700 to form the third parity symbol, and the second vector 904 is multiplied by the last four rows of the binary encoder matrix 700 to form the fourth parity symbol. It should be understood that although, in this example, the third and fourth parity symbols were created from the last four rows of the binary encoder matrix 700, in other embodiments, any group of four parity rows of the binary encoder matrix 700 could be used.
At block 806, a systematic codeword is formed from the 48 information bits of the binary information vector vbin 900, concatenated with first and second parity symbols, generated as described by the method of
At block 808, at some time later, the systemic codeword is retrieved from the storage mediums 108 by decoder 208.
At block 810, decoder 208 determines if any symbols of the codeword were erased, i.e., not provided by one or more of the storage mediums 108 due to, for example, a hardware failure of one of the storage mediums or a disconnect between a storage medium and decoder 208.
At block 812, if a single symbol was erased or otherwise unavailable, decoder 208 determines which codeword symbol failed out of the 12 information symbols of the codeword. For example, decoder 208 may determine that information symbol 3 failed, corresponding to storage medium 108c.
At block 814, decoder 208 recovers the failed information symbol as described in block 428 above, using the third parity symbol from storage medium 108, if the failed information symbol is from storage mediums 108a-108f, or using the fourth parity symbol from storage medium 108n+1 if the failed information symbol is from storage mediums 108g-108l. However, the array I comprises representations of only intact information symbols in either the first-half or the second-half of the set of storage mediums to which the failed information symbol belongs. Referring to the current example, if the failed information symbol was information symbol 3 that had been stored in storage medium 108c, array I comprises {1, 2, 4, 5, 6}. If the failed information symbol was the 10th information symbol of the codeword, array I would comprise {7, 8, 9, 11, 12}.
In other embodiments, a different “mapping” of failed-symbol-to-parity-symbol scheme may be defined, such as using the third parity symbol when odd information symbols fail, and using the fourth parity symbol when even information symbols fail. In these alternative embodiments, each of the third and fourth parity symbols are derived in accordance with the alternative mapping scheme. Continuing with the odd/even scheme just described, the third parity symbol is generated from even groupings of 4 bits each of the binary information vector vbin 900 (i.e., bits {5, 6, 7, 8}, {13, 14, 15, 16}, {21, 22, 23, 24}, etc.), inserting 4 zeros in between each of the even groupings, while the fourth parity symbol is generated from odd groupings of 4 bits each of the binary information vector vbin 900 (i.e., bits {1, 2, 3, 4}, {9, 10, 11, 12}, {17, 18, 19, 20}, etc.), also inserting 4 zeros in between each of the odd groupings.
At block 816, if more than one storage medium fails, decoder 208 XORs the third and fourth parity symbols together, creating an original parity symbol, i.e., a parity symbol that would have been created by multiplication of the binary information vector vbin 900 with the last four rows of the binary encoder matrix 700, as described in
At block 818, the failed codeword symbols are re-created, using the decoding method of
This modification to the method of
The methods or algorithms described in connection with the embodiments disclosed herein may be embodied directly in hardware or embodied in processor-readable instructions executed by a processor. The processor-readable instructions may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components.
Accordingly, an embodiment of the invention may comprise a computer-readable media embodying code or processor-readable instructions to implement the teachings, methods, processes, algorithms, steps and/or functions disclosed herein.
It is to be understood that the decoding apparatus and methods described herein may also be used in other communication situations and are not limited to RAID storage. For example, compact disk technology also uses erasure and error-correcting codes to handle the problem of scratched disks and would benefit from the use of the techniques described herein. As another example, satellite systems may use erasure codes in order to trade off power requirements for transmission, purposefully allowing for more errors by reducing power and chain reaction coding would be useful in that application. Also, erasure codes may be used in wired and wireless communication networks, such as mobile telephone/data networks, local-area networks, or the Internet. Embodiments of the current invention may, therefore, prove useful in other applications such as the above examples, where codes are used to handle the problems of potentially lossy or erroneous data.
While the foregoing disclosure shows illustrative embodiments of the invention, it should be noted that various changes and modifications could be made herein without departing from the scope of the invention as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the embodiments of the invention described herein need not be performed in any particular order. Furthermore, although elements of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.