The field relates generally to data storage systems, and more particularly to techniques for data encoding in such data storage systems.
The increasing amount of data available in digital format requires developing appropriate data storage systems. In many cases, the amount of data to be stored exceeds the capacity of a single disk drive. Furthermore, the reliability of a single drive may not be sufficient for a particular application. This motivates the development of redundant disk arrays such as, for example, a Redundant Array of Independent Disks or RAID. The size of such arrays may vary from a few disks to a few thousand disks. A major challenge in their design is the development of efficient algorithms for calculating parity data, as well as for their recovery in case of disk failures. Such algorithms are based on encoding the data with some error/erasure correcting code (i.e., calculating parity data from the payload data), and storing different symbols of a codeword on different disks. That is, K disks are typically used to store the payload data, while N−K disks are used to store parity data, where N is the total number of disks in RAID group. Numerous codes have been suggested for usage in storage applications, such as Reed-Solomon codes, Hamming codes, Remote Desktop Protocol, Even/Odd, Zigzag codes, etc. In general, the fraction (N−K)/N of parity disks needed to achieve given reliability decreases with the number of payload disks K. This motivates designing large disk arrays.
It is desirable to employ maximum distance separable (MDS) codes in the design of disk arrays. The crucial property of these codes is that they enable one to recover any combination of up to N−K erasures (disk failures). However, construction of MDS codes requires employing large alphabets. For example, Reed-Solomon codes and codes based on Cauchy matrices are defined over large finite fields GF(2m). The cost of the multiplication operation in such fields, which is needed for encoding and decoding of these codes, is much higher than that of the summation operation (exclusive-or, i.e., XOR). On the other hand, array codes are typically defined over the vector alphabet GF(2)m. Their encoding and decoding algorithms require exclusively XOR operation, but the size of one codeword (stripe size) is m times larger than that of codes over GF(2m). In a practical system, this results in more frequent partial stripe update operations, which considerably degrade RAID performance. As such, improved data encoding techniques are needed that utilize non-MDS codes.
Embodiments of the invention provide improved techniques for data encoding in data storage systems.
For example, in one embodiment, a method comprises the following steps. Data is obtained at a data storage system. Codewords are generated from the obtained data. The codewords are computed using a generalized concatenated code and each codeword comprises symbols, wherein the symbols comprise information symbols and check symbols. The codewords are stored on an array of disks associated with the data storage system. In one example, i-th symbols of the generated codewords are stored on an i-th disk of the array of disks.
In another embodiment, a computer program product is provided which comprises a processor-readable storage medium having encoded therein executable code of one or more software programs. The one or more software programs when executed by a processor implement one or more steps of the above-described method.
In yet another embodiment, an apparatus comprises a memory and a processor operatively coupled to the memory and configured to perform one or more steps of the above-described method.
In a further embodiment, a data storage system comprises an array controller operatively coupled to an array of disks. The array controller is configured to perform one or more steps of the above-described method.
Advantageously, the use of generalized concatenated codes according to embodiments of the invention provides significant fault tolerance in RAID-based data storage systems while reducing computational complexity as compared with conventional encoding techniques.
These and other features and advantages of the present invention will become more readily apparent from the accompanying drawings and the following detailed description.
Embodiments of the present invention will be described herein with reference to exemplary computing systems, data storage systems, and associated servers, computers, storage devices and other processing devices. It is to be appreciated, however, that embodiments of the invention are not restricted to use with the particular illustrative system and device configurations shown. Moreover, the phrases “computing system,” “processing platform,” “data storage system,” and “data storage system environment” as used herein with respect to various embodiments are intended to be broadly construed, so as to encompass, for example, private or public cloud computing or storage systems, or parts thereof, as well as other types of systems comprising distributed virtual infrastructure and those not comprising virtual infrastructure. However, a given embodiment may more generally comprise any arrangement of one or more processing devices.
Communication medium 18 provides network connections between the hosts 10 and the data storage system 30. Communications medium 18 may implement a variety of protocols such as Transmission Control Protocol/Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Asynchronous Transfer Mode (ATM), Ethernet, Fibre Channel, Small Computer System Interface (SCSI), combinations thereof, and the like. Furthermore, communications medium 18 may include various components (e.g., cables, switches/routers, gateways/bridges, Network Attached Storage/Storage Area Network (NAS/SAN) appliances/nodes, interfaces, etc.). Moreover, the communications medium 18 is capable of having a variety of topologies (e.g., queue manager-and-spoke, ring, backbone, multi drop, point to-point, irregular, combinations thereof, and so on).
Array controller 12 is constructed and arranged to convert blocks of payload data 16 into various codewords 24 in generator module 32. As shown, generator module 32 includes an outer coder 32A, an inner coder 32B, a load balancing controller 32C, and a reverse unit 32D. Outer code and inner code generation, as well as load balancing and code reversing, will be further explained below. Array controller 12 is also constructed and arranged to send codewords to disks 20(1) through 20(N) of RAID array 14. In one example, array controller 12 is a server, although in some arrangements, array controller 12 may be a dedicated unit of a server, a personal computer, a laptop computer, or the like.
Generator module 32 is constructed and arranged to generate codewords 24 from blocks of payload data 16. More particularly, generator module 32 encodes blocks of payload data 16 while generating parity data for this payload data. The encoding operations are performed using the outer coder 32A, inner coder 32B, load balancing controller 32C, and reverse unit 32D, as will be explained in further detail below. In one example, generator module 32 is software running on the array controller 12, although in some arrangements, generator module 32 is a stand-alone piece of hardware, or some combination of hardware and software.
RAID array 14 is constructed and arranged to store codewords 24 in disks 20(1) through 20(N) of the RAID array 14. During operation, array controller 12 receives payload data 16 over communications medium 18. Payload data 16 is broken into blocks of length K; in some arrangements, array controller 12 breaks payload data into blocks. In turn, generator module 32 takes in each block of payload data and applies a generator matrix of an outer code (from outer coder 32A) to it to create a codeword 24 of length N. This is done for each block. The obtained blocks are further encoded with an inner code via inner coder 32B. Load balancing and code reversing are also applied via 32C and 32D, respectively. It is to be understood that each codeword is typically comprised of information symbols corresponding to the payload data and check symbols corresponding to the parity data.
Array controller 12 then sends codewords 24 to RAID array 14 to be stored in particular stripes across disks 20(1) through 20(N). In this example, array controller 14 stores (or causes to be stored) symbols 1 through N of the codewords, that were encoded by generator module 32, across stripes 28(1) and 28(2). Note that a given stripe is spread across multiple disks, e.g., stripe 28(1) through 28(2) reside across disks 20(1) through 20(N). Further details of array controller 12 are described below with respect to
Memory 46 is configured to store program code 48 that contains instructions configured to cause processor 44 to carry out methodologies described herein. For example, for array controller 12, program code 48 contains instructions for applying outer/inner codes (see
Processor 44 may take the form of, but is not limited to, one or more central processing units, one or more microprocessors, and a single core or multi-cores each running single or multiple threads. In some arrangements, processor 44 is one of several processors working together. Processor 44 is configured to carry out methodologies and algorithms described herein by executing program code 48. Processor 44 includes generator module 32, although in some arrangements, generator module 32 may be a stand-alone hardware module or software residing in memory. The processor, memory and data interface shown in
Given the illustrative data storage system described above, we now describe encoding algorithms according to embodiments of the invention. More particularly, we describe improved methodologies for generating codewords that may, for example, be employed in the generator module 32 of array controller 12. Embodiments of the invention realize that stripe update operation problems, mentioned above in the background section as well as others, may be avoided by employing non-MDS codes over small alphabets which are able to correct large fractions of erasures. One example of such non-MDS code is a generalized concatenated code (GCC). Accordingly, embodiments of the invention employ GCCs for RAID-base data storage systems. As will be explained in detail below, data is stored on disks of the RAID group in such way so that the i-th disk contains the i-th symbols of codewords of a GCC. In the following detailed description, methods for code construction, systematic encoding and load balancing will be described. We first generally describe GCCs.
A GCC is constructed using a family of (n, ki, di) outer codes i over GF(2), 1≦i≦t, and a family of nested inner (v, t−i+1, qi) codes i, over GF(2). Note that, for the sake of simplicity, in one or more embodiments, we consider only the case of GCCs based on binary linear codes (however, other GCCs may be employed in alternate embodiments). This results in a (N=nv, K=Σi=1tki, D=min (d1q1, . . . , dtqt)) linear block code. Codes i, i>1, induce a recursive decomposition of code 1 into a number of coset, so that:
i
={c+u
1
g
i
|cε
i+1
,u
1ε{0+,1}},
where gi, denotes the rows of the generator matrix of 1.
For the description of an illustrative encoding algorithm, we assume that the generator matrices of outer codes are given in the canonical form (i)=(E|(i)), i=1, . . . ,t, where E is ki×ki identity matrix. Generator matrices for inner codes (i), i=1 . . . v, can be arbitrary.
As illustrated in methodology 300 of
1. Encoding with outer codes 320, i.e., computation of check symbols for codewords of outer codes (note that in outer codes 320, the shaded symbols are check symbols and the non-shaded symbols are information symbols):
a
i,k
+1 . . . n
=a
i,1 . . . k
(i)
, i=1, . . . ,t. (1)
2. Encoding with the first inner code 330. Let Λ=(1)
c
1 . . . v,1 . . . n
=Λa
1 . . . t,1 . . . n. (2)
Obtained sequence (c1,1 . . . n, . . . , cv,1 . . . n) is a codeword 340 of GCC. That is, a GCC codeword 340 can be considered as a v×n table, where each column is a codeword of 1. This table is obtained by encoding each row of another t×n table. The i-th row of the latter table is a codeword of i, as shown in
Another way to obtain codeword c (considered as a vector of length vn) is to multiply (a1,1 . . . k
Then, the generator matrix for GCC is given by:
where 0a,b is a×b matrix filled with zeros. That is, the 1's in the i-th row of (1) are replaced with the generator matrix of the i-th outer code, and 0's are replaced with ki×n zero matrices.
Embodiments of the invention employ GCC codes for RAID-based data storage systems. As mentioned above and as will be further explained below, data is stored on disks of the RAID group in such way so that the i-th disk contains the i-th symbols of codewords of a GCC. In the following detailed description, methods for code construction, systematic encoding and load balancing will be described. Note that these methods may, for example, be employed in the data storage system 30 in accordance with array controller 12 (including generator module 32) and RAID array 14.
For given length N and dimension K, appropriate inner and outer codes are identified. In one embodiment, polar codes are used as inner codes and binary linear block codes as outer codes, Generator matrix of (v=2s, m) polar code is given by some m rows of matrix Fv=, where
s denotes s-times Kronecker product of a matrix with itself. So, the number of outer codes t in the proposed construction is equal to the length of inner codes v. We propose to construct the generator matrix of the first inner code as PFvPT, where P is a permutation matrix corresponding to re-arrangement of Fv rows in the ascending order of their weights. The same permutation is applied to columns of Fv, so that the obtained matrix is in a low-triangular form, as needed by a fast encoding algorithm presented below. The generator matrix for the i-th inner code (i) consists of (v−i+1) last rows of PFvPT, with the largest weights.
For outer codes, we take any linear binary codes of length n=N/v and dimension ki:Σi=1vki=K. In particular, it is advantageous to use optimal linear block codes, i.e., codes with the largest possible minimum distance di for a given length n and dimension ki. In order to implement systematic encoding, the generator matrices for outer codes is represented as (i)=(E|(i)), i=1 . . . v.
One way to implement decoding of generalized concatenated codes is via a multistage decoding algorithm, which successively tries to recover the symbols of a codeword at level i by using a decoding algorithm for i, and invokes a decoder of i, in order to recover those symbols, which were not recoverable by an inner decoder. After reconstruction of the codeword of the i-th outer code, each symbol of this codeword is multiplied by the i-th row of the generator matrix of 1, and the obtained codeword is subtracted from the vector being decoded. Then, decoding proceeds at level i+1 in the same way. Let σi be the event of successful decoding in the i-th level. Then, the probability of successful GCC decoding Pr{σ1, . . . , σv} can be expressed via probabilities of successful decoding at levels 1 . . . v as:
Pr{σ
1, . . . ,σv}=Pr{σv|σ1, . . . ,σv−1}Pr{σv−1|σ1, . . . ,σv−2} . . . Pr{σ2|σ1}Pr{σ1}.
This implies that a GCC decoding failure probability is given by:
where Pi=1−Pr{σi|σ1, . . . , σi−1} is the decoding failure probability in the i-th decoding level. One practical method for construction of GCCs is to select ki so that Pi are approximately equal, i.e.:
P
1
≈P
2
≈ . . . ≈P
v
≈P. (4)
In one embodiment, an algorithm or methodology such as is illustrated in program code 400 in
This methodology assumes the ability to calculate a decoding failure probability for outer and inner codes of different dimensions. Such calculation is given by:
where p is a channel erasure probability (i.e., disk failure probability), j(s) is the number of uncorrectable erasure configurations of weight j for outer code of dimension s, and j(i) is the number of uncorrectable erasure configurations of weight j for the first information symbol for the i-th inner code. Erasure configuration is uncorrectable for some code C if there exist two codewords c, c′εC for which cj=c′j for all non-erased symbols. Erasure configuration is uncorrectable for the first information symbol if there exist at least two codewords, corresponding to different values of the first information symbol, which agree on all non-erased symbols.
An illustrative embodiment of systematic encoding will now be described under an assumption that the generator matrix of the first inner code (1) is a low-triangular one with units on the diagonal, and generator matrices of outer codes are given in the canonical form (i)=(E|(i)), i=1, . . . , v.
The encoding algorithm for the case of non-systematic GCC includes two steps given by expressions (1) and (2) above. Let the symbols Ci,1 . . . k
(Λ−1)i,i . . . vci . . . v,k
Since Λ−1 has units on the diagonal, (Λ−1)i,i . . . vci . . . v,k
c
i,k
+1 . . . n=((Λ−1)i,i . . . vci . . . v,1 . . . k
Consider systematic encoding for (16, 9) GCC with outer codes (4, 1), (4, 2), (4, 3), (4, 3).
The symbols C1,1, c2,1 . . . 2, c3,1 . . . 3, c4,1 . . . 3 are information symbols as illustrated in symbol chart 500 in
1. c4,4=c4,1 . . . 3(4),
2. c3,4=(c3,1 . . . 3+c4,1 . . . 3)(3)−c4,4,
3. c2,3 . . . 4=(c2,1 . . . 2+c4,1 . . . 2)(2)−c4,3 . . . 4,
4. c1,2 . . . 4=(c1,1+c2,1+c3,1+c4,1)(1)−c2,2 . . . 4−c3,2 . . . 4−c4,2 . . . 4.
The complexity (XORs number) of a systematic encoding algorithm is equal to the complexity of the original non-systematic encoding algorithm of GCC (e.g.,
where a is a sparsity coefficient for generator matrices of outer codes. Since Λ−1 is obtained by permuting rows and columns of Arikan matrix multiplication by this matrix can be implemented with complexity
This is much less than the complexity of multiplication of a vector by a generic low-triangular matrix, which is given by (v2). It is to be noted that n such multiplications have to be performed. Hence, the total complexity of a systematic encoding algorithm is given by:
In the case of high rate code, this expression is dominated by the first term.
Recall, that this fast systematic encoding algorithm is suitable for the case low-triangular generator matrix for the first inner code. Additional complexity reduction is achieved by employing polar codes as inner codes.
If length of inner codes is equal to the length of outer codes, i.e., v=n=√{square root over (N)}, then the expression for complexity (9) reduces to
In a RAID-based data storage system, the symbols of a codeword are typically stored on separate disks. If one needs to update some information symbol ui, then corresponding check symbols, which depend on the ui must also be updated to preserve the consistency. Hence, check symbols are updated much more frequently than information symbols. This may cause significant disk load imbalance and increase wearing of disks used to store check symbols.
In one embodiment, a load balancing method for GCC includes two parts: (1) load balancing for outer code codewords; and (2) load balancing for inner code codewords.
We first consider load balancing for outer codes. In one embodiment, the methodology is based on load balancing for linear block codes. Let us consider some (n, k) code generated by matrix The update rate for the j-th codeword symbol is given by:
λj=rwj, j=1 . . . n,
where r is the update rate of information symbols, and wj≧1 is the number of non-zero elements in the j-th column of .
To avoid significant disk load imbalance, different generator matrices for the same code should be used for different stripes. By employing appropriate row operations, it is possible to construct L generator matrices G1 containing the (k×k) identity submatrix E on different positions; the set of such positions is called an information set. This results in a disk update rate being given by:
i.e., disk load imbalance is averaged over multiple stripes. By appropriate construction of matrices G1, it is possible to make λi to be approximately the same, thus achieving uniform disk load.
Consider a (n=6, k=3) binary linear code with generator matrix
It can be verified that the following matrices are also generator matrices for this code:
It is to be understood that, in general, an arbitrary combination of k positions within a codeword does not represent an information set. That is, one cannot store the information symbols anywhere within a codeword. The reason is that obtaining an identity submatrix in a generator matrix on these positions requires the corresponding columns of the generator matrix to be linearly independent. This is not always the case.
In order to evaluate disk load imbalance, the following metric is suggested:
where β is average of λj, j=1, . . . , n over family Q of information sets. Minimization of F(Q) provides approximately the same update rates for all symbols of a codeword, which corresponds to improved load balancing.
We could identify a family of information sets I(i) for each outer code i, i=1 . . . v, and use them for storing of information symbols. However, information sets for outer codes could not be chosen independently, because rows of GCC codeword depend on each other. In order to use a systematic encoding algorithm, information sets for outer codes should be nested. Indeed, in this case for given families of information sets I(v)⊂I(v−1)⊂ . . . I(1), it is possible to perform the same reordering of columns of generator matrices for all outer codes and obtain generator matrices in the following form: {tilde over ()}(i)=(E|{tilde over ()}(i)), i.e. {tilde over ()}(i)=(i) {tilde over (P)}, where {tilde over (P)} is n×n is a permutation matrix. This operation corresponds to the multiplication of a generator matrix of GCC (see equation (3) above) by a permutation matrix, for example, if v=4, then:
If information sets I(1), . . . , I(v) are not nested, then it is not possible to construct such a permutation matrix for generator matrix G.
We next consider load balancing for inner codes. The above described approach based on employing a number of different information sets of outer codes provides load balancing inside each row of GCC codeword. However, load balancing for columns of a GCC codeword is also needed, since the number of information symbols corresponding to the i-th row is always ki and k1≦k2≦ . . . ≦kv. In one embodiment, load balancing for columns of a GCC codeword is implemented by arranging symbols of inner code codewords in reverse order. This results in another (N, K) GCC which is given by the same set of outer codes i, and the set of inner codes ′i, i=1 . . . v, where ′i is the code generated by matrix ((i))z, where Wz is a vertically symmetric matrix to the matrix W, i.e.:
For v=4, this leads to the following modification of the generator matrix for GCC (see expression (10)).
One half of the data stored in the disks is encoded with the original GCC generated by expression (10) and the second half belongs to the modified GCC generated by expression (11). In this case in average, the i-th row of GCC codeword contains (ki+kv−i+1)/2 information symbols (due to properties of GCC (k1+kv)≈(k2+kv−1)≈ . . . ).
Note that, in one or more embodiments, the above-described outer code generation, inner code generation, load balancing, and code reversing, are respectively performed via outer coder 32A, inner coder 32B, load balancing controller 32C, and reverse unit 32D of generator module 32 in
Accordingly, as illustrated above, error correcting codes are used in the design of redundant disk arrays. The smallest redundancy is achieved by employing maximum distance separable (MDS) codes. The simplest construction (known as RAID-4 and RAID-5) is based on the single-parity check code, which provides protection against single disk failure. If one needs to implement protection against multiple disk failures, then codes with higher redundancy are needed.
For small values of redundancy, N−K=2, it is possible to construct MDS array codes with quite small encoding and decoding complexity. For example, in the case of RDP and Even/Odd, the matrix representation of a codeword is used and values of redundant symbols are calculated according to different diagonals of the matrix. However, the extension of these schemes to the case of higher N−K is not straightforward and, in general, requires increasing symbol size and results in significant encoding and decoding complexity.
For a construction based on a linear code with generator matrix G=(E|B), where B is a K×(N−K) Cauchy matrix over field 2
Reed-Solomon codes have also been suggested for use in storage systems. Reed-Solomon codes require employing arithmetic operations in large finite fields, which induces high encoding and decoding complexity.
RAID-2 represents the first application of non-MDS codes in the design of disk arrays. It is based on Hamming codes, which can provide protection against up to 3 disk failures. However, the required number of parity disks is given by ┌ log (N)┐, where N is the number of disks in RAID.
Encoding of GCCs, as in the case of any other linear block codes, can be implemented via multiplication of a payload data vector by the corresponding generator matrix. Many applications can use systematic encoding which corresponds to employing generator matrices given by G=(E|B) for some matrix B. For a binary linear code of length N and dimension K, the average complexity of a straightforward implementation of this approach is K(N−K)/γ operations, where γ is sparsity coefficient.
The complexity of the systematic encoding algorithm embodiments of the invention is given by
where v is length of inner codes, n is length outer codes, and ki, i=1, . . . v are dimensions of outer codes. The length of a GCC is N=nv (usually v≈n≈√{square root over (N)}) and the dimension is K=Σi=1vki. In the case of high-dimensional codes, it is dominated by the first term which is approximately equal to
This is much less than the complexity of straightforward multiplication by the generator matrix, which costs O(K(N−K)).
Table 700 in
Table 800 in
Although polar codes provide asymptotically optimal erasure correcting capability, for small block length (i.e., small number of disks), improved codes can be constructed. The theory of GCCs enables one to easily obtain appropriate long codes from shorter codes, together with the corresponding efficient encoding algorithm.
Disk load imbalance for a GCC-based RAID can be characterized by the metric Σj=1N|λj−β|/(βN), where λj is update rate for j-th symbol of codeword and β is average among λj, j=1 . . . N. In graph 900 in
While various embodiments of the invention have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
For example, it should be understood that some embodiments are directed to array controller 12, which is constructed and arranged to update data stored on a redundant array of disks having an array controller in a storage system, each disk of the redundant array of disks including a disk controller apart from the array controller. Some embodiments are directed to a process of updating data stored on a redundant array of disks having an array controller in a storage system, each disk of the redundant array of disks including a disk controller apart from the array controller. Also, some embodiments are directed to a computer program product which enables computer logic to update data stored on a redundant array of disks having an array controller in a storage system, each disk of the redundant array of disks including a disk controller apart from the array controller.
It should also be understood that some embodiments are directed to array controller 12, which is constructed and arranged to store data in a redundant disk array that employs a code which transforms an information vector of information symbols of length K into a codeword of code symbols of length N. Embodiments are directed to a process of storing data in a redundant disk array that employs a code which transforms an information vector of information symbols of length K into a codeword of code symbols of length N.
In other arrangements, array controller 12 is implemented by a set of processors or other types of control/processing circuitry running software. In such arrangements, the software instructions can be delivered, within array controller 12, either in the form of a computer program product 120 (see
Number | Date | Country | Kind |
---|---|---|---|
2013128346 | Jun 2013 | RU | national |