The present disclosure relates generally to storage systems and more specifically to a methodology to generate Help-By-Transfer (HBT) Minimum Storage Regenerating (MSR) erasure codes for high-rate erasure in a storage device.
In a large-scale distributed storage system, individual storage nodes will commonly fail or become unavailable from time to time. Therefore, storage systems typically implement some type of recovery scheme for recovering data that has been lost, degraded or otherwise compromised due to node failure or otherwise. One such scheme is known as erasure coding. Erasure coding generally involves the creation of codes used to introduce data redundancies (also called “parity data”) that is stored along with original data (also referred to as “systematic data”), to thereby encode the data in a prescribed manner. If any systematic data or parity data becomes compromised, such data can be recovered through a series of mathematical calculations.
At a basic level, erasure coding for a storage system involves splitting a data file of size M into X chunks, each of the same size M/X. An erasure code is then applied to each of the X chunks to form A encoded data chunks, which again each have the size M/X. The effective size of the data is A*M/X, which means the original data file M has been expanded A/X times, with the condition that A≧X. Now, any X chunks of the available A encoded data chunks can be used to recreate the original data file M. The erasure code applied to the data is denoted as (n, k), where n represents the total number of nodes across which all encoded data chunks will be stored and k represents the number of systematic nodes (i.e., nodes that store only systematic data) employed. The number of parity nodes (i.e., nodes that store parity data) is thus n−k=r. Erasure codes following this construction are referred to as maximum distance separable (MDS) if for any loss of a maximum r nodes, such nodes are recoverable using data stored on exactly k nodes
A simple example of a (4, 2) erasure code applied to a data file M is shown in
Disk failure (or unavailability) occurs frequently in large-scale distributed storage systems. While some commonly employed MDS codes, like the Reed Solomon code, are very good in terms of requiring reduced storage overhead, they can impose a significant burden on the storage system I/O when recovering a failed or unavailable disk. In other words, a significant amount of disk I/O must be dedicated to recover the failed or unavailable disk, which consumes system resources and impacts performance. Minimum-storage regenerating (MSR) code is a class of MDS codes that in theory promises to provide significant reduction in disk I/O during repair. These codes, at the same time, do not compromise either in storage overhead or in reliability when compared to the Reed-Solomon code.
In an MSR coding scheme, every storage node of a storage network contains a set of data, represented in coding theory as “symbols.” This is referred to as “sub-packetization.” MSR codes require minimum storage space per storage node. To recover a particular failed storage node, only a sub-set of all symbols stored on each surviving storage node must be accessed (e.g., transferred to the new or repaired node) to regenerate the data set that was lost. This number of symbols is known to be close to the information theoretical minimum. In other words, if each storage node stores a symbols, only a subset β of the symbols will need to be obtained from each of d surviving storage nodes to recover a failed storage node.
The amount of data needed to be transferred to the new or repaired node to regenerate the data set lost when a node failed or became unavailable is known as the “repair bandwidth.” The repair bandwidth dβ is thus a function of the amount of data β accessed at each surviving node (referred to as “helper” nodes) and the number of helper nodes d that must be contacted. A so-called “help-by-transfer” regeneration code is one that does not require computation at the helper node before the data is transmitted. It follows that a help-by-transfer code possessing minimum sub-packetization is access optimal (AO), meaning that during a recovery process each surviving storage node needs to transmit only the symbols β that it accesses. See, I. Tamo, Z. Wang, and J. Bruck, “Access vs. Bandwidth in Codes for Storage,” IEEE International Symposium on Information Theory (ISIT 2012), July 2012, pp. 1187-1191, which is incorporated herein by reference.
The “code rate” for an (n, k) erasure code is defined as k/n or k/(k+r), which represents the proportion of the systematic data in the total amount of stored data (i.e., systematic data plus parity data). An erasure code having a code rate k/n>0.5 is deemed to be a high-rate erasure code. This means that the coding scheme will require a relatively large amount of systematic nodes k as compared to parity nodes r. Conversely, a low-rate (k/n≦0.5) erasure code will require a relatively small amount of systematic nodes k as compared to parity nodes r. High-rate erasure codes can thus be desirable because they require less storage overhead than low-rate erasure codes for a given set of systematic data.
It has been shown that the lower bound of sub-packetization for AO high-rate erasure codes is equal to r(k/r). See, I. Tamo, Z. Wang, and J. Bruck, “Access vs. Bandwidth in Codes for Storage,” IEEE International Symposium on Information Theory (ISIT 2012), July 2012, pp. 1187-1191, which is incorporated herein by reference. Attempts have recently been made to develop MSR erasure codes that account for this minimum sub-packetization bound r(k/r). See, K. A. Gaurav, B. Sashidharan and P. Vijaykumar, “An Alternate Construction of an Access-Optimal Regenerating Code with Optimal Sub-Packetization Level,” arXiv:1501.04760v1, 20 Jan. 2015, which is incorporated herein by reference. In that particular work, the authors demonstrated construction of MSR codes following an iterative approach. As exemplified by that work, known methods for constructing MSR codes rely on abstract mathematical approaches. While some prior works have proven the existence of high-rate MSR codes, there has yet to be demonstrated any practical approach for constructing a high-rate MSR code that can be applied practically to distributed storage system.
What is needed, therefore, is a relatively simple way to construct help-by transfer high-rate MSR erasure codes that use the minimum sub-packetization bound r(k/r) and have practical application to distributed storage systems.
The various embodiments described herein provide a conceptually simple approach for generating high-rate MSR erasure codes that have practical application in data storage systems. Embodiments include methods, systems and corresponding computer-executable instructions for generating tree structures and assigning indices to the nodes thereof, according to certain rules. The nodes in the tree structures will represent systematic storage nodes that store original systematic data and parity storage nodes that store redundant parity data. Systematic symbols for a high-rate MSR erasure code can be easily added to construct parity symbols of the codeword array representing the desired high-rate MSR erasure code, as will be appreciated. Each parity symbol will be linear combination of systematic symbols. When a systematic node fails, parity symbols from β rows from each of the parity nodes will provide the linear equations needed to solve for and recover the lost systematic symbols. By traversing the tree structures described herein in certain ways to be described, parity symbols for the codeword array can be determined. When forming linear combinations of the parity symbols for the codewords, random number coefficients or other coefficients (generated by some other technique, e.g., maintaining interference alignment) can be used for certain of the parity symbols, which will ensure a high-rate MSR erasure code that will have practical application in storage systems.
MSR codes may be expressed as “codeword arrays,” which are tables showing the systematic and parity symbols to be stored in each of the systematic and parity nodes used.
The symbol array 202 in
The following example discussed in reference to
n=9
k=6
r=n−k=3
m=2
α=rm=9
β=α/r=3
Next at step 306, indices are created for the root node of each tree. Each of the m r-Ary trees is given the root node index i, where i={0, . . . , m−1}. Thus, as shown in
Next, at step 312, the indices for each r-Ary tree with root node index i≧1 are created. For these trees, each base-r m-digit number leaf node index is obtained by applying a right-shift-rotation operation i times to the corresponding leaf node index of the i=0 tree.
After the m r-Ary trees are constructed and all node indices are assigned, the trees are complete. Then, based on the completed trees, a codeword array for the high rate MSR code can easily be obtained. As described above, such a codeword array can be represented as an array with a rows and n columns. The columns represent the systematic nodes and the parity nodes {N0, N1, . . . , Nk-1, P0, P1, . . . , Pr-1}. The rows represent the symbols to be stored in the each of the systematic nodes and the parity nodes.
The example discussed herein with reference to
The last r symbols in the tuple R, represent the symbols for the parity nodes {P0, P1, . . . , Pr-1}. In accordance with certain embodiments, the parity symbols for a high-rate MSR erasure code (also referred to as a “HMSR erasure code”) must be designed so as to enable successful recovery from failure of one systematic node {N0, N1, . . . , Nk-1}. Also, the desired HMSR erasure code will be resilient to failure of any one systematic node for which the data to be downloaded for recovery is (n−1)β, which is what fulfills the MSR requirement.
To begin determination of parity symbols, the method 300 of
The parity symbols for the remaining parity nodes pst (for t={1, 2, . . . , r−1}) are a combination of the k systematic symbols {as, bs, cs, . . . } from the sth row and an additional m systematic symbols from rows other than the same sth row. As illustrated in
To this point the symbol array 702 has rather simply been populated with systematic symbols for each systematic node, row parity symbols for the first parity node and row parity symbols for the remaining parity nodes. However, the remaining m symbols must now be determined for each row of the parity nodes for all but the first parity node P0 before the symbols pst for t={1, 2, . . . r−1} are complete. These additional m symbols are determined using the m r-Ary tree structure discussed with reference to
Next at step 508, symbols are determined and added to the symbol array for the parity node Pt (which is P1 in the first iteration). The symbols are added to the rows in the symbol array having the same indices as the leaf node indices identified in step 506. Each symbol is expressed as the letter designating the node Nj sub-tree and the decimal forms of the leaf node indices of a different sub-tree under the root node with index i. In some embodiments, the different sub-tree may be selected in a left to right manner, with the sub-tree to the immediate right of the node Nj sub-tree being the first chosen and returning to the first sub-tree under root node i after reaching the last sub-tree under root node i. In other embodiments, the different sub-tree may be any other sub-tree under root node i (i.e., selecting the different sub-tree in a left to right order is optional). With reference to the example of
After adding the symbols in step 508, the method moves to step 510 where it is determined whether there is another different sub-tree under the root node with the index i (which for the first iteration remains set at 0). If so, the parity node index t is incremented by 1 (i.e., t=t+1) at step 512 and the method then returns to step 508 where symbols are determined and added to the symbol array for the parity node P1 (which is now P2 in this iteration). Again, the symbols are added to the rows in the symbol array having the same indices as the leaf node indices identified in step 506. The symbols are expressed as the letter designating the node Nj sub-tree and the decimal forms of the leaf node indices of a different sub-tree under the root node with index i. Following the example of
When it is determined at step 510 that there are no other different sub-trees under the root node with index i, the method advances to step 514 where the parity node index t is again set to 1 and the systematic node index j is incremented by 1. Next, a determination is made at step 516 as to whether the systematic node Nj is under the root node with index i. As can be seen from
When it is finally determined at step 516 that node N (after incrementing j by 1 at step 514) is not under the root node with index i, the method moves to step 518 where the root node index i is incremented by 1 before again returning to step 504 for more iterations. Thus, as can be seen from the example of
As shown in the completed symbol array 702 of
During the iterations of steps 504 to 518, when it is finally determined at step 504 that j<k is not true, the method will end at step 520. For instance, at the end of the third iteration for the case of i=1 in the example of a HMSR (9, 6) erasure code, the systematic node index j will be incremented to 6 at step 516 and root node index i will be incremented to 2 at step 518. Then, upon returning to step 504 it will be determined that j<k is not true, which will cause the method to end at step 520.
Completion of method 300 (
The exemplary system shown in
Each host 802a, 802b may include one or more CPU 812, such as a microprocessor, microcontroller, application-specific integrated circuit (“ASIC”), state machine, or other processing device etc. The CPU 812 executes computer-executable program code comprising computer-executable instructions for causing the CPU 812, and thus the host 802a, 802b, to perform certain methods and operations. For example, the computer-executable program code can include computer-executable instructions for causing the CPU to execute a storage operating system and at least some of the methods described herein for constructing HMSR erasure codes and for encoding, storing and retrieving and decoding data chunks in the various storage nodes 804a, 804b, . . . 804n. The CPU 812 may be communicatively coupled to a memory 814 via a bus 816 for accessing program code and data stored in the memory 814.
The memory 814 can comprise any suitable non-transitory computer readable media that stores executable program code and data. For example, the computer-readable medium can include any electronic, optical, magnetic, or other storage device capable of providing a processor with computer-readable instructions or other program code. Non-limiting examples of a computer-readable medium include a floppy disk, CD-ROM, DVD, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, optical storage, magnetic tape or other magnetic storage, or any other medium from which a computer processor can read instructions. The program code or instructions may include processor-specific instructions generated by a compiler and/or an interpreter from code written in any suitable computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, Python, Perl, JavaScript, and ActionScript. Although not shown as such, the memory 814 could also be external to a particular host 802a, 802b, e.g., in a separate device or component that is accessed through a dedicated communication link and/or via the network(s) 810. A host 802b, 802b may also comprise any number of external or internal devices, such as input or output devices. For example, host 802a is shown with an input/output (“I/O”) interface 818 that can receive input from input devices and/or provide output to output devices.
A host 802a, 802b can also include at least one network interface 819. The network interface 819 can include any device or group of devices suitable for establishing a wired or wireless data connection to one or more of the networks 810 or directly to a network interface 829 of a storage node 804a, 804b, . . . 804n and/or a network interface 839 of a client device 806. Non-limiting examples of a network interface 819, 829, 839 can include an Ethernet network adapter, a modem, and/or the like to establish a TCP/IP connection with a storage node 804a, 804b, . . . 804n or a SCSI interface, USB interface, or a fiber wire interface to establish a direct connection with a storage node 804a, 804b, . . . 804n.
Each storage node 804a, 804b, . . . 804n may include similar components to those shown and described for the hosts 802a, 802b. For example, storage nodes 804a, 804b, . . . 804n may include a CPU 822, memory 824, a network interface 829, and an I/O interface 828 all communicatively coupled via a bus 826. The components in storage node 804a, 804b, . . . 804n function in a similar manner to the components described with respect to the hosts 802a, 802b. By way of example, the CPU 822 of a storage node 804a, 804b, . . . 804n may execute computer-executable instructions for storing, retrieving and processing data in memory 824, which may include multiple tiers of internal and/or external memories.
Each of the hosts 802a, 802b can be coupled to one or more storage node(s) 804a, 804b, . . . 804n. Each of the storage nodes 804a, 804b, . . . 804n could be an independent memory bank. Alternatively, storage nodes 804a, 804b, . . . 804n could be interconnected, thus forming a large memory bank or a subcomplex of a large memory bank. Storage nodes 804a, 804b, . . . 804n may be, for example, storage disks, magnetic memory devices, optical memory devices, flash memory devices, combinations thereof, etc., depending on the particular implementation and embodiment. In some embodiments, each storage node 804a, 804b, . . . 804n may include multiple storage disks, magnetic memory devices, optical memory devices, flash memory devices, etc. Each of the storage nodes 804a, 804b, . . . 804n can be configured, e.g., by a host 802a, 802b or otherwise, to serve as a systematic node or a parity node in accordance with the various embodiments described herein.
A client device 806 may also include similar components to those shown and described for the hosts 802a, 802b. For example, a client device 806 may include a CPU 832, memory 834, a network interface 829, and an I/O interface 838 all communicatively coupled via a bus 836. The components in a client device 806 function in a similar manner to the components described with respect to the hosts 802a, 802b. By way of example, the CPU of a client device 806 may execute computer-executable instructions for allowing a storage network architect, administrator or other user to design the m r-Ary tree structures, symbol arrays and/or codeword arrays for HMSR erasure codes, as described herein. Such computer-executable instructions and other instructions and data may be stored in the memory 834 of the client device 806 or in any other internal or external memory accessible by the client device. In some embodiments, the user of the client device may interact with the program(s) executing on the client device 806, for example with input and output devices, to design and construct desired tree structures, symbol arrays and codeword arrays. In other embodiments, the execution of the program code may cause the desired tree structures, symbol arrays and codeword arrays to be designed and constructed in an automated fashion. As noted, host(s) may alternatively or additional execute such program(s) for designing and constructing tree structures, symbol arrays and codeword arrays for HMSR erasure codes according to the methods described herein.
It will be appreciated that the depicted hosts 802a, 802b, storage nodes 804a, 804b, . . . 804n and client device 806 are represented and described in relatively simplistic fashion and are given by way of example only. Those skilled in the art will appreciate that actual hosts, storage nodes, client devices and other devices and components of a storage network may be much more sophisticated in many practical applications and embodiments. In addition, the hosts 802a, 802b and storage nodes 804a, 804b, . . . 804n may be part of an on-premises system and/or may reside in cloud-based systems accessible via the networks 810.
Numerous specific details are set forth herein to provide a thorough understanding of the claimed subject matter. However, those skilled in the art will understand that the claimed subject matter may be practiced without these specific details. In other instances, methods, apparatuses, or systems that would be known by one of ordinary skill have not been described in detail so as not to obscure claimed subject matter.
Unless specifically stated otherwise, it is appreciated that throughout this specification discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” and “identifying” or the like refer to actions or processes of a computing device, such as one or more computers or a similar electronic computing device or devices, that manipulate or transform data represented as physical electronic or magnetic quantities within memories, registers, or other information storage devices, transmission devices, or display devices of the computing platform.
Some embodiments described herein may be conveniently implemented using a conventional general purpose or a specialized digital computer or microprocessor programmed according to the teachings herein, as will be apparent to those skilled in the computer art. Some embodiments may be implemented by a general purpose computer programmed to perform method or process steps described herein. Such programming may produce a new machine or special purpose computer for performing particular method or process steps and functions (described herein) pursuant to instructions from program software. Appropriate software coding may be prepared by programmers based on the teachings herein, as will be apparent to those skilled in the software art. Some embodiments may also be implemented by the preparation of application-specific integrated circuits or by interconnecting an appropriate network of conventional component circuits, as will be readily apparent to those skilled in the art. Those of skill in the art will understand that information may be represented using any of a variety of different technologies and techniques.
Some embodiments include a computer program product comprising a computer readable medium (media) having instructions stored thereon/in that, when executed (e.g., by a processor), cause the executing device to perform the methods, techniques, or embodiments described herein, the computer readable medium comprising instructions for performing various steps of the methods, techniques, or embodiments described herein. The computer readable medium may comprise a non-transitory computer readable medium. The computer readable medium may comprise a storage medium having instructions stored thereon/in which may be used to control, or cause, a computer to perform any of the processes of an embodiment. The storage medium may include, without limitation, any type of disk including floppy disks, mini disks (MDs), optical disks, DVDs, CD-ROMs, micro-drives, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices (including flash cards), magnetic or optical cards, nanosystems (including molecular memory ICs), RAID devices, remote data storage/archive/warehousing, or any other type of media or device suitable for storing instructions and/or data thereon/in.
Stored on any one of the computer readable medium (media), some embodiments include software instructions for controlling both the hardware of the general purpose or specialized computer or microprocessor, and for enabling the computer or microprocessor to interact with a human user and/or other mechanism using the results of an embodiment. Such software may include without limitation device drivers, operating systems, and user applications. Ultimately, such computer readable media further includes software instructions for performing embodiments described herein. Included in the programming (software) of the general-purpose/specialized computer or microprocessor are software modules for implementing some embodiments.
The various illustrative logical blocks, modules, and circuits described in connection with the embodiments disclosed herein may be implemented or performed with a general-purpose processing device, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processing device may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processing device may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration
Aspects of the methods disclosed herein may be performed in the operation of such processing devices. The order of the blocks presented in the figures described above can be varied—for example, some of the blocks can be re-ordered, combined, and/or broken into sub-blocks. Certain blocks or processes can be performed in parallel.
The use of “adapted to” or “configured to” herein is meant as open and inclusive language that does not foreclose devices adapted to or configured to perform additional tasks or steps. Additionally, the use of “based on” is meant to be open and inclusive, in that a process, step, calculation, or other action “based on” one or more recited conditions or values may, in practice, be based on additional conditions or values beyond those recited. Headings, lists, and numbering included herein are for ease of explanation and are not meant to be limiting.
While the present subject matter has been described in detail with respect to specific examples thereof, it will be appreciated that those skilled in the art, upon attaining an understanding of the foregoing may readily produce alterations to, variations of, and equivalents to such aspects and examples. Accordingly, it should be understood that the present disclosure has been presented for purposes of example rather than limitation, and does not preclude inclusion of such modifications, variations, and/or additions to the present subject matter as would be readily apparent to one of ordinary skill in the art.