Erasure coding, such as MDS (Maximum-Distance Separable) erasure coding, is a known technique that is used to improve the reliability of data communication and data storage systems. In general, erasure coding allows lost data to be recovered in a manner that provides redundancy without needing to exactly replicate each piece of data. Classical erasure coding theory focuses on the tradeoff between redundancy and error tolerance.
Known erasure coding techniques are not always adequate to use in modern storage systems and computing environments. In particular, when recovering lost data, other metrics need to be considered with respect to whether a recovery mechanism meets the needs of its users. For example, when erasure-coded backup data is accessible over a network, network traffic may make recovery very inefficient. Similarly, disk input and output (I/O) that is incurred when recovering from failures may be a very large factor in recovery time.
This Summary is provided to introduce a selection of representative concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used in any way that would limit the scope of the claimed subject matter.
Briefly, various aspects of the subject matter described herein are directed towards a technology by which data blocks are coded into erasure coded blocks in a two-stage, two-level processing operation. In a first processing stage, such as via MDS coding, original blocks are coded into a first level of coded blocks. The first level may be generated via a systematic code, in which case the output coded blocks include the original blocks and one or more parity blocks (i.e., blocks that are functions of the original blocks); alternatively, the first level may be generated via a non-systematic code. In the second processing stage of fork codes, the first level blocks are partitioned into groups, and those groups used to generate a second level of parity blocks. The first level coded blocks in each group, together with the second-level parity blocks generated from them, are said to form a coding group. The blocks are maintained among a plurality of storage nodes.
In one aspect, recovery of a failed data block (corresponding to its node) is accomplished by determining whether the other data blocks associated with the failed data block's coding group are sufficient to perform the recovery. If so, only those data blocks need to be accessed to perform the recovery. In this manner, recovery is often significantly more efficient since fewer blocks need to be accessed to perform the recovery than with conventional erasure coding techniques.
Other advantages may become apparent from the following detailed description when taken in conjunction with the drawings.
The present invention is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which:
Various aspects of the technology described herein are generally directed towards a class of erasure codes, referred to herein as fork codes, which may be used in data recovery of blocks of data, (wherein as used herein, a “block” is any amount of data, e.g., a file, a volume, a unit of storage, or any fixed amount such as three kilobytes, two megabytes, four gigabytes, and so forth). In general, with fork-based erasure coding, the recovery time is ordinarily much faster than with traditional erasure coding, at the tradeoff of some additional storage. As will be understood, a fork code achieves a good tradeoff of several performance metrics that are important to a storage system, including between redundancy, erasure tolerance, and/or recovery complexity.
It should be understood that any of the examples described herein are non-limiting examples. As such, the present invention is not limited to any particular embodiments, aspects, concepts, structures, functionalities or examples described herein. Rather, any of the embodiments, aspects, concepts, structures, functionalities or examples described herein are non-limiting, and the present invention may be used in various ways that provide benefits and advantages in computing and data storage and/or recovery in general.
By way of background erasure coding, consider two original information blocks, A and B. A straightforward way to store information to recover these blocks is to replicate them among storage nodes, e.g., as two copies of the two original blocks, A and B, at four nodes. If any single node fails, at least one copy remains; however if two nodes fail, it may be that both are B nodes, for example, whereby no copy of B can be recovered. Note that a storage node, or simply “node” as used herein, may be any storage device or part thereof, and nodes are typically arranged such that each is independent of the others with respect to failing.
Erasure coding is a more advanced way of adding redundancy, as in the previous example, erasure coding allows recovery without data loss even if any two nodes fail (unlike a replication situation where both nodes containing B blocks fail). This is accomplished, for example, by storing a copy of each block along with mathematical combinations of the data from each block, e.g., by storing block A at node 1, block B at node 2, a parity block of A+B at node 3, and a parity block of A+2B at node 4, where the addition and multiplication operations are performed in a finite field. This erasure coded system can tolerate any two node failures without data loss, e.g., if the first and fourth nodes fail, the system can recover A and B from B and A+B; this can be done by computing (A+B)−B=A. Erasure coding thus provides redundancy/better node failure tolerance in an efficient way.
In erasure coding in general, a number of information blocks (typically denoted by k) are encoded into a number of coded blocks (typically denoted by n) by a linear transformation defined in a finite field. An erasure code can be specified by an n×k generator matrix G, which describes how each coded block is formed as a linear combination of the original information blocks. In the above example, of A, B, A+B and A+2B, the generator matrix is:
with the four rows in G corresponding to these four coded blocks, respectively. An (n,k) erasure code is said to be systematic if the first k coded blocks are the original k information blocks; the above code is thus a systematic code. Note that in erasure decoding, the original information blocks are recovered by applying another linear transformation to the available data blocks; erasure decoding essentially amounts to solving linear equations.
Classical erasure coding theory focuses on the tradeoff between redundancy and error tolerance. Under this criterion, as is known, an MDS (Maximum-Distance Separable) erasure code is optimal, where an (n,k) MDS code encodes k information blocks into n coded blocks (of same length as the original blocks) such that any k coded blocks can be used to reconstruct the original k information blocks. For example, a systematic (7, 5) MDS code has five output blocks representing the five original information blocks, namely A, B, C, D and E, plus two blocks for the additional two parity coded blocks, which are functions of the five original information blocks, namely A+B+C+D+E, and A+2B+3C+4D+5E in one systematic (7, 5) MDS code example.
In the second stage, the coded blocks 106 output by the first stage are partitioned into multiple disjoint sets, shown in
Also shown for completeness in
Thus, in the example of
The benefits of a fork code can be seen from its properties. Recovering from a single failure is simple, because each node can be recovered from the other nodes in the same coding group. For example, in
Thus, one purpose of the fork in the second stage is to enable fast recovery from single failures (or multiple isolated failures where each fork coding group has only a single failure). In general, a recovery system recovers from a failure by only accessing other nodes involved in the same coding group, when possible.
Another property is that with proper configuration parameters, the fork code can achieve a good tradeoff of overall erasure protection capability and fast recovery from single failures. For instance, in
It should be noted that the example of
In the second stage, additional coded blocks are computed from the p coded blocks. More particularly, as represented by step 304, the p blocks output by the first stage are partitioned into multiple disjoint sets.
Via steps 306-309, for each such disjoint set, e.g., S={X1, . . . , Xj}, one or more additional coded blocks, e.g., Y1, . . . YL, are generated as a function of the blocks in the set S. In one implementation, Y1, . . . , YL are such that X1, . . . , XJ, Y1, . . . YL form an MDS code. Following the grouping and the coding of the groups, the coded blocks and coded blocks for the groups are stored among the nodes.
The invention is operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with the invention include, but are not limited to: personal computers, server computers, hand-held or laptop devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, and so forth, which perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in local and/or remote computer storage media including memory storage devices.
With reference to
The computer 510 typically includes a variety of computer-readable media. Computer-readable media can be any available media that can be accessed by the computer 510 and includes both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by the computer 510. Communication media typically embodies computer-readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of the any of the above may also be included within the scope of computer-readable media.
The system memory 530 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 531 and random access memory (RAM) 532. A basic input/output system 533 (BIOS), containing the basic routines that help to transfer information between elements within computer 510, such as during start-up, is typically stored in ROM 531. RAM 532 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 520. By way of example, and not limitation,
The computer 510 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media, described above and illustrated in
The computer 510 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 580. The remote computer 580 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 510, although only a memory storage device 581 has been illustrated in
When used in a LAN networking environment, the computer 510 is connected to the LAN 571 through a network interface or adapter 570. When used in a WAN networking environment, the computer 510 typically includes a modem 572 or other means for establishing communications over the WAN 573, such as the Internet. The modem 572, which may be internal or external, may be connected to the system bus 521 via the user input interface 560 or other appropriate mechanism. A wireless networking component 574 such as comprising an interface and antenna may be coupled through a suitable device such as an access point or peer computer to a WAN or LAN. In a networked environment, program modules depicted relative to the computer 510, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
An auxiliary subsystem 599 (e.g., for auxiliary display of content) may be connected via the user interface 560 to allow data such as program content, system status and event notifications to be provided to the user, even if the main portions of the computer system are in a low power state. The auxiliary subsystem 599 may be connected to the modem 572 and/or network interface 570 to allow communication between these systems while the main processing unit 520 is in a low power state.
While the invention is susceptible to various modifications and alternative constructions, certain illustrated embodiments thereof are shown in the drawings and have been described above in detail. It should be understood, however, that there is no intention to limit the invention to the specific forms disclosed, but on the contrary, the intention is to cover all modifications, alternative constructions, and equivalents falling within the spirit and scope of the invention.