Peer-to-peer “p2p” distributed storage and delivery systems are highly useful in providing scalability, self-organization, and reliability. Such systems have demonstrated the viability of p2p networks as media for large-scale storage applications. In particular, p2p networks can be used to provide backup for files if the data is stored redundantly at the peers.
A p2p network is a popular environment for streaming data. A p2p network is one in which peer machines are networked together and maintain the state of the network via records on the participant machines. In p2p networks, any end host can initiate communications, and thus p2p networks are also sometimes referred to as “endhost” networks. Typical p2p networks generally lack a central server for administration, although hybrid networks do exist. Thus, generally speaking, the term p2p refers to a set of technologies that allows a group of computers to directly exchange data and/or services. The distinction between p2p networks and other network technologies is more about how the member computers communicate with one another than about the network structure itself. For example, end hosts in a p2p network act as both clients and servers in that the both consumer data and serve data to their peers.
In p2p distributed file sharing, pieces of a file are widely distributed across a number of peers. Then whenever a client requests a download of that file, that request is serviced from a plurality of peers rather then directly from the server. For example, one such scheme, referred to as “Swarmcast™,” spreads the load placed on a web site offering popular downloadable content by breaking files into much smaller pieces. Once a user has installed the Swarmcast client program, their computers automatically cooperate with other users' computers by passing around (i.e., serving) pieces of data that they have already downloaded, thereby reducing the overall serving load on the central server. A similar scheme, BitTorrent®, works along very similar principles. In particular, when under low load, a web site which serves large files using the BitTorrent scheme will behave much like a typical http server since it performs most of the serving itself. However, when the server load reaches some relatively high level, BitTorrent will shift to a state where most of the upload burden is borne by the downloading clients themselves for servicing other downloading clients. Schemes such as Swarmcast and BitTorrent are very useful for distributing pieces of files for dramatically increasing server capacity as a function of the p2p network size.
The mechanisms used by such schemes may vary. In the simplest case, a subject file may be copied many times, each time onto a different peer. This approach is wasteful since the amount of extra storage required to store these copies is sub-optimal. A more space-optimal approach employs erasure codes. Erasure codes are codes that work on any erasure channel (a communication channel that only introduces errors by deleting symbols and not altering them). In this approach, e.g., a file F is separated into fragments F1, F2, . . . , Fk. A a coding scheme is applied to these fragments that produces new fragments E1, E2, . . . , En, where n>k, with the property that retrieving any k out of the n fragments Ei is sufficient to reconstruct the file. The coding cost of this approach is 0(n/F/) word operations for the encoding and 0(k3+k/F/) for the decoding. For most practical purposes k and n are of similar order so this generally forces the number of fragments generated n to be small.
It is sometimes difficult in practical p2p backup schemes to keep the number of fragments small, because if the number of fragments is, e.g., 100 and the original file is of size 10 Gb, then each fragment is 100 Mb long. It is generally unlikely that a peer would be online long enough for a 100 Mb fragment to be uploaded to it. This encourages the use of smaller fragments; however, these in turn make the coding and decoding costs prohibitive.
One approach to get around the problem is to separate the large file F into a number of smaller files F1, . . . , Fm and then erasure code each one of these files. But this has the disadvantage that, to reconstruct the file F, it is necessary to reconstruct F1, then reconstruct F2, . . . , and finally reconstruct Fm. The probability that all of these reconstructions are successful becomes very attenuated when m gets moderate.
This Background is provided to introduce a brief context for the Summary and Detailed Description that follow. This Background is not intended to be an aid in determining the scope of the claimed subject matter nor to be viewed as limiting the claimed subject matter to implementations that solve any or all of the disadvantages or problems presented above.
The arrangements presented here provide for storing and delivering files in a p2p system using hierarchical erasure coding. In other words, the erasure coding is performed in hierarchical stages. At the first stage, the original file is erasure coded or otherwise broken up into a first plurality of fragments. At the second stage, each of the first plurality is erasure coded to produce a second plurality of fragments. Successive stages are performed similarly. The process may be visualized as a tree whose root is the original file, and whose leaves are the fragments that are eventually streamed to a peer. The leaves may be streamed in a random fashion to peers.
The arrangements also provide a way to evaluate the failure probability of a file. That is, the probability, given a number of peers and their respective availabilities that the original file will not be able to be faithfully reconstructed. The failure probability may be calculated using a recursive algorithm that may depend on the property that each peer should receive a random leaf in the hierarchical erasure-coding scheme.
The arrangements further provide a disk-efficient process of streaming fragments. An encoded file is created which is a transpose representation of that created in the usual encoding process. In this way, a single pass through the file can generate the fragment that will be sent to a peer. To produce a random leaf in a hierarchical encoding, enough top-level bytes are read to be able to produce an initial segment of a random child of the root, and the process may continue inductively until the entire leaf has been read.
This Summary is provided to introduce a selection of concepts in a simplified form. The concepts are further described in the Detailed Description section. Elements or steps other than those described in this Summary are possible, and no element or step is necessarily required. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended for use as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure.
Corresponding reference characters indicate corresponding parts throughout the drawings.
Arrangements are provided for hierarchical erasure coding of files for p2p backup and other purposes. A probabilistic estimate may be calculated for the likelihood of successfully reconstructing the file from online peers. The arrangements may perform the erasure coding in a disk-efficient manner.
Referring to
According to the arrangement described above, the erasure coding is performed in hierarchical stages. At stage 0, the subject file F=F0 is erasure coded into fragment files F10, . . . Fn0. The parameters n and k of the erasure coding may be chosen such that the stage 0 decomposition can be performed rapidly. At later stages, e.g., at stage t, each fragment Fit-1 is erasure coded to produce Fit. In this way, after t stages, nt fragments will have been produced, each of size
The process may be visualized as a tree whose root is the subject file F and whose leaves are the fragments that are eventually streamed to a peer. It is noted that only leaves may be distributed to peers, and a single peer may store multiple leaves.
Any of the erasure-coding steps may include a step of reading the subject file or fragment files in a transposed manner (step 34) so as to reduce the number of disk seeks, thus allowing the reading to be performed in a disk-efficient way. One way of implementing this reading in a transposed manner is described below in connection with
The last-created plurality of fragment files is then transmitted to the peer systems (step 36). A failure probability may be calculated and displayed at any time subsequent to construction of the final plurality (step 38), and the calculation may include use of a Fourier Transform (e.g., a fast Fourier transform or “FFT”) (step 42).
success probability=1−failure probability
So if a system calculates one it is trivial to calculate the other.
To outline this arrangement, the failure probability calculation includes a first step of associating a polynomial with each peer (step 44). A next step is to calculate a product of these polynomials (step 46). A sum is then calculated of the coefficients of the product of the polynomials (step 48). Finally, a failure probability is associated with the result of the summing step (step 52).
This arrangement is described below in additional detail. A subject file F is separated into a first plurality of fragment files F0, F1, . . . , Fk-1. These k fragment files are erasure-coded into n fragments E0, E1, . . . , En-1. Collecting any k of these fragments allows the reconstruction of the subject file F. It is noted above that the hierarchical erasure-coding arrangement may employ multiple erasure-coding steps. For simplicity and clarity, the calculation of failure probability will be described with respect to the Ei. It will be understood that the arrangement may apply similarly to any order of erasure-coded Ei.
Ei is transmitted to a peer Pi, and the likelihood that Pi is online is pi. The algorithm for computing the failure probability also assumes that the events that Pi being online is independent of the probability that any other peer or set of peers is online. Generally if this assumption is not true, then one cannot determine the failure probability in anything less than exponential time in the number of peers constituting the p2p network. With multiple steps of erasure-coding, n may be caused to rise and the file fragment size may be caused to decrease.
For each Pi, a polynomial is associated Pi(X)=qi+piX, where qi=1−pi. For the first polynomials:
P
1(X)=q0+p0X
P
0(X)P1(X)=q0q1+(q0p1+q1p0)X+p0p1X2
Etc.
Thus in general P(X) may be expressed as a polynomial:
In this case, αi, the coefficient of Xi, is the probability that exactly i peers are online. As k files are needed for reconstruction, the probability is then the sum of these coefficients, up to the kth term:
It may be calculated that the probability of failure with n peers can be determined in a time on the order of n2[0(n2)].
However, if a file is first deconstructed into k fragments and those fragments are then erasure coded into n fragments, such that the ith peer Pi receives ti fragments, then the polynomial becomes:
The sum of the coefficients of this polynomial of the terms Xr for r<k gives the failure probability for reconstruction of the subject file. The computation of this product can be performed in less time than 0(n2); rather, it may be performed in a time 0(n log2(n)). In particular, it can be shown that, given two polynomials f and g of degree n, their product may be computed in a time 0(n log n) using an FFT. And a corollary to this is that:
may be computed in a time 0(n log2(n)), again employing the FFT.
The time saved is significant. The following table demonstrates the significant time savings achieved when using the transform method:
As noted above, the erasure coding may be performed such that n and k are not too large, as this tends to increase the time cost of encoding. In particular, the encoding time is 0(nk/Fi/), while the decoding time is 0(k3+k2/Ei/). In the same way, fragment sizes may generally not be too large, as a peer will not likely be online long enough for a fragment to be transferred in either direction.
In one implementation, the failure probability may be calculated as below. First it is noted that if erasure coding is applied with the same parameters of (n,k) to each level, then the probability that the file can be reconstructed in part depends on how the leaves are distributed. If the assignment of leaves is performed arbitrarily, then the probability requires exponential time. However, if the assignment of leaves is performed randomly, then significantly less time may be required.
If Pi is available with probability pi and the same stores ti fragments, then:
Pr[t_fragments_available]=coeffx
The table of these probabilities may be calculated in a time 0(nf log2(nf)), where nf is the number of fragments. Correspondingly, a balls-in-bins analysis, it can be shown that:
At=Pr [File can be recovered|t fragments online]
can be computed in a time 0(hnf2 log2(nf)) where h is the height of the tree, i.e., number of levels of erasure-coding that were performed.
Thus Pr[File can be recovered]=ΣtAtPr[t fragments available] which was provided above. By using other techniques, e.g., concentration results, one can calculate even better approximations to this probability, e.g., in a time 0(hnf1.5 log2(nf)).
For higher levels of encoding, the method generalizes in a straightforward manner by mathematical induction.
While the description above describes a process whereby a probability is calculated given a set of parameters, e.g., n and k, it should be noted that the converse relationship may also be employed. For example, given that a user desires a 99% chance of reconstructing the file, the process may be employed to calculate how many fragments need to be generated to accomplish this goal.
For hierarchical erasure coding of files, the arrangement 50 of
A transmission module 62 transmits the fragments to the peer systems 60, and this may be performed using any manner of transmission, including streaming as soon as created, storing and then transmitting the fragment, or the like. Finally, a failure probability calculation module 66 may be employed to determine the likelihood, or not, of being able to reconstruct the subject file.
For the reconstruction of the subject file, it is noted that each of the erasure-coded leaves also has as meta-data the name of the leaf. When the fragments are received, they are deposited into the appropriate leaf. As soon as enough fragments have been received to reconstruct a leaf, the leaf is reconstructed and a higher-level fragment is thus obtained. This process may proceed level-by-level in this fashion until the root level is decoded. Note that to perform a successful decoding, one must remember the tree structure that was used to encode the file in the first place. This is not a copious amount of data if a regular structure like a full tree is used with the same branching factor at each level.
To perform erasure coding, the fragments generally include parts of each section of the file, e.g., a part of F1, a part of F2, etc. To read from each section requires multiple and non-optimum disk seeks. For example, to construct the first erasure-coded fragment E1, each bi1 would have to be examined, requiring n time-consuming disk seeks. If instead the file is re-interpreted as representing b11, . . . , bn1, b12, . . . bn2, b13, . . . , bn3, . . . , b1m, . . . , bnm, as shown by array 76, then E1 can be generated by reading the first portion of the file, i.e., reading consecutive bytes without seeking, as shown by the columns depicted in array 76′. This technique may be applied at multiple levels of the erasure coding tree. In some instances, the technique may involve re-writing the transposed version onto the disk.
As shown, operating environment 80 includes processor 84, computer-readable media 86, and computer-executable instructions 88. One or more internal buses 82 may be used to carry data, addresses, control signals, and other information within, to, or from operating environment 80 or elements thereof.
Processor 84, which may be a real or a virtual processor, controls functions of the operating environment by executing computer-executable instructions 88. The processor may execute instructions at the assembly, compiled, or machine-level to perform a particular process.
Computer-readable media 86 may represent any number and combination of local or remote devices, in any form, now known or later developed, capable of recording, storing, or transmitting computer-readable data, such as computer-executable instructions 88 which may in turn include user interface functions 92, failure calculation functions 94, erasure-coding functions 96, or storage functions 97. In particular, the computer-readable media 86 may be, or may include, a semiconductor memory (such as a read only memory (“ROM”), any type of programmable ROM (“PROM”), a random access memory (“RAM”), or a flash memory, for example); a magnetic storage device (such as a floppy disk drive, a hard disk drive, a magnetic drum, a magnetic tape, or a magneto-optical disk); an optical storage device (such as any type of compact disk or digital versatile disk); a bubble memory; a cache memory; a core memory; a holographic memory; a memory stick; a paper tape; a punch card; or any combination thereof. The computer-readable media may also include transmission media and data associated therewith. Examples of transmission media/data include, but are not limited to, data embodied in any form of wireline or wireless transmission, such as packetized or non-packetized data carried by a modulated carrier signal.
Computer-executable instructions 88 represent any signal processing methods or stored instructions. Generally, computer-executable instructions 88 are implemented as software components according to well-known practices for component-based software development, and are encoded in computer-readable media. Computer programs may be combined or distributed in various ways. Computer-executable instructions 88, however, are not limited to implementation by any specific embodiments of computer programs, and in other instances may be implemented by, or executed in, hardware, software, firmware, or any combination thereof.
Input interface(s) 98 are any now-known or later-developed physical or logical elements that facilitate receipt of input to operating environment 80.
Output interface(s) 102 are any now-known or later-developed physical or logical elements that facilitate provisioning of output from operating environment 80.
Network interface(s) 104 represent one or more physical or logical elements, such as connectivity devices or computer-executable instructions, which enable communication between operating environment 80 and external devices or services, via one or more protocols or techniques. Such communication may be, but is not necessarily, client-server type communication or p2p communication. Information received at a given network interface may traverse one or more layers of a communication protocol stack.
Specialized hardware 106 represents any hardware or firmware that implements functions of operating environment 80. Examples of specialized hardware include encoders/decoders, decrypters, application-specific integrated circuits, clocks, and the like.
The methods shown and described above may be implemented in one or more general, multi-purpose, or single-purpose processors.
Functions/components described herein as being computer programs are not limited to implementation by any specific embodiments of computer programs. Rather, such functions/components are processes that convey or transform data, and may generally be implemented by, or executed in, hardware, software, firmware, or any combination thereof.
It will be appreciated that particular configurations of the operating environment may include fewer, more, or different components or functions than those described. In addition, functional components of the operating environment may be implemented by one or more devices, which are co-located or remotely located, in a variety of ways.
Although the subject matter herein has been described in language specific to structural features and/or methodological acts, it is also to be understood that the subject matter defined in the claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
It will further be understood that when one element is indicated as being responsive to another element, the elements may be directly or indirectly coupled. Connections depicted herein may be logical or physical in practice to achieve a coupling or communicative interface between elements. Connections may be implemented, among other ways, as inter-process communications among software processes, or inter-machine communications among networked computers. The word “exemplary” is used herein to mean serving as an example, instance, or illustration. Any implementation or aspect thereof described herein as “exemplary” is not necessarily to be constructed as preferred or advantageous over other implementations or aspects thereof.
As it is understood that embodiments other than the specific embodiments described above may be devised without departing from the spirit and scope of the appended claims, it is intended that the scope of the subject matter herein will be governed by the following claims.