As is known in the art, edge computing systems and applications which utilize such systems are an emerging distributed computing paradigm. In edge-computing, users (or “clients”) interact with servers on an edge of a network. Such servers are thus said to form a first layer of servers. The edge servers, in turn, interact with a second layer of servers in a backend of the edge computing system and thus are (referred to as “backend layer servers”. While the edge servers are typically in geographic proximity to clients, the backend layer of servers are often provided as part of a data-center or a cloud center which is typically geographically distant from the clients and edge servers. The geographic proximity of the edge servers to clients permits high speed operations between clients and the edge layer, whereas communication between the edge servers and the backend is typically much slower. Such decentralized edge processing or computing is considered to be a key enabler for Internet of Things (IoT) technology.
As is also known, providing consistent access to stored data is a fundamental problem in distributed computing systems, in general, and in edge computing systems, in particular. Irrespective of the actual computation involved, application programs (also referred to simply as “applications” or more simply “apps”) in edge computing systems must typically write and read data. In settings where several writers attempt to concurrently or simultaneously update stored data, there is potential confusion on the version of data that should be stored during write operations and returned during read operations. Thus, implementation of strong consistency mechanisms for data access is an important problem in edge computing systems and is particularly important in those systems which handle massive amounts of data from many users.
To reduce, and ideally minimize, potential confusion with respect to different versions of the same data, consistency policies (or rules) may be imposed and implemented to deal with problems which arise because of concurrent access of data by clients. One well-known and widely acknowledged, and the most desirable form of consistency policy is known as “atomicity” or “strong consistency” which, at an application level, gives users of a distributed system (e.g. an edge computing system) the impression of a single machine executing concurrent read and write operations as if the executions take place one after another (i.e. sequentially). Thus, atomicity, which in simple terms, gives the users of a data service the impression that the various concurrent read and write operations take place sequentially. An ideal consistency solution should complete client operations via interaction only with the edge layer, whenever possible, thereby incurring low latency.
This is not possible, however, in all situations since practical edge servers have finite resources such as finite storage capacity and in some systems and/or uses the edge layer servers may be severely restricted in their total storage capacity as well as in other resources. For example, in situations where several thousands of files are being serviced, the edge servers typically do not have the capacity to store all the files all the time. In such situations, the edge servers rely upon the backend layer of servers for permanent storage of files that are less frequently accessed. Thus, the servers in the first layer act as virtual clients of the second layer servers.
Although various consistency policies (often weaker than strong consistency) are widely implemented and used in conventional processing systems, there is a lack of efficient implementations suitable for edge-computing systems. One important challenge in edge-computing systems, as described above, is reducing the cost of operation of the backend layer servers. Communication between the edge layer servers and backend layer servers, and persistent storage in the backends layer contribute to the cost of operation of the backend layer. Thus, cost reduction may be accomplished by making efficient use of the edge layer servers.
Described herein are concepts, systems and techniques directed toward a layered distributed storage (LDS) system and related techniques. In one embodiment, a two-layer erasure-coded fault-tolerant distributed storage system offering atomic access for read and write operations is described. Such systems and techniques find use in distributed storage systems including in edge computing systems having distributed storage.
The systems and techniques described herein addresses the edge computing challenges of: (1) reducing the cost of operation of backend layer servers by making efficient use of edge layer servers by: (a) controlling communication between edge layer servers and backend layer servers; and (b) controlling persistent storage in the backend layer; (2) enforcing/controlling consistency (e.g. atomic access for read and write operations); and (3) completing client operations via interaction only with the edge layer servers, whenever possible.
The described systems and techniques enable atomicity consistent data storage in edge computing systems for read and write operations while maintaining a desirable level of speed for users. In embodiments, the advantages of the concepts, come from the usage of erasure codes. In embodiments, minimum band width regenerating (MBR) codes may be used. In embodiments, random linear network codes (RLNC) may be used.
Since the techniques and systems described herein can be specifically adapted for edge-computing systems, a number of features can be provided. For example, as may be required by some edge-computing systems, the LDS technique described herein ensure that clients interact only with the edge servers and not with backend servers. In some embodiments, this may be an important requirement for applying the LDS technique to edge-computing systems. By ensuring that clients interact only with the edge servers and not with backend servers, the LDS techniques described herein allow completion of client-operations by interacting only with the edge layer (i.e. only a client need only interact with one or more edge layer servers). Specifically, a client write-operation (i.e. a client writes data) stores an updated file into the edge-layer and terminates. The client write-operation need not wait for the edge-layer to offload the data to the backend layer. Such a characteristic may be particularly advantageous in embodiments which include high speed links (e.g. links which provide a relatively low amount of network latency) between clients and edge layer servers. For a read operation, the edge-layer may effectively act as proxy-cache that holds the data corresponding to frequently updated files. In such situations, data required for a read maybe directly available at edge layer, and need not be retrieved from the backend layer.
Also, the LDS system and techniques described herein efficiently use the edge-layer to improve (and ideally optimize) the cost of operation of the backend layer. Specifically, the LDS technique may use a special class of erasure codes known as minimum bandwidth regenerating (MBR) codes to simultaneously improve (and ideally optimize) communication cost between the two layers, as well as storage cost in the backend layer.
Further still, the LDS technique is fault-tolerant. In large distributed systems, the individual servers are usually commodity servers, which are prone to failures due to a variety of reasons, such as, power failures, software bugs, hardware malfunction etc. Systems operating in accordance with LDS techniques described herein, however, are able to continue to serve the clients with read and write operations despite the fact that some fraction of the servers may crash at unpredictable times during the system operation. Thus, the system is available as long as the number of crashes does not exceed a known threshold.
The underlying mechanism used to for fault-tolerance is a form of redundancy. Usually, simple redundancy such as replication increases storage cost, but at least some embodiments described herein use erasure codes to implement such redundancy. The LDS techniques described herein achieves fault-tolerance and low storage and/or communication costs all at the same time.
In accordance with one aspect of the concepts described herein, a layered distributed storage (LDS) system includes a plurality of edge layer servers coupled to a plurality of backend layer servers. Each of the edge layer servers including an interface with which to couple to one or more client nodes, a processor for processing read and/or write requests from the client nodes and for generating tag-value pairs, a storage for storing lists of tag-value pairs and a backend server layer interface for receiving tag-value pairs from said processor and for interfacing with one or more of the plurality of backend servers. Each of the backend layer servers includes an edge-layer interface for communicating with one or more servers in the edge layer, a processor for generating codes and a storage having stored therein, coded versions of tag-value pairs. In some cases, the tag-value pairs may be coded via erasure coding, MBR coding or random linear network coding techniques. The backend layer servers are responsive to communications from the edge layer servers.
In preferred embodiments, the storage in the edge-layer servers is temporary storage and the storage in the backend layer servers is persistent storage.
With this particular arrangement, a system and technique which enables atomic consistency in edge computing systems is provided. Since users (or clients) interact only with servers in the edge layer, the system and technique becomes practical for use in edge computing systems, where the client interaction needs to be limited to the edge. By separating the functionality of the edge layer servers and backend servers, a modular implementation for atomicity using storage-efficient erasure-codes is provided. Specifically, the protocols needed for consistency implementation are largely limited to the interaction between the clients and the edge layer, while those needed to implement the erasure code are largely limited to the interaction between the edge and backend layers. Such modularity results in a system having improved performance characteristics and which can be used in applications other than in edge-computing applications.
The LDS technique described herein thus provides a means to advantageously use regeneration codes (e.g. storage-efficient erasure codes) for consistent data storage.
It should be appreciated that in prior art systems, use of regenerating codes is largely limited to storing immutable data (i.e. data that is not updated). For immutable data, these codes provide good storage efficiency and also reduce network bandwidth for operating the system.
Using the techniques described herein, however, the advantages of good storage efficiency and reduced network bandwidth possible via regenerating codes can be achieved even for data undergoing updates and where strong consistency is a requirement. Thus, the LDS techniques described herein enable the use of erasure codes for storage of frequently-updated-data. Such systems for supporting frequently-updated-data are scalable for big-data applications. Accordingly, the use of erasure codes as described herein provides edge computing systems having desirable efficiency and fault-tolcrance characteristics.
It is recognized that consistent data storage implementations involving high volume data is needed in applications such as networked online gaming, and even applications in virtual reality. Thus, such applications may now be implemented via the edge-computing system and techniques described herein.
In accordance with a further aspect of the concepts described herein, it has been recognized that in systems which handle millions of files, (which may be represented as objects), edge servers in an edge computing system do not have the capacity to store all the objects for the entire duration of execution. In practice, at any given time, only a fraction of all objects (and in some cases, a very small fraction of all objects) undergo concurrent accesses; in the system described herein, the limited storage space in the edge layer may act as a temporary storage for those objects that are getting accessed. The backend layer of servers provide permanent storage for all objects for the entire duration of execution. The servers in the edge layer may thus act as virtual clients of the second layer backend.
As noted above, an important requirement in edge-computing systems is to reduce the cost of operation of the backend layer. As also noted, this may be accomplished by making efficient use of the edge layer. Communication between the edge and backend layers, and persistent storage in the backend layer contribute to the cost of operation of the second layer. These factors are addressed via the techniques described herein since the layered approach to implementing an atomic storage service carries the advantage that, during intervals of high concurrency from write operations on any object, the edge layer can be used to retain the more recent versions that are being (concurrently) written, while filtering out the outdated versions. The ability to avoid writing every version of data to the backend layer decreases the overall write communication cost between the two layers. The architecture described thus permits the edge layer to be configured as a proxy cache layer for data that are frequently read, to thereby avoid the need to read from the backend layer for such data.
In embodiments, storage of data in the backend layer may be accomplished via the use of codes including, but not limited to erasure codes and random linear network codes. In some embodiments, a class of erasure codes known as minimum bandwidth regenerating codes may be used. From a storage cost view-point, these may be as efficient as popular erasure codes such as Reed-Solomon codes.
It has been recognized in accordance with the concept described herein that use of regenerating codes, rather than Reed-Solomon codes for example, provides the extra advantage of reducing read communication cost when desired data needs to be recreated from coded data stored in the backend layer (which may, for example, correspond to a cloud layer). It has also been recognized that minimum bandwidth regenerating (MBR) codes may be utilized for simultaneously optimizing read and storage costs.
Accordingly, the system and techniques described may herein utilize regenerating codes for consistent data storage. The layered architecture described herein naturally permits a layering of the protocols needed to implement atomicity and erasure codes (in a backend layer e.g. a cloud layer). The protocols needed to implement atomicity are largely limited to interactions between the clients and the edge servers, while those needed to implement the erasure code are largely limited to interactions between the edge and backend (or cloud) servers. Furthermore, the modularity of the implementation described herein makes it suitable even for situations that do not necessarily require a two-layer system.
The layered distributed storage (LDS) concepts and techniques described herein enable a multi-writer, multi-reader atomic storage service over a two-layer asynchronous network.
In accordance with one aspect of the techniques described herein, a write operation completes after writing an object value (i.e. data) to the first layer. It does not wait for the first layer to store the corresponding coded data in the second layer.
For a read operation, concurrency with write operations increases the chance of content being served directly from the first layer. If the content (or data) is not served directly from the first layer, servers in the first layer regenerate coded data from the second layer, which are then relayed to the reader.
In embodiments, servers in the first layer interact with those of the second layer via so-called write-to-backend layer (“write-to-L2”) operations and regenerate-from-backend-layer and “regenerate-from-L2” operations for implementing the regenerating code in the second layer.
In a system having first and second layers, with the first layer having n1 servers and the second layer having n2 servers, the described system may tolerate a number of failures f1, f2 in the first and second layers, respectively corresponding to f1<n1/2 and f2<n2/2.
In a system with n1=θ(n2); f1=θ(n1); f2=θ(n2), the write and read costs are respectively given by θ(n1) and θ(1)+n1l(δ>0) where δ is a parameter closely related to the number of write or internal write-to-L2 operations that are concurrent with the read operation. Note that l(δ>0) equates to 1 if δ>0 and 0 if δ=0. Note that the symbol a=θ(b) in the context any two variable parameters a and b is used to mean that the value of a is comparable to b and only differs by a fixed percent. The ability to reduce the read cost to θ(1), when δ=0 comes from the usage of minimum bandwidth regenerating (MBR) codes. In order to ascertain the contribution of temporary storage cost to the overall storage cost, a multi-object (say N) analysis may be performed, where each of the N objects is implemented by an independent instance of the LDS technique. The multi-object analysis assumes bounded latency for point-to-point channels. The conditions on the total number of concurrent write operations per unit time are identified, such that the permanent storage cost in the second layer dominates the temporary storage cost in the first layer, and is given by θ(N). Further, bounds on completion times of successful client operations, under bounded latency may be computed.
The use of regenerating codes enables efficient repair of failed nodes in distributed storage systems. For the same storage-overhead and resiliency, the communication cost for repair (also referred to as “repair-bandwidth”), is substantially less than what is needed by codes such as Reed-Solomon codes. In one aspect of the techniques described herein, internal read operations are cast by virtual clients in the first layer as repair operations, and this enables a reduction in the overall read cost. In one aspect of the techniques described herein, MBR codes, which offer exact repair, are used. A different class of codes knomin as Random Linear Network Codes (RLNC) may also be used. RLNC codes permit implementation of regenerating codes via functional repair. RLNC codes offer probabilistic guarantees, and permit near optimal operation of regenerating codes for choices of operating point.
A edge layer server comprising:
A backend layer server comprising:
In a system having a layered architecture for coded consistent distributed storage, a method of reading data comprising:
a server sj in the edge layer 1 reconstructs coded data cj using content from a backend layer 2. wherein coded data cj may be considered as part of the code C, and the coded data cj is reconstructed via a repair procedure invoked by a server sj in the edge layer 1 where d helper servers belong to the backend layer 2.
The foregoing features may be more fully understood from the following description of the drawings in which:
Referring now to
It should be appreciated that although only four edge layer servers 14a-14d are illustrated in this particular example, the system 11 may include any number of edge layer servers 14. Similarly, although only five backend layer servers 16a-16e are illustrated in this particular example, the system 11 may include any number of backend layer servers 16. In general edge layer 1 may include n1 servers while backend layer 2 may include n2 servers.
A plurality of client nodes 12 (also sometimes referred to herein as “clients” or “users”) are coupled to the edge layer servers 14. For clarity, writer clients (i.e., client nodes which want to write content (or data) v1, v2 to consistent storage in the backend layer 16) are identified with reference numbers 18a, 18b and reader clients (i.e., client nodes which want to read content or data) are identified with reference numerals 20a-20d.
When system 11 is provided as part of an edge computing system, high speed communication paths (i.e. communication paths which provide low network latency between clients 12 and servers 14) may exist between clients 12 and servers 14 in the edge layer 1.
Further, backend layer servers 16 may be provided as part of a data center or a cloud center and are typically coupled to the edge layer servers 14 via one or more communication paths 23 which are typically slower than high speed paths 19, 21 (in terms of network latency).
As illustrated in
Similarly, one or more of reader clients 20a-20d may each independently request the latest versions of desired content from the edge layer servers 14. In a manner to be described below in detail, edge layer servers 14 provide the most recent version of the content (in this case version v2 of the content) to appropriate ones of the reader clients 20a-20d. Such content is sometimes provided directly from one or more of the edge layer servers 14 and sometimes edge layer servers 14 communicate with backend layer servers 16 to retrieve and deliver information needed to provide the requested content to one or more of the reader clients 20a-20d.
Referring now to
Referring now to
Referring now to
Before describing write and read operations which may take place in layered distributed storage (LDS) system (in conjunction with
Each process has a unique id, and the ids are totally ordered. Client (reader/writer) interactions are limited to servers in 1, and the servers in 1 in turn interact with servers in 2. Further, the servers in 1 and 2 are denoted by {s1, s2, . . . , sn1} and {sn1+1,sn1+2, . . . , sn1+n2}, respectively.
It is also assumed that the clients are well-formed, i.e., a client issues a new operation only after completion of its previous operation, if any. As will be described in detail below, the layer 1-layer 2 1-2 interaction happens via the well defined actions write-to-L2 and regenerate-from-L2. These actions are sometimes referred to herein as internal operations initiated by the servers in 1.
Also, a crash failure model is assumed for processes. Thus, once a process crashes, it does not execute any further steps for the rest of the execution.
The LDS technique described herein is designed to tolerate fi crash failures in layer i; i=1; 2, where f1<n1/2 and f2<n2/3. Any number of readers and writers can crash during the execution. The above bounds arise from making sure sufficient servers in each of the layers of servers are active to guarantee a sufficient number of coded elements for a tag in order to allow decoding of the corresponding value. Communication may be modeled via reliable point-to-point links between any two processes. This means that as long as the destination process is non-faulty, any message sent on the link is guaranteed to eventually reach the destination process. The model allows the sender process to fail after placing the message in the channel; message-delivery depends only on whether the destination is non-faulty.
With respect to liveness and atomicity characteristics, one object, say x, is implemented via the LDS algorithm supporting read/write operations. For multiple objects, multiple instances of the LDS algorithm are executed. The object value v comes from the set V. Initially v is set to a distinguished value v0 (∈V). Reader R requests a read operation on object x. Similarly, a write operation is requested by a writer W. Each operation at a non-faulty client begins with an invocation step and terminates with a response step. An operation Tr is incomplete in an execution when the invocation step of π does not have the associated response step; otherwise the operation -rt is complete. In an execution, an operation (read or write) π1 precedes another operation π2, if the response step for operation π1 precedes the invocation step of operation π2. Two operations are concurrent if neither precedes the other.
“Liveness,” refers to the characteristic that during any well-formed execution of the LDS technique, any read or write operation initiated by a non-faulty reader or writer completes, despite the crash failure of any other clients and up to f1 server crashes in the edge layer 1, and up to f2 server crashes in the backend layer 2. Atomicity of an execution refers to the characteristic that the read and write operations in the execution can be arranged in a sequential order that is consistent with the order of invocations and responses.
With respect to the use of regenerating codes, a regenerating-code framework is used in which, a file of size B symbols is encoded and stored across n nodes such that each node stores a symbols. The symbols are assumed to be drawn from a finite field q, for some q. The content from any k nodes (ka symbols) can be used to decode the original file
For repair of a failed node, the replacement node contacts any subset of d≥k surviving nodes in the system, and downloads β symbols from each of the d symbols. The β symbols from a helper node is possibly a function of the α symbols in the node. The parameters of the code, say C, will be denoted as {(n, k, d)(α; β)} having a file-size B upper bounded by B≥.
Two extreme points of operation correspond to the minimum storage overhead (MSR) operating point, with B=kα and minimum repair bandwidth (MBR) operating point, with α=dβ. In embodiments, codes at the MBR operating point may be used. The file-size at the MBR point may be given by BMBR=≥Σi−0k−1d−i)β.
In some embodiments, it may be preferable to use exact-repair codes, meaning that the content of a replacement node after repair is substantially identical to what was stored in the node before crash failure. A file corresponds to the object value v that is written. In other embodiments, it may be preferable to use codes which are not exact repair codes such as random linear network codes (RLNCs).
In embodiments (and as will be illustrated in conjunction with
The usage of these three codes is as follows. Each server in the first edge layer 1, having access to the object value v (at an appropriate point in the execution) encodes the object value v using code C2 and sends coded data cn1+i to server sn1+1 in 2; 1≥i≥n2. During a read operation, a server sj in the edge layer 1 can potentially reconstruct the coded data cj using content from the backend layer 2. Here, coded data cj may be considered as part of the code C, and the coded portion cj gets reconstructed via a repair procedure (invoked by server sj in the edge layer 1) where the d helper servers belong to the backend layer 2. By operating at the MBR point, it is possible to reduce and ideally minimize the cost needed by the server sj to reconstruct cj. Finally, in the LDS technique described herein, the possibility that the reader receives k coded data elements from k servers in the edge layer 1, during a read operation is permitted. In this case, the reader uses the code C1 to attempt decoding an object value v.
An important property of one MBR code construction, which is needed in one embodiment of the LDS technique described herein, is the fact that a helper node only needs to know the index of the failed node, while computing the helper data, and does not need to know the set of other d−1 helpers whose helper data will be used in repair. It should be noted that not all regenerating code constructions, including those of MBR codes, have this property. In embodiments, a server sj∈1 requests for help from all servers in the backend layer 2, and does not know a priori, the subset of d servers 2 that will form the helper nodes. In this case, it is preferred that each of the helper nodes be able to compute its β symbols without the knowledge of the other d−1 helper servers.
In embodiments, internal read operations may be cast by virtual clients in the first layer as repair operations, and this enables a reduction in the overall read cost.
With respect to storage and communication costs, the communication cost associated with a read or write operation is the (worst-case) size of the total data that gets transmitted in the messages sent as part of the operation. While calculating write-cost, costs due to internal write-to-L2 operations initiated as a result of the write may be included, even though these internal write-to-L2 operations do not influence the termination point of the write operation. The storage cost at any point in the execution is the worst-case total amount of data that is stored in the servers in the edge layer 1 and the backend layer 2. The total data in the edge layer 1 contributes to temporary storage cost, while that in the backend layer 2 contributes to permanent storage cost. Costs contributed by meta-data (data for book keeping such as tags, counters, etc.) may be ignored while ascertaining either storage or communication costs. Further the costs may be normalized by the size of the object value v; in other words, costs are expressed as though size of the object value v is 1 unit.
A write operation will be described below in conjunction with
Referring now to
Referring now to
In embodiments, the ideal goal is to store the respective coded content in all the back end servers. With this goal in mind, the respective coded elements are sent to all backend servers. It is satisfactory if n2−f2 responses are received back (i.e., the internal write operation is considered complete if we know for sure that the respective coded elements are written to at least n2−f2 backend layer servers).
Referring now to
During the first phase (also sometimes referred to as the “get tag” phase), the writer 18 determines a new tag for the value to be written. A tag comprises a pair of values: a natural number, and an identifier, which can be simply a string of digits or numbers, for example (3, “id”). One tag is considered to be larger or more recent than another if either the natural number part of the first tag is larger than the other, or if they are equal, the identifier of the first tag is lexicographically larger (or later) than that of the second tag. Therefore, for any two distinct tags there is a larger one, and in the same vein, in a given set of tags there is a tag that is the largest of all. Note that such a tag is used in lieu of an actual timestamp.
In the second phase (also referred to as the “put data” phase), the writer sends the new tag-value pair to all severs in the edge layer 1, which add the incoming pair to their respective local lists (e.g. one or more lists in temporary storage 36 as shown in
It is important to note that the writer is not kept waiting for completion of the internal write-to-L2 operation. That is, no communication with the backend layer 2 is needed to complete a write operation. Rather, the writer terminates as soon as it receives a threshold number of acknowledgments (e.g. f1+k acknowledgments) from the servers in the edge layer 1. Once a server (e.g. server 14a) completes the internal write-to-L2 operation, the value associated with the write operation is removed from the temporary storage of the server 14a (e.g. removed from storage 36 in
In the techniques described herein, a broadcast primitive 56 is used for certain meta-data message delivery. The primitive has the property that if the message is consumed by any one server in the edge layer 1, the same message is eventually consumed by every non-faulty server in the edge layer 1. One implementation of the primitive, on top of reliable communication channels is described in co-pending application Ser. No. 15/838,966 filed on Dec. 12, 217 and incorporated herein by reference in its entirety. In this implementation, the process that invokes the broadcast protocol first sends, via point-to-point channels, the message to a fixed set Sf1+1 of f1+1 servers in the edge layer 1. Each of these servers, upon reception of the message for first time, sends the message to all the servers in the edge layer 1, before consuming the message itself. The primitive helps in the scenario when the process that invokes the broadcast protocol crashes before sending the message to all edge layer servers.
Referring now to
In the technique described herein, each server in the backend layer 2 stores coded data corresponding to exactly one tag at any point during the execution. A server in the backend layer 2 that receives tag-coded-element pair (t,c) as part of an internal write-to-L2 operation replaces the local a tag-coded-element pair (tl, cl) with the incoming pair one if the new tag value t is more recent than the local tag value tl (i.e. t>tl). The write-to-L2 operation initiated by a server s in the edge layer 1 terminates after it receives acknowledgments from f1+d servers in the backend layer 2. It should be appreciated that in this approach no value is stored forever in any non-faulty sere in the edge layer 1. The equations for selection of k, d are provided above.
Referring now to
As illustrated in
The broadcast primitive serves at least two important purposes. First, it permits servers 14 in edge layer 1 to delay an internal write-to-L2 operation until sending an acknowledgment ack to the writer; and second, the broadcast primitive avoids the need for a “writing back” of values in a read operation since the system instead writes back only tags. This is important to reduce costs to O(1) while reading from servers in the backend layer 2 (since MBR codes are not enough). O(1) refers to a quantity that is independent of the system size parameters such as n1 or n2
Referring now to
In one embodiment the LDS technique for a writer w∈W and reader r∈R includes a writer, executing a “get-tag” operation which includes sending a QUERY-TAG to servers in the edge layer 1. The writer then waits for responses from f1+k servers, and selects the most recent (or highest) tag t. The writer also performs a “put-data” operation which includes creating a new tag tw=(t:z+1;w) and sending (PUT-DATA, (tw; v)) to servers in 1. The client then waits for responses from f1+k servers in 1, and terminates.
It should be appreciated that in one embodiment, tags are used for version control of the object values. A tag t is defined as a pair (z,w), where z∈ and w∈W ID of a writer; or a null tag which we denote by ⊥. We use to denote the set of all possible tags. For any two tags t1; t2∈T we say t2>t1 if (i) t2.z>t1.z or (ii) t2.z=t1.z and t2.w>t1.w or (ii) t1=⊥ and t2≠⊥.
Each server s in the edge layer 1 maintains the following state variables: a) a list LCT×V, which forms a temporary storage for tag-value pairs received as part of write operations, b) ΓC×T, which indicates the set of readers being currently served. The pair (r; treq)∈l′ indicates that the reader r requested for tag treq during the read operation. c) tc: committed tag at the server, d) K: a key-value set used by the server as part of internal regenerate-from-L2 operations. The keys belong to , and values belong to T×. Here denotes the set of all possible helper data corresponding to coded data elements {cs(v), v∈V}. Entries of belong to. In addition to these, the server also maintains a three counter variable for various operations. The state variable for a server in the backend layer 2 comprises one (tag, coded-element) pair. For any server s, the notation s,y is used to refer to its state variable y. Thus, the notation s.y|T represents the value of s.y at point T of the execution. It should be appreciated that an execution fragment of the technique is simply an alternating sequence of (the collection of all) states and actions. An “action,” refers to a block of code executed by any one process without waiting for further external inputs.
Alternatively, the processing and decision blocks may represent steps performed by functionally equivalent circuits such as a digital signal processor (DSP) circuit or an application specific integrated circuit (ASIC). The flow diagrams do not depict the syntax of any particular programming language but rather illustrate the functional information of one of ordinary skill in the art requires to fabricate circuits or to generate computer software to perform the processing required of the particular apparatus. It should be noted that many routine program elements, such as initialization of loops and variables and the use of temporary variables may be omitted for clarity. The particular sequence of blocks described is illustrative only and can be varied without departing from the spirit of the concepts, structures, and techniques sought to be protected herein. Thus, unless otherwise stated, the blocks described below are unordered meaning that, when possible, the functions represented by the blocks can be performed in any convenient or desirable order.
Referring now to
Phase II of the write operation begins in processing block 74 in which the writer sends the new tag-value pair (t,v) to servers in the edge-layer 1. Preferably the tag-value pair is sent to all servers in the edge layer 1. Processing then proceeds to processing block 76 in which each server in the edge-layer that receives the tag-value pair (t,v) sends a data reception broadcast message (e.g. a metadata) to all servers in the edge layer 1.
Processing then proceeds to decision block 78 in which it is determined, whether the tag-value pair corresponds to a new tag-value pair for that server (i.e. is the newly received tag-value pair more recent that already committed tag-value pair tc in that server). If a decision is made that the tag pair is not a new tag pair then an acknowledgment is sent to the writer as shown in processing block 96.
If in decision block 78 a decision is made that the tag pair does correspond to a new tag pair, then processing proceeds to block 80 in which the servers in the edge-layer add the incoming tag-value pair (t,v) to their respective local lists (e.g. as stored in each layer server storage 36 in
In response to an edge layer server receiving broadcast messages from a predetermined number of servers, processing proceeds to processing block 84 in which each of the edge layer server sends an acknowledgment back to the writer. The writer needs f1+k ACKS, so at least f1+k servers must send an acknowledgment (ACK).
Processing then proceeds to decision block 86 in which a decision is made as to whether the tag is more recent than an already existing tag in the server (i.e. a committed tag tc) and whether the tag-value pair (t,v) is still in the tag-value pair list for that edge layer server.
If the tag is not more recent or if the tag is not still in the list, then processing ends.
Otherwise, processing then proceeds to processing block 88 in which a committed tag tc is updated to tag and all outstanding read requests are served with a tag-pair value (tc,vc) having a treq value which is less than or equal to the committed tag value tc. As also illustrated in processing block 88, these reads are removed from the list of outstanding read requests. Further, and as also illustrated in processing block 88, the values associated with tag-value pairs in the list for tags which are less than the value of tc, are removed. Processing then proceeds to processing block 90 in which the edge layer server offloads the tag-value pair to permanent storage in the backend layer 2. This may be accomplished, for example, by the server initiating a write-to-L2 action as described above).
Processing then proceeds to decision block 92 in which a decision is made as to whether the server completed the internal write-to-L2 operation. Although block 92 is illustrated as a decision block which implements a loop, in practice, this would be implemented as an interrupt driven procedure and thus processing block 94 would be implemented only upon completion of an internal write-to-L2 operation.
Once the sever completes the internal write-to-L2 operation, then processing flows to processing block 94 in which the edge-layer node server removes the value associated with the write operation from its temporary storage. Optionally, the server may also clear any old entries from its list. Processing then ends.
A read operation is described below in conjunction with
Referring now to
Referring now to
During the first or “get committed tag” phase, the reader identifies the minimum tag, treq, whose corresponding value it can return at the end of the operation.
During the second or “get-data” phase, the reader sends treq to all the servers in 1, awaits responses from f1+k distinct servers such that 1) at least one of the responses contains a tag-value pair, say (tr,vr) or 2) at least k of the responses contains coded elements corresponding to some fixed tag, say tr such that tr≥treq. In the latter case, the reader uses the code C2 to decode the value vr corresponding to tag tr. A server s∈1 upon reception of the get-data request checks if either (treq, vreq) or (tc,vc); tc>treq is its list; in this case, s responds immediately to the reader with the corresponding pair. Otherwise, s adds the reader to its list of outstanding readers, initiates a regenerate-from-L2 operation in which s attempts to regenerate a tag-coded data element pair (t′; c′s); t′≥treq via a repair process taking help from servers in 2. If regeneration from ⊥ fails, the server s simply sends (⊥,⊥) back to the reader.
It should be noted that irrespective of whether the regeneration operation succeeds, the server does not remove the reader from its list of outstanding readers. In the LDS technique the server s is allowed to respond to a registered reader with a tag-value pair, during the broadcast-resp action as well. It is possible that while the server awaits responses from 2 towards regeneration, a new tag t gets committed by s via the broadcast-resp action; in this case, if t≥tc, server s sends (t,v) to r, and also unregisters r from its outstanding reader list.
Referring now to
A second possibility is that the server regenerates a tag-value pair (t,c1) such that t is ≥treq. In this case the server sends the tag-value pair (t,c1) to the reader and does not unregister the reader.
A third possibility is that the server regenerates a tag-value pair (t,c1) such that t is <treq. In this case, the server sends the null set tag-value pair (⊥, ⊥) to the reader and does not unregister the reader.
A fourth possibility is that the server does not regenerate any tag-coded element pair (tag, coded-element) due to concurrent write-to-L2 actions. In this case, the server sends the null set tag-value pair (⊥, ⊥) to the reader and does not unregister the reader.
It should be appreciated that the reader expects responses from a predetermined number of servers (e.g. f1+k servers) such that either one of them is tag-value pair (tag, value) in which tag≥treq or a predetermined number of them (e.g. k of them) are tag-coded element pairs for the same tag, i.e. tag≥treq (decode value in this case).
Referring now to
Referring now to
Referring now to
If in decision block 142 a decision is made that no overlap exists, then processing proceeds lu processing block 146 in which one or more servers in the edge layer 1, regenerate tag-coded element pairs (tag, coded-element). In this scenario, the edge layer servers utilize information from the backend layer servers. This may be accomplished, for example, via regenerate-from-L2 operations as described above in conjunction with
Processing then proceeds to processing block 150 where the reader decodes the value V using the appropriate code. Processing then ends.
Referring now to
The phase II processing begins in processing block 164 in which the reader sends the minimum tag value treq to all of the servers in the edge layer.
Processing then proceeds to decision block 166 in which a decision is made as to whether the reader received responses from a predetermined number of distinct edge layer servers (including itself) such that at least one of the following conditions is true: (A) responses contain a tag-value pair (tr,vr) or (B) at least one of the responses contain coded elements corresponding to some fixed tag tr. That is, some tag greater than or equal to the requested tag (which may or may not be the requested tag-value pair), which means that the tag-value pair was stored in that servers local storage) or must return coded elements (which means that no appropriate tag-value pair was stored in that server's local storage and thus the server had to communicate with I2 to get coded elements). In embodiments, the predetermined number of distinct edge layer servers may be at least f1+k distinct edge layer servers.
Once one of the conditions is true, then decision blocks 170 and 173 determine which of the conditions A or B is true. If in decision block 170 a decision is made that condition A is not true, then condition B must be true and processing proceeds to block 176 the reader uses the coded elements to decode the value vr corresponding to tag tr. Processing then proceeds to block 175 where the reader writes back tag tr corresponding to value vr, and ensures that at least f1+k servers 1 have their committee tags at least as high as tr, before the read operation completes.
If in decision block 170 a decision is made that condition A is true, processing proceeds to block 172 where a tag-value pair is selected corresponding to the most recent (or “maximum”) tag. Processing then proceeds to decision block 173 in which a decision is made as to whether condition B is also true. If condition B is not also true, then the tag-value pair (tr,vr) is set as the tag-value pair (t,v). Processing then proceeds to block 175 as above.
If in decision block 173 a decision is made that condition B is also true, then processing proceeds to block 174 where the reader uses the code C2 to decode the value vr corresponding to tag tr and if the tag t is more recent that the tag tr, then the tag-value pair (t,r) is renamed as (tr,vr) (i.e. f t>tr, rename (t,v) as (tr,vr)).
Referring now to
If the condition in decision block 194 is not true, then processing proceeds to block 198 in which the server s adds the reader to its list of outstanding readers, along with treq. Processing then proceeds to processing block 200 in which server s initiates a regenerate-from −L2 operation in which the server s attempts to regenerate a tag-coded data element pair (tl,cl), tl≥treq via a repair process taking help from servers in 2.
Processing then proceeds to decision block 202 in which a decision is made as to whether regeneration from the backend layer 2 failed. If the regeneration failed, then processing flows to block 204 in which the server s simply sends a null set (⊥,⊥) back to the reader. It should be noted that irrespective of whether regeneration succeeds, the server does not remove the reader from its list of outstanding readers. That is, even though individual regenerations succeed, the regeneration might not succeed in a collection of k servers in the edge such that all these servers regenerate the same tag. This happens because of concurrent write operations. In such situation, by not removing the reader from the list of outstanding readers of a server, we allow the server to relay a value directly to the server (even after individual successful regeneration, but collective failure) so that the read operation eventually completes. Phase two processing then ends.
If in decision block 202 a decision is made that the regeneration did not fail, then processing flows to block 206 in which edge layer 1 regenerated tag-coded-element pairs are sent to the reader. Phase two processing then ends.
Below are described several interesting properties of the LDS technique. These may be found useful while proving the liveness and atomicity properties of the algorithm. The notation Sa C1, |Sa|=f1+k is used to denote the set of f1+k servers in 1 that never crash fail during the execution. Below are lemmas only applicable to servers that are alive at the concerned point(s) of execution appearing in the lemmas.
For every operation π in Π corresponding to a non-faulty reader or writer, there exists an associated (tag, value) pair that denoted as (tag(π), value(π))). For a write operation π, we the (tag(π), value(π))) pair may be defined as the message (tw,v) which the writer sends in the put-data phase. If TT is a read, the (tag(π), value (π))) pair is defined as (tr,v) where v is the value that gets returned, and tr is the associated tag. In a similar manner tags may also be defined for those failed write operations that at least managed to complete the first round of the write operation. This is simply the tag tw that the writer would use during a put-data phase, if it were alive. As described, writes that failed before completion of the first round are ignored.
For any two points T1, T2 in an execution of LDS, we say T1<T2 if T1 occurs earlier than T2 in the execution. The following three lemmas describe properties of committed tag tc, and tags in the list.
Lemma IV.1 (Monotonicity of committed tag). Consider any two points T1 and T2 in an execution of LDS, such that T1<T2. Then, for any server s∈1, s.tc|T1≥s.tc|T2.
Lemma IV.2 (Garbage collection of older tags). For any server s∈1, at any point T in an execution of LDS, if (t, v)∈s.L, we have t≥s.tc.
Lemma IV.3 (Persistence of tags corresponding to completed operations). Consider any successful write or read operation φ in an execution of LDS, and let T be any point in the execution after φ completes. For any set S′ of f1+k servers in 1 that are active at T, there exists s∈S′ such that s.tc|T≥tag(φ) and max {t:(t;*)∈s.L|T}≥tag(φ).
The following lemma shows that an internal regenerate-from-L2 operation respects previously completed internal write-to-L2 operations. Our assumption that f2<n2/3 is used in the proof of this lemma.
Lemma IV.4 (Consistency of Internal Reads with respect to Internal Writes). Let σ2 denote a successful internal write-to-L2(t,v) operation executed by some server in 1. Next, consider an internal regenerate-from-L2 operation π2, initiated after the completion of σ2, by a server s∈1 such that a tag-coded-element pair, say (t′,c′) was successfully regenerated by the server s. Then, t′≥t; i.e., the regenerated tag is at least as high as what was written before the read started.
The following three lemmas are central to prove the liveness of read operations.
Lemma IV.5 (If internal regenerate-from-L2 operation fails). Consider an internal regenerate-from-L2 operation initiated at point T of the execution by a server s1∈1 such that s1 failed to regenerate any tag-coded-element pair based on the responses. Then, there exists a point e in the execution such that the following statement is true: There exists a subset Sb of Sa such that |Sb|=k, and ∀s′∈Sb( )∈s′. where=ma1s.tc.
Lemma IV.6 (If internal regenerate-from-L2 operation regenerates a tag older than the request tag). Consider an internal regenerate-from-L2 operation initiated at point T of the execution by a server s1∈1 such that s1 only manages to regenerate (t,c) based on the responses, where t<treq. Here treq is the tag sent by the associated reader during the get-data phase. Then, there exists a point in the execution such that the following statement is true: There exists a subset Sb of Sa such that |Sb|=k, and ∀s′∈Sb( )∈s′ . . . where=ma1 s.tc.
Lemma IV.7 (If two Internal regenerate-from-L2 operations regenerate differing tags). Consider internal regenerate-from-L2 operations initiated at points T and T′ of the execution, respectively by servers s′ and s′ in 1. Suppose that s and s′ regenerate tags t and t′ such that t<t′. Then, there exists a point in the execution such that the following statement is true: There exists a subset Sb of Sa such that |Sb|=k, and ∀s′∈Sb( )∈s′ . . . where=ma1 s.tc.
Theorem IV.8 (Liveness). Consider any well-formed execution of the LDS algorithm, where at most f1<n1/2 and f2<n2/3 servers crash fail in layers 1 and 2, respectively. Then every operation associated with a non-faulty client completes.
Theorem IV.9 (Atomicity). Every well-formed execution of the LDS algorithm is atomic.
Storage and communication costs associated with read/write operations, and also carry out a latency analysis of the algorithm, in which estimates for durations of successful client operations are provided. We also analyze a multi-object system, under bounded latency, to ascertain the contribution of temporary storage toward the overall storage cost. We calculate costs for a system in which the number of nodes in the two layers are of the same order, i.e., n1=Θ(n2). We further assume that the parameters k,d of the regenerating code are such that k=Θ(n2); d=Θ(n2). This assumption is consistent with usages of codes in practical systems.
In this analysis, we assume that corresponding to any failed write operation π, there exists a successful write operation π′such that tag(π)>tag(π). This essentially avoids pathological cases where the execution is a trail of only unsuccessful writes. Note that the restriction on the nature of execution was not imposed while proving liveness or atomicity.
Lemma V.1 (Temporary Nature of 1 Storage). Consider a successful write operation π∈β. Then, there exists a point of execution Te(π) in β such that for all T′≥Te(π) in β, we have s.tc|T′≥tag(π) and (t,v)∉s.L|T′, ∀S∈1, t≤tag(π).
For a failed write operation π∈β let π′ be the first successful write in β such that tag(π′)>tag(π). Then, it is clear that for all T′≥Te (π′) in β, we have (t,v)∉s.L|T′, ∀s∈1, t≤tag(π), and thus Lemma V.1 indirectly applies to failed writes as well. Further, for any failed write π∈β, we define the termination point Tend(π) of π as the point Te(π′) obtained from Lemma V.1, where π′.
Definition 1 (Extended write operation). Corresponding to any write operation π∈β, we define a hypothetical extended write operation πe such that tag(πe)=tag(π), Tstart(πe)=Tstart(π) and Tend(πe)=max(Tend(π); Te(π)), where Te(π) is as obtained from Lemma V.1.
The set of all extended write operations in β shall be denoted by Πe.
Definition 2 (Concurrency Parameter δρ). Consider any successful read operation ρ∈β, and let πe denote the last extended write operation in β that completed before the start of ρ. Let Σ={σe∈Πe\tag(σ)>tag(πe) and σe overlaps with ρ}. We define concurrency parameter δρ as the cardinality of the set Σ.
Lemma V.2 (Write, Read Cost). The communication cost associated with any write operation in β is given by n1+n1n2=Θ(n1). The communication cost associated with any successful read operation pin ρ in β is given by n1(1+)+n1I(δρ>0)=Θ(1)+n1I(δρ>0). Here, I (δρ>0) is 1 if δρ0, and 0 if δρ=0.
It should be noted that the ability to reduce the read cost to Θ(1) in the absence of concurrency from extended writes comes from the usage of regenerating codes at MBR point. Regenerating codes at other operating points are not guaranteed to give the same read cost, depending on the system parameters. For instance, in a system with equal number of servers in either layer, also with identical fault-tolerance (i.e., n1=n2; f1=f2), it can be shown that usage of codes at the MSR point will imply that read cost is Ω(n1) even if δρ=0.
Lemma V.3 (Single Object Permanent Storage Cost). The (worst case) storage cost in 2 at any point in the execution of the LDS algorithm is given by =Θ(1).
Remark 2. Usage of MSR codes, instead of MBR codes, would give a storage cost of
=Θ(1). For fixed n2; k,d, the storage-cost due to MBR codes is at most twice that of MSR codes. As long as we focus on order-results, MBR codes do well in terms of both storage and read costs; see Remark 1 as well.
For bounded latency analysis, delay on the various point-to-point links are assumed to be upper bounded as follows: 1) T1, for any link between a client and a server in1, 2) T2, for any link between a server in 1 and a server in 2, and 3) T0, for any link between two servers in 1. We also assume the local computations on any process take negligible time when compared to delay on any of the links. In edge computing systems, T2 is typically much higher than both T1 and T0.
Lemma V.4 (Write, Read Latency). A successful write operation in β completes within a duration of 4T1+2T0. The associated extended write operation completes within a duration of max(3T1+2T0+2T2; 4T1+2T0). A successful read operation in — completes within a duration of max(6T1+2T2; 5T1+2T0+T2).
1) Impact of Number of Concurrent Write Operations on Temporary Storage, via Multi-Object Analysis: Consider implementing N atomic objects in the two-layer storage system described herein, via N independent instances of the LDS algorithm. The value of each of the objects is assumed to have size 1. Let θ denote an upper bounded on the total number of concurrent extended write operations experienced by the system within any duration of T1 time units. Under appropriate conditions on θ, it may be shown that the total storage cost is dominated by that of permanent storage in 2. The following simplifying assumptions are made: 1) system is symmetrical so that n1=n2; f1=f2(⇒k=d) 2) T0=T1, and 3) all the invoked write operations are successful. It should be noted that it is possible to relax any of these assumptions and give a more involved analysis. Also, let μ=T2/T1.
Lemma V.5 (Relative Cost of Temporary Storage). At any point in the execution, the worst case storage cost in 1 and 2 are upper bounded by [5+2μ] θn1 and. Specifically, if θ<<, the overall storage cost is dominated by that of permanent storage in 2, and is given by θ(N).
Described above is a two-layer model for strongly consistent data-storage which supports read/write operations. The system and LDS techniques described herein were motivated by the proliferation of edge computing applications. In the system, the first layer is closer (in terms of network latency) to the clients and the second layer stores bulk data. In the presence of frequent read and write operations, most of the operations are served without the need to communicate with the backend layer, thereby decreasing the latency of operations. In that regard, the first layer behaves as a proxy cache. As described herein, in one embodiment regenerating codes are used to simultaneously optimize storage and read costs. In embodiments, it is possible to carry out repair of erasure-coded servers in the backend layer 2. The modularity of implementation possibly makes the repair problem in the backend of layer 2 simpler than in prior art systems. Furthermore, it is recognized that the modularity of implementation could be advantageously used to implement a different consistency policy like regularity without affecting the implementation of the erasure codes in the backend. Similarly, other codes from the class of regenerating codes including, but not limited to the use of random linear network codes (RLNCs) in the backend layer, may also be used without substantially affecting client protocols.
This application claims the benefit of U.S. Provisional Application No. 62/509,390 filed May 22, 2017, titled “LAYERED DISTRIBUTED STORAGE SYSTEM AND TECHNIQUES FOR EDGE COMPUTING SYSTEM,” which application is incorporated by reference herein in its entirety.
This invention was made with Government support under Grant Nos. FA9550-13-1-0042 and FA9550-14-1-0403 awarded by the Air Force Office of Scientific Research. The Government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62509390 | May 2017 | US |