1. Field of the Invention
The present invention broadly relates to computerized methods and systems for reading from and writing to a distributed, asynchronous and fault-tolerant storage system, the latter including storage nodes and communicating with clients. A storage node of the storage system is for example a remote web service accessed over the Internet.
2. Description of the Related Art
Recent years have seen an explosion of Internet-scale applications, ranging from those in web search to social networks. These applications are typically implemented with many machines running in multiple data centers. In order to coordinate their operations, these machines access some shared storage. In this context, a prominent storage model is the key-value store (KVS). A KVS offers functions for storing and retrieving objects (called values) associated with unique keys. KVSs have become widely used as shared storage solutions for Internet- scale distributed applications. A KVS offers a range of simple functions for the manipulation of unstructured data objects (called values), each one identified by a unique key. KVSs are used as storage services directly (such as in Amazon® Simple Storage Service and Windows Azure® Storage) or indirectly, as non-relational (NoSQL) databases (as shown in A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 2010, 35-40, 44, and Project Voldemort: A distributed database. http://project-voldemort.com/). While different services and systems offer various extensions to the KVS interface, the common denominator of existing KVS services implements an associative array: a client may store a value by associating the value with a key, retrieve a value associated with a key, list the keys that are currently associated, and remove a value associated with a key.
Storage services provide reliability using replication and tolerate the failure of individual data replicas. However, when all data replicas are managed by the same entity, there are naturally common system components, and therefore failure modes common to all replicas. A failure of these components may lead to data becoming not available or even being lost, as recently witnessed during an Amazon S3 outage and Google's temporary loss of email data.
Therefore, a client can increase data reliability by replicating it among several storage services using the guarantees offered by robust distributed storage algorithms (for example, in D. K. Gifford. Weighted voting for replicated data. In Symposium on Operating System Principles (SOSP), 1979, 150-162, and H. Attiya, A. Bar-Noy, and D. Dolev. Sharing memory robustly in message-passing systems, J. ACM, 1995, 124-142, 42(1)).
In order to overcome these deficiencies, the present invention provides a method for reading from and writing to a distributed, asynchronous and fault-tolerant storage system including storage nodes, the storage nodes communicating with clients, the method including: from a first client writing an object to the storage system: retrieving from a first set of nodes amongst the storage nodes previous transient metadata relating to a previously written version of the object; and storing a new version of the object together with new transient metadata identifying the new version on a second set of nodes amongst the storage nodes, wherein the new transient metadata are metadata computed based on the previous transient metadata; and from a second client reading the object from the storage system: retrieving a set of transient metadata from a third set of nodes amongst the storage nodes; determining from the set of transient metadata retrieved a specific version of the object as stored on the storage system; and retrieving the specific version of the corresponding object from a fourth set of nodes amongst the storage nodes, wherein two sets of nodes amongst the first, second, third and fourth sets have at least one node in common.
According to another aspect, the present invention provides a computer program product for causing one or more clients communicating with a distributed, asynchronous and fault-tolerant storage system including storage nodes, the computer program product including: a computer readable storage medium having computer readable non-transient program code embodied therein, the computer readable program code including: computer readable program code configured to perform the steps of a method, including: from a first client writing an object to the storage system: retrieving from a first set of nodes amongst the storage nodes previous transient metadata relating to a previously written version of the object; and storing a new version of the object together with new transient metadata identifying the new version on a second set of nodes amongst the storage nodes, wherein the new transient metadata are metadata computed based on the previous transient metadata; and from a second client reading the object from the storage system: retrieving a set of transient metadata from a third set of nodes amongst the storage nodes; determining from the set of transient metadata retrieved a specific version of the object as stored on the storage system; and retrieving the specific version of the corresponding object from a fourth set of nodes amongst the storage nodes, wherein two sets of nodes amongst the first, second, third and fourth sets have at least one node in common.
According to yet another aspect, the present invention provides a distributed, asynchronous and fault-tolerant storage system including: storage nodes, wherein the storage nodes communicate with clients; a first client that writes an object to a storage system, wherein the first client retrieves from a first set of nodes amongst storage nodes previous transient metadata relating to a previously written version of the object and stores a new version of the object together with new transient metadata identifying the new version on a second set of nodes amongst the storage nodes, wherein the new transient metadata are metadata computed based on the previous transient metadata; and a second client that reads the object from the storage system, wherein the second client retrieves a set of transient metadata from a third set of nodes amongst the storage nodes, determines from the set of transient metadata retrieved a specific version of the object as stored on the storage system, and retrieves the specific version of the corresponding object from a fourth set of nodes amongst the storage nodes, wherein two sets of nodes amongst the first, second, third and fourth sets have at least one node in common.
First, general aspects of methods according to embodiments of the invention are discussed, together with high-level variants thereof (section 1). Next, in section 2, more specific embodiments are described.
In reference to
Furthermore, as depicted in
In essence, what the present methods provide is to allow clients communicating with the storage system to implement the following steps.
First, let us consider a first client 10, which is in a process of writing an object 201-202, as shown in
The client 10 shall typically perform the following two steps:
In step S11,
Next, in step S13,
The new TMD are computed based on previous TMD. For instance, a simple way is to increment a version number. Interestingly, the above TMD update mechanism provides a possible definition for the corresponding object (i.e., the set of a version previously written or still to be written). In other words, an object is preferably defined by a corresponding TMD sequence, rather than by the semantic contents of the object versions, which can vary tremendously from one version to another. The new TMD can be computed anywhere. It may for instance be computed at the client 10 itself, as illustrated in step S12,
Now, let us consider a second client 20 attempting to read the same object (or in fact, any other object) from the storage system. This second client, who possibly is the same as the first client, shall typically perform the following steps:
In step S31 (
In step S32 (
Note that the TMD set retrieved may actually correspond to all versions of all objects as stored on the system 1. In that case, upon client request, a node blindly returns all TMD corresponding to all objects stored thereon. In an embodiment, the TMD set returned can restrict to a given class (or even to a single object) and the request can be made correspondingly, such as to minimize the volume of TMD traffic and subsequent work at the requesting client for determining the desired version.
Next, in steps S33-S39, the second client 20 retrieves the specific version of the corresponding object from a fourth set (here also represented as set 12) of nodes. Again, a “Get” command can be used and the fourth set does not need to exactly correspond to the previous set of nodes. Rather, two sets of nodes (any pair of sets) amongst the first, second, third and fourth sets shall have at least one node 114 in common, as illustrated in
Also illustrated in
A method such as described above enables computation-free storage nodes, at variance with prior art methods. For example, in typical prior art methods, a client makes a read request and what is returned by a node to the client is a pair {metadata, object}. Thus, if the client makes a first read request at time t1 and a second read request at time t2, then two versions of the object are returned, one after each read request. In the present case, the substantial data (i.e., corresponding to the desired version) are returned only once the client has determined which version it wants.
Additional advantages reside in the fact that the “intelligence” can be delocalized to the clients. For example, the clients can implement garbage collection schemes, instead of having them implemented at the nodes (garbage collection schemes aim at removing obsolete object versions, as known per se).
In contrast, in prior art methods, the nodes were able to and thus required to determine what the desired (specific) version was. For example, the nodes were equipped with the necessary intelligence to keep only the most recent version of the stored objects. Now, this is done by the client. The client determines which version is the one it wants.
In sections 1.3 and 2 below, two classes of embodiments are contemplated, examples of which are respectively captured in
To start with, embodiments of the present invention may specifically address issues related to garbage collection (and more generally, the alteration of the stored object versions).
For example, a third client 30 (e.g., possibly any client which is ‘aware’ of outdated versions, shall take steps to retrieve (S16,
As will be discussed now, the above methods can be optimally implemented when storage nodes are equipped with key-value storage interfaces or any suitable interfaces which support the client-driven operations. Such an interface can be defined as a convention by which a client interacts with a storage node. In that case, a node (such as node 114 depicted in
This is illustrated in
Storing thereon a pair including a key (e.g., 301) and a corresponding object 201 (see also step S13 in
Next, retrieving from the node (e.g., “Get” command) an object (e.g., object 201) for a given key (e.g., key 301) (see also
In addition, providing nodes with KVS interfaces allows, at the step of retrieving specific TMD 302 (step S31,
Preferably, the interfaces are further provided with an operation allowing a client for deleting (e.g., “Remove” command) a pair (a key and an object), by providing a key to the interface, such as to implement a delocalized garbage process.
1.3 Two Classes of Embodiments: Universal vs. Reservation Metadata
Embodiments of
Here, when the first client stores a new object version (step S13,
In practice, the UMD shall allow a client to access the second copy when the first is not available, e.g., the corresponding object has been removed by a quicker garbage collector or superseded by a writer, it being reminded that the present system is asynchronous.
For example, when retrieving a specific version of an object, the reader 20 attempts (step S33,
Preferably, the second copy is stored together with both the UMD and the new TMD. Thus, when accessing the second copy, the reader 20 can compare (step S37) the associated TMD (i.e., stored together with the second copy) with the specific TMD retrieved earlier, such as to verify that the retrieved copy is consistent with the initial request.
The likely scenario is that the first copy is normally available, step S25, in which case the reader 20 can access it for any subsequent use (read, parse in RAM or replication, etc.), step S34. However, if the first copy is not available, then there is still the possibility for the reader to access the second copy.
Now, if the reader needs to ascertain that the second copy is the version actually desired, an additional comparison step, step S38, is needed in step S37. Typically, the additional comparison is carried out in order to determine whether the second copy corresponds to the most recent version. Technically, the reader checks whether the second copy has been written “no earlier than” the first copy, i.e., the copy corresponding to the retrieved specific TMD. To do that, the reader may for instance compare the TMD stored together with the second copy with the retrieved specific TMD. Nonetheless, other criteria might be involved, involving version compatibility, for collaborative work, etc.
In all cases, if an outcome of the comparison step does not conform to expectations, then the reader 20 may proceed to repeat the previous steps S31-S34, and if necessary step S35, etc., until the comparison leads to a satisfactory outcome, steps S38-S39. This way, a complete solution is offered which allows for reading from and writing to a distributed, asynchronous and fault-tolerant storage system.
Here the paradigm is different. In the embodiment described below, the reader 20 “reserves” a version when reading it. An exemplary scenario is the one illustrated in
As described before, when the second client 20 wishes to access an object from system 1, it shall first retrieve in step S31a of
Next, the second client 20 proceeds to store (steps S33a and S22a) reservation metadata (or RMD) for the specific version 202 it wants to access, on the storage system 1. Again, the RMD can be stored on any set of nodes having at least one node in common with any other set of nodes mentioned above.
Then, the specific version can be retrieved from any set of nodes, for subsequent use by the reader, as explained before. Upon completion, the reader can instruct in step S34a that the corresponding RMD is removed, for example, from one or more nodes of the system 1, and in practice as many as possible. In step S24a, the corresponding RMD is deleted.
Now, consider a client 10 or 30, which wants to write or access the same object, which is assumed to be concurrently accessed by the second client 20. This client 10 or 30 starts with retrieving (step S11a) previous TMD, i.e., relating to previously written versions 201 of this object, just as described earlier. In addition, this client shall inquire about reservation metadata, if any, which are associated with any version 202 of that object. If, as assumed in
The client 10 or 30 then knows that the reserved version needs special treatment. For example, in step 13a it will refrain from deleting the reserved version. Elsewise, in absence of a reservation, the reserved version can be safely deleted in step S14a.
This scheme is particularly useful for avoiding collisions, inasmuch as reservation metadata are placed by the reader before accessing the corresponding version. Again, the garbage collection processes or, more generally, the decisions taken as to the stored files are delegated to the clients rather than the nodes. Such a scheme offers improved security for clients who do not want to jeopardize the security of files managed directly by the storage system.
Notwithstanding, an issue with this second embodiment class is that the reader may not have the permission to “write”, and thus might not be able to “reserve” the files being accessed, i.e., to store reservation metadata. In that extent, the first embodiment class described in section 1.3.1 is preferred.
Finally, here again, the reservation metadata are preferably computed based on reservation metadata and/or transient metadata as previously stored for the object being concurrently accessed, whereby an easy scheme is provided to manage the version numbers.
The following provides details as to specific, fault-tolerant, wait-free and efficient algorithms that emulate a multi-reader multi-writer register from a set of KVS replicas in an asynchronous environment. Implementations serve an unbounded number of clients that use the storage. It tolerates crashes of a minority of the KVSs and crashes of any number of clients. These algorithms can be regarded as detailed variants to the methods discussed in section 1.3.1 above. Nonetheless, skilled person may appreciate that details given hereafter can be applied as variants to the methods discussed in section 1.3.2.
As known in the art, a client can increase data reliability by replicating it among several storage services using the guarantees offered by robust distributed storage algorithms. Such an algorithm uses multiple storage providers (e.g., storage nodes as introduced earlier), and emulates a single, more reliable shared storage abstraction, which can be modeled as a read/write register. Such a register can be designed to tolerate asynchrony, concurrency, and faults among the clients and the storage nodes.
Many well-known robust distributed storage algorithms exist. Perhaps surprisingly, none of them directly exploits key-value stores as storage nodes. The problem arises because existing solutions are either (1) unsuitable for KVSs since they rely on storage nodes that perform custom computation, which a KVS cannot do, or (2) prohibitively expensive, in the sense that they require as many storage nodes as there are clients.
First, the challenges behind running robust storage algorithms over a set of KVS nodes are described.
Many existing robust register emulations are based on versioning, in the sense that they associate each stored value with a version (sometimes called a timestamp) that increases over time. Consider the classical multi-writer emulation of a fault-tolerant register. A writer determines first the largest version from some majority of the storage nodes, derives a larger version, and then stores the new value together with the larger version at a majority of storage nodes. The storage node then performs computation and actually stores the new value only if it comes with a larger version than the one it stores locally. However, a KVS does not offer such an operation.
Similar to existing emulations, a robust storage solution is desired which is wait-free, such that every correct client may proceed independently of the speed or failure of other clients (or more precisely, every operation invoked by a correct client eventually completes).
If a classical algorithm is cast blindly into the KVS context without adjustment, all values are stored with the same key. This may cause a larger version and an associated, recently written value to be overwritten by a smaller version and an outdated value. This shall be referred to as “the old-new overwrite problem”. Another equally naive solution is to store each version under a separate key; such a KVS accumulates all versions that have ever been stored and takes up unbounded space. As remedy for this, one could remove small versions from a KVS after a value with a larger version has been stored. But this might, in turn, jeopardize wait-freedom. Consider a read operation that lists the existing keys and then retrieves the value with the largest version. If this version is removed between the time when the KVS executes the list operation and the time when the client retrieves it from the KVS, the read operation will fail. This can be referred to as “the garbage-collection race problem”.
First, a formal definition of a KVS is provided. A key-value store as used in embodiments is an associative array that allows storage and retrieval of values in a set V associated with of keys in a set K. The space complexity of the values is much larger than that of keys, so the values in V cannot be translated to elements of K and be stored as keys.
A KVS typically supports the following operations: (1) Associating a value with a key (Put(Key, Value)), (2) retrieving a value associated with a key (Get(Key)), (3) listing the keys that are currently associated (List( )) and (4) removing a value associated with a key (Remove(Key)). A possible formal sequential specification of the KVS is given in algorithm 1 shown in illustration 1 below:
Having noted this, two types of robust, asynchronous, and efficient emulations of a register over a set of fault-prone KVS replicas are particularly preferred, as described above. The reader may appreciate that both emulations can be designed for an unbounded number of clients, which may all read from and write to the register (i.e., the emulations implement a multi-writer multi-reader register). This makes it appropriate for Internet-scale systems. Also, both emulations may provide a multi-writer regular register. They may further be implemented so as to be wait-free and optimally resilient, i.e., the algorithm tolerates crash-stop failures of any minority of the KVS replicas and of any number of clients.
However, both emulations differ in their requirements. The first one (using universal metadata) does not require read operations to write to KVSs (that is, to change the state of a KVS by storing a value), in contrast with the second one. Precluding readers from storing values is practically appealing, since the clients may belong to different domains and not all of them should be permitted to write to the shared memory. But this poses a problem because of the garbage-collection race problem described previously. Thus, methods according to this first emulation instruct a write operation to store the same value twice, under different keys: once under an eternal key (universal metadata), which is never removed by garbage collection but vulnerable to an old-new overwrite, and a second time under a temporary key, named according to the version, as discussed earlier in reference to
Entities (computers) in the nodes of the system 1 are configured to store and replicate data, and return data as requested by clients, which are computerized units as well. The nodes themselves (e.g., clouds) preferably implement interfaces such as is described above (e.g., KVS interfaces). In all cases, the entities and clients can be regarded as a computerized unit or a set of computerized units such as is depicted in
Such computerized units are designed for implementing aspects of the present invention as described above. In that respect, it will be appreciated that the methods described herein are largely non-interactive and automated. In embodiments, the methods described herein can be implemented either in an interactive, partly-interactive or non-interactive system. The methods described herein can be implemented in software (e.g., firmware), hardware, or a combination thereof. In embodiments, the methods described herein are implemented in software, as an executable program, the latter executed by special digital computers (at the clients and/or at the node entities). More generally, embodiments of the present invention can be implemented using general-purpose digital computers, such as personal computers, workstations, etc.
The system 100 depicted in
The processor 105 is a hardware device for executing software, particularly software stored in memory 110. The processor 105 can be any custom made or commercially available processor, a central processing unit (CPU), an auxiliary processor among several processors associated with the computer 101, a semiconductor based microprocessor (in the form of a microchip or chip set), or generally any device for executing software instructions.
The memory 110 can include any one or combination of volatile memory elements (e.g., random access memory) and nonvolatile memory elements. Moreover, the memory 110 may incorporate electronic, magnetic, optical, and/or other types of storage media. Obviously, the memory 110 can have a distributed architecture, where various components are situated remote from one another, but can be accessed by the processor 105.
The software in memory 110 may include one or more separate programs, each of which includes an ordered listing of executable instructions for implementing logical functions. In the example of
The methods described herein can be in the form of a source program, executable program (object code), script, or any other entity including a set of instructions to be performed. When in a source program form, the program needs to be translated via a compiler, assembler, interpreter, or the like, which may or may not be included within the memory 110, so as to operate properly in connection with the OS 111. Furthermore, the methods can be written as an object oriented programming language, which has classes of data and methods, or a procedure programming language, which has routines, subroutines, and/or functions.
Possibly, a conventional keyboard 150 and mouse 155 can be coupled to the input/output controller 135. In addition, the I/O devices 140-155 may further include devices that communicate both inputs and outputs. The system 100 can further include a display controller 125 coupled to a display 130. The system 100 typically includes a network interface or transceiver 160 for coupling to a network 165, and thereby to the storage system 1 (
The network 165 transmits and receives data between the unit 101 and other entities (nodes/clients). The network 165 can be a packet-switched network such as the Internet network. Embodiments can also be contemplated which apply to local area networks, wide area networks, or other types of network environments. There are many possible types of technologies (e.g., fixed wireless network, wireless local area network, wireless wide area network), which can be involved here, are known per se, and do not need to be further described here.
If the unit 101 is a PC, workstation, intelligent device or the like, the software in the memory 110 may further include a basic input output system (BIOS) (omitted for simplicity). The BIOS is stored in ROM so that the BIOS can be executed when the computer 101 is activated.
When the unit 101 is in operation, the processor 105 is configured to execute software stored within the memory 110, to communicate data to and from the memory 110, and to generally control operations of the computer 101 pursuant to the software. The methods described herein and the OS 111, in whole or in part are read by the processor 105, typically buffered within the processor 105, and then executed.
When the systems and methods described herein are implemented in software, the methods can be stored on any computer readable medium, such as storage 120, for use by or in connection with any computer related system or method.
As will be appreciated by one skilled in the art, aspects of the present invention can be embodied as a system, method or computer program product. Accordingly, aspects of the present invention may take the form of an entirely hardware embodiment (computerized system performing the steps of the present methods), an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects. Furthermore, aspects of the present invention may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable non-transient program code embodied thereon, executed at the client and/or node sides.
Any combination of one or more computer readable medium(s) can be utilized. The computer readable medium can be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium can be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable non-transient program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium can be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Non-transient program code embodied on a computer readable medium can be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer non-transient program code for carrying out operations for aspects of the present invention can be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The non-transient program code may execute entirely on the unit 101, partly thereon, partly on a unit 101 and another unit, similar or not. It may execute partly on a first computer and partly on a second computer or entirely on one of the client's computer, etc.
Aspects of the present invention are described above with reference to flowchart illustrations and/or block diagrams of methods (
The computer program instructions may also be loaded onto one or more computer(s), other programmable data processing apparatus(es), or other devices to cause a series of operational steps to be performed to produce computer implemented processes such that the instructions being executed provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in
While the present invention has been described with reference to certain embodiments, it will be understood by those skilled in the art that various changes can be made and equivalents can be substituted without departing from the scope of the present invention. In addition, many modifications can be made to adapt a particular situation to the teachings of the present invention without departing from its scope. Therefore, it is intended that the present invention not be limited to the particular embodiment disclosed, but that the present invention will include all embodiments falling within the scope of the appended claims. For example, many modifications to the interfaces implemented by the nodes can be used for the purpose of allowing clients to execute special functions at the nodes, etc.
Number | Date | Country | Kind |
---|---|---|---|
11165040.4 | May 2011 | EP | regional |
This application is a continuation of and claims priority from U.S. application Ser. No. 13/463,933 filed on May 4, 2012, which in turn claims priority under 35 U.S.C. 119 from European Application 11165040.4 filed May 6, 2011, the entire contents of both applications are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 13463933 | May 2012 | US |
Child | 13596682 | US |