Various of the disclosed embodiments concern a method and apparatus for achieving durability for stored data objects.
The pervasiveness of the Internet and the advancements in network speed have enabled a wide variety of different applications on storage devices. For example, cloud storage, or more specifically, network distributed data storage system, has become a popular approach for safekeeping data as well as making large amounts of data accessible to a variety of clients. As the use of cloud storage has grown, cloud service providers aim to address problems that are prominent in conventional file storage systems and methods, such as scalability, global accessibility, rapid deployment, user account management, and utilization data collection. In addition, the system's robustness must not be compromised while providing these functionalities.
Among different distributed data storage systems, an object storage system employs a storage architecture that manages data as objects, as opposed to other storage architectures like file systems which manage data as a file hierarchy, and block storage which manages data as blocks within sectors and tracks. Generally, object storage systems allow relatively inexpensive, scalable and self-healing retention of massive amounts of unstructured data. Object storage is used for diverse purposes such as storing photos and songs on the Internet, or files in online collaboration services.
In a distributed storage system, data redundancy techniques can be employed to provide for high availability. One technique includes replication of the data. Replication involves generating one or more full copies of an original data object and storing the copies on different machines in case the original copies gets damaged or lost. While effective at preventing data lost, replication carries a high storage overhead in that each stored object takes up at least 2× more space than it normally would. Another technique include erasure coding (EC) that involves applying mathematical functions to a data object and breaking the data object down into a number of fragments such that the original object can be reconstructed from fewer than all of the generated fragments.
Introduced herein are techniques for achieving durability of a data object stored in a network storage system including a proxy server communicatively coupled to one or more storage nodes. In an embodiment, the proxy server receives a request from a client to store a data object in a network storage system. In response to the request the proxy server encodes the data object into fragments, wherein the original object is recoverable from fewer than all of the fragments. The encoding, in some embodiments, can include buffering segments of the data object as they are received from the client and individually encoding each segment using erasure coding into a data fragments and parity fragments. The data fragments and parity fragments are transmitted to the storage nodes where they are concatenated into erasure code fragment archives. Having transmitted the fragments to the storage nodes, the proxy server waits for acknowledgment indicating that the fragments have been successfully stored at the storage nodes. If the proxy server receives a successful write responses from a sufficient number of the storage nodes, the proxy server can report the durable storage of the data object to the client and can place a marker on at least one of the storage nodes indicating that the data object has been durable stored in the network storage system.
One or more embodiments of the present disclosure are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements.
Various example embodiments will now be described. The following description provides certain specific details for a thorough understanding and enabling description of these examples. One skilled in the relevant technology will understand, however, that some of the disclosed embodiments may be practiced without many of these details.
Likewise, one skilled in the relevant technology will also understand that some of the embodiments may include many other obvious features not described in detail herein. Additionally, some well-known structures or functions may not be shown or described in detail below, to avoid unnecessarily obscuring the relevant descriptions of the various examples.
The terminology used below is to be interpreted in its broadest reasonable manner, even though it is being used in conjunction with a detailed description of certain specific examples of the embodiments. Indeed, certain terms may even be emphasized below; however, any terminology intended to be interpreted in any restricted manner will be overtly and specifically defined as such in this Detailed Description section.
From the foregoing, it will be appreciated that specific embodiments of the invention are described herein for purposes of illustration, but that various modifications may be made without deviating from the scope of the invention. Accordingly, the invention is not limited except as by the appended claims.
In distributed object storage systems, Erasure Coding (EC) is a popular method for achieving data durability of stored objects. Erasure Coding is a mechanism where complex mathematics can be applied to a stored data object such that it can be broken down into N fragments, some of which consist of raw data and some of which consist of the results of said mathematical operations, which data is typically referred to as parity or ‘check data.’ Erasure Coding technology also allows for the reconstruction of the original object without requiring the need for all fragments; exactly how many are needed and what the mix of data versus ‘check data’ is depends on the erasure code scheme selected.
Erasure Coding however stops short of defining a means for managing these fragments within the storage system. For a truly shared, nothing distributed, scale out storage system as is typically deployed for Big Data applications in a Software Defined Storage manner, tracking and managing these fragments efficiently and transparently to applications accessing the storage system is a challenging problem, especially when considering that an eventually consistent system, i.e. one that favors availability over consistency, can store a fragment on just about any storage node in the cluster. Without a lightweight means for coordination between nodes to determine when all fragments, on some or all nodes, are stored, an individual storage node may easily wind up with a data fragment that is never deleted and never read. This can happen if a small enough subset of fragments is written to storage nodes, such that the object cannot be reconstructed. In this scenario, the individual storage node has no knowledge of the status of fragments at other nodes, so it cannot easily determine whether a subsequent request for the object should be fulfilled with that particular fragment, or if that particular fragment is part of a partial set that can never be rebuilt.
Described herein are example embodiments that solve these issues by providing mechanisms for placing a marker at a storage node that indicates the state of a stored object and provides the storage node with knowledge of the status of other fragments stored at other nodes. For example, in some embodiments, a proxy server acting as a central agent for a plurality of storage nodes, waits for a sufficient number (quorum) of success responses indicating that a storage node has successfully stored its component of a data object and then places a marker indicating on at least one of the storage nodes indicating that the data object is durably stored across a distributed storage system.
Network storage system 100 can represent an object storage system (e.g., OpenStack Object Storage system, also known as “Swift”), which is a multitenant, highly scalable, and durable object storage system designed to store large amounts of unstructured data. Network storage system 100 is highly scalable because it can be deployed in configurations ranging from a few nodes and a handful of drives to thousands of machines with tens of petabytes of storage. Network storage system 100 can be designed to be horizontally scalable so there is no single point of failure. Storage clusters can scale horizontally simply by adding new servers. If a server or hard drive fails, network storage system 100 automatically replicates its content from other active nodes to new locations in the cluster. Therefore, network storage system 100 can be used by businesses of variable sizes, service providers, and research organizations worldwide. Network storage system 100 can be used to store unstructured data such as documents, web and media content, backups, images, virtual machine snapshots, etc. Data objects can be written to multiple disk drives spread throughout servers in multiple data centers, with system software being responsible for ensuring data replication and integrity across the cluster.
Some characteristics of the network storage system 100 differentiate it from some other storage systems. For instance, in some embodiments, network storage system 100 is not a traditional file system or a raw block device; instead, network storage system 100 enables users to store, retrieve, and delete data objects (with metadata associated with the objects) in logical containers (e.g., via a RESTful HTTP API). Developers can, for example, either write directly to an application programming interface (API) of network storage system 100, can use one of the many client libraries that exist for many popular programming languages (such as Java, Python, Ruby, C#, etc.), among others. Other features of network storage system 100 include being natively designed to store and serve content to many concurrent users, being able to manage storage servers with no additional vendor specific hardware needed, etc. Also, because, in some embodiments, network storage system 100 uses software logic to ensure data replication and durability across different devices, inexpensive commodity hard drives and servers can be used to store the data.
Referring back to
As illustrated in
The proxy servers 171-174 can function as an interface of network storage system 100, as proxy servers 171-174 can communicate with external clients. As a result, proxy servers 171-174 can be the first and last to handle an API request from, for example, an external client, such as client user 150, which can include any computing device associated with a requesting user. Client user 150 can be one of multiple external client users of network storage system 100. In some embodiments, all requests to and responses from proxy servers 171-174 use standard HTTP verbs (e.g. GET, PUT, DELETE, etc.) and response codes (e.g. indicating successful processing of a client request). Proxy servers 171-174 can use a shared-nothing architecture, among others. A shared-nothing architecture is a distributed computing architecture in which each node is independent and self-sufficient and there is no single point of contention in the system. For example, none of the nodes in a shared-nothing architecture share memory or disk storage. Proxy servers 171-174 can be scaled as needed based on projected workloads. In some embodiments, a minimum of two proxy servers are deployed for redundancy—should one proxy server fail, a second proxy server can take over. However, fewer or more proxy servers than shown in
In general, storage nodes 181-184 are responsible for the storage of data objects on their respective storage devices (e.g. hard disk drives). Storage nodes can respond to forwarded requests from proxy servers 171-174, but otherwise may be configured with minimal processing capability beyond the background processes required to implement such requests. In some embodiments, data objects are stored as binary files on the drive using a path that is made up in part of its associated partition and the timestamp of an operation associated with the object, such as the timestamp of the upload/write/put operation that created the object. A path can be, e.g., the general form of the name of a file/directory/object/etc. The timestamp may allow, for example, the object server to store multiple versions of an object while providing the latest version for a download/get request. In other embodiments, the timestamp may not be necessary to provide the latest copy of object during a download/get. In these embodiments, the system can return the first object returned regardless of timestamp. The object's metadata (standard and/or custom) can be stored in the file's extended attributes (xattrs), and the object's data and metadata can be stored together and copied as a single unit.
Although not illustrated in
In some embodiments, network storage system 100 optionally utilizes a switch 120. In general, switch 120 is used to distribute workload among the proxy servers. In some embodiments, switch 120 is capable of prioritizing TCP and UDP traffic. Further, switch 120 can distribute requests for sessions among a number of resources in distributed storage cluster 110. Switch 120 can be provided as one of the services run by a node or can be provided externally (e.g. via a round-robin DNS, etc.).
Illustrated in
In some embodiments, within regions, network storage system 100 allows availability zones to be configured to, for example, isolate failure boundaries. An availability zone can be a distinct set of physical hardware whose failure would be isolated from other zones. In a large deployment example, an availability zone may be configured as a unique facility in a large data center campus. In a single datacenter deployment example, each availability zone may be a different rack. In some embodiments, a cluster has many zones. A globally replicated cluster can be created by deploying storage nodes in geographically different regions (e.g., Asia, Europe, Latin America, America, Australia, or Africa). The proxy servers can be configured to have an affinity to a region and to optimistically write to storage nodes based on the storage nodes' region. In some embodiments, the client can have the option to perform a write or read that goes across regions (i.e., ignoring local affinity).
With the above elements of the network storage system 100 in mind, a scenario illustrating operation of network storage system 100 is introduced as follows. In this example, network storage system 100 is a storage system of a particular user (e.g. an individual user or an organized entity) and client user 150 is a computing device (e.g. a personal computer, mobile device, etc.) of the particular user. When a valid read/retrieve request (e.g. GET) is sent from client user 150, through firewall 140, to distributed storage cluster 110, switch 120 can determine which proxy 171-174 in distributed storage cluster 110 to which to route the request. The selected proxy node (e.g. proxy 171-174) verifies the request and determines, among the storage nodes 181-184, on which storage node(s) the requested object is stored (based on a hash of the object name) and sends the request to the storage node(s). If one or more of the primary storage nodes is unavailable, the proxy can choose an appropriate hand-off node to which to send the request. The node(s) return a response and the proxy in turn returns the first received response (and data if it was requested) to the requester. A proxy server process can look up multiple locations because a storage system, such as network storage system 100, can provide data durability by writing multiple (in some embodiments, a target of 3) complete copies of the data and storing them in distributed storage cluster 110. Similarly, when a valid write request (e.g. PUT) is sent from client user 150, through firewall 140, to distributed storage cluster 110, switch 120 can determine which proxy 171-174 in distributed storage cluster 110 to which to route the request. The selected proxy node (e.g. proxy 171-174) verifies the request and determines, which among the storage nodes 181-184, on which to store the requested data object and sends the request along with the data object to the storage node(s). If one or more of the primary storage nodes is unavailable, the proxy can choose an appropriate hand-off node to which to send the request.
The replication scheme described with respect to
In a replication scheme, a single write request (e.g. PUT) with a single acknowledgment is all that is required between the proxy and each individual storage node. From the perspective of any of the storage nodes, the operation is complete when it acknowledges the PUT to the proxy as it now has a complete copy of the object and can fulfill subsequent requests without involvement from other storage nodes.
Data replication provides a simple and robust form of redundancy to shield against most failure scenarios. Data replication can also ease scheduling compute tasks on locally stored data blocks by providing multiple replicas of each block to choose from. However, even in a limited triple replication scheme, the cost in storage space is high. Three full copies of each data object are stored across the distributed computing cluster introducing a 200% storage space overhead. As will be described, storing fragments of a data object, for example through the use of erasure coding (EC), can alleviate this strain on storage pace while still maintaining a level of durability in storage.
Erasure Coding (EC) is a mechanism where complex mathematics can be applied to data (e.g. a data object) such that it can is broken down into a number of fragments. Specifically, in some embodiments, an EC codec can operate on units of uniformly sized data cells. The codec takes as an input the data cells and outputs parity cells based on mathematical calculations. Accordingly, the resulting fragments of data after encoding include data fragments which are the raw portions or segments of the original data and “parity fragments” or “check data” which are the results of the mathematical calculations. The resulting parity fragments are what make the raw data fragments resistant to data loss. Erasure Coding technology allows for the reconstruction of the original data object without requiring the need for all fragments; exactly how many are needed and what the mix of data versus ‘check data’ is depends on the erasure code scheme selected. For example, in a standard 4+2 erasure coding scheme, an original data object is encoded into six fragments: four data fragments including portions of the raw data from the original data object, and two parity fragments based on mathematical calculations applied to the raw data. In such a scheme, the original data object is can be reconstructed using any four of the six fragments. For example, the data object can obviously be reconstructed from the four data fragments that include the raw data, but if two of the data fragments are missing, the original data object can be still be reconstructed as long as the two parity fragments are available.
Use of erasure coding in a distributed storage context has the benefit of reducing storage overhead (e.g. to 1.2× or 1.5× as opposed to 3×) while maintaining high availability through resistance to storage node failure. However, the process for storing data described with respect to
Embodiments described herein solve this problem by introducing an extension to the process involving the initial write request.
Once the data object is encoded into the plurality of fragments (i.e. the data fragments and parity fragments) the proxy server 170 at sept 304 transmits (e.g. through simultaneous PUT statements) the plurality of fragments to one or more of the plurality of storage nodes in a distributed storage cluster. For example in
After transmitting the fragments, the proxy server 170 determines if a specified criterion is satisfied. Specifically, at step 306 proxy server 170 waits to receive a sufficient number of success responses from the storage nodes 180(1)-180(y) indicating that the storage node has successfully stored its fragment of the data object. However, as described earlier, any given storage node 180(1)-180(y) does not know the complete state of storage of the data object across the distributed storage system. Only a central agent (i.e. proxy server 170) having received a sufficient number (i.e. quorum) of acknowledgments from other storage nodes knows if the data object is durably stored. The number of successful responses needed for quorum can be user defined and can vary based on implementation, but generally is based on the erasure code scheme used by for durable storage. In other words, quorum can depend on the number of fragments needed to recover the data object. Specifically, in some embodiments, quorum is calculated based on the minimum number of data and parity fragments required to be able to guarantee a specified fault tolerance, which is the number of data elements supplemented by the minimum number of parity elements required by the chosen erasure coding scheme. For example, in a ReedSoloman EC scheme, the minimum number parity elements required for a particular specified fault tolerance may be 1, and thus quorum is the number of data fragments+1. Again, the number of encoded fragments needed to recover a given data object will depend on the deployed EC scheme.
In response to determining that the specified criterion is satisfied, the proxy server 170 places a marker on at least one of the of storage nodes indicating the state of the data object at the time of writing. For example, if the proxy server 170 receives a quorum of successful write responses from storage nodes 180(1)-180(y), it knows that the data object 340 is durably stored. In other words, even if not all of the transmissions of fragments completed successfully, the data object 340 is still recoverable. Accordingly, to share this knowledge with the storage nodes 180(1)-180(y), the proxy server at step 308 sends a message to and/or places a marker on the storage nodes 180(1)-180(y) indicating a state of the written data object. Preferably a message/marker is sent to all the storage nodes 180(1)-180(y) that have stored fragments of the data object, however in some embodiments only one storage node need receive the message/marker. This message/marker can take the form of a zero byte file using, for example, a time/date stamp and notable extension, e.g. .durable, and can indicate to the storage node that enough of this data object has been successfully stored in the distributed storage cluster to be recoverable. In other words, that the data object is durably stored. With this information, a given storage node can make decisions on whether to purge a stored fragment and how to fulfill subsequent data retrieval requests.
Following the acknowledgement of this second phase at step 310 from a sufficient number (i.e. quorum) of the storage node 180(1)-180(y), the proxy server can at step 312 report successful storage of the data object 340 back to the client user 150.
As shown in
Having buffered the first segment 442 of data object 440, the proxy server 170 encodes the segment 442 using an EC encoder 470. EC encoder 470 can be a combination of software and/or hardware operating at proxy server 170. As shown in
As shown in
Although not shown, in some embodiments fragments of a data object can be replicated for added redundancy across a distributed storage system. For example, in some embodiments upon encoding a particular fragment (e.g. Seg. 1, Frag. 1 shown in
After transmitting the replicated fragments, a proxy server and/or storage node can wait for responses indicating successful write of the replicated fragments. Upon receiving responses from a quorum of the storage nodes to which the replicated fragments were transmitted, the proxy server and/or storage node can place a marker on at least one of the storage nodes indicating that the particular fragment is fully replicated.
In some embodiments, the proxy server 170 can at step 506 conditionally read/retrieve the data object 540 from the storage nodes only if marker is present. Because the data object is stored as a set of fragments (e.g. erasure code fragment archives), proxy server 170 can at step 508 read decode the fragment archives using EC decoder 570 and then at step 510 transmit the now decoded data object 540 to the client 150. As described with respect to
For illustrative purposes the series of storage nodes 180(1)-180(y) are shown in
As mentioned, in some embodiments, a storage node 180(1)-180*y) can receive from a proxy server (e.g. proxy server 171-174 in
Consider an example in which storage node 180(3) for whatever reason does not have an available “.durable” marker. In some embodiments, in order to conserver storage space, storage node 180(3) may delete EC fragment archive 640(3) if after a period of time, storage node 180(3) still has not received the marker from the proxy server. Here from the storage node's perspective, because the marker is not present, the data object is not durably stored (i.e. not recoverable) in the network storage system so there is no utility in maintaining the fragment associated with the object in its storage. Alternatively, if storage node 180(3) has not been received the marker from the proxy server within the period of time, storage node 180(3) can communicate with other storage nodes (e.g. nodes 180(2) and 180(4) to determine if the they have received the marker. If storage node 180(3) determines that one or more other storage nodes have received the marker, the storage node can conclude with reasonable certainty that the data object is durably stored despite the absence of the marker in its local storage and can generate its own marker indicating that the data object is durably stored.
Consider another example in which storage node 180(4) for whatever reason does not have fragment archive 640(4) available. Here, storage node 180(4) may have the “.durable” marker available and with the knowledge that the data object is durably stored, communicate with the other storage nodes (e.g. storage nodes 180(y) and 180(3)) to reconstruct fragment archive 180(4). Recall that if the data object is durably stored (i.e. min number of fragments are available) the entire object (including any one of the fragments) is recoverable.
The mechanism for placing a marker on a storage device that indicates a state of stored data at write time can be applied to other applications as well. Recall that in some embodiments, in response to determining that a specified criterion is satisfied, a proxy server can place a marker on a storage node that indicates a state of the data (e.g. a data object) at the time of writing. This innovative feature has been described in the context of durable storage using erasure coding, but is not limited to this context.
For example, the aforementioned innovations can be applied in a non-repudiation context to ensure authenticity of stored data. Consider an example of storing a data object in a network storage system. Here the specified criterion may be satisfied if the proxy server receives an indication that authenticates the data object to be stored. For example, the proxy server may wait for review and an authentication certificate from a trusted third party. This trusted third party may be a service provided outside of the network storage system 100 described with respect to
As another example, the aforementioned innovations can be applied in a data security context. Again consider an example of storing a data object in a network storage system. Here, the specified criterion may be satisfied if the proxy server receives an indication that the data object is successfully encrypted. For example, in one embodiment, the proxy server may encrypt individual fragments before transmitting to the respective storage nodes. So that the storage nodes have knowledge of the state of the data, the proxy server may additionally transmit a encrypted marker to the storage nodes along with the fragments. Alternatively, encryption may be handled at the storage nodes. Here the proxy server may wait for a quorum of successful encryption responses from the storage nodes before reporting to the client and placing a marker at the storage nodes indicating that the data object is securely stored in the network storage system.
Further, as in the durable storage context, data can be conditionally retrieved/read based on whether the storage nodes include the marker. For example, in a non-repudiation context, the lack of at least one marker may indicate that the data has been tampered with or overwritten by an unauthorized entity since the initial write to storage. Given this conclusion, a storage node and/or proxy server may decline to transmit the existing data object to the client or may at least include a message with the returned data object that the authenticity cannot be verified. Similarly, in a data security context, the lack of at least one marker may indicate that the data was not properly encrypted at the time of write. Again, given this conclusion, a storage node and/or proxy server may decline to transmit the existing data object to the client or may at least include a message with the returned data object that the data was not properly encrypted.
In the illustrated embodiment, the computer processing system 700 includes one or more processors 710, memory 711, one or more communications devices 712, and one or more input/output (I/O) devices 713, all coupled to each other through an interconnect 714. The interconnect 714 may be or include one or more conductive traces, buses, point-to-point connections, controllers, adapters and/or other conventional connection devices. The processor(s) 710 may be or include, for example, one or more central processing units (CPU), graphical processing units (GPU), other general-purpose programmable microprocessors, microcontrollers, application specific integrated circuits (ASICs), programmable gate arrays, or the like, or any combination of such devices. The processor(s) 710 control the overall operation of the computer processing system 700. Memory 711 may be or include one or more physical storage devices, which may be in the form of random access memory (RAM), read-only memory (ROM) (which may be erasable and programmable), flash memory, miniature hard disk drive, or other suitable type of storage device, or any combination of such devices. Memory 711 may be or include one or more discrete memory units or devices. Memory 711 can store data and instructions that configure the processor(s) 710 to execute operations in accordance with the techniques described above. The communication device 712 represents an interface through which computing system X00 can communicate with one or more other computing systems. Communication device 712 may be or include, for example, an Ethernet adapter, cable modem, Wi-Fi adapter, cellular transceiver, Bluetooth transceiver, or the like, or any combination thereof. Depending on the specific nature and purpose of the computer processing system 700, the I/O device(s) 713 can include various devices for input and output of information, e.g., a display (which may be a touch screen display), audio speaker, keyboard, mouse or other pointing device, microphone, camera, etc.
Unless contrary to physical possibility, it is envisioned that (i) the methods/steps described above may be performed in any sequence and/or in any combination, and that (ii) the components of respective embodiments may be combined in any manner.
The techniques introduced above can be implemented by programmable circuitry programmed/configured by software and/or firmware, or entirely by special-purpose circuitry, or by any combination of such forms. Such special-purpose circuitry (if any) can be in the form of, for example, one or more application-specific integrated circuits (ASICs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), etc.
Software or firmware to implement the techniques introduced here may be stored on a machine-readable storage medium and may be executed by one or more general-purpose or special-purpose programmable microprocessors. A “machine-readable medium”, as the term is used herein, includes any mechanism that can store information in a form accessible by a machine (a machine may be, for example, any computing device or system including elements similar to as described with respect to computer processing system 700). For example, a machine-accessible medium includes recordable/non-recordable media (e.g., read-only memory (ROM); random access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; etc.), etc.
In this description, references to “an embodiment”, “one embodiment” or the like, mean that the particular feature, function, structure or characteristic being described is included in at least one embodiment of the technique introduced here. Occurrences of such phrases in this specification do not necessarily all refer to the same embodiment. Note that any and all of the embodiments described above can be combined with each other, except to the extent that it may be stated otherwise above or to the extent that any such embodiments might be mutually exclusive in function and/or structure.
Although the disclosed technique has been described with reference to specific exemplary embodiments, it will be recognized that the technique is not limited to the embodiments described, but can be practiced with modification and alteration within scope of the appended claims. Accordingly, the specification and drawings are to be regarded in an illustrative sense rather than a restrictive sense.
This application claims the benefit of U.S. Provisional Patent Application No. 62/293,653, filed on Feb. 10, 2016, entitled “METHOD AND APPARATUS FOR ACHIEVING DATA DURABILITY IN STORED OBJECTS”, which is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62293653 | Feb 2016 | US |