The present disclosure relates to a hybrid distributed storage system to dynamically modify storage overhead and improve access performance.
In a distributed storage system, data is usually replicated on several storage nodes to ensure reliability when failures occur. One of the main costs of distributed storage systems is the raw storage capacity. This cost increases when the quality and performance of the devices used increase. Furthermore, the global amount of data produced and stored by mankind increases faster than the average storage device capacity at equal cost.
A storage cluster that includes multiple storage nodes can be employed to store data. To provide business continuity and disaster recovery, data may be stored in several storage clusters such that if one of the clusters fails, the data may still be accessed from the other cluster. Nevertheless, the challenge remains to reduce storage overhead in a distributed storage system.
Techniques are provided for storing data in a distributed storage system. An object is stored according to a first storage policy in the distributed storage system that includes a plurality of storage nodes. Storing the object according to the first storage policy results in a first storage overhead for the object. A triggering event associated with the object is received, and the triggering event changes an attribute of the object. In response to the triggering event, a second storage policy for the object is identified. Storing the object according to the second storage policy results in a second storage overhead for the object different from the first storage overhead.
Generally, there are two main techniques to protect stored data. According to a first technique, data is replicated to several locations, with a configurable replication factor, which according to industry standard is at least three in order to be resilient to two simultaneous random failures. A storage overhead of a distributed storage system is defined as the ratio of the storage capacity used to store an object to the size of the object itself. If a data is replicated to produce three copies, the storage overhead is three. According to a second technique, data can be erasure coded. In erasure coding, the data is split into smaller chunks that are encoded to produce parity chunks. For example, one erasure coding standard calls for splitting the data into ten chunks and producing four parity chunks. In this case, the storage overhead is 1.4.
When data is erasure coded, the storage overhead is generally lower (1.4 for a standard 10-4 erasure code instead of 3 for 3 replicas). However, writing and retrieving the data when employing erasure coding techniques can be slower for two reasons. First, the encoding of the data chunks takes computation time to complete. Second, more storage nodes are needed to participate in the storage process. For example, for a write operation (for a 10-4 erasure code), fourteen nodes are involved instead of three nodes for replication. The write operation uses fourteen nodes instead of three nodes. The read operation is also more complex for the erasure coding model since the erasure coding model needs ten nodes (out of fourteen) to participate instead of one node (out of three) in the replication model. The performance difference between erasure coding and replication models also depends on the chosen storage technology, such as non-volatile random-access memory (NVRAM), solid state drive (SSD), spinning disk, etc.
To balance the tradeoff between performance and storage overhead, techniques disclosed herein provide storage systems that can be dynamically configured to store objects by replication, erasure coding, or a combination thereof.
In one embodiment, the proposed storage system is configured to dynamically adapt storage for each individual object to have a high storage efficiency such that most objects have a low storage overhead and the most popular objects have a high storage overhead, but are fast to access.
In one embodiment, a policy driven distributed storage system is employed where objects are initially stored by erasure coding or replication depending on the policy. Policies can be generated based on static information such as an object size, an object type, an object reliability requirement, object nature, an object application-requested quality of service (QoS), predetermined read or write performance, etc. and/or based on dynamic information such as an object popularity, an object update rate, a time since object creation, a cluster load, a storage node load, etc.
Reference is made first to
Each of the clients 104 can send a request for storing an object to one of the servers 102. Each of the clients 104 can send a request for retrieving an object to a server that manages the storage of, and access to, the object. In some embodiments, the system 100 is configured to handle a per-object storage configuration. In system 100, each object is assigned to a single server which is responsible for determining how and where to store the object. In one embodiment, the storage decisions can be made according to information about the object and information gathered from all the servers that host the object.
In some embodiments, each of servers 102 maintains a policy database that stores policies for storing objects in the storage clusters 106 and 108. For example, the client 104-1 may send a request 110 for storing object A to the server 102-1. Upon receipt of the request, the server 102-1 may extract one or more attributes of the object A and find a policy in the policy database based on the extracted attributes to store the object A. For example, an attribute of the object A can be its size or popularity. A policy, which can be a default policy for any object managed by the server 102-1, may define a cost structure for storing objects. Upon receipt of the object, the server 102-1 assigns a cost score to the object A based on the cost structure. Based on the cost score of the object A, the server 102-1 determines a storage method for storing the object.
In one embodiment, the default policy for server 102-1 may indicate that if a size of an object exceeds a predetermined size threshold, the object is to be stored by erasure coding, and that if the size of the object is equal to or less than the size threshold, the object is to be stored by replication, e.g., three replicas of the object stored in different nodes. For example, after receiving the request for storing the object A from the client 104-1, the server 102-1 determines and compares the size of the object A to the size threshold. If the size of the object A exceeds the size threshold, the server 102-1 uses erasure coding to, for example, split the object A into ten fragments and generate four parity fragments, and stores the fragments at, for example, fourteen different storage nodes in cluster 106, resulting in a storage overhead of 1.4. Other erasure coding mechanisms may be employed, which may result in a different storage overhead greater or less than 1.4. If the size of the object A is equal to or less than the size threshold, the server 102-1 stores the object A as three replicas at three different storage nodes in for example, cluster 108, resulting in a storage overhead of 3. In general, a storage overhead for an erasure coding mechanism is less than that for three replicas.
In some embodiments, policies may indicate that when an object is stored by erasure coding, the node or nodes that will store the fragments of the object have a slower response speed, whereas when an object is stored by replication, the node or nodes that will store the replicas have a higher response speed. For example, each of the servers 102 may maintain a performance and capacity database of the storage nodes. An entry in the performance and capacity database for a storage node may include the speed of processor, the size of the storage medium, the free space on the storage medium, the type of the storage medium (e.g., SSD or hard drive), or the load of the storage node. When an object is to be stored by erasure coding, the server can select fourteen nodes that have a slower response speed based on the performance and capacity database. Moreover, when an object is to be stored by replication, the server can select three nodes to store the replicas that have a response speed faster than that of the nodes used for erasure coding, based on the performance and capacity database.
In some embodiments, a default policy may indicate that when an object is stored in the system, it is initially erasure coded so that the default configuration results in a low storage overhead, e.g., 1.4. The default policy can be overwritten for some classes of objects. For example, when performance is important at the application level, an object may be stored by replication or in a fast storage medium despite that the default policy calls for erasure coding.
In some embodiments, a default policy may be based on the popularity of the objects. For example, an object received for storage may include its popularity information, e.g., a popularity score. A default policy may dictate that if the popularity score of the object is less than a threshold, the object is to be stored by erasure coding, and if the popularity score of the object is equal to or greater than the threshold, the object is to be stored by replication. In one embodiment, the popularity of an object can be determined by least recently used (LRU) or least recently and frequently used (LRFU) index. The servers 102 in the storage system may maintain a LRFU structure of objects for which the servers are responsible. The servers 102 keep a record of which objects are in this structure, as well as when and how many times the objects have been accessed since they have been in the structure. In one embodiment, the structure that stores the popularity scores of the objects may include two or more classes for determining storage methods for the objects. For example, a server may maintain a popularity database that records the changes of popularity score of the objects the server manages. A policy for storing the objects may define a first popularity threshold. If the popularity score of an object is less than the first popularity threshold, the object is assigned to a less-popular class such that the object is to be stored by erasure coding, resulting in a lower storage overhead, e.g. 1.4. If the popularity score of an object is equal to or greater than the first popularity threshold, the object is assigned to a popular class such that the object is to be stored by replication, resulting in a greater storage overhead, e.g., 3.
In some embodiments, the policy for storing the objects may further define a second popularity threshold greater than the first popularity threshold such that the objects are assigned to three different classes. When a popularity score of an object is equal to or greater than the second popularity threshold, the object is assigned to a most popular class. For example, objects in the least popular class are stored by erasure coding having a storage overhead of 1.4, objects in the popular class are stored by both erasure coding and replication (one replica) having a storage overhead of 2.4, and objects in the most popular class are stored by replication (three replicas) having a storage overhead of 3.
In some embodiments, the threshold(s) is/are configurable. For example, based on a triggering event in the storage system, a server may modify a policy to change the threshold(s) or identify a new policy that indicates a different threshold. After an object is initially stored in a storage cluster according to the default policy, the server may receive a triggering event associated with the object to identify or generate a new policy to store the object. The triggering event changes one or more attributes of the object. In response to the triggering event, the server identifies or generates a new storage policy for storing the object. The object is then stored according to the new policy.
For example, the triggering event changes an attribute of the object such that the attribute of the object is greater or less than a predetermined threshold. In response to the triggering event, the object is stored according to the new storage policy. In one embodiment, the server may store an additional copy or delete an existing copy of the object according to the new policy. In one embodiment, the additional copy of the object is stored by replication at a node having a response speed greater than a node that stores the existing copy of the object. These techniques allow the server to dynamically manage the storage of objects to reduce storage overhead and/or improve performance of the storage system.
In one embodiment, the client 204 may send to the server 202 a request 212 for retrieving an object x. The object x is stored by erasure coding according to the default storage policy such that the object x is split, for example, into 10 fragments and stored in 10 different storage nodes 206. In response to the request 212, the server 202 transmits a response 214 to the requesting client 204 that includes fragments identifiers and the identities and/or addresses of the storage nodes that store the fragments. The server 202 also updates the popularity database 210 based on the request 212. For example, the request 212 is the first ever request for retrieving the object x since the object is stored by the server 202 at the storage nodes 206. In response to the request 212, the server 202 saves an entry (x,1) shown at 210-5 in the popularity database 210. In one embodiment, the server 202 deletes entry 210-1 for the object v as entry 210-1 indicates that object v is as popular as object x (both have been retrieved once) but the entry 210-1 is older than the entry 210-5. The client 204, based on the response 214, can retrieve (at 216) object x from the storage nodes that store the fragments of the object x.
As the popularity of the object grows, the server 202 may generate or identify a new policy different from the default policy to store the object. An example is shown in
In one embodiment, the server 202 may instruct (at 226) the storage node 224 to retrieve (at 227) the fragments of the object x from storage nodes 206 to construct a replica of the object x. Once the replica is stored at the storage node 224, the object x is stored by both erasure coding and replication due to its increased popularity. When the object becomes popular, a replica of the object can be stored in the system 200 so that clients can directly access it. As a result, while the object is popular, its storage overhead is 2.4 (one replica plus a copy in erasure coding) instead of 1.4 (erasure coding). Furthermore, the storage node 224 chosen to host (store) the full replica can be selected among nodes that have a lower than average cluster request load or a faster storage device. That is, the storage node 224 can act as a system-wide cache for the object. In the meantime, based on the response 222, the client 204 starts to retrieve (at 228) object x from the storage nodes 206 that store the fragments of the object x.
Referring to
In some embodiments, an object previously popular may become less popular, and the server that manages the object may dynamically change the method to store the object. Reference is made to
In some embodiments, a popularity policy may set a second popularity threshold above the popularity threshold (a first popularity threshold) for providing an improved user experience for extremely popular objects. When the popularity of an object is equal to or greater than the first popularity threshold, a first replica of the object is added, and when the popularity of the object is equal to or greater than the second popularity threshold, one additional (second) replica of the object is added to the storage system 200.
It is to be understood that although one client, one server, and a limited number of storage nodes are illustrated in
In one embodiment, when the server detects that the load of the storage nodes exceeds a predetermined level, i.e., a triggering event, the server may increase a popularity threshold such that less objects are stored by replication. In another embodiment, when the server detects that the load of the storage nodes is less than the predetermined level, the server may decrease the popularity threshold such that more objects are stored by replication, which facilitates the client's access to the objects.
In some embodiments, a storage policy may indicate that by default objects are replicated and stored in the system. When their popularity decreases, they are erasure coded and stored on slower devices. In some embodiments, a storage policy may indicate that objects requiring high performance have a replica on a fast storage device and are erasure coded on slower devices while objects that do not require high performance have a replica on regular devices and are erasure coded on slower devices. As such, a server can dynamically manage the storage of the objects to improve performance and reduce storage overheads.
The memory 304 may include read only memory (ROM), random access memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical or other physical/tangible memory storage devices. The memory 304 stores dynamic information 304a such as a performance and capacity database for storage nodes 304a1, an object popularity, a popularity database 304a2, an object update rate, a time since object creation, a cluster load, a storage node load, etc. for identifying or generating policies for objects; static information 304b such as an object size, an object type, an object reliability, object nature, an application-requested QoS for objects, predetermined read or write performance, etc. for identifying or generating policies for objects; a storage policy database 304c that includes policies for storing objects; policy generating/identifying software 304d for generating a new policy or identifying a suitable policy in response to a triggering event; and access control software 304e configured to manage client requests for accessing/retrieving objects.
The functions of the processor 302 may be implemented by logic encoded in one or more tangible (non-transitory) computer-readable storage media (e.g., embedded logic such as an application specific integrated circuit, digital signal processor instructions, software that is executed by a processor, etc.), wherein the memory 304 stores data used for the operations described herein and stores software or processor executable instructions that are executed to carry out the operations described herein.
In one embodiment, the processor 302 may be embodied by digital logic gates in a fixed or programmable digital logic integrated circuit, which digital logic gates are configured to perform storage and accessing control operations described herein. In general, the policy generating/identifying software 304d and the access control software 304e may be embodied in one or more computer-readable storage media encoded with software comprising computer executable instructions and when the software is executed operable to perform the operations described herein.
The communication interface 306 is configured to transmit communications to, and receive communications from, a computer network for the server 300. In one example, the communication interface 306 may take the form of one or more network interface cards.
At 416, the server 402 assigns a cost score to the object based on the cost structure of the policy. For example, when the size of the object is small, the server 402 assigns a low cost score to the object, and when the size of the object is large, the server 402 assigns a high cost score. At 418, based on the cost score of the object, the server 402 identifies a method to store the object. For example, if the cost score of the object is equal to or greater than a threshold, the object is to be stored by erasure coding. If the cost score of the object is less than a threshold, the object is to be stored by replication. In some embodiments, a policy may include two different thresholds for determining a storage method for the object. For example, if the cost score of the object is less than a lower threshold, the object is to be stored by replication. If the cost score of the object is between the lower threshold and a higher threshold, the object is to be store by both replication and erasure coding. If the cost score of the object is equal to or greater than the higher threshold, the object is to be stored by erasure coding to reduce storage cost. At 420, once the server 402 determines a method to store the object, the server 402 transmits to the store cluster 406 instructions for storing the object. For example, the instructions may include a determined method (erasure coding, replication, or a combination thereof), a performance requirement (fast or slow response speed), a hardware requirement (solid state drive (SSD) or hard drive), identities of designated storage nodes to store the object, a storage overhead, etc. for the storage cluster 406 to successfully store the object. At 422, the storage cluster 406 stores the object based on the received instructions. In one embodiment, the server 402 may broadcast to the network that it has stored the object.
At 424, a triggering event is received at the server 402. The triggering event can be a request for retrieving an object. The request is transmitted from the client 404 or a different client to the server 402. Other dynamic parameters received at the server 402 can be a triggering event. Any event that changes one or more attributes of the object can be a triggering event. A reported downtime of the storage cluster 406 or a change in the cluster load or storage node load can also be a triggering event. In some embodiments, a triggering event associated with a dynamic parameter may come from any one of the server 402, the client 404, or the storage cluster 406.
Based on the triggering event, at 426 the server 402 modifies one or more attributes associated with the object. For example, based on a request for retrieving an object, the server 402 may update information such as a popularity score of the object, the latest time the object is requested, an identity of the requester for the object, a location of the requester, etc. In one embodiment, when the request is to retrieve the object, the server may increase a popularity score of the object as explained above in connection with
At 428, the server 402 generates a new policy or identifies a policy (second storage policy) based on the modified attribute(s) associated with the object. For example, when a popularity score of the object is modified, the server 402 generates a new policy or identifies a policy that governs the storage of objects based on their popularity scores. The popularity policy may include one or more threshold values for selecting a method (erasure coding, replication, or a hybrid of the two) for storing an object. At 430, the server 402 determines whether the modified attribute associated with the object is greater or less than a predetermined threshold of the second storage policy. For example, the server 402 determines whether the increased or decreased popularity score of the object is greater or less than a popularity threshold. An increased popularity score could result in the popularity score of the object moving from below the popularity threshold to above the popularity threshold. Conversely, a decreased popularity score could result in the popularity score of the object moving from being greater than the popularity threshold to being less than the popularity threshold.
At 432, based on the determination at 430, the server 402 determines a method to store the object. For example, based on the determination at 430, the server 402 determines whether the original method to store the object is still effective in cost and performance. If the server 402 determines that the original method to store the object is still effective, the server 402 determines the original method should be maintained. In some embodiments, the server 402 may determine that a new (different) storage method is to be employed to reduce cost or improve performance. For example, the server 402 may store an additional copy of the object at a fast response storage node if the increased popularity score of the object is equal to or greater than the popularity threshold. The server 402 may also delete an existing copy of the object if the reduced popularity score of the object is less than the popularity threshold. In one embodiment, if the increased popularity score of the object is equal to or greater than a second popularity threshold indicating that the object is extremely popular, the server 402 may store an additional copy of the object by replication at a fast response storage node.
At 434, once the server 402 determines a new method to store the object, the server 402 transmits to the store cluster 406 instructions for storing the object. The instructions indicate which new storage method or methods are to be used to store the object. At 436, the storage cluster 406 employs the new method(s) to store the object based on the received instructions.
At 438, the server 402 transmits a response to the client 404 if the triggering event is a request for retrieving the object. The response may include an address or identifier of the storage cluster 406 so that the client 404 can retrieve the object from the storage cluster 406. In some embodiments, the response may include addresses of the store nodes that store the object. At 440, based on the address or identifier of the storage cluster 406, the client 402 sends a request to retrieve the object to the storage cluster 406. At 442, in response to the request, the storage cluster 406 returns the requested object to the client 404.
If the popularity score of the object exceeds the first popularity threshold (Yes at 804), at 820 the server stores a copy of the object by erasure coding and a copy of the object by replication. In some embodiments, the erasure-coded copy may be stored at storage nodes having a slow response speed and the replica of the object may be stored at a node having a fast response speed.
At 822, the server receives another triggering event associated with the object. Similar to the process at 808, the triggering event may be a request for retrieving the object or a different object. At 824, in response to receiving the triggering event, the server updates the popularity score of the object. At 826, if the triggering event is a request for retrieving a different object, the server decreases the popularity score of the object. The process then returns to 804 for the server to again determine whether the decreased popularity score of the object still exceeds the first popularity threshold. At 828, if the triggering event is a request for retrieving the object, the server increases the popularity score of the object, and the process moves to 830.
At 830, the server determines whether the increased popularity score of the object exceeds a second popularity threshold higher than the first popularity threshold. If the increased popularity score of the object does not exceed the second popularity threshold, the process returns to 822 to wait for another triggering event. If the increased popularity score of the object exceeds the second popularity threshold, at 832 the server deletes a copy of the object stored by erasure coding and stores a second additional copy of the object by replication. At this point, the storage system has two replicas of the object and no erasure-coded copy of the object, resulting in a storage overhead of 2 for the object. In some embodiments, at 832 the server may delete a copy of the object stored by erasure coding and store two additional copies of the object by replication. When this occurs, the storage system has three replicas of the object and no erasure-coded copy of the object, resulting in a storage overhead of 3 for the object.
The techniques presented herein allow a server to dynamically and reactively improve the efficiency of distributed storage systems, for example, by adapting to heterogeneous object popularity. In some embodiments, a server may determine popularity of objects with a cost function that may be based on well-established mechanisms such as web caches and a content delivery network. The server is configured to manage the internal cluster representation of objects according to the determined popularity or to any other metrics. In some embodiments, the server may dynamically manage the manner by which objects are stored at the distributed storage system to maintain a low storage overhead. In some embodiments, when 20% of the objects stored in the cluster amount for 80% of the requests, the techniques applied with erasure codes provide an overall storage overhead of 1.6 while guaranteeing the same quality of service for object retrieval as a storage cluster with a storage overhead of 3 in 80% of the cases.
In some embodiments, the techniques provide a way to dynamically adjust the internal representation of objects in distributed storage clusters according to one or more policies. A policy could have dynamic adaptability to, for example, object popularity. A storage cluster may guarantee high performance for the vast majority of requests while maintaining a low storage overhead, resulting in a higher average performance/cost ratio.
In some embodiments, the techniques employ a cost function associated with one or more attributes of an object including but not limited to popularity. The cost function may be domain specific and may also depend on different object characteristics such as a size or data type of objects. The techniques associate a storage policy to categories of objects. The storage policy controls default object representation (erasure coding, replicas, or both) and reactively switches between object representations based on triggering events. For example, a policy defines that an object initially be stored in erasure-coding representation with an additional full replica to maximize read performance and to save computing and network resources that would be required to access the object in erasure-coded form. The particular event that triggers the transition between the different representations or the coexistence of different representations can be defined by the policy.
In one form, a method is provided, which includes: storing, by a server, an object according to a first storage policy in a distributed storage system that includes a plurality of storage nodes, wherein storing the object according to the first storage policy results in a first storage overhead for the object; receiving a triggering event associated with the object, wherein the triggering event changes an attribute of the object; in response to the triggering event, identifying a second storage policy for the object; and storing the object according to the second storage policy that results in a second storage overhead for the object different from the first storage overhead.
In some embodiments, the first storage policy is a default storage policy that defines a cost structure for storing objects maintained by the plurality of storage nodes. The method further includes: upon receipt of the object by the server, assigning a cost score to the object based on the cost structure; and based on the cost score of the object, determining a storage method for storing the object.
In some embodiments, the first storage policy indicates: when a size of the object is greater than a size threshold, the object is stored by erasure coding; and when the size of the object is equal to or less than the size threshold, the object is stored by replication.
In some embodiments, the triggering event changes the attribute of the object such that the attribute of the object is greater or less than a predetermined threshold; and in response to the triggering event, storing the object according to the second storage policy includes the server storing an additional copy or deleting an existing copy of the object.
In some embodiments, the additional copy of the object is stored at a first storage node having a response speed greater than a second storage node that stores the existing copy.
In some embodiments, the triggering event is a client request to retrieve the object. The attribute of the object is a popularity score of the object. Receiving the client request increases the popularity score of the object such that the popularity score of the object exceeds a popularity threshold. In response to the client request, storing the object according to the second storage policy for the object includes the server storing an additional copy of the object by replication such that the second storage overhead for the object is greater than the first storage overhead.
In some embodiments, the client request is a first client request and the popularity threshold is a first popularity threshold. The method further includes: receiving a second client request to retrieve the object; receiving the second client request increases the popularity score of the object such that the popularity score of the object exceeds a second popularity threshold; and in response to the second client request, storing the object according to the second storage policy for the object includes the server deleting a copy of the object stored by erasure coding and storing a second additional copy of the object by replication such that the second storage overhead for the object is greater than the first storage overhead.
In some embodiments, the object is a first object; the triggering event is a client request to retrieve a second object different from the first object; and the attribute of the object is a popularity score of the first object. Receiving the client request decreases the popularity score of the first object such that the popularity score of the first object is less than the popularity threshold. In response to the client request, storing the object according to the second storage policy for the object includes the server deleting a copy of the first object stored by replication such that the second storage overhead for the object is less than the first storage overhead.
In another form, an apparatus is provided. The apparatus includes a network interface that enables network communications, a processor, and a memory to store data and instructions executable by the processor. The processor is configured to execute the instructions to: store an object according to a first storage policy in a distributed storage system that includes a plurality of storage nodes, wherein storing the object according to the first storage policy results in a first storage overhead for the object; receive a triggering event associated with the object, wherein the triggering event changes an attribute of the object; in response to the triggering event, identify a second storage policy for the object; and store the object according to the second storage policy that results in a second storage overhead for the object different from the first storage overhead.
In yet another form, a non-transitory computer-readable storage medium is provided. The non-transitory computer-readable storage medium is encoded with software comprising computer executable instructions which, when executed by a processor, cause the processor to: store an object according to a first storage policy in a distributed storage system that includes a plurality of storage nodes, wherein storing the object according to the first storage policy results in a first storage overhead for the object; receive a triggering event associated with the object, wherein the triggering event changes an attribute of the object; in response to the triggering event, identify a second storage policy for the object; and store the object according to the second storage policy that results in a second storage overhead for the object different from the first storage overhead.
The above description is intended by way of example only. The present disclosure has been described in detail with reference to particular arrangements and configurations, these example configurations and arrangements may be changed significantly without departing from the scope of the present disclosure. Moreover, certain components may be combined, separated, eliminated, or added based on particular needs and implementations. Although the techniques are illustrated and described herein as embodied in one or more specific examples, it is nevertheless not intended to be limited to the details shown, since various modifications and structural changes may be made within the scope and range of equivalents of this disclosure.