A distributed data store employs replica placement techniques in which a number k of hash functions are used to compute that same number of potential locations for a data item and a subset r of these locations are chosen for storing replicas. These replica placement techniques provide a system designer with the freedom to choose r from k, are structured in that they are determined by a straightforward functional form, and are diffuse such that the replicas of the items on one server are scattered over many other servers. The resulting storage system exhibits excellent storage balance and request load balance in the presence of incremental system expansions, server failures, and load changes. Fast parallel recovery is also facilitated. These benefits translate into savings in server provisioning, higher system availability, and better user-perceived performance.
Techniques are provided for placing and accessing items in a distributed storage system that satisfy the desired goals with efficient resource utilization. Having multiple choices for placing data items and replicas in the storage system is combined with load balancing algorithms, leading to efficient use of resources. After the server architecture is created, and the k potential locations for a data item are determined along with the r locations for storing replicas, data items may be created, read, and updated, and network load may be balanced in the presence of both reads and writes. Create, update, and read operations pertaining to data items are described herein, e.g., with respect to
At step 20, k hash functions are generated or obtained, one for each set of servers. At step 30, a replication factor r is chosen, where r<k.
Thus, the servers are divided into k groups where servers in different groups do not share network switches, power supply etc. A separate hash function for each group maps a data item to a unique server in that group. Any data item is stored at r of k possible servers. The parameters k and r are not picked every time a data item arrives, but instead are determined ahead of time in the design of the server architecture and organization.
The choice of k and r significantly influences the behavior of the system. In practice, r is chosen based on the reliability requirement on the data. A larger r provides better fault tolerance and offers potentials for better query load balancing (due to the increase in the number of choices), but with higher overhead. In typical scenarios, r is chosen between 3 and 5.
The gap between k and r decides the level of freedom. The larger the gap, the more freedom the scheme has. This translates into better storage balancing and to fast re-balancing after incremental expansion. A larger gap also offers more choices of locations on which new replicas can be created when servers fail. In particular, even with k-r failures among k hash locations, there still exist r hash locations to store the replicas. However, a larger k with a fixed r incurs a higher cost of finding which hash locations have the data item: without the cache on the front-end for the mapping from data items to their locations, k hash locations are probed.
More particularly, regarding storage balancing described with respect to the flow diagram of
In a typical setting, as shown in
New machines may be added to the system from time to time for incremental expansion. Assume that new servers are added to the segments in a round-robin fashion so that the sizes of segments remain approximately the same. A hash function for a segment accommodates the addition of new servers so that some data items are mapped to those servers. Any dynamic hashing technique may be used. For example, linear hashing may be used within each segment for this purpose.
A fixed base hash function is distinguished from an effective hash function. The effective hash function relies on the base hash function, but changes with the number of servers to be mapped to. For example, as described with respect to the diagram of
The number of servers increases at step 420. At step 430, more bits in the hashed value are used to cover all the servers. For example, for cases where n=21 for some 1, the effective hash function is hb(d) mod n for any key d. For 21<n<21+1, the first and the last n-21 servers will use the lower 1+1 bits of the hashed value, while the remaining servers will use the lower 1 bits.
Note that servers 0, 1, 4, and 5 now each control only half the hash value space compared to that of server 2 or 3. This is generally true when 21<n<21+1 holds. In other words, linear hashing inherently suffers from hash-space imbalance for most values of n. However, this may be corrected by favoring the choice of replica locations at less-utilized servers.
Regarding high performance, storage balance is achieved through the controlled freedom to choose less-utilized servers for the placement of new replicas. Request load balance is achieved by sending read requests to the least-loaded replica server. Because the replica layout is diffuse, excellent request load balance is achieved. Balanced use of storage and network resources ensures that the system provides high performance until all the nodes reach full capacity and delays the need for adding new resources.
Regarding scalability, incremental expansion is achieved by running k independent instances of linear hashing. This approach by itself may compromise balance, but the controlled freedom mitigates this. The structured nature of the replica location strategy, where data item locations are determined by a straightforward functional form, ensures that the system need not consistently maintain any large or complex data structures during expansions.
Regarding availability and reliability, basic replication ensures continuous availability of data items during failures. The effect of correlated failures is alleviated by using hash functions that have disjoint ranges. Servers mapped by distinct hash functions do not share network switches and power supply. Moreover, recovery after failures can be done in parallel due to the diffuse replica layout and results in rapid recovery with balanced resource consumption.
Replication is used to tolerate failures. Replicas are guaranteed to be on different segments, and segments are desirably designed or arranged so that intersegment failures have low correlation. Thus, data will not become unavailable due to typical causes of correlated failures, such as the failure of a rack's power supply or network switch.
When servers fail, the number of remaining replicas for certain data items falls below r. Fast restoration of the redundancy level is crucial to reducing the probability of data loss. Because k>r holds, unused hash locations exist. It is desirable to re-create the failed replicas at those unused hash locations to preserve the invariant that all replicas of a data item are placed at its hash locations, thereby eliminating the need for any bookkeeping or for consistent meta-data updates.
Due to the pseudo-random nature of the hash functions, as well as their independence, data items on a failed server are likely to have their remaining replicas and their unused hash locations spread across servers of the other segments. The other hash locations are by definition in other segments. This leads to fast parallel recovery that involves many different pairs of servers, which has been shown effective in reducing recovery time.
New front-end machines may also be added during incremental expansion. Failed front-end machines should be replaced promptly. The amount of time it takes to introduce a new front-end machine depends mainly on the amount of state the new front-end must have before it can become functional. The state is desirably reduced to a bare minimum. Because the hash locations may be determined from the system configuration (including the number of segments and their membership), the front-end does not need to maintain a mapping from data items to servers: each back-end server maintains the truth of its inventory. Compared to storing an explicit map of the item locations, this greatly reduces the amount of state on the front-end, and removes any requirements for consistency on the front-ends. Moreover, front-ends may cache location data if they wish. Such data can go stale without negative consequences: the cost of encountering a stale entry is little more than a cache miss, which involves computing k hash functions and querying k locations.
The popularity of data items can vary dramatically, both spatially (i.e., among data items) and temporally (i.e., over time). Load balancing desirably accommodates such variations and copes with changes in system configuration (e.g., due to server failures or server additions). Depending on the particular system configuration, one or more resources on servers could become the bottleneck, causing client requests to queue up.
In cases where the network on a server becomes a bottleneck, it is desirable to have the request load evenly distributed among all servers in the system. Having r replicas to choose from can greatly mitigate such imbalance. In cases where the disk becomes the bottleneck, server-side caching is beneficial, and it becomes desirable not to unnecessarily duplicate items in the server caches.
Instead of using locality-aware request distribution, for a request on a given data item d, a front-end may pick the least loaded server among those storing a replica of d. Placement of data items and their replicas influences the performance of load balancing in a fundamental way—a server can serve requests on a data item only if it stores a replica of that data item. Due to the use of independent hash functions, data items on a particular server are likely to have their replicas dispersed on many different servers. Thus, such dispersed or diffuse replica placement makes it easier to find a lightly loaded server to take load of an overloaded server.
Re-balancing after reconfiguration may be performed, in which data items may be moved from one server to another to achieve a more desirable configuration. For example, a data item may be moved from a server to a less heavily loaded server. Re-balancing may be performed when a predetermined condition is met (e.g., when a new data item is received, at a particular time, when the average load reaches a certain threshold).
A flow diagram of an example method of balancing network bandwidth during the creation or writing of a received data item on a number of servers is described with respect to
A flow diagram of an example method of reading a data item, while maintaining network bandwith balancing, is described with respect to
To read a data item, the front-end must first identify the highest version stored by polling at least k-r+1 of the hash locations. This ensures an intersection with a hash location that receives the last completed version.
A flow diagram of an example method of balancing network bandwidth during the updating of a received data item on a number of servers is described with respect to
An update creates a new version of the same data item, which is inserted into the distributed storage system as a new data item. Although the new version has the same hash locations to choose from, it might end up being stored on a different subset from the old one based on the current storage utilization on those servers. Depending on the needs of the application, the storage system can choose to delete the old versions when appropriate.
Numerous other general purpose or special purpose computing system environments or configurations may be used. Examples of well known computing systems, environments, and/or configurations that may be suitable for use include, but are not limited to, personal computers, server computers, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, embedded systems, distributed computing environments that include any of the above systems or devices, and the like.
Computer-executable instructions, such as program modules, being executed by a computer may be used. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Distributed computing environments may be used where tasks are performed by remote processing devices that are linked through a communications network or other data transmission medium. In a distributed computing environment, program modules and other data may be located in both local and remote computer storage media including memory storage devices.
With reference to
Computer 110 typically includes a variety of computer readable media. Computer readable media can be any available media that can be accessed by computer 110 and includes both volatile and nonvolatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CDROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by computer 110. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer readable media.
The system memory 130 includes computer storage media in the form of volatile and/or nonvolatile memory such as read only memory (ROM) 131 and random access memory (RAM) 132. A basic input/output system 133 (BIOS), containing the basic routines that help to transfer information between elements within computer 110, such as during start-up, is typically stored in ROM 131. RAM 132 typically contains data and/or program modules that are immediately accessible to and/or presently being operated on by processing unit 120. By way of example, and not limitation,
The computer 110 may also include other removable/non-removable, volatile/nonvolatile computer storage media. By way of example only,
The drives and their associated computer storage media discussed above and illustrated in
The computer 110 may operate in a networked environment using logical connections to one or more remote computers, such as a remote computer 180. The remote computer 180 may be a personal computer, a server, a router, a network PC, a peer device or other common network node, and typically includes many or all of the elements described above relative to the computer 110, although only a memory storage device 181 has been illustrated in
When used in a LAN networking environment, the computer 110 is connected to the LAN 171 through a network interface or adapter 170. When used in a WAN networking environment, the computer 110 typically includes a modem 172 or other means for establishing communications over the WAN 173, such as the Internet. The modem 172, which may be internal or external, may be connected to the system bus 121 via the user input interface 160, or other appropriate mechanism. In a networked environment, program modules depicted relative to the computer 110, or portions thereof, may be stored in the remote memory storage device. By way of example, and not limitation,
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.