The present invention relates to the field of data storage and, more particularly, to fault tolerant data replication.
Enterprise-class data storage systems differ from consumer-class storage systems primarily in their requirements for reliability. For example, a feature commonly desired for enterprise-class storage systems is that the storage system should not lose data or stop serving data in circumstances that fall short of a complete disaster. To fulfill these requirements, such storage systems are generally constructed from customized, very reliable, hot-swappable hardware components. Their firmware, including the operating system, is typically built from the ground up. Designing and building the hardware components is time-consuming and expensive, and this, coupled with relatively low manufacturing volumes is a major factor in the typically high prices of such storage systems. Another disadvantage to such systems is lack of scalability of a single system. Customers typically pay a high up-front cost for even a minimum disk array configuration, yet a single system can support only a finite capacity and performance. Customers may exceed these limits, resulting in poorly performing systems or having to purchase multiple systems, both of which increase management costs.
It has been proposed to increase the fault tolerance of off-the-shelf or commodity storage system components through the use of data replication. However, this solution requires coordinated operation of the redundant components and synchronization of the replicated data.
Therefore, what is needed are improved techniques for storage environments in which redundant devices are provided or in which data is replicated. It is toward this end that the present invention is directed.
The present invention provides techniques for assignment and layout of redundant data in data storage system. In one aspect, the data storage system stores a number M of replicas of the data. Nodes that have sufficient resources available to accommodate a requirement of data to be assigned to the system are identified. When the number of nodes is greater than M, the data is assigned to M randomly selected nodes from among those identified. The data to be assigned may include a group of data segments and when the number of nodes is less than M, the group is divided to form a group of data segments having a reduced requirement. Nodes are then identified that have sufficient resources available to accommodate the reduced requirement.
In another aspect, a new storage device node is added to a data storage system having a plurality of existing storage device nodes. An existing node is identified that is heavily loaded in comparison to other ones of the existing nodes. Data stored at the identified existing node is moved to the new node. A determination is made whether the new node is sufficiently loaded in comparison the existing nodes. When the new node is not sufficiently loaded, the identification and movement is repeated until the new node is sufficiently loaded.
In yet another aspect, data is removed from a storage device node in a data storage system. Data at the storage device node from which data is to be removed is selected. Other nodes of the data storage system having sufficient resources available to accommodate a requirement of the data are identified. The data is moved to a randomly selected node from among those identified. The selection, identification and movement is repeated until the storage device node to be removed is empty. The empty node may then be removed from the system.
In further aspects, program storage media readable by a machine may tangibly embody a program of instructions executable by the machine to perform methods of assigning data, adding a node to a system or removing data from a node, as summarized above.
These and other aspects of the invention are explained in more detail herein.
The present invention provides improved techniques for storage environments in which redundant devices are provided or in which data is replicated. An array of storage devices provides reliability and performance of enterprise-class storage systems, but at lower cost and with improved scalability. Each storage device may be constructed of commodity components while their operation is coordinated in a decentralized manner. From the perspective of applications requiring storage services, the array presents a single, highly available copy of the data, though the data is replicated in the array. In addition, techniques are provided for accommodating failures and other behaviors, such as disk delays of several seconds, as well as different performance characteristics of devices, in a manner that is transparent to applications requiring storage services;
Preferably, each storage device 102 is composed of off-the-shelf or commodity parts so as to minimize cost. However, it is not necessary that each storage device 102 is identical to the others. For example, they may be composed of disparate parts and may differ in performance and/or storage capacity.
To provide fault tolerance, data is replicated within the storage system 100. In a preferred embodiment, for each data element, such as a block or file, at least two different storage devices 102 in the system 100 are designated for storing replicas of the data, where the number of designated storage devices and, thus, the number of replicas, is given as “M.” For a write operation, a value (e.g., for a data block) is stored at a majority of the designated devices 102 (e.g., in at least two devices 102 where M is two or three). For a read operation, the value stored in majority of the designated devices is returned.
For coordinating actions among the designated storage devices 102, timestamps are employed. In one aspect, a timestamp is associated with each data block at each storage device that indicates the time at which the data block was last updated (i.e. written to). In addition, a log of pending updates to each of the blocks is maintained which includes a timestamp associated with each pending write operation. An update is pending where a write operation has been initiated, but not yet completed. Thus, for each block of data at each storage device, two timestamps may be maintained.
For generating the timestamps, each storage device 102 includes a clock. This clock may either be a logic clock that reflects the inherent partial order of events in the system 100 or it may be a real-time clock that reflects “wall-clock” time at each device. If using real-time clocks, these clocks are synchronized across the storage devices 102 so as to have approximately the same time, though they need not be precisely synchronized. Synchronization of the clocks may be performed by the storage devices 102 exchanging messages with each other or by a centralized application (e.g., at one or more of the servers 106) sending messages to the devices 102. For example, each timestamp may include an eight-byte value that indicates the current time and a four-byte identifier that is unique to each device 102 so as to avoid identical timestamps from being generated.
In one aspect, the present invention provides a technique for performing coordinated read operations. A read request may be received by any one of the storage devices 102 of the storage system 100, such as from any of the clients 106. If the storage device 102 that receives the request is not a designated device for storing the requested block of data, that device preferably acts as the coordinator for the request, as explained herein. While the device that receives the request may also be a designated device for storing the data, this is not necessary. Thus, any of the devices 102 may receive the request. So that each device 102 has information regarding the locations of data within the system 100, each may store, or otherwise have access to, a data locations table (
Each of the three vertical lines 302, 304 and 306 in
The leftmost vertical line 302 represents the storage device 102 that is acting as coordinator for the read operation, whereas the other lines 304 and 306 represent the other designated devices. The read request is illustrated in
Each of the three storage devices 102 stores a value for the requested data block, given as “val” in
In response to the read request message 308, the first of the three storage devices 102 checks its update timestamp valTS1 for the requested data and forwards messages 310 and 312 to the other two storage devices 102. As shown in
In response to the messages 310 and 312, each of the other designated storage devices compares the value of its local timestamps valTS and logTS timestamp to the valTS timestamp value received from the coordinator storage device. If the local valTS timestamp is equal to the valTS timestamp received from the coordinator device, this indicates that both devices have the same version of the data block. Otherwise, not all of the versions may have been updated during a previous write operation, in which case, the versions may be different. Thus, by comparing the timestamps rather than the data itself, the devices 102 can determine whether the data is the same. It will be apparent that the data itself (or a representation thereof, such as a hash value) may be compared rather than the timestamps.
Also, if the local logTS is less than or equal to the valTS timestamp of the coordinator, this indicates that there is not a more recent update to the data that is currently pending. If the local logTS is greater than valTS, this indicates that the coordinator may not have the most recent version of the data available.
If the above two conditions are satisfied, the storage device returns an affirmative response (“yes” or “true”) to the coordinator device. The above may be represented by the following expression:
Referring to the example of
Because the coordinator storage device and the third storage device have the same valTS timestamp (and there is not a pending update), this indicates that the coordinator and the third storage device have the same version of the requested data. Thus, in the example, a majority (i.e. two) of the designated devices (of which there are three) have the same data. Thus, in response to receiving the message 314, the coordinator sends a reply message 316 that includes the requested data stored at the coordinator. The reply message 316 is routed to the requesting server 106.
The requested data may come from one of the designated devices that is not the coordinator (e.g., the coordinator may not have a local copy of the data or the coordinator may have a local copy, but obtains the data from another device anyway). In this case, the coordinator appoints one of the designated devices as the one to return data. The choice of device may be random, or may be based on load information. For example, load can be shifted away from a heavily loaded device to its neighbors, which can farther shift the load to their neighbors and so forth, such that the entire load on the system 100 is balanced. Thus, storage devices with heterogeneous performance accommodated for load balancing and load balancing can be performed despite some storage devices experiencing faults.
The coordinator then asks for <data,valTS,status> from the designated device and <valTS,status> from the others by sending different messages to each (e.g., in place of messages 310 and 312). The devices then return their valTS timestamps to the coordinator so that the coordinator can check the timestamps. The status information (a “yes” or “no” response) indicates whether logTS is less than or equal to valTS at the devices. If the designated device is not part of the quorum (e.g., because it is down or because it does not respond in time) or a quorum is not detected, the coordinator may initiate a repair operation (also referred to as a “recovery” operation) as explained herein (i.e., the coordinator considers the read to have failed). If the designated device does respond, and a quorum of affirmative responses are received, the coordinator declares success and returns the data from the designated device.
Thus, the coordinator may determine whether a majority of the designated storage devices 102 have the same version of the data by examining only the associated timestamps, rather than having to compare the data itself. In addition, once the coordinator determines from the timestamps that at least a majority of the devices have the same version of the data, the coordinator may reply with the data without having to wait for a “yes” or “no” answer from all of the designated storage devices.
Returning to the example of
As described above, the read operation allows the data (as opposed to the timestamps) to be read from any of the designated devices.
In another aspect, the present invention provides a technique for performing coordinated write operations. In general, write operations are performed in two phases including a “prewrite” phase and a write phase. In the prewrite phase, the logTS timestamp for the data to be written is updated and, then, in the write phase, the data and the valTS timestamp are updated. A partial or incomplete write operation is one in which not all of the storage devices designated to store a data block receive an update to the block. This may occur for example, where a fault occurs that affects one of the devices or when a fault occurs before all of the devices have received the update. By maintaining the two timestamps, partial or incomplete writes can be detected and addressed.
A write request may be received by any one of the storage devices 102 of the storage system 102 such as from any of the servers 106. The storage device 102 that receives the request preferable acts as the coordinator, even if it is not a designated device for storing the requested block of data. In an alternate embodiment, that device may forward the request to one of the devices 102 that is so designated which then acts a coordinator for the write request. Similarly to the read operation, any of the designated devices may receive the write request, however, the device that receives the request then acts as coordinator for the request.
Each of the three vertical lines 402, 404 and 406 in
In the example of
In response to the write request message 408, the coordinator forwards a new timestamp value, newTS, of “8” as a new value for the logTS timestamps to the other two storage devices via messages 410 and 412. This new timestamp value is preferably representative of the current time at which the write request is initiated. As shown in
Then, in response to the messages 410 and 412, each of the other designated storage devices compares the current value of its local logTS timestamp and the value of its local valTS timestamp to the newTS timestamp value received from the coordinator storage device. If both the local logTS timestamp and the local valTS timestamp are lower than the newTS timestamp received from the coordinator device, this indicates that there is not currently another pending or completed write operation that has a later logTS timestamp. In this case, the storage device updates its local logTS timestamp to the new value and returns an affirmative or “yes” response message to the coordinator.
Otherwise, if there is a more recent write operation in progress, the storage device responds with a negative or “no” response. If a majority of the designated devices have a higher value for either of their timestamps, this indicates that the current write operation should be aborted in favor of the later one since the data for the later write operation is likely more up-to-date. In this case, the coordinator receives a majority of “no” responses and the current write operation is aborted. The coordinator may then retry the operation using a new (later) timestamp.
The above may be represented by the following expression:
Referring to the example of
At this point, the prewrite phase is complete and all three of the designated storage devices are initialized to perform the second phase of the write operation, though this second phase can proceed with a majority of the devices. Thus, in the example, the second phase could proceed even if one of the designated devices had returned a “no” response.
To perform the second phase, the coordinator device sends a message type “Write” indicating the second phase of the write operation that includes the new version of the data and the timestamp newTS to each of the other designated devices. These messages are shown in
Then, in response to the messages 418 and 420, each of the other designated storage devices preferably compares the current value of its local logTS timestamp and the value of its local valTS timestamp to the newTS timestamp value received in the “Write” message from the coordinator storage device. This comparison ensures that there is not currently another pending or completed write operation that has a later logTS timestamp, as may occur if another write operation was initiated before the completion of the current operation.
More particularly, if the local valTS timestamp is lower than the newTS timestamp received from the coordinator device and the local logTS timestamp is less than or equal to the newTS timestamp, this indicates that there is not currently another pending or completed write operation that has a later timestamp. In this case, the storage device updates the data to the new value. In addition, the storage device preferably updates its local valTS timestamp to the value of the newTS timestamp and returns an affirmative or “yes” response message to the coordinator.
Otherwise, if there is a more recent write operation in progress, the storage device responds with a “no” response. If the coordinator receives a majority of “no” responses, the current write operation is aborted.
The above may be represented by the following expression:
Referring to the example of
In addition, once the coordinator has determined that a majority of the storage devices have returned a “yes” answer for the second phase of the write operation, the coordinator sends a reply message to the requestor. As shown in
In another aspect, the invention provides a technique for performing repair operations. Assume that a write operation is unsuccessful because the coordinator for the write operation device experienced a fault after sending a prewrite message, but before completing the write operation. In this case, the storage devices designated for storing the data (e.g., a block) for which the unsuccessful write operation had been attempted will have a logTS timestamp that is higher than the valTS timestamp of the coordinator. In another example, a communication error may have prevented a storage device from receiving the prewrite and write messages for a write operation. In this case, that storage device will have different valTS timestamp for this block of data from that of the other storage devices designated to store that block of data. In either case, when a read operation is requested for the data, the coordinator device for the read operation will detect these faults when the devices return a “no” reply in response to the read messages sent by the coordinator. In this case, the coordinator that detects this fault may initiate a repair operation to return the data block to consistency among the devices designated to store the block. Because repair operations are preformed only when an attempt is made to read the data, this aspect of the present inventions avoids unnecessary operations, such as to repair data that is not thereafter needed.
In sum, the repair operation is performed in two phases. In an initialization phase, a coordinator for the repair operation determines which of the designated devices has the newest version of the data block. In a second phase, the coordinator writes the newest version of the data to the devices. The timestamps for the block at the designated devices are updated as well.
Each of the three vertical lines 502, 504 and 506 in
In the example of
The repair operation may be initiated when the coordinator device detects a failed read operation. Referring to
In response to the repair initiation messages 508 and 510, each of the other designated storage devices compares the current value of its local logTS timestamp and the value of its local valTS timestamp to the new timestamp value newTS received from the coordinator storage device. If both the local logTS timestamp and the local valTS timestamp are lower than the newTS timestamp received from the coordinator device, this indicates that there is not currently a pending or completed write operation that has a later timestamp. In this case, the storage device updates its local logTS timestamp to the value of the newTS timestamp and returns an affirmative or “yes” response message to the coordinator. In addition, each storage device returns the current version of the data block to be corrected and its valTS timestamp.
Otherwise, if there is a more recent write operation in progress, the storage device responds with a negative or “no” response. If a majority of the designated devices have a higher value for either of their timestamps, this indicates that the repair operation should be aborted in favor of the later-occurring write operation since the data for the later write operation is likely more up-to-date. In this case, the coordinator receives a majority of “no” responses and the current repair operation is aborted (though the original read operation may be retried).
The above may be represented by the following expression:
Thus, as shown in
The coordinator then determines which storage device has the most-current version of the data. This is preferably accomplished by the coordinator comparing the valTS timestamps received from the other devices, as well as its own, to determine which valTS timestamp is the most recent. The coordinator then initiates a write operation in which the most recent version of the data replaces any inconsistent versions. In the example, the most recent valTS timestamp is “5,” which is the valTS timestamp of the coordinator and the third storage device. The second device has an older timestamp of “4” and different version of the data, “x.” The version of the data associated with the valTS timestamp of “5” is “v.” Accordingly, the version “v” is preferably selected by the coordinator to replace the version “x” at the second storage device.
The write operation is accomplished by the coordinator device sending a message type “Write” that includes the new version of the data and the timestamp newTS to each of the other designated devices. These messages are shown in
Then, similarly to the second phase of the write operation of
Referring to the example of
Once the coordinator has determined that a majority of the storage devices have returned a “yes” answer for the second phase of the repair operation, the coordinator may send a reply message 524 to the requester that includes the data value “v.” This reply is preferably sent where the repair operation was initiated in response to a failed read operation. The reply 524 thus returns the data requested by the read operation. As shown in
Assume that two timestamps, valTS and logTS, are associated with each block of data and that each of these timestamps is 12 bytes long. As mentioned, each timestamp may include a value that indicates the current time and an identifier that is unique to each device 102 so as to avoid identical timestamps from being generated. Assume also that each data block is 1 KB (1 kilobyte) and that the storage system 100 of
Thus, in accordance with an aspect of the invention, techniques are provided for managing the timestamps so as to reduce the required storage capacity for them. More particularly, for the read, write and repair operations described above, it can be noted that the timestamps are used to disambiguate concurrent updates to the data (as in the case of logTS) and to detect and repair results of failures (as in the case of valTS). Thus, in one aspect, where all of the replicas of a data block are functional, timestamps may be discarded after each device 102 holding a replica of the data has acknowledged an update. Thus, for write and repair operations, a third phase may be performed in which the coordinator instructs the designated devices to discard the timestamps for a data block after all of the designated devices have replied. Alternately, each device 102 determine whether its valTS timestamp is equal to its logTS timestamp and if so it can delete one of them (e.g., the logTS timestamp).
Thus, each storage device 102 need only maintain timestamps for data blocks that are actively being updated. If a failure affects one or more of the replicas, the other devices 102 maintain their timestamps for the data until the data is repaired or failure is otherwise taken care of (e.g., the system 100 is reconfigured).
In another aspect, because a single write request typically updates multiple data blocks, each of these data blocks will have the same timestamp. Accordingly, timestamps may be maintained for ranges of data blocks, rather than for each data block. For example, if eight contiguous data blocks “Block1” through “Block8” are updated by the same write request, a single timestamp entry may be maintained for all eight blocks rather than maintaining eight timestamp entries, one for each for block. The timestamps may be maintained as entries in a data structure. Each entry may have the following form:
[start, end, timestamp(s)] (5)
Where start identifies the beginning of the range, end identifies the end of the range and timestamp(s) applies to all of blocks of the range. In the example, a single entry for two timestamps may take the form:
[Block1, Block9, valTS1-9, logTS1-9]. (6)
In this case, a single data structure may be maintained for both the valTS timestamp and the logTS timestamp. Alternately, two entries may be maintained, one for each of the two timestamps. In this case, two data structures may be maintained, one for each of two timestamps. In the example, the two entries may take the form:
[Block1, Block9, valTS1-9] (7)
and
[Block1, Block9, logTS1-9]. (8)
Note that the end of the range in the exemplary entries above is identified by the next block after the eight blocks that are within the range. Thus, entry (6) above includes “Block9” which signifies the ninth block, whereas, only eight blocks are within the range for the associated timestamps. An alternate convention may be employed, such as where the end included in the entry is the last block within the range. For example, entry (6) above would instead take the form:
[Block1, Block8, valTS1-8, logTS1-8]. (9)
where “Block8” signifies the eighth block which is the last block in the range.
In a preferred embodiment, the timestamp entries above are maintained in an interval tree-like data structure, particularly, a B-Tree data structure.
A data structure 600 is preferably maintained by each storage device 102 for maintaining timestamps for data blocks stored by the storage device 102. The data structure 600 is preferably stored in NV-RAM 116 (
The following operations may be used for manipulating the data structure 600:
find-largest (base): given a value for base, an entry is returned having the largest key in the data structure such that key ≦base. If no such entry is present in the data structure, the operation may return the entry having the smallest key larger than base. In accordance with the present invention, start may be used as the base for this operation to locate timestamp entries having an equal start or a next lowest start and, if no such entry is in the data structure, to locate a timestamp entry having a next highest start. Such entries may potentially overlap a new entry to be inserted into the data structure. If no entries are stored in the data structure, this operation preferably returns an end-of-list indicator.
find-next (base): given a value for base, an entry is returned where the key is the smallest key such that key >base. In accordance with the present invention, start may be used as the base for this operation to locate timestamp entries having a next highest start. If no such entry is present in the data structure, this operation preferably returns an end-of-list indicator.
insert (entry): an entry is inserted in the data structure at a location identified by a key. In accordance with the present invention, this operation may be used to insert an entry of the form [start, end, timestamp] into the data structure.
replace (entry): an entry identified by a key is replaced with entry. In accordance with the present invention, this operation may be used to replace an entry of the form [start, end, timestamp] with an entry having a different end and/or timestamp.
When a write or repair operation is performed, the timestamps for a range of data blocks will generally need to be updated in the data structure 600 to maintain the data structure 600 current. The method 700 is preferably performed each time a timestamp is to be updated for a data block or a range of data blocks. For example, the initialization phase for a write operation, as described above in reference to
In a step 704, a find-largest(base) operation may be performed using start from the new entry generated in step 702 as the base. As mentioned, the find-largest(base) operation locates an entry in the data structure having an equal start or a next lowest start and, if no such entry is in the data structure, the operation locates a timestamp entry having a next highest start. Where an entry is located in step 704, it is referred to herein as the “current entry” and may be given as: [cur_start, cur_end, timestamp(s)].
In step 706, a determination may be made as to whether the current entry is the last entry in the data structure 600. This determination may be accomplished, for example, by checking if the current start (i.e. “cur_start” or “CS”) is associated with an end-of-list indicator for the data structure 600. If so, this indicates a stopping condition for the method has been reached. This stopping condition may occur during a first pass through the step 706 if the data structure 600 initially has no entries. In this case, the find-largest(base) operation will return the end-of-list indicator. Otherwise, this stopping condition may occur in a subsequent pass through the step 706, in which case, program flow may terminate in a step 748.
In addition, in step 706, a determination may be made as to whether start for the entry to be added to the data structure 600 is smaller than end for the entry to be added. This will generally be the case for a first pass through the step 706. However, in a subsequent pass through the step 706, insertions or replacement operations performed in accordance with other steps of the method 700 may have reduced the range 802 such that start is equal to end (i.e. all data blocks have been processed and added to the data structure)
If, in a first pass through the step 706, the data structure 600 initially has no entries, program flow moves to a step 708. In step 708 the new entry [start, end, timestamp(s)] is inserted into the tree. This may be accomplished using the insert(base) operation. Program flow may then terminate in step 710.
However, if in a first pass through the step 706, the data structure does have one or more entries, program flow moves to a step 712. In step 712, a determination is made as to whether cur_start is greater than start.
If the condition of step 714 is not satisfied, then the relationship between the ranges 802 and 804 is as shown in
Recall that in step 712, a determination was made as to whether cur_start was greater than start. If this condition is not satisfied, the relationship between the ranges 802 and 804 may be shown as in one of
Recall that in step 724, a determination was made as to whether cur_start is equal to start. If this condition is not satisfied, the relationship between the ranges 802 and 804 may be shown as in
Recall that in step 734 a determination was made as to whether cur_end is less than or equal to start. If this condition is satisfied, the relationship between the ranges 802 and 804 may be shown as in
Recall also that in step 726, a determination was made as to whether end is greater than or equal to cur_end. If this condition is satisfied, the ranges 802 and 804 may be shown as in
Recall also that in step 736, a determination was made as to whether cur_end is less than or equal to start. If this condition is satisfied, the ranges 802 and 804 may be as shown in
This process continues until the program terminates in one of the end states 710, 718, 732, 744 or 748. In sum, the method of
Thus, techniques have been described for managing timestamps in a computer system having multiple storage devices for storing redundant data.
It may be desired to assign data to storage devices such as the devices 102 of
Initally, the data stores to be assigned to the system 100 are broken into smaller elements. For example, in step 802, the data stores to be assigned to the system 100 may each be divided into a plurality of contiguous pieces, referred to as “segments.” Each segment may be of a predetermined data capacity, such as 8 gigabytes per segment, though it will be apparent that another capacity or different capacities may be selected.
Then, in step 804, the segments may be arranged in groups, where each group includes a plurality of segments. The groups may each include a predetermined number of segments, such as 128 segments per group, though it will be apparent that another number or different numbers of segments may be assigned to each group.
In step 804, the segments may be grouped sequentially, according to their positions within the stores. Alternately, the segments may be assigned to groups based on load balancing considerations. For example, an expected data throughput (i.e. total accesses per unit time) may be known for each store. It may be assumed that each segment in the store will have a throughput that is proportionate to the relative capacities of the store and the segment. The segments may then be assigned to the groups, such that each group is expected to have a throughput that is equal to that of the other groups.
In step 806, a group is selected. A group may be selected in sequential order or randomly (“random” selection, as referred to herein, also encompasses pseudo-random selection). In step 808, storage device nodes 102 (
Preferably, all such devices 102 that meet the capacity requirement, and possibly additional requirements, are identified in step 808.
As explained herein, data is stored redundantly in the system 100. For example, three or more replicas of each data block are stored in the system 100, where the number of replicas is equal to M. In step 810, a determination is made as to whether at least M nodes 102 were identified in step 808 as able to accommodate copies of the group selected in step 806. If not, this means that the required number M of replicas of the data for the group cannot be assigned to different nodes 102 in the system 100 unless the group is made smaller. The groups are generally divisible into smaller groups because each includes a plurality of segments. Thus, if the determination of step 810 is negative, step 804 may be repeated by further dividing the group so that the resulting group has lower requirements than previously. This may be accomplished by dividing the group into two or more smaller groups or by reassigning one or more of the segments of the group to a different group. Then, in a next pass through the step 808, it can be expected that there will be more nodes 102 that can accommodate the group than previously. This process is repeated until at least M nodes 102 are found that can accommodate the group.
Then, in step 812, the group is assigned to M nodes 102. If more than M nodes were identified in step 808, a subset of the identified nodes 102 is selected in step 812 for the group. This selection is preferably performed randomly. By performing this selection randomly for all of the groups, it is expected that the assignments of all of the groups will be balanced across the devices 102, reducing the incidence of“hotspots” in which storage operations are concentrated at a small number of the devices 102.
Once the group has been assigned, an entry into a data locations table is preferably made for keeping track of the assignments of the data stores to the nodes 102.
As shown in
In step 814, a determination is made as to whether all of the groups have been assigned to the system 100. If not, the process described above is repeated by returning to step 806 in which a next group of segments is selected. Nodes are then identified for accommodating this next group in step 808 and when at least M nodes are identified in step 810, this group is assigned in step 812 to selected nodes 102 of the system 100. Once all of the groups have been assigned in this way, program flow may terminate in a step 816.
Thus, a technique has been described for assigning data to storage device nodes 102 in the system 100. In sum, this technique involves qualifying nodes 102 to determine whether they are able to accommodate a collection of data (e.g., a group of segments), and, then, randomly selecting from among those nodes 102 that are qualified. This technique combines aspects of a deterministic assignment (by qualifying the nodes) and random assignment (by randomly selecting from among qualified nodes). The deterministic aspect ensures that the nodes are appropriately qualified for an assignment before the assignment is made, which avoids potentially having to reassign data. As mentioned, the random aspect is expected to result in a balanced assignment. This aspect of the present invention thus contrasts with prior techniques that are either purely deterministic or purely random.
In step 1002, a storage device node 102 is newly added to the system 100 of
In step 1008, a group of segments assigned to the existing node selected in step 1006 is selected and reassigned to the newly-added node. This may be accomplished, for example, by selecting the largest group assigned to the existing node, though the group may be selected based on another criterion, such as the group having the highest one or more performance requirements, such as throughput. The group may be selected in step 1008 based on availability of storage capacity or of other performance parameters at the newly-added node. For example, if the newly-added node has 50 units of storage capacity, a group that requires less than 50 units of capacity is selected in step 1006. In addition, the table 900 (
Then, in step 1010, a determination is made as to whether the newly-added node is now sufficiently loaded. For example, the amount of loading determined in for each existing node in step 1004 (e.g., capacity utilization or utilization for a combination of parameters) may be determined for the newly-added node. This loading may then be compared to an average (e.g., a statistical mean or median) loading for all the other nodes and if the loading of the newly-added node is at least as great as the average loading, then the newly-added node may be considered sufficiently loaded in step 1010. It will be apparent, however, that the sufficiency of loading of the newly-added node may be determined in other ways. For example, its loading may be compared to a range bounded by the lowest and highest loading of the existing nodes such that its loading is considered sufficient if it falls within this range.
Preferably, the loading of the existing nodes is determined taking into account the reassignment of groups in the newly-added node. Thus, where a group is reassigned from an existing node, its loading will generally be reduced. To take this reduced loading into account, the loading for this node may then be recomputed.
If the loading for the newly-added node is determined in step 1010 is based on parameters other than storage capacity, the newly-added node will also be considered sufficiently loaded of the storage capacity required for the data assigned to it exceeds a predetermined portion (e.g., ninety percent) of its total storage capacity. For example, if the throughput utilization of the newly-added node is lower than any of the existing nodes, but its storage capacity utilization is over ninety-percent, the node will be considered sufficiently loaded.
If the newly-added node is determined in step 1010 to be not sufficiently loaded, the steps of identifying a heavily-loaded node (step 1004), selecting a data at the heavily-loaded node (step 1006) and reassigning the selected data (step 1008) are repeated until the newly-added node is sufficiently loaded. Because the reduced loading of any node from which a group has been reassigned is preferably taken into account after the group has been reassigned to the newly-added node, the existing node identified in each pass through the step 1004 will generally be different from the node identified in the prior pass through the step 1004.
Once the newly-added node is sufficiently loaded, the method 1000 of
At some point, it may be desired to remove data from a node in the system 100. For example, a node may develop a fault or may become obsolete over time and, thus, the node may need to be taken out of service or removed.
In a step 1102, a node 102 existing in the system 100 is selected for removal. In step 1104, a group of segments stored at the node selected in step 1102 are selected for reassignment to another, existing node. Then, in step 1106, storage device nodes 102 that are able to accommodate the selected group are identified. Similarly, to step 808 of
In step 1108, a determination is made as to whether at least one node was identified in step 1106. If not, this means that the data for the group cannot be assigned to an existing node 102 in the system 100 unless the group is made smaller. As mentioned, the groups are generally divisible into smaller groups because each includes a plurality of segments. Thus, if the determination of step 1108 is negative, the group may then be split into two or more smaller groups in step 1110 so that the resulting groups have lower requirements than previously. Then, in a next pass through the step 1106 for each of these smaller groups, it can be expected that there will be more nodes 102 that can accommodate the group than previously. This process is repeated until at least one node is found that can accommodate the group.
Then, in step 1112, if one node was identified in step 1106, the group is moved to the identified node. If more than one node was identified, one of nodes is selected from among those identified. Similarly to step 812 of the method 900, this selection is preferably performed randomly. In addition, the table 900 (
In step 1114, a determination is made as to whether all of groups at the node to be removed have been reassigned. If any groups remain, the steps of selecting a group (step 1104), identifying which nodes 102 can accommodate the group (step 1106), splitting the group if necessary (step 1110) and reassigning the group (step 1112) may then be repeated until all of the groups have been reassigned.
Once all of the groups have been reassigned, the node may be removed in step 1116 if desired. Program flow may then terminate in a step 1118. Thus, a technique has been described for removing data from a storage device node in the system 100 and reassigning data from the node to existing nodes.
It will be apparent that modifications may be made to the techniques for data assignment described herein. For example, as described, selecting the set of M nodes in
While the foregoing has been with reference to particular embodiments of the invention, it will be appreciated by those skilled in the art that changes in these embodiments may be made without departing from the principles and spirit of the invention, the scope of which is defined by the following claims.
This application is a division of prior U.S. application Ser. No. 10/440,570, filed May 16, 2003 now abandoned. This application is related to U.S. application Ser. No. 10/440,531 (now U.S. Pat. No. 7,152,077), and Ser. No. 10/440,548 (U.S. Publication No. 2004/0230624), filed, on May 16, 2003, the contents of which are hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
4714996 | Gladney | Dec 1987 | A |
5212788 | Lomet et al. | May 1993 | A |
5430869 | Ishak et al. | Jul 1995 | A |
5452445 | Hallmark | Sep 1995 | A |
5504900 | Raz | Apr 1996 | A |
5546582 | Brockmeyer et al. | Aug 1996 | A |
5613107 | Villette | Mar 1997 | A |
5644763 | Roy | Jul 1997 | A |
5701480 | Raz | Dec 1997 | A |
5758054 | Katz | May 1998 | A |
5768538 | Badovinatz et al. | Jun 1998 | A |
5781910 | Gostanian | Jul 1998 | A |
5799305 | Bortvedt et al. | Aug 1998 | A |
5920857 | Rishe et al. | Jul 1999 | A |
5953714 | Abdullah | Sep 1999 | A |
6052712 | Badovinatz et al. | Apr 2000 | A |
6148295 | Megiddo et al. | Nov 2000 | A |
6219667 | Lu et al. | Apr 2001 | B1 |
6311251 | Merritt et al. | Oct 2001 | B1 |
6374336 | Peters et al. | Apr 2002 | B1 |
6421688 | Song | Jul 2002 | B1 |
6473830 | Li et al. | Oct 2002 | B2 |
6502175 | Krishnan et al. | Dec 2002 | B1 |
6728831 | Bridge | Apr 2004 | B1 |
6760808 | Peters et al. | Jul 2004 | B2 |
6763436 | Gabber | Jul 2004 | B2 |
6829617 | Sawdon | Dec 2004 | B2 |
6941437 | Cook et al. | Sep 2005 | B2 |
7146524 | Patel | Dec 2006 | B2 |
7152077 | Veitch | Dec 2006 | B2 |
20010044879 | Moulton et al. | Nov 2001 | A1 |
20020114341 | Sutherland et al. | Aug 2002 | A1 |
20020129040 | Frey | Sep 2002 | A1 |
20020174296 | Ulrich et al. | Nov 2002 | A1 |
20030105797 | Dolev | Jun 2003 | A1 |
20040015655 | Willis et al. | Jan 2004 | A1 |
20040230624 | Frolund | Nov 2004 | A1 |
20050108302 | Rand | May 2005 | A1 |
Number | Date | Country |
---|---|---|
0701370 | Mar 1996 | EP |
1160682 | May 2001 | EP |
Entry |
---|
Barbara Liskov, Sanjay Ghemawat, Robert Gruber, Paul Johnson, Liuba Shrira, Michael Williams, Replication in the Harp File System, Proceedings of the Thirteenth ACM Symposium on Operating Systems Principles, Oct. 13-16, 1991, Asilomar Conference Center, Pacific Grove, CA, pp. 226-238, ACM Press, New York, 1991. |
Edward K. Lee, Chandramohan A. Thekkath, Petal: Distributed Virtual Disks, ASPLOS-VII Proceedings / Seventh International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, Cambridge, Massachusetts, Oct. 1-5, 1996, pp. 84-92,ACM Press, New York, 1996. |
Leslie Lamport, The Part-Time Parliament, ACM Transactions on Computer Systems, vol. 16, No. 2, pp. 133-169, ACM Press, New York, 1998. |
Hagit Attiya, Amotz Bar-Noy, Danny Dolev, Sharing Memory Robustly in Message-Passing Systems, Proceedings of the Ninth Annual ACM Symposium on Principles of Distributed Computing: Quebec City, Quebec, Canada, Aug. 22-24, 1990 , pp. 363-375, ACM Press, New York, 1990. |
Nancy A. Lynch, Alexandera. Shvartsman, Robust Emulation of Shared Memory using Dynamic Quorum-Acknowledged Broadcasts, International Symposium on Fault-Tolerant Computing 1997: Seattle, Washington, pp. 272-281, IEEEComputer Society Press, Los Alamitos, 1997. |
Garth R. Goodson, Jay J. Wylie, Gregory R. Ganger, Michael K. Reiter, Decentralized Storage Consistency via Versioning Servers, Carnegie-Mellon University Technical Report CMU-CS-02-180 Carnegie-Mellon University, ECE Department, Pillsburgh, Sep. 2002. < http://www.pdl.cmu.edu/PDL-FTP-PASIS/CMU-CS-02-180.pdf>. |
Khalil Amiri. Garth Gibson, Richard Golding, Highly Concurrent Shared Storage, 20th International Conference on Distributed Computing Systems: proceedings: Taipei, Taiwan, Apr. 10-13, 2000, pp. 298-307, IEEE Computer Society, Los Alamitos, 2000. |
Hans-Peter Kriegel, Marco Potke, Thomas Seidl, Managing Intervals Efficiently in Object-Relational Databases, Marking the millennium: 26th International Conference on Very Large Databases, Cairo, Egypt, Sep. 10-14, 2000, pp. 407-418, Morgan Kaufmann, Orlando, 2000. |
Steven Berson, R. R. Muntz, W. R. Wong, Randomized Data Allocation for Real-time Disk I/O, Digest of papers: Compean '96; technologies for the information superhighway, Feb. 25-28, 1996, Santa Clara, CA, pp. 286-290, IEEE Computer Society, Los Alamitos, 1996. |
Steven Hand, Timothy Roscoe, Mnemosyne: Peer-to-Peer Steganographic Storage, Peer-to-peer systems: First International Workshop, IPTPS 2002, Cambridge, MA, USA, Mar. 7-8, 2002: revised papers, Springer, New York, 2002. |
Hui-I Hsiao, Chained Declustering: A New Availability Strategy for Multiprocssor Database machines, Data engineering: proceedings / Sixth International Conference on Data Engineering, Feb. 5-9, 1990, Los Angeles Airport Hilton and Towers, Los Angeles, California, USA, pp. 456-465, IEEE Computer Society Press, Los Alamitos, 1990. |
Arif Merchant, Phillip S. Yu, Analytic modeling of Clustered RAID with Mapping Based on Nearly Random Permutation, IEEEtransactions on Computers, vol. 45, No. 3, Mar. 1996, pp. 367-373, IEEEComputer Society Press, Los Alamitos, 1996. |
Gabriel Mizrahi, The Hashing Approch to the Internet File System Problem, M. A. Thesis, Department of Mathematics, University of Hafia, Israel, Nov. 2001. |
Jose Renato Santos, R. R. Muntz, B. Ribeiro-Neto, Comparing Random Data Allocation and Data Striping in Multimedia Servers, Proceedings ACM SIGMETRICS '2000: International Conference on Measurement and Modeling of Computer Systems, Jun. 17-21, 2000, Santa Clara, CA, USA, pp. 44-55 ACM Press, New York, 2000. |
Avishai Wool, Quorum Systems in Replicated Databases: Science or Fiction?, Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, vol. 21, No. 4, Dec. 1998, pp. 3-11, IEEE Computer Society Press, Los Alamitos, 1998. |
Robert H. Thomas, A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases, ACM Transactions on Database Systems (TODS), vol. 4, No. 2, Jun. 1979, pp. 160-209, ACM Press, New York, 1979. |
Hagit Atiiya, Amotz Bar-Noy, Danny Dolev, Sharing memory robustly in message-passing systems, vol. 42, No. 1, Jan. 1995, pp. 124-142, Journal of the ACM (JACM), ACM Press, New York, 1995. |
U.S. Appl. No. 10/440,531, Notice of Allowance dated Oct. 18, 2006 (10 pages). |
U.S. Appl. No. 10/440,531, Final Rejection dated May 5, 2006, pp. 1-10 and attachments. |
U.S. Appl. No. 10/440,531, Non-Final Rejection dated Nov. 3, 2005, pp. 1-12 and attachments. |
U.S. Appl. No. 10/440,548, Non-Final Rejection dated Aug. 20, 2009, pp. 1-13 and attachments. |
U.S. Appl. No. 10/440,548, Non-Final Rejection dated May 14, 2008, pp. 1-14 and attachments. |
U.S. Appl. No. 10/440,548, Final Rejection dated Jan. 11, 2007, pp. 1-19 and attachments. |
U.S. Appl. No. 10/440,548, Non-Final Rejection dated Jun. 15, 2006, pp. 1-16 and attachments. |
U.S. Appl. No. 10/440,548, Non-Final Rejection dated Nov. 21, 2005, pp. 1-24 and attachments. |
Sampath Rangarajan, Pankaj Jalote, Satish K. Tripathi; “Capacity of Voting Systems”, Jul. 1993, IEEE Transactions on Software Engineering, vol. 19, No. 7, pp. 698-706. |
Brinkman A et al˜“Efficient Distributed Data Placement Strategies for Storage Area Networks”˜SPAA 2000—12th Annual Symposium˜Jul. 2000 pp. 119-128. |
Douceur, Jr et al˜“Competitive Hill-Climbing Strategies for Replica Placement in A distributed File System”˜Proc 115th Int Symp Distributed Computing Oct. 2001˜13 pages. |
Mccue D et al˜“Computing Replica Placement in Distributed Systems”˜Management of Replicated Data˜IEEE˜Nov. 12, 1999˜pp. 58-61. |
PCT Search Report/Written Opinion˜Serial No. PCT/US2004/015352 dated Apr. 23, 2008˜pp. 13. |
Number | Date | Country | |
---|---|---|---|
20080046779 A1 | Feb 2008 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 10440570 | May 2003 | US |
Child | 11827973 | US |