This invention relates, in general, to distributed communications environments, and in particular, to reconciling independently updated distributed data of distributed communications environments.
Distributed communications environments include highly available, scalable systems that are utilized in various situations, including those situations that require a high-throughput of work or continuous or nearly continuous availability of the systems.
One example of a distributed environment is a clustered environment having one or more clusters. A cluster includes, for instance, a plurality of operating system instances that share resources and collaborate with each other to perform system tasks. In a clustered environment, information is often replicated, so that identical information is available on all members of the cluster. Maintaining the consistency of this data is difficult as members may be updated individually or in groups when all members are not present. Further, sundering (i.e., splitting) of such members into subgroups, which are not in communication, requires that updates made on different subgroups be reconciled when the sundering is repaired.
Currently, in order to allow updates, a centralized update log or a centralized data server is provided, or consistency is maintained by only allowing updates when a quorum of members is present. However, clustered members often wish to maintain distributed consistent information without reliance on a centralized store or a primary member. Further, the requirement of quorum is deficient, as is prevents processing when the quorum is not reached or is lost.
Based on the foregoing, a need exists for an enhanced capability for allowing updates and for reconciling independently updated distributed data in the absence of a central store or a quorum requirement.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method of facilitating reconciliation of data of a distributed communications environment. The method includes, for instance, determining whether one set of distributed data of the distributed communications environment and another set of distributed data of the distributed communications environment are consistent, wherein the determining includes employing one or more locally monotonically increasing values in the determining; and updating at least one of the one set of distributed data and the another set of distributed data, in response to the determining, to reconcile one or more inconsistencies between the one set of distributed data and the another set of distributed data.
In another aspect of the present invention, a method of facilitating reconciliation of data of a clustered communications environment is provided. The method includes, for instance, initiating by a joining member a join to a cluster of the clustered communications environment, the cluster including at least one current member; providing by a current member of the at least one current member a current membership data structure to the joining member; determining by the joining member a set of deltas, the set of deltas including zero or more data inconsistencies between the joining member and the current member, the determining employing at least one locally monotonically increasing value; providing by the joining member the set of deltas and a joining member's membership data structure to the current member; determining by the current member a set of deltas, the set of deltas including zero or more data inconsistencies between the current member and the joining member, the determining employing at least one locally monotonically increasing value; resolving by the current member zero or more conflicts between the joining member's set of deltas and the current member's set of deltas to provide a resolved set of deltas; and providing the resolved set of deltas to one or more members of the cluster, including the joining member.
System and computer program products corresponding to the above-summarized methods are also described and claimed herein.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.
One or more aspects of the present invention are particularly pointed out and distinctly claimed as examples in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
a depicts one example of a cluster of the communications environment of
b depicts one example of replicated information stored on a member and reconciled, in accordance with an aspect of the present invention;
a depicts one embodiment of the logic associated with a joining member determining deltas, in accordance with an aspect of the present invention;
b depicts one embodiment of the logic associated with a current member determining deltas, in accordance with an aspect of the present invention;
In accordance with an aspect of the present invention, a capability is provided for facilitating reconciliation of independently updated distributed data of a communications environment. To reconcile the data, locally monotonically increasing values are employed. Examples of such values include a local counter, a local timestamp that is monotonically increasing, etc. It is assumed that a future value or timestamp does not precede an earlier value or timestamp (i.e., time does not flow backward).
One example of a communications environment incorporating and using one or more aspects of the present invention is depicted in
Communications environment 100 is, for instance, a clustered environment. In a clustered environment, there are one or more clusters formed of cluster members. The cluster members of a cluster share resources and collaborate with each other to perform tasks. Aspects of one example of a clustered environment are described in, for instance, a patent application, entitled “A Method, System And Program Products For Managing A Clustered Computing Environment,” Novaes et al., (Docket No. POU920000004US1), Ser. No. 09/583,677, filed May 31, 2000, which is hereby incorporated herein by reference in its entirety.
As one example, communications environment 100 includes a cluster 200 (
In cluster 200, information is replicated on all the members of the cluster. Thus, each member 202 includes, for instance, a membership table 204 (
One example of membership table 204 is described with reference to
One example of data table 206 is described with reference to
Reconciliation component 208 (
As a further example, reconciliation is employed when members of a cluster are active at disjoint points in time. For instance, assume Nodes 1 and 2 of a cluster of five nodes, Nodes 1-5, are active and then go down. Then, Nodes 3-5 of the cluster are active. When all of the members are active, reconciliation is used to reconcile the updates made by the two subgroups of the cluster, which had not communicated with one another, since updates were made by the separate subgroups.
To become an active member of a cluster, the member joins the cluster. As used herein, the term “join” refers to a member becoming active in the cluster at anytime. As examples, it includes the member becoming active in the cluster for the first time, since a reset of the system; and/or the member becoming active after a splitting of the cluster (i.e., sundering). Other examples may also exist and are considered to be incorporated within the definition of join and one or more aspects of the present invention.
In one embodiment, multiple members with differing changes may join simultaneously. In this situation, a current member of the cluster determines its changes with regards to all of the joining members.
The reconciliation component includes logic used to reconcile the inconsistent information. For instance, it includes join logic to join the member to the cluster, delta determining logic to determine any changes that are to be reconciled, logic to determine if there are any conflicts to be resolved, replicate logic to replicate the changes to all the members of the cluster (or a subset of the members, in another embodiment), and apply logic to apply the deltas that have been determined. Both data tables and membership tables are reconciled, if necessary. Further details regarding reconciliation are described below with reference to join processing.
A member joins a cluster to become an active member of the cluster. For instance, Member 1 may be a member of Cluster A, but it may not be in communication with other members of the cluster, and therefore, it is an inactive member of the cluster. To become an active member, Member 1 joins the cluster. The member can request to join the cluster any time the member is capable of communicating with the other members of the cluster. Thus, even if the member joins and then fails, it can join again.
During join processing, the member requesting to join the cluster determines what information it has that is inconsistent with the cluster's information. This is referred to as delta processing. Also, the cluster determines whether there is any information that the cluster has that is inconsistent with the joiner. Again, this is referred to as delta processing.
One embodiment of the logic associated with join processing is described with reference to
In response to the join request, a current member of the cluster (e.g., the cluster leader) sends the current membership table to the joining member, STEP 502. The joining member uses the membership table to determine its set of deltas-relative to the tables currently being used in the cluster it is joining, STEP 504. For instance, the joining member uses the membership table it received to determine for each row in each data table maintained by the joining member whether the row is a delta (e.g., change) that is to be processed by the existing members.
Deltas are determinable in many ways. In one example, processing is performed, as described with reference to
If the current member was present when the row was last changed, then the row is not a delta, STEP 604. However, if the current member was not present when the row was last changed, then the row is a delta that is sent to the current member, STEP
The current member maintains this delta in a joining member's delta set to be used in further processing, as described below.
Subsequently, a determination is made as to whether there are more rows to be processed, INQUIRY 610. If so, processing returns to STEP 600. Otherwise, delta processing for the joining member is complete, STEP 612.
Returning to
In addition to the above, the current member determines whether it has any deltas with regard to the joining member, STEP 508. One embodiment of this logic is described with reference to
Thereafter, of if there was no delta, a determination is made as to whether there are more rows to be processed, INQUIRY 658. If there are more rows to be processed, then processing continues with STEP 650. Otherwise, processing associated with the current member determining its deltas is complete, STEP 660.
Returning to
In one example, to determine if there is a conflict, a row is selected from the current member's delta set, STEP 700 (
However, if the row is in the joining member's delta set, INQUIRY 702, then a further determination is made as to whether the data of that row is the same as the data of the joining member's row, INQUIRY 706. If the data is the same in both rows, then the row is not a delta, and the row can be removed from both of the delta sets, STEP 708.
Returning to INQUIRY 706, if the data in both rows is different, then there is a conflict, and a determination is made as to whether the conflict is resolved by selecting the current member's delta or the joining member's delta, INQUIRY 710. A row conflict can be handled in number of different ways, including arbitrarily, in which one of the two rows is selected as the current row; or programmatically, in which a program is written that determines the row to be used. The program references, for instance, the data tables on other members to make the decision. In one example, in selecting the row to use, a determination is made as to whether the change leaders of the two rows are the same. If so, a further determination is made as to whether the LCTS of the current member or the joining member is greater. The row associated with the greater LCTS is selected.
If it is determined that the current member's delta row is to be selected, then the row is removed from the joining member's delta set, STEP 714. However, if it is determined that the conflict is to be resolved by selecting the joining member's delta row, then the row is removed from the current member's delta set, STEP 712.
Subsequent to STEPs 704, 708, 712 or 714, a determination is made as to whether there are more rows to be processed, INQUIRY 716. If there are more rows to be processed, processing continues with STEP 700. Otherwise, conflict processing is complete.
Returning to
Referring to
A row is selected from the joining member's delta set, STEP 806, and that row is applied to the current member's tables, STEP 808. Additionally, the change is added to the accumulated delta set, STEP 810.
A determination is then made as to whether there are more rows of the joining member's delta set to be processed, INQUIRY 812. If there are more rows, then processing continues with STEP 806; otherwise, the accumulated delta set is sent to all of the members of the cluster, including the joining and joined members, STEP 814.
Returning to
Initially, a delta row is selected from the accumulated delta set, STEP 900, and a determination is made as to whether the row exists in a table of the member performing this processing, INQUIRY 902. If the row does exist in a table, then a further determination is made as to whether the delta is a deletion, INQUIRY 904. If it is a deletion, then the row is deleted, STEP 906; otherwise, the row is changed, STEP 908. Thereafter, processing continues with INQUIRY 912, as described below.
Returning to INQUIRY 902, if the row does not exist in a table, then a determination is made as to whether the delta is a deletion, INQUIRY 909. If the delta is a deletion and the row does not exist, then no processing of the row is needed, and processing continues with INQUIRY 912, as described below. However, if the row does not exist in the table and it is not a deletion, then the row is added, STEP 910, and processing continues with INQUIRY 912.
At INQUIRY 912, a determination is made as to whether the membership table is to be updated. As one example, a check is made as to whether the delta's change leader timestamp is greater than the last change timestamp of the change leader in the membership table (i.e., CL.CTS>CL.LCTS). If so, then the membership table is updated with the change leader's last changed timestamp, STEP 914. Thereafter, or if the membership table need not be updated, a determination is made as to whether there are more rows of the accumulated delta set to be applied, INQUIRY 916. If there are more rows to be applied, then processing continues with STEP 900; otherwise, the apply delta logic is complete. Thus, a two-phase commit process is used to commit the changes at all the members, thereby completing the join processing, assuming the commit is successful.
As described above, with the join processing, the change leader and changed timestamp information of the data tables combined with the membership table allows the determination of the relative sequencing of table changes. This provides a mechanism for ignoring rows which have been previously incorporated in a merged table.
In one embodiment, the current member replicates the updated tables whether or not the joining member had any deltas. This accounts for a joining member not having a data table or rows in the current member's data tables that are not in the joining member's data tables. If a member's membership table is deleted inadvertently (or maliciously), the join process still works. In this case, the member joins with no membership table. When the member joins a cluster, the join operation effectively causes entire tables to be transferred. This is also the case for a newly installed member freshly added to the cluster. If the member, however, is the first member to come up, the other members determine that their entire tables are deltas. In another embodiment, it can be determined that there is no data in the current member's tables, and instead, the joiner's tables are replicated.
In a further embodiment, there is more than one joining member. In this case, each joining member performs processing, as described herein, however, the current member determines if it has a delta relative to any of the joining members. This introduces an outer loop in the current member figure where the current member's data row is checked against each of the joining member's membership tables in determining the deltas, and a loop to examine each of the joining member's delta sets in determining conflicts and in removing the deltas not chosen after conflict resolution, as well as in removing the deltas from multiple joiner's delta sets when the deltas are the same.
Further, in applying the deltas, prior to changing a row, the member (current or joiner) is to check whether the change applies. Thus, if it's membership table (MT) [row.CL].LCTS==row.CTS, the change need not be processed and it can be skipped (as the one selected is from this node). This is done on every member during the processing as any of the current members or joining members may be in this forgetful state.
The steps involved in joining a cluster may vary depending on processing that has taken place, since the joining member has been inactive. Thus, various examples of the processing that takes place to effect a join are described below.
This describes a sunder fix where the subclusters both select a current member to do the processing.
Five members: {N1-N5}. Change Leader=N1.
The network sunder is repaired and Subclusters A and B merge back into a full cluster {N1-N5}, CL =N1. Assume A's master, N1, is the current member. N1 sends to the selected joining member (N4) N1's MT. N4 will then compute a delta which it sends to N1. N1 will then determine, for each row in N's delta, if there is a conflict with that row.
This is described below:
If the cluster {N1, N2, N3, N4, N5} has fully replicated data and membership tables, and then sunders into subcluster A {N1, N2, N3} and subcluster B {N4, N5}, each side knows the values of LCTS which was valid when the split occurred. Thus, if side A makes a change and defines RG12, that entry would have a CTS value greater than the previous LCTS for the row's CL. If side B (N4, N5) changes a previous entry for RG45, that entry would have a CTS that is also greater than the previous LCTS for the CL.
If the sunder were repaired and the two subclusters were to merge, side A would have a delta set of 1 row (RG12) and side B would have a delta set of 1 row (RG45). The side B delta when applied to side A will replace the row similar to the simple case above.
However, assume that N5 fails before the sunder is repaired. Then, when the sunder is fixed, member N4 would merge into subcluster A, so that the membership was {N1-N4}. The row delta would still be discovered and applied.
If N4 then dies, leaving a membership of {N1-N3} again, and row RG45 is modified again, it would receive an updated CTS. More importantly, the CL value for RG45 would change from N4 to N1. Note that there are now 3 separate sets of row RG45. Membership {N1-N3}, N4, and N5, and all have distinct CTS values.
If N5 then joins the cluster, N5 will discover that it has no deltas relative to the cluster, as the change to RG45 was applied when N4 joined. The cluster will discover that it has two changes (RG12 and RG45) relative to N5 that has to be applied. This example is further described below.
The network sunders into Subcluster A={N1-N3}, CL=N1 and Subcluster B={N4, N5}, CL=N4.
The MT LCTS for N4 is the same as the CTS for N4 in row RG45, thus the row is not a delta.
Although examples of join processing are described herein, variations to this processing are possible without departing from the spirit of the present invention. As one example, an optimization is provided in which the information processed and/or replicated is kept to a minimum. For instance, a determination is made at an early stage of processing (e.g., at an initial stage of join processing) as to whether there are any deltas to be processed. If there are no deltas, then processing ceases quickly. This is described in further detail with reference to
Referring to
If it is, then the current member may have deltas, STEP 1012. However, if the current member's LCTS is less than or equal to the joining LCTS, then the joining member may have deltas to be processed, STEP 1014. Should there be deltas that may need to be processed, STEPs 1006, 1014, then processing continues with determining if there are any deltas, as described above with reference to
In the above embodiment, if there are no deltas to be processed, then processing is complete. If either the joiner or the current member has deltas to be processed, then processing continues. The current member sends its tables' metadata (e.g., member id, CL, and LCTS) to the joiner, and the joiner sends its tables' metadata to the current member. Each metadata value is compared separately per table. Only those tables which have deltas are processed on either the current member or the joining member.
In the examples described above, it has been assumed that the presence of a member in the cluster indicates that any changes in which that member was the change leader were already merged. However, this would not be the case, if the data was lost on a particular member. For example, if the membership table on a member (e.g., node) was lost in its entirety by, for instance, table deletion or replacement, the member has forgotten the changes that were made when it was present as the change leader. This member is referred to as a forgetful member.
With a forgetful member, metadata is exchanged to allow this condition to be determined. When a change is made to a table, the timestamp for that change is saved as metadata in the table (CL.CTS), and as the metadata for the entire set of tables. Thus, each table records the last timestamp applied, and the membership table records the last timestamp applied to any table. When a member joins, this set of metadata is exchanged with the current cluster, and thus, it is possible to determine if it has forgotten any of its changes. For example, if a table (or tables) has been removed, the metadata timestamp will be exchanged as CL=0, CTS=0. This immediately implies that the tables contained on the cluster members should be utilized without regard to any deltas. If the member is the first one to join the cluster, joining members would detect that the entire data table or tables are to be treated as deltas and exchanged with the current member.
To handle the case where a table was moved or restored from a different member, the table metadata also contains the member identification. Thus, on boot, the member is able to determine if the table was created on itself, and if not, the table is ignored and removed—to be replaced with the current information contained in the cluster of the joining members.
If a member's system time was reset, such that the time values are no longer monotonically increasing, this is discovered as well, so that changes are not made with incorrect timestamps. To handle this, the metadata for the membership table also contains an entry for the LCTS made by the member itself. Thus, on boot, if the metadata LCTS is less than the system time, the value to be used is the LCTS+1 until such time as the system time either is reset to a correct value or the system time catches up with the change counter. This ensures that the values remain monotonically increasing.
If both conditions occur, the system time is reset and the tables are restored to a prior image, some changes for which the member was the change leader may be present in the cluster with a CTS value that is greater than the monotonically determined value in use by the member. In this case, the resolved membership table, which is replicated, includes an LCTS value, which is later than the monotonically increasing value that is in use on the member. When this is detected, the value is reset to the LCTS+1 value contained in the replicated membership table for its entry.
Further, if the tables were restored to a prior version and the timer was set backwards, the received membership table is checked to see if the LCTS is greater than the LCTS recorded and the timestamp is set to the greater value. For multiple joiners, this processing is performed relative to all membership tables, the current member and all of the joiners. It is detected during delta processing because the later timestamp is in a delta. Thus, the logic is: if row.CL==this member && row.CTS>this member's MT [this member], update timer to be monotonically increasing. (Note that this restores the timestamp to the last CTS for which the member was CL. It is possible that changes were lost when the data table was replaced, if no other member has them recorded.)
An example of processing associated with a forgetful member is described below.
Similar to the above examples, when N2 computes its delta, both C1 and C2 are in the delta because the membership table restored is prior to the changes having been made, the LCTS for both N1 and N2 are from before the first backup occurred. Changes C1 and C2 are after that backup, so when N2 joins, N2 will correctly place C1 and C2 in its delta. N1 accepts both changes because the CTS for each row is greater than its LCTS for that row's CL, and N1's CTS for those rows is less than the LCTS that N2 has for those rows CL.
Described in detail above is reconciliation processing used to provide consistent information between independently updated entries, such as cluster members. In addition to the above, there are two situations addressed herein. One situation is when a row in a table is deleted and another situation is handling a deleted member.
When a row in a data table is deleted, it is marked as having been deleted, but is maintained in the data table. One way of handling such deleted rows is to disallow changes once a row has been deleted. Thus, if delta changes are detected, the changes are ignored and the row remains as being deleted. A row which has been deleted is retained until all members of the cluster are present at which time these “pending delete” rows are removed from the data tables which are replicated to the joining members. If a row is deleted while all members are present, there is no need to retain the row as “pending” and the row may be removed.
A deleted row is handled like any other data row. It includes a timestamp indicating the time the deletion is made, and is detected as a delta and processed during the join processing. One potential mechanism is to treat a deletion as taking precedence over the change. Thus, if a delta change is detected to the row during the conflict resolution, the delete is processed, instead of a change.
If a deleted row is added back into a table (the key for the row is the same as the key of the deleted row), the row is retained, if no member has recorded a pending delete for that row. Other mechanisms exist which allow the correct processing in these cases by allowing the conflict resolution process to determine which row should be retained.
In one embodiment, the deleted rows are accumulated until each member has joined the cluster. If all the members are present in the cluster, then it is known that each of the members has seen the deletions and the deletes can be discarded.
When a member is deleted, it is marked as deleted in the membership table and it is kept in the membership table until all the members have joined. The deleted member is maintained in the membership table, since its member id and timestamp may be in the data tables, and that information is used to determine deltas. The deleted member, however, is not counted in the number of members needed to discard either a deleted row or the deleted member.
When all defined members have joined the cluster, the deleted member is discarded from the membership table. The data tables are processed to remove the deleted member's entries. The rows having the deleted member as the change leader value, have their timestamps replaced to indicate that the current cluster member processing the join is the change leader and the current timestamp is the changed timestamp, prior to the table being replicated, so that no data table rows refer to the deleted member. This processing is performed atomically, such that the member is deleted from the membership table and the timestamp labels are changed atomically. If one of those fails, then they both fail.
In a further embodiment, the case of added members is also handled. Here, the membership table on one member has more entries than on another member. In this case, if the timestamp is for a change leader that is not present in the membership table, it is assumed to be a delta. The entry is added to the membership table, when the membership table entries are resolved.
In yet a further embodiment, instead of separate nodes individually joining, two subclusters that were created by a sunder are coming back together after the sunder is fixed. In this scenario, one of the members from one of the subclusters is considered the current member, and one of the members in the other subcluster is considered the joiner. All members in the joining subcluster do not need to process the merge steps, but do process the delta application and commit. (Note, it is possible that more than two subclusters are merging, and thus, there can still be multiple joiners—one from each subcluster other than the one the chosen current member is in.)
Described in detail above is a capability for facilitating reconciliation of independently updated distributed data of a communications environment. To reconcile the data, locally monotonically increasing values are employed. One example of such values are local timestamps. These timestamps are monotonically increasing, so that it is guaranteed that a future timestamp that is used in the updating precedes an earlier timestamp. By using locally monotonically increasing values, advantageously updates are reconciled without the need for a centralized update log, a centralized data server, shared disk or other shared medium, or hardware time clocks. Further, the reconciliation is performed without the requirement of a quorum.
In one or more aspects of the present invention, persistent data is replicated among a plurality of communicating nodes in such a manner that allows for the nodes to correctly identify both unambiguously updated data entries and ambiguously updated entries when the data may have been updated by any subset of the nodes not in communication with the remaining nodes over any time period including nodes that are deleted from the set of nodes, data entries deleted on some of the nodes, and detection of the lost data from a node.
Although examples are described above, many variations can be made without departing from the spirit of the present invention. For example, environments other than those described herein may incorporate and use one or more aspects of the present invention. For instance, although one or more aspects of the present invention are described with reference to a clustered environment, this is only one example. Any environment that has independently updated data that is to be reconciled can use one or more aspects of the present invention. Further, the members of the cluster or other environments can be other than nodes, such as virtual machines or other types of entities. Moreover, in other embodiments, one or more optimizations or other changes may be made in order to perform the join processing or other processing used to reconcile the independently updated data. In yet other examples, additional, different or less logic may be a part of the reconciliation component. Many other variations are possible.
Further, although the examples use timestamps, any monotonically increasing values may be used, including, but not limited to counters. Timestamps are just one example.
The capabilities of one or more aspects of the present invention can be implemented in software, firmware, hardware or some combination thereof.
One or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has therein, for instance, computer readable program code means or logic (e.g., instructions, code, commands, etc.) to provide and facilitate the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
Additionally, at least one program storage device readable by a machine embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
Although preferred embodiments have been depicted and described in detail herein, it will be apparent to those skilled in the relevant art that various modifications, additions, substitutions and the like can be made without departing from the spirit of the invention and these are therefore considered to be within the scope of the invention as defined in the following claims.