Claims
- 1. A computer-implemented method, comprising:
maintaining cluster operational data on a replica set comprising a plurality of replica members that are each independent of any node of a server cluster; representing the cluster at a node if the number of replica members controlled by the node comprises at least a majority of the total number of replica members configured to operate in the cluster; and determining which of the replica members of the replica set has operational data that is most updated, and replicating at least some of that operational data to the other replica members of the replica set.
- 2. The method of claim 1 wherein determining which of the replica members of the replica set has the most updated operational data includes, maintaining an epoch number in association with each replica member.
- 3. The method of claim 2 wherein the size of each epoch number indicates a relative state of the cluster operational data on its respective replica member, and wherein determining which of the replica members of the replica set has operational data that is most updated includes determining which of the epoch numbers from each member is the largest.
- 4. The method of claim 3 wherein at least two members have epoch numbers that equal the largest epoch number, and wherein determining which of the replica members of the replica set has the most updated operational data includes, maintaining a sequence number in association with the cluster operational data, and determining the largest sequence number from the replica members that have epoch numbers that equal the largest.
- 5. The method of claim 1 further comprising, evaluating a last record logged on a replica member to which data is being replicated, against at least one record of the replicated data, to determine whether to discard the last record.
- 6. The method of claim 5 further comprising, evaluating a second-to-last record logged on the replica member to which data is being replicated, against at least one record of the replicated data, to determine whether to discard the second-to-last record.
- 7. The method of claim 1 further comprising, detecting the new availability of a new replica member that is configured to operate in the cluster, and reconciling the cluster operational data of the new replica member.
- 8. The method of claim 1 further comprising, detecting the unavailability of a replica member that was operational, determining whether the majority of replica members still exists, and if not, halting updates to the cluster configuration data.
- 9. The method of claim 8 further comprising, executing a recovery process to attempt to obtain control of a majority of replica members.
- 10. The method of claim 1 wherein maintaining the cluster operational data includes storing information indicative of the total number of replica members configured in the cluster.
- 11. The method of claim 1 wherein maintaining the cluster operational data includes storing the state of at least one other storage device of the cluster.
- 12. The method of claim 1 wherein the node controls the majority of replica members by arbitrating for exclusive ownership of each member.
- 13. The method of claim 12 wherein arbitrating for exclusive ownership includes executing a mutual exclusion algorithm.
- 14. The method of claim 1 wherein the node controls the majority of replica members by arbitrating for exclusive ownership of each member of the replica set using a mutual exclusion algorithm, and exclusively reserving each member of the replica set successfully arbitrated for.
- 15. The method of claim 1 wherein the node controls the majority of replica members by arbitrating for exclusive ownership of each member, including, issuing a reset command, delaying for a period of time, and issuing a reserve command.
- 16. The method of claim 1 wherein the node controls the majority of replica members by arbitrating for exclusive ownership of each member, including, issuing a reset command.
- 17. A computer-readable medium having computer-executable instructions for performing the method of claim 1.
- 18. A system for providing consistent operational data of a previous server cluster to a new server cluster, comprising, a plurality of nodes, a plurality of replica members, each of the replica members maintaining an epoch number indicative of a state of the cluster operational data, at least one replica member having updated cluster operational data stored thereon by a first node including information indicative of a quorum requirement of a number of replica members needed to form a cluster, and a cluster service on a second node configured to 1) obtain control of a replica set of a number of replica members, 2) compare the number of replica members in the replica set with the quorum requirement, 3) form the new server cluster if the quorum requirement is met by the number of replica members in the replica set, and 4) determine which of the replica members of the replica set has data that is most updated.
- 19. The system of claim 18 wherein the cluster service determines which available replica member of the replica set has the most updated data based on a comparison of the epoch numbers in the available replica members.
- 20. The system of claim 18 wherein the cluster service determines which available replica member of the replica set has the most updated data based on a comparison of the epoch numbers in the available replica members, and if a determination cannot be made by the comparison, by comparing a sequence number of a record maintained on each of at least two replica members.
- 21. The system of claim 18 wherein the cluster service prevents updates to the cluster operational data if the number of available replica members falls below the quorum requirement.
- 22. The system of claim 18 wherein the cluster service terminates the cluster if the number of operational replica members falls below the quorum requirement.
- 23. The system of claim 18 wherein the second node obtains control of the replica set by arbitrating with at least one other node for control of each replica member.
- 24. The system of claim 18 wherein each replica member is independent of any node of the server cluster.
- 25. The system of claim 18 wherein each replica member is independent of any node of the server cluster, and wherein the second node obtains control of the replica set by arbitrating with at least one other node for control of each replica member.
- 26. A computer-implemented method of operating a server cluster of at least three nodes, comprising:
storing cluster operational data on a replica set of at least one replica member, each replica member being independent from any node; at a first node, arbitrating with at least two other nodes for control of the replica set, the arbitration being performed for each replica member and comprising, attempting to obtain a right to exclusively reserve that replica member, and if the attempt is successful, exclusively reserving that replica member; and representing the cluster at the first node if the replica set is controlled thereby and has consistent cluster operational data with respect to a previous cluster.
- 27. The method of claim 26 wherein the replica set comprises a plurality of replica members, and wherein the replica set is controlled and has consistent cluster operational data with respect to the previous cluster when a majority of replica members is exclusively reserved.
- 28. The method of claim 26 wherein attempting to obtain a right to exclusively reserve that replica member includes, executing a mutual exclusion algorithm.
- 29. The method of claim 26 wherein attempting to obtain a right to exclusively reserve that replica member includes, attempting to write a unique identifier to a location on the replica member, delaying, and reading from the location to determine whether the unique identifier is unchanged.
- 30. The method of claim 26 wherein arbitration is performed to form the cluster.
- 31. The method of claim 26 wherein arbitration is performed by challenging at the first node for ownership of the replica set when the first node does not represent the cluster.
- 32. The method of claim 26 further comprising, defending exclusive ownership of the replica set at the first node after control of the cluster is achieved.
- 33. The method of claim 26 further comprising determining which of the replica members of the replica set has cluster operational data that is most updated, and replicating that operational data to the other replica members of the replica set.
- 34. The method of claim 26 wherein arbitrating for each replica member includes breaking a reservation of the replica member by another node.
- 35. The method of claim 26 wherein arbitrating for each replica member includes, issuing a reset command for the replica member, delaying for a period of time, and issuing a reserve command for the replica member.
- 36. A computer-readable medium having computer-executable instructions for performing the method of claim 26.
- 37. A computer-readable medium having computer-executable instructions, comprising:
representing a cluster by obtaining exclusive control of a majority of replica members in an available set thereof; detecting a status change of one replica member with respect to the available set; and taking action in response to the changed status to ensure that the replica members are consistent with respect to any update logged thereto.
- 38. The computer-readable medium of claim 37 wherein detecting a status change includes detecting that the one replica member is online, and wherein taking action in response to the changed status includes running a recovery process to make the replica members consistent.
- 39. The computer-readable medium of claim 37 wherein taking action in response to the changed status includes running a recovery process to make the replica members consistent.
- 40. The computer-readable medium of claim 39 wherein running a recovery process to make the replica members consistent includes increasing an epoch number maintained on each available replica member.
- 41. The computer-readable medium of claim 39 wherein running a recovery process to make the replica members consistent includes looking for a non-committed update that was not committed before a subsequent committed update on at least one available replica member, and discarding each such non-committed update found.
- 42. The computer-readable medium of claim 39 wherein running a recovery process to make the replica members consistent includes selecting a leader replica member, and propagating records from the leader replica member to non-leader replica members.
- 43. The computer-readable medium of claim 42 wherein selecting a leader replica member includes determining which replica member has data that is most updated with respect to other replica members, and selecting that replica member as the leader.
- 44. The computer-readable medium of claim 37 wherein detecting a status change includes detecting that the one replica member is offline, and wherein taking action in response to the changed status includes determining whether a majority of replica members still exists.
- 45. The computer-readable medium of claim 37 wherein a majority of replica members does not still exist, and wherein taking action in response to the changed status further includes preventing updates from being written to replica members that remain available.
- 46. The computer-readable medium of claim 37 wherein detecting a status change includes attempting to write an update to each available replica member, receiving success or failure information for each attempted write, and determining whether a majority of replica members still exists by evaluating a number of successful writes against a number required for a majority.
- 47. The computer-readable medium of claim 46 further comprising, reporting that the update succeeded if the number of successful writes is greater than or equal to the number required for a majority.
- 48. The computer-readable medium of claim 37 further comprising preventing further updates unless the number of successful writes is greater than or equal to the number required for a majority.
CROSS-REFERENCE TO RELATED APPLICATION
[0001] The present application is a continuation-in-part of U.S. patent application Ser. No. 09/277,450, filed Mar. 26, 1999.
Continuation in Parts (1)
|
Number |
Date |
Country |
Parent |
09277450 |
Mar 1999 |
US |
Child |
09895810 |
Jul 2001 |
US |