1. Technical Field
This invention relates to election of a cluster leader in a storage area network More specifically, the invention relates to reliable election of a cluster leader subsequent to loss of a prior cluster leader or loss of communication with the prior cluster leader.
2. Description of the Prior Art
A storage area network (“SAN”) is an increasingly popular storage technology.
Each cluster of nodes has a cluster leader that owns certain tasks for which member nodes in the cluster require communication with the leader to support a desired service. A loss of operation of the cluster leader or loss of communication between one or more nodes in the cluster and the cluster leader requires a new leader to be elected to ensure cluster integrity. The leader election procedure needs to meet four criteria: (1) reliability or near-certainty of electing a leader, (2) uniqueness of cluster leader, (3) presenting optimal capacity and availability from the cluster to the clients, and (4) choosing a leader in the shortest duration of time. The cluster only needs one leader for correctness of service that the cluster provides, of which the leader needs to be elected with near certainty to avoid cluster unavailability and disruption of service to the clients. Efficient and effective operation of the cluster requires the capacity supported by the cluster to include the maximum number of nodes that can reliably provide service to the clients.
Prior art solutions for leader election fail to meet the four criteria outlined above. Some cluster leader solutions choose the node(s) that first discovered the loss of the leader or loss of connectivity with the leader as the candidate(s) for the new leadership position. Most monitoring techniques for clusters involve one or two nodes that are adjacent to the leader as the nodes to monitor the connectivity with the cluster leader. In this example, the reliability of electing a cluster leader reduces as a result of fault scenarios under which the monitoring nodes might also be handicapped along with the previous leader at about the same time as the leader. In addition, the monitoring nodes may not be well connected to a majority of the nodes. This would result in reducing the chances of optimal capacity being provided to the clients of the cluster. Accordingly, there are limitations associated with this prior art technique of selecting the nodes to monitor connectivity with the cluster leader, in which the selected nodes would also function as subsequent cluster leader candidates in the event of loss of connectivity with the cluster leader.
Another known cluster leader election solution is known as a backoff protocol. There are two variations in this protocol. In both variations, one node tells the remaining nodes to backoff from undertaking the subsequent leader election protocol. If a node does not receive a single backoff message in the random-backoff case or is biased in favor relative to the node sending it a backoff, then the node proceeds to undertake the subsequent leader election protocol. This node may undergo a fault, thus reducing reliability. Accordingly, the backoff protocol does not ensure high reliability for leader election, does not guarantee optimal cluster capacity, and does not mitigate time to converge on a new cluster leader.
Another known prior art solution is known as the majority vote protocol. There are two variations to this protocol a single voting phase protocol and a mulit-phase voting protocol. Both variations require that a new cluster leader receive votes from a majority of the nodes based upon the original quantity of nodes in the cluster. Either variation of the majority voting protocol could be preceded by nomination of a candidate for leader election by predefined or dynamic methods, of which the dynamic methods include the prior art solutions discussed in the preceding paragraphs. These solutions cannot tolerate faults during the protocol or the protocol takes a long time to converge. Accordingly, this process does not ensure high availability of leader election, cluster leader availability under all circumstances, or time efficient for cluster leader election.
Another known leader election solution is the quorum resource lock protocol. There are several variations to this protocol of which one variation uses the quorum resource as an additional vote in the majority vote protocol. Another variation is known as a challenge defense protocol wherein the entire SCSI bus is reset to unlock the quorum resource. The SCSI bus reset is disruptive to all nodes, and the algorithm also take a long time to converge on the leader. The challenge defense protocol utilizes algorithms that require time to converge with multiple nodes attempting to acquire the lock. As such the challenge defense protocol is both disruptive and slow to converge.
Finally, another known prior art solution combines the quorum resource lock and majority vote protocols to provide an extra vote for the node that owns the quorum resource lock to break a tie during a network partition that evenly split the cluster of nodes. However, this solution neither to keeps the cluster available for the newly elected leader before concluding the protocol, nor does it take into account cluster availability via client reachability.
The prior art solutions for electing a new cluster leader in the event of loss of the leader or loss of communication between the nodes and the leader do not satisfy all of the requirements of a cluster election algorithm. Accordingly, a fast and reliable method and system for the election of a single and unique cluster leader with as many of the remaining nodes participating in such a multi-node cluster environment is desired.
This invention comprises an algorithm for election of a cluster leader subsequent to a fault in the cluster.
In a first aspect, a method is provided for leader election in a multi-node storage area network The method includes each node communicating to all nodes within a cluster of storage area network nodes of loss of connectivity between a node in the cluster and a cluster leader. A quantity of cluster leader candidates is pruned in response to the loss of connectivity. Approval of the node leadership election is validated within the cluster of nodes to function as a new cluster leader. The validation step includes biasing cluster reformation for election of the new cluster leader based upon a majority grouping of nodes with the cluster of nodes, and/or connectivity with a select group of clients in communication with the cluster.
In a second aspect of the invention, a storage area network system is provided with a group of storage area network nodes including one node adapted to function as a cluster leader. A communication manager is provided to enable each node to inform all nodes within a cluster of nodes of loss of connectivity between a node in the cluster and the cluster leader. A pruning protocol adapted to mitigate a quantity of cluster leader candidates is provided in response to the loss of connectivity. A validation protocol that is adapted to approve a new cluster leader candidate in response to the pruning protocol is also provided. The validation protocol preferably biases cluster leader election from a majority grouping of nodes within the cluster of nodes and/or connectivity with a select group of clients in communication with the cluster.
In a third aspect of the invention, an article in a computer-readable signal-bearing medium is provided Means in the medium are provided for informing all nodes within a cluster of storage area network nodes of loss of communication between a node in the cluster and the cluster leader. Means in the medium are provided for mitigating a quantity of cluster leader candidates responsive to the loss of communication. In addition, means in the medium are provided for validating election of a new cluster leader in response to the mitigation of cluster leader candidates. The means for validation election of a new cluster leader preferably biases cluster leader election from a majority grouping of nodes within the cluster of nodes and/or connectivity with a select group of clients in communication with the cluster.
Other features and advantages of this invention will become apparent from the following detailed description of the presently preferred embodiment of the invention, taken in conjunction with the accompanying drawings.
A cluster of nodes typically has two or more nodes, wherein each node may operate under a single or multiple operating system instances. Each node in a cluster has a unique identifier, known as a node identifier, in the form of a distinct non-negative number. The node identifier satisfies an ordering property in the cluster. The process of electing a new cluster leader subsequent to a loss of communication with a former cluster leader invokes the use of the node identifiers in an ordering protocol. In addition, a two pass system is utilized to ensure that in the event of a partition of the cluster, a new cluster leader may be elected from either a majority or minority grouping of nodes.
Following a cluster fault, each node in the cluster or the cluster partition, will have an opportunity to become the new cluster leader through a process for selection of a cluster leader candidate that utilizes node identifiers as a tool in the selection process, thus increasing the reliability of leader election. In order to mitigate the time for election of a new cluster leader, a pruning algorithm is invoked.
The pruning process is initiated by each node determining the need to send a refrain message to other nodes in the system 62, and then selecting a first node in the cluster as a recipient of the refrain message 64. Following the selection process at step 64, a test is conducted to determine if the sender node has received a refrain message 66. If the response to the test at step 66 is negative, a subsequent query is conducted to determine if the sender node identifier is less than the selected node identifier 68. A positive response to the test at step 68 will result in the sender node sending a message to the selected node to refrain from vying for the position as the new cluster leader 70. Similarly, if the response to the test at step 66 is positive, this is indicative that the sending node has received a message from a second sender node. A subsequent query is conducted to determine if the sending node identifier is less than the second sender node identifier 72. A positive response to the test at step 72 will result in the sending node sending a message to the second sender node to refrain from vying for the position as the new cluster leader 70. However, a negative response to either the query at steps 68 or 72, is evidence that the sender node is not a cluster leader candidate 76. A node that is determined not to be a cluster leader candidate will become a participant in the voting process initiated by a leader candidate selected from the pruning protocol. Alternatively, following steps 70 and 74, the sending node will wait for a defined time interval 78 before continuing through the pruning protocol Upon conclusion of the time interval at step 78, a test is conducted to determine if the node selected to receive a message at step 64 is the final node in the cluster 80. A negative response to the test at step 80, will result in the sending node selecting a subsequent node in the cluster as a recipient of a refrain message 82. Thereafter, the node proceeds to step 66 to determine if the node selected at step 82 should receive a refrain message. Alternatively, if the response to the test at step 80 is positive, the sending node is determined to be the cluster leader candidate from the grouping of nodes in which the sending node continues to maintain communication 84. Accordingly, the process for selection of a cluster leader candidate utilizes the node identifiers as a tool in the selection process.
Following the process of pruning the quantity of nodes for the position of new cluster leader candidate, a cluster leader must be established.
Majority Grouping=[Truncate(N/2)]+1 Equation 1
,wherein N is the quantity of nodes in the original cluster of nodes. Thereafter, a first pass of a vote for election of a new cluster leader is invoked. This process establishes that a leader of a grouping of nodes from the process illustrated in
The cluster leader election process allows for a maximum of two passes through the voting process. A negative response to the test at step 112 in
However, if at step 112 a cluster leader candidate received a majority vote, the cluster leader candidate must then determine if it has connectivity with a select group of clients which the cluster has been or is intended to service 118. A positive response to the determination at step 118 will allow the cluster leader candidate to proceed to a quorum disk lock phase. However, a negative response to the determination at step 118 results in a subsequent query to determine if the vote at step 106 was the first pass or second pass of the election 120. If the vote at step 106 was the first pass, then the cluster leader candidate is a failed candidate 122. However, if the vote at step 106 was a second pass, the election protocol proceeds to a quorum disk lock phase. Accordingly, the election process accounts for a determination as to whether the cluster leader candidate has received votes from a majority grouping of nodes, as well as whether the cluster leader candidate continues to have connectivity with a select group of clients.
The process of election of a new cluster leader following a cluster fault provides increased reliability of leader election and cluster reformation. A pruning protocol based upon a hierarchical system of the node identifiers is used to elect a new leader candidate for a grouping of nodes in a short duration. Thereafter, a two pass system is invoked to optimize a higher capacity cluster subset that has connectivity with a select group of clients, if possible, and to provide a highly diminished cluster subset in the event of unavailability of the former. The two pass system favors the majority grouping that also has good client connectivity as this would increase cluster capacity that is available to its clients. However, in the event a cluster leader is elected from a minority grouping of nodes, this ensures that a cluster leader is elected and the cluster can function and operate, although on a less efficient basis. Accordingly, the pruning protocol together with the two pass system ensures operation of the cluster with a cluster leader in a reliable and efficient manner following a fault in the cluster.
It will be appreciated that, although specific embodiments of the invention have been described herein for purposes of illustration, various modifications may be made without departing from the spirit and scope of the invention. In particular, the quorum disk is provided in a shared storage system in which the grouping nodes communicate for data. The algorithm for election of a cluster leader in the event of a cluster fault is a shared protocol. Any correct and reliable algorithm may be used for the quorum disk lock protocol. The candidate for cluster leader has an exclusive hold of the quorum disk resource for a required time period. In addition, this cluster leader election algorithm is applicable to any cluster environment in communication with a shared storage media in which the nodes in the cluster have access to the shared storage. Accordingly, the scope of protection of this invention is limited only by the following claims and their equivalents.