Availability prediction method for high availability cluster

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority under 35 U.S.C. §19 to Korean Patent Application No. P2007-127904, filed on Dec. 10, 2007, the disclosure of which is incorporated herein by reference in its entirety.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present disclosure relates to an availability prediction method for a high availability cluster, and more particularly, to a method for predicting an availability of a high availability cluster, which can determine an optimal number of nodes meeting a predetermined required availability level and a method for operating the same.

This work was supported by the IT R&D program of MIC/IITA [2007-S-016-01, A Development of Cost Effective and Large Scale Global Internet Service Solution]

2. Description of the Related Art

Generally, a cluster system refers to a system for integratedly managing a virtual image program by grouping a plurality of nodes having the similarity therebetween.

Many researches in various fields such as high availability (HA), load-balancing, high performance computing and grid computing are in progress. Especially, the high availability is an important aspect of a cluster technology for providing services without a failure upon a user's request in today's Internet environment.

A high availability cluster includes one or more nodes to prepare for a failure on any node of them. Moreover, the high availability cluster checks, at any time, a state of an individual node to dynamically remove a failed node in the cluster, allows other nodes to perform a corresponding task in behalf of the failed node, and allows a recovered node to join the cluster again.

FIG. 1A is a block diagram illustrating a configuration of a related art asymmetric cluster system.

Referring to FIG. 1A, the asymmetric cluster system 100 includes a head node 110, a switch node 120, and a compute node 130. The head node 110 monitors the compute node 130. The switch node 120 is placed between the head node 110 and the compute node 130. The compute node 130 fulfills a user's request by the head node 110.

The head node 110 distributes cluster-related software, monitors a failure on the compute node 130, and recovers the failed node to optimal system availability. It is very important to minimize failures of the nodes in the practical operation of the cluster system.

The head node 110 includes two Ethernet devices. One fulfills a user's request through a private network connected to the compute 130 via the switch 120, and the other fulfills the user's request through a public network.

The switch 120 provides the head node 100 with a path to the compute node 130 by being connected with the private network.

The compute node 130 carries out a certain operation according to a command of the head node 100 by being connected with the private network.

FIG. 1B is a block diagram illustrating a configuration of a related art high availability cluster system.

Referring to FIG. 1B, the high availability cluster system includes two head nodes 211 and 212, two switches 221 and 222, m number of compute nodes 230_1 to 230_—m.

The head nodes 211 and 212 are duplexed. Accordingly, when one of the head nodes 211 and 212 is failed, the failed node may be replaced with the other node.

In this case, since two switches 221 and 222 are used, each of the head nodes 211 and 212 includes three Ethernet devices.

FIG. 2 is a block diagram illustrating a configuration of a general high availability cluster system.

Referring to FIG. 2, the high availability cluster system includes a plurality of head nodes 250_1 to 250_—n, a plurality of switches 260_1 to 260_1, a plurality of compute nodes 270_1 to 270_—m. “Number of nines”, which is a value indicating availability, varies with the number n of the head nodes 250_1 to 250_—n.

The availability value varies with the number n of the nodes, the number of active and passive nodes, or a configuration of the nodes (e.g., where the cluster system may be constituted of only head nodes, or the head nodes and the switches nodes).

The larger is the number of nodes, the higher is availability probability of the high availability cluster system. However, since there is a limitation to the number of the nodes, it is necessary to design a system in consideration of the availability probability in accordance with the number of the nodes prior to the system construction.

SUMMARY

Therefore, an object of the present invention is to provide an availability prediction method for a high availability cluster, which can determine an optimal number of nodes and an operating method meeting a predetermined level of an availability probability.

To achieve these and other advantages and in accordance with the purpose(s) of the present invention as embodied and broadly described herein, an apparatus a method for predicting an availability of a high availability cluster in accordance with an aspect of the present invention includes: calculating a basic survival probability that the other node survives until a failure on one node of two nodes constituting a cluster is fixed; and determining an optimal number of nodes meeting a preset reference availability probability by calculating an availability probability for a predetermined range of the number of nodes on the basis of the basic survival probability.

To achieve these and other advantages and in accordance with the purpose(s) of the present invention, an apparatus a method for predicting an availability of a high availability cluster in accordance with another aspect of the present invention includes: enumerating all configurations of a cluster system capable of being constituted of one or more active nodes and the number of passive nodes omitting the number of active nodes from the number of an entire nodes; calculating an availability probability for each of the enumerated configurations; and determining a combination between the nodes when the availability probability is the maximum value as an optimal configuration of the cluster.

To achieve these and other advantages and in accordance with the purpose(s) of the present invention, a method for predicting an availability of a high availability cluster in accordance with another aspect of the present invention includes: enumerating all configurations of a cluster system variable with the number of head nodes and the number of switches; calculating an availability probability for each of the enumerated configurations; and determining a combination between the head nodes and the switches when the availability probability is the maximum value as an optimal configuration of the cluster.

The foregoing and other objects, features, aspects and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

FIG. 1A is a block diagram illustrating a configuration of a related art asymmetric cluster system;

FIG. 1B is a block diagram illustrating a configuration of a related art high availability cluster system;

FIG. 2 is a block diagram illustrating a configuration of a general high availability cluster system;

FIG. 3 is a table illustrating an availability probability of a node according to an embodiment of the present invention;

FIG. 4 is a table illustrating an availability probability of an availability cluster system according to an embodiment of the present invention;

FIG. 5 is a flowchart illustrating a process of obtaining the number n of nodes of a high availability cluster system having an availability probability P_nlarger than a reference availability probability according to embodiment of the present invention;

FIG. 6 is a flowchart illustrating a process of determining the number of active and passive nodes among n number of nodes according to an embodiment of the present invention;

FIGS. 7 through 9 are diagrams illustrating a process of carrying out an availability probability operation of a cluster system using Markov chain according to an embodiment of the present invention;

FIG. 11 is a flowchart illustrating a process of determining the number of head nodes and the number of switches when an availability probability is maximum value regarding a cluster system according to an embodiment of the present invention; and

FIG. 12 is a diagram illustrating an availability probability prediction of a cluster system by means of a continuous-time Markov chain (CTMC) according to an embodiment of the present invention.

DETAILED DESCRIPTION OF EMBODIMENTS

Hereinafter, specific embodiments will be described in detail with reference to the accompanying drawings. The present invention may, however, be embodied in different forms and should not be construed as limited to the embodiments set forth herein.

FIG. 3 is a table illustrating an availability probability of a node according to an embodiment of the present invention, which illustrates the availability probability in a survival probability of five nodes.

Referring to FIG. 3, a node state 310 of whether each of nodes is operating or not is expressed in a binary code of 1-bit.

That is, an active node is expressed in a binary code “0”, and a down node is expressed in a binary code “1”.

On the other hand, it is apparent that the active node and the down node may be expressed in a binary code “1” and “0”, respectively, and the node state 310 may be expressed in a binary code of 2 or more bits.

A survival state 320 and an availability probability 330 is illustrated by enumerating sixteen (two to the fourth power) cases in which states of 4 nodes are varied on the assumption that one node is a down node.

The survival state 320 indicates whether a current cluster system survives or not, which may be expressed, for example, as “Success” or “Fail”.

That is, if the number of the active nodes is enough for a quorum number, the survival state 320 is the success state. If the number of the active nodes is short of the quorum, the survival state 320 is the fail state.

The availability probability (P_n) 330 is a probability that other nodes survive until a failed node in a cluster system including (n+1) number of nodes is recovered, which may be calculated by the following equation (1):

$\begin{matrix} P_{n} = P_{1}^{n - 1} + \sum_{k = Qn}^{n - 2} {P_{1}^{k} (1 - P_{1})}^{n - k - 1} \frac{(n - 1)!}{(n - k - 1)!} Q_{n} = [\frac{n + 1}{2}] & (1) \end{matrix}$

As in the above equation (1), when a failure occurs on one node of two nodes, the availability probability (P_n) 330 may be obtained from a survival probability P₁that the other node survives until the failed node is recovered.

That is, provided there are two nodes in the cluster system, an availability of the entire system becomes 0% unless a failed node of the two nodes is recovered during a period of the survival probability P₁.

In the cluster system including two nodes, the availability probability P_nregarding the two nodes may be considered the survival probability P₁of the cluster system.

Again referring to FIG. 3, since all of four active nodes are active nodes in the first case, the availability probability P_n330 is P₁⁴, that is, P₁to the fourth power.

In the second case, since three nodes are the active nodes and one node is the down nodes, the availability probability P_n330 is the multiple of a survival probability P₁³of the three nodes and a survival probability (1−P₁) of the other one node.

As in the fourth case, when the survival state 320 of the system is the fail, the availability probability 330 is 0%. An availability probability 330 besides the above cases may be calculated by the manner as described above.

FIG. 4 is a table illustrating an availability probability of an availability cluster system according to an embodiment of the present invention. The table illustrates availability probabilities P₁to P_nwhen the number of nodes varies from 2 to n, and a quorum number varies from 2 to Q_n.

More particularly, when the number of the nodes is 2 and the quorum number is 0, the availability probability is 0.99750000=P₁. When the number of the nodes is 3 and the quorum number is 2, the availability probability is 0.99500625=P₁². When the number of the nodes is 4 and the quorum number is 3, the availability probability is 0.99251873=P₁³. When the number of the nodes is 5 and the quorum number is 3, the availability probability is 0.99996262=P₁⁴+(1−P₁)*P₁³*4. When the number of the nodes is 6 and the quorum number is 4, the availability probability is 0.99993781=P₁⁵+(1−P₁)*P₁⁴*5. When the number of the nodes is n and the quorum number is Q_n, the availability probability is P_n.

FIG. 5 is a flowchart illustrating a process of obtaining the optimal number n of nodes of a high availability cluster system having an availability probability P_nlarger than a reference availability probability according to an embodiment of the present invention. Hereinafter, the process will be described with reference to FIG. 5.

In operation S510, after a failure occurs on one node of two nodes constituting a cluster, a probability P₁that the other node survives is obtained. The P1 may be referred to as a basic survival probability and vary with a system environment. In operation S520, the reference availability probability is determined. It is apparent that the reference availability probability may be implemented prior to operation S510. In the way, after determining a certain reference availability probability, an operation of the optimal number of nodes meeting the reference availability probability may be carried out.

In operation S530, the number of the nodes is initialized to the minimum value of a range by substituting n with 2. Thereafter, in operation S540, an availability probability P_naccording to the number of the nodes is calculated as in the equation (1).

P_nis a probability that other nodes survive after the failure occurs on one node of the entire nodes.

It is determined in operation S550 whether the calculated probability P_nis larger than the reference availability probability.

According the determination, if the calculated probability P_nis the same as or larger than the reference availability probability, the number n of nodes n is outputted in operation S560.

On the other hand, if the calculated probability P_nis smaller than the reference availability probability in operation S550, the number of the nodes is increased by a certain unit value (e.g., 1) in operation S570.

Operations S530 through S570 are repeatedly performed until the calculated probability P_nbecomes larger than the reference availability probability. Furthermore, although the calculated probability P_nis larger than the reference availability probability, Operations S530 through S570 are repeatedly performed until the number of the nodes becomes the maximum value of the range by increasing the number of the nodes by the certain unit value.

The determined optimal number of the nodes n is used as the number of head nodes constituting an asymmetric cluster.

FIG. 6 is a flowchart illustrating a process of determining the number of active and passive nodes among n number of nodes according to an embodiment of the present invention. Hereinafter, the process will be described with reference to FIG. 6.

On the assumption that the number of the entire nodes n was determined through the process as illustrated in FIG. 5, the number of the active node and the number of passive node will be determined.

First, in operation S610, all configurations of a cluster system constituted of one or more active nodes and the number of passive nodes (n—the number of the active nodes) are enumerated.

Thereafter, in operation S620, an availability probability for each enumerated configuration is calculated.

In this case, the availability probability according a survival state and a state transition of each node in the corresponding configuration is calculated using the Markov chain.

In operation S630, a configuration of the cluster system when the availability probability is the maximum value is determined as an optimal configuration according to the result of the calculation.

Thereafter, the head nodes constitute the asymmetric cluster according to the determined optimal configuration.

FIGS. 7 through 9 are diagrams illustrating a process of carrying out an availability probability operation of a cluster system using Markov chain according to an embodiment of the present invention.

The Markov chain is a mathematically modeling technique for various management systems, and, at the same time, is a technique for sequentially predicting changes of future states by understanding dynamic properties of various parameters in a system on the basis of changes of past states.

An availability probability in case where one active node exists is illustrated using the Markov chain in FIG. 7.

In this case, there exist two states. A first state is a state that a node is active, and a second state is a state that the node is down.

Assuming that a probability of staying in the first state is π1 and a probability of staying in the second state is π2, a probability of shifting from the first state to the second state is α1, and a probability of shifting from the second state to the first state is β1, an availability probability A becomes π1. a mathematical relationship between the other probabilities may be expressed by the following equation (2):

π1+π2=1
π2*β1=1*α1
π1=1/(1+α1/β1)
π2=1/(1+β1/α1) (2)

where α1 is a mean time to failure MTTF, and β1 is a mean time to repair MTTR.

An availability probability in case where one node is an active node and the other node is a passive node is illustrated using the Markov chain in FIG. 8.

In this case, there exist four states. A first state is a state that the active node and the passive node are active, a second state is a state that the active node is down and the passive node is converted into the active node, a third state is a state that the passive node serves as the active nodes, and a fourth state is a state that all two nodes are down.

Probabilities with respect to each state are expressed by the following equation (3):

π1+π2+π3+π4=1
π3*β1=π1*α1
π1*α1=π2*γ1
π4*β=π3*α2
π1=1/(1+α1/γ1+α1/β1+α1*α2/β1*β2)
π2=1/(γ1/α1+1+γ1/β1+α1*γ1/β1*β2)
π3=1/(β1/α1+β1/γ1+1+α2/β2)
π4=1/(β2/α2+β1β2/γ1α2+β1β2/α1α2+1) (3)

where the availability probability A is expressed as A=π1+π3.

An availability probability in case where all tow nodes are active nodes is illustrated using the Markov chain in FIG. 9.

In this case, there exist four states. Availability probabilities of each state are expressed by the following equation (4):

π1+π2+π3+π4+π5=1
π3*β1=π5*γ2=π1*α1=π2*γ1
(α2+β1)*π3=π4*β2+π2*γ1
π3*π2=π4*β2
π1=1/(1+α1/β1+α1/γ1+α1/γ2+α1α2/β1β2)
π2=1/(1+γ1/α1+γ1/β1+γ1/γ2+α1γ1/β1β2)
π3=1/(1+β1/α1+β1/γ1+α2/β2+β1/γ2)
π4=1/(1+β2/α2+β1β2/α1α2+β1β2/α2γ1+β1β2/α2γ2)
π5=1/(1+γ2/α1+γ2/β1+γ2/γ1+α2γ2/β1β2) (4)

where the availability probability A is expressed as A=π1+π3+½(π2+π5).

FIG. 10 is a table illustrating an availability probability prediction according to variation of MTTF in a variable active and passive node establishment according to an embodiment of the present invention. FIG. 10 may be considered a table summarizing the cases as illustrated in FIGS. 7 through 9 and other cases.

Referring to FIG. 10, a case 1035 where there are five nodes is less sensitive to small MTTF than cases 1032, 1033 and 1034 where there are four or less nodes. The availability probability (number of nines) 1010 of the case 1035 is larger than that of the cases 1032, 1033 and 1034.

In operation S1110, all configurations of cluster systems constituted of N_hnn umber of head nodes and N_swn umber of switches are enumerated.

When the number of the entire nodes is U (1≦U), N_hnis 1≦N_hn≦U, N_swis 1≦N_sw≦U, and N_hnand N_swmay be the same value.

Thereafter, in operation S1120, an availability probability P_hn-swfor each configurations of the cluster systems is calculated by the following equation (5), in which the availability probability according a survival state and a state transition of each head node and each switch in the corresponding configuration is calculated using the Markov chain:

$\begin{matrix} P_{hn - sw} = \sum_{k \in U} π_{k} = \frac{T}{E} T = \sum_{i = 0}^{N_{sw} - 1} \frac{N_{sw}!}{(N_{sw} - 1)!} {(\frac{λ_{sw}}{δ_{sw}})}^{i} (1 + \sum_{j = 1}^{N_{hn} - 1} \frac{N_{hn}!}{(N_{hn} - j)!} {(\frac{λ_{hn}}{δ_{hn}})}^{j}) E = T + \sum_{i = 0}^{N_{sw} - 1} \frac{N_{sw}!}{(N_{sw} - i)!} {(\frac{λ_{sw}}{δ_{sw}})}^{i} N_{hn}! {(\frac{λ_{hn}}{δ_{hn}})}^{N_{hn}} + \sum_{j = 1}^{N_{hn} - 1} \frac{N_{hn}!}{(N_{hn} - j)!} {(\frac{λ_{hn}}{δ_{hn}})}^{j} N_{sw}! {(\frac{λ_{sw}}{δ_{sw}})}^{N_{sw}} λ_{θ} = \frac{1}{MTTF} δ_{θ} = \frac{1}{MTTR} θ \in {hn, sw} & (5) \end{matrix}$

where T is a probability that the system survives, E is an entire probability of the system, and π_kis a probability for each state.

According to the result of the calculation, in operation S1130, a combination between the head nodes and the switches when the availability probability is the maximum value is determined as an optimal cluster combination.

The optimal number of the nodes is determined by checking if the availability probability P_hn-swfor the number of nodes of a certain range meets a pre-established reference availability probability.

FIG. 12 is a diagram illustrating an availability probability prediction of a cluster system by means of a continuous-time Markov chain (CTMC) according to an embodiment of the present invention. Hereinafter, the availability probability prediction will be described with reference to FIG. 12 and Table 1:

State
Head node
Switch
System state

State 1
2
1
active

State 2
1
1
active

State 3
0
1
down

State 4
2
0
down

State 5
1
0
down

In the above table 1, the first state is a state that two head nodes and one switch survive, the second state is a state that one head node and one switch survive, the third state is a state that only one switch survives, the fourth state is a state that only two head nodes survive, and the fifth state is a state that only one head node survives.

In this case, the first and second states are in active state, the third, fourth and fifth states are in down state.

Hereinafter, each probability for each state and state variations will be described using the CTMC.

Since two head nodes survive, a probability of shifting from the first state to the second state is 2λ hn. Furthermore, since one head node survives, a probability of shifting from the second state to the third state is λ hn. Similarly, probabilities of shifting from the third state to the second state and shift from the second state to the first state are δ hn.

Furthermore, since one switch survives, probabilities of shifting from the first state to the fourth state and shifting from the second state to the fifth state are λ sw, and probabilities of shifting from the fourth state to the first state and shifting from the fifth state to the second state are δ sw.

According to the presenting invention, an optimal configuration of nodes is possible because individual probability of a cluster system can be predicted according to the number of nodes and node components.

In a method for predicting an availability of a high availability cluster according to embodiments of the present invention, the optimal number of nodes in the high availability cluster system can be determined according to a required availability level.

In addition, since a configuration of a relatively high availability system can be predicted using a predetermined number of nodes, it is possible to accomplish an optimal high availability cluster system.

Furthermore, a future state of the system can effectively be predicted from a past state of the system by predicting the availability of the high availability cluster by Markov chain.

As the present invention may be embodied in several forms without departing from the spirit or essential characteristics thereof, it should also be understood that the above-described embodiments are not limited by any of the details of the foregoing description, unless otherwise specified, but rather should be construed broadly within its spirit and scope as defined in the appended claims, and therefore all changes and modifications that fall within the metes and bounds of the claims, or equivalents of such metes and bounds are therefore intended to be embraced by the appended claims.

Claims

1. A method for predicting an availability of a high availability cluster, the method comprising: calculating, by a computer of a cluster system, a basic survival probability that the other node survives until a failure on one node of two nodes constituting a cluster is fixed; anddetermining, by the computer of the cluster system, an optimal number of nodes meeting a preset reference availability probability by calculating an availability probability for a predetermined range of the number of nodes on the basis of the basic survival probability,wherein the availability probability, which is a probability that other nodes survive until a failure on one node of entire nodes is fixed, is calculated by the following equation:
2. A method for predicting an availability of a high availability cluster, the method comprising: enumerating, by a computer of a cluster system, all configurations of a cluster system variable with the number of head nodes and the number of switches;calculating, by the computer of the cluster system, an availability probability for each of the enumerated configurations; anddetermining, by the computer of the cluster system, a combination between the head nodes and the switches when the availability probability is the maximum value as an optimal configuration of the cluster,wherein the availability probability according a survival state and a state transition for each head node and each switch in the corresponding combination is calculated using a Markov chain,wherein the availability probability is obtained by omitting a survival probability of the system from an entire probability of the system, andwherein the survival probability of the system is calculated by the following equation:
3. A method for predicting an availability of a high availability cluster, the method comprising: enumerating, by a computer of a cluster system, all configurations of a cluster system variable with the number of head nodes and the number of switches;calculating, by the computer of the cluster system, an availability probability for each of the enumerated configurations; anddetermining, by the computer of the cluster system, a combination between the head nodes and the switches when the availability probability is the maximum value as an optimal configuration of the cluster,wherein the availability probability according a survival state and a state transition for each head node and each switch in the corresponding combination is calculated using a Markov chain,wherein the availability probability is obtained by omitting a survival probability of the system from an entire probability of the system, andwherein the entire probability of the system is calculated by the following equation:

Priority Claims (1)

Number	Date	Country	Kind
10-2007-0127904	Dec 2007	KR	national

US Referenced Citations (2)

Number	Name	Date	Kind
7228460	Pomaranski et al.	Jun 2007	B2
20060136772	Guimbellot et al.	Jun 2006	A1

Foreign Referenced Citations (3)

Number	Date	Country
10-0404906	Oct 2003	KR
10-2006-0068873	Jun 2006	KR
10-0693663	Mar 2007	KR

Related Publications (1)

	Number	Date	Country
	20090150717 A1	Jun 2009	US

Availability prediction method for high availability cluster

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications