The present invention relates generally to fault-tolerance and more specifically to determining whether a node in a network can be safely removed without adversely affecting the remainder of the network.
Fault tolerant systems, by definition, are systems which can survive the failure of one or more components. These failures may happen alone, in an isolated fashion, or together, with one fault triggering a cascade of additional faults among separate components. The faults may be caused by a variety of factors, including software errors, power interruptions, mechanical failures or shocks to the system, electrical shorts, or through user error.
When an individual component fails in a typical computer system, the entire computer system frequently fails. In a fault-tolerant system, however, such system-wide failure must be prevented. Failures must be isolated, to the extent possible, and should be repairable without taking the fault tolerant system offline.
In addition, administrators of fault tolerant systems must have the ability to safely remove interchangeable modules within the system for routine inspection, cleaning, maintenance, and replacement. Ideally, fault tolerant systems would continue operating, even with some modules removed.
Towards that end, it would be useful to determine which components are critical to the continued operation of a fault tolerant system, and which components may fail or be removed by administrators without jeopardizing the stability of the entire system. Thus, a need exists for solutions capable of determining whether or not a fault-tolerant system would be adversely affected by the removal or failure of each component within that system.
In satisfaction of that need, the claimed invention provides systems and methods which assess the criticality of each component in a fault-tolerant system and determine whether any individual component may safely fail or be removed, or is safe to pull.
In one aspect, the claimed invention includes a method for determining whether a node in a non-recursive network can be removed. The method includes the steps of executing a reachability algorithm for a resource of a system upon initialization of the system. The resource is accessible to the system upon the initialization. A safe to pull manager evaluates the reachability algorithm for each node situated on the network to determine whether the node can be removed without interrupting resource accessibility to the system.
In one embodiment, the method includes updating the reachability algorithm when the network is updated. The method also includes adding a new node, removing a node, and recognizing a node failure. In yet another embodiment, the method includes signaling when the node can be removed from the network and when the node is unsafe to remove from the network. The signaling can include using a first indicator when a node is unsafe to remove and using a second indicator when a node is safe to remove. The evaluating of whether the node can be removed also includes determining whether the node is a root node and whether a threshold number of parent nodes exist for the node. The evaluating of whether the node can be removed can also include simulating a failure of the node. In one embodiment, the simulating of the failure of the node includes setting a variable in the reachability algorithm that corresponds with the node to a predetermined number.
In another aspect, a network includes a computer system having a safe to pull manager, a resource in communication with the computer system upon initialization of the system, and nodes connected between the resource and the system, wherein the safe to pull manager executes a reachability algorithm for the resource and for each node to determine whether a node can be removed without interrupting resource communication with the system.
In one embodiment, the nodes represent devices. Further, the computer system may execute a program that can access one or more resources. In one embodiment, the determination of whether the node can be removed includes simulating a failure of the node in the reachability expression. In some embodiments, the system is a power grid system. Moreover, the nodes may represent power sinks. In other embodiments, the system is a telephone system and each node represents a telephone transceiver. In some embodiments, the resource includes a disk volume, a network adapter, a physical disk, a program and a network. Further, the node can include a disk mirror, a Small Computer System Interface (SCSI) adapter, a disk, a central processing unit (CPU), an input/output (I/O) board, and a network interface card (NIC).
The advantages of the invention described above, together with further advantages, may be better understood by referring to the following description taken in conjunction with the accompanying drawings. In the drawings, like reference characters generally refer to the same parts throughout the different views. Also, the drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention.
Referring to
Each node 108 comprises a physical device connecting the computer system 104 to the resource 112. Thus, each node 108 may comprise devices such as an I/O board, a SCSI adapter, or any physical device connecting the computer system 104 to the resource. Although it is shown with only a single computer system 104, the network 100 can have any number of computer systems, each comprising a plurality of resources 112 and nodes 108.
The computer system 104 preferably executes an instruction set 116, such as an application or operating system. At initialization (e.g., boot up) of the system 104, the system verifies that the resource 112 is accessible to the system 104 through the nodes 108. If there are multiple paths between the computer system 104 and the resource 112, then some nodes 108 may be considered safe to pull. A node 108 will be deemed safe to pull if it can fail, be disconnected, or be removed without disconnecting the resource 112 from the computer system 104. Conversely, if a node 108 is required for continued operation of the network 100 and connectivity between the computer system 104 and the resource 112, then it will be deemed unsafe to pull or remove from the network 100.
In one embodiment, one or more of the nodes 108 represent the physical devices of the resource 112. Examples of various nodes 108 include a disk mirror, a Small Computer System Interface (SCSI) adapter, a disk, a central processing unit (CPU), an input/output (I/O) board, and/or a network interface card (NIC). The node 108 can represent a software and/or a hardware device.
Thus, the resource 112 can be a disk volume organized in a RAID5 implemented on a set of physical disks (i.e., nodes 108) so that a single disk failure does not prevent the computer system's access to the volume 108. The resource 112 can also be a physical disk that can be accessed via two adapters (dual initiated disks) (i.e., two nodes 108) so that even if one adapter (112) fails, the disk (108) can be accessed by the other adapter (112). In another embodiment, the processing capability of the computer system 104 can be supported by two CPUs executing in lock-step so that if one CPU fails, the other CPU continues the processing transparently.
Preferably, a safe to pull manager 124 generates a reachability expression (determined through the use of a reachability algorithm) 120 during initialization of the computer system 104 The safe to pull manager 124 recursively evaluates the reachability algorithm 120 to determine whether each node 108 can be removed without interrupting or eliminating the availability of the resource 112 to the system 104, to determine whether the node 108 is safe to pull.
A node 108 may be reachable in one of two ways. First, if the node 108 is part of the programs executing on the computer system 104, the node 108 is deemed reachable. Alternatively, if one or more of a node's parents are deemed reachable, the node 108 is deemed reachable. For example and as described in more detail below, a dual initiated disk 108 requires a single parent to be reachable. A plex, or one mirror of a disk volume, organized in RAID5 on five disks requires four parents to be reachable. The number of parents that a node 108 needs to be reachable in order for the node 108 itself to be reachable is referred to below as the threshold number. The threshold number may be stored in a property of the node 108 or resource. The safe to pull manager 124 can set the threshold number to zero for a resource 112 to ignore the resource 112, as there are always zero or more parents to a given node 108 or resource.
As illustrated, the program resource 208 needs either the first CPU 260 or the second CPU 264 to be operational in order to be reachable by the root 204. Similarly, the first volume 212 needs either the first plex 224 or the second plex 228 to be operational to be reachable. The first plex 224 requires the first disk 232 to be operational, and the first disk 232 needs the first SCSI adapter 240 to be operational. Arrows, such as a first arrow 276, illustrate the remaining dependencies of the resources 112 and nodes 108.
In this graph 200, when the devices associated with the nodes 108 are online, all devices are safe to pull. For example, the first disk 232 can be removed because all of its resources 112 (i.e., the first volume 212 and the second volume 216) have another path to the root 204 (via the second plex 228, the second disk 236, the second SCSI adapter 244, and the second I/O board 256). If the first disk 232 fails or is removed, then all of the devices represented by the path of the second plex 228, the second disk 236, the second SCSI adapter 244, and the second I/O board 256 become unsafe to pull or remove because if one of these devices fails, the first volume 212 and the second volume 216 are no longer reachable by the root 204.
As described above, the safe to pull manager 124 preferably generates a reachability algorithm 120 having variables that represent the nodes 108 in the graph 200. Each variable takes a value of 1 if the node 108 is not broken (e.g., has not failed or has not been removed) and takes a value of 0 if the node 108 is broken (e.g., failed or has been removed). The value is independent from the states of the other nodes 108.
A reachability expression R can be defined recursively as follows:
The required minimum number of parents T is the threshold number described above. The expression “(A+B+ . . . ): T” evaluates to 1 if and only if “A+B+ . . . ” is larger or equal to T. Thus, the following applies:
To test whether a node 108 is safe to pull for a given resource 112, the safe to pull manager 124 sets the corresponding variable to 0 and evaluates the reachability algorithm. The safe to pull manager 124 sets a variable to 0 to simulate a device pull or failure. If the reachability expression returns a value of 1, the node 108 is safe to pull for the given resource 112. The device is safe to pull if the reachability expression returns, for all resources 112, a value of 1 when the variable representing the corresponding node 108 is zero.
Referring to
As shown, each disk node G, H has a single parent node DJ. The resource V has a threshold of 1 because it is replicated on two mirrored plexes and is operational as long as one of its parents is operational. Each plex I and J have a threshold of 2 because they are each stored on two disks in striping mode. If one disk is lost, the entire plex is lost. The other nodes 108 have a threshold of 1.
Thus, the safe to pull manager 124 generates a reachability expression for the volume V. Specifically, this reachability expression is R(V)=V(R(I)+R(J)):1. This can be expanded as follows:
Furthermore, the expression also evaluates to 1 even when any single device fails. This can be tested by setting any single variable to 0 and evaluating the expression. Therefore, the system is fault-tolerant, or may continue running even with at least one point of failure. The expression can also evaluate how the RAID 10 system behaves with two or more points of failure. For example, if both nodes C and B fail, the resource V is no longer reachable. If, however, G and H fail, the resource V is still reachable.
Referring to
When all nodes are online, all corresponding variables are 1 and the expression evaluates to 1. This means that the resource V is reachable. If, however, A or C is 0, R(V)=0. This means that the physical devices corresponding to the nodes A and C are not safe to pull. Thus, the safe to pull manager 124 uses the reachability expression to illustrate that a RAID 5 system does not work well with single initiated disks.
When the nodes 108 are present and not broken, all corresponding variables are 1, and the reachability expression evaluates to 1. Thus, the resource V is reachable. Moreover, a single point of failure is also covered. Thus, all devices are safe to pull. Further, in order to continue operating, the following pairs of devices must never fail together: (E, F), (E, G), (F, G), (C, D), (A, B), (C, B), (A, D). Thus, RAID 5 systems work well with dual-initiated disks.
Referring to
In more detail, each node 108 contains the following Boolean variables:
In one embodiment, the safe to pull manager's determination of whether a node 108 is reachable from the root 204 (step 604) and the determination of whether the node 108 belongs to a path from a reachable resource 112 (step 608) is accomplished in three steps. In particular and also referring to
The safe to pull manager's computation of the safe to pull state of pullable nodes 108, as described above in step 612 of
Although shown with a single computer system 104, the network 100 can have any number of computer systems.
It will be appreciated, by those skilled in the art, that various omissions, additions and modifications may be made to the methods and systems described above without departing from the scope of the invention. All such modifications and changes are intended to fall within the scope of the invention as illustrated by the appended claims.