These and other features of this invention will be more readily understood from the following detailed description of the various aspects of the invention taken in conjunction with the accompanying drawings in which:
The drawings are merely schematic representations, not intended to portray specific parameters of the invention. The drawings are intended to depict only typical embodiments of the invention, and therefore should not be considered as limiting the scope of the invention. In the drawings, like numbering represents like elements.
An illustrative multi-tier system 10 employing a failure detection methodology in accordance with an embodiment of the present invention is depicted in
Each tier 12, 16 of the multi-tier system 10 further includes a local failure detection system. In particular, as shown in
As further illustrated in
Each tier 12, 16 further includes a central aggregation point comprising a high availability manager (HAM) 32, 34, respectively, for overseeing the operation of the local failure detection system 24, 28 in the tier. With regard to tier 12, for example, the HAM 32 is configured to obtain and report the failure status of the members 20-1, 20-2, . . . , 20-N of the component cluster 14, as provided by the local failure detection system 24, and to respond to application requests accordingly. Similarly, with regard to tier 16, the HAM 34 is configured to obtain and report the failure status of the members 22-1, 22-2, . . . , 22-N of the component cluster 18, as provided by the local failure detection system 28, and to respond to application requests accordingly. Although the HAMs 32, 34 are depicted in
A data connection 36 is provided between each HAM 32, 34. The location of the HAM 32 in the tier 12 is communicated to the tier 16, and the location of the HAM 34 in the tier 16 is provided to the tier 12. In general, in accordance with the present invention, the location of the HAM in each tier in a multi-tier system is communicated to the HAM in each other tier of the multi-tier system. This ensures that each HAM can provide component status information to each other HAM.
When a member 20-1, 20-2, . . . , 20-N of the component cluster 14 in the tier 12 is determined to have failed by the heartbeating mechanism 26 of the local failure detection system 24, the failure status of that member is communicated by the HAM 32 over the data connection 36 to the HAM 34 in the tier 16. The HAM 34 then communicates information regarding the failure to each member 22-1, 22-2, . . . , 22-N of the component cluster 18, which then take appropriate clean-up actions in response to the failure. Similarly, when a member 22-1, 22-2, . . . , 22-N of the component cluster 18 in the tier 16 is determined to have failed by the heartbeating mechanism 30 of the local failure detection system 28, the failure status of that member is communicated by the HAM 34 over the data connection 36 to the HAM 32 in the tier 14. The HAM 32 then communicates information regarding the failure to each member 20-1, 20-2, . . . , 20-N of the component cluster 14, which then take appropriate clean-up actions in response to the failure. To this extent, status changes (e.g., failure data) are communicated inter-tier via the data connection 36 only when needed.
An illustrative sample runtime sequence 50 depicting the interaction of the components in
The computer system 104 is shown as including a processing unit 108, a memory 110, at least one input/output (I/O) interface 114, and a bus 112. Further, the computer system 104 is shown in communication with at least one external device 116 and a storage system 118. In general, the processing unit 108 executes computer program code, such as the failure detection system 130, that is stored in memory 110 and/or storage system 118. While executing computer program code, the processing unit 108 can read and/or write data from/to the memory 110, storage system 118, and/or I/O interface(s) 114. Bus 112 provides a communication link between each of the components in the computer system 104. The at least one external device 116 can comprise any device (e.g., display 120) that enables a user (not shown) to interact with the computer system 104 or any device that enables the computer system 104 to communicate with one or more other computer systems.
In any event, the computer system 104 can comprise any general purpose computing article of manufacture capable of executing computer program code installed by a user (e.g., a personal computer, server, handheld device, etc.). However, it is understood that the computer system 104 and the failure detection system 130 are only representative of various possible computer systems that may perform the various process steps of the invention. To this extent, in other embodiments, the computer system 104 can comprise any specific purpose computing article of manufacture comprising hardware and/or computer program code for performing specific functions, any computing article of manufacture that comprises a combination of specific purpose and general purpose hardware/software, or the like. In each case, the program code and hardware can be created using standard programming and engineering techniques, respectively.
Similarly, the computer infrastructure 102 is only illustrative of various types of computer infrastructures that can be used to implement the invention. For example, in one embodiment, the computer infrastructure 102 comprises two or more computer systems (e.g., a server cluster) that communicate over any type of wired and/or wireless communications link, such as a network, a shared memory, or the like, to perform the various process steps of the invention. When the communications link comprises a network, the network can comprise any combination of one or more types of networks (e.g., the Internet, a wide area network, a local area network, a virtual private network, etc.). Regardless, communications between the computer systems may utilize any combination of various types of transmission techniques.
As previously mentioned, the failure detection system 130 enables the computer system 104 to detect the failure of a member of a component cluster in a tier of a multi-tier system (see, e.g.,
It is understood that some of the various systems shown in
While shown and described herein as a method and system for failure detection, it is understood that the invention further provides various alternative embodiments. For example, in one embodiment, the invention provides a computer-readable medium that includes computer program code to enable a computer infrastructure to provide failure detection. To this extent, the computer-readable medium includes program code, such as the failure detection system 130, which implements each of the various process steps of the invention. It is understood that the term “computer-readable medium” comprises one or more of any type of physical embodiment of the program code. In particular, the computer-readable medium can comprise program code embodied on one or more portable storage articles of manufacture (e.g., a compact disc, a magnetic disk, a tape, etc.), on one or more data storage portions of a computer system, such as the memory 110 and/or storage system 118 (e.g., a fixed disk, a read-only memory, a random access memory, a cache memory, etc.), and/or as a data signal traveling over a network (e.g., during a wired/wireless electronic distribution of the program code).
In another embodiment, the invention provides a business method that performs the process steps of the invention on a subscription, advertising, and/or fee basis. That is, a service provider could offer to provide failure detection as described above. In this case, the service provider can create, maintain, support, etc., a computer infrastructure, such as the computer infrastructure 102, that performs the process steps of the invention for one or more customers. In return, the service provider can receive payment from the customer(s) under a subscription and/or fee agreement and/or the service provider can receive payment from the sale of advertising space to one or more third parties.
In still another embodiment, the invention provides a method of failure detection. In this case, a computer infrastructure, such as the computer infrastructure 102, can be obtained (e.g., created, maintained, having made available to, etc.) and one or more systems for performing the process steps of the invention can be obtained (e.g., created, purchased, used, modified, etc.) and deployed to the computer infrastructure. To this extent, the deployment of each system can comprise one or more of (1) installing program code on a computer system, such as the computer system 104, from a computer-readable medium; (2) adding one or more computer systems to the computer infrastructure; and (3) incorporating and/or modifying one or more existing systems of the computer infrastructure, to enable the computer infrastructure to perform the process steps of the invention.
As used herein, it is understood that the terms “program code” and “computer program code” are synonymous and mean any expression, in any language, code or notation, of a set of instructions intended to cause a computer system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and (b) reproduction in a different material form. To this extent, program code can be embodied as one or more types of program products, such as an application/software program, component software/a library of functions, an operating system, a basic I/O system/driver for a particular computing and/or I/O device, and the like.
The foregoing description of the preferred embodiments of this invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously, many modifications and variations are possible.