The present invention is generally related to computer cluster environments, and data and resource management in such environments, and is particularly related to a system and method for use with a data grid cluster to support death detection.
Modern computing systems, particularly those employed by larger organizations and enterprises, continue to increase in size and complexity. Particularly, in areas such as Internet applications, there is an expectation that millions of users should be able to simultaneously access that application, which effectively leads to an exponential increase in the amount of content generated and consumed by users, and transactions involving that content. Such activity also results in a corresponding increase in the number of transaction calls to databases and metadata stores, which have a limited capacity to accommodate that demand.
In order to meet these requirements, a distributed data management and cache service can be run in the application tier so as to run in-process with the application itself, e.g., as part of an application server cluster. However, from time to time, one or more of the server machines in the application server cluster can be shut down, and/or the processes running on top of the server machines can be dysfunctional. There is a need to quickly detect such an event when it happens. This is the general area that embodiments of the invention are intended to address.
In accordance with an embodiment, a system and method is described for use with a data grid cluster to support death detection. A network ring is formed by connecting a plurality of process nodes in the data grid, wherein each node in the network ring watches another node. A death of a first process node in the network ring can be detected by a second process node, when the second process node notices that its connection to the first process node has closed. The first process node then informs other process cluster nodes in the network ring that the first node is dead. In accordance with an embodiment, machine level death detection can also be supported in the data grid cluster by using an Internet Protocol (IP) monitor.
In accordance with an embodiment, as referred to herein a “data grid cluster”, or “data grid”, is a system comprising a plurality of computer servers which work together to manage information and related operations, such as computations, within a distributed or clustered environment. The data grid cluster can be used to manage application objects and data that are shared across the servers. Preferably, a data grid cluster should have low response time, high throughput, predictable scalability, continuous availability and information reliability. As a result of these capabilities, data grid clusters are well suited for use in computational intensive, stateful middle-tier applications. Some examples of data grid clusters, e.g., the Oracle Coherence data grid cluster, can store the information in-memory to achieve higher performance, and can employ redundancy in keeping copies of that information synchronized across multiple servers, thus ensuring resiliency of the system and the availability of the data in the event of server failure. For example, Coherence provides replicated and distributed (partitioned) data management and caching services on top of a reliable, highly scalable peer-to-peer clustering protocol, with no single points of failure, and can automatically and transparently fail over and redistribute its clustered data management services whenever a server becomes inoperative or disconnected from the network.
Death Detection
In accordance with an embodiment, the system employs death detection as a cluster mechanism that can quickly detect whenever a cluster member in a data grid cluster has failed. Failed cluster members can be removed from the cluster, and the remaining cluster members notified about the departed member. By using death detection, a data grid cluster can differentiate between an actual member failure and a temporarily unresponsive member, such as the case when a Java Virtual Machine (JVM) conducts a full garbage collection. Using death detection, the system can achieve (1) instantaneous detection of failed processes; (2) quick death detection of failed machines; (3) minimizing the chance of garbage collection based false positives; and (4) improved cluster stability.
In accordance with an embodiment, the data grid cluster can use a network ring, such as a Transmission Control Protocol (TCP) ring, for fast process-level death detection. Furthermore, the data grid cluster can use an Internet Protocol (IP) Monitor to detect machine or network-related deaths. Additionally, a packet timeout approach can be used as a catch-all death detection for process, machine, and network failures.
TCP Ring
In accordance with an embodiment, a data grid cluster can use a TCP ring for fast process-level death detection that is useful in those cases when the process in a data grid cluster does not have any time to perform any shut down logic and is abruptly terminated. The remainder of the cluster can identify that the object has terminated from the cluster. Additionally, the TCP Ring can be used to reduce false positives caused by long garbage collection pauses, swapping, and under-provisioned deployment environments, and can reduce death detection time.
In accordance with an embodiment, the process nodes can be linked in the data grid cluster using a plurality of network connections 111-118 based on a network protocol, such as TCP. The linked process nodes can form a network ring, such as a TCP ring 110. As shown in
As shown in
In accordance with an embodiment, different ordering schemes can be used to arrange the connection of the process nodes in a TCP ring. An exemplary ordering scheme can be based on seniority of each process node in order to decide a monitoring order for the process nodes in the TCP ring. For example, each process node can be monitored by a next senior process node in the TCP ring, while the senior-most node is monitored by the youngest node.
In accordance with an embodiment, each secondary TCP ring can target a particular or special group of processes in the cluster, and each special group of processes can be identified by a role. As shown in
Different group of processes in the cluster can be associated with different properties and configurations. For example, a high priority process such as a cache server can be configured with a very short garbage collection time, while a low priority process such as a management server can be configured using a relatively longer garbage collection time. If a management server is used to watch a cache server, then the cache sever may terminate while the monitoring management server is experiencing garbage collection. Thus, it is preferable to have a high priority node watching another high priority node, for example using a high priority cache server to watch another high priority cache server in a server TCP ring.
As shown in
In accordance with an embodiment, the data grid cluster can automatically assign a process to an existing network ring in the cluster based on a role associated with the process node. Additionally, users can define customized roles for process nodes in a data grid. The user-defined roles can be used to create user-defined network rings in the grid.
Machine Level Death Detection Using IP Monitor
In accordance with an embodiment, a data grid cluster can use a death detection mechanism such as an IP monitor to detect machine or network related deaths. The IP monitor death detection feature can use a timeout-based mechanism (such as failed pings) to monitor machine death itself.
In accordance with an embodiment, each server machine in the cluster wakes up periodically, for example every second, and randomly selects another server machine to ping 411-416. The pinged machine is then expected to respond to the ping within a particular (short) amount of time. If there is a series of pings that are not responded to, then it is declared that the server machine has shut down and that all nodes on that machine are dead. For example, in accordance with one embodiment, the default configuration can be set to 15 second timeout, where each ping is waited on for 2 seconds.
In accordance with an embodiment, quorum policies can also be used to determine whether the data grid cluster can kill a particular machine, or a fraction of the server machine cluster, in order to prevent undesirable situation in the cluster such as a split-brain scenario. Additionally, the quorum policies can be user defined policies or vendor supplied policies. These quorum polices can take higher precedence over the data grid cluster itself. For example a quorum policy can define that the data grid cluster can not kill ten nodes at once, since that can cause severe shortage of resources in the cluster. Additional descriptions of various embodiments of using quorum policies in a data grid cluster are provided in U.S. patent application Ser. No. 13/352,203, filed Jan. 17, 2012, entitled “SYSTEM AND METHOD FOR USING CLUSTER LEVEL QUORUM TO PREVENT SPLIT BRAIN SCENARIO IN A DATA GRID CLUSTER” and U.S. patent application Ser. No. 13/352,209, filed Jan. 17, 2012, entitled “SYSTEM AND METHOD FOR SUPPORTING SERVICE LEVEL QUORUM IN A DATA GRID CLUSTER”, each of which applications are herein incorporated by reference.
Packet Timeout
In accordance with an embodiment, a packet timeout approach, which acts a catch-all death detection (process, machine, and network) feature, can be use with a data grid cluster to support death detection when other death detection features are disabled in the data grid. For example, a server machine or computer in the cluster can use a packet publisher's resend timeout interval to determine that another member has stopped responding to UDP packets. Every time when a packet is transmitted across the cluster and no acknowledgement is received, the packet will be re-sent. In one example, the default timeout interval can be set to 5 minutes. If no acknowledgement is received for a certain number of consecutive re-transmissions, then the node can be declared dead.
Additionally, quorum policies can be used as part of the packet timeout approach. As described above, these quorum polices can take higher precedence over the data grid cluster itself. A voting process can also be used in the packet timeout approach, where the cluster nodes in the data grid cluster can conduct a vote to decide which nodes should be ousted from the cluster.
Death Detection Configuration
In accordance with an embodiment, death detection can be configured to work by creating a ring of TCP connections between all cluster members. The TCP communication can be sent on the same port that is used for cluster UDP communication. Each cluster member issues a unicast heartbeat, and the most senior cluster member issues a cluster heartbeat, which is a broadcast message. Each cluster member uses the TCP connection to detect the death of another node within the heartbeat interval. The death detection feature can be enabled by default, and can be configured within, e.g. a <tcp-ring-listener> element, or in a configuration file, such as an operational override file. Settings can be used to change the default behavior of the TCP-ring listener. This includes changing the amount of attempts and time before determining that a computer that is hosting cluster members has become unreachable. For example, the default setting can be 3 attempts and 15 seconds, respectively. The TCP/IP server socket backlog queue can also be set and defaults to the value used by the operating system.
Listing 1 illustrates a configuration file that can be used to change the settings, in accordance with an embodiment.
In accordance with an embodiment, a system property can be used to specify a timeout, instead of using the operational override file. For example, this system property can be set to 20 seconds by configuring “-Dtangosol.coherence.ipmonitor.pingtimeout=20s.” Additionally, the values of the <ip-timeout> and <ip-attempts> elements can be high enough to insulate against allowable temporary network outages.
In accordance with an embodiment, the death detection heartbeat interval can be changed. A higher interval alleviates network traffic but also prolongs detection of failed members. The default heartbeat value is 1 second. Listing 2 illustrates how to change the death detection heartbeat interval from within an operational override file, in accordance with an embodiment.
In accordance with an embodiment, death detection can be enabled by default, and/or can be explicitly disabled. Disabling death detection can alleviate network traffic, but also prolongs the detection of failed members. Listing 3 illustrates how to disable death detection from within an operational override file, in accordance with an embodiment.
In accordance with an embodiment, the packet resend timeout interval specifies the maximum amount of time, in milliseconds, that a packet continues to be resent if no ACK packet is received. After this timeout expires, a determination is made if the recipient is to be considered terminated. This determination takes additional data into account, such as if other nodes are still able to communicate with the recipient. The default value is 300000 milliseconds. For production environments, the recommended value is the greater of 300000 and two times the maximum expected full GC duration. Listing 4 illustrates how to change the packet resend timeout interval within the operational override file, in accordance with an embodiment.
Throughout the various contexts described in this disclosure, the embodiments of the invention further encompass computer apparatus, computing systems and machine-readable media configured to carry out the foregoing systems and methods. In addition to an embodiment consisting of specifically designed integrated circuits or other electronics, the present invention may be conveniently implemented using a conventional general purpose or a specialized digital computer or microprocessor programmed according to the teachings of the present disclosure, as will be apparent to those skilled in the computer art.
Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those skilled in the software art. The invention may also be implemented by the preparation of application specific integrated circuits or by interconnecting an appropriate network of conventional component circuits, as will be readily apparent to those skilled in the art.
The various embodiments include a computer program product which is a storage medium (media) having instructions stored thereon/in which can be used to program a general purpose or specialized computing processor(s)/device(s) to perform any of the features presented herein. The storage medium can include, but is not limited to, one or more of the following: any type of physical media including floppy disks, optical discs, DVDs, CD-ROMs, microdrives, magneto-optical disks, holographic storage, ROMs, RAMs, PRAMS, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic or optical cards, nanosystems (including molecular memory ICs); paper or paper-based media; and any type of media or device suitable for storing instructions and/or information. The computer program product can be transmitted in whole or in parts and over one or more public and/or private networks wherein the transmission includes instructions which can be used by one or more processors to perform any of the features presented herein. The transmission may include a plurality of separate transmissions. In accordance with certain embodiments, however, the computer storage medium containing the instructions is non-transitory (i.e. not in the process of being transmitted) but rather is persisted on a physical device.
The foregoing description of the preferred embodiments of the present invention has been provided for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations can be apparent to the practitioner skilled in the art. Embodiments were chosen and described in order to best explain the principles of the invention and its practical application, thereby enabling others skilled in the relevant art to understand the invention. It is intended that the scope of the invention be defined by the following claims and their equivalents.
This application claims the benefit of priority to U.S. Provisional Patent Application No. 61/437,542, titled “DEATH DETECTION IN A DATA GRID CLUSTER”, filed Jan. 28, 2011, which application is herein incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
5784569 | Miller et al. | Jul 1998 | A |
5819272 | Benson | Oct 1998 | A |
5940367 | Antonov | Aug 1999 | A |
5991894 | Lee et al. | Nov 1999 | A |
5999712 | Moiin et al. | Dec 1999 | A |
6182139 | Brendel | Jan 2001 | B1 |
6377993 | Brandt et al. | Apr 2002 | B1 |
6487622 | Coskrey, IV et al. | Nov 2002 | B1 |
6490620 | Ditmer et al. | Dec 2002 | B1 |
6615258 | Barry et al. | Sep 2003 | B1 |
6631402 | Devine et al. | Oct 2003 | B1 |
6693874 | Shaffer et al. | Feb 2004 | B1 |
6714979 | Brandt et al. | Mar 2004 | B1 |
6968571 | Devine et al. | Nov 2005 | B2 |
7114083 | Devine et al. | Sep 2006 | B2 |
7139925 | Dinker et al. | Nov 2006 | B2 |
7266822 | Boudnik et al. | Sep 2007 | B1 |
7328237 | Thubert et al. | Feb 2008 | B1 |
7376953 | Togasaki | May 2008 | B2 |
7543046 | Bae et al. | Jun 2009 | B1 |
7720971 | Moutafov | May 2010 | B2 |
7739677 | Kekre et al. | Jun 2010 | B1 |
7792977 | Brower et al. | Sep 2010 | B1 |
7814248 | Fong et al. | Oct 2010 | B2 |
7953861 | Yardley | May 2011 | B2 |
8195835 | Ansari et al. | Jun 2012 | B2 |
8209307 | Erofeev | Jun 2012 | B2 |
8312439 | Kielstra et al. | Nov 2012 | B2 |
20020035559 | Crowe et al. | Mar 2002 | A1 |
20020073223 | Darnell et al. | Jun 2002 | A1 |
20020078312 | Wang-Knop et al. | Jun 2002 | A1 |
20030023898 | Jacobs et al. | Jan 2003 | A1 |
20030046286 | Jacobs et al. | Mar 2003 | A1 |
20030120715 | Johnson et al. | Jun 2003 | A1 |
20030187927 | Winchell | Oct 2003 | A1 |
20030191795 | Bernardin et al. | Oct 2003 | A1 |
20040059805 | Dinker et al. | Mar 2004 | A1 |
20040179471 | Mekkittikul et al. | Sep 2004 | A1 |
20040267897 | Hill et al. | Dec 2004 | A1 |
20050021737 | Ellison et al. | Jan 2005 | A1 |
20050083834 | Dunagan et al. | Apr 2005 | A1 |
20050097185 | Gibson et al. | May 2005 | A1 |
20050138460 | McCain | Jun 2005 | A1 |
20050193056 | Schaefer et al. | Sep 2005 | A1 |
20060095573 | Carle et al. | May 2006 | A1 |
20070016822 | Rao et al. | Jan 2007 | A1 |
20070118693 | Brannon et al. | May 2007 | A1 |
20070140110 | Kaler | Jun 2007 | A1 |
20070174160 | Solberg et al. | Jul 2007 | A1 |
20070237072 | Scholl | Oct 2007 | A1 |
20070260714 | Kalmuk et al. | Nov 2007 | A1 |
20070271584 | Anderson et al. | Nov 2007 | A1 |
20080183876 | Duvur et al. | Jul 2008 | A1 |
20080276231 | Huang et al. | Nov 2008 | A1 |
20080281959 | Robertson | Nov 2008 | A1 |
20090265449 | Krishnappa et al. | Oct 2009 | A1 |
20090320005 | Toub et al. | Dec 2009 | A1 |
20100128732 | Jiang | May 2010 | A1 |
20100211931 | Levanoni et al. | Aug 2010 | A1 |
20100312861 | Kolhi et al. | Dec 2010 | A1 |
20110041006 | Fowler | Feb 2011 | A1 |
20110107135 | Andrews et al. | May 2011 | A1 |
20110161289 | Pei et al. | Jun 2011 | A1 |
20110179231 | Roush | Jul 2011 | A1 |
20110249552 | Stokes et al. | Oct 2011 | A1 |
20120036237 | Hasha et al. | Feb 2012 | A1 |
20120117157 | Ristock | May 2012 | A1 |
20120158650 | Andre et al. | Jun 2012 | A1 |
20120215740 | Vaillant et al. | Aug 2012 | A1 |
Number | Date | Country | |
---|---|---|---|
20120198055 A1 | Aug 2012 | US |
Number | Date | Country | |
---|---|---|---|
61437542 | Jan 2011 | US |