This application relates to U.S. patent application Ser. No. 10/773,803, now U.S. Pat. No. 7,406,537, entitled “Dynamic Subscription and Message Routing on a Topic Between Publishing Nodes and Subscribing Nodes,” filed Feb. 6, 2004, which is incorporated herein by reference. This application also relates to U.S. Utility patent application Ser. No. 10/304,992, now U.S. Pat. No. 7,039,671, entitled “Dynamically Routing Messages between Software Application Programs Using Named Routing Nodes and Named Message Queues” filed on Nov. 26, 2002, which is incorporated herein by reference in its entirety.
1. Field of the Invention
The present invention relates to systems and methods for sending and receiving messages. In particular, the present invention relates to a system and method for sending and receiving messages using fault-tolerant or high availability architecture.
2. Description of the Background Art
The use and proliferation of distributed computing networks is ever increasing. With the advent and business use of the Internet, the need for more efficient distributed computing system has become critical. The business use of the Internet has forced the integration of disparate computing environments with distributed computing systems that enable data transfer between such disparate systems. This in turn has created a need for better messaging systems that can handle amount of data and communication that are needed to effectively let disparate systems operate together and share information.
Such distributed processing and messaging systems are now used for a variety of applications. As they have been used for more applications, there is increasing demand for systems that are fault-tolerant such that the messaging systems can be used for financial transactions, equity trades and other messaging that demands high availability. However, there are very few such systems that can provide such fault tolerance, and those that have fault tolerance do so with a penalty in cost, performance, and/or hardware requirements.
A typical prior art approach for providing fault tolerance is shown in
However, the prior art systems 100 suffer from a number of shortcomings. First, there is no live or hot recovery. Any failover requires that server A′ perform recovery from disk which requires time during which pending transactions will be lost. Second, additional software is required to manage the two servers 102, 104 during start up and back up. This software is not used anywhere else for operation of the messaging systems or servers. Third, hardware locks are used to detect the failure of a server. Such hardware locks are difficult to distribute to the tens or hundreds of servers that may be part of a messaging system. Fourth, typically, the Server A′ cannot be used for any other function that the back up to server A, and therefore the prior art effectively doubles the hardware costs to provide fault tolerance.
Therefore, what is needed is a system and methods for implementing a fault-tolerant messaging system that overcomes the limitations found in the prior art.
The present invention overcomes the deficiencies and limitations of the prior art by providing a fault-tolerant messaging system. In one embodiment, the fault-tolerant messaging system comprises a primary broker, a first network, a back up broker, and a second network. The primary broker and the back up broker are coupled to the first network for communication with clients thus creating a messaging system. The primary broker and the back up broker are also coupled to the second network for replicating state from the primary broker and the back up broker, and also sending transaction events immediately to maintain synchronization between the primary broker and the back up broker. The brokers preferably further comprise a replication module for communicating state between the primary broker and the back up broker, a recovery module for performing recovery on the back up broker upon failure of the primary broker, and a fault-tolerant connection module for establishing a fault-tolerant connection between the primary broker and the back up broker over the second network. In an alternate embodiment, the recovery module of the primary broker may maintain a log of transactions and send them to the back broker over the second network in batches of transactions.
The present invention also includes a number of novel methods including: a method for performing fault tolerance; a method for replication of broker state from a primary broker and a back up broker; a method for maintaining or synchronizing the state of a primary broker and a back up broker; a method for performing recovery to a back up broker after a failure; a method for operation of a fault-tolerant connection between a client and a broker; and a method for dynamic synchronization between a primary broker and a back up broker according to the present invention.
The invention is illustrated by way of example, and not by way of limitation in the figures of the accompanying drawings in which like reference numerals refer to similar elements.
A system and method for fault-tolerant messaging is described. More specifically, the present invention is a software solution that does not require additional hardware or hardware changes and can operate on a variety of platforms to provide a fault-tolerant messaging system. In one embodiment, the fault-tolerant messaging system provides real-time message event back up so that recovery in the middle of transaction can be accomplished. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the invention. It will be apparent, however, to one skilled in the art that the invention can be practiced without these specific details. In other instances, structures and devices are shown in block diagram form in order to avoid obscuring the invention. For example, the present invention is described primarily with reference to failover from a primary broker to an associated back up broker. However, the present invention applies to any distributed computing system that has fault tolerance and the servers or nodes running brokers, and may include significantly more brokers, servers and clusters.
Reference in the specification to “one embodiment” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the invention. The appearances of the phrase “in one embodiment” in various places in the specification are not necessarily all referring to the same embodiment.
Some portions of the detailed descriptions that follow are presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the data processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of steps leading to a desired result. The steps are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. It has proven convenient at times, principally for reasons of common usage, to refer to these signals as bits, values, elements, symbols, characters, terms, numbers, or the like.
It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the following discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing” or “computing” or “calculating” or “determining” or “displaying” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data represented as physical (electronic) quantities within the computer system's registers and memories into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission or display devices.
The present invention also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, and magnetic-optical disks, read-only memories (ROMs), random access memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, or any type of media suitable for storing electronic instructions, each coupled to a computer system bus.
The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description below. In addition, the present invention is not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the invention as described herein.
Moreover, the present invention claimed below is operating on or working in conjunction with an information system. Such an information system as claimed may be an entire messaging system or only portions of such a system. For example, the present invention can operate with an information system that need only be a broker in the simplest sense to process messages. Thus, the present invention is capable of operating with any information system from those with minimal functionality to those providing all the functionality disclosed herein.
The present invention describes a fault-tolerant messaging system 200, and in doing so uses a number of terms some of which are defined below as follows:
Active Broker: Of the primary/back up broker pair, only one broker is actively processing messaging activity at any one moment, and that broker is said to be the active broker. The active and standby roles are dynamic and they change during failover.
Back up Broker: A back up broker is associated with a primary broker, and is a broker that becomes active upon failure of the primary broker or failure of the all the primary broker's replication connection(s) to the back up broker(s). The primary and back up brokers communicate over a replication connection to replicate state, and over a public or service network to provide a service to clients.
Failover: The process beginning with the failure of an active broker, through the transfer of the active role to another broker in the same fault tolerant set, and ending with the reconnection of any clients to the new active broker and the completion of any pending operations.
Fault tolerant broker: a broker configured for fault tolerance; whether or not an operation against a fault tolerant broker is actually protected from failure depends further on the state of the broker, and on the type of connection.
Fault tolerant set: a fault tolerant broker and any back up broker(s) deployed to protect it from failure. The brokers in a fault tolerant set share the same identity with respect to clients and other brokers in a cluster; only one broker in a fault tolerant set may be actively processing client and cluster requests at any one time. A fault tolerant set preferably consists of a primary broker and a single back up broker, but alternatively multiple back up brokers may be supported in the future.
K-resilient: a system or architecture tolerant of up to k concurrent failures. For example, in the context of the brokers of the present invention, a 1-resilient fault tolerance architecture provides uninterrupted service to clients in the event the failure of one broker, but not if the back up broker also fails before the original broker is returned to service.
Partition: a network failure that leaves two processes running but unable to communicate; to each process, a partition failure is indistinguishable from a failure of the other process. The Active-Standby fault-tolerance architecture of the present invention does not deal well with partition failures because they lead to multiple processes assuming the active role at the same time.
Primary Broker: A primary broker is messaging broker configured to replicate to a back up broker at any point, without reinitializing storage. The primary broker, if operating in a fault tolerant mode, has an associated back up broker coupled by a replication connection.
Replication Connection: A configured network path between primary broker and back up broker, specifying the network endpoints and attributes required to establish a replication connection. The replication connection is used to replicate state between the primary broker and the back up broker and monitor each other's status to detect failures.
Runtime Synchronization: Runtime synchronization is the process of synchronizing two brokers while one is actively servicing messaging operations. It is triggered automatically when one broker starts up, connects to the other and finds it active.
Standby Broker: Of the primary/back up broker pair, only one broker is actively processing messaging activity at any one moment, and that broker that is not actively processing messaging activity is said to be the standby broker. The active and standby roles are dynamic and they change during failover.
Storage Synchronization: Storage synchronization is the process of synchronizing (updating the state of one broker to match the state of the other broker) the state of the two brokers while they are both down: It is an administrative operation analogous to storage initialization, and requires that the broker on which it is invoked have access to the recovery logs of both brokers.
System Overview
Referring now to
While the present invention will now be described throughout this patent application in the context of a primary broker and one back up broker, those skilled in the art will recognize that there may be a variety of different configurations for providing fault tolerance, and that a primary broker 202 could have any number of back up brokers from 1 to n. Furthermore, while the following descriptions of the present invention describe fault tolerance and failover as happening from a primary to a back up broker (it is assumed below that the primary broker is in the active state and the back up broker is in the standby state), it is the runtime states of ‘active’ and ‘standby’ that determine the replication and failover roles, and that the primary and back up brokers can act in either of these roles. There could also be a variety of orderings in which the primary broker fails over to the n back up brokers. For example, this could be a static sequential order in which the primary broker fails over to another back up or it could change dynamically depending other uses of the back up brokers as will be understood to those skilled in the art.
Fault-Tolerant Clusters
While the systems 200, 300 described above include the functionality of distributed computing systems including messaging capability, these descriptions have been simplified for ease of understanding of the present invention. The systems 200, 300 also include full functionality of a dynamic routing architecture for messages and publish/subscribe messaging capabilities as detailed in co-pending U.S. patent application Ser. No. 10/773,803, now U.S. Pat. No. 7,406,537, entitled “Dynamic Subscription and Message Routing on a Topic Between Publishing Nodes and Subscribing Nodes,” filed Feb. 6, 2004; and U.S. Utility patent application Ser. No. 10/304,992, now U.S. Pat. No. 7,039,671, entitled “Dynamically Routing Messages between Software Application Programs Using Named Routing Nodes and Named Message Queues” filed on Nov. 26, 2002, both of which are incorporated herein by reference in their entirety.
Server
Referring now to
Control unit 450 may comprise an arithmetic logic unit, a microprocessor, a general purpose computer, a personal digital assistant or some other information appliance equipped to provide electronic display signals to display device 410. In one embodiment, control unit 450 comprises a general purpose computer having a graphical user interface, which may be generated by, for example, a program written in Java running on top of an operating system like WINDOWS® or UNIX® based operating systems. In one embodiment, one or more application programs are executed by control unit 450 including, without limitation, word processing applications, electronic mail applications, financial applications, and web browser applications.
Still referring to
Processor 402 processes data signals and may comprise various computing architectures including a complex instruction set computer (CISC) architecture, a reduced instruction set computer (RISC) architecture, or an architecture implementing a combination of instruction sets. Although only a single processor is shown in
Main memory 404 stores instructions and/or data that may be executed by processor 402. The instructions and/or data may comprise code for performing any and/or all of the techniques described herein. Main memory 404 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, or some other memory device known in the art. The memory 404 is described in more detail below with reference to
Data storage device 406 stores data and instructions for processor 402 and comprises one or more devices including a hard disk drive, a floppy disk drive, a CD-ROM device, a DVD-ROM device, a DVD-RAM device, a DVD-RW device, a flash memory device, or some other mass storage device known in the art.
System bus 408 represents a shared bus for communicating information and data throughout control unit 450. System bus 408 may represent one or more buses including an industry standard architecture (ISA) bus, a peripheral component interconnect (PCI) bus, a universal serial bus (USB), or some other bus known in the art to provide similar functionality. Additional components coupled to control unit 450 through system bus 408 include the display device 410, the keyboard 412, the cursor control device 414, the network controller 416 and the I/O device(s) 418.
Display device 410 represents any device equipped to display electronic images and data as described herein. Display device 410 may be, for example, a cathode ray tube (CRT), liquid crystal display (LCD), or any other similarly equipped display device, screen, or monitor. In one embodiment, display device 410 is equipped with a touch screen in which a touch-sensitive, transparent panel covers the screen of display device 410.
Keyboard 412 represents an alphanumeric input device coupled to control unit 450 to communicate information and command selections to processor 402.
Cursor control 414 represents a user input device equipped to communicate positional data as well as command selections to processor 402. Cursor control 414 may include a mouse, a trackball, a stylus, a pen, a touch screen, cursor direction keys, or other mechanisms to cause movement of a cursor.
Network controller 416 links control unit 450 to a network that may include multiple processing systems. The network of processing systems may comprise a local area network (LAN), a wide area network (WAN) (e.g., the Internet), and/or any other interconnected data path across which multiple devices may communicate. The control unit 450 also has other conventional connections to other systems such as a network for distribution of files (media objects) using standard network protocols such as TCP/IP, http, https, and SMTP as will be understood to those skilled in the art.
One or more I/O devices 418 are coupled to the system bus 408. For example, the I/O device 418 may be an audio input/output device 418 equipped to receive audio input via a microphone and transmit audio output via speakers. In one embodiment, audio device 418 is a general purpose; audio add-in/expansion card designed for use within a general purpose computer system. Optionally, I/O audio device 418 may contain one or more analog-to-digital or digital-to-analog converters, and/or one or more digital signal processors to facilitate audio processing.
It should be apparent to one skilled in the art that control unit 450 may include more or less components than those shown in
Primary Broker
The operating system 502 is preferably one of a conventional type such as, WINDOWS®, SOLARIS® or LINUX® based operating systems. Although not shown, the memory unit 404a may also include one or more application programs including, without limitation, word processing applications, electronic mail applications, financial applications, and web browser applications.
The publish/subscribe module 504 is to establish a subscription to a topic for a node and to unsubscribe from a topic. It is also used to identify subscribers to a topic and transmit messages to subscribers dynamically. The publish/subscribe module 504 also includes other topic tables, queues and other elements necessary to implement publish/subscribe messaging on the system 200.
The message queue 506 stores messages that have been received from other server or nodes and that need to be forwarded to other nodes or distributed locally to subscribing applications. The message queue 506 is accessible to the broker 404a.
The broker module 508 is used to create instances of brokers 202, 204 with the functionality that has been described above. The broker module 422 manages the creation and deletion of broker instances to ensure the proper operation of the fault tolerant system 200. Each of the brokers has the full functionality of brokers as has been noted above and detailed in the related U.S. patent application Ser. Nos. 10/773,803, now U.S. Pat. No. 7,406,537, and 10/304,992, U.S. Pat. No. 7,039,671, both of which are incorporated herein by reference in their entirety.
The replication module 510 manages and maintains a copy of the state of the primary broker 202 on the back up broker 204. At the primary broker 202, the replication module 510 works with a corresponding replication module 510 on the back up broker 204. The replication module 510 replicates data and monitors the status of the back up broker 204 by communicating over a replication connection. In particular, the replication module 510 replicates storage state such that a message database 522 of the primary broker 202 is synchronized with a back up message database 604 stored on and used by the back up broker 204. The replication module 510 also maintains run-time synchronization by either: 1) immediately sending events being processed by the primary broker 202 to the back up broker 204 and using a guaranteed acknowledgement to ensure it is recorded on the recovery log 602 in the back up broker 204, or 2) by storing events in a recovery log 516/520 at the primary broker 202 and sending the events from the recovery log 516/520 to a corresponding recover log 602 at the back up broker 204 periodically or as needed when the recovery log 516 becomes full. Each transaction generates one or more events, and by storing the events, the transactions can be recovered. In one embodiment, the present invention may also include a fast log storage mechanism where a plurality of recovery logs 516, 520 are used and alternatively read to the back up broker 204, while on of the other of the plurality of recover logs 516, 520 is used to store transactions. The functions of the replication module 510 are described in more detail below with reference to
The replication connection module 512 manages and maintains a replication connection between the primary broker 202 and the back up broker 204. The primary broker 202 and the back up broker 204 replicate data, maintain synchronization and monitor each other's status by communicating over replication connections, which define pairs of network endpoints over which the brokers 202, 204 communicate to replicate data and to detect failures. The replication connection is preferably a secure connection and may include a plurality of channels. The replication connection module 512 manages and maintains a replication connection including definition and management of multiple replication connections in order to make use of multiple redundant network paths if available. Both brokers 202, 204 actively connect and maintain all defined replication connections, and regularly heartbeat replication connections in both directions to detect failures. Only one connection is used for replication at a time, but the brokers 202, 204 can switch to another connection without interrupting replication if one connection fails. Furthermore, the replication connection module 512 manages performs the following functions: initiating the connection of the first replication channel, selecting the active channel for replication based on the metrics, monitor the health of the replication channels via heartbeats, initiating the retry attempt to re-establish a failed channel, reporting any channel failure and generate the notification event, and implementing the acknowledgement exchange protocol to ensure no duplicate messages or missing acknowledgements.
The fault-tolerant (FT) connection module 514 is used to establish a fault-tolerant connection with the client 210 and the primary broker 202. The operation of fault-tolerant (FT) connection module 514 is described below with reference to FIG. 13. The fault tolerant connection is a connection in which the client will attempt to re-establish connection before failing over the back up broker 204. This module in the broker only attempts to re-establish a connection to the client and maintains the context of the client connection for a configurable amount of time to facilitate successful client reconnect and/or failover. The fault-tolerant (FT) connection module 514 is also used to send acknowledgement signals such that a “once and only once” messaging architecture can be used for communication between the client 210 and the primary broker 202. The fault-tolerant (FT) connection module 514 is used to generate the signals necessary to complete the handshaking process with the client as shown in
The first recovery log A 516 provides an area to store transactions as the primary broker 202 processes them. In another embodiment, a second recovery log B 520 is provided for the same function. The first recovery log A 516 stores transactions so that they can be replicated to the back up broker 204. The transactions are preferably appended to the end of the log file so that they provide a sequential listing of transactions that can be used for recovery. The recovery module 524 is responsible for storing the transaction in the recovery log A 516 as they are received and processed by the primary broker 202. The recovery module 524 provides them to the replication module 510 for transmission and processing at the back up broker 524. In a second embodiment, the first recovery log A 516 and the second recovery log B 520 are used for fast log recovery. The second embodiment is a circular cataloguing system of events and synchronization points using the first recovery log A 516 and the second recovery log B 520. The recovery module 524 preferably writes events to one recovery log file at a time. When that recovery log file reaches a configured maximum size, the recovery module 524 will begin writing to the second recovery log file. A synchronization point also occurs when switching to a between the first recovery log A 516 and the second recovery log B 520 or vice versa. The recovery module 524 can write both transaction events and synchronization points to the recovery logs 516, 520. This process continues for the lifetime of the broker 202.
A “synchronization point” logs all information that is currently necessary to continue reliable broker messaging. Any information that is no longer necessary is discarded. This allows recovery log files to retain a reasonable size. The synchronization point may consist of many syncpoint events. These events begin with a “SyncBegin” event and end with a “SyncEnd” event. After the “SyncEnd” event is complete, database updates occur, including recording the position in the log file where the last SyncBegin was logged. After the database updates are complete, the next broker recovery will begin at the “SyncBegin” event that is logged. Interleaved among the sync events are new log events. These new events are from new messages or state changes that occur within the broker 202. The recovery module 524 does not attempt to halt new activity while the syncpoint is occurring. A synchronization point is very important to the log file system. Without a complete synchronization point, system 200 cannot guarantee reliable messaging due to broker failure.
The primary broker 202 includes one more configuration registers 518. These configuration registers are used to identify the operating mode of the primary broker 202, the identification and address of a back up broker 204, the channels to use when communicating and monitoring the back up broker 204 and other parameter necessary for establishing a fault-tolerant connection with a client and maintenance of state with a back up broker 204.
The message database 522 is a database of messages and state information for operation of the broker 202. The message database 522 preferably includes messages received, sent, and other state signals. The message database 522 is preferably stored on non-volatile storage such as a hard disk. The message database 522 is coupled to operate with the broker module 508 in a conventional manner for a messaging system. The messaging database 522 is also accessible by the replication module 510 and the recovery module 524 for performance of their functions as has been described above.
A recovery module 524 is also included in the primary broker memory 404a for storing transactions or events in process by the primary broker 202. The recovery module is responsible for storing data necessary for recovery in the recovery log and between synchronization points to the message database 522. As noted above, the recovery module works in cooperation with the replication module 510. The recovery module 524 also includes other processing for recovering the primary broker after failure or upon start up to bring the primary broker 202 up in a predefined state. The operation of the recovery module 524 will be described in more detail below with reference to
Finally, the memory unit 404a includes a transaction manager 526. The transaction manager 526 is used to track the state of transactions. The transaction manager 526 keep track of transaction state, and keeps transactions open until complete. If a transaction is open during failover, the transaction manager 526 has sent state information for the transaction to the back up broker 204 such that the back up broker is able to continue and complete after failover. The transaction manager 526 receives messages and events and maintains transaction state since the transaction manager 526 of the primary broker 404a (active broker in this case) is coupled for communication with a corresponding transaction manager 526 of the back up broker 202 (standby in this case). The transaction manager 526 is also coupled to bus 408 for communication with other components of the primary broker memory unit 404a.
Those skilled in the art will recognize that, although the various processes and their functionality have been described with respect to particular embodiments, these processes and/or their functionality can be combined into a single process or into any combination of multiple processes. The processes can also be provided using a combination of built-in functions of one or more commercially available software application programs and/or in combination with one or more custom-designed software modules.
Back up Broker
The memory 404b for the back up broker 204 preferably comprises an operating system 502, a publish/subscribe module 504, a message queue 506, a broker module 508, a replication module 510, a replication connection module 512, a fault-tolerant (FT) connection module 514, a recovery log 602, one more configuration registers 518, a message database 604, and a recovery module 524.
The operating system 502, publish/subscribe module 504, message queue 506, and broker module 508 have the same functionality as has been described above, but for the back up broker 204.
The replication module 510 of the back up broker 204 synchronizes the state of the back up broker 204 to that of the primary broker 202. In particular, the replication module 510 of the back up broker 204 communicates with the replication module 510 of the primary broker 202 for both storage synchronization and run-time synchronization. The replication module 510 of the back up broker 204 is communicatively coupled to the recovery log 602 of the back up broker 204 and the message database 604 of the back up broker 204. The recovery log 602 and the message database 604 include all events, transaction and state of the back up broker 204 and in addition all events, transactions and state of primary broker 202. These are mirror copies of message database 522 and the recovery log 516 of the primary broker 202. The replication module 510 of the back up broker 204 store transactions or events and the database state in the recovery log 602 and the message database 604, respectively. For example, in one embodiment of the present invention, the replication module 510 of the back up broker 204 processes six types of events from the primary broker 202: 1) replicated events are logged on the backup broker 204, 2) in-memory events are non-logged informational events generated by the primary broker 202 to synchronize in-memory state, 3) database events that result in database add, delete and update operations on the backup broker 204, 4) operational events that utilize java reflection to execute logic on the backup broker 204, 5) fault tolerant events which represent commands executed on the primary broker 202 that need to be followed by the backup broker 204, 6) transaction events that represent messages that have been written to transaction files on the primary broker 202. The replication module 510 of back up broker 204 is able to communicate with the replication module 510 of the primary broker 202, and can receive transactions or events whether the replication module 510 of the primary broker 202 is operating in the mode of: 1) sending transaction immediately, 2) buffering transactions in a single recovery log 516, or 3) buffering transactions using the fast logging method described above. Essentially, the replication module 510 of back up broker 204 receives and accepts information regarding events, transactions and messages, modifies its state information based on the received information, but does not process the information because that is handled by the primary broker 202. Upon failure, the back up broker 204 can then continue but also processes the information.
The replication connection module 512 of the back up broker 512 performs similar functions as has been described above for the replication connection module 512 of the primary broker 202. The replication connection module 512 sets up a replication connection with the primary broker 202 for the back up broker 204. The replication connection module 512 manages and maintains a replication connection including definition and management of multiple replication connections in order to make use of multiple redundant network paths if available, but for the back up broker 204.
A fault-tolerant (FT) connection module 514 can be included in the back up broker 204. The fault-tolerant (FT) connection module 514 has similar functionality and coupling as has been described above, but for the back up broker 204. Since the back up broker 204 can also operate in the active mode upon failover, the fault-tolerant (FT) connection module 514 is used to establish fault-tolerant connections between the back up broker 204 and clients 210 of the primary broker 202 upon failover.
The memory 404b of the back up broker 20-4 includes one more configuration registers 518. These configuration registers identify the operational mode of the back up broker 204, the identity of the primary broker 202 that the back up broker is backing up, information for establishing a replication connection with the primary broker 202 and other conventional configuration information.
The recovery log of the back up broker 602 is a recovery log for storing transactions and events that mirrors the recovery log 516 of the primary broker 202. This recovery log of the back up broker 602 is preferably stores event and transactions replicated over by the primary broker 202. The replication module 510 maintains the recovery log 602 of the back up broker 204. The memory 404b of the back up broker 204 also include the message database 604. Again, this message database 604 is a mirror copy of the message database 522 of the primary broker 202 and is maintained by the replication module 510. Upon failure of the primary broker 202, the recovery log 602 and message database 604 are used to start the back up broker 204 with the same state as the primary broker 202 had before failure.
The recovery module 524 of the back up broker 204 performs similar functions as has been described above for the recovery module 524 of the primary broker 202. The recovery module 524 of the back up broker 512 is coupled for communication with the recovery log 602 and the message database 604 of the back up broker 204. Upon failure of the primary broker 202, the recovery module 524 can restore the back up broker 204 to the state of the primary broker 202 before it failed. Using the recovery log 602 of the back up broker 204 and the message database 604 of the back up broker 204, the back up broker 204 can be restored such that it continues the operations of the primary broker 202. These recovery operations are described in more detail below with reference to
Client
Referring now to
The memory 404c for the client 210 preferably comprises an operating system 502, a publish/subscribe module 504, a point-to-point module (queue) 506, a fault-tolerant (FT) connection module 514, a fault detection module 708, a primary/standby configuration register 710, and a client recovery module 712. The operating system 502, publish/subscribe module 504, point-to-point module 506, and a fault-tolerant connection module 514 have the same functionality as has been described above, but for the client 210.
The point-to-point module (queue) 506 is responsible for message queue type messaging and coupled to bus 408 for communication similar to that described above for the message queue of the primary broker 202 and the back up broker 204.
The client 210 uses the primary/standby configuration register 710 to store an identification number specifying the primary broker 202 and the back up broker 204. This identification information and connection information are provided so that the client 210 may connect to a primary broker 202, and in the event of failure, know which broker it should communicate with and how to make a connection to the back up broker 204.
The client 210 also includes a fault detection module 708. The fault detection module 708 provides the client 210 with the capability to detect a fault that will cause failover, and the client 210 to begin communication with the back up broker 204. This is particularly advantageous because there are multiple types of failure that may occur. A failure of the primary broker 202 will be known because the back up broker 204 is monitoring the primary broker through a replication connection. The fault detection module 708 of the client 210 detects when a connection to the primary broker 202 has failed, and works with the fault-tolerant (FT) connection module 514 to reconnect to the primary broker 202 before failing over to the back up broker 204. This process is detailed below with reference to
The client recovery module 712 works with the back up broker 204 to establish a connection to the back up broker 204 after failure of the primary broker 202. In particular, the client recovery module 712 communicates with the recovery module 524 of the back up broker 204. Upon failover, the client recovery module 712 communicates with the recovery module 524 so that the back up broker 204 can continue any messaging operation, transaction, or any other broker operation started by the primary broker before failure. This is particularly advantageous because it provides continuous availability and quality of service in the event of primary broker 202 failures.
Active and Standby Modes For Brokers
The states are grouped into two main “roles,” the active role 820 and the standby role 822. The terms active and standby usually refer to the general role rather than the individual state, and indicate which of the brokers 202/204 is servicing operations from clients 210, and which one is not. As shown, while in the WAITING state 808 the broker 202/204 is not in either role—it is waiting to resolve its role.
WAITING State. Each broker 202/204 begins in the waiting state 808. In the waiting state 808, a broker is starting up and waiting to determine which role it should take. A primary or back up broker 202/204 is in the waiting state 808 at startup until it connects to the other broker to determine who has the most recent broker state. While in the waiting state 808, a broker does not accept connections from clients 210 or from other brokers in its cluster. By default, when a primary broker 202 and back up broker 204 are started, the first one to come up will go into the waiting state 808 until the second comes up and they establish a replication connection. Once in the waiting state 808, there are three ways to transition to another states. First, if a replication connection is established, and the other broker is in the standalone state 802, the broker will transition to the standby sync state 810 and begin runtime synchronization (Connect to STANDALONE Peer). Second, if a replication connection is established, and the other broker is also in the WAITING state, the brokers choose roles based on their previous role and synchronization state, and one broker of the two waiting broker is activated (Activate Waiting Broker). Third, a broker in the waiting state 808 may transition directly to stand alone state 802 in response to a setting for the broker to start iii the active role or to be a primary without a back up broker (Start_Active, or Primary s/o Back up).
STANDALONE state. In the STANDALONE state 802, the broker 202 is available to clients 210, but it is not connected to another broker 204 to which it can replicate. A failure of the broker 202 while in this state 802 will interrupt service, since there is no standby ready to fail over. Brokers not configured for fault tolerance are always in this state while running. A primary broker 202 is in the STANDALONE state if it is actively servicing client and cluster operations but no standby broker is running, or if a standby broker is running but is not in the STANDBY state. While in the STANDALONE state 802, if a replication connection is established, and the other broker is in the STANDALONE state, both brokers must have become active during a partition 814. The brokers may have performed inconsistent operations, and may have inconsistent state that cannot be resolved at runtime; if both brokers have accepted connections from clients or other brokers, both brokers will shut down. If only one broker has accepted connections while partitioned, that broker will remain in the standalone state and the other broker will shut down. If a replication connection is established, and the other broker is in the WAITING state 808, this broker will transition to the ACTIVE SYNC state, and begin the drive the runtime synchronization process while continuing to service clients. A primary broker with no configured back up (i.e. not configured for fault tolerance) transitions to the STANDALONE state 802 immediately on startup, and remains in this state indefinitely.
ACTIVE SYNC state. In the ACTIVE SYNC state 804, the broker 202 is driving the runtime synchronization process from the active role 820, to update the state of the standby, while also servicing client operations. If the both brokers connected while in the WAITING state, and they were storage-synchronized prior to starting up, runtime synchronization is trivial and completes immediately. While in this ACTIVE SYNC state 804, completion of the runtime synchronization protocol (Sync Complete) causes a transition to the ACTIVE state 806. While in this ACTIVE SYNC state 804, the loss of all replication connections (Peer Failure Detected) indicates a failure of the standby broker, and this broker returns to the STANDALONE state 802.
ACTIVE state. A fault-tolerant broker is in the ACTIVE state 806 if it is actively servicing client and cluster operations, and it is replicating operations to a standby broker 204 that is currently in the STANDBY state 812. While in the ACTIVE state 806, a broker 202 is protected from failure since a standby broker 204 is present and ready to take over. While in this state, the failure of all replication connections (Peer Failure Detected) indicates a failure or partition from the other broker, and causes a transition to the STANDALONE state 802.
STANDBY SYNC state. In the STANDBY SYNC state 810, the broker 204 is undergoing runtime synchronization from the standby role 822, and its storage and memory state are being updated to reflect the state of the active broker 202. When runtime synchronization completes (Sync Complete), the broker 204 transitions to the standby state 812. If the brokers 202, 204 are connected while still both in the WAITING state 808, and they were storage-synchronized prior to starting up, runtime synchronization is trivial and completes immediately. A broker in the STANDBY SYNC state 810 does not take over for an ACTIVE broker even if it detects a failure, since it does not have the context to continue client operations. While in this state, and all replication connections are lost (Peer Failure Detected), this cancels runtime synchronization and causes a transition back into the WAITING state 808.
STANDBY state. In the STANDBY state 812, the broker 204 has completed runtime synchronization; it is processing replication data “live” from the active; and it is ready to fail over to the active role if it detects a failure of the active broker. If a failure is detected (Peer Failure Detected) while in STANDBY mode 812, the broker 204 will switch its state to an ACTIVE role 820 and begin accepting failover connections as well as normal connections from new clients. While in this state, the loss of all replication connections (Peer Failure Detected) indicates a failure of the other broker, and causes a transition to the STANDALONE state 802. This is the transition usually referred to as broker failover. When an active broker fails, the standby broker becomes active as soon as it detects and confirms the failure, and it is ready to accept connections from any clients and cluster peer brokers that were connected to the previously active broker when it failed; any clients can continue operating without losing the context of operations that were pending when the previous broker failed.
Referring now to
Referring now to
In an alternate embodiment depicted in
Referring now to
Referring now to
On the other hand, if in step 1314 it is determined that the attempt to reconnect to the primary broker 202 was not successful, the method tries to connect 1316 with the back up broker 204. The method determines 1318 whether the client 210 was able to connect to the back up broker 204. If so, the method continues to perform steps 1322 and 1324 as has been described above, but for the back up broker 204 before returning to step 1308 and monitoring for a connection failure. If the client 210 was able to connect to the back up broker 204 in step 1316, the method proceeds from step 1318 to step 1320 where and error and connection failure are signaled before the method ends.
The foregoing description of the embodiments of the present invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the present invention to the precise form disclosed. Many modifications and variations are possible in light of the above teaching. It is intended that the scope of the present invention be limited not by this detailed description, but rather by the claims of this application. As will be understood by those familiar with the art, the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Likewise, the particular naming and division of the modules, routines, features, attributes, methodologies and other aspects are not mandatory or significant, and the mechanisms that implement the present invention or its features may have different names, divisions and/or formats. Furthermore, as will be apparent to one of ordinary skill in the relevant art, the modules, routines, features, attributes, methodologies and other aspects of the present invention can be implemented as software, hardware, firmware or any combination of the three. Of course, wherever a component, an example of which is a module, of the present invention is implemented as software, the component can be implemented as a standalone program, as part of a larger program, as a plurality of separate programs, as a statically or dynamically linked library, as a kernel loadable module, as a device driver, and/or in every and any other way known now or in the future to those of ordinary skill in the art of computer programming. Additionally, the present invention is in no way limited to implementation in any specific programming language, or for any specific operating system or environment. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting, of the scope of the present invention, which is set forth in the following claims.
Number | Name | Date | Kind |
---|---|---|---|
5245616 | Olson | Sep 1993 | A |
5392398 | Meyer | Feb 1995 | A |
5596720 | Hamada et al. | Jan 1997 | A |
5758354 | Huang et al. | May 1998 | A |
5765033 | Miloslavsky | Jun 1998 | A |
5805825 | Danneels et al. | Sep 1998 | A |
5822526 | Waskiewicz | Oct 1998 | A |
5850525 | Kalkunte et al. | Dec 1998 | A |
5857201 | Wright, Jr. et al. | Jan 1999 | A |
5870605 | Bracho et al. | Feb 1999 | A |
5878056 | Black et al. | Mar 1999 | A |
5951648 | Kailash | Sep 1999 | A |
6016515 | Shaw et al. | Jan 2000 | A |
6061559 | Eriksson et al. | May 2000 | A |
6112323 | Meizlik et al. | Aug 2000 | A |
6128646 | Miloslavsky | Oct 2000 | A |
6145781 | Kawabe et al. | Nov 2000 | A |
6167445 | Gai et al. | Dec 2000 | A |
6185695 | Murphy et al. | Feb 2001 | B1 |
6289212 | Stein et al. | Sep 2001 | B1 |
6298455 | Knapman et al. | Oct 2001 | B1 |
6304882 | Strellis et al. | Oct 2001 | B1 |
6336119 | Banavar et al. | Jan 2002 | B1 |
6397352 | Chandrasekaran et al. | May 2002 | B1 |
6452934 | Nakata | Sep 2002 | B1 |
6453346 | Garg et al. | Sep 2002 | B1 |
6484198 | Milovanovic et al. | Nov 2002 | B1 |
6513154 | Porterfield | Jan 2003 | B1 |
6643682 | Todd et al. | Nov 2003 | B1 |
6647544 | Ryman et al. | Nov 2003 | B1 |
6728715 | Astley et al. | Apr 2004 | B1 |
6732175 | Abjanic | May 2004 | B1 |
6782386 | Gebauer | Aug 2004 | B1 |
6785678 | Price | Aug 2004 | B2 |
6792460 | Oulu et al. | Sep 2004 | B2 |
6801604 | Maes et al. | Oct 2004 | B2 |
6807636 | Hartman et al. | Oct 2004 | B2 |
6816898 | Scarpelli et al. | Nov 2004 | B1 |
6874138 | Ziegler et al. | Mar 2005 | B1 |
6877107 | Giotta et al. | Apr 2005 | B2 |
6898556 | Smocha et al. | May 2005 | B2 |
6901447 | Koo et al. | May 2005 | B2 |
6944662 | Devine et al. | Sep 2005 | B2 |
6983479 | Salas et al. | Jan 2006 | B1 |
7007278 | Gungabeesoon | Feb 2006 | B2 |
7028089 | Agarwalla et al. | Apr 2006 | B2 |
7039701 | Wesley | May 2006 | B2 |
7096263 | Leighton et al. | Aug 2006 | B2 |
7103054 | Novaes | Sep 2006 | B2 |
7177929 | Burbeck et al. | Feb 2007 | B2 |
7251689 | Wesley | Jul 2007 | B2 |
7287097 | Friend et al. | Oct 2007 | B1 |
7296073 | Rowe | Nov 2007 | B1 |
7302634 | Lucovsky et al. | Nov 2007 | B2 |
7334022 | Nishimura et al. | Feb 2008 | B2 |
7359919 | Cohen et al. | Apr 2008 | B2 |
7379971 | Miller et al. | May 2008 | B2 |
7386630 | Liong et al. | Jun 2008 | B2 |
7395349 | Szabo et al. | Jul 2008 | B1 |
7406440 | Napier et al. | Jul 2008 | B2 |
7406537 | Cullen | Jul 2008 | B2 |
7418501 | Davis et al. | Aug 2008 | B2 |
7433835 | Frederick et al. | Oct 2008 | B2 |
7464154 | Dick et al. | Dec 2008 | B2 |
7467196 | Di Luoffo et al. | Dec 2008 | B2 |
7487510 | Carr | Feb 2009 | B1 |
7496637 | Han et al. | Feb 2009 | B2 |
7512957 | Cohen et al. | Mar 2009 | B2 |
7516191 | Brouk et al. | Apr 2009 | B2 |
7533172 | Traversat et al. | May 2009 | B2 |
7539656 | Fratkina et al. | May 2009 | B2 |
7543280 | Rosenthal et al. | Jun 2009 | B2 |
7603358 | Anderson et al. | Oct 2009 | B1 |
7702636 | Sholtis et al. | Apr 2010 | B1 |
7752604 | Genkin et al. | Jul 2010 | B2 |
7801946 | Bearman | Sep 2010 | B2 |
7801976 | Hodges et al. | Sep 2010 | B2 |
7881992 | Seaman et al. | Feb 2011 | B1 |
7887511 | Mernoe et al. | Feb 2011 | B2 |
7895262 | Nielsen et al. | Feb 2011 | B2 |
7941542 | Broda et al. | May 2011 | B2 |
8001232 | Saulpaugh et al. | Aug 2011 | B1 |
8060553 | Mamou et al. | Nov 2011 | B2 |
20010007993 | Wu | Jul 2001 | A1 |
20020010781 | Tuatini | Jan 2002 | A1 |
20020026473 | Gourraud | Feb 2002 | A1 |
20020107992 | Osbourne et al. | Aug 2002 | A1 |
20020161826 | Arteaga et al. | Oct 2002 | A1 |
20030005174 | Coffman et al. | Jan 2003 | A1 |
20030009511 | Giotta et al. | Jan 2003 | A1 |
20030014733 | Ringseth et al. | Jan 2003 | A1 |
20030041178 | Brouk et al. | Feb 2003 | A1 |
20030055920 | Kakadia et al. | Mar 2003 | A1 |
20030061404 | Atwal et al. | Mar 2003 | A1 |
20030074579 | Della-Libera et al. | Apr 2003 | A1 |
20030093500 | Khodabakchian et al. | May 2003 | A1 |
20030101210 | Goodman et al. | May 2003 | A1 |
20030120665 | Fox et al. | Jun 2003 | A1 |
20030145281 | Thames et al. | Jul 2003 | A1 |
20030167293 | Zhu et al. | Sep 2003 | A1 |
20030204644 | Vincent et al. | Oct 2003 | A1 |
20040030947 | Aghili et al. | Feb 2004 | A1 |
20040088140 | O'Konski et al. | May 2004 | A1 |
20040133633 | Fearnley et al. | Jul 2004 | A1 |
20040186817 | Thames et al. | Sep 2004 | A1 |
20040193703 | Loewy et al. | Sep 2004 | A1 |
20040216127 | Datta et al. | Oct 2004 | A1 |
20040225724 | Pavlik et al. | Nov 2004 | A1 |
20050027853 | Martin et al. | Feb 2005 | A1 |
20050038708 | Wu | Feb 2005 | A1 |
20050203915 | Zhu et al. | Sep 2005 | A1 |
20050262515 | Dinh et al. | Nov 2005 | A1 |
20060031481 | Patrick et al. | Feb 2006 | A1 |
20060173985 | Moore | Aug 2006 | A1 |
20060195819 | Chory et al. | Aug 2006 | A1 |
20060206440 | Anderson et al. | Sep 2006 | A1 |
20060224702 | Schmidt et al. | Oct 2006 | A1 |
20060224750 | Davies et al. | Oct 2006 | A1 |
20060230432 | Lee et al. | Oct 2006 | A1 |
20070174393 | Bosschaert et al. | Jul 2007 | A1 |
20080059220 | Roth et al. | Mar 2008 | A1 |
20080148346 | Gill et al. | Jun 2008 | A1 |
20080172270 | Eckenroth | Jul 2008 | A1 |
20090319832 | Zhang et al. | Dec 2009 | A1 |
20090326997 | Becker et al. | Dec 2009 | A1 |
20100017853 | Readshaw | Jan 2010 | A1 |
20100030718 | Anderson et al. | Feb 2010 | A1 |
20100304992 | An et al. | Dec 2010 | A1 |
Number | Date | Country |
---|---|---|
1420562 | May 2004 | EP |
WO-9922288 | May 1999 | WO |