The present invention relates to the field of data processing systems, particularly of the distributed type, such as computer networks. More specifically, the present invention relates to a fault-tolerant distributed data processing system.
Computer networks are made up of several data processing apparatuses (computers, workstations, peripherals, storage devices and the like) connected together by a data communications network. Computer networks may vary in size from small networks, like the LANs (Local Area Networks), to very large networks, possibly composed of a several smaller, interconnected networks (this is for example the case of the Internet).
Computers in a computer network communicates with each other by exchanging messages, whose format depends on the protocol/suite of protocols adopted.
From a logical architecture viewpoint, a computer network may be subdivided in several groups of computers, or network nodes, called “network domains”; computers in a same network domain are logically associated with one another, being for example administered as a common unit with common rules and procedures. A domain manager computer or node typically manages the network domain: for example, all communications to and from the other computers in the domain, particularly messages received from, or directed to (the domain managers of) other domains of the network may have to be routed through the domain manager of that domain. The network domains may be structured hierarchically: for example, a generic network domain may include one or more subordinate domains; each subordinate domain is managed by a respective domain manager computer which is subordinated to the domain manager computer managing the upper-level domain.
By way of example, the commercially-available workload scheduling products suite known under the name “Tivoli Workload Scheduler” by IBM Corporation treats a computer network, for example the production environment of, e.g., a company or a government agency, as a workload scheduler network containing at least one workload scheduler domain, the so-called “master domain”; the master domain manager computer forms the management hub. The workload scheduler network may be structured so as to contain a single domain, represented by the master domain, or as a multi-domain network: in the former case, the master domain manager maintains communications with all the computers of the network; in the multi-domain case, the master domain is the topmost domain of a hierarchical tree of domains: the master domain manager communicates with the computers in its domain, and with subordinate domain manager computers, which manage the subordinate domains. The subordinate domain managers in turn communicate with the computers in their domains and, possibly, with further subordinate domain managers, and so on.
Structuring the computer network in a plurality of domains is advantageous, because it allows reducing the network traffic: communications between the master domain manager and the other computers are in fact reduced in number, because for example the communications between two computers in a same subordinate domain are handled by the respective domain manager, and need not pass through the master domain manager.
An important feature of a data processing system is the tolerance to faults, i.e., the ability of the system to continue more or less normal operation despite the occurrence of hardware or software faults. In a computer network, ensuring an adequate tolerance to faults includes inter alia implementing a message dispatch/routing mechanism adapted to tolerating network faults, like failures of one or more network nodes.
Fault tolerance may be implemented by assigning to some computers in the network the role of backup computers, which take over the responsibilities normally assigned to other computers of the network, in case such other computers face a failure. In particular, in a network structured in domains, the backup computers have to take over the responsibility of dispatching/routing messages to the proper destinations.
For example, in the above-mentioned example of the workload scheduler network, fault tolerance at the level of the master domain may be achieved by assigning to a computer of the network the role of backup of the master domain manager (the backup computer may for example be a domain manager of a subordinate domain subordinate to the master domain, or another computer in the master domain); similarly, in subordinate domains, one computer of the domain may be assigned the role of backup of the respective domain manager computer.
In particular, in every domain (being it the master domain or a subordinate domain, at whichever level of the domains hierarchy), or at least in those domains which are considered more critical, a backup computer can be defined, adapted to take over responsibilities of the respective domain manager.
The backup computers need to have at any time available a same level of information as that possessed by the respective domain manager, so that in case the latter faces a failure, the associated backup computer can effectively take over the responsibility and perform the tasks that were intended to be performed by the domain manager. In particular, having a same level of information means being able to reproduce the messages that would have been dispatched/routed by the domain manager, should the latter have not failed. To this purpose, the network may be structured so that every message received by the domain manager, is also received in copy by the associated backup computer.
However, once role of the generic domain manager is taken over by the respective backup computer, the latter needs to determine which of the messages have already been dispatched/routed by the domain manager before experiencing the failure, and which not. In case the computers of the network work in a cluster configuration, or at least some kind of storage (e.g., disk) sharing exist, the above goal can be achieved for example by having the domain manager exploiting a persistent queue to store outgoing messages: once the domain manager dispatches a generic message, that message is removed from the queue; provided that the backup computer can access the queue, it can at any time determine which messages are still to be dispatched. However, if the computers do not work in cluster configuration, or no possibility of disk sharing exists, the backup computer cannot know which messages have already been dispatched by the domain manager before failure.
The Applicant has tackled the problem of improving current implementations of fault tolerant computer networks.
According to an aspect of the present invention, a method as set forth in appended claim 1 is provided.
The method comprises:
having a message originator computer, upon originating a generic messages for a destination computer, labelling the message by means of a message identifier adapted to uniquely identify the originated message;
having the message destination computer, upon receipt of a generic message, record the respective message identifier in a list of identifiers of received messages;
in case a backup computer takes over the role of a manager computer provided for managing the distribution of messages from the message sender to message destination computers, having the backup computer retrieve, from the message destination computer, the list of identifiers of received messages; and
based on the retrieved list of identifiers of received messages, having the backup computer dispatch to the destination computer messages directed thereto that have been received by the backup computer but that were not received by the destination computer.
Thanks to the method according to the present invention, it is ensured that, in case of failure of the manager computer, the messages that were still not delivered to the message destination computers are delivered thereto, in the correct chronological order and without any message repetition.
The features and advantages of the present invention will be made apparent by the following detailed description of an embodiment thereof, provided merely by way of a non-limitative example, description that will be conducted making reference to the attached drawings, wherein:
With reference in particular to
In particular, the exemplary data processing system 100 has a distributed architecture, based on a data communications network 105, which may typically consists of an Ethernet LAN (Local Area Network), a WAN (Wide Area Network), or the Internet. The data processing system 100 may for example be the information infrastructure, i.e., the so-called “production environment” of a SOHO (Small Office/Home Office environment) or of an enterprise, a corporation, a government agency or the like.
In the data processing system 100, several data processors 110, for example personal computers or workstations (hereinafter, for the sake of conciseness, simply referred to as “computers”), are connected to the data communications network 105 in a computer network configuration.
As shown in
Merely by way of example (but this is not to be construed as a limitation of the present invention, which has more general applicability), a workload scheduling tool like the previously cited Tivoli Workload Scheduler by IBM Corporation, is installed in the computers 110 of the data processing system 100; once installed, the Tivoli Workload Scheduler forms a workload scheduler network. Referring to
The network is fault-tolerant. In particular, in an embodiment of the present invention, every domain (being it the master domain or a subordinate domain, at whichever level of the domains' tree hierarchy), a backup node is defined, adapted to take over responsibilities of the respective domain manager in case the latter experiences a failure. In particular, in the example shown in the drawing, the leaf node 210 in the master domain is a backup master, acting as the backup of the master domain manager 200, and the leaf nodes 220 and 230 in the subordinate domains act as backups of the respective subordinate domain managers 205 and 215, respectively. Dashed line arrows in the drawing represent backup logical connections between the network nodes, that are provided in addition to the normal connections, so as to enable the backup nodes perform their function of backup of the respective (master or subordinate) domain managers, particularly to communicate with the same nodes of the network with which the respective domain managers communicate. For example, as depicted schematically in
Referring again to the Tivoli Workload Scheduler example, without entering into excessive details, the master domain manager 200 is the network node that contains centralized database files used to document scheduling objects, that creates production plans at the beginning of each day, and that performs all logging and reporting operations for the network. The backup master 210 is a network node capable of taking over responsibilities of the master domain manager 200 for automatic workload recovery. A generic network node may be fault-tolerant or not. A fault-tolerant node (“FTN” in the drawings, or “Fault-Tolerant Agent”—FTA), is a computer capable of resolving local dependencies and of launching its jobs even in absence of the domain manager; a backup node is typically a fault-tolerant node. A node that is not fault-tolerant is also referred to as “standard node” (“SN” in the drawings, or “Standard Agent”—SA).
Before the start of each working day, the master domain manager 200 creates a production control file, and the workload scheduler is then restarted in the workload scheduler network. The master domain manager 200 sends a copy of the production control file to each of the fault-tolerant nodes directly linked thereto, in the example the leaf node 210 and the two subordinate domain managers 205 and 215. This process is iterated: the domain managers 205 and 215 send a copy of the received production control file to the respective subordinate domain managers (if any) and fault-tolerant nodes directly linked thereto, in the example the leaf nodes 220, 225 and 230. Once the workload scheduling network has been started, scheduling messages like job starts and completions are passed from the agents (SAs or FTAs) to their respective domain managers, and the domain managers route the messages up to the master domain manager; the latter broadcasts messages through the hierarchical tree (through the domain managers down to the leaf nodes) to update the copies of the production control file held by the subordinate domain managers and leaf nodes (particularly, the FTAs).
Referring to
According to an embodiment of the present invention, the message compiler module 415 is adapted to insert in the message to be sent 410a a message identifier or message tag 425, adapted to univocally identify the generic message issued by the network node 225. In particular, the message identifier 425 includes a first identifier field 425a and a second identifier field 425b; the first identifier field 425a is adapted to univocally identify, among all the nodes of the network, the network node 225 that has generated the message; the second identifier field 425b is in turn adapted to univocally identify that message among all the messages generated by that network node. In an embodiment of the present invention, the first identifier field 425a includes for example a code corresponding to the name 430 (“NODE ID” in the drawing) assigned to the computer 225 for identifying it in the network, stored for example (in a file stored) in the computer's hard disk 168. The second identifier field 425b is for example a code, e.g. a progressive integer, which in
Similarly to the message sender module 420, a message receiver module 440 manages the receipt of incoming messages 410b (handling in particular the lower-level aspects of the message receipt from the data communications network 105). The received message is passed to a message identifier extractor module 445, which is adapted to parse the received message and to extract the respective message identifier 425. The message identifier extractor module 445 puts the extracted message identifier 425, extracted from the received message, into a message identifier table 450, which is adapted to contain the message identifiers of the messages received by the network node 225. The message identifier table 450 may be stored in the computer's hard disk 168, as in the shown example, or it may be saved in a portion of the working memory 159; in this latter case, a mechanism may be implemented adapted to save the message identifier table on the hard disk when, for example, the computer is shut down, or the process is terminated. The message identifier table 450 may be adapted to store a prescribed, maximum number of message identifiers, and implement a “first-in, first-out” policy, for freeing space when full; in this way, the identifiers of obsolete messages are removed for freeing space. In particular, the message identifier table 450 may be adapted to retain the identifier(s) of the last message(s) which the node 225 received from each other network node. From the message identifier extractor module 445, the message is passed to a management message recognizor module 455, adapted to ascertain whether the received message is a backup management message; for the purposes of the present description, by backup management message there is in particular meant a message not intended to be used by the application software 405, but instead relating to the management of the network fault tolerance in case of failure of a network node. If the received message is not a backup management message, the management message recognizor module 455 passes it to the application software module 405; otherwise, i.e. in case the received message is recognized to be a management message, the management message recognizor module 455 is adapted to retrieve the list of message identifiers contained in the message identifiers table 450, and to pass it to the message compiler module 415, for being sent to the competent backup domain manager, as will be explained later on in the present description. Similarly to the generic message issued by the application software 405, the message compiler module 415 inserts the message identifier 425, and the message is sent by the message sender module 420. It is observed that, in alternative embodiments of the invention, it may be foreseen that the message compiler module, in addition to insert the message identifier in the generic message to be sent, also logs the message identifier in the table 450: in this case, the table will contain not only the message identifiers of the received messages, but also those of the messages issued by the network node 225.
Similarly to
The structure of the generic domain manager, like for example the master domain manager 200 and the subordinate domain managers 205 and 215, is not explicitly shown, however, similarly to the backup node just described, the generic domain manager manages the dispatch/routing of the messages to the intended destinations, i.e. to the network nodes linked thereto. According to an embodiment of the present invention, also the generic domain manager, like the generic leaf node and backup node, implements a mechanism for labelling all the messages it generates, particularly the message compiler module 415, adapted to insert into the generic message to be sent the message identifier 425 univocally identify the message, by including the network node name 430, and a progressive code, generated by the progressive code generator 435.
The operation of the fault-tolerant computer network will be hereinafter described, according to an embodiment of the invention.
In particular, the schematic and simplified flowchart of
The node 225 periodically checks whether there are messages (generated by the application software 405) waiting to be sent (decision block 605).
In the affirmative case (exit branch Y of decision block 605), the message compiler module 415 gets the identifier of the network node (NODE ID) 430 (block 610), asks the progressive code generator 435 to generate a new progressive code (block 615), uses the network node identifier and the generated progressive code to compose the message identifier 425, and adds the message identifier 425 to the message to be sent (block 620); the composed message is then sent (block 625), and the operation flow jumps back to the beginning.
If no message waits to be sent (exit branch N of decision block 605), the node 225 checks whether there are incoming messages (decision block 630).
In the affirmative case (exit branch Y of decision block 630), the message identifier extractor 445 extracts the message identifier 425 from the received message 410b (block 635), and puts the extracted message identifier 420 into the message identifiers table 450 (block 640).
The management message recognizor module 455 then checks whether the received message is a management message, requesting the node 225 to provide the content of the respective received message identifiers table 450 (decision block 645), or rather a normal message directed to the application software 405.
In the negative case (exit branch N of decision block 645) the message is passed over to the application software 405 for processing (block 650).
In the affirmative case (exit branch Y of decision block 645), the management message recognizor module 455 retrieves the content of the received message identifiers table 450 (block 655), and provides it to the message compiler module 415 (block 660), which will then prepare a message (or, possibly, more messages) to be sent to the backup node 220 (in a way similar to that described above). The operation flow then jumps back to the beginning.
Back to decision block 630, if no incoming messages are waiting to be served (exit branch N), the operation flow jumps back to the beginning, unless the computer is shut down (decision block 699, exit branch Y), in which case the operations end.
The backup node 220 periodically checks whether the respective domain manager in respect of which it acts as a backup, in the shown example the domain manager 205, is experiencing a failure (decision block 705); for example, the failure detector module 520 may check whether a system manager operator has instructed the backup node 220 to take over the role of the domain manager.
In the negative case (exit branch N of decision block 705), the backup node 220 checks whether there are incoming messages waiting to be served (decision block 710). It is pointed out that the generic backup node receives a copy of every message sent by the respective domain manager to the nodes of its domain, as well as a copy of every message sent to the respective domain manager by the nodes (other than the backup node) in the domain. In other words, the backup node is aware of all the message traffic in the domain to which it belongs.
In the affirmative case (exit branch Y of decision block 710, the message identifier extractor 445 extracts the message identifier 425 from the received message (block 715), and puts the extracted message identifier 425 into the message identifiers table 450 (block 720).
The message destination analyzer module 505 then checks whether the message is addressed to one of the linked nodes (i.e. the domain manager, or the other nodes of the domain) (decision block 725).
In the negative case (exit branch N of decision block 725), i.e., in case the incoming message is addressed to the backup node, the message is passed over to the application software 405 for processing (block 730); the operation flow then jumps back to the beginning (connector J1).
If instead the message is ascertained to be addressed to one of the linked nodes, i.e. to one of the network nodes linked to the corresponding domain manager (exit branch Y of decision block 725, the message destination analyzer 505 determines which is the destination linked node (block 735). If the domain manager 205 in respect of which the backup node 220 acts as a backup is not facing a failure (exit branch N of decision block 740, the message is simply put into the proper message queue 515 (block 745); no further action is undertaken, and the operation flow jumps back to the beginning (connector J1): in fact, in case of normal operation, it is the domain manager 205 that is in charge of the task of dispatching/routing the messages to the proper destinations. If, on the contrary, the domain manager is currently facing a failure (a condition signaled for example by a failure flag, set when the failure detector module 520 detects a domain manager failure condition—exit branch Y of decision block 740), the backup node, in addition to putting the message in the proper message queue, also dispatches/routes the message to the proper destination (block 750).
In case no incoming message waits to be processed (exit branch N of decision block 710), it is ascertained whether there are messages (generated by the application software 405 waiting to be sent (decision block 753).
In the affirmative case (exit branch Y of decision block 753), similarly to what described in the foregoing in connection with the leaf node 225, the message compiler module 415 gets the node's ID 430 (block 755), asks the progressive code generator 435 to generate a new progressive code (block 760) composes the message identifier 425 with the network node identifier and the generated progressive code, and adds the message identifier 425 to the message to be sent (block 765); the composed message is then sent (block 770), and the operation flow jumps back to the beginning (connector J1). Also in this case, alternative embodiments of the present invention may foresee that the message compiler module 415 logs the message identifier into the table 450.
If instead there are no messages waiting to be sent (exit branch N of decision block 753), the operation flow jumps back to the beginning, unless for example a shut down takes place (decision block 799, exit branch Y).
Actions similar to those up to now described are similarly performed by the generic domain manager, like the master domain manager 200 and the domain managers 205 and 215 which, in addition, unconditionally dispatches the received messages to the intended destinations.
Let now the case be considered of the domain manager 205 experiencing a failure: when the (failure detector 520 of the) backup node 220 detects this, for example receiving an instruction by a system manager operator (exit branch Y of decision block 705, and connector J2, the backup node 220 has to take over the responsibilities of the failed domain manager 205. The failure detector 520 sets a failure flag (block 775); as described in the foregoing, the failure flag is exploited by the backup node 220 for deciding whether a generic incoming message addressed to one of the linked nodes has to be dispatched to the intended destination, or simply put in the respective message queue 515 (decision block 745). Then, based on the destinations table 510, for the generic one of the linked network nodes (i.e., in the considered example, for the leaf node 225 and the master domain manager 200, the (linked node asker 525 of the) backup node 220 requests the content of the respective received message identifiers table 450 (block 780). Once the list of received message identifiers has been retrieved from that linked node (block 785), the (message selector 530 of the) backup node 220 selects from the message queue 515 corresponding to that linked node the messages that, based on the retrieved list of received message identifiers, result not to have been sent to that network node (block 790); for example, before its failure the domain manager 205 may have received messages addressed to one of the network nodes in its domain, but the domain manager 205 did not have enough time to route these messages to the proper destination before incurring the failure. For example, let it be assumed that the retrieved list contains the following message identifiers:
and let it be assumed that, in the message queue 515 in respect of the network node under consideration, there are the messages labelled by the following identifiers:
The message selector 530 can thus determine that the messages from the node with identifier Ida and numbered 101, 102 and 103 were not received by the network node under consideration (which received messages up to the one numbered 100), whereas all the messages from the network node with identifier Idb have been received.
The (message selector 530 of the) backup node 220 accordingly causes the selected messages to be sent to the proper destination node (block 791). In particular, it is possible to ensure that the messages are sent to the intended destination node in the proper chronological sequence: if a generic node issues two messages in sequence, it is ensured that the two messages are received as well in the correct sequence; this is a feature that may be a prerequisite for the correct operation of the data processing system. These operations are repeated for all the linked nodes (decision block 793), as specified in the destinations table 510. Before jumping back to the beginning (connector J1), it is ascertained whether the functionality of the domain manager has in the meanwhile been reestablished (decision block 795), in which case the failure flag is reset (block 797).
Thanks to the described solution, it is ensured that, in case of failure of a domain manager, messages which was sent, but not received (for example, messages that waited to be routed) to subordinate network nodes do not get lost. The backup node is put in condition to know exactly which messages need to be sent to each of the linked network nodes, avoiding any dispatch duplication or delay, and to send the messages in the proper chronological order.
It is also possible to define different categories of messages, for example having different criticality, and to manage in the way previously described only those messages that are considered more critical for the network operation.
The implementation of the present invention has been described making reference to an exemplary embodiment thereof, however those skilled in the art will be able to envisage modifications to the described embodiment, as well as to devise different embodiments, without however departing from the scope of the invention as defined in the appended claims.
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc. Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of the present description, a computer-usable or computer-readable medium can be any apparatus, device or element that can contain, store, communicate, propagate, or transport the program for use by or in connection with the computer or instruction execution system.
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor storage medium, network or propagation medium. Examples of a storage medium include a semiconductor memory, fixed storage disk, moveable floppy disk, magnetic tape, and an optical disk. Current examples of optical disks include compact disk—read only memory (CD-ROM), compact disk—read/write (CD-R/W) and digital versatile disk (DVD). Examples of a propagation medium include wires, optical fibers, and wireless transmission.
The invention can be applied in a data processing system having a different architecture or based on equivalent elements; each computer can have another structure or it can be replaced with any data processing entity (such as a PDA, a mobile phone, and the like).
Number | Date | Country | Kind |
---|---|---|---|
05300432.1 | May 2005 | EP | regional |