The present invention relates generally to communications networks and multiprocessing systems or networks having a shared communications fabric. More particularly, the invention relates to efficient techniques for correlating a link event to particular nodes in a multi-path network to facilitate updating of status of source-destination routes in the effected nodes, as well as to a technique for consolidating multiple substantially simultaneous link events within the network to facilitate updating of status of source-destination routes in the effected nodes.
Parallel computer systems have proven to be an expedient solution for achieving greatly increased processing speeds heretofore beyond the capabilities of conventional computational architectures. With the advent of massively parallel processing machines such as the IBM® RS/6000® SP1™ and the IBM® RS/6000® SP2™, volumes of data may be efficiently managed and complex computations may be rapidly performed. (IBM and RS/6000 are registered trademarks of International Business Machines Corporation, Old Orchard Road, Armonk, N.Y., the assignee of the present application.)
A typical massively parallel processing system may include a relatively large number, often in the hundreds or even thousands of separate, though relatively simple, microprocessor-based nodes which are interconnected via a communications fabric comprising a high speed packet switch network. Messages in the form of packets are routed over the network between the nodes enabling communication therebetween. As one example, a node may comprise a microprocessor and associated support circuitry such as random access memory (RAM), read only memory (ROM), and input/output (I/O) circuitry which may further include a communications subsystem having an interface for enabling the node to communicate through the network.
Among the wide variety of available forms of packet networks currently available, perhaps the most traditional architecture implements a multi-stage interconnected arrangement of relatively small cross point switches, with each switch typically being an N-port bidirectional router where N is usually either 4 or 8, with each of the N ports internally interconnected via a cross point matrix. For purposes herein, the switch may be considered an 8 port router switch. In such a network, each switch in one stage, beginning at one side (so-called input side) of the network, is interconnected through a unique path (typically a byte-wide physical connection) to a switch in the next succeeding stage, and so forth until the last stage is reached at an opposite side (so called output side) of the network. The bi-directional router switch included in this network is generally available as a single integrated circuit (i.e., a “switch chip”) which is operationally non-blocking, and accordingly a popular design choice. Such a switch chip is described in U.S. Pat. No. 5,546,391 entitled “A Central Shared Queue Based Time Multiplexed Packet Switch With Deadlock Avoidance” by P. Hochschild et al., issued on Aug. 31, 1996.
A switching network typically comprises a number of these switch chips organized into interconnected stages, for example, a four switch chip input stage followed by a four switch chip output stage, all of the eight switch chips being included on a single switch board. With such an arrangement, messages passing between any two ports on different switch chips in the input stage would first be routed through the switch chip in the input stage that contains the source or input port, to any of the four switches comprising the output stage and subsequently, through the switch chip in the output stage the message would be routed back (i.e., the message packet would reverse its direction) to the switch chip in the input stage including the destination (output) port for the message. Alternatively, in larger systems comprising a plurality of such switch boards, messages may be routed from a processing node, through a switch chip in the input stage of the switch board to a switch chip in the output stage of the switch board and from the output stage switch chip to another interconnected switch board (and thereon to a switch chip in the input stage). Within an exemplary switch board, switch chips that are directly linked to nodes are termed node switch chips (NSCs) and those which are connected directly to other switch boards are termed link switch chips (LSCs).
Switch boards of the type described above may simply interconnect a plurality of nodes, or alternatively, in larger systems, a plurality of interconnected switch boards may have their input stages connected to nodes and their output stages connected to other switch boards, these are termed node switch boards (NSBs). Even more complex switching networks may comprise intermediate stage switch boards which are interposed between and interconnect a plurality of NSBs. These intermediate switch boards (ISBs) serve as a conduit for routing message packets between nodes coupled to switches in a first and a second NSB.
Switching networks are described further in U.S. Pat. Nos. 6,021,442; 5,884,090; 5,812,549; 5,453,978; and 5,355,364, each of which is hereby incorporated herein by reference in its entirety.
Various techniques have been used for generating routes in a multi-path network. While some techniques generate routes dynamically, others generate static routes based on the connectivity of the network. Dynamic methods are often self-adjusting to variations in traffic patterns and tend to achieve as even a flow of traffic as possible. Static methods, on the other hand, are pre-computed and do not change during the normal operation of the network. Further, routes for transmitting packets in a multistage packet switched network can either be source based or destination based. In source based routing, the source determines the route along which the packet is to be sent and sends it along with the packet. The intermediate switching points route the packet according to the passed route information. Alternatively, in destination based routing, the source places the destination identifier in the packet and injects it into the network. The switching points will either contain a routing table or logic to determine how the packet needs to be sent out. In either case, the method to determine the route can be static or dynamic, or some combination of static and dynamic routing.
One common technique for sending packets between source-destination pairs in a multi-path network is static, source-based routing. For example, reference the above-incorporated co-pending applications, as well as the High-Performance Switch (HPS) released by International Business Machines Corporation, one embodiment of which is described in “An Introduction to the New IBM eServer pSeries® High Performance Switch,” SG24-6978-00, December 2003, which is hereby incorporated herein by reference in its entirety. As described in these co-pending applications, a suitable algorithm is employed to generate routes to satisfy certain pre-conditions, and these routes are stored in node tables, which grow with the size of the network. When a packet is to be sent from a source node to a destination, the source node references its route tables, selects a route to the destination and sends the route information along with the packet into the network. Each intermediate switching point looks at the route information and determines the port through which the packet should be routed at that point.
In a multi-stage network, any given link in the network will be part of routes between a set of source-destination pairs, which themselves will be a subset of all source-destination pairs of the network. If reliable message transfer is to be maintained, an approach is needed to efficiently and quickly identify routes affected by a link event and take appropriate action dependent on the event. The present invention addresses this need in both the case of a single link event, and multiple substantially simultaneous link events.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a communications network which includes a network of interconnected nodes. The nodes are at least partially interconnected by links, and are adapted to communicate by transmitting packets over the links. Each node has an associated network interface which defines a plurality of routes for transferring packets from that node as source node to a destination node, and further includes path status indicators for indicating whether a route is usable or is unusable as being associated with a fault. The network further includes a network manager for monitoring the network of interconnected nodes and noting a link event therein. Responsive to the presence of a link event, the network manager determines, with reference to an ascertained link level within the network of the link event, path status indicator updates to be provided to the respective network interfaces of effected nodes in the network of interconnected nodes.
In another aspect, a method of maintaining communication among a plurality of nodes in a network is provided. The method includes: defining a plurality of static routes for transferring a packet from a respective node as source node to a destination node in the network; monitoring the network to identify a link event within the network; providing path status indicators to at least some nodes of the plurality of nodes for indicating whether a source-destination route is usable or is unusable as being associated with a link fault; and employing a network manager to monitor the network for link events, and upon noting a link event, for determining, with reference to an ascertained link level within the network of the link event, path status indicator updates to be provided to the respective network interface of effected nodes of the network of interconnected nodes.
In a further aspect, at least one program storage device is provided readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform a method of maintaining communication among a plurality of nodes in a network of interconnected nodes. The method again includes: defining a plurality of static routes for transferring a packet from a respective node as source node to a destination node in the network; monitoring the network to identify a link event within the network; providing path status indicators to at least some nodes of the plurality of nodes for indicating whether a source-destination route is usable or is unusable as being associated with a link fault; and employing a network manager to monitor the network for link events, and upon noting a link event, for determining, with reference to an ascertained link level within the network of the link event, path status indicator updates to be provided to the respective network interface of effected nodes of the network of interconnected nodes.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.
One or more aspects of the present invention are particularly pointed out and distinctly claimed as examples in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
Generally stated, this invention relates in one aspect to handling of situations in systems in which one or more faults, each requiring one or many repair actions, may occur. The repair actions themselves span a set of hierarchical steps in which a higher level action encompasses levels beneath that level. A centralized manager, referred to herein as the network manager, is notified by network hardware of a link fault occurring within the network, and is required to direct or take appropriate actions. At times, these repair actions can become time-critical. One particular type of system which faces this time-critical repair action is a networked computing cluster. This disclosure utilizes the example of a cluster of hosts interconnected by a regular, multi-state interconnection network and illustrates solutions to the problem.
As noted above, a common method for sending packets between source-destination pairs in a multi-stage packet switched network of interconnected nodes is static, source-based routing. When such a method is employed, sources maintain statically computed route tables which identify routes to all destinations in the network. Typically, a suitable algorithm is used to generate routes that satisfy certain preconditions, and these routes are stored in tables which grow with the size of the network. When a packet is to be sent to a destination, the source will look up the route table, select a route to that destination and send the route information along with the packet. Intermediate switching points examine the route information and determine the port through which the packet should be routed at each point.
In a multi-stage network, any given link in the network may be part of multiple routes between a set of source-destination pairs, which will be a subset of all the source-destination pairs of the network. If reliable message transfer is to be maintained, it becomes necessary to quickly identify the routes effected by a link event, such as a link failure, and take appropriate recovery action. The recovery action may be replacing the failed route by a good route, or simply avoiding the failed route, as described further in the above-referenced incorporated, co-pending application entitled “Reliable Message Transfer Over an Unreliable Network.” For any recovery action, it is important to identify the set of failed routes so that the recovery action can be performed efficiently.
When the routes themselves are compactly encoded in a form that can be understood by logic at the switching points, they may not contain the identity of the links used by each hop in the route. One direct method to identify routes effected by a link failure is to reverse map the routes on the links and maintain the information in the network manager that is responsible for the initial route generation, as well as the repair action. Such reverse maps would require a very large storage for large networks. Thus, a technique that avoids the creation and maintenance of reverse maps to help repair actions would be desirable.
Further, when multiple faults, each requiring one of many recovery actions, occur substantially simultaneously (i.e., within a defined time interval of each other), then the recovery or repair actions are preferably consolidated. It is possible that the repair actions themselves span a set of hierarchical steps in which a higher level action encompasses all levels beneath that level. An efficient technique is thus described herein to consolidate repair actions and maintain packet transport when multiple link failures occur substantially simultaneously.
The present invention relates in one aspect to a method to quickly and efficiently correlate and consolidate link events, i.e., a faulty link or a recovered link, occurring in a network of interconnected nodes, without maintaining a reverse map of the source-destination routes employed by the nodes. The solution presented herein recognizes the connection between the level of links in a multi-stage network and the route generation algorithm employed, and identifies source nodes whose routes are effected by a certain link event, and categorizes them in terms of the extent of repair action needed. It also presents a technique to collect fault data relating to multiple faults occurring close to each other, and analyze them to derive a consolidated repair action that will be completed within a stipulated time interval.
Referring to the drawings,
Packets are injected into and retrieved from the cluster network using switch network interfaces 228, or specially designed adapters, between the hosts and the cluster network. Each switch network interface 228 comprises a plurality, and preferably three or more, route tables. Each route table is indexed by a destination identifier. In particular, each entry in the route table defines a unique route that will move an incoming packet to the destination defined by its index. The routes typically span one or more switching elements and two or more links in the cluster network. The format of the route table is determined by the network architecture. In an exemplary embodiment, four predetermined routes are selected from among the plurality of routes available between a source and destination node-pair. A set of routes thus determined between a source and all other destinations in the network are placed on the source in the form of route tables. During cluster operation, when a source node needs to send a packet to a specific destination node, one of the (e.g., four) routes from the route table is selected as the path for sending the packet.
In an exemplary embodiment, as illustrated in
In accordance with an aspect of the present invention, the network manager identifies faults in the network in order to determine which of the routes, if any, on any of the hosts are affected by a failure within the network. In an exemplary embodiment, the switch network interface 128 (see
Another advantage of the technique for providing reliable message transfer in accordance with aspects of the present invention is that the global knowledge of the network status is maintained by the network manager 230 (see
Yet another advantage of the present invention is that all paths that fail due to a link failure are marked unusable by the network manager by turning their path status bits off. While prior methods rely on message delivery failure to detect a failed path, the present invention has the capability to detect and avoid failures before they occur.
Still a further advantage of the present invention is that when a failed path becomes usable again, the network manager merely turns the appropriate path status bits back on. This is opposed to prior methods that require testing the path before path usage is reinstated. Such testing by attempting message transmission is not needed in accordance with the present invention.
Aspects of the present invention are illustratively described herein in the context of a massively parallel processing system, and particularly within a high performance communication network employed within the IBM® RS/6000® SP™ and IBM eServer pSeries® families of Scalable Parallel Processing Systems manufactured by International Business Machines (IBM) Corporation of Armonk, N.Y.
As briefly noted, the correlation and consolidation facility of the present invention is described herein, by way of example, in connection with a multi-stage packet-switch network. In one embodiment, the network may comprise the switching network employed in IBM's SP™ systems. The nodes in an SP system are interconnected by a bi-directional multi-stage network. Each node sends and receives messages from other nodes in the form of packets. The source node incorporates the routing information into packet headers so that the switching elements can forward the packets along the right path to a destination. A Route Table Generator (RTG) implements the IBM SP2™ approach to computing multiple paths (the standard is four) between all source-destination pairs. The RTG is conventionally based on a breadth first search algorithm.
Before proceeding further, certain terms employed in this description are defined:
SP System: For the purpose of this document, IBM's SP™ system means generally a set of nodes interconnected by a switch fabric.
Node: The term node refers to, e.g., processors that communicate amongst themselves through a switch fabric.
N-way System: An SP system is classified as an N-way system, where N is a maximum number of nodes that can be supported by the configuration.
Switch Fabric: The switch fabric is the set of switching elements or switch chips interconnected by communication links. Not all switch chips on the fabric are connected to nodes.
Switch Chip: A switch chip is, for example, an eight port cross-bar device with bi-directional ports that is capable of routing a packet entering through any of the eight input channels to any of the eight output channels.
Switch Board: Physically, a Switch Board is the basic unit of the switch fabric. It contains in one example eight switch chips. Depending on the configuration of the systems, a certain number of switch boards are linked together to form a switch fabric. Not all switch boards in the system may be directly linked to nodes.
Link: The term link is used to refer to a connection between a node and a switch chip, or two switch chips on the same board or on different switch boards.
Node Switch Board: Switch boards directly linked to nodes are called Node Switch Boards (NSBs). Up to 16 nodes can be linked to an NSB.
Intermediate Switch Board: Switch boards that link NSBs in large SP systems are referred to as Intermediate Switch Boards (ISBs). A node cannot be directly linked to an ISB. Systems with ISBs typically contain 4, 8 or 16 ISBs. An ISB can also be thought of generally as an intermediate stage.
Route: A route is a path between any pair of nodes in a system, including the switch chips and links as necessary.
One embodiment of a switch board, generally denoted 300, is depicted in
The correlation and consolidation facility disclosed herein categorizes links into various levels ranging from 0 to n−1. The links, connecting the network hosts to the peripheral switches are level 0 links, the on-board links on the peripheral switches are level 1 links, the links between the peripheral switches and the next stage of links are level 2 links, level 3 links are on the intermediate switch boards, level 4 links are between the blocks of 256 endpoints and the secondary switchboards (SSBs), and level 5 links are links on the secondary switchboards themselves. Depending upon the level of the link, a certain link has the potential to carry routes from or to specific sets of host nodes. Identification of the set of host nodes reduces the routes to be examined to a definite subset. Having found that subset, various methods described below can be used to identify the specific routes that are passing through the link.
In the example network of
In this sample network, links at level 0, the ones that connect to the hosts, carry all the routes to and from the attached host nodes. The next level, i.e., level 1 links, are the links on board the NSBs. When a link at this level fails, one route to all off chip destinations from the four hosts or sources connected to the chip will fail. Also, one route from all off chip sources to the four destinations on this chip will fail. The next level is level 2, the links between NSBs and ISBs. When a link at this level fails, one route to all off board designations from the 16 sources on the NSB connected to the faulty link will fail. Also, one route from all off board sources to the 16 destinations on this NSB will fail. A level 3 fault will effect routes to and from 64 host nodes; a level 4 fault will effect routes to and from 256 host nodes; a level 5 fault, being at the center of the network, will effect routes between the 1024 hosts on one side and the 1024 hosts on the other side of the link.
Table 1 illustrates the number of source-destination pairs for the two modification types for each link level in a 2048-node network.
For illustration, reference
The top right block of 256 contains NSBs 65-80. Assume that a level 2 link between one of the NSB 65 NSBs and an ISB of the block is also faulty and is handled next. The link level query will branch to level 2 (
As noted below with reference to
Before describing the link event correlation and consolidation facility further, the repair process of the reliable message transfer described in the above-incorporated application entitled “Reliable Message Transfer Over an Unreliable Network” is reviewed.
Continuing with discussion of the link event correlation and consolidation facility,
When static routes are generated and stored in route tables on the host nodes, only a few of the many possible routes between a source-destination pair will be selected. In the cluster implementation, four such routes may be selected. So, it is obvious that not all hosts in the selected subsets will have routes to or from them passing through the failed link. In the repair action phase it is necessary to identify the routes which have the potential to be effected. The path table bit corresponding to any effected route should then be turned “off”. Similarly, the path table bit corresponding to any restored routes are turned “on” when links come back up. One direct method to find such routes passing through the failed link is to trace all routes to or from the hosts in the selected set hop-by-hop and determine those that pass through the link. A second method is to use the routing algorithm to identify the hosts whose routes pass through the failed link. Because of the regularity of the network, these are determinable algebraically. A third method can be implemented by creating a route mask built utilizing the specific connectivity and the structure of route words, which is then applied to all routes in the selected list to identify those passing through the failed link.
Consider host node 0 in the example network. Destinations 1024-1279 will be in its destination list. Choosing destination 1024, the possible routes between from 0 to 1024 will contain 10 hops, with a hop being defined by a port number through which the packet traveling on that route will exit a chip. All possible routes between 0 and 1024 can be represented by the set:
(4,5,6,7)-(0,1,2,3)-(4,5,6,7)-(0,1,2,3)-(4,5,6,7)-0-4-0-4-0
Four of these would have been placed in the route table. The network manager maintains a database of the links and devices in the network that contains their status and interconnectivity. While implementing the first method, status of each of these ports is checked in the database while walking through the route. A route is declared good if all intervening links between the source node and the destination node along the route are good. If any one link is bad, the route is declared bad and the corresponding path table bit is turned “off”. Of the two bad links in the above example, the first has the potential of being in the 6th hop of the route between 0 and 1023. If it is found, the corresponding route is deemed bad.
When multiple links at different levels fail at the same time, a host may end up requiring multiple portions of its route tables to be examined. Identifying a superset of these sets would allow a single action to be taken. Thus, disclosed herein is a technique to collect and consolidate link events, and analyze them to come up with repair actions that can be completed within a stipulated time interval. The collection of event link data commences with receiving a first link event notification and a time interval is set within the total available time to collect any other faults/recoveries in the system. All gathered data is then analyzed and a unique set of repair actions is arrived at such that all collected link events are handled.
In describing the consolidation facility with reference to
The steps in the implementation of
The network manager (NM) receives a link outage event 1200, thus entering link collection phase, and pushes the link onto a Status Change List of links 1210;
The network manager waits T seconds to see if there is any more link outage event in the centralized message queue 1220 (T being, e.g., 5 seconds);
If there is one, then the network manager waits until there is no more event in the last T seconds period;
Since it is possible for a link to have recovered during this time, the network manager looks for any pending link up events for T seconds 1230;
If found, the network manager collects all pending link up events 1240 until there are no more of them for T seconds;
The network manager will then go back to check for pending link down events, without waiting for any amount of time 1250;
If there are, then the link event information is pushed into the status change list 1260 and processing returns to determine whether another new link event of the same type has been received in the next T seconds 1220. Otherwise, the network manager enters the analysis phase 1270.
If the link event is at level 2, then the network manager determines whether the link's board is connected to the particular host node 1540, and if “yes”, sets the host node's status to modification type FULL 1545. Otherwise, the manager determines whether the host modification type is already FULL 1550, and if “yes”, processing is complete. If the host modification type is not already FULL, then it is set to PARTIAL 1555.
If the link event is at level 3, then the network manager determines whether there are any secondary switch boards 1560. If “no”, then the host modification type is set to FULL 1565. If there are secondary switch boards, a determination is made whether the link's block is connected to the particular host node 1570, and if “yes”, then that host is set to modification type FULL 1575. Otherwise, the network manager determines whether the host is already modification type FULL, and if “yes”, transition step processing is complete for the particular node. If not already FULL, the host's modification type is set to PARTIAL 1585.
If the link event is at level 4, the network manager again inquires whether there are any secondary switch boards 1590, and if “no”, sets the host modification type to FULL 1595. Otherwise, the network manager determines whether the link's chip is connected to the host block 1600, and if “yes”, the host is set to modification type FULL 1605. If the link's chip is not connected to the host block, then a determination is made whether the host is already at modification type FULL, and if so, transition step processing for the particular host node is complete. Otherwise, the host modification type is set to PARTIAL 1615.
Finally, if the link event is at level 5 of the 5 level network of interconnected nodes depicted in
When a node is in modification type FULL, the entire path table is processed for the repair action, whereas when the modification type of a host is PARTIAL, only the particular destinations in the destination list for that host are processed. Whatever the type of modification required, the potentially effected routes can be examined in one of three ways as noted above, i.e., a hop-by-hop checking for faulty links on the route, algebraically examining the routes using a routing algorithm, or constructing a route mask for the combination of faulty links, and applying the masks to the potentially effected routes.
The capabilities of one or more aspects of the present invention can be implemented in software, firmware, hardware or some combination thereof.
One or more aspects of the present invention can be included in an article of manufacture (e.g., one or more computer program products) having, for instance, computer usable media. The media has therein, for instance, computer readable program code means or logic (e.g., instructions, code, commands, etc.) to provide and facilitate the capabilities of the present invention. The article of manufacture can be included as a part of a computer system or sold separately.
Additionally, at least one program storage device readable by a machine embodying at least one program of instructions executable by the machine to perform the capabilities of the present invention can be provided.
The flow diagrams depicted herein are just examples. There may be many variations to these diagrams or the steps (or operations) described therein without departing from the spirit of the invention. For instance, the steps may be performed in a differing order, or steps may be added, deleted or modified. All of these variations are considered a part of the claimed invention.
Although preferred embodiments have been depicted and described in detail herein, it will be apparent to those skilled in the relevant art that various modifications, additions, substitutions and the like can be made without departing from the spirit of the invention and these are therefore considered to be within the scope of the invention as defined in the following claims.
This application contains subject matter which is related to the subject matter of the following co-pending applications, each of which is assigned to the same assignee as this application. Each of the below-listed applications is hereby incorporated herein by reference in its entirety: “Fanning Route Generation Technique for Multi-Path Networks”, Ramanan et al., Ser. No. 09/993,268, filed Nov. 19, 2001; “Divide and Conquer Route Generation Technique for Distributed Selection of Routes Within A Multi-Path Network”, Aruna V. Ramanan, Ser. No. 11/141,185, filed May 31, 2005; and “Reliable Message Transfer Over An Unreliable Network”, Bender et al., Ser. No. ______, filed Aug. 24, 2005 (Attorney Docket No. POU920050041US1).