The present invention relates generally to computer systems, and more particularly to dynamic detachment of node(s) in a multi-node system.
A multi-node system is one in which a plurality of nodes are interconnected. An example multi-node system is the xSeries® eServer™ x440 from the International Business Machines Corporation (“IBM”). (“xSeries” is a registered trademark, and “eServer” is a trademark, of IBM.) Multi-node systems provide massive redundancy and processing power, and therefore improve system availability, performance, and scalability.
A multi-node system might comprise, for example, 4 interconnected nodes, where each node comprises 8 processors, such that the overall system effectively offers 32 processors. Each node typically contributes memory resources that are shareable among the interconnected nodes.
Multi-node systems commonly use an system management interrupt architecture, referred to herein as “system management interrupt”, or “SMI”. When an interrupt vector is written to an SMI register, an SMI interrupt is generated. The interrupt is then handled by an SMI interrupt handler.
In one aspect, the present invention provides node detach in a multi-node system, comprising detecting an interrupt, by an interrupt handler of a particular one of the nodes of the multi-node system, and entering the interrupt handler to process the interrupt. Upon determining that the interrupt indicates that the particular node is to be detached from the multi-node system, this aspect further comprises: transparently hosting in-use memory of the particular node at a different one of the nodes which has available memory, such that subsequent references to the in-use memory are transparently resolved to the different one of the nodes; and then detaching the particular node from the multi-node system by not exiting from the interrupt handler.
In this aspect, the transparently hosting preferably further comprises: copying contents of the in-use memory to the different one of the nodes; creating a mapping between a location of the in-use memory at the particular node and a new location of the copied contents at the different node, wherein the mapping enables the transparent resolution for the subsequent references; marking unused memory at the particular node as unavailable; and marking the new location at the different node as unavailable.
In another aspect, the present invention provides node detach in a multi-node system comprising a plurality of interconnected nodes, wherein each of the nodes has associated therewith an interrupt handler for detecting and processing interrupts. This aspect preferably comprises: detecting, by the interrupt handler associated with a particular one of the nodes, an interrupt; entering the interrupt handler to process the interrupt; and nondisruptively detaching the node, responsive to determining that the interrupt indicates that the particular node is to be detached from the multi-node system.
In this aspect, the nondisruptive detach preferably further comprises: copying contents of in-use memory of the particular node to a different one of the nodes which has available memory; creating a mapping between a location of the in-use memory at the particular node and a new location of the copied contents at the different node, wherein the mapping enables subsequent transparent resolution of subsequent references to the in-use memory; marking unused memory at the particular node as unavailable; marking the new location at the different node as unavailable; and then detaching the particular node from the multi-node system by not exiting from the interrupt handler.
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined by the appended claims, will become apparent in the non-limiting detailed description set forth below.
The present invention will be described with reference to the following drawings, in which like reference numbers denote the same element throughout.
Preferred embodiments are directed toward dynamically detaching one or more nodes in a multi-node environment (e.g., responsive to an error situation). Using techniques disclosed herein, a node can be detached without adversely impacting the operating system or others of the nodes. This node detach operation may be referred to as a “hot detach”—that is, it occurs dynamically, while the overall system continues to function. The node detach may be performed, for example, because the node is failing. Each node of the multi-node system contributes memory, which may be shared by other nodes at any particular point in time. If contents presently stored in the detaching node's memory just disappear during a node detach, the system would likely crash as a result; in addition, losing the memory contents may lead to results that are unpredictable. To avoid this undesirable situation, the contents of in-use memory of the node being detached are copied to another node, and a memory map is updated to make the copy transparent to the operating system for subsequent memory accesses. Furthermore, the copied-to memory locations are programmatically blocked to prevent accidentally overwriting the copy.
A so-called “north bridge” component 115, 170 may be present in each node. A north bridge component is present in a chipset architecture commonly known as “north bridge, south bridge”. In this architecture, the north bridge component communicates with a processor 105, 155 over a bus (see reference numbers 108, 158 in
Embodiments of the present invention are not limited to this north bridge, south bridge chipset, however, and thus the depiction in
A scalability chip 120, 165 comprises one or more control fields, and is leveraged by preferred embodiments to enable information to be communicated among the nodes 100, 150 of the multi-node system (as will be described in more detail).
Each node of the multi-node system further comprises an SMI interrupt handler 110, 160. As noted earlier, when SMI interrupts are generated, they are handled by an SMI interrupt handler.
A shortcoming of prior art multi-node systems is that there is no way to bring down a single node, without bringing down the operating system and the other nodes in the multi-node system. Any of a variety of error conditions might occur at a particular node, for example, for which the particular node should be detached from (i.e., cease participating in) the multi-node system. These error conditions include, by way of illustration only, detecting that the node is overheating and detecting that the node is experiencing a memory leak. Disadvantages of shutting down an entire multi-node system because of conditions pertaining only to a single one of the nodes include reduced system availability and reduced system throughput.
Prior art multi-node systems synchronously enter system management mode, or “SMM”, at all nodes whenever any one of the nodes receives an SMI interrupt. In this mode, normal processing at all of the nodes is halted while the SMI interrupt handler evaluates the interrupt in an attempt to determine its cause. If the error is catastrophic, the SMI handler will typically generate a machine check, forcing a reboot of all of the nodes. However, in many cases, the causing event need not affect the other nodes. In these cases, rebooting those nodes needlessly wastes time and resources.
Preferred embodiments of the present invention enable the SMI interrupt handlers at the nodes to operate independently, such that an individual node can detach from the multi-node system in a non-disruptive way. Using techniques disclosed herein, the processors of a node to be detached enter system management mode, under control of the node's SMI interrupt handler, while the processors on other nodes continue normal operation. Notably, the other nodes can continue functioning after the detaching node is detached, and memory resources in use at the detaching node can be transparently mapped to different memory locations such that executing components do not lose access to contents of the memory from the detaching node.
SMI interrupts in a prior art multi-node system are typically propagated, across the interconnections that connect the nodes together, to the SMI handler for each node. In these systems, an SMI interrupt that impacts one node therefore impacts all nodes, causing them all to stop normal processing and enter their interrupt handlers. This is inefficient and can have undesirable effects on the overall system. Preferred embodiments leverage the scalability chip in the nodes, as noted earlier, to inhibit propagation of SMI interrupts among the nodes, thereby providing for node independence with regard to SMI interrupt handling. The hot detach operation provided by the present invention can therefore be isolated to detaching a single node.
Referring now to
When a node detects that an SMI interrupt has been generated (Block 210), the interrupt handler of only the detecting node is involved. Once invoked (Block 215), this SMI interrupt handler evaluates the interrupt to determine whether the interrupt indicates that the node needs to detach from the system (Block 220).
If the test in Block 220 has a positive result, then at Block 225, the interrupt handler sends a message, preferably using a shared memory structure, to a memory controller referred to herein as a “daemon” that runs under control of the operating system. This message instructs the daemon that the node is about to detach. After the node signals the daemon, it then exits its SMI interrupt handler (Block 230), and the daemon processes the node detach operations (as discussed below with reference to
Once the daemon has finished, it generates another SMI interrupt to the local node. This interrupt is detected by the detaching node at Block 210, and the interrupt handler is entered again at Block 215. This time, the test in Block 220 has a negative result, and processing continues to Block 235, which tests to see whether the interrupt is a “daemon finished” signal from the daemon, signalling the detaching node that it has finished the detach processing.
If the test in Block 235 has a positive result, then control reaches Block 240, where the SMI interrupt handler of the detaching node does no further processing, and in particular, does not exit. The node is thus effectively removed from the system (although contents of the node's memory continue to be available, in the copied-to location(s), as discussed below with reference to
While many SMI interrupts may be properly isolated to a single node, there may be other scenarios where one node generates an SMI interrupt that should be propagated among the nodes to prevent system misbehavior. To account for scenarios in which a node detects an SMI interrupt that should be propagated among the interconnected nodes, preferred embodiments implement logic as will now be described with reference to
If the test at Block 245 has a negative result, then the interrupt that was detected at Block 210 is an interrupt that is to be processed by the local node only (Block 250), using techniques which do not form part of the inventive concepts disclosed herein. Following completion of that processing, control returns to Block 205 to await the next SMI interrupt at this node.
When control reaches Block 255, an interrupt has been detected that needs to be propagated from the local node to the other interconnected nodes. Accordingly, SMI interrupt propagation is (re)enabled at Block 255. This preferably comprises resetting the control field in the scalability chip and initializing a shared memory area where the SMI interrupt handlers of the other nodes will communicate with this node. The local node then forces a soft SMI interrupt condition to occur (Block 260). Triggering this interrupt causes the interrupt that was detected at Block 210 to be propagated from the local node to the interconnected nodes. As a result, each of those nodes will detect the interrupt and then enter their SMI interrupt handler. Those SMI interrupt handlers will query the shared memory area as to the cause of the interrupt, and will then take appropriate action, depending on their configuration. Each node that finishes processing this interrupt records status in the shared memory area to indicate that it is finished. As indicated at Block 265, the local node may also take action to process this SMI interrupt locally.
The local node then monitors the shared memory area (Block 270) to determine whether the other interconnected nodes have finished their processing of the propagated interrupt. If all of the nodes have finished, then the test at Block 275 has a positive result, and control preferably returns to Block 200, where the local node again disables SMI interrupt propagation and awaits subsequent interrupts. Otherwise, when the test at Block 275 has a negative result, the local node continues to monitor the shared memory area at Block 270.
Turning now to
When the daemon detects that a node has signaled it to perform a node detach (Block 300), it determines how much memory is currently in use at the detaching node (Block 305). The daemon then searches for available memory on others of the nodes in the multi-node system (Block 310). Preferably, this comprises consulting a memory map that records what memory is currently available to the multi-node system. (Refer to
The memory map is then revised (Block 325) to mark all currently unused memory locations on the detaching node as being unavailable, and (Block 330) to mark the copied-to location on the one or more other nodes as being unavailable. (Refer to
Finally, the daemon generates a soft SMI interrupt (Block 335), thereby signalling the detaching node that the daemon has finished its operations for detaching the node (i.e., that the memory copying and remapping operations are finished). The daemon then exits the processing of
In
The daemon determines, in this example scenario, that all of the currently-used memory from node 1 can be copied to a contiguous block of node 2 memory, from address 128M through address 256M.
As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as methods, systems, and/or computer program products comprising computer-readable program code. Accordingly, the present invention may take the form of an entirely software embodiment, an entirely hardware embodiment, or an embodiment combining software and hardware aspects. In a preferred embodiment, the invention is implemented in software, which includes (but is not limited to) firmware, resident software, microcode, etc.
Furthermore, embodiments of the invention may take the form of a computer program product accessible from computer-usable or computer-readable media providing program code for use by, or in connection with, a computer or any instruction execution system. For purposes of this description, a computer-usable or computer-readable medium may be any apparatus that can contain, store, communicate, propagate, or transport a program for use by, or in connection with, an instruction execution system, apparatus, or device.
The medium may be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, removable computer diskette, random access memory (“RAM”), read-only memory (“ROM”), rigid magnetic disk, and optical disk. Current examples of optical disks include compact disk with read-only memory (“CD-ROM”), compact disk with read/write (“CD-R/W”), and DVD.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims shall be construed to include preferred embodiments and all such variations and modifications as fall within the spirit and scope of the invention. Furthermore, it should be understood that use of “a” or “an” in the claims is not intended to limit embodiments of the present invention to a singular one of any element thus introduced.