1. Technical Field
The present invention generally relates to computer systems and in particular to handling component failure within computer systems. Still more particularly, the present invention relates to failover protocols for handling component failure within computer systems.
2. Description of the Related Art
Higher-end computer systems are often designed with redundant components, with one component serving (or designated) as a primary component and the other component serving as a backup component. The computer system initially operates with the primary component. When/if the primary component fails, the backup component is activated to take over the responsibilities/functions of the primary component. The process by which operations are switched from a primary component over to a backup component is referred to as a “failover.” Conventionally, component redundancy and failover processes were provided for only certain component-configurations, including, for example, (1) an operating system (OS) providing access control over redundant I/O controllers and (2) an I/O controller providing access control over redundant I/O devices coupled to the I/O controller.
With conventional failover methods, any one of the redundant components or the associated access-controller is able to asynchronously initiate a failover process, and each component-initiated failover is completed independent of the failover of the other components. Thus, failover in response to failure of a primary component failure may be initiated from multiple sources running asynchronously, which results in multiple concurrent failover requests. The existence of these multiple concurrent failovers provides challenges in resynchronization of the three associated components (i.e., the access controller and redundant devices).
To overcome the synchronization problems with overlapping/concurrent failovers from multiple sources, an alternate failover method is provided by which a single source is pre-assigned to drive/control all failover processes. For example, for a system that provides redundant I/O controllers controlled by an OS, the failover is always controlled by the OS, which represents the single source at the access control level. With the second described component-configuration, having a single I/O controller and redundant devices, this single source is the I/O controller.
Additional failover methods simply switch the bus control signals from one component to the redundant (backup) component, leaving the access controller (e.g., the OS for redundant I/O controllers or the I/O controller for redundant devices) unaware that a failover has occurred. However, this failover method does not work within system topologies where the redundant components (e.g., I/O controllers) have buses which do not pass through some sort of multiplexer (MUX). With such topologies, complete failover cannot be accomplished by simply switching which controller's set of signals are selected as the output from the MUX.
Disclosed is a method and system for enabling failover of redundant service processors. A data processing system (e.g., a server) is designed with redundant service processors. Both service processors are capable of performing the full set of service processor functions, with one service processor (SP) assuming the role of a primary SP and the other SP designated as the backup SP. The primary SP performs the initialization, monitoring and control of the system resources. The backup SP is available to take over the role of the primary SP at any time the primary SP fails, goes offline, or relinquishes its primary role.
As a part of the initialization of the two SPs, the SPs negotiate between themselves which SP is the primary and which SP is the backup. The SPs then communicate these roles to the hypervisor (or firmware). After a primary SP communicates its role to the hypervisor, the hypervisor ensures that the configuration data stored on the primary SP is up-to-date. The hypervisor indicates this up-to-date status of the primary SP's configuration data by acknowledging the role message received from the primary SP.
During system operation with the primary SP, the backup SP and system hypervisor monitor the primary SP for any indication that the primary SP is no longer performing properly or is exhibiting operating characteristics associated with a “failure”. The particular failure conditions/characteristics being monitored for are preset by the system designer. In the event of a failure of the primary SP, any one (but only one) of the three components, from among the backup SP, hypervisor, or even the primary SP itself, may initiate the failover to the backup SP.
To provide the capability for each one of these multiple components to initiate a failover between SPs without causing overlapping/concurrent failovers, the backup SP always checks whether there is an ongoing failover operation (of the SPs) before the backup SP initiates a new failover. Only when there is no ongoing failover (of the SPs) within the system does the backup SP allow a new failover to begin. Any existing/ongoing failover is permitted to complete on the system, without any overlapping or concurrent failover.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.
The invention itself, as well as a preferred mode of use, further objects, and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
The present invention provides a method and system for enabling synchronous failover of redundant service processors with no overlapping failover operations. A data processing system (e.g., a server) is designed with redundant service processors and a hypervisor. Both service processors are capable of performing the full set of service processor functions, with one service processor (SP) registering itself as a primary SP with the system firmware/hypervisor and the other SP registering as the backup SP. The primary SP performs the initialization, monitoring and control of system resources. The backup SP and hypervisor monitor the primary SP for indications that the primary SP is failing. In the event of a failure of the primary SP, any one of the three components, the backup SP, hypervisor, or even the primary SP itself, is able to initiate a failover to the backup SP. During failover conditions, backup SP checks the system to ensure that there is no ongoing failover before a new failover is initiated.
In the following detailed description of exemplary embodiments of the invention, specific exemplary embodiments in which the invention may be practiced are described in sufficient detail to enable those skilled in the art to practice the invention, and it is to be understood that other embodiments may be utilized and that logical, architectural, programmatic, mechanical, electrical and other changes may be made without departing from the spirit or scope of the present invention. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the present invention is defined only by the appended claims.
Within the descriptions of the figures, similar elements are provided similar names and reference numerals as those of the previous figure(s). Where a later figure utilizes the element in a different context or with different functionality, the element is provided a different leading numeral representative of the figure number (e.g., 4xx for
It is further understood that the use of specific parameter names are for example only and not meant to imply any limitations on the invention. The invention may thus be implemented with different nomenclature/terminology utilized to describe the above parameters, without limitation.
With reference now to the figures,
As utilized herein, the term hypervisor refers to a schema within the computer system that allows multiple operting systems to run, unmodified, at the same time, and which provides a measure of robustness and stability to the system. Each OS within the hypervisor operates independently of the others, such that if one operating system crashes, the other OSes would continue working without interruption. Cerrtain embodiments of the invention are described from the perspective of a hypervisor providing access control to the SPs. Alternatively, in other embodiments the accessc cotnrol component is refered to as the system firmware. Those skilled in the art apprecite that the hypervisor is a major portion of the system firmware and references herein to functions being completed by system firmware may interchangeably be referred to as functions completed by the hypervisor, and vice versa. Notably also, the features of the invention are applicable to a computer system with a single OS.
Returning to
System firmware 480 provides SP firmware data 440, 450 stored in system firmware area of respective SPs 420, 425. Generally, system firmware 480 is loaded into host memory 412 and executed on host processor nest 405, while functions of system firmware 480 that include SP firmware data 440, 450 may be executed by a dedicated processor (not shown) of respective SPs 420, 425.
Both SPs 420, 425 operate on standby power, while CEC 400 operates on bulk power. SPs 420, 425 and CEC 400 thus exist in separate power domains. Those skilled in the computer arts are familiar with these power terms, as utilized herein. When the computer system's AC (alternating current) line cord is plugged into the electrical socket, only the standby power domain is activated. When standby power is applied to either SP (420/425), the SP (420/425) begins to initialize itself. As part of this initialization, the SP (420/425) determines if the SP (420/425) has a sibling SP connected via communications link 427. If a sibling SP exists, then both SPs 420, 425 decide which is the primary SP and which is the backup SP based on pre-established SP initialization methods. The actual methods by which one of the two SPs (420/425) is selected as primary with the other delegated as backup is not relevant to the discussion of the actual invention, which assumes that the roles of primary SP and backup SP have already been assigned before the failover processing is triggered. The primary SP takes on the role of controlling specific hardware resources of computer system 490 (or CEC 400) and communicating with system firmware 480. The backup SP waits in a state from which the backup SP is able to take over as primary SP in case the current primary SP fails.
To provide the capability for multiple entities to initiate a failover between service processors, the service processors first negotiate between themselves which service processor is the primary and which is the backup. The SPs then communicate these roles to the system firmware/hypervisor. After an SP communicates its role to the hypervisor, the hypervisor ensures that the configuration data stored on the primary SP is up-to-date, and the hypervisor indicates this up-to-date status by acknowledging the role message from the primary SP.
As described in greater details below, during system operation of the control of the primary SP, the backup SP and system hypervisor monitor the primary SP for any indication that the primary SP is no longer performing properly or is exhibiting operating characteristics associated with a failure. The specific failure condition (or operating characteristics) being monitored for is preset by the system designer/engineer. Also, any failing SP indicates its role (primary or backup) and its failover capability to the system firmware. As the roles are determined and then communicated across the various components, surveillance/monitoring between the different components is activated/initiated.
The invention provides multiple implementations on how ownership of MUX 505 may be transferred between SPs (420, 425). However, the illustrative embodiments are described from the perspective that ownership of MUX 505 may only be taken if one of three events occur: (1) the current primary SP gives up ownership, perhaps voluntarily or in response to an external trigger; or (2) watchdog timer 516 (which is maintained by hypervisor 515, in the illustrative embodiment) expires because the primary SP has failed to service watchdog timer 516 within a specified amount of time; or (3) the primary SP fails to provide a “live” signal to the backup SP on communications link 427. For the second condition, primary SP is assumed to automatically update watchdog timer 516 within a pre-set period unless primary SP is failing or is giving up the role of primary SP for administrative reasons.
Three different embodiments are provided by the invention, depending on which of the three components, from among the hypervisor, the backup SP, and the primary SP, detects the failure condition and triggers/initiates the failover. Each embodiment is delineated in sections A-C below and illustrated by respective flow charts
In a first embodiment, when system firmware detects a surveillance failure or communications loss with the current primary SP, the hypervisor instructs the backup SP to become the new primary. The backup SP determines if a failover is already in progress, and when a failover is already in progress, the backup SP acknowledges to the hypervisor that a failover is already ongoing. If a failover is not already in progress, then the backup SP determines if the backup SP is able to gain control of the shared hardware resources. When the backup SP is able to gain control of the shared hardware resources, the backup SP acknowledges to the hypervisor that the backup SP is initiating a new failover.
However, when the backup SP is not able to gain control of the shared hardware resource, then the backup SP waits a pre-specified amount of time for control of the shared hardware resources to become available. Once the shared hardware resources become available, the backup SP asserts the reset line to the primary SP to force the release of the shared hardware and continue with the failover. During the failover, the backup SP starts up any necessary applications to monitor the system (e.g., applications to provide power and thermal monitoring), and then the backup SP notifies the hypervisor that the backup SP has now taken the role as the new primary SP. The hypervisor ensures that the configuration data is synchronized on the new primary SP, and the hypervisor acknowledges the role message. The new primary now begins surveillance with hypervisor and starts “listening” to determine if the old primary comes out of its failure state (e.g., as in the case of a recovered reset/reload).
In order to more clearly described the assignments of primary SP and backup SP and subsequent switching of roles among the SPs, specific illustrative embodiments of the invention are described from the perspective of SP 420 being primary SP (hereinafter primary SP 420) and SP 425 being backup SP (hereinafter backup SP 425). Additionally, once failover occurs, primary SP 420 is also referred to as “previous” primary SP or “new” backup SP, while backup SP 425 is also referred to as “previous” backup SP or “new” primary SP. Each designation is meant to refer to the current role of the particular SP, relative to either the SP's previous role or the previous role of the other SP.
If a failover is not already in progress, then backup SP 425 determines at block 115 if backup SP 425 is able to take control of system hardware 115. If not, then backup SP 425 sends an acknowledgement to system firmware 480 that backup SP 425 is not able to take control of the system hardware and therefore cannot failover, as indicated at block 125.
If, however, backup SP 425 is able to take control of the system hardware, then backup SP 425 acknowledges to system firmware 480 that backup SP 425 is failing over to the primary role, as shown at block 130. At block 135, backup SP 425 starts the necessary applications needed to assume the primary role, and backup SP sends a role message to system firmware 480 indicating that backup SP 425 has now assumed the primary role, as provided at block 140.
System firmware 480 determines, at block 145, if system firmware data 450 on new primary SP 425 is up-to-date. If system firmware data 450 is not up-to-date, system firmware 480 updates system firmware data 450, as shown at block 150, prior to sending new primary SP 425 an acknowledgement to the role message, as indicated at block 155. System firmware 480 and new primary SP 425 then initiates surveillance/monitoring to detect any future communications loss, as indicted at block 160. Also, new primary SP 425 listens on communications link 427 for its sibling SP—now backup SP 420—to become active again, as shown at block 165. Failover processing initiated from system firmware 480 then ends.
In the second embodiment, the backup SP detects the failure of the primary SP and initiates the failover. When the backup SP detects a surveillance failure with the primary SP, the backup SP will first determine if a failover is already in progress. If a failover is already in progress, then the backup SP takes no action, and the current failover is allowed to continue. However, if a failover is not already in progress, then the backup SP will determine if the backup SP is able to gain control of the shared hardware resources. Assuming the backup SP is able to gain control of the shared hardware resources, the backup SP then notifies the hypervisor that the backup SP is initiating the failover.
On receipt of this notification from the “new primary” SP, the hypervisor ensures that the configuration data is synchronized on the new primary, and the hypervisor acknowledges the role message. The new primary now begins surveillance with the hypervisor and starts “listening” to determine if the old primary comes out of its failure state. The old primary may come out of a failure state following a recovered reset/reload and registers itself as the new backup SP. In one embodiment, if the system firmware is not running, i.e. the system is at standby or initializing, then the backup SP completes the failover to the new primary role without communicating with the system firmware.
When the backup SP is not able to gain control of the shared hardware resources, then the backup SP continues to monitor for the availability of control over the shared hardware resource. This monitoring by the backup SP continues until the backup SP is instructed to failover by the hypervisor. This second failover request (from hypervisor) is provided to handle situations where the backup SP may have itself failed, or situations where the failure may be within the communications link between the two SPs.
Returning to block 205, if a failover is not already in progress, then backup SP 425 determines at block 215 if backup SP 425 is able to take control of system hardware. If not, then backup SP 425 waits for a pre-specified amount of time, as indicated by block 225, before retesting whether backup SP 425 is able to take control of the hardware, as determined at block 227. If backup SP 425 is still not able to take control of the hardware, then the failover process ends without a change in roles of primary SP and backup SP. In this situation, an assumption is made that primary SP 420 is still functioning, but communication link 427 has completely failed.
If backup SP 425 is able to take control of the hardware, either after the initial test (block 215) or the retest (block 227), then backup SP 425 notifies system firmware 480 that backup SP is failing over to the primary role, as provided at block 230. Backup SP 425 starts the necessary applications needed to assume the primary role, as indicated at block 235, and backup SP 425 sends a role message to system firmware 480 indicating that backup SP 425 has now assumed the primary role, as shown at block 240. System firmware 480 determines, at block 245, if system firmware data 450 on new primary SP 425 is up-to-date. If system firmware data 450 is not up-to-date, system firmware 480 updates system firmware data 450, as shown at block 250, prior to sending new primary SP 425 an acknowledgement to the role message, as provided at block 255. System firmware 480 and new primary SP 425 begin surveillance with each other to detect any future communications loss, as shown at block 260. New primary SP 425 listens on communications link 427 for its sibling SP—new backup SP 420—to become active again, as indicated at block 265, and the process ends.
With the above embodiment, since both SPs (420, 425) operate on standby power, both SPs (420, 425) may be operating when system firmware 480 is not. For the present embodiment, backup SP 425 is still able to detect a failure in primary SP 420, while the system is at standby or initializing, and the process in
Notably, with the above two embodiments, when/if previous primary SP 420 recovers from the failure that resulted in the failover to new primary SP 425, previous primary SP 420 first communicates with new primary SP 425 to learn that previous primary SP 420 has been delegated to a new role as new backup SP. New backup SP 420 then sends a role message to system firmware/hypervisor 480 to indicate that previous primary, now new backup SP 420 is now the backup and is failover capable. If system firmware/hypervisor 480 is not running, then previous primary SP 420 assumes the backup role and makes itself failover capable without communicating with system firmware/hypervisor 480.
In the third embodiment, the primary SP is itself able to initiate a failover when the primary SP detects/determines that there is an error/failure related to one of the primary SP's connections to the rest of the system and/or that the primary SP is not able to continue to properly manage the system. Alternatively, the primary SP may be instructed to failover for administrative reasons. Possible administrative reasons include, for example, (a) the result of a code update where the primary SP has to be reset to activate a new code level and (b) a reset prior to a concurrent maintenance action. When the primary SP initiates the failover, the primary SP will first quiesce all applications running on the primary SP and also synchronize any necessary data to the backup SP. The primary SP then gives up control of any shared hardware resources, and instructs the backup SP to failover to the primary role.
On receipt of this instruction from the primary SP, the backup SP first determines if a failover is already in progress. If there is already a failover in progress, the backup SP informs the primary SP that a failover is already in progress, and the existing/ongoing failover is allowed to continue. If a failover is not already in progress, then the backup SP determines if the backup SP is able to gain control of the shared hardware resources. Once the backup SP is able to gain control of the shared hardware resources, the backup SP notifies the hypervisor that the back up SP is failing over. The backup SP then notifies the primary SP that the backup SP is proceeding to assume the role of “new primary” SP. The previous primary SP notifies the hypervisor of its new role as new backup SP. The previous primary SP also notifies the hypervisor that it is either failover capable or not failover capable, depending on the existing condition(s) leading to the failover. Assuming the previous primary SP remains operational, the previous primary SP then begins surveillance/monitoring on the new primary SP.
If the primary SP's failover is for administrative reasons, i.e., not due to some failure with respect to the previous primary SP and/or system connections, the new primary also revalidates all of the system connections to the new primary. However, if the new primary is not able to validate a connection, the new primary returns the primary role to the previous primary SP, and thus the previous backup SP declares itself as not failover capable. The hypervisor ensures that the configuration data is synchronized on the new primary SP and acknowledges the role message. The new primary SP now begins surveillance with PHYP and notifies the old primary SP that the failover is complete. Again, if the hypervisor is not running, then the new primary SP and backup SP will failover without communicating with the hypervisor.
Backup SP 425 determines at block 325 if a failover is already in progress. If a failover is already in progress, backup SP 425 signals primary SP 420 that a failover is already in progress, as indicated at block 330. Following, primary SP 420 sends a role message to system firmware 480 indicating that primary SP 420 is now assuming the backup role, as shown at block 335. The ongoing (in-progress) failover is allowed to complete without any further action being taken by backup SP 425 with regards to the failover request from primary SP 420.
If, at decision block 325, a failover is not already in progress, then backup SP 425 determines whether backup SP 425 is able to take control of the system hardware, as shown at block 340. If backup SP 425 is not able to take control, then backup SP 425 signals to primary SP 420 that backup SP 425 is not able to take over the system hardware, as indicated at block 345, and then the failover process ends without a change in primary and backup roles. If, however, backup SP 425 is able to take control of the system hardware, then backup SP 425 notifies system firmware 480 that backup SP 425 is failing over to the primary role, as provided at block 350. Additionally, backup SP 425 sends an acknowledgement to previous primary SP 420 that backup SP 425 is failing over, as indicated at block 355. At block 360, previous primary SP 420 sends a message to system firmware 480 that previous primary SP 420 has now assumed the role of new backup SP 420.
Backup SP 425 starts the necessary applications needed to assume the primary role, as shown at block 365, and backup SP 425 sends a role message to system firmware 480 at block 370 indicating that backup SP 425 has now assumed the primary SP role. System firmware 480 determines, at block 375, if system firmware data 450 on new primary SP 425 is up-to-date. If system firmware data is not up-to-date, system firmware 480 updates system firmware data 450, as provided at block 380, prior to sending an acknowledgement of the role message to new primary SP 425, as shown at block 385. System firmware 480 and new primary SP 425 begin surveillance with each other to detect any future communications loss, as indicated at block 390. Also, new primary SP 425 listens on communications link 427 for its sibling SP (420) to become active again, as shown at block 395. Then the process ends.
As with the previous embodiment, since both SPs 420, 425 operate on standby power, both SPs 420, 425 may be operating when system firmware 480 is not. For the present embodiment, primary SP 425 is still able to detect/determine a failure in primary SP 420, and the process in
As a final matter, it is important that while an illustrative embodiment of the present invention has been, and will continue to be, described in the context of a fully functional computer system with installed software, those skilled in the art will appreciate that the software aspects of an illustrative embodiment of the present invention are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the present invention applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of signal bearing media include recordable type media such as floppy disks, hard disk drives, CD ROMs, and transmission type media such as digital and analogue communication links.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.