The present invention relates to the field of firmware and error handling and, more particularly, to check-stopping firmware implemented virtual communication channels without disabling all firmware functions.
Physical communication channels in System z sometimes experience hardware errors that cannot be repaired by recovery actions. When these types of errors occur, the channels are check-stopped in order to isolate the problem and prevent damage to the entire system. In System z, this is performed by setting a permanent error in a channel control block. These permanent errors remain in effect until troublesome hardware is fixed, after which the permanent errors are released and the disabled communication channels can again be used.
A special type of virtual communication channel is available in System z that is called Hipersockets. At present, there is no hardware based check-stop that can be applied to these virtual communication channels, since they are implemented at a layer of abstraction above the hardware layer. Occasionally, errors occur in Hipersockets Firmware, which results in a situation where it is no longer safe to use the virtual communication channels. These channels cannot currently be isolated and disabled without disabling the entire firmware (central electronic complex (CEC) firmware) in which the virtual communication channels are implemented. The CEC firmware is a critical system component which performs many other functions than just those related to the virtual communication channels. Thus, severe errors based upon the virtual communication channel can cause all CEC functions to be disabled, which adversely affects normal system operations.
The present invention discloses a solution for check-stopping firmware implemented virtual communication channels without disabling all firmware functions. In the solution, a virtual input/output subsystem of firmware can be selectively isolated from other portions of the firmware. This permits the virtual input/output subsystem to be disabled when severe errors occur involving virtual communication channels, without affecting other portions and functions of the firmware. Check-stopping the virtual channel subsystem can be performed using existing mechanisms for handling and reporting permanent errors including channel control blocks, channel report words, and the like. The subsystem can be reactivated in response to a firmware patch which can be an automated or manual procedure.
As will be appreciated by one skilled in the art, the present invention may be embodied as a system, method or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module” or “system.” Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.
Any combination of one or more computer usable or computer readable medium(s) may be utilized. The computer usable or computer readable medium may be, for example but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. More specific examples (a non-exhaustive list) of the computer readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a transmission media such as those supporting the Internet or an intranet, or a magnetic storage device. Note that the computer usable or computer readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, for instance, via optical scanning of the paper or other medium, then compiled, interpreted, or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. In the context of this document, a computer usable or computer readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. The computer usable medium may include a propagated data signal with the computer usable program code embodied therewith, either in baseband or as part of a carrier wave. The computer usable program code may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C++ or the like and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
The present invention is described below with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instruction means which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
As used herein a check-stop can be a firmware state causing the suspension and/or deactivation of one or more firmware components. The check-stop state can be triggered in response to one or more reoccurring and/or unrecoverable errors occurring within firmware 130. Reoccurring and/or unrecoverable errors can include component failures, communication channel failures, corrupted data structures, corrupted executable code, and the like. Channels 138 can be firmware implement communication channels able to act as a transport layer for one or more message passing protocols (e.g., TCP/IP).
In virtual channel subsystem 134, a set of virtual communication channels 138 within firmware 130 can facilitate message passing between subsystem images 132, 136. The virtual communication channels can have no direct linkage to hardware-level network interface adaptors, which makes detecting and correcting problems with the virtual communication channels 138 challenging. Traditionally, when errors were detected with virtual channel subsystem 134, the entire firmware 130 including all firmware subsystems were disabled. System 100 permits a disablement switch 180 to disable the virtual channel subsystem 134 without affecting other portions of the firmware 130. Images 132, 136 can be locally distributed server images executing programmatic code such as applications, operating systems 142, and the like. In one embodiment, images 132, 136 can include virtualized operating systems 142 executing within mainframe 110. In the embodiment, images 132, 136 can communicate using hardware-abstracted firmware channels 138. The images 132, 136 and their included operating systems 142 can be implemented at the software 140 level.
In mainframe 110, a virtual channel subsystem 134 can be placed into a check-stop state in response to one or more fatal errors occurring within subsystem 134 (e.g., channel failure 160). Using existing hardware mechanisms, subsystem 134 can be isolated from other firmware 130 subsystems 135 during a subsystem 134 failure. There is a check-stop state used to halt CPU 122 during hardware malfunction. The same check-stop state can be used to perform the same action on subsystem 134. That is, subsystem 134 can be selectively disabled allowing mainframe 110 components 120, 130, 140 to function normally, except for the functionality provided by 134.
Firmware controller 150 can be used to direct firmware 130 activity and manage detector 152 and reporter 154 actions. Controller 150 can permit the configuration of behavior for components 152, 154, and 131. Controller 150 can configure error threshold behavior for triggering a check-stop state. For example, if an error occurs more than three times within thirty seconds, a check-stop state can be enacted. Further, controller 150 can allow an administrative agent to manually enable or disable controller 131 by setting disablement switch 180.
Disablement switch 180 can be a stored state value used to determine the state and/or condition of subsystem 134. In one embodiment, disablement switch 180 can indicate a check-stop state for subsystem 134. For instance, disablement switch 180 set to a value of “one” can indicate subsystem 134 is deactivated. In an alternative embodiment, each channel 160-163 can be associated with an individual disablement switch 180. In the embodiment, a single failed channel 160 can be selectively deactivated without affecting the entire channel subsystem 134. Alternatively, disablement switch 180 can be one or more portions of executable code able to terminate subsystem 134 activity in response to fatal errors.
Failure detector 152 can detect failures associated with failed channels, malfunctioning channels, corrupt data structures in subsystem 134, corruption in channel controller 131, and the like. Detection of failures can be performed by detector 152 based on a checksums, verifiable hashes, system test checks, and the like. For instance, software error checking program code can be employed to determine an error event within subsystem 134. Detector 152 can determine and identify a failed channel 160 within subsystem 134 and take appropriate action. One action can include conveying a failure event and information about the failure to failure reporter 154.
Failure reporter 154 can perform data gathering and failure report generation useful in diagnosis of channel 160 error. Reporter 154 can utilize data gained from first failure data capture (FFDC), attempted recovery actions, and the like. Reporter 154 can notify system components of the failure such as operating system 142, subsystem 134, firmware controller 150, and the like. Notification can permit affected system 110 components to adjust functionality in response to the failure. Additionally, reporter 154 can generate administrative notifications permitting an administrative agent to address the failure.
Virtual channel controller 131 can be responsible for maintaining and handling subsystem 134 functionality. Controller 131 can be separate component from subsystem 134 in firmware 130 or can be present within subsystem 134. Controller 131 can convey message 170 to subsystem 134 which can force a check-stop on subsystem 134 when one or more reoccurring and/or unrecoverable errors occur within subsystem 134. Message 170 can be a check-stop directive instructing the deactivation of subsystem 134. For instance, subsystem 134 can disable all channels 160-163 upon receipt of message 170.
Once a check-stop state has been reached, firmware 130 can continue to function normally in the absence of subsystem 134 functionality. Application of a firmware patch implemented to resolve the failure can trigger subsystem 134 to be reactivated. If the applied firmware patch fails to resolve the reoccurring and/or unrecoverable failure, the firmware controller 150 will take appropriate actions. Controller 150 can reinstate the check-stop state for subsystem 134 and can force a firmware rollback, reverting the firmware 130 to the previous un-patched state.
Drawings presented herein are for illustrative purposes only and should not be construed to limit the invention in any regard. Firmware implemented virtual channels are not limited to mainframe 110 implementations and can be present within any computing device capable of executing a firmware. Components in firmware 130 can be distributed differently from drawings presented herein permitting that the functionality described is maintained.
In step 205, a failure in the virtual channel or in the virtual channel subsystem can be detected. The virtual channel can be a communication channel (e.g., TCP/IP) emulated within firmware responsible for message passing. The failure can include corrupt subsystem data structures, malfunctioning channels, and the like. In step 210, if the failure is reoccurring and/or unrecoverable the method can proceed to step 220, else continue to step 215. In step 215, appropriate recovery actions can be performed based on the nature of the failure. Recovery actions can be an automatically performed action taken by hardware components, firmware components, software components, and the like. Alternatively, an administrator can manually execute a recovery action in response to the subsystem failure.
In step 220, all virtual channels within the firmware can be disabled to maintain system stability. In step 225, the firmware (e.g., error reporting component) can notify the appropriate system components of subsystem failure and deactivation. Notification of components can include software components such as host/guest operating systems, software I/O subsystem, and the like. In step 230, if a firmware patch is available, the method can proceed to step 240, else continue to step 235. In step 235, a system administrator can be optionally notified that a patching action is required due to subsystem failure. In step 240, a firmware patch can be applied which can be an automated patch procedure or manual patching action. In step 245, all virtual channels within firmware can be enabled in response to the patching action. The activation can be an automatic or manual procedure once the patch is applied.
As used herein, a Hipersocket can be one or more virtual communication channels enabling the transmission of messages to and from System z 305 components. Corrupt channel 324 can be a virtual communication channel experiencing one or more failures which can generate an error message and/or action.
In CEC firmware 310, detector 312 can be a firmware component able to detect and identify one or more failures within Hipersocket channel subsystem 322. Detector 312 can detect failures associated with failed channels, malfunctioning channels, corrupt data structures in subsystem 322, and the like. Detection of failures can be performed by detector 312 based on one or more system test checks. For instance, corrupt channel 324 can be detected using a checksum identified by detector 312. In one embodiment, detector 312 can use Hipersocket channel control block 320 to perform a subsystem check-stop on subsystem 322.
Detection of corrupt channel 324 can cause the failure detector 312 to communicate a message 330 to subsystem 322 invoking a check-stop on all channels 324-328. Message 330 can be a check-stop message directing subsystem 322 to exit any currently executing code and disable channels 324-328. In one embodiment, message 330 can be an interrupt level message triggering a check-stopped state in subsystem 322.
Disablement switch 321 can be one or more stored values within control block indicating the status of block 320 and/or subsystem 322. In one embodiment, disablement switch 321 can be used to set a permanent error flag within control block 320. Disablement switch 321 can be tied to control block 320 indicating the corrupt channel(s) in subsystem 322. In another embodiment, switch 321 can denote the nature of the error in subsystem 322, type of error, channel path identifier (CHPID), affected image identifier (IID), and the like. Alternatively, the switch 321 can be used to control one instance of subsystem 322 when multiple instances of Hipersocket channel subsystem 322 are present.
When a check-stop state is initiated for Hipersocket channel subsystem 322, CEC firmware can convey an error report to software level processes. This can permit software 340 level actions to be taken in response to the check-stop state. Actions can include notifying locally executing server images of the check-stopped state, notifying administrators of subsystem failure, and the like. For instance, channel report words 350 can be communicated to an executing instance of z/OS 342 or other operating system (zLinux, z/VM, z/VSE, etc.).
Drawings presented herein are for illustrative purposes only and should not be construed to limit the invention in any regard. Component 312 within firmware 310 can be implemented within System z 305 hardware and/or software. Implementation details for System z 305 components 320-350 can vary from drawings presented herein. Although System z 305 is presented, other systems implementing Hipersocket functionality are contemplated.
The flowchart and block diagrams in the