Method for managing environmental conditions of a distributed processor system

Information

  • Patent Grant
  • 6249885
  • Patent Number
    6,249,885
  • Date Filed
    Wednesday, October 1, 1997
    27 years ago
  • Date Issued
    Tuesday, June 19, 2001
    23 years ago
Abstract
A network of microcontrollers for monitoring and diagnosing the environmental conditions of a computer is disclosed. The network of microcontrollers provides a management system by which computer users can accurately gauge the health of their computer. The network of microcontrollers provides users the ability to detect system fan speeds, internal temperatures and voltage levels. The invention is designed to not only be resilient to faults, but also allows for the system maintenance, modification, and growth—without downtime. Additionally, the present invention allows users to replace failed components, and add new functionality, such as new network interfaces, disk interface cards and storage, without impacting existing users. One of the primary roles of the present invention is to manage the environment without outside involvement. This self-management allows the system to continue to operate even though components have failed.
Description




PRIORITY CLAIM




The benefit under 35 U.S.C. § 119(e) of the following U.S. provisional application(s) is hereby claimed:


















Application




Filing






Title




No.




Date











“Remote Access and Control of Enviromental




60/046,397




May 13,






Management System”





1997






“Hardware and Software Architecture for




60/047,016




May 13,






Inter-Connecting an Environmental





1997






Management System with a Remote Interface”






“Self Management Protocol for a Fly-By-Wire




60/046,416




May 13,






Service Processor”





1997






“Computer System Hardware Infrastructure for




60/046,398




May 13,






Hot Plugging Single and Multi-Function PC





1997






Cards Without Embedded Bridges”






“Computer System Hardware Infrastructure for




60/046,312




May 13,






Hot Plugging Multi-Function PCI Cards With





1997






Embedded Bridges”














APPENDICES




Appendix A, which forms a part of this disclosure, is a list of commonly owned copending U.S. patent applications. Each one of the applications listed in Appendix A is hereby incorporated herein in its entirety by reference thereto.




COPYRIGHT RIGHTS




A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the: Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.




BACKGROUND OF THE INVENTION




1. Field of the Invention




The invention relates to the field of fault tolerant computer systems. More particularly, the invention relates to a managing and diagnostic system for evaluating and controlling the environmental conditions of a fault tolerant computer system.




2. Description of the Related Technology




As enterprise-class servers become more powerful and more capable, they are also becoming ever more sophisticated and complex. For many companies, these changes lead to concerns over server reliability and manageability, particularly in light of the increasingly critical role of server-based applications. While in the past many systems administrators were comfortable with all of the various components that made up a standards-based network server, today's generation of servers can appear as an incomprehensible, unmanageable black box. Without visibility into the underlying behavior of the system, the administrator must “fly blind.” Too often, the only indicators the network manager has on the relative health of a particular server is whether or not it is running.




It is well-acknowledged that there is a lack of reliability and availability of most standards-based servers. Server downtime, resulting either from hardware or software faults or from regular maintenance, continues to be a significant problem. By one estimate, the cost of downtime in mission critical environments has risen to an annual total of $4.0 billion for U.S. businesses, with the average downtime event resulting in a $140 thousand loss in the retail industry and a $450 thousand loss in the securities industry. It has been reported that companies lose as much as $250 thousand in employee productivity for every 1% of computer downtime. With emerging Internet, intranet and collaborative applications taking on more essential business roles every day, the cost of network server downtime will continue to spiral upward. Another major cost is of system downtime administrators to diagnose and fix the system. Corporations are looking for systems which do not require real time service upon a system component failure.




While hardware fault tolerance is an important element of an overall high availability architecture, it is only one piece of the puzzle. Studies show that a significant percentage of network server downtime is caused by transient faults in the I/O subsystem. Transient failures are those which make a server unusable, but which disappear when the server is restarted, leaving no information which points to a failing component. These faults may be due, for example, to the device driver, the adapter card firmware, or hardware which does not properly handle concurrent errors, and Often causes servers to crash or hang. The result is hours of downtime per failure, while a system administrator discovers the failure, takes some action and manually reboots the server. In many cases, data volumes on hard disk drives become corrupt and must be repaired when the volume is mounted. A dismount-and-mount cycle may result from the lack of hot pluggability in current standards-based servers. Diagnosing intermittent errors can be a frustrating and time-consuming process. For a system to deliver consistently high availability, it should be resilient to these types of faults.




Modern fault tolerant systems have the functionality to monitor the ambient temperature of a storage device enclosure and the operational status of other components such the cooling fans and power supply. However, a limitation of these server systems is that they do not contain self-managing processes to correct malfunctions. Thus, if a malfunction occurs in a typical server, the one corrective measure taken by the server is to give notification of the error causing event via a computer monitor to the system administrator. If the system error caused the system to stop running, the system administrator might never know the source of the error. Traditional systems are lacking in detail and sophistication when notifying system administrators of system malfunctions. System administrators are in need of a graphical user interface for monitoring the health of a network of servers. Administrators need a simple point-and-click interface to evaluate the health of each server in the network. In addition, existing fault tolerant servers rely upon operating system maintained logs for error recording. These systems are not capable of maintaining information when the operating system is inoperable due to a system malfunction.




Existing systems also do not have an interface to control the changing or addition of an adapter. Since any user on a network could be using a particular device on the server, system administrators need a software application that will control the flow of communications to a device before, during, and after a hot plug operation on an adapter.




Also, in the typical fault tolerant computer system, the control logic for the diagnostic system is associated with a particular processor. Thus, if the environmental control processor malfunctioned, then all diagnostic activity on the computer would cease. In traditional systems, there is no monitoring of fans, and no means to make up cooling capacity lost when a fan fails. Some systems provide a processor located on a plug-in PCI card which can monitor some internal systems, and control turning power on and off. If this card fails, obtaining information about the system, and controlling it remotely, is no longer possible. Further, these systems are not able to affect fan speed or cooling capacity.




Therefore, a need exists for improvements in server management which will result in greater reliability and dependability of operation. Server users are in need of a management system by which the users can accurately gauge the health of their system. Users need a high availability system that should not only be resilient to faults, but should allow for maintenance, modification, and growth—-without downtime. System users should be able to replace failed components, and add new functionality, such as new network interfaces, disk interface cards and storage, without impacting existing users. As system demands grow, organizations must frequently expand, or scale, their computing infrastructure, adding new processing power, memory, storage and I/O capacity. With demand for 24-hour access to critical, server-based information resources, planned system downtime for system service or expansion has become unacceptable.




SUMMARY OF THE INVENTION




Embodiments of the inventive monitoring and management system provides system administrators with new levels of client/server system availability and management. It gives system administrators and network managers a comprehensive view into the underlying health of the server—in real time, whether on-site or off-site. In the event of a failure, the invention enables the administrator to learn why the system failed, why the system was unable to boot, and to control certain functions of the server.




One embodiment of the invention is a method for monitoring and diagnosing a computer, comprising: providing a computer connected to a microcontroller network; requesting conditions of the computer from the microcontroller network; sensing the conditions of the computer with the microcontroller network; receiving the sensed conditions in the microcontroller network; and communicating the sensed conditions from the microcontroller network to the source of the request.











BRIEF DESCRIPTION OF THE DRAWINGS





FIG. 1

is one embodiment of a top-level block diagram showing a fault tolerant computer system of the invention, including mass storage and network connections.





FIG. 2

is one embodiment of a block diagram showing a first embodiment of a multiple bus configuration connecting I/O adapters and a network of microcontrollers to the clustered CPUs of the fault tolerant computer system shown in FIG.


1


.





FIG. 3

is one embodiment of a block diagram showing a second embodiment of a multiple bus configuration connecting canisters containing I/O adapters and a network of microcontrollers to the clustered CPUs of the fault tolerant system shown in FIG.


1


.





FIG. 4

is one embodiment of a top-level block diagram illustrating the microcontroller network shown in

FIGS. 2 and 3

.





FIGS. 5A-5C

are detailed block diagrams showing one embodiment of the microcontroller network shown in

FIG. 4

illustrating the signals and values monitored by each microcontroller, and the control signals generated by the microcontrollers.





FIG. 6

is one embodiment of a flowchart showing the process by which a remote user can access diagnostic and managing services of the microcontroller network shown in

FIGS. 4

,


5


A-


5


C.





FIG. 7

is one embodiment of a block diagram showing the connection of an industry standard architecture (ISA) bus to the microcontroller network shown in

FIGS. 4

,


5


A-


5


C.





FIG. 8

is one embodiment of a flowchart showing the master to slave communications of the microcontrollers shown in

FIGS. 4

,


5


A-


5


C.





FIG. 9

is one embodiment of a flowchart showing the slave to master communications of the microcontrollers shown in

FIGS. 4

,


5


A-


5


C.





FIGS. 10A and 10B

are flowcharts showing one process by which the System Interface, shown in

FIGS. 4

,


5


A-


5


C, gets commands and relays commands from the ISA bus to the network of microcontrollers.





FIGS. 11A and 11B

are flowcharts showing one process by which a Chassis microcontroller, shown in

FIGS. 4

,


5


A-


5


C, manages and diagnoses the power supply to the computer system.





FIG. 12

is a flowchart showing one process by which the Chassis controller, shown in

FIGS. 4

,


5


A-


5


C, monitors the addition and removal of a power supply from the fault tolerant computer system.





FIG. 13

is a flowchart showing one process by which the Chassis controller, shown in

FIGS. 4

,


5


A-


5


C, monitors temperature.





FIGS. 14A and 14B

are flowcharts showing one embodiment of the activities undertaken by CPU A controller, shown in

FIGS. 4

,


5


A-


5


C.





FIG. 15

is a detailed flowchart showing one process by which the CPU A controller, show in

FIGS. 4

,


5


A-


5


C, monitors the fan speed for the system board of the computer.





FIG. 16

is a flowchart showing one process by which activities of the CPU B controller, shown in

FIGS. 4

,


5


A-


5


C, scans for system faults.





FIG. 17

is a flowchart showing one process by which activities of a Canister controller, shown in

FIGS. 4

,


5


A-


5


C, monitors the speed of the canister fan of the fault tolerant computer system.





FIG. 18

is a flowchart showing one process by which activities of the System Recorder, shown in

FIGS. 4

,


5


A-


5


C, resets the NVRAM located on the backplane of the fault tolerant computer system.











DETAILED DESCRIPTION OF THE INVENTION




The following detailed description presents a description of certain specific embodiments of the present invention. However, the invention can be embodied in a multitude of different ways as defined and covered by the claims. In this description, reference is made to the drawings wherein like parts are designated with like numerals throughout.





FIG. 1

is one embodiment of a block diagram showing a fault tolerant computer system of the present invention. Typically the computer system is one server in a network of servers and communicating with client computers. Such a configuration of computers is often referred to as a client-server architecture. A fault tolerant server is useful for mission critical applications such as the securities business where any computer down time can result in catastrophic financial consequences. A fault tolerant computer will allow for a fault to be isolated and not propagate through the system thus providing complete or minimal disruption to continuing operation. Fault tolerant systems also provide redundant components such as adapters so service can continue even when one component fails.




The system includes a fault tolerant computer system


100


connecting to external peripheral devices through high speed I/O channels


102


and


104


. The peripheral devices communicate and are connected to the high speed I/O channels


102


and


104


by mass storage buses


106


and


107


. In different embodiments of the invention, the bus system


106


,


107


could be Peripheral Component Interconnect (PCI), Microchannel, Industrial Standard Architecture (ISA) and Extended ISA (EISA) architectures. In one embodiment of the invention, the buses


106


,


107


are PCI. Various kinds of peripheral controllers


108


,


112


,


116


, and


128


, may be connected to the buses


106


and


107


including mass storage controllers, network adapters and communications adapters. Mass storage controllers attach to data storage devices such as magnetic disk, tape, optical disk, CD-ROM. These data storage devices connect to the mass storage controllers using one of a number of industry standard interconnects, such as small computer storage interface (SCSI), IDE, EIDE, SMD. Peripheral controllers and I/O devices are generally off-the-shelf products. For instance, sample vendors for a magnetic disk controller


108


and magnetic disks


110


include Qlogic, and Quantum (respectively). Each magnetic disk may hold multiple Gigabytes of data.




A client server computer system typically includes one or more network interface controllers (NICs)


112


and


128


. The network interface controllers


112


and


128


allow digital communication between the fault tolerant computer system


100


and other computers (not shown) such as a network of servers via a connection


130


. For LAN embodiments of the network adapter, the network media used may be, for example, Ethernet (IEEE 802.3), Token Ring (IEEE 802.5), Fiber Distributed Datalink Interface (FDDI) or Asynchronous Transfer Mode (ATM).




In the computer system


100


, the high speed I/O channels, buses and controllers (


102


-


128


) may, for instance, be provided in pairs. In this example, if one of these should fail, another independent channel, bus or controller is available for use until the failed one is repaired.




In one embodiment of the invention, a remote computer


130


is connected to the fault tolerant computer system


100


. The remote computer


130


provides some control over the fault tolerant computer system


100


, such as requesting system status.





FIG. 2

shows one embodiment of the bus structure of the fault tolerant computer system


100


. A number ‘n’ of central processing units (CPUs)


200


are connected through a host bus


202


to a memory controller


204


, which allows for access to semiconductor memory by the other system components. In one embodiment of the invention, there are four CPUs


200


, each being an Intel Pentium® Pro microprocessor. A number of bridges


206


,


208


and


209


connect the host bus to three additional bus systems


212


,


214


, and


216


. These bridges correspond to high speed I/O channels


102


and


104


shown in FIG.


1


. The buses


212


,


214


and


216


correspond to the buses


106


and


107


shown in FIG.


1


. The bus systems


212


,


214


and


216


, referred to as PC buses, may be any standards-based bus system such as PCI, ISA, EISA and Microchannel. In one embodiment of the invention, the bus systems


212


,


214


,


216


are PCI. In another embodiment of the invention a proprietary bus is used.




An ISA Bridge


218


is connected to the bus system


212


to support legacy devices such as a keyboard, one or more floppy disk drives and a mouse. A network of microcontrollers


225


is also interfaced to the ISA bus


226


to monitor and diagnose the environmental health of the fault tolerant system. Further discussion of the network will be provided below.




A bridge


230


and a bridge


232


connects PC buses


214


and


216


with PC buses


234


and


236


to provide expansion slots for peripheral devices or adapters. Separating the devices


238


and


240


on PC buses


234


and


236


reduces the potential that a device or other transient I/O error will bring the entire system down or stop the system administrator from communicating with the system.





FIG. 3

shows an alternative bus structure embodiment of the fault tolerant computer system


100


. The two PC buses


214


and


216


contain bridges


242


,


244


,


246


and


248


to PC bus systems


250


,


252


,


254


, and


256


. As with the PC buses


214


and


216


, the PC buses


250


,


252


,


254


and


256


can be designed according to any type of bus architecture including PCI, ISA, EISA, and Microchannel. The PC buses


250


,


252


,


254


, and


256


are connected, respectively, to a canister


258


,


260


,


262


and


264


. The canisters


258


,


260


,


262


, and


264


are casings for a detachable bus system and provide multiple slots for adapters. In the illustrated canister, there are four adapter slots.




Referring now to

FIG. 4

, the present invention for monitoring and diagnosing environmental conditions may be implemented by using a network of microcontrollers


225


located on the fault tolerant computer system


100


. In one embodiment some of the microcontrollers are placed on a system board or motherboard


302


while other microcontrollers are placed on a backplane


304


. Furthermore, in the embodiment of

FIG. 3

, some of the microcontrollers such as Canister controller A


324


may reside on a removable canister.





FIG. 4

illustrates that the network of microcontrollers


225


is connected to one of the CPUs


200


by an ISA bus


308


. The ISA


308


bus interfaces the network of microcontrollers


225


which are connected on the microcontroller bus


310


through a System. Interface


312


. In one embodiment of the invention, the microcontrollers communicate through an I


2


C serial bus, also referred to as a microcontroller bus


310


. The document “The I


2


C Bus and How to Use It” (Philips Semiconductor, 1992) is hereby incorporated by reference. The I


2


C bus is a bi-directional two-wire bus and operates at a 400 kbps rate in the present embodiment. However, other bus structures and protocols could be employed in connection with this invention. In other embodiments, IEEE 1394 (Firewire), IEEE 422, IEEE 488 (GPIB), RS-185, Apple ADB, Universal Serial Bus (USB), or Controller Area Network (Can) could be utilized as the microcontroller bus. Control on the microcontroller bus is distributed. Each microcontroller can be a sender (a master) or a receiver (a slave) and each is interconnected by this bus. A microcontroller directly controls its own resources, and indirectly controls resources of other microcontrollers on the bus.




Here are some of the features of the I


2


C-bus:




Only two bus line are required: a serial data line (SDA) and a serial clock line (SCL).




Each device connected to the bus is software addressable by a unique address and simple master/slave relationships exist at all times; masters can operate as master-transmitters or as master-receivers.




The bus is a true multi-master bus including collision detection and arbitration to prevent data corruption if two or more masters simultaneously initiate data transfer.




Serial, 8-bit oriented, bi-directional data transfers can be made at up to 400 kbit/second in the fast mode.




Two wires, serial data (SDA) and serial clock (SCL), carry information between the devices connected to the I


2


C bus. Each device is recognized by a unique address and can operate as either a transmitter or receiver, depending on the function of the device. Further, each device can operate from time to time as both a transmitter and a receiver. For example, a memory device connected to the I


2


C bus could both receive and transmit data. In addition to transmitters and receivers, devices can also be considered as masters or slaves when performing data transfers (see Table 1). A master is the device which initiates a data transfer on the bus and generates the clock signals to permit that transfer. At that time, any device addressed is considered a slave.












TABLE 1











Definition of I


2


C-bus terminology












Term




Description









Transmitter




The device which sends the data to the bus






Receiver




The device which receives the data from the bus






Master




The device which initiates a transfer, generates







clock signals and terminates a transfer






Slave




The device addressed by a master






Multi-master




More than one master can attempt to control the bus







at the same time without corrupting the message.







Each device at separate times may act as a master.






Arbitration




Procedure to ensure that, if more than one master







simultaneously tries to control the bus, only one is







allowed to do so and the message is not corrupted






Synchronization




Procedure to synchronize the clock signal of two or







more devices














The I


2


C-bus is a multi-master bus. This means that more than one device capable of controlling the bus can be connected to it. As masters are usually microcontrollers, consider the case of a data transfer between two microcontrollers connected to the I


2


C-bus. This highlights the master-slave and receiver-transmitter relationships to be found on the I


2


C-bus. It should be noted that these relationships are not permanent, but only depend on the direction of data transfer at that time. The transfer of data between microcontrollers is further described in FIG.


8


.




The possibility of connecting more than one microcontroller to the I


2


C-bus means that more than one master could try to initiate a data transfer at the same time. To avoid the conflict that might ensue from such an event, an arbitration procedure has been developed. This procedure relies on the wired-AND connection of all I


2


C interfaces to the I


2


C-bus.




If two or more masters try to put information onto the bus, as long as they put the same information onto the bus, there is no problem. Each monitors the state of the SDL. If a microcontroller expects to find that the SDL is high, but finds that it is low, the microcontroller assumes it lost the arbitration and stops sending data. The clock signals during arbitration are a synchronized combination of the clocks generated by the masters using the wired-AND connection to the SCL line.




Generation of clock signal on the I


2


C-bus is always the responsibility of master devices. Each master microcontroller generates its own clock signals when transferring data on the bus.




In one embodiment, the command, diagnostic, monitoring and history functions of the microcontroller network


102


are accessed using a global network memory and a protocol has been defined so that applications can access system resources without intimate knowledge of the underlying network of microcontrollers. That is, any function may be queried simply by generating a network “read” request targeted at the function's known global network address. In the same fashion, a function may be exercised simply by “writing” to its global network address. Any microcontroller may initiate read/write activity by sending a message on the I


2


C bus to the microcontroller responsible for the function (which can be determined from the known global address of the function). The network memory model includes typing information as part of the memory addressing information.




Referring to

FIG. 4

, in one embodiment of the invention, the network of microcontrollers


310


includes ten processors. One of the purposes of the microcontroller network


225


is to transfer messages to the other components of the server system


100


. The processors or microcontrollers include: a System Interface


312


, a CPU A controller


314


, a CPU B controller


316


, a System Recorder


320


, a Chassis controller


318


, a Canister A controller


324


, a Canister B controller


326


, a Canister C controller


328


, a Canister D controller


330


and a Remote Interface controller


332


. The System Interface controller


312


, the CPU A controller


314


and the CPU B controller


316


are located on a system board


302


in the fault tolerant computer system


100


. Also located on the system board are one or more central processing units (CPUs) or microprocessors


164


and the Industry Standard Architecture (ISA) bus


296


that connects to the System Interface Controller


312


. The CPUs


200


may be any conventional general purpose single-chip or multi-chip microprocessor such as a Pentium


7


, Pentium® Pro or Pentium® II processor available from Intel Corporation, A MIPS® processor available from Silicon Graphics, Inc., a SPARC processor from Sun Microsystems, Inc., a Power PC® processor available from Motorola, or an ALPHA® processor available from Digital Equipment Corporation. In addition, the CPUs


200


may be any conventional special purpose microprocessor such as a digital signal processor or a graphics processor.




The System Recorder


320


and Chassis controller


318


, along with a data storage such as a random access non-volatile access memory (NVRAM)


322


that connects to the System Recorder


320


, are located on a backplane


304


of the fault tolerant computer system


100


. The data storage


322


may be independently powered and may retain its contents when power is unavailable. The data storage


322


is used to log system status, so that when a failure of the computer


100


occurs, maintenance personnel can access the storage


322


and search for information about what component failed. An NVRAM is used for the data storage


322


in one embodiment but other embodiments may use other types and sizes of storage devices.




The System Recorder


320


and Chassis controller


318


are the first microcontrollers to power up when server power is applied. The System Recorder


320


, the Chassis controller


318


and the Remote Interface microcontroller


332


are the three microcontrollers that have an independent bias 5 Volt power supplied to them if main server power is off. This independent bias 5 Volt power is provided by a Remote Interface Board (not shown). The Canister controllers


324


-


330


are not considered to be part of the backplane


304


because each is mounted on a card attached to the canister.





FIGS. 5A-5C

are one embodiment of a block diagram that illustrates some of the signal lines that are used by the different microcontrollers. Some of the signal lines connect to actuators and other signal lines connect to sensors. In one embodiment of the invention the microcontrollers in the network are commercially available microcontrollers. Examples of off-the-shelf microcontrollers are the PIC16c65 and the PIC16c74 available from Microchip Technology Inc, the 8051 from Intel Corporation, the 8751 available from Atmel, and a P80CL580 microprocessor available from Philips, could be utilized.




The Chassis controller


318


is connected to a set of temperature detectors


502


,


504


, and


506


which read the temperature on the backplane


304


and the system board


302


.

FIG. 5

also illustrates the signal lines that connect the System Recorder


320


to the NVRAM


322


and a timer chip


520


. In one embodiment of the invention, the System Recorder


320


is the only microcontroller that can access the NVRAM


322


. The Canister controller


324


is connected to a Fan Tachometer Signal Mux


508


which is used to detect the speed of the fans. The CPU A controller


314


also is connected to a fan mux


508


which gathers the fan speed of system fans. The CPU A controller


314


displays errors to a user by writing to an LCD display


512


. Any microcontroller can request the CPU A controller


314


to write a message to the LCD display


512


. The System Interface


312


is connected to a response buffer


514


which queues outgoing response signals in the order that they are received. Similarly, a request signal buffer


516


is connected to the System Interface


312


and stores, or queues request signals in the order that they are received.




Software applications can access the network of microcontrollers


225


by using the software program header file that is listed at the end of the specification in the section titled “Header File for Global Memory Addresses.” This header file provides a global memory address for each function of the microcontroller network


225


. By using the definitions provided by this header file, applications can request and send information to the microcontroller network


225


without needing to know where a particular sensor or activator resides in the microcontroller network.





FIG. 6

is one embodiment of a flowchart illustrating the process by which under one implementation of the present invention, a remote application connected, say, through the connection of

FIG. 1

, can access the network of microcontrollers


225


. Starting at state


600


, a remote software application, such as a generic system management application like Hewlett-Packard Open View, or an application specific to this computer system, retrieves a management information block (MIB) object by reading and interpreting a MIB file, or by an application's implicit knowledge of the MIB object's structure. This retrieval could be the result of an operator using a graphical user interface (GUI), or as the result of some automatic system management process. The MIB is a description of objects, which have a standard structure, and contain information specific to the MIB object ID associated with a particular MIB object. At a block


602


, the remote application builds a request for information by creating a request which references a particular MIB object by its object ID, sends the request to the target computer using a protocol called SNMP (simple network management protocol). SNMP is a type of TCP/IP protocol. Moving to state


604


, the remote software sends the SNMP packet to a local agent Microsoft WinSNMP, for example, which is running on the fault tolerant computer system


100


, which includes the network of microcontrollers


225


(FIG.


4


). The agent is a specialized program which can interpret MIB object IDs and objects. The local agent software runs on one of the CPUs


200


of

FIGS. 2 and 3

.




The local agent examines the SNMP request packet (state


606


). If the local agent does not recognize the request, the local agent passes the SNMP packet to an extension SNMP agent. Proceeding to state


608


, the extension SNMP agent dissects the object ID. The extension SNMP agent is coded to recognize from the object ID, which memory mapped resources managed by the network of microcontrollers need to be accessed (state


608


). The agent then builds the required requests for the memory mapped information in the command protocol format understood by the network of microcontrollers


225


. The agent then forwards the request to a microcontroller network device driver (state


610


).




The device driver then sends the information to the network of microcontrollers


225


at state


612


. The network of microcontrollers


225


provides a result to the device driver in state


614


. The result is returned to the extension agent, which uses the information to build the MIB object, and return it to the extension SNMP agent (state


616


). The local SNMP agent forwards the MIB object via SNMP to the remote agent (state (


616


). Finally, in state


620


, the remote agent forwards the result to the remote application software.




For example, if a remote application needs to know the speed of a fan, the remote application reads a file to find the object ID for fan speed. The object ID for the fan speed request may be “837.2.3.6.2”. Each set of numbers in the object ID represent hierarchical groups of data. For example the number “3” of the object ID represents the cooling system. The “3.6” portion of the object ID represents the fans in the cooling. All three numbers “3.6.2” indicate speed for a particular fan in a particular cooling group.




In this example, the remote application creates a SNMP packet containing the object ID to get the fan speed on the computer


100


. The remote application then sends the SNMP packet to the local agent. Since the local agent does not recognize the fen speed object ID, the local agent forwards the SNMP packet to the extension agent. The extension agent parses the object ID to identify which specific memory mapped resources of the network of microcontrollers


225


are needed to build the MIB object whose object ID was just parsed. The extension agent then creates a message in the command protocol required by the network of microcontrollers


225


. A device driver which knows how to communicate requests to the network of microcontrollers


225


takes this message and relays the command to the network of microcontrollers


225


. Once the network of microcontrollers


225


finds the fan speed, it relays the results to the device driver. The device driver passes the information to the extension agent. The agent takes the information supplied by the microcontroller network device driver and creates a new SNMP packet. The local agent forwards this packet to the remote agent, which then relays the fan speed which is contained in the packet to the remote application program.





FIG. 7

is one embodiment of a block diagram of the interface between the network of microcontrollers


225


and the ISA bus


308


of

FIGS. 2 and 3

. The interface to the network of microcontrollers


225


includes a System Interface processor


312


which receives event and request signals, processes these signals, and transmits command, status and response signals to the operating system of the CPUs


200


. In one embodiment, the System Interface processor


312


is a PIC16C65 controller chip, available from Microchip, Technology Inc., which includes an event memory (not shown) organized as a bit vector, having at least sixteen bits. Each bit in the bit vector represents a particular type of event. Writing an event to the System Interface processor


312


sets a bit in the bit vector that represents the event. Upon receiving an event signal from another microcontroller, the System Interface


312


interrupts CPUs


200


. Upon receiving the interrupt, the CPUs


200


will check the status of the System Interface


312


to ascertain that an event is pending. Alternatively, the CPUs


200


may periodically poll the status of the System Interface


312


to ascertain whether an event is pending. The CPUs


200


may then read the bit vector in the System Interface


312


to ascertain the type of event that occurred and thereafter notify a system operator of the event by displaying an event message on a monitor connected to the fault tolerant computer


100


or another computer in the server network. After the system operator has been notified of the event, as described above, she may then obtain further information about the system failure which generated the event signal by accessing the NVRAM


322


.




The System Interface


312


communicates with the CPUs


200


by receiving request signals from the CPUs


200


and sending response signals back to the CPUs


200


. Furthermore, the System Interface


312


can send and receive status and command signals to and from the CPUs


200


. For example, a request signal may be sent from a software application inquiring as to whether the System Interface


312


has received any -vent signals, or inquiring as to the status of a particular processor, subsystem, operating parameter. The following discussion explains how in further detail at the state


612


, the device driver sends the request to the network on microcontrollers, and then, how the network on microcontrollers returns the result (state


614


). A request signal buffer


516


is connected to the System Interface


312


and stores, or queues, request signals in the order that they are received, first in-first out (FIFO). Similarly, a response buffer


514


is connected to the System Interface


312


and queues outgoing response signals in the order that they are received (FIFO). These queues are one byte wide, (messages on the I


2


C bus are sequences of 8-bit bytes, transmitted bit serially on the SDL).




A message data register (MDR)


707


is connected to the request and response buffer


516


and


514


and controls the arbitration of messages to and from the System Interface


312


via the request and response buffers


516


and


514


. In one embodiment, the MDR


707


is eight bits wide and has a fixed address which may be accessed by the server's operating system via the ISA bus


226


connected to the MDR


707


. As shown in

FIG. 7

, the MDR


707


has an I/O address of


0


CC


0


h. When software application running on one of the CPUs


200


desires to send a request signal to the System Interface


312


, it does so by writing a message one byte at a time to the MDR


707


. The application then indicates to the system interface processor


312


that the command has been completely written, and may be processed.




The system interface processor


312


writes the response one byte at a time to the response queue, then indicates to the CPU (via an interrupt or a bit in the status register) that the response is complete, and ready to be read. The CPU


200


then reads the response queue one byte at a time by reading the MDR


707


until all bytes of the response are read.




The following is one embodiment of the command protocol used to communicate with the network of microcontrollers


225


.












TABLE 2









Command Protocol Format



























































The following is a description of each of the fields in the command protocol.












TABLE 3











Description of Command Protocol Fields












FIELD




DESCRIPTION









Slave Addr




Specifies the processor identification code. This







field is 7 bits wide. Bit [7 . . . 1].






LSBit




Specifies what type of activity is taking place. If







LSBit is clear (0), the master is writing to a slave.







If LSBit is set (1), the master is reading from a







slave.






MSBit




Specifies the type of command. It is bit 7 of byte







1 of a request. If this bit is clear (0), this is a







write command. If it is set (1), this is a read







command.






Type




Specifies the data type of this command, such as







bit or string.






Command ID (LSB)




Specifies the least significant byte of the address







of the processor.






Command ID (MSB)




Specifies the most significant byte of the address







of the processor.






Length (N)






Read Request




Specifies the length of the data that the master







expects to get back from a read response. The







length, which is in bytes, does not include the







Status, Check Sum, and Inverted Slave Addr







fields.






Read Response




Specifies the length of the data immediately







following this byte, that is byte 2 through byte







N + 1. The length, which is in bytes, does not







include the Status, Check Sum, and Inverted







Slave Addr fields.






Write Request




Specifies the length of the data immediately







following this byte, that is byte 2 through byte







N + 1. The length, which is in bytes, does not







include the Status, Check Sum, and Inverted







Slave Addr fields.






Write Response




Always specified as 0.






Data Byte 1




Specifies the data in a read request and response,







and a write request.






Data Byte N






Status




Specifies whether or not this command executes







successfully. A non-zero entry indicates a failure.






Check Sum




Specifies a direction control byte to ensure the







integrity of a message on the wire.






Inverted Slave Addr




Specifies the Slave Addr, which is inverted.














The System Interface


312


further includes a command and status register (CSR)


709


which initiates operations and reports on status. The operation and functionality of CSR


709


is described in further detail below. Both synchronous and asynchronous I/O modes are provided by the System Interface


312


. During a synchronous mode of operation, the device driver waits for a request to be completed. During an asynchronous mode of operation the device driver sends the request, and asks to be interrupted when the request completes. To support asynchronous operations, an interrupt line


711


is connected between the System Interface


312


and the ISA bus


226


and provides the ability to request an interrupt when asynchronous I/O is complete, or when an event occurs while the interrupt is enabled. As shown in

FIG. 7

, in one embodiment, the address of the interrupt line


711


is fixed and indicated as IRQ


15


which is an interrupt address number used specifically for the ISA bus


226


.




The MDR


707


and the request and response buffers


516


and


514


, respectively, transfer messages between a software application running on the CPUs


200


and the failure reporting system of the invention. The buffers


516


and


514


have two functions: (1) they store data in situations where one bus is running faster than the other, i.e., the different clock rates, between the ISA bus


226


and the microcontroller bus


310


; and (2) they serve as interim buffers for the transfer of messages—this relieves the System Interface


312


of having to provide this buffer.




When the MDR


707


is written to by the ISA bus


226


, it loads a byte into the request buffer


516


. When the MDR


707


is read from the ISA bus


516


, it unloads a byte from the response buffer


514


. The System Interface


312


reads and executes messages from buffer


516


when a message command is received in the CSR


709


. A response message is written to the response buffer


514


when the System Interface


312


completes executing the command. The system operator receives a completed message over the microcontroller bus


310


. A software application can read and write message data to and from the buffers


516


and


514


by executing read and write instructions through the MDR


707


.




The CSR


709


has two functions. The first is to initiate commands, and the second is to report status. The System Interface commands are usually executed synchronously. That is, after issuing a command, the microcontroller network device driver should continue to poll the CSR


709


status to confirm command completion. In addition to synchronous I/O mode, the microcontroller network device driver can also request an asynchronous I/O mode for each command by setting a “Asyn Req” bit in the command. In this mode, an interrupt is generated and sent to the ISA bus


226


, via the interrupt line


711


, after the command has completed executing.




In the described embodiment, the interrupt is asserted through IRQ


15


of the ISA programmable interrupt controller (PIC). The ISA PIC interrupts the CPU


200


s when a signal transitioning from high to low, or from low to high, is detected at the proper input pin (edge triggered). Alternatively, the interrupt line


711


may utilize connect to a level-triggered input. A level-triggered interrupt request is recognized by keeping the signal at the same level, or changing the level of a signal, to send an interrupt. The microcontroller network device driver can either enable or disable interrupts by sending “Enable Ints” and “Disable Ints” commands to the CSR


701


. If the interrupt


711


line is enabled, the System Interface


312


asserts the interrupt signal IRQ


15


of the PIC to the ISA bus


226


, either when an asynchronous I/O is complete or when an event has been detected.




In the embodiment shown in

FIG. 2

, the System Interface


312


may be a single-threaded interface. Since messages are first stored in the queue, then retrieved from the queue by the other side of the interface, a device driver should write one message, containing a sequence of bytes, at a time. Thus, only one message should be in progress at a time using the System Interface


312


. Therefore, a program or application must allocate the System Interface


312


for its use before using it, and then de-allocate the interface


514


when its operation is complete. The CSR


709


indicates which operator is allocated access to the System Interface


312


.




Referring to

FIGS. 2 and 7

, an example of how messages are communicated between the System Interface


312


and CPUs


200


in one embodiment of the invention is as follows (all byte values are provided in hexadecimal numbering). A system management program (not shown) sends a command to the network of microcontrollers


225


to check temperature and fan speed. To read the temperature from CPU A controller


314


the program builds a message for the device driver to forward to the network of microcontrollers


225


. First, the device driver on CPUs


200


allocates the interface by writing the byte “01” to the CSR


709


. If another request was received, the requestor would have to wait until the previous request was completed. To read the temperature from Chassis controller


318


the device driver would write into the request queue


516


through the MDR


707


the bytes “02 83 03 00 FF”. The first byte “02” would signify to the System Interface


312


that a command is intended for the Chassis controller


318


. The first bits of the second byte “83” indicates that a master is writing to a slave. The last or least significant three bits of the byte “83” indicate the data type of the request. The third and fourth bytes “03 00” indicate that the read request temperature function of the Chassis controller


318


is being requested. The final byte “FF” is the checksum.




After writing the bytes to the MDR


707


, a “13” (message command) is written by the device driver to the CSR


709


, indicating the command is ready to be executed. The System Interface processor


312


passes the message bytes to the microcontroller bus


310


, receives a response, and puts the bytes into the response FIFO


514


. Since there is only one system interface processor


312


, there is no chance that message bytes will get intermingled.




After all bytes are written to the response FIFO, the System Interface processor


312


sets a bit in the CSR


709


indicating message completion. If directed to do so by the device driver, the system interface


312


asserts an interrupt on IRQ


15


upon completion of the task.




The CPUs


200


would then read from the response buffer


516


through the MDR


707


the bytes “02 05 27 3C 27 26 27 00”. The first byte in the string is the slave address shown as Byte


0


in the Read Response Format. The first byte


02


indicates that the CPU A Chassis controller


318


was the originator of the message. The second byte “05” indicates the number of temperature readings that follow. The second Byte “05” maps to Byte


1


of the Read Response Format. In this example, the Chassis con:roller


318


returned five temperatures. The second reading, byte “3C” (60 decimal) is above normal operational values. The last byte “00” is a check sum which is used to ensure the integrity of a message.




The CPUs


200


agent and device driver requests the fan speed by writing the bytes “03 83 04 00 FF” to the network of microcontroller


225


. Each byte follows the read request format specified in Table 2. The first byte “03” indicates that the command is for the CPU A Controller


314


. The second byte “83” indicates that the command is a read request of a string data type.




A response of “03 06 41 43 41 42 41 40 00” would be read from MDR


707


by the device driver. The first byte “03” indicates to the device driver that the command is from the CPU A controller


314


. The speed bytes “41 43 41 42 41 40” indicate the revolutions per second of a fan in hexadecimal. The last byte read from the MDR


707


“00” is the checksum.




Since one of the temperatures is higher than the warning threshold, 55° C., and fan speed is within normal (low) range, a system administrator or system management software may set the fan speed to high with the command bytes “03 01 01 00 01 01 FF”. The command byte “03” indicates that the command is for the CPU A


314


. The first byte indicates that a write command is requested. The third and fourth bytes, which correspond to byte


2


and


3


of the write request format, indicate a request to increase the fan speed. The fifth byte, which corresponds to byte


4


of the write request format indicates to the System Interface


312


that one byte is being sent. The sixth byte contains the data that is being sent. The last byte “FF” is the checksum.





FIG. 8

is one embodiment of a flowchart describing the process by which a master microcontroller communicates with a slave microcontroller. Messages between microcontrollers can be initiated by any microcontroller on the microcontroller bus


310


(FIG.


4


). A master microcontroller starts out in state


800


.




In state


802


, the microcontroller arbitrates for the start bit. If a microcontroller sees a start bit on the microcontroller bus


310


, it cannot gain control of the microcontroller bus


310


. The master microcontroller proceeds to state


804


. In the state


804


, the microcontroller increments a counter every millisecond. The microcontroller then returns to state


800


to arbitrate again for the start bit. If at state


806


the count reaches 50 ms, the master has failed to gain the bus (states


808


and


810


). The microcontroller then returns to the state


800


to retry the arbitration process.




If in the state


802


, no start bit is seen on the microcontroller bus


310


, the microcontroller bus


310


is assumed to be free (i.e., the microcontroller has successfully arbitrated won arbitration for the microcontroller bus


310


). The microcontroller sends a byte at a time on the microcontroller bus


310


(state


812


). After the microcontroller has sent each byte, the microcontroller queries the microcontroller bus


310


to insure that the microcontroller bus


310


is still functional. If the SDA and SCL lines of the microcontroller bus


310


are not low, the microcontroller is sure that the microcontroller bus


310


is functional and proceeds to state


816


. If the SDA and SCL lines are not drawn high, then the microcontroller starts to poll the microcontroller bus


310


to see if it is functional. Moving to state


819


, the microcontroller increments a counter Y and waits every 22 microseconds. If the counter Y is less than five milliseconds (state


820


), the state


814


is reentered and the microcontroller bus


310


is checked again. If the SDA and SCL lines are low for 5 milliseconds (indicated when, at state


820


, the counter Y exceeds 5 milliseconds), the microcontroller enters state


822


and assumes there is a microcontroller bus error. The microcontroller then terminates its control of the microcontroller bus


310


(state


824


).




If in the state


814


, the SDA/SCL lines do not stay low (state


816


), the master microcontroller waits for a response from a slave microcontroller (state


816


). If the master microcontroller has not received a response, the microcontroller enters state


826


. The microcontroller starts a counter which is incremented every one millisecond. Moving to state


828


, if the counter reaches fifty milliseconds, the microcontroller enters state


830


indicating a microcontroller bus error. The microcontroller then resets the microcontroller bus


310


(state


832


).




Returning to state


816


, if the master microcontroller does receive a response in state


816


, the microcontroller enters state


818


and receives the data from the slave microcontroller. At state


820


, the master microcontroller is finished communicating with the slave microcontroller.





FIG. 9

is one embodiment of a block diagram illustrating the process by which a slave microcontroller communicates with a master microcontroller. Starting in state


900


, the slave microcontroller receives a byte from a master microcontroller. The first byte of an incoming message always contains the slave address. This slave address is checked by all of the microcontrollers on the microcontroller bus


310


. Whichever microcontroller matches the slave address to its own address handles the request




At a decision state


902


, an interrupt is generated on the slave microcontroller. The microcontroller checks if the byte received is the first received from the master microcontroller (state


904


). If the current byte received is the first byte received, the slave microcontroller sets a bus time-out flag (state


906


). Otherwise, the slave microcontroller proceeds to check if the message is complete (state


908


). If the message is incomplete, the microcontroller proceeds to the state


900


to receive the remainder of bytes from the master microcontroller. If at state


908


, the slave microcontroller determines that the complete message has been received, the microcontroller proceeds to state


909


.




Once the microcontroller has received the first byte, the microcontroller will continue to check if there is an interrupt on the microcontroller bus


310


. If no interrupt is posted on the microcontroller bus


310


, the slave microcontroller will check to see if the bus time-out flag is set. The bus time-out flag is set once a byte has been received from a master microcontroller. If in the decision state


910


the microcontroller determines that the bus time-out flag is set, the slave microcontroller will proceed to check for an interrupt every 10 milliseconds up to 500 milliseconds. For this purpose, the slave microcontroller increments the counter every 10 milliseconds (state


912


). In state


914


, the microcontroller checks to see if the microcontroller bus


310


has timed out. If the slave microcontroller has not received additional bytes from the master microcontroller, the slave microcontroller assumes that the microcontroller bus


310


is hung and resets the microcontroller bus


310


(state


916


). Next, the slave microcontroller aborts the request and awaits further requests from other master microcontrollers (state


918


).




Referring to the state


909


, the bus timeout bit is cleared, and the request is processed and the response is formulated. Moving to state


920


, the response is sent a byte at a time. At state


922


, the same bus check is made as was described for the state


814


. States


922


,


923


and


928


form the same bus check and timeout as states


814


,


819


and


820


. If in state


928


this check times out, a bus error exists, and this transaction is aborted (states


930


and


932


).





FIGS. 10A and 10B

are flow diagrams showing one process by which the System Interface


312


handles requests from other microcontrollers in the microcontroller network and the ISA bus


226


(FIGS.


4


and


5


). The System Interface


312


relays messages from the ISA bus


226


to other microcontrollers in the network of microcontrollers


225


. The System Interface


312


also relays messages from the network of microcontrollers to the ISA bus


226


.




Referring to

FIGS. 10A and 10B

, the System Interface


312


initializes all variables and the stack pointer (state


1000


). Moving to state


1002


, the System Interface


312


starts its main loop in which it performs various functions. The System Interface


312


next checks the bus timeout bit to see if the microcontroller bus


310


has timed-out (decision state


1004


). If the microcontroller bus


310


has timed-out, the System Interface


312


resets the microcontroller bus


310


in state


1006


.




Proceeding to a decision state


1008


, the System Interface


312


checks to see if any extent messages have been received. An event occurs when the System Interface


312


receives information from another microcontroller regarding a change to the state of the system. At state


1010


, the System Interface


312


sets the event bit in the CSR


709


to one. The System Interface


312


also sends an interrupt to the operating system if the CSR


709


has requested interrupt notification.




Proceeding to a decision state


1012


, the System Interface


312


checks to see if a device driver for the operating system has input a command to the CSR. If the System Interface


312


does not find a command, the System Interface


312


returns to state


1002


. If the System Interface does find a command from the operating system, the System Interface parses the command. For the “allocate command”, the System Interface


312


resets the queue to the ISA bus


226


resets the done bit in the CSR


709


(state


1016


) and sets the CSR Interface Owner ID (state


1016


). The Owner ID bits identify which device driver owns control of the System Interface


312


.




For the “de-allocate command”, the System Interface


312


resets the queue to the ISA bus


226


, resets the done bit in the CSR


709


, and clears the Owner ID bits (state


1018


).




For the “clear done bit command” the System Interface


312


clears the done bit in the CSR


709


(state


1020


). For the “enable interrupt command” the System Interface


312


sets the interrupt enable bit in the CSR


709


(state


1022


). For the “disable interrupt command,” the System Interface


312


sets the interrupt enable bit in the CSR


709


(state


1024


). For the “clear interrupt request command”, the System Interface


312


clears the interrupt enable bit in the CSR


709


(state


1026


).




If the request from the operating system was not meant for the System Interface


312


, the command is intended for another microcontroller in the network


225


. The only valid command remaining is the “message command.” Proceeding to state


1028


, the System Interface


312


reads message bytes from the request buffer


516


. From the state


1028


, the System Interface


312


proceeds to a decision state


1030


in which the System Interface


312


checks whether the command was for itself. If the command was for the System Interface


312


, moving to state


1032


, the System Interface


312


processes the command. If the ID did not match an internal command address, the System Interface


312


relays the command the appropriate microcontroller (state


1034


) by sending the message bytes out over the microcontroller bus


310


.





FIGS. 11A and 11B

are flowcharts showing an embodiment of the functions performed by the Chassis controller


318


. Starting in the state


1100


, the Chassis controller


318


initializes its variables and stack pointer.




Proceeding to state


1102


, the Chassis controller


318


reads the serial numbers of the microcontrollers contained on the system board


302


and the backplane


304


. The Chassis controller


318


also reads the serial numbers for the Canister controllers


324


,


326


,


328


and


330


. The Chassis controller


318


stores all of these serial numbers in the NVRAM


322


.




Next, the Chassis controller


318


start its main loop in which it performs various diagnostics (state


1104


). The Chassis controller


318


checks to see if the microcontroller bus


310


has timed-out (state


1106


). If the bus has timed-out, the Chassis controller


318


resets the microcontroller bus


310


(state


1008


). If the microcontroller bus


310


has not timed out the Chassis controller proceeds to a decision state


1110


in which the Chassis controller


318


checks to see if a user has pressed a power switch.




If the Chassis controller


318


determines a user has pressed a power switch, the Chassis controller changes the state of the power to either on or off (state


1112


). Additionally, the Chassis controller logs the new power state into the NVRAM


322


.




The Chassis controller


318


proceeds to handle any power requests from the Remote Interface


332


(state


1114


). As shown in

FIG. 9

, a power request message to this microcontroller is received when the arriving message interrupts the microcontroller. The message is processed and a bit is set indicating request has been made to toggle power. At state


1114


, the Chassis controller


318


checks this bit. If the bit is set, the Chassis controller


318


toggles the system, i.e., off-to-on or on-to-off, power and logs a message into the NVRAM


322


that the system power has changed state (state


1116


).




Proceeding to state


1118


, the Chassis controller


318


checks the operating system watch dog counter for a time out. If the Chassis controller


318


finds that the operating system has failed to update the timer, the Chassis controller


318


proceeds to logs a message with the NVRAM


322


(state


1120


). Additionally, the Chassis controller


318


sends an event to the System Interface


312


and the Remote Interface


332


.




Since it takes some time for the power supplies to settle and produce stable DC power, the Chassis controller delays before proceeding to check DC (state


1122


).




The Chassis controller


318


then checks for changes in the canisters


258


-


264


(state


1124


), such as a canister being inserted or removed. If a change is detected, the Chassis controller


318


logs a message to the NVRAM


322


(state


1126


). Additionally, the Chassis controller


318


sends an event to the System Interface


312


and the Remote Interface


332


.




The Chassis controller


318


proceeds to check the power supply for a change in status (state


1128


). The process by which the Chassis controller


318


checks the power supply is described in further detail in the discussion for FIG.


12


.




The Chassis controller then checks the temperature of the system (state


1132


). The process by which the Chassis controller


318


checks the temperature is described in further detail in the discussion for FIG.


13


.




At state


1136


, the Chassis controller


318


reads all of the voltage level signals. The Chassis controller


318


saves these voltage levels values in an internal register for reference by other microcontrollers.




Next, the Chassis controller


318


checks the power supply signals for AC/DC changes (state


1138


). If the Chassis controller


318


detects a change in the Chassis controller


318


, the Chassis controller


318


logs a message to the NVRAM


322


(state


1140


). Additionally, the Chassis controller


318


sends an event to the System Interface


312


and the Remote Interface


332


that a AC/DC signal has changed. The Chassis controller


318


then returns to state


1104


to repeat the monitoring process.





FIG. 12

is a flowchart showing one process by which the Chassis controller


318


checks the state of the redundant power supplies termed number


1


and


2


. These power supplies are monitored and controlled by the chassis controller


318


through the signal lines shown in FIG.


5


A. When a power supply fails or requires maintenance, the other supply maintains power to the computer


100


. To determine whether a power supply is operating properly or not, its status of inserted or removed (by maintenance personnel) should be ascertained. Furthermore, a change in status should be recorded in the NVRAM


322


.

FIG. 12

describes in greater detail the state


1128


shown in FIG.


11


B.




Starting in state


1202


, the Chassis controller


318


checks the power supply bit. If the power supply bit indicates that a power supply should be present, the Chassis controller checks whether power supply “number


1


” has been removed (state


1204


). If power supply number


1


has been removed, the chassis microcontroller


318


checks whether its internal state indicates power supply number one should be present. If the internal state was determined to be present, then the slot is checked to see whether power supply number


1


is still physically present (state


1204


). If power supply number


1


has been removed, the PS_PRESENT#1 bit is changed to not present (state


1203


). The Chassis controller


318


then logs a message in the NVRAM


322


.




Referring to state


1206


, if the PS_PRESENT#1 bit indicates that power supply number


1


is not present, the Chassis controller


318


checks whether power supply number


1


has been inserted (i.e., checks to see if it is now physically present) (state


1206


). If it has been inserted, the Chassis controller


318


then logs a message into the NVRAM


322


that the power supply number


1


has been inserted (state


1210


) and changes the value of PS_PRESENT#1 to present.




After completion, states


1204


,


1206


,


1208


, and


1210


proceed to state


1212


to monitor power supply number


2


. The Chassis controller


318


checks whether the PS_PRESENT#2 bit is set to present. If the PS_PRESENT#2 bit indicates that power supply “number


2


” should be there, the Chassis controller


318


proceeds to state


1224


. Otherwise, the Chassis controller


318


proceeds to state


1226


. At state


1224


, the Chassis controller


318


checks if power supply number


2


is still present. If power supply number


2


has been removed, the Chassis controller


318


logs in the NVRAM


322


that power supply number


2


has been removed (state


1228


). The chassis controller also changes the value of PS_PRESENT#2 bit to not present.




Referring to decision state


1226


, if the PS_PRESENT#2 bit indicates that no power supply number


2


is present, the Chassis controller


318


checks if power supply number


2


has been inserted. If so, the Chassis controller


318


then logs a message into the NVRAM


322


that power supply number


2


has been inserted and changes the value of PS_PRESENT#2 to present (state


1230


). After completion of states


1224


,


1226


,


1228


, and


1230


, the chassis controller


318


proceeds to state


1232


to monitor the AC/DC power supply changed signal.




If in decision state


1234


the Chassis controller


318


finds that the AC/DC power supply changed signal from the power supplies is asserted, the change in status is recorded in state


1236


. The Chassis controller


318


continues the monitoring process by proceeding to the state


1132


in FIG.


11


B.





FIG. 13

is a flowchart showing one process by which the Chassis controller


318


monitors the temperature of the system. As shown in

FIG. 5A

, the Chassis controller


318


receives temperature detector signal lines from five temperature detectors located on the backplane and the motherboard. If either component indicates it is overheating, preventative action may be taken manually, by a technician, or automatically by the network of microcontrollers


225


.

FIG. 13

describes in greater detail the state


1132


shown in FIG.


11


B.




To read the temperature of the Chassis, the Chassis controller


318


reads the temperature detectors


502


,


504


, and


506


(state


1300


). In the embodiment of the invention shown in

FIG. 13

there are five temperature detectors (two temperature detectors not shown). Another embodiment includes three temperature detectors as shown.




The Chassis controller


318


checks the temperature detector


502


to see if the temperature is less than −25° C. or if the temperature is greater than or equal to 55° C. (state


1308


). Temperatures in this range are considered normal operating temperatures. Of course, other embodiments may use other temperature ranges. If the temperature is operating inside normal operating boundaries, the Chassis controller


318


proceeds to state


1310


. If the temperature is outside normal operating boundaries, the Chassis controller


318


proceeds to state


1312


. At state


1312


, the Chassis controller


318


evaluates the temperature a second time to check if the temperature is greater than or equal to 70° C. or less than or equal to −25° C. If the temperature falls below or above outside of these threshold values, the Chassis controller proceeds to state


1316


. Temperatures in this range are considered so far out of normal operating temperatures, that the computer


100


should be shutdown. Of course, other temperature ranges may be used in other embodiments.




Referring to state


1316


, if the temperature level reading is critical, the Chassis controller


318


logs a message in the NVRAM


322


that the system was shut down due to excessive temperature. The Chassis controller


318


then proceeds to turn off power to the system in state


1320


, but may continue to operate from a bias or power supply.




Otherwise, if the temperature is outside normal operating temperatures, but only slightly deviant, the Chassis controller


318


sets a bit in the temperature warning status register (state


1314


). Additionally, the Chassis controller


318


logs a message in the NVRAM


322


that the temperature is reaching dangerous levels (state


1318


).




The Chassis controller


318


follows the aforementioned process for each temperature detector on the system. Referring back to state


1310


, which was entered after determining a normal temperature from one of the temperature detectors, the Chassis controller


318


checks a looping variable “N” to see if all the sensors were read. If all sensors were not read, the Chassis controller


318


returns to state


1300


to read another temperature detector. Otherwise, if all temperature detectors were read, the Chassis controller


318


proceeds to state


1322


. At state


1322


, the Chassis controller


318


checks a warning status register (not shown). If no bit is set in the temperature warning status register, the Chassis controller


318


returns to the state


1136


in FIG.


11


B. If the Chassis controller


318


determines that a bit in the warning status register was set for one of the sensors, the Chassis controller


318


proceeds to recheck all of the sensors (state


1324


). If the temperature of the sensors are still at a dangerous level, the Chassis Controller


318


maintains the warning bits in the warning status register. The Chassis controller


318


then proceeds to the state


1136


(FIG.


11


B). At state


1324


, if the temperatures of the sensors are now at normal operating values, the Chassis controller


318


proceeds to clear all of the bits in the warning status register (state


1326


). After clearing the register, the Chassis controller


318


proceeds to state


1328


to log a message in the NVRAM


322


that the temperature has returned to normal operational values, and the Chassis controller


318


proceeds to the state


11136


(FIG.


11


B).





FIGS. 14A and 14B

are flowcharts showing the functions performed by one embodiment of the CPU A controller


314


. The CPU A controller


314


is located on the system board


302


and conducts diagnostic checks for: a microcontroller bus timeout, a manual system board reset, a low system fan speed, a software reset command, general faults, a request to write to flash memory, checks system flag status, and a system fault.




The CPU A controller


314


, starting in state


1400


, initializes its variables and stack pointer. Next, in state


1402


the CPU A controller


314


starts its main loop in which it performs various diagnostics which are described below. At state


1404


, the CPU A controller


314


checks the microcontroller bus


310


for a time out. If the microcontroller bus


310


has timed out, the CPU A controller


314


resets the microcontroller bus


310


(state


1406


). From either state


1404


or


1406


, the CPU A controller


314


proceeds to check whether the manual reset switch (not shown) is pressed on the system board


302


(decision state


1408


). If the CPU A controller


314


determines that the manual reset switch is pressed, the CPU A controller resets system board by asserting a reset signal (state


1410


).




From either state


1408


or


1410


, the CPU A controller


314


proceeds to check the fan speed (decision state


1412


). If any of a number of fans speed is low (see FIG.


15


and discussion below), the CPU A controller


314


logs a message to NVRAM


322


(state


1414


). Additionally, the CPU A controller


314


sends an event to the Remote Interface


334


and the System Interface


312


. The CPU A controller


314


next proceeds to check whether a software reset command was issued by either the computer


100


or the remote computer


132


(state


1416


). If such a command was sent, the CPU A controller


314


logs a message in NVRAM


322


that system software requested the reset command (state


1418


). Additionally, the CPU A controller


314


also resets the system bus


202


.




From either state


1416


or


1418


, the CPU A controller


314


checks the flags bits (not shown) to determine if a user defined system fault occurred (state


1420


). If the CPU A controller


314


determines that a user defined system fault occurred, the CPU A controller


314


proceeds to display the fault on an LCD display


512


(

FIG. 5B

) (state


1422


).




From either state


1420


or


1422


the CPU A controller


314


proceeds to a state


1424


(if flash bit was not enabled) to check the flash enable bit maintained in memory on the CPU B controller


316


. If the flash enable bit is set, the CPU A controller


314


displays a code for flash enabled on the LCD display


512


. The purpose of the flash enable bit is further described in the description for the CPU B controller


316


(FIG.


16


).




From either state


1424


or


1426


(if the flash bit was not enabled), the CPU A controller


314


proceeds to state


1428


and checks for system faults. If the CPU A controller


314


determines that a fault occurred, the CPU A controller


314


displays the fault on the LCD display


512


(state


1430


). From state


1428


if no fault occurred, or from state


1430


, the CPU A controller


314


proceeds to the checks the system status flag located in the CPU A controller's memory (decision state


1432


). If the status flag indicates an error, the CPU A controller


314


proceeds to state


1434


and displays error information on the LCD display


512


.




From either state


1432


or


1434


, the CPU controller proceeds to state


1402


to repeat the monitoring process.





FIG. 15

is a flowchart showing one process by which the CPU A controller


314


monitors the fan speed.

FIG. 15

is a more detailed description of the function of state


1412


in

FIG. 1



4


A. Starting in state


1502


, the CPU A controller


314


reads the speed of each of the fans


1506


,


1508


, and


1510


. The fan speed is processed by a Fan Tachometer Signal Mux


508


(also shown in

FIG. 5B

) which updates the CPU A controller


314


. The CPU A controller


314


then checks to see if a fan speed is above a specified threshold (state


1512


). If the fan speed is above the threshold, the CPU A controller


314


proceeds to state


1514


. Otherwise, if the fan speed is operating below a specified low speed limit, the CPU A controller


314


proceeds to state


1522


.




On the other hand, when the fan is operating above the low speed limit at state


1514


, the CPU A controller


314


checks the hot_swap_fan register (not shown) if the particular fan was hot swapped. If the fan was hot swapped, the CPU A controller


314


proceeds to clear the fan's bit in both the fan_fault register (not shown) and the hot_swap_fan register (state


1516


). After clearing these bits, the CPU A controller


314


checks the fan fault register (state


1518


). If the fan fault register is all clear, the CPU A controller


314


proceeds to set the fan to low speed (state


1520


) and logs a message to the NVRAM


322


. The CPU A controller


314


then proceeds to state


1536


to check for a temperature warning.




Now, referring back to state


1522


, if a fan speed is below a specified threshold limit, the CPU A controller


314


checks to see if the fan's speed is zero. If the fan's speed is zero, the CPU A controller


314


sets the bit in the hot_swap_fan register in state


1524


to indicate that the fan has a fault and should be replaced. If the fan's speed is not zero, the CPU A controller


314


will proceed to set a bit in the fan_fault register (state


1526


). Moving to state


1528


, the speed of any fans still operating is increased to high, and a message is written to the NVRAM


322


.




In one alternative embodiment, the system self-manages temperature as follows: from either state


1520


or


1528


, the CPU A controller


314


moves to state


1536


and checks whether a message was received from the Chassis controller


318


indicating temperature warning. If a temperature warning is indicated, and if there are no fan faults involving fans in the cooling group associated with the warning, the speed of fans in that cooling group is increased to provide more cooling capacity (state


1538


).




Proceeding to state


1530


from either state


1536


or


1538


, the CPU A controller


314


increments a fan counter stored inside of microcontroller memory. If at state


1531


, there are more fans to check, the CPU A controller


314


returns to state


1502


to monitor the speed of the other fans. Otherwise, the CPU controller


314


returns to state


1416


(FIG.


14


).





FIG. 16

is one embodiment of a flow diagram showing the functions performed by the CPU B controller


316


. The CPU B controller


316


scans for system faults, scans the microcontroller bus


310


, and provides flash enable. The CPU B controller


316


, starting at state


1600


, initializes its variables and stack pointer.




After initializing its internal state, the CPU B controller


316


enters a diagnostic loop at state


1602


. The CPU B controller


316


then checks the microcontroller bus


310


for a time out (decision state


1604


). If the microcontroller bus


310


has timed out, the CPU B controller


316


resets the microcontroller bus


310


in state


1606


. If the microcontroller bus


310


has not timed out (state


1604


) or after state


1606


, the CPU B controller


316


proceeds to check the system fault register (not shown) (decision state


1608


).




If the CPU B controller


316


finds a system fault, the CPU B controller


316


proceeds to log a message into the NVRAM


322


stating that a system fault occurred (state


1610


). The CPU B controller


316


then sends an event to the System Interface


312


and the Remote Interface


332


. Additionally, the CPU B controller


316


turns on one of a number of LED indicators


518


(FIG.


5


B).




If no system fault occurred, or from state


1610


, the CPU B controller


316


scans the microcontroller bus


310


(decision state


1612


). If the microcontroller bus


310


is hung then the CPU B controller


316


proceeds to flash an LED display


512


that the microcontroller bus


310


is hung (state


1614


). Otherwise, if the bus is not hung the CPU B controller


316


then proceeds to state


1624


.




The CPU B controller


316


proceeds to check for a bus stop bit time out (decision state


1624


). If the stop bit has timed out, the CPU B controller


316


generates a stop bit on the microcontroller bus for error recovery in case the stop bit is inadvertently being held low by another microcontroller (state


1626


).




From either state


1624


or


1626


, the CPU B controller


316


proceeds to check the flash enable bit to determine if the flash enable bit (not shown) is set (state


1628


). If the CPU B controller


316


determines that the flash enable bit is set (by previously having received a message requesting it), the CPU B controller


316


proceeds to log a message to the NVRAM


322


(state


1630


). A flash update is performed by the BIOS if the system boot disk includes code to update a flash memory (not shown). The BIOS writes new code into the flash memory only if the flash memory is enabled for writing. A software application running on the CPUs


200


can send messages requesting that BIOS flash be enabled. At state


1630


, the 12 Volts needed to write the flash memory is turned on or left turned on. If the flash enable bit is not on, control passes to state


1629


, where the


12


Volts is turned off, disabling writing of the flash memory.




From either state


1629


or


1630


, the CPU B controller


316


proceeds to repeat the aforementioned process of monitoring for system faults (state


1602


).





FIG. 17

is one embodiment of a flowchart showing the functions performed by the Canister controllers


324


,


326


,


328


and


330


shown in

FIGS. 4 and 5

. The Canister controllers


324


,


326


,


328


and


330


examine canister fan speeds, control power to the canister, and determine which canister slots contain cards. The Canister controllers


324


-


330


, starting in state


1700


, initialize their variables and stack pointers.




Next, in state


1702


the Canister controllers


324


-


330


start their main loop in which they performs various diagnostics, which are further described below. The Canister controllers


324


-


330


check the microcontroller bus


310


for a time out (state


1704


). If the microcontroller bus


310


has timed out, the Canister controllers


324


-


330


reset the microcontroller bus


310


in state


1706


. After the Canister controller


324


-


330


reset the microcontroller bus


310


, or if the microcontroller bus


310


has not timed out, the Canister controllers


324


-


330


proceed to examine the speed of the fans (decision state


1708


). As determined by tachometer signal lines connected through a fan multiplexer


508


(FIG.


5


), if either of two canister fans is below the lower threshold, the event is logged, an event is sent to the System Interface


312


and, speed, in a self-management embodiment, the fan speed is set to high. The Canister controllers


324


-


330


check the fan speed again, and if they are still low the canister controlling


324


-


330


signal a fan fault and register an error message in the NVRAM


322


(state


1710


).




If the Canister controller received a request message to turn on or off canister power, a bit would have been previously set. If the Canister controllers


324


-


330


find this bit set (state


1712


), they turn the power to the canister on, and light the canister's LED. If the bit is cleared, power to the canister is turned off, as is the LED (state


1714


).




Next, the Canister controllers


324


-


330


read a signal for each slot which indicates whether the slot contains an adapter (state


1716


). The Canister controllers


324


-


330


then returns to the state


1702


, to repeat the aforementioned monitoring process.





FIG. 18

is one embodiment of a flowchart showing the functions performed by the System Recorder controller


320


. The System Recorder controller


320


maintains a system log in the NVRAM


322


. The System Recorder


320


starting in state


1800


initializes its variables and stack pointer.




Next, at state


1802


the System Recorder


320


starts its main loop in which the System Recorder


320


performs various functions, which are further described below. First, the System Recorder


320


checks the microcontroller bus


310


for a time out (state


1804


). If the microcontroller bus


310


has timed out, the System Recorder


320


resets the microcontroller bus


310


in state


1806


. After the System Recorder


320


resets the bus, or if the microcontroller bus


310


has not timed out, the System Recorder


320


checks to see if another microcontroller had requested the System Recorder


320


to reset the NVRAM


322


(state


1808


). If requested, the System Recorder


320


proceeds to reset all the memory in the NVRAM


322


to zero (decision state


1810


). After resetting the NVRAM


322


, or if no microcontroller had requested such a reset, the System Recorder


320


proceeds to a get the real time clock every second from a timer chip


520


(

FIG. 5A

) (decision state


1812


).




From time to time, the System Recorder


320


will be interrupted by the receipt of messages. When these messages are for storing data in the NVRAM


322


, they are carried out as they are received and the messages are stored in the NVRAM


322


. Thus, there is no state in the flow of

FIG. 18

to explicitly store messages. The System Recorder then returns to the state


1802


to repeat the aforementioned monitoring process.




While the above detailed description has shown, described, and pointed out the fundamental novel features of the invention as applied to various embodiments, it will be understood that various omissions and substitutions and changes in the form and details of the system illustrated by be made by those skilled in the art, without departing from the intent of the invention.




Appendix A




Incorporation by Reference of Commonly Owned Applications




The following patent applications, commonly owned and filed Oct. 1, 1997, are hereby incorporated herein in their entirety by reference thereto:



















Attorney Docket






Title




Application No.




No.











“System Architecture for Remote Access




08/942,160




MNFRAME.002A1






and Control of Environmental






Management”






“Method of Remote Access and Control of




08/942,215




MNFRAME.002A2






Environmental Management”






“ystem for Independent Powering of




08/942,410




MNFRAME.002A3






Diagnostic Processes on a Computer






System”






“Method of Independent Powering of




08/942,320




MNFRAME.002A4






Diagnostic Processes on a Computer






System”






“Diagnostic and Managing Distributed




08/942,402




MNFRAME.005A1






Processor System”






“System for Mapping Environmental




08/942,222




MNFRAME.005A3






Resources to Memory for Program Access”






“Method for Mapping Environmental




08/942,214




MNFRAME.005A4






Resources to Memory for Program Access”






“Hot Add of Devices Software




08/942,309




MNFRAME.006A1






Architecture”






“Method for The Hot Add of Devices”




08/942,306




MNFRAME.006A2






“Hot Swap of Devices Software




08/942,311




MNFRAME.006A3






Architecture”






“Method for The Hot Swap of Devices”




08/942,457




MNFRAME.006A4






“Method for the Hot Add of a Network




08/943,072




MNFRAME.006A5






Adapter on a System Including a






Dynamically Loaded Adapter Driver”






“Method for the Hot Add of a Mass




08/942,069




MNFRAME.006A6






Storage Adapter on a System Including a






Statically Loaded Adapter Driver”






“Method for the Hot Add of a Network




08/942,465




MNFRAME.006A7






Adapter on a System Including a Statically






Loaded Adapter Driver”






“Method for the Hot Add of a Mass




08/962,963




MNFRAME.006A8






Storage Adapter on a System Including a






Dynamically Loaded Adapter Driver”






“Method for the Hot Swap of a Network




08/943,078




MNFRAME.006A9






Adapter on a System Including a






Dynamically Loaded Adapter Driver”






“Method for the Hot Swap of a Mass




08/942,336




MNFRAME.006A10






Storage Adapter on a System Including a






Statically Loaded Adapter Driver”






“Method for the Hot Swap of a Network




08/942,459




MNFRAME.006A11






Adapter on a System Including a Statically






Loaded Adapter Driver”






“Method for the Hot Swap of a Mass




08/942,458




MNFRAME.006A12






Storage Adapter on a System Including a






Dynamically Loaded Adapter Driver”






“Method of Performing an Extensive




08/942,463




MNFRAME.008A






Diagnostic Test in Conjunction with a






BIOS Test Routine”






“Apparatus for Performing an Extensive




08/942,163




MNFRAME.009A






Diagnostic Test in Conjunction with a






BIOS Test Routine”






“Configuration Management Method for




08/941,268




MNFRAME.010A






Hot Adding and Hot Replacing Devices”






“Configuration Management System for




08/942,408




MNFRAME.011A






Hot Adding and Hot Replacing Devices”






“Apparatus for Interfacing Buses”




08/942,382




MNFRAME.012A






“Method for Interfacing Buses”




08/942,413




MNFRAME.013A






“Computer Fan Speed Control Device”




08/942,447




MNFRAME.016A






“Computer Fan Speed Control Method”




08/942,216




MNFRAME.017A






“System for Powering Up and Powering




08/943,076




MNFRAME.018A






Down a Server”






“Method of Powering Up and Powering




08/943,077




MNFRAME.019A






Down a Server”






“System for Resetting a Server”




08/942,333




MNFRAME.020A






“Method of Resetting a Server”




08/942,405




MNFRAME.021A






“System for Displaying Flight Recorder”




08/942,070




MNFRAME.022A






“Method of Displaying Flight Recorder”




08/942,068




MNFRAME.023A






“Synchronous Communication Interface”




08/943,355




MNFRAME.024A






“Synchronous Communication Emulation”




08/942,004




MNFRAME.025A






“Software System Facilitating the




08/942,317




MNFRAME.026A






Replacement or Insertion of Devices in a






Computer System”






“Method for Facilitating the Replacement




08/942,316




MNFRAME.027A






or Insertion of Devices in a Computer






System”






“System Management Graphical User




08/943,357




MNFRAME.028A






Interface”






“Display of System Information”




08/942,195




MNFRAME.029A






“Data Management System Supporting Hot




08/942,129




MNFRAME.030A






Plug Operations on a Computer”






“Data Management Method Supporting




08/942,124




MNFRAME.031A






Hot Plug Operations on a Computer”






“Alert Configurator and Manager”




08/942,005




MNFRAME.032A






“Managing Computer System Alerts”




08/943,356




MNFRAME.033A






“Computer Fan Speed Control System”




08/940,301




MNFRAME.034A






“Computer Fan Speed Control System




08/941,267




MNFRAME.035A






Method”






“Black Box Recorder for Information




08/942,381




MNFRAME.036A






System Events”






“Method of Recording Information System




08/942,164




MNFRAME.037A






Events”






“Method for Automatically Reporting a




08/942,168




MNFRAME.040A






System Failure in a Server”






“System for Automatically Reporting a




08/942,384




MNFRAME.041A






System Failure in a Server”






“Expansion of PCI Bus Loading Capacity”




08/942,404




MNFRAME.042A






“Method for Expanding PCI Bus Loading




08/942,223




MNFRAME.043A






Capacity”






“System for Displaying System Status”




08/942,347




MNFRAME.044A






“Method of Displaying System Status”




08/942,071




MNFRAME.045A






“Fault Tolerant Computer System”




08/942,194




MNFRAME.046A






“Method for Hot Swapping of Network




08/943,044




MNFRAME.047A






Components”






“A Method for Communicating a Software




08/942,221




MNFRAME.048A






Generated Pulse Waveform Between Two






Servers in a Network”






“A System for Communicating a Software




08/942,409




MNFRAME.049A






Generated Pulse Waveform Between Two






Servers in a Network”






“Method for Clustering Software




08/942,318




MNFRAME.050A






Applications”






“System for Clustering Software




08/942,411




MNFRAME.051A






Applications”






“Method for Automatically Configuring a




08/942,319




MNFRAME.052A






Server afier Hot Add of a Device”






“System for Automatically Configuring a




08/942,331




MNFRAME.053A






Server after Hot Add of a Device”






“Method of Automatically Configuring and




08/942,412




MNFRAME.054A






Formatting a Computer System and






Installing Software“






“System for Automatically Configuring




08/941,955




MNFRAME.055A






and Formatting a Computer System and






Installing Software”






“Determining Slot Numbers in a




08/942,462




MNFRAME.056A






Computer”






“System for Detecting Errors in a Network”




08/942,169




MNFRAME.058A






“Method of Detecting Errors in a Network”




08/940,302




MNFRAME.059A






“System for Detecting Network Errors”




08/942,407




MNFRAME.060A






“Method of Detecting Network Errors”




08/942,573




MNFRAME.061A





















































Claims
  • 1. A method of monitoring and diagnosing a computer connected to a microcontroller network, the method comprising:requesting conditions of the computer from the microcontroller network; sensing the conditions of the computer with the microcontroller network; receiving the sensed conditions in the microcontroller network; and communicating the sensed conditions from the microcontroller network to the source of the request wherein the controlling of said sensed conditions includes increasing the speed of a fan in the computer when the temperature of the computer is above a threshold temperature.
  • 2. The method of claim 1, wherein the source of the requesting conditions is the computer.
  • 3. The method of claim 1, additionally comprising providing a client computer connected to the computer wherein the source of the requesting conditions is the client computer.
  • 4. The method of claim 1, wherein the requesting conditions of the computer includes requesting the speed of a system fan.
  • 5. The method of claim 1, wherein the requesting conditions of the computer includes requesting the temperature of a sensor.
  • 6. The method of claim 1, wherein the requesting conditions of the computer includes requesting the status of a watchdog timer.
  • 7. The method of claim 1, wherein the requesting conditions of the computer includes requesting the state of a microcontroller bus in the microcontroller network.
  • 8. The method of claim 1, wherein the requesting conditions of the computer includes requesting the presence status of a canister containing a plurality of adapter slots.
  • 9. The method of claim 1, wherein the requesting conditions of the computer includes requesting the status of the system voltage.
  • 10. A method of monitoring system functions of a computer, the method comprising:interconnecting a plurality of microcontrollers via a microcontroller bus; controlling a plurality of environmental conditions of the computer with the interconnected microcontrollers wherein the controlling said plurality of environmental conditions includes increasing the speed of a fan in the computer when the temperature of the computer is above a threshold temperature; connecting at least one of interconnected microcontrollers to a system bus of the computer; receiving a message sent from the system bus to the interconnected microcontrollers, the message requesting a change in a selected one of the plurality of environmental conditions; and sending a message from the interconnected microcontrollers to the system bus, the message indicating a change in the selected one of the plurality of environmental conditions.
  • 11. The method of claim 10, wherein the requesting message requests the interconnected microcontrollers to check the presence of a power supply.
  • 12. The method of claim 10, wherein the requesting message requests the interconnected microcontrollers to write a flash memory in the computer with a new basic input/output system (BIOS) program.
  • 13. The method of claim 10, wherein the requesting message requests the interconnected plurality of microcontrollers to send a message to a system log.
  • 14. The method of claim 10, wherein the requesting message includes requesting notification of a system fault.
  • 15. The method of claim 10, wherein the requesting message requests the interconnected microcontrollers to disable power to a canister containing a plurality of adapter slots.
  • 16. The method of claim 10, wherein the requesting message requests the interconnected microcontrollers to enable power to a canister containing a plurality of adapter slots.
  • 17. The method of claim 10, wherein the requesting message requests the interconnected microcontrollers to reset a watchdog timer.
  • 18. The method of claim 10, wherein the microcontroller bus comprises an I2C bus.
RELATED APPLICATIONS

This application is related to U.S. application Ser. No.: 08/942,402, entitled, “DIAGNOSTIC AND MANAGING DISTRIBUTED PROCESSOR SYSTEM”, U.S. application Ser. No. 08/942,222, entitled “SYSTEM FOR MAPPING ENVIRONMENTAL RESOURCES TO MEMORY FOR PROGRAM ACCESS”, and U.S. application Ser. No. 08/942,214, entitled “METHOD FOR MAPPING ENVIRONMENTAL RESOURCES TO MEMORY FOR PROGRAM ACCESS”, which are being filed concurrently herewith on Oct. 1, 1997.

US Referenced Citations (243)
Number Name Date Kind
4057847 Lowell et al. Nov 1977
4100597 Fleming et al. Jul 1978
4449182 Rubinson et al. May 1984
4672535 Katzman et al. Jun 1987
4695946 Andreasen et al. Sep 1987
4707803 Anthony, Jr. et al. Nov 1987
4769764 Levanon Sep 1988
4774502 Kimura Sep 1988
4821180 Gerety et al. Apr 1989
4835737 Herrig et al. May 1989
4894792 Mitchell et al. Jan 1990
4949245 Martin et al. Aug 1990
4999787 McNally et al. Mar 1991
5006961 Monico Apr 1991
5007431 Donehoo, III Apr 1991
5051720 Kittirutsunetorn Sep 1991
5073932 Yossifor et al. Dec 1991
5103391 Barrett Apr 1992
5118970 Olson et al. Jun 1992
5123017 Simpkins et al. Jun 1992
5136715 Hirose et al. Aug 1992
5157663 Major et al. Oct 1992
5210855 Bartol May 1993
5222897 Collins et al. Jun 1993
5247683 Holmes et al. Sep 1993
5253348 Scalise Oct 1993
5261094 Everson et al. Nov 1993
5265098 Mattson et al. Nov 1993
5266838 Gerner Nov 1993
5269011 Yanai et al. Dec 1993
5272382 Heald et al. Dec 1993
5272584 Austruy et al. Dec 1993
5276814 Bourke et al. Jan 1994
5277615 Hastings et al. Jan 1994
5280621 Barnes et al. Jan 1994
5283905 Saadeh et al. Feb 1994
5307354 Cramer et al. Apr 1994
5311397 Harshberger et al. May 1994
5311451 Barrett May 1994
5317693 Cuenod et al. May 1994
5329625 Kannan et al. Jul 1994
5337413 Lui et al. Aug 1994
5351276 Doll, Jr. et al. Sep 1994
5367670 Ward et al. Nov 1994
5379184 Barraza et al. Jan 1995
5379409 Ishikawa Jan 1995
5386567 Lien et al. Jan 1995
5402431 Saadeh et al. Mar 1995
5404494 Garney Apr 1995
5423025 Goldman et al. Jun 1995
5430717 Fowler et al. Jul 1995
5430845 Rimmer et al. Jul 1995
5432946 Allard et al. Jul 1995
5440748 Sekine et al. Aug 1995
5448723 Rowett Sep 1995
5455933 Schieve et al. Oct 1995
5460441 Hastings et al. Oct 1995
5471617 Farrand et al. Nov 1995
5471634 Giorgio et al. Nov 1995
5473499 Weir Dec 1995
5483419 Kaczeus, Sr. et al. Jan 1996
5485607 Lomet et al. Jan 1996
5487148 Komori et al. Jan 1996
5491791 Glowny et al. Feb 1996
5493574 McKinley Feb 1996
5493666 Fitch Feb 1996
5513339 Agrawal et al. Apr 1996
5515515 Kennedy et al. May 1996
5517646 Piccirillo et al. May 1996
5519851 Bender et al. May 1996
5526289 Dinh et al. Jun 1996
5528409 Cucci et al. Jun 1996
5530810 Bowman Jun 1996
5533193 Roscoe Jul 1996
5542055 Amini et al. Jul 1996
5546272 Moss et al. Aug 1996
5555510 Verseput et al. Sep 1996
5559764 Chen et al. Sep 1996
5559958 Farrand et al. Sep 1996
5559965 Oztaskin et al. Sep 1996
5560022 Dunstan et al. Sep 1996
5564024 Pemberton Oct 1996
5566299 Billings et al. Oct 1996
5566339 Perholtz et al. Oct 1996
5568610 Brown Oct 1996
5568619 Blackledge et al. Oct 1996
5572403 Mills Nov 1996
5577205 Hwang et al. Nov 1996
5579491 Jeffries et al. Nov 1996
5579528 Register Nov 1996
5581712 Herrman Dec 1996
5581714 Amini et al. Dec 1996
5586250 Carbonneau et al. Dec 1996
5588121 Reddin et al. Dec 1996
5588144 Inoue et al. Dec 1996
5592611 Midgely et al. Jan 1997
5598407 Bud et al. Jan 1997
5602758 Lincoln et al. Feb 1997
5604873 Fite et al. Feb 1997
5606672 Wade Feb 1997
5608876 Cohen et al. Mar 1997
5615207 Gephardt et al. Mar 1997
5621159 Brown et al. Apr 1997
5622221 Genga, Jr. et al. Apr 1997
5628028 Michelson May 1997
5632021 Jennings et al. May 1997
5636341 Matsushita et al. Jun 1997
5638289 Yamada et al. Jun 1997
5644470 Benedict et al. Jul 1997
5644731 Liencres et al. Jul 1997
5651006 Fujino et al. Jul 1997
5652832 Kane et al. Jul 1997
5652892 Ugajin Jul 1997
5655081 Bonnell et al. Aug 1997
5655148 Richman et al. Aug 1997
5659682 Devarakonda et al. Aug 1997
5664119 Jeffries et al. Sep 1997
5666538 DeNicola Sep 1997
5668943 Attanasio et al. Sep 1997
5671371 Kondo et al. Sep 1997
5675723 Ekrot et al. Oct 1997
5680288 Carey et al. Oct 1997
5684671 Hobbs et al. Nov 1997
5689637 Johnson et al. Nov 1997
5696895 Hemphill et al. Dec 1997
5696899 Kalwitz Dec 1997
5696949 Young Dec 1997
5696970 Sandage et al. Dec 1997
5701417 Lewis et al. Dec 1997
5704031 Mikami et al. Dec 1997
5708775 Nakamura Jan 1998
5708776 Kikinis Jan 1998
5712754 Sides et al. Jan 1998
5717570 Kikinis Feb 1998
5721935 DeSchepper et al. Feb 1998
5724529 Smith et al. Mar 1998
5726506 Wood Mar 1998
5737708 Grob et al. Apr 1998
5740378 Rehl et al. Apr 1998
5742514 Bonola Apr 1998
5742833 Dea et al. Apr 1998
5747889 Raynham et al. May 1998
5748426 Bedingfield et al. May 1998
5752164 Jones May 1998
5754396 Felcman et al. May 1998
5754449 Hoshal et al. May 1998
5754797 Takahashi May 1998
5758352 Reynolds et al. May 1998
5761033 Wilhelm Jun 1998
5761045 Olson et al. Jun 1998
5761085 Giorgio Jun 1998
5761462 Neal et al. Jun 1998
5761707 Aiken et al. Jun 1998
5764924 Hong Jun 1998
5764968 Ninomiya Jun 1998
5765008 Desai et al. Jun 1998
5765198 McCrocklin et al. Jun 1998
5767844 Stoye Jun 1998
5768541 Pan-Ratzlaff Jun 1998
5768542 Enstrom et al. Jun 1998
5771343 Hafner et al. Jun 1998
5774645 Beaujard et al. Jun 1998
5774741 Choi Jun 1998
5777897 Giorgio Jul 1998
5778197 Dunham Jul 1998
5781703 Desai et al. Jul 1998
5781716 Hemphill et al. Jul 1998
5781767 Inoue et al. Jul 1998
5781798 Beatty et al. Jul 1998
5784576 Guthrie et al. Jul 1998
5787459 Stallmo et al. Jul 1998
5790775 Marks et al. Aug 1998
5790831 Lin et al. Aug 1998
5793948 Asahi et al. Aug 1998
5793987 Quackenbush et al. Aug 1998
5794035 Golub et al. Aug 1998
5796185 Takata et al. Aug 1998
5796580 Komatsu et al. Aug 1998
5796981 Abudayyeh et al. Aug 1998
5798828 Thomas et al. Aug 1998
5799036 Staples Aug 1998
5799196 Flannery Aug 1998
5801921 Miller Sep 1998
5802269 Poisner et al. Sep 1998
5802305 McKaughan et al. Sep 1998
5802324 Wunderlich et al. Sep 1998
5802393 Begun et al. Sep 1998
5802552 Fandrich et al. Sep 1998
5803357 Lakin Sep 1998
5805804 Laursen et al. Sep 1998
5805834 McKinley et al. Sep 1998
5809224 Schultz et al. Sep 1998
5809256 Najemy Sep 1998
5809555 Hobson Sep 1998
5812748 Ohran et al. Sep 1998
5812750 Dev et al. Sep 1998
5812757 Okamoto et al. Sep 1998
5812858 Nookala et al. Sep 1998
5815117 Kolanek Sep 1998
5815652 Ote et al. Sep 1998
5821596 Miu et al. Oct 1998
5822547 Boesch et al. Oct 1998
5826043 Smith et al. Oct 1998
5835719 Gibson et al. Nov 1998
5835738 Blackledge, Jr. et al. Nov 1998
5838932 Alzien Nov 1998
5841964 Yamaguchi Nov 1998
5841991 Russell Nov 1998
5845061 Miyamoto et al. Dec 1998
5845095 Reed et al. Dec 1998
5850546 Kim Dec 1998
5852720 Gready et al. Dec 1998
5852724 Glenn, II et al. Dec 1998
5857074 Johnson Jan 1999
5864653 Tavallaei et al. Jan 1999
5864713 Terry Jan 1999
5867730 Leyda Feb 1999
5875308 Egan et al. Feb 1999
5875310 Buckland et al. Feb 1999
5878237 Olarig Mar 1999
5878238 Gan et al. Mar 1999
5881311 Woods Mar 1999
5884049 Atkinson Mar 1999
5886424 Kim Mar 1999
5892898 Fujii et al. Apr 1999
5898846 Kelly Apr 1999
5905867 Giorgio May 1999
5907672 Matze et al. May 1999
5909568 Nason Jun 1999
5911779 Stallmo et al. Jun 1999
5913034 Malcolm Jun 1999
5922060 Goodrum Jul 1999
5930358 Rao Jul 1999
5935262 Barrett et al. Aug 1999
5936960 Stewart Aug 1999
5938751 Tavallaei et al. Aug 1999
5941996 Smith et al. Aug 1999
5964855 Bass et al. Oct 1999
5983349 Kodama et al. Nov 1999
5987554 Liu et al. Nov 1999
5987627 Rawlings, III Nov 1999
6012130 Beyda et al. Jan 2000
6038624 Chan et al. Mar 2000
Foreign Referenced Citations (5)
Number Date Country
0 866 403 A1 Sep 1998 EP
04 333 118 Nov 1992 JP
05 233 110 Sep 1993 JP
07 093 064 Apr 1995 JP
07 261 874 Oct 1995 JP
Non-Patent Literature Citations (26)
Entry
NetFrame, “NetFrame Clustered Multiprocessing Software”, Doc. No. 78-100226-01, pp. 1-2, 5-8, 359-404, and 471-512, Apr. 1996.*
ftp.cdrom.com/pub/os2/diskutil/, PHDX software, phdx.zip download, Mar. 1995, “Parallel Hard Disk Xfer.”
Cmasters, Usenet post to microsoft.public.windowsnt.setup, Aug. 1997, “Re: FDISK switches.”
Hildebrand, N., Usenet post to comp.msdos.programmer, May 1995, “Re: Structure of disk partition into.”
Lewis, L., Usenet post to alt.msdos.batch, Apr. 1997, “Re: Need help with automating FDISK and Format.”
Netframe, ://www.netframe-support.com/technology/datasheets/data.htm, before Mar. 1997, “Netframe ClusterSystem 9008 Data Sheet.”
Simos, M., Usenet post to comp.os.msdos.misc, Apr. 1997, “Re: Auto FDISK and Format.”
Wood, M. H., Usenet post to comp.os.netware.misc, Aug. 1996, “Re: Workstation duplication method for WIN95.”
Lyons, Computer Reseller News, Issue 721, pp. 61-62, Feb. 3, 1997, “ACC Releases Low-Cost Solution for ISPs.”
M2 Communications, M2 Presswire, 2 pages, Dec. 19, 1996, “Novell IntranetWare Supports Hot Pluggable PCI from NetFrame.”
Rigney, PC Magazine, 14(17): 375-379, Oct. 10, 1995, “The One for the Road (Mobile-aware capabilities in Windows 95).”
Shanley, and Anderson, PCI System Architecture, Third Edition, p. 382, Copyright 1995.
Gorlick, M., Conf. Proceedings: ACM/ONR Workshop on Parallel and Distributed Debugging, pp. 175-181, 1991, “The Flight Recorder: An Architectural Aid for System Monitoring.”
IBM Technical Disclosure Bulletin, 92A=62947, pp. 391-394, Oct. 1992, Method for Card Hot Plug Detection and Control.
Shanley and Anderson, PCI System Architecture, Third Edition, Chapters 15 & 16, pp. 297-328, CR 1995.
PCI Hot-Plug Specification, Preliminary Revision for Review Only, Revision 0.9, pp. i-vi, and 1-25, Mar. 5, 1997.
SES SCSI-3 Enclosure Services, X3T10/Project 1212-D/Rev 8a, pp. i, iii-x,1-76, and I-1 (index), Jan. 16, 1997.
Compaq Computer Corporation, Technology Brief, pp. 1-13, Dec. 1996, “Where Do I Plug the Cable? Solving the Logical-Physical Slot Numbering Problem.”
NetFrame Systems Incorporated, News Release, 3 pages, referring to May 9, 1994, “NetFrame's New High-Availability ClusterServer Systems Avoid Scheduled as well as Unscheduled Downtime.”
NetFrame Systems Incorporated, datasheet, 2 pages, Feb. 1996, “NF450FT Network Mainframe.”
NetFrame Systems Incorporated, datasheet, 9 pages, Mar. 1996, “NetFrame Cluster Server 8000.”
Herr, et al., Linear Technology Magazine, Design Features, pp. 21-23, Jun. 1997, “Hot Swapping the PCI Bus.”
Mark Lockareff, “Lonworks—An Introduction”, HTINews, Dec., 1996, 2 pp.
M. J. Schofield, “Controller Area Network—How CAN Works”, mschofield@cix.compulink.co.uk, Sep. 23, 1997, 4 pp.
“CAN: Technical Overview”, NRTT, Ltd., Sep. 23, 1997, 15 pp.
Product Brochure of NetFrame, “NF450FT Network Mainframe”, Feb. 1992, 14 pp.
Provisional Applications (5)
Number Date Country
60/046397 May 1997 US
60/047016 May 1997 US
60/046416 May 1997 US
60/046398 May 1997 US
60/046312 May 1997 US