System for automatically reporting a system failure in a server

Description

APPENDICES

Appendix A, which forms a part of this disclosure, is a list of commonly owned copending U.S. patent applications. Each one of the applications listed in Appendix A is hereby incorporated herein in its entirety by reference thereto.

COPYRIGHT RIGHTS

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent files or records, but otherwise reserves all copyright rights whatsoever.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The invention relates to the reporting of problems and/or failure conditions in electronic systems. More particularly, the invention relates to a system and method for automatically reporting failure conditions in a server system.

2. Description of the Related Technology

In the computer industry, the fast and efficient detection of system errors and/or failures, and the subsequent correction of such failures, is critical to providing quality performance and product reliability to the users and buyers of computer systems. Particularly with respect to server computers which are accessed and utilized by many end users, early detection and notification of system problems and failures is an extremely desirable performance characteristic, especially for users who depend on the server to obtain data and information in their daily business operations, for example.

Typically, after a server has failed, users trying to access that server do not know that a problem exists or what the nature of the problem is. If a user experiences undue delay in connecting to the server or accessing a database through the server, the user typically does not know whether there is something wrong with the server, something wrong with his or her connection line, or whether both problems exist. In this scenario, the user must wait for a system operator, at the site where the server is located, to detect the error or failure and correct it. Hours can elapse before the failure is corrected. Often, a system operator or administrator will not discover the failure until users experience problems and start complaining. In the meantime, an important event may be missed and time is wasted, leading to user dissatisfaction with the server system.

Therefore, what is needed is a method and system for early detection of system failures or problems and prompt notification to a system operator or control center of the failure condition so that remedial actions may be quickly taken. In addition, for servers which may be remotely located from a control center, for example, a method and system for notifying the control center at a remote location is needed.

SUMMARY OF THE INVENTION

The invention addresses the above and other needs by providing a method and system for detecting a system failure and automatically reporting the failure to a system operator who may be located at or near the site where the server is present, or remotely located from the server such that the system operator communicates with the server via a modem connection. As used herein, the terms “failure”, “system failure”, “system failure condition” and any combination or conjugation of these terms refers to any problem, error, fault, or out of tolerance operating condition or parameter which may be detected in a computer and/or server system. Additionally, these terms may refer to a change in a status or condition of the server system, or a component or subsystem thereof.

In one embodiment of the invention, a system for reporting a failure condition in a server system, includes: a controller which monitors the server system for system failures, and generates an event signal and failure information if a system failure is detected; a system interface, coupled to the controller, which receives the event signal; a central processing unit, coupled to the system interface, wherein, upon receiving the event signal, the system interface reports an occurrence of an event to the central processing unit; and a system log which receives failure information communicated from the system interface and stores said failure information.

In another embodiment, the system described above further includes a system recorder, coupled between the controller and the system log, for receiving the failure information from the controller, assigning a time value to the failure information, and subsequently storing the failure information with the time value into the system log.

In another embodiment, a failure reporting system for a server system, includes the following: a controller which monitors the server system for system failures and generates an event signal and failure information if a system failure is detected; a system recorder, coupled to the controller, which receives failure information and assigns a time value to the failure information; a system log which stores failure information received from the system recorder; and a system interface, coupled to the controller, which receives and stores the event signal, and reports an occurrence of an event to a central processing unit which is coupled to the system interface, wherein the central processing unit executes a software program which allows a system operator to access the system log to read failure information stored therein.

In a further embodiment, the system described above 12 further includes a remote interface, coupled to the controller, which receives the event signal and reports the occurrence of an event to a computer external to the server system.

In yet another embodiment, a failure reporting system for a server system, includes: a controller which monitors the server system for system failures and generates an event signal and failure information if a system failure is detected; a system recorder, coupled to the controller, which receives the failure information and assigns a date and time to the failure information; a system log which stores the failure information; a system interface, coupled to the controller, which receives and stores the event signal and reports an occurrence of an event to a central processing unit, coupled to the system interface, wherein the central processing unit executes a software program which allows a system operator to access the system log to read failure information stored therein; a remote interface, coupled to the controller, which receives the event signal and reports the occurrence of an event to a computer external to the server system; and a switch, coupled to the remote interface, which switches connectivity to the remote interface between a first computer and a second computer, wherein the first computer is a local computer, coupled to the switch via a local communications line, and the second computer is a remote computer, coupled to the switch via a modem connection.

In a further embodiment, a failure reporting system in a server system, includes: means for detecting a system failure condition; means for transmitting failure information related to the failure condition to a system recorder; means for storing the failure information; and means for reporting an occurrence of an event to a central processing unit of the server system.

In another embodiment, the invention is a program storage device which stores instructions that when executed by a computer perform a method, wherein the method comprises: detecting a system failure condition; transmitting failure information related to the failure condition to a system recorder; storing the failure information in a system log; and reporting an occurrence of an event to a central processing unit of the server system.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1

is a block diagram of a server having a failure reporting system for detecting, recording and reporting a system failure in accordance with one embodiment of the invention.

FIG. 2

is a system block diagram of one embodiment of a system interface which is used to transfer data between the server's operating and the server'failure reporting system, in accordance with the invention.

FIG. 3A

is a table illustrating one embodiment of a data format for a read request signal communicated by the system interface and/or the remote interface of

FIG. 1

in accordance with the invention.

FIG. 3B

is a table illustrating one embodiment of a data format for a write request signal communicated by the system interface and/or the remote interface of

FIG. 1

in accordance with the invention.

FIG. 3C

is a table illustrating one embodiment of a data format for a read response signal communicated by the system interface and/or the remote interface of

FIG. 1

in accordance with the invention.

FIG. 3D

is a table illustrating one embodiment of a data format for a write response signal communicated by the system interface and/or the remote interface of

FIG. 1

in accordance with the invention.

FIG. 4

is a system block diagram of one embodiment of the remote interface of FIG.

1

.

FIGS.

5

A-

5

C illustrate one embodiment of a data format for a request, a response, and an interrupt signal, respectively, which are received and transmitted by the remote interface of FIG.

1

.

FIG. 6

is a system block diagram of one embodiment of the system recorder of FIG.

1

.

FIGS.

7

A-

7

D together form a flowchart diagram of one embodiment of a process of storing information in the system log and retrieving information from the system log.

FIGS.

8

A-

8

D together form a flowchart illustrating one embodiment of a process for detecting and reporting system failures in accordance with the invention.

DETAILED DESCRIPTION OF THE INVENTION

The invention is described in detail below with reference to the figures, wherein like elements are referenced with like numerals throughout.

Referring to

FIG. 1

, a block diagram of one embodiment of a server system

100

is illustrated. The server system

100

includes a central processing unit (CPU)

101

which executes the operating system (OS) software, which controls the communications protocol of the server system

100

. The CPU

101

is coupled to an Industry Standard Architecture bus (ISA bus)

103

which transfers data to and from the CPU

101

. The ISA bus

103

and its functionality are well-known in the art. Coupled to the ISA bus

103

is a system interface

105

which receives event signals from one or more microcontrollers that monitor and control various subsystems and components of the server system

100

. As described in further detail below, an event signal sent to the system interface

105

indicates that a system failure or error has occurred. The various microcontrollers which monitor the server system

100

are also described in further detail below. As used herein, the term “event” may refer to the occurrence of any type of system failure. The structure and functionality of the system interface

105

is described in greater detail below with respect to FIG.

2

. Additionally, as used herein the terms “signal,” “command” and “data” and any conjugation and combinantions thereof, are used synonymously and interchangeably and refer to any information or value that may be transmitted, received or communicated between two electronic entities.

Coupled to the system interface

105

is a system bus

107

. In one embodiment, the system bus

107

is an Inter-IC control bus (I

2

C bus), which transfers data to and from the various controllers and subsystems mentioned above. The I

2

C bus and the addressing protocol in which data is transferred across the bus are well-known in the art. One embodiment of a messaging protocol used in this I

2

C bus architecture is discussed in further detail below with reference to FIGS.

3

A-

3

D. The command, diagnostic, monitoring, and logging functions of the failure reporting system of the invention are accessed through the common I

2

C bus protocol. In one embodiment, the I

2

C bus protocol uses addresses typically stored in a first byte of a data stream, as the means of identifying the various devices and commands to those devices. Any function can be queried by generating a “read” request, which has its address as part of its protocol format. Conversely, a function can be executed by “writing” to an address specified in the protocol format. Any controller or processor connected to the bus can initiate read and write requests by sending a message on the I

2

C bus to the processor responsible for that function.

Coupled to the system bus

107

is a CPU A controller

109

, a CPU B controller

111

, a chassis controller

112

and four canister controllers

113

. These controllers monitor and control various operating parameters and/or conditions of the subsystems and components of the server system

100

. For example, CPU A controller

109

may monitor the system fan speeds, CPU B controller

111

may monitor the operating temperature of the CPU

101

, the chassis controller

112

may monitor the presence of various circuit boards and components of the server system, and each of the canister controllers

112

may monitor the presence and other operating conditions of “canisters” connected to the server system

100

. A “canister” is a detachable module which provides expandability to the number of peripheral component interface (PCI) devices that may be integrated into the server system

100

. In one embodiment, each canister is capable of providing I/O slots for up to four PCI cards, each capable of controlling and arbitrating access to a PCI device, such as a CD ROM disk drive, for example. A more detailed description of a canister can be found in a co-pending and commonly owned patent application entitled, “Network Server With Network Interface, Data Storage and Power Modules That May Be Removed and Replaced Without Powering Down the Network”, which is listed in Appendix A attached hereto.

If one or more of the various controllers detects a failure, the respective controller sends an event signal to the system interface

105

which subsequently reports the occurrence of the event to the CPU

101

. In one embodiment, the controllers

109

,

111

and

113

are PIC16C65 microcontroller chips manufactured by Microchip Technologies, Inc. and the chassis controller

112

is a PIC16C74 microcontroller chip manufactured by Microchip Technologies, Inc.

Upon detecting a failure condition, a controller (

109

,

111

,

112

or

113

) not only transmits an event signal to the system interface

105

, but also transmits failure information associated with the failure condition to a system recorder

115

connected to the system bus

107

. The system recorder

115

then assigns a time stamp to the failure information and logs the failure by storing the failure information, along with its time stamp, into a system log

117

. The operation and functionality of the system recorder

115

is described in further detail below with reference to FIG.

6

. In one embodiment, the system log

117

is a non-volatile random access memory (NVRAM), which is well-known for its characteristics in maintaining the integrity of data stored within it, even when power to the memory cells is cut off for extended periods of time as a result of a system shut-down or power failure. The following are examples of various monitoring functions performed by some of the controllers described above. However, it is understood that the invention is not limited to these monitoring functions which serve only as examples.

In one embodiment, the controller

109

may be coupled to a system fan unit (not shown) and periodically monitor the speed of the fan. In one version, the fan unit transmits a pulse waveform to the controller

109

, the frequency of which is proportional to the rate of rotation of the fan. The controller

109

checks the frequency of the pulse waveform on a periodic basis and determines whether the frequency is within a specified range of acceptable fan speeds. If a measured frequency is either too slow or too fast, the controller

109

detects a fan failure condition and sends an event signal to the system interface

105

. The controller

109

also sends failure information to the system recorder

115

which assigns a time value to the failure information and stores the failure information with its time stamp into the system log

117

. After the system interface

105

receives an event signal, it reports the occurrence of the event to the CPU

101

.

As another example, the controller

111

may monitor a system temperature parameter. For example, a temperature sensor (not shown) may be coupled to the CPU

101

for monitoring its operating temperature. In one embodiment, the temperature sensor generates a voltage which is proportional to a measured operating temperature of the CPU

101

. This voltage may then be converted by well-known means into a digital data signal and subsequently transmitted to the controller

109

. The controller

111

then determines whether the measured temperature falls within specified limits. If the measured temperature is either too low or too high, a temperature failure condition is detected and an event signal is transmitted to the system interface

105

which subsequently reports the event to CPU

101

and an entry is written to the system log

117

by the system recorder

115

.

In another embodiment, multiple temperature sensors (not shown) are coupled to a temperature bus (not shown). The temperature readings of all the sensors on the temperature bus are monitored every second and are read by Dallas Inc. temperature transducers (not shown) connected to the system bus

107

. In one embodiment, the temperature transducers are model no. DS1621 digital thermometers, made by Dallas Semiconductor Corp. of Dallas, Tex. The temperature sensors are read in address order. The criteria for detecting a temperature fault is provided by two temperature limits: a shutdown limit, which is initialized to 70° C.; and lower and upper warning limits, which are set at −25° C. and 55° C., respectively. Each sensor is compared to the shutdown limit. If any temperature exceeds this limit, the system is powered off. If it is lower than the shutdown limit, each sensor is then compared to the warning limits. If any temperature is below −25° C. or above 55° C., a warning condition is created, a temperature LED is set, a temperature event signal is sent to the system interface

105

, and an entry is written to the system log

117

by the system recorder

115

.

The chassis controller

112

can monitor the presence of power supplies, for example. In one embodiment, power supplies may be detected and identified by a signal line coupling each power supply to a one-wire serial bus (not shown) which is in turn connected to a serial number chip (not shown) for identifying the serial number of each power supply. In one embodiment, the serial number chip is a DS2502 1 Kbit Add-only memory, manufactured by Dallas Semiconductor Corp. In order to detect the presence of a power supply, a trigger pulse may be sent by the chassis controller

112

to detect a power supply presence pulse. If there is a change in the presence of a power supply, a presence bit is updated and a power supply event is sent to the system interface

105

. The power supply data is then written to the system log

117

. If a power supply is removed from the system, no further action takes place. The length of the serial number string for that power supply address is set to zero. However, if a power supply is installed, its serial number is read by the Dallas Semiconductor Corp. one-wire protocol and written to the system log

117

.

As shown in

FIG. 1

, the server system

100

further includes a remote interface

119

that is also connected to the system bus

107

. The remote interface

119

also receives event signals from the various controllers

109

,

111

,

112

and/or

113

when a failure condition has been detected. The remote interface

119

is a link to the server system

100

for a remote client. In one embodiment, the remote interface

119

encapsulates messages in a transmission packet to provide error-free communications and link security. This method establishes a communication protocol in which data is transmitted to and from the remote interface

119

by using a serial communication protocol known as “byte stuffing.” In this communication method, certain byte values in the data stream always have a particular meaning. For example, a certain byte value may indicate the start or end of a message, an interrupt signal, or any other command. A byte value may indicate the type or status of a message, or even be the message itself. However, the invention is not limited to any particular type of communication protocol and any protocol which is suitable may be used by the remote interface

119

in accordance with the invention. The remote interface

119

is described in further detail below with reference to FIG.

4

.

Through the remote interface

119

, a failure condition may be reported to a local system operator or to a remote operator. As used herein, the term “local” refers to a computer, system, operator or user that is not located in the same room as the hardware of the server system

100

but may be located nearby in a different room of the same building, for example. The term “remote” refers to a computer, system or operator that may be located in another city or state, for example, and is connected to the server system via a modem-to-modem connection. The remote operator is typically a client who is authorized to access data and information from the server system

100

through a remote computer

125

.

Coupled to the remote interface

119

is a switch

121

for switching connectivity to the remote interface

119

between a local computer

123

and a remote computer

125

. As shown in

FIG. 1

, the local computer

123

is connected to the remote interface

119

via a local communications line

127

. The local communications line

127

may be any type of communication line, e.g., an RS232 line, suitable for transmitting data. The remote computer

125

is connected to the remote interface via a modem-to-modem connection established by a client modem

129

coupled to a server modem

131

. The client modem

129

is connected to the server modem

131

by a telephone line

133

.

The system interface

105

, the system bus

107

, the controllers

109

,

111

,

112

and

113

, the system recorder

115

, the system log

117

, and the remote interface

119

are part of a network of controllers and processors which form the failure reporting system of the invention. One embodiment of this failure reporting system is known as the Intrapulse System™. designed and manufactured by Netframe, Inc., located at Milpitas, Calif. In

FIG. 1

, the Intrapulse System is that portion of the components surrounded by the dashed lines. The Intrapulse System monitors the status and operational parameters of the various subsystems of the server system

100

and provides system failure and error reports to a CPU

101

of the server system

100

. Upon reporting the occurrence of an event to the CPU

101

, the CPU

101

executes a software program which allows a system operator to access further information regarding the system failure condition and thereafter take appropriate steps to remedy the situation.

Referring to

FIG. 2

, a block diagram of one embodiment of the system interface

105

is shown surrounded by dashed lines. The system interface

105

is the interface used by the server system

100

to report failure events to the CPU

101

. Furthermore, a system operator can access failure information related to a detected system failure by means of the system interface

105

. A software program executed by the operating system of the CPU

101

allows the CPU

101

to communicate with the system interface

105

in order to retrieve information stored in the system log

117

, as described above. In one embodiment, this software program is the Maestro Central program, manufactured by Netframe, Inc. The operating system of the CPU

101

may be an operating system (OS) driver program, such as Windows NT™ or Netware™ for Windows, for example.

The system interface

105

includes a system interface processor

201

which receives event and request signals, processes these signals, and transmits command, status and response signals to the operating system of the CPU

101

. In one embodiment the system interface processor

201

is a PIC16C65 controller chip which includes an event memory (not shown) organized as a bit vector, having at least sixteen bits. Each bit in the bit vector represents a particular type of event. Writing an event to the system interface processor

201

sets a bit in the bit vector that represents the event. Upon receiving an event signal from the controller

109

(FIG.

1

), for example, the system interface

105

reports the occurrence of an event to the CPU

101

by sending an interrupt to the CPU

101

. Upon receiving the interrupt, the CPU

101

will check the status of the system interface

105

in order to ascertain that an event is pending. Alternatively, the reporting of the occurrence of an event may be implemented by programming the CPU

101

to periodically poll the status of the system interface

105

in order to ascertain whether an event is pending. The CPU

101

may then read the bit vector in the system interface processor

201

to ascertain the type of event that occurred and thereafter notify a system operator of the event by displaying an event message on a monitor coupled to the CPU

101

. After the system operator has been notified of the event, as described above, he or she may then obtain further information about the system failure which generated the event signal by accessing the system log

117

. This capability is also provided by the Maestro Central software program.

The system interface

105

communicates with the CPU

101

by receiving request signals from the CPU

101

and sending response signals back to the CPU

101

. Furthermore, the system interface

105

can send and receive status and command signals to and from the CPU

101

. For example, a request signal may be sent from a system operator enquiring as to whether the system interface

105

has received any event signals, or enquiring as to the status of a particular processor, subsystem, operating parameter, etc. A request signal buffer

203

is coupled to the system interface processor

201

and stores, or queues request signals in the order that they are received. Similarly, a response buffer

205

is coupled to the system interface processor

201

and queues outgoing response signals in the order that they are received.

A message data register (MDR)

207

is coupled to the request and response buffers

203

and

205

. In one embodiment, the MDR

207

is eight bits wide and has a fixed address which may be accessed by the server's operating system via the ISA bus

103

coupled to the MDR

207

. As shown in

FIG. 2

, the MDR

207

has an I/O address of 0CC0h. When a system operator desires to send a request signal to the system interface processor

201

, he or she must first access the MDR

207

through the operating system of the server which knows the address of the MDR

207

.

One embodiment of a data format for the request and response signals is illustrated in FIGS.

3

A-

3

D.

FIG. 3A

shows a data format for a read request signal.

FIG. 3B

shows a similar data format for a write request signal.

FIG. 3C

shows a data format for a read response signal and

FIG. 3D

shows a data format for a write response signal.

The following is a summary of the data fields shown in FIGS.

3

A-

3

D:

FIELD

DESCRIPTION

Slave Addr

Specifies the processor identification code. This field is

7 bits wide. Bit [7 . . . 1].

LSBit

Specifies what type of activity is taking place. If LSBit

is clear (0), the master is transmitting to a slave. If

LSBit is set (1), the master is receiving from a slave.

MSBit

Specifies the type of command. It is bit 7 of byte 1 of

a request. If this bit is clear (0), this is a write

command. If it is set (1), this is a read command.

Type

Specifies the data type of this command, such as bit or

string.

Command ID

Specifies the least significant byte of the address of the

(LSB)

processor.

Command ID

Specifies the most significant byte of the address of the

(MSB)

processor.

Length (N)

Read Request

Specifies the length of the data that the master expects to

get back from a read response. The length, which is in

bytes, does not include the Status, Check Sum, and

Inverted Slave Addr fields.

Read Response

Specifies the length of the data immediately following

this byte, that is byte 2 through byte N + 1. The length,

which is in bytes, does not include the Status, Check

Sum, and Inverted Slave Addr fields.

Write Request

Specifies the length of the data immediately following

this byte, that is byte 2 through byte N + 1. The length,

which is in bytes, does not include the Status, Check

Sum, and Inverted Slave Addr fields.

Write Response

Always specified as 0.

Data Byte 1

Specifies the data in a read request and response, and a

write request.

.

.

.

Data Byte N

Status

Specifies whether or not this command executes

successfully. A non-zero entry indicates a failure.

Check Sum

Specifies a direction control byte to ensure the integrity

of a message on the wire.

Inverted Slave

Specifies the Slave Addr, which is inverted.

Addr

Referring again to

FIG. 2

, it is seen that the system interface

105

further includes a command and status register (CSR)

209

which controls operations and reports on the status of commands. The operation and functionality of CSR

209

is described in further detail below. Both synchronous and asynchronous I/O modes are provided by the system interface

105

. Thus, an interrupt line

211

is coupled between the system interface processor

201

and the ISA bus

103

and provides the ability to request an interrupt when asynchronous I/O is complete, or when an event occurs while the interrupt is enabled. As shown in

FIG. 2

, in one embodiment, the address of the interrupt line

211

is fixed and indicated as IRQ

15

which is an interrupt address number used specifically for the ISA bus

103

.

The MDR

207

and the request and response buffers

203

and

205

, respectively, transfer messages between a system operator or client and the failure reporting system of the invention. The buffers

203

and

205

are configured as first-in first-out (FIFO) buffers. That is, in these buffers, the next message processed is the one that has been in the queue the longest time. The buffers

203

and

205

have two functions: (1) they match speeds between the high-speed ISA bus

103

and the slower system bus

117

(FIG.

1

); and (2) they serve as interim buffers for the transfer of messages. This relieves the system interface processor

201

of having to provide this buffer.

When the MDR

207

is written to by the ISA bus

103

, it loads a byte into the request buffer

203

. When the MDR

207

is read from the ISA bus

203

, it unloads a byte from the response buffer

205

. The system interface processor

201

reads and executes the request from the request buffer

203

when a message command is received in the CSR

209

. A response message is written to the response buffer

205

when the system interface processor

201

completes executing the command. The system operator or client can read and write message data to and from the buffers

203

and

205

by executing read and write instructions through the MDR

207

.

The CSR

209

has two functions. The first is to issue commands, and the second is to report on the status of execution of a command. The commands in the system interface

105

are usually executed synchronously. That is, after issuing a command, the client must continue to poll the CSR status to confirm command completion. In addition to synchronous I/O mode, the client can also request an asynchronous I/O mode for each command by setting a “Asyn Req” bit in the command. In this mode, an interrupt is generated and sent to the ISA bus

103

, via the interrupt line

211

, after the command has completed executing.

The interrupt line

211

may use an ISA IRQ

15

protocol, as mentioned above, which is well-known in the art. Alternatively, the interrupt line

211

may utilize a level-triggered protocol. A level-triggered interrupt request is recognized by keeping the signal at the same level, or changing the level of a signal, to send an interrupt. In a system which utilizes the level-triggered interrupt, it is a particular level of a signal, either high or low, which represents the interrupt signal. In contrast, an edge-triggered interrupt, for example, is recognized by the signal level transition. That is an interrupt is detected when the signal changes from either a high level to a low level, or vice versa, regardless of the resulting signal level. A client can either enable or disable the level-triggered interrupt by sending “Enable Ints” and “Disable Ints” commands. If the interrupt line is enabled, the system interface processor sends an interrupt signal to the ISA bus

103

, either when an asynchronous I/O is complete or when an event has been detected.

In the embodiment shown in

FIG. 2

, the system interface

105

may be a single-threaded interface. That is, only one client, or system operator, is allowed to access the system interface

105

at a time. Therefore, a program or application must allocate the system interface

105

for its use before using it, and then deallocate the interface

105

when its operation is complete. The CSR

209

indicates which client or operator is allocated access to the system interface

105

at a particular time.

A further discussion of the structure and operation of the system interface

105

may be found in a copending and commonly owned patent application entitled, I

2

C “I

2

C To ISA Bus Interface,” which is listed in Appendix A attached hereto.

FIG. 4

illustrates a system block diagram of one embodiment of the remote interface

119

of FIG.

1

. As described above, the remote interface

119

serves as an interface which handles communications between the server system

100

(

FIG. 1

) and an external computer, such as a local computer

123

or a remote computer

125

. The local computer

123

is typically connected to the remote interface

119

, via a local communication line

127

such as an RS232 line, and the remote computer

129

is typically connected to the remote interface

119

by means of a modem connection line

133

which connects the remote modem

129

to the server modem

131

.

As shown within the dashed lines of in

FIG. 4

, the remote interface

119

comprises a remote interface processor

401

, a remote interface memory

403

, a transceiver

405

and an RS232 port

407

. The remote interface processor

401

is coupled to the system bus

107

and receives an event signal from the controller

109

(

FIG. 1

) when a failure condition has been detected. In one embodiment, the remote interface processor

401

is a PIC16C65 controller chip which includes an event memory (not shown) organized as a bit vector, having at least sixteen bits. Each bit in the bit vector represents a particular type of event. Writing an event to the remote interface processor

401

sets a bit in the bit vector that represents the event. The remote interface memory

403

is coupled to the remote interface processor

401

for receiving and storing event data, commands, and other types of data transmitted to the remote interface

119

. In one embodiment, the remote interface memory

403

is a static random access memory (SRAM).

In order to communicate with external devices, the remote interface

119

further includes the transceiver

405

, coupled to the remote interface processor

401

, for receiving and transmitting data between the remote interface processor

401

and a local PC

123

or a remote/client PC

125

, in accordance with a specified communication protocol. One embodiment of such a communication protocol is described in further detail below. In one embodiment, the transceiver

405

is an LT1133A signal processing chip. Coupled to the transceiver

405

is a RS232 communication port which is well-known in the art for providing data communications between computer systems in a computer network. One of the functions of the transceiver

405

is to transpose signal levels from the remote interface processor

401

to RS232 signal protocol levels.

The remote interface

119

is coupled to a switch

121

for switching access to the remote interface

119

between a local computer

123

and a remote PC

125

. The switch

121

receives command signals from the remote interface processor

401

and establishes connectivity to the RS232 communication port

407

based on these command signals. Upon receiving an event signal, the remote interface processor

401

will set the connectivity of the switch

121

based on criteria such as the type of event that has been detected. If the switch

121

is set to provide communications between the local PC

123

and the remote interface

119

, after receiving an event signal, the remote interface processor

401

transmits a Ready To Receive (RTR) signal to the local computer

123

. A software program which is stored and running in the local computer

123

recognizes the RTR signal and sends back appropriate commands in order to interrogate the remote interface processor

401

. In one embodiment, the software program which is stored and executed by the local computer

123

is the Maestro Recovery Manager software program, manufactured by Netframe, Inc. Upon interrogating the remote interface processor

401

, the local computer

123

detects that an event signal has been received by the remote interface

119

. The local computer

123

may then read the bit vector in the remote interface processor

401

to ascertain the type of event that occurred and thereafter notify a local user of the event by displaying an event message on a monitor coupled to the local computer

123

. After the local user has been notified of the event, as described above, he or she may then obtain further information about the system failure which generated the event signal by accessing the system log

117

(

FIG. 1

) from the local computer

123

via the remote interface

119

. This capability is also provided by the Maestro Recovery Manager software program.

If the switch

121

is set to provide connectivity to the remote/client computer

125

via a modem-to-modem connection, a server modem

131

will dial the modem number (telephone number) corresponding to the client modem

129

in order to establish a communication link with the remote computer

125

. In one embodiment, the number of the client modem

129

is stored in the system log

117

(

FIG. 1

) and accessed by the remote interface processor

401

upon receiving specified event signals. When the client modem

129

receives “a call” from the server modem

131

, the remote computer

125

will send back appropriate commands and/or data in order to interrogate the remote interface processor

401

in accordance with a software program running on the remote computer

125

. In one embodiment, this software program is the Maestro Recovery Manager software program manufactured by Netframe, Inc. Upon interrogating the processor

401

, the remote computer

125

will detect that an event signal has been transmitted to the remote interface

119

. The remote computer

125

may then read the bit vector in the remote interface processor

401

to ascertain the type of event that occurred and thereafter notify a remote user of the event by displaying an event message on a monitor coupled to the remote computer

125

. At this point, a remote user, typically a client authorized to have access to the server system

100

, may obtain further information about the failure condition which generated the event signal by accessing the system log

117

(

FIG. 1

) from the remote computer

125

via the remote interface

119

.

In one embodiment, the remote interface communication protocol is a serial protocol that communicates messages across a point-to-point serial link. This link is between the remote interface processor

401

and a local or remote client. The protocol encapsulates messages in a transmission packet to provide error-free communication and link security and further uses the concept of “byte stuffing” in which certain byte values in a data stream always have a particular meaning. Examples of bytes that have a special meaning in this protocol are:

SOM: Start of a message

EOM: End of a message

SUB: The next byte in the data stream must be substituted before processing.

INT: Event Interrupt

Data: An entire Message

The remote interface serial protocol uses two types of messages: (1) requests, which are sent by remote management systems (PCs) to the Remote Interface; and (2) responses, which are returned to the requester by the Remote Interface. The formats of these messages are illustrated in FIGS.

5

A-

5

C.

The following is a summary of the fields within each of the messages shown in FIGS.

5

A-

5

C:

SOM

A special data byte value marking the start of a message.

EOM

A special data byte value marking the end of a message.

Seq. #

A one-byte sequence number, which is incremented on each

request. It is stored in the response.

TYPE

One of the following types of requests:

IDENTIFY

Requests the remote interface to send back identification

information about the system to which it is connected.

It also resets the next expected sequence number.

Security authorization does not need to be established

before the request is issued.

SECURE

Establishes secure authorization on the serial link by

checking password security data provided in the message

with the server system password.

UNSECURE

Clears security authorization on the link and attempts to

disconnect it. This requires security authorization to

have been previously established.

MESSAGE

Passes the data portions of the message to the remote

interface for execution. The response from remote

interface is sent back in the data portion of the response.

This requires security authorization to have been

previously established.

POLL

Queries the status of the remote interface. This request

is generally used to determine if an event is pending in

the remote interface.

STATUS

One of the following response status values:

OK

Everything relating to communication with the remote

interface is successful.

OK_E-

Everything relating to communication with the remote

VENT

interface is successful. In addition, there is one or more

events pending in the remote interface.

SEQUENCE

The sequence number of the request is neither the

current sequence number or retransmission request, nor

the next expected sequence number or new request.

Sequence numbers may be reset by an IDENTIFY

request.

CHECK

The check byte in the request message is received

incorrectly.

FORMAT

Something about the format of the message is incorrect.

Most likely, the type field contains an invalid value.

SECURE

The message requires that security authorization be in

effect. Or, if the message has a TYPE value of

SECURE, the security check failed.

Check

Indicates a message integrity check byte. Currently the

value is 256 minus the previous bytes in the message. For

example, adding all bytes in the message up, to and

including the check byte should produce a result of zero (0).

INT

A special one-byte message sent by the Remote Interface

when it detects the transition from no events pending to one

or more events pending. This message can be used to

trigger reading events from the remote interface. Events

should be read until the return status changes form

OK_EVENT to OK.

In one embodiment, the call-out protocol of the remote interface is controlled by a software code called Callout Script. The Callout script controls actions taken by the remote interface

119

when it is requested to make a callout to a local or remote computer,

123

or

125

, respectively. The script is a compact representation of a simple scripting language that controls the interaction between a modem and a remote system. Because the script keyword fields are bytes, it requires a simple compiler to translate from text to the script. The script is stored in the system recorder

115

(

FIG. 1

) and is retrieved by the remote interface

119

when needed. The following is a summary of some of the fields of the callout script:

Field

Data

Function

Label

Label Value

Establishes a label in the script.

Goto

Label Value

Transfers control to a label.

Speed

Speed Value

Sets the remote interface speed to the specified

value.

Send

Data String

Sends the data string to the serial interface.

Test

Condition,

Testes the specified condition and transfer to

label

label if the tests is true.

Trap

Event, label

Establishes or removes a trap handler address for

a given event.

Search

Data string,

Searches for a specific data string of the

label value

receiving buffer. If the data string is found,

remove the data up to and including this string,

form the buffer, Then, transfer to label.

Control

Control

Takes the specified control action.

Wait

.1-25.5 sec.

Delays execution of the script for the specified

time.

Exit

OK, Fail

Terminates script processing and exit with a

status and log result.

A further description of the remote interface

119

can be found in a copending and commonly owned U.S. patent application entitled, “System Architecture For Remote Access And Control of Environmental Management,” which is listed in Appendix A attached hereto.

Referring to

FIG. 6

, a block diagram of one embodiment of the system recorder

115

of

FIG. 1

is illustrated. The system recorder

115

is enclosed by the dashed lines and includes a system recorder processor

601

and a real-time clock chip

603

. In one embodiment, the system recorder processor is a PIC chip, part no. PIC16C65, manufactured by Microchip Technologies, Inc., and the real-time clock chip

603

is a Dallas 1603 IC Chip, manufactured by Dallas Semiconductor, Inc. of Dallas, Tex., and which includes a four-byte counter which is incremented every second. Since there are 32 bits, the real-time clock chip

603

has the capacity of recording the time for more than 100 years without having to be reset. It also has battery backup power, so if the power goes off, it continues to “tick.” The real-time clock chip

603

records “absolute” time. In other words, it does not record time in terms of the time of day in a particular time zone, nor does it reset when the time in the real world is reset forward or back one hour for daylight savings. The operating system must get a reference point for its time by reading the real-time clock chip

603

and then synchronizing it with real world time.

The system recorder processor

601

is coupled to the system bus

117

. When a failure condition is detected by the controller

109

(FIG.

1

), the controller

109

transmits failure information related to the detected failure condition to the system recorder processor

601

. This failure information may include the values of out-of-tolerance operational parameters such as fan speed or a system temperature, for example. Upon receiving this failure information, the system recorder processor

601

queries the real-time clock chip

603

for a time value which is stored in the 8-byte field within the chip

603

. The real-time clock chip

603

transmits the value of this 8-byte field to the processor

601

whereupon the processor

601

“stamps” the failure information with this time value. The time value is included as part of the failure information which is subsequently stored in the system log

117

.

In order to store data into the system log

117

, the system recorder processor

601

must obtain the address of the next available memory space within the system log

117

and set a pointer to that address. The system recorder processor

601

is coupled to the system log

117

by means of an address bus

606

and a data bus

607

. Prior to storing or retrieving data from the system log, the processor

601

communicates with the system log

117

in order to ascertain the addresses of relevant memory locations in or from which data is to be either stored or retrieved. Upon receiving an address, the processor

601

can proceed to store or retrieve data from the corresponding memory space, via the data bus

607

. FIGS.

7

A-

7

D illustrate a flowchart of one embodiment of a process of reading data from and writing data to the system log.

Referring now to FIGS.

7

A-

7

D, a flow chart illustrates one embodiment of a method by which the system recorder

115

(

FIG. 1

) stores and retrieves information from the system log

117

. In the embodiment discussed below the system log

117

is a non-volatile random access memory (NVRAM) and is referred to as NVRAM

117

. In

FIG. 7A

, at step

700

, the system recorder

115

is typically in an idle state, i.e., waiting for commands from other microcontrollers in the network. At step

702

, the system recorder

115

determines if an interrupt command is detected from other microcontrollers. If no interrupt command is detected, then at step

704

, the system recorder

115

checks if a reset command is pending. A reset command is a request to clear the all memory cells in the NVRAM

117

. If a reset command is detected, then at step

706

, the system recorder

115

clears all memory cells in the NVRAM

115

and returns to its idle state at step

700

, and the entire process repeats itself. If a reset command is not detected, then at step

708

, the system recorder

115

updates the time stored in the real-time clock chip

603

(

FIG. 6

) every one second. At this step, the system recorder

115

reads the real time clock and saves the real time in a local register (not shown).

If, at step

702

, an interrupt command is detected from other microcontrollers, the system recorder

115

determines the type of data in the interrupt command at step

710

. For the purpose of logging message events in the NVRAM

117

, the log data and event data type are pertinent. As noted above, the log data type is used to write a byte string to a circular log buffer, such as the NVRAM

117

. The log data type records system events in the NVRAM

117

. The maximum number of bytes that can be written in a log entry is 249 bytes. The system recorder

115

adds a total of six bytes at the beginning of the interrupt command: a two-byte identification code (ID), and a four-byte timestamp for recording the real time of the occurrence of the system event.

With special firmware, the NVRAM

117

is divided into two blocks: a first block having 64 kbytes of memory space, and a second block having 64 kbytes of memory space. The first block of the NVRAM

117

is a fixed-variable memory block which stores ID codes of the devices installed in the network as well as other information. The second block is a memory block which stores message codes in connection with events occurring in the network. The NVRAM

117

may be based upon devices manufactured by Dallas Semiconductor Corporation, e.g., the DS1245Y/AB 1024K Nonvolatile SRAM.

Based on the interpretation of the data type at step

712

, the system recorder

115

determines whether the interrupt command is intended to be sent to the first block or second block of the NVRAM

117

. If the interrupt command is intended to be sent to the first block of NVRAM

117

, then the process described in

FIG. 7B

is followed. If the interrupt command is not intended to be sent to the first block of NVRAM

117

, then it is intended to be sent to the second block of NVRAM

117

. At step

714

, the system recorder

115

determines whether the interrupt command is a read or write command for the second block. If the interrupt command is a read command, then the process described in

FIG. 7C

is followed. If the interrupt command is not a read command, then it is a write command and the process described in

FIG. 7D

is followed.

Referring to

FIG. 7B

, a flow chart is provided for describing the steps of performing a read from and/or write to the first block of the NVRAM

117

. As noted above, the first block of the NVRAM

117

is a 64-kbyte memory block. The first block is a fixed-variable memory block which stores ID codes of the devices installed in the network. Hence, a command addressed to the first block is typically generated by a controller (e.g., chassis controller

112

of

FIG. 1

) responsible for updating the presence or absence of devices in the network. The process described in

FIG. 7B

is followed when, at step

712

(shown in FIG.

7

A), the system recorder

115

determines that the command interrupt is intended to be sent to the first block of the NVRAM

117

.

As shown in

FIG. 7B

, at step

718

, the system recorder

115

determines whether the interrupt command is to read from or write to the NVRAM

117

. If the command interrupt is a read command, then at step

720

, the system recorder

115

loads the address pointer at the intended address location in NVRAM

117

. At step

722

, the system recorder

115

reads the intended message from the address location in the NVRAM

117

, and forwards the read data to the master device (i.e., device requesting the read operation) in the network. After the read operation is complete, at step

728

, the system recorder

115

issues an interrupt return command to return to its idle state at step

700

(shown in FIG.

7

A).

If at step

718

the system recorder

115

determines that the interrupt command is a write command, then at step

724

, the system recorder

115

loads the address pointer at the intended address location in NVRAM

117

. The system recorder

115

preferably checks on the availability of memory space in NVRAM

117

prior to executing a write operation (see

FIG. 7D

for details). At step

726

, the system recorder

115

writes the event message to the address location in the NVRAM

117

, and forwards a confirmation to the master device in the network. After the write operation is complete, at step

728

, the system recorder

115

issues an interrupt return command to return to its idle state at step

700

(shown in FIG.

7

A).

Referring now to

FIG. 7C

, a flow chart is provided for describing the steps of performing a read operation from the second block of the NVRAM

117

. As noted above, the second block of the NVRAM

117

is a 64-kbyte memory block. The second block is a memory block which stores event messages in connection with events occurring in the network. Hence, a command addressed to the second block is typically generated by a controller responsible for updating the occurrence of such events. The process described in

FIG. 7C

is followed when, at step

714

(shown in FIG.

7

A), the system recorder

115

determines that the interrupt command is a read command intended to the second block of the NVRAM

117

.

As shown in

FIG. 7C

, if the system recorder

115

determines that the interrupt command is a read operation, then at step

730

, the system recorder

115

loads an address pointer to the intended address in the second block of NVRAM

117

. At step

732

, the system recorder

115

performs a read operation of the first logged message from the NVRAM

117

commencing with the intended address location. For a read operation, it is preferable that only the 165534 (FFFEh) and 65533 (FFFDh) addresses be recognized. The address 65534 specifies the address of the oldest valid message. The address 65533 specifies the address of the next message following the last message read from the log in NVRAM

117

. The last address in the second block of the NVRAM

117

is 65279 (FEFFh). This is also the address at which the system recorder

115

performs a pointer wrap operation (see

FIG. 7D

for details). In doing so, the system recorder

115

redirects the address pointer to the beginning of the second block of the NVRAM

117

. Hence, the address of the next message address after the 65279 address is 0. To perform a read operation of the entire second block in a chronological order, the timestamp is read first. Then, the message logged at address 65534 is read second. This message constitutes the first logged message. Then, the message logged at address 65533 is read next. This message is the next logged message. Then, the message logged at address 65533 is read again to read all subsequently logged messages. The reading at address 65533 terminates until the status field returns a non-zero value such as 07H, for example.

At step

734

, the system recorder

115

determines whether the address location has reached the end of the second block in the NVRAM

117

. If the address location has not reached the end of the second block, then at step

736

, the system recorder

115

performs a read operation of the next logged message using the addressing scheme described above. The system recorder

115

transmits all read messages to the master device via the I

2

C bus. If the address location has reached the end of the second block, then the system recorder

115

returns to its idle state

700

(shown in FIG.

7

C).

Referring now to

FIG. 7D

, a flow chart is provided for describing the steps of performing a write operation to the second block of the NVRAM

117

. Typically, a command addressed to the second block is generated by a controller (e.g., chassis controller

222

) responsible for updating the occurrence of such events. The process described in

FIG. 7D

is followed when, at step

714

(shown in FIG.

7

A), the system recorder

115

determines that the interrupt command is a write command directed to the second block of the NVRAM

117

.

As shown in

FIG. 7D

, if the system recorder

115

determines that the interrupt command is a write command, then at step

740

, the system recorder

115

loads an address pointer to the intended address in the second block of NVRAM

117

. At step

742

, the system recorder

115

determines whether a memory space is available in the second block of NVRAM

117

to perform the requested write operation. If a memory space is not available in the second block, then at step

744

, the system recorder

1

15

performs a pointer wrap operation. In doing so, the system recorder

115

redirects the address pointer to the beginning of the second block of the NVRAM

117

. The system recorder

115

erases the memory space corresponding to a single previously logged message which occupies that memory space. Additional previously logged messages are erased only if more memory space is required to perform the present write operation.

If the system recorder

115

determines that a memory space is available in the second block of the NVRAM

117

, then at step

746

, the system recorder

115

fetches the time from the real-time clock

603

and stamps (i.e., appends) the real time to the message being written. As noted above, the real time comprises a four-byte field (i.e., 32 bits) which are appended to the message being written. At step

748

, the system recorder

115

writes the time-stamped message to the second block of the NVRAM

117

. At step

750

, the system recorder

115

issues an interrupt return command to return to its idle state

700

(shown in FIG.

7

A).

A further description of the system recorder

115

and the NVRAM

117

can be found in a copending and commonly owned U.S. patent application entitled, “Black Box Recorder For Information System Events,” which is listed in Appendix A attached hereto.

FIGS.

8

A-

8

D illustrate a flowchart of one embodiment of the process of reporting system failures in accordance with the invention. As the process is described below reference is also made to

FIG. 1

which illustrates a block diagram of one embodiment of the server system

100

which carries out the process shown in FIGS.

8

A-

8

D.

Referring to

FIG. 8A

, the process starts at location

800

and proceeds to step

801

wherein a controller

109

monitors the server

100

for system failures. In step

803

, a determination is made as to whether any system failures have been detected. If in step

803

, no failures have been detected, the process moves back to step

801

and the controller

109

continues to monitor for system failures. If in step

803

a failure is detected, the process moves to step

805

in which the failure information is sent to the system recorder

115

. In this step, the controller

109

sends failure information, such as the value of measured operation parameters which have been determined to be out of tolerance, to the system recorder

115

which assigns a time stamp to the failure event. Next, in step

807

, the system recorder

115

logs the failure by storing the failure information, along with its time stamp, in the system log

117

. In step

809

, an event signal is sent to the system interface

105

and to the remote interface

119

. The process then moves to step

811

as shown in FIG.

8

B.

Referring to

FIG. 8B

, in step

811

, an interrupt signal is sent to the CPU

101

of the server system. Or, alternatively, the CPU

101

may be periodically monitoring the system interface

105

in which case the CPU

101

will detect that an event signal has been received by the system interface

105

. In step

813

, the CPU

101

reads the event from the system interface

105

. Thereafter, in step

815

, the CPU

101

notifies a system operator or administrator of the event who may then take appropriate measures to correct the failure condition. In one embodiment, the CPU

101

may notify a system operator by displaying an error or event message on a monitor coupled to the CPU

101

, or the CPU

101

may simply illuminate a light emitting diode (LED) which indicates that a system failure has been detected. At this point, the system operator may decide to ignore the event message or obtain more information about the event by accessing the system log

117

for the failure information which was stored in it in step

807

. By means of operating system software executed by the CPU

101

and the communications protocol established by the system interface

105

, the system operator can access this failure information from the system log

117

. Additionally, the CPU

101

may take remedial actions on its own initiative (programming). For example, if a critical system failure has been detected, e.g., a system temperature is above a critical threshold, the CPU

101

may back-up all currently running files (core dump into back-up memory space) and then shut down the server system.

In step

817

, the CPU

101

decides whether to call out to a local or remote computer in order to notify it of the event. Particular types of events may warrant a call-out to either a local or remote computer in order to notify important personnel or administrators of a particular problem, while other types of events may not. If in step

817

it is determined that the particular event does not warrant a call-out to a local or remote computer, the process ends at step

819

. On the other hand, if the CPU

101

decides that a call-out is warranted, the process moves to step

821

as shown in FIG.

8

C.

Referring to

FIG. 8C

, in step

821

, the CPU

101

will determine whether the call-out is to be made to a local computer

123

, connected to the server system

100

via a local communication line

127

such as a an RS232 line, or to a remote computer

125

, connected to the server system

100

via a modem-to-modem connection. If in step

821

it is determined that a call-out to a local computer

123

is to be made, the function of step

823

is implemented wherein the operating system sets the call-out switch

121

to the local connection mode. In step

825

, the remote interface

119

notifies the local computer

123

that an event signal has been received. Thereafter, in step

827

, the local computer reads the event message from the remote interface

119

. Upon reading the event message, in step

829

, the local computer

123

may notify a local user of the event condition and/or take other appropriate measures. Depending on the software program running on the operating system of the local computer, the local computer

123

may notify the local user by displaying an error or event message on a monitor of the local computer

123

, or the local computer

123

may simply illuminate a light emitting diode (LED) which indicates that a system failure has been detected. At this point, the local user may decide to ignore the event message or obtain more information about the event by accessing the system log for the failure information which was stored in it in step

807

. The local user may then contact appropriate personnel located at the site where the server is located and inform and/or instruct such personnel to remedy the problem. Or, the local user may travel to the site himself, or herself, in order to fix the problem. The process then ends at step

819

.

If in step

821

it is determined that a call-out is to be made to a remote computer, the process proceeds to step

831

wherein the call-out switch

121

is set to a remote connection mode. The process then moves to step

833

as shown in FIG.

8

D. In step

833

, the CPU

101

of the server system determines whether the remote computer

125

has security authorization to receive the event information and access the system log. This function may be accomplished by receiving a password from the remote computer or receiving an encrypted identification signal from the remote computer and verifying that it matches the server's password or identification signal. However, other methods of providing secure transmissions between a host system and a remote system which are known in the art may be utilized in accordance with the invention. If in step

833

, security authorization has not been established the process ends at step

819

. However, if in step

833

, security authorization is established, the process proceeds to step

835

, wherein the remote interface

119

dials out through the modem-to-modem connection to establish a communication link with the remote computer

125

. The dial out number is automatically provided to the remote interface

119

by the CPU

101

and in one embodiment a list of dial-out numbers may be stored in the system log

117

.

In step

837

, the remote interface

119

checks whether a good communication link has been established by determining whether a data set read (DSR) and data carrier detect (DCD) signals have been communicated between a server modem

131

and a remote modem

129

. The DSR and DCB signals are common signals used in modem-to-modem handshake protocols. However, any protocol for verifying an active modem-to-modem communication link which is known in the art may be utilized in accordance with the invention. If in step

837

, it is determined that a good communication link cannot be established, the process proceeds to step

839

wherein the CPU

101

reports that the call-out failed. The process then ends in step

819

.

If in step

837

, it is determined that a good communication link has been established, the remote interface

119

, in step

841

, notifies the remote computer

125

that an event signal has been received. In step

843

, the remote computer reads the event from the remote interface

119

by reading a bit vector within the remote interface

119

. In step

845

, after reading the event in step

843

, the remote computer

125

notifies a remote user of the event condition and/or take other appropriate measures. Depending on the software program running on the operating system of the remote computer

125

, the remote computer

125

may notify a remote user by displaying an error or event message on a monitor of the remote computer

125

, or the remote computer

125

may simply illuminate a light emitting diode (LED) which indicates that a system failure has been detected. At this point, the remote user may decide to ignore the event message or obtain more information about the event by accessing the system log for the failure information which was stored in it in step

807

. The process then ends at step

819

.

As described above, the invention provides a fast and efficient method of detecting system failures and/or events and reporting such failures and events to a client, system operator, or control center of a server system. By logging failure information into a system log, a system operator or client can ascertain the nature of a particular problem and thereafter make an informed decision as to what steps may be required to correct the system error or failure. By providing this type of failure reporting system, the invention alleviates much confusion and frustration on the part of system users which would otherwise result. Additionally, by quickly reporting such failures, the amount of downtime of the server system is reduced.

The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims, rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.

Appendix A

Incorporation by Reference of Commonly Owned Applications

The following patent applications, commonly owned and filed on the same day as the present application are hereby incorporated herein in their entirety by reference thereto:

Appli-

cation

Attorney Docket

Title

No.

No.

“System Architecture for Remote

MNFRAME.002A1

Access and Control of Environmental

Management”

“Method of Remote Access and

MNFRAME.002A2

Control of Environmental

Management”

“System for Independent Powering of

MNFRAME.002A3

Diagnostic Processes on a Computer

System”

“Method of Independent Powering of

MNFRAME.002A4

Diagnostic Processes on a Computer

System”

“Diagnostic and Managing Distributed

MNFRAME.005A1

Processor System”

“Method for Managing a Distributed

MNFRAME.005A2

Processor System”

“System for Mapping Environmental

MNFRAME.005A3

Resources to Memory for Program

Access”

“Method for Mapping Environmental

MNFRAME.005A4

Resources to Memory for Program

Access”

“Hot Add of Devices Software

MNFRAME.006A1

Architecture”

“Method for The Hot Add of Devices”

MNFRAME.006A2

“Hot Swap of Devices Software

MNFRAME.006A3

Architecture”

“Method for The Hot Swap of

MNFRAME.006A4

Devices”

“Method for the Hot Add of a Network

MNFRAME.006A5

Adapter on a System Including a

Dynamically Loaded Adapter Driver”

“Method for the Hot Add of a Mass

MNFRAME.006A6

Storage Adapter on a System Including

a Statically Loaded Adapter Driver”

“Method for the Hot Add of a Network

MNFRAME.006A7

Adapter on a System Including a

Statically Loaded Adapter Driver”

“Method for the Hot Add of a Mass

MNFRAME.006A8

Storage Adapter on a System Including

a Dynamically Loaded Adapter Driver”

“Method for the Hot Swap of a

MNFRAME.006A9

Network Adapter on a System

Including a Dynamically Loaded

Adapter Driver”

“Method for the Hot Swap of a Mass

MNFRAME.006A10

Storage Adapter on a System Including

a Statically Loaded Adapter Driver”

“Method for the Hot Swap of a

MNFRAME.006A11

Network Adapter on a System

Including a Statically Loaded Adapter

Driver”

“Method for the Hot Swap of a Mass

MNFRAME.006A12

Storage Adapter on a System Including

a Dynamically Loaded Adapter Driver”

“Method of Performing an Extensive

MNFRAME.008A

Diagnostic Test in Conjunction with a

BIOS Test Routine”

“Apparatus for Performing an

MNFRAME.009A

Extensive Diagnostic Test in

Conjunction with a BIOS Test

Routine”

“Configuration Management Method

MNFRAME.010A

for Hot Adding and Hot Replacing

Devices”

“Configuration Management System

MNFRAME.011A

for Hot Adding and Hot Replacing

Devices”

“Apparatus for Interfacing Buses”

MNFRAME.012A

“Method for Interfacing Buses”

MNFRAME.013A

“Computer Fan Speed Control Device”

MNFRAME.016A

“Computer Fan Speed Control Method”

MNFRAME.017A

“System for Powering Up and

MNFRAME.018A

Powering Down a Server”

“Method of Powering Up and

MNFRAME.019A

Powering Down a Server”

“System for Resetting a Server”

MNFRAME.020A

“Method of Resetting a Server”

MNFRAME.021A

“System for Displaying Flight

MNFRAME.022A

Recorder”

“Method of Displaying Flight

MNFRAME.023A

Recorder”

“Synchronous Communication

MNFRAME.024A

Interface”

“Synchronous Communication

MNFRAME.025A

Emulation”

“Software System Facilitating the

MNFRAME.026A

Replacement or Insertion of Devices in

a Computer System”

“Method for Facilitating the

MNFRAME.027A

Replacement or Insertion of Devices in

a Computer System”

“System Management Graphical User

MNFRAME.028A

Interface”

“Display of System Information”

MNFRAME.029A

“Data Management System Supporting

MNFRAME.030A

Hot Plug Operations on a Computer”

“Data Management Method Supporting

MNFRAME.031A

Hot Plug Operations on a Computer”

“Alert Configurator and Manager”

MNFRAME.032A

“Managing Computer System Alerts”

MNFRAME.033A

“Computer Fan Speed Control System”

MNFRAME.034A

“Computer Fan Speed Control System

MNFRAME.035A

Method”

“Black Box Recorder for Information

MNFRAME.036A

System Events”

“Method of Recording Information

MNFRAME.037A

System Events”

“Method for Automatically Reporting a

MNFRAME.040A

System Failure in a Server”

“System for Automatically Reporting a

MNFRAME.041A

System Failure in a Server”

“Expansion of PCI Bus Loading

MNFRAME.042A

Capacity”

“Method for Expanding PCI Bus

MNFRAME.043A

Loading Capacity”

“System for Displaying System Status”

MNFRAME.044A

“Method of Displaying System Status”

MNFRAME.045A

“Fault Tolerant Computer System”

MNFRAME.046A

“Method for Hot Swapping of Network

MNFRAME.047A

Components”

“A Method for Communicating a

MNFRAME.048A

Software Generated Pulse Waveform

Between Two Servers in a Network”

“A System for Communicating a

MNFRAME.049A

Software Generated Pulse Waveform

Between Two Servers in a Network”

“Method for Clustering Software

MNFRAME.050A

Applications”

“System for Clustering Software

MNFRAME.051A

Applications”

“Method for Automatically

MNFRAME.052A

Configuring a Server after Hot Add of

a Device”

“System for Automatically Configuring

MNFRAME.053A

a Server after Hot Add of a Device”

“Method of Automatically Configuring

MNFRAME.054A

and Formatting a Computer System

and Installing Software”

“System for Automatically Configuring

MNFRAME.055A

and Formatting a Computer System

and Installing Software”

“Determining Slot Numbers in a

MNFRAME.056A

Computer”

“System for Detecting Errors in a

MNFRAME.058A

Network”

“Method of Detecting Errors in a

MNFRAME.059A

Network”

“System for Detecting Network Errors”

MNFRAME.060A

“Method of Detecting Network Errors”

MNFRAME.061A

Claims

1. A system for reporting a failure condition in a server system, comprising:a controller which monitors the server system for system failures, and generates an event signal and failure information if a system failure is detected; a system interface, coupled to the controller, which receives the event signal and failure information; a central processing unit, coupled to the system interface, wherein, upon receiving the event signal, the system interface reports an occurrence of an event to the central processing unit; and a system log which receives failure information communicated from the system interface and stores said failure information.
2. The system of claim 1 wherein the system log is a nonvolatile random access memory.
3. The system of claim 1 wherein the system interface comprises a bit vector, having a plurality of bits, which receives the event signal and stores a value corresponding to the event signal, wherein the event signal changes the value of at least one bit of the bit vector.
4. The system of claim 1 further comprising a system recorder, coupled between the controller and the system log, for receiving the failure information from the controller, assigning a time value to the failure information, and subsequently storing the failure information with the time value into the system log.
5. The system of claim 1 wherein the central processing unit executes a software program which allows a system operator to access the system log to read the failure information.
6. The system of claim 5 further comprising a monitor coupled to the central processing unit for displaying a message to the system operator.
7. The system of claim 1 further comprising a remote interface, coupled to the controller, for receiving the event signal and reporting an occurrence of an event to a computer external to the server system.
8. The system of claim 7 wherein the remote interface comprises a bit vector, having a plurality of bits, which receives the event signal and stores a value corresponding to the event signal, wherein the event signal changes the value of at least one bit of the bit vector.
9. The system of claim 7 wherein the computer stores and executes a software program which allows a user of the computer to access the system log to read the failure information.
10. The system of claim 7 further comprising a switch, coupled to the remote interface, for switching connectivity to the remote interface between a first computer and a second computer.
11. The system of claim 10 wherein the first computer is a local computer, coupled to the switch via a local communications line, and the second computer is a remote computer, coupled to the switch via a modem-to-modem connection.
12. A failure reporting system for a server system, comprising:a controller which monitors the server system for system failures and generates an event signal and failure information if a system failure is detected; a system recorder, coupled to the controller, which receives failure information and assigns a time value to the failure information; a system log which stores failure information received from the system recorder; and a system interface, coupled to the controller, which receives and stores the event signal, and reports an occurrence of an event to a central processing unit which is coupled to the system interface, wherein the central processing unit executes a software program which allows a system operator to access the system log to read failure information stored therein.
13. The system of claim 12 wherein the system log is a nonvolatile random access memory.
14. The system of claim 12 wherein the system interface comprises a bit vector which receives the event signal and stores a value corresponding to the event signal, wherein the event signal changes the value of at least one bit of the bit vector.
15. The system of claim 12 further comprising a remote interface, coupled to the controller, which receives the event signal and reports the occurrence of an event to a computer external to the server system.
16. The system of claim 15 wherein the remote interface comprises a bit vector which receives the event signal and stores a value corresponding to the event signal, wherein the event signal sets at least one bit of the bit vector to indicate that a system failure has occurred.
17. The system of claim 15 further comprising a switch, coupled to the remote interface, which switches connectivity to the remote interface between a first computer and a second computer.
18. The system of claim 17 wherein the first computer is a local computer, coupled to the switch via a local communications line, and the second computer is a remote computer, coupled to the switch via a modem connection.
19. A failure reporting system for a server system, comprising:a controller which monitors the server system for system failures and generates an event signal and failure information if a system failure is detected; a system recorder, coupled to the controller, which receives the failure information and assigns a date and time to the failure information; a system log which stores the failure information; a system interface, coupled to the controller, which receives and stores the event signal and reports an occurrence of an event to a central processing unit, coupled to the system interface, wherein the central processing unit executes a software program which allows a system operator to access the system log to read failure information stored therein; a remote interface, coupled to the controller, which receives the event signal and reports the occurrence of an event to a computer external to the server system; and a switch, coupled to the remote interface, which switches connectivity to the remote interface between a first computer and a second computer, wherein the first computer is a local computer, coupled to the switch via a local communications line, and the second computer is a remote computer, coupled to the switch via a modem connection.
20. A failure reporting system in a server system, comprising:means for detecting a system failure condition; means for transmitting failure information related to the failure condition to a system recorder; means for storing the failure information; and means for reporting an occurrence of an event to a central processing unit of the server system.
21. The system of claim 20 further comprising means for notifying a human operator of the system failure.
22. The system of claim 21 wherein the means for notifying a human operator comprises means for displaying a message on a monitor coupled to the central processing unit.
23. The system of claim 21 further comprising means for accessing the system log to read the failure information from the system log.
24. The method of claim 20 further comprising means for determining a time when the failure condition occurred and means for storing the time with the failure information.
25. The system of claim 20 wherein the means for reporting the occurrence of the event to the central processing unit comprises:means for sending an event signal to a system interface, coupled to the central processing unit; means for setting a bit in a bit vector within the system interface, wherein the setting of the bit corresponds to a specified type of system failure; and means for sending an interrupt signal to the central processing unit after the bit is set, wherein, upon receiving the interrupt signal the central processing unit reads a status register within the system interface to ascertain that the event signal has been received by the system interface.
26. The system of claim 25 further comprising means for reading the bit vector to ascertain the type of system failure.
27. The method of claim 20 wherein the means for reporting the occurrence of the event to the central processing unit comprises:means for sending an event signal to a system interface, coupled to the central processing unit; means for setting a bit in a bit vector within the system interface, wherein the setting of the bit corresponds to a specified type of system failure; and means for setting a status of a status register within the system interface to indicate the occurrence of the event, wherein the central processing unit monitors the status register within the system interface at specified periodic intervals.
28. The system of claim 27 further comprising means for reading the bit vector to ascertain the type of system failure.
29. A system for reporting a failure condition in a server system, comprising:means for detecting the failure condition; means for generating and transmitting failure information related to the failure condition to a system recorder; means for assigning a time value to the failure information; means for storing the failure information and its time value into a system log; means for reporting an occurrence of an event to a local computer coupled to the server system via a remote interface; means for accessing the system log; and means for reading the failure information.
30. The system of claim 29 wherein the means for reporting the occurrence of the event to the local computer comprises:means for sending an event signal to the remote interface; means for setting a bit in a bit vector within the remote interface, wherein the setting of the bit corresponds to a specified type of system failure; and means for notifying the local computer that the event signal has been received by the remote interface.
31. The system of claim 30 wherein the means for notifying the local computer comprises means for transmitting a ready-to-read signal to the local computer, wherein, upon receiving the ready-to-read signal, the local computer interrogates the remote interface to ascertain that the bit in the bit vector has been set.
32. The system of claim 31 further comprising means for notifying a local operator, who is using the local computer, of the system failure.
33. The system of claim 32 wherein the means for notifying the local operator comprises means for displaying a message on a monitor coupled to the local computer.
34. A system for reporting a failure condition in a server system, comprising:means for detecting the failure condition; means for generating and transmitting failure information related to the failure condition across a control bus from a first microcontroller to a system recorder microcontroller; means for assigning a time value to the failure information; means for storing the failure information and its time value into a system log; means for reporting an occurrence of an event to a remote computer coupled to the server system via a remote interface, wherein the remote computer is connected to the remote interface via a modem connection; means for accessing the system log via the system recorder microcontroller; and means for reading the failure information.
35. The system of claim 34 wherein the means for reporting the occurrence of the event to the remote computer comprises:means for sending an event signal to the remote interface; means for setting a bit in a bit vector within the remote interface, wherein the setting of the bit corresponds to a specified type of system failure; and means for notifying the remote computer that the event signal has been received by the remote interface.
36. The system of claim 35 wherein the means for notifying the remote computer comprises:means for automatically calling a modem number corresponding to a modem coupled to the remote computer, wherein, upon receiving the call, the remote computer interrogates the remote interface to ascertain that the bit in the bit vector has been set.
37. The system of claim 36 further comprising:means for verifying that the remote computer is authorized to access the server system via the remote interface; and means for verifying that a communication link has been established between the remote computer and the remote interface.
38. The system of claim 34 further comprising means for notifying a remote operator, who is using the remote computer, of the system failure.
39. The system of claim 38 wherein the means for notifying the remote operator comprises means for displaying a message on a monitor coupled to the remote computer.
40. A program storage device storing instructions that when executed by a computer perform a method, wherein the method comprises:detecting a system failure condition; transmitting failure information related to the failure condition to a system recorder; storing the failure information in a system log; and reporting an occurrence of the failure condition to a central processing unit.
41. The device of claim 40 wherein the method further comprises notifying an operator of the system failure.
42. The device of claim 41 wherein the act of notifying an operator comprises displaying a message on a monitor coupled to the central processing unit.
43. The device of claim 41 wherein the method further comprises accessing the system log to read the failure information from the system log.
44. The device of claim 40 wherein the method further comprises determining when the failure condition occurred and storing a representation of when the failure condition occurred in the system log.
45. The device of claim 40 wherein the act of reporting the occurrence of the failure condition to the central processing unit comprises:sending an event signal to a system interface, coupled to the central processing unit; setting a bit in a bit vector within the system interface, wherein the setting of the bit corresponds to a specified type of system failure; and sending an interrupt signal to the central processing unit after the bit is set, wherein, upon receiving the interrupt signal the central processing unit reads a status register within the system interface to ascertain that the event signal has been received by the system interface.
46. The device of claim 45 wherein the method further comprises reading the bit vector to ascertain a type of event.
47. The device of claim 40 wherein the act of reporting the occurrence of the failure condition to the central processing unit comprises:sending an event signal to a system interface, coupled to the central processing unit; setting a bit in a bit vector within the system interface, wherein the setting of the bit corresponds to a specified type of system failure; and setting a status of a status register within the system interface to indicate the occurrence of the event, wherein the central processing unit monitors the status register within the system interface at specified periodic intervals.
48. The device of claim 47 wherein the method further comprises reading the bit vector to ascertain a type of event.
49. The device of claim 40 wherein the method further comprises reporting the occurrence of the failure condition to a local computer connected to server system via a remote interface.
50. The device of claim 49 wherein the act of reporting the occurrence of the failure condition to the local computer comprises:sending an event signal to the remote interface; setting a bit in a bit vector within the remote interface, wherein the setting of the bit corresponds to a specified type of system failure; and notifying the local computer that the event signal has been received by the remote interface.
51. The device of claim 50 wherein the act of notifying the local computer comprises transmitting a ready-to-read signal to the local computer, wherein, upon receiving the ready-to-read signal, the local computer interrogates the remote interface to ascertain that the bit in the bit vector has been set.
52. The device of claim 51 wherein the method further comprises notifying a local operator, who is using the local computer, of the system failure.
53. The device of claim 52 wherein the act of notifying the local operator comprises displaying a message on a monitor coupled to the local computer.
54. The device of claim 52 wherein the method further comprises accessing the system log through the local computer to read the failure information.
55. The device of claim 40 wherein the method further comprises reporting the occurrence of the failure condition to a remote computer connected to the server system via a remote interface, wherein the remote computer is connected to the remote interface via a modem-to-modem connection.
56. The device of claim 55 wherein the act of reporting the occurrence of the failure condition to the remote computer comprises:sending an event signal to the remote interface; setting a bit in a bit vector within the remote interface, wherein the setting of the bit corresponds to a specified type of system failure; and notifying the remote computer that the event signal has been received by the remote interface.
57. The device of claim 56 wherein the act of notifying the remote computer comprises:automatically calling a phone number corresponding to a modem coupled to the remote computer, wherein, upon receiving the call, the remote computer interrogates the remote interface to ascertain that the bit in the bit vector has been set.
58. The device of claim 57 wherein the method further comprises:verifying that the remote computer is authorized to access the server system via the remote interface; and verifying that a communication link has been established between the remote computer and the remote interface.
59. The device of claim 57 wherein the method further comprises notifying a remote operator, who is using the remote computer, of the system failure.
60. The device of claim 59 wherein the act of notifying the remote operator comprises displaying a message on a monitor coupled to the remote computer.
61. The device of claim 59 wherein the method further comprises accessing the system log through the remote computer to read the failure information.

PRIORITY CLAIM

The benefit under 35 U.S.C. § 119(e) of the following U.S. provisional application(s) is hereby claimed: This application is related to U.S. application Ser. No. 08/942,168, entitled, “Method For Automatically Reporting A System Failure In A Server,” which is being filed concurrently herewith.

US Referenced Citations (251)

Number	Name	Date
551314	Kandasamy et al.	Apr 1996
4057847	Lowell et al.	Nov 1977
4449182	Rubinson et al.	May 1984
4672535	Katzman et al.	Jun 1987
4695946	Andreasen et al.	Sep 1987
4707803	Anthony, Jr. et al.	Nov 1987
4769764	Levanon	Sep 1988
4774502	Kimura	Sep 1988
4821180	Gerety et al.	Apr 1989
4835737	Herrig et al.	May 1989
4949245	Martin et al.	Aug 1990
4999787	McNally et al.	Mar 1991
5006961	Monico	Apr 1991
5007431	Donehoo, III	Apr 1991
5033048	Pierce et al.	Jul 1991
5051720	Kittirutsunetorn	Sep 1991
5073932	Yossifor et al.	Dec 1991
5103391	Barrett	Apr 1992
5118970	Olson et al.	Jun 1992
5121500	Arlington et al.	Jun 1992
5136708	Lapourtre et al.	Aug 1992
5138619	Fasang et al.	Aug 1992
5157663	Major et al.	Oct 1992
5210855	Bartol	May 1993
5245615	Treu	Sep 1993
5247683	Holmes et al.	Sep 1993
5253348	Scalise	Oct 1993
5265098	Mattson et al.	Nov 1993
5266838	Gerner	Nov 1993
5269011	Yanai et al.	Dec 1993
5272382	Heald et al.	Dec 1993
5272584	Austruy et al.	Dec 1993
5276863	Heider	Jan 1994
5280621	Barnes et al.	Jan 1994
5283905	Saadeh et al.	Feb 1994
5307354	Cramer et al.	Apr 1994
5311451	Barrett	May 1994
5317693	Cuenod et al.	May 1994
5329625	Kannan et al.	Jul 1994
5337413	Lui et al.	Aug 1994
5351276	Doll, Jr. et al.	Sep 1994
5367670	Ward et al.	Nov 1994
5379184	Barraza et al.	Jan 1995
5386567	Lien et al.	Jan 1995
5388267	Chan et al.	Feb 1995
5402431	Saadeh et al.	Mar 1995
5404494	Garney	Apr 1995
5423025	Goldman et al.	Jun 1995
5430717	Fowler et al.	Jul 1995
5430845	Rimmer et al.	Jul 1995
5432715	Shigematsu	Jul 1995
5432946	Allard et al.	Jul 1995
5438678	Smith	Aug 1995
5440748	Sekine et al.	Aug 1995
5455933	Schieve et al.	Oct 1995
5463766	Shieve et al.	Oct 1995
5471617	Farrand et al.	Nov 1995
5471634	Giorgio et al.	Nov 1995
5473499	Weir	Dec 1995
5483419	Kaczeus, Sr. et al.	Jan 1996
5485550	Dalton	Jan 1996
5487148	Komori et al.	Jan 1996
5491791	Glowny et al.	Feb 1996
5493574	McKinley	Feb 1996
5493666	Fitch	Feb 1996
5513339	Agrawal et al.	Apr 1996
5517646	Piccirillo et al.	May 1996
5526289	Dinh et al.	Jun 1996
5528409	Cucci et al.	Jun 1996
5530810	Bowman	Jun 1996
5535326	Baskey et al.	Jul 1996
5542055	Amini et al.	Jul 1996
5546272	Moss et al.	Aug 1996
5548712	Larson et al.	Aug 1996
5555510	Verseput et al.	Sep 1996
5559764	Chen et al.	Sep 1996
5559958	Farrand et al.	Sep 1996
5559965	Oztaskin et al.	Sep 1996
5564024	Pemberton	Oct 1996
5566299	Billings et al.	Oct 1996
5566339	Perholtz et al.	Oct 1996
5568610	Brown	Oct 1996
5568619	Blackledge et al.	Oct 1996
5572403	Mills	Nov 1996
5577205	Hwang et al.	Nov 1996
5579487	Meyerson et al.	Nov 1996
5579491	Jeffries et al.	Nov 1996
5581712	Herrman	Dec 1996
5581714	Amini et al.	Dec 1996
5584030	Husak	Dec 1996
5586250	Carbonneau et al.	Dec 1996
5588121	Reddin et al.	Dec 1996
5588144	Inoue et al.	Dec 1996
5596711	Burckhartt et al.	Jan 1997
5598407	Bud et al.	Jan 1997
5602758	Lincoln et al.	Feb 1997
5606672	Wade	Feb 1997
5608876	Cohen et al.	Mar 1997
5615207	Gephardt et al.	Mar 1997
5621159	Brown et al.	Apr 1997
5621892	Cook	Apr 1997
5622221	Genga, Jr. et al.	Apr 1997
5625238	Ady et al.	Apr 1994
5627962	Goodrum et al.	May 1997
5628028	Michelson	May 1997
5630076	Saulpaugh et al.	May 1997
5631847	Kikinis	May 1997
5632021	Jennings et al.	May 1997
5638289	Yamada et al.	Jun 1997
5644470	Benedict et al.	Jul 1997
5644731	Liencres et al.	Jul 1997
5651006	Fujino et al.	Jul 1997
5652832	Kane et al.	Jul 1997
5652839	Giorgio et al.	Jul 1997
5652892	Ugajin	Jul 1997
5652908	Douglas et al.	Jul 1997
5655081	Bonnell et al.	Aug 1997
5655083	Bagley	Aug 1997
5655148	Richman et al.	Aug 1997
5659682	Devarakonda et al.	Aug 1997
5664118	Nishigaki et al.	Sep 1997
5664119	Jeffries et al.	Sep 1997
5666538	DeNicola	Sep 1997
5668992	Hammer et al.	Sep 1997
5669009	Buktenica et al.	Sep 1997
5671371	Kondo et al.	Sep 1997
5675723	Ekrot et al.	Oct 1997
5680288	Carey et al.	Oct 1997
5684671	Hobbs et al.	Nov 1997
5689637	Johnson et al.	Nov 1997
5696895	Hemphill et al.	Dec 1997
5696899	Kalwitz	Dec 1997
5696949	Young	Dec 1997
5696970	Sandage et al.	Dec 1997
5704031	Mikami et al.	Dec 1997
5708775	Nakamura	Jan 1998
5712754	Sides et al.	Jan 1998
5715456	Bennett et al.	Feb 1998
5721935	DeSchepper et al.	Feb 1998
5724529	Smith et al.	Mar 1998
5726506	Wood	Mar 1998
5727207	Gates et al.	Mar 1998
5732266	Moore et al.	Mar 1998
5737708	Grob et al.	Apr 1998
5740378	Rehl et al.	Apr 1998
5742514	Bonola	Apr 1998
5742833	Dea et al.	Apr 1998
5747889	Raynham et al.	May 1998
5748426	Bedingfield et al.	May 1998
5752164	Jones	May 1998
5754797	Takahashi	May 1998
5758165	Shuff	May 1998
5758352	Reynolds et al.	May 1998
5761033	Wilhelm	Jun 1998
5761045	Olson et al.	Jun 1998
5761085	Giorgio	Jun 1998
5761462	Neal et al.	Jun 1998
5761707	Aiken et al.	Jun 1998
5764924	Hong	Jun 1998
5764968	Ninomiya	Jun 1998
5765008	Desai et al.	Jun 1998
5765198	McCrocklin et al.	Jun 1998
5767844	Stoye	Jun 1998
5768541	Pan-Ratzlaff	Jun 1998
5768542	Enstrom et al.	Jun 1998
5771343	Hafner et al.	Jun 1998
5774645	Beaujard et al.	Jun 1998
5774741	Choi	Jun 1998
5777897	Giorgio	Jul 1998
5778197	Dunham	Jul 1998
5781703	Desai et al.	Jul 1998
5781716	Hemphill et al.	Jul 1998
5781744	Johnson et al.	Jul 1998
5781767	Inoue et al.	Jul 1998
5781798	Beatty et al.	Jul 1998
5784555	Stone	Jul 1998
5784576	Guthrie et al.	Jul 1998
5787019	Knight et al.	Jul 1998
5787459	Stallmo et al.	Jul 1998
5787491	Merkin et al.	Jul 1998
5790775	Marks et al.	Aug 1998
5790831	Lin et al.	Aug 1998
5793948	Asahi et al.	Aug 1998
5793987	Quackenbush et al.	Aug 1998
5794035	Golub et al.	Aug 1998
5796185	Takata et al.	Aug 1998
5796580	Komatsu et al.	Aug 1998
5796981	Abudayyeh et al.	Aug 1998
5797023	Berman et al.	Aug 1998
5798828	Thomas et al.	Aug 1998
5799036	Staples	Aug 1998
5799196	Flannery	Aug 1998
5801921	Miller	Sep 1998
5802269	Poisner et al.	Sep 1998
5802298	Imai et al.	Sep 1998
5802305	McKaughan et al.	Sep 1998
5802324	Wunderlich et al.	Sep 1998
5802393	Begun et al.	Sep 1998
5802552	Fandrich et al.	Sep 1998
5802592	Chess et al.	Sep 1998
5803357	Lakin	Sep 1998
5805804	Laursen et al.	Sep 1998
5805834	McKinley et al.	Sep 1998
5809224	Schultz et al.	Sep 1998
5809256	Najemy	Sep 1998
5809287	Stupek, Jr. et al.	Sep 1998
5809311	Jones	Sep 1998
5812748	Ohran et al.	Sep 1998
5812750	Dev et al.	Sep 1998
5812757	Okamoto et al.	Sep 1998
5812858	Nookala et al.	Sep 1998
5815117	Kolanek	Sep 1998
5815647	Buckland et al.	Sep 1998
5815652	Ote et al.	Sep 1998
5821596	Miu et al.	Oct 1998
5822547	Boesch et al.	Oct 1998
5835719	Gibson et al.	Nov 1998
5835738	Blackledge, Jr. et al.	Nov 1998
5838932	Alzien	Nov 1998
5841964	Yamaguchi	Nov 1998
5841991	Russell	Nov 1998
5852720	Gready et al.	Dec 1998
5852724	Glenn, II et al.	Dec 1998
5857074	Johnson	Jan 1999
5857102	McChesney et al.	Jan 1999
5864653	Tavallaei et al.	Jan 1999
5867730	Leyda	Feb 1999
5875307	Ma et al.	Feb 1999
5875308	Egan et al.	Feb 1999
5875310	Buckland et al.	Feb 1999
5878237	Olarig	Mar 1999
5878238	Gan et al.	Mar 1999
5881311	Woods	Mar 1999
5884027	Garbus et al.	Mar 1999
5889965	Wallach et al.	Mar 1999
5892898	Fujii et al.	Apr 1999
5892928	Wallach et al.	Apr 1999
5898888	Guthrie et al.	Apr 1999
5905867	Giorgio	May 1999
5907672	Matze et al.	May 1999
5909568	Nason	Jun 1999
5911779	Stallmo et al.	Jun 1999
5913034	Malcolm	Jun 1999
5922060	Goodrum	Jul 1999
5930358	Rao	Jul 1999
5935262	Barrett et al.	Aug 1999
5936960	Stewart	Aug 1999
5938751	Tavallaei et al.	Aug 1999
5941996	Smith et al.	Aug 1999
5964855	Bass et al.	Oct 1999
5983349	Kodama et al.	Nov 1999

Foreign Referenced Citations (4)

Number	Date	Country
04 333 118	Nov 1992	JP
05 233 110	Sep 1993	JP
07 093 064	Apr 1995	JP
07 261 874	Oct 1995	JP

Non-Patent Literature Citations (17)

Entry
Shanley and Anderson, PCI System Architecture, Third Edition, Chapters 15 & 16, pp. 297-328, CR 1995.
PCI Hot-Plug Specification, Preliminary Revision for Review Only, Revision 0.9, pp. i-vi, and 1-25, Mar. 5, 1997.
SES SCSI-3 Enclosure Services, X3T10/Project 1212-D/Rev 8a, pp. i, iii-x, 1-76, and I-1 (index), Jan. 16, 1997.
Compaq Computer Corporation, Technology Brief, pp. 1-13, Dec. 1996, “Where Do I Plug the Cable? Solving the Logical-Physical Slot Numbering Problem.”
Davis, T, Usenet post to alt.msdos.programmer, Apr. 1997, “Re: How do I create an FDISK batch file?”
Davis, T., Usenet post to alt.msdos.batch, Apr. 1997, “Re: Need help with automating FDISK and FORMAT . . . ”
NetFrame Systems Incorporated, Doc. No. 78-1000226-01, pp. 1-2, 5-8, 359-404, and 471-512, Apr. 1996, “NetFrame Clustered Multiprocessing Software: NW0496 DC-ROM for Novell® NetWare® 4.1 SMP, 4.1, and 3.12.”
Shanley, and Anderson, PCI System Architecture, Third Edition, Chapter 15, pp. 297-302, Copyright 1995, “Intro To Configuration Address Space.”
Sun Microsystems, Part No. 802-6569-11, Release 1.0.1, Nov. 1996, “Remote Systems Diagnostics Installation & User Guide.”
Shanley, and Anderson, PCI System Architecture, Third Edition, Chapter 16, pp. 303-328, Copyright 1995, “Configuration Transactions. ”
Sun Microsystems Computer Company, Part No. 802-5355-10, Rev. A, May 1996, “Solstice SyMON User's Guild.”
Lyons, Computer Reseller News, Issue 721, pp. 61-62, Feb. 3, 1997, “ACC Releases Low-Cost Solutions for ISPs.”
M2 Communications, M2 Presswire, 2 pages, Dec. 19, 1996, “Novell Intranetware Supports Hot Pluggable PCI from NetFRAME.”
Rigney, PC Magazine, 14(17):375-379, Oct. 10, 1995, “The One for the Road (Mobile-aware capabilities in Windows 95).”
Shanley, and Anderson, PCI System Architecture, Third Edition, p. 382, Copyright 1995.
Gorlick, M.,Conf. Proceedings: ACM/ONR Workshop on Parallel and Distributed Debugging, pp. 175-181, 1991, “The Flight Recorder: An Architectural Aid for System Monitoring.”
IBM Technical Disclosure Bulliten, 92A+62947, pp. 391-394, Oct. 1992, “Method for Card Hot Plug Detection and Control.”

Provisional Applications (3)

Number	Date	Country
60/046397	May 1997	US
60/047016	May 1997	US
60/046416	May 1997	US

System for automatically reporting a system failure in a server

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

US Classifications

Field of Search

US

International Classifications