The invention relates to the automated monitoring of the status of computer systems on a network through a central apparatus, and to detecting and evaluating errors for such computer systems in particular.
In the past enterprise computer networks were primarily mainframes with terminals at remote locations. The network manager had complete control over the network and processing as it was performed at the mainframe in these systems. Now, personal computers (PCs) provide major processing power replacing terminals on the desktop and are most often tied into networks. This has presented some challenges in maintaining operation of such computers and networks. Unlike mainframe managers, the manager of a PC network has limited control over and reduced information concerning the respective client computers on the network. This has been a particularly increasing problem as the number of computers on networks has increased, at times numbering over 10,000 client PCs on a single network.
With personal computers replacing terminals on a network, far more can go wrong that is outside the network manager's knowledge or control. This flexibility and power is valuable to the end user but limits the ability of the network manager to perform and control network maintenance and support. Network managers have less information available and on hand when maintaining a remote client PC network which is regularly unattended. It has become more difficult to manage and control such PC networks as personal computer complexity and power increases. Moreover, clients are becoming increasingly dependant on their systems and are less tolerant of downtime. Compounding these issues, today's powerful systems are difficult for most users to manage or repair when abnormal operation such as a fault, a system interruption or another error occurs.
Accordingly a method and apparatus to track the status or state of an increasing number of networked client PCs has become necessary. It would be beneficial to automate such tracking and provide additional information concerning the network addition, connection status, activity status, error status and test status of individual client PCs on a network. It would also be beneficial to improve the knowledge base of such networked PC clients. This will improve response time of personnel responsible for servicing a particular networked PC and will improve the ability of service personnel to detect and evaluate possible errors and state changes for the registered PC clients.
In addition, it would be beneficial to manage such a network and its individual client PCs from a central apparatus or database, use information gathered through remote databases as well as from common conditions of individual PCs on the network, and provide automatic notification to a variety of users in specific subsets on a PC by PC basis. By automating an apparatus, method and computer program to detect an individual machine's status or state using a certain set of criteria, we can effectively and efficiently manage the vast number of computers on a large and growing network.
The present invention provides a method, apparatus and computer program product for monitoring a group of attached devices or machines such as client computers or personal computers (computers, PCs or machines including but not limited to tape drives, disk subsystems and similar storage devices) on a network. Monitoring is achieved through notifying a central database of the current state of each of the computers, machines and the like coupled to the database through the network. The method and apparatus of the present invention includes software and hardware for obtaining the status or state of each of the networked computers or machines, determining if a new computer has been connected to the network, setting a state for each new computer and determining if the current state of any one of the plurality of computers has not been obtained within a prescribed interval of time or if any one of the networked computers has not notified the central database of its current state within a prescribed interval of time. The method and apparatus includes hardware and software for determining if any one of the networked computers is in an inactive state, determining if any one of the networked computers has changed its state to inactive and updating the central database to reflect the current state of each of the networked computers. A list of users to be notified of the state of each PC, along with the reason for any change in state of each computer, is compiled from the data in the central database and can be sent automatically to each of the users on the list.
Each of the computers on the network can send a signal to the database at a predetermined time interval and this signal is used by the database to detect when one or more of the networked computers is not operating or not operating properly. If one or more of the PCs has not notified the central database of its current state within a prescribed period of time, the method and apparatus of the present invention can adapt or vary the prescribed period of time to allow for a longer period of time or a shorter period of time as required by the networked PC. The system of the present invention will then update the central database and use either the longer period of time or the shorter period of time as a new prescribed interval for determining if any one of the networked computers has not notified the central database of its current state.
The method, apparatus and computer program product of the present invention will determine the reason for any change in state of each of the networked computers and send an automated alert to a list of users. This automated alert can take the form of an automated formatted email to said list of users. The list of users can be managed to provide a subset of the networked computers and a subset of users from the list of users to notify through the automated formatted email.
The method, apparatus and computer code of the present invention also includes the ability to check an outside database or compare the similarities of a subset of computers on the network to determine the reason for any change in state of one or more of computers on the network. The step of checking an outside database to determine a reason for a change can also be provided through comparing one of the computers on the network to another networked computer to assist in determining the reason for any change in state or status of any computer or group of computers on the network.
Presently preferred implementations for the invention will now be described in detail with reference to the drawings wherein:
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that these specific details need not be used to practice the present invention. In other instances, well known structures, interfaces, and processes have not been shown in detail in order not to unnecessarily obscure the present invention.
Monitoring the status of a group of attached devices or machines such as client computers, personal computers or storage devices, including but not limited to tape subsystems, disk subsystems and the like, will identify computers, machines or devices that have not reported or called home over the network for any number of reasons. In addition, such monitoring will detect and evaluate errors for such computers, machines or devices. By routinely monitoring the status of the networked machines, the present invention will reduce the response time to remedy the inactivity, error or other fault by informing the correct individuals about the situation. Referring to
Each data package 102 sent can be considered a call home instance which provides information about the status or state of one or more machines 100 on the network. A call home instance occurs when a machine or group of machines sends a set of data packages 102 to a central apparatus or database 108 to be collected, parsed, and stored. One or more of three types of call home data packages 102 are transmitted over the network to monitor system status of one or more PCs on the network and implement the present invention—a heartbeat data package, an error initiated data package, and a test data package.
A heartbeat data package is sent over a specific time interval to report that the machine or group of machines 100 is performing properly and that it is still connected to the network. A heartbeat data package is a sequence of messages which are sent by a machine or group of machines 100 to a receiving source such as a database 108 or the like. The receiving database 108 includes an interface or computer code to measure the time interval between receiving successive heartbeat data packages from a machine or a group of machines 100. If database 108 does not receive a heartbeat data package prior to the expiration of a predetermined length of time, then the machine or group of machines is suspected of not being connected to the network. One way to determine if a machine has not called home in a given amount of time is to look at the dates of each machine's last call home, or query the database for all of the machines that have not called home over a given period of time. It is the regularity or the specific timing of the report of healthy data and the measurability thereof that allows the method and apparatus of the present invention to infer that there is a problem with a given machine. This problem can be inferred due to the regularity of the expected data package 102 even with no direct connection, indication or report of a problem from a given machine.
An error initiated data package is sent to determine if one of many errors occurs on one or more machines 100. Upon initiation of an error data package, the machines 100 collect the status of all of the appropriate devices and send a response which is collected, parsed and stored within database 108 for use by support personnel. This will allow the support personnel to evaluate the problem remotely and choose the correct maintenance action.
A test data package is initiated and sent by a computer engineer or other network maintenance personnel to ensure that a machine or a group of machines 100 is successfully connected to the network and that all of the features of the machine or the group are performing correctly. The test data package collects the status of all of the polled machines 100 and their respective devices and sends a response which is received, parsed and stored within database 108 for use by the system of the present invention to inform support personnel of any errors or other faults on any of the network machines 100. This response may indicate a change in status of one or more machines 100 which is used to report any problems with any machine 100 on the network.
Referring to
Machines 100 that are determined to be overdue for a call home will cause the automated process at step 210 to contact a plurality of registered parties including support personnel. In certain cases, a machine may not be registered to any party. In that case, no one will be alerted that an expected communication entry is missing. In other cases, a single party will have registered to monitor the machine and will be alerted of the missing or problematic transmission. In a third case, the machine will appear in multiple registries being monitored by multiple parties. In this case, the alerts will need to be sent to various unrelated parties across registries. The automated alerting mechanism is able, given a machine, to determine who, if any one needs to be contacted from a dynamically generated set of registered users at step 210. Users are able to opt in and out of monitoring any number of machines at any time and the alerting mechanism is able to handle the distribution of alerts to a dynamically changing registry of concerned parties.
In evaluating field units or networked machines 100 for inactive status, the method, apparatus and computer code of the present invention considers the particular configuration of each machine reporting. Given a machine's configuration, such as stand-alone versus networked attached, a different method for determining inactive status may be used. The present invention is able to decide from among a plurality of criteria for being overdue or inactive in evaluating each machine. If a heartbeat or call home interval is changed to allow for a longer or shorter period of time as determined by the process described above, the system will detect this change in the heartbeat or call home interval and will now use this new interval as the assessment for the machines status. Machines on a network may have call home or heartbeat intervals which range from one to fourteen (14) days. The system accounts for these variations and determines the status of each machine as modified through the iteration process of the present invention.
The method, apparatus and computer code of the present invention runs a script at a given interval to check the database 108 for the state or status of all machines 100. Referring to
As the list of inactive machines increases, it becomes important to manage the number of machines that are set to an inactive state. The list of inactive machines will hold historic data relating to the inactive machines that have stopped calling home for any reason.
The method, apparatus and computer code of the present invention will also use network intelligence to check with other systems to ensure that the list is accurate. For example, it will check with outside databases and compare state or status with other networked machines to determine the validity and reason for the status as reported (see
The following code sample provides one example of the steps and features of the present invention as claimed herein:
Format update statements to account for the inactive machines and machines without the call home status interval
Run script to format the emails and send them
Remove the list of new, inactive, changed to inactive machines and machines without a call home period to ensure the accuracy.
Print end time
The steps and computer code shown in
The method, apparatus and computer code of the present invention may be performed by a computer program. The computer program can exist in a variety of forms both active and inactive. For example, the computer program can exist as software possessing program instructions or statements in source code, object code, executable code or other formats; firmware program(s); or hardware description language (HDL) files. Any of the above can be embodied on a computer readable medium, which include storage devices and signals, in compressed or uncompressed form. Such computer readable storage devices include conventional computer RAM (random access memory), ROM (read only memory), EPROM (erasable, programmable ROM), EEPROM (electrically erasable, programmable ROM), and magnetic or optical disks or tapes. Computer readable signals, whether modulated using a carrier or not, can include heartbeat data packages, error data packages, test data packages and the like, all described above. It will be understood by those skilled in the art that a computer system hosting or running the computer program can be configured to access a variety of signals, including but not limited to signals downloaded through the Internet or other networks. Such may include distribution of executable software program(s) over a network, distribution of computer programs on a CD ROM or via Internet download and the like.
The invention has been described with reference to preferred implementations thereof but it will be appreciated that variations and modifications within the scope of the claimed invention will be suggested to those skilled in the art. For example, the invention may be implemented on networks including ethernet, token ring and the like or used to control other aspects of a system. The method, apparatus and computer code of the present invention may be extended to monitor other devices which exhibit a plurality of operational modes.
While this invention has been described in conjunction with the specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Any such changes may be made without departing from the spirit and scope of the invention.