The present invention generally relates to information technology, and, more particularly, to the field of network and systems management.
During the operation of most computer networks and distributed systems, several faults and failures inevitably arise. In some cases, the faults and failures are transient and can be corrected automatically by the network or distributed system. However, in many cases, a human administrator must intervene to understand the cause of the fault or failure, and then undertake corrective action. Often, the corrective action involves changing the configuration of system devices and components.
Current systems that report network faults are distinct and isolated from systems used to make configuration changes. The human administrator(s) typically monitor network faults from a network management system and after determining the root-cause, a separate configuration system is used to fix the problem.
Thus, a need exists for improved network management and configuration systems.
One embodiment of the present invention provides an automated means to link together network configuration systems and fault monitoring systems to leverage the individual strengths of the systems. Embodiments of the present invention include methods and systems for managing a network by retrieving network fault event information for a target node and enhancing the network fault event information with certain network topology information. The enhanced network fault event information can incorporate network topology information such as: information on path nodes in a routing path of one or more nodes identified in the network fault event; information on one or more adjacent nodes to one or more nodes identified in the network fault event; and/or information on one or more nodes that perform a service function for one or more nodes identified in the network fault event. Past network configuration change information is then retrieved based on the enhanced network fault event information; and the enhanced network fault event information is then correlated with the past network configuration change information for the target node. In one example, the correlated past network configuration change information is used to reconfigure a network node to correct a network fault.
Another aspect of the present invention captures the human intelligence involved in solving past network problems and applies it to automatically resolve future problems by building a knowledge base that can assist operators (in some cases automatically without operator involvement) to address subsequent faults. For example, in some embodiments, a set of network operation guidance rules based on correlated past network configuration change information are created and stored. The set of network operation guidance rules may be ordered based on a frequency of occurrence of the correlated past network configuration change information. In some embodiments, network reconfiguration recommendations are made to a network administrator, or a network node may be automatically reconfigured, based on the stored set of network operation guidance rules.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which are to be read in connection with the accompanying drawings, in which:
As discussed above, current computer networks infrastructures for fault management and the network configurations are distinct and not integrated. One embodiment of the present invention provides an automated means to link together configuration systems and the fault monitoring system and leverage the individual strengths of the systems to significantly reduce the workload on the network administrator. For example, in one embodiment, a fault generated by one node of a network management system can be automatically correlated with a second node that must be reconfigured to correct the fault and which (second node) is described only in a separate configuration management system of the same network.
Another aspect of the present invention captures the human intelligence involved in solving past network problems and applies it to automatically resolve future problems. For example, the present invention includes features for correlating information contained in a network fault management systems of an enterprise with information contained in a configuration management system of that enterprise. The correlated information can be used to build a knowledge base that can assist operators (in some cases automatically without operator involvement) to address subsequent faults. For example, by learning the correlation between a configuration action and a network fault, some embodiments of the present invention may deploy policies which could cause an automated program to take corrective action when a fault for which a learned configuration change solution is encountered in the system.
In some embodiments of the invention, graph-theoretical techniques are used to determine relationships between the different components of the network, both for fault reporting and for configuration management, and the relationships can be used to determine which configuration actions are recommended to remedy a reported fault. This information can also be summarized and used to guide an operator when similar faults occur in the future.
Referring now to the Figures, an example of a current infrastructure for managing computer network faults is depicted in
An example of such a Fault Monitoring System is that sold by International Business Machines Corporation under the trademark IBM TIVOLI NETCOOL OMNIBUS. The FMS 103 may periodically ping and check connectivity among different servers in the Network 101. Such probe based monitoring is well known and an example of such a probe that is sold by International Business Machines Corporation under the trademark IBM TIVOLI NETCOOL OMNIBUS IP PROBE. In this example, assume that FMS 103 is running on server S4, and is monitoring (by pinging) the liveliness of the other servers.
An example of a Fault Database 105 is a component of the product sold by International Business Machines Corporation under the trademark TIVOLI NETCOOL OMNIBUS, wherein a hypothetical and simplified set of fault events may be represented as follows:
Based on the root cause analysis, the administrator can take corrective action, which usually requires making configuration changes in the Network 101 as will be described in more detail with reference to
Those skilled in the art will appreciate that the scheme of
An example of a current configuration management system is that sold by International Business Machines Corporation under the trademark TIVOLI PROVISIONING MANAGER. The CMS 203 can enable different types of configuration changes to be made to the Network 201. The NMS 207 identifies and records the changes CMDB 209. A simplified example record of network node configuration changes may be represented as follows:
Current Fault Management Systems (
As will be described in more detail below, the present invention includes features that advantageously enhances, correlates and leverages the information and functionality of the heretofore uncoordinated Configuration Management System (
Those skilled in the art will appreciate that the Operator Guidance System 315 can be implemented in various ways to provide guidance and suggestions on how to handle subsequent faults occurring in the Network 301. For example, referring now to an embodiment depicted in
In step 607, a history of past configuration changes made to the node can be obtained from the Configuration Database 209. An example of the results of a query of the configuration database CMDB 209 for the time periods relevant to the faults could be represented as follows:
In step 609, the enhanced event information (from step 605) is correlated to configuration changes that were performed to correct similar past fault events (from step 607). In the above example, current data mining techniques would reveal that past configuration actions taken to correct a node that generates an “Unreachable” problem code is to reboot the node adjacent to the fault node. In step 611, the correlated information is used to develop guidance rules for the operators and administrators of the network system. The fault and configuration logs were shown in the above examples in a somewhat simplified way so as to not obscure features of the resent invention. Those skilled in the art will appreciate that in practice there may be multiple causes for an event and different corrective actions taken, so the guidance rules developed will be probabilistic. The process then terminates in step 613.
As was discussed above for example with reference to
In addition to providing guidance to the operator, in some embodiments, the system can be modified so that some of the faults in the network are automatically corrected. For example, a set of policies can be defined governing when one or more corrective actions can be taken automatically, and when the corrective action requires the involvement of a human administrator. By way of example only, a policy can be defined as a Boolean combination of conditions under which the corrective action can be taken automatically, e.g., if the operator guidance system shows that more than 80% of the time (or some other appropriate predetermined threshold), a specific action is used to successfully correct a fault, then the policy can state that the action can be applied automatically. Another set of policies may, for example, use a different threshold using the night-time hours.
Those skilled in the art will appreciate that one or more embodiments of the invention can be implemented in the form of a computer product including a computer usable medium with computer usable program code implementing the inventive process; and that one or more embodiments or aspects of the invention can be implemented in the form of a computer system including at least one processor that is coupled to a memory storing computer usable program code operative to perform exemplary process steps. For example,
It is to be further understood that, because some of the constituent system components and method steps depicted in the accompanying figures may be implemented in software, the actual connections between the system components (or the process steps) may differ depending upon the manner in which the system is programmed. Given the teachings of the present disclosure provided herein, one of ordinary skill in the related art will be able to contemplate these and similar implementations or configurations of the present disclosure.
Now that illustrative embodiments of the present invention have been described herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various other changes and modifications may be made by one skilled in the art without departing from the scope or spirit of the invention, which is properly defined by the claims appended hereto.
This invention was made with Government support under Contract W911NF-06-3-0001 awarded by the U.S. Army. The Government has certain rights in this invention.
Number | Name | Date | Kind |
---|---|---|---|
5815395 | Hart et al. | Sep 1998 | A |
20030093709 | Ogawa et al. | May 2003 | A1 |
20060031435 | Tindal | Feb 2006 | A1 |
20060242288 | Masurkar | Oct 2006 | A1 |
20070092282 | Takenoshita | Apr 2007 | A1 |
20070109580 | Yoshida | May 2007 | A1 |
20080016465 | Foxenland | Jan 2008 | A1 |
20080244693 | Chang | Oct 2008 | A1 |
20080250042 | Mopur et al. | Oct 2008 | A1 |
20080282336 | Diaz Cuellar et al. | Nov 2008 | A1 |
20080301765 | Nicol et al. | Dec 2008 | A1 |
20090158096 | Ali et al. | Jun 2009 | A1 |
Number | Date | Country | |
---|---|---|---|
20100023604 A1 | Jan 2010 | US |