1. Technical Field
The invention disclosed and claimed herein generally relates to a method and apparatus for monitoring a network to detect faults, in order to determine the impact the faults have on prespecified services running on the network. More particularly, the invention pertains to a method of the above type for automatically discovering devices, or nodes, in the network that are coupled to a particular operator device, and also for discovering services configured to run on the discovered nodes. Even more particularly, the invention pertains to a method of the above type that alerts network operators of the effects that network outages or faults will have on the discovered services.
2. Description of Related Art
A business system disposed to operate in connection with a network such as the Internet typically requires a server that runs a particular server program, or service. Moreover, it is very common for a business system to use a server that is running one or more services in addition to the particular service. For example, a business system such as a catalog ordering system could require a server running services such as data processing systems, and also web application services. Moreover, the additional services could in turn rely on network communications with yet other services, in order to implement the business system in its entirety. Accordingly, it is seen a number of services operating at different network nodes may be required in order to implement a business system.
An operator of a business system of the above type will generally be very familiar with the particular server used to access the Internet or other network. However, the operator likely will not be aware of all the other network devices, or of the services respectively running thereon, that are required to operate the business system as described above. Thus, the impact that a network fault or outage could have on these services would also not be known to the operator. Accordingly, it would be desirable to give operators of business systems visibility into the effects of network outages, and what services are made unavailable thereby. This information would assist operators in correcting service problems caused by network outages. For example, if two server machines being operated by an operator both stopped responding, and the operator was alerted that one machine had DB2 service and the other had no services running on it, the operator could prioritize fixing the server running the DB2 service first.
In the prior art, a business systems manager is available that may show line of business impact to a operator. One such system is the Tivoli® Business Systems Manager, Tivoli® being a proprietary trademark of International Business Machines Corporation (IBM) and registered in the United States. These systems provide a higher level of service impact based on network outages. However, this prior art system requires an operator to manually define relationships among the network components required for a business system. Thus, no completely automated solution to the above problem, whereby a operator is automatically informed of the impact that a network fault has on necessary services, appears to be available at the present time.
By means of the invention, the service impact of node (end system) and network faults or outages is reported to network operators, based on correlating the network outages with services automatically discovered to be running on the nodes. This enables an operator to prioritize correction of service problems caused by the network outage events, based on the comparative impact of an outage on respective services. One useful embodiment of the invention is directed to a method for providing the operator of a specified network device with notice of the impact of a network fault on one or more services running in association with the specified device. The method comprises the steps of discovering one or more devices in the network that are respectively connected to the specified device, to assist in performing an intended task, and then discovering each service that is running on each of the discovered devices, likewise in support of task performance. The method further comprises monitoring the status of respective discovered devices at prespecified intervals, in order to detect the occurrence of a fault in the network. Upon detecting a fault, an alert is generated to indicate the impact of the detected fault on respective discovered services.
The novel features believed characteristic of the invention are set forth in the appended claims. The invention itself, however, as well as a preferred mode of use, as well as further objectives and advantages thereof, will best be understood by reference to the following detailed description of an illustrative embodiment when read in conjunction with the accompanying drawings, wherein:
Referring to
To illustrate an embodiment of the invention, it is assumed that an operator operates server 102 to establish a business system to carry out a specified task, such as catalog ordering or the like. It is further assumed that services running on server 102 for this propose must rely on other services in order to implement the entire business system. Accordingly, the operating system of server 102 establishes a connection with server 120. Server 120 is configured to run services 136 and 138, which are both required to implement the business system. A connection is also established between server 102 and server 114 of LAN 112, which is configured to run another required service 140.
Referring to
Network monitor 206 is adapted to send an ICMP (Internet Control Message Protocol) network request to server 102 over network 100, at the server IP address. The ICMP response or lack thereof, enables the monitor 206 to determine whether a machine is active on the IP address or not. Further information about the device is retrieved through SNMP (Simple Network Management Protocol) protocol requests. Thus, network monitor 206 is able to determine or discover the respective connected devices, including servers 120 and 114, as well as any other servers, routers, and work stations. Each of these discovered devices, or nodes, is then listed in a database 210 residing in network management tool 202.
After respective devices connected to server 102 have been discovered and listed in database 210, network monitor 206 continues to assess or monitor the availability status of each discovered device, at intervals, which are configurable by the operator. Thus, the network monitor 206 is able to determine when either a node (i.e. a server or workstation), or an entire network that includes any of the discovered nodes, becomes unavailable because of some fault.
It is understood that the term “network”, as used herein, may refer to both a large global network such as network 100, as well as to sections thereof and smaller networks connected thereto that include discovered devices.
Referring further to
As is known to those of skill in the art, a port is used in accordance with the TCP/IP protocol to designate a particular server program, or service, running on a network computer or the like. Thus, in order to discover a service running on a particular one of the discovered devices, the service monitor 208 is connected to the network 100, at the IP address of the particular device. The monitor 208 then attempts to connect to a port of a particular number, to determine whether or not a service associated with the particular port number is running on the particular discovered device. If a service is discovered on a particular device at the particular port number, this information is stored or listed in database 210. Thereafter, the status of the listed service will be continually monitored by service monitor 208, to determine whether or not it remains on the particular device.
After attempting to connect on the particular port number, service monitor 210 is operated to attempt to connect to other port numbers, on the same IP address of the particular device, in order to discover any other services running on such device. In like manner, service monitor 208 is operated to discover the services configured to run on each of the other discovered devices. At the conclusion of this process, database 210 will contain a complete list of all nodes or devices of network 100 that are connected to server 102 in support of the business system, as described above. Database 210 will also contain a list of all services discovered to be running on the respective discovered devices, likewise in support of the business system. Moreover, the list of discovered nodes and services is continually updated in database 210, at very frequent intervals, by operating network monitor 206 and service monitor 208 to continually monitor the status of respective nodes and services.
In other embodiments of the invention, application programmable interfaces (APIs) may also be used to discover services running on devices connected to server 102.
When the network management tool 202 discovers a network fault or outage during the continual status monitoring procedures described above, the network management system 200 will also determine whether a service on any of the network nodes is affected. In the case of a fault at a node (e.g., an end station or workstation), the network management system 200 searches the database 210 to see if any services are known to be running on the node in question. If so, these services will be affected by the network fault at this node. Accordingly, the network management tool 202 of network management system 200 is operated, to generate an alert setting forth the impact of the node fault event on these services. This alert is then sent to the management console (not shown) of the operator or operator of server 102.
In the case of an outage or fault affecting an entire network, the database 210 is searched to determine if there are any nodes within the particular network which have services running on them. If there are, then these nodes will be affected by the network fault, so that the services on these nodes will also be affected. In this case, network management system 202 generates an alert setting forth the impact of the network fault event on these services. This alert is likewise sent to the management console of the operator of server 102.
By furnishing alerts as described above to the operator of server 102, the operator is enabled to set priorities in correcting the service problems resulting from the faults.
Referring to
Referring further to
Referring to
The description of the present invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention, the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.