A network management system can be associated with communication networks, with the purpose of collecting alarms from network equipment, forming a summary of the collected alarms, particularly using correlation methods, and displaying this alarm summary to an operator so that the operator can implement corrective action in the case of a failure of the network equipment. The concept of a “failure” or “fault” is understood to be a very general term for any type of hardware and/or software malfunction. Network equipment and/or software that is no longer operational in some manner is considered to have a failure. Likewise, an improper configuration of network equipment and/or software is considered to have a failure.
Network management systems can be used to configure network equipment and/or software. The operator can input new parameters using a man-machine interface and the network management system applies these new parameters to the network equipment and/or software. In this way, the operator can correct a network failure in reaction to an alarm.
Such a centralized analysis depends on collection of a large amount of data and alarms from many elements in the communication system. These elements may be network equipment, such as for example, routers, switches, computer servers, networking cards and other components of computer servers, inclusive of software.
Due to the many interactions between network elements, a single failure can generate a substantial number of alarms. Thus, a failure on a router may generate an alarm from other network equipment connected to one of the ports on the router. It is therefore difficult for the operator to determine which is the genuine failure among the large number of generated alarms, and even more so to determine the corrective action to be undertaken.
Nevertheless, the operator has to take action with each failure to determine the corrective action(s) to be undertaken and to undertake the corrective action(s). The operator then needs to reconfigure the network equipment using the network management system or to manually connect to one or more of the network equipment and send the appropriate CLI (command line interface) commands.
The foregoing and other objectives, features, and advantages of the invention may be more readily understood upon consideration of the following detailed description of the invention, taken in conjunction with the accompanying drawings.
Referring to
The network devices 100 and the management system 120 may be interconnected with one another using any protocol. For example, a simple network management protocol (SNMP) may be used for collecting and organizing information about managed devices and software on an Internet protocol network and for modifying that information to change the network device and/or software behavior. SNMP may be used to expose management data in the form of variables on devices and/or software to be managed. Normally, SNMP enables the variables to be remotely queried, and often manipulated, by the management system 120. Each of the network devices 100 includes a respective agent 140 which reports information via SNMP to the management system 120. The agent 140 may permit unidirectional (read-only) or bidirectional (read and write) access to network device specific information. The agent 140 is a network management software module that resides on the respective network device and has local knowledge of the management information and translates that information to and/or from a SNMP specific form. The information from the respective agent 140 may be polled and/or pushed to the management system 120. In this manner, the management system 120 receives information from each of the respective agents 140, either on a regular basis or in response to a request. The agents 140 may further provide alerts to the management system 120 of a failure of the corresponding network device and/or software 100.
Referring to
For example, a router card may experience a failure. The management system 120 may receive a fault notification together with additional information from a corresponding agent 140 for the router card. Based upon the additional information a support engineer may attempt to diagnose the source of the fault notification. Initially, the support engineer may determine it is desirable to initiate a rebooting of the router card to attempt to remedy the fault condition. If the router card, as a result of rebooting the router card, operates properly then the corrective action was successful.
For example, a manifest delivery controller is a software application running on a computer server for modifying video manifests to enable server-side dynamic advertisement insertion, content personalization, and analytics for Internet protocol-based video. The management system 120 may receive a fault notification together with additional information from a corresponding agent 140 for the manifest delivery controller that has failed. Based upon the additional information a support engineer may attempt to diagnose the source of the fault notification. Initially, the support engineer may determine it is desirable to initiate a rebooting of the manifest delivery controller to attempt to remedy the fault condition. If the manifest delivery controller, as a result of rebooting the manifest delivery controller, fails to operate properly then the support engineer needs to further examine the logs to attempt to determine an appropriate course of action. Unfortunately, it can be rather time consuming to determine an appropriate course of action.
Referring to
The management system 120 may include a log file acquisition process 420 that retrieves the log files from the corresponding network devices 100 upon a fault being detected, or otherwise periodically receives and updates the log files from the network devices 100 on a continual basis. In this manner, when a fault is triggered for one or more network devices 100 by a corresponding one or more agents 140, the log files have already been received by the log file acquisition process 420 or otherwise received by the log file acquisition process 420 in response to receiving one or more faults. A mitigation process 430 receives the fault indication 440 and, based upon the corresponding log files from the log file acquisition module 420, processes the log files using the trained machine learning process 400. In response, the mitigation process 430 suggests an appropriate manner of mitigating the fault. Based upon any suitable criteria, the mitigation process 430 may automatically perform the determined one or more mitigation activities. If as a result of the automatic mitigation activities, such as restarting the device and/or software process or reinstalling and/or reconfiguring the device and/or software process, the fault remains then the fault may be elevated to an appropriate support engineer with supporting documentation regarding the fault, including appropriate suggestions from the machine learning process 400 based upon previous encounters with the same or similar faults.
The support engineer may go through the log files that have been retrieved by the log file acquisition process 420, together with examination of additional data remaining on the network devices 100, if desired, to make an analysis of what is the likely root cause for the fault.
Referring to
Referring to
Referring to
Referring to
Referring to
In many cases, there is a lot of effort involved by a front-line engineer involved to analyze and process a fault from the system and/or a customer. Referring to
Referring to
The operator may select an off-online mode 1650. In the off-line mode 1650, the management system 120 may obtain log files 1660 from the customer through a network interconnection, such as the Internet. The log files 1660 are preferably provided by the customer in some manner, such as using shared cloud-based storage. The log files, for example, may be provided using a simple network management protocol or a file transfer protocol. One or more of the log files may be provided to the machine learning process 1662 for processing. The machine learning process 1662 may perform a multitude of processing steps. An initial step the machine learning process 1662 may perform is reading the log files 1664. The machine learning process 1662 may identify issues 1666 based upon the log files. The machine learning process 1662 may determine corrective actions 1668 to be taken based upon the identified issues 1666. Based upon the determined correction actions 1668, the management system 120 may provide an indication of the actions 1670 to be performed by the customer. The customer may perform the actions that are indicated 1672. After performing the actions that are indicated 1672, the management system 120 may perform verification 1674 to ensure that the issues have been resolved.
As it may be observed, the dual option system using an on-line mode and an off-line mode, permits the management system 120 to efficiently and accurately process log files that include faults in a manner that documents what is performed for future reference together with resolving the issues in a verifiable manner. Referring to
Moreover, each functional block or various features in each of the aforementioned embodiments may be implemented or executed by a circuitry, which is typically an integrated circuit or a plurality of integrated circuits. The circuitry designed to execute the functions described in the present specification may comprise a general-purpose processor, a digital signal processor (DSP), an application specific or general application integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic devices, discrete gates or transistor logic, or a discrete hardware component, or a combination thereof. The general-purpose processor may be a microprocessor, or alternatively, the processor may be a conventional processor, a controller, a microcontroller, or a state machine. The general-purpose processor or each circuit described above may be configured by a digital circuit or may be configured by an analogue circuit. Further, when a technology of making into an integrated circuit superseding integrated circuits at the present time appears due to advancement of a semiconductor technology, the integrated circuit by this technology is also able to be used.
It will be appreciated that the invention is not restricted to the particular embodiment that has been described, and that variations may be made therein without departing from the scope of the invention as defined in the appended claims, as interpreted in accordance with principles of prevailing law, including the doctrine of equivalents or any other principle that enlarges the enforceable scope of a claim beyond its literal scope. Unless the context indicates otherwise, a reference in a claim to the number of instances of an element, be it a reference to one instance or more than one instance, requires at least the stated number of instances of the element but is not intended to exclude from the scope of the claim a structure or method having more instances of that element than stated. The word “comprise” or a derivative thereof, when used in a claim, is used in a nonexclusive sense that is not intended to exclude the presence of other elements or steps in a claimed structure or method.
The terms and expressions which have been employed in the foregoing specification are used therein as terms of description and not of limitation, and there is no intention, in the use of such terms and expressions, of excluding equivalents of the features shown and described or portions thereof, it being recognized that the scope of the invention is defined and limited only by the claims which follow.
The present application is a continuation of U.S. patent application Ser. No. 17/584,839, filed Jan. 26, 2022, which claims the benefit of U.S. Provisional Patent Application No. 63/142,789 filed Jan. 28, 2021.
Number | Date | Country | |
---|---|---|---|
63142789 | Jan 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17584839 | Jan 2022 | US |
Child | 18380125 | US |