This disclosure relates to computers and networks.
Network monitoring involves outputting alarms to network operators to allow operators to troubleshoot and optimize the operations of a computer network.
Alarms are often outputted in a list that an operator can sort and filter to find alarms that need attention. However, it is often the case that the list grows large and that a large number of alarms are ignored or filtered out based on operator experience and other factors. This can cause important alarms to be missed or can make the operators' jobs more difficult, which may result in degraded network performance.
The drawings illustrate, by way of example only, embodiments of the present disclosure.
The present invention provides systems, processes, and other techniques to solve at least one of the problems of the prior art.
The present invention allows an operator to select or define an alarm that is used as a basis to rank or otherwise output other alarms. Alarms that are similar to the one the operator is interested in are brought the attention of the operator. The present invention also identifies affinities among operators based on their past actions in dealing with past alarms and suggests similar new alarms to similar operators. These aspects of the present invention when used separately or in combination can increase productivity of network operators and thereby increase network efficiency and performance.
In this embodiment, the system 10 is configured to operate according to the Simple Network Management Protocol (SNMP). In other embodiments, the system 10 can be configured to operate according to other protocols. For sake of illustration, the SNMP will be referenced for various examples discussed herein, but this should not be taken as limiting.
The network 14 is connected to another network 16 via a firewall 18 or similar device. The network 16 can include any computer network or combination of networks, such as a LAN, an intranet, a WAN, a wireless network, a cellular data network, the Internet, and similar. The firewall 18 is configured to prevent unauthorized access to the network 14.
The system 10 includes monitoring agent devices 20 (also termed probes), a monitoring manager 22, and one or more remote administrator terminals 24. The monitoring agent devices 20 receive status and alarm data from associated network devices 12 and report such to the monitoring manager 22, which processes such data for output and/or storage.
The monitoring agent devices 20 are remote data processing devices configured to operate within the network 14. In this embodiment, the monitoring agent devices 20 are SNMP managers configured to monitor the operation of network devices 12 (often termed managed devices in SNMP) by receiving input data 30, such as status data and alarm data, from the network devices 12, which act as sources of input data in a production environment. Input data 30 is sometimes termed a “management information base” or “MIB”. Each monitoring agent device 20 is associated with one or more network devices 12 from which it collects data. Such associations may be created, destroyed, and maintained by the monitoring manager 22 and monitoring agent devices 20. The monitoring agent devices 20 are configured to process input data received from the network devices 12 into processed data 32 for output to the monitoring manager 22. For example, SNMP responses and SNMP Traps are mapped into performance and alarm attributes in the system database. A monitoring agent device 20 may be a computer that has a processor and memory specifically and exclusively configured to operate as an SNMP manager. A monitoring agent device 20 may be plug computer, such as a SheevaPlug device available from Marvell. Alternatively, a monitoring agent device 20 may be a managed network device 12 executing a monitoring agent program. A monitoring agent device 20 may include a processor 42 and memory 43 cooperative to execute instructions 44 to realize functionality discussed herein. Instructions may be stored in a non-transitory computer-readable medium, such as a hard drive, solid-state memory device, flash memory, random-access memory, and the like.
The monitoring manager 22 is connected to the network 16. The monitoring manager 22 is configured to receive processed data 32 from the monitoring agent devices 20 and to format the processed data 32 for output and/or storage as output data 34. The monitoring manager 22 may be further configured to perform further processing, such as normalization and aggregation, on the received processed data 32. In this embodiment, the monitoring agent device 20 is an SNMP manager configured to send SNMP Requests to and receive SNMP Responses and Traps bearing data 30 from the network devices 12. This information is transformed by the monitoring agent device 20 into processed data 32 and sent to the monitoring manager 22. The monitoring manager 22 may include one or more computers which may be termed servers. The monitoring manager 22 may include a processor 45 and memory 46 cooperative to execute instructions 47 to realize functionality discussed herein. Instructions may be stored in a non-transitory computer-readable medium, such as a hard drive, solid-state memory device, flash memory, random-access memory, and the like.
The further processing performed by the monitoring agent 20 on the data 30 received from the network devices 12 may include normalization of network device status data and alarms. For example, a particular router's load may be outputted as an integer and another router's load may be outputted as multiple floating-point values representative of averages over predetermined times. The monitoring agent devices 20 are configured to recognize such values as load metrics, normalize them, and send them to the monitoring manager 22. The monitoring agent device 20 can be configured to normalize the load values to a consistent range, such as a percentage, for output and/or storage. The monitoring manager 22 may be configured to aggregate data from several networks 14.
Client terminals 40 are connected to one or both of the networks 14, 16. The client terminals 40 are configured to connect to the monitoring manager 22 to display output data provided by the monitoring manager 22. Client terminals 40 may be used by administrators of the network 14 to monitor the network's operations, performance, and health.
Communications among the monitoring manager 22, the monitoring agent devices 20, and the terminals 24, 40 can be facilitated by various protocols and techniques, including Transmission Control Protocol/Internet Protocol (TCP/IP), Secure Sockets Layer (SSL), Secure Shell (SSH), SSH tunneling, a Virtual Private Network (VPN), a combination of any of such, and similar. Establishing and maintaining associations between the monitoring agent devices 20 and the network devices 12, as well as communications there-between, can also be achieved using available techniques.
In operation, during monitoring of one or more networks 14, status and/or alarm data from network devices 12 is collected by associated monitoring agent devices 20. Each monitoring agent device 20 processes collected data and sends processed data to the monitoring manager 22, which formats the processed data for display and/or storage as output data. Administrators of the networks 14 can monitor the operation, performance, and health of their networks 14 by using client terminals 40 to connect to the monitoring manager 22 to view and/or download the output data.
When a monitoring agent device 20 detects an alarm or an alarm condition in data 30 received from a network device 12, the monitoring agent device 20 outputs an alarm 36 to the monitoring manager 22. The alarm 36 may be included in processed data 32 or may be a separate entity.
With reference to
As shown in
As shown in
Example alarm dimensions include a distance in a container hierarchy 52 between a network device 12 originating the selected alarm and a network device 12 originating the target alarm, a difference in a type 54 of network device 12 originating the selected alarm and a type 54 of network device 12 originating the target alarm, a difference between a model 56 of network device 12 originating the selected alarm and a model 56 of network device 12 originating the target alarm, a difference between a source 58 of the selected alarm and a source 58 of the target alarm, a Levenshtein distance (textual difference) between a message 60 of the selected alarm and a message 60 of the target alarm, a concurrency in a start time 62 of the selected alarm and a start time 62 of the target alarm, a concurrency in an end time 64 of the selected alarm and an end time 64 of the target alarm, a difference in a severity 66 of the selected alarm and a severity 66 of the target alarm, a difference in an assigned network operator 68 for the selected alarm and an assigned network operator 68 of the target alarm, and a difference in a rating 70 of the selected alarm and a rating 70 of the target alarm. Other examples are also contemplated.
Computing a value for each alarm dimension being considered can use any suitable methodology. An example for device type 54 is shown in
Similar methodologies are used for each alarm dimension considered. A consistent sense is used among methodologies, such as higher values equating to greater differences, greater distances, and lesser degrees of concurrency.
Further, each alarm dimension can be assigned a weighting. Computing the total distance between the selected alarm's data 50-1 and a target alarm's data 50-N can thus include computing a weighted sum of all alarm dimensions. If weightings are not used, a simple sum can be computed instead.
The values for various alarm dimensions and weightings can be made configurable via the user interface so that customized similarity computations can be developed.
The monitoring engine can include a computation engine for determining the total distances and an alarm output engine for generating the user interface or otherwise outputting alarms based on computed total distance.
As shown in
For instance, as shown in
The computed total distances can also be compared to one or more threshold distances, which may be made user-configurable, to trigger additional actions beyond outputting indications of the alarms. Such actions include assigning a network operator to an alarm, transmitting a notification to a network operator, transmitting a query to the network device that triggered the alarm, and similar.
As shown in
The set of past alarms 112 and undertaken actions 114 for each operator 110 in a sense defines that operator's job, at least historically. That is, each network operator's preferences, behaviour, and duties can be elicited from the historical alarm data 118.
The monitoring manager 22 can include an affinity engine 120 that is configured to process historical alarm data 108 to obtain statistical operator affinity data 122 that identifies similar network operators. In this way, similarities between network operators can be determined and can be used when assigning new alarms. Operators who have worked on similar alarms can be assigned similar alarms in the future.
The affinity engine 120 is configured to compute statistical affinities in historical alarm data 108 using operator identifier (e.g., ID number, email address, name) as the basis. Any suitable statistical methodology can be used. The result is statistical operator affinity data 122 that, in one example, identifies similar operators. The table shown in
In the statistical computation, similar alarms can be identified as described above in relation to the alarm dimensions. Similar actions can be defined by a lookup table (or matrix). In one example, similar actions are identical actions. In one example computation, each alarm 112 for each operator 110 is compared to each other alarm 112 of the other operators 110, the comparison yielding a total distance (discussed above), or other measure of similarity, between each pair of compared alarms 112. Then, for each pair of compared alarms 112, the actions 114 taken are compared and the total distance, or other measure of similarity, is refined. In one example, the same action 114 preserves the total distance, or other measure of similarity, while different actions nullify the total distance, or other measure of similarity. That is, an operator who ignores a certain type of alarm will determined to be less similar to an operator who completes the same type of alarm. Finally, a total affinity is computed for all pairs of compared alarms 112 and actions 114 for each pair of operators 110 by, for example, summing the total distances, or other measure of similarity. The statistical operator affinity data 122 can then obtained as the computed total affinity for each pair of operators 110, an indication of such affinity (e.g., “1” or “0”, as shown in the table) if the total affinity passes a threshold affinity, or similar.
The historical data 108 used when computing operator affinity can be limited by age, so that alarms and actions older than a specific age (e.g., 1 year) are not considered or are weighted less than newer alarms and actions. This allows operator affinity to degrade with age, so that, for example, network operators whose jobs diverge for other reasons cease seeing similar alarms.
The monitoring manager 22 can include an alarm output engine 124 that references the statistical operator affinity data 122 when outputting new alarms. Among operators that have historical affinity, actions taken on new alarms are tracked and similar new alarms are outputted to such operators and being alarms of potential interest. That is, considering a first network operator and a second network operator, based on a statistical affinity between the first and second operators, a new second alarm for the second network operator is selected after the first network operator has taken action on a new first alarm that is similar to the new second alarm. Groups of similar operators can thus be dynamically defined and continually updated based on past alarms and actions, and new alarms that are taken up by a group member cause similar new alarms to be promoted to other group members. The alarm output engine 124 can be configured to identify similar alarms using the techniques discussed herein (e.g., total distance).
In another illustrative example, if two operators are determined to have historic affinity because they both complete router alarms consistently and then one of the operators begins completing VoIP telephone alarms, then the alarm output engine 124 will begin to recommend VoIP telephone alarms to the other operator. This illustrates how the present invention allows operators with similar behaviour to learn from each other.
The alarm output engine 124 can be configured to output a list of alarms for each operator and rank alarms higher in the list when affinity is determined. Other techniques to bring such alarms to the attention of operators, such as icons and ratings, can additionally or alternatively be used.
As discussed above, the present invention has at least several advantages over the prior art. Alarms having greater relevance can be brought the attention of network operators using machine intelligence and learning in an adaptive and dynamic manner.
While the foregoing provides certain non-limiting example embodiments, it should be understood that combinations, subsets, and variations of the foregoing are contemplated. The monopoly sought is defined by the claims.
This application claims the benefit of U.S. 62/325,126, filed Apr. 20, 2016, the entirety of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
62325126 | Apr 2016 | US |