Event storms are common in any large-scale push-based monitoring systems due to mis-configuration of monitoring agents or due to noisy devices. Current monitoring systems stall or crash in the face of huge event storms and require user intervention to remedy the condition. To alleviate such performance degradation, some systems allow users to specify simple threshold-based policies and drop packets that do not satisfy the policies.
Embodiments of a network system and associated operating methods manage event storms. The network system comprises an event analysis and control engine that detects and manages events occurring on a network. The event analysis and control engine receives events from a plurality of agents, and analyzes the events according to policies specified in a policies templates database. The event analysis and control engine processes raw network packets directly with less than full packet parsing to generate a filtered stream of events based on the analysis. The event analysis and control engine propagates the filtered stream of events to a monitoring system. In at least some embodiments, the event analysis and control engine also reconfigures the end-agents, where possible, to reduce the event rate.
Embodiments of the invention relating to both structure and method of operation may best be understood by referring to the following description and accompanying drawings:
System and method embodiments of a scalable event analysis and control engine manage event traffic from multiple sources and can handle event storms.
Embodiments of a scalable event analysis and control engine can monitor event streams with small memory and computation footprint and enable users to specify one or more of multiple different policies on monitored event streams, and shape the event traffic so that a monitoring system does not crash or stall. The depicted event analysis and control engine also can reconfigure end-agents to reduce event traffic. For scalability, the event analysis and control engine enable selection of efficient approximate counting algorithms that can compute statistics over events with small memory footprint.
Embodiments of a network system can be configured with a capability to handle event storms using a closed-loop architecture that increases reliability and scalability of a network manager.
Embodiments of a network system can implement an efficient analysis algorithm with small memory foot-print for quickly locate misbehaving or mis-configured event-generators. The network system can efficiently track offending event sources, thereby improving overall system reliability and enabling immunity to large number of offending sources overrunning a system.
The disclosed event analysis and control engine and associated operating methods can address several aspects of functionality by analyzing an event traffic profile in near real-time and reporting on results of the analysis, and shaping trap traffic as appropriate to ensure that a monitoring system is not overwhelmed. Users can thus improve control event generation.
The disclosed event analysis and control engine and associated operating methods can be implemented without using large buffers or file queues, thus enabling a memory-efficient approach which reduces memory footprint. The illustrative systems and techniques can enable memory and computation efficiency by event traffic shaping, thereby selectively controlling which events or event types pass to a monitoring system.
Referring to
The policies specify aspects of multiple options within the network system such as which statistics are computed, what thresholds are used, how traffic shaping is performed, what events to report to the monitoring system, how to reconfigure the agents, and the like. For example, a policy for traffic shaping can be “drop all events from end-agent A.” Similarly, a policy for statistical computation can be “compute Top-K sources which send more than 100 events per second.”
The network system 100 can further comprise the policies templates database 110 which can be coupled to the event analysis and control engine 102 for example either directly or via a network link. The policies templates database 110 supplies policies templates for analysis. The network system 110 can further comprise the monitoring system 114 coupled to the event analysis and control engine 102 that receives filtered events and analysis events modified by shaping by the event analysis and control engine 102.
In some arrangements, the network system 100 can further comprise one or more agents 108 coupled to the event analysis and control engine 102 that receive a configuration from and communicate events to the event analysis and control engine 102. The agents 108 can be connected to the event analysis and control engine 102 by a network or other communication link, or by direct connection.
In an illustrative embodiment, the event analysis and control engine 102 can manage temporal concentrations of events by informing the monitoring system 114 and users about elevated event occurrence levels via analysis events 116. The event analysis and control engine 102 can then modify traffic by filtering the events 106 then forwarding the filtered events 112 to the monitoring system 114. The event analysis and control engine 102 then can reconfigure event-sending agents to reduce the number of events that are sent.
The event analysis and control engine 102 can be configured for conserving memory and computation consumption by leveraging optimized approximate counting data structures. In an example implementation, the counting data structures can be leveraged for continuously detecting event concentrations, for example by determining one or more statistics over the stream of events. If suitable, the statistics can be computed at different time scales. Window-based approximate counting algorithms can be used to compute the statistics.
The network system 100 can further comprise a user interface 118 coupled to the event analysis and control engine 102 that enables a user to select monitoring of different statistics at selected fine-grain and coarse-grain time scales over incoming events.
The event analysis and control engine 102 can also be configured for monitoring event streams for anomalies using analysis algorithms and by determining event traffic shaping based on the observed anomalies. Event traffic shaping can be implemented using one or more of several techniques that can be selectively activated. Example techniques can include dropping uniformly random events, dropping all events from a selected source, dropping all events of a selected event type, informing of anomalies via analysis of events with no events dropped, configuring at least one agent using database templates to reduce events from the at least one agent, and the like. Multiple of the event traffic shaping methods can be performed simultaneously.
In various implementations and/or conditions, the event analysis and control engine 102 can further be configured for analyzing and controlling event traffic in a push-based monitoring system. Similarly, the event analysis and control engine 102 can be configured for analyzing and controlling event traffic in a pull-based monitoring system wherein agents at end devices are queries for events from a central management server.
Referring to
Referring to
Since the network system 300 can automatically configure, where possible, the end-agents 308 and thus control the event rate at the sources, the configuration becomes a closed-loop control system.
The network system 300 can further comprise the policies templates database 310 coupled to the event analysis and control engine 302 that supplies policies templates for analysis. A monitoring system 314 can be coupled to the event analysis and control engine 302 receives filtered events and analysis events which are modified by shaping by the event analysis and control engine 302.
The network system 300 can further comprise one or more agents 306 coupled to the event analysis and control engine 302 that receives a configuration from and communicates events to the event analysis and control engine 302.
The event analysis and control engine 302 can be configured to detect anomalies and selectively respond to detection by temporarily terminating receipt of traps from a source agent of the anomaly, temporarily terminating receipt of a specified event from a source agent, enabling a user to control behavior according to the analysis, and spawning additional trap processors according to the analysis.
Referring to
Referring to
Referring to
Referring to
Referring to
In an example implementation, the one or more statistics can be selected from parameters regarding entities including top-K sources, event-types, (source, event)-tuples of the data structures, sources with an event rate extending past a predetermined threshold, event-types with an event rate extending past a predetermined threshold, (source, event)-tuples of the data structures with an event rate extending past a predetermined threshold, and the like.
Different statistics can be monitored at selected fine-grain and coarse-grain time scales over incoming events.
Referring to
In various embodiments, event traffic can be shaped 456 using one or more techniques such as dropping uniformly random events, dropping all events from a selected source, dropping all events of a selected event type, informing of anomalies via analysis of events with no events dropped, configuring at least one agent using database templates to reduce events from the at least one agent, and the like. Multiple event traffic shaping methods can be performed simultaneously.
In some embodiments, the technique for analyzing and controlling event traffic can be implemented in a push-based monitoring system in which agents on the monitored devices or local aggregators push system monitoring data as events to a central management server.
In other embodiments or selected conditions, the technique for analyzing and controlling event traffic can be implemented in a pull-based monitoring system wherein agents at end devices are queried for events from a central management server.
Clusters of event traffic on a network system, which can be called event storms, can occur in monitoring systems such as push-based monitoring systems in which agents on the monitored devices or local aggregators push system monitoring data as events to a central management server. Examples of events can include alarms or traps as in a network manager software installation or messages as in an operations product installation. For example, in the network manager context, several scenarios can result in large event storms. An event storm can result when a wide area network (WAN) router fails and many (for example, several hundreds) edge routers connected to the Internet via the WAN router generate alerts simultaneously. An event storm can also occur for a router that is incorrectly configured to low threshold values for generating alerts. A further cause of event storms is noisy devices that emit a large number of traps of little value to a monitoring system.
In an operations context, a scenario for occurrence of event storms is application agents that lose connection to a management server, for example due to network problems, and buffer all generated messages, then storming the buffered messages to the server once connectivity is established.
As shown in
Handling of large-scale event storms is a challenge for current monitoring systems. Monitoring systems that do not address event storms may crash in the face of such storms either due to running out of available memory for processing or CPU thrashing that occurs with event overload. For example, in the case of a persistent storm as shown in
Dropping events during storms is a common solution employed by some management products. For example, event reduction techniques in network manager and operations management applications can include an event correlation service circuit that allows suppression of events from specified devices but the strategy of simply suppressing events without any analysis to combat the event storms has several disadvantages. Information in the events that enables insight into the cause of the event storms is lost and thus ignored. With no analysis, event suppression can drop not only events that should be dropped but also important events occurring during storms. Suppression of events without analysis can alleviate problems at the central server while the event storms can disrupt other traffic on the network. Event suppression alone is not a suitable long-term solution since information relating to the profile of trap traffic in operative environment and conditions is valuable to a user, and simple suppression does not give any information.
Referring again to
In some conditions and/or embodiments, the system 100 can implement automatic remote reconfiguration of an agent 108 which is enabled by an agent 108 exposing interfaces and the event analysis and control engine 102 allocated access to templates to perform reconfiguration. In the illustrative example shown in
One aspect that can be implemented in an event analysis and control engine embodiment is a very small footprint with respect to both memory and computation consumption. For example, naive counting methods that maintain exact counts of events for each source of event or for each event type can quickly fill memory space in a large-scale system (O(N) memory footprint for N distinct items). The illustrative system 100 can be implemented to leverage optimized approximate counting data structures such as count-sketch as described by M. Charikar, K. Chen, and M. Farch-Colton in “Finding Frequent Items in Data Streams,” in International Colloquium on Automata, Languages, and Programming, 2002. The count-sketch algorithm has a lower memory footprint than traditional counting methods because in the illustrative scheme only a constant number of counters are maintained in contrast to counting methods in which a counter is maintained for every unique item. The data structure can be used to determine Top-K sources, event-types, and (source, event type)-tuples to detect the prolific event sources continuously. A top-K query requests for K tuples ordered according to a specific ranking function that combines values from multiple attributes. In addition, to supply statistics at different time scales (for example, Top-K in last minute, last hour, last day), window-based approximate counting algorithms can be leveraged. Leveraging techniques enable monitoring of different statistics at fine-grain to coarse-grain time scales over the incoming events.
As the analysis algorithms monitor the event stream for anomalies, control engine decides how the traffic is shaped based on the observed anomalies. Depending on the policies, the control engine might (i) drop uniformly random events (note that a strategy that uses buffers and drops all events once that buffer fills will not be a uniformly random drop as only packets at the tail are dropped in case of bursts) (ii) drop all events from a source, or of an event type, etc., (iii) just inform about the anomalies to the monitoring system/user via analysis events and not drop any events, or (iv) configure one or more agents using templates in the database to reduce the events from those agents.
Referring to
The analysis engine can be implemented as an augmentation to a monitoring system in a network manager application.
In further embodiments and applications, a control loop can be implemented that includes the event analysis and control engine using the Top-K statistics from analysis algorithms to reconfigure certain agents to reduce the number of events. The illustrative techniques are also applicable to any monitoring system that employs a pull-based approach in which agents at end devices push events to a central management server. Accordingly, the illustrative system and techniques are applicable to other monitoring applications including Telecom event management systems and operations management systems.
Functionality of the event analysis and control engine and associated techniques extends beyond setting of rules for detection of simple event storm events, counting of the events of a type, checking for counts beyond a threshold in a specified time window, and enablement of users to write rules for dropping events on detection of storms. Functionality of the event analysis and control engine and associated techniques is greatly enhanced to support control functions to reconfigure the agents that send the events and includes optimized analysis engine for detecting storms.
Terms “substantially”, “essentially”, or “approximately”, that may be used herein, relate to an industry-accepted tolerance to the corresponding term. Such an industry-accepted tolerance ranges from less than one percent to twenty percent and corresponds to, but is not limited to, functionality, values, process variations, sizes, operating speeds, and the like. The term “coupled”, as may be used herein, includes direct coupling and indirect coupling via another component, element, circuit, or module where, for indirect coupling, the intervening component, element, circuit, or module does not modify the information of a signal but may adjust its current level, voltage level, and/or power level. Inferred coupling, for example where one element is coupled to another element by inference, includes direct and indirect coupling between two elements in the same manner as “coupled”.
The illustrative block diagrams and flow charts depict process steps or blocks that may represent modules, segments, or portions of code that include one or more executable instructions for implementing specific logical functions or steps in the process. Although the particular examples illustrate specific process steps or acts, many alternative implementations are possible and commonly made by simple design choice. Acts and steps may be executed in different order from the specific description herein, based on considerations of function, purpose, conformance to standard, legacy structure, and the like.
While the present disclosure describes various embodiments, these embodiments are to be understood as illustrative and do not limit the claim scope. Many variations, modifications, additions and improvements of the described embodiments are possible. For example, those having ordinary skill in the art will readily implement the steps necessary to provide the structures and methods disclosed herein, and will understand that the process parameters, materials, and dimensions are given by way of example only. The parameters, materials, and dimensions can be varied to achieve the desired structure as well as modifications, which are within the scope of the claims. Variations and modifications of the embodiments disclosed herein may also be made while remaining within the scope of the following claims.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/US08/79889 | 10/14/2008 | WO | 00 | 4/11/2011 |