The present invention relates to a method and a network node for diagnosing one or more faults in a multi-layer communications network.
Fast failure detection and failure diagnosis is an important area in network management. After a failure is detected and data switched to alternative paths, there is a need to quickly localize the failure so that measures may be taken to replace or repair the faulty network element. A communications network involves a large set of distributed hardware and software components. Errors may occur in each device in the network despite best practices in design, implementation, and testing. Root causes can also be affected by external factors.
The multi-layer network comprises network elements in several network protocol layers. An Ethernet E-line service on top of a packet-optical integrated network is an example of a network service provided in a multi-layer network, but other large-scale distribution networks may also be configured as multi-layer networks. Fault localization in a multi-layer networks is generally difficult. Path-trace capabilities, such as IP trace-route are not available in the optical layer.
In multi-layer networks, network elements with measurement capability in the network layers generate measurement records on network traffic. These records may be collected and used for statistical and/or reporting purposes. When faults are experienced in a multi-layer network, the collected information is used to detect failures and performance degradations that exist in the network. Typically, the information is gathered in a network management system NMS that handles fault management and performance management.
Known fault localization methods aim at finding a correlation between a network fault and one or more fault carrying events, i.e., events occurring in response to the network fault. This is usually a difficult task due to the relatively large amount of fault carrying events caused by a network fault. Events may in this context be the issuance of messages informing about something happening in the network, such as the occurrence of an alarm, a performance parameter increment, or a service/action request made by one node to itself or to another node.
A disadvantage of these methods is that they are limited in the scope of correlation of fault carrying events. In a complex communication system, processing of fault carrying events exclusively may not be sufficient to succeed in fault localization. The fault carrying events may be an effect of a fault but may also occur as a result of symptoms of the real fault. These symptoms may be localized far away from the actual location of the fault.
There are existing techniques to search for dependencies between fault carrying event(s) and the problem causing the fault carrying events, within one or more different subsystems. Dependencies between for instance an alarm in a certain subsystem and the cause of the problem, if the cause resides in a different subsystem, are thus not considered. Network performance monitoring, automated failure localization and diagnosis are critical to service providers of large distribution networks, due to the increases in scale, diversity and complexity of the application services. There are significant network management challenges that arise from the combination of rapid growth and increased complexity of the network. Failure and performance degradation diagnosis is an important area in network management.
The complexity and prevalence of communication networks, systems, and associated services have increased over the past several years. Many new applications are increasingly complex in terms of the resources required to operate and deliver the applications, the application functions, and storage architecture, for example. The resources necessary to conceive, develop, activate, and eventually to provide increasingly complex applications continue to increase. In addition to the increasing complexity of applications and services, there is increased demand for applications and services that traverse various network technologies and systems.
From a network management standpoint, these various networks and network devices often report operational information in different ways. For example, the networks and network devices may employ particular network management approaches and technologies for monitoring operation of the network system, and network management personnel associated with particular networks and network devices may rely upon specific, and varied, network management systems and methods. Furthermore, modern networks increasingly rely upon third party vendors to provide hardware and/or software for offered services. These hardware and software devices frequently operate and report according to systems, methods, and even protocols that are not the same as the network providing the services.
None of the techniques for localizing faults and failure diagnosis discussed above are applicable in a large scale multi-layer network. Thus, there is a need for a method to support diagnosis of failure and performance degradation instances in a multi-layer network.
It is an object of the present invention to provide such a method for diagnosis of failure in a multi-layer network. This object is achieved through a method for root cause diagnosis of network failures and performance degradations in a multi-layer network method identifying the network elements that are the source of the failure as well the type of fault in these elements.
In an embodiment of a method according to the invention, a first set of performance measurements occurring in response to the fault and representing at least a first and a second layer in a multi-layer communications network are received in the network management system. The fault is localized by identifying a probable set of network elements affected by the fault. A type of fault is inferred from one or more symptoms in the first set of performance measurements. The root cause of the failure is identified from a combination of the information on probable set of network elements and inferred type of fault.
In another embodiment of a method according to the invention, the root cause analysis is performed in a pre-generated fault type inference graph.
In a further embodiment of a method according to the invention, additional measurements, a second set of measurements, are triggered when the analysis of localization of a fault and/or the fault type is non-conclusive. The steps of localizing the fault by identifying a probable set of network elements affected by the fault, inferring the type of fault from one or more symptoms incurred from one or more performance measurements, and determining the root cause by combining the information on the probable set of network elements with the inferred type of fault is repeated for the additional measurements.
It is another object of the invention to provide an arrangement in a network management node in a multi-layer network for carrying out the inventive method. The arrangement includes a receiver configured to receive a first set of performance measurements representing performance of network elements on at least a first and a second layer in the multi-layer communications network. The arrangement further includes a root cause analyzer arranged to detect affected network elements and fault type upon network failure. The root cause analyzer further comprises a spatial localization unit configured to identify a probable set of network elements affected by the fault based on the first set of performance measurements, and a fault type inference unit configured to infer the type of fault from one or more symptoms incurred from one or more performance measurements in the set of performance measurements. A transmitter is configured to output the information from the root cause analyzer.
Failure diagnosis in multi-layer networks requires the ability to localize faults in a large set of distributed components.
In a first step of the illustrated embodiment of the inventive method a first set of performance measurements are received 210 representing elements within at least a first and a second layer in the multi-layer communications network. The performance measurements are preferably collected in measurement tools implemented on routers, switches or on hosts, capturing network performance, statistics of traffic performance metrics, e.g., optical bit-error-rate, MPLS delay, Ethernet bandwidth, or any other suitable metric from any protocol layer element, in the multi-layer network. In an embodiment of the invention the first set of performance measurements are represented by input events, wherein a set of network elements are associated with each event. These set of elements are referred to as ‘Elements of e’. The resolution in terms of elements associated with an event may vary between different events. For example, for an event e indicating a long delay on a path, the associated set of elements includes all network components that are part of the path. A router can be one of these components. However, the operator may also need to make the resolution finer and include measurements that have not previously been included. This could be accomplished by adding all involved routers' interfaces to the Element set.
In a next step 220, fault localization is performed identifying a probable set of network elements affected by the fault. The goal of the fault localization is to find all elements that could be the source of the failure/degradation. The fault localization is performed based on an assumption that at any given time, there is only a single fault leading to many input events and corresponding performance measurement anomalies from the elements associated with the input events. In the fault localization step, one or more elements that are part of all the input events are identified.
The first set of performance measurements are represented by input events e(i). A set of network elements e(i).elements are associated to each input event, where a network element may be associated to a plurality of input events. The fault localization process further involves identifying the network elements, common-elements, associated to each of the input events.
In step 230 of the fault analysis method, the type of failure is assessed. In a root cause analysis, inferring the type of the root cause is uncorrelated to determining the location of the fault. Different types of faults results in different groups of measurement events, e.g., an optical link problem result in a first group of measurement events, a congestion at MPLS layer in a second group of measurement events, non-overlapping with the first group. A congestion at the Ethernet layer on the other hand, will result in a third set of measurement events, that may in part overlap with a fourth group of measurement events following on a situation of near congestion at the MPLS layer. In the step of assessing the fault type, the possible combination of events is tracked through a graph of events. The step of inferring the type of failure is preferably performed in sequence of the step of evaluating the fault.
In step 240, the root cause is determined by combining information on the location on the fault from step 220 and type of fault from step 230.
In the disclosed inference graph, each oval represents a measurement event and the rims of the ovals represent a state transition. The boxes map to possible causes, fault types, associated with the measurement event linked to the fault type in the inference graph. The graph contains types of measurements and is not specific to any location in the network. For each group of events, the most likely locations are identified using the Fault-Localization Algorithm [1], prior to inferring the type of root cause. The output of the two consecutive steps will be an identification of the fault location and the type of fault.
The identification of measurement events is preferably threshold based, with thresholds set to detect anomalies in the performance measurement results. The invention can be applied with different event detection mechanisms.
In an embodiment of the invention, the step of inferring a fault type involves a search process starting at a lower level moving to a higher level in the networking protocol stack. Events in higher layers may many times be a symptom of events in lower levels. Thus starting the search at a lower level may lead to identification of a fault type faster than performing the fault type inference with a starting point at a higher level. However, the invention is not limited to a lower level starting point, but is equally possible from an opposite protocol starting point.
When deciding on starting points within the same protocol layer, the search process checks measurement events with lower measurement overhead first. Execution of performance measurements in the network introduces a measurement load in the multilayer network, also known as the measurement overhead in the multilayer network. Starting with lower overhead measurements, the search process will be initiated for those measurements that may be performed with little impact on the overall performance of the multi-layer network; thus, unnecessary measurement load may be reduced in the system. Low overhead measurements are usually conducted more frequently and may thus provide more accurate information to help isolate the fault type faster.
The inference type search uses a pre-generated inference graph as input. The graph is traversed from a root node to a root cause node. For each node, a search is performed in an event set E for all events with the same type as the node. The search stops when a root cause node is detected.
In the step of inferring the fault type, the first set of performance measurements are represented by input events e(i). An event set E is formed by a set of input events e(i) forming a sequence of input events. Root cause analysis is performed for event set E, determining one or more symptoms.
In an embodiment of the invention, a pre-generated inference graph is used as an input, mapping a type of fault to the symptoms. The graph is traversed from the root node. Starting in a root node, a search is performed in the event set E for all events with the same type as the current node. If any event of the current type has an anomaly, traversing will continue to the nest node with the “Y” branch. If there is no anomaly, traversing will continue through the “X” branch. If the current node is a root cause node the search stops. If there are no unambiguous results relating to a root cause at this point of the search, e.g., when the search results in two or more possible root causes, an on-demand algorithm for generating further measurements should be triggered at this point.
In order to accurately identify the root cause, a complete set of information on performance at each layer and at each network element is required. However, the conciseness of the set of performance information must be balanced against the concerns for measurement overhead. In the disclosed scenario, measurement capability is available involving the routers R1-R4 in the three layer network. In response to detection of some type of malfunction in the network, performance measurements are collected in measurement tools implemented on the routers. The performance measurements are represented by input events. Each input event represents measurement results from one or more measurement tools. In the disclosed simplified scenario a group of input events e1-e4, b1, b2 and m1, m2 respectively representing the optical layer, the Ethernet layer and the MPLS layer are identified. The resolution in terms of elements associated with an event varies between the events. The input events e1-e4, are based on bit error rate measurements in the routers, the input events b1, b2 concern Ethernet bandwidth and include measurements relating to multiple routers b1:{R1, R2, R3}, b2:{R2, R3, R4}; the same applies for input events m1, m2 that concern MPLS trace-route, m1:{R1, R2, R3}, m2:{R2,R3, R4}.
A receiver 360 in an arrangement 300 in a network management system receives a first set of input events b1, b2, m1, m2, e1-e4. The performance measurements and the corresponding input events represent the Ethernet layer {b1, b2}, the MPLS layer {m1, m2}, and the Optical layer {e1, e2, e3, e4}. The input events occur in response to a malfunction in the disclosed three-layer network. Events are generated by an event detection module in response to a malfunction in the disclosed three-layer network. The event identification is usually performed as a threshold based method, wherein thresholds based on time T, packet loss L and bandwidth B are defined. An Ethernet bandwidth measurement is represented by an input event b1 defined as an Ethernet Bandwidth>B. A further event b2 defined also defined as an Ethernet bandwidth>B. An MPLS traceroute measurement is represented by an input event ml defined as an MPLS Loss>L; a further event in the MPLS layer is identified as input event m2 also defined as an MPLS Loss>L. In the optical layer, events e1-e4 represent bit error rate measurements in the routers R1-R4.
The localization of the fault is performed by identifying a probable set of network elements affected by identifying the set of routers most commonly represented in all of the events. A search for the routers associated with each event is performed. In the illustrated scenario, the result of this search is a set of most likely r for each event:
Based on the likely locations for b1, b2, m1 and m2, the most common set among all locations is identified as locations R2, R3.
The type of fault is inferred through a fault analysis performed as a root cause analysis in an inference graph. Following the inference graph illustrated in
As previously disclosed for
The root cause inference graph can be used for improving the accuracy of root cause analysis, especially when sufficient measurements for an actual fault location and type determination are lacking. In an embodiment of the invention, a second set of measurements are triggered based on the results of root cause analysis. Three types of triggering conditions are foreseen: when sufficient measurements to draw a conclusion on root cause is missing, when sufficient measurements to identify the location of the fault are missing, and where there is no strong symptom of service degradation but an indication of future congestion.
An additional second set of measurements may be triggered following a root cause analysis based on the first set of measurements. During the first root cause analysis, it is possible to identify a set of additional events required in order to form a conclusion on a fault type. A second set of performance measurements could represent measurements of a type not previously included in the performance measurements due to measurement overhead. However, in a situation where a fault exists in the multi-layer network and where fault-localization attempts have failed, such increased overhead will be acceptable. New types of measurement could include high-frequency loss measurement, more accurate/more aggressive bandwidth measurement, jitter measurement, etc. When encountering possible fault types without matching events, a triggering command initiates additional measurements to generate a second set of measurements. The second set of measurements are represented by input events and included in the step of inferring a fault type through root cause analysis in the fault inference graph.
A second set of measurements may also be triggered when a set of available measurements do not provide sufficient information to localize the fault.
In order to prevent overloading of the network and the devices, measuring is usually performed in a low-rate manner. When failure or performance degradation occurs, the frequency of measuring may be increased to obtain more accurate measurement results. In order to be able to identify and initiate preventive actions prior to network failure, it is possible to increase the measurement frequency. A second set of measurements representing higher frequency measurements will be triggered when one or more detected events are close to a defined threshold for triggering alarms. Similarly, when the fault type diagnosis suggests that the network is close to congestion, more frequent measurements are required in order to detect or predict future congestion.
In the disclosed embodiments for triggering additional measurements, the process of choosing which measurements to require for the second set of measurements are based on data flagged as missing by the root cause analysis algorithm. The process of initiating generation of the second set of measurements is automated and enables an improved method for localizing faults and identifying fault types in a multi-layer network. It is of course possible to generate a second set of measurements including measurements from different paths, as well as measurements of another type or an increased frequency. However, the analysis of the second set of measurements would then preferably be carried out for each individual subset.
A benefit of the disclosed method for localizing and inferring type of fault is that the method considers performance measurements at different protocol layers in the network. The performance measurements are collected by one or more measurement tools. The root cause of a fault is inferred from a set of symptoms extracted from measurements. The method can be deployed without access to the network devices.
Working with on-demand measurements when additional measurements are required to draw a conclusion on localization and type of faults, reduces measurement overhead in the fault analysis process while enabling improved inference accuracy.
The disclosed method may operate in a packet-optical integrated network with Ethernet services, e.g., a Metro-Ethernet Self Organizing Network MESON. The disclosed method can be performed without integration with software deployment and billing systems. The method may automatically determine the configuration of the measurement tools required for gathering the data required in the root cause analysis process.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SE2012/050028 | 1/16/2012 | WO | 00 | 6/20/2014 |
Number | Date | Country | |
---|---|---|---|
61578419 | Dec 2011 | US |