This disclosure relates to computer networks, and more particularly, to management of network devices.
A computer network is a collection of interconnected computing devices that can exchange data and share resources. A variety of devices operate to facilitate communication between the computing devices. For example, a computer network may include routers, switches, gateways, firewalls, and a variety of other devices to provide and facilitate network communication.
These network devices typically include mechanisms, such as management interfaces, for locally or remotely configuring the devices. By interacting with the management interface, a client can perform configuration tasks as well as perform operational commands to collect and view operational data of the managed devices. For example, the clients may configure interface cards of the device, adjust parameters for supported network protocols, specify physical components within the device, modify routing information maintained by a router, access software modules and other resources residing on the device, and perform other configuration tasks. In addition, the clients may allow a user to view current operating parameters, system logs, information related to network connectivity, network activity or other status information from the devices as well as view and react to event information received from the devices.
Network configuration services may be performed by multiple distinct devices, such as routers with service cards and/or dedicated service devices. Such services include connectivity services such as Layer Three Virtual Private Network (L3VPN), Virtual Private Local Area Network Service (VPLS), and Peer to Peer (P2P) services. Other services include network configuration services, such as Dot1q VLAN Service. Network management systems (NMSs) and NMS devices, also referred to as controllers or controller devices, may support these services such that an administrator can easily create and manage these high-level network configuration services.
In particular, user configuration of devices may be referred to as “intents.” An intent-based networking system lets administrators describe the intended network/compute/storage state. User intents can be categorized as business policies or stateless intents. Business policies, or stateful intents, may be resolved based on the current state of a network. Stateless intents may be fully declarative ways of describing an intended network/compute/storage state, without concern for a current network state.
Intents may be represented as intent data models, which may be modeled using unified graphs. Intent data models may be represented as connected graphs, so that business policies can be implemented across intent data models. For example, data models may be represented using connected graphs having vertices connected with has-edges and reference (ref) edges. Controller devices may model intent data models as unified graphs, so that the intend models can be represented as connected. In this manner, business policies can be implemented across intent data models. When Intents are modeled using a unified graph model, extending new intent support needs to extend the graph model and compilation logic.
In order to configure devices to perform the intents, a user (such as an administrator) may write translation programs that translate high-level configuration instructions (e.g., instructions according to an intent data model, which may be expressed as a unified graph model) to low-level configuration instructions (e.g., instructions according to a device configuration model). As part of configuration service support, the user/administrator may provide the intent data model and a mapping between the intent data model to a device configuration model.
In order to simplify the mapping definition for the user, controller devices may be designed to provide the capability to define the mappings in a simple way. For example, some controller devices provide the use of Velocity Templates and/or Extensible Stylesheet Language Transformations (XSLT). Such translators contain the translation or mapping logic from the intent data model to the low-level device configuration model. Typically, a relatively small number of changes in the intent data model impact a relatively large number of properties across device configurations. Different translators may be used when services are created, updated, and deleted from the intent data model.
In general, this disclosure describes techniques for managing network devices. A network management system (NMS) device, also referred to herein as a controller device, may configure network devices using low-level (that is, device-level) configuration data, e.g., expressed in Yet Another Next Generation (YANG) data modeling language. According to the techniques described herein, the controller device may configure the network devices at individual component level or individual service level. The controller device implements a programmable network diagnosis model to provide root cause analysis (RCA) for events (e.g., faults) detected over the network. The programmable network diagnosis model of this disclosure applies model traversal techniques over a resource definition graph that accounts for device resources, service resources provided by the network devices, and the interdependencies between the various resources.
The programmable network diagnosis model permits for programming cause and effect relationships between resource events, and to initialize telemetry rules for both devices resources and service-associated device resources. Additionally, the programmable network diagnosis model enables forward chaining-based RCA by automatically deriving inference rules, and accounts for temporal relations between network events. The programmability of the network diagnosis model enables the controller device to perform the forward chaining-based RCA techniques of this disclosure while accommodating dynamic network changes. In this way, the programmable network diagnosis model is scalable in that the controller device may program the model to accommodate changes to the size or configuration of the network, and to support numerous resources implemented by the network devices.
In one example, this disclosure is directed to a method of monitoring a device group of a network. The method includes receiving, by a programmable diagnosis service running on a controller device that manages the device group, a programming input and forming, by the programmable diagnosis service, based on the programming input, a resource definition graph that models interdependencies between a plurality of resources supported by the device group. The method further includes detecting, by the programmable diagnosis service, an event affecting a first resource of the plurality of resources, and identifying, based on the interdependencies modeled in the resource definition graph formed based on the programming input, a root cause event that caused the event affecting the first resource, the root cause event occurring at a second resource of the plurality of resources.
In another example, this disclosure is directed to a controller device for managing a device group of a network. The controller device includes a network interface, a memory, and processing circuitry in communication with the memory. The processing circuitry is configured to receive, using a programmable diagnosis service executed by the processing circuitry, a programming input, and to form, using the programmable diagnosis service, based on the programming input, a resource definition graph that models interdependencies between a plurality of resources supported by the device group. The processing circuitry is further configured to detect, using the programmable diagnosis service, an event affecting a first resource of the plurality of resources, and to identify, using the programmable diagnosis service, based on the interdependencies modeled in the resource definition graph formed based on the programming input, a root cause event that caused the event affecting the first resource, the root cause event occurring at a second resource of the plurality of resources.
In another example, this disclosure is directed to a controller device for managing a device group of a network. The controller device includes means for receiving, using a programmable diagnosis service executed by the processing circuitry, a programming input, and means for forming, using the programmable diagnosis service, based on the programming input, a resource definition graph that models interdependencies between a plurality of resources supported by the device group. The controller device further includes means for detecting, using the programmable diagnosis service, an event affecting a first resource of the plurality of resources, and means for identifying, using the programmable diagnosis service, based on the interdependencies modeled in the resource definition graph formed based on the programming input, a root cause event that caused the event affecting the first resource, the root cause event occurring at a second resource of the plurality of resources.
In another example, this disclosure is directed to a non-transitory computer-readable medium encoded with instructions. When executed, the instructions cause processing circuitry of a controller device for managing a device group of a network to receive, using a programmable diagnosis service executed by the processing circuitry, a programming input, to form, using the programmable diagnosis service, based on the programming input, a resource definition graph that models interdependencies between a plurality of resources supported by the device group, to detect, using the programmable diagnosis service, an event affecting a first resource of the plurality of resources, and to identify, using the programmable diagnosis service, based on the interdependencies modeled in the resource definition graph formed based on the programming input, a root cause event that caused the event affecting the first resource, the root cause event occurring at a second resource of the plurality of resources.
The programmable network diagnosis model of this disclosure provides several technical improvements over existing RCA technology. Networks are dynamic with respect to their structures and components (e.g., structures and/or configurations thereof). The programmable network diagnosis model enables administrators to adapt the correlation system to accommodate changes in the network topology, the component types and versions, and the services offered. Because the offered services can change and grow in number because of potential differences between customers or entities catered to, the programmability of the network diagnosis model enables integration of new services with respect to the forward chaining-based RCA techniques of this disclosure. In this way, the programmable network diagnosis model of this disclosure provides scalable and reliable error resilience over networks that incorporate diverse devices and support diverse resources.
The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.
Fault diagnosis (sometimes referred to as “root cause analysis” or “RCA”) is a process to identify the initiating condition or event that triggers a network component failure from a set of possible candidate events/conditions that are generated or present within a discrete time window. RCA is a critical task for operators to maintain a properly functioning network. A few possible techniques to perform RCA include a model traversing technique and a dependency graph technique.
The model traversing technique uses object models to determine fault propagation. The network is represented using various components and relationships between the components. Based on this model representing the network, fault dependencies can be inferred and used to identify the root cause of an issue. Model traversing techniques do not specify fault dependencies directly, but instead, derive the fault dependencies from the model during run-time. These techniques are suitable for a network that changes frequently. However, by themselves, model traversing techniques cannot deal with more complex fault propagation scenarios (e.g., basing fault propagation on an assumption that only one issue happens at a time, etc.).
The dependency graph technique uses a directed graph to model dependencies between the object events. Nodes represent network elements (e.g., hosts). An edge from node A: event to node B: event indicates that the failures in node A can cause failures in node B. Dependency graphs are often used in networks with infrequent changes. In networks with frequent changes, the dependencies need to be updated frequently. Network complexity is on the increase, particularly in light of the rapid increase in the number of connected devices, the relatively complex topology of distributed networks, and increasing internet of things (IoT) adoption. These factors also contribute to the heterogeneity of networks, due to the differences in device capabilities and configurations. For example, one network can be overlaid on top of another network. For example, virtual private networks (VPNs) are overlaid on internet protocol (IP) networks that use it as a transport layer. Network troubleshooters need a mechanism by which to correlate the issues across layers with a generic model-driven solution that can be applied to any network and service topology that can support networks with frequent changes and support multiple concurrent faults at a time.
Because networks are dynamic with respect to their structures and components, adaptability of the correlation system to ongoing changes in the network topology, component types and versions, and the services offered represents a technical improvement over existing RCA technologies. Programmable diagnosis services of this disclosure provide scalability and response times that enable reliable RCA over dynamic, heterogenous networks. The programmable diagnosis model of this disclosure enables network administrators to program the network and device resources including service resources, device resources, and resource dependencies therebetween. Additionally, the programmable diagnosis model of this disclosure enables network administrators to program cause-and-effect relationships between resource events that may occur within the network.
The programmable diagnosis model of this disclosure enables network administrators to initialize telemetry rules, either with device resource properties in the case of device resources, or via service association inheritance in the case of service-associated device resources. Based on the model programmed in this way, the controller may automatically derive inference rules with respect to resource event interrelationships. The controller may continually update the inference rules, and may implement the inference rules to perform RCA based on forward chaining of network resource events. Additionally, the programmable diagnosis model of this disclosure enables the incorporation of temporal relationships between resource events to perform RCA among potentially interrelated events. The inference rules are augmented with temporal constraints to enable temporal-based RCA.
Aspects of the underlying element and service models are described in U.S. patent application Ser. No. 16/731,372 filed on 31 Dec. 2019, the entire content of which is incorporated herein. The Network Model Aware Diagnosis technique of the present disclosure uses element models, service models, and multi-layer models. The element model accounts for network devices that uses various resources (e.g., a packet forwarding engine (PFE), a line card, interfaces, chassis, CPUs, etc.) and captures the relationships between these resources and captures dependencies between various network resource events.
The service model accounts for services spread across the devices (e.g., layer-3 (L3) VPN/virtual private LAN services (VPLS), label-switched path (LSP) tunnels, etc.). The service model comprises various events captured at the service level. The service model captures (i) service and service endpoint associations, (ii) connectivity link (path) between various endpoint (e.g., a VPN service with endpoints Node A, B, C contains a tunnel between Node A and Node B and a tunnel between Node A and Node C, etc.), (iii) dependencies across service events, (iv) dependencies across the endpoint events, and (v) dependency between device event to service event. Networks are layered, and as such, a broken link in an underlying layer or any other problem in the lower layer services cause many higher layer services to fail, even when these services are not directly connected to the failing components. The multi-layer model captures (service to service dependencies, (ii) service link to service link dependencies, and (iii) dependencies across service events.
The enterprise network 102 is shown coupled to a public network 118 (e.g., the Internet) via a communication link. The public network 18 may include, for example, one or more client computing devices. The public network 18 may provide access to web servers, application servers, public databases, media servers, end-user devices, and other types of network resource devices and content.
The controller device 110 is communicatively coupled to the elements 114 via the enterprise network 102. The controller device 110, in some examples, forms part of a device management system, although only one device of the device management system is illustrated for purpose of example in
In common practice, the controller device 110, also referred to as a network management system (NMS) or NMS device, and the elements 114 are centrally maintained by an information technology (IT) group of the enterprise. The administrators 112 interact with the controller device 110 to remotely monitor and configure the elements 114. For example, the administrators 112 may receive alerts from the controller device 110 regarding any of the elements 114, view configuration data of the elements 114, modify the configurations data of the elements 114, add new network devices to the enterprise network 102, remove existing network devices from the enterprise network 102, or otherwise manipulate the enterprise network 102 and network devices therein. Although described herein with respect to an enterprise network as an example use case, it will be the techniques of this disclosure are also applicable to other network types, public and private, including LANs, VLANs, VPNs, and the like.
In some examples, the administrators 112 uses controller device 10 or a local workstation to interact directly with the elements 114, e.g., through telnet, secure shell (SSH), or other such communication sessions. That is, the elements 114 generally provide interfaces for direct interaction, such as command line interfaces (CLIs), web-based interfaces, graphical user interfaces (GUIs), or the like, by which a user can interact with the devices to directly issue text-based commands. For example, these interfaces typically allow a user to interact directly with the device, e.g., through a telnet, secure shell (SSH), hypertext transfer protocol (HTTP), or other network session, to enter text in accordance with a defined syntax to submit commands to the managed element. In some examples, the user initiates an SSH session 115 with one of the elements 114, e.g., element 14F, using the controller device 110, to directly configure element 14F. In this manner, a user can provide commands in a format for execution directly to the elements 114.
Further, the administrators 112 can also create scripts that can be submitted by the controller device 110 to any or all of the elements 114. For example, in addition to a CLI interface, the elements 114 also provide interfaces for receiving scripts that specify the commands in accordance with a scripting language. In a sense, the scripts may be output by the controller device 110 to automatically invoke corresponding remote procedure calls (RPCs) on the managed the elements 114. The scripts may conform to, e.g., extensible markup language (XML) or another data description language.
The administrators 112 use the controller device 110 to configure the elements 114 to specify certain operational characteristics that further the objectives of the administrators 112. For example, the administrators 112 may specify for an element 114 a particular operational policy regarding security, device accessibility, traffic engineering, quality of service (QoS), network address translation (NAT), packet filtering, packet forwarding, rate limiting, or other policies. The controller device 110 uses one or more network management protocols designed for management of configuration data within the managed network elements 114, such as the SNMP protocol or the Network Configuration Protocol (NETCONF) protocol, or a derivative thereof, such as the Juniper Device Management Interface, to perform the configuration. The controller device 10 may establish NETCONF sessions with one or more of the elements 114.
Controller device 110 may be configured to compare a new intent data model to an existing (or old) intent data model, determine differences between the new and existing intent data models, and apply the reactive mappers to the differences between the new and old intent data models. In particular, the controller device 110 determines whether the new data model includes any additional configuration parameters relative to the old intent data model, as well as whether the new data model modifies or omits any configuration parameters that were included in the old intent data model.
The intent data model may be a unified graph model, while the low-level configuration data may be expressed in YANG, which is described in (i) Bjorklund, “YANG—A Data Modeling Language for the Network Configuration Protocol (NETCONF),” Internet Engineering Task Force, RFC 6020, October 2010, available at tools.ietf.org/html/rfc6020, and (ii) Clemm et al., “A YANG Data Model for Network Topologies,” Internet Engineering Task Force, RFC 8345, March 2018, available at tools.ietf org/html/rfc8345 (sometimes referred to as “RFC 8345”). In some examples, the intent data model may be expressed in YAML Ain′t Markup Language (YAML). Controller device 10 may include various reactive mappers for translating the intent data model differences. These functions are configured to accept the intent data model (which may be expressed as structured input parameters, e.g., according to YANG or YAML). The functions are also configured to output respective sets of low-level device configuration data model changes, e.g., device configuration additions and removals. That is, y1=f1(x), y2=f2(x), . . . yN=fN(x).
The controller device 110 may use YANG modeling for intent data model and low-level device configuration models. This data may contain relations across YANG entities, such as list items and containers. As discussed in greater detail below, the controller device 110 may convert a YANG data model into a graph data model, and convert YANG validations into data validations. Techniques for managing network devices using a graph model for high level configuration data is described in “CONFIGURING AND MANAGING NETWORK DEVICES USING PROGRAM OVERLAY ON YANG-BASED GRAPH DATABASE,” U.S. patent application Ser. No. 15/462,465, filed on 17 Mar. 2017, the entire content of which is incorporated herein by reference.
Controller device 110 may receive data from any of administrators 112 representing any or all of create, update, and/or delete actions with respect to the unified intent data model. The controller device 110 may be configured to use the same compilation logic for each of create, update, and delete as applied to the graph model.
In general, controllers, such as controller device 110, use a hierarchical data model for intents, low-level data models, and resources. The hierarchical data model can be based on YANG or YAML. The hierarchical data model can be represented as a graph, as discussed above. Modern systems have supported intents to ease the management of networks. Intents are declarative. To realize intents, the controller device 110 attempts to select optimal resources.
In accordance with aspects of this disclosure, controller device 110 implements a programmable diagnosis model that facilitates RCA when one or more of the network elements 114 exhibits a failure (e.g., packet loss, or other failure). The programmable diagnosis model constructs the network resources and inter-resource dependencies in the form of a resource definition graph. The resource definition graph is a construct that can be programmed, in such a way that it specifies a set of objects (resources) which include: (i) attributes(s); (ii) state(s); and (iii) links to other object(s) (resource(s). A particular instance of a resource definition graph defines the relationships that characterize a particular corresponding network context, which can be a network domain, a network device, a network service, etc. The programmable diagnosis service 224 discovers resources (instances) based on the constructed resource definition graph.
Control unit 202 represents any combination of hardware, hardware implementing software, and/or firmware for implementing the functionality attributed to the control unit 202 and its constituent modules and elements. When control unit 202 incorporates software or firmware, control unit 202 further includes any necessary hardware for storing and executing the software or firmware, such as one or more processors or processing units. In general, a processing unit may include one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), fixed function circuitry, programmable processing circuitry, or any other equivalent integrated or discrete logic circuitry, as well as any combinations of such components. A processing unit is generally implemented using fixed and/or programmable logic circuitry.
User interface 206 represents one or more interfaces by which a user, such as the administrators 112 of
Functionality of the control unit 202 may be implemented as one or more processing units in fixed or programmable digital logic circuitry. Such digital logic circuitry may include one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), fixed function circuitry, programmable logic circuitry, field programmable gate arrays (FPGAs), or any other equivalent integrated or discrete logic circuitry, as well as any combination of such components. When implemented as programmable logic circuitry, the control unit 202 may further include one or more computer readable storage media storing hardware or firmware instructions to be executed by processing unit(s) of control unit 202.
In this example, control unit 202 includes a user interface module 208, network interface module 210, and management module 212. Control unit 202 executes user interface module 208 to receive input from and/or provide output via user interface 206. Control unit 202 also executes network interface module 210 to send and receive data (e.g., in packetized form) via network interface 204. The user interface module 208, the network interface module 210, and the management module 212 may again be implemented as respective hardware units, or in software or firmware implemented by appropriate hardware infrastructure, or a combination thereof.
The control unit 202 executes a management module 212 to manage various network devices, e.g., the elements 114 of
The management module 212 is configured to receive intent unified-graph-modeled configuration data for a set of managed network devices from a user, such as the administrators 112. Such intent unified-graph-modeled configuration data may be referred to as an “intent data model.” Over time, the user may update the configuration data, e.g., to add new services, remove existing services, or modify existing services performed by the managed devices. The unified intent data model may be structured according to, e.g., YANG or YAML. The graph model may include a plurality of vertices connected by edges in a hierarchical fashion. In YANG, edges of graph models are represented though “leafref” elements. In the case of YAML, such edges may be represented with a “ref” edge. Similarly, parent-to-child vertex relations can be represented with a “has” edge. For example, a vertex for Element A refers to a vertex for Element B using a has-edge can be understood to mean, “Element A has Element B.”
The configuration database 214 generally includes information describing the managed network devices, e.g., the elements 114. The configuration database 214 may include information indicating device identifiers (such as MAC and/or IP addresses), device type, device vendor, devices species (e.g., router, switch, bridge, hub, etc.), or the like. The configuration database 214 also stores current configuration information (e.g., intent data model, or in some cases, both intent data model and low-level configuration information) for the managed devices (e.g., the elements 114).
The model database 216 includes the models configured by a user, via the configuration module 222, that describe the structure of the network 102. As described below, the model database includes a network aware diagnosis model that is used by programmable diagnosis service 224 to perform root cause analysis to find the malfunctioning element 114 that is a source of an event even when the malfunction is not the direct/immediate result of the event, but instead, a cascading downstream effect of the event.
The diagnosis model 304 captures the cause and effect (sometime referred to herein as “correlations”) relationship between various resources. For example, the diagnosis model 304 may reflect cause-and-effect relationships across events that occur over network 102. The cause and effect relationships are defined between resources and resource alarms/events. When the cause and effect relationship is defined between resources, any critical alarm/event on a resource causes an effect on “supporting resources.” When the cause and effect relationship is defined between resource alarms/events, an event on a resource causes an effect on a “supported resource” events.
The programmable diagnosis model 300 is used by the programmable diagnosis service 224 to perform forward-chained RCA in accordance with aspects of this disclosure. To aid in identifying the root cause of a fault or other event while accommodating dynamic changes in the topology of network 102, the programmable diagnosis model 300 enables administrators to update aspects of the diagnosis model 304 by providing programming input 310 via the controller device 110. The programmable diagnosis service 224 uses the programming input 310 to construct a resource definition graph that models network resources and interdependencies therebetween. Based on the model constructed in this way, programmable diagnosis service 224 discovers the resources from network 102 and build the relations across the discovered resources.
Individual vertices of the resource definition graph include one or more “playbooks” that define respective telemetry rule(s) that enables the programmable diagnosis service 224 to fetch state information from network 102. The resource definition graph constructed by the programmable diagnosis service 224 captures both network model and device model information, as well as corresponding rules of the telemetry rules 306. The resource definition graph also includes the diagnosis model 304, which provides cause and effect relationship information across events detected within network 102. A given vertex of the resource definition graph (including resource model information along with telemetry rule information) enables the programmable diagnosis service 224 to discover network and device resource instances of each object that exist on network 102, to collect the data required to fill and update the value of the object attributes, and to compute the actual value of the “state” attributes defined.
The programmable diagnosis model 300 also includes temporal metadata 308. The temporal metadata 308 includes information describing timing information of events detected among the elements 114 of the network 102. The temporal metadata 308 may include exact times, approximate times, or relative times measured with respect to discrete events detected within the network 102. Based on criteria provided in the programming input 310 or based on other criteria, the programmable diagnosis service 224 may apply the portions of the temporal metadata 308 that apply to potentially interrelated events to perform RCA with respect to a downstream event. In one example, the programmable diagnosis service may retain or eliminate an event as a possible upstream cause based on whether or not the event occurred within a threshold time frame of causality with respect to the downstream event.
Using the combination of the network resource model(s) 302, the diagnosis model 304 formed or updated with the programming input 310, the telemetry rules 306, and the temporal metadata 308, the programmable diagnosis service 224 forms one or more of the inference rules stored to the inference database 218. In turn, the programmable diagnosis service 224 applies those inference rules of the inference database 218 that are applicable to the particular event under RCA to run the programmable diagnosis model 300. The output produced by running the programmable diagnosis model 300 is shown in
More specifically, the programmable diagnosis service 224 uses the programmed model (a version of diagnosis model 304 formed using programming input 310) to automatically derive the relevant inference rules of the inference database 218. In accordance with aspects of this disclosure, the inference rules stored to the inference database 218 are subject to one or more temporal constraints, which are described in greater detail below with respect to the application of temporal metadata. The programmable diagnosis service 224 applies the derived inference rules to identify the source of the fault under RCA. The inference engine 226 maintains the event under RCA in cache memory for a predetermined time interval, and generates an inference upon receiving a dependent event. Upon correlating the events, the inference engine 226 generates a smart event with an RCA tree and a root cause event to be output as part of forward-chained RCA output 312. In some examples, the programmable diagnosis service 224 save the forward-chained RCA output 312 to an analytics database which may be implemented locally at the controller device 110, at a remote location, or in a distributed manner.
In the example of
Resource definition graph 402A captures network model information, device model information, and corresponding telemetry rules for the resources shown. Using the information available from resource definition graph 402A, controller device 110 may discover the various instances of the objects described in resource definition graph 402A included in a particular device group of network 102. Based on the causality link between IFD 502 and IFL 504, controller device 110 may determine that a fault occurring at IFD 502 potentially affects the functioning of IFL 504. Based on the causality link, programmable diagnosis service 224 may include IFD 502 in the discovery process with respect to fault investigation for IFL 504. In this way, programmable diagnosis service 224 may obtain object properties and service properties for the device group under discovery based on the causality links included in resource definition graph 402A.
In examples in which IFD 502 has multiple interfaces, programmable diagnosis service 224 may run programmable diagnosis model 300 to derive an inference rule that associates the particular interface of IFD 502 with the dependent event (e.g., packet loss or other fault) occurring at IFL 504. Programmable diagnosis service 224 further tunes the inference rule using one or more temporal constraints formed based on temporal metadata 308. If the fault discovered at IFL 504 fits the temporally compliant inference rule, programmable diagnosis service 224 generates forward-chained RCA output to identify the fault at IFD 502 as either the root cause or as an intermediate cause (which leads to the root cause) of the fault discovered at IFL 504.
To obtain forward-chained RCA output 312, programmable diagnosis service 224 may use diagnosis model 304 (formed or modified using programming input 310) to automatically derive the relevant inference rules of inference database 218. Again, programmable diagnosis service 224 may derive the inference rules to comport with temporal constraints for causality as derived from temporal metadata 308. In turn, programmable diagnosis service 224 uses the inference rules stored to inference database 218 to identify the source of the detected event (e.g. fault). Inference engine 226 may maintain an event in cache storage for a specified time interval and generate an inference when a potentially dependent (e.g., downstream effect) event arrives. Upon generating an event correlation, programmable diagnosis service 224 may generate a “smart event” with an RCA tree and an identified root cause event. Programmable diagnosis service 224 stores the smart event and the identified root cause event to an analytics database that may be implemented locally at controller device 110, at a remote location, or in a distributed manner.
More specifically, to generate the alerts stored to alerts event database 616, smart event generator 610 uses inference rules formulated by inference engine 226. Inference engine 226 also stores events received from the programmable network diagnosis service 224 to event cache 614. Inference engine 226 implements a knowledge-based generation mechanism with respect to the inference rules stored to inference database 218.
At a high level, model 700 may capture the following: (i) a resource model for network and device resources; (ii) resource dependencies; (including (a) parent and child resources and (b) unidirectional and bidirectional dependencies); (iii) cause and effect dependencies between resource events; and (iv) a mapping of a telemetry playbook to model 700. As shown in
In the example of
YANG code for a data model corresponding to model 700 is presented below:
The YANG data model above includes various constructs. Resource fields define attributes of the corresponding resource. State fields define operational states of the corresponding resource. Dependencies capture inter-resource dependencies. Resource-rule mapping fields capture mappings between the resource field and the rule field, along with triggers to the resource state mapping.
A YANG model corresponding to the YANG code above is presented below:
Descriptions for various data model fields are presented below in Table 1.
Based on the association of resources, programmable diagnosis service 224 may apply configuration information model 700. Programmable diagnosis service 224 may collect additional state information based on a service association to a resource. To collect the additional state information, programmable diagnosis service 224 may apply additional telemetry rules (e.g., telemetry rules 306 or other telemetry rules) based on service-to-resource associations. For example, a VPN service associated interface may require additional telemetry rule(s) run on the associated interface. The application of this telemetry rule is given as below:
Execution of the code above will add “interface-status.rule” to interfaces that are associated to resource “VRF.”
Model update 800 may be described as “network model decoration” or “event decoration” with respect to events with a network model under diagnosis, in accordance with aspects of this disclosure. An analytics engine operated by controller device 110 may capture a stream of events captured from network 110 and feed the event stream into the programmable diagnosis service 224. Model update 606 may decorate every event with model dependency information.
The analytics engine may collect certain state information based on service associations. For example, if a VPN is associated with a particular interface, the analytics engine may fetch state information for that interface. Programmable diagnosis service may execute an interface status rule if there is an association between a VPN instance to an interface instance. “VPN1” shown in model 800 is such a VPN instance. In the case of model 800, model update 606 may decorate device events will be with “vpn1 instance” information. In the case of the events shown in
Programmable diagnosis service 224 constructs a diagnosis dependency model that captures cause-and-effect relationships across various resources in a device group of network 102. Programmable diagnosis service 224 may include various types of cause-and-effect relationships in the diagnosis dependency model, such as cause-and-effect relationships between resources and/or cause-and-effect relationships between resource alarms/events. If a cause-and-effect relationship is between resources, any critical alarm/event on a resource can cause an effect on “supported resources”. That is, a user may provide, as part of programming input 310, a dependency definition linking an event on one resource to a causal event on another resource. If a cause-and-effect relationship is between resource alarms/events, an event detected on a resource can cause an effect on a supported resource event.
Dependency and contains edges introduce cause-and-effect relationships between the resources in the diagnosis dependency model. Dependencies between resource alarms are shown in the code below:
Inference engine 226 represents an expert system that can be described as a form of finite state machine with a cycle consisting of three action states. The three action states are “match rules,” “select rules,” and “execute rules.” Inference engine 226 may apply rules on set of facts that are active in memory. Inference engine 226 may requires facts upon which to operate. Inference engine 226 runs a fact model that captures network event information. The fact model is denoted by:
As described above, programmable diagnosis service 224 generates temporally based inference rules by applying temporal metadata 308. That is, inference engine 226 applies temporal metadata 308 to generate all of the inference rules stored to inference database 218 with temporal constraints. The techniques of this disclosure are based on a realization that temporal relations are important in handling relationships between network events. Without applying temporal constraints, event correlation may include inaccuracies because of the time elapsed being events being disregarded. Because timing information for events are relative to each other, a purely date/time representation without relative timing deltas may be insufficient with respect to applying temporal constraints to improve the accuracy of RCA. Two example temporal operators are “before” and “after” operators. For instance, in the example of model 800, the “ge-0/0/3 interface down” event happened before the VRF1 packet loss event.
Based on the dependency model created for resources of network 102, inference engine 226 auto-generates inference rules to be stored to inference database 218. Inference engine 226 uses a rule template that accepts, as input, the cause-and-effect dependencies (defined in the network resource model) and generates the inference rules based on these cause-and-effect dependencies. Inference engine 226 generates the templates to account for causal and consequent events to be detected in any order, such as the consequent event being detected after the causal event (expected), or the consequent (“target”) event being detected before the causal event (unexpected). In some examples, the causal event may be detected after the target event because of latency or other system constraints. For instance, an interface down event may cause VPN packet loss, but the packet loss may be detected before or after the interface down event is detected, in different use case scenarios. Inference engine 226 may generate the rule template to accommodate both scenarios. Inference engine 226 may assign a different inference rule to each of these scenarios. Rule template generation is shown by way of example in the code below, in which {{source event type}}, {{target event type}} are template variables:
Programmable diagnosis service 224 performs RCA based on inferences through forward chaining, in accordance with the techniques of this disclosure. As used herein, forward chaining is the logical process of inferring unknown truths from known data, and moving forward using determined conditions and rules to identify a solution. A generic example, based on transitive properties can be stated as “if ‘a’ causes ‘b’ and ‘b’ causes ‘c’, then ‘a’ is the root cause of ‘c’.” As part of inference formation, programmable diagnosis service 224: (i) merges the causes and effects and causes based on the generated inferences to form one or more inference rules; and (ii) generates an RCA tree (which can be represented as a graph of related events) as part of a chaining process.
Inference engine 226 may persist the RCA tree in event cache 614 (or another event DB) for further event analysis. Inference engine 226 generates an inference model that captures the inferred information from the events stored to event cache 614. The inference model contains causes and a list of effects. An “inference” class declaration (including a list of effects) is presented below:
Examples of forward-chaining rules are presented in the code below:
An example use case of the third (merging) rule is in the case of an interface down event, which may cause VPN packet loss (over potentially numerous VPNs) as well as customer latency and/or customer connectivity failures. The interface down event may be a direct parent of both consequent events, or may be an ancestor event via the transitive property of fault causality. Inference engine 226 may clearing one or more inferences upon clearing one or more corresponding events from event cache 614. Upon clearing an event from event cache 614, inference engine 226 may (i) delete, from inference database 218, all facts and inferences related to the cleared event; and (ii) reactivate all correlated events which were part of the deleted inference to create new inferences. Code denoting three different rules relating to inference clearing are presented below:
Smart event generator 610 correlates events and identifies root cause events as part of the forward chaining inference-drawing process. Smart event generator 610 generates a smart event per root cause event along with a set of impacted events. Smart event generator 610 persists the smart event in an analytical databases (e.g., alerts/events database 616) to enable a user to initiate further actions, such as one or more of remediation actions 620.
Programmable diagnosis service 224 may detect an event affecting a first resource of the resources supported by the device group managed by controller device 110 (956). In turn, programmable diagnosis service 224 may identify a root cause event that caused the event affecting first resource based on the interdependencies modeled in the respective resource definition graph 402 formed previously (958). Programmable diagnosis service may identify the root cause event as occurring at a second resource of the resources supported by the device group managed by controller device 110.
In some examples, to identify the root cause event that caused the event affecting the first resource comprises, programmable diagnosis service 224 may apply the respective resource definition graph 402 to at least a subset of the supported resources to generate one or more inference rules with respect to the supported resources, and may perform a forward-chained RCA by applying the one or more inference rules to events detected over the supported resources. In some examples, programmable diagnosis service 224 may initialize one or more telemetry rules that enable controller device 110 to monitor state information for one or more components of the device group and/or to instigate one or more alarms in response to detecting threshold events occurring within the supported resources. In some examples, to initialize the one or more telemetry rules, programmable diagnosis service 224 may configure first cause-and-effect relationships between device resources supported by the device group and second cause-and-effect relationships between service resources supported by the device group based on the programming input.
In some examples, to form the resource definition graph that models the interdependencies between the resources supported by the device group, programmable diagnosis service 224 may apply one or more temporal constraints to the modeled interdependencies. In some examples, the one or more temporal constraints include a constraint according to which the event affecting the first resource occurs after the root cause event occurring at the second resource. In some examples, the one or more temporal constraints include a constraint according to which the event affecting the first resource occurs before the root cause event occurring at the second resource. In some examples, the supported resources include one or more network resources, and programmable diagnosis service 224 may configure at least a subset of the one or more network resources.
In some examples, the supported resources include one or more service resource models, and programmable diagnosis service 224 may configure at least a subset of the one or more service resources. In some examples, the supported resources include one or more device resource models, and programmable diagnosis service 224 may configure at least a subset of the one or more device resources.
The techniques described in this disclosure may be implemented, at least in part, hardware, software, firmware or any combination thereof. For example, various aspects of the described techniques may be implemented within one or more processors, including one or more microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), fixed-function circuitry, programmable circuitry, or any other equivalent integrated or discrete logic circuitry, as well as any combination of such components. The term “processor” or “processing circuitry” may generally refer to any of the foregoing logic circuitry, alone or in combination with other logic circuitry, or any other equivalent circuitry. A control unit comprising hardware may also perform one or more of the techniques of this disclosure.
Such hardware, software, and firmware may be implemented within the same device or within separate devices to support the various operations and functions described in this disclosure. In addition, any of the described units, modules or components may be implemented together or separately as discrete but interoperable logic devices. Depiction of different features as modules or units is intended to highlight different functional aspects and does not necessarily imply that such modules or units must be realized by separate hardware or software components. Rather, functionality associated with one or more modules or units may be performed by separate hardware or software components, or integrated within common or separate hardware or software components.
The techniques described in this disclosure may also be embodied or encoded in a computer-readable medium, such as a computer-readable storage medium, containing instructions. Instructions embedded or encoded in a computer-readable medium may cause a programmable processor, or other processor, to perform the method, e.g., when the instructions are executed. Computer-readable media may include non-transitory computer-readable storage media and transient communication media. Computer readable storage media, which is tangible and non-transitory, may include random access memory (RAM), read only memory (ROM), programmable read only memory (PROM), erasable programmable read only memory (EPROM), electronically erasable programmable read only memory (EEPROM), flash memory, a hard disk, a CD-ROM, a floppy disk, a cassette, magnetic media, optical media, or other computer-readable storage media. The term “computer-readable storage media” refers to physical storage media, and not signals, carrier waves, or other transient media.
Various examples have been described. These and other examples are within the scope of the following claims.
Number | Date | Country | Kind |
---|---|---|---|
202041004313 | Jan 2020 | IN | national |
This application is a continuation of U.S. application Ser. No. 16/821,745, filed Mar. 17, 2020, which claims benefit of priority from India Provisional Application No. 202041004313 filed on 31 Jan. 2020, the entire contents of each of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
6286047 | Ramanathan | Sep 2001 | B1 |
6336138 | Caswell et al. | Jan 2002 | B1 |
7552447 | Uthe | Jun 2009 | B2 |
7865888 | Qureshi et al. | Jan 2011 | B1 |
8001527 | Qureshi | Aug 2011 | B1 |
8954972 | Lam et al. | Feb 2015 | B2 |
9104572 | Thompson | Aug 2015 | B1 |
9189319 | Ito et al. | Nov 2015 | B2 |
9692671 | Groenendijk et al. | Jun 2017 | B2 |
10187260 | Chen et al. | Jan 2019 | B1 |
10516761 | A et al. | Dec 2019 | B1 |
10970143 | Wade et al. | Apr 2021 | B1 |
11086709 | Ratkovic | Aug 2021 | B1 |
11165631 | Chitalia | Nov 2021 | B1 |
11269711 | R et al. | Mar 2022 | B2 |
11269718 | Chen | Mar 2022 | B1 |
11405260 | A et al. | Aug 2022 | B2 |
11627034 | Chawathe | Apr 2023 | B1 |
20020022952 | Zager et al. | Feb 2002 | A1 |
20040049372 | Keller | Mar 2004 | A1 |
20040049565 | Keller et al. | Mar 2004 | A1 |
20040268335 | Martin et al. | Dec 2004 | A1 |
20050181835 | Lau | Aug 2005 | A1 |
20050210132 | Florissi | Sep 2005 | A1 |
20050262106 | Enqvist | Nov 2005 | A1 |
20050289395 | Katsuyama et al. | Dec 2005 | A1 |
20060235962 | Vinberg | Oct 2006 | A1 |
20070294051 | Sanghvi | Dec 2007 | A1 |
20080016115 | Bahl et al. | Jan 2008 | A1 |
20080021918 | Rao | Jan 2008 | A1 |
20080222287 | Bahl et al. | Sep 2008 | A1 |
20100115341 | Baker | May 2010 | A1 |
20110154367 | Gutjahr et al. | Jun 2011 | A1 |
20110231704 | Ge et al. | Sep 2011 | A1 |
20120222745 | Kolarsky | Sep 2012 | A1 |
20130097183 | McCracken | Apr 2013 | A1 |
20130339515 | Radhakrishnan | Dec 2013 | A1 |
20140222745 | Deng et al. | Aug 2014 | A1 |
20150199226 | Wu et al. | Jul 2015 | A1 |
20150280968 | Gates | Oct 2015 | A1 |
20150280969 | Gates | Oct 2015 | A1 |
20150319090 | Fu et al. | Nov 2015 | A1 |
20170075749 | Ambichl et al. | Mar 2017 | A1 |
20170102997 | Purushothaman et al. | Apr 2017 | A1 |
20170279687 | Muntes-Mulero et al. | Sep 2017 | A1 |
20170288940 | Lagos et al. | Oct 2017 | A1 |
20170372212 | Zasadzinski et al. | Dec 2017 | A1 |
20180054351 | Bhandari et al. | Feb 2018 | A1 |
20180136987 | He et al. | May 2018 | A1 |
20180218264 | Renders et al. | Aug 2018 | A1 |
20190081850 | Nazar et al. | Mar 2019 | A1 |
20190165988 | Wang et al. | May 2019 | A1 |
20190227860 | Gefen | Jul 2019 | A1 |
20190230003 | Gao et al. | Jul 2019 | A1 |
20190356403 | V.K. et al. | Nov 2019 | A1 |
20200042426 | Ambichl | Feb 2020 | A1 |
20210026723 | Nadger et al. | Jan 2021 | A1 |
20210037081 | Slavik | Feb 2021 | A1 |
20210152416 | A et al. | May 2021 | A1 |
20210243068 | R et al. | Aug 2021 | A1 |
20210271522 | Butterworth et al. | Sep 2021 | A1 |
20220019494 | R et al. | Jan 2022 | A1 |
20220029876 | Mercian | Jan 2022 | A1 |
20220179726 | R et al. | Jun 2022 | A1 |
Number | Date | Country |
---|---|---|
1529455 | Sep 2004 | CN |
102473129 | May 2012 | CN |
104584483 | Apr 2015 | CN |
2961100 | Dec 2015 | EP |
2013055760 | Apr 2013 | WO |
Entry |
---|
Chen, Pengfei, et al. “Causeinfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems.” IEEE INFOCOM 2014—IEEE Conference on Computer Communications. IEEE, 2014. (Year: 2014). |
Qiu, Juan, et al. “A causality mining and knowledge graph based method of root cause diagnosis for performance anomaly in cloud applications.” Applied Sciences 10.6 (2020): 2166. (Year: 2020). |
Dandona, Divyaansh, Mevlut Demir, and John J. Prevost. “Graph Based Root Cause Analysis in Cloud Data Center.” 2020 IEEE 15th International Conference of System of Systems Engineering (SoSE). IEEE, 2020. (Year: 2020). |
Weng, Jianping, et al. “Root cause analysis of anomalies of multitier services in public clouds.” IEEE/ACM Transactions on Networking 26.4 (2018): 1646-1659. (Year: 2018). |
Bahl et al., “Towards Highly Reliable Enterprise Network Services Via Inference of Multi-level Dependencies,” ACM SIGCOMM Computer Communication Review, Aug. 27-31, 2007, pp. 13-24. |
Bjorklund, “Yang—A Data Modeling Language for the Network Configuration Protocol (NETCONF),” Internet Engineering Task Force (IETF), RFC 6020, Oct. 2010, 173 pp. |
Clemm et al., “A YANG Data Model for Network Topologies” Internet Engineering Task Force (IETF) RFC 8345, Mar. 2018, 57 pp. |
Extended Search Report from counterpart European Application No. 20178806.4, dated Nov. 19, 2020, 8 pp. |
Gruschke, “Integrated event management: Event correlation using dependency graphs,” In Proceedings of the 9th IFIP/IEEE International Workshop on Distributed Systems: Operations & Management (DSOM 98). Oct. 1998, 12 pp. |
Harrington et al., “An Architecture for Describing Simple Network Management Protocol (SNMP) Management Frameworks,” Network Working Group, RFC 3411, Dec. 2002, 64 pp. |
Huang et al., “Performance Diagnosis for SOA on Hybrid Cloud Using the Markov Network Model,” 2013 IEEE 6th International Conference on Service-Oriented Computing and Applications, Koloa, Hawaii, Dec. 16-18, 2013, pp. 17-24. |
Katker et al., “Fault isolation and event correlation for integrated fault management”, International Symposium on Integrated Network Management, Springer, Boston, MA, 1997, 583-596 pp., (Applicant points out, in accordance with MPEP 609.04(a), that the year of publication, 1997, is sufficiently earlier than the effective U.S. filing date, so that the particular month of publication is not in issue.). |
Prosecution History from U.S. Appl. No. 16/821,745, dated Jul. 12, 2021 through Aug. 16, 2022, 95 pp. |
Response to Extended Search Report dated Nov. 19, 2020, from counterpart European Application No. 20178806.4 filed Feb. 4, 2022, 18 pp. |
Sanchez et al., “Self-Modeling based Diagnosis of Services over Programmable Networks,” 2016 IEEE NetSoft Conference and Workshops (NetSoft), Seoul, Korea (South), Jun. 2016, 10 pp. |
Schoenwaelder, “Common YANG Data Types” Internet Engineering Task Force (IETF), RFC 6991, Jul. 2013, 30 pp. |
Steinder et al., “A survey of fault localization techniques in computer networks,” Science Direct, Science of Computer Programming, vol. 53, Issue 2, Nov. 2004, pp. 165-194. |
Yan et al., “G-RCA: A Generic Root Cause Analysis Platform for Service Quality Management in Large IP Networks” IEEE/ACM Transactions on Networking; vol. 20, Issue: 6, Dec. 2012, pp. 1734-1747. |
First Office Action and Search Report, and translation thereof, from counterpart Chinese Application No. 202010454714.5 dated Mar. 16, 2023, 23 pp. |
Communication pursuant to Article 94(3) EPC from counterpart European Application No. 20178806.4 dated Sep. 18, 2023, 4 pp. |
Number | Date | Country | |
---|---|---|---|
20230208701 A1 | Jun 2023 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 16821745 | Mar 2020 | US |
Child | 18066407 | US |