In the modern computer age, businesses rely on an electronic network to function properly. Computer network management and troubleshooting are complex. There are thousands of shell scripts and applications for different network problems. The available, but poorly documented solutions, can be overwhelming for junior network engineers. Most network engineers learn troubleshooting through reading the manufacturer's manual or internal documentation from the company's documentation department. But the effectiveness varies. For instance, the troubleshooting knowledge captured in a document can only be helpful if the information is accurate and the user correctly identifies the problem. Many companies have to conduct extensive training for junior engineers. The conventional way of network troubleshooting requires a network professional to manually run a set of standard commands and processes for each device. However, to become familiar with those commands, along with each of its parameters, takes years of practice. Also, complicated troubleshooting methodology is often hard to share and transfer. Therefore even though a similar network problem happens again and again, each instance of troubleshooting may still have to start from scratch. However, networks are getting more and more complex, and it is increasingly difficult to manage them efficiently with traditional methods and tools.
Network management teams provide two functions: to deliver on services required by the business and ensure minimized downtime. The first function may be dominated by projects, such as data centers, cloud migration, or implementing quality of service (QoS) for a voice or video service. The second function, minimizing downtime, may be more critical in impacting a company's revenue and reputation. Ensuring minimal downtime can include preventing outages from happening and resolving outages as soon as possible. Two measurements for an outage may include Mean Time Between Failure (MTBF) and Mean Time to Repair (MTTR).
Network management may utilize new methodologies and processes to accommodate the global shift to digital technologies. To manage the network efficiently with tactical, manual approaches using legacy mechanisms to build, operate, and troubleshoot may need to improve.
This disclosure relates generally to network management automation using network intent or adaptive monitoring automation. Network intent (NI) represents a network design and baseline configuration for that network or network devices with an ability to diagnose deviation from the baseline configuration. The NI can be automated to update and replicate the diagnosis. The monitoring of the network can be adapted to capture network problems in advance with adaptive monitoring automation.
In one embodiment, a method for automating network management includes creating a network intent for a network device with a baseline configuration for the network device; establishing a diagnosis for the network device that includes a comparison with the baseline configuration; monitoring variables for the network device; comparing variables for the network device with the baseline configuration based on the diagnosis; identifying a deviation from the baseline configuration based on the comparing; updating the network intent based on the diagnosis and the deviation; and utilizing, iteratively, the updated network intent for the network device with the monitoring and the comparing. The network intent is associated with the network device and other network devices have other network intent with variables for those other network devices. The updated network intent is applied to a second network device. The utilizing includes outputting at least one of a diagnosis note, device status code, a network intent status code, or a baseline intent. The modifying comprises updating the network intent and iteratively applying the network intent for the one or more baseline configurations. The baseline configuration is saved as the network intent, and the monitored variables comprise current data, which is compared with previous data. The method includes parsing, with a visual parser, the monitored variables, wherein the monitoring is based on the parsed variables. The visual parser parses the monitored variables with a text parser, a variable parser, a paragraph parser, or a table parser. The visual parser comprises a reuse parser that applies to other network devices other than the network device. The network intent establishes design rules, security rules, or establishes repetitive problems.
In one embodiment, a method for network management includes establishing a network intent that comprises one or more baseline configurations for a network; monitoring variables in real time; comparing the monitored variables with the one or more baseline configurations; diagnosing a deviation from the one or more baseline configurations, which indicates one or more network problems; modifying the network intent based on the diagnosing, such that the network intent can be automatically applied to future deviations; and applying the modified network intent for subsequent instances of the monitoring. The network intent is associated with a network device and the variables are for that network device. A second network intent is established for a second network device. The applying further comprises iterative performing the comparing, the diagnosing, and the modifying for the subsequent instances. The method includes providing an alert when the deviation is diagnosed. The method includes parsing, with a visual parser, the monitored variables, wherein the diagnosing is based on the parsed variables. The modifying includes outputting at least one of a diagnosis note, device status code, a network intent status code, or a baseline intent. The modifying comprises updating the network intent and iteratively applying the network intent for the one or more baseline configurations. The baseline configuration is saved as the network intent and the monitored variables comprise current data, which is compared with previous data. The monitoring comprises an adaptive monitoring automation using a primary flash probe and a secondary flash probe.
In one embodiment, a method for automating network management includes performing monitoring of a network, wherein the monitoring is adaptive to network problems and adaptive to a workload; establishing a primary flash probe that is used to detect a deviation based on the monitoring; establishing one or more secondary flash probes for the primary flash probe that are triggered when the primary flash probe detects the deviation; and generating a flash alert when the primary flash probe or the one or more secondary flash probes detect the deviation. The method includes running a network automation at a device level based on the generated flash alert. The network includes running a diagnosis for the network device that includes a comparison with the baseline configuration. The network automation is the network intent. The monitoring comprises a back-end automation without reliance on a user to run automation. The primary flash probe or the one or more secondary flash probes perform a device level check or an interface level check. The method includes establishing a flash probe that performs a network anomaly detection on a single device. The method includes establishing a built-in flash probe that is triggered for detection of a configuration change, or when SNMP or CLI is unreachable. The primary flash probe or the one or more secondary flash probes is triggered by an event or by an API. The method includes providing a dashboard displaying a summary of probes and the generated flash alerts that includes a distribution of those for each network device. The dashboard displays an execution tree with results from the probes and the generated flash alerts. The dashboard displays a map of the network devices and the probes for each of the network devices on the map.
In one embodiment, a network management system includes a network intention (NI) management configured to define and execute the NI; adaptive monitoring automation configured to utilize one or more flash probes in a backend process, wherein the one or more flash probes create an alert and trigger the NI execution; and a dashboard for displaying network devices with corresponding results of the flash probes. The system includes an execution tree with results from the flash probes and the generated flash alerts. When the alert occurs, the triggered automation is executed. The flash probe comprises at least one of a primary robe, a secondary probe, or an external probe. The dashboard displays a summary of the flash probes and the generated alerts that includes a distribution of those for each of the network devices. The dashboard displays an execution tree with results from the flash probes and the generated alerts. The dashboard displays a map of the network devices and the flash probes for each of the network devices. The system includes a visual parser using a grammar to turn device command output or configuration file text into programmable variables, wherein the visual parser is configured to parse a configuration file and CLI command output for automation problem resolutions, further wherein the visual parser comprises variables comprising text, single variables, paragraph, and table. The NI comprises at least one of a name, a description, a target device, a tag, a configuration, or a variable.
The system and method may be better understood with reference to the following drawings and description. Non-limiting and non-exhaustive embodiments are described with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. In the drawings, like referenced numerals designate corresponding parts throughout the different views.
A new model requires closed-loop mechanisms to achieve continuous improvement and self-documenting workflow automation. This shift to a business-centric and intent-based mindset is automation-friendly, analytical, and proactive. Network diagnostic work may move from sequential, CLI-focused methods to multi-threaded, integrated automation.
Network management automation may rely on administrative tasks or failure prevention monitoring, such as redundancy verifications, device hardening verifications, or compliance audits. The automation described below that augments network operations and improves MTTR and MTBF, prevents the inherent risks within networks that cause outages and MTBF, and prevents the inherent risks that cause outages within networks. Network engineering and architecture teams were traditionally the main stewards of this use case, where their jobs are to roll out new services, deliver redundancy, and reduce inherent risks. Reducing MTTR has an equal, if not greater, impact on the overall target of reducing downtime. The automation embodiments can enable infrastructure teams to become more efficient in this role. Combined with the added complexity of new networking technologies, the sheer volume of network devices, and the fragmentation of subject matter expertise, may lead to longer troubleshooting times. The automation embodiments can augment network management and improve MTTR.
By way of introduction, the disclosed embodiments relate to systems and methods for network management automation using network intent or adaptive monitoring automation. Network intent (NI) represents a network design and baseline configuration for that network or network devices with an ability to diagnose deviation from the baseline configuration. The NI can be automated to update and replicate the diagnosis. The monitoring of the network can be adapted to capture network problems in advance with adaptive monitoring automation.
Network Intention (NI) is a network-based solution with an executable automation element to document and verify a network design. NIs can be monitored proactively to prevent violation. The system can send an alert for an NI violation. The NI system may include Network Intention Management as a subsystem to define, manage and manually execute NI. The NI system may include a Feature Intent Definition or Network Intent Cluster as a subsystem to automatically create NIs from a template. The NI system may include Adaptive Monitoring Automation as a backend process to poll the network's status via a Flash Probe. When a flash alert occurs, the triggered automation is executed, such as Network Intent. The NI system may include a Decision Tree as a view to present the Flash Probe's results with the Flash Alert and associated triggered automation and further recommend automation elements based on a device and/or a tag that shows a troubleshooting scenario.
Reference will now be made in detail to exemplary embodiments of the invention, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts. The numerous innovative teachings of the present application will be described with particular reference to presently preferred embodiments (by way of example, and not of limitation). The present application describes several inventions, and none of the statements below should be taken as limiting the claims generally.
For simplicity and clarity of illustration, the drawing figures illustrate the general manner of construction, and description and details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the invention. Additionally, elements in the drawing figures are not necessarily drawn to scale, and some areas or elements may be expanded to help improve understanding of embodiments of the invention.
The word ‘couple’ and similar terms do not necessarily denote direct and immediate connections, but also include connections through intermediate elements or devices. For purposes of convenience and clarity only, directional (up/down, etc.) or motional (forward/back, etc.) terms may be used with respect to the drawings. These and similar directional terms should not be construed to limit the scope in any manner. It will also be understood that other embodiments may be utilized without departing from the scope of the present disclosure, and that the detailed description is not to be taken in a limiting sense, and that elements may be differently positioned, or otherwise noted as in the appended claims without requirements of the written description being required thereto.
The terms “first,” “second,” “third,” “fourth,” and the like in the description and the claims, if any, may be used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable. Furthermore, the terms “comprise,” “include,” “have,” and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, article, apparatus, or composition that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, apparatus, or composition.
The aspects of the present disclosure may be described herein in terms of functional block components and various processing steps. It should be appreciated that such functional blocks may be realized by any number of hardware and/or software components configured to perform the specified functions. For example, these aspects may employ various integrated circuit components, e.g., memory elements, processing elements, logic elements, look-up tables, and the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices.
Similarly, the software elements of the present disclosure may be implemented with any programming or scripting languages such as C, C++, Java, COBOL, assembler, PERL, Python, or the like, with the various algorithms being implemented with any combination of data structures, objects, processes, routines, or other programming elements. Further, it should be noted that the present disclosure may employ any number of conventional techniques for data transmission, signaling, data processing, network control, and the like.
The particular implementations shown and described herein are for explanatory purposes and are not intended to otherwise be limiting in any way. Furthermore, the connecting lines shown in the various figures contained herein are intended to represent exemplary functional relationships and/or physical couplings between the various elements. It should be noted that many alternative or additional functional relationships or physical connections may be present in a practical incentive system implemented in accordance with the disclosure.
As will be appreciated by one of ordinary skill in the art, aspects of the present disclosure may be embodied as a method or a system. Furthermore, these aspects of the present disclosure may take the form of a computer program product on a tangible computer-readable storage medium having computer-readable program-code embodied in the storage medium. Any suitable computer-readable storage medium may be utilized, including hard disks, CD-ROM, optical storage devices, magnetic storage devices, and/or the like. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified in the flowchart block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks.
As used herein, the terms “user,” “network engineer,” “network manager,” “network developer” and “participant” shall interchangeably refer to any person, entity, organization, machine, hardware, software, or business that accesses and uses the system of the disclosure. Participants in the system may interact with one another either online or offline.
Communication between participants in the system of the present disclosure is accomplished through any suitable communication means, such as, for example, a telephone network, intranet, Internet, extranet, WAN, LAN, personal digital assistant, cellular phone, online communications, off-line communications, wireless network communications, satellite communications, and/or the like. One skilled in the art will also appreciate that, for security reasons, any databases, systems, or components of the present disclosure may consist of any combination of databases or components at a single location or at multiple locations, wherein each database or system includes any of various suitable security features, such as firewalls, access codes, encryption, de-encryption, compression, decompression, and/or the like.
In network troubleshooting, a network engineer may use a set of commands, methods, and tools, either standard or proprietary. For example, these commands, methods, and tools may include the following items:
The Command Line Interface (CLI): network devices often provide CLI commands to check the network status or statistics. For example, in a Cisco IOS switch, the command “show interface” can be used to show the interface status, such as input errors.
Configuration management: a tool used to find differences of configurations of network devices in a certain period. This is important since about half of the network problems are caused by configuration changes.
The term “Object” refers to the term used in computer technology, in the same meaning of “object oriented” programming languages (such as Java, Common Lisp, Python, C++, Objective-C, Smalltalk, Delphi, Java, Swift, C#, Perl, Ruby, and PHP). It is an abstracting computer logic entity that envelops or mimics an entity in the real physical world, usually possessing an interface, data properties and/or methods.
The term “Device” refers to a data object representing a physical computer machine (e.g., printer, router) connected in a network or an object (e.g., computer instances or database instances on a server) created by computer logic functioning in a computer network.
The term “Q-map” or “Qmap” refers to a map of network devices created by the computer technology of NetBrain Technologies, Inc. that uses visual images and graphic drawings to represent the topology of a computer network with interface property and device property displays through a graphical user interface (GUI). Typically, a computer network is created with a map-like structure where a device is represented with a device image and is linked with other devices through straight lines, pointed lines, dashed lines and/or curved lines, depending on their interfaces and connection relationship. Along the lines, also displayed are the various data properties of the device or connection.
The term “Qapp” refers to a built-in or user-defined independently executable script or procedure generated through a graphical user interface as per technology available from NETBRAIN TECHNOLOGIES, INC.
The term “GUI” refers to a graphical user interface and includes a visual paradigm that offers users a plethora of choices. GUI paradigm or operation relies on windows, icons, mouse, pointers and scrollbars to display graphically the set of available files and applications. In a GUI-based system, a network structure may be represented with graphic features (icons, lines and menus) that represent corresponding features in a physical network in a map. The map system may be referred to as a Qmap and is further described with respect to U.S. Pat. Nos. 8,386,593, 8,325,720, and 8,386,937, the entire disclosure of each of which is hereby incorporated by reference. After a procedure is created, it can be run in connection with any network system. Troubleshooting with a proposed solution may just take a few minutes instead of hours or days traditionally. The troubleshooting and network management automation may be with the mapping of the network along with the NETBRAIN QAPP (Qapp) system. The Qapp system is further described with respect to U.S. Pat. Nos. 9,374,278, 9,438,481, U.S. Pat. Pub. Nos. 2015/0156077, 2016/0359687, and 2016/0359688, the entire disclosure of each of which is hereby incorporated by reference.
The term “Step” refers to a single independently executable computer action represented by a GUI element, that obtains, or causes, a network result from, or in, a computer network; a Step can take a form of a Qapp, a system function, or a block of plain text describing an external action to be executed manually by a user, such as a suggestion of action, “go check the cable.” Each Step is thus operable and re-usable by a GUI operation, such as mouse curser drag-and-drop or a mouse clicking.
The network manager 112 may be a computing device for monitoring or managing devices in a network, including performing automation tasks for the management, including network intent analysis and adaptive monitoring automation. In other embodiments, the network manager 112 may be referred to as a network intent analyzer or adaptive monitor for a user 102. The network manager 112 may include a processor 120, a memory 118, software 116 and a user interface 114. In alternative embodiments, the network manager 112 may be multiple devices to provide different functions, and it may or may not include all of the user interface 114, the software 116, the memory 118, and/or the processor 120.
The user interface 114 may be a user input device or a display. The user interface 114 may include a keyboard, keypad or cursor control device, such as a mouse, joystick, touch screen display, remote control or any other device operative to allow a user or administrator to interact with the network manager 112. The user interface 114 may communicate with any of the network devices in the network 104, and/or the network manager 112. The user interface 114 may include a user interface configured to allow a user and/or an administrator to interact with any of the components of the network manager 112. The user interface 114 may include a display coupled with the processor 120 and configured to display output from the processor 120. The display (not shown) may be a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, a printer or other now known or later developed display device for outputting determined information. The display may act as an interface for the user to see the functioning of the processor 120, or as an interface with the software 116 for providing data.
The processor 120 in the network manager 112 may include a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP) or other type of processing device. The processor 120 may be a component in any one of a variety of systems. For example, the processor 120 may be part of a standard personal computer or a workstation. The processor 120 may be one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analyzing and processing data. The processor 120 may operate in conjunction with a software program (i.e., software 116), such as code generated manually (i.e., programmed). The software 116 may include the Data View system and tasks that are performed as part of the management of the network 104, including the generation and usage of Data View functionality. Specifically, the Data View may be implemented from software, such as the software 116.
The processor 120 may be coupled with the memory 118, or the memory 118 may be a separate component. The software 116 may be stored in the memory 118. The memory 118 may include, but is not limited to, computer readable storage media such as various types of volatile and non-volatile storage media, including random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like. The memory 118 may include a random access memory for the processor 120. Alternatively, the memory 118 may be separate from the processor 120, such as a cache memory of a processor, the system memory, or other memory. The memory 118 may be an external storage device or database for storing recorded tracking data, or an analysis of the data. Examples include a hard drive, compact disc (“CD”), digital video disc (“DVD”), memory card, memory stick, floppy disc, universal serial bus (“USB”) memory device, or any other device operative to store data. The memory 118 is operable to store instructions executable by the processor 120.
The functions, acts or tasks illustrated in the figures or described herein may be performed by the programmed processor executing the instructions stored in the software 116 or the memory 118. The functions, acts or tasks are independent of the particular type of instruction set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firm-ware, micro-code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitasking, parallel processing and the like. The processor 120 is configured to execute the software 116.
The present disclosure contemplates a computer-readable medium that includes instructions or receives and executes instructions responsive to a propagated signal, so that a device connected to a network can communicate voice, video, audio, images or any other data over a network. The user interface 114 may be used to provide the instructions over the network via a communication port. The communication port may be created in software or may be a physical connection in hardware. The communication port may be configured to connect with a network, external media, display, or any other components in system 100, or combinations thereof. The connection with the network may be a physical connection, such as a wired Ethernet connection or may be established wirelessly, as discussed below. Likewise, the connections with other components of the system 100 may be physical connections or may be established wirelessly.
Any of the components in the system 100 may be coupled with one another through a (computer) network, including but not limited to one or more network(s) 104. For example, the network manager 112 may be coupled with the devices in the network 104 through a network or the network manager 112 may be a part of the network 104. Accordingly, any of the components in the system 100 may include communication ports configured to connect with a network. The network or networks that may connect any of the components in the system 100 to enable data communication between the devices may include wired networks, wireless networks, or combinations thereof. The wireless network may be a cellular telephone network, a network operating according to a standardized protocol such as IEEE 802.11, 802.16, 802.20, published by the Institute of Electrical and Electronics Engineers, Inc., or WiMax network. Further, the network(s) may be a public network, such as the Internet, a private network, such as an intranet, or combinations thereof, and may utilize a variety of networking protocols now available or later developed including, but not limited to TCP/IP based networking protocols. The network(s) may include one or more of a local area network (LAN), a wide area network (WAN), a direct connection such as through a Universal Serial Bus (USB) port, and the like, and may include the set of interconnected networks that make up the Internet. The network(s) may include any communication method or employ any form of machine-readable media for communicating information from one device to another.
The network manager 112 may act as the operating system (OS) of the entire network 104. The network manager 112 provides automation for the users 102, including automated documentation, automated troubleshooting, automated change, and automated network defense. In one embodiment, the users 102 may refer to network engineers who have a basic understanding of networking technologies, and are skilled in operating a network via a device command line interface and are able to interpret a CLI output. The users 102 may rely on the network manager 112 for controlling the network 104, such as with network intent analysis functionality or for adaptive monitoring automation.
For Identify, potential delays and unreliable variability exist in the Identify phase, a problem that may require the most effort to resolve. Highly unpredictable, the Identify phase may have the most considerable impact on the cost of an outage. Without a means to methodically tackle this variability, we cannot measurably improve the most significant portion of MTTR. Hence, the most considerable reduction in MTTR will come from Mean Time to Identify (MTTI). An effective automation strategy must enable teams to obtain and analyze data faster to isolate the root cause.
While the Fix phase can be very brief, efforts to reduce the inherent risk of pushing a change and integrating this phase into a full incident response workflow are desired. The postmortem is an optional fourth phase of MTTR. When the current incident is resolved, what if a similar event reoccurs later, or is this a commonly recurring event? In network management postmortems, the lessons learned can be executable for next time.
Fault Detected: A network monitoring tool detects a fault, and then some automated event correlation may occur, and a ticket is automatically generated. Now, an investigation must begin to determine the root cause. While fault detection is mostly automated, the transition from detection to examination is typically not automated and is a cause of delay.
Idle Time: There is a waiting period after an event has been detected and is ongoing, but before an incident, a response investigation has begun. A ticket may sit idle for an hour or more while potentially critical diagnostic information vanishes.
First Response: This is often the most time-consuming stage and where MTTR can be reduced most. It is critical to have the correct data and the right know-how. Hugely variable, this stage can potentially take several hours or more depending on the complexity of the issue.
Escalation: If the first engineer is unable to resolve the issue, escalation is needed. The common flaw at this step is duplication of effort. The escalation engineer will inevitably repeat the first engineer's work before moving on to more advanced diagnostics.
Remediation: The goal here is to ensure that we push safe changes, do no additional harm, and verify that the fix was successful. Automation is the safest way to push out changes during this high-stress period of incident response.
Postmortem: Implementing lessons learned to “do better next time” may be critical yet exceedingly challenging to enact successfully.
Traditionally, the movement between the stages of incident response and the diagnostics during an investigation is manual. Therefore, MTTR reduction depends on people. Improving MTTR without automation would require either more people or a better network, both of which may be difficult to achieve. Advanced automation across each phase of the incident response workflow delivers a scalable methodology. MTTR reduction can be achieved by increasing automation at every stage of the incident response workflow and through a proactive automation at the postmortem stage following every incident.
Triggered Automation: Automate First Response
When a fault occurs within the network, the first challenge is the resulting idle time. If the ticket sits unworked, and in the case of intermittent issues, potential diagnostic data may even clear before an investigation can begin. Automation augments this process and initiates the diagnosis of the event. Triggered automation closes the gap between the detection of the fault and the action of investigating. For triggered automation to be successful, full network management workflow integration may be used. A network's event detection system or ITSM must communicate with the NetOps automation system to trigger an automatic diagnosis.
Automation may be designed to augment people. Rather than sequentially parsing through the CLI outputs of every piece of network equipment in an affected segment, the engineer leverages pre-built operational runbooks that retrieve contextual diagnostic data from every device at the click of a button. This helps provide repeatable and predictable outcomes, ensures that relevant data is accurately retrieved, and dramatically reduces the diagnostic process's time.
The diagnostics may be scalable. Once the first engineer responds to an incident and begins the initial triage and investigation, the priority is to obtain the correct data quickly and perform accurate, efficient analysis, typically involving manual digging through CLI. The goal is to accelerate this diagnosis using automation. Knowing what data to get, retrieving it rapidly, and leveraging expert know-how to analyze this data is required. Automation may also provide enhanced data analytic functions to enable activities such as historical data comparisons to know “what has changed” or baseline analysis to understand “is this normal.” When combined with live data, an engineer can obtain the correct data and use these comparisons of past, current, and ideal network conditions to perform the analysis much faster. The first level of support can resolve some issues, but many problems require escalation. Collaboration may fail during incident response, with data not adequately conveyed to the next-level engineer or diagnostics not captured and saved. The escalation engineer may duplicate the work of the first engineer before moving on to more advanced diagnostics. A network automation solution should record the collected diagnostics and troubleshooting notes of every person assigned to the ticket, so everyone working on the problem has the same data. When it comes to the fix, the goal is to push out the change safely and verify that the fix resolved the issue. A well-designed change automation system ensures the fix is successful. The solution automates the full mitigation sequence, including change deployment, the before and after quality assurance, and validation that the problem has cleared. The network management automation embodiments may ensure that mitigation is safely executed, no additional harm has occurred, and reliable post-fix verification is performed.
To see continual improvement over time requires more issues to be near-instantly diagnosed with the root cause identified. In other words, the automation strategy should focus on moving increasingly more issues to near-zero time to a resolution until you can resolve practically every ticket with automation. As more problems occur with proper postmortem reviews, a NetOps team would classify recurring issue types into a “known problem” category and develop operational runbooks for these problems.
As more known problem operational runbooks are fed to the machine, more known issues will have fully automated diagnoses. This process continuously pushes MTTR lower. With proactive automation, we convert lessons learned into repeatable and executable diagnostic automation tasks. More than just documenting that lesson, the goal is to implement an automated diagnostic that checks for this problem the next time there is a similar incident.
To achieve these proactive automation goals, the automation platform may:
When designing a knowledge management framework and network automation strategy, the objective may enable junior engineers to leverage their senior-level expertise. From the view of an escalation chain, the goal will be to shift knowledge from senior staff, logically residing on the right side of the operational flow, towards the first responders working on the flow's left side, effectively shifting knowledge to the left. This downstream flow of knowledge enables the diagnostic work previously performed by a Tier-1 engineer to handle the automation system. The Tier-1 team can now take advanced work once performed by escalation engineers. This may provide the following benefits:
There are several times when knowledge should be fed back into the automation platform, but two examples are operational handoff and following an incident. Operational Handoff is when a team has implemented a new network design (e.g., MPLS). A consistent, easy-to-follow method for documenting operational procedures related to new designs or new technology is required to ensure that everyone on the team knows how to troubleshoot the new environment. Building an operational runbook for the new design may be part of the handoff from the architect to the operator. Following an Incident means that the team may get together for a postmortem review after resolving an incident. The goal is to do better next time. This feedback process creates a closed-loop mechanism for continual improvement, capturing knowledge at these two critical and ordinary moments. Combining knowledge management with no-code runbook automation leads to the automated resolution of every ticket and can achieve continuous MTTR reduction over time. This feedback mechanism may be referred to as Proactive Automation.
Automation Platform
The automation may have two types of users: consumers and creators of executable knowledge. This solves the challenges of resolving network tickets and maintaining a network, as shown in the following example network incident. The network's monitoring systems have detected a low video quality issue between the Boston and New York site locations. The network team's application performance monitor notifies their ITSM system and generates a new trouble ticket. Here, workflow integration comes into play. The network management system provides a mechanism to integrate with ITSM systems, which enables (1) creating a contextual Dynamic Map of the problem area at the time of ticket creation, and (2) enriches the trouble ticket with diagnostic data obtained from Executable Runbooks at the time of the event—Just in Time Automation. In the example video quality incident, the Dynamic Map visualizes relevant data about the network—topology data, configuration, and design data, baseline data across thousands of data points, and even data from integrated third-party solutions. This map provides instant visualizations of the problem area. Triggered automation has now occurred, and valuable data has been automatically gathered at the start of the event using an Executable Runbook. A first response engineer may have reviewed these automated diagnostics. The data retrieved includes essential device health, QoS parameters, access-control lists, and other relevant collected logs. What used to be a manual effort is now a zero-touch mechanism, ensuring that every ticket is enriched with a contextual map and diagnostic data.
The root cause can then be determined in the poor video quality issue. The engineer has reviewed the map of the problem and the collected diagnostics but still needs to drill down further to determine the root cause. To aid in the diagnosis, the scalability of the automation platform may be used. Additional diagnostics or more advanced design reviews may be needed to determine the root cause. The engineer now leverages the automated drill-down capabilities of the network management automation platform to do further analysis and historical comparisons and compare this data with previous baselines. The know-how and operational procedures from previous incident responses by the network management team may be converted into Executable Runbooks and allows large swaths of contextual data to be pulled, parsed, analyzed, and displayed on the console at the push of a button by an engineer on the team, no matter their experience.
In the low video quality example, the network management team has identified the issue to be a misconfigured QoS parameter on a router. The misconfiguration has been successfully remediated with a configuration fix using the network management automation platform. By adding this issue to the list of known problems, the team ensures that they can identify and remediate the problems much faster if it happens again. With the network management automation platform, the additional diagnostic commands used to resolve the issue are added to the existing Executable Runbook automatically to enrich the Runbook without requiring any coding. Should the event reoccur, the system will trigger an automated diagnosis using the updated Runbook. The root cause will be determined instantly, with a near-zero Time to Repair for this repeat occurrence. This process also helps to rule out possible known issues in unrelated incidents automatically. It creates a “virtuous cycle”—the more known problems and scenarios for which an Executable Runbook is built, the further MTTR is reduced.
Intent-Based Automation
Dynamic Mapping and Executable Runbook are used for automating network troubleshooting. The Runbook digitalizes the troubleshooting procedure and can be executed anywhere by anyone after writing once. There exist vast amounts of troubleshooting playbooks by network device vendors. Enterprise also creates many best-practice playbooks to troubleshoot the problem common to its unique network. Executable Runbook can codify these playbooks. However, one difficulty in codifying these runbooks is that they try to solve a common problem and require coding skills. Some Runbooks can be complicated with many forks depending on human decisions (the diamond node in the sample playbook), making them hard to execute in the backend processes without human intervention. Since Runbook is a template-based solution designed to solve a common problem for many networks, it may not contain the baseline data for a specific network, which is the most useful info while troubleshooting.
Accordingly, Network Intention (NI) can be used to solve these issues. NI may also be referred to as Network Intent. NI is an Automation Unit that can represent an actual network design (with Baseline) and include the logic to diagnose the intent deviation and replicate diagnosis logic across the entire network (with Network Intent Cluster technology). NI is a network-based solution with an executable automation element to document and verify a network design. In an ideal network, all NIs should not be violated. NIs can be monitored proactively, and the system should send an alert for an NI violation. The NI system may include the following components:
NI may be used in a preventative use case. There may not be problems, but periodic checkups are run to ensure the network is running normally. In another example, when there are problems (e.g., the application is down-ticket system), tests may need to be run, so the automation automates the testing for why the application is down. It may be NI is down.
Referring to
In some embodiments, the Parser for Config and CLI commands can be defined. A Visual Parser supports at least four types of variables: text, single variables, Paragraph, and Table. The Text parser is used to match specified lines of text. For example, to verify that the specific configuration or CLI command output does not change in the future, you can define a text parser to parse specified lines of text and compare the live data with the baseline. A Variable or Keyword parser is used to parse a single-value variable (such as version number) by anchoring keywords before and after the variable. Each Variable Line Pattern in a keyword parser can parse a variable within the full-text range or parse multiple variables in one text line. A paragraph parser is used to extract the essential data in recurring text lines and place it into a tabular shape. The parsed variables of a paragraph parser are a table. The variables defined in ID line patterns, variable line patterns, and parent line patterns (optional) will be formed as table columns. A table parser may be used to parse table-formatted text, such as NDP table, VRF table, OSPF neighbors, etc. With a table parser, you can address the line of table headers in the raw text and then leverage the column separator to adjust the Table's column width manually.
In some embodiments, a note, diagnosis, and status code are added. A common diagnosis can be as simple as: if the variable is not equal to a specific value (the baseline value) and then creates an alert. Status code describes NI execution results (Error or Normal). Clicking Edit Diagnosis opens the diagnosis pane. On a Define Diagnosis tab, click Add Diagnosis to enter a diagnosis name and select an anchor defined. An if/then condition can be set, and there is an option to select the Set as Status Code for Network Intent check box to add a status code at the NI level.
When adding a diagnosis to a Network Intent, the user can define various diagnosis logics to make the NI more flexible and verify the network design more accurately. A Table/Paragraph Variable is one example. A variable can be a single variable such as $state or a table (Paragraph) variable. For the table/paragraph variable, the user can select the Loop Table Rows for the system to loop through each Table's row. Using the OSPF neighbor table as an example. There may be multiple neighbor information lines in the CLI result, and the diagnosis execution will determine whether the state contains full line by line. Different types of variables have various operations such as Equals, Not Equal to, Contains, etc. For each variable defined in diagnosis, the user can select its data sources/type:
The diagnosis may compare the current state with the baseline state or compare the current CRC value with the last CRC value. The user can compare the variables from the different devices, such as an MTU from two neighbor devices. There may be multiple simple conditions, and users can combine them into a Boolean expression (and/or). The diagnosis logic may have flexible settings to support simple and complex diagnosis logic and output.
NI can be used to enforce design rules or security rules. For example, it can Check Route Leaking Between DMZ/Enterprise/Production networks. This automation logic can be replicated to all networks by NIC. In another embodiment, NI can be used to troubleshoot repetitive problems, such as an interface error. In another embodiment, NI can be used to diagnose application path problems by considering:
Visual Parser
The system provides at least five types of parser rules that can be applied to a parser or a parser group:
A single-line rule (line pattern) represents a type of expression serving parse variables in one or multiple text lines. The system adopts line-pattern-matching syntax to apply the given line patterns to identify and parse variables. The line pattern may be in the following types of parser and parser components:
A simple line pattern is an example line pattern to parse one or more variables. A variable always starts with $, and it is a string by default. For example, the pattern “one minute: $string:cpu1; five minutes: $cpu2” asks the system to find the keyword “one minute:” and assigns the word between “one minute:” and “;” to the variable $ cpu1. The variable name may include a combination of letters, numbers, and underscores and can only start with letters and underscores. Variables of the same level in the same Parser may not be allowed to have the same name. The following table introduces sample pairs of raw text and simple line patterns for each variable type:
The following two characters can be used in a simple line pattern to match the start/end or a line:
The system may provide an option to match lines by variable patterns to get multiple raw CLI text lines of specified multiple variables with the following detailed rules:
The system provides an option to match lines by keyword patterns to parse multiple lines of raw CLI text for the specified pattern by following the rules:
The system provides a specific regex pattern using regular expression (regex for short). Starting with a specific keyword: regex or mregex, the regex pattern declares all the required variables (separated by a comma) in a pair of square brackets, followed by a colon (:) and regex that can parse text lines. Each pair of parentheses in a regex represents a capturing group to group listed characters to form a sub-pattern. Their matched values will be assigned to each variable defined inside the pair of square brackets by sequence. The following two types of regex patterns define a visual parser:
When there is no keyword before and after a target variable in one line of raw text, that is, the variable is the only string in that line, you can use the character A to represent the start of a line and use the character $ to represent the end of a line when defining the line pattern.
In some embodiments, a special character can be used for an exact match or a special character to avoid a mismatch. Setting a start line, end line, or both helps narrow down the range of text lines to apply a parser and get more accurate results. The matching scope includes the full-text range when there is no start/end line configured in a parser. There is an option to select either of the following ways to add a start line or end line: directly selecting a line or using the line of a selected variable.
Text replacement may be a flexible way to automate text pre-processing before it can be parsed as expected. When you want to search and replace any string in the raw text, you can define a text replacement. Text replacement can be defined on multiple levels. At a global level, there is a search and replace for a string in the whole range of sample text. At a parser level, there is a search and replace a string in the given range of text that a specific parser's definition has matched. When defining a text replacement, there is an option to add multiple replacements rules. Each rule may include a Find What or a Replace With. The Find What is the text you are searching for in the given range. It may include:
Text replacement may include the following use cases:
A text parser is used when you only want to use a portion of the configuration file or CLI command outputs to validate network design and check changes. Take the parsing of the configuration file as an example. You can define a text parser to parse the configuration file: 1) Retrieve sample text of configurations; 2) Select a parser type by clicking Add Text to add a variable Text1; 3) Select lines of text in the Sample area, and click the arrow ( ) to duplicate it as the content to match Text1; and 4) Preview the parsed result of sample text, and then click OK to save the text parser. Multiple paragraphs of lines can be selected in the Sample area and assembled in one text variable. Users can also add multiple text variables in one text parser to parse different paragraphs of lines.
The system adopts an exact match to compare the selected lines of text when applying a text parser, following these rules:
A simple variable parser is used to parse a single-value variable (such as version number, etc.) by anchoring keywords before and after the variable. Each Variable Line Pattern in a keyword parser can parse a variable within the full-text range or parse multiple variables in one text line.
A variable can be defined visually by highlighting the text inside the variable group, and a Line pattern will automatically be created for this variable. The rules to fill the keywords before and after this variable are:
If the highlighted text is the beginning of the current line, the line pattern will start with “^”. If the highlighted text is the end of the current line, the line pattern will end with “$”.
The variable type will be auto-created according to the context of the highlighted text. The variable name is created by the following rules:
When two or more texts are highlighted, they are regarded as one variable. The variable's value is all content (including spaces) between the keywords before and after the highlighted texts. The corresponding variable type is $mstring. In the example below, the variable is defined as $mstring:uptime.
A paragraph parser is used to extract the essential data in recurring text lines and place it into a tabular shape. A paragraph parser can convert variables across multiple sections of the raw text into a table data structure, so diagnosis against each row can be exacted. Using the parsing of interface information as an example, a paragraph parser to parse interface information can be defined by:
A table parser is used to parse table-formatted text, such as NDP table, VRF table, OSPF neighbors, etc. With a table parser, users can address the line of table headers in the raw text and then leverage the column separator to adjust the table's column width manually. Using the parsing of the VRF table as an example, a table parser to parse the VRF table can include:
The Visual Parser is designed to be visible so that users can understand the relationship between the parser variable and the original data through the WYSIWYG (What You See is What You Get) and learn how to define a Parser quickly. Multiple parsers can be created from an original text. However, only one Parser can be expanded for edit or view at a time so that the relationship between the parser variable and the original data can be visually displayed. Each parser will have the Start Line and End Line properties, and these lines will be displayed in the original data by default.
The visual parser may be a reuse parser that can copy the parser from NI, which may be defined on one device, and apply it to other devices in one NI. This may be referred to as a copy parser.
In Debug mode, the NI creator can run an NI step by step and check each step's input and output value. The system executes NI in four levels:
When triggered by a flash probe, the NI can be installed as triggered automation of the flash probe as part of Adaptive Monitoring Automation, which is a backend process to monitor the whole network's status periodically. When a flash alert occurs, the system will further execute NIs. The triggered NI results can be viewed with the flash probe via the Preventive Automation Dashboard. When an alert occurs on the flash probe, you can trigger the NI to execute automatically. An NI can be installed to a flash probe.
In another embodiment, a third-party system can trigger network management Runbook Template execution, including an NI node. For example, a ticket is created since a BGP neighbor of a core device is flapping, which triggers an API call to the NetBrain system, and the device name and BGP are sent to the network management as a keyword. A Runbook can filter NIs related to this device and BGP and execute these NIs.
Referring back to
Feature Intent Definition (FID)
FID may also be referred to as Network Intent Cluster (NIC). FID and NIC may be used interchangeably throughout. A large network can have millions of NIs, and it may be time-consuming to add these NIs manually. The FID or NIC system can discover and create these NIs automatically based on a template file. A Feature Intent Template (FIT) may declare the network management resources that can be created and run based on device feature match. The template's main contents may be stored as a text file whose format complies with the YAML standard. Using the config line pattern, various network technologies can be decoded from device configuration files, exactly match the device you are interested in, and further store the key parameters in the line pattern for further use. It will significantly help you identify the devices running certain network technologies (BGP, QOS, Multicasting, etc.) across the entire network. Further, it creates the related NIs and defines the running methods (schedule run or triggered by flash probes).
Network Intent is device-based automation for end-users to define and use. It can be defined with deep automation analysis logic applicable to any scenario. To scale to other devices with similar intents, engineers build the intent-based automation device-by-device and intent-by-intent. It may be time-consuming to build intent-based automation for a large network with complex technologies. A Feature Intent Template or NIC template may be used for automation and may include:
The Feature Intent Template (FIT), defined inside YAML-Format Feature Intent Definition File (FID file), is a set of automation technology to define NetBrain automation across the entire network. Using the config line pattern, you can decode various network technologies from device configuration files, exactly match the device you are interested in, and store the key parameters in the line pattern for further use. It will significantly help you identify the devices running certain network technologies (BGP, QOS, Multicasting, etc.) across your entire network, create the related automation resources in the system (Network Intent for BGP design, Flash Probe for BGP flapping check, etc.), and further define the execution methods (triggered run by the system or interactively run by users). In one embodiment, the purpose of the feature intent template is to decode network features and build/install automation across the entire network to support the reference workflow.
The Feature Intent Template (FIT) definition includes two parts:
Network Troubleshooting may require a deep understanding of different network technologies configured on each device, such as HSRP, QoS, or BGP. The knowledge and automation needed for further troubleshooting differ based on network features. Automating the automation assets required for troubleshooting is to understand network features. The line pattern concept to find the matched devices for a specific feature from the device configuration files. One simple example is to find whether the BGP routing protocol is configured on a Cisco IOS device by searching for config lines in one example. Each line may include two types of data, the network keyword, which does not change, and variables. If we take the first line as an example, “router” and “bgp” are network keywords, while “2” is a variable. As different routers may configure different routing processes, we need to combine the keyword with the variable to determine whether BGP is configured for a device. By combining keywords and variables into a single line, we have created a unique line pattern that serves as the feature decode unit. In NetBrain's implementation, the variable is represented by $<variable type>:<variable name>.
The configuration for a specific network technology differs in various embodiments. To use the line pattern to find the match for the specific feature while matching the configuration file line as much as possible, you can use the needed line (which may also be referred to as a must-have line in some embodiments) and optional line concept to tag your line pattern. Let's take the following configuration file snippet as an example:
To find the device with the HSRP configured and match the configlet as much as possible, we can define the following lines as needed lines:
The needed lines are the key line patterns that identify whether the device indeed has the HSRP configured. At the same time, you may or may not have the priority field configured by the standby group, so in this case, you'll need to make the following line an optional line: standby 1 priority 150.
To specify whether a line is a needed line or an optional line, you can use the M or O as a flap ahead of the line patterns. Putting them together, you'll have the following line pattern you can use to match devices.
Devices that include all the needed lines sequentially will be recognized as a match, so using the optional line here can help you match devices with or without priority defined for the standby group. If you need to match devices with priority explicitly defined, you can make the last line a needed line. Since the default behavior of the line property is a needed line, you can leave the needed lines untagged, and the system will recognize the line as the needed line. The following pattern means the first three lines are needed lines while only the last one is the optional line:
The configuration must match the line pattern definition sequentially for the line pattern definition to be identified as a match. If any line of the configurations does not match the line pattern defined, it will not be recognized as a match. The following modified configlet is not recognized as a match for the line pattern we just defined as the lines cannot be matched by exact order.
With the exact line pattern match rule by order, you will sometimes need to find repetitive lines for certain line patterns to find all the matched config lines. The group concept is introduced to better match device config files to support grouping several lines into a unique matching unit. The previous line pattern we just defined can be recognized as a single group, and we can give it a simple group name, group1, to indicate its uniqueness:
By grouping these line patterns, you can find all interfaces with HRSP configured within configuration files and extract them. Another reason to divide your line patterns into different groups is to use each group as a unit to match separately. A simple example is finding OSPF configuration files for Cisco devices while finding all interfaces with OSPF configured. The line pattern will be something look like the below:
Group1:
As the previous rule states, if you put all these lines into a single group, the system will look for the configuration lines for a match and then look for the next match. So, a configuration file that may include multiple OSPF interfaces configured may only be matched once. To support this case, you can use the group logic to divide the line pattern into different OSPF groups as below:
Group1:
Group2:
The system will search for each group's exact match separately by dividing the line patterns into two groups. A configuration file that includes multiple interfaces can easily match the group1 definition. In contrast, the global OSPF configuration can be easily matched. Please note that the groups' sequence does not matter, so if the defined pattern starts from group1, then group2, while the real configuration file starts with group2 and then group1, the device will still be recognized as a match.
Device feature decoding through configuration files provides a powerful way to figure out network features from your network devices. But that requires massive calculations across all devices. In some cases, you may need lightweight methods to find devices quickly, so you can use the device properties that are already displayed in network management, or use the regex as a qualification to achieve this, as explained below:
The qualification section allows you to use all device GDR properties to filter the related devices. The regex section allows you to define one or more conditions to match related devices. Mregex is supported here. Using the qualification and regex rule as preliminary filters can significantly improve the accuracy and performance. In some use cases, you may only need to define the qualification and regex match without using the config line pattern for feature decoding, and that is fine. Still, you'll need to make sure you have at least one of the three matching methods defined for the system to match devices and execute properly.
The previous section explains the feature decode basics and how you can use the line patterns to match the configlet from configuration files. This section will explain how you can further divide the feature intent into sub-feature intent (SubFI for short) and generate default network intent by using the sub-feature intent. Network Intent can include very complex automation logic defining how to check the desired status. It can only include the basic configlets and CLI commands without automation logic, by which we mean the default Network Intent. The configuration files decoded can be used to fulfill the configlet displayed in the network pane and the configlet of the network intent detail pane if there's no automation logic defined for network intent.
You can also define the CLI commands to be used for feature verification, and this CLI command will be passed to the network intent CLI command when the default network intent is created based on feature intent.
Besides the general CLI commands without parameters, the CLI commands with the parameters can be referenced from the line patterns. In the above example, we use the show standby interface {$intName1} command, which means from the configuration files, we use the line pattern to match the interfaces with the HSRP configuration, and then we only check the interface HSRP status for these interfaces. By specifying parameters using inline patterns, we can significantly improve the CLI command accuracy.
Feature Intent stands for all configuration lines matched for line patterns. Often you could match many repetitive patterns and want to divide the Feature Intent into sub Feature Intent for further network intent creation. Let's take a simple example of the line patterns we created for the HSRP feature:
The above pattern is the HSRP feature pattern to match devices that have HSRP configured on their interfaces. Still, one interface may have multiple HSRP groups configured, each with its ip address and priority. The following example shows a configuration file with two HSRP groups configured on a single interface, and we need to split the groups into two different network intents.
To divide different HSRP groups into different sub Feature Intent and further create network intent based on certain HSRP groups, we can divide the feature intent into SubFIs based on the following parameter used in YAML for Split Keys: a line pattern could match multiple instances in the configuration file, and thus some line pattern variables may have multiple possible values. Defining the variable here will ensure that the variable only has one instance value in the subFI. In the above sample, since we want to split the feature intent by group names, we can specify the split_keys as follows:
By defining the split_keys, assuming we only have this interface with the HSRP configured, the subFIs are:
The previous example only contains one group in the pattern field. In case you have multiple groups in the pattern, and you want to group them, you will need to have the relation defined. The relation is used to filter and keep the SubFI matching the relation definition. The only function you can use is equals($var1, $var2) which means they should be the same.
By default, if you use multiple groups or define the split_keys, NetBrain will generate multiple SubFIs according to your definition. However, in some cases, even if you find all related configlets, you still want to generate a single Feature Intent instead of multiple subFIs. In this case, you can use the generate_one_FI_groups flag. You can define whether you want to create one instance for single or multiple groups. If you want all groups to be generated as a single Feature Intent, list all group names here so the system will generate only one FI here.
Once we have generated the FI and SubFIs for multiple devices, we need to group them to generate the FI group. FI group contains a couple of devices with network relationships. The followings are two examples of FI groups:
To generate FI groups across multiple devices, we must find unique characteristics for these devices. From the networking perspective, the above examples can be explained by:
The unique characteristics of each device to generating FI group is denoted with the “Eigen” variables, identified with the following statements:
There are different ways to define the Eigen variable expression:
You can define one or more Eigen variables for device clustering. One of the key Eigen variables will be used for cross-device grouping and the others for complementary verification. The qualification field is used to filter further unwanted SubFIs based on Eigen variables. In this case, we want only to generate an FI group if the devices are within the same site. And we don't want to generate an FI group for devices that we haven't allocated to certain sites that may introduce inaccuracy. We can use the $site as the qualification to filter devices that don't belong to any site.
The last keyword, group_type, defines the method to group devices into the same FI group, and there are two types:
Once we have the SubFIs created based on Eigen variables, we can further convert the FI groups into Network Intents. There are two ways to convert FI group into network intents:
To generate default network intent, we need to define the related network intent contents:
The path field illustrates where you want to put the newly generated default network intents. To make each network intent unique, we attach the CrossRelation field to the network intent name. The conflict_mode section defines the behavior if the network intent with the same name already exists. In this case, you can either overwrite the existing network intent using an override flap or skip it. As network intent can be locked to prevent others from modifying it, you can set this field accordingly. Please note that the feature intent template cannot update the network intent because it's locked once you set this field to true. And this setting has higher priority over the conflict_mode field, so you won't be able to update the network intent in any case. The create_default_NI field specifies which type of network intent to generate. In this section, we'll set this field to true to generate default network intent without automation logic. Cli_baseline_update_type: This field specifies how you would like to set the CLI baseline data. There are two ways to add the CLI command output to the network intent:
Network Intent can be created from a NI template. This is the creation of Network Intent with automation logic. As the network intent is device-based automation, a user must select specific devices and then define the network intent automation. So, if you have many devices with similar network technology and need to define similar automation logic, it is tedious to manually replicate the logic to all other devices. Using FIT can find devices automatically and apply the automation analysis logic to the new network intents. A network intent template to display the template function may be the same as network intent. If the template variables are set up correctly, you can use any network intent as a network intent template. To set up the network intent template variables, you can open the edit mode of any network intent and click on the Define Template Variables hyperlink to open the Define Template Variables window. All devices defined in this network intent may be listed along with the CLI commands. There are at least two different ways to create new network intent:
The exact device count match may require an exact device count match when the network intents require the same device count according to the network technology. The analysis logic may require differentiation of the different devices of the network intent. HSRP is an example of this case that requires two devices: the active and standby device. In the network intent's automation check logic, you may define different checking logic depending on this device's status (active or standby). Using this network intent as a template to duplicate the network intent automation logic to other devices may require the device count in the new network intent is the same.
The adjustable device count match may not require the exact device count match, which can be used when you have a couple of devices grouped for network technology that doesn't require the same device count. The following is a simple example of the IPSec designs for the WAN connections, where the network intent for the sites connected through IPsec tunnels, consisting of several devices. You can define the universal checking logic for all these devices, and if you want to create network intents for other WAN connections, the device count can be different, but the automation check logic can be re-used.
The next step may be to define how devices and related show commands are replaced by the new FI group. The replaced parameters may be the following:
With the Network Intent Template Parameters defined, a user can further define logic to map the feature intent template's variable. After defining the template variables, this Network Intent can be used as a template to generate further network intent, which can be shared/exported.
Adaptive Monitoring Automation
The Adaptive Monitoring Automation is a backend automation system to run hundreds of thousands of automation tasks without human intervention. The system may utilize a Flash Probe. A Flash Probe defines an entity that performs a network anomaly detection on one or more devices. In one example, a flash probe runs on a single device. For example, to detect whether a single Device R1 has a high CPU, you can define a Flash Probe (Alert Name) as CPU High. The Alert generated after Flash Probe is detected is Flash Alert. If a flash alert occurs in a network device, the system further runs the drill-down automation (Network Intent) to identify the potential root cause. The flash probe polls the live network device and discovers any anomaly. The system can also integrate with other 3rd party monitoring systems instead of directly pulling the live network data. The Adaptive Automation System may include:
By executing device-level automation (triggered by the flash probe), the system may handle massive automation resources. The system scalability, as a result, can be enhanced. Since the automation is executed in the backend in a fully automatic manner, users can create their automation resources and upload them to the backend system. The automation results can be viewed across the entire company via the Preventive Automation Dashboard or the monitoring data view. An alert message may be displayed in the Preventive Automation Dashboard and sent via email for notification purposes. In one embodiment, a user can type $ to reference the variables defined in the alert message.
Referring back to
For Adaptive Monitoring (AM) guidance, network problems (i.e., symptoms) should be tracked by a primary flash probe (e.g., a single device, primarily SNMP data, with a few basic CLI data, e.g., show interface). There may be transient problems, such as: 1) an Interface Performance Issue: link utilization spike, interface flapping; 2) a device performance issue (e.g., CPU/MEM spike); 3) a route table entry anomaly; 4) a firewall failover or a device configuration change. After the primary flash probe detects a symptom, it will trigger the secondary flash probe to further detect the network problem, such as with a single device with CLI data or by a more specific alert based on the primary alert (e.g., BGP, OSPF Neighbor Check, BGP route table check, BGP config check). The network problem may be further tracked by Network Intents such as an HSRP check, QOS check, and BGP route reflector design check.
As shown in
Adaptive monitoring results can be viewed from a Preventative Automation (PA) Dashboard or a monitoring data view, as shown in
The primary flash probe is defined with basic info, such as the name, display name, and description. The Device/Interface level selection can also be defined, such as specifying whether this flash probe detects device level anomalies (CPU high, BGP neighbor change, etc.) or interface level anomalies (interface flapping, interface traffic usage high, etc.). Variables that are selected can be used for defining alert rules. This may include selecting the parser variables. The user can also use the compound variable computation to create complex variables for alert definition. Alert rules are defined for the condition to trigger an alert and the alert message for the flash probe. The flash probe is enabled by default once defined, and it will be executed based on the current device's primary frequency. To adjust the flash probe frequency, the user can click its frequency settings and modify them accordingly.
The primary flash probe can be applied to other devices. The system will check whether the selected devices are valid for the application according to the logic below:
The primary flash probe can be enabled/disabled on other devices. When a flash probe is enabled, it will trigger respective tasks to retrieve data and perform an error check periodically. If you want to enable or disable the flash probe for multiple devices, right-click a flash probe and choose from the following two options:
The functions, features, and properties of the primary flash probes may also apply to the secondary flash probes. In some embodiments, the secondary flash probes may only be triggered by the primary flash probe and cannot be run periodically. To specify the desired primary flash probe(s) to trigger the secondary flash probe, a user can select one or more primary flash probes from the Triggered By section in Secondary Flash Probe Details. Like the primary flash probe, the user can apply the secondary flash probe to other devices. Since the secondary flash probe needs to be triggered by the primary flash probe, the system will check whether the targeting devices have a similar primary flash probe, triggering a secondary flash probe. If not, the system will first apply the primary flash probe to other devices and then apply the secondary flash probe.
There may be at least three types of flash probes. The Primary Probe can be polled with a particular frequency, such as an alert-based Flash Probe where an anomaly generated by devices triggers the probe, or a timer-based Flash Probe where the probe can be triggered by a timer and can be used for further scheduled CLI and NI tasks. A secondary probe can only be triggered by primary probes. An external probe is used for integration with other monitoring systems. The alert generated by 3rd party systems can implicitly generate external flash probes.
Flash probes can be set at different levels to capture different types of anomalies. For example, at the device level for an anomaly that is related to specific devices and not specific interfaces. Device-level flash probes may include CPU high, device config change, BGP neighbor flapping, etc. In another example, the anomaly may be at the interface level when the anomaly is related to specific interfaces. Interface level flash probes may include interface flapping, interface error increase, etc. The flash probe can be configured at multiple interfaces of the same device. In this embodiment, the system will check the selected interfaces one by one to determine whether an anomaly exists. If an anomaly exists in any of the selected interfaces, the flash probe's result will generate an alert and trigger the respective NIs.
Parser variables can be added. When a user adds parser variables to the target device, the system will use the target device type as the filter and list only applicable parsers for the user to select. Multiple parser variables can be selected for the further alert check. The variables may differ based on the flash probe level. For example, only device variables will be available to select if the flash probe is set at the device level. If the flash probe is set at the interface level, both device level and interface level variables will be available to select. Even if the plan is to check the interface level anomaly, there may still be a device level variable as its condition.
A compound variable may be added. Compound variables may be designed to perform bulk operations on multiple parser variables or use function calls to retrieve certain values. For example, CRC_Increase_Count=$crc-GetLastValue($crc). In this example, a compound variable is used to get the CRC error increase count. In another example: BGP_Neighbor_Change_Count=abs(GetTableRowCount($bgp_nbrs)-GetLastRowCount($bgp_nbrs)). In this example, the statement above can get the BGP neighbor change count compared to last time's data retrieval.
The alert definition may define the condition to create the alert. The following example operations may be supported to build the condition: Equals to, Does not equal to, Is none, Is not none, Greater than, Less than, Greater than or equals to, Less than or equals to, or Range. To compare the current value of a parser variable with its previously retrieved value, a user can select the desired parser variable and use the keyword LastValue as the comparison object. In one embodiment, an entire table can be set as the baseline for the alert check. The loop table rows' function may be designed to check the specific column's value(s) for granular control purposes. The user selects at least one table first and then selects the loop table rows' desired column. The system will loop each row to check whether the defined alert rule is matched. If any row matches the alert definition, an alert will be triggered, and the system will stop checking more rows for performance considerations.
Variables are then monitored. To optimize the performance, the system offers users the ability to select the parser variables they deem critical to their intended usage (instead of unselectively storing all historical data). Specific monitoring variables can be selected, so only parser variables' most critical historical data will be stored in the database and later be visualized in the monitoring data view. An alert message may be displayed in the Preventive Automation Dashboard and sent via email for notification purposes. In one embodiment, a user can type $ to reference the variables defined in the alert message field's alert rule statement.
In some embodiments, there may be built-in flash probes. Examples are shown in the following table:
The built-in flash probe examples include the configuration change, which polls the configuration and generates an alert if there's any change. The SNMP Unreachable generates alerts if a device cannot be accessed via SNMP. CLI Unreachable generates alerts if the device cannot be accessed via CLI.
There may be an application programming interface triggering a flash alert that uses the existing APM/monitoring/logging system to trigger NI analysis. In one example, this may complement monitored data with high-frequency SNMP data while leveraging a CLI parser variable data for low-frequency monitoring. There may be a correlation between all monitoring alerts on a map.
There may be the installation of the automation.
The trigger rule can be defined for how the system executes automation and has the following options:
There may be a Prevention Automation (PA) dashboard and/or execution/decision tree. The PA Dashboard provides an overview of the network health status and statistics for the entire or partial network. Also, the PA dashboard offers the ability to further drill down to any device to view its alert and execution details. The PA dashboard may be a display for adaptive monitoring and may include a decision tree. The PA dashboard includes four components: PA dashboard summary, alert distribution, execution tree and alert history of probe and NI.
The PA dashboard summary shows PA statistics: the number of devices, the number of probes, the number of triggered Network Intents, and the number of devices with no alerts, probe alerts, and intent alerts. In addition, users can customize the device scope (the whole network, a site, a device group, or the devices of the current map) and the time range. The alert distribution shows the total number of probes with alerts and NIs with alerts. In addition, users can select a specific device to view its execution details from the following two categories: devices with Network Intent alerts and devices with probe alerts. The execution tree or decision tree shows the detailed results of probes and triggered Network Intentions for a specified device. The results are displayed with different color codes to highlight the network parameters in abnormal states. The alert history of probes and NI shows all historical alert results. In addition, users can view all alerts generated by a probe or a Network Intent.
The PA dashboard may be customizable. By default, the PA dashboard demonstrates the alert results for all domain devices. Users can specify the device scope by the following filter conditions:
Clicking each alert type in a pie chart, the corresponding device info will be visualized in an alert distribution table. A user can create a default PA dashboard view by defining the default network and period.
PA Dashboard results can be viewed or displayed in a map. After creating or opening a map, a user can select the device scope of the PA dashboard to the current map and view the alert distribution for all devices on this map. The alert distribution table selects a device to pin the execution tree and the map side by side. Then a user can add an NI into the current runbook to execute the NI interactively as in
The system and process described above may be encoded in a signal bearing medium, a computer readable medium such as a memory, programmed within a device such as one or more integrated circuits, one or more processors or processed by a controller or a computer. That data may be analyzed in a computer system and used to generate a spectrum. If the methods are performed by software, the software may reside in a memory resident to or interfaced to a storage device, synchronizer, a communication interface, or non-volatile or volatile memory in communication with a transmitter. A circuit or electronic device designed to send data to another location. The memory may include an ordered listing of executable instructions for implementing logical functions. A logical function or any system element described may be implemented through optic circuitry, digital circuitry, through source code, through analog circuitry, through an analog source such as an analog electrical, audio, or video signal or a combination. The software may be embodied in any computer-readable or signal-bearing medium, for use by, or in connection with an instruction executable system, apparatus, or device. Such a system may include a computer-based system, a processor-containing system, or another system that may selectively fetch instructions from an instruction executable system, apparatus, or device that may also execute instructions.
A “computer-readable medium,” “machine readable medium,” “propagated-signal” medium, and/or “signal-bearing medium” may comprise any device that includes stores, communicates, propagates, or transports software for use by or in connection with an instruction executable system, apparatus, or device. The machine-readable medium may selectively be, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. A non-exhaustive list of examples of a machine-readable medium would include: an electrical connection “electronic” having one or more wires, a portable magnetic or optical disk, a volatile memory such as a Random Access Memory “RAM”, a Read-Only Memory “ROM”, an Erasable Programmable Read-Only Memory (EPROM or Flash memory), or an optical fiber. A machine-readable medium may also include a tangible medium upon which software is printed, as the software may be electronically stored as an image or in another format (e.g., through an optical scan), then compiled, and/or interpreted or otherwise processed. The processed medium may then be stored in a computer and/or machine memory.
The illustrations of the embodiments described herein are intended to provide a general understanding of the structure of the various embodiments. The illustrations are not intended to serve as a complete description of all of the elements and features of apparatus and systems that utilize the structures or methods described herein. Many other embodiments may be apparent to those of skill in the art upon reviewing the disclosure. Other embodiments may be utilized and derived from the disclosure, such that structural and logical substitutions and changes may be made without departing from the scope of the disclosure. Additionally, the illustrations are merely representational and may not be drawn to scale. Certain proportions within the illustrations may be exaggerated, while other proportions may be minimized. Accordingly, the disclosure and the figures are to be regarded as illustrative rather than restrictive.
One or more embodiments of the disclosure may be referred to herein, individually and/or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any particular invention or inventive concept. Moreover, although specific embodiments have been illustrated and described herein, it should be appreciated that any subsequent arrangement designed to achieve the same or similar purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all subsequent adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the description.
The phrase “coupled with” is defined to mean directly connected to or indirectly connected through one or more intermediate components. Such intermediate components may include both hardware and software based components. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional, different or fewer components may be provided.
The above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other embodiments, which fall within the true spirit and scope of the present invention. Thus, to the maximum extent allowed by law, the scope of the present invention is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description. While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents.
This application claims priority to Provisional Patent Application No. 63/179,782, filed on Apr. 26, 2021, entitled INTENT-BASED NETWORK AUTOMATION, and claims priority to Provisional Patent Application No. 63/311,679, filed on Feb. 18, 2022, entitled PROBLEM DIAGNOSIS AUTOMATION SYSTEM (PDAS) INCLUDING NETWORK INTENT CLUSTER (NIC), TRIGGERED DIAGNOSIS, AND PERSONAL MAP, the entire disclosures of both of which are herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63179782 | Apr 2021 | US | |
63311679 | Feb 2022 | US |