In the modern computer age, businesses rely on an electronic network to function properly. Computer network management and troubleshooting are complex. There are thousands of shell scripts and applications for different network problems. The available, but poorly documented solutions can be overwhelming for junior network engineers. Most network engineers learn troubleshooting through reading the manufacturer's manual or internal documentation from the company's documentation department. But the effectiveness varies. For instance, the troubleshooting knowledge captured in a document can only be helpful if the information is accurate and the user correctly identifies the problem. Many companies have to conduct extensive training for junior engineers. The conventional way of network troubleshooting requires a network professional to manually run a set of standard commands and processes for each device. However, to become familiar with those commands, along with each of their parameters, takes years of practice. Also, complicated troubleshooting methodology is often hard to share and transfer. Therefore, even though a similar network problem happens repeatedly, each troubleshooting instance may still have to start from scratch. However, networks are getting more and more complex, and it is increasingly difficult to manage them efficiently with traditional methods and tools.
Network management teams provide two functions: to deliver services required by the business and ensure minimized downtime. The first function may be dominated by projects, such as data centers, cloud migration, or implementing quality of service (QoS) for a voice or video service. The second function, minimizing downtime, may be more critical in impacting a company's revenue and reputation. Ensuring minimal downtime can include preventing outages from happening and resolving outages as soon as possible. Two measurements for an outage may include Mean Time Between Failure (MTBF) and Mean Time to Repair (MTTR).
Network management may utilize new methodologies and processes to accommodate the global shift to digital technologies. To manage the network efficiently with tactical, manual approaches using legacy mechanisms to build, operate, and troubleshoot may need to improve.
This disclosure generally relates to Problem Diagnosis Automation System (PDAS) for network management automation using network intent (NI). Network intent (NI) represents a network design and baseline configuration for that network or network devices with the ability to diagnose deviation from the baseline configuration. Problem Diagnosis Automation System (PDAS) automates the diagnosis of repetitive problems and the enforcement of preventive measures across a network. Automation assets across the network include Network Intent (NI) or Executable Runbook (RB) inside the no-code platform. Automation is executed in response to an external symptom in three successive methods, namely interactive, triggered, and preventive. Execution output is organized inside an incident pane for each incident.
A Network Intent Cluster (NIC) clones a NI across the network to create a group of NIs (member NIs) with the same design or logic. NIC may be created from a seed NI via no coding process. In PDAS, a subset of Member NIs can be automatically executed according to the user-defined condition based on the member device, the member NI tags, or signature variables.
A Triggered Automation Framework (TAF) matches the incoming API calls from a 3rd party system to current incidents and installs the automation (e.g., NUNIC) to be triggered for each call. It may include: Integrated IT System defining the scope and data of the incoming API calls; Incident Type to match a call to an Incident; and Triggered Diagnosis to define what and how the NIC/NI is executed.
In one embodiment, a method for network management automation includes defining one or more input devices and variables; identifying one or more network intent (NI) seeds; generating member NI based on the one or more NI seeds and based on the defined one or more input devices; and triggering a network intent cluster to run for the generated member NI. The method includes classifying the one or more input devices when subject to network commands; and grouping the one or more input devices by eigen-value based on the network commands. The generating the member NI is based on the grouping. The method includes selecting the NI seed; and testing the selected NI seed against a live network, wherein the generating the member NI occurs only when the NI seed passes the testing. The defining the input devices further comprises identifying the one or more input devices based on Site, Device Group, Device, Path, or by Map. The defining comprises uploading a file with device properties. The NI seed comprises one or more devices with NI to be replicated. The member NI comprises one or more devices with the NI seed, wherein the one or more devices are from the defined one or more input devices. The generating member NI is: by map, by site, by device group, by path, by device, or by neighbor. The triggering is from an external source.
In another embodiment, a method for Problem Diagnosis Automation System (PDAS) including receiving an incident via a ticket system for a network; identifying a device and signature variables based on the incident; and triggering a network intent cluster (NIC) to create and run a member NI. The method includes reviewing a reference library for past incidents from the ticket system; and performing an automated network intent runbook analysis. The method includes performing an automated diagnosis of the problem based on the automated network intent runbook analysis; and outputting results of the automated diagnosis for troubleshooting and data sharing. The method includes classifying the input device when subject to different commands; grouping the classifying by eigen-value; and comparing, for each of the groupings, a NI for the input device with the identified NI seed. The output comprises an incident pane as a graphical user interface (GUI). The incident pane displays results from a network intent diagnosis. The incident pane displays a recommended diagnosis for the incident.
In another embodiment, a method for network intention (NI) includes cloning a NI with a Network Intent Cluster (NIC); and seeding the NI across a network to create a group of NIs based on the design for the NIC. A subset of the NIs can be automatically executed according to a user-defined condition based on a member device, a member NI tags, or other signature variables. The NI includes at least one of a name, a description, a target device, a tag, a configuration, or a variable.
In one embodiment, a method for automating network management includes enabling a network intent (NI) or a network intent cluster (NIC) to be triggered based on input parameters for an incident; defining conditions for the triggering of the NI or the NIC; and identifying member NIs to be executed. The method includes executing the member NIs. The input parameters for an incident comprises a name, description, type, or selection. The type comprises the NI or NIC. The conditions comprise triggered conditions.
In another embodiment, a method for network management includes receiving an incident via a ticket system for a network; analyzing the incident; performing an automated diagnosis of the incident based on the analysis, wherein the automated diagnosis comprises implementing a Triggered Automation Framework (TAF); and outputting results of the automated diagnosis for troubleshooting and data sharing. The automated diagnosis further includes: performing a self-service diagnosis; performing an interactive automation; and performing preventative automation via a probe. The TAF includes: matching incoming application program interface (API) calls; and installing automation to be triggered for each of the API calls. The installing comprises a triggered diagnosis to define execution of a network intent (NI). The installing comprises a triggered diagnosis to define execution of a network intent cluster (NIC). The outputting results t comprises an incident pane as a graphical user interface (GUI). The incident pane displays results from a network intent (NI) diagnosis. Results from the TAF are displayed on the incident pane.
In another embodiment, a method for network automation includes: receiving a network incident; classifying the incident; triggering a diagnosis for the incident based on the classifying; and displaying the diagnosis in an incident pane. The receiving comprises a ticket identifying the incident. The classifying comprises classifying an incident error, an incident type, or a device for the incident. The classifying comprises an Application Programming Interface (API) call. The triggering comprises a triggered diagnosis that automatically executes based on the classifying. The execution comprises a Network Intent Cluster (NIC) that updates logic based on the classifying. The incident pane comprises a graphical user interface (GUI) that displays a triggered diagnosis center. The incident pane comprises a triggered diagnosis log.
The system and method may be better understood with reference to the following drawings and descriptions. Non-limiting and non-exhaustive embodiments are described with reference to the following drawings. The components in the drawings are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. The drawings, like referenced numerals, designate corresponding parts throughout the different views.
Network problems may be organized by a Ticket System in the form of incidents. Those network problems may be repetitive: identical or similar problems happen repeatedly but are diagnosed the same way each time. Often those problems are preventable, caused by miss-configuration, performance degrade, or security violations. However, lack of automated methods to enforce the design rules, best practices, or security policy may prevent the remediation of those problems effectively.
Problem Diagnosis Automation System (PDAS) may address those issues. Specifically, PDAS may include automating the Diagnosis of the repetitive problem and automating the enforcement of preventive measures across the entire network. PDAS automates the Diagnosis of repetitive problems and enforces preventive measures across the entire network.
A Network Intent Cluster (NIC) clones a NI across the network to create a group of NIs (member NIs) with the same design or logic. NIC may be created from a seed NI via no coding process. In PDAS, a subset of Member NIs can be automatically executed according to the user-defined condition based on the member device, the member NI tags, or signature variables.
A Triggered Automation Framework (TAF) matches the incoming API calls from a 3rd party system to current incidents and installs the automation (e.g., NI/NIC) to be triggered for each call. It may include: Integrated IT System defining the scope and data of the incoming API calls; Incident Type to match a call to an Incident; and Triggered Diagnosis to define what and how the NIC/NI is executed.
Reference will now be made in detail to exemplary embodiments of the invention, examples of which are illustrated in the accompanying drawings. When appropriate, the same reference numbers are used throughout the drawings to refer to the same or like parts. The numerous innovative teachings of the present application will be described with particular reference to presently preferred embodiments (by way of example, and not of limitation). The present application describes several inventions, and none of the statements below should be taken as limiting the claims generally.
For simplicity and clarity of illustration, the drawing figures illustrate the general manner of construction, and description and details of well-known features and techniques may be omitted to avoid unnecessarily obscuring the invention. Additionally, elements in the drawing figures are not necessarily drawn to scale, and some areas or elements may be expanded to help improve understanding of embodiments of the invention.
The word ‘couple’ and similar terms do not necessarily denote direct and immediate connections, but also include connections through intermediate elements or devices. For purposes of convenience and clarity only, directional (up/down, etc.) or motional (forward/back, etc.) terms may be used with respect to the drawings. These and similar directional terms should not be construed to limit the scope in any manner. It will also be understood that other embodiments may be utilized without departing from the scope of the present disclosure, and that the detailed description is not to be taken in a limiting sense, and that elements may be differently positioned, or otherwise noted as in the appended claims without requirements of the written description being required thereto.
The terms “first,” “second,” “third,” “fourth,” and the like in the description and the claims, if any, may be used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable. Furthermore, the terms “comprise,” “include,” “have,” and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, article, apparatus, or composition that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, apparatus, or composition.
The aspects of the present disclosure may be described herein in terms of functional block components and various processing steps. It should be appreciated that such functional blocks may be realized by any number of hardware and/or software components configured to perform the specified functions. For example, these aspects may employ various integrated circuit components, e.g., memory elements, processing elements, logic elements, look-up tables, and the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices.
Similarly, the software elements of the present disclosure may be implemented with any programming or scripting languages such as C, C++, Java, COBOL, assembler, PERL, Python, or the like, with the various algorithms being implemented with any combination of data structures, objects, processes, routines, or other programming elements. Further, it should be noted that the present disclosure may employ any number of conventional techniques for data transmission, signaling, data processing, network control, and the like.
The particular implementations shown and described herein are for explanatory purposes and are not intended to otherwise be limiting in any way. Furthermore, the connecting lines shown in the various figures contained herein are intended to represent exemplary functional relationships and/or physical couplings between the various elements. It should be noted that many alternative or additional functional relationships or physical connections may be present in a practical incentive system implemented in accordance with the disclosure.
As will be appreciated by one of ordinary skill in the art, aspects of the present disclosure may be embodied as a method or a system. Furthermore, these aspects of the present disclosure may take the form of a computer program product on a tangible computer-readable storage medium having computer-readable program-code embodied in the storage medium. Any suitable computer-readable storage medium may be utilized, including hard disks, CD-ROM, optical storage devices, magnetic storage devices, and/or the like. These computer program instructions may be loaded onto a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions which execute on the computer or other programmable data processing apparatus create means for implementing the functions specified in the flowchart block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart block or blocks. The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks.
As used herein, the terms “user,” “network engineer,” “network manager,” “network developer” and “participant” shall interchangeably refer to any person, entity, organization, machine, hardware, software, or business that accesses and uses the system of the disclosure. Participants in the system may interact with one another either online or offline.
Communication between participants in the system of the present disclosure is accomplished through any suitable communication means, such as, for example, a telephone network, intranet, Internet, extranet, WAN, LAN, personal digital assistant, cellular phone, online communications, off-line communications, wireless network communications, satellite communications, and/or the like. One skilled in the art will also appreciate that, for security reasons, any databases, systems, or components of the present disclosure may consist of any combination of databases or components at a single location or at multiple locations, wherein each database or system includes any of various suitable security features, such as firewalls, access codes, encryption, de-encryption, compression, decompression, and/or the like.
In network troubleshooting, a network engineer may use a set of commands, methods, and tools, either standard or proprietary. For example, these commands, methods, and tools may include the following items:
The Command Line Interface (CLI): network devices often provide CLI commands to check the network status or statistics. For example, in a Cisco IOS switch, the command “show interface” can be used to show the interface status, such as input errors.
Configuration management: a tool used to find differences of configurations of network devices in a certain period. This is important since about half of the network problems are caused by configuration changes.
The term “Object” refers to the term used in computer technology, in the same meaning as “object oriented” programming languages (such as Java, Common Lisp, Python, C++, Objective-C, Smalltalk, Delphi, Java, Swift, C#, Perl, Ruby, and PHP). It is an abstracting computer logic entity that envelops or mimics an entity in the real physical world, usually possessing an interface, data properties and/or methods.
The term “Device” refers to a data object representing a physical computer machine (e.g., printer, router) connected in a network or an object (e.g., computer instances or database instances on a server) created by computer logic functioning in a computer network.
The term “Q-map” or “Qmap” refers to a map of network devices created by the computer technology of NetBrain Technologies, Inc. that uses visual images and graphic drawings to represent the topology of a computer network with interface property and device property displays through a graphical user interface (GUI). Typically, a computer network is created with a map-like structure where a device is represented with a device image and is linked with other devices through straight lines, pointed lines, dashed lines and/or curved lines, depending on their interfaces and connection relationship. Along the lines, also displayed are the various data properties of the device or connection.
The term “Qapp” refers to a built-in or user-defined independently executable script or procedure generated through a graphical user interface as per technology available from NETBRAIN TECHNOLOGIES, INC.
The term “GUI” refers to a graphical user interface and includes a visual paradigm that offers users a plethora of choices. GUI paradigm or operation relies on windows, icons, mouse, pointers, and scrollbars to display the set of available files and applications graphically. In a GUI-based system, a network structure may be represented with graphic features (icons, lines and menus) that represent corresponding features in a physical network in a map. The map system may be referred to as a Qmap and is further described with respect to U.S. Pat. Nos. 8,386,593, 8,325,720, and 8,386,937, the entire disclosure of each of which is hereby incorporated by reference. After a procedure is created, it can be run in connection with any network system. Troubleshooting with a proposed solution may just take a few minutes instead of hours or days traditionally. The troubleshooting and network management automation may be with the mapping of the network along with the NETBRAIN QAPP (Qapp) system. The Qapp system is further described with respect to U.S. Pat. Nos. 9,374,278, 9,438,481, U.S. Pat. Pub. No. 2015/0156077, U.S. Pat. Pub. No. 2016/0359687, and U.S. Pat. Pub. No. 2016/0359688, the entire disclosure of each of which is hereby incorporated by reference.
The term “Step” refers to a single independently executable computer action represented by a GUI element, that obtains, or causes, a network result from, or in, a computer network; a Step can take a form of a Qapp, a system function, or a block of plain text describing an external action to be executed manually by a user, such as a suggestion of action, “go check the cable.” Each Step is thus operable and re-usable by a GUI operation, such as mouse curser drag-and-drop or a mouse click.
The network manager 112 may be a computing device for monitoring or managing devices in a network, including performing automation tasks for the management, including network intent analysis and adaptive monitoring automation. In other embodiments, the network manager 112 may be referred to as a network intent analyzer or adaptive monitor for a user 102. The network manager 112 may include a processor 120, a memory 118, software 116 and a user interface 114. In alternative embodiments, the network manager 112 may be multiple devices to provide different functions, and it may or may not include all of the user interface 114, the software 116, the memory 118, and/or the processor 120.
The user interface 114 may be a user input device or a display. The user interface 114 may include a keyboard, keypad, or cursor control device, such as a mouse, joystick, touch screen display, remote control, or any other device operative to allow a user or administrator to interact with the network manager 112. The user interface 114 may communicate with any of the network devices in the network 104, and/or the network manager 112. The user interface 114 may include a user interface configured to allow a user and/or an administrator to interact with any of the components of the network manager 112. The user interface 114 may include a display coupled with the processor 120 and configured to display output from the processor 120. The display (not shown) may be a liquid crystal display (LCD), an organic light emitting diode (OLED), a flat panel display, a solid state display, a cathode ray tube (CRT), a projector, a printer or other now known or later developed display device for outputting determined information. The display may act as an interface for the user to see the functioning of the processor 120, or as an interface with the software 116 for providing data.
The processor 120 in the network manager 112 may include a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), or other types of processing devices. The processor 120 may be a component in any one of a variety of systems. For example, the processor 120 may be part of a standard personal computer or a workstation. The processor 120 may be one or more general processors, digital signal processors, application specific integrated circuits, field programmable gate arrays, servers, networks, digital circuits, analog circuits, combinations thereof, or other now known or later developed devices for analyzing and processing data. The processor 120 may operate in conjunction with a software program (i.e., software 116), such as code generated manually (i.e., programmed). The software 116 may include the Data View system and tasks that are performed as part of the management of the network 104, including the generation and usage of Data View functionality. Specifically, the Data View may be implemented from software, such as the software 116.
The processor 120 may be coupled with the memory 118, or the memory 118 may be a separate component. The software 116 may be stored in the memory 118. The memory 118 may include, but is not limited to, computer readable storage media such as various types of volatile and non-volatile storage media, including random access memory, read-only memory, programmable read-only memory, electrically programmable read-only memory, electrically erasable read-only memory, flash memory, magnetic tape or disk, optical media and the like. The memory 118 may include a random access memory for the processor 120. Alternatively, the memory 118 may be separate from the processor 120, such as a cache memory of a processor, the system memory, or other memory. The memory 118 may be an external storage device or database for storing recorded tracking data, or an analysis of the data. Examples include a hard drive, compact disc (“CD”), digital video disc (“DVD”), memory card, memory stick, floppy disc, universal serial bus (“USB”) memory device, or any other device operative to store data. The memory 118 is operable to store instructions executable by the processor 120.
The functions, acts or tasks illustrated in the figures or described herein may be performed by the programmed processor executing the instructions stored in the software 116 or the memory 118. The functions, acts or tasks are independent of the particular type of instruction set, storage media, processor or processing strategy and may be performed by software, hardware, integrated circuits, firm-ware, micro-code and the like, operating alone or in combination. Likewise, processing strategies may include multiprocessing, multitasking, parallel processing and the like. The processor 120 is configured to execute the software 116.
The present disclosure contemplates a computer-readable medium that includes instructions or receives and executes instructions responsive to a propagated signal, so that a device connected to a network can communicate voice, video, audio, images or any other data over a network. The user interface 114 may be used to provide the instructions over the network via a communication port. The communication port may be created in software or may be a physical connection in hardware. The communication port may be configured to connect with a network, external media, display, or any other components in system 100, or combinations thereof. The connection with the network may be a physical connection, such as a wired Ethernet connection or may be established wirelessly, as discussed below. Likewise, the connections with other components of the system 100 may be physical connections or may be established wirelessly.
Any of the components in the system 100 may be coupled with one another through a (computer) network, including but not limited to one or more network(s) 104. For example, the network manager 112 may be coupled with the devices in the network 104 through a network or the network manager 112 may be a part of the network 104. Accordingly, any of the components in the system 100 may include communication ports configured to connect with a network. The network or networks that may connect any of the components in the system 100 to enable data communication between the devices may include wired networks, wireless networks, or combinations thereof. The wireless network may be a cellular telephone network, a network operating according to a standardized protocol such as IEEE 802.11, 802.16, 802.20, published by the Institute of Electrical and Electronics Engineers, Inc., or WiMax network. Further, the network(s) may be a public network, such as the Internet, a private network, such as an intranet, or combinations thereof, and may utilize a variety of networking protocols now available or later developed including, but not limited to TCP/IP based networking protocols. The network(s) may include one or more of a local area network (LAN), a wide area network (WAN), a direct connection such as through a Universal Serial Bus (USB) port, and the like, and may include the set of interconnected networks that make up the Internet. The network(s) may include any communication method or employ any form of machine-readable media for communicating information from one device to another.
The network manager 112 may act as the operating system (OS) of the entire network 104. The network manager 112 provides automation for the users 102, including automated documentation, automated troubleshooting, automated change, and automated network defense. In one embodiment, the users 102 may refer to network engineers who have a basic understanding of networking technologies, are skilled in operating a network via a device command line interface, and are able to interpret a CLI output. The users 102 may rely on the network manager 112 for controlling the network 104, such as with network intent analysis functionality or for adaptive monitoring automation.
Along with the flows in
The diagnostics may be scalable. Once the first engineer responds to an incident and begins the initial triage and investigation, the priority is to obtain the correct data quickly and perform accurate, efficient analysis, typically involving manual digging through CLI. The goal is to accelerate this diagnosis using automation. Knowing what data to get, retrieving it rapidly, and leveraging expert know-how to analyze it is required. Automation may also provide enhanced data analytic functions to enable activities such as historical data comparisons to know “what has changed” or baseline analysis to understand “is this normal.” When combined with live data, an engineer can obtain the correct data and use these comparisons of past, current, and ideal network conditions to perform the analysis much faster. The first level of support can resolve some issues, but many problems require escalation. Collaboration may fail during incident response, with data not adequately conveyed to the next-level engineer or diagnostics not captured and saved. The escalation engineer may duplicate the work of the first engineer before moving on to more advanced diagnostics. A network automation solution should record the collected diagnostics and troubleshooting notes of every person assigned to the ticket so everyone working on the problem has the same data. When it comes to the fix, the goal is to push out the change safely and verify that the fix resolves the issue. A well-designed change automation system ensures the fix is successful. The solution automates the full mitigation sequence, including change deployment, before and after quality assurance, and validation that the problem has cleared. The network management automation embodiments may ensure that mitigation is safely executed, no additional harm has occurred, and reliable post-fix verification is performed.
To see continual improvement over time requires more issues to be near-instantly diagnosed with the root cause identified. In other words, the automation strategy should focus on moving increasingly more issues to near-zero time to a resolution until you can resolve practically every ticket with automation. As more problems occur with proper postmortem reviews, a NetOps team would classify recurring issue types into a “known problem” category and develop operational runbooks for these problems.
As more known problem operational runbooks are fed to the machine, more known issues will have fully automated diagnoses. This process continuously pushes MTTR lower. With proactive automation, we convert lessons learned into repeatable and executable diagnostic automation tasks. More than just documenting that lesson, the goal is to implement an automated diagnostic that checks for this problem the next time there is a similar incident.
When a fault occurs within the network, the first challenge is the resulting idle time. If the ticket sits unworked, and in the case of intermittent issues, potential diagnostic data may even clear before an investigation can begin. Automation augments this process and initiates the diagnosis of the event. Triggered automation closes the gap between the detection of the fault and the action of investigating. For triggered automation to be successful, full network management workflow integration may be used. A network's event detection system or ITSM must communicate with the NetOps automation system to trigger an automatic diagnosis.
There are times when knowledge should be fed back into the automation platform, but two examples are operational handoff and following an incident. Operational Handoff is when a team has implemented a new network design (e.g., MPLS). A consistent, easy-to-follow method for documenting operational procedures related to new designs or new technology is required to ensure that everyone on the team knows how to troubleshoot the new environment. Building an operational runbook for the new design may be part of the handoff from the architect to the operator. Following an Incident means that the team may get together for a postmortem review after resolving an incident. The goal is to do better next time. This feedback process creates a closed-loop mechanism for continual improvement, capturing knowledge at these two critical and ordinary moments. Combining knowledge management with no-code runbook automation leads to the automated resolution of every ticket and can achieve continuous MTTR reduction over time. This feedback mechanism may be referred to as Proactive Automation.
Automation may have two types of users: consumers and creators of executable knowledge. This solves the challenges of resolving network tickets and maintaining a network, as shown in the following example network incident. The network's monitoring systems have detected a low video quality issue between the Boston and New York site locations. The network team's application performance monitor notifies their ITSM system and generates a new trouble ticket. Here, workflow integration comes into play. The network management system provides a mechanism to integrate with ITSM systems, which enables (1) creating a contextual Dynamic Map of the problem area at the time of ticket creation, and (2) enriches the trouble ticket with diagnostic data obtained from Executable Runbooks at the time of the event—Just in Time Automation. In the example video quality incident, the Dynamic Map visualizes relevant data about the network—topology data, configuration, and design data, baseline data across thousands of data points, and even data from integrated third-party solutions. This map provides instant visualizations of the problem area. Triggered automation has now occurred, and valuable data has been automatically gathered at the start of the event using an Executable Runbook. A first response engineer may have reviewed these automated diagnostics. The data retrieved includes essential device health, QoS parameters, access-control lists, and other relevant collected logs. What used to be a manual effort is now a zero-touch mechanism, ensuring that every ticket is enriched with a contextual map and diagnostic data.
The root cause can then be determined in the poor video quality issue. The engineer has reviewed the map of the problem and the collected diagnostics but still needs to drill down further to determine the root cause. To aid in the diagnosis, the scalability of the automation platform may be used. Additional diagnostics or more advanced design reviews may be needed to determine the root cause. The engineer now leverages the automated drill-down capabilities of the network management automation platform to do further analysis and historical comparisons and compare this data with previous baselines. The know-how and operational procedures from previous incident responses by the network management team may be converted into Executable Runbooks and allows large swaths of contextual data to be pulled, parsed, analyzed, and displayed on the console at the push of a button by an engineer on the team, no matter their experience.
In the low video quality example, the network management team has identified the issue to be a misconfigured QoS parameter on a router. The misconfiguration has been successfully remediated with a configuration fix using the network management automation platform. By adding this issue to the list of known problems, the team ensures that they can identify and remediate the problems much faster if it happens again. With the network management automation platform, the additional diagnostic commands used to resolve the issue are added to the existing Executable Runbook automatically to enrich the Runbook without requiring any coding. Should the event reoccur, the system will trigger an automated diagnosis using the updated Runbook. The root cause will be determined instantly, with a near-zero Time to Repair this repeat occurrence. This process also helps to rule out possible known issues in unrelated incidents automatically. It creates a “virtuous cycle”—the more known problems and scenarios for which an Executable Runbook is built, the further MTTR is reduced.
Dynamic Mapping and Executable Runbook are used for automating network troubleshooting. The Runbook digitalizes the troubleshooting procedure and can be executed anywhere by anyone after writing once. There exist vast amounts of troubleshooting playbooks by network device vendors. Enterprise also creates many best-practice playbooks to troubleshoot problems common to its unique network. Executable Runbook can codify these playbooks. However, one difficulty in codifying these runbooks is that they try to solve a common problem and require coding skills. Some Runbooks can be complicated with many forks depending on human decisions making them hard to execute in the backend processes without human intervention. Since Runbook is a template-based solution designed to solve a common problem for many networks, it may not contain the baseline data for a specific network, which is the most useful info while troubleshooting.
Accordingly, Network Intention (NI) can be used to solve these issues. NI is an Automation Unit that can represent an actual network design (with Baseline) and include the logic to diagnose the intent deviation and replicate diagnosis logic across the entire network (with Network Intent Cluster technology). NI is a network-based solution with an executable automation element to document and verify a network design. In an ideal network, all NIs should not be violated. NIs can be monitored proactively, and the system should send an alert for an NI violation. The NI system may include the following components:
NI may be used in a preventative use case. There may not be problems, but periodic checkups ensure the network runs normally. In another example, when there are problems (e.g., the application is down—ticket system), tests may need to be run, so the automation automates the testing for why the application is down. It may be NI is violated.
NIC expands Network Intent (NI) scope from a specific network design to one type of network design with similar diagnosis logic. A large network can have millions of NIs, and it may be time-consuming to add these NIs manually. The NIC system can discover and create these NIs automatically.
While NI effectively documents and validates a network design, it may apply to at least one network device or a set of devices at a time. Therefore, it can take many repetitive efforts to create NIs for a large network. NIC may be designed to expand the logic of a NI (seed NI) from one or a set of devices to the whole network. Furthermore, NIC may be triggered to run in the Triggered Automation Framework (TAF), and its results can significantly reduce the MTTR. NIC may not require coding skills and provides an intuitive user interface for creating and debugging. For example, a NI may monitor whether a failover occurs between a pair of network devices (the failover may cause performance issues such as slow applications). Upon identification, NIC can replicate the logic to all other pairs of network devices in the network without any coding.
Referring back to
The first step shown in the example process of
This step may be referred to as an Input Devices node. In the Input Devices node, users select the devices to expand the NI. There may be at least three ways to choose devices: 1) Select Sites: Select all devices of this site; 2) Select Device Groups: select all devices of this device group; or 3) Select Devices: select devices manually. Sites and Device Group may help deal with dynamic devices.
The second step shown in the example process of
A Seed NI node may select a NI to expand the logic. The seed devices may have default alias, D1, D2, etc. Users can change the alias to an intuitive name, such as this device and neighbor device. In some embodiments, one NI can be selected for a NIC. The seed NI may support macro variables. For example, users can create a NI to check the MTU mismatch between two specific neighbor interfaces using the CLI command show interface e0/0.
The third step shown in the example process of
Seed Logic may be used to select the logic replicating from the seed NI to the input devices. There may be three types of logic:
Neighbor-level logic may have three types of replication logic designed for the different types of real-world cases: 1) full mesh; 2) sparse mode; 3) hub-spoke mode. The full mesh may take any two input devices in an eigen group to replicate the Diagnosis. So, if there are n input devices in an eigen group, NIC may generate the maximum of n*(n−1)/2 diagnoses in a member NI. Full mesh mode can be used to check the parameters across each neighbor pair to ensure the parameter for each device is unique. For example, check Router IP for an OSPF autonomous system to ensure that all router IDs configured within the same Autonomous system are unique.
In the second example of neighbor-level logic, there may be a sparse mode. Sparse Mode will take the input devices of an eigen group as a list and replicate the Diagnosis for any two adjacent devices. So, if there are n input devices, NIC may generate the maximum of (n−1) Diagnosis in a member NI. Sparse Mode can check the parameters across each neighbor pair to ensure that the parameters are the same across the device selected. For example, check EIGRP K Value for the same EIGRP AS number to ensure that all EIGRP key values within the same EIGRP AS number are the same. Seed NI checks the K value for two devices to ensure they are not the same. If Key Values are not the same, the system will raise an alert. To expand the logic to all devices within the same EIGRP system, Sparse Mode replication logic may be used to define the seed logic.
In the third example of neighbor-level logic, there may be a hub-spoke mode. Hub-spoke mode may be applied to a pair of devices with different roles. For example, one is a P device, and the other is a PE device. Hub-spoke mode may divide the input devices of an eigen group into two groups according to the roles and take one device from each group to replicate the Diagnosis. For example, if there are m P devices and n PE devices, NIC may generate the maximum m*n Diagnosis in a member NI (for this eigen group). A NI may be created to check the connectivity between a P and a PE device to ensure their connectivity is working. Then for this mode, the expansion for the check logic goes to all connections between P devices and PE devices with hub-spoke mode. For example, a seed NI checks the connectivity between P and PE devices. The system will raise an alert if there is a connectivity issue between the P and PE devices. To expand the logic to all devices within the same BGP AS Number, hub-spoke replication logic may be used to define the seed logic.
For group level logic, there may be a replication of the exact number of device logic with seed NI. For example, a typical remote site of a network includes one router and two switches. A seed NI is created to check the configuration compliance for a particular site. The group level logic can be used to expand the same logic to all remote sites having the same deployment and setup.
The fourth step shown in the example process of
The Device Classifier node can put devices into different classifiers based on the device types so each classifier can use the same CLI command(s) to retrieve the data or use the same system. Users can use other device properties and configuration file other than the device type. Users can define multiple classifiers, for example, one classifier for one vendor, which can be useful for an NI to support the multi-vendor.
The fifth step shown in the example process of
Group by Eigen Value node groups devices with the same eigen value into an Eigen Group, and these devices will be in the same Member Network Intent. One example is: for the single device diagnosis, users can select device property, hostname, as the Engen value and put each input device into an eigen group. Users can add variables from the Parser library, built-in system data, and CSV input variables. Or users can create a new Parser. Under the system data, users can select the device property, interface property, and topology data.
In another example, there may be compound variables and/or an instruction to ignore the variable order. While expanding a NI to check MTU mismatch between two neighbor interfaces to the whole network, users can select the topology data under the system data as the eigen variables, including four variables, this device, local interface, neighbor device, and neighbor interface. Furthermore, users can add compound variables built from the currently selected variables. For example, users can create a compound variable this_device_info with the formula $thisDevice+$localInteface. This compound variable may identify a local interface across the network if the device hostname is unique. In some embodiments, users can create the compound variable neigbor_device_info and set both compound variables as the eigen variables. The system may create two eigen groups for a pair of this_device_info and neigbor_device_info as (R1e0, R2e0) and (R2e0, R1e0). However, the order may be varied for MTU mismatch, and these groups may be one. Users can ignore variable orders by adding the Ignore Variable Order setting and checking the corresponding variables.
In another example, there may be a merge group via variable. The system may create the eigen group by default if all eigen variables are the same. In some embodiments, users may want to group devices even if some eigen variables are different. For example, a NI is created to check the neighbor relationship between P and PE devices. For this purpose, the P device and its PE devices are put into an Eigen Group. Four eigen variables are added: $name, $BGP_as_number, $nbr_device, $local_IP. And the Ignore Variable Order is added to ignore the order of name and nbr_device. Four eigen groups are created for each pair of P and PE neighbors. To merge all eigen groups into one group, we can enable the merge variables function and select the variable as_number so that the devices with the same as_number will be merged into one group.
The sixth step shown in the example process of
The Target Seed node may define how to match the input devices to a seed device. For example, an NI may be created to check the failover status of a primary and backup HRSP device. The seed devices may be the primary and backup devices. The target seed logic can be defined by: if $state contains Active, match the primary seed device; if $state contains standby, match the standby seed device.
The seventh step shown in the example process of
Member NI generates the Member NIs with the following additional functions:
The following Table summarizes the example process shown in
indicates data missing or illegible when filed
The NIC can then be executed. Member NIs of a NIC can be run manually. In some embodiments, NIC is triggered by an external ticket, which requires adding NIC to the triggered Diagnosis of the Triggered Automation Framework (TAF) system (discussed below), or an internal probe, which may require installation of NIC to the probe.
The example NIC process described above includes seven steps. In alternative embodiments, there may be more or fewer steps. In one embodiment referred to as Auto Mode, that process may include three steps.
For the first step, the input devices are selected. Users can select the input devices: by Site, by Device Group, By Device, by Path, and by Map. When users select inputting the device by Device, they may select the method to create the group, which can be per device, per VLAN group, per subnet, device and L3 neighbors, device and its L2 neighbors, and all in one group. For the second step, the seed NIs are selected. The auto mode may support the single device diagnosis. The system can ask users to disable the auto mode if the Seed Intent contains a cross-device diagnosis. For the third step, the member NIs are created. The member NIs will be created by the type of input devices or the method to create the group:
The system can automatically create other nodes (e.g., nodes/steps 3-6 from
In
This Auto Test option can also apply to and simplify the definition of Device Classifiers (step/node 4) and Group by Eigen Values (step/node 5) when multiple vendors or commands are involved. With this option enabled, the user can use the default device classifier, and the system then determines which device type or commands are used by testing the data against the live network. In other words, Auto Test provides functionality for Auto Mode, where the system determines several nodes/steps.
Triggered Automation Framework (TAF) is a framework for an incident such as a ServiceNow ticket to trigger the related network automation such as Network Intent and Runbook.
The first step in integrating an IT system is to define the API call signature (or identification) from that system to the NetBrain system. This may be done via defining an Integrated IT System at the system management level. An Integrated IT system describes what types of API calls (category) and the data included in these API calls. In addition, the system provides a mechanism to support multi-tenant and domain deployment for Managed Service Providers (MSP) and other customers with the multi-tenant and domain deployment. An Integrated IT system has the following fields:
Each category may correspond to the different types of API calls from this ticket system, which usually has various data fields or parameters. For example, there may be one category for the Incident ticket and another category for the Change Request ticket. To define a category, a user can enter a unique name and add the data fields. There are at least two ways to add the data fields:
If multiple categories are defined for an IT system, TAF may match an API call to a category by looking for a particular data field, category, of the API call. As a result, a user can add this particular data field to all categories. Otherwise, a user can define a condition of a category used by TAF to tell which category an incoming API call from this ticket system belongs to. To define a simple condition, a user can select a data field of the API call, an operator (contains, does not contain, matches, does not match), and enter a keyword. Users can combine multiple simple conditions with the standard Boolean AND/OR operations.
Managed Service Providers (MSP) customers may have multiple tenant systems, one tenant for one client. To support the multi-tenant/Domain, an API call may include a particular data field, scope, and define mappings between scopes and Domains for Integrated NetBrain systems. The TAF framework may forward the API call to the matched domain.
For each category of the incoming API call from the Integrated IT Systems, TAF will further classify them into NetBrain Incident types. The Incident Type defines: 1) the condition to put an API call into this Incident Type; 2) the signature variables to decide whether merge the API call into an existing Incident or create a new Incident; and 3) the Incident message and Guidebook, which may be displayed in the Incident Pane.
There may also be Incident Merging Setting. Multiple tickets are related and can be caused by the same root cause. For example, if a monitoring system detects an interface is down, it may create multiple tickets. TAF allows a user to merge API calls for all these tickets into one Incident instead of creating a new incident for each of these calls. The setting to merge may be defined so: if an API call has the same signature value as a previous API call within a specific time range, then do not create a new incident. Instead, append a new Incident message to the Incident created in the last call.
There may be an option to Match Existing Incident. With this option enabled, API calls belonging to this Incident type will be discarded if no existing incident matches this API call. This option may be disabled so that a new incident will be created if no incident matches the API call. However, a user may temporarily enable this option if he does not want many new incidents created.
There may be an option to Set New Incident Subject. The default incident subject may be {source}-{triggered time}. A user can customize this subject by typing any text and inserting any data field from the category and built-in special fields ({Incident Type}, {source}, {category}, and {triggered time}). For example, a user can create a subject as: Interface {interface name} of device {device} is down: from {source} on {triggered time}.
There may be an option to Merge Incident by Signature. The user can select one or multiple data fields (or custom variables covered later) as the signature to merge the tickets to an incident. One signature can have multiple variables, for example, value 1=$device or $cmdb_ci_name. The use case for the multiple variables is that a ticket may use either $device or $cmdb_ci_name as the device name reporting this Incident. The system may use the first variable with no empty value for the comparison.
The user can define multiple values for the signature. For example, for Interface Error incident type, we may define $device_name as value 1 and $interface_name as value 2. The tickets will be merged if both $device_name and $interface_name are the same.
There may be an option for a Custom Variable. The value used for the signature may be a part of a data field such as $description and $detail_message. A user can create a custom variable to retrieve the value from the data field by regular expression.
There may be an option for Merge Incident by Time. The user can define how long TAF should look back to find the incident candidate to be merged. The user can define this time range by Incident Creation Time and/or Updated Time. When both times are selected, the system will use AND logic. If neither is selected, the system will search for all incidents.
Under the Triggered Diagnosis option of a Triggered Diagnosis Center, a user can install an NI or NIC for an Incident Type. The installed NI and NIC can be run automatically (i.e., triggered Diagnosis) by the incoming API call or displayed in Incident Pane for the user to execute manually (self-service). The Diagnosis results and NI status codes may be shown in an Incident Pane and the Integrated IT system.
Besides the name and description, a user can select a NI or NIC for the Diagnosis. In some embodiments, a user can choose NIC unless the Incident Type is specific to certain devices. For example, a user can select the NIC (e.g., BGP Flapping Examination). The NIC may be set to run automatically if the triggered condition is satisfied or displayed in the Incident Portal for the end-user to run it (self-service) manually.
In one embodiment, there may be a guide for interactive automation. For example, the user can select a guidebook or a Runbook Template to guide the end-user to run the recommended automation in the Incident Portal.
In another embodiment, there may be a subscription to preventative automation. A diagnosis can be configured to collect the alerts from Flash Probe and/or NIs. The user can define the time range (e.g., next one day), filter tag (e.g., BGP probe or NI), and alert type from Intent. The system may collect alerts from the fresh probe or NIs on all incident devices in the configured time range and display them in the Incident Pane.
Each task may have the following fields:
Referring back to
There may be an Incident-based Collaboration Flow. First, users may open a ServiceNow ticket during Troubleshooting. An incident can be created automatically for a ServiceNow ticket based on the TAF definition. Users can open a ServiceNow ticket and find the related link to the Incident. Second, the incident pane is opened to provide a view of the messages, showing the triggered process and details. Third,
Solving a problem may require multi-person cooperation and various data types (such as map, NI, probe, and Runbook). The solution may be through troubleshooting and reviewing. Preventive Automation (Adaptive Monitoring) data subscription allows users to see all diagnosis results related to current network problems in the most recent time, which helps users locate and solve problems faster.
The system and process described above may be encoded in a signal bearing medium, a computer readable medium such as a memory, programmed within a device such as one or more integrated circuits, one or more processors or processed by a controller or a computer. That data may be analyzed in a computer system and used to generate a spectrum. If the methods are performed by software, the software may reside in a memory resident to or interfaced to a storage device, synchronizer, communication interface, or non-volatile or volatile memory in communication with a transmitter. A circuit or electronic device designed to send data to another location. The memory may include an ordered listing of executable instructions for implementing logical functions. A logical function or any system element described may be implemented through optic circuitry, digital circuitry, through source code, through analog circuitry, through an analog source such as an analog electrical, audio, or video signal or a combination. The software may be embodied in any computer-readable or signal-bearing medium, for use by, or in connection with an instruction executable system, apparatus, or device. Such a system may include a computer-based system, a processor-containing system, or another system that may selectively fetch instructions from an instruction executable system, apparatus, or device that may also execute instructions.
A “computer-readable medium,” “machine readable medium,” “propagated-signal” medium, and/or “signal-bearing medium” may comprise any device that includes stores, communicates, propagates, or transports software for use by or in connection with an instruction executable system, apparatus, or device. The machine-readable medium may selectively be, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or propagation medium. A non-exhaustive list of examples of a machine-readable medium would include: an electrical connection “electronic” having one or more wires, a portable magnetic or optical disk, a volatile memory such as a Random Access Memory “RAM”, a Read-Only Memory “ROM”, an Erasable Programmable Read-Only Memory (EPROM or Flash memory), or an optical fiber. A machine-readable medium may also include a tangible medium upon which software is printed, as the software may be electronically stored as an image or in another format (e.g., through an optical scan), then compiled, and/or interpreted or otherwise processed. The processed medium may then be stored in a computer and/or machine memory.
The illustrations of the embodiments described herein are intended to provide a general understanding of the structure of the various embodiments. The illustrations are not intended to serve as a complete description of all of the elements and features of apparatus and systems that utilize the structures or methods described herein. Many other embodiments may be apparent to those of skill in the art upon reviewing the disclosure. Other embodiments may be utilized and derived from the disclosure, such that structural and logical substitutions and changes may be made without departing from the scope of the disclosure. Additionally, the illustrations are merely representational and may not be drawn to scale. Certain proportions within the illustrations may be exaggerated, while other proportions may be minimized. Accordingly, the disclosure and the figures are to be regarded as illustrative rather than restrictive.
One or more embodiments of the disclosure may be referred to herein, individually and/or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any particular invention or inventive concept. Moreover, although specific embodiments have been illustrated and described herein, it should be appreciated that any subsequent arrangement designed to achieve the same or similar purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all subsequent adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the description.
The phrase “coupled with” is defined to mean directly connected to or indirectly connected through one or more intermediate components. Such intermediate components may include both hardware and software based components. Variations in the arrangement and type of the components may be made without departing from the spirit or scope of the claims as set forth herein. Additional, different or fewer components may be provided.
The above disclosed subject matter is to be considered illustrative, and not restrictive, and the appended claims are intended to cover all such modifications, enhancements, and other embodiments, which fall within the true spirit and scope of the present invention. Thus, to the maximum extent allowed by law, the scope of the present invention is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing detailed description. While various embodiments of the invention have been described, it will be apparent to those of ordinary skill in the art that many more embodiments and implementations are possible within the scope of the invention. Accordingly, the invention is not to be restricted except in light of the attached claims and their equivalents.
This application claims priority to Provisional Patent Application Number 63/311,679, filed on Feb. 18, 2022, entitled PROBLEM DIAGNOSIS AUTOMATION SYSTEM (PDAS) INCLUDING NETWORK INTENT CLUSTER (NIC), TRIGGERED DIAGNOSIS, AND PERSONAL MAP, and claims priority as a Continuation-in-part to U.S. application Ser. No. 17/729,275, filed on Apr. 26, 2022, entitled NETWORK ADAPTIVE MONITORING, and to U.S. application Ser. No. 17/729,182, filed on Apr. 26, 2022, entitled NETWORK INTENT MANAGEMENT AND AUTOMATION, both of which claim priority to Provisional Patent Application No. 63/179,782, filed on Apr. 26, 2021, entitled INTENT-BASED NETWORK AUTOMATION, the entire disclosures of each are herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63311679 | Feb 2022 | US | |
63179782 | Apr 2021 | US | |
63179782 | Apr 2021 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17729275 | Apr 2022 | US |
Child | 18172044 | US | |
Parent | 17729182 | Apr 2022 | US |
Child | 17729275 | US |