The disclosure relates generally to computer networks and, more specifically, to monitoring and troubleshooting computer networks.
Commercial premises or sites, such as offices, hospitals, airports, stadiums, or retail outlets, often install complex wireless network systems, including a network of wireless access points (APs), throughout the premises to provide wireless network services to one or more wireless client devices (or simply, “clients”). APs are physical, electronic devices that enable other devices to wirelessly connect to a wired network using various wireless networking protocols and technologies, such as wireless local area networking protocols conforming to one or more of the Institute of Electrical and Electronics Engineers (IEEE) 802.11 standards (i.e., “WiFi”), Bluetooth/Bluetooth Low Energy (BLE), mesh networking protocols such as ZigBee or other wireless networking technologies.
Many different types of wireless client devices, such as laptop computers, smartphones, tablets, wearable devices, appliances, and Internet of Things (IoT) devices, incorporate wireless communication technology and can be configured to connect to wireless access points when the device is in range of a compatible wireless access point in order to access a wired network. In the case of a client device running a cloud-based application, such as voice over Internet Protocol (VOIP) applications, streaming video applications, gaming applications, or video conference applications, data is exchanged during an application session from the client device through one or more APs and one or more wired network devices, e.g., switches, routers, and/or gateway devices, to reach the cloud-based application server.
In general, this disclosure describes one or more techniques for using a machine learning model to detect one or more network failures in an external network. In some examples, a network management system (NMS) may provide network management services, such as Wide Area Network (WAN) assurance services, for an internal network. The internal network may includes a plurality of network devices that operate to provide WAN interconnectivity between devices of one or more external networks, such as one or more client devices of a customer network with one or more service provider devices of a cloud services provider network. That is, a client device may initiate a session with a service provider device via the plurality of network devices of the internal network. An administrator of the internal network may control a plurality of pathways through the internal network. The pathways may facilitate the flow of network traffic through the internal network. For example, a client device may connect to an access point and initiate a session with a service provider device such as a server of video streaming platform. A pathway through one or more network devices of the internal network may be part of an end-to-end pathway connecting the client device with the service provider device.
When a client device communicates with a service provider device via a network connection within the plurality of network devices of the internal network, one or more network failures may occur outside of the internal network that interrupt a session between the client device and the service provider device. For example, the client device may connect to the network devices of the internal network via one or more access points or other computing devices of an external network. Additionally, or alternatively, the service provider device may connect to the network devices of the internal network via one or more access points or other computing devices of external networks. Network failures may occur at access points, computing devices, or interfaces located in an external network such that even if a network path through one or more network devices of the internal network is fully operational, a session between a client device and a service provider device is interrupted.
For example, a client device may establish a session with a service provider device via an end-to-end pathway that includes a pathway through the internal network. If a network failure located between the internal network and the service provider device occurs during the session, a network device within the internal network may attempt to send network traffic to the service provider device without success. Due to the network failure, the service provider device may not receive the network traffic from the network device of the internal network or may fail to send return network traffic back to the internal network for forwarding to the client device. This means that data indicating transmitted and received network traffic at a network device of the internal network may indicate whether there is a network failure associated with the service provider device. Other kinds of data that may indicate network failure include transmission control protocol (TCP) data indicating a flow of TCP packets at a network device and new session data indicating a number of new sessions established via the network device.
The techniques of this disclosure may provide one or more improvements to the computer-related field of computer networking that are integrated within a practical application. For example, a network management system may use a machine learning model to detect one or more network failures outside of an internal network based on network data received from network devices of the internal network. Since the network data may include information indicating network traffic and new sessions at a network device, the network management system may process network data using the machine learning model to determine whether one or more network devices within the internal network are failing to exchange network traffic or establish new sessions with a device outside of the internal network. By using a machine learning model to detect network failures outside of an internal network, the network management system may detect network failures more accurately as compared with systems that do not analyze network data using a machine learning model.
In one example, a network management system includes a memory configured to store a machine learning model; and processing circuitry in communication with the memory. The processing circuitry is configured to: receive network data from a plurality of network devices of an internal network managed by the network management system, wherein the internal network is connected to an external network isolated from the network management system: apply the machine learning model to detect, based on the network data, a network failure within the external network; and perform a corrective action to remediate the network failure within the external network.
In another example, a method includes receiving, by processing circuitry of a network management system, network data from a plurality of network devices of an internal network managed by the network management system, wherein the internal network is connected to an external network isolated from the network management system, and wherein the processing circuitry is in communication with a memory of the network management system configured to store a machine learning model; applying, by the processing circuitry, the machine learning model to detect, based on the network data, a network failure within the external network; and performing, by the processing circuitry, a corrective action to remediate the network failure within the external network.
In another example, a computer-readable medium comprising instructions that, when executed by processing circuitry, causes the processing circuitry to: receive network data from a plurality of network devices of an internal network managed by a network management system, wherein the internal network is connected to an external network isolated from the network management system: apply a machine learning model to detect, based on the network data, a network failure within the external network; and apply a corrective action to remediate the network failure within the external network.
The details of one or more examples of the techniques of this disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the techniques will be apparent from the description and drawings, and from the claims.
Each site of sites 102 includes a plurality of network access server (NAS) devices, such as access points (APs), switches, or routers (not shown in
Each site of sites 102 also includes a plurality of client devices, otherwise known as user equipment devices (UEs), referred to generally as UEs or client devices, representing various wireless-enabled devices within each site. For example, a plurality of UEs 148A-1 through 148A-K are currently located at site 102A. Similarly, a plurality of UEs 148N-1 through 148N-K are currently located at site 102N. UEs located at sites 102 may be collectively referred to as “UEs 148.” In some examples, UEs may be referred to as “client devices.” Each UE of UEs 148 may be any type of wireless client device, including, but not limited to, a mobile device such as a smart phone, tablet or laptop computer, a personal digital assistant (PDA), a wireless terminal, a smart watch, smart ring, or other wearable device. UEs 148 may also include wired client-side devices, e.g., internet of things (IoT) devices such as printers, security devices, environmental sensors, or any other device connected to the wired network and configured to communicate over one or more wireless networks 106.
In order to provide wireless network services to UEs 148 and/or communicate over the wireless networks 106, APs 142 and the other wired client-side devices at sites 102 are connected, either directly or indirectly, to one or more network devices (e.g., switches, routers, or the like) via physical cables, e.g., Ethernet cables. In the example of
Example network system 100 also includes various networking components for providing networking services within the wired network including, as examples, an Authentication, Authorization and Accounting (AAA) server 110 for authenticating users and/or UEs 148, a Dynamic Host Configuration Protocol (DHCP) server 116 for dynamically assigning network addresses (e.g., IP addresses) to UEs 148 upon authentication, a Domain Name System (DNS) server 118 for resolving domain names into network addresses, a plurality of switches 126A-126Y (collectively, “switches 126”), a plurality of servers 128A-128X (collectively “servers 128”) (e.g., web servers, database servers, file servers, service provider servers, and the like), and NMS 130. In some examples, servers 128 may represent service provider devices that are configured to provide one or more services to UEs 148. For example, external network 125 may include switches 126 and 128 and may be configured for proving services to sites 102. In some examples, external network 125 is managed by a third party other than NMS 130 or an administrator of NMS 130. As shown in
In some examples, network devices 122 may comprise a plurality of network devices that form a plurality of pathways through network 120, which is an internal network. Each pathway of the plurality of pathways through the internal network may be part of one or more end-to-end pathways between UEs 148 and servers 128. In some examples, an administrator device 111 of network 120 may control a configuration the pathways through network 120. Since the plurality of pathways through the internal network formed by network devices 122 may, in some cases, each be part of one or more end-to-end pathways that connect one or more UEs of UEs 148 with one or more servers of servers 128, the administrator of network 120 may control a configuration of the end-to-end pathways by controlling the configuration of the pathways through the internal network. For example, a first pathway through the internal network including network devices 122 may be part of a first end-to-end pathway that connects UE 148A-1 with server 128B and a second pathway through the internal network including network devices 122 may be part of a second end-to-end pathway that connects UE 148N-1 with server 128X. This means that the configuration of the first pathway and the configuration of the second pathway may affect the configuration of the first end-to-end pathway and the configuration of the second end-to-end pathway, respectively.
In the example of
The administrator and administrator device 111 may comprise IT personnel and an administrator computing device associated with one or more of sites 102. Administrator device 111 may be implemented as any suitable device for presenting output and/or accepting user input. For instance, administrator device 111 may include a display: Administrator device 111 may be a computing system, such as a mobile or non-mobile computing device operated by a user and/or by the administrator. Administrator device 111 may, for example, represent a workstation, a laptop or notebook computer, a desktop computer, a tablet computer, or any other computing device that may be operated by a user and/or present a user interface in accordance with one or more aspects of the present disclosure. Administrator device 111 may be physically separate from and/or in a different location than NMS 130 such that administrator device 111 may communicate with NMS 130 via network 120 or other means of communication.
In some examples, one or more of the NAS devices, e.g., APs 142, switches 146, or routers, may connect to edge devices 150A-150N (collectively, “edge devices 150) via physical cables, e.g., Ethernet cables. Edge devices 150 comprise cloud-managed, wireless LAN controllers. Each of edge devices 150 may comprise an on-premises device at a site 102 that is in communication with NMS 130 to extend certain microservices from NMS 130 to the on-premises NAS devices while using NMS 130 and its distributed software architecture for scalable and resilient operations, management, troubleshooting, and analytics.
Each one of the network devices of network system 100, e.g., servers 110, 116, 118 and/or 128, network devices 122, APs 142, UEs 148, switches 126 and/or 146, and any other servers or devices attached to or forming part of network system 100, may include a system log or an error log module wherein each one of these network devices records the status of the network device including normal operational status and error conditions. Throughout this disclosure, one or more devices of network system 100, e.g., servers 110, 116, 118 and/or 128, switches 126, APs 142, UEs 148, and switches 146, may be considered “third-party” network devices when owned by and/or associated with a different entity than NMS 130 such that NMS 130 does not receive, collect, or otherwise have access to the recorded status and other data of the third-party network devices. In some examples, edge devices 150 may provide a proxy through which the recorded status and other data of the third-party network devices may be reported to NMS 130. In some examples, one or more devices of network system 100, e.g., servers 110, 116, 118 and/or 128, switches 126, APs 142, UEs 148, and switches 146, may be owned and/or associated with the same entity that is associated with NMS 130. In some examples, network devices 122 may be owned and/or associated with the same entity that is associated with NMS 130.
NMS 130 may include processing circuitry 132 and memory 133. Processing circuitry 132 may include fixed function circuitry and/or programmable processing circuitry. Processing circuitry 132 may include any one or more of a microprocessor, a controller, a digital signal processor (DSP), graphics processing unit (GPU), tensor processing unit (TPU), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or equivalent discrete or analog logic circuitry. In some examples, processing circuitry 132 may include multiple components, such as any combination of one or more microprocessors, one or more controllers, one or more DSPs, GPUs, TPUs, one or more ASICs, or one or more FPGAs, as well as other discrete or integrated logic circuitry, which may be physically located in one or more devices in one or more physical locations.
Processing circuitry 132 may be capable of processing instructions stored in memory 133. In some examples, memory 133 includes a computer-readable medium that includes instructions that, when executed by processing circuitry 132, cause NMS 130 and processing circuitry 132 to perform various functions attributed to them herein. Memory 133 may include any volatile, non-volatile, magnetic, optical, or electrical media, such as a random-access memory (RAM), read-only memory (ROM), non-volatile RAM (NVRAM), electrically erasable programmable ROM (EEPROM), ferroelectric RAM (FRAM), dynamic random-access memory (DRAM), flash memory, or any other digital media.
In computer networking, a “black hole” is a kind of network failure where the network discards, or “drops” incoming or outgoing network traffic along an end-to-end pathway between a source device and a destination device without alerting the source device that the network traffic did not reach its intended recipient. Network failures such as black holes may interrupt a session between a source device and a destination device such that the destination device ceases to provide one or more services associated with the network traffic to the source device. Many factors may contribute to network failures such as black holes including a misconfiguration of one or more forwarding path elements (e.g., next hops) of a network device, a failure of a WAN link, a failure of a server, or a failure of any connection between nodes of the network, as examples.
In some examples, processing circuitry 132 is configured to use virtual network assistant (VNA) 134 and machine learning model 135 to identify one or more network failures occurring outside of the networks managed by NMS 130, such as within external network 125 not managed by NMS 130. Network devices 122 and NMS 130 may be part of the same internal network in that an administrator of the internal network controls both network devices 122 and NMS 130. In some examples, sites 102 including APs 142, switches 146, and UEs 148 are outside of the internal network corresponding to network devices 122 managed by NMS 130. One or more of switches 146 may, in some cases, be within the internal network corresponding to network devices 122 and NMS 130. In some examples, switches 126 and servers 128 may be located outside of the internal network corresponding to network devices 122 and NMS 130. In some examples, one or more of switches 126 may be located within the internal network corresponding to network devices 122 and NMS 130.
An administrator of the internal network may control network devices 122 to form a plurality of network paths that connect one or more UEs 148 with one or more servers 128. Since servers 128 may be configured to provide one or more services to UEs 148, it may be beneficial for network devices 122 to form the plurality of network paths to connect servers 128 with UEs 148. An end-to-end pathway between a UE of UEs 148 and a server of servers 128 may include one or more pathways through the internal network including network devices 122 and one or more pathways through external networks that are isolated from the internal network corresponding to network devices 122 and NMS 130. In the example where sites 102, switches 126A, and servers 128 are located outside of the internal network corresponding to network devices 122 and NMS 130, an end-to-end pathway between UE 148A-1 and server 128A may include one or more pathways through external networks isolated from the internal network and between UE 128A-1 and network devices 122A via APs 142 and switch 146A, a pathway through network devices 122 within the internal network, and one or more pathways outside of the internal network between network devices 122 and server 128A via switches 126 and connections 129.
Although network devices 122 may form the plurality of network paths through network 120 to connect servers 128 with UEs 148, one or more network failures may occur outside of the internal network 120 associated with network devices 122 and NMS 130 that interrupt one or more sessions between servers 128 and UEs 148 via network devices 122. That is, a network failure may occur in a device, interface, or network connection that is not controlled by the administrator of the internal network associated with network devices 122 and NMS 130. For example, switches 126 and servers 128 may be outside of the internal network associated with network devices 122 and NMS 130. This means that switches 126, servers 128, and connections 129A-129D between switches 126 and servers 128 may be outside of the internal network. Network failures occurring at switches 126, servers 128, or connections 129 may interrupt one or more sessions between servers 128 and UEs 148 via the plurality of network connections formed by network devices 122.
In some examples, processing circuitry 132 of NMS 130 may be configured to apply machine learning model 135 of VNA 134 to identify one or more network failures located outside of the internal network 120 associated with network devices 122 and NMS 130. For example, processing circuitry 132 may apply machine learning model 135 to identify one or more network failures occurring at switches 126, servers 128, connections 129, or any other device or connection outside of the internal network associated with network devices 122 and NMS 130 based on data received from within the internal network 120. By applying machine learning model 135 to identify network failures occurring outside of the internal network 120 based on data received from inside the internal network 120, NMS 130 may improve a manner in which network devices 122 facilitate sessions between servers 128 and UEs 148 as compared with network systems that do not use a machine learning model to identify network failures outside of the internal network.
NMS 130 may leverage machine learning model and network features to detect network failures such as black holes and remediate or avoid a detected network failure. An example feature may relay to customers what routing of a packet looks like, e.g., how the packet is flowing, next hop, services, etc. Using machine learning model 135 may improve a speed and/or an accuracy at which NMS 130 detects one or more network failures as compared with systems that do not use a machine learning model to detect network failures. For example, systems that use heartbeat mechanisms to detect network failures might not detect network failures as quickly or as accurately or to as high a degree of confidence as NMS 130 detects network failures using machine learning model 135. This is because machine learning model 135 may detect network failures by analyzing traffic and session data 136 received from network devices 122 which includes significantly more data that merely heartbeat signals. Furthermore, machine learning model 135 may be trained using a large cache of training data 138 so that machine learning model 135 is configured to recognize a plurality of network data patterns associated with network failure and/or a plurality of network data patterns associated with healthy network operation.
To apply machine learning model 135 to detect a network failure, processing circuitry 132 may be configured to determine that one or more network devices of network devices 122 are unable to exchange network traffic with a server of servers 128 located outside of the internal network. For example, processing circuitry 132 may apply machine learning model 135 to determine that network device 122I is unable to communicate with server 128A via switch 126A and connection 129A. When network device 122I is not able to communicate with server 128A, this means that there may be a network failure corresponding to switch 126A, server 128A, connection 129A, or any combination thereof. In some examples, processing circuitry 132 may be configured to apply machine learning model 135 to identify a location of a network failure outside of the internal network. For example, machine learning model 135 may output information indicating that the network failure is present at connection 129A, switch 126A, or another location. In some examples, when NMS 130 applies machine learning model 135 to detect a network failure, the machine learning model 135 may output a detected network failure with a confidence score indicating a confidence that the machine learning model 135 accurately detected the network failure.
A network failure located outside of the internal network may, in some examples, be located along one or more pathways through external networks that are part of one or more end-to-end pathways that also include an pathway through the internal network including network devices 122. That is, network traffic may travel through a pathway of the internal network including network devices 122 to an pathway through an external network that includes a network failure which prevents the network traffic from traversing an end-to-end pathway that includes both the internal network pathway and the external network pathway. Even though an administrator of the internal network corresponding to network devices 122 and NMS 130 might have control over network devices 122 and the internal network pathways without having control over the external network pathways, it might be possible for the administrator to reroute network traffic through network devices 122 to avoid a network failure within an external network pathway and/or reconfigure internal network pathways so that the network traffic avoids the network failure within the external network pathway. NMS 130 is not limited to rerouting network traffic in response to detecting a network failure. In some examples, NMS 130 may, in response to detecting a network failure, control network devices 122 to avoid the forwarding of network traffic to an address associated with the network failure, withdraw a route to an address associated with the network failure, send a diagnostic message to an address associated with the network failure to confirm the network failure within the external network, output an indication or notification of the network failure to an administrator, or any combination thereof.
In some examples, it may be beneficial to reroute network traffic to server 128A or another server via an external network pathway that does not have network failure. For example, when processing circuitry 132 applies machine learning model 135 to determine that there is a network failure at connection 129A which prevents network device 122A from communicating with server 128A, processing circuitry 132 may route network traffic from network device 122I to server 128A via switch 126B and connection 129B instead of routing network traffic via switch 125A and connection 129A. In another example, the same service provider may provide at least some of the same services to UEs 148 using both server 128A and server 128B. In this example, based on determining that there is a network failure at connection 129A, processing circuitry 132 may route network traffic from network device 122I to server 128B via switch 126C and connection 129C instead of routing network traffic to server 128A via switch 125A and connection 129A. In any case, based on applying machine learning model 135 to identify one or more network failures located outside of the internal network, NMS 130 may control network devices 122 to reroute traffic to avoid network failures outside of the internal network without interrupting one or more sessions between UEs 148 and servers 128A or decreasing interruptions of one or more sessions between UEs 148 and servers 128A.
Rerouting network traffic is not the only remedial action that NMS 130 is configured to take in response to identifying one or more network failures. In some examples, NMS 130 may output one or more messages in response to identifying one or more network failures. For example, when NMS 130 identifies a network failure associated with server 128A, NMS 130 may output a message (e.g., a ping) to server 128A and/or a service provider associated with server 128A indicating that NMS 130 detected a network failure. In some examples, NMS 130 may save information indicating the detected network failure to memory 133.
In some examples, NMS 130 receives network traffic and session data 136 from network devices 122. In some examples, the term “network traffic and session data” may be referred to herein as “network data.” Network traffic and session data 136 may, in some examples, include network traffic and session data corresponding to each network device of network devices 122. In some examples, the network traffic and session data corresponding to each network device of network devices 122 may include network traffic transmission (Tx) data indicating an amount of network traffic transmitted from an interface of the network device and network traffic reception (Rx) data indicating an amount of network traffic received by the network interface of the network device. Additionally, or alternatively, the network traffic and session data corresponding to each network device of network devices 122 may include transmission control protocol (TCP) data indicating an amount of TCP synchronize (SYN) network traffic received by and/or transmitted from the network device, an amount of TCP synchronize-acknowledge (SYN-ACK) network traffic received by and/or transmitted from the network device, and an amount of TCP acknowledge (ACK) network traffic received by and/or transmitted from the network device. Additionally, or alternatively, the network traffic and session data corresponding to each network device of network devices 122 may include new session data indicating a number of new sessions established via the interface of the network device, or any combination thereof.
If any internet protocol (IP) address within network system 100 is unreachable, this may cause a change in network traffic transmission data, network traffic reception data, TCP data, new session data, or any combination thereof. For example, when Tx data is increasing and Rx data is decreasing or nonexistent with respect to a device associated with the IP address, this may indicate one or more network failures corresponding to the device. In another example, when there is an increase in one or more TCP resets with respect to a device associated with the IP address, this may indicate one or more network failures corresponding to the device. In some examples, one or more failures to establish a new session involving a device associated with the IP address may indicate one or more network failures corresponding to the device.
In some examples, all of the edge routers within an internal network corresponding to network devices 122 and NMS 130 use the same gateway IP address to reach a destination device (e.g., a server of servers 128). When machine learning model 135 determines that a connection, device, or server outside of the internal network is associated with a network failure, NMS 130 may notify devices within the internal network to stop using the connection, device, or server associated with the IP address to communicate with a destination device. In some examples, NMS 130 may reroute network traffic to the destination device via another external network pathway that is not associated with an identified network failure. In some examples, NMS 130 may reroute network traffic to another server that is configured to provide the same or similar services as the destination device associated with the network failure.
In some examples, the network traffic and session data received from a network device of network devices 122 may be organized according to a respective server of servers 128 associated with the data. For example, network traffic and session data received from network device 122I may include a set of data including Tx data indicating an amount of network traffic transmitted from an interface of network device 122I to server 128A and Rx data indicating an amount of network traffic received by network device 122I from server 128A. Additionally, or alternatively, network traffic and session data received from network device 122I may include TCP data indicating a number of TCP SYN packets sent by network device 122I to server 128A, a number of TCP SYN-ACK packets received by network device 122I from server 128A, a number of TCP ACK packets sent by network device 122I to server 128A, or any combination thereof. In some examples, network traffic and session data received from network device 122I may include new session data indicating a number of new sessions established with server 128A via network device 122A according to any communication session protocol including TCP, UDP, TLS, ICMP, or any combination thereof.
Since network traffic and session data 136 may include a variety of data including data indicating levels of network traffic and amounts of new sessions, network traffic and session data 136 indicating a manner and extent to which each network device of network devices 122 communicates with each server of servers 128 via each end-to-end pathway of a plurality of end-to-end pathways between servers 128 and UEs 148 via network devices 122. In some examples, network devices 122 and NMS 130 may be located within the same internal network and the plurality of internal network pathways through network devices 122 may interface with a plurality of external network pathways outside of the internal network to form a plurality of end-to-end pathways between servers 128 and UEs 148. That is, an end-to-end pathway between a server of servers 128 and a UE of UEs 148 may include one or more internal network pathways through network devices 122 and one or more external network pathways outside of the internal network.
To apply machine learning model 135, processing circuitry 132 of NMS 130 may use network traffic and session data 136 as an input to machine learning model 135 and machine learning model 135 may generate an output based on network traffic and session data 136 that indicates one or more network failures outside of the internal network corresponding to network devices 122 and NMS 130. That is, machine learning model 135 may identify one or more network failures outside of the internal network based on data received from network devices 122 located within the internal network.
Machine learning model 135 may, in some cases, identify a network failure based on one or more trends in network traffic and session data 136. For example, when a trend in the Tx data and the Rx data of the network traffic and session data 136 indicates high levels of network traffic transmission from one or more of network devices 122 to server 128B and low levels of network traffic reception by the one or more of network devices 122 from server 128B, this may indicate that there is a network failure between the one or more of network devices 122 and server 128B. In another example, when a trend in the TCP data of network traffic and session data 136 indicates high levels of TCP SYN messages transmitted to server 128B and low levels of TCP SYN-ACK messages received from server 128B, this may indicate that a TCP session between network devices 122 and server 128B failed to establish, which may indicate a network failure. In another example, when a trend in the new session data of network traffic and session data 136 indicates a low number of new sessions established between server 128B and UEs 148 via network devices 122 of the internal network, this may indicate that there is a network failure between network devices 122 and server 128B.
In some examples, machine learning model 135 may detect one or more network failures that are likely to occur at a future time based on network traffic and session data 136. For example, machine learning model 135 may determine that a probability that network failure will occur at connection 129A is greater than a probability threshold. NMS 130 may reroute network traffic based on determining that the probability that network failure will occur at connection 129A is greater than the probability threshold.
Memory 133 of NMS 130 may be configured to store training data 138. In some examples, training data 138 comprises a plurality of sets of training network data. In some examples, each set of training network data of the plurality of sets of training network data may include data similar in kind to the network traffic and session data 136 received from network devices 122. For example, each set of training network data of the plurality of sets of training network data may include network traffic transmission data, network traffic reception data. TCP data, new session data, or any combination thereof corresponding to a network device. In some examples, each set of training network data of the plurality of sets of training network data is associated with network status information indicating whether the set of training network data is associated with one or more network failure events.
Processing circuitry 132 of NMS 130 may be configured to train, using supervised learning, machine learning model 135 based on the plurality of sets of training network data and the network status information corresponding to each set of training network data of the plurality of sets of training network data. In some examples, to train machine learning model 135, processing circuitry 132 may train machine learning model 135 to recognize one or more patterns in the training data that are associated with network failure and/or one or more patterns in the training data that are not associated with network failure.
NMS 130 may train machine learning model 135 using historical data of route parameters on network device interfaces (e.g., gateway routers and/or destination IP addresses) at an internal network-level in order to determine baseline or average levels one or more route parameters (e.g., Rx data, Tx data, TCP data, and new session data) for the internal network. In some examples, NMS 130 may train machine learning model 135 to identify spikes in route parameters as compared with baseline levels that are indicative of network failures including black holes. This means that machine learning model 135 may leverage patterns associated with network failures that machine learning model 135 learns when trained with training data 138 to identify network failures. For example, to learn a pattern corresponding to network failures such as black holes, machine learning model 135 may analyze average uplink and downlink at one or more points in time, analyze new session data, analyze TCP resets, or any combination thereof.
Processing circuitry 132 may, in some examples, receive information indicating whether the machine learning model 135 accurately detected a network failure based on network traffic and session data 136. Processing circuitry 132 may augment the training data 138 with the network data and network status information indicating whether machine learning model 135 accurately detected the network failure based on the network traffic and session data 136. In some cases, processing circuitry 132 may re-train, using supervised learning, the machine learning model 135 based on the augmented training data. That is, processing circuitry 132 may continuously re-train machine learning model 135 based on predictions and determinations that machine learning model 135 makes.
In accordance with one specific implementation, a computing device is part of NMS 130. In accordance with other implementations, NMS 130 may comprise one or more computing devices, dedicated servers, virtual machines, containers, services, or other forms of environments for performing the techniques described herein. Similarly, computational resources and components implementing VNA 134 may be part of the NMS 130, may execute on other servers or execution environments, or may be distributed to nodes within network 120 (e.g., routers, switches, controllers, gateways, and the like).
Although the techniques of the present disclosure are described in this example as performed by NMS 130, techniques described herein may be performed by any other computing device(s), system(s), and/or server(s), and that the disclosure is not limited in this respect. For example, one or more computing device(s) configured to execute the functionality of the techniques of this disclosure may reside in a dedicated server or be included in any other server in addition to or other than NMS 130, or may be distributed throughout network system 100, and may or may not form a part of NMS 130.
As described herein, NMS 130 provides an integrated suite of management tools and implements various techniques of this disclosure. In general, NMS 130 may provide a cloud-based platform for wireless network data acquisition, monitoring, activity logging, reporting, predictive analytics, network anomaly identification, and alert generation. For example, NMS 130 may be configured to proactively monitor and adaptively configure network system 100 so as to provide self-driving capabilities. Moreover, VNA 134 includes a natural language processing engine to provide AI-driven support and troubleshooting, anomaly detection, artificial intelligence (AI)-driven location services, and AI-driven radio frequency (RF) optimization with reinforcement learning. For example, VNA 134 includes a machine learning model 135 configured to identify one or more network failures based on network traffic and session data 136 stored in memory 133 of NMS 130. In some examples, processing circuitry 132 may train machine learning model 135 using training data 138
As illustrated in the example of
In some examples, underlying routers of SD-WAN 177 may implement a stateful, session-based routing scheme in which the routers 187A, 187B dynamically modify contents of original packet headers sourced by UEs 148 to steer traffic along selected paths, e.g., path 189, toward application services 181 without requiring use of tunnels and/or additional labels. In this way, routers 187A, 187B may be more efficient and scalable for large networks since the use of tunnel-less, session-based routing may enable routers 187A, 187B to achieve considerable network resources by obviating the need to perform encapsulation and decapsulation at tunnel endpoints. Moreover, in some examples, each router 187A, 187B may independently perform path selection and traffic engineering to control packet flows associated with each session without requiring use of a centralized SDN controller for path selection and label distribution. In some examples, routers 187A, 187B implement session-based routing as Secure Vector Routing (SVR), provided by Juniper Networks, Inc.
In some examples, SD-WAN 177 including routers 187A, 187B and path 189 may be an example of network devices 122 of
In some examples, NMS 130 may enable intent-based configuration and management of network system 100, including enabling construction, presentation, and execution of intent-driven workflows for configuring and managing devices associated with wireless networks 106, wired LANs 175, and/or SD-WAN 177. For example, declarative requirements express a desired configuration of network components without specifying an exact native device configuration and control flow. By utilizing declarative requirements, what should be accomplished may be specified rather than how it should be accomplished. Declarative requirements may be contrasted with imperative instructions that describe the exact device configuration syntax and control flow to achieve the configuration. By utilizing declarative requirements rather than imperative instructions, a user and/or user system is relieved of the burden of determining the exact device configurations required to achieve a desired result of the user/system. For example, it is often difficult and burdensome to specify and manage exact imperative instructions to configure each device of a network when various different types of devices from different vendors are utilized. The types and kinds of devices of the network may dynamically change as new devices are added and device failures occur. Managing various different types of devices from different vendors with different configuration protocols, syntax, and software versions to configure a cohesive network of devices is often difficult to achieve. Thus, by only requiring a user/system to specify declarative requirements that specify a desired result applicable across various different types of devices, management and configuration of the network devices becomes more efficient.
Typically, NMS 130 may provide network management services, such as management, monitoring and automated oversight services, for SD-WAN 177, based on network data received from the components of SD-WAN 177, such as routers 187. Conventionally, an NMS may not receive network data for external or third-party networks, such as from a cloud-service provider network supporting cloud-based application services 181. Therefore, a conventional NMS may not have insight into the causes of network failures within such an external network, and therefore have limited ability to remediate failures occurring within such external networks that adversely affect the forwarding of traffic within SD-WAN 177, such as the blackholing of customer traffic occurring within such external networks.
In accordance with the techniques described in this disclosure, NMS 130 is configured to receive network traffic and session data 136 from the plurality of routers including routers 187A, 187B. In some examples, network traffic and session data 136 may include sets of data from each of the plurality of routers including routers 187A, 187B within SD WAN 177. Processing circuitry 132 may apply machine learning model 135 of VNA 134 to identify one or more network failures occurring outside of SD-WAN 177, such as within a network supporting cloud-based application services 181, based on network traffic and session data 136 received from the plurality of routers including routers 187A, 187B within the internal network. Processing circuitry 132 of NMS 138 may be configured to train machine learning model 135 based on training data 138.
Network 121A is depicted as having two network devices, network 121B is depicted as having two network devices, network 121C is depicted as having two network devices, and network 121D is depicted as having three network devices for ease of illustration, but each of networks 121 may include any number of network devices. The configuration of sites 102, network devices 122, switches 126, and servers 128 illustrated in
Connections 123 may form a plurality of internal network pathways through network devices 122. For example, one internal network pathway exists between sites 102 and switch 126A via connection 123A, network device 122A, connection 123B, network device 122D, connection 123D, network device 122E, network device 122F, connection 123F, network device 122G, network device 122H, network device 122I, and connection 123G. Another internal network pathway exists between sites 102 and switch 126C via connection 123A, network device 122A, connection 123C, network device 122C, connection 123E, network device 122E, and connection 123I. As seen in
In some examples, internal network pathways through network devices 122 via connections 123 may be located within an internal network corresponding to network devices 122 and NMS 130. That is, an administrator of internal networks 121 (or NMS 130) may configure network devices 122 and the internal network pathways through network devices 122 via connections 123. When one or more network failures happen at network devices 122 and/or connections 123, the administrator of internal networks 121 (or NMS 130) may resolve these network failures by controlling network devices 122 and/or connections 123. When one or more network failures occur outside of internal networks 121 (such as behind switches 126 with respect to network devices 122), it may be beneficial for NMS 130 to detect these failures and re-route network traffic through network devices 122 to avoid network failures outside of the internal network. This is because the administrator (or NMS 130) of internal networks 121 and network devices 122 might not have control over one or more devices or connections that are a source of a network failure outside of the internal network, but the administrator does have control over network traffic traveling through internal networks 121.
In some examples, sites 102, switches 126, servers 128, and connections 129 may be located outside of the internal network corresponding to network devices 122 and NMS 130. Sites 102, switches 126, servers 128, and connections 129 may, in some examples, form a plurality of external network pathways that, together with the internal network pathways through network devices 122, form a plurality of end-to-end pathways between UEs 148 and servers 128. For example, an end-to-end pathway between UE 148A-1 and server 128B may include an external network pathway extending from UE 148A-1 to network devices 122 via APs 142 and switch 146A, an internal network pathway through network devices 122 via connection 123A, network device 122A, connection 123C, network device 122C, connection 123E, network device 122E, connection 123I, switch 126C, and connection 129C. There are many other examples of end-to-end pathways that connect sites 102 and servers 128 as illustrated in
NMS 130 may apply machine learning model 135 of
Control unit 202 may comprise routing engine 204 and packet forwarding engine 222. Routing engine 204 operates as the control plane for network device 200 and includes an operating system that provides a multi-tasking operating environment for execution of a number of concurrent processes. Routing engine 204 communicates with other routers, e.g., such as network devices 122 of
RIB 206 may describe a topology of the computer network in which network device 200 resides, and may also include routes through the shared trees in the computer network. RIB 206 describes various routes within the computer network, and the appropriate next hops for each route, i.e., the neighboring routing devices along each of the routes. Routing engine 204 analyzes information stored in RIB 206 and generates forwarding information for packet forwarding engine 222, stored in Forwarding information base (FIB) 224. FIB 224 may associate, for example, network destinations with specific next hops and corresponding IFCs 226 and physical output ports for output links 230. FIB 224 may be a radix tree programmed into dedicated forwarding chips, a series of tables, a complex database, a link list, a radix tree, a database, a flat file, or various other data structures.
FIB 224 may also include lookup structures. Lookup structures may, given a key, such as an address, provide one or more values. In some examples, the one or more values may be one or more next hops. A next hop may be implemented as microcode, which when executed, performs one or more operations. One or more next hops may be “chained,” such that a set of chained next hops perform a set of operations for respective different next hops when executed. Examples of such operations may include applying one or more services to a packet, dropping a packet, and/or forwarding a packet using an interface and/or interface identified by the one or more next hops.
Network traffic transmission data 232 may include information corresponding to an amount of network traffic that network device 200 forwards corresponding to each endpoint device of a set of endpoint devices. For example, network traffic transmission data 232 may include information corresponding to an amount of network traffic that network device 200 forwards along one or more end-to-end pathways where server 128A of
Network traffic reception data 234 may include information corresponding to an amount of network traffic that network device 200 receives from each endpoint device of a set of endpoint devices. For example, network traffic reception data 234 may include information corresponding to an amount of network traffic that network device 200 receives from server 128A along one or more end-to-end pathways where server 128A of
Network traffic transmission data 232 and network traffic reception data 234 may indicate one or more network failures associated with one or more endpoint devices of a set of endpoint devices. For example, if network traffic transmission data 232 indicates that a high volume of network traffic is forwarded along one or more end-to-end pathways associated with an endpoint device, and network traffic reception data 234 indicates that a low volume or no volume of network traffic is received from the endpoint device along the one or more end-to-end pathways, this may indicate that a network failure associated with the endpoint device is preventing network traffic from reaching the endpoint device and/or preventing the endpoint device to send network traffic along the one or more end-to-end pathways.
In some examples, to establish a session between a client device (e.g., one of UEs 148 of
TCP data 236 may include information indicating a number of TCP SYN packets forwarded via network device 200 along one or more end-to-end pathways associated with an endpoint device, a number of TCP SYN-ACK packets forwarded via network device 200 along one or more end-to-end pathways associated with the endpoint device, a number of TCP ACK packets forwarded via network device 200 along one or more end-to-end pathways associated with the endpoint device, or any combination thereof. TCP data 236 may include TCP data corresponding to each endpoint device of a plurality of endpoint devices (e.g., each server of servers 128A-128X of
When one or more TCP handshakes involving an endpoint device are not successfully completed, this may indicate that there is a network failure corresponding to the endpoint device. For example, when network device 200 forwards a high volume of TCP SYN packets along one or more end-to-end pathways to server 128A via connection 123G, switch 126A, and connection 129A of
New session data 238 may indicate a number of new sessions that failed to establish between each service provider device of a plurality of service provider devices (e.g., each server of servers 128) and one or more of a plurality of client devices (e.g., UEs 148) via network device 200. As used herein, the term “new session” may refer to a new session according to TCP and/or a new session according to one or more other communication session protocols (e.g., UDP, TLS, and ICMP). For example, new session data 238 may indicate a number of new sessions failed to establish between server 128A and one or more of UEs 148 via network device 200, a number of new sessions created between server 128B and one or more of UEs 148 via network device 200, and so on. A “number” of new sessions failed to establish may refer to a total number of new sessions failed to establish, a rate at which new sessions are failed to establish, a number of new sessions failed to establish within a period of time, or any combination thereof. In some examples, New session data 238 may indicate a number of new sessions established between each service provider device of a plurality of service provider devices (e.g., each server of servers 128) and one or more of a plurality of client devices (e.g., UEs 148) via network device 200.
Network device 200 may output network traffic transmission data 232, network traffic reception data 234, TCP data 236, new session data 238, or any combination thereof to a network management system (e.g., NMS 130 of
Since network device 200 may represent one network device of network devices 122 of
In some examples, NMS 300 and network devices 122 are located within an internal network controlled by an administrator. That is, administrator device 111 may output one or more instructions to control, configure, or otherwise manage network devices 122 and/or NMS 130 based on receiving instructions from an administrator. For example, administrator device 111 may output one or more instructions to configure a plurality of internal network pathways through network devices 122 that are part of a plurality of end-to-end pathways. In some examples, each end-to-end pathway of the plurality of end-to-end pathways may connect one or more UEs 148 with one or more servers of servers 128A. But at least a portion of an end-to-end pathway between a UE of UEs 148 and a server of servers 128 may be located outside of the internal network corresponding to network devices 122 and NMS 300. For example, switches 126, servers 128, connections 129, APs 142, and switches 146 may be located outside of the internal network. This means that although an end-to-end pathway between UE 148A-1 and server 128A traverses an internal network pathway through network devices 122A within the internal network, the end-to-end pathway between UE 148A-1 and server 128A may in some cases traverse one or more external network pathways including AP 142A-1, switch 146A, switch 126A, and connection 129A which are located outside of the internal network.
NMS 300 includes a processing circuitry 306, memory 308, user interface 310, communication interface 312, and database 318. Memory 308 is configured to store application programming interface (API) 322 and VNA 360 including machine learning model 362 and model training engine. Communication interface 330 includes a receiver 324 and a transmitter 326. Database 318 is configured to store network traffic and session data 370 and training data 380. Network traffic and session data 370 includes network traffic transmission data 372, network traffic reception data 374, TCP data 376, and new session data 378. Although database 318 is illustrated in
The various elements are coupled together via a bus 314 over which the various elements may exchange data and information. Processing circuitry 306 may be an example of processing circuitry 132 of
In some examples, NMS 300 receives data from one or more of network devices 122, switches 126, servers 128, APs 142, switches 146, UEs 148, routers 187 and other network nodes within network system 100, which may be used to calculate one or more metrics corresponding to network system 100. NMS 300 may analyze this data for cloud-based monitoring and/or management of wireless networks 106 at sites 102, monitoring and/or management of network devices 122, and monitoring and/or management of switches 126 and servers 128. In some examples, NMS 300 may be part of another server shown in
Processing circuitry 306 may include fixed function circuitry and/or programmable processing circuitry. Processing circuitry 306 may include any one or more of a microprocessor, a controller, a DSP, GPU, TPU, an ASIC, an FPGA, or equivalent discrete or analog logic circuitry. In some examples, processing circuitry 306 may include multiple components, such as any combination of one or more microprocessors, one or more controllers, one or more DSPs, GPUs, TPUs, one or more ASICs, or one or more FPGAs, as well as other discrete or integrated logic circuitry, which may be physically located in one or more devices in one or more physical locations. Processing circuitry 306 may execute software instructions, such as those used to define a software or computer program, stored to a computer-readable storage medium (such as memory 308), such as non-transitory computer-readable mediums including a storage device (e.g., a disk drive, or an optical drive) or a memory (such as Flash memory or RAM) or any other type of volatile or non-volatile memory, that stores instructions to cause the processing circuitry 306 to perform the techniques described herein.
Processing circuitry 306 may be capable of processing instructions stored in memory 308. In some examples, memory 308 includes a computer-readable medium that includes instructions that, when executed by processing circuitry 306, cause NMS 300 and processing circuitry 306 to perform various functions attributed to them herein. Memory 308 may include any volatile, non-volatile, magnetic, optical, or electrical media, such as a RAM, ROM, NVRAM, EEPROM, FRAM, DRAM, flash memory, or any other digital media. Memory 308 may include one or more devices configured to store programming modules and/or data associated with operation of NMS 300. For example, memory 308 may include a computer-readable storage medium, such as a non-transitory computer-readable medium including a storage device (e.g., a disk drive, or an optical drive) or a memory (such as Flash memory or RAM) or any other type of volatile or non-volatile memory, that stores instructions to cause the processing circuitry 306 to perform the techniques described herein.
A user, such as an administrator, may interact with NMS 300 through user interface 310. User interface 310 may include a display such as liquid crystal display (LCD), light-emitting diode (LED) display, or other type of screen, with which processing circuitry 306 may present information related to NMS 300, network devices, or other devices of network system 100. In addition, user interface 310 may include an input mechanism to receive input from the user. The input mechanisms may include, for example, any one or more of buttons, a keypad (e.g., an alphanumeric keypad), a peripheral pointing device, a touch screen, or another input mechanism that allows the user to navigate through user interfaces presented by processing circuitry 306 of NMS 300 and provide input. In other examples, user interface 310 also includes audio circuitry for providing audible notifications, instructions or other sounds to the user, receiving voice commands from the user, or both. Memory 308 may include instructions for operating user interface 310.
Communication interface 312 may include, for example, an Ethernet interface. Communication interface 312 couples NMS 300 to a network and/or the Internet, such as any of network 120 as shown in
The data and information received by NMS 300 via communication interface 312 may include, for example, network traffic and session data 370. As seen in
In some examples, network traffic transmission data 372 may indicate an amount of network traffic transmitted by each network device of network devices 122 via each internal network pathway of a plurality of internal network pathways through network devices 122. In some examples, network traffic reception data 374 may indicate an amount of network traffic received by each network device of network devices 122 via each internal network pathway of the plurality of internal network pathways through network devices 122. In some examples, TCP data 376 may indicate a number of TCP SYN packets, TCP SYN-ACK packets, and/or TCP ACK packets forwarded by each network device of network devices 122 via each internal network pathway of the plurality of internal network pathways through network devices 122. In some examples, new session data 378 may indicate a number of new sessions failed to establish via each network device of network devices 122 corresponding to each internal network pathway of the plurality of internal network pathways.
NMS 300 may, in some examples, receive network traffic and session data 370) continuously as network devices 122 receive network traffic, transmit network traffic, forward TCP handshake packets, and establish new sessions. In some examples, NMS 300 may output a request for network traffic and session data 370 and, in response to the request, receive network traffic and session data 370 from network devices 122. In some examples, NMS 300 may receive network traffic and session data 370 in a sequence of blocs, each bloc of the sequence of blocs arriving according to a scheduled frequency (e.g., each second, minute, or hour).
Processing circuitry 306 may apply machine learning model 362 to detect, based on network traffic and session data 370, a network failure outside of the internal network corresponding to NMS 300 and network devices 122. That is, machine learning model 362 may process network traffic and session data 370 to generate an output indicating the network failure. In some examples, to apply machine learning model 362 to detect the network failure, processing circuitry 306 is configured to determine that one or more network devices network devices 122 are unable to exchange network traffic with a server of servers 128 via one or more external network pathways. In some examples, to apply machine learning model 362 to detect the network failure, processing circuitry 306 is further configured to determine that the network failure is associated with a network connection of a plurality of network connections (e.g., connections 129) outside of the internal network along an external network pathway, wherein the network connection is located between the server of servers 128 and the network devices 122.
In some examples, machine learning model 362 may generate an output that indicates a precise location of a detected network failure. Since NMS 300 receives network traffic and session data 370 from a plurality of network devices (e.g., each of network devices 122, machine learning model 362 may analyze network traffic and session data 370 corresponding to a plurality of internal network pathways through the plurality of network devices. These plurality of internal network pathways through the plurality of network devices may be associated with a plurality of external network pathways outside of the internal network corresponding to NMS 300 and network devices 122. A plurality of end-to-end pathways may exist between UEs 148 and servers 128, each pathway of the plurality of end-to-end pathways including one or more internal network pathways and one or more external network pathways. Machine learning model 362 may identify a location outside of the internal network corresponding to a network failure based on network traffic and session data 370 received from within the internal network.
When network traffic and session data 370 indicates that one or more sessions with server 128A of
Machine learning model 362 may identify a network failure by identifying one or more trends associated with network failures that machine learning model 362 learned when machine learning model 362 was trained using training data 380. For example, machine learning model 362 may identify a trend in the network traffic transmission data 372 and the network traffic reception data 374 indicating high levels of network traffic transmission to a server of servers 128 outside of the internal network and low levels of network traffic reception from the server. Machine learning model 362 is not limited to identifying levels of network traffic transmission and/or reception only associated with a server. Machine learning model 362 may analyze network traffic based on one or more external network pathways to the server. For example, machine learning model 362 may determine that there is a high level of network traffic transmission to server 128A and a low level of network traffic reception from server 128A via one or more pathways including connection 123H, switch 126B, and connection 129B of
In some examples, machine learning model 362 may identify a network failure by identifying one or more trends associated with TCP data 376. TCP is a protocol where a session may be established between two endpoint devices. To establish a TCP session, a first endpoint device may send a TCP SYN packet to a second endpoint device, the second endpoint device may send a TCP SYN-ACK packet to the first endpoint device, and the first endpoint device may send a TCP ACK packet to the second endpoint device. This exchange of SYN, SYN-ACK, and ACK packets is sometimes referred to as a “three-way handshake.” When a TCP session fails, one or more endpoint devices may try to reestablish the session by initiating another three-way handshake. This may be referred to as a “TCP reset.” Machine learning model 362 may identify a trend indicating that there is an increased amount of TCP resets associated with a server of servers 128 and/or an increased amount of TCP resets associated with one or more external network pathways associated with the server of servers 128. In some examples, machine learning model 362 may determine that there is an increased volume of TCP SYN packets sent to a server of servers 128 via one or more external network pathways and a decreased volume or no volume of TCP SYN-ACK packets received from the server in response to SYN packets. Machine learning model 362 may determine that a network failure exists in the one or more external network pathways based on identifying this trend.
Machine learning model 362 is not limited to analyzing TCP three-way handshakes to identify network failures. In some examples, machine learning model 362 may be configured to analyze new session data 378 corresponding to one or more protocols other than TCP to identify network failures, such as UDP, TLS, ICMP, or any combination thereof. That is, new session data 378 may indicate, for each network device of network devices 122, a number of new sessions failed to establish according to any communication session protocol via the network device corresponding to each server of servers 128. In some examples, machine learning model 362 may identify a pattern indicating that a number of new sessions established between a server of servers 128 and UEs 148 via one or more internal network pathways and/or external network pathways is decreased from baseline levels or near zero. Machine learning model 362 may determine that a network failure exists at server 128A or the one or more external network pathways connecting to server 128A based on identifying this trend.
Processing circuitry 306 of NMS 300 may apply model training engine 364 to train machine learning model 362 using training data 380. In some examples, training data 380) may include a plurality of sets of training data. In some examples, each set of training data of the plurality of sets of training data may include network traffic and session data that includes at least one of network traffic transmission data, network traffic reception data, TCP data, and new session data. In this way, each set of training data of the plurality of sets of training data may include data that is similar in kind to data that is present in the network traffic and session data 370 received from network devices 122 of
Each set of training data of the plurality of sets of training data of training data 380 may include information indicating whether the set of training data is associated with a network failure. In some examples, each set of training data of the plurality of sets of training data of training data 380 may be associated with information in addition to or alternatively to whether the set of training data is associated with a network failure. For example, sets of training data may be associated with information indicating a type of network failure associated with the data (e.g., black hole or another type of network failure), a location of a network failure associated with the data (e.g., at a server, at a switch, at a connection, a severity of the network failure (e.g., total network failure or partial network failure), or any combination thereof.
Based on the information associated with each set of training data of the plurality of sets of training data and based on the plurality of sets of training data of training data 380, processing circuitry 306 may apply model training engine 364 to use supervised learning to train machine learning model 362. Supervised machine learning involves using a labeled set of training data to train a machine learning model based on one or more patterns between the labels of the training data and the training data itself. For example, information indicating whether a set of training data is associated with a network failure is a label. In the process of training machine learning model 362, model training engine 364 may identify one or more patterns in training data 380 associated with network failure and/or identify one or more patterns in training data 380) associated a healthy network. Machine learning model 362 may learn these patterns as it is trained by model training engine 264 such that machine learning model 362 may detect network failure based on these patterns once trained.
Model training engine 364 is not limited to using supervised learning to generate machine learning model 364. Example machine learning techniques that may be employed to generate machine learning model 352 include various learning styles, such as supervised learning, unsupervised learning, and semi-supervised learning. Example types of algorithms include Bayesian algorithms, Markov models, Hawkes processes, Clustering algorithms, decision-tree algorithms, regularization algorithms, regression algorithms, instance-based algorithms, artificial neural network algorithms, deep learning algorithms, dimensionality reduction algorithms and the like. Various examples of specific algorithms include Bayesian Linear Regression, Boosted Decision Tree Regression, and Neural Network Regression, Back Propagation Neural Networks, self-attention models, Convolutional Neural Networks (CNNs), Long Short Term Networks (LSTMs), the Apriori algorithm, K-Means Clustering, k-Nearest Neighbors (KNN), Learning Vector Quantization (LVQ), Self-Organizing Map (SOM), Locally Weighted Learning (LWL), Ridge Regression, Least Absolute Shrinkage and Selection Operator (LASSO), Elastic Net, and Least-Angle Regression (LARS), Principal Component Analysis (PCA) and Principal Component Regression (PCR).
In some examples, when NMS 300 uses machine learning model 362 to identify one a network based on network traffic and session data 370, NMS 300 may determine whether machine learning model 362 accurately identified the network failure. For example, when NMS 300 determines that machine learning model 362 correctly identified the network failure, correctly identified a location of the network failure, incorrectly identified the network failure, incorrectly identified the location of the network failure, or any combination thereof, NMS 300 may save this information along with the network traffic and session data 370 used to identify the network failure. In some examples, NMS 300 may save the network traffic and session data 370 used to identify the network failure and the information indicating the accuracy of the identification to training data 380.
Model training engine 364 may, in some examples, re-train the machine learning model 362 based on the training data 380 augmented with the network traffic and session data 370 used to identify a network failure and the information indicating the accuracy of the identification. In some examples, NMS 300 may continuously receive new network traffic and session data 370 received from network devices 122 of
UE device 400 includes a wired interface 430, wireless interfaces 420A-420C, processing circuitry 406, memory 408, and a user interface 410. The various elements are coupled together via a bus 414 over which the various elements may exchange data and information. Wired interface 430 represents a physical network interface and includes a receiver 432 and a transmitter 434. Wired interface 430 may be used, if desired, to couple, either directly or indirectly, UE device 400 to a wired network device, such as one of switches 146 of
First, second and third wireless interfaces 420A, 420B, and 420C include receivers 422A, 422B, and 422C, respectively, each including a receive antenna via which UE device 400 may receive wireless signals from wireless communications devices, such as APs 142 of
Processing circuitry 406 may include fixed function circuitry and/or programmable processing circuitry. Processing circuitry 406 may include any one or more of a microprocessor, a controller, a DSP, GPU, TPU, an ASIC, an FPGA, or equivalent discrete or analog logic circuitry. In some examples, processing circuitry 406 may include multiple components, such as any combination of one or more microprocessors, one or more controllers, one or more DSPs, GPUs, TPUs, one or more ASICs, or one or more FPGAs, as well as other discrete or integrated logic circuitry, which may be physically located in one or more devices in one or more physical locations. Processing circuitry 406 may execute software instructions, such as those used to define a software or computer program, stored to a computer-readable storage medium (such as memory 408), such as non-transitory computer-readable mediums including a storage device (e.g., a disk drive, or an optical drive) or a memory (such as Flash memory or RAM) or any other type of volatile or non-volatile memory, that stores instructions to cause the processing circuitry 406 to perform the techniques described herein.
Processing circuitry 406 may be capable of processing instructions stored in memory 408. In some examples, memory 408 includes a computer-readable medium that includes instructions that, when executed by processing circuitry 406, cause UE device 400 and processing circuitry 406 to perform various functions attributed to them herein. Memory 408 may include any volatile, non-volatile, magnetic, optical, or electrical media, such as a RAM, ROM, NVRAM, EEPROM, FRAM, DRAM, flash memory, or any other digital media. Memory 408 may include one or more devices configured to store programming modules and/or data associated with operation of UE device 400. For example, memory 408 may include a computer-readable storage medium, such as a non-transitory computer-readable medium including a storage device (e.g., a disk drive, or an optical drive) or a memory (such as Flash memory or RAM) or any other type of volatile or non-volatile memory, that stores instructions to cause the processing circuitry 406 to perform the techniques described herein.
A user may interact with UE device 400 through user interface 410. User interface 310 may include a display such as liquid crystal display (LCD), light-emitting diode (LED) display, organic LED (OLED), or other type of screen, with which processing circuitry 406 may present information related to UE device 400 and/or one or more services received by UE device 400. In addition, user interface 410 may include an input mechanism to receive input from the user. The input mechanisms may include, for example, any one or more of buttons, a keypad (e.g., an alphanumeric keypad), a peripheral pointing device, a touch screen, or another input mechanism that allows the user to navigate through user interfaces presented by processing circuitry 406 of UE device 400 and provide input. In other examples, user interface 410 also includes audio circuitry for providing audible notifications, instructions or other sounds to the user, receiving voice commands from the user, or both. Memory 408 may include instructions for operating user interface 410.
In this example, memory 408 includes an operating system 440, applications 442, a communications module 444, configuration settings 450, and data 454. Communications module 444 includes program code that, when executed by processing circuitry 406, enables UE device 400 to communicate using any of wired interface(s) 430, wireless interfaces 420A-420B and/or cellular interface 450C. Configuration settings 450 include any device settings for UE device 400 settings for each of wireless interface(s) 420A-420B and/or cellular interface 420C.
Data 454 may include, for example, a status/error log including a list of events specific to UE device 400. The events may include a log of both normal events and error events according to a logging level based on instructions from NMS 130. Data 454 may store any data used and/or generated by UE device 400, such as data used to calculate one or more metrics or identify relevant behavior data, that is collected by UE device 400 and either transmitted directly to NMS 130 or transmitted to any of APs 142 in a wireless network of wireless networks 106 for further transmission to NMS 130.
As described herein, UE device 400 may measure and report network data from data 454 to NMS 130. The network data may comprise event data, telemetry data, and/or other data. In some examples, the network data may include data corresponding to one or more sessions between UE device 400 and a service provider device such as a server of servers 128. For example, UE device 400 may form one or more sessions with a service provider device such as a server of a video streaming service. Data 454 may include information corresponding to the one or more sessions including information indicating a quality of one or more sessions and/or data corresponding to one or more failed sessions. The network data may include various parameters indicative of the performance and/or status of the wireless network.
In this example, network node 500 includes a wired interface 502, e.g., an Ethernet interface, processing circuitry 506, a memory 508, and input/output 510, e.g., display, buttons, keyboard, keypad, touch screen, mouse, etc., coupled together via a bus 514 over which the various elements may interchange data and information. Wired interface 502 couples the network node 500 to a network, such as an enterprise network. Though only one interface is shown by way of example, network nodes may, and usually do, have multiple communication interfaces and/or multiple communication interface ports. Wired interface 502 includes a receiver 520 and a transmitter 522.
Processing circuitry 506 may include fixed function circuitry and/or programmable processing circuitry. Processing circuitry 506 may include any one or more of a microprocessor, a controller, a DSP, GPU, TPU, an ASIC, an FPGA, or equivalent discrete or analog logic circuitry. In some examples, processing circuitry 506 may include multiple components, such as any combination of one or more microprocessors, one or more controllers, one or more DSPs, GPUs, TPUs, one or more ASICs, or one or more FPGAs, as well as other discrete or integrated logic circuitry, which may be physically located in one or more devices in one or more physical locations. Processing circuitry 506 may execute software instructions, such as those used to define a software or computer program, stored to a computer-readable storage medium (such as memory 508), such as non-transitory computer-readable mediums including a storage device (e.g., a disk drive, or an optical drive) or a memory (such as Flash memory or RAM) or any other type of volatile or non-volatile memory, that stores instructions to cause the processing circuitry 506 to perform the techniques described herein.
Processing circuitry 506 may be capable of processing instructions stored in memory 508. In some examples, memory 508 includes a computer-readable medium that includes instructions that, when executed by processing circuitry 506, cause network node 500 and processing circuitry 506 to perform various functions attributed to them herein. Memory 508 may include any volatile, non-volatile, magnetic, optical, or electrical media, such as a RAM, ROM, NVRAM, EEPROM, FRAM, DRAM, flash memory, or any other digital media. Memory 508 may include one or more devices configured to store programming modules and/or data associated with operation of network node 500. For example, memory 508 may include a computer-readable storage medium, such as a non-transitory computer-readable medium including a storage device (e.g., a disk drive, or an optical drive) or a memory (such as Flash memory or RAM) or any other type of volatile or non-volatile memory, that stores instructions to cause the processing circuitry 506 to perform the techniques described herein.
Memory 508 stores executable software applications 532, data 534, operating system 540, and NMS agent 544. Data 534 may include a system log and/or an error log that stores event data, including behavior data, for network node 500. In examples where network node 500 comprises a “third-party” network device, the same entity does not own or have access to both the APs or wired client-side devices and network node 500. As such, in the example where network node 500 is a third-party network device, NMS 130 does not receive, collect, or otherwise have access to the network data from network node 500.
In examples where network node 500 comprises a server, network node 500 may receive data and information, e.g., including operation related information, e.g., handshake request, session request, registration request, AAA services, DHCP requests, Simple Notification Service (SNS) lookups, and Web page requests via receiver 520, and send data and information, e.g., including configuration information, authentication information, web page data, services, etc. via transmitter 522.
In examples where network node 500 comprises a wired network device, network node 500 may be connected via wired interface 502 to one or more APs or other wired client-side devices, e.g., IoT devices. For example, network node 500 may include multiple wired interfaces 502 and/or wired interface 502 may include multiple physical ports to connect to multiple APs or the other wired-client-side devices within a site via respective Ethernet cables. In some examples, each of the APs or other wired client-side devices connected to network node 500 may access the wired network via wired interface 502 of network node 500. In some examples, one or more of the APs or other wired client-side devices connected to network node 500 may each draw power from network node 500 via the respective Ethernet cable and a Power over Ethernet (POE) port of wired interface 502.
In examples where network node 500 comprises a session-based router that employs a stateful, session-based routing scheme, network node 500 may be configured to independently perform path selection and traffic engineering. The use of session-based routing may enable network node 500 to eschew the use of a centralized controller, such as an SDN controller, to perform path selection and traffic engineering, and eschew the use of tunnels. In some examples, network node 500 may implement session-based routing as Secure Vector Routing (SVR), provided by Juniper Networks, Inc. In the case where network node 500 comprises a session-based router operating as a network gateway for a site of an enterprise network (e.g., router 187A of
In examples where network node 500 comprises a packet-based router, network node 500 may employ a packet- or flow-based routing scheme to forward packets according to defined network paths, e.g., established by a centralized controller that performs path selection and traffic engineering. In the case where network node 500 comprises a packet-based router operating as a network gateway for a site of an enterprise network (e.g., router 187A of
The data collected and reported by network node 500 may include periodically reported data and event-driven data. Network node 500 is configured to collect logical path statistics via bidirectional forwarding detection (BFD) probing and data extracted from messages and/or counters at the logical path (e.g., peer path or tunnel) level. In some examples, network node 500 is configured to collect statistics and/or sample other data according to a first periodic interval, e.g., every 3 seconds, every 5 seconds, etc. Network node 500 may store the collected and sampled data as path data, e.g., in a buffer.
In some examples, network node 500 optionally includes an NMS agent 544. NMS agent 544 may periodically create a package of the statistical data according to a second periodic interval, e.g., every 3 minutes. In some examples, the package of statistical data may also include details about clients connected to network node 500 and the associated client sessions. NMS agent 544 may then report the package of statistical data to NMS 130 in the cloud. In other examples, NMS 130 may request, retrieve, or otherwise receive the package of statistical data from network node 500 via an API, an open configuration protocol, or another of communication protocols. The package of statistical data created by NMS agent 544 or another module of network node 500 may include a header identifying network node 500 and the statistics and data samples for each of the logical paths from network node 500. In still other examples, NMS agent 544 reports event data to NMS 130 in the cloud in response to the occurrence of certain events at network node 500 as the events happen.
In the example of
First and second wireless interfaces 620A and 620B represent wireless network interfaces and include receivers 622A and 622B, respectively, each including a receive antenna via which AP device 600 may receive wireless signals from wireless communications devices, such as UEs 148 of
Processing circuitry 606 may include fixed function circuitry and/or programmable processing circuitry. Processing circuitry 606 may include any one or more of a microprocessor, a controller, a DSP, GPU, TPU, an ASIC, an FPGA, or equivalent discrete or analog logic circuitry. In some examples, processing circuitry 606 may include multiple components, such as any combination of one or more microprocessors, one or more controllers, one or more DSPs, GPUs, TPUs, one or more ASICs, or one or more FPGAs, as well as other discrete or integrated logic circuitry, which may be physically located in one or more devices in one or more physical locations. Processing circuitry 606 may execute software instructions, such as those used to define a software or computer program, stored to a computer-readable storage medium (such as memory 608), such as non-transitory computer-readable mediums including a storage device (e.g., a disk drive, or an optical drive) or a memory (such as Flash memory or RAM) or any other type of volatile or non-volatile memory, that stores instructions to cause the processing circuitry 606 to perform the techniques described herein.
Processing circuitry 606 may be capable of processing instructions stored in memory 608. In some examples, memory 608 includes a computer-readable medium that includes instructions that, when executed by processing circuitry 606, cause network node 500 and processing circuitry 606 to perform various functions attributed to them herein. Memory 608 may include any volatile, non-volatile, magnetic, optical, or electrical media, such as a RAM, ROM, NVRAM, EEPROM, FRAM, DRAM, flash memory, or any other digital media. Memory 608 may include one or more devices configured to store programming modules and/or data associated with operation of AP device 600. For example, memory 608 may include a computer-readable storage medium, such as a non-transitory computer-readable medium including a storage device (e.g., a disk drive, or an optical drive) or a memory (such as Flash memory or RAM) or any other type of volatile or non-volatile memory, that stores instructions to cause the processing circuitry 606 to perform the techniques described herein.
In this example, memory 608 stores executable software including an API 640, a communications manager 642, configuration settings 650, a device status log 652, data 654, and log controller 655. Device status log 652 includes a list of events specific to AP device 600. The events may include a log of both normal events and error events such as, for example, memory status, reboot or restart events, crash events, cloud disconnect with self-recovery events, low link speed or link speed flapping events, Ethernet port status, Ethernet interface packet errors, upgrade failure events, firmware upgrade events, configuration changes, etc., as well as a time and date stamp for each event. Log controller 655 determines a logging level for the device based on instructions from NMS 130. Data 654 may store any data used and/or generated by AP device 600, including data collected from UEs 148, such as data used to calculate one or more metrics, that is transmitted by AP device 600 for cloud-based management of wireless networks 106 by NMS 130.
Input/output (I/O) 610 represents physical hardware components that enable interaction with a user, such as buttons, a display, and the like. Although not shown, memory 608 typically stores executable software for controlling a user interface with respect to input received via I/O 610. Communications manager 642 includes program code that, when executed by processing circuitry 606, allow AP device 600 to communicate with UEs 148 and/or network 120 via any of wired interface 630 and/or wireless interfaces 620A-620B. Configuration settings 650 include any device settings for AP device 600 such as radio settings for each of wireless interfaces 620A-620B. These settings may be configured manually or may be remotely monitored and managed by NMS 130 to optimize wireless network performance on a periodic (e.g., hourly or daily) basis.
As described herein, AP device 600 may measure and report network data from device status log 652 to NMS 130. The network data may comprise event data, telemetry data, and/or other data. The network data may include various parameters indicative of the performance and/or status of the wireless network. The parameters may be measured and/or determined by one or more of the UE devices and/or by one or more of the APs in a wireless network. NMS 130 may determine one or more metrics based on the data received from the APs in the wireless network and store the metrics in a memory. In examples where AP device 600 comprises a “third-party” network device, the same entity does not own or have access to both NMS 130 and AP device 600. As such, in the example where AP device 600 is a third-party network device, NMS 130 does not receive, collect, or otherwise have access to the network data from AP device 600).
NMS 130 may be configured to receive network data from network devices 122 of an internal network managed by a network management system 130 (702). Network devices 122 may, in some examples, connect UEs 148 with servers 128 via one or more pathways through the internal corresponding to network devices 122. In some examples, the network data may include network traffic data and/or session data (e.g., network traffic and session data 136 of
Processing circuitry 132 of NMS 130 may be configured to apply a machine learning model to detect, based on the network data, a network failure within an external network (704). For example, machine learning model 135 may process the network data to identify the network failure. Machine learning model 135 may be trained to recognize one or more patterns in network data corresponding to network failure and/or recognize one or more patterns in network data corresponding to healthy network operation. For example, machine learning model 135 may be trained using supervised learning based on a set of labeled training data. The training data may include data similar in kind to the network data received from network devices 122 and one or more labels indicating information corresponding to the training data. The training data may include a label indicating whether the data is associated with network failure or is associated with a healthy network.
In some examples, an external network comprises one or more external devices (e.g., one or more of servers 128) and an edge device (e.g., a switch of switches 126) connecting the one or more external network devices to the internal network. To detect the network failure, processing circuitry 132 may in some examples be configured to apply machine learning model 135 to detect a failure of a link between the edge device and one of the plurality of external network devices (e.g., connection 129A between switch 126A and server 128A. Additionally, or alternatively, to detect the network failure, processing circuitry 122 is configured to apply the machine learning model to detect a failure of one of the external network devices (e.g., one of servers 128).
The external network may, in some examples, represent a first external network (e.g., an external network including switch 126A, switch 126B, and server 128A). To detect the network failure, the processing circuitry 122 may be configured to apply machine learning model 135 to detect blackholing of network traffic by a first edge device (e.g., switch 126A of the first external network. In some examples, a second edge device (e.g., switch 126B) of the first external network is not associated with blackholing. In some examples, a second external network (e.g., an external network including switch 126C and server 128B) is not associated with blackholing.
NMS 130 may perform a corrective action to remediate the network failure within the external network (706). Even though the network failure may be located in an external network, NMS 130 may be configured to reroute network traffic within the internal network such that the network traffic does not reach the network failure when the network traffic leaves the internal network. For example, when a network failure at connection 129A of
Processing circuitry 132 of NMS 300 is configured to train, with training data 138, machine learning model 135 to detect one or more network failures associated with respective sets of a plurality of sets of training data (802). Each set of training network data of the plurality of sets of training network data may be associated with network status information indicating whether the set of training network data is associated with one or more network failure events. For example, the network status information may include information indicating whether the set of training network data is associated with a network failure or is associated with healthy network operation, information indicating a nature of a network failure associated with the set of training network data (e.g., black hole or other kind of network failure), information indicating a location of a network failure associated with the set of training network data (e.g., a location relative to one or more network devices), or any combination thereof.
By training the machine learning model 135, processing circuitry 132 of NMS 130 may cause machine learning model 135 to recognize one or more training network data patterns associated with network failure events. Since each set of training network data of the plurality of sets of training data 138 includes network status information indicating whether the set of training network data is associated with one or more network failure events, processing circuitry 132 of NMS 130 may train machine learning model 135 to identify one or more patterns is the training data 138 associated with sets of training network data labeled as being associated with network failure events. Additionally, or alternatively, processing circuitry 132 of NMS 130 may train machine learning model 135 to identify one or more patterns is the training data 138 associated with sets of training network data labeled as being associated with healthy network operation.
Processing circuitry 132 of NMS 130 may apply machine learning model 135 to detect a network failure based on network data received from a plurality of network devices 122 of an internal network (804). In some examples, at block 806, processing circuitry 132 may perform the techniques of blocks 702 and 704 of the example operation of
The techniques described herein may be implemented in hardware, software, firmware, or any combination thereof. Various features described as modules, units or components may be implemented together in an integrated logic device or separately as discrete but interoperable logic devices or other hardware devices. In some cases, various features of electronic circuitry may be implemented as one or more integrated circuit devices, such as an integrated circuit chip or chipset.
If implemented in hardware, this disclosure may be directed to an apparatus such as a processor or an integrated circuit device, such as an integrated circuit chip or chipset. Alternatively, or additionally, if implemented in software or firmware, the techniques may be realized at least in part by a computer-readable data storage medium comprising instructions that, when executed, cause a processor to perform one or more of the methods described above. For example, the computer-readable data storage medium may store such instructions for execution by a processor.
A computer-readable medium may form part of a computer program product, which may include packaging materials. A computer-readable medium may comprise a computer data storage medium such as RAM, ROM, NVRAM, EEPROM, flash memory, magnetic or optical data storage media, and the like. In some examples, an article of manufacture may comprise one or more computer-readable storage media.
In some examples, the computer-readable storage media may comprise non-transitory media. The term “non-transitory” may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. In certain examples, a non-transitory storage medium may store data that can, over time, change (e.g., in RAM or cache).
The code or instructions may be software and/or firmware executed by processing circuitry including one or more processors, such as one or more DSPs, general purpose microprocessors, ASICs, FPGAs, or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor,” as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. In addition, in some respects, functionality described in this disclosure may be provided within software modules or hardware modules.