The present disclosure pertains to anomaly detection methods in a network and more specifically to a method of detecting anomalies in network protocol processes utilizing probabilistic statistical models.
In a data-center or large-scale network running a large number of switches, it is often difficult to detect if a particular switch is performing erratically. For example, it may be difficult to detect if the switch is having an issue with the implementation of protocols or if the switch is showing signs of malfunction due to an attack on the system. Often times, detection of these kinds of issues are not performed properly and, in due course, if the issue is not addressed, the switch can shut down or cause further disruption to the rest of the network, eventually resulting in a major network disruption.
There are shown in the drawings embodiments that are presently preferred it being understood that the disclosure is not limited to the arrangements and instrumentalities shown, wherein:
Various embodiments of the disclosure are discussed in detail below. While specific implementations are discussed, it should be understood that this is done for illustration purposes only. A person skilled in the relevant art will recognize that other components and configurations may be used without parting from the spirit and scope of the disclosure.
In one aspect of the disclosure, a method is provided where the method includes collecting data sample points to form a first data set, each of the data sample points representing a network feature variable, each network feature variable associated with a corresponding network feature, calculating a standard deviation and a mean value of the network feature variables for each network feature, performing normalization of the network feature variables to obtain normalized network feature variables, calculating, using the standard deviation and the mean value for each network feature, a probability value (p-value) for each normalized network feature variable, and determining if an anomaly exists with respect to each network feature based at least upon the p-value for each normalized network feature variable.
In another aspect, a system is provided where the system includes a processor, and a computer-readable storage medium having stored therein instructions which, when executed by the processor, cause the processor to perform a series of operations. These operations include collecting data sample points to form a first data set, each of the data sample points representing a network feature variable, each network feature variable associated with a corresponding network feature, calculating a standard deviation and a mean value of the network feature variables for each network feature, performing normalization of the network feature variables to obtain normalized network feature variables, calculating, using the standard deviation and the mean value for each network feature, a probability value (p-value) for each normalized network feature variable, and determining if an anomaly exists with respect to each network feature based at least upon the p-value for each normalized network feature variable.
Yet another aspect provides a non-transitory computer-readable storage medium having stored therein instructions which, when executed by a processor, cause the processor to perform a series of operations. The operations include collecting data sample points to form a first data set, each of the data sample points representing a network feature variable, each network feature variable associated with a corresponding network feature, calculating a standard deviation and a mean value of the network feature variables for each network feature, calculating, using the standard deviation and the mean value for each network feature, a probability value (p-value) for each normalized network feature variable, and determining if an anomaly exists with respect to each network feature based at least upon the p-value for each normalized network feature variable.
A computer network is a geographically distributed collection of nodes, or switches, interconnected by communication links and segments for transporting data between endpoints, such as personal computers and workstations. Many types of networks are available, with the types ranging from local area networks (LANs) and wide area networks (WANs) to overlay and software-defined networks, such as virtual extensible local area networks (VXLANs).
LANs typically connect nodes over dedicated private communications links located in the same general physical location, such as a building or campus. WANs, on the other hand, typically connect geographically dispersed nodes over long-distance communications links, such as common carrier telephone lines, optical lightpaths, synchronous optical networks (SONET), or synchronous digital hierarchy (SDH) links. LANs and WANs can include layer 2 (L2) and/or layer 3 (L3) networks and devices.
The Internet is an example of a WAN that connects disparate networks throughout the world, providing global communication between nodes on various networks. The nodes typically communicate over the network by exchanging discrete frames or packets of data according to predefined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP). In this context, a protocol can refer to a set of rules defining how the nodes or switches interact with each other. Computer networks may be further interconnected by an intermediate network node, such as a router, to extend the effective “size” of each network. Further, a controller, such as a Fabric Controller (“FC”) can be configured to control the interaction between nodes/switches.
The interfaces 168 are typically provided as interface cards (sometimes referred to as “line cards”). Generally, they control the sending and receiving of data packets over the network and sometimes support other peripherals used with the router 110. Among the interfaces that may be provided are Ethernet interfaces, frame relay interfaces, cable interfaces, DSL interfaces, token ring interfaces, and the like. In addition, various very high-speed interfaces may be provided such as fast token ring interfaces, wireless interfaces, Ethernet interfaces, Gigabit Ethernet interfaces, ATM interfaces, HSSI interfaces, POS interfaces, FDDI interfaces and the like. Generally, these interfaces may include ports appropriate for communication with the appropriate media. In some cases, they may also include an independent processor and, in some instances, volatile RAM. The independent processors may control such communications intensive tasks as packet switching, media control and management. By providing separate processors for the communications intensive tasks, these interfaces allow the master microprocessor 162 to efficiently perform routing computations, network diagnostics, security functions, etc.
Although the system shown in
Regardless of the network device's configuration, it may employ one or more memories or memory modules (including memory 161) configured to store program instructions for the general-purpose network operations and mechanisms for roaming, route optimization and routing functions described herein. The program instructions may control the operation of an operating system and/or one or more applications, for example. The memory or memories may also be configured to store tables such as mobility binding, registration, and association tables, etc.
In a data center switch or any network switch running network protocols, there is often a level of correlation in the activity level of the switches deployed in the same cluster. Further, each protocol's instance's processing correlates with how many protocol data units it is processing or generating. Collecting samples statistics periodically using distributed mechanisms can be initiated from a central Fabric Controller (FC) 110. While the present disclosure focuses on analyzing statistics gathered from routing protocols, the concepts disclosed herein may be applied to any other protocol processes.
The present disclosure detects anomalies in protocol processes which may occur at a switch or between switches in a network fabric.
Leaf nodes 204 can reside at the edge of fabric 212, and can thus represent the physical network edge. In some cases, leaf nodes 204 can be top-of-rack (“ToR”) switches configured according to a ToR architecture. In other cases, leaf nodes 204 can be aggregation switches in any particular topology, such as end-of-row (EoR) or middle-of-row (MoR) topologies. Leaf nodes 204 can also represent aggregation switches, for example.
Network connectivity in fabric 212 can flow through leaf nodes 204. Here, leaf nodes 204 can provide servers, resources, endpoints, external networks, or VMs access to fabric 212, and can connect leaf nodes 104 to each other. In some cases, leaf nodes 204 can connect EPGs to fabric 212 and/or any external networks. Each EPG can connect to fabric 212 via one of the leaf nodes 204, for example.
Endpoints 210A-E (collectively “210”) can connect to fabric 212 via leaf nodes 204. For example, endpoints 210A and 210B can connect directly to leaf node 204A, which can connect endpoints 210A and 210B to fabric 212 and/or any other one of leaf nodes 204. Similarly, endpoint 210E can connect directly to leaf node 204C, which can connect endpoint 210E to fabric 212 and/or any other of the leaf nodes 204. On the other hand, endpoints 210C and 210D can connect to leaf node 204B via L2 network 206. Similarly, the wide area network (WAN) can connect to leaf nodes 204C or any other leaf node 204 via L3 network 208.
Endpoints 210 can include any communication device, such as a computer, a server, a switch, a router, etc. Although fabric 212 is illustrated and described herein as an example leaf-spine architecture, one of ordinary skill in the art will readily recognize that the subject technology can be implemented based on any network fabric, including any data center or cloud network fabric. Indeed, other architectures, designs, infrastructures, and variations are contemplated herein.
Each leaf node 204 is connected to each spine node 202 in fabric 212. During instances of link-state routing protocol updates, one or more leaf nodes 204 can detect the occurrence of network transitions, such as, for example, failure of one or more spine nodes 202. Examples of link-state routing protocol updates can be for example, intermediate system-to-intermediate system (“IS-IS”) or other intra-domain link state routing protocol updates such as Open Shortest Path First (“OSPF”) updates. The present disclosure is not limited to any particular type of routing update protocols.
A controller in network fabric 212, i.e., Fabric Controller 110, can be configured to detect anomalies in various network features occurring at a switch or between switches in fabric 212 using the methods described herein. The network features can include any network processes, including routing processes, the following of which are exemplary: (1) CPU percentage of a protocol in a last sampled window; (ii) the number of bytes of new protocol data units produced; (iii) the number of new Shortest Path First (SPF) computations; (iv) the number of link state packets (LSPs) generated; (v) the number of LSPs flooded; (vi) the number of link state transitions; (ii) the amount of memory used by the process; (viii) the role of a switch (i.e. leaf switch or spine switch); (ix) the historic mean CPU usage during the last sampled window; and (x) the historic mean memory usage during the last sampled window. As mentioned above, these processes are merely exemplary and the methods disclosed herein may be used to determine anomalies occurring for any other network processes. Further, although this disclosure focuses on determining if an anomaly exists at a switch during routing processes, the methods disclosed herein can also be applied to other non-routing processes.
FC 110 can periodically receive and store data representing these various protocol variables and analyze the data to determine if any anomalies exist at a particular switch 203, using the methods disclosed herein. Typically, a spike or abnormality in any of these variables indicates an anomaly of some kind. For example, if there is any problem with LSP flooding, a routing protocol will usually resort to re-transmitting at an aggressive rate resulting in excessive amounts of traffic generation. In one embodiment of the methods disclosed herein, statistics regarding network feature variables are collected at each switch 203 in fabric 212, where the role of the switch 203 is considered. The reason for doing so is that feature variable behavior at a spine switch 202 is different from feature variable behavior at a leaf switch 204.
In one embodiment, FC 110 collects, at periodic intervals, data representing various network feature variables from some or all of switches 203 in network fabric 212. The period of data collection can be any period of time. For example, data from each switch 203 can be collected every 60 seconds. Or, data from spine switches 202 can be collected every 30 seconds and data from leaf switches 204 can be collected every 60 seconds. In one embodiment, FC 110 can be configured to run two types of anomaly detection processes. A first anomaly detection process is to determine anomalies between switches 203 having the same or similar roles on an instance of data in order to find out if any one switch 203 is behaving anomalously when compared to other switches 203 having the same or similar role (i.e., one or more of spine switches 202, or one or more of leaf switches 204). Another anomaly detection process is to determine if any anomalies exist between historic patterns of a given switch 203. Over a period of data collection, FC 110 can calculate a normalized parameter vector and collect all non-anomalous instances as a training set to learn the probabilistic model. Based on the fact that the distributions of these variables are normal, a high deviation from the mean will indicate the existence of an anomaly.
FC 110 then performs a normalization procedure for each network feature variable, at step 330. Normalization is performed on each variable for each network feature because the rate of certain variables is different. For example, the LSP rate is different than the “hello packet” rate. Normalizing these variables will provide a more accurate determination as to whether any of the network feature values represent an anomaly, i.e., if they fall outside of an expected range or threshold. In one embodiment, normalization of each network feature variable is accomplished by the formula:
(X(i,j)−Mean(X(j))/StdDev(X(j)), where X(i,j) is the value of j-th feature's value for the i-th observation (data sample point).
FC 110 then calculates the probability value (p-value) for each network feature, at step 340. The calculation compares each of the normalized variables for each network feature to the calculated mean and standard deviation values for that network feature to determine if any of the normalized variables fall outside of the expected value, which may indicate that an anomaly with regard to that network feature exists. In one embodiment, a multivariate distribution function is used to determine the p-value. The multivariate distribution function assumes that there is some correlation between at least two network protocol variables. For example, in one scenario, CPU utilization at a switch 203 might be correlated to the number of LSPs generated by that switch 203. These variables may also be correlated to the number of LSPs flooded by the switch 203. Because some of the variables of one network feature may have a correlation to variables of another network feature, a multivariate distribution function is used to determine the p-value for each normalized variable.
In one embodiment, the multivariate distribution function used is a Gaussian multiplicative distribution function, where X[i] represents the i-th network feature, Mu[i] represents the mean value of that feature, and StdDev[i] is the standard deviation of the feature. For example, the network feature could be the CPU load at a particular switch 203 relating to a certain protocol during a sampling instance. Then, the probability that the measured (normalized) value of this feature falling within an acceptable range is governed by the following equation:
Probability of X[i] given Mu[i] and StdDev[i]=(1/(sqrt(2*pi)*StdDev[i]))*(ê−((X[i]−Mu[i]̂2/2(StdDev[i]̂2)))
A computer program either internal to or external from FC 110 can be trained to identify an anomaly by checking to see if the normalized sampled instance of the feature (in this case, the CPU load) at a particular switch 203 falls within an acceptance range, or not, based upon the multiplicative probability the feature variable. In this fashion, an anomaly can be identified. The p-value calculation can be performed for all other normalized network feature values. Based upon the above, FC 110 can flag an anomaly by checking if the normalized sampled instance of each feature value falls within the acceptance range or not. As mentioned above, a dual algorithm can be modeled as two independent models for anomaly detection (current behavior of all switches to see if one switch is behaving anomalously and historic behavior of the same switch). In one embodiment, an anomaly detected in both models for a given switch's protocol parameter vector can be used as a more definitive measure of the existence of an anomaly.
Once an anomaly has been detected, remedial steps can be taken to address the anomaly, at step 350. For example, more detailed internal debug logs and/or event history information can be collected so that it can help in analyzing the issues that caused the anomaly. Alerts can be generated and sent to a network operator, or alerts can be generated automatically, such that the switch where the anomaly has occurred can be quarantined by using a draining mechanism where a link's cost is marginally increased to reduce its preference so that traffic is steered away from that link. The present disclosure is not limited to any particular type of remedial procedure.
Referring to
As discussed above, a probability value is determined for each network feature value by running a multivariate distribution model for each normalized network feature value to determine if any of these values fall outside the expected range, which may indicate the existence of an anomaly. In instances where it is desirous to analyze many network features (i.e., 20 or 30), it may be desired to reduce the number of computations, since such computations could be taxing on the computer and processing resources involved in the computations. One way to reduce the number of computations is to reduce the redundancy in the collected data by identifying variables that have correlations with each other.
Referring again to
PCA serves not only to reduce the dimensionality but also results in series of principle components that have no correlation with each other. In other words, the components or variables input into the probability model are uncorrelated, since, in essence, PCA has already taken into account the correlation of various variables with each other. Thus, rather than using a multivariate Gaussian distribution function to determine the p-values for each normalized network feature variable, an independent distribution function may be used. An independent Gaussian function is one example that may be used. The independent Gaussian model assumes that the variables input are independent from one another. By multiplying each of the p-values calculated by the independent Gaussian model, an aggregate probability can be obtained. This aggregate probability is compared to a threshold value, and if the aggregate probability falls below the threshold value, an anomaly can be flagged.
In another embodiment, instead of each data sample point representing a discrete measurement of a network variable at an instance of time, each data point can represent a number of network variable measurements over a window of time. This might be desired to detect the occurrence of an anomaly based not just upon a one-time occurrence or “spike” but over a period of time. Thus, for example, 10 measurements can be taken measuring CPU utilization at a given switch 203. These measurements, collectively, can represent the “sample data point” for the CPU utilization variable in the probability computation. This can also be done for some or all of the other network variables. So, for example, if there are 30 network feature variables, and 10 measurements are taken for each network feature variable, a sample size of 300 variables can be used to form a data set. This allows for the determination of the existence of a contextual anomaly for each network feature, i.e., a series of anomalous measurements over a period of time, as opposed to a point anomaly, i.e., the occurrence of a single anomalous measurement, which may in some instances be for reasons other than the existence of an anomaly. The present disclosure is not limited to collecting data for each network feature variable over any specific window of time or for any specific number of measurements.
For clarity of explanation, in some instances the present technology may be presented as including individual functional blocks including functional blocks comprising devices, device components, steps or routines in a method embodied in software, or combinations of hardware and software.
In some embodiments the computer-readable storage devices, mediums, and memories can include a cable or wireless signal containing a bit stream and the like. However, when mentioned, non-transitory computer-readable storage media expressly exclude media such as energy, carrier signals, electromagnetic waves, and signals per se.
Methods according to the above-described examples can be implemented using computer-executable instructions that are stored or otherwise available from computer readable media. Such instructions can comprise, for example, instructions and data which cause or otherwise configure a general purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. Portions of computer resources used can be accessible over a network. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, firmware, or source code. Examples of computer-readable media that may be used to store instructions, information used, and/or information created during methods according to described examples include magnetic or optical disks, flash memory, USB devices provided with non-volatile memory, networked storage devices, and so on.
Devices implementing methods according to these disclosures can comprise hardware, firmware and/or software, and can take any of a variety of form factors. Typical examples of such form factors include laptops, smart phones, small form factor personal computers, personal digital assistants, rackmount devices, standalone devices, and so on. Functionality described herein also can be embodied in peripherals or add-in cards. Such functionality can also be implemented on a circuit board among different chips or different processes executing in a single device, by way of further example.
The instructions, media for conveying such instructions, computing resources for executing them, and other structures for supporting such computing resources are means for providing the functions described in these disclosures.
Although a variety of examples and other information was used to explain aspects within the scope of the appended claims, no limitation of the claims should be implied based on particular features or arrangements in such examples, as one of ordinary skill would be able to use these examples to derive a wide variety of implementations. Further and although some subject matter may have been described in language specific to examples of structural features and/or method steps, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to these described features or acts. For example, such functionality can be distributed differently or performed in components other than those identified herein. Rather, the described features and steps are disclosed as examples of components of systems and methods within the scope of the appended claims.