This invention relates generally to communications in computer networks. More particularly, this invention is directed toward monitoring network performance of virtualized resources.
Networks continue to grow in size and line speed. This results in challenging network administration tasks since the volume of information to be analyzed is overwhelming. Existing techniques for generating warnings regarding potentially hazardous network activity result in many false positives. This is very distracting to network administrators.
Thus, there is a need for improved network monitoring techniques, including the monitoring of virtualized resources within a network.
A machine has a processor and a memory connected to the processor. The memory stores instructions executed by the processor to observe network packet exchanges between virtualized resources. Key performance indicators characterizing packet information and connection information are generated from the packet exchanges. The key performance indicators are routed to a network connected device.
The invention is more fully appreciated in connection with the following detailed description taken in conjunction with the accompanying drawings, in which:
Like reference numerals refer to corresponding parts throughout the several views of the drawings.
In one embodiment, each of the network monitoring devices 202 may monitor and analyze traffic in a corresponding network 100, such as a data center network. Referring to
The network monitoring devices 202 are connected to a management station 204 across a network 206. The network 206 may be a wide area network, a local area network, or a combination of wide area and/or local area networks. For example, the network 206 may represent a network that spans a large geographic area. The management station 204 may monitor, collect, and display traffic analysis data from the network devices 202, and may provide control commands to the network devices 202. In this way, the management station may enable an operator, from a single location, to monitor and control network monitoring devices 202 deployed worldwide.
The components discussed up to this point are disclosed in U.S. Pat. No. 9,407,518, which is owned by the current applicant. U.S. Pat. No. 9,407,518 is incorporated herein by reference. The current application builds upon this architecture by utilizing a management station 204 with new features disclosed in connection with the discussion of
In addition the system 200 includes one or more container based network monitoring devices 214A-214N. Each container based network monitoring device 214 includes interfaces 216A-216N, which may be of the type discussed in connection with network device 202. The container based network monitoring device 214 is more fully disclosed in connection with the discussion of
The system 200 also includes one or more forensic network devices 218A-218N. Each forensic network device 218 includes interfaces 220A-220N, which may be of the type discussed in connection with network device 202. The forensic network device 218 is more fully characterized in connection with the discussion of
As discussed in previously incorporated U.S. Pat. No. 9,407,518, each network monitoring device 202 provides real-time high resolution (i.e., nanoseconds resolution) deep packet inspection data for every bit in every packet at line speed. Each device 202 generates packet level Key Performance Indicators (KPIs) which are continuously fed into the time series database 322. As discussed in more detail below, this facilitates distributed monitoring of a network.
Packet collector 500 observes every packet exchange between virtual machines 502A-502N. Similarly, packet collector 600 observes every packet exchange between containers 602A-602N. Virtual machines 502A-502N and containers 602A-602N are virtualized resources. The term virtualized resources is used herein to cover both virtual machines and containers. Each packet collector processes all the packets it captures and creates relevant KPIs based on these packets. The KPIs capture significant network activity while effectively condensing the amount of information that must be forwarded to other network connected devices, such as the time series database 322 of the management station 204.
The KPIs may include packet information, such as Ethernet type, internet protocol type, packet length, high layer protocol information, such as Dynamic Host Configuration Protocol (DHCP) information, Hypertext Transfer Protocol (HTTP) information, HTTP Secure (HTTPS) information and the like. The KPIs may also include connection information. Each packet collector keeps track of connections for connection oriented protocols such as Transmission Control Protocol (TCP) and Session Initiation Protocol (SIP), which allows for the creation of KPIs such as session length, session time, session failure, such as retransmission timeouts and the like. Each packet collector maintains these KPIs internally and can report them to the time series database 322. In addition, each packet collector maintains local storage of the actual packets captured in a circular buffer such that one or more consumers can retrieve these packets when needed. This methodology allows for a very efficient usage of the management and monitoring of a network without overwhelming the network by sending all the packets for analysis by a single centralized server. In other words, the disclosed techniques provide a fully distributed scalable solution for monitoring of virtualized resources.
Attention now turns to the data collected by the time series database 322. The following terms are used to characterize this data.
Data may be loaded into the time series database 322 using a variety of techniques. For example, a command line and an application interface may be used. Below is an example insert command:
Below are exemplary keywords and values that may be used in accordance with embodiments of the invention.
Below are exemplary queries that may be expressed against the time series database 322.
Tag values may be expressed on per-second or sub-second levels. Each time frame has an associated indicator. Below is a list of tag values that may be associated with indicators.
Below is a description of data points that may be collected in connection with indicators.
Below are examples of fields for different data points.
The analytics module 324 processes data in the time series database 322. In one embodiment, the analytics module 324 defines baseline network behavior and produces analytics and alerts based upon the baseline network behavior. The analytics may be displayed by the visualization module 326 (e.g., the visualization module 326 renders a visualization, which is displayed on a monitor connected to the input/output ports 312).
Many network administrators report being overwhelmed by data. They do not need more raw data. They need a more intelligent summary of the large volume of data that represents network activity.
As previously discussed, the network device 202 captures network traffic at line rate on each monitored link and generates performance analytics (and complete packet inspection) in real-time for network administrators. Therefore, the network device 202 captures a large amount of raw data. In addition, VM based network monitoring devices 210A-210N, container based network monitoring devices 214A-214N and forensic network devices 218A-218N may be generating data.
The data alone is not very useful to the network administrators that are already overwhelmed by data. Therefore, there is a need to distill this data into useful, actionable information.
Given the ability of a network monitoring device 202 to capture network traffic at line rate and generate analytics from this traffic, there is an opportunity to analyze and forecast the traffic in a network. This allows one to extract meaningful information from the line-rate data collected from the network monitoring devices 202A-202N and other devices of
The analytics module 324 creates baselines from historical network traffic. These baselines can be used to determine when the network traffic is behaving as expected or exhibiting unusual characteristics. In the case of unusual characteristics, one can look for abnormal network behaviors that might indicate an attack or other potential issue.
Often network traffic exhibits a weekly pattern. Think of a business network. The network will experience reduced traffic over the weekends and during weekday nights when employees are at home. The network traffic will pick up each morning as employees arrive to work and decrease as employees go home for the day. Therefore, the traditional time series approach of correlating the future traffic with the previous short time period (seconds to hours) completely ignores the fundamental forces driving the network traffic.
Most authors use time series analysis to model and predict network traffic. This correlates the future traffic with the traffic of the recent past. In some cases, authors add a seasonal component to their traffic. Often this seasonal component is short (from minutes up to a day). Sometimes this seasonal component is annual.
The analytics module 324 utilizes a weekly pattern and assumes that it is going to be significant for a large percentage of the networks deploying network monitoring devices 202A-202N. Therefore, rather than looking at a sliding window of time (employing a single time series analysis of the network traffic), traffic is sliced into time segments per weekday. This leads to multiple time series, each with a weekly time step.
Prior art models network traffic with a single time series. Rather than create a time series out of the microsecond to second data, as is commonly found in the literature, an embodiment of the invention aggregates data into longer time samples (for example, between 10 and 20 minutes and, in one embodiment, 15 minute time intervals). These time samples are then treated as a time series with time steps of one week. This process creates multiple “parallel” time series.
For example, if one aggregates data into 15 minute samples, then one will have 96 time series per day (96=60*24/15), giving a total of 672 individual time series per week (672=7*96). Each time series incorporates data from the previous weeks. This historical data is used to predict the traffic for the same time slot in the next week. As data is captured for the current day, it is compared to the baseline (calculated the previous week) to determine what actions to take, if any.
There are many approaches to calculating the baseline for the time interval in the next week. The baseline can be calculated using a simple moving average, an exponential moving average, Holt-Winters exponential smoothing, or a trend plus an autoregressive process, an autoregressive-moving-average model or using a more complicated detrended time series model (ARIMA, GARCH, Neural Networks, etc.).
It is believed that there is a strong correlation between the network traffic for the previous weeks and the network traffic for the current week. Therefore, relatively simple models perform adequately (moving average, exponentially weighted moving average, Holt-Winters exponential smoothing or an autoregressive process plus trend).
All of these models (mentioned above) require an initial phase to get started. For the first couple of weeks of collecting data, one can initialize the baseline with a simple average. Once enough data has been collected, one can calculate the chosen model from the existing data. For a straight-forward autoregressive model, one needs to extract the trend, plus choose the model order and the number of weeks of data to use for fitting the autoregression model to the data.
The Holt-Winters model incorporates both a linear trend and a seasonal trend in the model (and many of the other models can also include seasonal components). Since the word “seasonal” does not explicitly appear above, one might ask why include the Holt-Winters exponential smoothing model as an option. The answer is that the weekly data will potentially show both a weekly trend and a yearly seasonal trend (“Black Friday,” for example). Hence, embodiments of the invention include a yearly seasonal trend in models. However, the impact of the yearly seasonal trend is not available for the baseline calculation until the start of the second year of data collection.
Note that the weekly time series models are not calculated once and then frozen for all future baseline calculations. Each week the time series models are updated based upon the network traffic received on the current day. The newly updated models are used to calculate the baseline for the following week. This means that the time series models used to calculate the baselines will most likely differ each week.
In one embodiment, each device 202A-202N stores aggregated per-second data in the time series database 322. Using the maximum value of the collected data tends to be uninteresting. The maximum moves up toward the line rate and then stays there. In addition, the average value is often too small to capture the bursts in the traffic. The average is usually orders of magnitude lower than the actual bursts on the link.
Using a percentile of the maximum values, such as 70 percentile of the maximum values, shows a behavior that appears to be more predictable than the maximum bit rate or average bit rate. Therefore, an embodiment of the baselining code uses the 70% quantiles of the maximum per-second data stored in the time series database 322. For instance, if the 70th percentile of the maximum per-second traffic for the current day exceeds the maximum of the 70th percentiles for the previous N weeks, then it is known that the network traffic for the current day is abnormal relative to the recent history. A similar statement can be made if the 70th percentile of the maximum per-second traffic for the current day drops below the minimum of the 70th percentiles for the previous N weeks.
Sometimes a non-recurring event might happen that significantly impacts the network traffic. In this case, it might be inappropriate to include the data collected during this event into the baseline calculation. For this reason, the analytics module 324 is configured to allow one to specify days (and time intervals within days) to be excluded from the baseline calculations.
In addition to calculating a baseline, it is desirable to provide the network administrator with an estimate for the quality of the baseline. There are a variety of approaches one could take to estimate the accuracy of the baseline. A simple estimate of the accuracy is to take a moving average (or weighted moving average) of the previous absolute prediction errors (absolute differences between the measured data and the corresponding baseline).
When using an autoregressive model to calculate the baseline, one can use the accompanying theory of linear predictors to estimate the prediction error of the baseline by calculating the mean squared prediction error for the autoregressive model. However, the standard calculation of the mean squared prediction error is an optimistic lower bound on the prediction error, not a good estimate of the prediction error. Since the variance of the process is an upper bound on the mean squared prediction error, one can approximate the quality of the baseline by estimating the variance of the weekly data values.
The analytics module 324 is configured to generate alerts in response to material deviations from baseline behavior. The expected baseline behavior is presented to the user as an envelope around the baseline function. The envelope comprises a function above the baseline and a function below the baseline that estimate the range that is expected to predominantly represent the future network traffic. Reference to network behavior baseline contemplates the actual network behavior baseline or the network behavior baseline and the envelope. The analytics module 324 is configurable to define a deviation threshold, such as a 10% deviation threshold from the network behavior baseline, a 15% deviation threshold from the network behavior baseline, or a 20% deviation threshold from the network behavior baseline. The analytics module may, at the user's option, choose to compare the raw network traffic or a smoothed version of the network traffic to the network behavior baseline. The user may also choose a minimum amount of time the traffic needs to exceed the deviation threshold from the network behavior baseline in order to trigger an alert. The analytics module 324 is also configurable to define material deviations in the context of known events that may impact the baseline behavior. For example, an expected blockbuster media release may be used to specify greater thresholds for what are considered deviations from baseline behavior.
The analytics module 324 is configured to generate an alert in response to current network behavior that exceeds a deviation threshold. The alert may be a signal applied to network 206, such as an email or text, which is directed toward one or more designated individuals, such as network administrators. The analytics module 324 is also configurable to adjust the severity of the alert as a function of the severity of the deviation from baseline behavior.
An embodiment of the present invention relates to a computer storage product with a computer readable storage medium having computer code thereon for performing various computer-implemented operations. The media and computer code may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well known and available to those having skill in the computer software arts. Examples of computer-readable media include, but are not limited to: magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs, DVDs and holographic devices; magneto-optical media; and hardware devices that are specially configured to store and execute program code, such as application-specific integrated circuits (“ASICs”), programmable logic devices (“PLDs”) and ROM and RAM devices. Examples of computer code include machine code, such as produced by a compiler, and files containing higher-level code that are executed by a computer using an interpreter. For example, an embodiment of the invention may be implemented using JAVA®, C++, or other object-oriented programming language and development tools. Another embodiment of the invention may be implemented in hardwired circuitry in place of, or in combination with, machine-executable software instructions.
The foregoing description, for purposes of explanation, used specific nomenclature to provide a thorough understanding of the invention. However, it will be apparent to one skilled in the art that specific details are not required in order to practice the invention. Thus, the foregoing descriptions of specific embodiments of the invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise forms disclosed; obviously, many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, they thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. It is intended that the following claims and their equivalents define the scope of the invention.
This application is related to concurrently filed and commonly owned U.S. Ser. No. ______, filed May ______, 2017.