The present disclosure generally relates to the classification of data streams using behavioral methods.
Internet service providers (ISPs) typically attempt to classify at least some of the data traffic supported by their networks. Traffic classification enables an ISP to prioritize or deprioritize network traffic (based on service tiers, net neutrality, etc.), as well as to identify malicious traffic (e.g., worms) and/or identify potentially illegal traffic (e.g., copyright violations). Currently, most traffic classification in ISP networks is performed using Deep Packet Inspection (DPI). In DPI, the data payload of the packet is inspected and searched for patterns that match known character strings from a continuously updated database of identifiers. Accordingly, DPI is only appropriate for the classification of non-encrypted traffic. However, the percentage of encrypted traffic in ISP networks is increasing, thereby impacting on the use of deep packet inspection (DPI) to classify such traffic.
The present invention will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:
A method for video traffic flow behavioral classification is implemented on a computing device and includes: receiving coarse flow data from a network router, where the coarse flow data includes summary statistics for data flows on the router, classifying the summary statistics to detect video flows from among the data flows, requesting fine flow data from the network router for each of the detected video flows, where the fine flow data includes information on a per packet basis, receiving the fine flow data from the network router, and classifying each of the detected video flows per video service provider in accordance with the information.
A method implemented on a network router includes: instructing a coarse flow generator on the network router to generate summary statistics for network traffic flows, forwarding the summary statistics to a network data center for classification of the network traffic flows, receiving a request from the network data center to generate packet based information for at least one of the network traffic flows in accordance with the classification, instructing a fine flow generator on the network router to generate the packet based information, and forwarding the packet based information to the network data center, wherein the instructing of the coarse and fine flow generators is implemented via a script interpreted by an embedded event manager (EEM) on the network router.
The inventors of the present invention have realized that Over The Top (OTT) video flows, such as provided by Netflix and YouTube, may be particularly suitable for classification by shallow packet inspection (SPI) methods that do not require inspection of data payloads and are therefore not impacted by encryption. OTT video flows are typically persistent (compared to typical web traffic)—a movie may last for hours. During that time, the flows are also fairly similar and predictable. By way of illustration,
It will be appreciated by one of skill in the art that OTT video is currently the dominant type of traffic in Internet service provider (ISP) networks. Typically, up to 60% of downstream traffic is OTT video. Furthermore, the percentage of OTT video in downstream traffic has been growing and is believed by the inventors of the present invention to be likely to continue to grow. Accordingly, a method for classifying encrypted OTT video may enable an ISP to classify a significant portion of all of its network traffic, regardless of whether or not it is encrypted.
Reference is now made to
It will be appreciated by one of skill in the art that both routers 100 and data center 200 may comprise other functional components that in the interests of clarity are not shown in
EEM 120 may be operative to instruct coarse flow generator 130 and fine flow generator 140 to generate network flow data for provision to data center 200. Coarse flow generator 130 may be configured to generate coarse flow data based on low frequency analysis of data flows sampled by router 100. Fine flow generator 140 may be configured to generate coarse flow data based on high frequency analysis of data flows sampled by router 100.
In accordance with embodiments of the present disclosure, the functionality of routers 100 may be provided by leveraging currently existing network technology adding additional hardware to network 10. For example, IVN script 110, EEM 120, coarse flow generator 130 and fine flow generator 140 may be implemented using existing, commercially available, traditional and flexible versions of Cisco IOS NetFlow. NetFlow classifies network packets into “flows” and summarizes characteristics of these flows. The original version of NetFlow, now referred to as traditional NetFlow, classifies flows based on a fixed set of seven key fields: source IP, destination IP, source port, destination port, protocol type, type of service (ToS) and logical interface. Traditional NetFlow's flow characteristics, such as total bytes and total packets, are (generally speaking) based on the lifetime of the flow or a one minute sample. The data retrieved is highly generalized and therefore appropriate for low frequency analysis without requiring added processing downstream. Accordingly, coarse flow generator 130 may be implemented using traditional NetFlow per a suitably configured IVN script 110 input to EEM 120.
Flexible NetFlow supports many additional features including shorter sample periods and configurable key fields to define flows. With support for configurable key fields, a flow may be defined by criteria other than the seven key fields used by traditional NetFlow. Accordingly, new combinations of packet fields may be used to classify packets into unique flows that may have little resemblance to those created by traditional NetFlow. In accordance with embodiments of the present disclosure, a sequence approach may be used with flexible NetFlow to capture details on an almost per-packet level as opposed to the typical generalization provided by traditional NetFlow. The sequence approach is predicated on including the TCP sequence number as a key. With the TCP sequence number included as a key, most packets (except for retransmits) will be treated as unique flows since the overall combination of key fields (source IP, destination IP, source port, destination port, TCP sequence number, and others) typically creates a unique combination for each packet. The resulting flows will therefore typically represent a single packet, causing flexible Net flow's flow summary to accurately report per packet details including reception time and packet length, thereby providing high frequency analysis. Accordingly, fine flow generator 140 may be implemented using flexible NetFlow per a suitably configured IVN script 110 input to EEM 120. In order to provide per-packet details for a video flow, fine flow generator 140 may therefore generate a series of summary reports, one for every packet in the sample population.
It will be appreciated by one of skill in the art that since this sequence approach may generate a significant amount of data, it may be appropriate for shorter durations, i.e. less than one second, with an appropriately sized cache to ensure that the collection process has acceptable impact on the IOS device. It will similarly be appreciated by one of skill in the art, that the present disclosure is not limited solely to using NetFlow to implement the functionality of routers 100. Any other known product or service providing generally the same functionality may also be used. Alternatively, or in addition, additional software and/or hardware components may be added as necessary to an existing router 100 and/or data center 200 to provide the data collection and analysis provided by NetFlow.
Reference is now made to
Flow director 220 is operative to maintain proper operation of IVN data flows from Routers 100. It may use SNMP to request (step 330) that specific routers 100 initiate coarse flow generation per the participating routers 100 in endpoint database 215. It will be appreciated that steps 310-330 may not necessarily be performed each time the processing loop of process 300 is executed. For example, for any given execution of the processing loop, there may be no new notifications to be received in step 310.
Collector 230 may collect (step 340) coarse flow data forwarded from router 100 and save them in coarse and fine flow database 240. The coarse flow data may represent short aggregated summaries of a sampling of all of the flow data on router 100. Alternatively, coarse flow generator 230 may be implemented to filter out data for flows that are unlikely to be video flows. For example, very short data flows may be excluded on the assumption that they are not video flows. Such filtering may be implemented by controlling and configuring flexible NetFlow functionality by IVN script 110 for the generation of the coarse flow by coarse flow generator 130. It will be appreciated that the coarse flow data is generated by coarse flow generator 130 and forwarded to data center 200 using UDP. It will be appreciated by one of skill in the art that other transport protocols may be similarly suitable to implement this functionality.
Coarse classifier 250 may classify (step 350) coarse flows retrieved from coarse and fine flow database 140 in accordance with previously defined rules and/or training data in rules and training database 255. The rules in rules and training database 255 may be defined in accordance with heuristic analysis of how different media services may operate their platform. Analysis of OTT sessions from real service providers may yield features such as audio/video bitrates, chunks gaps and buffer sizes.
For example, per recent analysis, Netflix may generally use one of two inter-chunk packet gaps and only one audio bitrate. Reasonable confidence that this analysis is correct may rely on the fact that some findings may be associated with a limited set of values. For example, audio bitrates are normally 64, 128, 192, 256, etc. and inter-chunk packet gaps are normally integer values. Assuming such values are correct, further assumptions may be made regarding the correctness of other derived values (e.g. video bitrates) as well. Tests using this approach in a limited number of network environments have yielded results with identification success rates exceeding 98%. However, it will be appreciated by one of skill in the art that in a real-world environment, such an approach may underperform such results since it may be difficult to heuristically learn and adapt to changes in provider services and ambient network conditions.
If as per step 350 it is likely that the coarse flow represents a video flow (step 360), coarse classifier 250 will instruct flow director 220 to request (step 365) fine flows to be generated by router 100. Otherwise, control may return to the start of process 300.
Collector 230 may receive (step 370) the associated fine flows from router 100 and store them in coarse and fine flow database 240. It will be appreciated that, as discussed hereinabove, the fine flow data is generated by fine flow generators 140. It will be appreciated that at any one time there may be more than one active video flow candidate on router 100; an instance of fine flow generator 140 may be executed for each active video flow candidate. The fine flow data comprises more finely grained information than coarse flow data. For example, timestamp and packet size may be captured for all messages in a short time window (e.g., 250 ms) for forwarding to collector 230. It will be appreciated that such high resolution sampling may be resource intensive and accordingly the sampling time window may be relatively short, and flow director 220 may limit such requests to limit overhead for network 10.
Fine classifier 260 may classify (step 380) fine flows retrieved from coarse and fine flow database 240 according to provider (e.g. Netflix, YouTube, etc.) per training data in training database 265. The results of step 380 may be stored in classified flow database 270. Dashboard 280 may use the data from classified flows database 270 to generate (step 390) a notification report for the classified fine flows. In accordance with some embodiments of the present application, the notification report may be presented on an operator's online console or dashboard. Alternatively, or in addition, the notification report may be stored electronically for future reference. Alternatively, or in addition, the notification report may be forwarded via email and/or other suitable vehicle for input to online and/or offline review and/or control processes.
It will be appreciated by one of skill in the art that such a notification report, in any of its possible forms, may serve as input to processes for the management of network 10. For example, video flows as detected by process 300 may be assigned a different priority than other data flows in network 10. A higher or lower priority level may be assigned to video flows in general, based on technical and/or functional considerations. Routers 100 may be instructed by data center 200 to prioritize video flows in relation to other data flows based on such a priority level. Classified video flows may also be assigned different priorities according to video service provider. The different priorities may be based on technical and/or functional considerations, and routers 100 may thereby also be instructed to discriminate between video flows according to video service provider.
In accordance with embodiments of the present disclosure, manifold learning diffusion maps may be used to implement coarse classifier 250 and/or fine classifier 260. A manifold is a space in which every point has a neighborhood which locally resembles the Euclidean space, but in which the global structure may be more complicated, e.g. the earth surface can be assumed locally flat but globally is a two dimensional manifold embedded in a three dimensional space.
Manifold learning is a formal framework for many different machine learning techniques based on the assumption that the original data actually exists on a lower dimensional manifold embedded in a high dimensional ambient space (manifold assumption) and that data distributions show natural clusters separated by regions of low density (cluster assumption) The underlying geometric structure of the data may therefore be discovered given the high dimensional observations. The input data may be defined in a high dimensional ambient space, using fewer parameters while preserving relevant information and the intrinsic semantic of the source dataset; dimensionality reduction techniques are used to transform dataset X with dimensionality D into a new dataset Y with dimensionality d, while retaining the geometry of the data.
Diffusion Maps is a manifold learning methodology that preserves the local similarity of the high dimensional dataset constructing the low dimensional representation for the underlying unknown manifold using non-linear techniques based on graph theory and differential geometry. The distance between two data points is estimated via a fictive diffusion process simulated with a Markov random walk on the associated undirected graph that approximates the manifold.
The Euclidean distance between points in the embedded space (the transformed space) is approximately the diffusion distance between those points in the ambient space (the original space). Variation of physical parameters along the original manifold is approximately preserved in the new data space as long as the Euclidean distances are preserved.
Accordingly, taking two data points xi and xj in a high dimensional ambient space, a local similarity matrix W may be defined to reflect the degree to which points are near to one another. Imagining a random walk starting at xi that moves to the points immediately adjacent, the number of steps it takes for that walk to reach xj reflects the distance between xi and xj along the given direction. The similarity of the data in the context of this fictive diffusion process is retained in a low-dimensional non-linear parameterization useful for uncovering the relations within the feature space. Moreover, the embedding may be robust to random noise in the data as long as the points in the ambient space keep their relatedness to adjacent points in presence of noise.
Reference is now made to Fig.4 which illustrates a diffusion map learning process 400 to be performed by coarse classifier 250 and/or fine classifier 260 in accordance with embodiments of the present disclosure to generate training data and/or to process input data flows received from routers 100. Process 400 employs a combination of graph-theory and differential geometry. The elements of a subject dataset are related to each other in a structured manner through similarities or dependencies between the data elements represented with an undirected weighted graph, in which the data elements correspond to nodes, the relation between elements are represented by edges, and the strength or significance of relations is reflected by the edge weights.
In the interests of simplicity of reference, process 400 will be discussed hereinbelow as performed by fine classifier 260. It will be appreciated that process 400 may be performed by either or both of coarse classifier 250 and fine classifier 260. Alternatively, or in addition, a dedicated training module may be used to generate the training data. Fine classifier 260 receives (step 410) input data. When executed in training mode, the input data represents capture of labeled video streaming services samples. In operation, the input data is received as either coarse flow or fine flow data from routers 100.
It will be appreciated by one of skill in the art that video network traffic may be described by a number of observable data or feature vectors that are the points {xi}Ni=1 in the high dimensional ambient space. A feature may be indicative of the type of application that generated the traffic based on the statistical characteristics of the application protocols but without using the information of payloads that may be encrypted. Classifiers 250 and 260 are trained to associate the sets of features with known video streaming services, and to apply the trained classifier to classify unknown traffic using the previously learned rules.
It has been observed that different applications have generally distinct packet size distributions (PSDs) and that the same applications generally have similar packet inter-arrival times (IATs). Process 300 may therefore use PSDs and IATs as indicators for application classification. The PSD of an application can be obtained from observation of relevant TCP connections. For training, the traces of each application may be generated manually and recorded in coarse and fine flow data database 240. Such manual generation, typical of supervised classification methods, provides the advantage to build a consistent ground-truth dataset in which each application that generated a given flow is well known. Alternatively, it is possible to use a mix of labeled and unlabeled sample typical of semi-supervised classification methods.
In accordance with an exemplary implementation of process 400, the generated data may be based on an average capture duration of approximately 240 seconds from video streaming traffic service such as, for example, Netflix, Lovefilm, YouTube, Hulu, Metacafe and Dailymotion. Examples of PSD histograms generated for each of these video streaming services may be seen in
It will be appreciated by one of skill in the art that a transport layer protocol such as TCP may be responsible for the reliable and inline delivery of data packets between two communicating applications. The inter-arrival time between two consecutive packets of a network flow transmitted by a host may be determined by a function of at least the application traffic generation rate, the transport layer protocol in use, queuing delays at the host and on the intermediate nodes in the network, the medium access protocol, and finally a random amount of jitter. In accordance with an exemplary implementation of process 400, the IAT histograms may also be based on an average capture duration of approximately 240 seconds from video streaming traffic service such as, for example, Netflix, Lovefilm, YouTube, Hulu, Metacafe and Dailymotion. Examples of IAT histograms generated for each of these video streaming services may be seen in
For each sample point, fine classifier 260 may construct (step 420) a corresponding histogram for the PSD and the average IAT to capture the overall statistical traffic behavior. Each histogram may be represented as a point in the feature space.
It will, however, be appreciated by one of ordinary skill in the art that using a single feature for classification may be insufficient; it is not unlikely that two different applications may have similar PSD or IAT. For example, as shown in FIGS., 5B, 5H, 5J and 5L, while not identical, the PSD histograms for NetFlix, Hulu, Metacafe and Dailymotion are fairly similar. Accordingly, process 400 may be configured to use two or more features.
Fine classifier 260 may therefore be configured to determine (step 430) joint similarity between PSD and IAT distribution. In accordance with embodiments of the present disclosure, manifold alignment methods may be employed by fine classifier to create a more powerful representation of the manifold, aligning (combining) multiple datasets into a fusion multi-kernel support. Manifold alignment views each individual dataset as belonging to a larger dataset. Accordingly, since the datasets may have the same manifold structure, the Laplacian associated with each dataset are all discrete approximations of the same manifold that can be combined into a joint Laplacian to construct an embedding that integrates features provided by the different datasets. Accordingly, the fusion multi-kernel of the kernels W IAT and W PSD for IAT and PSD distributions in their respective feature space may be derived as a Bhattacharyya kernel according to: W IVN=√{square root over (WIAT)} √{square root over (W)}PSD, such that the fusion multi-kernel W IVN is a measure of joint similarity between IAT and PSD distributions.
The W IVN dataset may be represented in an N×D matrix consisting of N feature vectors with dimensionality D. Each instance is represented as a point in the ambient space D and s(xi, xj) represents the distance between a pair of adjacent data points. In accordance with embodiments of the present invention, the Jensen-Shannon divergence (JSD) may be used to measure the distance s(xi, xj). It will be appreciated that any other suitable method may also be used in other embodiments Fine classifier 260 may construct (step 440) a data adjacency matrix W, on a weighted undirected graph for the observed data {xi}Ni=1 where the elements W(xi, xj) of the symmetric matrix W are defined by the Gaussian kernel:
Fine classifier 160 may construct (step 450) the Laplacian Matrix L, for
D(xi, xj)=Σj (xi, xj) and set L=D−W.
Fine classifier 160 may then compute (step 460) the Eigenmap that solves the generalized eigenvalue problem Lψ=λDψ for the symmetric Laplacian P=D−1/2L D−1/2with eigenvalues 1=λ0>λ1 . . . >λN and eigenvectors ψ0, ψ0 . . . , ψN. The resulting matrix P has all rows equals to one and can be interpreted as a stochastic matrix defining a random walk on the graph. The constant eigenvector ψ0 with the top eigenvalue λ0=1 may be discarded while keeping the first d dominant eigenvalues λ1 . . . λd and eigenvectors ψ1 . . . , ψd. The embedding of the manifold will be then given by the vector in the embedded space xi→Ψt(xi)={λt1ψ1(xi), . . . , λtdψd (xti)}where d<<D is the dimension of the embedded space.
It will be appreciated that if the data points xi and xj are adjacent when measured by W, then they should similarly be very near on the manifold. Conversely, the points Ψt(xi) and Ψt(xj) are adjacent when measured in the ambient space, because the diffusion distance should be similarly small. Fine classifier 260 may embed (step 470) the results in the embedded space.
In accordance with embodiments of the present invention, classification of the training data may be performed in a supervised/semi-supervised manner. Reference is now made briefly to
Once the clusters have been computed in the embedded space using the labeled samples, a new unlabeled sample may be added to the training set. Instead of computing a new embedded space for each new sample, Nyström extension may be used to estimate the extended eigenvector in the previous embedded space. It will be appreciated that the same method may be employed for processing data flows in operation.
The classification of an unlabeled sample uses weighted neighborhoods schemes such as random forest or k-NN (k-nearest neighbor) algorithms to count the number of training points of the same class within the minimal distance from the centroids. For illustration, the crosses in the circled application clusters in
In accordance with embodiments of the present disclosure, alternatively or in addition, deep learning techniques may be used to implement coarse classifier 250 and/or fine classifier 260. Deep learning may be characterized as machine learning techniques that receive raw data as input and automatically generate optimal feature extractors. Any suitable deep learning technique that includes generative models representing a deeper model of the structure underlying the data may be used to implement coarse classifier 150 and/or fine classifier 260. Non-limiting examples of such implementation include de-noising auto-encoders, restricted Boltzmann machines and convolutional networks.
In accordance with embodiments of the present disclosure, coarse classifier 250 may be implemented by modeling the types of system noise and affine transformations that are expected in the field and dynamically introducing simulated artifacts based on this model during system training. While this may be resource intensive during the training phase it may yield high-speed classification during operation since the classification code may consists of a few relatively simple matrix operations.
Reference is now made to
Coarse classifier 250 may receive (step 510) vectorized IAT/PSD pairs as they are streamed into the system. Coarse classifier 250 may transform (step 520) the input data so that it has a mean of 0 and a standard deviation of 1. Coarse classifier 250 may reduce (step 530) the dimensionality of the transformed data. In accordance with embodiments of the present disclosure principle component analysis (PCA) may be used to perform step 530. However it will be appreciated that any suitable analysis may be used for step 530. The analysis may maintain a configurable amount of variance to help reduce input layer size if necessary. Whitened PCA or ZCA (zero component analysis) may be used to reduce the redundancy of the input data.
Based on a configuration parameter, coarse classifier 250 may perform regularization in order to minimize (step 540) extremely large numerical values thus helping provide numerical stability. The preprocessed data may then be classified (step 550) by the trained deep learning based classifier.
In accordance with embodiments of the present disclosure, both deep learning and manifold diffusion maps may be used in conjunction by data center 200 to perform process 300. For example, coarse classifier 250 may be implemented using deep learning, thereby taking advantage of the high-speed classification provided by deep learning for the relatively large volume of coarse flow classifications. Fine classifier 260 may be implemented using manifold diffusion maps, thereby designating the more resource intensive processing for the relatively lower volume of fine flow classifications.
It will be appreciated by one of skill in the art, that the methods described hereinabove may also be implemented to address non-video traffic. In accordance with embodiments of the present invention, the methods may be applied to the classification of any persistent network traffic based on behavioral methods to capture flow information without inspecting the packet payload or using additional hardware. For example, BitTorrent and/or Spotify traffic may be classified using generally similar methods.
It is appreciated that software components of the present invention may, if desired, be implemented in ROM (read only memory) form. The software components may, generally, be implemented in hardware, if desired, using conventional techniques. It is further appreciated that the software components may be instantiated, for example: as a computer program product or on a tangible medium. In some cases, it may be possible to instantiate the software components as a signal interpretable by an appropriate computer, although such an instantiation may be excluded in certain embodiments of the present invention.
It is appreciated that various features of the invention which are, for clarity, described in the contexts of separate embodiments may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment may also be provided separately or in any suitable subcombination.
It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the invention is defined by the appended claims and equivalents thereof: