The present invention relates to monitoring traffic in an industrial computer network.
Industrial computer networks are extensively utilized in manufacturing, energy, and transportation sectors to facilitate communication and data exchange between various devices and systems. These networks are crucial in ensuring the seamless operation of industrial processes and efficient resource management.
Monitoring the network traffic within industrial networks is vital for maintaining the network's integrity, availability, and security. By analyzing the traffic between devices, it is possible to identify communication patterns and detect any anomalies or deviations from normal behavior. Such anomalies could indicate device malfunctions, network failures, or potential security breaches.
In the past, various methods and systems have been developed for monitoring network traffic in industrial networks. These include network monitoring tools that capture and analyze network packets, intrusion detection systems that detect and prevent unauthorized access, and anomaly detection systems that identify abnormal network behavior.
However, these existing solutions have limitations. They often require complex configurations, extensive manual intervention, and a deep understanding of the network infrastructure. Additionally, they can not effectively handle the unique characteristics and requirements of industrial networks, such as the large number of devices, diverse communication protocols, and the real-time nature of industrial processes.
The prior art approach of applying a specific machine learning module provides versatility of use of the monitoring system, but can not be a universal approach that is applicable to a wide range of various industrial computer networks.
There is a need to provide a method and system for monitoring traffic in an industrial computer network that will be versatile in use and applicable to many different types of computer networks, wherein various types of devices can operate.
The object of the invention is a computer-implemented method for monitoring network traffic, the method comprising: receiving network traffic packets from at least one network probe; pre-processing network traffic packets to generate pre-processed network traffic data in a form of a time series containing at least one statistical value for at least one parameter for a plurality of packets, each network traffic data item of the time series being representative of network traffic packets from a selected time window; providing a plurality of machine learning models configured to predict network traffic data for upcoming traffic based on the pre-processed network traffic data of traffic received so far, wherein each machine learning model is dedicated to a distinct statistical value for a distinct parameter; training the machine learning models with the pre-processed network traffic data and deriving an evaluation score for each machine learning model; and based on the evaluation score, selecting at least one machine learning model of the trained machine learning models for monitoring of upcoming network traffic; monitoring the upcoming network traffic to detect anomalous events by comparing the actual network state with a network state simulation generated using the selected at least one machine learning model; and generating a warning if the actual network state does not match the generated network state simulation, wherein the network state is represented by a statistical value for a parameter specific to the selected model.
The method may comprise pre-processing the network traffic packets individually for each connection, wherein a connection is identified by at least one of: a sender identifier, a recipient identifier, a communication port and a transmission protocol.
The pre-processing may comprise determining, for each packet, at least one parameter corresponding to a time until arrival of a next packet, a payload length or an entropy, and determining, for each parameter of a plurality of packets, at least one statistical value corresponding to a minimal value, a maximal value, a mean value, a standard deviation value or a sum value, to obtain at least one statistical value for at least one parameter for a plurality of packets.
The method may further comprise outputting the warning signal via a graphical user interface or an application program interface.
The method may further comprise determining an optimal time window length from a plurality of time window lengths within predefined limits.
The method may further comprise determining whether the statistical values have a standard distribution and if so, selecting at least one statistical machine learning model, and otherwise selecting at least one autoregressive machine learning model for training.
The method may further comprise including a number of packets within a time window in the pre-processed network traffic data.
At least one parameter for a plurality of packets can be a time until arrival of the next packet, a length of a payload carried by a packet, an entropy of the payload or a hash function of the payload.
The invention also relates to a network monitoring system comprising: a data interface for receiving network traffic packets from at least one network probe; a data pre-processor for pre-processing network traffic packets to generate pre-processed network traffic data in a form of a time series containing statistics for the network traffic packets, each network traffic data item of the time series being representative of network traffic packets from a selected time window; a configurator comprising a plurality of machine learning models configured to predict network traffic data for upcoming traffic based on the pre-processed network traffic data of traffic received so far, wherein each machine learning model is dedicated to a distinct statistical value for a distinct parameter, and a controller configured to: train the machine learning models with the pre-processed network traffic data; derive an evaluation score for each machine learning model; and based on the evaluation score, select at least one machine learning model of the trained machine learning models for monitoring of upcoming network traffic; and an anomalies detector configured to: monitor the upcoming network traffic to detect anomalous events by comparing the actual network state with a network state simulation generated using the selected at least one machine learning model and generate a warning if the actual network state does not match the generated network state simulation, wherein the network state is represented by a statistical value for a parameter specific to the selected model.
The system may be specifically configured to perform any of the steps of the method as described herein.
Various embodiments are herein described, by way of example only, with reference to the accompanying drawings, wherein:
The following detailed description is of the best currently contemplated modes of carrying out the invention. The description is not to be taken in a limiting sense, but is made merely for the purpose of illustrating the general principles of the invention.
The present invention is applicable to industrial network monitoring systems that operate based on the analysis of communication between devices. The system monitors transfer of data (network packets) to detect anomalous events in network traffic. The solution is based on the idea of comparing the actual network state with a simulation generated using a defined set of generic machine learning models. If the actual state does not match the simulation, the system responds by generating a warning.
A data pre-processor 120 is configured to pre-process, in step 720, the network traffic packets by clustering packets (step 726) to generate network traffic data in a form of a time series containing statistics for the network traffic packets. For example, the time series may have a form such as shown in
A configurator of artificial intelligence, CAI, 130 comprises a plurality of machine learning models 132, which are provided in step 730, configured to predict network traffic data for upcoming traffic based on the pre-processed network traffic data of traffic received so far and a controller 131 configured to train, in step 740, the machine learning models 132 and to derive evaluation for each machine learning model 132. The machine learning models 132 used in the system can be supervised learning models and/or unsupervised learning models. Information on the best performing machine learning model 132 can be stored in a database 133 (along with information on the network traffic data for which the particular model was particularly effective) and used to select, in step 750, at least one model for actual monitoring of the upcoming network traffic during operation of the system. The CAI configurator can therefore operate (distinctly or in simultaneously) in two modes, a training mode wherein the machine learning models 132 are trained and selected, and in a prediction mode wherein the outputs of the selected machine learning models 132 are provided for use in detecting anomalies.
The controller 131 can operate in the training mode continuously, in parallel with a prediction mode, in order to continuously monitor which machine learning model 132 is the most effective for the current traffic. Alternatively, the controller 131 can operate in the training mode periodically (for example, once per hour or once per day) and/or on demand (for example, upon detecting a significant change in the characteristics of network traffic or upon changes to the network infrastructure such as when a new device or a new connection is installed within the network).
An anomalies detector 150 is configured to monitor, in step 760, the upcoming network traffic to detect anomalous events by comparing the actual network state with a network state simulation generated using the selected at least one machine learning model 132 and generating a warning if the actual network state does not match the generated network state simulation. The anomalies detector contains a plurality of comparators 151, each associated in a one-to-one or one-to-many relationship with the selected machine learning models 132. In other words, the output of each of the selected machine learning models 132, i.e. the network state simulation (in a form of a predicted statistical value for a parameter specific to that machine learning model) is input to one or more of the comparators 151. Each comparator 151 is configured to compare statistical value predicted by one or more machine learning models 132 with a statistical value generated for actual network traffic by the data pre-processor 120. In case there is an anomaly, the anomalies detector 150 generates a warning signal in step 770.
A user interface 140 is provided to allow the user to configure the operation of the network monitoring system and of the anomalies detector and view results of their operation, such as the selection of the machine learning models 132 made in the CAI module 130 or warning signals generated by the anomalies detector 150, in step 780. The user interface 140 may have a form of a graphical user interface (GUI) or an application program interface (API) via which data can be pushed or pulled for cooperation with other modules.
The network packets collected by the network probes 10 can be arranged according to a particular connection (step 721), wherein the packets corresponding to a particular connection are packets that have in common at least one of:
The system can monitor all individual connections or connections that have a minimum predetermined number of packets transmitted in total (such as at least 5000 packets). Therefore, for each connection an independent set of simulation models will be created. Consequently, once a connection has collected a sufficient number of packets, the CAI module 130 can proceed to the consecutive analysis steps (for a given connection) to analyze the collected set of packets.
For each connection to be analyzed, additional information on packets is collected, such as the arrival time of the first packet and the last packet in the set of packets.
In addition, for each packet pi of the connection the system can determine the following parameters (step 722):
The machine learning models are trained to be able to predict statistic values for parameters of future network traffic based on knowledge of its state at a given moment.
In one approach, the machine learning models can be trained to analyze individual packets to predict the timing (and possible characteristics) of the appearance of the next packet of a given type. In this case, the training sample can comprise a collection of unprocessed packets.
In order to improve the training efficiency and anomaly detection results, the inventors found that it is effective to analyze data in time windows, as illustrated in
The packets 301 are clustered (step 724) according to the time window 302 in which they were received. For each window, statistical values related to each parameter (such as time, payload length, entropy) of packets within the particular time window are determined (step 723), such as a minimum value, a maximum value, a mean value, a standard deviation value, a sum of values.
These statistical values are subsequently used as inputs (hyperparameters) for the machine learning models. Specifically, a dedicated machine learning model is created and assigned for each generated statistical value. Hence, if there are three parameters (such as time, payload length, entropy) and five statistical values (min, max, mean, std, sum) for a particular connection, 3*5=15 machine learning models are created, each corresponding to a particular statistical value of a particular parameter. In addition, additional (16th) machine learning model can be created to analyze the number of packets within a time window. In other words, each machine learning model 132 is dedicated to a unique statistical value for a unique parameter or a number of packets within a time window.
Alternatively, some or each of these 16 sources of data (which can be referred to as data flows) can be simultaneously input to different types of machine learning models. For example, if (as will be explained in detail below) the system can handle 6 different autoregressive machine learning model types (such as KRR, SVR, PAA, MLP, DRM, DRRM), the system may generate a total of 16*6=96 different autoregressive machine learning models.
This method facilitates easier identification of the attack vector and provides more precise information about the type of attack (anomalies in different statistics correspond to different forms of network attack). More precise information about the attack leads to quicker operator response, potentially reducing damage.
A single time window is defined by a number (in μs) that determines its duration. The length of the time window can be selected individually for each type of statistic and can be the result of an optimization procedure. The limits for optimization can be predefined. The time window length can be defined using two methods.
In the first method, known as dynamic window selection, the limiting values for window length (irrespective of the type of statistics) can be predetermined, for example, as 16 (minimum value) and 64 (maximum value) times an average length of time until the arrival of the next packet for a particular connection for the training set of packets.
In the second method, the limiting values can be predefined as specific values, such as 1000000 μs (minimum) and 5000000 μs (maximum).
The length of the time window can significantly impact the system's performance. A short window reduces the likelihood of detecting “silent” long-running attacks containing a small number of packets per unit time. Conversely, a long time window increases detection time (the system must wait to collect the data required to execute the query). The desired time window may therefore vary depending on specific scenarios and domain knowledge.
Time window optimization is based on two main steps: evaluating the quality of a given time window definition (for a given statistic) and selecting the time window definition with the highest rating.
An example will be presented for a statistic related to the average value of the parameter related to the time until the arrival of the next packet. A plurality of time window lengths are defined and for each length of the time window, the following procedure is performed.
First, an average value is calculated for the parameter for all packets within consecutive time windows. Next, a score for the particular time window length is calculated by:
The scoring procedure is repeated for the plurality of different values of time window lengths. The goal is to find the length of the time window within the limiting values with the highest score.
The above described procedure, if it is to be performed for each connection, for each statistic of each parameter considered and for each time window length considered, can be relatively complicated if it is to be performed based on a grid search (wherein a predefined set of uniformly distributed “networks” of parameters is tested).
This can be further optimized by using a Tree-structured Parzen Estimator Approach (TPE) algorithm, which uses so-called expected improvement criterion, and which is explained in more details in an article “Algorithms for hyper-parameter optimization” (by Bergstra, James, et al in Advances in neural information processing systems 24, 2011). In each step of the TPE optimization, the locally best value of the time window length is obtained by maximizing the above criterion. In each step of the algorithm, the criterion is improved allowing iteratively to obtain better and better values of the optimized parameter (time window length). The assumed configuration space in this case is not an n-dimensional cube as in the case of grid search, but a tree (an acyclic coherent undirected graph).
Furthermore, the calculations can be further simplified by averaging the values of statistics in the time windows. Averaging involves transforming a sequence of values in time windows into a new sequence, wherein each new value is an average of N adjacent values in the original sequence. The purpose of averaging is to smooth the distribution of values and is helpful in optimizing the model training process. Preferable values for the parameter N were found to be between 1 and 20.
The controller 131 is configured to generate and test a plurality of machine learning models 132, such as to select one or more machine learning models 132 that will be used as estimators of upcoming network traffic. If more than one model is selected, they can operate in parallel and their estimation results can be averaged to minimize the error rate.
The system can be configured to support at least two types of machine learning models 132, which can be selected depending on the type of distribution of packets (step 725):
The best criterion for assessing the randomness of the sequence was found to be the Kolmogorov-Smirnov test. This test, in the simplest case, checks how closely a sequence of statistical values resembles a normal distribution. Alternatively, randomness can be determined using the Ljung-Box test, which checks whether there is any correlation in the sequence of statistical values.
In addition, the system can be configured to support machine learning models 132 configured to analyze individual packets, by performing deep packet inspection in order to analyze packet payload. A plurality of these models can be tested in order to select one that is most suitable for the payload of packets for a particular connection.
This class of machine learning models can comprise one or more of the following models: Kernel Ridge Regression (KRR), Support Vector Regression (SVR), Passive Aggressive Algorithm (PAA), Multi-Layer Perceptron regressor (MLP), Deep Regression Model (DRM), Deep Recurrent Regression Model (DRRM).
Each available type of machine learning models is tested for at least one statistical value for at least one particular parameter in order to determine whether that model is suitable for predicting traffic data for upcoming network traffic. The models are tested on a collection of traffic data packets, which is divided, for example, at a proportion 7:3 into a training data set (70%) and a validation data set (30%). Data packets are then pre-processed by performing one or more of the following:
Such pre-processed data can be used as input data for the models and as reference predicted values needed to calculate a loss function. An example will be discussed below, wherein the input data comprises a series of statistical values denoted as t1, t2, . . . , tN, wherein N is the length of the training sample and the predicted value is denoted as y. It can be assumed that ti+i is a linear function of only n previous statistics ti−n, . . . , ti, in other words ti+1=ƒ(ti−n, . . . , ti).
The function ƒ represents one of the available models (such as the six models listed above), and n is a parameter selected (optimized) by the configurator module. The value of n is optimized for all models and is not considered as a parameter for any model. It affects the selection of input data and is not dependent on the type of the model. The input data and the predicted values can be illustrated as shown in
After the input data vector is created, each model is trained and then the goodness of fit of the model is calculated as a score value. The score for each model is calculated for the validation data set using the following algorithm:
The optimization algorithm finds a combination of the value of parameter n and a particular statistical value (hyperparameter for the model) that minimize the value (actually, its minus value) of the scores. The number of optimization steps can be different for each model type, for example it can be equal to 100 for KRR, SVR, PA and MLP models, and between 1 and 16 for DRM and DRRM models. Since the efficiency of optimization is proportional to the number of steps and inversely proportional to the number of optimized hyperparameters, the best results can be achieved for the RR and MLP models (100 optimization steps for 2 optimized parameters) and the worst for the DRRM model (usually 4 optimization steps for 6 optimized parameters).
The hyperparameters found in the optimization process are filtered. It is required that each set of hyperparameters corresponds to a model with a score value above 0.9. Then, a number of sets (for example, up to 2 sets) of hyperparameters per model type (KRR, MLP, etc.), combination and statistic are stored in a database. Any two sets of hyperparameters should differ from each other by a particular value.
The statistic machine learning modules are used for training data sets that have a normal distribution, which can be assessed as explained above using the Kolmogorov-Smirnov test or the Ljung-Box test.
Data for the statistic machine learning models is also processed by dividing them into time windows, as explained above. The statistical values should represent a stationary process within the time window, to get values for which the mean and variance do not change over time and which have a distribution that is close to normal, and wherein the individual values are independent of each other.
Various statistic machine learning models for statistical process control (SPC) can be used, and two examples will be explained in detail herein: a first model, herein called SPCxr—that is based on controlling the value of a mean and variance of a variable; and a second model, herein called SPCxs—that is based on controlling the value of the mean and a standard deviation. Assuming that the monitored processes contain a variance, it can arise for two main reasons. The first reason is the naturally occurring variance in a random variable, which creates a certain range, in the area of which the fluctuations occur. The second reason can be related to special events (such as a breakdown of a device within the network), it is undesirable and requires analysis. The main task of the statistic machine learning models is to create a range of natural variance and monitor whether the change in a given value has occurred naturally or is perhaps the result of a special event, and thus classify it as an anomaly.
Regardless of the statistical value (hyperparameter) that is analyzed, three important lines can be defined, as shown in a plot 601 of
The statistical machine learning models can be configured to report anomalies when the controlled value exceeds a lower or an upper limit. This is the simplest and most effective approach in detecting deviations from the norm.
However, other and more sophisticated rules can also be implemented for detecting unlikely events. For example, anomalies could be reported when:
Each of the available types of statistic machine learning model is tested for at least one statistical value for at least one particular parameter that has been confirmed as having a normal distribution. The models are tested on the whole a collection of traffic data packets. Data packets are then pre-processed by performing one or more of the following:
Next, limits are determined for each statistic machine learning model.
For the SPCxr model, three control lines are determined for the mean value (X) and three control lines are determined for the variance value (R).
The central line for the mean value (X) is determined by the equation:
wherein m is the number of time windows in the training set and
Next, a gap R is determined for each window as a difference between the maximum value and a minimum value of statistic values within that window. An average gap for all m time windows is calculated as:
A standard deviation estimator can be determined as:
wherein d2 is a predetermined constant defined as follows, depending on the value of the moving average window (m)−d2(m): d2(2)=1,128; d2(3)=1.693; d2(4)=2,059; d2(5)=2,326; d2(6)=2,534; d2(7)=2,704; d2(8)=2,847; d2(9)=2,970; d2(10)=3,078; d2(11)=3,173; d2(12)=3,258; d2(13)=3,336; d2(14)=3,407; d2(15)=3,472; d2(16)=3,532; d2(17)=3,588; d2(18)=3,640; d2(19)=3,689; d2(20)=3,735; d2(21)=3,778; d2(22)=3,819; d2(23)=3,858; d2(24)=3,895; d2(25)=3,931.
Finally, the ULC, centerline and LCL values for the mean value (X) can be determined based on the following equations:
The three lines for the variance value (R) are determined in a similar way. The standard deviation estimator for the R value is calculated as:
wherein d3 is a predetermined constant defined as follows, depending on the value of the moving average window (m)−d3(m): d3(2)=0,853; d3(3)=0,888; d3(4)=0,880; d3(5)=0,864; d3(6)=0,848; d3(7)=0,833; d3(8)=0,820; d3(9)=0,808; d3(10)=0,797; d3(11)=0,787; d3(12)=0,778; d3(13)=0,770; d3(14)=0,763; d3(15)=0,756; d3(16)=0,750; d3(17)=0,744; d3(18)=0,739; d3(19)=0,734; d3(20)=0,729; d3(21)=0,724; d3(22)=0,720; d3(23)=0,716; d3(24)=0,712; d3(25)=0,708.
Finally, the ULC, centerline and LCL values for the variance value (R) can be determined based on the following equations:
The SPCxs model is trained in a manner equivalent to the SPCxr model, with the exception that the estimator constant is changed. For the mean value (X) a standard deviation is calculated as:
wherein si is a standard deviation of i-th window. The ULC, centerline and LCL values for the mean value (X) can be determined based on the following equations:
wherein c4 is a predetermined constant defined as follows, depending on the value of the moving average window (m)−c4(m): c4(2)=0,7979; c4(3)=0,8862; c4(4)=0,9213; c4(5)=0,940; c4(6)=0,9515; c4(7)=0,9594; c4(8)=0,965; c4(9)=0,9693; c4(10)=0,9727; c4(11)=0,9754; c4(12)=0,9776; c4(13)=0,9794; c4(14)=0,981; c4(15)=0,9823; c4(16)=0,9835; c4(17)=0,9845; c4(18)=0,9854; c4(19)=0,9862; c4(20)=0,9869; c4(21)=0,9876; c4(22)=0,9882; c4(23)=0,9887; c4(24)=0,9892; c4(25)=0,9896.
In turn, the ULC, centerline and LCL values for the standard deviation value(S) can be determined based on the following equations:
While the models are trained, no additional hyperparameters need to be selected, therefore the creation of statistical models is much faster as compared to autoregressive models, and does not require such a large amount of calculations. The SPCxr and SPCxs models are trained separately, wherein for each model appropriate limits are selected for two graphs (X, R or X, S) per model. Then each model is evaluated by assigning it a score according to the formula:
wherein the moving average score is the value of the moving average, and the precision is the value expressed as a percentage, indicating the amount of the samples that fell within the three sigma interval. The minimum score for the model to be accepted is 0.9. Since there are no additional hyperparameters in the model itself, only one best configuration is obtained.
Although the models selected during the hyperparameter optimization stage accurately predict network traffic, it is possible that some models perform better than others. To select the best models, they can be chosen based on the calculated model scores. The selection can be made to eliminate models that do not allow online operation, i.e., those for which the time of window analysis is longer than the time window length itself. The ratio of these times can be referred to as a “resource” of the model. A requirement can be set so that the sum of resources for the final selected models is less than a certain constant, hereinafter referred to as MAX_RESOURCES.
The selection can be made according to the solution of the Knapsack Problem, wherein score is the value of an object and resource is its volume. This can be solved as:
This procedure allows to obtain a set of models that substantially evenly analyze all connections and whose performance allows online analysis of packets.
Alternatively, other selection procedures are possible, such as:
While the invention has been described with respect to a limited number of embodiments, it will be appreciated that many variations, modifications and other applications of the invention can be made. Therefore, the claimed invention as recited in the claims that follow is not limited to the embodiments described herein.
Number | Date | Country | Kind |
---|---|---|---|
23209511.7 | Nov 2023 | EP | regional |
Number | Date | Country | |
---|---|---|---|
Parent | 18389237 | Nov 2023 | US |
Child | 18945880 | US |