Industrial control systems that operate physical systems (e.g., associated with power turbines, jet engines, locomotives, autonomous vehicles, grid infrastructure, medical equipment (e.g., tools for ultra sounds, CAT scans, MRIs, etc.) etc.) are increasingly connected to the Internet. As a result, these control systems have been increasingly vulnerable to threats, such as cyber-attacks (e.g., associated with a computer virus, malicious software, etc.) that could disrupt electric power generation and distribution, damage engines, inflict vehicle malfunctions, etc. Current methods primarily consider attack detection in Information Technology (“IT,” such as, computers that store, retrieve, transmit, manipulate data) and Operation Technology (“OT,” such as direct monitoring devices and communication bus interfaces). Cyber-attacks can still penetrate through these protection layers and reach the physical “domain.” Such attacks can diminish the performance of a control system and may cause total shut down or even catastrophic damage to a plant. In some cases, multiple attacks may occur simultaneously (e.g., more than one actuator, sensor, or parameter inside control system devices might be altered maliciously by an unauthorized party at the same time). Note that some subtle consequences of cyber-attacks, such as stealthy attacks occurring at the domain layer, might not be readily detectable (e.g., when only one monitoring node, such as a sensor node, is used in a detection algorithm). Existing approaches to protect an industrial control system may include machine learning models which may help predict an attack. However, these approaches don't necessarily analyze a relationship between variables, which may be indicative of a cyber-attack. It would therefore be desirable to have additional systems and processes for automatically protecting a cyber-physical system from cyber-attacks.
According to some embodiments, a system is provided including a memory storing processor-executable steps; and a processor to execute the processor-executable steps to cause the system to: receive a first data distribution for a first variable; determine a first data optimum number of bins for the first data distribution; create a first model for the first data distribution using the first data optimum number of bins; receive a second data distribution for a second variable; determine a second data optimum number of bins for the second data distribution; create a second model for the second data distribution using the second data optimum number of bins; apply the first model to the second data distribution to calculate a smallest descriptive size of the second data distribution given the first model; apply the second model to the first data distribution to calculate a smallest descriptive size of the first data distribution given the second model; and determine a causal direction between the first variable and the second variable based on the application of the first model and the second model.
According to some embodiments, a method is provided including receiving a first data distribution for a first variable; determining a first data optimum number of bins for the first data distribution; creating a first model for the first data distribution using the first data optimum number of bins; receiving a second data distribution for a second variable; determining a second data optimum number of bins for the second data distribution; creating a second model for the second data distribution using the second data optimum number of bins; applying the first model to the second data distribution to calculate a smallest descriptive size of the second data distribution given the first model; applying the second model to the first data distribution to calculate a smallest descriptive size of the first data distribution given the second model; and determining a causal direction between the first variable and the second variable based on the application of the first model and the second model.
Some technical advantages of some embodiments disclosed herein are improved systems and methods to protect one or more cyber-physical systems (“CPS”) from abnormalities, such as cyber-attacks, in an automatic manner. Embodiments provide a causality module that determines an optimal number of bins in a histogram and uses this process to determine a causality direction in random variable pairs (e.g., does “X” cause “Y” or does “Y” cause “X”). Embodiments provide a causality module that is effective and efficient at determining causal directional relationships between sensor variables with the goal of identifying causal features for use in machine learning models, and noting shifts in causal relationships of windowed data in order to spot attacks, thereby providing increased cyber protection for the CPS. Embodiments may also reduce the computational complexity required to determine causality, as compared to conventional causal techniques, by a factor of 10, and improve the accuracy of the causal inference through improved compression. For example, conventional causal techniques may require more than ten seconds to determine an optimal binning scheme for a single variable of 1000 observations, which would not be practical for real time use. Embodiments, on the other hand, compute an optimal binning encoding for 1000 observations in less than ten milliseconds, depending on the level of quantization desired (i.e., if less precision is allowable, the algorithm can run faster). Embodiments may provide a causality algorithm that implements a dyadic (power of two) search for optimal binning, as well as a simplified method for encoding error costs. Embodiments provide for further computational efficiency by learning optimal histogram encodings/binnings offline and applying them online to a window of data, making real-time detection feasible through the use of a reduced window size. The causality module may also, in embodiments, provide for the creation of a grammar based code and the use of this code with time-series sequential data distributions to determine causality in random variable pairs, and determine optimal causal delay (e.g. X causes Y at a delay of 2 seconds). The use of the grammar based code may provide a means of applying this process on windowed data to enable real time detection of cyber-attacks These improvements, together, provide for the determination of causal relationships between variables in a cyber physical system, and the monitoring of these causal relationships to detect cyber-attacks.
With this and other advantages and features that will become hereinafter apparent, a more complete understanding of the nature of the invention can be obtained by referring to the following detailed description and to the drawings appended hereto.
Other embodiments are associated with systems and/or non-transitory computer readable mediums storing instructions to perform any of the methods described herein.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of embodiments. However, it will be understood by those of ordinary skill in the art that the embodiments may be practiced without these specific details. In other instances, well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the embodiments.
One or more specific embodiments of the present invention will be described below. In an effort to provide a concise description of these embodiments, all features of an actual implementation may not be described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure.
As described above, an industrial asset (e.g., power turbines, electric motors, aircraft engines, locomotives, hydroelectric power plants, grid infrastructure, medical equipment (e.g., tools for ultra sounds, CAT scans, MRIs, etc.)) with critical infrastructure may be operated by an industrial control system. As a result, a key challenge with these industrial assets is preventing a cyber-attack on the industrial control system by identifying and addressing any vulnerabilities in the system and/or quickly identifying an event as a cyber-attack. The cyber-attack may manipulate the system by changing sensor values, actuators (e.g., valves that affect the flow into a system), rotational speed, air flow, and/or by issuing false commands, etc. for the system.
Central to localizing cyber-attack vectors in control networks is determination of what sensor is the prime mover in a spoofing attack. Due to underlying physics, causal relationships will necessarily exist between some sensors. As a non-exhaustive example, temperature in one node will cause pressure in another node and vice versa. Some parameters may be more deterministically caused by some variables than others. However, when an attack occurs, these relationships may reverse or change in magnitude. Continuing with the non-exhaustive example, if the temperature at a given node in a gas network tends to causally determine the pressure at another node, and that temperature node is spoofed, the causal determination between pressure and temperature will for a time be disturbed. The control system may react to this disturbance, and restore causality in some sense, but until the control system catches up, a causal distortion may be observable, and this observation may be a means of localizing the attack. Methods for causality determination may include the use of Minimum Description Length (MDL) principles from the theory of Kolmogorov Complexity and Algorithmic Information Theory to infer causality. These methods utilize compression in some sense to estimate Kolmogorov Complexity, and to determine causal direction by quantifying the descriptive cost of one variable given another.
Embodiments apply the concept of Kolmogorov Complexity to sensor data to determine causal relationships between sensor variables by noting shifts in causal relationships of windowed data. These relationships may then be used to detect attacks. Compared to conventional Kolmogorov Complexity algorithms, which may not support continuous data types and thus lose the essence of the shape of data, embodiments provide a streamlined process to make the computation complexity tractable.
While machine learning models may help make predictions based on associations, they may not necessarily determine cause. A famous example is that ice cream consumption and drownings increase in the summer, but eating ice cream does not cause you to drown. In other words, correlation is not causation. One or more embodiments determine whether variable X determines variable Y or vice versa. In terms of cyber-security, if someone is spoofing a sensor so that it appears to be sending incorrect pressure values, for example, in a case that the pressure causes temperature in another part of the system, the temperature may not change in response to the spoof, which in turn may indicate an attack exists, and may help to indicate the location of the attack. Embodiments may determine whether variable X determines variable Y or vice versa, and then identify an inversion of the relationship.
Additionally, use of the method described herein to identify causal relationships may also be used as a means of feature selection for traditional machine learning techniques, where only highly causal features are utilized as inputs. As a non-exhaustive example, if you have 1000 features, the method described herein may be used to determine that twenty of those 1000 features are the most causal features in terms of predictions. Then the machine learning model may be built from those twenty features.
The present invention provides significant technical improvements to facilitate causality determinations between variables and cyber attack detection. The present invention is directed to more than merely a computer implementation of a routine or conventional activity previously known in the industry as it significantly advances the technical efficiency between devices by implementing a specific new method and system as defined herein. The present invention is a specific advancement in the area of directional causality and cyber detection by providing benefits in reduced computational complexity, improved accuracy, and detection and localization of a cyber-attack, and such advances are not merely a longstanding commercial practice. The present invention provides improvement beyond a mere generic computer implementation as it involves the processing and conversion of significant amounts of data in a new beneficial manner.
The system 100 may also include a causality module 120 configured for causality determination and including a pre-processor 122 and a compressor 126. The causality module 120 of a causality platform 124 may generate a causality determination 128 for two variables. The causality determination 128 may be the determination that one variable in a system causes another random variable. In one or more embodiments, this determination may be made using a minimum description length binning/histogram process, described further below with respect to
The pre-processor 122 may also be configured to apply a sliding window protocol to the input data/stream that segments or divides the input data stream into discrete or separate portions of sequential information. Input data streams of various lengths may be supported such as, for example, input data streams of at least 1 KB in length. In various embodiments, the pre-processor 122 may filter the input data stream 112 by removing from consideration input data known to not be useful for determining causality.
In one or more embodiments, the compressor 126 may be configured to perform a Minimum Description Length (MDL) Compression (MDLcompress) algorithm to generate a grammar based code that estimates the Kolmogorov complexity of a variable. It is noted that MDLcompress may achieve general compression of a variable by creating a grammar and using that “grammar based code” to compress the variable. As used herein, the term “grammars” refers to a set of rules and relationships that are associated with particular data sequences. Additionally, the term “model” or “compression model” as used herein may refer to a set of one or more grammars with a probability distribution being associated with each grammar.
The system 100 may also include a database 114. Database 114 may store data used by at least the causality module 120. For example, database 114 may store data values (e.g., sensor values, training values, etc.) that may be used by the causality module 120 during the execution thereof.
Database 114 may comprise any query-responsive data source or sources that are or become known, including but not limited to a structured-query language (SQL) relational database management system. Database 114 may comprise a relational database, a multi-dimensional database, an eXtendable Markup Language (XML) document, or any other data storage system storing structured and/or unstructured data. The data of database 114 may be distributed among several relational databases, dimensional databases, and/or other data sources. Embodiments are not limited to any number or types of data sources.
In some embodiments, the data of database 114 may comprise one or more of conventional tabular data, row-based data, column-based data, and object-based data. Moreover, the data may be indexed and/or selectively replicated in an index to allow fast searching and retrieval thereof. Database 114 may support multi-tenancy to separately support multiple unrelated clients by providing multiple logical database systems which are programmatically isolated from one another.
Database 114 may implement an “in-memory” database, in which a full database is stored in volatile (e.g., non-disk-based) memory (e.g., Random Access Memory). The full database may be persisted in and/or backed up to fixed disks (not shown). Embodiments are not limited to an in-memory implementation. For example, data may be stored in Random Access Memory (e.g., cache memory for storing recently-used data) and one or more fixed disks (e.g., persistent memory for storing their respective portions of the full database).
All processes mentioned herein may be executed by various hardware elements and/or embodied in processor-executable program code read from one or more of non-transitory computer-readable media, such as a hard drive, a floppy disk, a CD-ROM, a DVD-ROM, a Flash drive, Flash memory, a magnetic tape, and solid state Random Access Memory (RAM) or Read Only Memory (ROM) storage units, and then stored in a compressed, uncompiled and/or encrypted format. In some embodiments, hard-wired circuitry may be used in place of, or in combination with, program code for implementation of processes according to some embodiments. Embodiments are therefore not limited to any specific combination of hardware and software.
Initially, at S210, a first data distribution 116 for a first variable (“X”) may be received from a first monitoring node 110 as a series of continuous current monitoring node values 112. The distribution may be time dependent or non-time dependent. It is noted that while the grammar-based method, described further below, may be better at determining causality in time dependent variables, the histogram method described herein may also be used to determine causality by applying the histogram on moving time windows.
Then at S212 an optimum number of bins 301 (
In some embodiments, the optimum number of bins 301 may be the smallest binning partition that describes the data distribution. As described herein, the optimum number of bins (e.g., smallest compressed size of “X”, where “X” represents the first variable) may be referred to as K(X), where “K” is the Kolmogorov complexity (i.e., minimum length of a program such that a universal computer can generate a specific sequence). The K(X) may be determined via a Markov Decision Process (MDP), or any other suitable optimization processes may be used. The causality module 120 may extract bin frequencies and compute Shannon codes for the extracted bin frequencies. Shannon coding is a variable length encoding technique for lossless data compression, whereby a code is assigned to a symbol based on their probabilities of occurrence and the codes assigned to the symbols will be of varying length. Data compression, also known as source coding, encodes or converts data in such a way that it consumes less memory space, thereby reducing the number of resources required to store and transmit data. In one or more embodiments, an entropy (e.g., a Shannon code or similarly a Huffman Code) may be calculated for: 1. a model cost 304, as the total number of bins and a code length for each bin; 2. a code length cost 306, where for each point, an entropy code is encoded for the bin the point belongs to; and 3. an error cost 308, where for each point, an entropy code is encoded for the error between the point's value and a mean of all points in the bin. For example, as shown in
For example,
After the optimum number of bins 301 to describe the first data distribution 116 is determined in S212, a model 314 is created for the first data distribution 116 using the optimum number of bins in S214. The model 314 may be a histogram, or any other suitable model. As shown in
The encoding (e.g., model) of the optimal number of bins determined in S214 may then be applied to a second data distribution 118 for a second variable (“Y”) to determine causality (i.e., whether the first causes the second or vice versa). To that end, in S216, a second data distribution 118 for a second variable (“Y”) may be received from a second monitoring node 110 as a series of continuous monitoring node values 112, represented in
It is noted that, in some embodiments, causality may be determined between two variables during an offline learning phase. Then, during an online detection phase, one of the models may be applied (e.g., the model for the first distribution is applied to the second distribution) to avoid the computational costs associated with generating a second model.
Next, the model 404 created for the first data distribution 402 is applied to the second data distribution 406 in S218, as shown in
In some instances, the model for the first data distribution may not be efficiently applied to the second data distribution, and as such may not be optimal. For example, as shown in
After K(Y|X) is determined at S220, the binning algorithm 123 may in S222 determine an optimum number of bins to describe the second data distribution 118 for the second variable (“Y”), in a same manner as described above in S212 for determining the optimum number of bins to describe the first data distribution for the first variable (“X”). Then, in S224, a model is created for the second data distribution using the optimum number of bins determined for the second distribution in S222 in a manner, as described above for the model created for the first data distribution in S214. For example, as shown in
The first data distribution 116 for the first variable (“X”) is shown in
It is noted that while S222-S228 is described as occurring after S210-S220, these processes may be occurring at a same time or at a substantially same time.
Next, in S230, the causality module 120 executes the causality algorithm 121 to determine whether the first variable (X) caused the second variable (Y) or vice versa. In a case that K(X)+K (Y|X) is less than K(Y)+K (X|Y), then the first variable (X) caused the second variable (Y). In a case that K(Y)+K (X|Y) is less than K(X)+K (Y|X), then the second variable (Y) caused the first variable (X).
A non-exhaustive example in the field of Solar Power may be used to describe the process 200 detailed herein. The non-exhaustive example, as shown in
In the example, treating the solar distribution as the first data distribution for the first variable (X), the optimal number of bins for the solar distribution is determined to be 15, with a complexity (K(X)) of 13250, determined at S212, and the model 702 is determined per S214. The complexity (K(Y|X)) of power (Y) given solar (X) is determined to be 13366 at S218. Although not shown herein, one or more iterations may have been performed per S228.
Then, treating the power distribution as the second data distribution (Y), the optimal number of bins for the power distribution is determined to be 12, with a complexity (K(Y)) for the second variable of 12623, as determined at S222. A model for the power distribution 704 is generated at S224. The complexity (K(X|Y)) of solar (X) given power (Y) is determined to be 15402 at S226. The causality module 120 then executes the causality algorithm 121 at S230
K(Solar)+K(Power|Solar)<K(Power)+K(Solar|Power)
13250+13366<12623+15402
to determine that solar intensity causes solar power and not vice versa.
In one or more embodiments, the causality module 120 may, in an offline environment, generate a causal dependency matrix 800 (
In one or more embodiments, the causality module 120 may, in an online/near-real-time environment, generate a causal matrix from current pipeline configuration/operations conditions without a significant pre-construction of a causal matrix. For example, the model may be applied to current operating conditions for a specific variable without creating a causal matrix for all of the variables associate with the CPS.
Turning to
In one or more embodiments, the causality module 120 may apply a grammar engine 130 to a sequential time series of data prior to application of the causality algorithm 121. In some embodiments, the grammar engine 130 may find patterns and motifs useful for compressing unknown data sets via a grammar-based Minimum Description Length (MDL) compression algorithm (“compressor”) 126 to generate grammars. The compressor 126 may use a grammar-based coding technique that compresses through inferring an algorithmic minimum sufficient statistic in a stochastic gradient manner. As used herein, the term “grammars” refers to a set of rules and relationships that are associated with particular data sequences. Furthermore, the term “compression model” as used herein refers to a set of one or more grammars with a probability distribution being associated with each grammar. After the grammar code is generated, the causality algorithm 121 may be applied, as described above, to the grammar code to determine a causality direction (i.e., does X cause Y or does Y cause X) between two variables.
The grammar process 1000 (
Next in S1016, the grammar is applied to the data distribution for the second variable—wind power (Y), and the wind power sequence is compressed using the grammar to generate K(Y|X), as in S226 of
Consider the string “a_rose_is_a_rose_is_a_rose”
A single first grammar rule could capture the phrase:
S1->“a_rose”
Resulting in the string: S1_is_S1_is_S1
The Recursive nature of MDLcompress would create an additional rule:
S2->S1_is, resulting in the final string S2S2S1
The process of S1010-S1016 is repeated for the wind power variable (Y) as in S222-S228 of
Then the direction of causality is determined in S1026, as in S230, described above. In particular, if K(X)+K(Y|X)<K(Y)+(K(X|Y), it is inferred that X causes Y.
In one or more embodiments, it may further be determined at S1028 whether a delay is indicated by the data distribution. To determine whether there is a delay, two data sets may be merged 1202, as shown in
The merged sequence of X=ADABACADABAD and Y=_ZRZQZYZRZQZR at delay of 1 provides sequence AZDRAZBQAZCYADRAZBQAZDR which has all causal pairs that will compress well, compared to other delays.
Note that the embodiments described herein may be implemented using any number of different hardware configurations. For example,
The processor 1310 also communicates with a storage device 1330. The storage device 1330 may comprise any appropriate information storage device, including combinations of magnetic storage devices (e.g., a hard disk drive), optical storage devices, mobile telephones, and/or semiconductor memory devices. The storage device 1330 stores a program 1312 and/or causality engine 1314 for controlling the processor 1310. The processor 1310 performs instructions of the programs 1312, 1314, and thereby operates in accordance with any of the embodiments described herein. For example, the processor 1310 may receive from the nodes a first distribution for a first variable and a second distribution for a second variable, where the first variable is different from the second variable, and then apply the causality algorithm to determine a direction of causality between the two variables.
The programs 1312, 1314 may be stored in a compressed, uncompiled and/or encrypted format. The programs 1312, 1314 may furthermore include other program elements, such as an operating system, clipboard application, a database management system, and/or device drivers used by the processor 1310 to interface with peripheral devices.
As used herein, information may be “received” by or “transmitted” to, for example: (i) the causality platform 1300 from another device; or (ii) a software application or module within the causality platform 1300 from another software application, module, or any other source.
Although specific hardware and data configurations have been described herein, note that any number of other configurations may be provided in accordance with embodiments of the present invention (e.g., some of the information associated with the databases described herein may be combined or stored in external systems). Moreover, although some embodiments are focused on gas turbines, or solar/wind systems any of the embodiments described herein could be applied to other types of cyber-physical systems including power grids, dams, locomotives, airplanes, and autonomous vehicles (including automobiles, trucks, drones, submarines, etc.)
The present invention has been described in terms of several embodiments solely for the purpose of illustration. Persons skilled in the art will recognize from this description that the invention is not limited to the embodiments described but may be practiced with modifications and alterations limited only by the spirit and scope of the appended claims.