The present invention relates to computer and network security and, more particularly, to alert ranking and attack scenarios reconstruction for anomaly detection.
Enterprise networks are key systems in corporations and they carry the vast majority of mission-critical information. As a result of their importance, these networks are often the targets of attack. Communications on enterprise networks are therefore frequently monitored and analyzed to detect anomalous network communication as a step toward detecting attacks.
In particular, advanced persistent threat (APT) attacks, which persistently use multiple complex phases to penetrate a targeted network and steal confidential information, have become major threats to enterprise information systems. Existing rule/feature-based approaches for APT detection may only discover isolated phases of an attack. As a result, these approaches may suffer from a high false-positive rate and cannot provide a high-level picture of the whole attack.
In such enterprise networks, multiple detectors may be deployed to monitor computers and other devices. These detectors generate different kinds of alerts based on the monitored data. Reconstructing attack scenarios involves determining which ranks are important and which represent false positives.
A method for detecting security intrusions includes detecting alerts in monitored system data. Temporal dependencies are determined between the alerts based on a prefix tree formed from the detected alerts. Content dependencies between the alerts are determined based on a distance between alerts in a graph representation of the detected alerts. The alerts are ranked, using a processor, based on an optimization problem that includes the temporal dependencies and the content dependencies. A security management action is performed based on the ranked alerts.
A system for detecting security intrusions includes a detector module configured to detect alerts in monitored system data. A temporal dependency module is configured to determine temporal dependencies between the alerts based on a prefix tree formed from the detected alerts. A content dependency module is configured to determine content dependencies between the alerts based on a distance between alerts in a graph representation of the detected alerts. A ranking module includes a processor configured to rank the alerts based on an optimization problem that includes the temporal dependencies and the content dependencies. A security module is configured to perform a security management action based on the ranked alerts.
These and other features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.
The disclosure will provide details in the following description of preferred embodiments with reference to the following figures wherein:
In accordance with the present principles, the present embodiments provide alert ranking, discover the underlying correlations between different alerts, and reconstruct attack scenarios. The present alert ranking therefore addresses the challenges presented by alert heterogeneity, temporal and content differences, false positives, the need for real-time responsiveness, a lack of training data, and non-linear alert correlations.
Referring now in detail to the figures in which like numerals represent the same or similar elements and initially to
Each agent 10 includes an agent manager 11, an agent updater 12, and agent data 13, which in turn may include information regarding active processes, file access, net sockets, number of instructions per cycle, and host information. The backend server 20 includes an agent updater server 21 and surveillance data storage. Analysis server 30 includes intrusion detection 31, security policy compliance assessment 32, incident backtrack and system recovery 33, and centralized threat search and query 34.
Referring now to
The detectors that feed the intrusion detection system 31 may report alerts with very different semantics. For example, network detectors monitor the topology of network connections and report an alert if a suspicious client suddenly connects to a stable server. Meanwhile, process-file detectors may generate an alert if an unseen process accesses a sensitive file. The intrusion detection system 31 integrates alerts regardless of their respective semantics to overcome the problem of heterogeneity.
Furthermore, real security incidents (e.g., hacker attacks, malware infections, etc.) are likely to cause multiple alerts for different detectors. However, particularly in an advanced persistent threat (APT) scenario, the alerts might be widely spaced in time, with heterogeneous system entity information. The alert ranking and attack scenario reconstruction module 46 therefore integrates alerts with both temporal and content differences.
Due to the complexity of enterprise systems, the accuracy of a single detector is usually low, where the majority of alerts being generated are false positives. The false positives are therefore filtered out, with only meaningful ranking results being output. Furthermore, this processing takes place in real-time to address the high potential for damage that can develop rapidly.
Because of the large scale of data collection in enterprise systems, it can be difficult to obtain useful training data for an analysis system. The manual labeling of large sets of reported alerts to create training data is costly and error-prone. Furthermore, most real alerts are unknown attacks, where the end user has no knowledge about the alert pattern and cannot define a useful model in advance. As such, the present embodiments learn models to detect attacks as the attacks unfold.
APT attacks usually include a series of sequential, interacting process events. Such non-linear cooperative interactions between system events can often generate sequences or patterns of alerts. As a result, the present embodiments discover the underlying relationship between different alerts and rank the alerts based on interactions between the processes.
Referring now to
Block 304 performs alert encoding. Alert encoding determines the raw alert sequence under an appropriate granularity. Each alert may be considered unique if all attributes are considered, making it difficult to capture the temporal dependency between alerts. However, because each alert can be represented as the co-occurrence of a set of entities when the time-related attribute is excluded, a set of representatives, Σ, is used to create ensembles of co-occurrences. The number of representatives can be too large to be manipulated if all non-time-related entities are considered. As such, only important entities are considered, with examples including the source and destination entities representing each alert. Block 304 enumerates all possible alerts in the symbol set Σ.
Block 306 then performs temporal dependency modeling on the alerts. To model temporal dependency in alert sequences, a prefix tree is used to preserve the temporal structure between alerts and to learn the long-term dependencies between alerts using Bayesian hierarchical modeling. Block 306 then applies a breadth-first search on the prefix tree to identify a set of patterns such that alerts in each pattern are highly correlated.
Block 308 performs content dependency modeling, either before, during, or after the temporal dependency modeling of block 306. Each alert is associated with heterogeneous types of entities, such as the user, time, source/destination process, and folder. These entities, viewed as content information, are useful for aggregating low-level alerts into a high-level view of an attacker's behavior.
Block 310 then performs ranking based on both the temporal structures and content similarities determined by blocks 306 and 308, identifying alerts and alert patterns that maximize the consensus between temporal and content dependencies. It should be noted that an alert pattern is a sequence of alerts that may represent multiple steps or phases of an abnormal system or user activity. Block 310 sorts the confidences of alerts and alert patterns simultaneously by integrating the temporal and content dependencies into an optimization problem. The output of block 310 is a set of ranked alerts. Block 312 then prunes the untrustworthy alerts and alert patterns by, e.g., removing alerts and alert patterns having a confidence score below a threshold value or having a rank below a threshold rank.
Referring now to
A sequence of alerts is formally expressed herein as s1:T={S1, . . . , ST), where each si takes a value in the set of entities Σ. The joint distribution over the sequence can be estimated by:
where the prediction of symbol si is conditioned on all of its preceding symbols s1:i−1. When the prediction of the next variable is only related to the values taken by at most the preceding n variables, this problem can be approximated by an nth order Markov model. When n is not truncated to some fixed value, the model is non-Markovian.
To learn such a model from the data, a predictive distribution of the next symbol, given each possible context, is learned. Given a finite sequence of symbols s, the predictive distribution of the next symbol conditioned on s is written as G[s]. G[s] is a discrete distribution that can be represented as a probability vector with latent variables: G[s](u)=p(sT+1=u|s), ∀u∈Σ.
Estimating probability vectors independently relies on adequate training sequences that represent the true distribution. However, because attack scenarios are rate and have a low recurrence or signal observation, it is difficult to estimate a whole probability vector that generalizes in any reasonable way. Block 402 therefore creates a prefix tree representation that hierarchically ties together the vector of predictive probabilities in a particular context to vectors of probabilities in related, shorter contexts. Block 404 then builds a hierarchical Bayesian model to address the problem of insufficient training data, using observations that occur in very long contexts to recursively inform the estimation of the predictive probabilities for related, shorter contexts and vice versa. Block 406 then searches for attack patterns.
For a given sequence s having T symbols, the number of predictive distributions conditioned on a context can be intractable when the length T goes to infinity. The only variables that will have observations associated with them are the ones corresponding to the contexts that are prefixes of s:
The prefix tree representation created by block 402 therefore includes a set of nodes that represent a prefix (e.g., a sequence of nodes) and its probability vector. Each node depends only on its ancestors in the prefix tree, which correspond to the suffices of the context. Thus, the only variables for which inference is needed are precisely those that correspond to contexts which are contiguous subsequences of s:
The prefix tree representation of a sequence may be constructed from an input string in O(T2) time and space. The prefix tree representation can further be improved by marginalizing out the on-branching interior nodes. The marginalized prefix tree can also be directly built from an input sequence in linear time and space complexity. The resulting prefix tree retains the nodes (variables) of interest, eliminating all non-branching nodes by allowing each edge label to be a sequence of symbols (or meta-symbols), rather than a single symbol.
Block 404 uses a hierarchical Bayesian model to approximate the probability vectors in the prefix tree generated by block 402, based on the assumption that predictive distributions conditioned on similar preceding contexts will be similar. A hierarchical Bayesian prior is placed over the set of probability vectors. The prior probability vector for G[s] is written herein as H[s]. Before observing any data, the next symbol conditioned on s should occur according to the probability H[s](u), ∀u∈Σ. The hierarchical Bayesian priors regard the distribution on each node as prior to inform the distributions on its descendants. The hierarchical structure can be expressed as H[s]=G[π(s)], where π(s) denotes the suffix of s having all but the earliest symbol, corresponding to the parent of node s in the prefix tree. A Pitman-Yor process is then applied to capture the hierarchical structure.
Based on the predictive distributions learned by the Bayesian hierarchical modeling of block 404, block 406 finds a set of highly correlated alert patterns. Given an alert pattern of length L, denoted herein as u={su
The more likely a pattern is to be observed in the sequence, the stronger the temporal dependency of the pattern is. To identify the set of patterns that have probability larger than a threshold ε and an arbitrary length smaller than Lmax from the Bayesian hierarchical modeling, block 406 uses a breadth-first search to find alert patterns on the prefix tree.
Referring now to
In particular, the entities of a kth type in alert i and alert k are written as vik and vjk, respectively, each of which is a member of Vk. The distance between the two entities is written as dis(vik, vjk). The distance between alerts, dis(ai, aj) can be naturally derived from the convention of the Lθ-norm distance, which is the sum of the Lθ distance along each dimension:
In practice, the θ is always specified at 1 and 2, which resemble the Hamming and Euclidean distances, respectively. Since the dependent alerts always occur within a certain time span, a time decay function can be further incorporated into the distance measurements. The times of occurrence for alerts ai and aj are written herein as ti and tj, with the time difference between them being Δt=|ti−tj|. When the time difference between two alerts is greater than a threshold δ, the dependency decays exponentially with Δt. Otherwise the dependency does not decay. Thus:
where c2 is a constant that controls the decay rate and where:
This reduces the problem to finding the distance between each pair of entities. Due to the fact that categorical data does not have any intrinsic distance measurement, the co-occurrence has been widely used to quantify the relationship between entities. The co-occurrence measures the closeness of entities by the frequency of their co-occurrence, but is limited by its intransitive nature. For example, if the entities a and b do not occur, based on the co-occurrence statistics they are not close to one another. However, if both a and b are indirectly connected by the entity c, they would share a certain degree of similarity. This similarity would be missed because a and b didn't co-occur in the alert data.
To measure the dependency between alerts, the present embodiments also capture the transitive distance between entities. Block 502 therefore creates a d-partite graph G=(V, E), with the vertex set V being made up of all entities and the edge set E indicating the co-occurrence structure among the entities. The graph is a d-partite graph with each partite representing a type of entities, because entities belonging to the same type do not co-occur in the alert data.
Based on the graph representation, block 504 measures the similarity between alerts and block 506 measures the pairwise distance between entities using a proximity measures approach, which provides a systematic way to augment the initial entity relation by collectively considering an entity's relation with other entities. Entities of the same type can then be related to one another by transiting their connection with entities of other types.
Block 502 represents each node in the graph as a vector of 1s and 0s, with each element recording the occurrence of the ith entity in all alerts, denoted as vi∈T×1. Based on the vector representation, the weights are estimated using the proximities listed in table 1 below. The similarity measurements need to be further transformed to a distance using transfer functions to obtain the shortest-path distance.
When the similarity is non-zero, the transfer function may take the form of, e.g.,
etc., and may be infinite when the similarity is zero. Considering all co-occurrences may result in a dense, noisy graph, so block 502 prunes the noisy edges by removing connections that are not within the k nearest neighbors, where k is a parameter that controls the sparsity of the graph. The distance between any pair of entities can be directly computed from some proximity measures, such as the Hamming and Euclidean distances, where entities with zero occurrence can still have some finite distance between them. These measures can also be less robust, however, because the distance measure they provide is intransitive and may not faithfully reflect the proximities between entities. By only connecting correlated entities and then using the shortest path to link less-correlated entities, a more robust proximity measure is achieved.
By sorting all entities with a certain order, the pairwise distances between the entities can be represented as a θ-norm symbolic distance matrix, Sθ, with each element Spqθ representing the distance between v(p) and v(q): Spqθ=dis(v(p), v(q))θ, where θ is a power parameter. Using the distance measurement and the transfer function of similarity, a pairwise similarity matrix between alerts is generated by block 502 and is denoted as S.
Having the temporal and content dependencies from blocks 306 and 308, block 310 ranks the alerts. The set of alert patterns extracted from the temporal model are denoted as M1, . . . , ML with corresponding anomaly scores p1, . . . , pL. Each alert pattern Ml is associated with a set of processes
The pattern structures among alerts are given by an affinity matrix F ∈ , where is the number of alerts and L is the number of patterns. Each element of the affinity matrix, Fil, indicates whether an alert ai is included in the pattern Ml. The value of Fil is 1 if the process conducting ai exists in pattern Ml:
and is 0 otherwise.
Each alert can either correspond to a true intrusion or to a false positive. The probability of each alert ai (with i=1, . . . , T) corresponding to a true intrusion is {circumflex over (P)}(ai=true positive). As noted above, T is the number of symbols in a training sequence, where the number of symbols in the sequence is the same as the number of all alerts. The number of unique symbols and the number of unique alerts would be different, because symbols are used to represent the alerts based on the values of some important entities of the alert, such that different alerts can have the same symbol.
Block 310 ranks alerts based on these estimated probabilities. Each alert ai is therefore assigned a score ui that represents the probability of being a true positive. Due to the presence of false positives, each alert pattern Ml may be a mixture of true positives and false positives that does not correspond to intrusion behavior. The confidence for each alert pattern being an intrusion, P(Ml=true positive), is assigned to a score vl. Therefore, maximizing the consensus between temporal and content dependencies is equivalent to estimating the scores of alerts and alert patterns that satisfy the following conditions:
1. The score of each alert pattern is correlated to the pattern's anomaly score.
2. The score of each alert pattern depends on the probabilities of its associated alerts being true positives.
3. Similar alerts tend to have similar probabilities of being true positives.
The optimization problem solved by block 310 therefore estimates the confidence of alerts and alert patterns based on their anomaly scores and incorporates the content and temporal structures:
where the first term of the objective function maximizes the correlation between confidence of alert patterns and their anomaly scores and the second and third part provide two regularizations that control similarities between the scores over temporal and content structures. The second term ensures closeness from each alert pattern to its associated alerts and the third term incorporates the alerts' similarity estimated from content dependency modeling as the similarity matrix S to regularize the deviation between alert probabilities. The parameters λ1 and λ2 are tuning parameters that control the degree to which probability vectors are similar. Larger values for the tuning parameters impose a stronger regularization effect on the estimate. The first constraint is imposed to control the number of true positive alerts in solutions having larger values for K, indicating more true positives. K is a pre-defined integer that roughly controls the number of alerts with non-zero scores in the constraint. The remaining constraints are added to ensure the non-negativity and normalization of parameters.
Block 310 solves this optimization problem using, e.g., quadratic programming. The top-k alerts and alert patterns are those having the top-k values for v and u. Block 312 removes any alerts and alert patterns that are not within the top-k.
Embodiments described herein may be entirely hardware, entirely software or including both hardware and software elements. In a preferred embodiment, the present invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. A computer-usable or computer readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device. The medium can be magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. The medium may include a computer-readable storage medium such as a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
Each computer program may be tangibly stored in a machine-readable storage media or device (e.g., program memory or magnetic disk) readable by a general or special purpose programmable computer, for configuring and controlling operation of a computer when the storage media or device is read by the computer to perform the procedures described herein. The inventive system may also be considered to be embodied in a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner to perform the functions described herein.
A data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution. Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) may be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
Referring now to
A detector module 606 interfaces with the detectors in the enterprise system, collecting alert information from every detector and storing the alert information in the memory 604. The temporal dependency module 608 and the content dependency module 610 process the stored alert information to identify the dependencies between the various heterogeneous alerts so that ranking module 612 can determine which alerts and alert patterns are trustworthy and represent true positives.
Based on the outcome of the ranking module 612, a security module 614 performs manual or automated security actions in response to the ranked alerts and alert patterns. In particular, the security module 614 may have rules and policies that trigger when alerts indicate certain kinds of attacker behavior. Upon such triggers, the security module 614 may automatically trigger security management actions such as, e.g., shutting down devices, stopping or restricting certain types of network communication, raising alerts to system administrators, changing a security policy level, and so forth. The security module 614 may also accept instructions from a human operator to manually trigger certain security actions in view of analysis of the alerts and alert patterns.
Referring now to
A first storage device 722 and a second storage device 724 are operatively coupled to system bus 702 by the I/O adapter 720. The storage devices 722 and 724 can be any of a disk storage device (e.g., a magnetic or optical disk storage device), a solid state magnetic device, and so forth. The storage devices 722 and 724 can be the same type of storage device or different types of storage devices.
A speaker 732 is operatively coupled to system bus 702 by the sound adapter 730. A transceiver 742 is operatively coupled to system bus 702 by network adapter 740. A display device 762 is operatively coupled to system bus 702 by display adapter 760.
A first user input device 752, a second user input device 754, and a third user input device 756 are operatively coupled to system bus 702 by user interface adapter 750. The user input devices 752, 754, and 756 can be any of a keyboard, a mouse, a keypad, an image capture device, a motion sensing device, a microphone, a device incorporating the functionality of at least two of the preceding devices, and so forth. Of course, other types of input devices can also be used, while maintaining the spirit of the present principles. The user input devices 752, 754, and 756 can be the same type of user input device or different types of user input devices. The user input devices 752, 754, and 756 are used to input and output information to and from system 700.
Of course, the processing system 700 may also include other elements (not shown), as readily contemplated by one of skill in the art, as well as omit certain elements. For example, various other input devices and/or output devices can be included in processing system 700, depending upon the particular implementation of the same, as readily understood by one of ordinary skill in the art. For example, various types of wireless and/or wired input and/or output devices can be used. Moreover, additional processors, controllers, memories, and so forth, in various configurations can also be utilized as readily appreciated by one of ordinary skill in the art. These and other variations of the processing system 700 are readily contemplated by one of ordinary skill in the art given the teachings of the present principles provided herein.
The foregoing is to be understood as being in every respect illustrative and exemplary, but not restrictive, and the scope of the invention disclosed herein is not to be determined from the Detailed Description, but rather from the claims as interpreted according to the full breadth permitted by the patent laws. It is to be understood that the embodiments shown and described herein are only illustrative of the principles of the present invention and that those skilled in the art may implement various modifications without departing from the scope and spirit of the invention. Those skilled in the art could implement various other feature combinations without departing from the scope and spirit of the invention. Having thus described aspects of the invention, with the details and particularity required by the patent laws, what is claimed and desired protected by Letters Patent is set forth in the appended claims.
This application is a continuation-in-part of co-pending application Ser. No. 15/098,861, filed on Apr. 14, 2016, which in turn claims priority to provisional application Ser. No. 62/148,232, filed on Apr. 16, 2015, both of which are incorporated herein by reference in their entirety. This application further claims priority to provisional application Ser. No. 62/407,024, filed on Oct. 12, 2016, and 62/411,911, filed on Oct. 24, 2016, both of which are incorporated herein in their entirety.
Number | Name | Date | Kind |
---|---|---|---|
9292695 | Bassett | Mar 2016 | B1 |
9363149 | Chauhan | Jun 2016 | B1 |
20030093514 | Valdes et al. | May 2003 | A1 |
20070209074 | Coffman | Sep 2007 | A1 |
20100192195 | Dungagan et al. | Jul 2010 | A1 |
20160065601 | Gong | Mar 2016 | A1 |
20160078229 | Gong | Mar 2016 | A1 |
20160205122 | Bassett | Jul 2016 | A1 |
20160301704 | Hassanzadeh | Oct 2016 | A1 |
20160301709 | Hassanzadeh | Oct 2016 | A1 |
20180013787 | Jiang | Jan 2018 | A1 |
20180359264 | Sweet | Dec 2018 | A1 |
Entry |
---|
Alexander Hofmann, Online Intrusion Alert Aggregation with Generative Data Stream Modeling, IEEE Transactions on Dependable and Secure Computing, Mar.-Apr. 2011. |
Wang Li, Attack scenario construction with a new sequential mining technique, Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, Jul.-Aug. 2007, pp. 872-877. |
Oliver Dain, Fusing a Heterogeneous Alert Stream into Scenarios, In Proceedings of the 2001 ACM workshop on Data Mining for Security Applications, Dec. 18, 2001. |
Number | Date | Country | |
---|---|---|---|
20180034836 A1 | Feb 2018 | US |
Number | Date | Country | |
---|---|---|---|
62148232 | Apr 2015 | US | |
62407024 | Oct 2016 | US | |
62411911 | Oct 2016 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15098861 | Apr 2016 | US |
Child | 15729030 | US |