The present disclosure relates generally to computer security and, more particularly, to techniques for detecting intrusions in a computing environment.
Malicious code can be classified into virus, worm, Trojan horse, etc. Regardless of the function each malicious code performs, it follows certain patterns of behavior that should be considered abnormal in a system. For example, a typical worm scans for ports. It may also send out numerous emails in a short duration of time.
Since lots of attacks happen through the network, much work has been done in detecting network traffic such as port scan and contents of the packets. This approach, however, can not detect worms or virus loaded with third party software before it tries to propagate itself through the network.
Since all the system activities are recorded in system log files, many researchers perform intrusion detection by auditing the system log files. However, the delay between the emergence of an intrusion and its detection through auditing of log files can be undesirable. Since the system activities can be modeled as statistical processes, approaches based on statistical method and machine learning methods have been explored. The drawback of using statistical methods is the computation complexity. This may not be critical with desktop systems. In embedded systems, however, resource can be scarce and complexity can be a major issue. In this disclosure, an intrusion detection system is proposed that aims at solving the complexity problem without sacrificing effectiveness.
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art
A method is provided for detecting intrusions to a computing environment. The method include: monitoring service requests in the computing environment over a defined period of time; constructing a vector which represents the occurrence of different system calls during the defined time period; and comparing the vector to a plurality of stored vectors, where each of the stored vectors represents system calls made in a potential intrusion.
If a potential intrusion is detected at this stage, then a more complicated detection scheme may be performed by a second detection scheme. For instance, the second detection scheme may assess the temporal sequence in which the system calls were made and/or the system files accessed by the system calls.
Further areas of applicability will become apparent from the description provided herein. It should be understood that the description and specific examples are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.
The drawings described herein are for illustration purposes only and are not intended to limit the scope of the present disclosure in any way.
A system call is the mechanism used by an application program to request service from the operating system. System calls often use a special machine code instruction which causes the processor to change mode (e.g. to “supervisor mode” or “protected mode”). This allows the operating system to perform restricted actions such as accessing hardware devices or the memory management unit. System calls can be used to detect malicious attacks in a computing environment. However, an individual system call does not provide sufficient information. Therefore, the first stage detector examines a collection of system calls which are made within a defined period of time (e.g., 1 millisecond).
In operation, the first stage detector 12 monitors in real-time the system calls made in the computing environment. Most operating systems provide some type of system call interface. For example, in Linux, the system call dispatcher Calls.S may be used by the detector 12 to monitor system calls. In Linux, if the intrusion detection system is implemented as a Linux Security Module, the Security Module places hooks in the system call interface which can be used to monitor system calls. It is understood that this is an implementation detail and that various techniques may be used to monitor system calls in a given computing environment.
The first stage detector 12 constructs a vector which represents the occurrence of different system calls made during a defined time period.
Upon reaching the end of the defined time period, the first stage detector 12 then proceeds to compare the constructed vector to a plurality of the vectors residing in a first data store 14. Each vector in the first data store 14 is formulated in the same manner as describe above and represents system calls made during a known malicious intrusion. In the exemplary embodiment, a binary comparison is performed between the constructed vector and the vectors stored in the first data store. Although the comparison is preferably made in real-time, broader aspects of this disclosure envision comparing the constructed vector at some later time.
In addition, the first stage detector 12 continues to monitor in real-time the system calls made in the computing environment. For each subsequent time period, the first stage detector 12 builds another vector and compares the vector to the vectors residing in the first data store in the manner described above. In this way, the intrusion detection system is continually monitoring the computing environment for suspicious intrusions.
Various techniques may be used to improve the comparison process. For example, vectors in the first data store can be pre-sorted so that vectors indicative of more frequently occurring intrusions are sorted to the top of the data store. Once a match is found between the constructed vector and one of the stored vectors, first stage comparison is terminated and processing moves to the second stage.
In another example, the format for the vector may be defined so that system calls which more frequently occur in known intrusions are positioned in the more significant bits of the array. For instance, element one may correlate to system call 55 and element two may correlate to system call 184, where these two system calls are made most often in a malicious intrusion. Once a mismatch is found between the constructed vector and one of the stored vectors, the comparison process can move on to the next vector stored in the data store.
In yet another example, simplified regular expression matching can be employed to perform the necessary vector matching. A regular expression, represented as a string or a set of binary tokens, can be used by the monitor to detect an intrusion. An expression provides a concise description of one or more intrusion patterns without the need to scan for each pattern separately.
To construct the regular expression the formalisms may provide operations for grouping, quantification, and alternation, which can be combined to form complex expressions that describe the intrusion patterns. In addition, the regular expression syntax offers a set of special tokens to describe vectors or group of vectors. For example, the vocabulary and syntax of the string based regular expression could be based on the traditional Unix regular expression syntax, whereas the syntax might include but is not limited to:
[̂\P1]+\i*\W0
whereas [̂\P1]+ describes all processes that do not have ID 1 (ID 1 could denote the password management application); \i* to skip irrelevant vectors if any; and \W0 defines the write access vector to file with ID 0 (ID 0 for files is, in this example, the password file).
The comparison process can be implemented using state machines by compiling regular expressions into binary representations. The vectors are used as input to the state machine for it to advance to different states. Once it arrives at a state that indicates a possible intrusion, further processing is performed by the second stage detector. The advantage of this approach is that only one state per process needs to be stored. Additionally, it is not necessary to store vector information since vectors are encoded into the state machines.
To further increase performance, a simple hash algorithm can be applied to the vectors being compared. If two vectors are equal, then the hash values for the vectors are also equal. Accordingly, a hash algorithm can be applied to the constructed vector and likewise the hash algorithm can be applied to the vectors in the first data store so that hash values as are stored therein. In this case, the first stage detector performs a binary comparison of hash values. Other techniques for improving the comparison process also fall within the scope of this disclosure.
In an alternative approach,
In operation, the first stage detector 12 may construct the second type of vector as it monitors in real-time the system calls made in the computing environment. When the first stage detector finds a match for the first type of vector, it invokes the second state detector to further evaluate the second type of vector. If the first stage detector does not find a match for the first type of vector, the computational cost associated with the second stage detection scheme is avoided.
When invoked, the second stage detector 12 compares the second type of constructed vector to a plurality of the vectors residing in a second data store 18. Each vector in the second data store 18 is formulated in the same manner as the second type of vector and represents the temporal sequence in which system calls are made and what files are accessed by each system call during a known malicious intrusion. Although the comparison is preferably made in real-time, broader aspects of this disclosure envision comparing the constructed vector at some later time.
In an exemplary embodiment, the second stage detector 12 may employ a maximum entropy classifier to evaluate the second type of vector. A maximum entropy classifier maximizes entropy and is based on the known without assuming any of the unknown. The principle of maximum entropy classifier is to find the most uniformly distributed model that confirms to the known constrains. Unlike a Bayesian classifier, the maximum entropy classifier does not require the features to be completely independent.
Given a set of training samples T={(x1, y1), (x2, y2), . . . , (xN, yN)} where xi is a real value feature vector and yi is the target domain, the maximum entropy principle states that data T should be summarized with a model that is maximally noncommittal with respect to missing information. Among distributions consistent with the constraints imposed by T, there exists a unique model with highest entropy in the domain of exponential models of the form:
where Λ={λ1, λ2, . . . , λn} are parameters of the model, fi(x,y)'s are arbitrary feature functions of the model, and
is the normalization factor to ensure PΛ(y|x) is a probability distribution. The target of the classifier is to find the model that maximizes the conditional entropy:
In this application, the second type of constructed vector serves as the feature vector for the classifier. The classifier is designed to output a probability that the vector is indicative of a malicious intrusion. When the output probability exceeds some predetermine threshold, then further actions may be invoked to particularly identify the type of intrusion or otherwise address the intrusion.
N-grams have proved to be an effective feature extraction tool in high dimensionality feature spaces. An n-gram is a sub-sequence of n items from a given sequence. By converting a sequence of items to a set of n-grams, it can be embed in a vector space, thereby allowing the sequence to be compared to other sequences in an efficient manner. In an exemplary embodiment, an n-gram sequence may be derived from the second type of constructed vector. For example, a tri-gram formed from the vector in
The above description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses. For instance, it is envisioned that either the first stage detection scheme or the second stage detection scheme may be employed independent of the other stage as a basis for detection intrusions.