1. Field
This disclosure is generally related to the detection of malicious insiders. More specifically, this disclosure is related to a system that detects malicious insiders by modeling behavior changes and consistencies.
2. Related Art
The detection of malicious insiders plays a very important role in preventing disastrous incidents caused by insiders in a large organization, such as a corporation or a government agency. By detecting anomalous behaviors of an individual, the organization may intervene or prevent the individual from committing a crime that may harm the organization or society at large. For example, an intelligence agency may monitor behaviors of its employees and notice that a particular person may exhibit signs of discontent with respect to certain government policies. Early intervention, such as preventing the person from accessing sensitive information, for example confidential government documents, may prevent the person from leaking the sensitive information to outside parties. The detected anomalies are often presented to an analyst, who will conduct further investigation.
One embodiment of the present invention provides a system for identifying anomalies. During operation, the system obtains work practice data associated with a plurality of users. The work practice data includes a plurality of user events. The system further categorizes the work practice data into a plurality of domains based on types of the user events, models user behaviors within a respective domain based on work practice data associated with the respective domain, and identifies at least one anomalous user based on modeled user behaviors from the multiple domains.
In a variation on this embodiment, the plurality of domains includes one or more of: a logon domain, an email domain, a Hyper Text Transfer Protocol (HTTP) domain, a file domain, and a device domain.
In a variation on this embodiment, modeling the user behaviors within the respective domain involves constructing feature vectors for the plurality of users based on the work practice data associated with the respective domain, and applying a clustering algorithm to the feature vectors, wherein a subset of users are clustered into a first cluster.
In a further variation, the system further calculates an anomaly score associated with a respective user within a second domain based on a probability that the user is clustered into a second cluster into which other users within the subset of users are clustered.
In a variation on this embodiment, modeling the user behaviors within a respective domain further involves modeling changes in the user behaviors within the respective domain by clustering users within the respective domain based on work practice data associated with a time instance.
In a further variation, modeling the changes in the user behaviors further involves calculating a probability of a user transitioning from a first cluster at a time instance to a second cluster at a subsequent time instance.
In a variation on this embodiment, identifying at least one anomalous user involves calculating a weighted sum of anomaly scores associated with the at least one anomalous user from the plurality of domains.
20
20
The following description is presented to enable any person skilled in the art to make and use the embodiments, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present invention is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.
Embodiments of the present invention provide a solution for detecting malicious insiders based on large amounts of work practice data. More specifically, the system monitors the users' behaviors and detects two types of anomalous activities: the blend-in anomalies (where malicious insiders try to behave similarly to a group to which they do not belong), and the unusual change anomalies (where malicious insiders exhibit changes in their behaviors that are different from their peers' behavior changes). Users' activities are divided into different domains, and each domain is modeled based on features describing the activities within the domain. During operation, the system observes users' activities and clusters the users into different peer groups based on their activities in each domain. The system detects unusual behavior changes by comparing a user's behavior changes with behavior changes of his peers. The system can also detect peer-group inconsistency of a user by monitoring the user's peer group over time, and across all domains.
Malicious insiders pose significant threats to information security. Employees authorized to access internal information may cause harm to the organization by leaking sensitive information to outside parties or by performing sabotage operations. Detection of anomalous behaviors plays an important role in identifying potentially malicious insiders, making it possible to diffuse the potential threat before damage is done. In order to detect the anomalous behaviors, many approaches make use of the readily available work practice data, which can include users' various work-related activities on their company-issued or personal computers, such as logging on/off, accessing websites, sending and receiving emails, accessing external devices or files, etc. Each type of activity may include multiple attributes, which can provide a more detailed description of each activity. For example, the logging-on activity may include attributes such as “the number of after-hours logons” and “the number of logons from a non-user PC;” and the receiving-email activity may include attributes such as “number of recipients” and “number of emails.”
Note that here the term “computers” may be used to refer to various types of computing devices, including but not limited to: a work station, a desktop computer, a laptop computer, a tablet computer, a smartphone, a personal digital assistant (PDA), etc.
The prevalence of the computers, especially the mobile devices, and the diversity of applications running on those computers make the work practice data vast, diverse, and heterogeneous. Data in different categories often exhibits drastically different behaviors, and demands different processing and analysis techniques. Combining data from different categories can be technically challenging. For example, certain models may attempt to concatenate different feature vectors from different categories into a single feature vector. However, such an approach may not work because features from different categories may have different ranges or scales. The lack of proper scaling prevents the model from distinguishing among different types of activities, and limits the model's ability to treat and draw conclusions about different activity types appropriately. In addition, a large number of features can compromise model accuracy due to overfitting or excessive model complexity, and can lead to performance degradation and scalability issues.
To overcome such problems, in some embodiments of the present invention, different types of work practice data (or different types of user activities) are categorized into different domains, with attributes associated with each activity type treated as an independent set of domain features. For example, attributes associated with the logging on/off activities may include number of logons, number of computers with logons, number of after-hours logons, number of logons on a dedicated computer, and number of logons on other employees' dedicated computers, etc. These attributes can be included in a feature set for the logon/logoff domain. Once the attributes are defined for each domain, the anomaly-detection system uses a per-domain modular approach that treats each domain independently.
The modular approach can provide a number of advantages that include, but are not limited to: the per-domain clustering ability, the per-domain learning ability, the per-domain modeling and analysis ability, the adaptability to new data, increased scalability, the ability to fuse information from multiple domains, and the ability to establish a global, cross-domain model.
In some embodiments, the work practice data are divided into six domains, including a logon domain, an HTTP domain, an email-sent domain, an email-received domain, a device domain, and a file domain. The logon domain includes logon and logoff events. The feature set associated with the logon domain may include features such as the number of logons, the number of computers with logon activities, the number of after-hours logons, the number of logons on the user's dedicated computer, the number of logons on other employees' dedicated computers, etc. The HTTP domain includes HTTP (Hypertext Transfer Protocol) access events, such as web browsing or uploading/downloading. The feature set associated with the HTTP domain may include features such as the number of web visits, the number of computers with web visits, the number of uniform resource locators (URLs) visited, the number of after-hours web visits, the number of URLs visited from other employees' dedicated computers, etc. The email-sent domain includes email-sending events. The feature set associated with the email-sent domain may include features such as the number of emails, the number of distinct recipients, the number of internal emails sent, the number of emails sent after hours, the number of emails sent with attachments, the number of emails sent from computers dedicated to other employees, etc. The email-received domain includes email-receiving events. The feature set associated with the email-received domain is similar to the one associated with the email-sent domain. In some embodiments, the email-sent domain and the email-received domain may be combined to form an email domain. The device domain includes events related to usages of removable devices, such as USB drives or removable hard disks. The feature set associated with the device domain may include features such as the number of device accesses, the number of computers with device accesses, the number of after-hours device accesses, the number of device accesses on the user's dedicated computer, the number of device accesses on other employees' dedicated computers, etc. The file domain includes file access events, such as creating, copying, moving, modifying, renaming, and deleting of files. The feature set associated with the file domain may include features such as the number of file accesses, the number of computers with file accesses, the number of distinct files, the number of after-hours file accesses, the number of file accesses on the user's dedicated computer, the number of file accesses on other employees' dedicated computers, etc.
Existing anomaly-detection approaches often ignore the inhomogeneity of the work practice data and only focus on statistical outliers. For example, certain techniques define a probability distribution over the work practice data and classify data points with abnormally small probabilities as anomalies or outliers. Sometimes the anomalies are identified separately in each domain, and are combined in an ad-hoc manner (i.e., they are determined manually, rather than learned automatically from the data). For example, users who are outliers in only one domain might be ignored or be flagged as anomalous for having the most extreme anomaly score in such a domain.
While these techniques can be successful in detecting outliers in separate domains, there are limitations. Notably, users who are not outliers in any of the domains will never be labeled as outliers based on these techniques even if these are malicious users. For example, consider a scenario where a user logs on to multiple machines each day. Such behavior is normal if the user is a system administrator who is supposed to log on to multiple machines each day and send emails about system administration issues; the same behavior will be abnormal if the user is a software engineer, whose normal behavior is to log on to a single machine and send emails about software development. However, using the aforementioned techniques, this behavior will never be labeled as anomalous because such techniques examine the log on domain separately from the email domain, and do not treat logging on to multiple machines as an outlier. Similarly, when data in the email domain is examined, no anomaly will be detected. Therefore, a malicious software engineer who logs in to multiple machines daily searching for vulnerable data will remain undetected if each domain is analyzed separately.
To solve such problems, some embodiments of the present invention build a global model for the entire set of available domains, and find outliers in that global model. Note that, as described previously, when establishing the global model, the different domains remain separate at the feature construction (input treatment) stage. It is at the modeling (learning and inference) and scoring (output/decision) stages when the multiple domains are combined. There are two advantages to this modeling strategy. First, the anomaly scores from multiple domains are combined not in an ad-hoc manner, but rather in a data-driven manner. Second, this strategy allows detection of anomalous behaviors that are not by themselves anomalous in any single domain.
Network 102 can generally include any type of wired or wireless communication channel capable of coupling together computing nodes. This includes, but is not limited to, a local area network (LAN), a wide area network (WAN), an enterprise's intranet, a virtual private network (VPN), and/or a combination of networks. In one embodiment of the present invention, network 102 includes the Internet. Network 102 may also include telephone and cellular networks, such as Global System for Mobile Communications (GSM) networks or Long Term Evolution (LTE) networks
Client machines 104-110 can generally include any nodes on a network with computational capability and a mechanism for communicating across the network. General users, such as users 116 and 118, perform their daily activities on these client machines. The clients can include, but are not limited to: a workstation, a personal computer (PC), a laptop computer, a tablet computer, a smartphone, and/or other electronic computing devices with network connectivity. Furthermore, the client machines may couple to network 102 using wired and/or wireless connections. In one embodiment, each client machine includes a mechanism that is configured to record activities performed by the general users.
Work practice database 112 can generally include any type of system for storing data associated with the electronically recorded activities in non-volatile storage. This includes, but is not limited to, systems based upon magnetic, optical, and magneto-optical storage devices, as well as storage devices based on flash memory and/or battery-backed up memory. In one embodiment, the client machines 104-110 send their recorded work practice data to work practice database 112 via network 102.
Anomaly-detection server 114 includes any computational node having a mechanism for running anomaly-detection algorithms. In addition, anomaly-detection server 114 is able to output a suspect list, which identifies individuals with abnormal behaviors. In some embodiments, anomaly-detection server 114 is capable of outputting a list that ranks all users based on their anomaly scores. A security analyst can view the list and determine which individuals need to be investigated further.
Embodiments of the present invention provide a solution that is capable of detecting malicious insiders based on three types of anomalies: the stand-alone anomalies, the blend-in anomalies, and the anomalies due to temporal inconsistencies.
From
To detect blend-in anomalies, the system needs to analyze work practice data in all domains. However, instead of having a single top-down model that includes all features from all domains, which can result in difficulty of inference due to the large size of the data set, separate models for each domain are built. Each domain is first analyzed separately, and the system then analyzes interdependence among the various domains. In some embodiments, the anomaly-detection system can use a two-stage modeling process. The first stage is to build single-domain models within each individual domain. Note that building a single-domain model can include obtaining the maximum likelihood estimate (MLE) for model parameters in the corresponding domain. In further embodiments, the single-domain models are based on a Gaussian mixture model (GMM), where the maximum a posteriori probability (MAP) values for the cluster to which each user belongs within each domain are obtained. The second stage is to use the single-domain model parameters in a global model as if they were fixed. Note that if the data in each domain is relatively unambiguous (i.e., each single-domain model can be determined with sufficient accuracy), the loss in accuracy is small. In some embodiments, the global cross-domain model is based on the MAP cluster indices. In the end, information from multiple domains is fused to generate an output.
The multi-domain anomaly-detection system detects anomalous users based on the assumption that an anomalous user is the one who exhibits inconsistent behaviors across the multiple domains. In general, a user's activity should reflect the user's job role in any domain, and users with similar job roles should exhibit similar behaviors within each domain. As shown in
Subsequently, the system constructs feature vectors for each domain (operation 308), and clusters users based on the constructed feature vectors within each domain (operation 310). Note that the feature set for each domain includes domain-specific attributes. Given that the users' job roles are unknown to the system, such a clustering provides modeling of those hidden job roles. As discussed previously, users with similar job roles tend to behave similarly, and hence would belong to the same cluster within each domain. In some embodiments, the system applies a k-means clustering technique to the feature vectors. Other clustering techniques are also possible. As described in the previous session, the single-domain model can be based on a Gaussian mixture modeling (GMM). Note that the advantage of this per-domain learning scheme is to provide a simpler model with lower levels of errors due to variance in learning, thus improves the model's accuracy and reducing the risk of overfitting. The per-domain learning scheme also enhances the model's interpretability. Moreover, treating each activity domain separately provides more flexibility, since a different type of model can be used for different activity domains as appropriate. For example, some models make certain assumptions about correlations of features. Such assumptions can be violated in some, but not all, domains.
Once per-domain clustering is achieved, the system calculates a predictability of a certain user in a certain domain to detect the multi-domain inconsistency (operation 312). The maximum a posteriori probability (MAP) cluster indices from the single-domain models for each user u form a cluster vector cu, where cu
When detecting the multi-domain inconsistency, the system may establish various models to measure the predictability of a cluster index in a target domain. In some embodiment, three different models, a discrete model, a hybrid model, and a continuous model, can be used to measure the predictability. The difference among these three models lies in the granularity of the cluster information used as features for learning and evaluation.
For example, the discrete model uses discrete features and provides discrete evaluation outcome. More specifically, the discrete model uses cluster labels (indices) from the observed domains as features for learning, and predicts cluster labels to evaluate user predictability. The predictability is measured as the Hamming distance between the prediction and the observation (i.e., 0 if the prediction is correct, and 1 otherwise). The hybrid model uses cluster labels from the observed domains as features for learning, and predicts cluster labels to evaluate user predictability. However, unlike the discrete model, in the hybrid model, the evaluation is not based just on whether or not the true cluster is predicted, but instead, is based on how well the true cluster is predicted. This is, in essence, a density-estimation problem. The predictability is measured as 1 minus the likelihood of observing the true cluster index given the cluster index of its peers. In other words, the hybrid model uses discrete features and provides continuous evaluation outcome. On the other hand, the continuous model uses continuous features and provides continuous evaluation outcome. More specifically, the continuous model uses a vector of cluster probabilities as features, and also predicts the cluster probability vector for the target domain.
Returning to
In addition to the blend-in anomalies that can be detected using the aforementioned multi-domain cross-validation technique, it is also desirable to detect anomalies that exhibit temporal inconsistency. Note that while a particular behavior may not be suspicious, a change in behavior that is rare can be. Conventional anomaly-detection approaches often rely on detecting temporal anomalies that correspond to a sudden change in a user's behavior when compared to his past behavior. For example, if a user suddenly starts to work a lot after hours, he may be labeled as an anomaly by the conventional approach. However, such a behavior change may be normal if the user is facing a deadline or takes up a new responsibility. Hence, conventional approaches that analyze users independently can have a high false positive rate, which can increase investigation costs and distract attention from actual malicious insiders.
To avoid mistakenly flagging users who change their behavior in a non-malicious manner, in some embodiments, the system models the activity changes of similar subsets of the population (e.g., users with similar job roles), and evaluates how well a particular user conforms to change patterns that are most likely to occur within the user's subpopulation. In other words, to decide whether a user is suspicious, the system compares each user's activity changes to activity changes of his peer group.
The problem of detecting temporal inconsistency can be defined as follows. An anomalous user is the one who exhibits changes in behavior that are unusual compared to his peers. The intuition is that user activity should reflect the user's job role in any domain, and users with similar job roles should exhibit similar behavior changes within each domain, over time. Although peers will not be expected to exhibit similar changes in behavior at each similar time, they will be expected to do so over longer time intervals. In some embodiments, the model considers that peers are expected to experience similar changes; however, those changes do not necessarily have to take place at the same time.
Similar to the approach that detects blend-in anomalies, here users are also clustered based on their activities, such that a cluster that a user is assigned to indicates the type of behavior this user exhibits. In addition, a change in user behavior is indicated by a change in the cluster that this user gets assigned to. Over a relatively long period of time, peers are expected to transition among the same subset of clusters. For examples, engineers will be seen to transition between clusters 2 and 4 in the logon domain, and among clusters 3, 4 and 5 in the email domain. So an engineer who transitions between clusters 2 and 5 in the logon domain is considered suspicious. The less likely this transition is among the engineer's peers, the more suspicious it is.
To build a temporal model, some embodiments of the present invention use day as a time unit, and the work practice data (which includes large amount of event records) are binned into (user, day) records. For each (user, day) pair, the system can construct a feature vector for each domain using domain-specific attributes.
Subsequently, the system clusters the users based on the constructed feature vectors (operation 610). Note that unlike the previous approach where the clustering is performed on features over the entire time span, here the clustering is performed on the users' daily behavior features. Moreover, the system constructs a transition probability matrix Qd for each domain d (operation 612). In some embodiments, the system computes Qd by computing the transition probability qd(ck,cm) between each possible cluster pair (ck,cm) by counting the number of such changes aggregating over all users and all time instances.
The system then models users' behavior changes and detects temporal anomalies in each domain by calculating a transition score (operation 614). Note that the behavior changes are modeled within each domain separately. For each domain, the system determines the cluster to which a user belongs each day, and then computes the likelihood of transitions between clusters from one day to the next. For example, the system may determine that a user belongs to cluster 1 on a particular day, and that the same user has a 20% chance to move to cluster 2 the next day. In some embodiments, the system applies a Markov model to model the user's behavior change. More specifically, the system models the user behavior over time as a Markov sequence, where a user belongs to one cluster (or state) each day, transitioning between clusters (or states) on a daily basis. The system detects unusual changes based on rare transitions given the total likelihood of transitions. For each user, the total likelihood of all transitions made by the user over the entire time span can be computed using Qd, and the transition score sdu for each user u within domain d can be calculated by estimating the user's total transition likelihood. In some embodiments, sdu can be calculated as sdu=pd(c0)Πt=1n−1qd(ctu,ct+1u), where pd(c0) is the prior probability of being in state c0, which is the start state for user u. Note that users are ranked based on their transition scores; the lower the transition score, the higher the anomaly ranking. Hence, a user with the rarest transitions compared with her peers would be the most suspicious. In some embodiments, the system penalizes a user for the least likely transition and computes the anomaly score using that rarest transition. Here, sdu can be calculated as sdu=min qd(ctu,ct+1u). Once anomaly scores for the same set of users within each domain are obtained, the system can combine this information from the different domains to generate a final score for each user (operation 616). In some embodiments, the final score is computed based on a user's worst rank (i.e., the smallest transition score) from all the domains. sfinalu=mind(sdu). The final ranking for each user thus reflects the highest suspicious indicator score across all the domains.
Data-input layer 702 handles receiving the work practice data set for a population. In some embodiments, the data may be received from the company, which has recorded work practice data of its employees, as a data package. In some embodiments, data-input layer 702 may directly couple to a server that is configured to record work practice data in real time.
Single-domain modeling layer 704 includes a number of independent branches, depending on the number of domains being analyzed. In
Global modeling layer 706 performs multi-domain cross-validation to identify blend-in anomalies. In some embodiments, for each domain, global modeling layer 706 may use cluster labels from all but one domain as features for learning, and evaluates the predictability of the target domain. In addition, the evaluated results from all domains are combined to generate a combined result. In addition to multi-domain cross-validation, global modeling layer 706 also detects temporal inconsistency among users. Note that to establish a temporal model, the data going from data-input layer 702 to single-domain modeling layer 704 should also be sorted based on timestamps. Depending on the granularity, data within a time unit, such as a day, a week, or a month, can be placed into the same bin. The following feature-extraction and clustering operations in single-domain modeling layer 704 should be performed for each bin in turn. Global modeling layer 706 then models users' behavior changes over time based on how a user transitions between clusters from one day to the next. Users with the rarest transitions are often identified as anomalies. Based on the multi-domain cross-alidation result and the temporal inconsistency detection result, global modeling 706 can output a suspect list that may include all different types of anomalies, including but not limited to: the statistical outliers, the blend-in anomalies, and the anomalies due to temporal-inconsistency.
Note that by allowing per-domain feature extraction and clustering, embodiments of the present invention allow per-domain analysis, thus enabling more sophisticated reasoning and concrete conclusions by providing a detailed explanation about why and how each malicious activity is detected. This provides benefits that go beyond merely detecting malicious activities. Moreover, the per-domain analysis facilitates per-domain evaluation, including which activity domain can detect what types of malicious activity, and at what level of accuracy and fault rate, etc. In addition, the per-domain modeling also provides adaptability to various data types. When dealing with massive amounts of data, it is typical to keep receiving more data, and these additional data may include new activity domains, or new features within an existing domain. The per-domain modularity allows the system to adapt to and include new data in the analysis without necessarily having to repeat every step (of data treatment, learning, modeling and analysis) on the entire available dataset. In other words, new data can be considered after running previous models, and the results can be integrated without necessarily having to rerun all models on all previously existing domain data. The per-domain modularity also makes it possible to process data, learn and apply models, and run the analysis, on a separate machine for each domain, thereby addressing scalability issues and boosting machine performance. When combining results from the multiple domains or sources, the system weights each domain output differently. The weighting can be based on the relevance and/or utility of each domain to the problem, and based on the quality of data available for each domain. Moreover, domains can be disregarded if strong correlation with other domains is observed.
The data structures and code described in this detailed description are typically stored on a computer-readable storage medium, which may be any device or medium that can store code and/or data for use by a computer system. The computer-readable storage medium includes, but is not limited to, volatile memory, non-volatile memory, magnetic and optical storage devices such as disk drives, magnetic tape, CDs (compact discs), DVDs (digital versatile discs or digital video discs), or other media capable of storing computer-readable media now known or later developed.
The methods and processes described in the detailed description section can be embodied as code and/or data, which can be stored in a computer-readable storage medium as described above. When a computer system reads and executes the code and/or data stored on the computer-readable storage medium, the computer system performs the methods and processes embodied as data structures and code and stored within the computer-readable storage medium.
Furthermore, methods and processes described herein can be included in hardware modules or apparatus. These modules or apparatus may include, but are not limited to, an application-specific integrated circuit (ASIC) chip, a field-programmable gate array (FPGA), a dedicated or shared processor that executes a particular software module or a piece of code at a particular time, and/or other programmable-logic devices now known or later developed. When the hardware modules or apparatus are activated, they perform the methods and processes included within them.
The foregoing descriptions of various embodiments have been presented only for purposes of illustration and description. They are not intended to be exhaustive or to limit the present invention to the forms disclosed. Accordingly, many modifications and variations will be apparent to practitioners skilled in the art. Additionally, the above disclosure is not intended to limit the present invention.
This invention was made with U.S. government support under Contract No. W911NF-11-C-0216 (3729) awarded by the Army Research Office. The U.S. government has certain rights in this invention.