This invention relates generally to security analytics in computer networks, and, more specifically, to classifying user accounts as human accounts or service accounts based on keys from an identity management system.
For user behavior modeling in IT network security analytics, it is critical to leverage contextual information to improve alert accuracy. For example, contextual information can be used to construct and evaluate context-specific rules. Some contextual information is factual, and some is derived statistically. An example of factual contextual information is the geolocation from which a current VPN event comes. An example of statistically-derived contextual information is a conclusion that an asset (e.g., a laptop or phone) is likely to belong to an executive based on historical data.
Whether an account is a human user account or a service account is useful contextual information in network security analytics. For example, if during a login session, an account is behaving as a service account, but it is known that it is a human user account, the login session may be a good candidate for an alert.
An identity management system (e.g., Open LDAP, Active Directory) maintains a directory of all accounts on an enterprise network. Each account is described by a collection of key-value pairs. “Keys” are akin to fields, but are dynamic in that they some can be specified by the enterprise. The types of keys used to describe an account are not always consistent across departments and certainly not across enterprises.
Currently, classifying an account as a human user account or a service account is done manually and requires significant human effort. An analyst reads the organization unit key from an identity management system and decides whether the key value pertains to a service account. This environment-specific effort is laborious and at best finds a subset of service accounts, leaving potentially other service accounts undiscovered. Furthermore, the process needs to be repeated as new accounts are added to the network. It would be desirable to leverage the manually-found service accounts to construct an automated classifier to probabilistically infer, using textual readout of keys from an identity management system, the status of new accounts or existing, unclassified accounts.
The present disclosure describes a system, method, and computer program for automatically classifying user accounts within an entity's computer network, using machine-based-learning modeling and keys from an identity management system. The method is performed by a computer system (the “system”).
Using machine-learning-based modeling, the system creates a statistical model that maps individual keys or sets of keys to a probability of being associated with a first type of user account. The model is trained using a set of inputs and a target variable. The inputs are keys from an identity management data structure associated with user accounts manually classified as the first type of user account or a second type of user account, and the target variable is whether the user account is the first type of user account.
Once the statistical model is created, the system uses the model to automatically determining whether an unclassified user account is the first type of user account. To classify an unclassified user account, the system identifies identity management keys associated with the unclassified user account. The system then creates an N-dimensional vector of the keys, wherein N is the number of the keys associated with the unclassified user account.
The system inputs the N-dimensional vector into the statistical model to calculate a probability that the unclassified user account is the first type of user account. In response to the probability exceeding a first threshold, the system classifies the unclassified user account as the first type of user account.
In certain embodiments, there is one threshold (i.e., the first threshold). If the probability is below the first threshold, the account is classified as the second type of account.
In certain embodiments, there are two thresholds. If the probability is below a lower, second threshold, the account is classified as the second type of account. If the probability is between the first and second thresholds, the system concludes that the classification of the user account is undetermined.
In certain embodiments, the first type of user account is a service user account and the second type of user account is a human user account. In certain embodiments, the first type of user account is a human user account and the second type of account is a service user account.
In certain embodiments, the data model is constructed using Bernoulli Naïve Bayes modeling.
In certain embodiments, the keys for the unclassified user account are identified by parsing an output text file from the identity management system that corresponds to the unclassified user account.
In certain embodiments, the system performs the automated classification steps on each of the manually-classified user accounts used to train the statistical model in order to identify any mismatches between the automated classifications and the manual classifications.
Referring to
The model is created using a supervised learning algorithm. The preferred algorithm is the Bernoulli Naïve Bayes algorithm, but other supervised learning algorithms, such as logistical regression algorithm, could be used. The model is trained using a set of inputs and a target variable. The inputs used to train the statistical model are identity management keys associated with manually-classified accounts. In the embodiment illustrated in
The system obtains the keys for the training data from an identity management system. Examples of identity management systems are MICROSOFT ACTIVE DIRECTORY and OPEN LDAP. In one embodiment, the system parses an LDIF text file from identity management system to find the keys associated with training accounts. A human then provides a list of accounts known as service accounts based on the knowledge of enterprise's account naming convention or IT records. Alternatively, a human may review the keys to manually classify service accounts from the keys. For example, an administrator may review the keys for a set of accounts to identify the accounts with a key or set of keys known to be specific only to service accounts at an entity. The identified accounts within the training set are manually classified as service accounts, and the other accounts within the set are manually classified, by default, as human user accounts. Whether via known service account listing or manual review effort, the administrator likely will find only a subset of the service accounts this way, but, if the identified set of service accounts is large enough, the model will be sufficiently reliable. Furthermore, as described below, an iterative process may be used to improve the reliability of the model. Using a supervised learning algorithm, the system leverages the manually classified accounts to “learn” and build the statistical model.
The table in
Once the model is created, it can be used to automatically determine whether an unclassified account is a service account or human user account. It also can be used to reclassify, in an automated manner, the manually-classified training data. Referring to
Referring again to
In an alternate embodiment, there are two probability thresholds, an upper threshold and a lower threshold, as illustrated in
In
In the embodiment described with respect to
If the positive scenario is service accounts (i.e., the system predicts the probability that an account is a service account), then the FP rate is the percentage of human user accounts misclassified as service accounts by the system, assuming that the manual classifications were correct. Likewise, the FN rate is the percentage of service accounts misclassified by the system as human user accounts, assuming the manual classifications were correct. If only one threshold will be used by the system to classify accounts (e.g., threshold 210), the threshold probability is the EER probability (step 430). If two thresholds are used (e.g., thresholds 305, 310), the lower threshold is set a certain amount below the EER score, and the upper threshold is set a certain amount above the EER score (step 430). In such case, the lower threshold gives x amount of the FN rate, and the upper threshold gives y amount of the FP rate, wherein x and y are separately controllable based on the EER. For example, assume that, at the EER, the FP and FN rates are 12%. The upper and lower thresholds are set to probability scores on the x axis respectively associated with lower FP and FN rates. The lower threshold might be set to an FN rate of 6% and the upper threshold might be set to a FP rate of 6%. The percentages need not be equal.
An iterative classification process may be used to lower the EER rate and increase the accuracy of the statistical model.
The methods described with respect to
The account classification results may be used in context-specific rules in security analytics for computer networks. For example, an alert may be raised if an account classified by the methods herein as a human user account is behaving like a service account.
As will be understood by those familiar with the art, the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the above disclosure is intended to be illustrative, but not limiting, of the scope of the invention, which is set forth in the following claims.
This application is a continuation of U.S. patent application Ser. No. 15/058,034, filed on Mar. 1, 2016, and titled “System, Method, and Computer Program for Automatically Classifying User Accounts in a Computer Network Using Keys from an Identity Management System,” the contents of which are incorporated by reference as if fully disclosed herein.
Number | Name | Date | Kind |
---|---|---|---|
5941947 | Brown et al. | Aug 1999 | A |
6223985 | DeLude | May 2001 | B1 |
6594481 | Johnson et al. | Jul 2003 | B1 |
7181768 | Ghosh et al. | Feb 2007 | B1 |
7624277 | Simard et al. | Nov 2009 | B1 |
7668776 | Ahles | Feb 2010 | B1 |
8326788 | Allen et al. | Dec 2012 | B2 |
8443443 | Nordstrom et al. | May 2013 | B2 |
8479302 | Lin | Jul 2013 | B1 |
8484230 | Harnett et al. | Jul 2013 | B2 |
8539088 | Zheng | Sep 2013 | B2 |
8583781 | Raleigh | Nov 2013 | B2 |
8606913 | Lin | Dec 2013 | B2 |
8676273 | Fujisaki | Mar 2014 | B1 |
8850570 | Ramzan | Sep 2014 | B1 |
8881289 | Basavapatna et al. | Nov 2014 | B2 |
9055093 | Borders | Jun 2015 | B2 |
9081958 | Ramzan et al. | Jul 2015 | B2 |
9129110 | Mason et al. | Sep 2015 | B1 |
9185095 | Moritz et al. | Nov 2015 | B1 |
9189623 | Lin et al. | Nov 2015 | B1 |
9202052 | Fang et al. | Dec 2015 | B1 |
9680938 | Gil et al. | Jun 2017 | B1 |
9690938 | Saxe et al. | Jun 2017 | B1 |
9692765 | Choi et al. | Jun 2017 | B2 |
9760240 | Maheshwari et al. | Sep 2017 | B2 |
9779253 | Mahaffey et al. | Oct 2017 | B2 |
9843596 | Averbuch et al. | Dec 2017 | B1 |
9898604 | Fang et al. | Feb 2018 | B2 |
10063582 | Feng et al. | Aug 2018 | B1 |
10095871 | Gil et al. | Oct 2018 | B2 |
10360387 | Jou et al. | Jul 2019 | B2 |
10419470 | Segev et al. | Sep 2019 | B1 |
10445311 | Saurabh et al. | Oct 2019 | B1 |
10474828 | Gil et al. | Nov 2019 | B2 |
10496815 | Steiman et al. | Dec 2019 | B1 |
10803183 | Gil et al. | Oct 2020 | B2 |
11140167 | Lin et al. | Oct 2021 | B1 |
20020107926 | Lee | Aug 2002 | A1 |
20030065926 | Schultz et al. | Apr 2003 | A1 |
20030147512 | Abburi | Aug 2003 | A1 |
20040073569 | Knott et al. | Apr 2004 | A1 |
20060090198 | Aaron | Apr 2006 | A1 |
20070156771 | Hurley et al. | Jul 2007 | A1 |
20070282778 | Chan et al. | Dec 2007 | A1 |
20080028467 | Kommareddy et al. | Jan 2008 | A1 |
20080040802 | Pierson et al. | Feb 2008 | A1 |
20080170690 | Tysowski | Jul 2008 | A1 |
20080262990 | Kapoor et al. | Oct 2008 | A1 |
20080301780 | Ellison et al. | Dec 2008 | A1 |
20090144095 | Shahi et al. | Jun 2009 | A1 |
20090171752 | Galvin et al. | Jul 2009 | A1 |
20090292954 | Jiang et al. | Nov 2009 | A1 |
20090293121 | Bigus et al. | Nov 2009 | A1 |
20100125911 | Bhaskaran | May 2010 | A1 |
20100191763 | Wu | Jul 2010 | A1 |
20100269175 | Stolfo et al. | Oct 2010 | A1 |
20100284282 | Golic | Nov 2010 | A1 |
20110167495 | Antonakakis et al. | Jul 2011 | A1 |
20120278021 | Lin et al. | Nov 2012 | A1 |
20120316835 | Maeda et al. | Dec 2012 | A1 |
20120316981 | Hoover et al. | Dec 2012 | A1 |
20130080631 | Lin | Mar 2013 | A1 |
20130117554 | Ylonen | May 2013 | A1 |
20130197998 | Buhrmann et al. | Aug 2013 | A1 |
20130227643 | Mccoog et al. | Aug 2013 | A1 |
20130268260 | Lundberg et al. | Oct 2013 | A1 |
20130305357 | Ayyagari et al. | Nov 2013 | A1 |
20130340028 | Rajagopal et al. | Dec 2013 | A1 |
20140007238 | Magee | Jan 2014 | A1 |
20140090058 | Ward et al. | Mar 2014 | A1 |
20140101759 | Antonakakis et al. | Apr 2014 | A1 |
20140315519 | Nielsen | Oct 2014 | A1 |
20150026027 | Priess et al. | Jan 2015 | A1 |
20150039543 | Athmanathan et al. | Feb 2015 | A1 |
20150046969 | Abuelsaad et al. | Feb 2015 | A1 |
20150100558 | Fan | Apr 2015 | A1 |
20150121503 | Xiong | Apr 2015 | A1 |
20150205944 | Turgeman | Jul 2015 | A1 |
20150215325 | Ogawa | Jul 2015 | A1 |
20150339477 | Abrams et al. | Nov 2015 | A1 |
20150341379 | Lefebvre et al. | Nov 2015 | A1 |
20150363691 | Gocek et al. | Dec 2015 | A1 |
20160005044 | Moss et al. | Jan 2016 | A1 |
20160021117 | Harmon et al. | Jan 2016 | A1 |
20160063397 | Ylipaavalniemi et al. | Mar 2016 | A1 |
20160292592 | Patthak et al. | Oct 2016 | A1 |
20160306965 | Iyer et al. | Oct 2016 | A1 |
20160364427 | Wedgeworth, III | Dec 2016 | A1 |
20170024135 | Christodorescu et al. | Jan 2017 | A1 |
20170127016 | Yu et al. | May 2017 | A1 |
20170155652 | Most et al. | Jun 2017 | A1 |
20170161451 | Weinstein et al. | Jun 2017 | A1 |
20170178026 | Thomas et al. | Jun 2017 | A1 |
20170213025 | Srivastav et al. | Jul 2017 | A1 |
20170236081 | Grady Smith et al. | Aug 2017 | A1 |
20170318034 | Holland et al. | Nov 2017 | A1 |
20180048530 | Nikitaki et al. | Feb 2018 | A1 |
20180181883 | Ikeda | Jun 2018 | A1 |
20180234443 | Wolkov et al. | Aug 2018 | A1 |
20190034641 | Gil et al. | Jan 2019 | A1 |
20190066185 | More et al. | Feb 2019 | A1 |
20200021607 | Muddu et al. | Jan 2020 | A1 |
20200082098 | Gil et al. | Mar 2020 | A1 |
Entry |
---|
Wang et al. (“Don't Follow Me Spam Detection in Twitter”, International Conference on Security and Cryptography (SECRYPT), 2010, pp. 1-10) (Year: 2010). |
Guo et al. (“Detecting Non-personal and Spam Users on Geo-tagged Twitter Network”, Transactions in GIS, 2014, 18(3), pp. 370-384) (Year: 2014). |
Freeman et al. (“Who are you? A Statistical Approach to Measuring User Authenticity”, NDSS 16, Feb. 21-24, 2016, San Diego, pp. 1-15) (Year: 2016). |
DatumBox Blog (“Machine Learning Tutorial: The Naïve Bayes Text Classifier”, DatumBox Machine Learning Blog and Software Development News, Archive.Org Jan. 21, 2014) (Year: 2014). |
Miettenen et al. (“ConXsense—Automated Context Classification for Context-Aware Access Control,” Asia CCS'14, 2014, pp. 293-304) (Year: 2014). |
Shi et al. (“Cloudlet Mesh for Securing Mobile Clouds from Intrusions and Network Attacks,” 2015 3rd IEEE International Conference on Mobile Cloud Computing, Services, and Engineering, 2015, pp. 109-118) (Year: 2015). |
Farah Emad Fargo (“Resilient Cloud Computing and Services,” PHD Thesis, Department of Electrical and Computer Engineering, University of Arizona, 2015, pp. 1-115) (Year: 2015). |
Kim, Jihyun et al., “Long Short Term Memory Recurrent Neural Network Classifier for Intrusion Detection”, IEEE 2016. |
Taylor, Adrian et al., “Anomaly Detection in Automobile Control Network Data with Long Short-Term Memory Networks”, IEEE 2016. |
Zhang, Ke et al., “Automated IT System Failure Prediction: A Deep Learning Approach”, IEEE 2016. |
Cooley, R., et al., “Web Mining: Information and Pattern Discovery on the World Wide Web”, Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence, Nov. 3-8, 1997, pp. 558-567. |
DatumBox Blog, “Machine Learning Tutorial: The Naïve Bayes Text Classifier”, DatumBox Machine Learning Blog and Software Development News, Jan. 2014, pp. 1-11. |
Freeman, David, et al., “Who are you? A Statistical Approach to Measuring User Authenticity”, NDSS, Feb. 2016, pp. 1-15. |
Ioannidis, Yannis, “The History of Histograms (abridged)”, Proceedings of the 29th VLDB Conference (2003), pp. 1-12. |
Malik, Hassan, et al., “Automatic Training Data Cleaning for Text Classification”, 11th IEEE International Conference on Data Mining Workshops, 2011, pp. 442-449. |
Poh, Norman, et al., “EER of Fixed and Trainable Fusion Classifiers: A Theoretical Study with Application to Biometric Authentication Tasks”, Multiple Classifier Systems, MCS 2005, Lecture Notes in Computer Science, vol. 3541, pp. 1-11. |
Wang, Alex Hai, “Don't Follow Me Spam Detection in Twitter”, International Conference on Security and Cryptography, 2010, pp. 1-10. |
Number | Date | Country | |
---|---|---|---|
20220006814 A1 | Jan 2022 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 15058034 | Mar 2016 | US |
Child | 17478805 | US |