The current disclosure relates to predictive models for identifying security threats in an enterprise network, and in particular to systems and methods for providing custom predictive models and machine learning use cases for detecting electronic security threats within an enterprise computer network.
Machine learning using predictive models has two stages. First, a predictive model is trained and then the trained model may be used for scoring potential security threats. Training occurs when, given a data set, the predictive model algorithm learns to adapt its parameters to conform to the input data set provided. Scoring occurs when a fully trained predictive model is used to make predictions, such as to predict a risk score associated with a set of behaviors in the data set. There are two flavors of machine learning: on-line, where the training and scoring both happen automatically in software within the same environment, or off-line where the training is done separately from the scoring, typically through a manual process lead by a data scientist.
Most current security analytics solutions perform offline machine learning where model development and training is performed by a data scientist outside of the main product to allow insights such as, for example, “The average amount of data copied to a USB drive by employees is 2 GB.” These models, once trained, are then deployed as scoring algorithms or simple threshold-based rules to provide security insights or alerts such as, for example “Alert me whenever an employee copies more than 5 GB of data to a USB drive”. While these offline models may be useful, it is difficult or impossible to account for variances in the population at scale. For example, while the average amount of data copied to a USB drive may be 2 GB, an employee working in a data intensive area, for example video editing, may exceed this average continually.
In contrast to off-line training, on-line learning and scoring is done automatically in the security system without requiring human expertise. For example, the security system learns automatically the average amount of data copied to a USB drive by each individual employee. The security system can then determine how unusual it is for any given employee when they copy a specific amount of data to a USB key.
While on-line models can provide advantageous results, they may be more difficult for end users to create models that are adapted to their own needs.
The present disclosure will be better understood with reference to the drawings, in which:
In accordance with the present disclosure there is provided a method of processing a custom predictive security model comprising: retrieving a custom predictive security model definition defining: input data from one or more available data sources providing security related information, the input data used in the predictive security model; and model logic providing a scoring function used to compute a predicted output value from the input data; ingesting, from the available data sources, the input data specified in the retrieved custom predictive security model; loading the ingested input data into the scoring function of the custom predictive security model; and outputting a predicted value from the scoring function based on the ingested input data.
In accordance with a further embodiment of the method, the custom predictive security model definition further defines one or more data aggregations to be applied to one or more input data, the method further comprises aggregating the input data according to the one or more data aggregations during ingestion.
In accordance with a further embodiment, the method further comprises processing one or more native models each of the native models providing a respective output predictive value based on ingested input data.
In accordance with a further embodiment, the method further comprises processing one or more supermodels, each of the supermodels specifying two or more models and one or more Boolean operators joining the two or more models to provide a predictive value for the supermodel.
In accordance with a further embodiment of the method, one or more of the supermodels further define one or more trigger conditions for triggering processing of the respective supermodel.
In accordance with a further embodiment, the method further comprises providing a user interface for creating the one or more supermodels comprising: trigger selection functionality for selecting one or more triggering events from available triggering events; model selection functionality for selecting one or more of available native models, custom models and supermodels; Boolean operator definition functionality for defining Boolean operators joining selected models; and output definition functionality for defining an output of the supermodel.
In accordance with a further embodiment, the method further comprises: creating one or more supermodels using the user interface for creating the one or more supermodels; and storing the created one or more supermodels.
In accordance with a further embodiment, the method further comprises providing a user interface for creating the one or more custom models, the user interface comprising: import functionality for importing a data schema; and import functionality for importing model logic.
In accordance with a further embodiment of the method, the import functionality for importing model logic imports model logic is defined using Predictive Model Markup Language (PMML).
In accordance with a further embodiment, the method further comprises: creating one or more custom models using the user interface for creating the one or more custom models; and storing the created one or more custom models.
In accordance with the present disclosure there is provided a computing system for processing a custom predictive security model comprising: a processor for executing instructions; a memory storing instructions, which when executed by the processor configure the computing system to: retrieve a custom predictive security model definition defining: input data from one or more available data sources providing security related information, the input data used in the predictive security mode; and model logic providing a scoring function used to compute a predicted output value from the input data; ingest, from the available data sources, the input data specified in the retrieved custom predictive security model; load the ingested input data into the scoring function of the custom predictive security model; and output a predicted value from the scoring function based on the ingested input data.
In accordance with a further embodiment of the computing system, the custom predictive security model definition further defines one or more data aggregations to be applied to one or more input data, and wherein the instructions stored in memory, when executed by the processor, further configure the computing system to aggregate the input data according to the one or more data aggregations during ingestion.
In accordance with a further embodiment of the computing system, the instructions stored in memory, when executed by the processor, further configure the computing system to process one or more native models each of the native models providing a respective output predictive value based on ingested input data.
In accordance with a further embodiment of the computing system, the instructions stored in memory, when executed by the processor, further configure the computing system to process one or more supermodels, each of the supermodels specifying two or more models and one or more Boolean operators joining the two or more models to provide a predictive value for the supermodel.
In accordance with a further embodiment of the computing system, one or more of the supermodels further define one or more trigger conditions for triggering processing of the respective supermodel.
In accordance with a further embodiment of the computing system, the instructions stored in memory, when executed by the processor, further configure the computing system to provide a user interface for creating the one or more supermodels comprising: trigger selection functionality for selecting one or more triggering events from available triggering events; model selection functionality for selecting one or more of available native models, custom models and supermodels; Boolean operator definition functionality for defining Boolean operators joining selected models; and output definition functionality for defining an output of the supermodel.
In accordance with a further embodiment of the computing system, the instructions stored in memory, when executed by the processor, further configure the computing system to: create one or more supermodels using the user interface for creating the one or more supermodels; and store the created one or more supermodels.
In accordance with a further embodiment of the computing system, the instructions stored in memory, when executed by the processor, further configure the computing system to provide a user interface for creating the one or more custom models, the user interface comprising: import functionality for importing a data schema; and import functionality for importing model logic.
In accordance with a further embodiment of the computing system, the import functionality for importing model logic imports model logic is defined using Predictive Model Markup Language (PMML).
In accordance with a further embodiment of the computing system, the instructions stored in memory, when executed by the processor, further configure the computing system to: create one or more custom models using the user interface for creating the one or more custom models; and store the created one or more custom models.
In accordance with the present disclosure there is provided a non-transitory computer readable memory, storing instructions, which when executed by a processor of a computing system, configure the computing system to: retrieve a custom predictive security model definition defining: input data from one or more available data sources providing security related information, the input data used in the predictive security mode; and model logic providing a scoring function used to compute a predicted output value from the input data; ingest, from the available data sources, the input data specified in the retrieved custom predictive security model; load the ingested input data into the scoring function of the custom predictive security model; and output a predicted value from the scoring function based on the ingested input data.
It is desirable to allow a customer to customize security predictive models using in security products. For larger corporations, or other security product users, they may have their own data science teams. As a result, they have the technical ability to develop their own statistically valid machine learning models using data science tools such as SPSS™, R, Python™, etc. Additionally, some security product customers may have a need for a specific machine learning algorithm or model but are unable to share specific details or data sets with the security product producer. Although customers may have the data science teams needed to create statistically valid machine learning models, the may not have the required data engineering abilities for developing and deploying the model in a big data production deployment, as that often involves different technologies, skills and experiences.
As described further herein, it is possible to provide a system to allow data scientists to define a custom machine learning model using data science tools they are familiar with. The custom model can then be deployed into an on-line predictive security platform for additional customization and production deployment, all without involving any software engineers or any custom development. These custom models can be run in isolation or in combination with existing native models as an augmentation of an existing system.
Current solutions may require custom development (e.g. in Java™ or Scala™) by solution or product teams to add new models or algorithms to the set of available predictive models that can be trained and scored online. This means that the availability of new, custom predictive models require a new release of the underlying software. Further, the producer of the underlying software may analyze data to develop baselines of normal for entities across an organization and then surface behavioral anomalies. This is done out-of-the-box with hundreds of analytical models and for many users, and this approach is effective when paired with tuning. However, for certain customers, it is desirable to have a fast, easy, and flexible way to add new models by leveraging without having to have significant expertise in software development and deployment.
In addition to providing a system to allow data scientists to easily deploy new models into the predictive security system, the system also allows users, who may not be data scientists, to easily re-combine or customize both the existing native models, along with any custom models to provide machine learning use cases from the above, with an intuitive user experience (UX). The customized machine learning use cases, including the learning and scoring components of the customized machine learning use cases, can then be deployed and executed automatically by the system.
Current security solutions today have, at best, a UX to customize rules or policies which do not have a component of online machine learning and therefore do not need to handle the same underlying complexity. There is no security solution UX that allows customization of online predictive models that also does not require data science expertise.
Although
Computer 102 and server 108 each comprise a processor, memory, non-volatile (NV) storage and one or more input/output (I/O) interfaces. The NV storage may provide for long term storage, even in the absence of power. The NV storage may be provided by a hard disk drive, a solid state drive or flash memory. The I/O interfaces may allow additional components to be operatively coupled to the host computers, which may include network interface cards, input control devices such as keyboard and mice as well as output devices such as monitors. The processors execute instructions stored in memory in order to configure the host computer 102 or server 108 to provide model creation and execution.
As depicted, different model components 116, 118, 120 may define the operation of the different components of the input processing 202 and model processing 206.
Although only a single custom model is depicted, a number of custom models and native models may be stored. As described above, the models may be stored separately from the data type definitions. A data store library may store the schema definitions for both native, built-in data types (e.g. Active Directory™ NetFlow, Perforce™ and other common, standard data types) and custom data types (e.g. output from a home-grown authentication system, output from a custom human resources (HR) database). As described above the data types for both native and custom data types may be specified using a standard declarative language, such as Apache Avro™ or a set of named column identifiers and column types. A model library may store the definitions for a model's input columns, transformations and aggregations, model algorithms, model parameters, and output column. A model's input columns, transformations, algorithms, parameters and output columns can be specified using a standard declarative language, such as Predictive Model Markup Language (PMML). A model's associated aggregation requirements can be specified using a standard declarative language, such as OLAP MDX.
Data ingest functionality 210 interfaces with the raw data sources 204 and ingests the data for processing, for both native and custom models. The data sources 204 provide security related information that is useful for detecting electronic security threats within an enterprise computer network. The data sources may include for example, Active Directory sources, NetFlow sources, Perforce sources, building access systems, human resources information system (HRIS) sources as well as other data sources that provide information that may provide insight into potential security risks to an organization. Metadata required for data ingest of a custom data source is read from the Data Types Library. During the data ingest, raw data can be cleaned and normalized. The ingested data may be stored to a message queue.
Data transformation functionality 212 performs row-level transformations, if required, of the incoming data, to result in additional columns to be added to the row. This is sometimes required to generate the appropriate inputs into a model. For example, a predictive model may require the logarithm of a column's value, rather than the actual value itself. A special case of data transformation is to take the values of the row and use them as input into a predictive model from the model library, to create additional columns to be added to the row which are actually predictions. This is sometimes described as “data enrichment”. For example, a predictive model may look at the metadata associated with a network flow record, and predict the most probable network protocol associated with that flow record. As another example, a predictive model may look at a DNS record, and predict whether this is a malicious connection using a Domain Generation Algorithm (DGA). Metadata required for all data transformations may be read from the model library (for example, PMML supports data transformation specifications).
Data aggregation functionality 214 performs aggregation operations if required across collections of rows, to result in aggregate values that are sometimes required as inputs into a model. For example, a predictive model may require the total sum of a column's value for every hour or day in the dataset. Metadata required for all data aggregation may be read from the Model Library (for example, the use of MDX may be used to describe the required aggregations).
Model training functionality 216 performs any model training, if required. Metadata to describe the model training algorithms be may read from the Model Library (for example, the use of PMML may be used to enumerate the model algorithms). Examples of machine learning model algorithms include logistic regression and neural networks.
Model scoring functionality 218 performs any model scoring, which outputs predictions from the models that may then be used to automate system responses depicted schematically as output functionality 220, such as to automatically generate tickets in an incident response management system to investigate a high risk machine that was detected from a custom model. The scoring function may be read from the model library (for example, the model scoring function is described in a PMML file). This may be implemented, for example, using a PMML library that executes the scoring function across an Apache Spark™ cluster.
It can be appreciated that the system above is useful even with a subset of the components. For example, if no data transformations, data aggregation, or model training is required, the system continues to provide utility with just data ingest and model scoring capabilities.
In addition to the predictive analytics models described above, the system 200 may include a rules engine 222 for processing data. The rules engine may output events that match one or more rules. The matched events may be stored for example using Apache HBase™ or other data store 224. The matched events may be used as triggers for the model processing. The stored events may be synched to other systems including for example Elasticsearch™ for presentation to users 226. As another example, the rules engine may be used to trigger automated responses to specific predictive events, such as to quarantine a machine from the network when a malicious DNS query has been predicted.
The model 300 also comprises model logic 308 that specifies the particular logic used by the model. As depicted the, model logic 308 may be specified using a predictive model markup language (PMML) although the logic may be specified in other ways as well. Regardless of how the model logic is specified, it defines the particular predictive models or rules that are used to provide the model output. The model logic may define the attributes used by the model as well as the model logic and possibly training or configuration information for the model. The attributes specified in the model logic correspond to the attributes or fields specified in the data schema. The model logic may be viewed as a predicate, p=f(x), where the model f is a characteristic function of a probability distribution that may return, for example, a probability in [0,1], along with other useful predicates that are useful for security purposes such as predicting if an event has occurred within a particular time frame. Although a wide number of model logic algorithms may be used, examples include regression models, neural networks, support vector machines (SVM), clustering models, decision trees, naïve Bayes classifiers as well as other model algorithms. Predictive model predicates may be described using PMML or other suitable languages. Other model predicates may be described in the model logic using any standard grammar, such as tokens described via a Backus-Naur Form (BNF) notation.
The above has described the creation and processing of custom models. While creating custom models may be advantageous, it should be done by data scientists having the necessary knowledge to create statistically meaningful models. It is desirable to provide the ability for users without the required knowledge to create their own models using the available statistically valid models.
Models may be selected from the model store 614. The models may be grouped together into different model types or families. For example, for anomaly models, as depicted, different anomaly types or families 616, 618 may be determined using different models 620, 622. For example, in a cybersecurity application, an anomaly model of “employee is copying an unusual amount of data” may be determined by different models, including for example a model that compares the employees' historical data copying amounts, and another model that compares the amount to other employees in the same department. In selecting an anomaly type, the underlying models of the selected anomaly type may be used and combined together with underlying models of other selected anomaly types. The joining conditions may specify Boolean operators for combining the different anomaly models of the selected anomaly type and trigger events together. The output definition may provide a new type of custom anomaly. New trigger events may fire from a rules engine, such as Storm™. The combined models provide an ML use cases that may aggregate on any number of “model families” or types and/or trigger events generated from a rules engine.
The security risk profiling functionality 812 provides user interface functionality 814 that allows end users to interact with various potential risk models 816. As depicted, the models may include native models 816a that are provided as part of the security risk profiling functionality, custom models 816b that are defined by end users or other third parties. The user interface functionality 814 may include dashboard user interface functionality 818 that displays results of the processed models to end an end user. The dashboard interface presented by the dashboard user interface allows end users to investigate potential security risks based on the results of processing one or more of the models 816. For example, a model may indicate that a particular user is a high risk of potential data theft. The interface functionality 814 may further comprise model selection and tuning functionality 820 that allows an end user to select one or more models to execute, or process. The selection of the models may be provided in various ways, including for example listing all available models, subsets of models, predefined listing or groupings of models or other ways of selecting models. The model selection may allow the user to select any of the native models 816a, custom models 816b or super models 816c. The model selection and tuning functionality 820 may also allow an end user to tune or configure selected models. For example, parameter values, thresholds or other settings of the selected models may be set or adjusted. The user interface functionality 814 may also include custom model creation user interface functionality 822 allows an end user to create custom models. The custom model creation interface may allow the end-user to create the custom model in various ways including importing model functionality defined in other tools or languages. The custom model creation user interface functionality 822 allows end users who may be familiar with creating statistically valid predictive models but are not familiar with programming or otherwise creating models for the security risk profiling functionality 812 to easily import the predictive models they created with other tools or languages. The user interface functionality 814 may also provide machine learning use case creation interface functionality 824 that allows end user who may not be familiar with creating statistically valid models to create new use cases by selecting and combining existing models 816.
The security risk profiling functionality 812 may further comprise execution management functionality 826. The execution management functionality 826 may control the processing of selected models. The model selection and tuning functionality 820 may provide selected model(s), or an indication of the selected model(s), to the execution management functionality 826 which may configure the security risk profiling functionality 812 for processing the models. The models may be configured to be processed periodically, such as every hour, day, week, etc. or the models may be processed on demand when selected. The execution management functionality 826 may retrieve the data schema information, and possibly any aggregation information, from selected models and configures input processing functionality 828 in order to ingest, and aggregate if required, any input data required by the selected models. The input processing functionality 828 may store the ingested data for access by other functionality. For example, the input processing functionality 828 may pass the ingested data to a message queue 830. The execution management functionality 826 may also configure model processing functionality 832 as well as rule processing functionality 834 according to the model logic.
In addition to configuring the input processing functionality 828, the model processing functionality 832 and the rule processing functionality 834, the execution management 826 may also control the processing of supermodels 816c. As described above, supermodels may comprise a plurality of models that are joined together using Boolean operators. The execution management functionality 826 may receive a supermodel and configure the components according to the individual models of the supermodel. The execution management functionality 826 may combine the results from the individual models together according to the Boolean operators of the supermodel. The execution management functionality 826 may retrieve individual model results from the message queues 830 and combine the results together and store the output to the message queues.
The security risk profiling functionality 812 described above provides a system that allows the creation and execution of custom predictive models for use in detecting potential security risks within an organization. Additionally, the security risk profiling functionality 812 provides a system that allows end users to combine existing models together to create new machine learning use cases that may be used in identifying potential security threats.
Although certain components and steps have been described, it is contemplated that individually described components, as well as steps, may be combined together into fewer components or steps or the steps may be performed sequentially, non-sequentially or concurrently. Further, although described above as occurring in a particular order, one of ordinary skill in the art having regard to the current teachings will appreciate that the particular order of certain steps relative to other steps may be changed. Similarly, individual components or steps may be provided by a plurality of components or steps. One of ordinary skill in the art having regard to the current teachings will appreciate that the system and method described herein may be provided by various combinations of software, firmware and/or hardware, other than the specific implementations described herein as illustrative examples.
Some embodiments are directed to a computer program product comprising a computer-readable medium comprising code for causing a computer, or multiple computers, to implement various functions, steps, acts and/or operations, e.g. one or more or all of the steps described above. Depending on the embodiment, the computer program product can, and sometimes does, include different code for each step to be performed. Thus, the computer program product may, and sometimes does, include code for each individual step of a method, e.g., a method of operating a communications device, e.g., a wireless terminal or node. The code may be in the form of machine, e.g., computer, executable instructions stored on a computer-readable medium such as a RAM (Random Access Memory), ROM (Read Only Memory) or other type of storage device. In addition to being directed to a computer program product, some embodiments are directed to a processor configured to implement one or more of the various functions, steps, acts and/or operations of one or more methods described above. Accordingly, some embodiments are directed to a processor, e.g., CPU, configured to implement some or all of the steps of the method(s) described herein. The processor may be for use in, e.g., a communications device or other device described in the present application.
This application claims priority from U.S. Provisional Application No. 62/658,228 filed Apr. 16, 2018 the entirety of which is hereby incorporated by reference for all purposes.
Number | Name | Date | Kind |
---|---|---|---|
5692107 | Simoudis | Nov 1997 | A |
8160977 | Poulin | Apr 2012 | B2 |
8250009 | Breckenridge | Aug 2012 | B1 |
8311967 | Lin | Nov 2012 | B1 |
8370279 | Lin | Feb 2013 | B1 |
8370280 | Lin | Feb 2013 | B1 |
8417715 | Bruckhaus | Apr 2013 | B1 |
8438122 | Mann | May 2013 | B1 |
8443013 | Lin | May 2013 | B1 |
8595154 | Breckenridge | Nov 2013 | B2 |
8762299 | Breckenridge | Jun 2014 | B1 |
8977720 | Meng | Mar 2015 | B2 |
9020861 | Lin | Apr 2015 | B2 |
9189747 | Mann | Nov 2015 | B2 |
9294495 | Francoeur | Mar 2016 | B1 |
9306965 | Grossman | Apr 2016 | B1 |
9641544 | Treat | May 2017 | B1 |
9923787 | Ngoo | Mar 2018 | B2 |
10050985 | Mhatre | Aug 2018 | B2 |
10366346 | Achin | Jul 2019 | B2 |
10387798 | Duggan | Aug 2019 | B2 |
10397255 | Bhalotra | Aug 2019 | B1 |
10452993 | Hart | Oct 2019 | B1 |
10515313 | Kaplow | Dec 2019 | B2 |
10650150 | Rajasooriya | May 2020 | B1 |
10735470 | Vidas | Aug 2020 | B2 |
10754959 | Rajasooriya | Aug 2020 | B1 |
10810512 | Wubbels | Oct 2020 | B1 |
10824950 | Jain | Nov 2020 | B2 |
10848515 | Pokhrel | Nov 2020 | B1 |
10963811 | Grehant | Mar 2021 | B2 |
11080435 | Bourhani | Aug 2021 | B2 |
20020183988 | Skaanning | Dec 2002 | A1 |
20030088565 | Walter | May 2003 | A1 |
20050234688 | Pinto | Oct 2005 | A1 |
20050234753 | Pinto | Oct 2005 | A1 |
20050234761 | Pinto | Oct 2005 | A1 |
20050234762 | Pinto | Oct 2005 | A1 |
20050234763 | Pinto | Oct 2005 | A1 |
20080201116 | Ozdemir | Aug 2008 | A1 |
20130167231 | Raman | Jun 2013 | A1 |
20130179937 | Mont | Jul 2013 | A1 |
20140149895 | Bardhan | May 2014 | A1 |
20140245207 | Poulin | Aug 2014 | A1 |
20140337971 | Casassa Mont | Nov 2014 | A1 |
20140343955 | Raman | Nov 2014 | A1 |
20150142713 | Gopinathan | May 2015 | A1 |
20150269383 | Lang | Sep 2015 | A1 |
20160232457 | Gray | Aug 2016 | A1 |
20160350671 | Morris, II | Dec 2016 | A1 |
20170019487 | Maheshwari | Jan 2017 | A1 |
20170083572 | Tankersley | Mar 2017 | A1 |
20170091673 | Gupta | Mar 2017 | A1 |
20170177309 | Bar-Or | Jun 2017 | A1 |
20170185904 | Padmanabhan | Jun 2017 | A1 |
20170286502 | Bar-Or | Oct 2017 | A1 |
20170286526 | Bar-Or | Oct 2017 | A1 |
20170316052 | Marin | Nov 2017 | A1 |
20170329881 | Korada | Nov 2017 | A1 |
20170330102 | Brush | Nov 2017 | A1 |
20170344901 | Ronen | Nov 2017 | A1 |
20170351241 | Bowers | Dec 2017 | A1 |
20170351511 | Bar-Or | Dec 2017 | A1 |
20180012145 | Maurya | Jan 2018 | A1 |
20180068220 | Shao | Mar 2018 | A1 |
20180165599 | Pete | Jun 2018 | A1 |
20180165604 | Minkin | Jun 2018 | A1 |
20180367561 | Givental | Dec 2018 | A1 |
20190012257 | Indurthivenkata | Jan 2019 | A1 |
20190042286 | Bailey | Feb 2019 | A1 |
20190102554 | Luo | Apr 2019 | A1 |
20190141079 | Vidas | May 2019 | A1 |
20190149564 | McLean | May 2019 | A1 |
Number | Date | Country | |
---|---|---|---|
20190318203 A1 | Oct 2019 | US |
Number | Date | Country | |
---|---|---|---|
62658228 | Apr 2018 | US |