This application generally relates to monitoring scripts, including but not limited to systems and methods for identifying scripts by coding styles.
Information technology administrators and developers in companies may write scripts to make their routine tasks. The scripts can be launched as administrator privileges to perform almost everything on an endpoint. Meanwhile, cyber attackers may use some scripts to disable computers, steal data, or use a breached computer program to launch additional attacks since the script can be deployed easily.
Information technology (IT) administrators and developers of a company may suffer from balancing convenience and security. The IT administrators and developers may use command-line shell policies (e.g., AllSigned, or RemoteSigned) to block unknown scripts. On another side, security product vendors may design different components to detect different types of scripts. However, some IT developers may often write scripts for proof of concept (PoC) testing. The scripts can be changed frequently and may not be signed every time.
For a malicious command-line shell, attackers may obfuscate scripts to escape a signature detection. A machine learning based method can detect an obfuscated command-line shell at a landing stage of an attack. However, once the attackers take control of an endpoint, the attackers may perform attacks by scripts without obfuscation, such as a porting scanning, an installing autorun, or dumping credentials. The scripts may usually disguise as normal scripts to persist/keep/carry on an endpoint of a victim for a long time. There are more and more attack tools that can be rewritten by a commend-line shell (e.g., PowerShell, Nmap, or Mimikatz). The obfuscated command-line detection may not be sufficient to defend against endpoint attacks.
The systems and methods of this disclosure can address the technical problems by identifying a unique script coding style in each company and detecting an abnormal script before the abnormal script is launched. Employees of each company often follow a same coding style guide. This technical solution may be applied to any kinds of scripts.
An aspect of this disclosure can be directed to a system. The system can include one or more processors, coupled to memory. The one or more processors can identify a script for execution by a computing device of an entity. The one or more processors can determine, via a model trained with machine learning based on a plurality of scripts established by a plurality of entities, a classification of the script prior to execution of the script by the computing device. The one or more processors can control execution of the script responsive to the classification of the script.
The data processing system can be intermediary to the computing device and one or more servers. The one or more processors can be configured to intercept the script transmitted from the one or more servers to the computing device prior to receipt by the computing device of the script. The one or more processors can be configured to determine, via the model and prior to forwarding the script to the computing device, that the classification indicates the script is authorized for execution by the computing device. The one or more processors can be configured to forward the script to the computing device for execution responsive to the script being authorized for execution by the computing device.
The computing device may comprise the data processing system. The one or more processors can be configured to receive, via a network, the script from a remote device configured to remotely manage the computing device, the script compatible with a plurality of different platforms and configured with a command-line shell. The one or more processors can be configured to determine, responsive to receipt of the script from the remote device and prior to execution of the script on the computing device, the classification of the script. The one or more processors can be configured to control, responsive to the classification, execution of the script to prevent execution of the script or allow execution of the script on the computing device. In some embodiments, the one or more processors can be configured to prevent execution of the script by the computing device responsive to the classification comprising an indication that the script was developed by a second entity that is different from the entity. The computing device can be managed by the entity.
The one or more processors can be configured to determine, via the model, the classification of the script as one of developed internal to the entity or developed external to the entity. The one or more processors can be configured to scan the script to identify a plurality of values for a plurality of features of the script. The one or more processors can be configured to input the plurality of values for the plurality of features into the model to determine the classification.
The model can be trained with a plurality of features of the plurality of scripts developed by the plurality of entities. The plurality of features may comprise one or more features indicative of a coding style, a file attribute, or a code quality. The one or more processors can be configured to determine, for the entity, a plurality of features established in the model for the entity. The plurality of features may comprise at least one of a naming convention, bracket position, maximum line length, trailing whitespace, spare around keywords, style of cmdlet, or indentation. The one or more processors can be configured to scan the script to identify a plurality of values for the plurality of features established for the entity. A first value for a first feature of the plurality of features corresponding to the naming convention may indicate an amount of words in the script that use a camel-case or a snake-case. A second value for a second feature of the plurality of features corresponding to the bracket position may indicate that a bracket is located at an end of a line in the script or the bracket is located at a head of the line in the script.
The one or more processors can be configured to receive a second plurality of scripts from a third-party repository. The one or more processors can be configured to train an initial model based on a plurality of features of the second plurality of scripts. The plurality of features may comprise one or more features indicative of a coding style, a file attribute, or a code quality. The one or more processors can be configured to receive a third plurality of scripts developed by the entity. The one or more processors can be configured to classify, via the initial model, the third plurality of scripts of the entity into a category in the initial model trained based on the second plurality of scripts from the third-party repository. The one or more processors can be configured to train, based on the initial model and the category, the model as a binary classifier to output the classification as one of internal or external. The machine learning may comprise at least one of a support vector machine, a linear kernel function, or a radial basis kernel function.
An aspect of the present disclosure can be directed to a method for identifying abnormal scripts by coding style. The method can include identifying, by a data processing system comprising one or more processors coupled with memory, a script for execution by a computing device of an entity. The data processing system may determine a classification of the script prior to execution of the script by the computing device via a model trained with machine learning based on a plurality of scripts established by a plurality of entities. The data processing system may control execution of the script responsive to the classification of the script.
The data processing system can be intermediary to the computing device and one or more servers. The data processing system may intercept the script transmitted from the one or more servers to the computing device prior to receipt by the computing device of the script. The data processing system may determine, via the model and prior to forwarding the script to the computing device, that the classification indicates the script is authorized for execution by the computing device. The data processing system may forward the script to the computing device for execution responsive to the script being authorized for execution by the computing device.
The computing device may comprise the data processing system. The data processing system may receive, via a network, the script from a remote device configured to remotely manage the computing device. The script may be compatible with a plurality of different platforms and configured with a command-line shell. The data processing system may determine, responsive to receipt of the script from the remote device and prior to execution of the script on the computing device, the classification of the script. The data processing system may control, responsive to the classification, execution of the script to prevent execution of the script or allow execution of the script on the computing device.
The data processing system may prevent execution of the script by the computing device responsive to the classification comprising an indication that the script was developed by a second entity that is different from the entity. The computing device can be managed by the entity. The data processing system may determine, via the model, the classification of the script as one of developed internal to the entity or developed external to the entity. The data processing system may scan the script to identify a plurality of values for a plurality of features of the script. The data processing system may input the plurality of values for the plurality of features into the model to determine the classification. The data processing system may receive a second plurality of scripts from a third-party repository. The data processing system may train, via the machine learning, an initial model based on a plurality of features of the second plurality of scripts. The plurality of features may comprise one or more features indicative of a coding style, a file attribute, or a code quality. The machine learning may comprise at least one of a support vector machine, a linear kernel function, or a radial basis kernel function. The data processing system may receive a third plurality of scripts developed by the entity. The data processing system may classify, via the initial model, the third plurality of scripts of the entity into a category in the initial model trained based on the second plurality of scripts from the third-party repository. The data processing system may train, based on the initial model and the category, the model as a binary classifier to output the classification as one of internal or external.
An aspect the present disclosure can be directed to a non-transitory computer readable medium storing program instructions. The non-transitory computer-readable medium can store instructions that, when executed by one or more processors, cause the one or more processors to identify a script for execution by a computing device of an entity. The instructions can include instructions to determine, via a model trained with machine learning based on a plurality of scripts established by a plurality of entities, a classification of the script prior to execution of the script by the computing device. The instructions can include instructions to control execution of the script responsive to the classification of the script. The model may comprise a plurality of features comprising at least one of a naming convention, bracket position, maximum line length, trailing whitespace, spare around keywords, style of cmdlet, or indentation.
These and other aspects and implementations are discussed in detail below. The foregoing information and the following detailed description include illustrative examples of various aspects and implementations, and provide an overview or framework for understanding the nature and character of the claimed aspects and implementations. This Summary is not intended to identify key features or essential features, nor is it intended to limit the scope of the claims included herewith. The drawings provide illustration and a further understanding of the various aspects and implementations, and are incorporated in and constitute a part of this specification. Aspects can be combined and it will be readily appreciated that features described in the context of one aspect of the invention can be combined with other aspects. Aspects can be implemented in any convenient form. For example, by appropriate computer programs, which may be carried on appropriate carrier media (computer readable media), which may be tangible carrier media (e.g. disks) or intangible carrier media (e.g. communications signals). Aspects may also be implemented using suitable apparatus, which may take the form of programmable computers running computer programs arranged to implement the aspect. As used in the specification and in the claims, the singular form of ‘a’, ‘an’, and ‘the’ include plural referents unless the context clearly dictates otherwise.
Objects, aspects, features, and advantages of embodiments disclosed herein will become more fully apparent from the following detailed description, the appended claims, and the accompanying drawing figures in which like reference numerals identify similar or identical elements. Reference numerals that are introduced in the specification in association with a drawing figure may be repeated in one or more subsequent figures without additional description in the specification in order to provide context for other features, and not every element may be labeled in every figure. The drawing figures are not necessarily to scale, emphasis instead being placed upon illustrating embodiments, principles and concepts. The drawings are not intended to limit the scope of the claims included herewith.
The features and advantages of the present solution will become more apparent from the detailed description set forth below when taken in conjunction with the drawings, in which like reference characters identify corresponding elements throughout. In the drawings, like reference numbers generally indicate identical, functionally similar, and/or structurally similar elements.
For purposes of reading the description of the various embodiments below, the following descriptions of the sections of the specification and their respective contents may be helpful:
Section A describes a computing environment which may be useful for practicing embodiments described herein; and
Section B describes systems and methods for identifying scripts by coding styles.
Prior to discussing the specifics of embodiments of the systems and methods of an appliance and/or client, it may be helpful to discuss the computing environments in which such embodiments may be deployed.
As shown in
Computer 100 as shown in
Communications interfaces 115 may include one or more interfaces to enable computer 100 to access a computer network such as a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or the Internet through a variety of wired and/or wireless or cellular connections.
In described embodiments, the computing device 100 may execute an application on behalf of a user of a client computing device. For example, the computing device 100 may execute a virtual machine, which provides an execution session within which applications execute on behalf of a user or a client computing device, such as a hosted desktop session. The computing device 100 may also execute a terminal services session to provide a hosted desktop environment. The computing device 100 may provide access to a computing environment including one or more of one or more applications, one or more desktop applications, and one or more desktop sessions in which one or more applications may execute.
Referring to
In some embodiments, the computing environment 160 may provide client 165 with one or more resources provided by a network environment. The computing environment 160 may include one or more clients 165a-165n, in communication with a cloud 175 over one or more networks 170. Clients 165 may include, e.g., thick clients, thin clients, and zero clients. The cloud 108 may include back-end platforms, e.g., servers, storage, server farms or data centers. The clients 165 can be the same as or substantially similar to computer 100 of
The users or clients 165 can correspond to a single organization or multiple organizations. For example, the computing environment 160 can include a private cloud serving a single organization (e.g., enterprise cloud). The computing environment 160 can include a community cloud or public cloud serving multiple organizations. In some embodiments, the computing environment 160 can include a hybrid cloud that is a combination of a public cloud and a private cloud. For example, the cloud 175 may be public, private, or hybrid. Public clouds 108 may include public servers that are maintained by third parties to the clients 165 or the owners of the clients 165. The servers may be located off-site in remote geographical locations as disclosed above or otherwise. Public clouds 175 may be connected to the servers over a public network 170. Private clouds 175 may include private servers that are physically maintained by clients 165 or owners of clients 165. Private clouds 175 may be connected to the servers over a private network 170. Hybrid clouds 175 may include both the private and public networks 170 and servers.
The cloud 175 may include back-end platforms, e.g., servers, storage, server farms or data centers. For example, the cloud 175 can include or correspond to a server or system remote from one or more clients 165 to provide third party control over a pool of shared services and resources. The computing environment 160 can provide resource pooling to serve multiple users via clients 165 through a multi-tenant environment or multi-tenant model with different physical and virtual resources dynamically assigned and reassigned responsive to different demands within the respective environment. The multi-tenant environment can include a system or architecture that can provide a single instance of software, an application or a software application to serve multiple users. In some embodiments, the computing environment 160 can provide on-demand self-service to unilaterally provision computing capabilities (e.g., server time, network storage) across a network for multiple clients 165. The computing environment 160 can provide an elasticity to dynamically scale out or scale in responsive to different demands from one or more clients 165. In some embodiments, the computing environment 160 can include or provide monitoring services to monitor, control, and/or generate reports corresponding to the provided shared services and resources.
In some embodiments, the computing environment 160 can include and provide different types of cloud computing services. For example, the computing environment 160 can include Infrastructure as a service (IaaS). The computing environment 160 can include Platform as a service (PaaS). The computing environment 160 can include server-less computing. The computing environment 160 can include Software as a service (SaaS). For example, the cloud 175 may also include a cloud based delivery, e.g., Software as a Service (SaaS) 180, Platform as a Service (PaaS) 185, and Infrastructure as a Service (IaaS) 190. IaaS may refer to a user renting the use of infrastructure resources that are needed during a specified time period. IaaS providers may offer storage, networking, servers, or virtualization resources from large pools, allowing the users to quickly scale up by accessing more resources as needed. Examples of IaaS include AMAZON WEB SERVICES provided by Amazon.com, Inc., of Seattle, Washington; RACKSPACE CLOUD provided by Rackspace US, Inc., of San Antonio, Texas; Google Compute Engine provided by Google Inc. of Mountain View, California; or RIGHTSCALE provided by RightScale, Inc., of Santa Barbara, California. PaaS providers may offer functionality provided by IaaS, including, e.g., storage, networking, servers or virtualization, as well as additional resources such as, e.g., the operating system, middleware, or runtime resources. Examples of PaaS include WINDOWS AZURE provided by Microsoft Corporation of Redmond, Washington; Google App Engine provided by Google Inc.; and HEROKU provided by Heroku, Inc., of San Francisco, California. SaaS providers may offer the resources that PaaS provides, including storage, networking, servers, virtualization, operating system, middleware, or runtime resources. In some embodiments, SaaS providers may offer additional resources including, e.g., data and application resources. Examples of SaaS include GOOGLE APPS provided by Google Inc.; SALESFORCE provided by Salesforce.com Inc. of San Francisco, California; or OFFICE 365 provided by Microsoft Corporation. Examples of SaaS may also include data storage providers, e.g., DROPBOX provided by Dropbox, Inc., of San Francisco, California; Microsoft SKYDRIVE provided by Microsoft Corporation; Google Drive provided by Google Inc.; or Apple ICLOUD provided by Apple Inc. of Cupertino, California.
Clients 165 may access IaaS resources with one or more IaaS standards, including, e.g., Amazon Elastic Compute Cloud (EC2), Open Cloud Computing Interface (OCCI), Cloud Infrastructure Management Interface (CIMI), or OpenStack standards. Some IaaS standards may allow clients access to resources over HTTP, and may use Representational State Transfer (REST) protocol or Simple Object Access Protocol (SOAP). Clients 165 may access PaaS resources with different PaaS interfaces. Some PaaS interfaces use HTTP packages, standard Java APIs, JavaMail API, Java Data Objects (JDO), Java Persistence API (JPA), Python APIs, web integration APIs for different programming languages including, e.g., Rack for Ruby, WSGI for Python, or PSGI for Perl, or other APIs that may be built on REST, HTTP, XML, or other protocols. Clients 165 may access SaaS resources through the use of web-based user interfaces, provided by a web browser (e.g., GOOGLE CHROME, Microsoft INTERNET EXPLORER, or Mozilla Firefox provided by Mozilla Foundation of Mountain View, California). Clients 165 may also access SaaS resources through smartphone or tablet applications, including, e.g., Salesforce Sales Cloud or Google Drive app. Clients 165 may also access SaaS resources through the client operating system, including, e.g., Windows file system for DROPBOX.
In some embodiments, access to IaaS, PaaS, or SaaS resources may be authenticated. For example, a server or authentication server may authenticate a user via security certificates, HTTPS, or API keys. API keys may include various encryption standards such as, e.g., Advanced Encryption Standard (AES). Data resources may be sent over Transport Layer Security (TLS) or Secure Sockets Layer (SSL).
For a malicious command-line shell, attackers may obfuscate scripts to escape a signature detection. Some machine learning based methods may detect an obfuscated command-line shell at a landing stage of an attack. However, once the attackers take control of an endpoint, the attackers may perform attacks by scripts without obfuscation (e.g., a porting scanning, an installing autorun, or dumping credentials). The obfuscated command-line detection may not be sufficient to defend against endpoint attacks. Another choice is a behavior-based detection method. A detection engine may monitor behaviors of a command-line script in runtime and may take an action according to pre-defined policies. The main flaw of the behavior-based detection method can be that it is hard to estimate how much damage a script has already done when the script is detected.
The systems and methods of this technical solution can identify a unique script coding style in each company and detect an abnormal script before the abnormal script is launched. The technology leverages a machine learning method to identify a script coding style within a company and to block suspicious scripts with different coding styles before the scripts are launched in their computing environment. This technical solution may be applied to any kind of scripts.
Referring to
The network 210 can include computer networks such as the Internet, local, wide, metro or other area networks, intranets, satellite networks, other computer networks such as voice or data mobile phone communication networks, and combinations thereof. The network 210 may be any form of computer network that can relay information between the one or more components of the system 200. The network 210 can relay information between server(s) 220 and one or more information sources, such as web servers or external databases, amongst others. In some implementations, the network 210 may include the Internet and/or other types of data networks, such as a local area network (LAN), a wide area network (WAN), a cellular network, a satellite network, or other types of data networks. The network 210 may also include any number of computing devices (e.g., computers, servers, routers, network switches, etc.) that are configured to receive and/or transmit data within the network 210. The network 210 may further include any number of hardwired and/or wireless connections. Any or all of the computing devices described herein (e.g., server 220, computing device 230, or data processing system 240) may communicate wirelessly (e.g., via WiFi, cellular, radio, etc.) with a transceiver that is hardwired (e.g., via a fiber optic cable, a CAT5 cable, etc.) to other computing devices in the network. Any or all of the computing devices described herein (e.g., server 220, computing device 230, or data processing system 240) may also communicate wirelessly with the computing devices of the network via a proxy device (e.g., a router, network switch, or gateway). In some implementations, the network 210 can be similar to or can include the network 170 or a computer network accessible to the computer 100 described herein above in conjunction with
The system 200 can include or interface with at least one server 220. The server 220 may be referred to as a host system, a cloud device, a remote device, a remote entity, or a physical machine. The server 220 can include or correspond to as a node, remote devices, remote entities, application servers, or backend server endpoints. The server 220 can be composed of hardware or software components, or a combination of both hardware or software components. The server 220 can include resources for executing one or more applications, such as SaaS applications, network applications, or other applications within a list of available resources maintained by the server 220. The server 220 can include one or more features or functionalities of at least resource management services (e.g., resource management services) or other components within the cloud computing environment (e.g., cloud computing environment). The server 220 can communicate with the computing device 230 via a communication channel established by the network 210, for example.
The system 200 can include or interface with at least one computing device 230. The computing device 230 can include at least one processor and a memory, e.g., a processing circuit. The computing device 230 can include various hardware or software components, or a combination of both hardware and software components. The computing device 230 can be constructed with hardware or software components and can include features and functionalities similar to the client devices 165 described hereinabove in conjunction with
The system 200 can include at least one data processing system 240. The data processing system 240 can include various components to manage a script identification process. The data processing system 240 can include at least one model generator 242. The data processing system 240 can include at least one rule engine 244. The data processing system 240 can include at least one script evaluator 246. The data processing system 240 can include at least one identification engine 248. The data processing system 240 can include at least one control executor 250. The data processing system 240 can include at least one action feedback manager 252. The data processing system 240 can include at least one data repository 254. Individual components (e.g., model generator 242, rule engine 244, script evaluator 243, identification engine 248, control executor 250, action feedback manager 252, or data repository 254) of the data processing system 240 can be composed of hardware, software, or a combination of hardware and software components. Individual components of the data processing system 240 can be in electrical communication with each other. For instance, the model generator 242 can exchange data or communicate with the rule engine 244, script evaluator 243, identification engine 248, control executor 250, action feedback manager 252, or data repository 254. The one or more components of the data processing system 240 can be used to perform features or functionalities, such as identifying at least one script, determining at least one classification, or controlling execution. The data processing system 240 can operate remotely from the server 220, the computing device 230, or other devices in the system 200.
In some cases, the data processing system 240 can be a part of the server 220 or the computing device 230, such as an integrated device, embedded device, a server-operated device, or a device accessible by the administrator of the server 220. For example, the data processing system 240 can perform operations local or on-premise to the computing device 230 or the server 220. One or more components (e.g., model generator 242, rule engine 244, script evaluator 243, identification engine 248, control executor 250, action feedback manager 252, or data repository 254) of the data processing system 240 can be executed on the server 220 or the computing device 230. The data processing system 240 can be a part of or correspond to a virtual machine of the server 220 executing an application for the computing device 230. For example, the operations of the data processing system 240 can be performed by the virtual machine assigned to the respective computing device 230. In some cases, one or more components or functions of the data processing system 240 can be packaged into a script, agent, or bot configured to execute on the server 220 or computing device 230.
In some embodiments, the model generator 242 may train an initial model based on one or more scripts. The initial model can be a machine learning baseline model (e.g., eXtreme Gradient Boosting (XGBoot) model). The XGBoot may transform multiple decision trees (hundreds or thousands of decision trees) into a single born-again decision tree that approximates same decision function. The one or more scripts may be developed by one or more entities (e.g., internal entities of a company, external entities of the company, or third-party entities). In certain embodiments, the one or more scripts can be from public source code repositories. The one or more scripts may include one or more features. The one or more features may comprise an indicator of a coding/writing style, a file attribute, or a code quality. The code quality may be defined by a defect metric or a complexity metric. The defect metric can be a number of defects (or severity of the defects) in the script. The complexity metric may include a number of linearly independent paths with the script or a testing time of the script. The model generator 242 may train an initial model (e.g., machine learning baseline model) to “learn” a coding style by leveraging machine learning. The model generator 242 may extract the one or more features (e.g., writing styles, functions, file attributes, or code qualities) from a script to train the machine learning model with a coding style. There can be several definitions/features to train the machine learning model.
The model generator 242 may receive one or more scripts from a third-party repository. The model generator 242 may train an initial model based one or more features of the one or more scripts. The one or more features may indicate a coding style, a file attribute, or a code quality. The model generator 242 may receive one or more scripts developed by an entity. The initial model may classify one or more scripts of the entity into a category in the initial model trained based on the one or more scripts from the third-party repository. For example, the initial model may include category A, B, C, and D. The model generator 242 may classify a first script developed by the entity into category C, and may classify a second script from the third-party repository into category A, B, or D, according to the training processes. The model generator 242 may train the model as a binary classifier to output the classification as one of internal or external based on the initial model and the category. The binary classifier may categorize new observations into one of two class labels (e.g., spam or not) via a supervised learning algorithm. After the initial model is trained, the model generator 242 may generate a model.
Different administrators may write command-line shell codes in different styles.
“Naming Conventions”, “Bracket Positions”, “Max Line Length”, “Trailing Whitespace”, “Space around Keywords”, “Style of Cmdlet”, or “Indentation” can be extracted to identify a coding style of a script, as shown in Table 1.
A computing device can execute a script to manage a system resource by updating grand piece online (GPO) codes, modifying Registry, or changing Files. For example, some operations, cmdlets, or native application programming interfaces (APIs) can be used frequently in some companies according to business and working scenarios. In the contrary, some functions may be not used in the companies oftentimes. In such case, “Percentage of Registry operation”, “Percentage of File operation”, “Percentage of Process operation”, “Percentage of Network operation”, or “Number of DllImport” can be extracted to identify a coding style of a script, as shown in Table 2.
Some administrators may save their scripts in specific folders or repositories. A size of each script may also reflect a coding style so that the model generator can extract more features from file attributes. In such case, “Path”, “Size”, “Attribute”, “Zone.Identifier”, “Editor”, or “Timestamp” can be extracted to identify a coding style of a script, as shown in Table 3. “Editor” may mean/indicate a last process name that edits or creates the file. If the administrator edits the file, the Editor can be “notepad” or “vscode.” If the file is downloaded, the Editor can be “chrome” or “outlook.”
Lint or other static scan tools may obtain an evaluation of a code quality, which can be also used to identify a coding style of a script, as shown in Table 4.
The above features and encoding formulas are purely for making an example. The model generator may involve more features and much more complex encoding formulas for a model training.
The model generator 242 may determine or create one or more clusters from the scripts. A cluster can be a group of data objects with one or more similar features. For example, the model generator 242 may scan one or more scripts and may extract the above coding style features (e.g., bracket positions and naming conventions) from the one or more scripts from open-source repositories. An unsupervised learning approach can be applied to the data set to train a cluster model M. K-means clustering algorithm may be used. The source codes of the data set may be clustered into K categories (e.g., 3 categories). Each group or cluster may have a threshod value for each feature. In some embodiments, Elbow method or Silhouette method can be used to determine an optimal number of K. The scripts with one or more similar features (e.g., similar bracket positions and similar naming conventions) may be grouped in a same group. For example, script A has 95% brackets located at the end of a line and 97% naming in snake-case, script B has 90% brackets located at the head of a line and 98% naming in camel-case, and script C has 80% brackets located at the end of a line and 90% naming in snake-case. The script A and the script C can be clustered together since the cluster has a threshold of 70% brackets located at the end of a line and 80% naming in snake-case.
The model generator 242 may utilize the one or more clusters (e.g., A, B, and C clusters) to classify the one or more scripts. Most of scripts from a certain company may be classified into one or more clusters (e.g., C cluster) of the one or more clusters. The model generator 242 may match or mark a script from the certain company to one or more clusters (e.g., C cluster). The model generator 242 may generate a category for the script from the certain company. The category may include one cluster or a combination of different clusters. The category can be “internal” or “external.” “Internal” may mean or indicate a coding style of scripts in this category can be widely used in a certain company. “External” may mean or indicate a coding style of scripts can be unfamiliar to the certain company.
The model generator 242 may extract the one or more features from the one or more scripts to build feature vectors with multiple dimensions. The model generator 242 may build a supervised machine learning model according to the multiple dimensions feature vectors. For example, the model generator may train a binary classification supervised learning model using a support vector machine (SVM) algorithm. The machine learning may comprise at least one of a support vector machine, a linear kernel function, or a radial basis kernel function. The support vector machine can be a supervised machine learning model that uses classification algorithms for two-group classification problems. The linear kernel function and the radial basis kernel may take data as input and transform the data into a required form. Specifically, the radial basis kernel function may have localized and finite response to an entire x-axis.
Referring back to
The rule engine 244 may trigger the model generator 242 to train a new configuration/model according to a define event or a defined time information. The defined event may comprise a detection error rate (e.g., a false positive rate) of malicious scripts being above a threshold. The defined time information may include a periodic time interval. As another example, the rule engine may adjust the configuration periodically (e.g., every month).
The rule engine 244 may store a rule (e.g., access permission) of a category/classification (e.g., internal or external) of a script. For example, a script with a classification of “internal” may be authorized for execution by a computing device, and a script with a classification of “external” may be blocked or prevented from execution by a computing device. In some embodiments, the rule engine may adjust the rule in response to an action feedback. For example, an action feedback may indicate that a certain category may be allowed to execute on the computing device. The rule engine may adjust the rule to allow the certain category to execute on the computing device according to the action feedback.
In some embodiments, the script evaluator 246 may determine a classification (e.g., a binary classification) of a script via a model trained with machine learning. The model may be trained with machine learning based on one or more scripts established by one or more entities. The binary classification may include an “internal” or an “external.” The internal may indicate that the scripts is developed internal to an entity. The external may indicate that the scripts is developed external to the entity. When there is a new or unknown script, the script evaluator 246 may scan the new script and may determine a classification (e.g., internal or external) of the new script prior to execution of the new script by the computing device. For example, the script evaluator may classify a new script as “internal” based on a trained support vector machine (SVM) algorithm. The script evaluator 246 can be used to detect an abnormal script before the abnormal script is launched by the computing device. Since a company often follows same coding style guide, the script evaluator 246 may detect any scripts using an unfamiliar coding style.
The script evaluator 246 may determine a rating/score of a script via a model trained with machine learning (e.g., a trained machine learning baseline model). The script evaluator 246 may determine a classification of the script based on the rating/score of the script. For example, the script evaluator 246 may rate a script with “good” lower than 50%. The script evaluator 246 may classify the script as “external” based on the rating of “good” lower than 50%.
The script evaluator 246 may evaluate one or more values for one or more features by the trained model to determine a classification. For instance, the script evaluator 246 may obtain a first value (e.g., 70%) of greater than 50% using snake-case and a second value (e.g., 34%) of less than 40% using a full name for a script. The script evaluator 246 may compare the first value and the second value of the script with different thresholds from different clusters trained by the model. The script evaluator may decide that the first and the second value of the script is closest to cluster D (e.g., an internal cluster). The script evaluator may determine that the classification of the script is “internal.”
The identification engine 248 may receive a script for execution by a computing device of an entity. In some cases, the identification engine 248 may intercept a script transmitted from the server to the computing device prior to receipt by the computing device of the script. For example, the identification engine may receive one or more scripts from a third-party repository. As another example, the identification engine may receive one or more scripts developed by a certain entity. The identification engine may receive the script from a remote device configured to remotely manage the computing device. The script can be compatible with one or more different platforms and configured with a command-line shell.
The identification engine 248 may identify the script for execution. The identification engine 248 can identify a script responsive to receiving the script from a remote source, or responsive to receiving an instruction to execute the script. The script can be configured to perform a function or execute a process on the computing device 230. In some other cases, a script can be used for creating a hooking process for a software component. The identification engine 248 may decide a purpose of a script and forward the script to a destination for execution of the purpose. The identification engine 248 may identify the script based on a source (e.g., a path, or a URL address) of the script. The identification engine 248 can identify a unique identifier for the script, filepath for the script, developer of the script, or source of the script. After the identification, the identification engine 248 may forward the script to a corresponding engine in the data processing system. For example, if the identification engine identifies a new script for execution, the identification engine may forward the new script to the script evaluator. For another example, if the identification engine identifies a script (or one or more scripts) for a model training, the identification engine may forward the script (or the one or more scripts) to the model generator.
In some embodiments, the control executor 250 may control execution of the script responsive to the classification of the script. For instance, if a script is classified as “external,” the control executor may prevent execution of the script on the computing device. The control executor 250 may send an alert notification or an analytic result to a user interface (e.g., a user and entity behavior analytics (UEBA) dashboard) of an application. An administrator may make a final decision after reviewing the script. The control executor 250 may terminate the script from execution using a command line. When a script is authorized for execution by the computing device, the control executor may forward the script to the computing device for execution.
The action feedback manager 252 may receive feedback from an entity to improve an accuracy of the trained model. After the control executor 250 sent the analytic result to the user interface of the application, the administrator may provide feedback using the user interface. For example, the action feedback manager 252 may receive feedback, from an administrator, indicating that an “external” script is wrongly classified (e.g., a false alarm), and may update the script with an “internal” tag. The script may be put back into the “internal” data set for continually retraining. The action feedback manager 252 may update the trained machine learning model according to the feedback. As another example, the action feedback manager 252 may receive feedback, from the administrator, indicating that an “internal” script is correctly classified. The action feedback manager 252 may notify the rule engine 244 the feedback regularly to improve an accuracy of the trained model.
In some embodiments, the data repository 254 may be referred to as a data repository, central storage, or memory of the data processing system 240. The one or more storages (e.g., script storage, model storage, feedback storage, rule storage, or execution storage) can be accessed, modified, or interacted with by one or more components (e.g., model generator 242, rule engine 244, script evaluator 246, identification engine 248, control executor 250, or action feedback manager 252) of the data processing system 240. In some cases, the one or more storages of the data repository 254 can be accessed by one or more other authorized devices of the system 200, such as the server 220. The data repository 254 can include other storages to store additional data from one or more components of the data processing system 240 or data from other devices of the system 200, for example.
At ACT 502, the method 500 can include the data processing system identifying or obtaining a plurality of scripts. The data processing system may prepare or establish a training set and a testing set for a supervised machine learning. The data processing system may receive the scripts from a third-party repository or an enterprise repositories. In a non-limiting example, the script may be from a certain entity that intends to use the method 500. The scripts may be received from a remote device. The scripts may be compatible with one or more platforms. The data processing system may use the plurality of scripts as training samples for a machine learning model training.
At ACT 504, the method 500 can include the data processing system performing feature extraction. The data processing system may extract features from the training samples to build eight or higher dimensions feature vectors for example. The higher dimensions feature vectors may include information of writing styles, functions, file attributes, or code qualities. The feature extraction may reduce a number of features in a dataset by creating new features from the existing ones. The reduced set of features (e.g., the higher dimensions feature vectors) may summarize most of the information in the original set of features. The reduced set of features can be created from a combination of the original set of features.
At ACT 506, the method 500 can include the data processing system performing supervised learning to generate or update a classification model. The data processing system may build a supervised machine learning model by analyzing the training samples. The supervised machine learning model may produce an inferred function, which can be used for mapping new samples (e.g., scripts). For example, the data processing system can use eXtreme Gradient Boosting (XGBoot) for this task. The XGBoot may transform multiple decision trees (hundreds or thousands of decision trees) into a single born-again decision tree that approximates the same decision function. The result can be a well-trained machine learning model.
At ACT 508, the method 500 can include the data processing system receiving or identifying a new script. By hooking process creation, the data processing system can block a new script process before launching in order to evaluate the new script to determine whether to allow or authorize execution of the scripts. The data processing system can parse a command line of the new script to get a path of the script or cmdlets. The data processing system may put a result of the path and cmdlets of the new script into the trained model.
At ACT 510, the method 500 can include the data processing system performing feature extraction on the newly identified script. The data processing system may extract features from the newly identified script to build higher dimensions feature vectors. The higher dimensions feature vectors may include information of the newly identified script (e.g., writing styles, functions, file attributes, or code qualities). The data processing system may forward the result of feature extraction of the newly identified script to a trained classification mode.
At ACT 512, the method 500 can include the data processing system inputting the extracted features into the classification model generated or updated at ACT 506. The data processing system may use the trained model to get a rating of the newly identified script. The classification model may determine a classification of the newly identified script based on the rating of the newly identified script. For example, the classification model may classify the newly identified script as “external” based on the rating of “good” lower than 50%
At ACT 514, the method 500 can include the data processing system controlling an action. Controlling an action can refer to or include the data processing system sending an alert notification to a user interface of an application. For example, when a rating of “good” is lower than 50% (or less), the data processing system can notify other security components for taking an action. An administrator may provide feedback to the notification. The data processing system may improve an accuracy of the classification model based on the feedback.
Referring to
In method 900, the data processing system may input real-time data into the binary classification model B 808. The real-time data input can come from either Windows Event Log or online captures from network appliances 702. The data processing system may extract features 704 and may send the features to trained model B 808 when a new file is dropped or captured. At ACT 902, the data processing system may get a classified result “internal.” At ACT 904, the data processing system may get a classified result “external.” The output “external” 904 may indicate that the style of this script is not standard style in the company. At ACT 906, the data processing system may send an alert notification 906 based on the output via an interface of an application. At ACT 908, after receiving feedback from an administrator corresponding to the alert notification, the data processing system can forward the feedback to the binary classification model B. For example, the administrator 908 may mark the output as a false alarm if the output is not correct. The sample may be put back into the “internal” 802 or “external” 804 data set for continually retraining.
Still referring to
At operation 1210, the one or more processors can determine, via a model trained with machine learning based on a plurality of scripts established by a plurality of entities, a classification of the script prior to execution of the script by the computing device. The one or more processor may classify the script as an “internal” or an “external” via the trained model. The internal may indicate that the scripts is developed internal to an entity. The external may indicate that the scripts is developed external to the entity. The one or more processor may detect any “external” scripts using an unfamiliar coding style.
At operation 1215, the one or more processors can control execution of the script responsive to the classification of the script. For instance, if a script is classified as “internal,” the one or more processors may allow execution of the script on the computing device. In another case, if the script is classified as “external,” the one or more processors may prevent execution of the script on the computing device. The one or more processors may send an alert notification or an analytic result to an user interface of an application. A user may make a final decision (e.g., to delete or keep the script) after reviewing the script.
The following examples pertain to further embodiments, from which numerous permutations and configurations will be apparent.
Example 1 includes a system, comprising: one or more processors, coupled to memory, to: identify a script for execution by a computing device of an entity. The one or more processors can determine, via a model trained with machine learning based on a plurality of scripts established by a plurality of entities, a classification of the script prior to execution of the script by the computing device. The one or more processors can control execution of the script responsive to the classification of the script.
Example 2 includes the subject matter of Example 1, wherein the data processing system can be intermediary to the computing device and one or more servers. The one or more processors can be configured to intercept the script transmitted from the one or more servers to the computing device prior to receipt by the computing device of the script. The one or more processors can be configured to determine, via the model and prior to forwarding the script to the computing device, that the classification indicates the script is authorized for execution by the computing device. The one or more processors can be configured to forward the script to the computing device for execution responsive to the script being authorized for execution by the computing device.
Example 3 includes the subject matter of any of Examples 1 and 2, wherein the computing device may comprise the data processing system. The one or more processors can be configured to receive, via a network, the script from a remote device configured to remotely manage the computing device, the script compatible with a plurality of different platforms and configured with a command-line shell. The one or more processors can be configured to determine, responsive to receipt of the script from the remote device and prior to execution of the script on the computing device, the classification of the script. The one or more processors can be configured to control, responsive to the classification, execution of the script to prevent execution of the script or allow execution of the script on the computing device.
Example 4 includes the subject matter of any of Examples 1 through 3, wherein the data processing system is further configured to: prevent execution of the script by the computing device responsive to the classification comprising an indication that the script was developed by a second entity that is different from the entity, wherein the computing device is managed by the entity.
Example 5 includes the subject matter of any of Examples 1 through 4, wherein the data processing system is further configured to: determine, via the model, the classification of the script as one of developed internal to the entity or developed external to the entity.
Example 6 includes the subject matter of any of Examples 1 through 5, wherein the data processing system is further configured to: scan the script to identify a plurality of values for a plurality of features of the script; and input the plurality of values for the plurality of features into the model to determine the classification.
Example 7 includes the subject matter of any of Examples 1 through 6, wherein the model is trained with a plurality of features of the plurality of scripts developed by the plurality of entities, the plurality of features comprising one or more features indicative of a coding style, a file attribute, or a code quality.
Example 8 includes the subject matter of any of Examples 1 through 7, wherein the data processing system is further configured to: determine, for the entity, a plurality of features established in the model for the entity, the plurality of features comprising at least one of a naming convention, bracket position, maximum line length, trailing whitespace, spare around keywords, style of cmdlet, or indentation; and scan the script to identify a plurality of values for the plurality of features established for the entity.
Example 9 includes the subject matter of any of Examples 1 through 8, wherein: a first value for a first feature of the plurality of features corresponding to the naming convention indicates an amount of words in the script that use a camel-case or a snake-case; and a second value for a second feature of the plurality of features corresponding to the bracket position indicates that a bracket is located at an end of a line in the script or the bracket is located at a head of the line in the script.
Example 10 includes the subject matter of any of Examples 1 through 9, wherein the data processing system is further configured to: receive a second plurality of scripts from a third-party repository; train an initial model based on a plurality of features of the second plurality of scripts, the plurality of features comprising one or more features indicative of a coding style, a file attribute, or a code quality; receive a third plurality of scripts developed by the entity; classify, via the initial model, the third plurality of scripts of the entity into a category in the initial model trained based on the second plurality of scripts from the third-party repository; and train, based on the initial model and the category, the model as a binary classifier to output the classification as one of internal or external.
Example 11 includes the subject matter of any of Examples 1 through 10, wherein the machine learning comprises at least one of a support vector machine, a linear kernel function, or a radial basis kernel function.
Example 12 includes a method, comprising: identifying, by a data processing system comprising one or more processors coupled with memory, a script for execution by a computing device of an entity; determining, by the data processing system via a model trained with machine learning based on a plurality of scripts established by a plurality of entities, a classification of the script prior to execution of the script by the computing device; and controlling, by the data processing system, execution of the script responsive to the classification of the script.
Example 13 includes the subject matter of Example 12, wherein the data processing system is intermediary to the computing device and one or more servers, comprising: intercepting, by the data processing system, the script transmitted from the one or more servers to the computing device prior to receipt by the computing device of the script; determining, by the data processing system via the model and prior to forwarding the script to the computing device, that the classification indicates the script is authorized for execution by the computing device; and forwarding, by the data processing system, the script to the computing device for execution responsive to the script being authorized for execution by the computing device.
Example 14 includes the subject matter of any of Examples 12 and 13, wherein the computing device comprises the data processing system, comprising: receiving, by the data processing system via a network, the script from a remote device configured to remotely manage the computing device, the script compatible with a plurality of different platforms and configured with a command-line shell; determining, by the data processing system responsive to receipt of the script from the remote device and prior to execution of the script on the computing device, the classification of the script; and controlling, by the data processing system responsive to the classification, execution of the script to prevent execution of the script or allow execution of the script on the computing device.
Example 15 includes the subject matter of any of Examples 12 through 14, comprising: preventing, by the data processing system, execution of the script by the computing device responsive to the classification comprising an indication that the script was developed by a second entity that is different from the entity, wherein the computing device is managed by the entity.
Example 16 includes the subject matter of any of Examples 12 through 15, comprising: determining, by the data processing system via the model, the classification of the script as one of developed internal to the entity or developed external to the entity.
Example 17 includes the subject matter of any of Examples 12 through 16, comprising scanning, by the data processing system, the script to identify a plurality of values for a plurality of features of the script; and inputting, by the data processing system, the plurality of values for the plurality of features into the model to determine the classification.
Example 18 includes the subject matter of any of Examples 12 through 17, comprising receiving, by the data processing system, a second plurality of scripts from a third-party repository; training, by the data processing system via the machine learning, an initial model based on a plurality of features of the second plurality of scripts, the plurality of features comprising one or more features indicative of a coding style, a file attribute, or a code quality, wherein the machine learning comprises at least one of a support vector machine, a linear kernel function, or a radial basis kernel function; receiving, by the data processing system, a third plurality of scripts developed by the entity; classifying, by the data processing system via the initial model, the third plurality of scripts of the entity into a category in the initial model trained based on the second plurality of scripts from the third-party repository; and training, by the data processing system, based on the initial model and the category, the model as a binary classifier to output the classification as one of internal or external.
Example 19 includes a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to: identify a script for execution by a computing device of an entity; determine, via a model trained with machine learning based on a plurality of scripts established by a plurality of entities, a classification of the script prior to execution of the script by the computing device; and control execution of the script responsive to the classification of the script.
Example 20 includes the subject matter of Example 19, wherein the model comprises a plurality of features comprising at least one of a naming convention, bracket position, maximum line length, trailing whitespace, spare around keywords, style of cmdlet, or indentation.
Various elements, which are described herein in the context of one or more embodiments, may be provided separately or in any suitable subcombination. For example, the processes described herein may be implemented in hardware, software, or a combination thereof. Further, the processes described herein are not limited to the specific embodiments described. For example, the processes described herein are not limited to the specific processing order described herein and, rather, process blocks may be re-ordered, combined, removed, or performed in parallel or in serial, as necessary, to achieve the results set forth herein.
It should be understood that the systems described above may provide multiple ones of any or each of those components and these components may be provided on either a standalone machine or, in some embodiments, on multiple machines in a distributed system. The systems and methods described above may be implemented as a method, apparatus or article of manufacture using programming and/or engineering techniques to produce software, firmware, hardware, or any combination thereof. In addition, the systems and methods described above may be provided as one or more computer-readable programs embodied on or in one or more articles of manufacture. The term “article of manufacture” as used herein is intended to encompass code or logic accessible from and embedded in one or more computer-readable devices, firmware, programmable logic, memory devices (e.g., EEPROMs, ROMs, PROMs, RAMs, SRAMs, etc.), hardware (e.g., integrated circuit chip, Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), etc.), electronic devices, a computer readable non-volatile storage unit (e.g., CD-ROM, USB Flash memory, hard disk drive, etc.). The article of manufacture may be accessible from a file server providing access to the computer-readable programs via a network transmission line, wireless transmission media, signals propagating through space, radio waves, infrared signals, etc. The article of manufacture may be a flash memory card or a magnetic tape. The article of manufacture includes hardware logic as well as software or programmable code embedded in a computer readable medium that is executed by a processor. In general, the computer-readable programs may be implemented in any programming language, such as LISP, PERL, C, C++, C#, PROLOG, or in any byte code language such as JAVA. The software programs may be stored on or in one or more articles of manufacture as object code.
References to “or” can be construed as inclusive so that any terms described using “or” can indicate any of a single, more than one, and all of the described terms. References to at least one of a conjunctive list of terms can be construed as an inclusive OR to indicate any of a single, more than one, and all of the described terms. For example, a reference to “at least one of ‘A’ and ‘B’” can include only ‘A’, only ‘B’, as well as both ‘A’ and ‘B’. Such references used in conjunction with “comprising” or other open terminology can include additional items.
While various embodiments of the methods and systems have been described, these embodiments are illustrative and in no way limit the scope of the described methods or systems. Those having skill in the relevant art can effect changes to form and details of the described methods and systems without departing from the broadest scope of the described methods and systems. Thus, the scope of the methods and systems described herein should not be limited by any of the illustrative embodiments and should be defined in accordance with the accompanying claims and their equivalents.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/CN2022/120187 | 9/21/2022 | WO |