The subject matter described herein relates to a platform for characterizing trustworthiness of files uploaded into a file system (e.g., cloud service serving multiple tenants, on-premise solution, etc.).
Cybersecurity threats such as ransomware are designed to evade modern security tools by delivering files within a computing environment which include code which, when executed, implement various malicious activities. Given the increasing sophistication of these threats, files may bypass tools within the computing environment resulting in problematic files being stored or accessed.
In one aspect, a notification message is received indicating an upload of a file to a cloud service. An analysis engine (which can execute one or more machine learning models or other analysis operations) can generate information that characterizes the file which can be indicative of a level of trustworthiness for the file. In response to the generated information, each of a plurality of judges are notified to commence or revisit a judging process. In response to the notifications, the judges (which can execute one or more machine learning models or other analysis operations) retrieve the generated information and determine a respective trustworthiness score for the file. These scores can be stored in a corresponding judge database and/or data can be provided which characterizes the determined trustworthiness scores to a consuming application or process.
The generated information can characterize different aspects of the file such as attributes or capabilities which can, in turn, be used by the analysis engine to determine and intent of the file (e.g., administrative file, ransomware, etc.). The attributes can indicate, for example, whether the file: is packed, is signed, is encrypted, includes causing other files to be encrypted, includes code causing deletion of files, or includes code causing files to be uploaded.
In some cases, the plurality of judges are associated with a single endpoint (i.e., computing device, etc.), process, service or session and comprise a subset of available judges while other judges are associated with one or more other endpoints, processes, services and/or sessions. In other cases, the plurality of judges can be associated with a pre-defined group of endpoints, processes, services, and/or sessions and comprise a subset of available judges while other judges are associated with groups of one or more other endpoints, processes, services, and/or sessions. In some variations, the plurality of judges are associated with a single tenant (e.g., a single cloud customer accessing shared computing resources) and comprise a subset of available judges while other judges are associated with one or more other tenants (e.g., other cloud customers sharing those same computing resources).
The new file notification message can be a simple queue service (SQS) service.
Each of the judges can comprise or execute a different type of machine learning model. In some variations, at least two of the judges can comprise or execute a same type of machine learning model which are uniquely trained.
The consuming application or process can initiate a remediation action in response to at least one of the provided determined trustworthiness scores. The remediation action can include, for example, quarantining the file, deleting the file, preventing access to the file, or initiating one or more antiransomware obfuscation processes.
A worker can process the file notification message for ingestion by a pipeline. The pipeline can coordinate workflows with each of a plurality of analyzers.
The cloud service can serve multiple tenants and the determined trustworthiness scores can be stored on a tenant-by-tenant basis.
In an interrelated aspect, a query is received requesting a score for a file stored by a file management system (e.g., cloud service, on-premise storage, etc.). Thereafter, a tenant identification (ID) is determined for the query. A judge database associated with the tenant ID is queries for the score and this score is returned to the requestor (e.g., endpoint, process, service, session, etc.).
Non-transitory computer program products (i.e., physically embodied computer program products) are also described that store instructions, which when executed by one or more data processors of one or more computing systems, cause at least one data processor to perform operations herein. Similarly, computer systems are also described that may include one or more data processors and memory coupled to the one or more data processors. The memory may temporarily or permanently store instructions that cause at least one processor to perform one or more of the operations described herein. In addition, methods can be implemented by one or more data processors either within a single computing system or distributed among two or more computing systems. Such computing systems can be connected and can exchange data and/or commands or other instructions or the like via one or more connections, including but not limited to a connection over a network (e.g., the Internet, a wireless wide area network, a local area network, a wide area network, a wired network, or the like), via a direct connection between one or more of the multiple computing systems, etc.
The subject matter described herein provides many technical advantages. For example, the current subject matter provides enhanced techniques for characterizing the trustworthiness of a file which can be triggered when such files are uploaded to a cloud storage service.
The details of one or more variations of the subject matter described herein are set forth in the accompanying drawings and the description below. Other features and advantages of the subject matter described herein will be apparent from the description and drawings, and from the claims.
Like reference symbols in the various drawings indicate like elements.
The current subject matter is directed to a file analysis and scoring system which allows for automated and on-demand processing of executable files to extract intelligence and to decide on what action to take on devices (e.g., nodes, computing devices, etc.) that attempt to access or otherwise find such files. This system can be used to identify files having malicious content and, in particular, to identify files likely to contain or initiate undesired actions such as the deployment of malware such as ransomware. Trustworthiness scores for a file can be determined using different judges which can be specifically tailored to the requestor (e.g., endpoint or other computing device, process, service, session, tenant, etc.).
The analysis engine 110 can comprise a worker module which utilize a plurality of workers to perform tasks in parallel to extract messages (e.g., SQS message) and route such messages through a pipeline for processing by a particular processor. The messages, for example, can be from a message queuing service which routes messages for consumption by various software components. The processors can, for example, each correspond to a different analyzer or group of analyzers for extracting attributes or other features from the file. The pipeline can act as a central broker which coordinates workflows that use external processing units and causes the files to be processed by a respective processor while keeping track of processor availability and results.
The file scoring API 140 provides an interface for access to cloud-based data assets including uploading of files and/or accessing files through queries and the like. The file scoring API 140 can take various configurations including a metadata management and governance system which facilitates gathering, processing, and maintaining metadata about cloud-based data assets. The file scoring API 140 can be used by remote computing devices to get scores generated by the judgment engine 130 which are associated with the requestor (endpoint, tenant, etc.). These scores can then be used by such remote computing devices to make a determination of whether or not to access the file. In some variations, the query itself causes the judgment engine 130 to score files responsive to the query.
Judges 340, 350, 360 in response to a trigger (e.g., expiration of an amount of time, a message, etc.) will access any files in the corresponding queue 310, 320, 330 and generate, for each file, a score (also referred to as a judgment). This score can, for example, be binary (malicious/safe) or it can be a numerical indicator. These scores can be stored in a judgments database 380. The judges 340, 350, 360 can execute the same or different models depending on the desired configuration. With the former, the models may be trained differently causing different outputs. With the latter and for example, the models can differ such as one of the judges being more aggressive towards AutoIT files or other anomalies which results in different actions being implemented.
In some cases, certain files can be certified as being part of an allowlist and stored in a corresponding certificate allowlist database 370. This certification can be used to segregate or otherwise insulate certain files from the judgment process and/or alternatively files having a score above a pre-defined level can be added to the certificate allowlist database 370. Other files (i.e., those that are not part of the certificate allowlist), can be stored in the analysis database 120. In some cases, the analysis database 120 can include tables for all analyzed files.
The analysis engine 110, as part of the generated analysis information, can perform operations to classify or otherwise characterize the attributes and capabilities of the file. As an example, the analysis engine 110 can determine that a particular file is an administrative tool based on the API set that the file uses. As another example, the analysis engine 110 can determined that the file is ransomware based on the operations that it performs when one or more analyzers are run. The analysis engine 110, as part of the generated analysis information, can additionally or alternatively extrapolate the intent and purpose of the file. The analysis information can also characterize one or more aspects indicative of ransomware such as whether the file is packed, is signed, is encrypted (or includes code causing encryption), includes code causing deletion of files, and/or includes code causing files to be uploaded. The analysis information can also specify categories or functionality that is inferred by a model (i.e., one of the analyzers within the analysis engine 110) and/or provide information regarding the presence of certain content within the file, similarity hashes regarding the file, certificates used to sign the file, etc.
When the analysis is complete, the analysis engine 110, at 430, sends a notification to select judges 340-370 that a new file is ready to be judged. Judges 340-370 can be assigned to tenants or devices. In some variations, the judges 340-370 can be assigned to tenants based on the best set of detections that can be had for that specific tenant. As an example, it may be reasonable to have a Chicago office-only organization to use a judge (e.g., a machine learning model) 340-370 that has a high false positive rate on Chinese executables, but would be unreasonable to expect that same judge would work well for an organization that has an office within China. Additionally, some organizations require grayware to run due to the fact that there is no universal standard for security hygiene that specifies things to this level of granularity. On the device level, there may be a device that the organization allows to be out of sync with their normal mode of operation. In a case such as that, the device itself can be assigned to a different judge 340-370.
Thereafter, at 440, a notification system creates a new task for each judge 340-370 in their personal queue 310-330. Each judge 340-370, in parallel and at 450, then takes the requested file hash from its queue 310-330 and continues to analyze the file hash to create a verdict. Each judge 340-370, at 460, then issues a verdict on each file and causes the results to be stored in the judgments database 380.
With further reference to
The verdicts/trustworthiness scores can be consumed by a downstream application or process. In some variations, such application or process can trigger or otherwise initiate a remediation action in response to one or more of the verdicts/trustworthiness scores. These remediation actions can take various forms including quarantining the file, deleting the file, blocking access to the file, and/or or initiating one or more antiransomware obfuscation processes. A particular value for the verdict/trustworthiness score can trigger different remediation actions. Stated differently, the relative value of trustworthiness (e.g., low risk, medium risk, high risk, etc.) can cause different remediation actions (e.g., remediation actions commensurate with the associated risk).
Various implementations of the subject matter described herein may be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various implementations may include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor (e.g., CPU, GPU, etc.), which may be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. Further, while the description above is largely focused on ransomware countermeasures, the current subject matter is applicable to analysis of files to prevent them from causing any undesired behavior including other actions associated with different types of malware.
These computer programs (also known as programs, software, software applications or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor.
To provide for interaction with a user, the subject matter described herein may be implemented on a computing device having a display device (e.g., a LED, OLED, or LCD screen/monitor) for displaying information to the user and a keyboard and an input device (e.g., mouse, trackball, touchpad, touchscreen, etc.) by which the user may provide input to the computing device. Other kinds of devices may be used to provide for interaction with a user as well; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The subject matter described herein may be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a client computer having a graphical user interface or a Web browser through which a user may interact with an implementation of the subject matter described herein), or any combination of such back-end, middleware, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network (“LAN”), a wide area network (“WAN”), and the Internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
In the descriptions above and in the claims, phrases such as “at least one of” or “one or more of” may occur followed by a conjunctive list of elements or features. The term “and/or” may also occur in a list of two or more elements or features. Unless otherwise implicitly or explicitly contradicted by the context in which it is used, such a phrase is intended to mean any of the listed elements or features individually or any of the recited elements or features in combination with any of the other recited elements or features. For example, the phrases “at least one of A and B;” “one or more of A and B;” and “A and/or B” are each intended to mean “A alone, B alone, or A and B together.” A similar interpretation is also intended for lists including three or more items. For example, the phrases “at least one of A, B, and C;” “one or more of A, B, and C;” and “A, B, and/or C” are each intended to mean “A alone, B alone, C alone, A and B together, A and C together, B and C together, or A and B and C together.” In addition, use of the term “based on,” above and in the claims is intended to mean, “based at least in part on,” such that an unrecited feature or element is also permissible.
The subject matter described herein can be embodied in systems, apparatus, methods, and/or articles depending on the desired configuration. The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and/or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and subcombinations of the disclosed features and/or combinations and subcombinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and/or described herein do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Other implementations may be within the scope of the following claims.
Number | Name | Date | Kind |
---|---|---|---|
8925101 | Bhargava | Dec 2014 | B2 |
9990511 | Dreyfus | Jun 2018 | B1 |
10055586 | Roundy | Aug 2018 | B1 |
10193918 | Patton | Jan 2019 | B1 |
10229269 | Patton | Mar 2019 | B1 |
10516688 | Tamir | Dec 2019 | B2 |
11102244 | Jakobsson | Aug 2021 | B1 |
11226938 | Madisetti | Jan 2022 | B2 |
11349855 | Amit | May 2022 | B1 |
11531648 | Smith | Dec 2022 | B2 |
11537630 | Smith | Dec 2022 | B2 |
11716338 | Elyashiv | Aug 2023 | B2 |
11842157 | Zhang | Dec 2023 | B2 |
11876817 | Ross | Jan 2024 | B2 |
11995185 | Mager | May 2024 | B2 |
12056241 | Ulasen | Aug 2024 | B2 |
12095794 | Karaje | Sep 2024 | B1 |
20070028110 | Brennan | Feb 2007 | A1 |
20140298207 | Jakobsson | Oct 2014 | A1 |
20150074759 | Brennan | Mar 2015 | A1 |
20150186668 | Whaley | Jul 2015 | A1 |
20160148013 | Ittah | May 2016 | A1 |
20170099344 | Shanklin | Apr 2017 | A1 |
20170264640 | Taldo | Sep 2017 | A1 |
20180007069 | Hadfield | Jan 2018 | A1 |
20180041571 | Narayanaswamy | Feb 2018 | A1 |
20180204000 | Charters | Jul 2018 | A1 |
20180212988 | Mohanta | Jul 2018 | A1 |
20180375886 | Hunt | Dec 2018 | A1 |
20190098037 | Rogers | Mar 2019 | A1 |
20190108419 | Kirti | Apr 2019 | A1 |
20190138727 | Dontov | May 2019 | A1 |
20190332769 | Fralick | Oct 2019 | A1 |
20200076612 | Shenoy, Jr. | Mar 2020 | A1 |
20200089881 | Coven | Mar 2020 | A1 |
20200128073 | Adluri | Apr 2020 | A1 |
20200322360 | Noon | Oct 2020 | A1 |
20200358792 | Bazalgette | Nov 2020 | A1 |
20200410096 | Zagorsky | Dec 2020 | A1 |
20210158360 | Somani | May 2021 | A1 |
20220232038 | Kulkarni | Jul 2022 | A1 |
20220292194 | Edwards | Sep 2022 | A1 |
20220292195 | Holland | Sep 2022 | A1 |
20220391523 | Kwong | Dec 2022 | A1 |
20230044102 | Anderson | Feb 2023 | A1 |
20230076201 | Bebchuk | Mar 2023 | A1 |
20230077289 | Sloane | Mar 2023 | A1 |
20230205880 | Ulasen | Jun 2023 | A1 |
20230267207 | Smith | Aug 2023 | A1 |
20240007492 | Shen | Jan 2024 | A1 |
20240022565 | Keith, Jr. | Jan 2024 | A1 |
20240220646 | Xu | Jul 2024 | A1 |
20240223589 | Grammel | Jul 2024 | A1 |
Entry |
---|
Andavan et al.; “Privacy protection domain-user integra tag deduplication in cloud data server”, Aug. 2022, International Journal of Electrical and Computer Engineering (IJECE) vol. 12, No. 4, pp. 4155-4163. (Year: 2022). |
May et al.; “Combating Ransomware Using Content Analysis and Complex File Events”, 2019, IEEE, pp. 1-5. (Year: 2019). |
Medhat et al.; “Yaramon: A Memory-based Detection Framework for Ransomware Families”, 2020, International Conference for Internet Technology and Secured Transactions (ICITST), pp. 1-6. (Year: 2020). |