Code poising aims to access source code, build processes, or update mechanisms by infecting legitimate apps to distribute malware. Hence, the end-users will perceive that malware as safe and trustworthy software and will therefore be more likely to download it. An illustrative example is the Codecov attack, where a backdoor concealed within a Codecov uploader script was widely downloaded. In April 2021, attackers compromised a Codecov server to inject malicious code into a bash uploader script. Codecov customers then downloaded this script for two months. When executed, the script exfiltrated sensitive information, including keys, tokens, and credentials from those customers' Continuous Integration/Continuous Delivery (CI/CD) environments. Using these data, Codecov attackers reportedly breached hundreds of customer networks, including HashiCorp, Twilio, Rapid7, Monday.com, and e-commerce giant Mercari.
These types of attacks are becoming increasingly popular and harmful due, in part, to modern development procedures that use open source packages and public repositories. These procedures are efficient, cost-effective and accelerate development, and therefore popular among many developers. There has been a 73% growth of open-source software component downloads in 2021 compared to 2020, and a reported 77% increase in the use of open-source software between 2021 to 2022 among various companies.
Additionally, Red-Hat predicts an 8% decline in the use of proprietary software in software already in use in respondents' organizations over the next two years. Over the same period, they expect enterprise open source to increase by 5% and community-based open-source also increasing by 3% over the same period, resulting in open-source technologies being adopted more than any other technology. Development procedures, involving those packages and repositories are mostly automatic, or at least semi-automatic, the same as developers installing an open-source package.
As a result of this growth, popular packages, development communities, lead contributors, and many more can be considered attractive targets for software supply chain attacks. These kinds of attacks can make dependent software projects more vulnerable. In 2021, OWASP considered software supply chain threats to be one of the Top-10 security issues worldwide. A lead example of such an attacks was the ua-parser-js attack, where in October 2021 the attacker was granted ownership of the package by account takeover and published three malicious versions. At that time, ua-parser-js was a highly popular package with more than seven million weekly downloads. Logic bombs also pose a threat—see https://www.csoonline.com/article/510947/logic-bomb.html.
In recent years, a vast research field has emerged to deal with this threat. This field is researched by academia and is part of the application security market, which has been valued at 6.2 billion USD. This research field includes many aspects that depend on various parameters, such as programming language (PL). Different PLs have different security issues. For example, Python has assert statements that control the application logic or program execution, which can lead to the retrieval of incorrect results, introduce security risks, or cause program failure. In CPP, it is more common to commit buffer overruns by writing input to smaller buffers. A second important parameter to consider is the scope of the functionalities being examined (function, class, scripts, etc.). For example, there are attacks targeting central locations in the package, e.g., the installation phase or fundamental functions.
The subject matter regarded as the embodiments of the disclosure is particularly pointed out and distinctly claimed in the concluding portion of the specification. The embodiments of the disclosure, however, both as to organization and method of operation, together with objects, features, and advantages thereof, may best be understood by reference to the following detailed description when read with the accompanying drawings in which:
Any reference to “may be” should also refer to “may not be”.
In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the one or more embodiments of the disclosure. However, it will be understood by those skilled in the art that the present one or more embodiments of the disclosure may be practiced without these specific details. In other instances, well-known methods, procedures, and components have not been described in detail so as not to obscure the present one or more embodiments of the disclosure.
It will be appreciated that for simplicity and clarity of illustration, elements shown in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
Because the illustrated embodiments of the disclosure may for the most part, be implemented using electronic components and circuits known to those skilled in the art, details will not be explained in any greater extent than that considered necessary as illustrated above, for the understanding and appreciation of the underlying concepts of the present one or more embodiments of the disclosure and in order not to obfuscate or distract from the teachings of the present one or more embodiments of the disclosure.
Any reference in the specification to a method should be applied mutatis mutandis to a system capable of executing the method and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that once executed by a computer result in the execution of the method.
Any reference in the specification to a system and any other component should be applied mutatis mutandis to a method that may be executed by a system and should be applied mutatis mutandis to a non-transitory computer readable medium that stores instructions that may be executed by the system.
Any reference in the specification to a non-transitory computer readable medium should be applied mutatis mutandis to a system capable of executing the instructions stored in the non-transitory computer readable medium and should be applied mutatis mutandis to method that may be executed by a computer that reads the instructions stored in the non-transitory computer readable medium.
Any combination of any module or unit listed in any of the figures, any part of the specification and/or any claims may be provided. Especially any combination of any claimed feature may be provided.
There is provided a MSDT algorithm for detecting malicious code injection within the functions' source code, by static analysis.
Firstly, the inventors used the PY150 dataset to train a deep neural architecture model.
Secondly, by utilizing that model, the inventors were able to embed every function in the CodeSearchNet (CSN) Python dataset, which is used for experimental evaluation, into the representation space of the model's encoding part.
Thirdly, the inventors applied a clustering algorithm over every function type implementation to detect anomalies by outlier research. Lastly, the inventors ranked the anomalies by their distance from the nearest clusters' border points—the farther the point is, the higher the score.
The inventors conducted extensive experiments to evaluate MSDT's performance. The inventors started by randomly injecting five different real-world malicious codes into the top 100 common functions, using Code2Seq as the deep neural model and DBSCAN for the clustering algorithm.
Next, the inventors measured the precision at k (precision@k) (for various k values) of MSDT's ability to match functions classified as malicious with their proper tagging (see the Experiments section). The precision@k test result values were as high as 0.909. For example, MSDT achieved this result when k=20 for the different implementations of the get function. These implementations were randomly injected as part of a real-world attack.
Additionally, the inventors empirically evaluated MSDT on a real-world attack and succeeded in detecting it. Lastly, the inventors empirically compared MSDT against widely used static analysis tools, which can only work on files. As MSDT works on functions, it has a more precise capability to detect an injection in a given function.
In addition to the MSDT algorithm itself, the inventors also described and shared theirs open, curated dataset of 607,461 functions, some of which were injected with several real-world malicious codes in this work. This dataset can be used in future works within the field of code injection detection.
In recent years, the awareness of the threats regarding public repositories and open-source packages has increased. As a result, many studies point out two main security issues with the usage of those packages: (1) vulnerable packages and (2) malicious intent in packages. Vulnerable packages contain a flaw in their design, unhandled code error or other bad practices that could be a future security risk. Communities and commercial companies have vastly researched this widespread threat (e.g., Snyk and Mend). Usually, this threat is based on Common Vulnerabilities and Exposures (CVEs). Those vulnerabilities allow the malicious actor, with prior knowledge of the package usage location, to achieve its goal with a few actions. Malicious intent in packages contain bad design, unhandled code error, or a code that does not serve the main functionality of the program, etc. Those examples are created to be exploited or triggered during some phases of the package (installation, test, runtime, etc.).
Studies have shown a rise in malicious functionalities appearing in public repositories and highly used packages. These studies have shown that there are common injection methods for malicious actors to infect packages. As Ohm et al. (Marc Ohm, Henrik Plate, Arnold Sykosch, Michael Meier “Backstabbers knife collection: A review of open Source supply Chain attack” International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 23-24, Springer 2020) demonstrated, to inject malicious code into a package, an attacker may either infect an existing package or create a new one similar to the original one (which is often called dependency confusion.)
A new malicious package developed and published by a malicious actor has to follow several principles: (1) for a proper replacement to be made to the targeted package, it has to contain a proper replacement to the targeted package, it has to contain a semi-ident functionality; and (2) it has to be attractive, ending up in the targeted users' dependency tree. To grant the use of the new package types, one of the following methods can be employed: naming the malicious package in a similar manner to the original one (typosquatting), creating a trojan in the package, using an unmaintained package, or user account (use after free).
The second injection strategy can infect existing packages through one of the following methods: (1) injection to the source of the original package by a Pull request/social engineering; (2) the open source project owner adding malicious functionality out of ideology, such as political; (3) injection during the build process; and (4) injection through the repositories system.
It was demonstrated that the malicious intent in packages could be categorized by several parameters: targeted Operating System (OS), PL, the actual malicious activity, the location of the malicious functionality within the package (where it is injected), and more. Additionally, they showed the majority of the maliciousness is associated with persistence purposes, which can be categorized into several major groups: Backdoors, Droppers, and Data Exfiltration.
The current application focuses on the second security issue with a specification in a dynamic PL (programming languages) (Python as a test case) for usage popularity and the popularity of injection-oriented attacks within those PLs repositories (Node.js, Python, etc.).
These injections are often related to the PLs dynamicity features, such as exposing the running functionalities only at runtime (e.g., exec(“print (Hello world!)”)), configurable dependencies and imports of packages (e.g., import from a local package instead of a global one).
The described use of the PLs dynamicity features is the most common among the known attacks. A leading example of this kind of attack included a malicious package named “pytz3-dev,” which was seen in the central repository of Python packages, the Python Package Index (PyPI), and downloaded by many. This package contains malicious code in the initialization module and searches for a Discord authentication token stored in an SQLite database. If found, the code exfiltrated the token. This attack was carried out unnoticed for seven months and downloaded by 3000 users in 3 months.
These features, and many more, are used by attackers, thus making it one of the most common attack techniques associated with a supply chain attack, as covered by NIST.
Detection methods of malicious intent in source code include static analysis and dynamic analysis. Static analysis finds irregularities in a program without executing it and is more safe than dynamic analysis.
Various detection analysis were recognized to be faulty.
Feature-based technique uses the occurrences count of known problematic functionalities. For example, this technique uses a classifier with a given labeled dataset and several features extracted (function appearances, length of the script, etc.) that can predict the maliciousness of a script. The main drawback of this technique is that it strongly binds with reversing research that points to features related to the attack, which may lead to detection overfitting the attacks that have been revealed and learned. Furthermore, potential attackers could evade detection by several methods, such as not using or not adequately using the searched features in the code. An example of such a static analysis tool is Bandit. Bandit is a widespread tool designed to find common security issues in Python files using hard-coded rules. This tool uses AST form of the source code to better examine the rule set. In addition, Bandit's detection method includes the following metrics: severity of the issues detected and the confidence of detection for a given issue. Those metrics are divided into three values: low, medium, and high. Each rule manually obtains its severity and confidence values from the Bandits' community.
Signature-based detection (in the case of malware detection) is a process where a set of rules (based on reversing procedure) define the maliciousness level of the program. Rules generated for static analysis purposes are often a set of functionalities or opcodes in a specific order to match the researched code behavior. For example, YARA is a commonly used static signature tool and the generated rules for dynamic analysis purposes are often a set of executed operations, memory states, and registers' values. The main drawback of this technique is that it applies to known maliciousness.
Comparing packages to known CVEs (see Open-source packages' security issues). On the one hand, static analysis tends to scale well over many PL classes (with a given grammar), efficiently operating on large corpora. It often identifies well-known security issues and in many cases, is explainable. On the other hand, this kind of analysis suffers from a high number of false positives and poor configuration issues detection.
Dynamic Analysis. This type of analysis finds irregularities in a program after its execution and determines its maliciousness, where gathered data, such as system calls, variable values, and IO access, are often used for anomaly detection or classification problems. There are several drawbacks to using dynamic analysis on a source code: (a) Data gathering difficulties: the procedure of extracting data is hard to automate, as the package needs to be activated and execute its functionality; and (b) Scalability: the learned and tested program must be activated in its entirety, where the wanted data has to be extracted for each. Therefore, in this study, the inventors have chosen to focus on advanced static analysis.
Deep Learning Methods for Analyzing Source Code
In recent years, there has been an increasing need to use machine learning (ML) methods in code intelligence for productivity and security improvement. As a result, many studies construct statistical models to code intelligence tasks. Recently, pre-trained models were constructed by learning from big PL corpora, such as CodeBERT and CodeX. These pre-trained models are commonly based on models from the natural language process (NLP) field (such as BERT and GPT), including improvements of the original Transformer architecture and the original self-attention mechanisms presented by Vaswani et al. Not only did this development lead not only to improvement in code understanding and generation problems, but it also to enlarged the number of tasks and their necessities, such as Clone detection and Code completion. Those tasks include several challenges, such as capturing semantic essence, syntax resemblance, and figure execution flow. For every challenge, it occurred that a model exists that would fit better than others. For example, for code translating between PLs, algorithms including a “Cross-lingual Language Model” with masked tokens preprocessing are superior for capturing the semantic essence well.
Over the years, several ML methods have been researched within the context of code analysis tasks. In 2012, the use of techniques from the classic text analysis field were shown, for example, using SVM on a bag-of-words (BOW) representation of simple tokenization (lexing by the PL grammar) of Java source. In 2016 techniques were shown to get context for the extracted tokens using, for example, the output of recurrent neural network (RNN) trained over tokenized (lexing representations) code. However, it was shown that RNN-based sequence models lack several source code concepts regarding source code representations: First, inaccurate representation of the non-sequential structure of source code. Second, RNN-based models may be inefficient for very long sequences. Third, those models lack the ability to grasp the syntactic and semantic information of the source code.
In this study, the inventors used the Code2Seq model, which is a deep neural architecture developed by Alon et al. (Uri Alon, Shaked Brody, Omer Levy, Eran Yahav, “code2seq: Generating Sequences from Structured Representations of Code”. arXiv:1808.01400), similar to Nagar et al. The inventors selected this model over others because it performs the mentioned code embedding models in a similar task, such as Code Search, and Code Captioning. Additionally, the Code2Seq model has fewer parameters compared to other models. The inventors trained the model using the PY150 dataset. This dataset contains Python functions in the form of AST (see Datasets). In this architecture, a function is referred to as an AST, where the output trees' internal nodes represent the program's construction with known rules, as described in the given grammar. The tree's leaves represent information regarding the program variables, such as names, types, and values.
Code2seq can be integrated into many applications, such as code search: with a given sentence describing a code, and the output will be the wanted code. For example, Nagar et al. used the Code2seq model to generate comments for collected code snippets. The candidate code snippets and corresponding machine-generated comments were stored in a database where eventually, the code snippets with similar comments to natural language queries were retrieved.
Results
This section presents the experimental results obtained by the MSDT algorithm (see The proposed method section) when applied to the constructed function types dataset that contained both injected and benign implementations (see Injection simulation section). It is worth noting that this study used 8 GB RAM with 8 CPU cores server to evaluate the algorithm. The runtime of the process took about 10 minutes for 48627 different implementations.
The constructed dataset includes the 100 most common function types from the CSN dataset (see Datasets section). From the function types implementations distribution (see
The first experiment included parameter tuning of the DBSCAN method mentioned in the Anomaly detection on representation section, which the inventors applied to the CSN dataset without the 100 most common function types.
Inventors received the following best results 30 (see
The second experiment included the evaluation of MSDTDBSCAN on every function type against every attack type and every k in the range of 1 to 10 percent of the implementations. For every iteration of k, the inventors measured precision @ k. the inventors found that MSDTDBSCAN detects well when applied to several functions and attacks. See examples 41, 42 and 43 of
In addition, the inventors discovered that the measured Spearman's rank correlation between the MSDT'S detection rate and the number of implementations is equal to p=0.539, indicating a correlation between the detection rate and the number of implementations. the inventors also tested the MSDTEcod on the same experimental settings described in the Code2seq representation section. Following the mentioned evaluation (see the Evaluation Process section), the inventors measured the precision@ k for every k ranging from 1 to 30. the inventors can observe that generally, the MSDTEcod detects the top two rank anomalies and is less successful in the following k values (see examples 51, 52 and 53 of
The third experiment included detecting injected malicious implementations of multiply by applying MSDTDBSCAN. By visualizing the PCA (2 components) of the collected samples (see example 60 of
The fourth experiment emphasizes the relations between malicious and benign implementations. By the following visualization, the inventors received (see example 70 of
Based on theirs analysis of the results presented in the Results section and the figures above, the inventors can observe the following:
This study introduces MSDT, a novel algorithm to statically detect code injection in functions' source code by utilizing a deep neural translation model named Code2Seq and applying anomaly detection techniques on Code2Seq's representation for each function type. the inventors comprehensively described MSDT's steps, starting with collecting and preprocessing a dataset. After injecting five malicious functionalities into random implementations, the inventors extracted embedding for each implementation in the function type. Based on these embeddings, the inventors applied an anomaly detection technique, resulting in anomalies that the inventors eventually ranked by their distance from the nearest cluster border point.
This evaluation of MSDT on the constructed dataset demonstrates that MSDT succeeded for cases when: (1) the functions have a repetitive functionality; and (2) the injected code has a limited number of lines. However, MSDT was less successful when: (1) the injected code contains a relatively large number of lines; and (2) the functions have a more abstract functionality.
For the MSDT to use the Code2Seq embedding, it is necessary to convert every function to an AST representation. According to an embodiment a more comprehensive representation is used for a code that includes the semantic, syntactic, and execution flow data of the program—for instance, using execution paths in a control flow graph that have been constructed statically from a program, or using program dependence graph (PDG).
According to an embodiment, the enable MSDT is configured to support any textual PL. This can be done using the proper grammar and a deep neural architecture (Code2Seq) to embed functions' source code.
According to an embodiment, models other than Code2Seq are used for source code embeddings, like Seq2Seq, CodeBERT, and CodeX.
According to an embodiment, other outlier detection models are used on this high-dimension clustering problem.
The primary goal of this study is to detect code injection by applying static analysis to the source code. This section describes the static analysis algorithm the inventors developed and theirs experiments to test and evaluate theirs proposed method, MSDT (see the Experiments section).
There are several datasets including labeled function implementations for several purposes. In this study, the inventors used 607,461 public Python function implementations with simulated test cases and real-world, observed attacks. Additionally, this study combines an embedding layer based on a deep neural translation model, Code2Seq. Lastly, this study showcases traditional anomaly detection techniques over the Code2Seq representation based on DB SCAN compared to another anomaly detection technique based on Ecod.
Datasets
In this study, the inventors utilized three datasets: (1) the Eth PY150 dataset is used for training Code2Seq as for the presented model of Code2Seq is trained upon Java dataset. The Eth PY150 is a Python corpus with 150,000 files. Each file contains up to 30,000 AST nodes from open-source projects with non-viral licenses such as MIT. For the training procedure, the inventors randomly sampled the PY150 dataset to validate/test/train sets of 10 K/20 K/120 K files; (2) the CodeSearchNet (CSN) Python dataset is used to perform the different experiments to prevent data leakage from the training procedure, where CSN is a Python corpus, containing 457,461<docstring, code>pairs from open source libraries, which the inventors refer only to as the code; and (3) the Backstabber's Knife Collection is used for the malicious functionalities injected during the simulations. The Backstabber's Knife Collection is a dataset of manual analysis of malicious code from 174 packages that were used by real-world attackers. Namely, the inventors use five different malicious code injections from this collection, to inject in the 100 most common functions within the CSN corpus. the inventors chose those specific malicious codes for their straightforward integration within the injected function, and their download popularity.
As mentioned above, the input to the Code2seq model is an AST representation of a function. To get this representation for each function, the inventors extracted tokens using fissix and tree sitter, which allowed us to normalize the code to get consistent encoding. With the normalized output code, the inventors then generate an AST using fissix.
Injection Simulation
The inventors randomly selected up to 10% implementations from each of the top 100 common functions to be code injected to simulate the real-world number of code injections. To find the 100 most common functions, the inventors count the number of implementations for each function in the CSN dataset and refer to the 100 most frequent functions. The total number of the 100 most common function implementations was 48627. The injected functionalities were five malicious samples collected from Backstabber's Knife Collection.
Those injections illustrated several attack types:
The functionalities were injected at the beginning of the randomly selected implementations for those popular function types, and as viewed by Ohm et al., and similar to the mentioned attacks above.
Code2seq Representation
In this study, the inventors used the result vectors of the attention procedure (see Deep learning methods for analyzing source code section), named context vectors with 320 dimensions; it was the representation space of the model for code snippets. At each decoding step, the probability of the next target token depended on the previous tokens.
The inventors used the same parameters presented by Alon et al. Additionally, the inventors trained the model on the Eth PY150 train set (as mentioned in Datasets section) for 20 epochs or until there was no improvement after ten iterations. Eventually, the inventors tested theirs Code2seq model on the Eth PY150 test set (as mentioned in the Datasets section) and achieved a recall of 47%, precision of 64%, and F1 of 54% on the mentioned randomly sampled test set.
Anomaly Detection on Representation
In this step, the inventors used their Code2Seq representation (see the Code2seq representation section) for the given injected functions and non-injected from the same type. Then, the inventors used the DBSCAN method (referred to as MSDTDBSCAN) as the density-based clustering algorithms are known to perform better in finding outliers. the inventors achieved it by using tuning the following parameters for the DB SCAN method:
For each iteration, a 10-fold cross-validation is applied, measuring the following metrics by the means of the different folds (TPR and AP), detecting outlier precision.
Evaluation Process
The performance of the anomalies detected by MSDT was measured by precision at k (precision@k) study, which stands for the true positive rate (TPR) of the results that occurs within the top k of the ranking. the inventors ranked the anomalies by their Euclidean distance from the nearest clusters' border points. Eventually, the inventors measured the precision@k metric for each function type with the mentioned code injection attacks and compared it to a RandomClassifier, to show the performance of MSDT relative to a random decision, as there are no other methods that work on functions use for comparison (see the Introduction and Background sections). To better understand how MSDT detects attacks, the inventors examined the correlation between the detection rate and the number of implementations among the various function types. Therefore, the inventors measured the average precision@k for every attack, and for every function type, the inventors calculated the average of the average detection rate of the various attacks. the inventors used Spearman's rank correlation (ρ) to measure the correlation between the mentioned average of the function types and their number of implementations.
The inventors compared MSDTDBSCAN 's performance to another outlier detection baseline method named Ecod (referred to as MSDTEcod) over the mentioned representation (see the Anomaly detection on representation section). The inventors chose Ecod because it outperformed several widely used outlier detection, such as KNN. The inventors used Ecod to detect outliers as follows: firstly, the inventors applied Ecod on every function type for every attack type (accordingly to MSDTDBSCAN). Secondly, the inventors measured the anomaly score of each implementation. The Ecod algorithm calculates this score, where the more the vector is distant, the higher its score. Thirdly, the inventors extracted the precision@k where k indicates the anomalies in descending order, i.e., precision@2 is the precision of the two most highly ranked anomalies.
To evaluate their method on real-world injections, the inventors applied MSDTDBSCAN on a real-world case taken from the Backstabber' s Knife Collection. The case was a sample of malicious functionality injected in multiply calculation functionality that loaded a file by Popen, as mentioned above in Injection simulation. the inventors collected 48 implementations of multiply related functions from the mentioned datasets (see Datasets section). the inventors did so to gain reference of the injected multiply function to the benign implementations and thus applied MSDTDBSCAN on this multiply case.
Additionally, the inventors compared MSDT with the mentioned MSDTEcod method and two of the well-known static analysis tools named Bandit and Snyk (see the Static Analysis section). Namely, the inventors evaluated those static analysis tools on the origin file where the malicious implementation of multiply appeared.
Lastly, to emphasize the relations between the malicious and the benign implementations, the inventors visualized the achieved embedding of the get and the log functions with the injected code. the inventors managed this visualization by applying PCA (2 components) on the Code2Seq context vectors (see Code2seq representation section). See examples 80 and 90 of
Method 100 may be executed by a processing circuit or more than a single processing circuit.
The processing circuit may be implemented as a central processing unit (CPU), and/or one or more other integrated circuits such as application-specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), full-custom integrated circuits, etc., or a combination of such integrated circuits.
According to an embodiment, method 100 is applied on a source code for a function. The source code may be of any size.
According to an embodiment, method 100 includes step 110 of obtaining, by a processing circuit, an embedding of a source code for a function. The obtaining may include at least one of generating the embedding or receiving the embedding.
According to an embodiment, step 110 includes calculating the embedding or receiving the embedding.
An embedding may be generated in different manners—for example by different deep learning models. The calculating of the embedding may include selecting a deep learning model out of multiple deep learning models. The selected deep model is applied on the source code in step 120.
The selection may be based on at least one of a length of the source code, available computational resources, available memory resources, and the like.
This application claims priority from U.S. provisional patent 63/395,880 filing date 8/8/2022 which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63395880 | Aug 2022 | US |