Software development often includes developers writing new source code in a high level programming language that uses software libraries. Software libraries may in turn use native libraries that perform specific operations. The native libraries compile into a native binary, which is a precompiled code that is compiled for the hardware platform on which the native binary will execute. Native binaries are often written in a lower level programming language, such as C or C++ and are used by libraries.
By way of an example, consider a programmer that writes a gaming web application which uses one or more software libraries. The software library is written in Python programming language. The gaming application interacts with users, processes the interactions, and provides visual output. One aspect of the gaming application rendering to provide the visual output. Because the library and the gaming application are written in a higher level programming language, the library or the application would be slow and inefficient in performing the rendering. Thus, a native binary may be used that performs the rendering. Namely, the application with the library gets the user input, pass the result of the input to the binary, and the binary communicates with the graphics processing unit (GPU) to render the output.
Thousands of libraries exist which use native libraries, and, correspondingly, native binaries. For example, the Elasticsearch library has forty six native binaries in it. Native binaries may have vulnerabilities. Because of the number of libraries and native binaries, a difficulty exists in tracking the native binaries that are used in an application.
For large software projects, many software libraries and software library versions may be used in a single software project. Further, after software libraries are released and used on software projects, vulnerabilities may be discovered in the software libraries and the native binaries used in the software libraries. For example, vulnerabilities may be discovered months or even years after users have integrated the software libraries and corresponding native binaries into their software projects. A vulnerability is a weakness in the software that may allow an unauthorized user or code to attack software. Thus, malicious code and users may perform denial of service attacks, install malware, access sensitive data, or perform other nefarious actions by exploiting the vulnerability.
In general, in one aspect, one or more embodiments relate to a method that includes disassembling a reference binary of a library to generate a control flow graph of the referenced binary, normalizing the control flow graph to generate a normalized graph, traversing the normalized graph to generate execution traces from the normalized graph, and generating library vector embeddings. Generating library vector embeddings includes, for each execution trace of at least a subset of the execution traces, processing the execution trace by a vector embedding model to generate a library vector embedding of the execution trace. The method further includes relating, in storage, a library identifier of the library to the plurality of library vector embeddings as a fingerprint of the library.
In general, in one aspect, one or more embodiments relate to a system that includes storage and a computer processor comprising computer readable program code for causing a computing system to perform operations. The operations include disassembling a reference binary of a library to generate a control flow graph of the referenced binary, normalizing the control flow graph to generate a normalized graph, traversing the normalized graph to generate execution traces from the normalized graph, and generating library vector embeddings. Generating library vector embeddings includes, for each execution trace of at least a subset of the execution traces, processing the execution trace by a vector embedding model to generate a library vector embedding of the execution trace. The operations further include relating, in storage, a library identifier of the library to the plurality of library vector embeddings as a fingerprint of the library.
In general, in one aspect, one or more embodiments relate to a method that includes disassembling target software to generate a control flow graph, normalizing the control flow graph to generate a normalized graph, traversing the normalized graph to generate execution traces from the normalized graph, and processing the execution traces to generate target vector embeddings of the target software. The method further includes selecting, from multiple libraries, a library in which the target vector embeddings match a threshold of a library fingerprint of the library to obtain a selected library and processing the target software based on the selected library.
Other aspects of the invention will be apparent from the following description and the appended claims.
Like elements in the various figures are denoted by like reference numerals for consistency.
In general, embodiments are directed to detecting native binaries in software libraries. Specifically, one or more embodiments generate fingerprints that are vector embeddings from execution traces of a binary of a native library. The set of vector embeddings form a fingerprint of the native library. Specifically, a binary is decomposed into a control flow graph, which is then normalized. The normalization removes hardware specific portions of the control flow graph. The normalized graph is traversed to obtain execution traces for the binary. At least a subset of the execution traces are processed through a specifically trained vector embedding model to generate library vector embeddings. The specifically trained vector embedding model is trained on assembly language. Each library vector embedding is a vector embedding of a corresponding execution trace. Thus, the collection of execution traces form the fingerprint for a library.
When target software is being analyzed, the binary of the target software undergoes a similar process to identify execution traces and corresponding vector embeddings therefrom of the target software. A matching process is then performed to determine whether the vector embeddings generated from the target software satisfy a threshold of matching vector embeddings in the binary of the native library. If the threshold is satisfied, then the native library is detected as being in the target software. Accordingly, the target software may be updated to address the vulnerability.
Notably, large target software can use a number of libraries. The libraries often have large codebases and may, in turn, rely on several native binaries. Developers writing target software using libraries may be unaware when writing the target software of the various native binaries used by the libraries. Thus, when vulnerabilities are discovered in the native binaries, the target software becomes vulnerable. On the vulnerability side, large data repositories store vulnerability listings time and space independent of the repositories having the libraries and target software. Thus, when the native binaries generated from libraries that are used by high level libraries and ultimately used by target software are unknown, identifying vulnerabilities in the target software due to the native binaries is a technical challenge.
Turning to the Figures,
The various repositories shown in
Software library repositor(ies) (102) are one or more repositories having at least one software library. A software library is a set of instructions that may be incorporated into other software (e.g., target software (114), another software library). Software library may be composed of a set of code files. A code file is a file having software code. The software code in the code file may be source code and/or executable code. The set of code files may be distributed across different repositories. For example, one software library repository may store source code and another software library may store the executable code. The executable code is a compiled version of the source code. For example, the executable code may be any compiled code or other executable form of software. Within the software code are code sections. A code section is stored in a single code file, whereby the single code file may have one or more code sections. A code section is a compilation unit or other block of code. A code section may be a method or class. Although the terms method and class are used, the term method and class refer to equivalent units of code in other programming languages. For example, a class is any programming language structure that encapsulates code. The code encapsulated by the class is equivalent to a method. A code section may include one or more other code sections in addition to other instructions. The inclusion is complete. Namely, a first code section that is included in a second code section is completely included in the section code section. For example, a code section that is a class includes one or more code sections that are methods.
One or more of the software libraries in the software library repositories may be a native library. The term, “native,” refers to the code being architecture specific. A native library is a library written for a particular computer hardware architecture. As such, a native library is platform specific. The source code version of the native library may be in a low level language, such as C or C++. The source code version of the native library may be available as open source software (OSS). The compiled version of a native library is a native binary. For example, the native binary may be compiled into assembly language.
The native binary may be incorporated into other software libraries. For example, other libraries in the software library repositories (102) may use the native binary to perform certain functions. The incorporation of the native binaries into other software libraries may be performed for performance during execution. The native binaries that are incorporated into a software library may not be prominently exposed to developers of target software incorporating the software library.
Vulnerable record repositor(ies) (104) are one or more storage repositories having vulnerability records. A vulnerability record is a record of a vulnerability that is reported as being present in a software component. For example, a vulnerability repository may be in accordance with Common Vulnerabilities and Exposures (CVE) system. CVE is an established system for publicly disclosing vulnerabilities in software products (e.g., applications or libraries). CVE is used by the US National Vulnerability Database (NVD). CVE sets forth a process of uniquely identifying vulnerability and disclosing the vulnerability publicly, usually after the vulnerability is fixed. The disclosure or reporting of the vulnerability includes a standard set of metadata, such as severity. The vulnerability record may be cloned by other public and commercial providers of vulnerability databases, e.g., Github Security Advisories (GHSA). A vulnerability record may include various types of information. For example, the vulnerability record may include the unique identifier, a description of the vulnerability, one or more links to references, and other metadata.
Patches to mitigate or even correct vulnerabilities may be published in one or more code repositor(ies). A patch is a set of instructions that update a software component. A set of patches may be associated with a same vulnerability. For example, different versions of a software component may have different patches. Further, a patch may be a commit, which is an individual, addressable change to the software component. Multiple commits may exist, which are each applicable to the same software component. Thus, a set of patches may include one or more patches. Because which native binaries are included in a software library may be unknown to the vendor of target software, the vulnerabilities in the target software due to the native binaries may be unknown. Thus, even when patches may be available, a challenge exists deploying the patches to the target software.
Continuing with
The library version level (204) serves as an identifier of the library version amongst the set of library versions of the library. The library version level (204) has, individually for each library version, a library version fingerprint related, in storage of the fingerprint repository (110), to the library version identifier that uniquely identifies the library version. For example, the library X version M fingerprint (214) identifies version M of library X amongst the versions of library X. Library X version M fingerprint (214) is related to library X version M identifier (222). Library X version N fingerprint (216) is related to library X version N identifier (224). Library Y version Q fingerprint (218) is related to library Y version Q identifier (226). Library Y version R fingerprint (220) is related to library Y version R identifier (228).
Returning to
As shown in
The fingerprint generator (130) is configured to generate fingerprints for library and library versions. Specifically, the fingerprint generator (130) is configured to generate a library fingerprint and a library version fingerprint.
The reference binary selector (132) is software that is configured to select a reference binary for a library. The reference binary is a version of the library from which the library fingerprint is generated. Namely, the reference binary is a compiled version of the library that is selected as being representative of the library.
The source code parser (134) is software that is configured to parse source code. In one or more embodiments, the source code parser (134) may further be configured to determine the compiler agnostic portion of source code. A compiler agnostic portion of source code is a portion of source code that is not dependent on the compiler. For example, the compiler agnostic portion may be the header, license information, and other such information that is in the compiled version, but is independent of the compiler.
The binary disassembler (136) is a program that is configured to disassemble a binary and create a control flow graph for the binary. The control flow graph of a binary is a graph that indicates how code blocks of the binary are connected to other code blocks of the binary. Code blocks are blocks of binary code that do not include multiple possible execution paths within the code block (e.g., no conditional statements within the code block) and do not overlap in instructions with other code blocks. An example of a binary disassembler (136) is Ghidra. However, other disassemblers, such as interactive disassembler or a custom disassembler, may be used without departing from the scope of the technology.
The normalizer (138) is configured to normalize instructions in a control flow graph and generate a normalized graph. The normalization replaces hardware specific values with placeholders representing the type of value. Specifically, the normalization may replace operands with a placeholder identifying type of operand while keeping the operator of the instruction as is. By way of an example, a specific register may be changed to “Reg” and a specific memory address may be changed to “Mem”.
A vector embedding model (140) is a machine learning model that is to transform execution traces into a vector embedding. In one or more embodiments, the vector embedding model may be pretrained for programming languages and then subsequently trained, such as by using the technique described in
A software manager program (112) is a software program that is configured to analyze the target software (114) and identify libraries included in the target software (114). In one or more embodiments, the software manager program (112) identifies the native binaries in the target software (114). The software manager program (112) may be further configured to fix or update the target software (114) based on the vulnerable code sections.
In Block 304, reference binaries in the selected libraries are obtained in accordance with one or more embodiments. The selected libraries may published binaries for the libraries. As another example, source code, if available, may be compiled into a binary. Each binary is for a particular version of the library. Distinct versions may exist for different architectures and/or as modified over time. One of the binaries is selected as a reference binary. In some embodiments, the reference binary may be selected, for example, based on being a most recent version of the library, a number of changes as compared to other versions, or based on other criteria.
In Block 306, for each reference binary, a library fingerprint is generated. Generally, the library fingerprint is generated by obtaining an execution trace of the library and processing the execution trace through the vector embedding model to generate a vector embedding. The vector embedding is then linked to the library identifier as one of the set of vector embeddings in the fingerprint. A technique for generating a library fingerprint is described in
In Block 308, source code for each library version is obtained in one or more embodiments. For selected libraries that are open source or otherwise have source code available, library versions are selected. For resource management, only a subset of the library versions may be selected. For example, the subset may be the library versions created within a threshold amount of time from a current time. Other criteria may be used to select the library version. For each selected library version, the source code for the library version is obtained.
In Block 310, a library version fingerprint is generated for each source code in one or more embodiments. Generating the library version fingerprint for a particular library is described in
In Block 312, target software is processed using fingerprints to detect libraries and library versions in the target software. The system analyzes target software. The target software may be another library or an application. Analyzing the target software is presented in
In one or more embodiments, when the set of libraries and possibly library versions that are included in the target software are determined, the vulnerability record repository may be queried with the set of libraries and library versions to obtain vulnerability records. The vulnerability records may link to patches that correct the vulnerability. The patches may be applied to the target software. For example, the target software may be updated to reference a newer version of the library.
In one or more embodiments, a separate control flow graph is generated for each code section. For example, each compilation unit or function or method may have an individual control flow graph generated for the compilation unit. In such a scenario, the process described below is repeated for each control flow graph and the vector embeddings generated therefrom are added to the set of library fingerprints.
In Block 404, the control flow graph is normalized to generate a normalized graph. Normalizing the control flow graph iterates through the instructions to replace architecture specific terms with a placeholder representing a type of term. The normalization process thus removes architecture specific language to transform the control flow graph to architecture independent. Thus, even though the reference binary may be architecture specific, the library fingerprint for the library may be architecture independent. Performing the normalization may be based on a mapping structure. The mapping structure maps tokens or portions of tokens to the placeholder for the token. Specifically, the instructions in the control flow graph are parsed to generate tokens, and each token that has a mapping in the mapping structure may be replaced with the placeholder mapped to the token. For example, the mapping structure may indicate that any token starting with REG should be replaced with REG for register. Thus, REG1 is replaced with REG. Similarly, the mapping structure may map tokens satisfying a regular expression should with a corresponding placeholder. For example, a token satisfying a regular expression for memory address may be replaced with MEM for memory. In one or more embodiments, the analysis is performed on the operands of the instructions. In some embodiments, both the operands and the operators of the instructions are replaced.
In Block 406, the normalized graph is traversed to generate execution traces from the normalized graph. Each execution trace follows a path through the control flow graph starting from the starting node of the normalized graph to an ending node of the normalized graph. Thus, the path is a possible flow of execution through a portion of the library. Execution traces may be overlapping, and may optionally start at a same starting node and ending at a same or different ending node. Depth first search traversal starting at the starting node may be performed to traverse the normalized graph. When a conditional expression is identified in which the current node has two or more children, the current execution trace may be cloned for each child from the current node. The two clones continue to follow respective child node. When a node of the normalized graph is visited during traversal, the instructions in the node are added to the execution trace. The result of traversing the normalized graph is a set of execution traces that track the possible paths through the representative binary.
Although
In Block 408, for each of at least a subset of execution traces, the execution trace is processed by a vector embedding model to generate a library vector embedding of the execution trace. The library vector embedding is a vector embedding that is part of the library fingerprint. The vector embedding model processes each token of the execution traces in order to encode the execution traces. Execution traces that are semantically similar have vector embeddings that are close in cosine similarity and execution traces that are semantically different have vector embeddings that are different in cosine similarity.
When generating the execution traces, the library identifier is related, in storage, to the library vector embeddings as a library fingerprint in Block 410. The set of library vector embeddings form the library fingerprint for the library.
In Block 506, the compiler agnostic portion is processed through a vector embedding model to generate a version vector embedding of the library version. The vector embedding model may be the same or different than the vector embedding model used to generate the vector embedding for the library. If the compiler agnostic portion has multiple separate portions, such as in different code sections, then a vector embedding may be generated for each separate portion.
In Block 508, the library version identifier is related in storage to the version vector embedding as the library version fingerprint. Specifically, the set of one or more version vector embeddings are linked to the library version identifier.
For clarity of terminology, the library vector embedding is the vector embedding of the library. The version vector embedding is the vector embedding of the library version. The target vector embedding is the vector embedding of the target software.
In Block 610, libraries in which the target vector embeddings match a threshold of the library fingerprint are selected to obtain selected libraries. In one or more embodiments, the traces vector embeddings are compared to the library vector embeddings to identify matching vector embeddings. In one or more embodiments, matching is an exact match (e.g., identical vector embeddings). In other embodiments, matching is based on a distance threshold on cosine distance. Namely, cosine distance between the library vector embedding and the target vector embedding may be calculated. If the cosine distance satisfies the distance threshold, then the library vector embedding is determined to match the target vector embedding.
The libraries having more than a threshold number or a threshold percentage of library vector embeddings that match the target vector embeddings are selected. The threshold may be, for example, eighty percent of the set of library vector embeddings. However, other thresholds may be used without departing from the scope of the invention.
In Block 612, library versions of selected libraries in which the target vector embeddings match a threshold of the library version fingerprint are selected. In one or more embodiments, for each selected library in Block 610, the library version fingerprints of the library versions of the library are obtained. A same or similar process as described above with reference to Block 610 is performed. In one or more embodiments, the traces vector embeddings are compared to the library vector embeddings to identify matching vector embeddings and a determination is made whether the matching vector embeddings satisfies the threshold.
In Block 614, the target software is processed based on selected libraries and selected library versions being included in target software. Processing the target software may be performed as described above with reference to Block 312 of
In Block 702, portions of training execution traces generated from binary to generate a masked training dataset. The training execution traces may be execution traces from a subset of the libraries that are normalized. One or more tokens in each of the training execution traces are masked, or hidden. The vector embedding model is then trained based on the masked training dataset in Block 704. Training the vector embedding model may be based on sentence completion training. In sentence completion training, the vector embedding model is trained to complete the rest of a sentence based on the first part of the sentence. In the present case, the sentence is the training execution trace. The completion of the sentence is the masked portion of the training execution trace. The vector embedding is the last hidden state of the vector embedding model. Thus, by performing sentence completion training, the vector embedding model is trained to output, as a final hidden state, a similar vector embedding for correct prediction of masked portions and dissimilar vector embeddings for incorrect prediction of masked portions.
In Block 706, the execution traces are grouped by functions into function training data sets. Each function is a code section. In Block 708, the vector embedding model is trained based on the function training dataset. The training of the vector embedding model in Block 708 is based on next sentence prediction. In next sentence prediction, the vector embedding model outputs a positive value when a sentence is predicted as being the next sentence and false otherwise. In the present case, given two execution traces from the same function, the training causes the vector embedding model to predict true indicating the same function and false otherwise. As discussed above, the final hidden state of the vector embedding model is the vector embedding of the execution trace.
The process of
Turning to
Libraries may be referenced by application code (804). For example, the application code may include a link to a library for backward compatibility. The application code may include references to multiple libraries. During the build process, the application code is compiled and binary of the libraries are added to create a target application (808). Thus, as part of the target application (808), binary libraries such as foo.so and snappy.so are included. However, which libraries that foo.so and snappy.so correspond to may be unknown.
Embodiments may be implemented on a computing system specifically designed to achieve an improved technological result. When implemented in a computing system, the features and elements of the disclosure provide a significant technological advancement over computing systems that do not implement the features and elements of the disclosure. Any combination of mobile, desktop, server, router, switch, embedded device, or other types of hardware may be improved by including the features and elements described in the disclosure. For example, as shown in
The input devices (1010) may include a touchscreen, keyboard, mouse, microphone, touchpad, electronic pen, or any other type of input device. The input devices (1010) may receive inputs from a user that are responsive to data and messages presented by the output devices (1012). The inputs may include text input, audio input, video input, etc., which may be processed and transmitted by the computing system (1000) in accordance with the disclosure. The communication interface (1008) may include an integrated circuit for connecting the computing system (1000) to a network (not shown) (e.g., a local area network (LAN), a wide area network (WAN) such as the Internet, mobile network, or any other type of network) and/or to another device, such as another computing device.
Further, the output devices (1012) may include a display device, a printer, external storage, or any other output device. One or more of the output devices may be the same or different from the input device(s). The input and output device(s) may be locally or remotely connected to the computer processor(s) (1002). Many different types of computing systems exist, and the aforementioned input and output device(s) may take other forms. The output devices (1012) may display data and messages that are transmitted and received by the computing system (1000). The data and messages may include text, audio, video, etc., and include the data and messages described above in the other figures of the disclosure.
Software instructions in the form of computer readable program code to perform embodiments may be stored, in whole or in part, temporarily or permanently, on a non-transitory computer readable medium such as a CD, DVD, storage device, a diskette, a tape, flash memory, physical memory, or any other computer readable storage medium. Specifically, the software instructions may correspond to computer readable program code that, when executed by a processor(s), is configured to perform one or more embodiments, which may include transmitting, receiving, presenting, and displaying data and messages described in the other figures of the disclosure.
The computing system (1000) in
The nodes (e.g., node X (1022), node Y (1024)) in the network (1020) may be configured to provide services for a client device (1026), including receiving requests and transmitting responses to the client device (1026). For example, the nodes may be part of a cloud computing system. The client device (1026) may be a computing system, such as the computing system shown in
The computing system of
As used herein, the term “connected to” contemplates multiple meanings. A connection may be direct or indirect (e.g., through another component or network). A connection may be wired or wireless. A connection may be temporary, permanent, or semi-permanent communication channel between two entities.
The various descriptions of the figures may be combined and may include or be included within the features described in the other figures of the application. The various elements, systems, components, and steps shown in the figures may be omitted, repeated, combined, and/or altered as shown from the figures. Accordingly, the scope of the present disclosure should not be considered limited to the specific arrangements shown in the figures.
In the application, ordinal numbers (e.g., first, second, third, etc.) may be used as an adjective for an element (i.e., any noun in the application). The use of ordinal numbers is not to imply or create any particular ordering of the elements nor to limit any element to being only a single element unless expressly disclosed, such as by the use of the terms “before”, “after”, “single”, and other such terminology. Rather, the use of ordinal numbers is to distinguish between the elements. By way of an example, a first element is distinct from a second element, and the first element may encompass more than one element and succeed (or precede) the second element in an ordering of elements.
Further, unless expressly stated otherwise, or is an “inclusive or” and, as such includes “and.” Further, items joined by an or may include any combination of the items with any number of each item unless expressly stated otherwise.
In the above description, numerous specific details are set forth in order to provide a more thorough understanding of the disclosure. However, it will be apparent to one of ordinary skill in the art that the technology may be practiced without these specific details. In other instances, well-known features have not been described in detail to avoid unnecessarily complicating the description. Further, other embodiments not explicitly described above can be devised which do not depart from the scope of the claims as disclosed herein. Accordingly, the scope should be limited only by the attached claims.