The described technology is directed to the field of software development, deployment, and evaluation.
Over the history of software development, software development techniques and technology have advanced significantly, specifically with the use of iterative development (Agile), reusable code (libraries, frameworks and open source), and remote infrastructure (cloud and API services) technologies and methodologies. In addition, the corporate and development culture has also changed, and modern software is now often built by distributed teams that comprise employees (often in different locations), contractors, vendors, and offshore engineers working together.
Thus, both the techniques and approaches used and the development environments being used have changed, with the result that it is not uncommon for the development of a complex software application to be conducted by multiple teams distributed in different locations worldwide, and using software elements (such as libraries, APIs, functional modules, open source code, algorithms, etc.) that are obtained from other sources and not developed in-house by those teams.
The use of software elements from disparate sources in developing an application can create a risk in that these software elements may contain a virus, an intentionally placed piece of malware, or another form of potentially damaging code that is, as a result, incorporated into the application. Even in the absence of a specifically-known risk, a software element may possess a known vulnerability, so that its use creates a source of risk to a software application or to the development environment. And while much has changed in the area of software development methods, relatively little has changed in the area of software security with regards to the way that security is taken into account when developing applications that incorporate software elements developed by other parties. In this regard, developers typically focus on conducting a testing cycle after software is complete. This is expensive and ineffective, and as recognized by the inventors, may cause the development environment to be exposed to harmful or improperly tested software elements prior to testing of the constructed software. This both creates an inherent risk and is inefficient since the same potentially damaging software element may be incorporated into multiple places in the final software product.
The number of software development languages, frameworks, libraries and APIs available to be used by today's developers has become quite large, and the number of available software elements that may be incorporated into a software application continues to grow. As a result, in order to be aware of potential risks, software developers need to be able to understand and/or track a vast amount of security data related to the code, libraries, and other software elements that they may use in developing an application. Yet application development security teams are rarely able to keep up with the ever increasing volume of software elements, security data, and related information.
One aspect of preventing the introduction of potentially harmful software elements into a development environment is being able to identify whether an element that is being considered for use (or is being developed) has a known vulnerability or is instead expected to be safe. In the case of software components that are in bytecode form, this assessment must be done with respect to the bytecode contents of the software component. As an example, consider Java components that are typically deployed using a jar file. Where a component File.jar is used in an application, the problem is to detect if this component matches any of the components in the catalog of known vulnerable components.
A hardware and/or software facility is described (“the facility”) that generates a fingerprint or other form of identifier for a software element, such as a bytecode software element, that may be used by a developer to construct an application. In some cases, the fingerprint is referred to as a “computer bytecode fingerprint,” or “CBF.” CBF makes use of a uniform format based on bytecode of different platforms like Java, Android, and .NET. CBF contains information about the classes, methods, and fields used in the component; this information is extracted from the bytecode of the component. The fingerprint may then be used to assist in determining if a software element that a developer wishes to introduce into a development environment is known to possess a vulnerability, potentially damaging code, or other form of undesirable aspect. The facility permits the characterization of software elements in a form that permits comparison between such elements to determine whether they are the same or substantially similar, such as to identify suspect elements and preventing their use in a development environment. As a result, the facility can be used to assist a group of developers to reduce the risk to the development process from externally created software elements such as APIs, code, functional modules, open source code, etc. that the developers may wish to incorporate into their software application.
As noted, one purpose of fingerprinting is to allow the comparison or matching of libraries against a larger dataset of known vulnerable libraries (i.e., those known to have a vulnerability or to contain potentially damaging code). When a match occurs, a system/platform that is responsible for managing the access to and integration of software elements into a development environment can alert developers, management, or other appropriate people of a potential risk in using the identified library so that corrective action can be taken (such as by prohibiting use of that library and removing it from consideration for future use).
For the purpose of this description, a “code library” may be one or more of a singular computer file, or other body of code.
In some embodiments, the facility may be used as part of managing the access to and use of software elements in a software development environment used to develop a software application. The facility is used by the system or platform to identify software elements with a known vulnerability and, in response, prevent the incorporation of those elements into an application module being developed within a software development environment. The management function(s) may be implemented as a system or platform which includes processes for generating or deriving a “fingerprint” or “fingerprints” for one or more software elements that a developer desires to use, and compares that fingerprint or fingerprints to a record of the fingerprints of suspect elements (such as a “blacklist” of the fingerprints of elements having a known or suspected vulnerability). Thus, when searching for a “match” the system may perform a many-to-many comparison, with only a single match being required for positive identification of a software element. The fingerprinting process may be provided in any suitable format, independently or as part of a software management platform, and may be implemented by any suitable computing or data processing device (e.g., web-service, cloud-computing service, Software-as-a-Service business model, or as a dedicated server or computing device located in one or more locations, etc.). In one example embodiment, the facility is implemented as part of a multi-tenant cloud-based data processing platform.
As noted, in some embodiments, the facility is implemented in the context of a multi-tenant, “cloud” based environment (such as a multi-tenant data processing platform), typically used to develop and provide web services for end users. This exemplary implementation environment will be described with reference to
The distributed computing service/platform (which may also be referred to as a multi-tenant data processing platform) 108 may include multiple processing tiers, including a user interface tier 116, an application server tier 120, and a data storage tier 124. The user interface tier 116 may maintain multiple user interfaces 117, including graphical user interfaces and/or web-based interfaces. The user interfaces may include a default user interface for the service to provide access to applications and data for a user or “tenant” of the service (depicted as “Service UI” in the figure), as well as one or more user interfaces that have been specialized/customized in accordance with user-specific requirements (e.g., represented by “Tenant A UI”, . . . , “Tenant Z UI” in the figure, and which may be accessed via one or more APIs). The default user interface may include components enabling a tenant to administer the tenant's participation in the functions and capabilities provided by the service platform, such as accessing data, causing the execution of specific data processing operations, specifying software elements that a developer desires to have access to, creating and/or implementing a software control policy, initiating a process to fingerprint a software element and compare it to a list of suspect elements, etc. Each processing tier shown in the figure may be implemented with a set of computers and/or computer components including computer servers and processors, and may perform various functions, methods, processes, or operations as determined by the execution of a software application or set of instructions. The data storage tier 124 may include one or more data stores, which may include a service data store 125 and one or more tenant data stores 126.
Each tenant data store 126 may contain tenant-specific data that is used as part of providing a range of tenant-specific services or functions, including but not limited to software module management, software development environment access control, characterization of software elements, storage of utilized software elements, generation and storage of software module usage policies, etc. Data stores may be implemented with any suitable data storage technology, including structured query language (SQL) based relational database management systems (RDBMS).
In accordance with one embodiment of the facility, distributed computing service/platform 108 may be multi-tenant, and service platform 108 may be operated by an entity in order to provide multiple tenants with a set of related software development applications, data storage, and functionality. These applications and functionality may include ones that a software development business uses to manage various aspects of its application development operations. For example, the applications and functionality may include providing web-based access to software development information systems, thereby allowing a user with a browser and an Internet or intranet connection to view, enter, process, or modify certain types of information.
The integrated system shown in
As noted,
The application layer 210 may include one or more application modules 211, each having one or more sub-modules 212. Each application module 211 or sub-module 212 may correspond to a particular function, method, process, or operation that is implemented by the module or sub-module. Such function, method, process, or operation may include those used to implement one or more aspects of the facility, such as for:
The application modules and/or sub-modules may include any suitable computer-executable code or set of instructions (e.g., as would be executed by a suitably programmed processor, microprocessor, or CPU), such as computer-executable code corresponding to a programming language. For example, programming language source code may be compiled into computer-executable code. Alternatively, or in addition, the programming language may be an interpreted programming language such as a scripting language or bytecode. Each application server (e.g., as represented by element 122 of
The data storage layer 220 may include one or more data objects 222 each having one or more data object components 221, such as attributes and/or behaviors. For example, the data objects may correspond to tables of a relational database, and the data object components may correspond to columns or fields of such tables. Alternatively, or in addition, the data objects may correspond to data records having fields and associated services. Alternatively, or in addition, the data objects may correspond to persistent instances of programmatic data objects, such as structures and classes. Each data store in the data storage layer may include each data object. Alternatively, different data stores may include different sets of data objects. Such sets may be disjoint or overlapping.
Note that the example computing environments depicted in
In accordance with one embodiment of the facility, the system, apparatus, methods, processes, functions, and/or operations for generating an identifier for a software element may be wholly or partially implemented in the form of a set of instructions executed by one or more programmed computer processors such as a central processing unit (CPU) or microprocessor. Such processors may be incorporated in an apparatus, server, client or other computing device operated by, or in communication with, other components of the system. As an example,
It should be understood that the facility as described above can be implemented in the form of control logic using computer software in a modular or integrated manner. Based on the disclosure and teachings provided herein, a person of ordinary skill in the art will know and appreciate other ways and/or methods to implement the facility using hardware and a combination of hardware and software.
Any of the software components, processes or functions described in this application may be implemented as software code to be executed by a processor using any suitable computer language such as, for example, Java, Javascript, C++ or Perl using, for example, procedural, object oriented and functional programming techniques. The software code may be stored as a series of instructions, or commands on a computer readable medium, such as a random access memory (RAM), a read only memory (ROM), a magnetic medium such as a harddrive or a floppy disk, or an optical medium such as a CD-ROM. Any such computer readable medium may reside on or within a single computational apparatus, and may be present on or within different computational apparatuses within a system or network.
All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and/or were set forth in its entirety herein.
Returning to
Returning to
Those skilled in the art will appreciate that the steps shown in
Returning to
Returning to
Let C(file1) be the set of classes in file file1, and C(file2) be the set of classes in file file2, respectively. Comparing files file1.CBF and file2.CBF:
CBSM(file1, file2)=(ΣCBSM_class(c1, c2))/|C(file1)∪C(file2)| (1)
Let M(c1) and M(c2) be the set of methods in class c1 and c2 respectively,
And let F(c1) and F(c2) be the set of fields in class c1 and c2 respectively,
Then, for each class c1==c2 that is present in both file file1 and file file2:
CBSM_class(c1, c2)=(w1*ΣCBSM_method(m1, m2)/|M(c1)∪M(c2)|)+w2*(|F(c1)∩F(c2)|)/(|F(c1)∪F(c2)|) (2)
Where, w1 and w2 are the weights assigned to give importance to matching methods and fields respectively. This is done to take into account the fact that a match based on the entire method is more important than a match between fields. (e.g. w1=0.8 and w2=0.2 says that method match contributes 80% of the total matching while fields contribute only 20%)
Let I(m1) and I(m2) be the set of instructions in method ml and method m2 respectively, ignoring any operands or arguments,
Then,
CBSM_method(m1, m2)=|I(m1)∩I(m2)|/|I(m1)∪I(m2)| (3)
In step 406, if the CBSM calculated in step 405 exceeds a confidence threshold, then the facility continues in step 407, else the facility continues in step 408. In step 407, the facility identifies the application component as vulnerable. In some embodiments, the confidence threshold is user-configurable. In some embodiments, the confidence threshold is 80%. After step 407, the facility continues in step 409. In step 408, if additional vulnerable components remain to be processed, the facility continues in step 404 to process the next vulnerable component, else the facility continues in step 409. In step 409, if additional application components remain to be processed, then the facility continues in step 403 to process the next application component, else these steps conclude.
An example in which a CBSM is calculated for a pair of components follows below:
Consider the following Person Component (Comp1) below in Table 1:
Further consider a second implementation of Person Component (Comp2) below in Table 2:
Based on Equation (2), the facility calculates the similarity metric between the two components as follows:
Thus the overall CBSM (Component Bytecode Similarity Metric)=0.61 (as each component has only 1 class here)
While the foregoing has described fingerprints as being generated for code resources received in bytecode form, in various embodiments, the facility generates bytecodes for software resources received in a variety of forms. As one example, in some embodiments, the facility generates fingerprints for code resources received in the source code form.
In some such embodiments, the facility uses a code translator to convert from the form in which a code resource was received into bytecode form, then generates a fingerprint from the bytecode form. Where a code resource is received in source code form, the facility performs this conversion by compiling the code resource in source code form.
In some such embodiments, the facility generates a fingerprint from the code resource in its original form. In the case of a code resource that is received in source code form, the facility generates a fingerprint by extracting a textual hierarchy from the abstract syntax tree of the program. In some embodiments, the facility performs certain kinds of translation between fingerprints generated from code resources in one form for comparison to fingerprints generated from code resources of another form.
In some embodiments, the facility generates and compares fingerprints from software resources that are in a uniform form other than bytecode, such as source code.
It will be appreciated by those skilled in the art that the above-described facility may be straightforwardly adapted or extended in various ways. While the foregoing description makes reference to particular embodiments, the scope of the invention is defined solely by the claims that follow and the elements recited therein.