The present disclosure relates in general to the field of computer software development, and more specifically, to assessing spatial security of pointers within source code.
Software programs may be written in any one of a variety of programming languages, with programs consisting of software components written in source code according to one or more of these languages. Development environments exist for producing, managing and compiling these programs. For instance, an integrated development environment (IDE), may be used which includes a set of integrated programming tools such as code editors, compilers, linkers, and debuggers. The specific development of a software system to be secure may play an integral role in securing computing systems more generally, including the invaluable and vast data and code being hosted on these systems.
According to one aspect of the present disclosure, source code is accessed and parsed by a computer-implemented tool to automatically detect a plurality of pointers in the source code. A pointer is an address in memory at which the program stores data of a certain type, and this data could contain other pointers, i.e., addresses, to other data. A usage type of each of the plurality of pointers is determined, and from it a graph structure is generated for the source code, with the graph structure including a plurality of nodes corresponding to the plurality of pointers. Generating the graph structure includes determining whether one or more of the plurality of pointers are wild pointers, which may be subject to insecure usage, determining from the usage types of the plurality of pointers, determining a plurality of type edges based on the determined corresponding types to couple nodes in the plurality of nodes, and determining from the usage types of the plurality of pointers a plurality of checked edges to couple a subset of the nodes in the plurality of nodes.
According to another aspect of the present disclosure, a graph structure (such as the above) may be accessed, which includes a plurality of nodes to represent a plurality of pointers within the source code, a plurality of type edges connecting nodes in the plurality of nodes within the graph structure to identify a flow of a program to be implemented using the source code, and a plurality of checked edges based on respective usage of pointers in the plurality of pointers. A computer-implemented tool may take the graph structure as an input and use the graph structure to determine whether one or more of the plurality of pointers comprise wild pointers, and determine pointer types for at least a portion of the plurality of pointers. In some instances, the source code may be automatically annotated or modified by the tool based on the determined wild pointers and/or pointer types.
Like reference numbers and designations in the various drawings indicate like elements.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely in hardware, entirely software (including firmware, resident software, micro-code, etc.) or combining software and hardware implementations that may all generally be referred to herein as a “circuit,” “ module,” “component,” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.
Any combination of one or more computer readable media may be utilized. The computer readable media may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an appropriate optical fiber with a repeater, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, CII, VB.NET, Python or the like, conventional procedural programming languages, such as the “C” programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS).
Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer readable medium that when executed can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions when stored in the computer readable medium (e.g., a non-transitory storage medium) produce an article of manufacture including instructions which when executed, cause a computer to implement the function/act specified in the flowchart and/or block diagram block or blocks. The computer program instructions may also be loaded onto a computer, other programmable instruction execution apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatuses or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.
Referring now to
In some implementations, a development system 105 may be implemented as a hosted or local computing system, through which users (e.g., programmers, software testers, software engineers, etc.) may interact with the tools directly (e.g., through corresponding user interfaces (e.g., graphical user interfaces)). For instance, the development system 105 may be implemented as an IDE and/or compiler that is installed and run on a personal computer, among other examples. In other implementations, the development system 105 may be provided as a software as a service (Saar) or cloud-based service offering, hosted on one or more multiple network-connected computing systems, which other users (e.g., users of client systems 125, 130) may access the development system 105 over one or more networks (e.g., 120).
Other software systems and services (e.g., 110) may also communicate and interoperate with a development system 105 (e.g., a cloud-based implementation) over one or more local or wide-area networks (e.g., the Internet). For instance, a software system 110 may be utilized to host, test, secure, develop, or otherwise build or manage software code. Additionally, one or more repositories (e.g., 115) of software code may be provided and may be accessible over a network (e.g., by software system 110 or development system 105). For instance, tools and functionality provided by an improved development system (e.g., 105), such as discussed in more detail herein, may be utilized to automatically annotate or convert various pieces of software code developed, hosted, or managed by software system 110, repository system 115, or personal computing client system 125, 130, among other examples. For instance, a piece of code may be accessed by the development system 105 over network 120 and processed by the development system 105 to determine opportunities to convert pointers to checked versions of the pointers. The development system 105, in some implementations, may further generate a secured, or checked, version of the same piece of code and provide the improved version as an output (e.g., delivered over network 120) to a source of the code (e.g., other connected systems 110, 115, 125, 130), among other example implementations.
In general, “servers,” “clients,” “computing devices,” “network elements,” “database systems,” “user devices,” and “systems,” etc. (e.g., 105, 110, 115, 125, 130, etc.) in example computing environment 100, can include electronic computing devices operable to receive, transmit, process, store, or manage data and information associated with the computing environment 100. As used in this document, the term “data processing apparatus,” “computer,” “processor,” “processor device,” or “processing device” is intended to encompass any suitable processing device. For example, elements shown as single devices within the computing environment 100 may be implemented using a plurality of computing devices and processors, such as server pools including multiple server computers. Further, any, all, or some of the computing devices may be adapted to execute any operating system, including Linux, UNIX, Microsoft Windows, Apple OS, Apple iOS, Google Android, Windows Server, etc., as well as virtual machines adapted to virtualize execution of a particular operating system, including customized and proprietary operating systems.
Further, servers, clients, network elements, systems, and computing devices (e.g., 105, 110, 115, 125, 130, etc.) can each include one or more processors, computer-readable memory, and one or more interfaces, among other features and hardware. Servers can include any suitable software component or module, or computing device(s) capable of hosting and/or serving software applications and services, including distributed, enterprise, or cloud-based software applications, data, and services. For instance, in some implementations, a development system (e.g., 105), software system 110 (e.g., hosting one or more software applications), a repository systems (e.g., 115), or other system within computing environment 100 can be at least partially (or wholly) cloud-implemented, web-based, or distributed to remotely host, serve, or otherwise manage data, software services and applications interfacing, coordinating with, dependent on, or used by other services and devices in environment 100. In some instances, a server, system, subsystem, or computing device can be implemented as some combination of devices that can be hosted on a common computing system, server, server pool, or cloud computing environment and share computing resources, including shared memory, processors, and interfaces.
While
Software system security continues to grow in importance. Many discovered and reported vulnerabilities, some of which are exploited by attackers and result in severe downstream effects, may be avoided through careful construction of the software. However, programmers seldom possess the level of expertise to appreciate the myriad ways in which design and coding choices may introduce vulnerabilities into the code, which may later serve as the gateway for an attack or exploitation of the resulting computing program. As an example, the C programming language provides a single type constructor to describe pointers to memory, but this constructor is tasked with characterizing four distinct patterns of use: (1) pointers to exactly one data item; (2) pointers to zero or more data items (an array); (3) pointers to an unknown number of non-NULL data items and concluding with a NULL; and (4) pointers to unstructured (“wild”) data. This ambiguity, and the programmer confusion that it causes, is the source of a large and pernicious class of security vulnerabilities based on illegal memory accesses. As such, it is desirable to make these distinct pointer usage patterns manifest in C programs, e.g., as distinct type constructors. Doing so is the basis of several efforts to build pre- and post-execution analyzers for C program source code, and/or to extend the C language itself with new types and annotations. Indeed, vulnerabilities that compromise memory safety are at the heart of many attacks. Memory safety has two aspects. Temporal safety is ensured when memory is never used after it is freed. Spatial safety is ensured when any pointer dereference is always within the memory allocated to that pointer. Buffer overruns—a spatial safety violation—still constitute a frequent and pernicious source of vulnerability, despite their long history.
Solutions have been attempted to address memory safety. Several efforts have been attempted to make C programs safe, in particular. Static analysis tools aim to find vulnerabilities by looking at a program's source code pre-deployment, but may miss bugs, have trouble scaling, or emit too many alarms. Security mitigations can mute the impact of vulnerabilities by making them harder to exploit, but provide no guarantees (e.g., data leaks and mimicry attacks may still be possible). Some efforts have aimed to provide spatial safety by adding code that performs run-time checks, during deployment, but such checks tend to add substantial overhead and can complicate interoperability with legacy code when pointer representations are changed. In sum, despite the multiple solutions that have been attempted, existing approaches remain deficient.
In one implementation, spatially safe programming languages have been developed, allowing incremental conversion while balancing control, interoperability, and high performance. In some implementations, a spatially safe programming language may represent all pointers in the code in their normal or legacy form (e.g., the form used in the standard, non-spatially safe version of the same programming language). The spatially safe programming language may explicitly specify the legal boundaries of pointed-to memory to enhance human readability and maintainability while supporting efficient compilation and running time. The spatially safe version of a programming language may support pointers of various types, and these types and bounds of the pointers may be used by the spatially safe compiler to either prove that an access is safe, or else to insert a run-time bounds check when such a proof is too difficult, among other example features. In some implementations, a spatially safe version of a programming language may be implemented as an extension of a compiler for a legacy or standard version of the programming language, among other examples. Indeed, in some implementations, such compilers or other tools may incorporate features and functionality to allow software code to be automatically converted, at least partially, into a spatially safe, or checked, version of the code, among other example features.
At least some of the systems described in the present disclosure, such as the systems of
In some implementations, a source code converter 210 may be provided, which may automatically inspect a piece of code (e.g., source code 230, 230′) to identify opportunities to improve the spatial safety of the code and automatically convert and/or annotate the code with new, improved code that resolves pointer-based vulnerabilities detected in the code (e.g., 230, 230′). In one example, the source code converter 210 may include a parser 240 to parse the code to identify pointers within the code and identify how the pointers are used within the code. Data may be generated by the parser to describe these pointers. A type checker 245 may be provided to identify, from the results of the parser 240, respective types of the identified pointers within the code. A constraint builder 250 may generate a multigraph model 225 for the code to identify whether the pointers are spatially safe. A constraint solver 255 may utilize the multigraph model 225 to determine, from the constraints, opportunities to annotate or convert the code to address spatial safety issues. A code annotator 260 may modify or annotate the code to implement or identify these opportunities (e.g., in annotated source code 235 generated by the source code converter), such as described in more detail herein. A user (e.g., human programmer) may utilize the annotated source code 235 to improve upon or attempt to “fix” the issues identified by the source code converter. Such annotated or converted versions of the source code may be provided (e.g., via network 120) to one or more remote repositories and the resulting code may be utilized to implement various components (e.g., 280) of a software application developed or improved utilizing source code converter 210, among other example uses.
As one example of a spatially safe programming language, a spatially safe version of C and C-based programming languages may be developed, such as Checked C. Checked C may support pointers to single objects, arrays, and NULL-terminated arrays. Checked C is designed to support incremental porting from legacy C. Programs may consist of a mix of checked and legacy pointers, and fully ported code can be annotated as within a checked region, which can be held blameless for any spatial safety violation. This guarantee is made possible by restricting any use of unchecked pointers and casts within the region. To allow existing unchecked code to be accessed by checked regions and with checked pointers, Checked C allows unchecked code to be annotated with bounds-safe interfaces. These describe the expected behavior and requirements of the code and can be added to parameters and return values of function declarations/definitions, function and record types, and global variables. Such interfaces support modular porting and use of legacy libraries. In a Checked C implementation, programmers can add safety with each use of a checked pointer, and then extend the safety guarantee by expanding the scope of checked regions. This may result in each step of the code enjoying a working software artifact. Ultimately, a fully-ported program is assuredly safe, and in the meantime scrutiny can be focused on any unchecked regions, making the process of a debugging significantly simpler and more directed. Indeed, a checked version of a programming language, such as a Checked C, may allow the base programming language (e.g., C) to be extended with bounds-enforced checked pointer types. These pointers may be backward binary-compatible with legacy C pointers and may co-exist with them, ensuring efficiency and allowing a program's continued use while it is retrofitted for security.
In some implementations, a source code converter or other tool (e.g., a compiler including the functionality of an example source code converter) may be provided to enable retrofitting of a program written in a particular programming language in accordance with a checked version of the programming language. For instance, a development environment, code checker, compiler, or other tool may be provided with source code conversion functionality implemented as a static analysis-based tool to ease the retrofitting process. In one example, a source code converter module or utility may include logic executable by a computing device to automatically convert legacy pointers to checked versions of the pointers (or “checked pointers”). Checked pointers may be represented as system-level memory words, with no “fattening” metadata attached. Regions of code using only checked pointers may enjoy local spatial safety, in that any run-time spatial safety violation cannot be blamed on code in a checked region. Checked pointers confer safety benefits. Files, functions, and even single blocks of code that use only checked pointers and avoid certain unsafe idioms (e.g., variadic function calls) can be designated as checked regions; such that the region is sure to be spatially safe in the sense that any run-time safety violation cannot be blamed on code in that checked region. Placing an entire program in a checked region ensures it is wholly spatially safe.
In some instances, the source code converter may be run iteratively or repeatedly on a piece of code to guide a human programmer in securing code against vulnerabilities related to pointers within the code. In some instances, completely porting an existing C program to Checked C by converting its pointers to a checked type may not be realistic. Accordingly, a source code converter may be utilized to automatically and effectively assist a human programmer in iteratively refactoring their program, interspersing uses of the tool with manual changes.
In one example implementation, the source code converter tool may be equipped with logic to determine which legacy pointers can be converted into checked pointers. In one example embodiment, pointers in the programming language may be one of multiple different types. For instance, in the example of C/Checked C, each pointer may be one of three possible types, _Ptr<T>, _Array_ptr<T>, or _Nt_array_ptr<T> (ptr, arr, and ntarr for short). These types represent a pointer to a single element (e.g., ptr), array of elements (e.g., arr), and null-terminated array (e.g., ntarr) of elements of type T, respectively. In this particular example, the array-based pointer examples arr and ntarr have associated bounds annotations. Here are the three different ways to specify the bounds for a pointer p; the corresponding memory region is at the right:
The interpretation of an ntarr's bounds is similar, but the range can extend further to the right, until a NULL terminator is reached (the NULL is not within the bounds). Roughly speaking, checked pointers have a subtyping relationship ntarr<arr<ptr, in that an ntarr can be used where an arr (of the same or lesser size) is expected, which can be used where a ptr is expected (as long as the array's size is at least 1). This ordering or hierarchy can be used because the side conditions about bounds tend to hold and can be fixed manually with ease by a human programmer in the event the code converter gets it wrong.
In the example illustrated in
Turning to the simplified flow diagram 300 of
In one example, the source code converter 210 may infer checked pointer types, as a kind of type qualifier inference.
Nodes in the graph generated by an example constraint builder represent respective pointers identified (e.g., by the parser) within the code 230. Edges of the graph represent relationships between pointers and constraints on their use. A pointer's pattern of use is determined by a choice (or “solution”) 510 that satisfies the constraints. The graph representation and solving algorithm can serve as the basis of a program analyzer that can find defects in the source code, due to inconsistent pointer use, and/or automatically modify or annotate the code to make manifest the program's use of these various patterns and whether the use represents potential vulnerabilities within the code. For instance, constraint solver 255 may take the data embodying the graph 505 to determine annotations and potential changes for pointers implemented within the code. The solution data generated by the constraint solver 255 may be passed to a code annotator block 555 to implement the annotations or changes to the code to generate the annotated version of the source code 235.
In some implementations, the graph (referred to herein also as a multigraph) generated by the constraint builder of an example source code converter may be a graph-based representation of constraints on pointer usage within the code. The graph structure is referred to as a multigraph, as it is generated to include two kinds of edges. One kind of edge constrains when a pointer must be wild, and the other kind characterizes the form of the pointed-to memory (e.g., single, array, NULL-terminated). The setup of wild/non-wild edges may be leveraged by the constraint solver to produces modular results, which are more understandable and maintainable for programmers. If an analyzed function internally uses only non-wild pointers, the constraint solved may identify this and present this result to the user/programmer (e.g., through the generated annotations) even if callers of the function elsewhere in the program pass it wild pointers. As such, programmers can trust that functions are safe if they have only non-wild pointers in their interface, so code reviewing effort can focus elsewhere.
As introduced above, constraint solving may utilize the multigraph as an input to determine a best solution of given pointers appearing within the code from the constraints for that pointer modeled within the multigraph structure. For instance, sometimes a pointer's memory may allow multiple characterizations. As an example, a pointer to an array of size 1 may also be a pointer to a single value. The question engaged by the constraint solver engine is to determine which characterization to use for each pointer detected within the code, so as to communicate most effectively to the programmer and to admit the greatest flexibility for future program maintenance. In one example, the possible use patterns (e.g., four use patterns) may be organized as a mathematical lattice. For instance, in one implementation, the use patterns may include NULL-terminated arrays (NT), arrays (ARR), single-term pointers (PTR), and wild or unstructured pointers (WILD) ordered as NT<ARR<PTR<WILD. The solving algorithm implemented by the constraint solved may include multiple phases, first choosing the least solution (in the lattice order) for function parameters which are pointers and then choosing the greatest solution for function returns that are pointers. Exceptions may be considered and implemented in the latter case based whether there are bounds owing to local usage; function-internal pointers also depend on certain bounds.
The multigraph structure generated by an example constraint builder engine provides a graph-based representation of pointers within the code. In some implementations, this graph may not only be a machine-consumable data structure (e.g., usable by constraint solver engines to automatically annotate or modify code, such as discussed herein), but may also be rendered as a graphical representation of code for presentation to a user/programmer (e.g., for reference during software development or troubleshooting). The multigraph may be presented graphically to allow this representation to be visualized alongside the code to assist in program understanding. It can be efficiently updated even as a program is being modified. For instance, edges characterize dependence, so only those parts of the graph that are affected by a change need to be changed. In this manner, the multigraph may be regenerated/updated in substantially real-time to reflect changes to the code made by the programmer, and these updates may be likewise presented to the user, among other example features.
Turning to
In
In
In general, the constraint builder builds the multigraph 705 by generating a node in the graph for each pointer x in the program. For each pointer x: If the pointer x is used as an array (e.g., indexed as x[e] for some e), a type edge is added: x→ARR. If the pointer x is assigned to a pointer y, the constraint builder adds a type edge x→y and two checked edges x→y and y→x (or a bidirectional checked edge x↔y) to the multigraph. If the pointer x is used unsafely, the constraint builder adds a checked edge WILD→x to the multigraph. If the pointer x is returned from a function and assigned to pointer y, the constraint builder adds a type edge x→y and a checked edge y→x. The reversed direction of the checked edge for function calls is noteworthy as it may serve to model the localized unsafe usage of particular pointers within a piece of code, among other example advantages. Further, if the pointer x is assigned to by a call from a memory allocation (malloc), which creates multiple objects, the constraint builder adds a type edge ARR→x. If the pointer x is assigned to by a call from a memory allocation (malloc), which creates a single object, the constraint builder adds a type node pointer (PTR) and generates a corresponding type edge PTR→x. For each array x, if the array is a string, the constraint builder may add a type node for a NUL-terminated array NTARR to the multigraph and generate a type edge NTARR→x for that string (rather than a type edge ARR→x). For each function pointer ƒ, if the function has pointer arguments then a call to ƒ is handled as described above, where edges are produced between the arguments to the call and the parameters of the called function pointer. If the function pointer is passed to another function, its own parameters/returns match up with those of the called function's parameter. If the function ƒ has parameter x and the function being called has a function pointer parameter whose own parameter is y then the constraint builder reverses the flows of a normal call, by adding a type edge y→x and a checked edge x→y. If a function pointer is assigned to/from another function pointer/function, the constraint builder also reverses the typed-edge flow, but checked edges are bidirectional. For each integer z, a type edge PTR→&z may be added by the constraint builder if the expression &z appears in the program. Through this combination of assessments, the constraint builder may build corresponding sections of the multigraph for each pointer detected within the code to generate a complete multigraph (e.g., 705) with corresponding type and checked edges.
As introduced above, a constraint solver engine may take, as an input, a multigraph data structure generated by a constraint builder engine to determine constraints of the individual pointers. Specifically, the constraint solver engine may use the multigraph to map each node in the graph to WILD or SAFE and, for the pointers determined to be SAFE, map each node to a respective checked pointer type (e.g., PTR, ARR, NTARR, etc.). In one example, the constraint solver may begin by utilizing the multigraph to determine whether each pointer is WILD or SAFE. For instance, the constraint solver may traverse the graph from the WILD type node (e.g., 636) along (and in the direction of) the checked edges constructed within the multigraph to determine which nodes should also be regarded as wild. If a node cannot be reached by traversing the checked edges, it (and its corresponding pointer) may be considered safe by the constraint solver. For the nodes determined to be safe, the constraint solver may then determine the pointer type of each of the pointers (or at least each of the safe pointers) based on the respective type edges of the corresponding nodes.
Continuing with the examples of
Continuing with the examples of
After determining whether each identified pointer in a piece of code is (potentially) wild or safe, the constraint solver may then determine, using the type edges in the multigraph, respective pointer types to be assigned to each of the pointers. The pointer type solutions may be based on an ordering of preference of supported pointer types from most to least specific. For instance, in an implementation supporting pointer types NTARR, ARR, and PTR, the ordering may be: NTARR<ARR<PTR. The type edges within the multigraph may be interpreted by the constraint solver to indicate that for a type edge x→y, it is interpreted that x y per the ordering of the pointer type ordering defined for the language. The solutions developed by the constraint solver are to respect these constraints in that constraints in the sense that substituting a solution for its variable satisfies the constraints. To illustrate, in the example of
In determining pointer type solutions for pointers using an example constraint solver (e.g., as in the example of
In one example, to compute the greatest solution for variables used as parameters to functions within the code, the constraint solver may initialize all solutions to the greatest pointer type in the ordering (e.g., PTR). The constraint solver may then identify any type nodes in the multigraph.
In the particular example of
This yields solutions:
The constraint solver may then use these solutions and continue to the second step in the algorithm to determine least solutions for returns for functions in the code. Finding the least solution may be carried out similar to finding the greatest solution, except that initial unsolved variables are initialized to the lowest solution in the lattice, edges are considered from the taken node and not to it, and constraints are based on a least upper bound instead of a greatest lower bound. As with the greatest solutions, if a solution is not possible for a given pointer, then the pointer may be assigned a WILD status, and wildness may traverse from this node to other nodes via pointer type edges. The following pseudocode summarizes an example implementation of this solution:
To illustrate the solving of least solutions for returns, per the above, consider the following example based on the multigraph 1205 of
which yields solutions at this stage of:
In a third step, the constraint solver may further utilize the multigraph to compute a greatest solution for unbounded return and local variables. Unbounded returns may refer to a subset of the function returns that are not reachable (via traversal of type edges) to any of the constant type nodes or other nodes for which a solution has already been found in the multigraph, leading to the solution being unbounded and thus artificially driven to an overspecific solution. Accordingly, in the second step of determining a least solution for function returns, only the solutions for bounded returns are preserved, while unbounded returns are reinitialized and solved in this third solution step. The solution step begins by initializing the unbounded returns and local variables to the greatest solution and using solution values determined for pointers in the preceding two steps, along with any constant type nodes. The while loop is entered again for the unbounded return and local variables to iteratively determine a solution set that satisfies the lattice ordering constraints, as in the first step. For instance, in the example of
Which in the example of
In sum, per the discussion above, an example constraint solving approach may be three-phased by first computing the greatest solution for function parameters, then computing the least solution for (bounded) returns, and then computing the greatest solution for what remains (e.g., local variables, structured fields, and unbounded returns). This is carried out by first computing the greatest solution overall, but then resetting the solutions for returns, local variables, unbounded returns, etc. Then it computes the least solution, resetting the solutions for locals, etc., and finishing by solving with the greatest solution for the remaining variables/pointers, solving is linear time for each step. This approach may be configured to provide optimal flexibility and accuracy in the code. For instance, the greatest solution for parameters ensures the greatest flexibility for future callers. The higher the solution on the lattice ordering, the more callers can be admitted (e.g., an NTARR can still be passed when an ARR is expected). Similarly, generality can be gained by using the least solution on returns. While the greatest solution allows flexibility, using a greatest solution on returns potentially drops information that could be useful. For instance, while PTR may work for a particular program, it may limit future uses of the function that may like to know that it returns a more specific data (e.g., ARR data). As discussed above, using least solutions for returns is limited to bounded returns, or returns that are constrained, directly or transitively by a constant (e.g., ARR from array usage, or from interacting with an itype) or an already solved parameter qualifier, etc. Otherwise, unbounded returns are to be solved to the greatest solution (e.g., as a completely unconstrained return is typically a singleton pointer (e.g., PTR), rather than an artificially specific pointer type (e.g., NTARR). Similarly, the greatest solution for locals and structured fields provides the best generality.
In some implementations, the three-phase solving approach may be used to produce more general solutions to other sorts of qualifier constraints, which are also organized as a lattice. For example, instead of NTARR<ARR<PTR, other example lattices may be used, such as a lattice UNTAINTED<TAINTED which determines whether data so labeled is to be deemed trustworthy (UNTAINTED) or not (TAINTED). Constraints could be generated as described above, and based on usage of library functions, e.g., that functions such as getenv() would return TAINTED data, while functions such as system() would expect data to be UNTAINTED. The three-phase solving approach would improve the generality of the code to which the solution applied, among other example advantages and use cases.
When the wild/safe and pointer type solutions have been determined for each pointer in the code by a constraint solver engine, corresponding checked language syntax may be identified and either added, automatically, to the code, or output as suggestions via annotations to the code.
In some implementations, the constraint solving process performed by a constraint solver engine may identify regions (blocks) of code that are completely safe and free from any spatial safety violations. Such regions of code may be referred to as checked regions, characterized by the complete use of safe pointers (no wild pointers) within the region of code. In some implementations, the source code converter may additionally annotate or modify code to identify checked regions of code, determined from the solutions calculated by the constraint solver engine. For instance,
In some implementations, the conversion of code can be carried out as a kind of qualifier inference, where each unannotated legacy C pointer (e.g., in an example of conversation to Checked C) is associated with a qualifier variable q whose solution is a pair (k, p), where k and p are qualifier constants. These constants come from two distinct lattices: k∈{chk, wild} indicates a checked vs. non-checked pointer with chk<wild. p indicates a type of checked pointer (if the pointer is not wild), or p∈{ntarr, arr,ptr}, where ntarr<arr<ptr. Qualifier constraints will apply to either the k or p portion of the solution (k,p).
These two different forms of constraints, kind (k) and p-type (p) constraints, induce two different constraint graphs. Qualifier constraints may arise from several sources. Most directly, they come from subtyping constraints, such as leveraging checked types that appear in the source program. One source of checked-type qualifiers comes from interface types, or itypes, which give a legacy function an as-if-checked type. For example, the itype of strtok's return type is ntarr, because strtok returns a NULL-terminated array pointer. Thus for the code char*s=strtok( . . . ), the return type from strtok induces constraint chk≤qs in the kind graph and ntarr≤qs in the p-type graph.
Continuing with this example if an extern function does not have an itype, then its parameters and return are deemed unchecked. Constraints may also arise from pointer usages. For instance, if a particular pointer qs is used as an array, it induces a constraint qs≤arr. Similarly, if a pointer type is cast unsafely, then a constraint wild≤qs is generated, similar to the principles discussed above. After generating and solving the constraints, if a particular pointer's qualifier is k=wild, the source code converter may leave this pointer alone in the output program (or rewritten code). Alternatively, if k=chk, then the source code converter may automatically rewrite it according to the determined solution for the pointer (e.g., according to a corresponding Checked C syntax).
As discussed above, a constraint builder engine may be configured to generate checked edges in addition to type edges to form a multigraph. The checked edges may be generated so as to limit or contain overinclusive wildness based on type edges in the graph. This may be utilized effectively where the presumption is that a source code converter will be used iteratively by a programmer/developer, such that root causes of wildness will be addressed by the programmer between iterations of the source code converter.
Checked edges, for instance, may prevent wildness from propagating (perhaps incorrectly) from callers to callees, as is more typical in flow models. Such overinclusive wildness may cause developers to waste time looking at functions that have unchecked parameters (and returns) but are actually safe, rather than focusing on those functions that contain (or call) unsafe code. By allowing the type on the function to signal whether it deserves attention (e.g., based on a rewriting by the source code converter based on constraint solving), these designations identify that a corresponding function is internally safe, but has unsafe callers that should be considered. Accordingly, in some implementations, an example constraint builder may implement checked edges, which reverse the parameter kind constraints on function calls. In traditional qualifier inference, for a function f with parameter p, the call y=f(x) would induce constraints qx≤qp, qr et≤qy. In the improved constraint builder discussed herein, such parameter edges are reversed: qp≤qx, qret≤qy. In this way, if f uses p in a way that causes it to be wild, this wildness will propagate back to callers. The same happens (as is typical) with the return. On the other hand, if the function uses p safely internally, then it would be acceptable for the caller's argument x to be wild (since chk<wild), given p′s itype.
Internal to a function, the source code converter may unify kind constraints; e.g., x=y induces constraints qy≤qx and qx≤qy, represented as a bidirectional edge qx↔qy. Doing so avoids casts to/from checked pointers within a function. A statement return a also induces a bidirectional edge, so that the value “flows out” of the function. Accordingly, a constraint builder engine may generate a fresh qualifier variable node at each call site to represent the returned value, such as in other examples discussed above.
If not all pointers are converted to checked after running a source code converter, such as described above, the developer may strive to manually correct as many wild pointers as possible based on the results returned by the source code converter (or other tool including a constraint builder engine and/or constraint solver engine). Specifically, the source code converter may generate an output that identifies code that is a root cause of wildness, meaning that it is responsible for a direct checked edge WILD→q in the multigraph. This is a place where, for example, an unsafe cast occurs or where an external function's parameters or return were made wild. Fixing a root cause may result in positive downstream effects, such as discussed above. Upon making manual adjustments to the code, the code may be resubmitted to the source code converter and reanalyzed. Doing so may result in additional annotated or modified code generated by the source code converter, and potentially an indication that all wildness has been effectively eradicated from the code.
An example source code converter may be equipped with additional example features and functionality. For instance, multigraphs generated by the source code converter may be generated to automatically link to corresponding source code, such that the multigraph may be co-presented or even serve as a graphical user interface element within an IDE or other software development tool. As another example, additional analysis may be performed during constraint solving, such as determining or inferring array bounds for solutions that are array-based (e.g., ARR, NTARR, etc.), among other example functionality.
Turning to
Turning to the example of
The flowcharts and block diagrams in the Figures illustrate the architecture, functionality, algorithms, and operation of possible implementations of systems, methods and computer program products according to various aspects of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting of the disclosure. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The corresponding structures, materials, acts, and equivalents of any means or step plus function elements in the claims below are intended to include any disclosed structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.