The present invention pertains to the field of software defect identification, and in particular to deep learning for software defect identification.
Various techniques are known in the art for analysing computer software for defects. For example, static code analysis techniques may be used to analyse the source code of a software program to detect either or both of syntax and logic errors. This can be done without actually executing the software. In addition, techniques such as execution logging can be used to track the evolving state of a set of program variables during execution of the software, to detect unexpected operations.
Both of these techniques suffer limitations in that they depend on predefined rule sets to detect errors. These rules are typically defined by a human operator, and are often based on patterns that would be recognizable and easily checkable. Accordingly, they tend to be very effective at detecting commonly occurring defects (such as uninitialized pointers), for which robust rules have been developed. However, they tend to be far less effective at detecting rarely occurring and/or complex defects that occur during run-time (such as stack overflow).
This background information is provided to reveal information believed by the applicant to be of possible relevance to the present invention. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art against the present invention.
An object of embodiments of the present invention is to provide software defect identification that overcome at least some of the limitations of the prior art.
Accordingly, an aspect of the present invention provides a neural network for identifying defects in source code of computer software. The neural network comprises: at least one convolutional layer configured to generate a one or more feature abstractions associated with an input segment associated with the source code; at least one recurrent layer configured to identify within the one or more feature abstractions a pattern indicative of a defect in the source code; and at least one mapping layer configured to generate a mapping between the identified pattern and a location of the indicated defect in the source code.
Further features and advantages of the present invention will become apparent from the following detailed description, taken in combination with the appended drawings, in which:
It will be noted that throughout the appended drawings, like features are identified by like reference numerals.
Typically, each node implements a transfer function which maps input signals received through each of its input ports to output signals that are transmitted through each of its output ports. The transfer function applies weights to each of the input values received at a given node, and then combines them (typically through a set of operations such as addition, subtraction, division or multiplication) to create an output value. The output value is then transmitted on each of the output ports. The weighting factors applied to the inputs (along with the particular transfer function itself) control the propagation of signals through the neural network 100 between the input layer 104A and the output layer 104C. In some embodiments, the process of “training” the neural network (or unsupervised learning) typically comprises adjusting the weighting factors so that a predetermined input data set will produce a desired output data set.
As in the simple neural network 100 of
The present invention provides methods and systems that exploit the pattern-recognition capability of neural networks to identify defects in computer software.
Typically, a parse tree is implemented as string of the form: Symbol1:=Operation(Symbol2,Symbol3), where “Operation” is a functional operation of a particular computing language, Symbol2 and Symbol3 are values upon which “Operation” acts, and Symbol1 is a value that receives the result of “Operation” acting on Symbol2 and Symbol3. It will be appreciated that either one or both of Symbol2 and Symbol3 may themselves be values that receive the results of other operations.
Following completions of the Semantic Analysis 210, a multi-pass process is typically implemented to convert the parse trees into the Executable code 204. In the example of
Intermediate Code Generation 212 normally involves replacing the parse trees with corresponding machine Opcodes that can be executed on a processor. In many cases this process may be implemented as a simple replacement operation, as parse tree string is replaced by its Opcode equivalent.
Machine Independent Optimization 214 typically involves selecting an order of operation of the Opcodes selected by the Intermediate Code Generation 212, for example to maximize a speed of execution of the executable code 204.
Machine code generation 216 normally involves replacing the machine Opcodes with corresponding Machine Code that can be executed on a specific processor.
Machine Dependent Optimization 218 typically involves selecting an order of operation of the Machine Code generated by the Machine code generation 216 stage, for example to exploit pipelining to maximize performance of the executable code 204.
As illustrated in
Embodiments of the present invention provides methods and systems for detecting defects in the source code 202 associated with the logic of the source code, or associated with errors that would not otherwise be detectable through the conventional Lexical, Syntactic, or Semantic analysis. Example defects of the type that can be detected using methods in accordance with the present invention include logic errors, stack overflow errors, improperly terminated loops etc. In some embodiments, parse trees output from the semantic analysis process 210 of a compiler may be analysed to detect defects. In other embodiments, one or more intermediate representations 220 may be analysed to detect defects.
In some embodiments, the input array may include information that can be used to identify a location at which a defect is detected. For example, each parse tree may include a respective identifier. Similarly, each intermediate representation 220 may include line numbers or other identifiers. When a logic defect is detected by the neural network 306, a description of the defect can be inserted into the defect report 308 along with an identifier indicating the location (e.g. the parse tree, or intermediate representation line) at which the defect was detected. In some embodiments, the identifier included in the defect report may be mapped to a corresponding location in the source code 202, for example during a post-processing step (not shown).
In an alternative embodiment, the symbol table 410 may be loaded into a first column of the input array 304, rather than the first row. In still other embodiments, the symbol table 410 may be loaded into both the first row and the first column of the input array 304. In this latter embodiment, each operation 404 can be loaded into the input array, for example at the cell in the row and column corresponding to the symbols 408 associated with that operation.
In very general terms, the input array 304 (as an intermediate representation of the source code) may be supplied to the input layer of the neural network 306. However, it may be appreciated that in many cases the size of the input array 304 may not match the input vector length of the neural network 306, which will normally have a predetermined upper bound. Accordingly, the input array 304 may be processed (for example using methods known in the art) to generate a plurality of input segments that match the input vector length of the neural network 306. In some embodiments, processing the input array 304 to generate the input segments may comprise allocating predetermined portions (such as a set of one or more rows or columns) of the input array 304 to each input segment. The size of each input segments may be set by the system, or it may be be based on elements (such as recognized comments and flags, for example) in the source code.
The convolutional layers 502 are configured to receive and process input segments 508 to perform feature detection and reduce the complexity of the input data. In some embodiments, the convolutional layers 502 may generate one or more feature abstractions associated with each input segment 508. Each feature abstraction may correspond with a programming feature of the source code, which may include any one or more of: selections (such as if/then/else, switch); repetitions (such as for, while, do while); flow controls (such as break, continue, goto, call, return, exception handling); expressions (such as assignment, evaluation; compound statements (such as atomic/synchronized blocks); and events or event triggers.
In some embodiments, each convolutional layer may include a convolutional sublayer and a pooling sublayer. The convolutional sublayer may be configured to recognize features (such as loops, function calls, object references; and events or event triggers) in the source code. The pooling sublayer may be configured to sub-sample the output of the convolution (using either average or maximum pooling, for example) to improve resolution.
The recurrent layers 504 are configured to receive and process feature abstractions generated by the convolutional layers 502, to identify patterns which span multiple input segments. For example, the pattern associated with an individual feature (or feature abstraction) may indicate the presence of a defect in the source code based on how the pattern appears, or based on how features are identified with respect to each other.
Recurrent layers 504 may have a shared memory (not shown), so that patterns can be detected across input segments.
Specific types of recurrent layers 504 include: long short term memory (LSTM); and Gated recurrent units.
The functionally fully connected layers 506 are configured to generate a mapping between the identified pattern and a location of the indicated defect in the source code. For example, when the pattern of a defect in the source code is identified by the recurrent layers 504, there is a likelihood of a problem in a particular feature of the source code. The functionally fully connected layers 506 operate to map this (potentially defective) feature to a location in the source code. This location might be as fine as a range of lines within the source code that has a defect (e.g. there is a variable used in a small range that will generate an overflow error), or it might be something that identifies a relatively broad section of the source code and indicates that there is a likely problem with a given type of structure (e.g. in a large block of code, there may be a loop that will not properly terminate, or there may be a variable that will be assigned a value during a loop that will result in an overflow error).
Effectively, the purpose of this layer is to pull the recognized errors back together with the source code, in order to facilitate correction of the defect.
As may be appreciated, a fully connected layer is one where each output of a one layer is connected to an input of the next layer. In some embodiments, the functionally fully connected layers 506 are configured as fully interconnected layers. In other embodiments, two or more layers of the functionally fully connected layers 506 are not, in fact, fully interconnected, but are nevertheless configured to yield the same results as a fully connected layer. In such embodiments, such layers trade off the breadth of the connections between layers (which results in each layer having to move a large amount of data at once) for a less broad set of connections between layers, but with an increase in the depth of the number of layers. The term “functionally” fully interconnected layers is used herein to refer to both layers that are in fact fully interconnected, and layers that are not fully interconnected but are configured to yield the same results as fully interconnected layers.
In the embodiment of
In embodiments of the present invention, the version history stored in the Code Repository 602, and the corresponding history of defects detected in and changes made to each version of the application stored in the Change Request Database 604 are processed to extract blocks of source code that contain defects, and information describing those defects. In the illustrated example, the defect description information includes a problem classification that describes the defect, and a line number that identifies a location in the source code block at which the defect is located. The extracted software blocks, and the corresponding defect description information are used to define a training set for the neural network 306. For example, a selected block of source code may be processed as described above with reference to
Although the present invention has been described with reference to specific features and embodiments thereof, it is evident that various modifications and combinations can be made thereto without departing from the invention. The specification and drawings are, accordingly, to be regarded simply as an illustration of the invention as defined by the appended claims, and are contemplated to cover any and all modifications, variations, combinations or equivalents that fall within the scope of the present invention.