During the development of a program or software, a range of measures is taken to ensure that the program is tested prior to the release and distribution of the program. These measures are aimed at reducing the number of bugs in the program in order to improve the quality of the program. A bug in a source code program is an unintended state in the executing program that results in undesired behavior. Regardless of these measures, the program may still contain bugs.
Software maintenance makes the corrective measures needed to fix software bugs after the bugs are reported by end users. Fixing the software bugs after deployment of the program hampers the usability of the deployed program and increases the cost of the software maintenance services. A better solution would be to detect and fix the software bugs prior to release of the program.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
A machine learning model is trained to predict the probability of a software bug in a source code file. The model is trained during a training phase that mines source code repositories for source code files having source code statements with and without software bugs. Features associated with the syntactic structure or context of the source code file is then extracted for analysis in order to generate feature vectors that train the machine learning model. The feature vectors may represent syntactic information from each line of source code, from each method in a source code file, and for each class of a source code file and/or any combination thereof. The feature vectors are used to train a machine learning model to determine the likelihood that a source code bug is present in a target source code file.
These and other features and advantages will be apparent from a reading of the following detailed description and a review of the associated drawings. It is to be understood that both the foregoing general description and the following detailed description are explanatory only and are not restrictive of aspects as claimed.
Overview
The subject matter disclosed herein discloses a mechanism for predicting software bugs in a source code file. The mechanism analyzes various source code files to extract features that represent patterns indicative of a software bug and patterns without a software bug. The features selected best capture the context in which a software bug exists and does not exist in order to train a machine learning model to learn the patterns that identify a software bug. The mechanism described herein utilizes a context that is based on the syntactical structure of the source code. Hence, the machine learning model learns the existence of a software bug from the context where the software bug exists and does not exist.
The subject matter disclosed herein utilizes several different techniques for extracting features representative of the context of software bugs and the context of bug-free source code. In one aspect, each element in a line of source code is converted into a token that represents the element. The line of source code is then represented as a sequence of tokens. The sequence of tokens is then grouped into a window or group that includes sequences of tokens in an aggregated collection of contiguous source code statements. The sequences in a window are then transformed into a binary representation which forms a feature vector that trains a machine learning model, such as a long short term model (LSTM).
In another aspect, a source code file is partially tokenized with each line of source code including a combination of tokens and source code. Each line of source code is analyzed on a character-by-character or chunk-by-chunk basis to identify characters or chunks that are associated with and without a software bug. A chunk has a predetermined number of characters. Contiguous chunks of source code are grouped into a window which is then converted into a binary representation that forms feature vectors that train a machine learning model, such as a recurrent neural network (RNN).
In yet another aspect, metrics representing a measurement of various syntactical elements of a source code file are collected. The metrics may include the number of variables, the number of mathematical operations, the number of a particular data type referenced, the number of loop constructs, the usage of a particular method, and the usage of a particular data type. These metrics may be collected for each line of source code, for each method in a source code file, for each class in a source code file, and/or other groupings deemed appropriate. The metrics are then converted into a binary representation that forms feature vectors which are used to train a potentially simpler machine learning model such as an artificial neural network (ANN).
The feature vectors are constructed from a combination of source code files having a software bug and source code files without a software bug. The feature vectors are then split into data that is used to train the machine learning model and data that is used to test the machine learning model. When the machine learning model is trained to meet a desired level of accuracy, the model is then used to predict the probability of a software bug in a source code file.
A visualization technique is used to display the probabilistic output from the machine learning model in several ways. A visualization engine may be utilized to display each line, method, and/or class of a target source code file with a corresponding probability of a software bug. The probability may be displayed as a numeric value, as an icon, by highlighting portions of the source code in various colors or shading the portion of the source code in a particular style and so forth. In addition, the probabilities may be displayed when they exceed a threshold. However, the subject matter disclosed herein is not constrained to any particular visualization technique, style or format and other formats, styles and techniques may be utilized as desired.
The detection of a software bug differs from performing type checking which uses the syntax of the programming language to find syntax errors. The software bugs referred to herein refer to semantic and logic errors. Semantic errors occur when the syntax of the source code is correct but the semantics or meaning of a portion of the source code is not what is intended. A logic error occurs when the syntax of the source code is correct but the flow of instructions does not perform or produce an intended result. Hence, a software bug affects the behavior of the source code and results in an unintended state and undesired behavior.
Attention now turns to a discussion of the methods, systems, and devices that implement this technique in various aspects.
Source Code Bug Prediction
The code analysis engine 110 analyzes the syntactic structure of the source code files at different granularities to find patterns indicative of a software bug and indicative of no software bugs. In one aspect, lines of source code from source code files with bugs and without bugs are analyzed. Each element in a line of source code is replaced with a token that is based on the grammar of the underlying programming language. The tokens in a window of a contiguous set of source code statements are aggregated to form a feature vector that trains the machine learning model.
In another aspect, a source code file is partially tokenized with each line of source code including a combination of tokens and source code. Each line of source code is analyzed on a character-by-character or chunk-by-chunk basis to identify characters or chunks that are associated with and without software bugs. A chunk is a predetermined number of characters. Certain elements in the source code file are replaced or concatenated with tokens. Contiguous chunks of source code are grouped into a window and the window is then converted into a binary representation or feature vectors that train a machine learning model, such as a recurrent neural network (RNN).
In another aspect, the lines of a source code file can be analyzed with respect to various metrics that measure the number of variables in a line of source code, the number of mathematical operations in a line of source code, the number of a particular data type of elements referenced in a line of source code, the number of loop constructs in a line of source code, the usage of a particular method in a line of source code, and the usage of a particular data type in a line of source code. These features are then used to form feature vectors that train the machine learning model. This technique is simple to implement and has the advantage of allowing a developer to add, delete, and modify the metrics to accommodate the nature of the source code being analyzed.
In yet another aspect of the subject matter disclosed herein, the methods and/or classes in a source code file may be analyzed instead of the lines of source code. Each method and/or class may be analyzed for metrics identifying the type of elements in each method/class, the number of variables in a line of source code, the number of mathematical operations in a line of source code, the number of a particular data type referenced in a line of source code, the number of loop constructs in a line of source code, the usage of a particular method in a line of source code, and the usage of a particular data type in a line of source code, and any combination thereof. These features are then converted into a binary representation or feature vectors that train the machine learning model.
In one aspect of the subject matter described herein, the visualization engine can be part of a source code editor or an integrated development environment (IDE). In another aspect, the visualization may be part of a user interface, a browser, or other type of application configured to present the source code file and model output in a visual manner.
Attention now turns to a description of the operations for the aspects of the subject matter described with reference to various exemplary methods. It may be appreciated that the representative methods do not necessarily have to be executed in the order presented, or in any particular order, unless otherwise indicated. The exemplary methods may be representative of some or all of the operations executed by one or more aspects described herein and that the method can include more or less operations than that which is described. Moreover, various activities described with respect to the methods can be executed in serial or parallel fashion, or any combination of serial and parallel operations. The methods can be implemented using one or more hardware elements and/or software elements of the described embodiments or alternative embodiments as desired for a given set of design and performance constraints.
In aspects where the data mining engine searches a version control repository, the version control system may track changes made to a source code file in a change history or metadata that is recorded in the repository. Alternatively, the data mining engine may collect all the data in a source code file regardless of any modifications made to the source code file to fix a bug. Furthermore, if the history of changes made to the source code file was voluminous, recent changes may be selected in order to reduce the analysis time. If there were major changes made to the source code file, those changes made after the major changes were made may only be considered.
The change history may indicate that the source code file was changed due to a bug fix. The data mining engine searches the change history for those source code files having changes made due to a bug fix. The change history may indicate in which source code statement the bug is located. Based on this search, the data mining engine chooses different source code files in which a change was made for a bug fix and those not having software bugs (block 204). The data mining engine tags each line of a source code file with a flag that identifies whether the line of source code includes a bug or not (block 206). These annotated programs are then input to the code analysis engine.
The source code repository may track these changes and attribute them to bug fixes. Differential code 306 illustrates the differences between the original source code file 302 and the modified source code file 304 where the source code statement “int[ ] fib=new int[n]” is annotated with the “−” symbol indicating that the associated code statement was altered. In addition, program 306 shows the source code statement “int[ ] fib=new int[n+1]” annotated with a “+” symbol indicating that the associated code statement is the modification. The data mining engine reads the tracked changes of a source code file (i.e., change sets) and annotates the source code file with a flag that indicates whether or not each source code statement contains a bug. Mined data 308 represents the original source code file 302 annotated with a flag at each line, where the flag “FALSE” denotes that there is no bug in a source code statement and the flag “TRUE” denotes a software bug is in the source code statement. This mined data 308 is then input to the code analysis engine.
The code analysis engine optionally filters out certain tokens deemed to be insignificant, such as comments, whitespace, etc., and code changes that are not of interest (block 404). Each element in a line is replaced with a corresponding token thereby transforming the source code statement into a sequence of tokens where each token corresponds to an element in the original source code statement (block 406).
In one aspect of the subject matter disclosed herein, the method utilizes a long short term memory (LSTM) neural network as the model for source code bug prediction. It should be noted that this aspect is not constrained to a LSTM neural network and that other probabilistic machine learning techniques may be utilized. The LSTM architecture includes an input layer, one or more hidden layers in the middle with recurrent connections between the hidden layers at different times, and an output layer. Each layer represents a set of nodes and the layers are connected with weights. The input layer xt represents an input at time t and the output layer yt produces a probability distribution. The hidden layers ht maintain a representation of the history of the training data. Gating units are used to modulate the input, output, and hidden-to-hidden transitions in order to keep track of a longer history of the training data.
Typical LSTM architectures implement the following operations:
i
t=σt(Wxtxt+Whtht-1+Wcict-1+bi)
f
t=σ(Wxf+Whht-1WcfCt-1bf)
c
t
=f
t
⊙c
t-1
+i
t⊙ tan h(Wxoxt+Whcht-1+bc)
o
t=σ(WxoxtWhht-1Wcoctb0)
h
t
=o
t⊙ tan h(ct)
where it, ot, ft are input, output and forget gates respectively,
ct is memory cell activity,
xt and ht are the input and output of the LSTM respectively,
⊙ is an element wise product, and
σ is the sigmoid function.
The training engine transforms the windows of the raw training data (e.g., sequences of training data) into a binary representation that is used as the feature vectors. The training engine uses the feature vectors to determine the appropriate weights and parameters for the LSTM model.
In one aspect of the subject matter disclosed herein, each line of the training data is optionally limited to a fixed length of 250 tokens. Lines with less than 250 tokens are padded with EndOfLine tokens. The code analysis engine utilizes only 439 tokens of the grammar of the underlying programming language in order to exclude trivial and superfluous elements. Each token in a line is represented by a bit pattern that includes 439 bits, where each bit represents a particular token. Each line of source code is then represented by 109,750 bits (i.e., 250 tokens multiplied by 439 bits).
A size of a window may be determined by analyzing the success of the model with the testing data. Alternatively, a window may be of a reasonable size based on the available computational resources. The window size may be one. In an exemplary aspect, the window comprises seven (7) source code lines. The window includes a current line along with the immediately preceding three (3) lines and the immediately succeeding three (3) lines. Special padding will be used for both the first and last three (3) lines of source code which may not have immediately preceding/following lines of code. Each window will be labeled with a flag indicating whether the current line has a software bug or not. The training data may include a relatively equal number of lines having bugs and not having bugs in order to reduce potential bias and increase the accuracy of the model. Alternatively, instead of ensuring a relatively equal number of lines having bugs and not having bugs, the ultimate outcome of the model can be scaled by a factor determined by the proportion of lines having bugs to those not having bugs. Lines with less than three elements before padding are ignored since they do not contain a sufficient amount of data that can be of significance. With the window size of seven lines, a single feature vector will contain 768,250 bits (i.e., 109,750 bits per line multiplied by a window size of 7).
Turning to
Turning to
Each character or chunk of characters of the annotated source code file is then flagged as either being associated with a software bug or not. For example, as shown in table 806 each character in the annotated source code file 804 is associated with a flag. The flag may have values “F” or “T” where “F” indicates that the corresponding character is not associated with a software bug and “T” indicates that the character is associated with a software bug. If chunks are used, then the table would identify whether each chunk is associated with a software bug nor not.
The annotated source code file 804 is then input into the training engine 114 which transforms the annotated source code statements into a binary representation or feature vectors that train a machine learning model, such as a recurrent neural network (RNN). The training engine 114 groups contiguous source code statements preceding a particular source code statement using a window of a certain size into feature vectors which are then used to train the RNN.
Referring to
The extracted features 926 are then input to the training engine 928 which transforms them into a binary representation or feature vectors that train a machine learning model 930. The training engine 928 contains a feature vector generation engine 123 and model generation engine 117 as shown in the training engine 114 of
When the ANN has been trained and tested to meet a suitable threshold, the model 930 is ready.
Attention now turns to a discussion of the visualization techniques employed by the visualization engine.
Visualization
In one aspect of the subject matter discussed herein, the output from the model execution engine 124 may be input to a visualization engine 128 that visualizes the results from the model. In one aspect, the visualization engine 128 displays a portion of the source code file with certain lines of code highlighted in different shades or colors. The different shaded and/or highlighted lines indicate different probabilities that a corresponding line contains a bug. For example, as shown in
In addition, icons can be affixed next to a particular line of source code where the icons indicate different probabilities of the associated line containing a software bug. For example, in
It should be noted that the subject matter disclosed herein is not limited to a particular virtualization technique or format and that other techniques and formats may be utilized to visualize the output of the machine learning model.
Attention now turns to a discussion of the different applications in which the source code bug prediction technique may be utilized.
Applications
The source code bug prediction technique described herein is utilized to analyze source code files to extract features indicative of patterns that can be used to train a machine learning model to predict the likelihood of the existence of a software bug. However, this technique may be applied to different applications or scenarios to achieve an intended objective.
In one aspect, the techniques described herein may be applied to a specific set of source code files, such as the source code files of a specific developer or group of developers, such as members on the same programming team or project. The source code files written by a particular developer or group of developers may be selected to train a customized model suited for a particular developer, group of developers, and/or team. Each customized model learns the programming habits of the developer, group of developers, and team. In an execution phase, a target source code file can be analyzed by each customized model, that is, by each developer's model, the team's model, or any combination thereof. The results of each customized model can then be visualized with the target source code file. Additionally, the results of each model can be aggregated into a single result. The results of one or more of the models can be weighted so that the results of certain models are given a higher weight than the results of other models. The results of some models can be excluded as well. The application of the various models to a target source code can avoid the detection of issues specific to one developer being incorrectly detected in the source code of another developer.
In yet another aspect, the techniques described herein may be applied to detect hardware bugs in a hardware description language (HDL). It should be noted that the subject matter disclosed herein is not limited to a software bug in source code and may be applied to detect bugs in other languages that adhere to a grammar.
Technical Effect
Aspects of the subject matter disclosed herein pertain to the technical problem of determining the probability of software bugs in a source code file in a more relevant and meaningful manner. The technical features associated with addressing this problem involve a technique that models the context or syntactic structure of portions of a source code file (i.e., source code statements, methods, classes) with and without software bugs in order to generate a machine learning model to predict the probability of a source code file containing software bugs. Accordingly, aspects of the disclosure exhibit technical effects with respect to detecting a software bug in a portion of a source code file by source code files extracting significant syntactic features that yield patterns that can be learned to predict a likelihood of the existence of a software bug.
Exemplary Operating Environment
Attention now turns to
The computing device 1102 may include one or more processors 1104, a communication interface 1106, one or more storage devices 1108, one or more input devices 1110, one or more output devices 1112, and a memory 1114. A processor 1104 may be any commercially available or customized processor and may include dual microprocessors and multi-processor architectures. The communication interface 1106 facilitates wired or wireless communications between the computing device 1102 and other devices. A storage device 1108 may be computer-readable medium that does not contain propagating signals, such as modulated data signals transmitted through a carrier wave. Examples of a storage device 1108 include without limitation RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, all of which do not contain propagating signals, such as modulated data signals transmitted through a carrier wave. There may be multiple storage devices 1108 in the computing device 1102. The input devices 1110 may include a keyboard, mouse, pen, voice input device, touch input device, etc., and any combination thereof. The output devices 1112 may include a display, speakers, printers, etc., and any combination thereof.
The memory 1114 may be any non-transitory computer-readable storage media that may store executable procedures, applications, and data. The computer-readable storage media does not pertain to propagated signals, such as modulated data signals transmitted through a carrier wave. It may be any type of non-transitory memory device (e.g., random access memory, read-only memory, etc.), magnetic storage, volatile storage, non-volatile storage, optical storage, DVD, CD, floppy disk drive, etc. that does not pertain to propagated signals, such as modulated data signals transmitted through a carrier wave. The memory 1114 may also include one or more external storage devices or remotely located storage devices that do not pertain to propagated signals, such as modulated data signals transmitted through a carrier wave.
The memory 1114 may contain instructions, components, and data. A component is a software program that performs a specific function and is otherwise known as a module, program, application, and the like. The memory 1114 may include an operating system 1120, a source code repository 1122, a data mining engine 1124, a code analysis engine 1126, a training engine 1128, a model execution engine 1130, a visualization engine 1132, mined data 1134, training data 1136, a source code editor 138, an integrated development environment (IDE) 140, a model generation engine 142, and other applications and data 1144.
The subject matter described herein may be implemented, at least in part, in hardware or software or in any combination thereof. Hardware may include, for example, analog, digital or mixed-signal circuitry, including discrete components, integrated circuits (ICs), or application-specific ICs (ASICs). Aspects may also be implemented, in whole or in part, in software or firmware, which may cooperate with hardware.
In addition, the logic flows depicted in the figures do not require the particular order shown, or sequential order, to achieve desirable. Other steps may be provided or steps may be eliminated, from the described flows, and other components may be added to, or removed from, the described systems and devices. Accordingly, other implementations are within the scope of the following claims.
In accordance with aspects of the subject matter described herein, a computer system can include one or more processors and a memory connected to one or more processors. At least one processor is configured to obtain a plurality of source code statements from at least one source code file, where at least one source code file contains a software bug and at least one source code file does not contain a software bug. The source code statements are transformed into a plurality of features with at least one feature representing the context of a software bug and at least one feature representing a context not having a software bug. These features are transformed into feature vectors that train a machine learning model to recognize patterns indicative of a software bug. The machine learning model is used to generate probabilities of a software bug for a target source code file.
The system transforms the plurality of source code statements into a sequence of tokens, where each token is associated with a grammar of the source code file. The system may also transform the plurality of source code statements into features by converting and/or concatenating at least one element of the source code into a token along with the elements of the source code statement. The system may also transform the plurality of source code statements into features by converting each source code statement into a sequence of metrics wherein a metric is associated with a measurement of a syntactic element of source code statement. The machine learning model may be implemented as a LSTM model, RNN, or ANN.
The system visualizes the output of the machine learning model in various ways. The system may visualize one or more source code statement from a target source code file with a corresponding probability for one or more of the source code statements. The visualization may include highlighting a source code statement in accordance with its probability, altering a font size or text color in accordance with its probability, annotating a source code statement with a numeric probability value, and/or annotating a source code statement with an icon representing a probability value. The output of the visualization may be displayed when the probability exceeds a threshold value.
A device can include at least one processor and a memory connected to the at least one processor. The device including a data mining engine, a code analysis engine, a training engine, and a visualization engine. The data mining engine searches a source code repository for source code files. The code analysis engine converts a portion of a source code file having a software bug and a portion of a source code file not having a software bug into a sequence of syntactic elements that represent a context in which a software bug exists and fails to exist. The visualization engine generates a visualization identifying at least one portion of a target source code file having a likelihood of a software bug. The visualization may include a portion of a target source code file and the probabilities associated therewith.
The training engine uses the sequence of syntactic elements to train a machine learning model to predict a likelihood of a software bug in a target source code file. The training engine aggregates a contiguous set of sequences of syntactic elements into a window to generate a feature vector. The contiguous set of sequences includes an amount of sequences of syntactic elements preceding and following a select sequence. The portion of the source code file may include one or more lines of a source code file and/or classes of the source code file.
A method of using a system and device, such as the system and device described above, can include operations such as obtaining a plurality of source code files with and without software bugs. The source code files are mined from change records of a source code repository. Portions of the source code files are converted into a sequence of metrics, where a metric represents a measurement of a syntactic element. The metrics are used to train a machine learning model to predict the likelihood of a software bug in a portion of a target source code file. The portion of a target source code file may include a source code statement, a method and/or a class. The metrics may include one or more of a number of variables, a number of mathematical operations, a number of particular data type of elements referenced, a number of loop constructs, a usage of a particular method and a usage of a particular data type.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.