Not applicable.
The present invention generally relates to analysis of software, and more particularly to the detection and reporting of defects in source code.
It is well known that software source code contains problems that make it difficult to add functionality to the software, or to modify existing functionality. Examples of such problems include errors in the source code, the structure of the code being inadequate for the desired changes, and source code that is correct when executed by a computer but is nonetheless confusing for a human reader. As it is estimated that a majority of the time spent developing software is spent reading and understanding existing source code, detecting and addressing readability problems is of paramount importance in software development.
Many analysis tools that detect such problems have been created. These tools can detect problems in the source code without requiring the code to be executed, and can report the problems in order to improve the code.
However, a common problem with source code analysis tools is that a large number of results are typically reported on most source code. Due to limitations of source code analysis, some of these results can be incorrect. In addition, even when the problems that are detected are correct, a user may not consider the problems relevant.
For example, a tool may report that a part of the source code would be difficult for a human to read and modify, but unless this part of the code is modified then the reported problem is useless to the user. An example of a problem that is useful to report only if the code is modified is a single function that is too long. Many guidelines for writing good code recommend that a single function should consist of at most 200 lines of source code, and it is useful to report violations of these guidelines to a user. However, this is only relevant if the function is located in a part of the code that the user intends to modify in some way.
In another example, a tool may report problems in a part of the source code that is not developed by the user. One situation in which this may occur is if the source code includes some open-source components that are used, but are developed by a different set of developers. In this case, problems in the open-source components are typically not of interest to a user of the tool.
It is difficult for users of a source code analysis tool to find which of the problems reported by a tool are most important to them and therefore should be fixed most urgently.
Systems and methods are provided that take a collection of problems in source code that are detected by a source code analysis tool, and produce a smaller collection of problems that include some, but not all of the problems reported by a source code analysis tool. This is achieved by determining for each problem reported by the source code analysis tool whether it should be included or discarded. The resulting collection of problems can be presented to a user in a variety of ways.
The methods for choosing which of the problems to include and which to discard ensure two important properties: first, the resulting set of problems is typically significantly smaller than the set of problems originally reported by the source code analysis tool; and second, the problems that are reported would be considered relevant by a user of the tool.
The methods for choosing relevant problems are able to make use of the source code itself, as well as other important information such as the dates and times at which parts of the source code have been modified. This information helps adapt the choice of relevant problems by detecting which parts of the source code are actively being modified by a user at the time that a tool is used to detect problems. By only showing a user those problems in parts of the source code that are being modified, a smaller and more relevant set of problems is identified. Such information can be made available by a version control system.
In addition, if desired the methods for choosing relevant problems can also be given as input any external components that are part of the source code, but are not of interest to the user. In this case, the methods described herein can detect any problems that occur in these components in order to discard these problems. Because it is common for source code for external components to be modified, the method for filtering is robust in that it can detect which parts of the source code has been modified, and which parts have been used without modification.
It should be understood that these embodiments are only examples of the many advantageous uses of the innovative teachings herein. In general, statements made in the specification of the present application do not necessarily limit any of the various claimed inventions. Moreover, some statements may apply to some inventive features but not to others. In general, unless otherwise indicated, singular elements may be in the plural and vice versa with no loss of generality.
The novel system, computer program product, and method disclosed filters the results of a source code analysis tool to present a user with a small subset of a tool's results so that all the problems that are presented to the user are relevant to them. The filtering module of disclosed herein uses both criteria about the source code itself e.g. age, whether it is third-party code, how many unique users have edited the code as well as the source code itself.
A source code file is any textual file that can be interpreted by a computer program to cause the program to execute any instructions described in the file, or that can be translated into a binary representation that can be executed by a computer. Source code files may contain text as well as instructions; an example of this is a web page containing text as well as executable code.
A codebase is any set of source code files.
A file is a portion of source code for a computer program.
A source code analysis tool is any computer program that takes as input source code files, possibly with some other information, and outputs a collection of messages that are associated with particular locations in the source code files.
A source code analysis result is any message associated with a location in a source code file that is produced by a source code analysis tool.
A version control system is a computer program, or a component of a computer program, that stores files, allows users to retrieve or modify the files, and keeps a history of the changes that were made to the files.
Version history is the list of modifications recorded by a version control system.
Third-party source code refers to any source code files that are part of a codebase but have been written by people other than the authors of the rest of the codebase.
Architecture Overview
Several implementations of the filtering module depicted in (108) will be described in more details in the following text. Note first that the architecture may be extended to allow several filtering modules to be used as depicted in
Filtering Modules
Each of
The filtering module in
To decide whether to keep or reject the problem (306), the filtering module first locates the file that contains the problem (304). It then retrieves from the version control system (308) the last date at which a change was made to the file. Retrieving the date from the version control system can be achieved by one of: running a program that is part of the version control system, using a library, or inspecting the log files produced by the version control system. The file is kept if the date of the last change is close enough to the current date when the filtering module is run: in the example figure, this is shown as the last change date being within 30 days of the current date, but the number of days can be changed, either by being configured by a user or in an implementation of the filtering module.
The architecture of the filtering module in
This filtering module addresses the problem of third-party code: if a codebase contains some source code files that are derived from third-party code and are not considered part of the code, then problems in these files are not relevant. Furthermore, if a codebase contains files that are partially identical to third-party source files, then problems in the identical parts are not relevant, but problems in the parts that differ are relevant.
To illustrate this filtering module, consider as an example a codebase with three files A, B and C. Suppose further that A and B have been copied from an open-source project, but C was written from scratch. Finally, suppose that after being copied, B was modified in part. The filter will reject all problems identified in A; reject problems in B only if they are located on the same lines that have a corresponding line in the original version of B (before modifications); and keep all problems in C.
To achieve this, the filter in
A refinement of the matching procedure described in
Using the matching procedure described above, the filtering module of
A refinement of this filtering module is required if problems can span several lines. Each problem has a corresponding location in the source, which consists of all or part of one or more lines. In one example, if the location of a problem contains parts of several lines, then the problem is rejected only if all lines have a matching line as described above. In another example, the problem is rejected if any of the lines have a matching line as described above. It will readily be seen that variations on these criteria can be made without affecting the spirit of the invention.
This filtering module takes as input the history of detected problems (708). This is the list of the problems that were detected each time the source code analysis tool was run on the same codebase (704). For instance, if the source code analysis tool was run each day for three days in a row, the history of detected problems would contain the problems detected on each of the three days. The filtering module compares the number of detected problems for each day (706, 708), and if any new problem was detected in the file in the last 5 days, then all problems are kept (712); otherwise all problems are rejected (710). The duration of 5 days is an illustration, and the user can select any duration.
While the above description of the invention applies to software source code, the invention can be used to provide the same filtering functionality to problems detected in artifacts other than source code. One example of such an example is to filter results of a text analysis tool (such as a spelling checker) running on a textual document such as documentation of software source code.
Overall, the present invention can be realized in hardware or a combination of hardware and software. The processing system according to one example can be realized in a centralized fashion in one computer system, or in a distributed fashion where different elements are spread across several interconnected computer systems and image acquisition sub-systems. Any kind of computer system—or other apparatus adapted for carrying out the methods described herein—is suited. A typical combination of hardware and software is a general-purpose computer system with a computer program that, when loaded and executed, controls the computer system such that it carries out the methods described herein.
In one example, the present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods described herein, and which—when loaded in a computer system—is able to carry out these methods. Computer program means or computer programs in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following a) conversion to another language, code or, notation; and b) reproduction in a different material form.
Computer system (800) also optionally includes a communications interface 824. Communications interface (824) allows software and data to be transferred between computer system (800) and external devices. Examples of communications interface (824) include a modem, a network interface (such as an Ethernet card), a communications port, a PCMCIA slot and card, etc. Software and data transferred via communications interface (824) are in the form of signals which may be, for example, electronic, electromagnetic, optical, or other signals capable of being received by communications interface (824). These signals are provided to communications interface (824) via a communications path (i.e., channel) (826). This channel (826) carries signals and is implemented using wire or cable, fiber optics, a phone line, a cellular phone link, an RF link, and/or other communications channels.
Although specific embodiments of the invention have been disclosed, those having ordinary skill in the art will understand that changes can be made to the specific embodiments without departing from the spirit and scope of the invention. The scope of the invention is not to be restricted, therefore, to the specific embodiments. Furthermore, it is intended that the appended claims cover any and all such applications, modifications, and embodiments within the scope of the present invention.