The present invention relates to the field of computer software analysis in general, and in particular to the detection of code patterns in software applications.
Computer software is typically composed of a “code base” of programs containing lines of code, written in a computer language such as Java® or C++, which are compiled and executed on a host computer. Software engineers often structure the code hierarchically by placing lines of code in methods that are nested in classes, which are distributed among files. Software applications themselves may be organized into hierarchies, where low-level applications communicate between themselves on the same or different host computers under the control of a high-level application. Understanding the underlying structure of a distributed software system is a valuable tool in maintaining these complex systems.
A top down approach may be used to determine the structure of a code base based on the assumption that the code base was constructed in a structured manner. For example, high-level modeling languages, such as UML, enable software architects to design a well-structured software system. Moreover, the modeling language may even generate the low-level code, such as C++ code. However, this approach requires that the high-level representation be continuously synchronized with the low-level code, should changes be introduced in the low-level code. This is something that is difficult to do in practice.
Alternatively, a bottom up approach may be used to determine the code structure by analyzing the low-level code directly and attempting to detect patterns in the code based on a set of pre-defined heuristics. For example, code dependencies may be found by detecting static references to methods and variables in the code, so that when a usage of a variable appears in multiple program files, it may indicate a dependency between those program files. However, this approach is not well suited for determining the overall code structure, typically due to subtle complex relationships between segments of code, such as function call invocations that depend on certain variable values.
Some dependencies are relatively easy to discover, such as when one component invokes a method of another component, or when component relationships are defined in a deployment descriptor. Other dependencies are more complicated and less direct, such as when a relationship is result of a sequence of calls, such as in a call pattern, in a module's code that infers additional indirect dependencies. In J2EE, for example, modules communicate though their containers. When one EJB wants to access another EJB, it invokes the lookup method on a javax.naming.Context object. If the lookup invocation is found, assuming that the EJB name that is associated with that JNDI name can be resolved, it can be inferred that these two EJBs are communicating and that there is a dependency between them. In this example, the pattern to be found is a single instruction—the lookup invocation. In other situations, the code pattern is more complex, involving a sequence of method invocations. In fact, to more correctly identify an EJB lookup, it is better to also look for an RMI narrow invocation following the lookup invocation, since a lookup can be for any type of component, such as data source, and not just an EJB.
It would be advantageous to define an inference engine that takes not only the found patterns into consideration, but also other environmental and domain information, such as deployment descriptors, environment variables, etc., such that other high-level relationships might then be deduced for study by the programmer.
The present invention discloses a system and method for defining code patterns and for searching for the patterns in a code base.
In one aspect of the present invention a code pattern detector is provided including at least one pattern definition expressed in a pattern language, and a code analyzer operative to employ the pattern definition to analyze a code base, the code analyzer including a representation builder operative to construct a representation of the code base, a pattern detector operative to process the representation in conjunction with the pattern definition to find a pattern within the representation, and an inference engine operative to express any of the found patterns as an abstract relationship within the code base.
In another aspect of the present invention the code analyzer is operative to employ the pattern definition to analyze the code base and create at least one inference therefrom.
In another aspect of the present invention the code pattern detector further includes an operand resolver operative to resolve a value of any variables in the code base related to any of the patterns found within the representation.
In another aspect of the present invention the pattern definition describes a potential dependency in the code base.
In another aspect of the present invention the representation builder is operative to emulate the execution environment of the code base and express the representation as any of a call graph, a control flow graph, a cross language dependency graph, and a data flow graph.
In another aspect of the present invention the pattern definition defines a sequence of instructions and at least one relationship between any of the instructions.
A code pattern detector according to claim 6 where the pattern definition is constructed as a set of tags within a document.
In another aspect of the present invention the pattern definition is constructed as a set of XML tags within an XML document.
In another aspect of the present invention the tags include at least one parent tag that defines an instruction sequence, and at least one child tag that defines either of a characteristic of the instruction sequence and a characteristic of any of the instructions within the instruction sequence.
In another aspect of the present invention the relationship is a control flow relationship describing the order in which instructions are executed.
In another aspect of the present invention the relationship is a data flow relationship describing the flow and manipulation of data between two instructions in the instruction sequence.
In another aspect of the present invention the representation is a control flow graph, and where the pattern detector is operative to search the control flow graph to verify a sequence of instruction specified by the pattern definition.
In another aspect of the present invention the pattern detector is operative to verify that a data flow in the pattern definition corresponds to a data flow detected in the found pattern.
In another aspect of the present invention the operand resolver is operative to determine from the pattern definition which of the variables to resolve, determine from the pattern definition a scope for any of the variables, determine which segment of the code base to emulate based on the found pattern, and resolve any of the variables.
In another aspect of the present invention the operand resolver is operative to resolve any of the variables by emulating a segment of the code base corresponding to the variable, and create a resolved pattern therewith.
In another aspect of the present invention the code analyzer is operative to identify a relationship between a source including the code base, a configuration file, and the resolved pattern, and a target.
In another aspect of the present invention a method is provided for detecting a code pattern, the method including expressing at least one pattern definition in a pattern language, constructing a representation of a code base, processing the representation in conjunction with the pattern definition to find a pattern within the representation, and expressing any of the found patterns as an abstract relationship within the code base.
In another aspect of the present invention the method further includes resolving a value of any variables in the code base related to any of the patterns found within the representation.
In another aspect of the present invention the first expressing step includes describing a potential dependency in the code base.
In another aspect of the present invention the constructing step includes emulating the execution environment of the code base and express the representation as any of a call graph, a control flow graph, a cross language dependency graph, and a data flow graph.
In another aspect of the present invention the first expressing step includes defining a sequence of instructions and at least one relationship between any of the instructions.
In another aspect of the present invention the first expressing step includes constructing the pattern definition as a set of tags within a document.
In another aspect of the present invention the first expressing step includes constructing the pattern definition as a set of XML tags within an XML document.
In another aspect of the present invention the first expressing step includes constructing the pattern definition using at least one parent tag that defines an instruction sequence, and at least one child tag that defines either of a characteristic of the instruction sequence and a characteristic of any of the instructions within the instruction sequence.
In another aspect of the present invention the first expressing step includes defining a control flow relationship describing the order in which instructions are executed.
In another aspect of the present invention the first expressing step includes defining a data flow relationship describing the flow and manipulation of data between two instructions in the instruction sequence.
In another aspect of the present invention the constructing step includes constructing a control flow graph, and where the processing step includes searching the control flow graph to verify a sequence of instruction specified by the pattern definition.
In another aspect of the present invention the processing step includes verifying that a data flow in the pattern definition corresponds to a data flow detected in the found pattern.
In another aspect of the present invention the resolving step includes determining from the pattern definition which of the variables to resolve, determining from the pattern definition a scope for any of the variables, determining which segment of the code base to emulate based on the found pattern, and resolving any of the variables.
In another aspect of the present invention the resolving step includes resolving any of the variables by emulating a segment of the code base corresponding to the variable, and creating a resolved pattern therewith.
In another aspect of the present invention the method further includes identifying a relationship between a source including the code base, a configuration file, and the resolved pattern, and a target.
In another aspect of the present invention a computer program is provided embodied on a computer-readable medium, the computer program including a first code segment operative to employ a pattern definition expressed in a pattern language to analyze a code base, a second code segment operative to construct a representation of the code base, a third code segment operative to process the representation in conjunction with the pattern definition to find a pattern within the representation, and a fourth code segment operative to express any of the found patterns as an abstract relationship within the code base.
The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the appended drawings in which:
Reference is now made to
Code analyzer 120 preferably employs a representation builder 125 to construct a representation of code base 100. Representation builder 125 preferably emulates the execution environment of code base 100 and constructs representative data, such as a call graph, control flow, and data flow. Such representative data are described in more detail below with reference to
Reference is now made to
Control flow relationships typically describe the order in which instructions are executed, are typically defined within the space of all execution paths, and need not be limited in their scope to a flow of control as ascertained through textual analysis of code base 100, but may be a function of actual execution flow as well. For example, a pattern definition that describes a control flow may include a prioritization of the control flow, such as by specifying that a first instruction must be executed prior to a second instruction.
Data flow relationships typically describe the flow and manipulation of data between two instructions in an instruction sequence. A sequence of instructions may have an inherent value chain with respect to the data flow, where certain instructions when executed prior to others may build value in the data. For example, given two invocations:
Numerous data flows may exist and may be particular to the programming language employed. For example, in the Java® language, the following six types of data flow may be identified:
b=a.foo( )
b.bar( )
a.foo( )
a.bar( )
a.foo(c,d)
c.bar( )
c=a.foo( )
b.bar(c,d)
a.foo( )
b.bar(a,d)
a.foo(c,d)
b.bar(c,e)
Preferably, the pattern language provides a mechanism for describing all possible code dependencies, such as those described hereinabove.
In the example shown in
The control flow shown in the pattern definition of
A portion of the data flow shown in the code base of
(Return Object→Receiver Object)
e.g.,
b=a.foo( )
b.bar( )
The data flow in this example is defined using the <TargetDependent> tag, which defines which invocation built the data before the data is used as a Receiver object for the current invocation. In our example, the “forward” invocation is target-dependent on the “get_Dispatcher1” invocation.
Pattern definitions may include any combination of instruction relationships, including a combined control and data flow relationship between instructions, such as where a first instruction is executed prior to a second instruction and the data of the second instruction is constructed using the result of the first instruction.
Reference is now made to
For example, given the code base and pattern definition shown in
Reference is now made to
In this example, the data embedded in the object ‘dist’ is primed with information retrieved from the object ‘myRequest’ dependant on the data in the object ‘value’. Thus, while the value chain of the data starts with ‘value’, goes through ‘myRequest’, and ends with ‘dist’, the value chain is conditional on variables ‘a’ and ‘b’. In some cases the value is important as it defines the pattern role. In the present example it is the target Servlet to be invoked. Operand resolver 130 preferably detects all the variables that may affect the value chain and determines possible value chains for these variables. In the example presented hereinabove, operand resolver 130 may build a value chain for ‘a>b’ and for ‘a<b’, and consequently build the following two value chains:
Operand resolver 130 may conclude that there are two possible values in the getRequestDispatcher invocation, “myServlet1” and “myServlet2”.
Operand resolver 130 preferably determines which variables to resolve as well as their scope, or the valid value ranges for the variable, from pattern definition 110. Next, operand resolver 130 may determine which segment of code base 100 to emulate based on the detected patterns found by pattern detector 127, as described hereinabove. Finally, operand resolver 130 resolves the variables, typically by emulating the relevant segment of code base 100, to create a set of resolved patterns. A resolved pattern may take the form of a detected pattern realized within a particular solution space of a variable. For example, an invocation may access one of two types of documents dependent on the value of a variable, such as the invocation on line 6 shown in
Reference is now made to
In another example, given software that includes the following code base:
and a deployment description that includes a configuration file with the following properties:
It is appreciated that one or more of the steps of any of the methods described herein may be omitted or carried out in a different order than that shown, without departing from the true spirit and scope of the invention.
While the methods and apparatus disclosed herein may or may not have been described with reference to specific computer hardware or software, it is appreciated that the methods and apparatus described herein may be readily implemented in computer hardware or software using conventional techniques.
While the present invention has been described with reference to one or more specific embodiments, the description is intended to be illustrative of the invention as a whole and is not to be construed as limiting the invention to the embodiments shown. It is appreciated that various modifications may occur to those skilled in the art that, while not specifically shown herein, are nevertheless within the true spirit and scope of the invention.