1. Field of the Invention
The present invention relates generally to the field of hardware architecture data management. More particularly, it concerns optimization of Software-Managed Manycore (SMM) Architectures.
2. Description of Related Art
When programs run on a computer or other electronic device, computer code is compiled from programming languages down to machine level commands and binary code that are execute on hardware. A computer program consists of many functions and, each function may contain many lines of code that are executed when it is called by a program or by another function. In complicated code that supports user inputs and variable branch operations, large amounts of code are produced that might not need to execute on hardware. For example, many lines of code may be present to support doing Y if the user does action X while also supporting performing B when the user does action A. In this basic example, if the user never performs action A, then the code for performing B is not necessary. The compiler that translates computer code is responsible for recognizing this and ensures that the superfluous code does not waste any processing resources. Compilers are also tasked with maximizing efficiencies, like recognizing patterns that may be performed in fewer steps and controlling the flow of data and programs to efficiently utilize hardware resources.
Manycore systems provide a unique challenge for efficient code handling In these systems, software—via the compiler and other applications—controls program process flow and manages hardware resources. When running applications on a manycore system, different portions of an application or program must be tasked to different portions of the system to maintain optimal utilization of each core in the manycore system. One benefit of manycore systems is that they maintain local Scratch Pad Memory (SPM) for each of the cores. This reduces latency for each core that otherwise exists when relying on memory fetches from distant or global memory. SPM may be effectively utilized by managing which portions of an application are saved to each portion of SPM and its corresponding cores. This application management requires a detailed understanding of how a given application flows, including information about function calls, loops, switches, and conditional statements. The prior art describes several solutions for mapping application flow and structure to be used in software management of multicore and manycore systems; however, these solutions often provide either too little or too much information about an application's structure. Too little information limits how well SPM and the processing cores in the system are utilized. Too much information about application structure bogs down SPM and limits efficiency gains in a software managed system.
For example, Control Flow Graphs (CFGs) have been a long-used and essential tool in recognizing potential compiler efficiencies. CFGs represent code and process flow to facilitate creating compiler algorithms that remove “dead code” and implement efficiencies where certain code or patterns are recognized. Compilers use CFGs to transform or run code more efficiently Efficient program management becomes increasingly complex with many core processors, which are the leading edge in computer architectures. To maximize utilization of each core, the compiler must have a succinct yet complete view of a program's flow so that individual functions along with the appropriate memory allocations may be mapped to cores without being limited by the program or function flow.
Scaling the memory architecture in software managed systems is a major challenge when transitioning from a few cores to many core processors. Experts believe that coherent cache architectures will not scale to hundreds and thousands of cores (Heinrich, et al., 1999; Bournoutian & Orailoglu, 2011; Choi, et al., 2011; Xu, et al., 2011), not only because the hardware overheads of providing coherency increases rapidly with core count, but also because caches consume a lot of power. One promising option for a more power-efficient and scalable memory hierarchy is to use raw, “uncached” memory (commonly known as Scratch Pad Memory or SPM) in the cores. Since SPM does not have the hardware for address lookup and translation, they occupy 30% less area, and consume 30% less power than a direct mapped cache of the same effective capacity (Banakar, et al., 2002). In these types of systems, the coherence of memory addressing has to be provided in the software so that hardware is more power-efficient and scalable. A multicore/manycore architecture in which each core has an SPM instead of hardware caches is called a Software Managed Multicore (SMM) architecture (Bai, et al., 2013; Lu, et al., 2013). The Cell processor (Flachs, et al., 2006) (used in PlayStation 3) is a good example of an SMM architecture. Thanks to the SMM architecture, the peak power-efficiency of the Cell processor is 5 GFlops per Watt (Flachs, et al., 2006). Contrast this to the Intel i7 4-core Bloomfield 965 XE with power-efficiency of 0.5 GFlops per Watt (“Raw Performance . . . ” 2010; “Intel Core . . . ” 2010), both fabricated in the 65 nm technology node.
The main challenge in the SMM architecture is that several tasks like data management (movement of data between SPMs of the cores and the main memory) and inter-core communication (movement of data between the SPMs of cores), which were originally done by the hardware (more specifically, the cache hierarchy) now have to be explicitly done in the software, and that may cause overheads. Recent research results have been quite encouraging. Techniques have been proposed to manage all kinds of data: code (Bai, et al., 2013; Jung, et al., 2010), stack (Lu, et al., 2013; Bai, et al., 2011), and heap (Bai, et al., 2010; Bai, et al. 2013; Bai, et al., 2011b; Udayakumaran, et al., 2006) efficiently on the SPMs of the core. In fact, Bai, et al., 2013 and Lu, et al., 2013 show that the overhead of code and stack management on SPMs is lower than on a cache based architecture. Thus SMMs are coming up as a strong contender for processor architecture in the manycore era. All the state-of-the-art data management techniques that have been developed to date for SMM architectures are inter-procedural code transformations and require extensive inter-procedural analysis. One of the fundamental challenges is finding the right representation of a whole program, such that it captures the required information, and yet is not too big and cumbersome. For example, the call graph of a program depicts which functions call other functions, but it does not contain information about the loops and if-then-elses present in the programor within the functions. Also, it does not contain information about the order in which the functions are called. All this information is vital for the code transformations required for SMM architectures. Control Flow Graph (CFG) contains detailed information of all the control flow, but it is only for a single function. Researchers have tried to stitch together the CFGs of various functions by essentially pasting the CFG of the called function at the site of its call—named Global CFG (Udayakumaran, et al., 2006; Whitham, et al., 2012; Polychronopoulos, 1991). Global CFG is a detailed representation of the whole program, but it grows too big, and discovering the information SMM transformations need from this graph is very time consuming and cumbersome at the least. The prior art lacks a succinct representation of a whole program that contains functional call information, as well as important control flow structures of the program, e.g., loops and if-then-elses.
Previous research on developing code transformations for SMM architectures have used Global Call Control Flow Graph or GCCFG (Baker, et al., 2010; Jung, et al., 2010; Bai, et al., 2013; Lu, et al., 2013). GCCFG is a whole program representation, and is a hybrid between a call graph and control flow graph. GCCFG is a hierarchical representation of the program that abstracts away straight line code, and captures the function call and control flow information of the program. For example, a GCCFG is shown in
While previous research has informally defined and used GCCFG, they have not shown how to construct it for any given program. While constructing GCCFG for simple programs is relatively straightforward, there are several very commonly found program structures where constructing GCCFG is complicated. For example, loops that have multiple exits (commonly caused by continue, break, goto statements, and loops that return values), and intertwined loops (caused by goto statements), and switch statements, and if-then-else statements with exit statements, etc. GCCFGs have the potential to greatly enhance hardware utilization but the prior art has been incapable of handling these and other complicated cases.
The present disclosure defines GCCFGs and shows the construction of GCCFGs in several complicated cases to facilitate the use of GCCFGs for compiler enhancements in SMM architectures. As opposed to Global CFG, GCCFG is a more succinct representation, and results in more efficient implementations of the inter-procedural analysis required for code and stack data management on SMM architectures. Experiments conducted on MiBench benchmarks (Guthaus, et al., 2001) demonstrate that, the compilation time of a state-of-the art code management technique can be improved by an average of 5×, and that of stack management can be improved by 4× through GCCFG as compared to Global CFG.
A. GCCFG's are an optimal code mapping for SMM architectures. In an SMM architecture, a task is mapped to a core, and the core has access to only its local Scratch Pad Memory or SPM. All code and data that the task uses must come from SPM. If the code of the application requires more memory than is available in the SPM, then the whole code can reside in the large global memory and may be brought into SPM with a piecemeal approach. To facilitate this, SMM architecture allows users to divide the code part of the SPM into regions, and functions in the program can be mapped to the regions in the SPM. For example,
B. Information for optimal code mapping Some embodiments require certain information to solve the optimal code mapping problem. In some embodiments this includes an estimate the amount of data transfers that will happen when two functions are mapped to the same region. For example, in the program shown in
C. Why a call graph is not enough? The function execution sequence cannot be derived from a call graph. The first problem is that in the call graph, the order in which functions are called is absent. The GCCFG's of the present disclosure preserve the function call order. For example, in some embodiments, the left child of a node is called before the right child. Second, call graphs lose all control flow information, so it is unknown if afunction is being called in a loop, or in an if-then-else, or just in straight-line code. While it is clear that each of these structures has a very significant impact on the sequence of the function executions, and therefore the number of times the function has to be brought into the SPM. In fact, even annotating the call graph with how many times a function is called is not enough. That is because, it still does not capture the context in which the functions are called. For example
Data management optimizations require both the function call information and the control flow information. Since control flow information is so important, data management cannot be performed using just the information in the Call Graph. The Call Graph only has information about which functions call other functions, but it doesn't show an ordering to those calls, or control flow information. Data management techniques can use Global CFG instead (Udayakumaran, et al., 2006; Whitham, et al., 2012; Polychronopoulos, 1991). Global CFG is a graph where all CFGs are inlined to make a single large graph. Indeed Global CFG contains all the information needed, however, the information is arranged in a manner, such that it is compute intensive to dig out the information needed for optimal code mapping.
Other program level level graphs have been defined and used in other contexts. System Dependence Graph (SDG) was designed by Horowitz, et al., 1990. In the SDG, nodes can be split at the sub basic block level to represent individual statements as well as procedure entry and exit points. The SDG also requires different types of edges between the nodes. There are edges that represent control dependence, as well as, flow and def-order dependence. In order to maintain context between procedure calls, a grammar is used to model the call structure of each procedure. While the SDG could be used as input for data management schemes, it is not succinct. The fact that it breaks basic blocks into smaller parts and introduces edges between them makes it quite large. The present disclosure abstracts away the straight line code to provide a succinct representation of the whole program. Udayakumaran, et al., 2006, proposed Data-Program Relationship Graph (DPRG). They start with a call graph, and then append loop and condition nodes for control flow information. However, DPRG does not maintain ordering information, it must use a depth first search to get a general ordering. Also DPRG requires extra nodes for then and else statements, instead of just one node for IF-THEN-ELSE statements, making it less than a succinct representation of the program. Whitham, et al., 2012 proposes a graph called a Control Flow Tree (CFT). They derive this data structure from a control flow graph that has been converted into a tree, and then has been compressed into the final CFT. The graph proposed in their work maintains ordering by making sure that a parent node is always called before a child node. However, they must maintain a list of back edges to keep information about when the leaf of the tree needs to return to an earlier parent. The CFT is not a succinct representation of a program since it needs multiple data structures to represent control flow.
To facilitate data management optimizations on SMM architectures, Lee, et al., 2012 used regular expressions, called path expressions, to represent the way control flows through a program where kleen (*) closure represents a loop, union (j) closure represents a condition, and concatenation (•) closure represents the next segment to be executed. This information reveals the alternating behavior between functions in a program, so that an efficient mapping of function code to memory can be made. The information present in the regular expression is also present in GCCFG, however, it is much easier to annotate GCCFG with more information, like the number of times a loop executes or branch probability, than to annotate a regular expression with more information.
The state of the art data management schemes (Baker, et al., 2010; Bai, et al., 2013; Lee, et al., 2012; Lu, et al., 2013) for SMM architectures have used GCCFG or GCCFG-like data structures, but the construction of GCCFG has not been shown yet. The present disclosure formally defines, and describes the algorithm to construct GCCFGs.
Another representation is Hierarchical Task Graph (HTG) (Polychronopoulos, et al., 1991); however, HTG is only for one function. HTG is a hierarchical representation of the program control flow, and is derived by separating the control flow graph into hierarchies at the loop level. The present disclosure expands the HTG concepts to create an inter-procedural graph, called GCCFG. However, GCCFG construction can become quite challenging when the program has ill-formed control flow, e.g., poorly formed loops, switch statements, and hard to find convergence point of conditions—and the present disclosure provides solutions to correctly construct GCCFG's in these cases.
One embodiment of the present methods, comprises receiving at least one function executed by a computer program; extracting at least one loop in the computer program involving the at least one function; building a global call control flow graph based, at least in part, on the extracted at least one loop and the at least one function of the computer program, wherein the global call control flow graph represents interconnectivity in the computer program; and analyzing a code complexity of the computer program based, at least in part, on the global call control flow graph to manage data on a manycore system. In some embodiments, the step of building the global call control flow graph comprises building at least one hierarchical flow graph based on the at least one loop; building a call control flow graph based on the at least one hierarchical flow graph; and joining the at least one call control flow graph to create the global call control flow graph. Some embodiments further comprise receiving at least one condition executed by the computer program, wherein the step of building the global call control flow graph is further based, at least in part, on the received at least one condition. In some embodiments, the at least one loop comprises one of a poorly formed loop, a switch, a convergence of at least two conditions, a recursive procedure, or an at least one function pointer.
In some embodiments, at least one exit block is added to the hierarchical flow graph. In some embodiments, at least one place holder block is added to the hierarchical flow graph. In some embodiments, at least three distinct representational units are used in building the at least one call control flow graph, the at least three distinct representational units indicating at least a function, a loop, and a condition.
In some embodiments, analyzing the code complexity comprises calculating a total interference of the computer program. And some embodiments further comprise transforming the computer program to reduce the total interference of the computer program.
One embodiment of the present computer program products, comprises a non-transitory computer readable medium comprising code for performing the steps of: receiving at least one function executed by a computer program; extracting at least one loop in the computer program involving the at least one function; building a global call control flow graph based, at least in part, on the extracted at least one loop and the at least one function of the computer program, wherein the global call control flow graph represents interconnectivity in the computer program; and analyzing a code complexity of the computer program based, at least in part, on the global call control flow graph to manage data on a manycore system. In some embodiments of the present computer program products, the step of building the global call control flow graph comprises: building at least one hierarchical flow graph based on the at least one loop; building a call control flow graph based on the at least one hierarchical flow graph; and joining the at least one call control flow graph to create the global call control flow graph. In some embodiments, the non-transitory computer readable medium further comprises code for performing the step of receiving at least one condition executed by the computer program, wherein the step of building the global call control flow graph is further based, at least in part, on the received at least one condition.
In some embodiments, at least one loop comprises one of a poorly formed loop, a switch, a convergence of at least two conditions, a recursive procedure, or an at least one function pointer. In some embodiments, an at least one exit block is added to the hierarchical flow graph. In some embodiments, an at least one place holder block is added to the hierarchical flow graph. In some embodiments, at least three distinct representational units are used in building the at least one call control flow graph, the at least three distinct representational units indicating at least a function, a loop, and a condition.
In some embodiments of the present computer program products, analyzing the code complexity comprises calculating a total interference of the computer program. In some embodiments, the non-transitory computer readable medium further comprises code for performing the step of transforming the computer program to reduce the total interference of the computer program.
Some embodiments of the present apparatuses comprise a memory; and processor coupled to the memory, wherein the processor is configured to execute the steps of: receiving at least one function executed by a computer program; extracting at least one loop in the computer program involving the at least one function; building a global call control flow graph based, at least in part, on the extracted at least one loop and the at least one function of the computer program, wherein the global call control flow graph represents interconnectivity in the computer program; and analyzing a code complexity of the computer program based, at least in part, on the global call control flow graph to manage data on a manycore system. In some embodiments, the step of building the global call control flow graph comprises: building at least one hierarchical flow graph based on the at least one loop; building a call control flow graph based on the at least one hierarchical flow graph; and joining the at least one call control flow graph to create the global call control flow graph. In some embodiments, the processor coupled to the memory is further configured to execute the step of receiving at least one condition executed by the computer program, wherein the step of building the global call control flow graph is further based, at least in part, on the received at least one condition.
In some embodiments, the at least one loop comprises one of a poorly formed loop, a switch, a convergence of at least two conditions, a recursive procedure, or an at least one function pointer. In some embodiments, an at least one exit block is added to the hierarchical flow graph. In some embodiments, an at least one place holder block is added to the hierarchical flow graph. In some embodiments, at least three distinct representational units are used in building the at least one call control flow graph, the at least three distinct representational units indicating at least a function, a loop, and a condition.
In some embodiments, the step of analyzing the code complexity comprises calculating a total interference of the computer program. In some embodiments the processor coupled to the memory is further configured to execute the step of transforming the computer program to reduce the total interference of the computer program.
As used herein the specification, “a” or “an” may mean one or more. As used herein in the claim(s), when used in conjunction with the word “comprising”, the words “a” or “an” may mean one or more than one.
The use of the term “or” in the claims is used to mean “and/or” unless explicitly indicated to refer to alternatives only or the alternatives are mutually exclusive, although the disclosure supports a definition that refers to only alternatives and “and/or.” As used herein “another” may mean at least a second or more.
Throughout this application, the term “about” is used to indicate that a value includes the inherent variation of error for the device, the method being employed to determine the value, or the variation that exists among the study subjects.
Other objects, features and advantages of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and the specific examples, while indicating preferred embodiments of the invention, are given by way of illustration only, since various changes and modifications within the spirit and scope of the invention will become apparent to those skilled in the art from this detailed description.
The following drawings form part of the present specification and are included to further demonstrate certain aspects of the present invention. The invention may be better understood by reference to one or more of these drawings in combination with the detailed description of specific embodiments presented herein.
The Global Call Control Flow Graph or GCCFG, of the present disclosure is a whole program view of structures within an application. Specifically, GCCFG identifies three different types of structures that are commonly found in programs: function calls, loop blocks, and if-then-else condition blocks.
Definition 1: (GCCFG) Let G: =(V;E) be a DAG, where V={Vf ∪ Vc ∪ V1} is a set of vertices representing program structures and {e=(v1,v2)∈E:v1∈VΛv2 ∈V} is a set of directed edges. Then G is a GCCFG of a program where the vertices identify three program structures, function calls, loops, and if-then-else conditions respectively, and the edges represent the change of control flow in a program structure; where the program code in v2 is called from inside the program code corresponding to v1.
The three types of program structures represented by the vertex set in GCCFG are distinguished in the following ways: A vertex v ∈ Vf represents a function call in a program, has only one set of outgoing edges, and is represented, in some embodiments, by a circle shape in the final GCCFG graph, e.g. 404 of
The GCCFG of the example program in
A. How to Construct GCCFG. One embodiment of the present GCCFGs may be constructed as shown in
1) Step 1: Extracting Loop Information: A Program p ⊃ CFG H, where H=(B;E), represents a single function in a program. B is the set of basic blocks in the function, and E is the set of edges between basic blocks (Ferrante, et al., 1987).
Given a set of CFGs, which in some embodiments may be already been performed in step 1002 of
Definition 2: (LOOP) Each Loop L ⊂ H, where Li=(BLi; ELi) and H=(B;E) represents control flow graph of a program. BLi ⊂ B are the blocks in a loop and the loops nested within it. ELi ⊂ E are the edges between the blocks in a loop.
Definition 2 explains that a loop is a set of blocks and a set of edges that are both a subset of the blocks and edges in a CFG.
2) Step 2: Constructing Hierarchical Flow Graph: The next step after extracting the loop information is to separate the loops into hierarchical levels. All nested loops are a subset of the loop they are nested inside, so we identify which loop is one level of nesting below another loop. Therefore, ∀L(LV (L)→>level) where the function LV finds the level of the loop.
In this work L1 will always have the highest level hierarchy. So following the example in
By separating all of the loops in a CFG and identifying the hierarchy where they appear, we can use the loop information to build a new graph called Hierarchical Flow Graph (HFG), e.g. step 1012 of
Definition 3: (Hierarchical Flow Graph) An HFG HFG L′=(B′L, E′L) is a DAG, where B′L represents all of the basic blocks in a loop plus one LPH for each highest level nested loop, and one FPH for each function call in the loop. E′L is the set of edges between B′L. An HFG has either an LPH or an FPH to denote if it is above the highest level loop, or one of the loops in a function.
Algorithm 1 (below) explains how to separate nested loops into different graphs. The algorithm starts by copying all blocks and edges to the sets B′L and E′L respectively in lines 1 and 2. It then cycles through all nested loops that are at the first level of nesting below the loop L. It finds a nested loop K, as K is a proper subset of L and its nesting level is one more than L, so in lines 4 and 5 it removes the blocks and edges that are in K and also in L′. In line 6 the algorithm examines each edge in the original loop L; if the head or tail is in K, then the edge is removed from L′ and a new edge is added to L′. The new edge connects to a new node that is a LPH, where the one entry edge to the loop K now connects to LPH and the exit from K, also connects to LPH. The complexity of Algorithm 1 is dependent on the number of loops that are nested within one level of another loop. Therefore the time complexity of this algorithm would be O(n*b), where b is the number of blocks in the outer loop and n is the number of loops nested in it.
What is needed to move beyond this stage is a forest of DAGs, so that the HFG information can be used to build a more condensed graph. The first step is to remove any back edges, and to add a new root block. If the HFG is a loop, its root block becomes a LPH, and if it is the highest level HFG, its root block becomes a FPH.
3) Step 3: Building Call Control Flow Graph: We must traverse the HFGs in a Depth First Search (DFS) and build a graph, that condenses the information present in the HFG, into a graph called the Call Control Flow Graph (CCFG). A CCFG is a proper subset of GCCFG, it is constructed when a block of interest is found on an HFG, we then apply a set of rules to construct the proper vertexes and edges in the CCFG. A block of interest is a block with two outgoing edges, a LPH, a FPH, or a block with a function call in its code.
Definition 4: (Call Control Flow Graph) A CCFG G′=(V′, E′), where given a GCCFGG G′⊂ G. V′ is a set of vertices representing program structures of a loop, function call, or if-then-else. E′ is the set of edges connecting the program structures.
Algorithm 2 and Algorithm 3 explain how to build a CCFG given the information present in an HFG. First, Algorithm 2 gives three of the rules, for building a CCFG, by showing the cases for building verticies. The first rule, is found at line 2 of Algorithm 2, where there is a condition in the program, therefore a condition vertex is added to the set of vertices, in the CCFG G′. The second rule, is found at line 5 where a loop is found, then a loop vertex is added to the vertex set in G′. Finally, if the block contains a call to another function or is a FPH at line 8 then a function vertex is added to the vertex set in G′. At lines 4, 7, and 10 a mapping between the block in L′ and the vertex in G′ is created, as shown in
Algorithm 3, called from algorithm 2 is a recursive function, which describes the remaining rules for building CCFG. First at line 1 we locate a condition, that also is mapped to a vertex in V′. Then, we examine all true and all false paths through the graph, that appear after the condition diverges and before it converges. The fourth rule for creating a CCFG appears at line 6 and line 7, where if another block mapped to the CCFG is found then a true edge is added to the CCFG. The fifth rule appears at line 6 and line 9, which like the previous rule adds an edge, but now adds a false edge in the CCFG. This case is illustrated in
4) Final step: Integrating Call Control Flow Graph: After we have built a CCFG for every HFG in the program, we can glue the CCFG's together.
B. Challenges cases in GCCFG construction. Till now, we have explained the definition and construction of GCCFG in a typical setting. However, several times the input program graphs are ill-formed, and that makes the task of building GCCFG challenging of Challenge cases include a program that is constructed with poorly formed loops, a program which contains switch statements, finding the convergence point of some conditions, how to represent recursive procedures, and how to represent function pointers. These problems must be addressed to be able to successfully build GCCFG.
1) Poorly formed loops: The first challenge to address is that of poorly formed loops. These include, loops that have multiple exits (commonly caused by continue, break, and goto statements), and intertwined loops (caused by goto statements). Both of these types of loop problems must be removed before transforming the basic blocks into a final HFG.
2) Switch statements: The second challenge to address is programs with switch statements. While switches are not poor programming practice the challenge is that a single block's children cannot be broken down into true and false children. The present disclosure applies, in some embodiments, a transformation in to distinguish true and false children by adding an intermediate block to the graph. The top part in
3) Finding a convergence point: Another challenge to address is finding the convergence point of a condition with exit statements. These are mostly caused by error conditions that exit the program immediately. In the representing CFG the block with the exit has no descendants, so it is not clear what the corresponding convergence block would be. If there are nested conditions within this condition on the true or false paths it further confuses the issue as the convergence point of the nested condition may appear to occur after the convergence of parent condition. The solution to this challenge is similar to the loop problem in that each CFG must have a unique exit block.
C. Recursion and Function Pointers. Up to this point call sites have represented an edge from one vertex to a function vertex. However, when a program contains recursion some embodiments need to be able to represent that control has been given back to the entry point of the recursive procedure. To continue with the trend in GCCFG of having a unique function vertex for control to move to, would not adequately represent the control flow contained within the recursive procedure, or would require that we duplicate all of this information in the graph. Therefore, we introduce a back edge in GCCFG. Any back edge in the graph represents a recursive procedure call, where the edge starts at the call site and ends at the recursive procedures function vertex. It is important to note, that there is no structure in Global CFG to handle recursive programs. The Global CFG requires that when a function call occurs a function's CFG is inserted in its place, and this is not possible in Global CFG.
Determining the set of functions that a pointer can point to requires program analysis, where the most conservative results, which run quickly, will give a much larger set than a more accurate analysis, which will take a longer time. The tradeoff lies in choosing between accuracy and speed for pointer analysis. GCCFG needs to be generated quickly as a benefit to doing more data management analysis at compile time, and needs to be succinct for those analyses. Some embodiments use the pointer analysis presented in (Milanova, et al., 2004), where a less accurate model gives enough information to determine which pointers will be equivalent at compile time and placing the corresponding functions and their pointers into an equivalence class. This relationship between pointers and the functions they may point to is used to generate an edge between the call site where the pointer exists and the functions in the equivalence class. In GCCFG these special edges will be represented by dotted lines.
A. Experimental Setup. Experiments were to demonstrate the need, and usefulness, of GCCFG over Global CFG. The experiments are for code management, and stack data management optimizations in SMM architectures. To do that, an embodiment of the present disclosure was implented to construct GCCFG in an LLVM compiler. Since a pass can only be applied on a single source code file, the llvm-link utility was used to transfer several bitcode files into one bitcode file. A Function Pass in LLVM, which operates on each function of a program was implemented. The function pass extracted control flow and loop information from each function and stored it in a data structure. After all the passes had finished the extracted information was combined into a GCCFG. GCCFG nodes and edges were annotated with information necessary for code and stack data management. For comparison purpose, Global CFG's were also generated. The code and stack management implementations get information about the program from GCCFG (or Global CFG) through some functions, like estimatelnterferenceCost that can be computed using both GCCFG and GlobalCFG. Next, LLVM passes were run for code (Bai, et al., 2013) and stack data management (Lu, et al., 2013). A compiler was run on benchmarks from the Mibench suite (Guthaus, et al., 2001) to compare the compilation time.
B. GCCFG makes code management 5× faster as compared to Global CFG.
C. GCCFG makes stack management 4× faster as compared to Global CFG
D. GCCFG is succinct representation of the program.
Experiments also analyzed the algorithms to generate the information required for code and stack data management to compare the complexity of those algorithms. This was done to illustrate how succinct a representation GCCFG is.
A. Interference calculation using GCCFG. Algorithm 4 shows how code mapping determines the total interference of a program based on a mapping of functions to regions (Baker, et al., 2010). Interference is the amount of data transfers that will take place between the local SPM and the global main memory for a given mapping. The input M is a mapping of functions to regions. At lines 1 and 2 the algorithm iterates over all pairs of functions in the program that are mapped to the same region in memory. At lines 3 and 4 if function I is the Lowest Common Ancestor (LCA) of the two functions then the total interference will have the number of times the function is called during execution plus the number of times the first function on the path from i to j is called. Lines 5 and 6 do the same thing as 3 and 4 except j is the LCA of the two functions. Finally, at line 8 if neither i or j are the LCA of the other, the number of times the actual LCA of the two functions is executed is added to the total interference. The total running time of GCCFG would be O([n*1]/2*n2)—>O(n3) where n is the number of function vertices, and 1 is the number of loop nodes in the GCCFG. Note that we will need to traverse the height of the graph twice to find the LCA of two given nodes.
B. Interference calculation using Global CFG. Algorithm 5 shows how to determine the interference cost if we used the Global CFG. To calculate the total interference between any two functions in a program there are two main loops at lines 1 and 3. First we must cycle through every basic block in the whole program until we find one i that is a entry to or exit from a function, as this is where a swap can occur in memory. Now we need to find another block j that is an entry into a function, mapped to the same region as i, and i and j are in separate functions. At line 5 we then need to do a depth first search to find the state of the memory (active functions in memory) and compare that to the function containing i. This means we have a conflict that will increase the cost of the interference. Line 7 determines if both blocks are inside a loop, because this will increase the total cost by the number of times a loop iterates during the execution of the program. Otherwise we only add the cost of the number of times blocks i and j are executed to the total interference cost. The total running time for Algorithm 5 is O(b2*b+2L), where b is the number of basic blocks in a program, and L is the maximum number of basic blocks in a loop. b can be approximated as n*B, where n is the number of functions in the whole program and B is the maximum number of basic blocks in a function. Further, in a Global CFG, if a function is called multiple times then it is necessary to make a copy of the basic blocks for each call and inline those blocks into the graph. We can represent this inline factor as c which is a multiplication factor for how many times a function is inlined Therefore the total running time for Algorithm 5 is actually O([N*B*c]3) compared to O(n3) for Algorithm 4.
Since coherent caches architectures will not scale for long, researchers are on the lookout for a new memory architecture that can scale for hundreds and thousands of cores. Software Managed Manycore (SMM) architectures—in which each core has only a scratch pad memory is a promising solution, since hardware are simpler, scalable and more power-efficient. However, in SMM architectures the code and data of the tasks must be explicitly managed in the software by the compiler. State-of-the-art compiler techniques for SMM architectures require inter-procedural information and analysis, and they have used GCCFG (Global Call Control Flow Graph) for that. GCCFG is a whole program representation that captures the control-flow as well as function call information in a succinct way. However, how to construct GCCFG has not been shown yet. There are several commonly occurring cases where constructing GCCFG is not so straightforward. This disclosure provides for graph transformations that allow correct construction of GCCFG in nearly all cases. Experiments show that by using succinct representation (GCCFG) rather than elaborate representation (GlobalCFG), the compilation time of state-of-the-art code management technique (Bai, et al., 2013) can be improved by an average of 5×, and that of stack management (Lu, et al., 2013) can be improved by an average of 4×.
Turning now to the figures,
In some embodiments, the at least one loop comprises one of a poorly formed loop (see, e.g., 900a or 910a in
In some embodiments, at least three distinct representational units are used in building the at least one call control flow graph, the at least three distinct representational units indicating at least a function, a loop, and a condition. For example, in
In some embodiments, analyzing the code complexity comprises calculating a total interference of the computer program. And some embodiments further comprise transforming the computer program to reduce the total interference of the computer program.
One embodiment of the present computer program products, comprises a non-transitory computer readable medium comprising code for performing the steps of: receiving at least one function executed by a computer program; extracting at least one loop in the computer program involving the at least one function; building a global call control flow graph based, at least in part, on the extracted at least one loop and the at least one function of the computer program, wherein the global call control flow graph represents interconnectivity in the computer program; and analyzing a code complexity of the computer program based, at least in part, on the global call control flow graph to manage data on a manycore system. In some embodiments of the present computer program products, the step of building the global call control flow graph comprises: building at least one hierarchical flow graph based on the at least one loop; building a call control flow graph based on the at least one hierarchical flow graph; and joining the at least one call control flow graph to create the global call control flow graph. In some embodiments, the non-transitory computer readable medium further comprises code for performing the step of receiving at least one condition executed by the computer program, wherein the step of building the global call control flow graph is further based, at least in part, on the received at least one condition.
In some embodiments, at least one loop comprises one of a poorly formed loop, a switch, a convergence of at least two conditions, a recursive procedure, or an at least one function pointer. In some embodiments, an at least one exit block is added to the hierarchical flow graph. In some embodiments, an at least one place holder block is added to the hierarchical flow graph. In some embodiments, at least three distinct representational units are used in building the at least one call control flow graph, the at least three distinct representational units indicating at least a function, a loop, and a condition.
In some embodiments of the present computer program products, analyzing the code complexity comprises calculating a total interference of the computer program. In some embodiments, the non-transitory computer readable medium further comprises code for performing the step of transforming the computer program to reduce the total interference of the computer program.
Some embodiments of the present apparatuses comprise: a memory; and processor coupled to the memory, wherein the processor is configured to execute the steps of: receiving at least one function executed by a computer program; extracting at least one loop in the computer program involving the at least one function; building a global call control flow graph based, at least in part, on the extracted at least one loop and the at least one function of the computer program, wherein the global call control flow graph represents interconnectivity in the computer program; and analyzing a code complexity of the computer program based, at least in part, on the global call control flow graph to manage data on a manycore system. In some embodiments, the step of building the global call control flow graph comprises: building at least one hierarchical flow graph based on the at least one loop; building a call control flow graph based on the at least one hierarchical flow graph; and joining the at least one call control flow graph to create the global call control flow graph. In some embodiments, the processor coupled to the memory is further configured to execute the step of receiving at least one condition executed by the computer program, wherein the step of building the global call control flow graph is further based, at least in part, on the received at least one condition.
In some embodiments, the at least one loop comprises one of a poorly formed loop, a switch, a convergence of at least two conditions, a recursive procedure, or an at least one function pointer. In some embodiments, an at least one exit block is added to the hierarchical flow graph. In some embodiments, an at least one place holder block is added to the hierarchical flow graph. In some embodiments, at least three distinct representational units are used in building the at least one call control flow graph, the at least three distinct representational units indicating at least a function, a loop, and a condition.
In some embodiments, the step of analyzing the code complexity comprises calculating a total interference of the computer program. In some embodiments the processor coupled to the memory is further configured to execute the step of transforming the computer program to reduce the total interference of the computer program.
All of the methods disclosed and claimed herein can be made and executed without undue experimentation in light of the present disclosure. While the compositions and methods of this invention have been described in terms of preferred embodiments, it will be apparent to those of skill in the art that variations may be applied to the methods and in the steps or in the sequence of steps of the method described herein without departing from the concept, spirit and scope of the invention. More specifically, it will be apparent that certain agents which are both chemically and physiologically related may be substituted for the agents described herein while the same or similar results would be achieved. All such similar substitutes and modifications apparent to those skilled in the art are deemed to be within the spirit, scope and concept of the invention as defined by the appended claims.
The following references, to the extent that they provide exemplary procedural or other details supplementary to those set forth herein, are specifically incorporated herein by reference.
This application claims priority to U.S. Provisional Application No. 62/092,079 filed Dec. 15, 2014, which is specifically incorporated herein by reference without disclaimer.
The invention was made with government support under Grant No. 0916652 awarded by the National Science Foundation. The government has certain rights in the invention.
Number | Date | Country | |
---|---|---|---|
62092079 | Dec 2014 | US |