The present invention generally concerns computer programming. More particularly, the invention concerns a system, methods, and apparatus for source code compilation.
The progression of the computer industry in recent years has illustrated the need for more complex processor architectures capable of processing large volumes of data and executing increasingly complex software. A number of systems resort to multiple processing cores on a single processor. Other systems include multiple processors in a single computing device. Additionally, many of these systems utilize multiple threads per processing core. One limitation that these architectures experience is that the current commercially available compilers can not efficiently take advantage of the increase of computational resources.
In the software design and implementation process, compilers are responsible for translating the abstract operational semantics of the source program into a form that makes efficient use of a highly complex heterogeneous machine. Multiple architectural phenomena occur and interact simultaneously; this requires the optimizer to combine multiple program transformations. For instance, there is often a tradeoff between exploiting parallelism and exploiting locality to reduce the ever widening disparity between memory bandwidth and the frequency of processors: the memory wall. Indeed, the speed and bandwidth of the memory subsystems have always been a bottleneck, which worsens when going to multi-core. Since optimization problems are associated with huge and unstructured search spaces, this combinational task is poorly achieved by current compilers, resulting in weak scalability and disappointing sustained performance.
Even when programming models are explicitly parallel (threads, data parallelism, vectors), they usually rely on advanced compiler technology to relieve the programmer from scheduling and mapping the application to computational cores, understanding the memory model and communication details. Even provided with enough static information or annotations (OpenMP directives, pointer aliasing, separate compilation assumptions), compilers have a hard time exploring the huge and unstructured search space associated with these mapping and optimization challenges. Indeed, the task of the compiler can hardly been called optimization anymore, in the traditional meaning of reducing the performance penalty entailed by the level of abstraction of a higher-level language. Together with the run-time system (whether implemented in software or hardware), the compiler is responsible for most of the combinatorial code generation decisions to map the simplified and ideal operational semantics of the source program to the highly complex and heterogeneous machine.
The polyhedral model promises to be a powerful framework to unify coarse grained and fine-grained parallelism extraction with locality and communication optimizations. To date, this promise has yet been unfulfilled as no existing affine scheduling and fusion techniques can perform all these optimizations in a unified (i.e., non-phase ordered) and unbiased manner. Typically, parallelism optimization algorithms optimize for degrees of parallelism, but cannot be used to optimize locality or communication. In like manner, algorithms used for locality optimization cannot be used for the extracting parallelism. Additional difficulties arise when optimizing source code for the particular architecture of a target computing apparatus.
Therefore there exists a need for improved source code optimization methods and apparatus that can optimize both parallelism and locality.
The present invention provides a system, apparatus and methods for overcoming some of the difficulties presented above. Various embodiments of the present invention provide a method, apparatus, and computer software product for optimization of a computer program on a first computing apparatus for execution on a second computing apparatus.
In an exemplary provided method computer program source code is received into a memory on a first computing apparatus. In this embodiment, the first computing apparatus' processor contains at least one multi-stage execution unit. The source code contains at least one arbitrary loop nest. The provided method produces program code that is optimized for execution on a second computing apparatus. In this method the second computing apparatus contains at least two multi-stage execution units. With these units there is an opportunity for parallel operations. In its optimization of the code, the first computing apparatus takes into account the opportunity for parallel operations and locality and analyses the tradeoff of execution costs between parallel execution and serial execution on the second computing apparatus. In this embodiment, the first computing apparatus minimizes the total costs and produces code that is optimized for execution on the second computing apparatus.
In another embodiment, a custom computing apparatus is provided. In this embodiment, the custom computing apparatus contains a storage medium, such as a hard disk or solid state drive, a memory, such as a Random Access Memory (RAM), and at least one processor. In this embodiment, the at least one processor contains at least one multi-stage execution unit. In this embodiment, the storage medium is customized to contain a set of processor executable instructions that, when executed by the at least one processor, configure the custom computing apparatus to optimize source code for execution on a second computing apparatus. The second computing apparatus, in this embodiment, is configured with at least two multi-stage execution units. This configuration allows the execution of some tasks in parallel, across the at least two execution units and others in serial on a single execution unit. In the optimization process the at least one processor takes into account the tradeoff between the cost of parallel operations on the second computing apparatus and the cost of serial operations on a single multi-stage execution unit in the second computing apparatus.
In a still further embodiment of the present invention a computer software product is provided. The computer software product contains a computer readable medium, such as a CDROM or DVD medium. The computer readable medium contains a set of processor executable instructions, that when executed by a multi-stage processor within a first computing apparatus configure the first computing apparatus to optimize computer program source code for execution on a second computing apparatus. Like in the above described embodiments, the second computing apparatus contains at least two execution units. With at least two execution units there is an opportunity for parallel operations. The configuration of the first computing apparatus includes a configuration to receive computer source code in a memory on the first computing apparatus and to optimize the costs of parallel execution and serial execution of tasks within the program, when executed on the second computing apparatus. The configuration minimizes these execution costs and produces program code that is optimized for execution on the second computing apparatus.
Various embodiments of the present invention taught herein are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings, in which:
a) and 10(b) illustrate an embodiment of a provided method;
a) and 12(b) illustrate an embodiment of a provided method; and
It will be recognized that some or all of the Figures are schematic representations for purposes of illustration and do not necessarily depict the actual relative sizes or locations of the elements shown. The Figures are provided for the purpose of illustrating one or more embodiments of the invention with the explicit understanding that they will not be used to limit the scope or the meaning of the claims.
In the following paragraphs, the present invention will be described in detail by way of example with reference to the attached drawings. While this invention is capable of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the present disclosure is to be considered as an example of the principles of the invention and not intended to limit the invention to the specific embodiments shown and described. That is, throughout this description, the embodiments and examples shown should be considered as exemplars, rather than as limitations on the present invention. Descriptions of well known components, methods and/or processing techniques are omitted so as to not unnecessarily obscure the invention. As used herein, the “present invention” refers to any one of the embodiments of the invention described herein, and any equivalents. Furthermore, reference to various feature(s) of the “present invention” throughout this document does not mean that all claimed embodiments or methods must include the referenced feature(s).
Embodiments of the present invention provide a custom computing apparatus, illustrated in
Various embodiments of the present invention are directed to processors containing multi-stage execution units, and in some embodiments multiple execution units. By way of example and not limitation to the particular multi-stage execution unit,
A further illustration of a multiple execution unit system is depicted in
The following code example illustrates loop fusion. Given the following code:
The effect of loop fusion is to interleave the execution of the first loop with the execution of the second loop.
A consequence of loop fusion is that memory locations a[i] and b[i] referenced by the former 2 loops are now accessed in an interleaved fashion. In the former code, memory locations were accessed in the order a[0], a[1], . . . a[100] then b[0], b[1], . . . b[100]. In the code comprising the fused loops, the memory locations are now accessed in the order a[0], b[0], a[1], b[1], . . . a[100], b[100]. Loop fusion can lead to better locality when multiple loops access the same memory locations. It is common general knowledge in the field of compilers that better locality reduces the time a processing element must wait for the data resident in memory to be brought into a local memory such as a cache or a register. In the remainder of this document, we shall say that loops are fused or equivalently that they are executed together when such a loop fusion transformation is applied to the received program to produce the optimized program.
Loop fusion can change the order in which memory locations of a program are accessed and require special care to preserve original program semantics:
In the previous program, the computation of b[i] depends on the previously computed value of a[i+1]. Simple loop fusion in that case is illegal. If we consider the value computed for b[0]=2+a[1], in the following fused program, b[0] will read a[1] at iteration i=0, before a[1] is computed at iteration i=1.
It is common general knowledge in the field of high-level compiler transformations that enabling transformations such as loop shifting, loop peeling, loop interchange, loop reversal, loop scaling and loop skewing can be used to make fusion legal.
The problem of parallelism extraction is related to the problem of loop fusion in the aspect of preserving original program semantics. A loop in a program can be executed in parallel if there are no dependences between its iterations. For example, the first program loop below can be executed in parallel, while the second loop must be executed in sequential order:
It is common knowledge in the field of high-level compiler transformations that the problems of fusion and parallelism heavily influence each other. In some cases, fusing 2 loops can force them to be executed sequentially.
Loop permutability is another important property of program optimizations. A set of nested loop is said permutable, if their order in the loop nest can be interchanged without altering the semantics of the program. It is common knowledge in the field of high-level compiler optimization that loop permutability also means the loops in the permutable set of loops dismiss the same set of dependences. It is also common knowledge that such dependences are forward only when the loops are permutable. This means the multi-dimensional vector of the dependence distances has only non-negative components. Consider the following set of loops:
There are 2 flow dependences between the statement S and itself. The two-dimensional dependence vectors are: (i-(i−1), j-(j−1))=(1,1) and (i-(i−1), j-j)=(1, 0). The components of these vectors are nonnegative for all possible values of i and j. Therefore the loops I and j are permutable and the loop interchange transformation preserves the semantics of the program. If loop interchange is applied, the resulting program is:
Loop permutability is important because it allows loop tiling (alternatively named loop blocking). Loop tiling is a transformation that changes the order of the iterations in the program and ensures all the iterations of a tile are executed before any iteration of the next tile. When tiling by sizes (i=2, j=4) is applied to the previous code, the result is:
Consider the memory locations written by the statement S. Before tiling, the locations are written in this order: a[1][1], a[1][2] . . . a[1][99], a[2][1], a[2][2] . . . a[2][99], a[3][1] . . . . After tiling, the new order of writes is the following: a[1][1], a[2][1], a[1][2], a[2][2] . . . a[1][4], a[2][4], a[4][1], a[5][1], a[4][2], a[5][2] . . . a[4][4], a[5][4]. . . . It is additionally common knowledge that loop tiling results in better locality when the same memory locations are written and read multiple times during the execution of a tile.
Loop tiling is traditionally performed with respect to tiling hyperplanes. In this example, the tiling hyperplanes used are the trivial (i) and (j) hyperplanes. In the general case, any linearly independent combination of hyperplanes may be used for tiling, provided it does not violate program semantics. For example, (i+j) and (i+2*j) could as well be used and the resulting program would be much more complex.
Another important loop transformation is loop skewing. It is common knowledge that loop permutability combined with loop skewing results in the production of parallelism. In the following permutable loops, the inner loop can be executed in parallel after loop skewing:
After loop skewing the code is the following and the inner loop j is marked for parallel execution:
The skewing transformation helps extract parallelism at the inner level when the loops are permutable. It is also common knowledge that loop tiling and loop skewing can be combined to form parallel tiles that increase the amount of parallelism and decrease the frequency of synchronizations and communications in the program.
The problem of jointly optimizing parallelism and locality by means of loop fusion, parallelism, loop permutability, loop tiling and loop skewing is a non-trivial tradeoff. It is one of the further objects of this invention to jointly optimize this tradeoff.
When considering high-level loop transformations, it is common practice to represent dependences in the form of affine relations. The first step is to assign to each statement in the program an iteration space and an iteration vector. Consider the program composed of the 2 loops below:
The iteration domain of the statement S is D={[i, j] in Z2|1≦i≦n, 1≦j≦n}. The second step is to identify when two operations may be executed in parallel or when a producer consumer relationship prevents parallelism. This is done by identifying the set of dependences in the program. In this example, the set of dependences is: R={[[i, j], [i′, j′]]|i=i′, j=j′−1, [i, j] in D, [i′, j′] in D, <S, [i, j]><<<S, [i′, j′]>} union {[[i, j], [i′, j′]]|i=j′, i=j′, [i, j] in D, [i′, j′] in D, <S, [i, j]><<<S, [i′, j′]>}, where <<denoted multi-dimensional lexicographic ordering. This relationship can be rewritten as: a[i,j] a[j,i] {([i, j], [j, i])|1≦j, i≦n, −j+i−1≧0} union a[i,j] a[i,j−1] {([i, j+1], [i, j])|1≦j≦n−1, 0≦i≦n}.
It is common practice to represent the dependence relations using a directed dependence graph, whose nodes represent the statements in the program and whose edges represent the dependence relations. In the previous example, the dependence graph has 1 node and 2 edges. It is common practice to decompose the dependence graph in strongly connected components. Usually, strongly connected components represent loops whose semantics require them to be fused in the optimized code. There are many possible cases however and one of the objects of this invention is also to perform the selective tradeoff of which loops to fuse at which depth. It is common knowledge that a strongly connected component of a graph is a maximal set of nodes that can be reached from any node of the set when following the directed edges in the graph.
One-Dimensional Affine Fusion
One embodiment incorporates fusion objectives into affine scheduling constraints. Affine fusion, as used herein means not just merging two adjacent loop bodies together into the same loop nests, but also include loop shifting, loop scaling, loop reversal, loop interchange and loop skewing transformations. In the α/β/γ convention this means that we would like to have the ability to modify the linear part of the schedule, α, instead of just β and γ. Previous fusion works are mostly concerned with adjusting the β component (fusion only) and sometimes both the β and γ components (fusion with loop shifting). One embodiment of the invention, computes a scheduling function used to assign a partial execution order between the iterations of the operations of the optimized program and to produce the resulting optimized code respecting this partial order.
As a simple motivational example demonstrating the power of affine fusion, consider the example above. Dependencies between the loop nests prevents the loops from being fused directly, unless loop shifting is used to peel extra iterations of the first and second loops. The resulting transformation is shown below.
On the other hand, affine fusion gives a superior transformation, as shown above. In this transformation, the fusion-preventing dependencies between the loop nests are broken with a loop reversal rather than loop shifting, and as a result, no prologue and epilogue code is required. Furthermore, the two resulting loop nests are permutable. Thus we can further apply tiling and extract one degree of parallelism out of the resulting loop nests.
Many prior art algorithms cannot find this transformation with their restrictions. Some of the restrictions prune out the solution space based on loop reversals, and thus these algorithms can only find the loop-shifting based solutions. Another important criteria is that fusion should not be too greedy, i.e., aggressive fusion that destroys parallelism should be avoided. On the other hand, fusion that can substantially improve locality may sometimes be preferred over an extra degree of parallelism, if we already have obtained sufficient degrees of parallelism to fill the hardware resources. For instance, consider the combined matrix multiply example. This transformation is aggressive, and it gives up an additional level of synchronization-free parallelism that may be important on some highly parallel architectures. It is a further object of this invention to properly model the tradeoff between benefits of locality and parallelism for different hardware configurations.
The code below shows the result of applying fusion that does not destroy parallelism. The two inner .i-loops are fissioned in this transformation, allowing a second level of synchronization-free parallelism.
Affine Fusion Formulation
The tension between fusion and scheduling implies that fusion and scheduling should be solved in a unified manner. For any loop p, we compute a cost ωp which measures the slowdown in execution if the loop is executed sequentially rather than in parallel. Similarly, for each pair of loop nests (p, q), we estimate upq the cost in performance if the two loops p and q remains unfused. The cost ωp can be interpreted to be the difference between sequential and parallel execution times, and the cost upq can be interpreted as the savings due to cache or communication based locality. In one embodiment, the cost ωp is related to a difference in execution speed between sequential operations of the at least one loop on a single execution unit in the second computing apparatus and parallel operations of the at least one loop on more than one of the at least two execution units in the second computing apparatus. In another embodiment, the cost upq is related to a difference in execution speed between operations where the pair of loops are executed together on the second computing apparatus, and where the pair of loops are not executed together on the second computing apparatus.
In an illustrative example, let the Boolean variable Δp denote whether the loop p is executed in sequence, and let the variable fpq denote whether the two loops p and q remain unfused, i.e. Δp=0 means that p is executed in parallel, and fpq=0 means that edge loops p and q have been fused. Then by minimizing the weighted sum
we can optimize the total execution cost pertaining to fusion and parallelism. In some embodiment, the variable Δp specifies if the loop is executed in parallel in the optimized program. In another embodiment, the variable fpq specifies if the pair of loops are executed together in the optimized program.
In some embodiment, the value of the cost wp is determined by a static evaluation of a model of the execution cost of the instructions in the loop. In another embodiment, the value of the cost wp is determined through the cost of a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop. In a further embodiment, the value of the cost wp is determined by an iterative process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop.
In some embodiment, the value of the cost upq is determined by a static evaluation of a model of the execution cost of the instructions in the loop pair. In another embodiment, the value of the cost upq is determined through the cost of a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop pair. In a further embodiment, the value of the cost upq is determined by an iterative process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop pair.
The optimization can be formulated as follows. In one embodiment, we divide up the generalized dependence graph, GDG G=(V, E) into strongly connected components (SCCs) and consider each SCC to be a separate fusible “loop” candidate. Let G′=(V′, E′) denote the SCC induced subgraph where V′ denotes the SCCs and E′ the edges between SCCs. Given a node v ε V, let sec(v) denote the component in which v belongs to in the SCC decomposition. Given (p, q) E E′, let the Boolean variables fpq denote whether two SCCs has been fused, i.e., fpa=0 denotes that the loops corresponding to p and q have been fused.
fpqε{0,1}, (5)
(p,q)εE′ (6)
There are multiple possible strategies to encode the restrictions implied by E′. In one embodiment, we directly encode the transitivity relation E′ as constraints, i.e. (i) given edges (p,q) and (q,r) and (p,q′), if loops (p,q) or (q,1′) is not fused then (p,r) cannot be fused, and (ii) if (p, q) and (q, r) are fused then (p, q) must be fused:
f
pq
,f
qr
≦f
pr, (p,q),(q,r),(p,r)εE′ (7)
f
pq
+f
qr
≧f
pr, (p,q),(q,r),(p,r)εE′ (8)
One potential deficiency of this strategy is that up to O(|V′|3 constraints are required. In the second embodiment we adopt, involves the encoding of the β schedule coordinates directly in the constraints. In this encoding, βp=βq implies that loops p and q have been fused:
βpε{0,|V′|-1} (9)
βp≧βq+fpq(p,q)εE′ (10)
βq−βp≧−|V′|fpq, (p,q)εE′ (11)
Given the constraints on fpq in place, we can now provide a suitable modification to the schedule constraints. The constraints are divided into two types, the first involves edges within the same SCC, and the second involves edges crossing different SCCs:
F
pq(y)=fpq(yl+yk+1) (14)
Here, the term −N∞Fpq(y) is defined in such a way that −N∞Fpq(y)=0 when fpq=0, and is equal to a sufficiently large negative function when fpq=1. Thus, φs(e)(j,y)−φt(e)(i,y)≧0 only needs to hold only if the edge e has been fused or is a loop-carried edge. The final set of constraints is to enforce the restriction that δP(y)=δq(y) if (p, q) has been fused. The constraints encoding this are as follows:
δp(y)−δq(y)+N∞Fpq(y)≧0 (p, q) εE′ (15)
δq(y)−δp(y)+N∞Fpq(y)≧0 (p, q) εE′ (16)
δpq(y)−δp(y)+N∞Fpq(y)≧0 (p, q) εE′ (17)
Some embodiments additionally specify that a schedule dimension at a given depth must be linearly independent from all schedule dimensions already computed. Such an embodiment computes the linear algebraic kernel of the schedule dimensions found so far. In such an embodiment, for a given statement S, h denotes the linear part of φS, the set of schedule dimensions already found and J denotes a subspace linearly independent of h. A further embodiment derives a set of linear independence constraints that represent the additional Jh≠0 and does not restrict the search to Jh>0. Such linear independence constraints may be used to ensure successive schedule dimensions are linearly independent. In particular, such an embodiment, that does not restrict the search to Jh>0, exhibits an optimization process that can reach any legal multidimensional affine scheduling of the received program including combinations of loop reversal.
In some embodiments the set of conditions preserving semantics is the union of all the constraints of the form φs(e)(j,y)−φt(e)(i,y)≧0. In another embodiment, the optimizing search space that encompasses all opportunities in parallelism and locality is the conjunction of all the constraints (5)-(17).
In further embodiments, the set of affine constraints (12) and (13) is linearized using the affine form of Farkas lemma and is based on at least one strongly connected component of the generalized dependence graph.
In other embodiments, the constraints of the form (12) are used to enforce dimensions of schedules of loops belonging to the same strongly connected component are permutable.
In further embodiments, the constraints of the form (13) are used to ensure that dimensions of schedules of loops that are not executed together in the optimized program do not influence each other. In such embodiments, the constraints of the form (13) use a large enough constant to ensure that dimensions of schedules of loops that are not executed together in the optimized program do not influence each other.
In some embodiments, the linear weighted sum
can be optimized directly with the use of an integer linear programming mathematical solver such as Cplex. In other embodiments, a non-linear optimization function such as a convex function may be optimized with the use of a convex solver such as CSDP. Further embodiments may devise non-continuous optimization functions that may be optimized with a parallel satisfiability solver.
Boolean Δ Formulation
The embodiments described so far depend on a term (or multiple terms) δ(y) which bounds the maximal dependence distance. Another embodiment may opt for the following simpler formulation. First, we assign each SCC p in the GDG a Boolean variable Δp where Δp=0 means a dependence distance of zero (i.e., parallel), and
Δp=1 means some non-zero dependence distance:
Δpε{0,1} pεV′ (18)
Define the functions Δp(y) and Δpq(y) as:
Δp(y)=Δp×(y1+ . . . +yk+1) (19)
Δpq(y)=Δpq×(y1+ . . . +yk+1) (20)
Then the affine fusion constraints can be rephrased as follows:
Multi-Dimensional Affine Fusion
Affine fusion formulation is a depth by depth optimization embodiment. A further embodiment described in
The variables and their interpretations are:
The following constraints ensure that pek=0 only if εek-1=1 and εek=1:
pekε{0,1} e εE (30)
εek-1εek+2pek≧2 (31)
The next constraints encode the β component of the schedules.
The next set of constraints ensures that all δak(y) terms are the same for all nodes a which belong to the same loop nest:
δs(e)k(y)−δek(y)≦N∞(βs(e)k−βt(e)k) eεE (34)
δek(y)−δs(e)k(y)≦N∞(βs(e)k−βt(e)k) eεE (35)
δt(e)k(y)−δek(y)≦N∞(βs(e)k−βt(e)k) eεE (36)
δek(y)−δt(e)k(y)≦N∞(βs(e)k−βt(e)k) eεE (37)
δs(e)k(y)−δt(e)k(y)≦N∞(βs(e)k−βt(e)k) eεE (38)
δt(e)k(y)−δs(e)k(y)≦N∞(βs(e)k−βt(e)k) eεE (39)
Similarly, the next set of constraints ensure that all pak are identical for all nodes a which belong in the same loop nest.
p
s(e)k−pek≦N∞(βs(e)k−βt(e)k) eεE (40)
p
s
k
−p
t(e)k≦N∞(βs(e)k−βt(e)k) eεE (41)
p
s(e)k−pt(e)k≦N∞(βs(e)k−βt(e)k) eεE (42)
p
t(e)k−ps(e)k≦N∞(βs(e)k−βt(e)k) eεE (43)
In some embodiment, the strong satisfaction variable E_{k,e} assigned to each schedule dimension k and each edge e of the at least one strongly connected component is εek which is equal to 1 when the schedule difference at dimension k strictly satisfies edge e (i.e. when φs(e)k(i,y)−φt(e)k(j,y)≧1,e εE), 0 otherwise. In other embodiments, the loop permutability Boolean variable p_{k,e} assigned to each schedule dimension and each edge e of the at least one strongly connected component is pek.
In a further embodiment the statement permutability Boolean variable p_{k,a} assigned to each schedule dimension and each statement a of the at least one strongly connected component is pak. In another embodiment, constraints of the form (27), (28) and (29) are added to ensure dimensions of schedules of statements linked by a dependence edge in the generalized dependence graph do not influence each other at depth k if the dependence has been strongly satisfied up to depth k−1. In a further embodiment, constraints of the form (30) and (31) are added to link the strong satisfiability variables to the corresponding loop permutability Boolean variables. In another embodiment, constraints of the form (34) to (43) are added to ensure statement permutability Boolean variables are equal for all the statements in the same loop nest in the optimized program. In a further embodiment, the conjunction of the previous constraints forms a single multi-dimensional convex affine search space of all legal multi-dimensional schedules that can be traversed exhaustively or using a speeding heuristic to search for schedules to optimize any global cost function.
One example of an embodiment tailored for successive parallelism and locality optimizations is provided for an architecture with coarse grained parallel processors, each of them featuring fine grained parallel execution units such as SIMD vectors. One such architecture is the Intel Pentium E 5300. The following example illustrates how an embodiment of the invention computes schedules used to devise multi-level tiling hyperplanes and how a further embodiment of the invention may compute different schedules for different levels of the parallelism and memory hierarchy of the second computing apparatus. Consider the following code representing a 3-dimensional Jacobi iteration stencil. In a first loop, the array elements A[i][j][k] are computed by a weighted sum of the 7 elements, B[i][j][k], B[i−1][j][k], B[i+1][j][k], B[i][j−1][k], B[i][j+1][k], B[i][j][k−1] and B[i][j][k+1]. In a symmetrical second loop, the array elements B[i][j][k] are computed by a weighted sum of 7 elements of A. The computation is iterated Titer times.
When computing a schedule for the first level of parallelism (the multiple cores) our invention may produce the following optimized code in which permutable loops are marked as such.
In this form, the loops have been fused at the innermost level on loop I and the locality is optimized. Loop tiling by tiling factors (16, 8, 8, 1) may be applied to further improve locality and the program would have the following form, where the inner loops m, n, o are permutable.
Without further optimization, the loops are fused on all loops i,j,k,l,m,n and o. The program does not take advantage of fine grained parallelism on each processor along the loops m, n and o. Our innovation allows the optimization of another selective tradeoff to express maximal innermost parallelism at the expense of fusion. The selective tradeoff gives a much more important cost to parallelism than locality and our innovation may finds a different schedule for the intra-tile loops that result in a program that may display the following pattern:
The innermost doall dimensions may further be exploited to produce vector like instructions while the outermost permutable loops may be skewed to produce multiple dimensions of coarse grained parallelism.
In a further embodiment, the schedules that produce the innermost doall dimensions may be further used to produce another level of multi-level tiling hyperplanes. The resulting code may have the following structure:
In the following example, dependencies between the loop nests prevent the loops from being fused directly, unless loop shifting is used to peel extra iterations off the first and second loops. The resulting transformation is illustrated in the code below.
On the other hand, affine fusion (i.e., fusion combined with other affine transformations) gives a superior transformation, as shown below. In this transformation, the fusion-preventing dependencies between the loop nests are broken with a loop reversal rather than loop shifting, and as a result, no prologue or epilogue code is required. Furthermore, the two resulting loop nests are permutable. In some embodiments, tiling and extraction of one degree of parallelism out of the resulting loop nests is performed.
In some embodiments loop fusion is limited to not be too greedy, i.e., aggressive fusion that destroys parallelism should be avoided. On the other hand, fusion that can substantially improve locality may sometimes be preferred over an extra degree of parallelism, if we already have; obtained sufficient degrees of parallelism to exploit the hardware resources. For example, given the following code:
If fusion is applied too aggressively, it gives up an additional level of synchronization-free parallelism.
The below code illustrates the result of only applying fusion that does not destroy parallelism. The two inner j-loops are fissioned in this transformation, exposing a second level of synchronization-free parallelism.
The above illustrates that this tension between fusion and scheduling implies that fusion and scheduling should be solved in a unified manner. Turning now to
A provided method 150 for source code optimization is illustrated in
As used herein, “executed together” means fused in the sense of the code examples (0032)-(0037). Specifically executed together means that loops that are consecutive in the original program become interleaved in the optimized program. In particular, loops that are not “executed together” in the sense of loop fusion can be executed together on the same processor in the more general sense. In the second optimization path illustrated in
In another embodiment, the second cost is determined through of a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the at least one loop pair. In a further embodiment, the cost is determined through an iterative refining process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the at least one loop pair. Flow then continues to decision block 200 where it is determined if additional unassigned loop pairs exist. If additional unassigned loop pairs exist, flow continues back to block 170 and the process iterates until no additional unassigned loop pairs are found. When decision block 200 determines no additional loop pairs are present, flow continues to decision block 220. If in decision block 220 it is determined that additional unassigned loops exist, flow continues back to block 160 and the process iterates until no additional unassigned loops may be identified. Flow then continues to block 230 where a selective tradeoff is created for locality and parallelism during the execution on second computing apparatus 10(b). Flow then continues to block 130 where a scheduling function is produced that optimizes the selective tradeoff. Flow then continues to block 140 where optimized code is produced.
The flow of a further provided embodiment of a method 240 for source code optimization is illustrated in
The flow of a further provided method is illustrated in
In the second illustrated embodiment, flow continues from block 260 to block 300(b) where an element is selected from the search space. Flow continues to block 310(b) where a potential scheduling function is derived for the element. Flow then continues to block 320(b) where the performance of the potential scheduling function is evaluated. Flow then continues to block 340 where the search space is refined using the performance of evaluated schedules. Flow then continues to decision block 330(b) where it is determined if additional elements exist in the search space. If additional elements are present flow continues back to block 330 and the process iterated until no other elements exist in the search space. When no additional elements exist, in the search space, flow then continues to block 370 where the element with the best evaluated performance is selected.
In the third illustrated embodiment, flow continues from block 260 to block 350 where the tradeoff is directly optimized in the search space with a mathematical problem solver. Flow then continues to block 360 where an element is selected that is a result of the direct optimization. Flow then continues to block 320(c) there the performance of the selected element is evaluated. Flow then continues to block 370 where the element with the best evaluated performance is selected. As illustrated some embodiments may utilize more than one of these paths in arriving at an optimal solution. From selection block 370 flow then continues to block 280 where the scheduling function is derived from the optimized tradeoff. Flow then continues to block 140 where optimized code is produced.
The flow of a further provided embodiment of a method 380 for optimization of source code on a first custom computing apparatus 10(a) for execution on a second computing apparatus 10(b) is illustrated in
On a first path, flow continues to block 260 where a search space is derived that meet the conditions for semantic correctness. In this embodiment, the search space characterizes all parallelism and locality opportunities that meet the conditions of semantic correctness. Flow then continues to block 410 where a weighted parametric tradeoff is derived and optimized on the elements of the search space. On the second path, flow begins with block 160 where an unassigned loop is identified. Flow then continues on two additional paths. In a first path flow continues to block 180 where a first cost function is assigned in block 180. This first cost function is related to a difference in execution speed between parallel and sequential operations of the statements within the unidentified loop on second computing apparatus 10(b). Flow then continues to block 210 where a decision variable is assigned to the loop under consideration, this decision variable indicating whether the loop is to be executed in parallel in the optimized program. In some embodiments the cost is determined through static evaluation of a model of the execution cost of the instructions in the loop under consideration. In other embodiments, the cost is determined through a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop under consideration. In a further embodiment, the cost is determined by an iterative refining process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the loop under consideration. Flow then continues to decision block 220 where it is determined if there are additional unassigned loops.
Returning to block 160 where an unassigned loop is identified. On the second path flow continues to block 170 where an unassigned loop pair is identified. Flow then continues to block 175 where a second cost function is assigned for locality optimization. This second cost function is related to a difference in execution speed between operations where the loops of the pair of loops are executed together on the second computing apparatus, and where the loops of the pair of loops are not executed together on the second computing apparatus. Flow then continues to block 190 where a decision variable is assigned for locality. This second decision variable specifying if the loops of the loop pair under consideration is to be executed together in the optimized program. In one embodiment, the second cost is determined through static evaluation of a model of the execution cost of the instructions in the at least one loop pair. In another embodiment, the second cost is determined through of a dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the at least one loop pair. In a further embodiment, the cost is determined through an iterative refining process consisting of at least one static evaluation of a model of the execution cost and at least one dynamic execution on the second computing apparatus of at least a set of instructions representative of the code in the at least one loop pair. Flow then continues to decision block 200 where it is determined if additional unassigned loop pairs exist. If additional unassigned loop pairs exist, flow continues back to block 170 and the process iterates until no additional unassigned loop pairs are found. When decision block 200 determines no additional loop pairs are present, flow continues to decision block 220. If in decision block 220 it is determined that additional unassigned loops exist, flow continues back to block 160 and the process iterates until no additional unassigned loops may be identified. Flow then continues to block 230 where a selective trade-off is created for locality and parallelism during the execution on second computing apparatus 10(b).
In this embodiment, flow then continues to block 410 where as discussed, a weighted parametric tradeoff is derived and optimized on the elements of the search space. Flow then continues to block 420 where a multi-dimensional piecewise affine scheduling function is derived that optimizes the code for execution on second computing apparatus 10(b). Flow then continues to block 140 where the optimized program is produced.
The operational flow of a further provided method 430 for source code optimization is illustrated in
The operational flow of a further provided method 500 for source code optimization is illustrated in
In the first path, flow continues to block 540 where a set of affine constraints are derived using the affine form of Farkas lemma. On the second path, flow continues to block 550 where linear independence constraints are derived and used to ensure the successive scheduling dimensions are linearly independent. In some embodiment, these linear independence constraints are derived using orthogonally independent subspaces. In another embodiment, these constraints are formed using a Hermite Normal form decomposition. In the third path, flow continues to block 560 where a set of schedule difference constraints are derived and used to enforce dimensions of schedules of loops belonging to the same strongly connected component are permutable. In the last path, a set of loop independence constraints are derived and used to ensure that dimensions of schedules of loops that are not executed together do not influence each other. In one embodiment, this set of constraints includes a large enough constraint to cancel an effect of constraints on statements that are not executed together in the optimized program.
Flow then continues to block 580 where these derived constraints are added to the search space. Flow then continues to decision block 590 where it is determined if there are additional strongly connected components. If there are additional strongly connected components, flow continues back to block 530 and the process iterates until there are no further strongly connected components. Flow then continues to block 260 where a search space is derived that characterizes all parallelism and locality opportunities that meet the conditions of semantic correctness. Flow then proceeds to block 600 where a weighted parametric tradeoff is optimized on the elements of the search space. Flow continues to block 420 where a multi-dimensional piecewise affine scheduling function is derived from the optimization and to block 140 where this function is used to create an optimized program for execution on second computing apparatus 10(b). In one embodiment, the optimization can reach any legal dimensional affine scheduling of the received program. In another embodiment, the legal multi-dimensional affine scheduling of the received program includes loop reversals.
The operational flow of a further provided method 610 for source code optimization is illustrated in
If at decision block 620 determines that there are additional scheduling dimensions, flow continues to block 630 where the generalized dependence graph is decomposed into at least one strongly connected component. Flow continues to block 640 where a strongly connected component is selected. Flow then continues to block 650 where affine constraints are derived using the affine form of Farkas lemma, linear independence constraints permutability constraints, and independence constraints are derived as previously discussed. Flow then continues to block 660 where these constraints are added to the search space. Flow then continues to decision block 670 where it is determined if additional strongly connected components exits. If others exist, flow continues back to 640 and the process iterates until there are no remaining strongly connected components.
When decision block 670 indicates that there are no remaining strongly connected components, flow continues to block 730 where a weighted parametric tradeoff function is optimized on the search space. Flow then continues to decision block 690 where it is determined if new independent permutable schedule dimensions exist. If they exist flow continues to block 700 where an existing scheduling dimension is selected. Flow continues to block 720 where additional constraints are added to the search space for independence and linear independence. From block 720 flow continues to block 730 where a weighted parametric tradeoff function is optimized on the search space. Flow then continues back to decision block 690 and this part of the process iterates until no new independent permutable schedule dimensions are found. Flow then continues to block 740 where satisfied edges are removed from the dependence graph and to block 750 where the remaining edges and nodes are partitioned into smaller dependence graphs. Flow then continues back to block 390 and the process is iterated on these smaller dependence graphs until decision block 620 determines there are no additional dimensions to schedule.
The flow of a further provided embodiment of a method 760 for optimization of source code on a first custom computing apparatus 10(a) for execution on a second computing apparatus 10(b) is illustrated in
On the second path flow continues to block 790(b) where an element of the search space is selected. Flow then continues to block 800(b) where a scheduling function is derived for the selected element. Flow then continues to block 810(b) where the performance of the scheduling function is evaluated. Flow then continues to block 830 where the search space is refined using the performance of evaluated schedules. Flow then continues to decision block 820(b). If there are additional elements remaining in the search space flow continues back to block 790(b) and another element is selected from the search space. The process iterates until there are no remaining elements in the search space.
On the third path flow continues to block 840 where the selective tradeoff is directly optimized using a mathematical solver. Flow then continues to block 850 where an element is selected from the search space that is a solution to the optimization. Flow then continues to block 860 where the performance of the selected element is evaluated. Flow then continues to block 870 which selects the element with the best evaluated performance for all of its inputs. Flow then continues to block 880 which produces a scheduling function from the selective tradeoff and the selected element. Flow then continues to block 890 where the scheduling function is used to assign a partial order to the statements of the source code and an optimized program is produced.
An exemplary embodiment of block 770 is illustrated in
On the second path, flow continues from block 390 to block 970 where a node N is selected. Flow continues to block 980 where a statement permutability variable is assigned to node N at dimension K. Block 980 receives dimension K from block 1010. Flow continues to decision block 990. If there are remaining nodes in the dependence graph flow continues back to block 970 where another node N is selected. The process iterates until no additional nodes exist in the graph. Block 950 receives input from blocks 920 and 980 and assigns constraints to link edge permutability variable and statement permutability variable at dimension K. Flow then continues to block 960 where constraints to equate statement permutability variables for source and sink of edge E at dimension K are assigned. Flow then continues to decision block 1000. If additional scheduling dimensions exist, flow continues back to block 1010 the next scheduling dimension is selected and the entire process repeated for all dimensions. When all dimensions have been scheduled, flow continues to block 1020 where a single multi-dimensional convex affine space is constructed from all of the legal schedules.
The flow of another provided method 1070 for program code optimization is illustrated in
Thus, it is seen that methods and an apparatus for optimizing source code on a custom first computing apparatus for execution on a second computing apparatus are provided. One skilled in the art will appreciate that the present invention can be practiced by other than the above-described embodiments, which are presented in this description for purposes of illustration and not of limitation. The specification and drawings are not intended to limit the exclusionary scope of this patent document. It is noted that various equivalents for the particular embodiments discussed in this description may practice the invention as well. That is, while the present invention has been described in conjunction with specific embodiments, it is evident that many alternatives, modifications, permutations and variations will become apparent to those of ordinary skill in the art in light of the foregoing description. Accordingly, it is intended that the present invention embrace all such alternatives, modifications and variations as fall within the scope of the appended claims. The fact that a product, process or method exhibits differences from one or more of the above-described exemplary embodiments does not mean that the product or process is outside the scope (literal scope and/or other legally-recognized scope) of the following claims.
This application is related to and claims the benefit of priority to U.S. Provisional Application Ser. No. 61/097,799, entitled “STATIC SOFTWARE TOOLS TO OPTIMIZE BMD RADAR TO COTS HARDWARE”, filed Sep. 17, 2008, the entirety of which is hereby incorporated by reference. This application is additionally related to the subject matter contained in co-owned, co-pending U.S. patent application Ser. No. 12/365,780 entitled “METHODS AND APPARATUS FOR LOCAL MEMORY COMPACTION” filed Feb. 4, 2009 which claims priority to U.S. Provisional Application Ser. No. 61/065,294 both of which are additionally incorporated by reference herein in their entirety.
Portions of this invention were made with U.S. Government support under SBIR contract/instrument W9113M-08-C-0146. The U.S. Government has certain rights.
Number | Date | Country | |
---|---|---|---|
61097799 | Sep 2008 | US |