The embodiments of the invention relate generally to partitioning code and more particularly to affine partitioning code onto multiple processing units.
From the very beginning of the computer industry, there has been a constant demand for improving the performance of systems in order to run application faster than before, or for running applications that can produce results in an acceptable time frame. One method for improving the performance of computer systems is to have the system run applications or portions of an application (e.g. a thread) in parallel with one another on a system having multiple processors. In order to run an application or thread in parallel, the application or thread must be independent, that is, it cannot depend on the results produced by another application or thread.
The process of specifying which applications or threads can be run in parallel with one another may be referred to as partitioning. One particular type of partitioning used by current systems in referred to as affine partitioning. An affine partition may be used to uniformly represent many program transformations, such as loop interchange, loop reversal and loop skewing, loop fusing, and statement re-ordering. Further, space partitioning in an affine partitioning framework may be used to parallelize code below for multiprocessor systems.
In general, an affine partition typically comprises a linear transformation and a translation of a vector or matrix operation within one or more loops, including transformation and translation of loop index variables. Various loop manipulations may be performed using affine partitioning. For example, loop interchange, loop reversal and loop skewing may be represented by linear transformations and translations performed in affine partitioning. The affine partition is an extension to unimodular transformation in use by current compilers. The affine partition extends the concept of unimodular transformation by:
While affine partitioning has provided benefits in producing code that can be run in parallel on multiprocessor systems, there remain significant issues. For example, many times successive iterations of a loop in a program will make continuous accesses to memory. Before partitioning, memory accesses of successive iterations of a loop may be near one another, resulting in a high likelihood that a memory reference will be available in faster cache memory. However, in affine partitioning, instances of instructions in a loop may be divided across multiple processors and code may be transformed such that memory access patterns are much different than prior to partitioning. As a result, it is more likely that memory accesses may no longer be contiguous or near one another, and in fact memory accesses may be quite far from one another. In this case, there is a higher likelihood of a cache miss, thereby increasing the time required to access memory. Thus some or all of the performance gains realized by executing instructions in parallel may be lost due to the increase in memory access times due to cache misses.
In the following detailed description of exemplary embodiments of the invention, reference is made to the accompanying drawings that form a part hereof, and in which is shown by way of illustration specific exemplary embodiments in which the inventive subject matter may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the various embodiments of the invention, and it is to be understood that other embodiments may be utilized and that logical, mechanical, electrical and other changes may be made without departing from the scope of the inventive subject matter. The following detailed description is, therefore, not to be taken in a limiting sense.
In the Figures, the same reference number is used throughout to refer to an identical component which appears in multiple Figures. Signals and connections may be referred to by the same reference number or label, and the actual meaning will be clear from its use in the context of the description.
It should be noted that while four processing units 110 are illustrated in
Compiler 102 operates to read a source code stream 106 and to translate the source code stream 106 into object code that can run on one or more of processing units 110. Source code stream 106 may be written any of a number of programming languages such as C, C++, C#, Fortran, ADA, Pascal, PL1 etc. that are now available or may be developed in the future, with compiler 102 typically configured to operate on one type of programming language. The embodiments of the invention are not limited to any particular programming language.
In some embodiments, compiler 102 includes a partitioning module 104 that analyzes portions of source code stream 106 to determine if portions of the source code can be partitioned into code segments that can be parallelized, that is, segments that can be run simultaneously on the multiple processing units 110. Partitioning module 104 may be part of an optimizer component or a front-end component of compiler 102. Partitioning module operates to perform affine partitioning of the portions of code stream 106. In general, affine partitioning divides instances of an instruction across into partitions 116. Code segments 106 (also referred to as threads) contain instances of the instructions and are assigned a partition 116. The code segments of a partition are then assigned to run on one of the processors of the multi-core or multi-processor system. Partitioning across a set of processors may be referred to as space partitioning, while partitioning across a set of time stages may be referred to as time partitioning.
Affine partitioning uses affine transformations to partition the instances. Generally speaking, an affine transformation comprises a linear transformation followed by a translation.
It should be noted that although for partition are shown in
In some embodiments, compiler 102 includes an OpenMP API (Application Program Interface) 114. OpenMP is a specification for a set of compiler directives, library routines, and environment variables that can be used to specify shared memory parallelism in Fortran and C/C++ programs. Compiler 102 (or a front-end for compiler 102) may produce code including directives and/or function calls to the OpenMP API The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran on numerous hardware and software architectures, including Unix, Linux, and Microsoft Windows based platforms. Further details may be found in “OpenMP Application Program Interface” version 2.5 published May 2005 by the OpenMP Architecture Review Board.
Further details on the operation of the system described above are provided below with reference to
The outermost loops of the code in
Where φ1 is linear transformation for statement S1 and φ1 is linear transformation for statement S2. After the transformation, statement instances with same affine mapping results are grouped into same loop iteration of outermost loop P of the example code in
It should be noted that while the code shown in
In some embodiments, the method begins by performing locality optimization for sequential code (block 302).
Next, the system receives a code stream portion of the code being compiled (block 304). The code stream will typically comprise one or more nested loops, with one or more statements within the various body portions of the nested loops.
The system determines an affine partitioning result set (block 306). The affine partitioning will typically be a space partitioning designed to parallelize the code stream across multiple processing units. The result set may comprise a 1-order affine partitioning result set in the case where all of the processing units form a 1-dimensional processor space (e.g., processing units shares a memory bus). Higher order affine partitioning result sets may be generated when different groups of processing units share different memory interconnects. It is desirable that the affine partitions P be in the form of:
P=Ai1Ii1+Ai2Ii2+ . . . +AikIik (2)
where for the i-th statement, It is the induction variable and A is the affine transformation applied to the induction variable. In addition, P0 will be used to refer to the minimal value of P and PL will be used to refer to the maximal value for P.
The system then divides the range of P (i.e. values from P0 to PL) into L portions that will be associated with L code segments:
where L is the number of processing units in the target architecture (block 308). It is desirable that the portions be divided such that each code segment contains a similar amount of code.
The system then processes the L code segments. For the t-th segment, a copy of original code stream is copied into the code segment (block 310). In addition, a conditional is determined for each statement (block 312). The conditional is determined by taking the condition Pt≦P<Pt+1 for each statement in the code and replacing P by correspondent linear formula of the induction variables. For example, the condition Pt≦P<Pt+1 becomes
(Pt−Ai1Ii1−Ai2Ii2− . . . −Ai,k−1Ii,k−1)/Aik<Iik<(Pt+1−Ai1Ii1−Ai2Ii2− . . . −Ai,k−1Ii,k−1)/Aik (4)
In some embodiments, the conditional is then inserted into the code segment to provide for conditional execution of the affected statement or statements.
It should be noted that is desirable to produce ranges that result in spreading the amount of code relatively evenly across the partitions while preserving memory access independence. In the example above, a diagonal partitioning of the space is assumed for the ranges in formula (5). Those of skill in the art will appreciate that other partitions are possible and within the scope of the embodiments of the invention.
As illustrated, the code segment in
Returning to
In some embodiments of the invention, the system may create OpenMP sections (block 316). An OpenMP section may be created for each processing unit in the target architecture. Then each code segment generated as described above is placed in the corresponding OpenMP section.
Thus as can be seen from the above, instead of merging iterations of the transformed code (as shown in
Systems and methods for affine partitioning code streams for parallelizing code in architectures having multiple processing units have been described. This application is intended to cover any adaptations or variations of the embodiments of the invention. It is to be understood that the above description is intended to be illustrative, and not restrictive. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. Therefore, it is manifestly intended that the inventive subject matter be limited only by the following claims and equivalents thereof.
Number | Name | Date | Kind |
---|---|---|---|
5584027 | Smith | Dec 1996 | A |
5704053 | Santhanam | Dec 1997 | A |
5781777 | Sato et al. | Jul 1998 | A |
5953531 | Megiddo et al. | Sep 1999 | A |
6058266 | Megiddo et al. | May 2000 | A |
6374403 | Darte et al. | Apr 2002 | B1 |
6772415 | Danckaert et al. | Aug 2004 | B1 |
6952821 | Schreiber | Oct 2005 | B2 |
7367024 | Barua et al. | Apr 2008 | B2 |
20060031652 | Richter et al. | Feb 2006 | A1 |
20060080645 | Miranda et al. | Apr 2006 | A1 |
20060123405 | O'Brien et al. | Jun 2006 | A1 |
20070074195 | Liao et al. | Mar 2007 | A1 |
Number | Date | Country | |
---|---|---|---|
20070079303 A1 | Apr 2007 | US |