1. Technical Field
The present disclosure relates generally to information processing systems and, more specifically, to determining a loop unrolling factor for software loops.
2. Background Art
Software pipelining (SWP) is a compilation technique for scheduling non-dependent instructions from different logical iterations of a program loop to execute concurrently. Overlapping instructions from different independent logical iterations of the loop increases the amount of instruction level parallelism (ILP) in the program code. Code having high levels of ILP uses the execution resources available on modern, superscalar processors more effectively.
A loop is software-pipelined by organizing the instructions of the loop body into stages of one or more instructions each. These stages form a software-pipeline having a pipeline depth equal to the number of stages (the “stage count” or “SC”) of the loop body. The instructions for a given loop iteration enter the software-pipeline stage by stage, on successive initiation intervals (II), and new loop iterations begin on successive initiation intervals until all iterations of the loop have been started. Each loop iteration is thus processed in stages through the software-pipeline in much the same way that an instruction is processed in stages through a processor pipeline. When the software-pipeline is full, stages from SC sequential loop iterations are in process concurrently, and one loop iteration completes every initiation interval. Various methods for implementing software-pipelined loops are discussed, for example, in B. R. Rau, M. S. Schlansker, P. P. Tirumalai, Code Generation Schema for Modulo Scheduled Loops IEEE MICRO Conference 1992 (Portland, Oreg.) and in, B. R. Rau, M. Lee, P. P. Tirumalai, M. S. Schlansker, Register Allocation for Software-pipelined Loops, Proceedings of the SIGPLAN '92 Conference on Programming Language Design and Implementation, (San Francisco, 1992).
The present invention may be understood with reference to the following drawings in which like elements are indicated by like numbers. These drawings are not intended to be limiting but are instead provided to illustrate selected embodiments of a method and system for determining a loop unrolling factor.
Described herein are selected embodiments of a method and system to determine a loop unrolling factor for software loops. While the embodiments are described in the context of software-pipelined loops, the determination of a loop unrolling factor may also be practiced for systems that do not perform software pipelining. Embodiments of the described method may be performed for resource-bound software loops, even if software pipelining is not performed, in order to determine a loop unrolling factor for the loop. In the following description, numerous specific details such as pseudocode instruction sequences, control flow ordering, execution resources, and the like have been set forth to provide a more thorough understanding of the present invention. It will be appreciated, however, by one skilled in the art that the embodiments may be practiced without such specific details. Additionally, some well-known structures, circuits, and the like have not been shown in detail to avoid unnecessarily obscuring the embodiments discussed herein.
Embodiments of the present invention are illustrated using instructions from the IA64™ Instruction Set Architecture (ISA) of Intel Corporation, but these embodiments may be implemented in other ISAs as well. The IA64 ISA is described in detail in the Intel® IA64 Architecture Software Developer's Guide, Volumes 1-4, which is published by Intel® Corporation of Santa Clara, Calif.
Disclosed herein are embodiments for a method and apparatus for determining an unrolling factor for software loops. For at least one embodiment, the method is performed for resource-bound loops (discussed below). For some embodiments, the method may be performed for resource-bound loops that are software pipelined. Such embodiments of the method may be better understood with reference to standard software pipelining techniques, which are discussed immediately below.
A pseudo code representation of a counted Do loop is:
In this example, “DO( )” is the loop instruction, instructions “a” and “b” form the loop body, and “ENDDO” terminates the loop. The loop variable, L, tracks the number of iterations of loop(I), initialize(L) represents its initial value, and update(L) indicates how L is modified on each iteration of the loop. Test(L) is a logical function of L, e.g., L==LMAX, that terminates Loop(I) when it is true, passing control to instruction “e”. Other types of loops, e.g., “WHILE” AND “FOR” loops, follow a similar pattern, although they may not explicitly specify an initial value, and the loop variable may be updated by instructions in the loop body.
During a prolog 160, the software pipeline 100 is filled. Thus, at cycle 140(1), instruction A is executed using the operands appropriate for L=1, e.g., A(1). At cycle 140(2), instructions A and B are executed using operands appropriate for L=2 and L=1, respectively, e.g., A(2), B(1). At 140(3), A(3), B(2), and C(1) are executed. During prolog 160, resources associated with instructions B and/or C are not utilized. For example, if A, B and C are floating point instructions and Loop (I) is executed in a processor having four floating point units (FPUs), three FPUs are idle at cycle 140(1), and two are idle at cycle 140(2).
At cycle 140(3), the software pipeline 100 is filled, and instructions A, B and C are evaluated concurrently for different values of L through cycle 140(N). For cycles 140(3) through 140(N), the slots of software pipeline 100 are full. These cycles are referred to as the kernel phase 164 of the software pipeline 100. At cycle 140(N), instruction A has been evaluated for all N iterations of Loop(I).
During kernel phase 164 of the pipeline 100, resources of the processor may remain idle. For example, if A, B and C are FP instructions and Loop(i) is executed on a processor having four FPUs, one FPU remains idle even during kernel phase cycles 164 of the software pipeline 100. As such, we say that Loop (I) has a “fractional II”, because only a fraction of the execution resources are utilized during each execution cycle of the loop.
At cycles 140(N+1) and (140(N+2), software pipeline 100 empties as instructions B and C complete their N iterations of loop 100. These cycles form an epilog 170 of software pipeline 100 for which resources associated first with A and then with B are idled.
The initiation interval (II) for a software loop represents the number of processor clock cycles (“cycles”) between the start of successive iterations of the software loop. The minimum II for a loop is the larger of a resource II (RSII) and a recurrence II (RCII) for the loop. The RSII is determined by the availability of execution units for the different instructions of the loop. For example, a loop that includes three integer instructions has a RSII of at least two cycles on a processor that provides only two integer execution units. The RCII reflects cross-iteration or loop-carried dependencies among the instructions of the loop and their execution latencies. If the three integer instructions of the above-example have one cycle latencies and depend on each other as follows, inst1→inst2→inst3→inst1, the RCII is at least three cycles.
Software loops are considered to be “resource-bound” if their RSII>=RCII. For example, a loop having twelve non-dependent ALU instructions is resource-bound on a processor that can only execute six ALU instructions per cycle. Even with software pipelining, it takes two cycles to execute the loop. For such example, resource limitations (number of available execution units) drive the number of cycles to be executed in order to perform each iteration of the loop. As it happens, for this example, all six of the ALU units are utilized during each of the two cycles performed for each iteration of the software-pipelined loop. Thus, this example resource-bound loop does not have a fractional II.
However, some resource-bound loops do not fully utilize available processor resources during a given cycle, even after software pipelining. That is, some available execution units of the processor may remain unutilized during a cycle that executes instructions of a software-pipelined loop iteration. For example, consider a processor that is capable of processing two load instructions and two store instructions during a given cycle. For a loop that has one load instruction and one store instruction in its loop body, execution of a loop iteration, even after software pipelining, leaves one load unit and one store unit idle. As is indicated above, we refer to a software-pipelined loop that leaves execution resources idle during an execution cycle of the loop as having a “fractional II.” A sample loop, set forth in Example Loop 1, below, illustrates such a loop with pseudocode:
Example Loop 1:
Referring back to a previous example, Loop (I) also illustrates a loop having a fractional II. Referring to
In such cases, it may be helpful to “unroll” the loop before it is software-pipelined in order to more fully utilize execution resources. Such unrolling may help to optimally use the width of the processor. Consider again Example Loop 1, described above, which has one store instruction, one floating point add instruction, and one load instruction in its loop body. Without unrolling, the II for Example Loop 1 is 1 cycle per iteration. Assuming a processor that is able to process two load instructions, two floating point instructions, and two store instructions per cycle, each iteration of the loop utilizes only ½ of the processor's load, floating point, and store execution resources. Accordingly, the II for Example Loop 1 is fractional, and unrolling may improve processor resource utilization.
Example Loop 2, below, illustrates the loop from Example Loop 1, after it has been unrolled by a factor of two. The unrolled loop now has two load instructions, two floating point instructions, and two store instructions. After unrolling and pipelining, two iterations of the loop may be executed per cycle and each cycle fully utilizes the two load, two floating point, and two store execution resources of our hypothetical processor. As such, the unrolled loop may more fully utilize the processor resources during each cycle. (One of skill in the art will realize that multiple store instructions of Example Loop 2 are shown in order to illustrate the reduced II for an unrolled loop; code optimizations that might otherwise be utilized have been eliminated for purposes of illustration).
Example Loop 2:
By unrolling Example Loop 1 by a factor of two, we achieve an unrolled loop (Example Loop 2) for which the II is no longer fractional. After unrolling, the loop that originally had only one load instruction, one floating point instruction, and one store instruction now has two load instructions, two floating point instructions, and two store instructions in its loop body. Both of the load execution resources as well as both of the floating point execution resources and both of the store execution resources can now be utilized during each execution cycle for Example Loop 2. Accordingly, after unrolling Example Loop 1 by a factor of 2, the II is still one cycle, but now two iterations of the original loop may be performed during each cycle. The amount of work performed during each execution cycle for the loop has thus been improved (by 100%).
Devising a formula to determine an unrolling factor for software loops, whether they are to be software-pipelined or not, poses an interesting challenge. Traditionally, the degree of loop unrolling has been determined using an ad hoc method or has been based on heuristics. A simple formula for determining an efficient unrolling factor for software-pipelined loops would be welcome. The methods and system disclosed herein address these and other issues associated with unrolling of software loops.
For the embodiments discussed herein, it is assumed that the target processor supports at least two general instruction types. Accordingly, the formula illustrated in
Processing then proceeds to block 207. At block 207, the loop unrolling factor, which was calculated at block 204, is applied to unroll the original loop. Processing then proceeds to optional block 214. At block 214, the unrolled loop is software-pipelined. The optional nature of block 314 is denoted with broken lines in
In the flowchart of
For at least one embodiment, the method 300 illustrated in
One constraint is that the number of instructions of a particular instruction type issued in II cycles cannot exceed the maximum number of instructions of that type that can be executed by the particular processor during II cycles. For instance, the number of load instructions issued during U iterations of the loop is constrained to a number of such instructions that can be executed during II cycles. Accordingly, U*L should be less than or equal to Lmax*II (that is, U*L≦Lmax*II). Similarly, this instruction-type issue constraint is applicable to the other instruction types supported by the processor: U*S≦Smax*II; U*M≦Mmax*II; U*A≦Amax*II; U*F≦Fmax*II.
For example, consider an unrolled loop having an II of 2 cycles. Assume that the processor can execute four load instructions per cycle (Lmax=4). In such case, a maximum of eight load instructions may be issued for each iteration of the unrolled loop (II*Lmax=8). Accordingly, U*L for the unrolled loop should not exceed the eight-instruction limitation.
Continuing with the above example, consider a loop having three load instructions in the original loop body before unrolling. If the original loop is unrolled by a factor of two (U=2), then the unrolled loop contains U*L load instructions: 2*3=6 load instructions. Since six is less than eight, the constraint is satisfied. Stated another way, the following constraint is satisfied: U*L≦Lmax*II.
As another example, consider an unrolled loop having an II of 4 cycles. Assume again that the processor can execute four load instructions per cycle (Lmax=4). In such case, a maximum of sixteen load instructions may be issued for each iteration of the unrolled loop (II*Lmax=16). Accordingly, U*L for the unrolled loop should not exceed the sixteen-instruction limitation.
Continuing with the above example, consider a loop having six load instructions in the original loop body (L=6). If the loop is unrolled by a factor of three (U=3), then the unrolled loop body includes 18 load instructions (U*L=3*6=18). For this example, then, the instruction-type issue constraint of U*L≦Lmax*II is not satisfied, because U*L (18) is greater than II*Lmax (16).
For a processor having the instruction types discussed above, the instruction-type issue constraint can be generalized to all instruction types. That is, for a processor having the five instruction types discussed above, five constraints should be satisfied when an unrolling factor is being determined:
U*L≦Lmax*II
U*S≦Smax*II
U*M≦Mmax*II
U*A≦Amax*II
U*F≦Fmax*II
In addition, for a processor that includes additional instructions types X . . . Y, the following additional instruction-type issue constraints should also be satisfied:
U*X≦Xmax*II
U*Y≦Ymax*II
Each of these constraints can be simplified as follows:
U*L≦Lmax*II=>U/II≦Lmax/L
U*S≦Smax*II=>U/II≦Smax/S
U*M≦Mmax*II=>U/II≦Mmax/M
U*A≦Amax*II=>U/II≦Amax/A
U*F≦Fmax*II=>U/II≦Fmax/F
U*X≦Xmax*II=>U/≦Xmax/X
U*Y≦Ymax*II=>U/II≦Ymax/Y
Another constraint reflected in the formula utilized at block 304 is that, for at least one embodiment, the processing 304 further computes the unrolling factor such that the number of instructions per cycle for the unrolled loop does not exceed the processor's issue width. W reflects the issue width of the processor. The issue width is the maximum number of instructions that can be issued in a single cycle for a given processor.
Consider, for example, a processor that can issue six instructions per cycle (that is, W=6). Assume, for purposes of example, that such processor includes six ALU execution units (Amax=6) and two floating point execution units (Fmax=2). In theory, then, without consideration of W, the processor could issue six ALU instructions and two floating point instructions per cycle. However, if W=6, then the processor can only execute six, rather than eight, instructions per cycle.
Consider, for example, a loop that includes eight instructions in its loop body—six ALU instructions and two floating point instructions—on a processor for which Amax=6 and Fmax=2. Although, individually, the number of ALU instructions in the loop body is not more than Amax and the number of floating point instructions in the loop body is not more than Fmax, all instructions of the loop body cannot be executed in a single cycle because the number of instructions in the loop body exceeds W.
The set of parameter values illustrated at block 304 are based on the constraints discussed above. The first five parameters of the “Min” function illustrated at block 304 are based on the five simplified instruction-type issue constraints discussed above:
U/II≦Lmax/L
U/II≦Smax/S
U/II≦Mmax/M
U/II≦Amax/A
U/II≦Fmax/F
The final parameter of the “Min” function illustrated at block 304 takes the target processor's issue width into account. The issue width for an unrolled loop having an initiation interval of II cycles is II*W. II*W reflects the maximum number of instructions (all instruction types) that can be executed by the processor during II cycles. Thus, the total number of instructions in an unrolled loop iteration should be less than or equal to II*W. This constraint is referred to herein as the issue width constraint.
For at least one embodiment, the total number of instructions in an unrolled loop body, where N represents the number of instructions in the original loop body, is represented by U*N. However, for most embodiments, an unrolled loop includes not only the instructions of the loop body but also includes at least one branch instruction. The branch instruction at the end of the loop body determines whether control should remain in the loop (branch back to the beginning of the loop body) or should branch out of the loop. In an unrolled loop, this branch instruction is not repeated U times, but remains as a single instruction at the end of the loop body. Accordingly, the number of instructions in an unrolled loop is represented, for at least one embodiment, as U*N+1.
Assuming that those N instructions are of instruction types supported by the processor, N can be further broken down into the count for each type of instruction. Assuming that the processor supports two major classes of instructions, such as ALU instructions and FP instructions as discussed above, N=A+F. Accordingly, for at least one embodiment the number of instructions in an unrolled loop may be represented as U*(A+F)+1.
The issue width constraint, discussed above, states that the total number of instructions per cycle for the unrolled loop should be constrained by the total number of instructions that can be executed in II cycles. The issue width constraint may be expressed as: U*(A+F)+1≦W*II. Such expression may be simplified to: U/II≦W/(A+F)−1/II*(A+F). Such expression may be utilized as the sixth term for the “Min” function shown at block 304 of
Accordingly, block 304 illustrates that U/II for a subject loop may be determined as: U/II=Min (Lmax/L, Smax/S, Mmax/M, Amax/A, Fmax/F, . . . , W/(A+F)−1/(II*(A+F)))). The determination of U/II may be simplified if one considers that the goal for at least one embodiment is to unroll as much as possible while conforming to the six constraints discussed above. Accordingly, it is desirable to have a large U value. As the value of U goes up, the value of II also increases.
If we tend II to infinity, then the factor “1/(II*(A+F))” is eliminated: 1/(∞*(A+F)1/∞0. Accordingly, the determination of U/II becomes: U/II=Min (Lmax/L, Smax/S, Mmax/M, Amax/A, Fmax/F, . . . , W/(A+F). Such determination is based on an assumption, which is discussed immediately below.
The embodiment illustrated in
Also, one should note that only those terms of the Min function that are applicable to the loop of interest need be evaluated. For example, if the loop of interest does not include any store instructions, then the parameter for store instructions, Smax/0, is not defined. Accordingly, parameters are only considered at block 304 for those instruction types that are present in the loop of interest.
At block 306, the loop is unrolled by the whole number value, called P, that was calculated at block 304. That is, the loop is unrolled by a whole number P, where P is the value for U/II that was calculated at block 304. If software-pipelined, such loop will have an II of 1 cycle. In other words, each iteration of the original loop executes in 1/P cycles. From block 306, processing proceeds to block 314.
As an example to illustrate the processing of blocks 304, 305 and 306 in further detail, consider the following sample pseudocode, to be performed on a processor, having an issue width of six instructions, that can execute four floating point load instructions per cycle, and can execute two floating point arithmetic instructions per cycle:
Example Loop 3:
Assuming that sum is an array of floating-point values, the “for” loop set forth in Example Loop 3 could translate into one load instruction and one floating point addition instruction, in addition to the loop-closing branch. As is stated above, Lmax for our sample processor is four, Fmax is two and W is six. Also assume that the processor can execute four memory instructions per cycle (Mmax=4), six ALU instructions per cycle (Amax=6), and two store instructions (Smax=2) per cycle.
At block 304, U/II is calculated for Example Loop 3 as: U/II=Min (Lmax/L, Smax/S, Mmax/M, Amax/A, Fmax/F, W/(A+F)). The Smax/S parameter is not applicable because Example Loop 3 does not include any store instructions. It is assumed that the load instruction is an ALU instruction (A), as well as a memory instruction (M) and a load instruction (L). The expression evaluates to: U/II=Min (4/1, 4/1, 6/1, 2/1, 6/(1+1))U/II=2/1. Thus, the unrolling factor calculated at block 304 for Sample Loop 3 is 2. The throughput for the original loop, given this unrolling factor, is 0.5 cycles for each iteration of the original loop. That is, each iteration of the original loop is executed in ½ cycles.
As another example, refer again to Example Loop 3 and assume that sum is an array of integer values. The “for” loop set forth in Example Loop 3 could then translate into one integer load instruction and one ALU addition instruction, in addition to the loop-closing branch. Assume that our sample processor can process only two integer load instructions per cycle: Lmax=2. Also assume that W=6, and that the processor can execute four memory instructions per cycle (Mmax=4), two FP instructions per cycle (Fmax=2), and two store instructions (Smax=2) per cycle.
At block 304, U/II is calculated for Example Loop 3 (integer) as: U/II=Min (Lmax/L, Smax/S, Mmax/M, Amax/A, Fmax/F, W/(A+F)). The Smax/S and Fmax/F parameters are not applicable because Example Loop 3 does not include any store instructions nor floating point instructions. Again, it is assumed that the load instruction is an ALU instruction (A), as well as a memory instruction (M) and a load instruction (L). The expression evaluates to: U/II=Min (2/1, 4/1, 6/2, 6/(2+0))U/II=2/1. Thus, the unrolling factor calculated at block 304 for Example Loop 3 (integer) is 2. The throughput for the original loop, given this unrolling factor, is 0.5 cycles for each iteration of the original loop. That is, each iteration of the original loop is executed in ½ cycles.
At block 310, the loop is unrolled by the a factor P, where P is the numerator of the value calculated at block 304. For example, if the value U/II calculated at block 304 is a fraction, represented by P/Q, then the loop is unrolled P times. Stated another way, the value P/II is calculated at block at block 304, and the loop is unrolled P times at block 310. As a result, the unrolled loop has an II of Q cycles. That is, each iteration of the original loop executes in Q/P cycles.
As an example to illustrate the processing of blocks 304, 305 and 310 in further detail, consider a loop, referred to herein as Example Loop 4, which includes nine ALU instructions for its loop body. Assume that the ALU instructions of Example Loop 4 are to be performed on a processor having an issue width of six instructions and that can execute six ALU instructions per cycle. This loop will be unrolled by 2 using our method, and the resultant unrolled loop will have an II of 3 cycles, i.e. P=2, and Q=3.
From block 310, processing proceeds to block 314. At block 314, the unrolled loop is software pipelined. Again, the optional nature of block 314 is denoted with broken lines in
The foregoing discussion discloses selected embodiments of a formula-based method for determining a loop unrolling factor for a software loop. Such embodiments may be utilized on a processing system such as the processing system 400 illustrated in
Embodiments of the methods disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Software embodiments of the methods may be implemented as computer programs executing on programmable systems comprising at least one processor, a data storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Program code may be applied to input data to perform the functions described herein and generate output information. The output information may be applied to one or more output devices, in known fashion. For purposes of this disclosure, a processing system includes any system that has a processor, such as, for example; a network processor, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.
The programs may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The programs may also be implemented in assembly or machine language, if desired. In fact, the methods described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language
The programs may be stored on a storage media or device (e.g., hard disk drive, floppy disk drive, read only memory (ROM), CD-ROM device, flash memory device, digital versatile disk (DVD), or other storage device) accessible by a general or special purpose programmable processing system. The instructions, accessible to a processor in a processing system, provide for configuring and operating the processing system when the storage media or device is read by the processing system to perform the actions described herein. Embodiments of the invention may also be considered to be implemented as a machine-readable storage medium, configured for use with a processing system, where the storage medium so configured causes the processing system to operate in a specific and predefined manner to perform the functions described herein.
An example of one such type of processing system is shown in
Processing system 400 includes a memory 422 and a processor 414. Memory system 422 may store instructions 410 and data 412 for controlling the operation of the processor 414. Memory system 422 is intended as a generalized representation of memory and may include a variety of forms of memory, such as a hard drive, CD-ROM, random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory and related circuitry.
Memory system 422 may store instructions 410 and/or data 412 represented by data signals that may be executed by the processor 414. The instructions 410 may include a compiler 408. For at least one embodiment, a compiler 408 performs methods 200 (
In the preceding description, various embodiments of a method and system for determining a loop unrolling factor for loops are disclosed. For purposes of explanation, specific numbers, examples, systems and configurations were set forth in order to provide a more thorough understanding. However, it is apparent to one skilled in the art that the described embodiments of a system and method may be practiced without the specific details. It will be obvious to those skilled in the art that changes and modifications can be made without departing from the present invention in its broader aspects.
For example, the methods 200 (
In the preceding description, various aspects of an apparatus and method to determine a loop unrolling factor for a software loop are disclosed. For purposes of explanation, specific numbers, examples, systems and configurations were set forth in order to provide a more thorough understanding. However, it is apparent to one skilled in the art that the described apparatus and system may be practiced without the specific details. It will be obvious to those skilled in the art that changes and modifications can be made without departing from the present invention in its broader aspects. While particular embodiments of the present invention have been shown and described, the appended claims are to encompass within their scope all such changes and modifications that fall within the true scope of the present invention.