The present invention relates to a program conversion device, and in particular relates to a program conversion device for a processor which has an instruction set including an instruction that waits for a predetermined response from an outside source when the instruction is executed.
In recent years, the processing speed of a processor has been significantly improved while, as compared with this, an improvement in the access speed of main memory is minor. The speed difference between them continues to grow year by year. On account of this, it has been conventionally pointed out that the memory access is a bottleneck in the high-speed processing performed by an information processing device.
In order to solve this problem, a cache organization has been used from the perspective of a storage hierarchy. When the cache organization is used, data which is requested by a processor is transferred beforehand (namely, prefetched) from main memory into a high speed cache. By means of this, it becomes possible to quickly correspond to the memory access from the processor.
However, when the processor attempts to access data that is not present in the cache, a cache miss will occur. Due to this, it will take time for the data to be transferred from the main memory into the cache.
It is assumed that if a user performs programming without keeping the cache in mind, such a cache miss would frequently occur when that program is executed. As a result, penalties due to the cache misses will significantly deteriorate performance of the processor. For this reason, a compiler needs to perform optimization in consideration of the cache.
One of the techniques for cache optimization is to insert prefetch instructions. A prefetch instruction is used for having data of a specific memory address previously transferred from main memory into a cache before the memory address is referenced. In the optimization employing the insertion of prefetch instructions, a prefetch instruction is to be inserted into a cycle slightly ahead of the cycle in which the memory address is referenced.
For example, in the case of loop processing shown in
Problems that Invention is to Solve
In code shown in
In other words, one prefetch can correspond to 32 references, meaning that the remaining 31 prefetches are performed in vain. That is to say, it ends up repeatedly issuing the prefetch instruction of the same line.
Depending on processors, while a data transfer is being performed according to a dpref instruction and then a next dpref instruction is to be executed, the next dpref instruction is issued before the data transfer from the main memory to the cache according to the previous dpref instruction is finished. As such, an interlock will occur even though the dpref instruction was inserted to avoid such an interlock in the first place.
On that account, when one iteration of a loop is short and an interval between two dpref instructions is short as described in the above case, the time (latency) taken for the data to be transferred from the main memory to the cache according to the dpref instruction becomes conspicuous, more deteriorating the performance.
Also, aside from the execution of the dpref instruction, an instruction that causes a response waiting of some kind after the instruction is issued, such as a memory access instruction, have a possibility of causing an interlock.
The present invention was conceived in view of the problem described above, and has an object of providing a program conversion device and a program conversion method that improve the processing speed of a program execution without needlessly issuing instructions that have a possibility of causing an interlock.
Moreover, the present invention has an object of providing a program conversion device and a program conversion method that improve the processing speed of a program execution without needlessly issuing instructions that cause a response waiting of some kind after the instruction is issued.
Furthermore, the present invention has an object of providing a program conversion device and a program conversion method that cause no interlocks during the program execution.
Means to Solve the Problems
The stated object can be achieved by a program conversion device of the present invention for a processor which has an instruction set including an instruction that waits for a predetermined response from an outside source when the instruction is executed, the program conversion device being composed of: a loop structure transforming unit operable to perform double looping transformation so as to transform a structure of a loop, which is included in an input program and whose iteration count is x, into a nested structure where a loop whose iteration count is y is an inner loop and a loop whose iteration count is x/y is an outer loop; and an instruction placing unit operable to convert the input program into an output program including the instruction by placing the instruction in a position outside the inner loop.
With this, as shown in
More specifically, the present invention allows a loop to be transformed into a double loop so that an instruction having a possibility of causing an interlock is executed outside an inner loop. Consequently, the processing speed of the program execution can be improved without needless issues of the instruction.
Moreover, by means of the double loop, it becomes possible to ensure the number of cycles taken from the issue of an instruction that has a possibility of causing an interlock to the issue of a next instruction that has a possibly of causing another interlock. Thus, interlocks are less likely to occur during the program execution.
It should be noted that the program conversion device can be realized as a compiler, an OS (Operating System), or an integrated circuit, such as a CPU.
Response wait instructions include an instruction that might wait or not wait for a response as the case may be, as well as including an instruction that has a possibility of causing an interlock like the above-mentioned dpref instruction and an instruction that waits for a predetermined response from an outside source when the instruction is executed.
It should be noted here that the present invention may be realized not only as the program conversion device provided with such characteristic units, but also as: a program conversion method having steps corresponding to the characteristic units provided in the program conversion device; and a program that has a computer function as a program conversion device. Also, it should be understood that such a program can be distributed via a record medium such as a CD-ROM (Compact Disc-Read Only Memory), or via a transmission medium such as the Internet.
The present invention can improve the processing speed of a program execution.
Moreover, interlocks are less likely to occur during the program execution.
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[System Construction]
The compiler 149 is a program whose target processor is a CPU (Central Processing Unit) of a computer provided with a cache and which converts the source program 141 into an assembler file 143 described in assembler language. When converting the source program 141 into the assembler file 143, the compiler 149 performs optimizing processing based on a cache parameter 142 that is information regarding a cache line size, a latency cycle, etc. and on profile data 147 described later, and then outputs the assembler file 143.
The assembler 150 is a program that converts the assembler file 143 described in assembler language into an object file 144 described in machine language. The linker 151 is a program which links a plurality of object files 144 to generate the execution program 145.
As development tools for the execution program 145, a simulator 152 and a profiler 153 are prepared. The simulator 152 is a program which simulates the execution program 145 and outputs various sets of execution log data 146 obtained during the execution. The profiler 153 is a program which analyzes the execution log data 146 and outputs the profile data 147 obtained by analyzing an execution sequence of the program.
[Construction of Compiler]
The syntax analyzing unit 182 is a processing unit which receives the source program 141 as input and outputs a program in an intermediate language after performing the syntax analysis processing.
The optimization information analyzing unit 183 is a processing unit which reads and analyzes information required to perform the optimizing processing on intermediate languages of the cache parameter 142, the profile data 147, a compile option, and a pragma. The general optimizing unit 184 is a processing unit which performs general optimizing processing on intermediate code. The instruction scheduling unit 185 is a processing unit which performs instruction scheduling by optimizing a sequence of instructions. Both the compile option and the pragma are directives to the compiler.
The loop structure transforming unit 186 is a processing unit which transforms a single loop into a double loop. The instruction optimum placing unit 187 is a processing unit which places prefetch instructions in the transformed double loop. The code outputting unit 188 is processing unit which converts a program in the optimized intermediate language into a program described in assembler language and outputs the assembler file 143.
[Processing Flow]
Next, a flow of the processing executed by the compiler 149 is explained.
The syntax analyzing unit 182 performs syntax analysis on the source program 141 and generates intermediate code (S1). The optimization information analyzing unit 183 analyzes the cache parameter 142, the profile data 147, the compile option, and the pragma (S2). The general optimizing unit 184 performs the general optimization for the intermediate code in accordance with the analysis result given by the optimization information analyzing unit 183 (S3). The instruction scheduling unit 185 performs the instruction scheduling (S4). The loop structure transforming unit 186 focuses on the loop structure included in the intermediate code and transforms a single loop structure into a double loop structure if necessary (S5). The instruction optimum placing unit 187 inserts an instruction into the intermediate code for prefetching data to be referenced within the loop structure (S6). The code outputting unit 188 converts the intermediate code into assembler code, and outputs it as the assembler file 143 (S7).
Each processing of the syntax analyzing processing (S1), the optimization information analyzing processing (S2), the general optimizing processing (S3), the instruction scheduling processing (S4), and the assembler code outputting processing (S7) is the same as corresponding common processing. Thus, detailed explanations about them are omitted here.
The following are detailed explanations about the loop structure transforming processing (S5) and the prefetch instruction placing processing (S6).
When the loop count is non-fixed (NO in S11), a judgment is made by the pragma or the compile option as to whether the minimum loop count is designated, or as to whether it is designated to dynamically judge the loop count and split the loop during the program execution (S12).
When either directive is present (YES in S12) or the loop count is a fixed value (YES in S11), a judgment is made as to whether or not a subscript of an array referenced within the loop is analyzable (S13). To be more specific, when the value of the loop counter varies with certain regularity, the subscript is judged to be analyzable. For example, when the value of the loop counter is to be rewritten within the iteration, it is judged not to be analyzable.
When the subscript is analyzable (YES in S13), the numbers of bytes of elements to be referenced in one iteration is obtained for each array referenced during the loop processing and a minimum value LB among the obtained numbers is derived (S14).
Next, a judgment is made as to whether or not a value derived by dividing the cache line size CS by the value LB is greater than one (S15). When the value of CS/LB is greater than one (YES in S15), a judgment is made as to whether or not the arrays of the loop processing are aligned (S16). Whether or not the arrays are aligned is judged by whether it is designated by the pragma or the compile option that the arrays are aligned.
When the arrays are not aligned (NO in S17), a judgment is made as to whether or not “LB*LC/IC” is greater than CS (S16). Here, LC represents the number of latency cycles, and IC represents the number of cycles per iteration. Also, “LC/IC” represents the loop count for each loop when the loop is split into a plurality of innermost loops, and “LB*LC/IC” represents the access capacity of the loop.
When “LB*LC/IC” is greater than the line size CS (YES in S16), the elements corresponding to a size of one line or more are referenced in each loop processing after the splitting. As such, the cycle is considered as a split factor, and a loop count DT of the innermost loop is derived according to the following expression (1) for a case where each loop processing is transformed into a double loop (S18).
DT=(LC−1)/IC+1 (1)
When “LB*LC/IC” is smaller than the line size CS (NO in S16) or the arrays are aligned (YES in S17), the size is considered as a split factor and the loop count DT of the innermost loop is derived according to the following expression (2) for a case where each loop processing is transformed into a double loop (S19).
DT=(CS−1)/LB+1 (2)
After the processing of deriving the loop count DT of the innermost loop (S18 or S19), a judgment is made as to whether or not the loop count DT of the innermost loop is greater than one (S20). When DT is one (NO in S20), the loop does not need to be structurally transformed into a double loop since the loop count DT of the innermost loop is one. Thus, the loop structure transforming processing (S5) is terminated.
When the loop count DT of the innermost loop is two or more (YES in S20), an outer loop structure is generated for a case where the loop is transformed into a double loop (S21). When generating the outer loop structure, a judgment is made as to whether or not the peeling processing is necessary (S22). A method of judging whether or not the peeling processing is necessary is described later on.
When the peeling processing is necessary (NO in S22), the peeling processing is performed and peeling code is generated (S24). Following this, a judgment is made as to whether or not a directive by the compile option “—O” or “—Os” is present (S25). Here, the compile option “—O” is a directive for having the compiler output the assembler code that has the average program size and execution processing speed. The compile option “—Os” is a directive for having the compiler output the assembler code with a high regard for a reduction in the program size.
When the peeling processing is unnecessary (YES in S22) or there is no directive by the compile option “—O” or “—Os” (NO in S25), a conditional expression is generated for the loop count of the inner loop (innermost loop) (S23).
When the directive by the compile option “—O” or “—Os” is present (YES in S25), the loop processing peeled off is folded into a double loop and a conditional expression is generated for the loop count of the innermost loop (S26).
After the processing of generating the loop count condition of the innermost loop (S23 or S26), a judgment is made as to whether or not the number of target arrays to be referenced within the innermost loop is one (S27). When the number of target arrays to be referenced within the innermost loop is one (YES in S27), the loop structure transforming processing (S5) is terminated.
When the number of target arrays to be referenced within the innermost loop is two or more (NO in S27), the number of splits of the innermost loop is derived and a ratio of the loop counts of the innermost loops after the splitting is determined (S28). Following this, a judgment is made as to whether or not a value obtained by dividing the innermost loop count DT after the splitting by the number of splits is greater than one (S29). To be more specific, when the present value is one or less (NO in S29), there is no point in splitting since each loop count after the splitting is one or less (NO in S29). Thus, the loop structure transforming processing (S5) is terminated.
When the present value is greater than one (YES in S29), this means that each loop count after the splitting is two or more. In this case, a judgment is made as to whether or not there is directive by the compile option “—O” or “—Ot” (S30). The compile option “—Ot” is a directive for having the compiler output the assembler code with a high regard for an improvement in the execution processing speed.
When the directive by the compile option “—O” or “—Os” is present (YES in S30), copy-type inner loop splitting processing, which is described later, is executed with a high regard for an improvement in the execution processing speed (S31). Then, the loop structure transforming processing (S5) is terminated.
When the directive by the compile option “—O” or “—Os” is not present (NO in S30), condition-type inner loop splitting processing, which is described later, is executed with a high regard for a reduction in the program size (S32). Then, the loop structure transforming processing (S5) is terminated.
A value obtained by dividing the loop count DT of the innermost loop by the number of splits is referred to as a post-subdividing inner loop count (S41). Next, the inner loop is copied the number of times corresponding to the number of splits so as to generate the inner loops (S42). Following this, each inner loop count after the subdividing is modified to the post-subdividing inner loop count (S43). Moreover, a remainder left over after DT was divided by the number of splits is added to the loop count of the post-subdividing head loop (S44), and the copy-type inner loop splitting processing is terminated.
A value obtained by dividing the loop count DT of the innermost loop by the number of splits is referred to as a post-subdividing inner loop count (S51). Next, an inner loop count condition switch table is generated (S52). To be more specific, a switch statement, which is so called in C language, is generated so that the inner loop count will be sequentially switched. It should be noted that the statement may be an if statement.
After the generation of the table, each inner loop count condition after the subdividing is modified to the post-subdividing inner loop count (S53). Following this, a remainder left over after DT was divided by the number of splits is added to the loop count condition of the post-subdividing head loop (S54), and the condition-type inner loop splitting processing is terminated.
In the prefetch instruction placing processing, the following processing is repeated for all the loops (loop A). First, the loop in question is checked whether it is a target loop for instruction insertion (S61). Information as to whether it is the target loop for instruction insertion is obtained from the analysis result given by the loop structure transforming unit 186.
In the case of the target loop for instruction insertion (YES in S61), a judgment is made as to whether the condition-type loop splitting has been performed on the loop in question (S62). When the condition-type loop splitting has been performed, a position of the instruction insertion is analyzed for each conditional statement (S63) then a prefetch instruction is inserted (S64). When the condition-type loop splitting has not been performed on the target loop for the instruction insertion (NO in S62), a judgment is made as to whether the copy-type loop splitting has been performed on the present loop (S65). When the copy-type loop splitting has been performed (YES in S65), the position of the instruction insertion before the present loop is analyzed (S66). After this, the prefetch instruction is inserted (S67). In the case of the peeled loop (YES in S68), the position of the instruction insertion is analyzed so that the instruction will be inserted before the present loop (S69) and the prefetch instruction is inserted into that position (S70).
In the instruction inserting processing, the following is repeated until the time comes when an information list composed of an insertion instruction, an insertion position, and an insertion address will become empty (loop B).
A judgment is made as to whether or not the array elements among which the prefetch instruction is to be inserted have been aligned (S72). When they have not been aligned (NO in S72), a judgment is made as to whether the loop splitting was performed in accordance with the cycle factor or the loop splitting was performed in accordance with the size factor (S73).
When they have been aligned (YES in S72) or the loop splitting has been performed in accordance with the cycle factor (YES in S73), an instruction for prefetching data one line ahead is inserted (S74). When they have not been aligned and the loop splitting has been performed in accordance with the size factor (NO in S73), an instruction for prefetching data two lines ahead is inserted (S75). Finally, the analyzed information is deleted from the information list (S76).
[Compile Option]
In the compiler system 148, an option “-fno-loop-tiling-dpref” is prepared as a compile option for the compiler. When this option is designated, the structure transformation will not be performed on the loop regardless of directive by the pragma. When the present option is not designated, whether or not to execute the structure transformation is determined in accordance with the presence or absence of directive by the pragma.
[Directive by Pragma]
The present directive is used for the immediate subsequent loop.
When a variable is designated by the pragma “#pragma _loop_tiling_dpref variable name [, variable name]”, the loop splitting is performed with attention being paid only to the variable designated by the pragma. The variable to be designated may be an array or a pointer.
When a loop is designated by the pragma “#pragma _loop_tiling_dpref_all”, the structure transformation is performed with attention being paid all the arrays to be referenced within the loop.
The following is an explanation about the loop splitting processing in some specific phases. It should be noted that although in the following processing the program is described in C language for the sake of simplicity, the actual optimizing processing is performed in the intermediate language.
[Simple Loop Splitting]
Consideration is given to a case where a source program 282 shown in
FIGS. 12 to 15 are diagrams explaining about the progression of the intermediate language in the simple loop splitting processing in which peeling is unnecessary.
As with
Consideration is given to a case where a source program 292 shown in
In such a case, as shown by a program 294 in
[Case Where a Plurality of Array Accesses are Present (Peeling is Unnecessary)]
Consideration is given to a case where a source program 301 shown in
For the double-loop structure where a plurality of array accesses are present, there are two kinds of optimizations which are: optimization called copy-type for improving the execution processing speed; and optimization called condition-type for reducing the program size.
First, the copy-type optimization is explained. The loop count of the innermost loop included in the program 302 is split according to a size ratio between the elements of the arrays A and B. Here, the sizes of the elements of the arrays A and B are the same. Thus, as shown by a program 303 in
By inserting the loop processing between the prefetch instructions in this way, the prefetch instructions for different arrays will not be issued in a row. As such, a latency caused by the execution of the prefetch instruction can be hidden. Consequently, the execution processing speed can be improved.
Next, the condition-type optimization is explained. As is the case with the copy-type optimization, the loop count of the innermost loop is split according to a size ratio between the elements of the arrays A and B in the condition-type optimization. However, the two innermost loops are not arranged in the manner shown in the program 303. As shown by a program 305 in
By setting the number of innermost loops at one and varying the loop count and the prefetch instructions of the innermost loop using the conditional branch expressions, the program size of machine language instructions that are eventually generated can be reduced. However, due to the conditional branch processing, there is a possibility that the processing speed may be slightly slower as compared with the case of the copy-type optimization.
[Case Where a Plurality of Array Accesses are Present (Peeling is Necessary)]
Consideration is given to a case where a source program 311 shown in
Accordingly, when structurally transforming the source program 311 into a double-loop structure, a program 312 shown in
When the copy-type optimization is performed, the innermost loop is split according to a size ratio between the elements of the arrays A and B. As a result, a program 313 shown in
When the condition-type optimization is performed, the peeling folding processing is performed on the program 312. As a result, a program 315 shown in
In this way, when peeling is necessary, the peeled part is made as a loop separate from the a double loop in the case of the copy type whereas the value of the loop counter after the peeling processing is varied according to the conditional branch expression in the case of the condition type. Accordingly, when a plurality of array accesses are present in the loop and peeling is necessary, optimization can be performed in consideration of the latency caused by the prefetching.
[Case Where a Plurality of Array Accesses with Different Sizes are Present (Peeling is Unnecessary)]
Consideration is given to a case where a source program 321 shown in
In this case, attention is paid to the array B which has smaller element size and the loop structure transformation is performed corresponding to the elements of the array B. To be more specific, as shown by a program 322 in
Thus, when the copy-type optimization is performed, the innermost loop is split into three as shown by a program 323 in
When the condition-type optimization is performed, the variable K is updated within a range of values from zero to two during one set of the innermost loop processing and the loop count N of the innermost loop is set at one of 22, 21, and 21 through the conditional branch processing in accordance with the value of the variable K, as shown by a program 325 in
[Case Where a Plurality of Array Accesses with Different Sizes are Present (Peeling is Necessary)]
A source program 331 shown in
[Case Where a Plurality of Array Accesses with Different Strides are Present]
A stride refers to a value of an increment (an access width) of array elements in the loop processing. Consideration is given to a case where a source program 341 shown in
Accordingly, when the copy-type optimization is performed, the innermost loop is divided into three as shown by a program 343 in
On the other hand, when the condition-type optimization is performed, the variable K is updated within a range of values from zero to two during one set of the innermost loop processing and the loop count N of the innermost loop is set at one of 11, 11 and 10 through the conditional branch processing in accordance with the value of the variable K, as shown by a program 345 in
[Case Where the Loop Count is Non-Fixed]
Consideration is given to a case where a source program 351 shown in
In accordance with the pragma directive, the loop processing is split into the former loop processing that is performed 128 times and the latter loop processing that is performed the number of times corresponding to the loop count specified by the variable Val. As is the case with the simple loop, each processing is transformed into a double loop, so that a program 352 shown in
When the copy-type optimization is performed, a prefetch instruction (dpref(&A[i+32])) for prefetching the elements of the array A one line ahead is inserted immediately before the innermost loop of the program 325. As a result, a program 353 shown in
When the condition-type optimization is performed, the peeling folding processing is performed on the latter loop processing. Then, a branch instruction is inserted so that the innermost loop count is set at 32 until the outermost loop count reaches 128 and that the innermost loop count afterward is set at a count derived from (Val-128). As a result, a program 354 shown in
Finally, a prefetch instruction (dpref(&A[i+32])) is inserted prior to the execution of the innermost loop. As a result, a program 355 shown in
Consideration is given to a case where a source program 361 shown in
Even if the optimization is executed through the loop structure transformation performed on the loop processing whose loop count is small, the effect of the optimization would be less likely to show. For this reason, in order to heighten the effect of the optimization, the optimized loop is to be executed when the loop count is greater than a certain threshold value in such a case whereas the normal loop processing is to be executed in other cases. For example, suppose that the threshold value is 1024. As shown by a program 362 in
[Case Where Loop Splitting is Unnecessary]
Moreover, even when the number of processing cycles in the loop is greater than the number of processing cycles required to execute the prefetch instruction, the double looping is unnecessary. Even though the prefetch instruction is inserted at the head of the loop, the latency caused by the prefetch instruction can be hidden.
[Case Where the Elements to be Accessed are Misaligned]
However, in general, the compiler has no way to know whether the elements are aligned or not before the execution. On account of this, the compiler needs to perform the optimization on the precondition that the elements to be accessed in the loop are not appropriately aligned in the main memory.
To be more specific, when a source program 381 shown in
When a source program 391 shown in
[Structure Transformation Splitting by Insertion of Dynamic Alignment Analyzing Code]
Predetermined bits of a head address of the array A (address of an element A[0]) indicate a cache line, and out of these bits, another predetermined bits indicate an offset from the head of the line. Thus, through a logical operation of bits called “A&Mask”, the offset from the head of the line can be derived. Here, the value of Mask is predetermined. By shifting the offset value derived from the head address of the array A to the right by a predetermined correction value Cor, the position of the head element A[0] of the array A in relation to the head of one line can be determined. Thus, the number of elements n which are not aligned on the line can be derived according to the following expression (3).
n=32−(A&Mask)>>Cor (3)
More specifically, as shown in
Thus, as shown by a program 402 in
Then, the folding processing is performed on a peeled loop 405, so that a program 403 shown in
[Structure Transformation Splitting Using Profile Information]
[Structure Transformation Performed on the Loop aside from the Innermost Loop]
Consideration is given to a case where a source program 421 shown in
[Variable Directive by Pragma “#pragma _loop_tiling_dpref variable name [, variable name]”]
As described so far, according to the compiler system of the present embodiment, the loop processing is transformed into a double loop and the prefetch instruction is executed outside the innermost loop. This can prevent needless prefetch instructions from being issued, thereby improving the processing speed of the program execution. Moreover, the double loop processing ensures the required number of cycles between the executions of one prefetch instruction and a next prefetch instruction. On account of this, the latency can be hidden, and interlocks can be prevented from occurring.
Up to this point, the compile system of the embodiment of the present invention has been explained on the basis of the present embodiment. However, the present invention is not limited to the present embodiment.
For example, an instruction placed by the instruction optimum placing unit 187 is not limited to a prefetch instruction. The instruction may be: an usual memory access instruction; a response wait instruction such as an instruction that waits for a processing result after activating external processing; an instruction that may result in an interlock after executed; or an instruction that requires a plurality of cycles until a predetermined resource becomes referable. The response wait instructions include an instruction that might wait or not wait for a response as the case may be, as well as an instruction that always wait for a response.
Moreover, the system may be a compile system whose target processor is a CPU of a computer having no caches and which outputs such code that hides latencies caused by various kinds of processes and prevents interlocks from occurring.
Furthermore, the system may be realized as an OS (Operating System) which sequentially interprets machine instructions to be executed by the CPU and executes processing such as the loop structure transformation described in the present embodiment.
In addition, the present invention is applicable to instructions that have no possibility of causing interlocks, such as a PreTouch instruction described below. A PreTouch instruction is an instruction for executing processing that only previously allocates an area to store a variable designated by an argument in a cache. The following is an explanation about processing in which the loop structure transformation is performed and a PreTouch instruction is inserted.
[Simple Loop Splitting]
Consideration is given to a case where a source program 502 shown in
Thus, as shown by a program 504 in
Consideration is given to a case where a source program 512 shown in
In such a case, as shown by a program 514 in
[Structure Transformation Splitting by Insertion of Dynamic Alignment Analyzing Code]
Predetermined bits of a head address of the array A (address of an element A[0]) indicate a cache line, and out of these bits, another predetermined bits indicate an offset from the head of the line. Therefore, through a logical operation of bits called “A&Mask”, the offset from the line head can be derived. Note that the value of Mask is predetermined, and that it is set as [Mask=0x7F] in the present case. By subtracting the offset value, which was derived from the element address of the array A that is to be accessed in the first loop, from the value of Mask and then shifting the offset value to the right by a predetermined correction value Cor, the position of the element A[X] of the array A in relation to the head of one line can be determined. Thus, the number of misaligned elements PRLG in the line can be derived according to the following expression (4).
PRLG=(Mask−(&A[X])&Mask)>>Cor (4)
Moreover, according to the following expression (5), the position of the element (A[Y]) which follows the element (A[Y−1]) of the array A to be referenced lastly in the loop can be derived, with the position being determined in relation to the head of one line. Accordingly, the number of elements EPLG which do not fully fill one line can be derived.
EPLG=(&A[Y])&Mask)>>Cor (5)
Furthermore, the loop count KRNL with which the processing for one line is performed without leaving a remainder can be derived according to the following expression (6).
KNRL=(Y−X)=(PRLG+EPLG) (6)
To be more specific, as shown by a program 524 in
Accordingly, processing such as calculation according to the expression (4) to obtain the number of misaligned elements PRLG of the array A is performed as shown by the program 524 in
After this, the folding processing is performed on the peeled loop. As a result, a program 526 shown in
Note, however, that the area allocating instruction is inserted only in the aligned area and only for the innermost loop that uses an entire cache line.
The present invention is applicable to processing executed by a compiler, an OS, and a processor, each of which controls issues of an instruction that has a possibility of causing an interlock.
Number | Date | Country | Kind |
---|---|---|---|
2004-035430 | Feb 2004 | JP | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/JP05/01670 | 2/4/2005 | WO | 1/23/2006 |