This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2013-105537, filed on May 17, 2013, the entire contents of which are incorporated herein by reference.
The embodiments discussed herein are related to a compiler and a compiling method.
A compiler generates an object code by reading and optimizing a source code described in a programming language. In the compiler, loop fusion is utilized as an optimization technique to improve data locality, reduce the cost required for repeating determination processing for loops, and speed up the execution performance. The loop fusion is executed by fusing the loops in the case where an initial value, a final value, and an incremental value of the adjacent loops are identical and dependency between the loops does not collapse when the loops are fused, thereby reducing the number of determination times for a multiple-loop processing structure existing inside a source program.
Patent Literature 1: Japanese Laid-open Patent Publication No. 09-114675
Patent Literature 2: Japanese Laid-open Patent Publication No. 62-35944
Patent Literature 3: Japanese Laid-open Patent Publication No. 2009-104422
However, there may be a problem of causing data access latency or arithmetic processing latency after the loop fusion only with the above-described conditions of loop fusion, inducing ineffective loop fusion.
For instance, even when the loops are adjacent to each other and have the identical initial values, final values, incremental values, in the case of fusing the loops having number of data accesses more than operands, the data access latency occurs, thereby not improving the performance. In the same manner, in the case of fusing the loops having the operands more than the number of data accesses, the arithmetic processing latency occurs, thereby not improving the performance.
According to an aspect of the embodiments, a computer-readable recording medium stores therein a compile controlling program causing a computer to execute a compile process. The compile process includes determining executability of loop fusion, for each of a plurality of loops existing in a code to be processed, based on performance information of a system where the code to be processed is executed and based on operands and number of data transfers executed in each of the loops, and executing fusion of loop processing in accordance with a determination result on executablity of the loop fusion.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
Preferred embodiments will be explained with reference to accompanying drawings. Note that the present invention is not to be limited by the embodiments. Each of the embodiments can be combined as appropriate as long as there is no inconsistency.
As illustrated in
The storage unit 11 is a storage device that stores a source program 11a, an intermediate language 11b, an object file 11c, and an executable file 11d. Examples of the storage unit 11 are a memory, a hard disk and so on. A plurality of object codes to be processed is described in the source program 11a.
The compiler execution unit 12 compiles the source code. The compiler execution unit 12 is, for example, a compiler executed by a processor. The compiler execution unit 12 includes a source program input unit 13, an input/output control unit 14, an intermediate language generating unit 15, an optimizing unit 16, a code generating unit 17, and an object file output unit 18.
The source program input unit 13 opens a source program 11a designated by the compiler execution unit 12. For instance, the source program input unit 13 reads the source program stored in the storage unit 11 and outputs the source program to the input/output control unit 14 upon receipt of an instruction to start compiling from the compiler execution unit 12.
The input/output control unit 14 selects various kinds of processing in accordance with options and file types. For instance, when the source program 11a is received from the source program input unit 13, the input/output control unit 14 outputs the source program 11a to the intermediate language generating unit 15. Further, when an assembly language is received from the code generating unit 17, the input/output control unit 14 outputs the assembly language to the object file output unit 18.
The intermediate language generating unit 15 generates the intermediate language 11b from the source program 11a received from the input/output control unit 14, and stores the intermediate language on the memory. More specifically, the intermediate language generating unit 15 converts the source program 11a to an intermediate code utilized in the optimizing unit 16, i.e., a code to be utilized inside the compiler. Further, the intermediate language generating unit 15 stores the converted intermediate language 11b on the storage unit 11 and the like.
The optimizing unit 16 executes optimization such as loop fusion in order to speed up execution of the source program 11a. The optimizing unit 16 includes a source analysis unit 16a, a combination extraction unit 16b, an information extraction unit 16c, a ratio calculation unit 16d, a determination unit 16e, and a fusion unit 16f.
The source analysis unit 16a analyzes the intermediate language 11b. For instance, the source analysis unit 16a reads out the intermediate language 11b from the storage unit 11, and executes line reconstruction, lexical analysis, syntax analysis, meaning analysis, etc. to output results thereof to the combination extraction unit 16b.
The combination extraction unit 16b extracts a combination of the loops for which the loop fusion is executable. More specifically, the combination extraction unit 16b determines the combination for which the loop processing is fused (hereinafter may be referred to as “virtual loop”) based on conditions of each of a plurality of the loops existing inside the source program 11a or in the intermediate language 11b in accordance with the analysis results by the source analysis unit 16a.
For instance, the combination extraction unit 16b extracts a virtual loop based on an initial value, a final value, and an incremental value of the loop.
Additionally, the combination extraction unit 16b extracts the virtual loop based on data dependence between the loops.
On the other hand, in the case of right illustration of
Further, the combination extraction unit 16b extracts the virtual loop based on whether the loops are tightly nested. More specifically, the combination extraction unit 16b determines, as a candidate combination, the loops containing an operation only in an innermost loop of a multiple-loop. In the case where any operation is contained in a halfway loop, the combination extraction unit 16b deems the loops below the loop containing the operation are tight loops, and exempts these loops from the candidate combination.
Returning to
For instance, the information extraction unit 16c counts the number of arithmetic instructions, such as a MULT instruction and an ADD instruction, which execute floating-point operations of a resister, and calculates the operands for each of the combinations. Also, the information extraction unit 16c counts the number of cache lines in the memory which the respective loops access, and calculates the number as the number of streams for each of the combinations.
Further, the information extraction unit 16c counts the number of instructions such as a “LOAD” instruction for loading data in the register or a “STORE” instruction for storing the data in the memory for each of the combinations. Further, the information extraction unit 16c calculates the number of data transfers, i.e., the number of data accesses based on the number of the respective instructions and the number of bytes of the respective instructions to be referred. Here, note that the number of data transfers of the “STORE” instruction becomes double because the instruction is once loaded in the register and then stored in the loaded area.
The ratio calculation unit 16d calculates, for each of the combinations, a ratio of the number of data transfers to the operands by using the various kinds of data extracted by the information extraction unit 16c. More specifically, the ratio calculation unit 16d calculates the following ratio: the total number of data transfers inside the virtual loop/total number of operands inside the virtual loop=B (Byte)/F (FLOP) value. For example, in the case where the number of data transfers is “78” and the operands is “156”, the ratio calculation unit 16d calculates the B/F value as “0.5”.
The determination unit 16e determines whether the loop fusion is executable for each of the combinations based on the B/F value calculated by the ratio calculation unit 16d. More specifically, the determination unit 16e determines that the loop fusion is executable in the case where the B/F value of each of the combinations is in an optimum state in which arithmetic performance of the processor is fully utilized and a memory bandwidth is fully used.
For example, the determination unit 16e determines that the loop fusion is executable when the combination has a B/F value within a predetermined range. Now, a description will be given for a value to be the threshold. For example, in a computer that operates having performance specification of 16 GFLOPS, 8 core, an operand value of the floating-point operations executed per second is 16×8=128 GFLOPS. Additionally, in the case where theoretical throughput of the memory is 64 GB/s, the ideal B/F value to fully use system resources of the computer is “64/128=0.5”. This value depends on the specification of a machine, and may fluctuate depending on the memory throughput and the FLOPS value. In this exemplary case, the range of the threshold is set to 0.3<B/F value<0.6 based on the ideal B/F value.
Incidentally, in the case where there are the same loops existing in a plurality of combinations determined to be executable of the loop fusion, the determination unit 16e fuses the loops corresponding to a combination having a minimum difference between the B/F value and the threshold (ideal value). Further, the determination unit 16e executes the same determination for a combination that includes other loops not involving the loops determined executable of the loop fusion.
Also, the determination unit 16e may determine executability of the loop fusion considering the number of streams and the number of instructions. For instance, when the number of streams or the number of instructions exceeds the threshold, the determination unit 16e determines that the loop fusion is not executable for a visual loop even though the virtual loop satisfies the conditions of the B/F value.
The fusion unit 16f fuses the loops determined to be executable of the loop fusion by the determination unit 16e.
The code generating unit 17 generates an assembly language from the intermediate language optimized by the optimizing unit 16. Subsequently, the code generating unit 17 outputs the generated assembly language to the input/output control unit 14.
Upon receipt of the assembly language from the input/output control unit 14, the object file output unit 18 generates an object file 11c from the assembly language and stores the object file in the storage unit 11.
The linker 20 reads out the object file 11c generated by the object file output unit 18 from the storage unit 11, and generates the executable file 11d by linking the object file 11c to a library file. Subsequently, the linker 20 stores the generated executable file 11d in the storage unit 11.
Processing Flow
Next, a processing flow executed by the information processor will be described. Here, an entire flow of the loop fusion and data generation processing executed in the entire flow will be described.
Entire Flow
Subsequently, the intermediate language generating unit 15 generates the intermediate language 11b from the source program 11a that has been read out by the source program input unit 13 (S103). After that, analysis by the source analysis unit 16a is executed.
Then, the combination extraction unit 16b extracts virtual loops representing candidate combinations for the loop fusion (S104). Subsequently, the information extraction unit 16c and the ratio calculation unit 16d select one virtual loop from the extracted virtual loops (S105) and the data generation processing is executed (S106).
Further, after completing the data generation processing for the selected virtual loops, the information extraction unit 16c and the ratio calculation unit 16d determine whether the processing has been completed for all of the virtual loops (S107). Here, in the case where the processing has not been executed for any of the virtual loops (S107: No), the information extraction unit 16c and the ratio calculation unit 16d return to step S105 and repeat the processing thereafter.
On the other hand, in the case where it is determined that the processing has been completed for all of the virtual loops (S107: Yes), the determination unit 16e extracts a virtual loop having the B/F value satisfying the conditions (S108). At this point, the determination unit 16e may extract the virtual loop in light of the number of streams or the number of instructions.
Subsequently, the determination unit 16e determines, as a fusing target, the virtual loop having the B/F value closest to the ideal value among the virtual loops satisfying the conditions (S109). Then, the determination unit 16e exempts, from the fusing target, the loop belonging to the virtual loop that has been determined as the fusing target (S110). In other words, the determination unit 16e exempts, from the fusing target, other virtual loops including the loop determined to be fused.
After that, in the case of determining that there is other virtual loop satisfying the conditions (S111: Yes), the determination unit 16e returns to step S109. On the other hand, in the case where the determination unit 16e determines that there is no other virtual loop satisfying the conditions (S111: No), the fusion unit 16f fuses each of the loops belonging to the virtual loop determined as the target of loop fusion (S112). After that, general compile processing is executed.
Data Generation Processing Flow
As illustrated in
Subsequently, in the case of determining that the searched instruction is the “STORE” instruction (S203: Yes), the information extraction unit 16c determines whether streams which the respective loops inside the target virtual loop access are different (S204).
Then, in the case of determining that the streams which the respective loops access are different (S204: Yes), the information extraction unit 16c counts the number of data transfers, i.e., the number of data accesses (S205), and returns to step S202 to repeat the processing thereafter. At this point, the information extraction unit 16c also counts the number of streams to be accessed inside the virtual loop. Meanwhile, in the case of determining that the streams which the respective loops access are not different (S204: No), the information extraction unit 16c returns to step S202 to repeat the processing thereafter without counting the number of data transfers.
On the other hand, in the case of determining that the searched instruction is not the “STORE” instruction in step S203 (S203: No) but is the “LOAD” instruction (S206: Yes), the information extraction unit 16c executes the processing in step S207. More specifically, the information extraction unit 16c determines whether the streams which the respective loops inside the target virtual loop access are different (S207).
Then, in the case of determining that the streams which the respective loops access are different (S207: Yes), the information extraction unit 16c counts the number of data transfers, i.e., the number of data accesses (S208) and returns to S202 to repeat the processing thereafter. At this point, the information extraction unit 16c also counts the number of streams to be accessed inside the virtual loop. Meanwhile, in the case of determining that the streams which the respective loops access are not different (S207: No), the information extraction unit 16c returns to step S202 to repeat the processing thereafter without counting the number of data transfers.
On the other hand, in the case of determining that the searched instruction is not the “LOAD” instruction in step S206 (S206: No) but is an instruction of four arithmetic operations such as ADD (S209: Yes), the information extraction unit 16c executes the processing in step S210. In other words, the information extraction unit 16c determines whether the searched instruction of four arithmetic operations is a floating-point type instruction.
Then, in the case of determining that the searched instruction of four arithmetic operations is the floating-point type instruction (S210: Yes), the information extraction unit 16c counts the operands, i.e., the number of floating-point operations (S211) and returns to step S202 to repeat the processing thereafter.
Meanwhile, in the case of determining that searched instruction of four arithmetic operations is not the floating-point type instruction (S210: No), the information extraction unit 16c returns to step S202 to repeat the processing thereafter without counting the operands.
Also, in the case of determining that the searched instruction is not the four arithmetic operations instruction (S209: No), the information extraction unit 16c returns to step S202 to repeat the processing thereafter without counting the operands.
Further, in the case where the information extraction unit 16c executes the processing in steps S203 to S211 and then determines that there is no unsearched instruction remaining in step S202 (S202: No), the information extraction unit 16c registers the number of data transfers and the number of streams calculated through steps S203 to S211, correlating to the respective virtual loops (S212).
Subsequently, the information extraction unit 16c also registers the number of floating-point operations calculated through steps S203 to S211, correlating to the information in step S212 (S213). Further, the ratio calculation unit 16d calculates a B/F value from the calculated number of data transfers and the number of floating-point operations, and registers the B/F value, correlating to the information in steps S212 and S213 (S214). Thus, the optimizing unit 16 calculates the operands, the number of data transfers, the number of instructions, the number of streams, and the B/F value for each of the virtual loops.
Concrete Example
Next, a concrete example of the above loop fusion will be described.
Extracting Combination
First, the combination extraction unit 16b extracts candidate combinations for the loop fusion, i.e., virtual loops, from the intermediate language of the source code illustrated in
First, the loop 1 is used as the trigger. For instance, the combination extraction unit 16b determines that the loop 1 and the loop 2 are not combinable because these loops have the different initial values.
Subsequently, the combination extraction unit 16b determines that the loop 1 and the loop 3 are combinable because these loops have the same initial value, final value and incremental value and also A(j) and A(j+1) are dependent in the forward direction, and further the loop 2 and the loop 3 do not have the dependency. Also, the combination extraction unit 16b determines that the loop 1 and the loop 4 are combinable because these loops have the same initial value, final value and incremental value, and the loop 4 is dependent on the data of the loop 3 in the forward direction and therefore does not break the data. Further, the combination extraction unit 16b determines that the loop 1 and the loop 5 are not combinable because these loops have the different initial values.
Next, the loop 2 is used as the trigger. The combination extraction unit 16b determines that the loop 2 and the loop 3 are not combinable because these loops have the different initial values. Subsequently, the combination extraction unit 16b determines that the loop 2 and the loop 4 are not combinable because the loops have the different initial value. Further, the combination extraction unit 16b determines that the loop 2 and the loop 5 are combinable because these loops have the same initial value, final value, incremental value, and the loop 5 does not depend on the data of the loop 4 and does not break the dependency.
Next, the loop 3 is used as the trigger. The combination extraction unit 16b determines that the loop 3 and the loop 4 are combinable because the loops have the same initial value, final value and incremental value, and C(j) and C(j+1) are dependent in the forward direction. Further, the combination extraction unit 16b determines that the loop 3 and the loop 5 are not combinable because the loops have the different initial values. Next, the loop 4 is used as the trigger. The combination extraction unit 16b determines that the loop 4 and the loop 5 are not combinable because the loops have the different initial value.
The combinations thus extracted are illustrated in
Subsequently, the combination extraction unit 16b creates a loop fusion determination list illustrated in
Extracting Information
Next, an example of extracting the “number of data transfers, number of floating-point operations, B/F value, number of instruction and number of streams” from each of the virtual loops will be described.
Extracting Information: Number of Instructions
First, extracting the number of instructions will be described. The information extraction unit 16c counts the number of instructions executed in each of the loop 1 and the loop 3. In
Extracting Information: Number of Floating-Point Operations
Next, extracting the number of floating-point operations will be described. The information extraction unit 16c counts, for each of the loop 1 and the loop 3, the number of floating-point operations based on the respective instructions executed in each of the loops. In
Extracting Information: Number of Streams
Next, extracting the number of streams will be described. Here, a concept of the same stream will be described.
Extracting Information: Number of Data Transfers
Next, extracting the number of data transfers will be described. The information extraction unit 16c counts, for the loop 1 and the loop 3, the number of times that each of the instructions accesses. As for the loop 1, the information extraction unit 16c extracts the LOAD instruction for each of mem01 to mem06. Here, accessing mem01 to mem03 is executed in four-byte units, and accessing mem04 to mem08 is executed in eight-byte units. As a result, the information extraction unit 16c calculates the number of accesses of the LOAD instruction as “4 (mem01)+4 (mem02)+4 (mem03)+8 (mem04)+8 (mem05)+8 (mem06)=36”.
Additionally, the information extraction unit 16c extracts the STORE instructions for mem07 and mem08. Here, accessing mem07 and mem08 is executed in eight-byte units. Further, note that the number of data transfers is twice because the STORE instructions are stored in the area where the STORE instructions have been loaded. As a result thereof, the information extraction unit 16c calculates the number of accesses of the STORE instruction as “2×8 (mem07)+2×8 (mem08)=32”.
Therefore, the information extraction unit 16c calculates the number of data transfers for the loop 1 as next: the LOAD instruction “36” bytes+STORE instruction “32”=“68”. In the same manner, the information extraction unit 16c calculates the number of data transfers for the loop 3 as “10”. As a result thereof, the information extraction unit 16c calculates the number of data transfers of the virtual loop including the loop 1 and the loop 3 as “68+10=78”, and stores the value in the loop determination list.
Calculating B/F Value
Next, calculation of the B/F value will be described.
Determining Executability of Loop Fusion
Next, a description will be given for an example of determining executability of the loop fusion, using the results generated by the information extraction unit 16c and the ratio calculation unit 16d.
In the same way, it is assumed that the information extraction unit 16c and the ratio calculation unit 16d generate “78, 130, 0.6, 450, 13” as the “number of data transfers, number of floating-point operations, B/F value, number of instructions, number of streams” for the virtual loop “1, 4”.
Further, it is assumed that the information extraction unit 16c and the ratio calculation unit 16d generate “83, 281, 0.295, 550, 15” as the “number of data transfers, number of floating-point operations, B/F value, number of instructions, number of streams” for the virtual loop “1, 3, 4”.
Also, it is assumed that the information extraction unit 16c and the ratio calculation unit 16d generate “15, 276, 0.054, 350, 13”, as the “number of data transfers, number of floating-point operations, B/F value, number of instructions, number of streams” for the virtual loop “3, 4”.
Further, it is assumed that the information extraction unit 16c and the ratio calculation unit 16d generate “24, 145, 0.165, 540, 10” as the “number of data transfers, number of floating-point operations, B/F value, number of instructions, number of streams” for the virtual loop “2, 5”.
In this case, the determination unit 16e extracts the virtual loop “1, 3” and the virtual loop “1, 4” as the virtual loops that have thresholds corresponding to the B/F value “0.25<B/F value<0.75”. The determination unit 16e determines that the loop fusion is not executable for rest of the virtual loops.
Then, the determination unit 16e selects the virtual loop “1, 3” having the B/F value closer to the center of the threshold range because the B/F value of the virtual loop “1, 3” is “0.5” and the B/F value of the virtual loop “1, 4” is “0.6”.
Subsequently, the determination unit 16e exempts the virtual loop including the loop 1 or the loop 3 that has been determined as the fusing target from the fusing target. More specifically, the determination unit 16e exempts, from the fusing target, the virtual loop “1, 4” that has been extracted for having the B/F value within the threshold. Thus, the determination unit 16e determines that the virtual loop “1, 3”, namely the loop 1 and the loop 3, as the fusing target. The fusion unit 16f fuses the loop 1 to loop 3 thereafter.
Incidentally, in the case where the number of instructions or the number of streams exceeds the threshold, the determination unit 16e may determine to exempt the virtual loop determined as the fusing target from the fusing target. As a result, it is possible to reduce fusing of the loops containing a large number of instructions and fusing of the loops causing inefficient memory access. Therefore, deterioration of execution performance caused by the loop fusion can be avoided.
As described above, at the time of determining executability of the fusion of loop processing for the plurality of the loops included in a code to be processed, the information processor 10 calculates the ratio of the operands to the number of data accesses after the fusion. The information processor 10 determines that loop fusion is executable when it is clear that usability of the system is improved, thereby achieving the effective loop fusion.
In other words, the information processor 10 calculates the B/F value by using the number of floating-point operations and the number of data transfers. Then, the information processor 10 makes groups of the loops fully utilizing arithmetic performance of the processor and fully using the memory bandwidth based on the calculated B/F value. After that, the information processor 10 may improve the execution performance by fusing the grouped loops.
As a result, the information processor 10 can reduce the fusion between the loops having the data accesses more than the operands as well as the fusion between the loops having the operands more than the data accesses. Therefore, data access latency and arithmetic processing latency after the loops becoming valid can be reduced, thereby achieving the effective loop fusion.
While the embodiment of the present invention has been described above, the embodiments are not limited thereto and various modifications may be made besides the above-described embodiment. Accordingly, a different embodiment will be described below.
Target Program
According to a first embodiment, an example of extracting various kinds of data for determining executability of the loop fusion from an intermediate language has been described, but the embodiment is not limited to thereto. For instance, the data may be extracted from a source program 11a, and a virtual loop may be identified by using the source program 11a, and the data may be extracted using an intermediate language 11b.
Operand
According to the first embodiment, MULT and ADD are exemplified as floating-point operations, but the floating-point operations are not limited to thereto. For instance, processing can be executed even with a SUB instruction or a DIV instruction. Additionally, according to the first embodiment, an example in which operands and number of data transfers are calculated after generation of virtual loops has been described. However, the embodiment is not limited thereto, and the virtual loops may be generated after calculating the operands and the number of data transfers.
Optimization
According to the first embodiment, an example has been described, in which the loop fusion is executed as an example of optimization. However, there is other optimizing process other than the above-described loop fusion.
Hardware
The HDD 103 stores a program and respective DB for operating functions illustrated in
An example of the communication interface 104 is a network interface card. An example of the input device 105 is a key board, and the display device 106 is a display device for displaying various kinds of information, such as a touch panel and a display device.
The CPU 101 performs a process to execute respective functions described in
Thus, the information processor 10 operates as the information processor that performs a compiling method by reading and executing the program. Also, the information processor 10 reads out the above program from the recording medium via a medium reading device, and the functions same as the above-described embodiment can be executed by executing the mentioned program that has been read out. Note that execution of the program is not limited to the information processor 10 according to this embodiment. For instance, when a computer or a server executes the program, or both the computer and the server cooperatively execute the program, the present invention is also applicable in the same manner.
System
Additionally, among the respective processing described in this embodiment, an entire or any part of the processing that has been described to be automatically performed may be performed manually as well. Or, an entire or any part of the processing that has been described to be performed manually may be performed automatically by adopting a known method as well. Moreover, the processing procedure, controlling procedure, concrete names, and information including various kinds of data and parameters described in the above description and drawings may be suitably modified unless otherwise specified.
Also, the respective components in the respective units are illustrated in view of functional concept and therefore physically, not necessarily configured as illustrated in the drawings. In other words, specific forms, such as to separate or integrate the respective units, are not limited to those illustrated in the drawings. In other words, an entire or any part of the devices may be configured by physically or functionally separated or integrated in an optional unit depending on variety of loads, use condition, and so on. Furthermore, an entire or any part of the respective processing functions performed in the respective units may be implemented by a CPU and programs analyzed and executed by the CPU, or implemented as hardware by wired logic.
According to the embodiments, effective loop fusion is executable.
All examples and conditional language recited herein are intended for pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2013-105537 | May 2013 | JP | national |