This application is based upon and claims the benefit of priority from prior Japanese Patent Application P2005-055023 filed on Feb. 28, 2005; the entire contents of which are incorporated by reference herein.
1. Field of the Invention
The present invention relates to an instruction generator, a method for generating an instruction, and a computer program product for executing an application for the instruction generator, capable of generating a single instruction multiple data (SIMD) instruction.
2. Description of the Related Art
Same operations are often executed for a large amount of data in a multimedia application designed for image or audio processing. Accordingly, a processor embedding a multimedia extended instruction of a SIMD type for executing multiple operations with a single instruction is used for the purpose of improving the efficiency of the processing. To shorten a development period for a program and to enhance program portability, it is desirable to automatically generate a SIMD instruction from a source program described in a high-level language.
A multimedia extended instruction of a SIMD type may require special operation processes as shown in (1) to (5) below: (1) a special operator such as saturate calculation, an absolute value of a difference, and a high-order word of multiplication, and the like is involved; (2) different data sizes are mixed; (3) the same instruction can treat multiple sizes in a register-to-register transfer instruction (a MOV instruction), a logical operation, and the like as it is possible to interpret 64 bits operation as eight pieces of eight bits operations or four pieces of sixty bits operations; (4) input size may be different from output size; and (5) there is an instruction of changing some of operands.
A compiler for analyzing instructions in a C-language program applicable to parallel execution, and to generate SIMD instructions for executing addition-subtraction, multiplication-division, and other operations has been known as a SIMD instruction generating method for a SIMD arithmetic logic unit incorporated in a processor. There is also known a technique to allocate processing of a multiple for-loop script included in a C-language description to an N-way very long instruction word (VLIW) instruction, and thereby to allocate operations of respective nests to a processor array. A technique for producing a VLIW operator in consideration of sharing multiple instruction operation resources, has been reported.
However, there is no instruction generating method for generating an appropriate SIMD instruction when a SIMD arithmetic logic unit is embedded as a coprocessor independently of a processor core for the purpose of speeding up. Therefore, it has been expected to establish a method capable of generating an appropriate SIMD instruction for a SIMD coprocessor.
An aspect of the present invention inheres in an instruction generator configured to generate an object code for a processor core and a single instruction multiple data (SIMD) coprocessor cooperating with the processor core, the instruction generator comprising, a storage device configured to store a machine instruction function incorporating both an operation definition defining a program description in a source program targeted for substitution to a SIMD instruction, and the SIMD instruction, a parallelism analyzer configured to analyze the source program so as to detect operators applicable to parallel execution, and to generate parallelism information indicating the set of operators applicable to parallel execution, a SIMD instruction generator configured to perform a matching determination between an instruction generating rule for the SIMD instruction and the parallelism information, and to read the machine instruction function out of the storage device in accordance with a result of the matching determination, and a SIMD compiler configured to generate the object code by substituting the program description coinciding with the operation definition in the source program, for the SIMD instruction, based on the machine instruction function.
Another aspect of the present invention inheres in a method for generating an instruction configured to generate an object code for a processor core and a SIMD coprocessor cooperating with the processor core, the method comprising, analyzing a source program so as to detect operators applicable to parallel execution, generating parallelism information indicating the set of operators applicable to the parallel execution, performing a matching determination between an instruction generating rule for a SIMD instruction and the parallelism information, acquiring a machine instruction function incorporating both an operation definition defining a program description in a source program targeted for substitution to the SIMD instruction, and the SIMD instruction, in accordance with a result of the matching determination, and generating the object code by substituting the program description coinciding with the operation definition in the source program, for the SIMD instruction, based on the machine instruction function.
Still another aspect of the present invention inheres in a computer program product for executing an application for an instruction generator configured to generate an object code for a processor core and a SIMD coprocessor cooperating with the processor core, the computer program product comprising, instructions configured to analyze a source program so as to detect operators applicable to parallel execution, instructions configured to generate parallelism information indicating the set of operators applicable to the parallel execution, instructions configured to perform a matching determination between an instruction generating rule for a SIMD instruction and the parallelism information, instructions configured to acquire a machine instruction function incorporating both an operation definition defining a program description in a source program targeted for substitution to the SIMD instruction, and the SIMD instruction, in accordance with a result of the matching determination, and instructions configured to generate the object code by substituting the program description coinciding with the operation definition in the source program, for the SIMD instruction, based on the machine instruction function.
Various embodiments of the present invention will be described with reference to the accompanying drawings. It is to be noted that the same or similar reference numerals are applied to the same or similar parts and elements throughout the drawings, and the description of the same or similar parts and elements will be omitted or simplified.
As shown in
The instruction generating apparatus shown in
The processor core 71 includes a decoder 712, arithmetic logic unit (ALU) 713, a data RAM 714, in addition to the RAM 711, for instance. A control bus 73 and data bus 74 connect between the processor core 71 and the coprocessor 72.
When the source program stored in the storage device 2 includes repetitive processing as shown in
Furthermore, the parallelism analyzer 11a shown in
The dependence analyzer 112 traces the DAG and thereby checks data dependence of an operand on each operation on the DAG. In the DAG, an operator and a variable are expressed by nodes. A directed edge between the nodes indicates the operand (an input).
To be more precise, the dependence analyzer 112 checks whether an input of a certain operation is an output of an operation of a parallelism target. In addition, when the output of the operation is indicated by a pointer variable, the dependence analyzer 112 checks whether the variable is an input of the operation of the parallelism target. As a consequence, presence of dependence between the input and the output of the operation of the parallelism candidate is analyzed. Assuming that arbitrary two and more operations are selected and that there is dependence between operands of those operations, it is impossible to process those operations in parallel. Accordingly, a sequence of the operations is determined.
The dependence analyzer 112 starts the analysis from ancestral operation nodes (a node group C2 on the third tier from the bottom) of the DAG shown in
The graph is traced further on the operands ar0 and br0. As indicated with dotted lines in
Next, data dependence between the multiplication ml1 and a multiplication ml3 is checked. Specifically, dependence between the operand ar0 and an operand ar1 is checked by tracing. The multiplication ml1 and the multiplication ml3 are applicable to parallelism if ancestral nodes of the operand ar0 and the operand ar1 are not respective parent nodes (+:xr1, +:xr0) of the multiplication ml3 and the multiplication ml1. However, the ancestral node p1 of the operand ar0 is connected to a child node +:xr1 in
In this way, data dependence is checked similarly in terms of all pairs of multiplications including the pair of the multiplication ml1 and a multiplication ml4, the pair of the multiplication ml1 and a multiplication ml5, and so forth. When there is no data dependence between the operands of the multiplication ml1 and the multiplication ml5, these two multiplications are deemed applicable to parallelism. Moreover, the multiplication ml1 and the multiplication ml2 are applicable to parallelism as described previously. Therefore, the multiplication ml1, the multiplication ml2, and the multiplication ml5 are deemed applicable to parallelism.
After completing the data dependence analyses in terms of the multiplications, a parallelism analysis is performed on addition nodes (a node group C1) which are child nodes of the multiplications. Operands of an addition ad1 are the multiplication ml1 and the multiplication ml2 which are applicable to parallelism as described above. Accordingly, it is determined that the multiplication ml1, the multiplication ml2, and the addition ad1 are applicable to compound. Meanwhile, by use of a data type int of a variable xr0 which is a substitution target, this addition is regarded as a 32-bit signed addition (hereinafter expressed as “add32s”). Here, a result of addition is assigned to the variable of int. However, when the variable xr0 is expressed to be long, the addition is regarded as a 64-bit signed addition.
Thereafter, operands of the addition ad1 and an addition ad2 are traced. An output node of the addition ad2 is connected to the terminal node p1 of the addition ad1. Accordingly, it is determined that these two additions are inapplicable to parallelism. Then, operands are traced similarly on all additions to analyze data dependence between an output and an operand of a candidate operation for parallelism.
Further, the parallelism information generator 113 generates parallelism information as shown in
In the example shown in
The SIMD instruction generator 12 shown in
For example, a size of a 32-bit signed multiplier for executing the 16-bit signed multiplication mul16s in two-way parallel is stored as 800 gates, a size of an adder for realizing the 32-bit signed addition add32s is stored as 500 gates, a size of a 32-bit signed multiplier-adder is stored as 1200 gates, and a size of a 48-bit signed multiplier is stored as 1100 gates.
Moreover, as shown in
The determination module 122 generates the machine instruction function in terms of each “parallel { }” descriptions in the parallelism information, based on an instruction generating rule. As shown in
The RULEmad32s in
A parser 131 shown in
A code generator 132 executes generation of SIMD instructions by substituting the source program for SIMD instructions within the range that satisfies a coprocessor area constraint, then convert into assembler descriptions. The syntax tree generated from the source program may include one or more syntax trees identical to the syntax tree generated from the operation definitions in the machine instruction functions. A SIMD instruction in an inline clause within the machine instruction function is allocated to each of the matched syntax trees of the source program. However, a hardware scale becomes too large if the SIMD arithmetic logic unit as well as input and output registers of the operator are prepared for each of the machine instruction functions. For this reason, one SIMD arithmetic logic unit is shared by the multiple SIMD operations.
For example, when there are three machine instruction functions cmmad32, two multiplexers (MUX) 32—3 for combining three 32-bit inputs into one input and one demultiplexer (DMUX) 32—3 for splitting one 32-bit output into three 32-bit outputs are used for one mad32s operator 92 as shown in
Here, three or more machine instruction functions cpmad32s subject to be allocated are assumed to exist. Moreover, the SIMD arithmetic logic unit is assumed to be shared and the MUX and DMUX are assumed to be allocated. The code generator 132 of the SIMD compiler 13 acquires the above-described arithmetic logic unit area macro definition. When the coprocessor area constraint is set to 1350 gates, the code generator 132 allocates three machine instruction functions cpmad32. In this case, the total number of gates of the signed 32-bit multiplier-adder, the MUX—32—3, and the DMUX—32—3 is calculated as 1200+(50×2)+45=1345, which satisfies the restriction of 1350 gates. On the other hand, when there are three or more machine instruction functions cpmul32s and the coprocessor restriction is set to 1000 gates, the code generator 132 allocates three machine instruction functions cpmul32. The number of gates in this case is calculated as 800+(50×2)+45=945, which satisfies the coprocessor area constraint. The details about the code generator 132 will be described later.
The storage device 2 includes a source program storage 21, an arithmetic logic unit area information storage 22, a machine instruction storage 23, a coprocessor area constraint storage 24, a parallelism information storage 25, a SIMD instruction information storage 26, and an object code storage 27. The source program storage 21 previously stores the source program. The arithmetic logic unit area information storage 22 stores the arithmetic logic unit area information. The machine instruction storage 23 previously stores sets of the instruction generating rule and the machine instruction function. The coprocessor area constraint storage 24 previously stores the coprocessor area constraint. The parallelism information storage 25 stores the parallelism information generated by the parallelism information generator 113. The SIMD instruction information storage 26 the machine instruction function from the determination module 122. The object code storage 27 stores the object code including the SIMD instruction generated by the code generator 132.
The instruction generator shown in
A keyboard, a mouse or an authentication unit such as an optical character reader (OCR), a graphical input unit such as an image scanner, and/or a special input unit such as a voice recognition device can be used as the input unit 3 shown in
Next, the procedure of a method for generating an instruction according to the first embodiment of the present invention will be described by referring a flow chart shown in
In step S01, the DAG generator 111 shown in
In step S02, the dependence analyzer 112 analyzes data dependence of an operand on each operation on the DAG. That is, the dependence analyzer 112 checks whether an input of a certain operation is an output of an operation of a parallelism target.
In step S03, the parallelism information generator 113 generates the parallelism information for operators having no data dependence. The generated parallelism information is stored in the parallelism information storage 25.
In step S04, the arithmetic logic unit area calculator 121 calculates the entire arithmetic logic unit area by reading the circuit scale of the operators required for executing respective the parallelism information out of the arithmetic logic unit area information storage 22.
In step S05, the determination module 122 performs the matching determination between the instruction generating rule stored in the machine instruction function storage 23 and the parallelism information, and to read the machine instruction function out of the machine instruction function storage 23 in accordance with a result of the matching determination.
In step S06, the parser 131 acquires the source program from the source program storage 21, and executes a lexical analysis and a syntax analysis to the source program. As a result, the source program is converted into a syntax tree.
In step S07, the code generator 132 compares the syntax tree generated in step S06 with the operation definition of each machine instruction function. The code generator 132 replaces the syntax tree with the instruction sequence of the inline clause when the syntax tree and the operation definition correspond.
Next, the procedure of the instruction generating rule determination process shown in
In step S51, the determination module 122 reads the “parallel { }” description of the parallelism information out of the parallelism information storage 25.
In step S52, the determination module 122 determines the conformity between the instruction generating rule and the “parallel { }” description. The procedure goes to the step S54 when the instruction generating rule and the “parallel { }” description correspond. The procedure goes to the step S53, and the next instruction generating rule is selected when the instruction generating rule and the “parallel { }” description do not correspond.
In step S54, the determination module 122 selects a machine instruction function corresponding to the instruction generating rule, and adds an arithmetic logic unit area macro definition to the machine instruction function.
In step S55, the determination module 122 determines whether the matching determination about all “parallel { }” descriptions is completed. When it is determined that the matching determination about all “parallel { }” descriptions is not completed, the next “parallel { }” description is acquired in step S51.
Next, the procedure of the object code generation process will be described by referring a flow chart shown in
In step S71, the code generator 132 generates the object code from the syntax tree of the object code (machine code). The code generator 132 converts the operation definition in the machine instruction function stored in the SIMD instruction information storage 26 into the machine codes.
In step S72, the code generator 132 determines whether the machine codes sequence generated from the source program corresponds or resembles the converted operation definition. When it is determined that the machine codes sequence generated from the source program corresponds or resembles the converted operation definition, the procedure goes to step S73. When it is determined that the machine codes sequence generated from the source program does not correspond or resemble converted operation definition, the procedure goes to step S74.
In step S73, the code generator 132 replaces the machine codes sequence corresponding or similar to the converted operation definition with the SIMD instruction in the inline clause. The code generator 132 executes cumulative addition to the arithmetic logic unit area required for executing the replaced SIMD instruction, based on the arithmetic logic unit area macro definition.
In step S74, the code generator 132 determines whether the matching determination between the all machine codes generated from the source program and the converted operation definition is completed. When it is determined that the matching determination is completed, the procedure goes to step S75. When it is determined that the matching determination is not completed, the procedure returns to step S72.
In step S75, the code generator 132 determines whether a result of the cumulative addition is less than or equal to the coprocessor area constraint. When it is determined that the result of the cumulative addition is less than or equal to the coprocessor area constraint, the procedure is completed. When it is determined that the result of the cumulative addition is more than the coprocessor area constraint, the procedure goes to step S76.
In step S76, the code generator 132 determines whether an operator can execute a plurality of SIMD instructions. That is, the code generator 132 determines whether the coprocessor area constraint can be satisfied by sharing ALUs. When it is determined that coprocessor area constraint can be satisfied by sharing ALUs, the procedure is completed. When it is determined that coprocessor area constraint cannot be satisfied by sharing ALUs, the procedure goes to step S77. In step S77, an error message is informed to the user, and the procedure is completed.
As described above, according to the first embodiment, it is possible to provide the instruction generating apparatus and the instruction generating method capable of generating the appropriate SIMD instruction, for the SIMD coprocessor. Moreover, the determination module 122 is configured to acquire the machine instruction functions by using the name of the instruction applicable to parallelism, the number of bits of data to be processed by the instruction, and the information on presence of the code, as the parameters. In this way, the code generator 132 can generate the SIMD instruction, based on the acquired machine instruction function, so as to retain accuracy required for an operator of the coprocessor and so as to retain accuracy attributable to a restriction of description of a program language. Meanwhile, the code generator 132 for allocating the SIMD instruction can allocate the SIMD instruction in consideration of sharing of the SIMD arithmetic logic unit so as to satisfy the area constraint of the coprocessor.
As shown in
Next, the procedure of method for generating an instruction according to the second embodiment will be described with reference to a flow chart shown in
In step S10, the compiler 10 shown in
In step S01, the DAG generator 111 performs a lexical analysis of the assembly description and then executes constant propagation, constant folding, dead code elimination, and the like to generate the DAG.
As described above, according to the second embodiment, the DAG generator 111 can generate the DAG from the assembly description. Therefore, it becomes possible to deal with C++ language or FORTRAN language without limiting to the C language.
Various modifications will become possible for those skilled in the art after receiving the teachings of the present disclosure without departing from the scope thereof.
For example, the instruction generator according to the first and second embodiments may acquire data, such as the source program, the arithmetic logic unit area information, the instruction generating rule, the machine instruction function, and the coprocessor area constraint, via a network. In this case, the instruction generator includes a communication controller configured to control a communication between the instruction generator and the network.
Number | Date | Country | Kind |
---|---|---|---|
2005-055023 | Feb 2005 | JP | national |