The present invention relates to Very Long Instruction Word (VLIW) processors, and especially to variable length instruction bundle processors.
Such processors include a CPU having multiple processing units for executing multiple instructions in parallel. VLIW instruction bundles may include multiple elementary instructions targeting the different processing units of the CPU. Thus an instruction bundle for such a processor may reach a length typically between 64 and 128 bits, or even 256 bits or more.
In some VLIW processors, the distribution of elementary instructions of an instruction bundle between the CPU processing units is always performed in the same order.
It follows that a substantial portion of the memory containing the program may be occupied by NOP instructions. It also follows that a substantial portion of the data flow on the processor instruction bus contains such instruction bundles. These drawbacks lead to poor use of CPU processing resources, and power consumption. It also follows that a significant portion of the program memory is occupied unnecessarily by NOP instructions.
Some processors are designed to not constrain allocations of VLIW bundle instructions to processing units. With such a configuration, it is not necessary to insert NOP instructions in the instruction bundles. To this end, each instruction input of each processing unit comprises a multiplexer whose inputs are connected to input registers receiving the elementary instructions of the VLIW bundle.
It is possible to significantly reduce the number of multiplexer inputs by ordering the elementary instructions in the VLIW instruction bundle according to ranks assigned to the processing units and by completing the shorter elementary instructions (shorter than the maximum number of syllables) with NOP instructions, so that all the elementary instructions in the instruction bundle, eventually completed, have the same length, i.e. the same number of syllables.
Thus,
The need to insert NOP instructions in the VLIW instruction bundles can be avoided by maintaining an order of elementary instructions in the instruction bundle, corresponding to the order assigned to the processing units. Thus,
It is desirable to simplify the input circuit of the processing units of a VLIW processor without penalizing the use of CPU processing resources and the use of program memory, and without increasing power consumption. It is also desirable to improve the efficiency of this input circuit for transmitting the elementary instructions of an instruction bundle to the various processing units.
Embodiments relate to a compiling method of a program for a processor having multiple processing units capable of executing instruction bundles having several elementary instructions, the method comprising steps of gathering in an instruction bundle multiple elementary instructions each targeting a respective processing unit of the processor, each elementary instruction including one or more syllables each having a rank in the elementary instruction, and distributing the syllables of the elementary instructions in the instruction bundle by groups of syllables of same rank, a group of syllables of first rank of all elementary instructions of the instruction bundle being inserted in the instruction bundle before a group of syllables of second rank of all the elementary instructions of the instruction bundle, the syllables in each group being ordered according to the target processing unit of each syllable.
In an embodiment, the syllables of same rank are ordered in the instruction bundle so that a high priority processing unit receives first its respective elementary instruction syllable.
In an embodiment, each syllable of each instruction bundle includes an end code indicating whether the syllable is the last in the instruction bundle.
In an embodiment, each syllable of each instruction bundle includes a processing unit code indicating the processing unit to which the syllable is to be transmitted.
In an embodiment, each syllable of each instruction bundle includes a code indicating the rank of the syllable.
In an embodiment, each instruction bundle includes less syllables than the number of processing units times the maximum number of syllables per elementary instruction.
In an embodiment, the syllables of all elementary instructions have the same length.
Embodiments also relate to a processor comprising multiple processing units for processing multiple elementary instructions in parallel, the elementary instructions including one or more syllables, each having a rank in the elementary instruction, and an input circuit configured to receive an instruction bundle including multiple elementary instructions, and to transmit to the processing units all syllables of first rank of the elementary instructions of the instruction bundle faster than syllables of second rank of the elementary instructions of the instruction bundle, the syllables of same rank being ordered according to the target processing unit of each syllable.
In an embodiment, the syllables of the instruction bundle are distributed in groups of syllables of same rank, a group of syllables of first rank being placed at the beginning of the instruction bundle, wherein the syllables in each group are ordered in an order assigned to the processing units.
In an embodiment, the input circuit is configured to transmit to the processing units the syllables of each group of the instruction bundle, respecting the order assigned to the processing units.
In an embodiment, the input circuit includes registers configured for receiving each a respective syllable of the instruction bundle, the number of registers being at most equal to the number of processing units of the processor, times a maximum number of syllables per elementary instruction.
In an embodiment, the number of registers is less than the number of processing units of the processor, times a maximum number of syllables per elementary instruction.
In an embodiment, each syllable of each instruction bundle includes an end code indicating whether the syllable is the last in the instruction bundle.
In an embodiment, each syllable of each instruction bundle includes a processing unit code indicating the processing unit to which the syllable is to be transmitted.
In an embodiment, each syllable of each instruction bundle includes a code indicating the rank of the syllable.
Other advantages and features will become more clearly apparent from the following description of particular embodiments of the invention provided for exemplary purposes only and represented in the appended drawings, in which:
According to an embodiment, VLIW instruction bundles, each comprising one or more elementary instructions, each formed of one or more syllables, are produced by a compiler by ordering the elementary instructions in each VLIW instruction bundle starting with the first syllables of the elementary instructions. Thus, each VLIW instruction bundle includes the syllable of first rank of all the elementary instructions of the bundle, followed by possible syllables of second rank, and so on. The syllables of each rank are ordered according to an order assigned to the processing units. Thus, only the order of the syllables is imposed in a VLIW instruction bundle, but not the strict position of the syllables in the bundle.
The register R01 is connected directly to the first input of the processing unit PU1. The multiplexer MX71 is connected by its inputs to registers R22-R25, and by its output to the second input of the processing unit PU1. The multiplexer MX62 is connected by its inputs to the registers R21 and R22, and by its output to the first input of the processing unit PU2. The multiplexer MX72 is connected by its inputs to registers R22-R26, and by its output to the second input of the processing unit PU2. The multiplexer MX63 is connected by its inputs to registers R21-R23, and by its output to the first input of the processing unit PU3. The multiplexer MX73 is connected by its inputs to registers R22-R27, and by its output to the second input of the processing unit PU3. The multiplexer MX64 is connected by its inputs to registers R21-R24, and by its output to the first input of the processing unit PU4. The multiplexer MX74 is connected by its inputs to registers R22-R28, and by its output to the second input of the processing unit PU4.
The number of inputs of the multiplexers MX62-MX64, MX71-MX74 is thus limited to 32, and it is not necessary to insert NOP instructions in the VLIW instruction bundles. In addition, the multiplexers MX62-MX64 receiving the first syllables of the instruction bundles include a reduced number of inputs, and the first syllable input of the processing unit PU1 is directly connected to the register R01. It follows that the processing units PU1-PU4 receive the first syllables before the second syllables. Since the second syllables are not required for decoding the first syllables, the input circuit INC has the feature of initiating the processing of the elementary instructions of the instruction bundle, and thus activates all the processing units PU1-PU4 involved, before transmission to the processing units of the possible subsequent syllables of the elementary instructions, which may be present in registers R22-R28.
minimum: Pj[1] with j=1, 2, 3 or 4
maximum: P1[1]-P2[1]-P3[1]-P4[1]-P1[2]-P2[2]-P3[2]-P4[2]
Where Pj[1] represents the first syllable (or syllable of rank 1) of an elementary instruction Pj executable by processing unit PUj (j=1, 2, 3 or 4), and Pj[2] represents a second syllable (syllable of rank 2) of the elementary instruction Pj. All possible contents of an instruction bundle can be obtained by removing one or more syllables from the maximum content, maintaining the order of the syllables of the maximum content, assuming an instruction bundle cannot have a second syllable Pj[2] without the corresponding first syllable Pj[1].
The possible contents of registers R21-R28 are shown in Table 1 below:
The diagonal from top left to bottom right of Table 1 is the maximum content of the instruction bundle IW. Each column of Table 1 represents one of multiplexers MX62-MX64 and MX71-MX74, indicated in the last line MUX of the table. Each column of Table 1 defines by its non-empty boxes the registers R21-R28 at the input of the corresponding multiplexer MX62-MX64 and MX71-MX74.
According to an embodiment, the order assigned to the processing units is defined according to their time criticality. For example, a branch control unit BCU may be considered the most critical in time because it needs to know as soon as possible the instruction to execute and the associated operands, in order to execute the branch quickly and thus minimize the time necessary for executing the branch operation.
According to an embodiment, each syllable includes a processing unit code specifying which processing unit it should be processed by,
The syllable Pj[2] of elementary instruction Pj includes an instruction field IT1, the field EW indicating whether the syllable is the last of the VLIW instruction bundle, the processing unit code field PUC0 set to 0, and a processing unit code field PUC1 for the second syllable. The processing unit code fields PUC0, PUC1 may be used to control the multiplexers MX62-MX64 and MX71-MX74.
Of course, other fields may be provided in the syllables Pj[i]. Thus,
The transfer of all the elementary instructions from registers R21-R28 to the processing units PU1-PU4 is triggered by writing the last syllable of an instruction bundle in one of the registers (presence of an EW field indicating the last syllable of the bundle). The arrival time of a syllable in a processing unit depends on the size of the multiplexer MX62-MX64, MW71-MX74. Therefore, the units PU1-PU4 may first receive a first syllable, in the order PU1, PU2, PU3, PU4. Since the multiplexers MX71 and MX64 have a same number of inputs, they can simultaneously receive a syllable of the instruction bundle present in the registers R21-R28. The processing units PU2 to PU4 may then successively receive a syllable.
According to an embodiment, the processor includes one or more branch control units BCU, one or more arithmetic and logic units ALU, one or more load and store units LSU, and one or more fixed or floating point multiply and accumulate units MAU.
BC[1]-A0[1]-A1[1]-MA[1]-LS[1]-A0[2]-A1[2]-MA[2]-LS[2]
In which BC[1], A0[1], A1[1], MA[1] and LS[1] are first-rank syllables of elementary instructions respectively targeting the units BCU, ALU0, ALU1, MAU and LSU, and A0[2], A1[2] MA[2] and LS[2] are second-rank syllables respectively targeting the units ALU0, ALU1, MAU and LSU.
The possible contents of the registers R31 to R39 are shown in Table 2 below:
Each column of Table 2 defines by non-empty cells the registers R31 to R39 at the input of one of the multiplexers MX82-MX85 and MX92-MX95, as identified in the last line MUX of the table.
The register R31 is connected directly to the single input of the processing unit BCU. The multiplexer MX82 is connected by its inputs to the registers R31 and R32, and by its output to the first input of the processing unit ALU0. The multiplexer MX92 is connected by its inputs to the registers R32-R36, and by its output to the second input of the processing unit ALU0. The multiplexer MX83 is connected by its inputs to the registers R31-R33, and by its output to the first input of the processing unit ALU1. The multiplexer MX93 is connected by its inputs to the registers R32-R37, and by its output to the second input of the processing unit ALU1. The multiplexer MX84 is connected by its inputs to the registers R31-R34, and by its output to the first input of the processing unit MAU. The multiplexer MX94 is connected by its inputs to the registers R32-R38, and by its output to the second input of the processing unit MAU. The multiplexer MX85 is connected by its inputs to the registers R31-R35, and by its output to the first input of the processing unit LSU. The multiplexer MX95 is connected by its inputs to the registers R32-R39, and by its output to the second input of the processing unit LSU. The multiplexers MX82-MX85 and MX92-MX95 have a total of 40 entries.
Since the input of the BCU unit is directly connected to the register R31, the processing of the branch instructions may start first. Due to the number of multiplexer inputs to which they are connected, the processing units ALU0, ALU1, MAU and LSU may then successively receive a first syllable. The ALU0 unit can receive a second syllable while the LSU receives a first syllable. The units ALU1, MAU and LSU may then successively receive a second syllable.
According to an embodiment, the compiler generates instruction bundles IW having eight syllables at most. As a result, the register R39 may be omitted and the multiplexer MX95 has only seven inputs instead of eight. Thus, the multiplexers of the input circuit INC′ have 3 control bits at most. This arrangement also reduces the propagation time of a second syllable to the LSU processing unit. This simplification of the input circuit INC′ has negligible side effects because the probability for the compiler to generate an instruction bundle IW with nine syllables is low.
Since the processor has two similar processing units ALU0, ALU1, the PUC0 field need not differentiate these two processing units, whereby this field may be encoded with only 2 bits for four types of processing units. If an instruction bundle includes an elementary instruction for one of the ALU units, it will be forwarded to the unit ALU0. The unit ALU1 will be activated only if the instruction bundle IW includes two elementary instructions to be processed by an ALU. Since the BCU unit receives no second syllable, only the four processing units ALU0, ALU1, MAU, LSU are likely to receive a second syllable. The PUC1 field may therefore also include only 2 bits. In case the syllables contain PUC and RNG fields, only 4 bits are needed to encode a processing unit number (3 bits for five processing units) and the rank of the syllable (1 bit for one or two-syllable elementary instructions).
It will be apparent to those skilled in the art that the present invention is susceptible to various alternatives and applications. In particular, the invention is not limited to instruction bundles in which syllables are grouped by rank and ordered according to ranks assigned to the processing units. It is sufficient that the transfer of syllables to the processing units is performed starting with the first rank of the syllables. Placing the syllables of first rank at the start of the instruction bundles merely allows transmitting in first place to the processing units, the syllables of first rank located in the input registers.
Moreover, it is not necessary that each syllable of each instruction bundle includes an end code indicating whether the syllable is the last in the instruction bundle. Indeed, it can be devised that each instruction bundle includes, for example, a header or tail field specifying the number of syllables in the instruction bundle. The fields PUC0, PUC1 specifying a processing unit to which a syllable is to be transmitted and the rank RNG of each syllable, or an input number of a processing unit to which each syllable is to be transmitted, can also be inserted in such a header or tail field of an instruction bundle.
Number | Date | Country | Kind |
---|---|---|---|
1454638 | May 2014 | FR | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/FR2015/051134 | 4/27/2015 | WO | 00 |