The present invention relates to a method and apparatus for decoding a sequence of at least two instructions of a data processing program into a sequence of code words used to control a data path, e.g. of a digital signal processor or the like.
Computer architectures consist of a fixed data path which is controlled by a control word. This control word consists of several parts, each of which controls parts of the data path. These parts may comprise multiplexers, register addresses, operation codes for arithmetic logic units (ALUs) or other functional units, etc. Each instruction issued generates a new control word, usually by means of an instruction decoder which translates the binary format of the instruction into the corresponding control word(s), or by means of a micro store, i.e. a memory which contains the control words directly.
When a processor executes a loop function, the corresponding procedure may be interpreted not only as sequence of instructions, but also as a pipelined circuit which is emulated in time. In such a pipelined circuit, data moves down edges of a data flow graph at each clock period, wherein the edges of the graph are implemented by registers. On the other hand, the data flow graph may be implemented by a program where the different operations are executed in a certain order, as determined by an instruction scheduler. The instruction may either be executed one-by-one or with some parallelism involved, in case the processor architecture supports such a parallel processing. To execute these operations, respective control words in the processor will configure a part of the data flow graph at each instruction, while the edges are implemented by general processor registers.
However, the cost of embedded systems in terms of power dissipation and area is to a large extent determined by the memories required for program storage. High-level compilation in general leads to a penalty in terms of code size compared to handcrafted assembly code. To keep the code size within acceptable bounds efficient instruction encoding and code compaction are of key importance in the design of a new processor core. Furthermore, explanation of the different constraints regarding performance of different parts of a program allows for a trade-off between code density and performance.
Document U.S. Pat. No. 5,862,384 discloses a compiler optimizing device provided with an optimizing section arranged to transfer loop invariant expressions out of the loop based on update variables recognized in an update variable control table. This reduces the total number of executed statements and leads to an effective optimization of processing and accelerated processing speed.
However, certain instruction sequences, such as those representing a loop, will result in a sequence of control words which have a fixed part and a standard part. Although the fixed part of the control word does not change, a configuration memory context is reloaded during a large number of cycles determined by the execution time of such an invariant sequence. This time penalty leads to decreased processing performance and increased power consumption.
It is therefore an object of the present invention to provide a decoding method and apparatus for encoding an instruction sequence, by means of which an increased processing performance and a decreased power consumption can be obtained.
This object is achieved by a method as claimed in claim 1 and by an apparatus as claimed in claim 20.
Accordingly, a way in between the two extremes of fixed circuit implementation and program-based execution implementation is provided to configure a part of the data flow graph before entering the loop, so that it will be fixed during the loop. This corresponds to fixing a part of the data path, i.e. fixing some parts of the control word. Since this leads to a more narrow control word to be issued each cycle, both power consumption and (micro) code size decrease. This technique can be regarded as a kind of control word hoisting, wherein the compiler term “hoisting” means moving loop-invariant control signals (e.g. opcodes, addresses, etc.) out of the loop. However, with respect to the compiler, this technique has little impact, as the instruction schedule is not altered. Only the encoding of the instructions within the respective code word sequence, e.g. loop, is affected. In particular, the invariant portion, i.e. the fixed part, of the control word does not come from the instruction decoder or the micro store anymore, but may be derived elsewhere, e.g. from a separate memory. As an alternative, a re-configurable hardware, e.g. embedded Field Programmable Gate Arrays (FPGAs), can be used to take care of the wiring of the invariant part.
Due to the fact that the invariant part of the control word is no longer redundantly replicated in every control word, the necessary micro code memory size is reduced.
In the dedicated memory alternative, the dedicated memory may be implemented as a look-up table or a content addressable memory. The dedicated memory may be shared among a plurality of functional units of the data paths, or a plurality of dedicated memories may be allocated to respective functional units of the data paths.
Furthermore, the invariant code word portion may be used to address the dedicated memory, so as to obtain an expanded new code word which is used to control the data path. This may be achieved by using the same connections as the original sequence of code words. Then, the connections may be selected via a multiplexing function.
The bit positions of the invariant code word portion may be specified in a code word by using a configuration mask. In particular, a control word loading function may be controlled based on the configuration mask so as to direct the bits of the code word to appropriate positions of a control register. This may be achieved by using the configuration mask to generate switching signals for connecting to the control register. The size of the dedicated memory may correspond to the size of the instruction register.
In the re-configurable hardware alternative, a configuration context may be selected based on the invariant code word portion. Furthermore, the number of control bits controlling the circuitry mapped onto the re-configurable hardware may be reduced by embedding part of the invariant control into the configuration itself. Thereby, the invariant part of the code words can be hoisted as well, reducing the instruction word width necessary to control the circuitry mapped onto the re-configurable hardware. Moreover, those instructions with the largest invariant code word portions may be selected from the at least one instruction, and the selected instructions may be used for mapping onto the re-configurable hardware. Hence, the re-configurable hardware is used for those instructions or instruction sequences which can benefit most from hoisting.
Other advantageous developments are defined in the dependent claims.
In the following, the present invention will be described in greater detail based on preferred embodiments with reference to the drawings, in which:
The preferred embodiments will now be described on the basis of the decoding and processing of a loop function, as an example for a code sequence in a digital signal processor. According to the preferred embodiments, a part of the data flow graph is configured before entering the loop, so that it will be fixed during the loop. This corresponds to fixing a part of the data path, i.e. fixing some parts of the control word.
In general, fixing a part of the control word for a loop-invariant data flow graph may mean:
The generation of the fixed part of the control word, associated with an arbitrary loop-invariant data flow graph, takes place during code generation or instruction decoding for that loop. Then, during execution of the loop, the standard or variable part of the control word is obtained from a local instruction decoder or a micro store. The fixed or invariant part of the control word is used to configure a part of the data path to be fixed during the sequence of code words.
According to the first preferred embodiment, the fixed part of the code word is obtained from a dedicated memory block implemented as a look-up table or a content addressable memory (CAM). Depending on the instruction hierarchy in the considered architecture, the dedicated memory block may be placed globally or locally in the processor architecture.
According to a second preferred embodiment, the fixed part of the control word may be used to fix re-configurable aspects of a re-configurable hardware or logic unit provided in the processor architecture.
In the following, implementation examples for the first preferred embodiment are explained with reference to FIGS. 1 to 6.
According to the second possible approach indicated in the right lower portion of
Due to the fact that the loop-invariant or fixed part 12, of the control word 10 is separated from the loop-variant or standard part 11, each loop of an application will have a loop-invariant part or configuration per se. The remaining part may change during loop execution. This reduces the necessary micro code memory size, since the configuration part is not redundantly replicated in every control word within the loop. In practice, the control bits that do not change during loop execution may be distributed arbitrarily over the control word. A problem is then during execution to load the proper configuration and standard bits from the micro code memory, into the appropriate bit positions of the control register at the instruction decoder 24, based on which the resources of the processor data path is controlled. This may be solved by providing for each loop in the application, a configuration mask which specifies which bit positions in the control register are loop-invariant, and which are not. Then, a control word loader will direct the bits in the configuration part to the appropriate positions of the control register under control of the configuration mask. The execution of the loop can then start. For every control word derived from the standard parts 11, the control word loader will direct each bit to its appropriate position in the control register, without changing the pre-loaded configuration of that loop, as defined by the dedicated memory block 25 or 26.
Therefore,
It is noted that the pattern of zeros and ones in the configuration mask register 63 are not control signals, but simply specify which bit positions in the control word are loop-invariant and which are not. On the other hand, the zeros and ones in the loop configuration 62 are indeed the control signals that will control the processors data path, e.g. opcodes for functional units, register addresses, multiplexer select inputs, etc.
The programmable cross-point switch 73 may consist of a multiplexer chain, each multiplexer being controlled by a set of bits coming from a mask interpretation logic 71. In particular, the programmable cross-point switch 73 may also have controllable tri-state buffers, while the collection of all individual multiplexer and buffer control bits make up switch control signals 72. The mask interpretation logic 71 is arranged to translate the contents of the configuration mask register 63 into the switch control signals 72. The use of the mask interpretation logic 71 provides the following two advantages. Firstly, it de-couples the particular switch architecture from the semantics of the loop configuration mask 61. This allows, for instance, for a switch upgrade preserving binary code compatibility with the previous version. Secondly, the number of bits in the configuration mask register 63 will typically be much smaller than the number of bits making up the switch control signals 72, allowing for a reduction in the memory capacity required for the loop configurations 61. Nevertheless, the system can be arranged such that no loop configuration mask 63 is provided and no mask interpretation logic 71. Then, the bits of the loop configuration 61 directly form the switch control signals 72.
As mentioned before, in the first preferred embodiment the pattern of zeros and ones in the configuration mask encode which bit positions in the control word are loop-invariant and which are not. In fact, since the mask interpretation logic 71 de-couples the semantics of the loop configuration mask from the switch architecture, several other forms of encoding the information in the configuration mask can be used as well, which could reduce the bit-width required to store that information.
The programmable cross-point switch 73 can be fully connected, but does not need to. For instance, the first bit of the standard part 11 can be routed to any bit position of the control register 50, while the second bit position of the standard part 11 need not be routable to the first bit position of the control register 50. Furthermore, the third bit position of the standard part 11 need not be routable to the first and second bit positions in the control register 50, and so on. A similar rational can be applied to higher-order bits. By applying this rational, some of the connections can be physically removed from the cross-point switch 73, since they will then never be used.
When the register file 31 is implemented in re-configurable logic, all resources are mapped spatially. The control over the resources is achieved through configuration memory bits. A configuration memory 32 is loaded with a new configuration context when a new type of data flow graph is to be executed. An internal configuration controller 33 takes care of selecting the right configuration context based on the fixed part 12 of the control word 10. Since the fixed part 12 of the control word 10 is left unchanged for a number of cycles, the potential reconfiguration time penalty, associated with reloading of a configuration memory context, is insignificant.
If the re-configurable register file 31 is to be arranged to use a standard control word, the configuration context has to be adapted thereto.
The re-configurable register file 31 further comprises a configurable logic array 34 controlled based on the bit states in the configuration memory 32. Thereby, a configurable interface 35 can be configured as to provide desired inputs, outputs, and/or control words. Thereby, not only connections to functional units may be programmable, but also, for example, the depth of the register files 21. Furthermore, implementation of specialised register file architectures, such as FIFOs (First-In-First-Out) and stacks, in the reconfigurable logic is possible. Due to the simplified addressing scheme required for structures like FIFOs and stacks, the size of the standard part 11 of the control word required to address the register file can be reduced considerably.
In the present case shown in
However, many of these control parameters will not vary within a specific loop. Therefore, for that specific loop or other instruction sequence, it is not necessary to provide for twelve control bits in the instruction word to control the circuitry mapped onto the FPGA block 31. Thus, the instruction word width can be reduced, saving instruction memory.
Due to the fact that the FPGA block 31 is flexible, the vary hardware structure mapped onto it can be reconfigured so to interpret the reduced number of control bits properly. In
Thus, the general strategy is to limit the number of bits in the instruction word that can be used to control the FPGA block 31, therefore saving instruction memory. In spite of the reduction, the FPGA block 31 may still be used by several different critical loops in the application, by using the hoisting concept.
Another general strategy is to investigate which critical loops in the application have the largest loop-invariant parts and can benefit most from hoisting. These loops are then chosen for mapping onto re-configurable hardware units, such as the FPGA block 31, which can exploit hoisting, instead of being mapped onto hard-wired units and register files of the rest of the processor data path 41.
It is noted that the present invention is not restricted to the preferred embodiments explained above, and can be used in any processor environment where a sequence of at least two instructions of a data processing program is decoded into a sequence of code words used to control a data path. In particular, the suggested hoisting concept may be used for other instruction sequences which are decoded into code word sequences with variant and invariant parts. The preferred embodiments may thus vary within the scope of the attached claims.
Number | Date | Country | Kind |
---|---|---|---|
02077020.2 | May 2002 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/IB03/01849 | 4/25/2003 | WO | 11/19/2004 |