The present invention relates to a digital signal processor (DSP), for example, a SIMT-based DSP.
Many mobile communication devices use a radio transceiver that includes one or more digital signal processors (DSP).
For increased performance and reliability many mobile terminals presently use a type of DSP known as a baseband processor (BBP), for handling many of the signal processing functions associated with processing of the received the radio signal and preparing signals for transmission. It is advantageous to separate such functions from the main processor, as they are highly timing dependent, and may require a realtime operating system. There is a desire that such baseband processors should be as flexible as possible to adapt to developing standards and enable hardware reuse. Therefore, programmable baseband processors, PBBP have been developed.
Many of the functions frequently performed in such processors are performed on large numbers of data samples. Therefore a type of processor known as Single Instruction Multiple Data (SIMD) processor is useful because it enables one single instruction to operate on multiple data items, rather than on one data item at a time. Multiple data items may be arranged in a vector, and a processing unit suitable for operating on a vector of data will be referred to in this document as a vector execution unit.
As a further development of SIMD architecture, the Single Instruction stream Multiple Tasks (SIMT) architecture has been developed. Traditionally in the SIMT architecture one or two SIMD type vector execution units have been provided in association with an integer execution unit, which may be part of a core processor.
International Patent Application WO 2007/018467 discloses a DSP according to the SIMT architecture, having a processor core including an integer execution unit and a program memory, and two vector execution units which are connected to, but not integrated in the core. The vector execution units may be Complex Arithmetic Logic Units (CALU) or Complex Multiply-Accumulate Units (CMAC). The core has a program memory for distributing instructions to the execution units. In WO2007/018467 each of the vector execution units has a separate instruction decoder. This enables the use of the vector execution units independently of each other, and of other parts of the processor, in an efficient way.
It is an objective of the present invention to make a SIMT processor more flexible and enable more efficient use of the program memory, issue bandwidth and execution units.
This objective is achieved according to the present invention by a digital processor comprising:
The digital signal processor is characterized in that the processor comprises an issue control unit for selecting at least two execution units that are to receive and execute the same instruction at the same time, and logic for sending the instruction to said at least two execution units.
In the processor defined above, the same instruction may be used to control a number of execution units. This significantly reduces the control overhead when sending the same instruction to a number of execution units. It also enables parallel execution of the same instruction on a number of execution units. The possibility of starting several execution units at one time makes the handling of instructions very efficient. An execution unit may be a vector execution unit, a scalar execution unit or an integer execution unit. A scalar execution unit is arranged to process one data item at a time, but the data item may be an integer or a complex value. For example, the same vector instruction may be sent to two or more vector execution units to be performed on different sets of data. Examples of non-vector instructions that are often sent to more than one vector execution unit are clear and star. It is possible, for example, to have one issue group that includes all vector execution units.
In a preferred embodiment, each vector execution unit comprising a vector controller arranged to determine if an instruction is a vector instruction and, if it is, inform a count register arranged to hold the vector length, said vector controllers being further arranged to control the execution of instructions.
The processor may also comprise one or more accelerators, known in the art. The term functional unit, when used in this document, indicates either an execution unit or an accelerator.
Preferably, a number of issue groups are defined, each issue group comprising at least one of the execution units, and at least one issue group comprising more than one of the execution unit, and the issue control unit is arranged to select the at least two execution units by selecting an issue group. This may be hardcoded in the core.
Alternatively, in a preferred embodiment, the issue control unit further comprises at least one mask associated with at least one issue group, said mask indicating which execution unit or units in the issue group should receive and execute the instruction. This makes it possible to change the definition of issue groups and the selection of execution units for each issue group, making the processor more flexible.
An issue group may comprise at least one integer execution unit and/or at least one vector execution unit. An issue group may be defined to comprise only execution units of the same type, or a mix of execution units of different types, as desired. It may be suitable to define one issue group that includes all execution units, for example for issuing the command clear.
An instruction may involve reading data from and writing data to other units in the processor. When the same instruction is sent to a number of execution units in an issue group, normally each execution unit should work with its own set of other units to avoid several execution units trying to read from or write to the same unit. Therefore, in a preferred embodiment, at least one execution unit comprises a mapping table for translating information held in an instruction indicating at least one other unit with which the execution should interact, for example, from which memory it should read data. Still, two or more execution units may be arranged to receive data from the same memory unit or functional unit in the processor, for example when one execution unit in the issue group is to perform the function A=sum(X*Y), and another is to perform the function B=sum(X*Z), X, Y and Z being data vectors obtained from the other units in the processor.
One way of handling the result from an issue group involves writing the result from each execution unit in the issue group to the same vector register unit and letting the vector register unit perform the instructions involved in processing the result.
Preferably, the instruction decoder is arranged to inform the vector register unit about the instruction being executed at any given time.
The selection of which issue group is to perform a particular instruction may be handled in different ways. Normally, an issue signal will be extracted in the core and sent to the relevant execution unit. In this case, the at least one execution unit in an issue group is further arranged to receive an issue signal and to control the execution of instructions based on this issue signal. Alternatively, each vector execution unit may be arranged to extract an issue signal from a received instruction word and determine whether it should participate in the execution of the instruction word based on the issue signal.
Preferably, the vector controller controls the execution of instructions on the basis of an issue signal received from the core. Alternatively, the issue signal may be handled locally by the execution unit itself. How to implement this is known in the art.
Processing according to the invention is made more efficient by enabling concurrent processing of the one instruction on two different sets of data by two execution units. It would also be possible to let two execution units process different parts of the same set of data, provided the different parts were stored in different memories. This enables more efficient processing of large sets of data than what is enabled in the prior art, without having to implement larger vector execution units. As an alternative solution, the capacity of a vector execution unit could be increased by increasing the number of datapaths included in the vector execution unit, but such a high-capacity vector execution unit would be unnecessarily large for most commands, and therefore inefficient. Hence, the invention provides a more flexible and cost-efficient solution than providing a single vector execution unit with higher capacity
The distribution of instructions and data to and from several units in one go allows for extremely efficient handling of instructions since sending the same signal between several units can be achieved at practically the same cost as signaling between two units.
Typically, the program memory is arranged in the processor core and is also arranged to hold instructions for the integer execution unit.
The invention also relates to a baseband communication device suitable for multimode wired and wireless communication, comprising:
In a preferred embodiment, the vector execution units referred to throughout this document are SIMD type vector execution units or programmable co-processors arranged to operate on vectors of data.
The processor according to embodiments of this invention are particularly useful for Digital Signal Processors, especially baseband processors. The front-end unit may be an analog front-end unit arranged to transmit and/or receive radio frequency or baseband signals.
Such processors are widely used in different types of communication device, such as mobile telephones, TV receivers and cable modems. Accordingly, the baseband communication device may be arranged for communication in a cellular communications network, for example as a mobile telephone or a mobile data communications device. The baseband communication device may also be arranged for communication according to other wireless standards, such as Bluetooth or WiFi. It may also be a television receiver, a cable modem, WiFI modem or any other type of communication device that is able to deliver a baseband signal to its processor. It should be understood that the term “baseband” only refers to the signal handled internally in the processor. The communication signals actually received and/or transmitted may be any suitable type of communication signals, received on wired or wireless connections. The communication signals are converted by a front-end unit of the device to a baseband signal, in a suitable way.
In the following the invention will be described in more detail, by way of example, and with reference to the appended drawings.
A host interface unit 207 provides connection to the host processor (not shown). If a MAC processor is present, it is connected between the host interface unit 207 and the host processor. A digital front end unit 209 provides connection to an ADC/DAC unit in a manner well known in the art.
As is common in the art, the controller core 201 comprises a program memory 211 as well as instruction issue logic and functions for multi-context support. For each execution context, or thread, supported this includes a program counter, stack pointer and register file (not shown explicitly in
The controller core 201 also comprises an integer execution unit 212 comprising a register file RF, a core integer memory ICM, a multiplier unit MUL and an Arithmetic and Logic/Shift Unit (ALSU). These units are known in the art and are not shown in
An on-chip network 244 interconnects all units of the processor, including the controller core 201, the digital front end unit 209, the host interface unit 207, the vector execution units 203, 205, the memory banks 230, 232, the integer memory bank 238 and the accelerators 242.
In this example each of the first vector execution unit 203 and the second vector execution unit 205 are CMAC vector execution units, each comprising a vector controller 213, a vector load/store unit 215 and a number of data paths 217. The load function is used for fetching data from the other units connected to the on-chip network 244 (for example from a memory bank) and the store function is used for storing data from the execution units 203, 205 to for example a memory unit 230, 231 through the on-chip network 244. Data may also be obtained from other vector execution units and/or the computing results may be forwarded to other vector execution units for further processing. Each vector execution unit also comprises a vector controller 213, 223 arranged to receive instructions from the program memory 211.
The vector controller of this first vector execution unit is connected to the program memory 211 of the controller core 201 via the issue logic, to receive issue signals related to instructions from the program memory. In the description above, the issue logic decodes the instruction word to obtain the issue signal and sends this issue signal to the vector execution unit as a separate signal. It would also be possible to let the vector controller of the vector execution unit generate the issue signal locally. In this case, the issue signals are created by the vector controller based on the instruction word in the same way as it would be in the issue logic.
Alternatively, the vector execution units 203, 205 are CALU vector execution unit of a type known in the art, comprising a vector controller 223, a vector load/store unit 225 and a number of data paths 227. The vector controller 223 of this second vector execution unit is also connected to the program memory 211 of the controller core 201, via the issue logic, to receive issue signals related to instructions from the program memory.
The vector execution units 203, 205 could also be any kind of vector execution units. Although two vector execution units are shown and discussed, the inventive method can be extended to sending the same instruction to three or more vector execution units.
There could be an arbitrary number of vector execution units, in addition to the two shown in
To enable several concurrent vector operations, the processor preferably has a distributed memory system where the memory is divided into several memory banks, represented in
As is known in the art, a number of accelerators 242 are typically connected, since they enable efficient implementation of certain baseband functions such as channel coding and interleaving. Such accelerators are well known in the art and will not be discussed in any detail here. The accelerators may be configurable to be reused by many different standards.
The first and second vector execution unit 203, 205 are shown as a four-way CMAC units with four complex datapaths that may be run concurrently or separately. The four complex data paths include multipliers, adders, and accumulator registers (all not shown in
In one embodiment, the instruction set architecture for processor core 201 may include three classes of compound instructions. The first class of instructions are RISC instructions, which operate on 16-bit integer operands. The RISC-instruction class includes most of the control-oriented instructions and may be executed within integer execution unit 212 of the processor core 201. The next class of instructions are DSP instructions, which operate on complex-valued data having a real portion and an imaginary portion. The DSP instructions may be executed on one or more of the vector execution units 203, 205. The third class of instructions are the Vector instructions. Vector instructions may be considered extensions of the DSP instructions since they operate on large data sets and may utilize advanced addressing modes and vector support. The vector instructions may operate on complex or real data types.
In the prior art, the CMAC units 203, 205 are arranged to operate separately, each processing one instruction, on one set of data, at a time. According to the invention, control means are included which will enable the CMAC units 203, 205 to work concurrently on the same set of data in order to speed up the processing.
For illustration, in the prior art each vector execution unit has a name. The command
means that all the following CMAC instructions should be sent to CMAC unit number 0. This information is found in the instructions themselves and is decoded either in the issue logic in the core 201, or by the vector execution units themselves.
According to the invention, groups of execution units, called issue groups, are specified, each issue group comprising one or more execution units of the same type or of different types. When an instruction is issued, the unit field in the instruction word will not encode one of the execution units directly, but will instead indicate one of the issue groups, as will be discussed in connection with
According to the invention a new command is defined to say that all instructions of a particular type should be sent to a particular issue group, and not to an individual vector execution unit. If the following commands have been issued:
this means that all cmac instructions should be sent to issue group number 0 and all calu instructions should be sent to issue group number 5. If a cmac instruction such as cacc x,y is issued it will be sent to issue group number 0. If a calu instruction such as vadd z, b is issued, it will be sent to issue group number 5. The vector execution units in one issue group may have the same number of datapaths, or different numbers of datapaths.
As explained in connection with
In a preferred embodiment, however, to provide more flexibility, a mask may be used in connection with the issue signal, as shown in
In this example, the mask units 326, 328, 330 are all used for the same issue group. As indicated by a further mask unit 340, there may be mask units for one of more further issue groups as well. The main purpose of having multiple mask register for one issue group is to allow each context to have its own separate mask register.
In the example in
The issue group functions are particularly useful in situations where it is important that both CMAC units start at exactly the same time and work in a synchronized manner. Typically the multi-issue functions are used to enable several vector execution units to execute the same instruction, that is, when it is desired to transmit the same instruction to several vector execution units. This applies both to situations where synchronization of the execution is important and where several vector execution units should receive the same instructions but it is not essential that they are synchronized. An example of the latter is the clear instruction which is used to clear a vector execution unit. To clear all vector execution units, an issue group could be defined as comprising all vector execution units and the instruction could be sent to this issue group.
The following example will be discussed on the basis of a SIMT DSP with an arbitrary number of execution units. For simplicity, all units are assumed in this example to be CMAC vector execution units, but in practice a digital signal processor will have units of different types.
In many base band processing algorithms and programs, the algorithm can be decomposed into a number of DSP tasks, each consisting of a “prolog”, a vector operation and an “epilog”. The prolog is mainly used to clear accumulators, set up addressing modes and pointers and similar, before the vector operation can be performed. When the vector operation has completed, the result of the vector operation may be further processed by code in the “epilog” part of the task. In SIMT processors, typically only one vector instruction is needed to perform the vector operation.
The typical layout of one DSP task according to the invention is exemplified by the following example task :
The code snippet in the example performs a complex dot-product calculation over 512 complex values and then store the result to memory again. The routine requires the following instructions to be fetched by the processor core.
In the example above, the setcmvl, cmac and star instructions are issued to and executed on the CMAC vector execution unit whereas ldi, out and idle instructions are executed on the integer core (“core”). The parameter [3] to the star instruction indicates the indirect network port address of the unit to which the resulting data should be sent.
The vector length of the vector instructions indicates on how many data words (samples) the vector execution unit should operate on. The vector length may be set in any suitable way, for example one of the following:
The instruction idle #cmac0 instructs the core program flow controller to stop fetching new instructions until the CMAC0 unit has finished its vector operation. After the idle function releases, and allowing new instructions to be fetched, the “star” instruction is fetched and dispatched to the CMAC0 vector execution unit. The star instruction instructs the CMAC vector execution unit to store the accumulator to memory.
There are three possible ways of handling the output from the execution units of an issue group. The simplest and most common is that the execution units have worked separately on sets of data, and that each instruction, or sequence of instructions is ended individually. In this case, the result may be handled in a manner common in the art.
A second alternative is that the results from two or more execution units constituting an issue group should be handled together. One way of achieving this would be to provide a vector register file 902 as shown in
A third option would be to let only one of the execution units perform the epilog. In this case, for all but one of the execution units in an issue group the last instruction would be for the execution unit to send its data to the one execution unit of the issue group that was to perform the final combining of the results.
The idle instruction is used in the SIIVIT architecture to stop fetching instructions from the program memory until a particular vector execution unit is finished with its instruction. When a vector execution unit is finished it returns a signal to indicate to the core that it is ready. This signal might initiate an interrupt signal. When issue groups are used, preferably the idle instruction should stop the fetching of instructions until all vector execution units in the issue group is finished. Therefore, the core should handle ready signals from all vector execution units in the issue group in a coordinated manner. Typically, when the execution units in an issue group run the same instruction and no stalls occur in the execution units, all execution units within the same issue group should release their interrupt signal at the same time. To allow flexibility, it is possible to specify if “and” or “or” logic should be used to form the corresponding output signal. For example, the criterion may be that the ready signal has been received from all vector units, that is, all vector execution units in the issue group should be finished. Alternatively, the criterion may be that one of the vector units has issued the ready signal. A practical way of handling this is shown in
Number | Date | Country | Kind |
---|---|---|---|
1151231-6 | Dec 2011 | SE | national |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SE2012/051321 | 11/28/2012 | WO | 00 | 6/11/2014 |