The invention relates to methods for encoding a set of operations through a set of variable length instructions and apparatus for decoding the instructions.
In embedded systems, three key processor performance metrics are performance, power efficiency, and code density. Processor code density is important because it directly effects how much memory is needed for a certain application. The more memory that is needed, the bigger, more expensive, and more port hungry the system becomes. If the instructions executed by a processor can be made smaller, less memory is needed to execute a certain program. If a complete program can fit within the processor's on-chip memory, power goes down significantly and the performance of the program is increased.
Most of today's successful embedded processors use some kind of variable width decoding to improve code density. ARM uses a short instruction mode called THUMB which is asserted by executing a special instruction. The Blackfin digital signal processor (DSP) has variable width instruction sizes, with the most common instructions encoded as 16-bit instructions. Complex Instruction Set Computers (CISC) architectures generally allow reading data directly from memory using special address modes and have many more instruction widths and generally have better code density than Reduced Instruction Set (RISC) based processors. However, the more complex decoding of the CISC computers generally leads to slower and more power hungry circuitry.
The DSP architecture described herein can achieve significantly better code density and performance in signal processing compared to current RISC-based DSPs, while achieving very high speed of operation of the decoding. The DSP architectures provides 16-bit encoding/decoding of three-register instructions, and orthogonal 64 register selection fields within a 32-bit instruction. The 64-entry register file can allow significantly higher performance compared to typical DSP architectures in demanding signal processing applications, while the 16-bit instruction size provides excellent code density in control type applications.
Other features and advantages will become apparent from the following detailed description, drawings, and claims.
A digital signal processor (DSP) architecture containing a variable width decoder is shown in
A program memory 110 is used to store a program being executed. The program memory can be separate from the data memory to improve performance, although it could be combined. The width of the program memory is at least 32 bits, but can be 64 bits or 128 bits.
An instruction alignment buffer 120 aligns instructions so that instructions in memory do not have to be aligned on program memory line boundaries. This feature increases code density and reduces power consumption.
An instruction decoder 130 decodes the instruction received from the instruction buffer 120 and sends control signals to a register file, execution units (not shown), and a program sequencer. The instruction decoder decodes the length of an instruction as 16 bits wide or 32 bits wide based on the type of instruction.
A program sequencer 140 controls the fetching of instructions from program memory 110. Sequencer 140 provides a fetch address to program memory 110 and a read signal when an instruction is read. The fetch is done whenever the instruction buffer is not full. The unit also controls non-linear program flows such as jumps, calls, and branches. Up to two instructions can be executed in parallel.
A register file 150 is a unified register file with up to 64 general purpose registers capable of being used for all 32-bit instructions. A large and unified register file is a useful feature of load-store RISC architectures, because there are no addressing modes that allow data variables to be loaded from the data memory with a compute instruction.
A data memory 160 is a multi-bank memory architecture that allows for the fetching of data for computation in parallel with fetching an instruction from program memory. This is generally referred to as a Harvard architecture. In signal processing applications, allowing for simultaneous instruction fetch and data loads often doubles application performance.
A datapath 170 that can include processing units for data processing functions. The processor instruction set is flexible and expandable, but has a core instruction set that all flavors of the processor implementations have. The base integer instructions can include only the following instructions: addition, subtraction, xor, or, and, logical left shift, logical right shift, and arithmetic left shift. More instructions can be added based on specific application needs, and may include floating point arithmetic, multiplication, and/or multiply accumulate operations. Datapath-based instructions can be executed in parallel with load-store instructions.
A load store control 180 enables parallel execution of datapath instructions and load/store of data.
The architecture also provides an external interface 190 and bus 195. The bus communicates with load store control 180, register file 150, data memory 160, and external interface 190.
Register file 150 is a single unified register file that is used for all computer operations, including pointer manipulation, floating point execution, and integer arithmetic. Most architectures today utilize a split register file architecture. One reason for the register file split in these architectures is that a large instruction set does not allow encoding of such a large set of registers in a 32-bit instruction. The trade-off made was for more complicated instruction sets rather than a large register file. In the processor described here, the register file is unified and even allow 64 entry register files with a 32-bit instruction set. The 64 entry three-operand instructions are set in a 32-bit instruction by reducing the number of unique instructions and by reducing the size of immediate constants.
In some other designs, there can be a separate 32 entry register file for floating point operations, meaning that there are 32 registers available for integer operations and 32 registers for floating point operations. In still other architectures, there are only 8 data registers and 8 pointer registers. In both cases, register spillage may occur when either the integer register usage or computational register usage exceeds the size of the respective register file sizes. By making the register file large, unified, and orthogonal, there is only one register constraint to optimize for when writing the code rather than two. The constraint is that the total number of registers must be less than 64. A large register file is useful in signal processing applications, since one data fetch bus has been removed and thus there is a desire have to reuse more of the data, leading to a large number of temporary variables held in the register file rather than memory.
Out of the 16 types within the 4-bit type field, one opcode type (1111) is dedicated to extending the instruction to 32 bits. Instructions with immediate values use bit-4 to indicate a long (32-bit) instruction. Encoding the 32-bit instruction as a four bit value can be done with only four gates, which is insignificant when compared to the size of the whole digital signal processor, which can be on the order of 10,000 gates. However, these four gates enable the encoding of a large set of three register arithmetic instructions within a 16-bit instruction field, which can reduce the code size by half in many signal processing functions. If one bit were dedicated to specifying a 16-bit versus 32-bit instruction, only 15 bits would be available for general operation descriptions, which would not have been sufficient to encode all of the key instructions desired. Forcing many key instructions to be encoded as 32-bit instructions would have significantly increased the code size and power consumption of signal processing.
The instructions are 16 bits wide, with the second 16-bit extension adding more registers and longer immediate constants to the 16-bit instruction. The 16-bit instructions have three register fields, each with three bits to identify one of registers R0-R7. The 32-bit instructions have three register fields, each with a total of 6 bits to identify each of 64 registers. The lower three bits of each one of the register fields, Rn, Rm, and Rd, are contained within the first 16 bits, and the upper three bits, i.e., the most significant bits (MSBs), of each one of the register fields are contained within the upper 16 bits of the instruction. Compared to the 16-bit instruction, these three sets of three are the MSBs of the addresses for addressing registers R8 through R63. Any user entered command that uses only registers R0 through R7 are encoded as 16-bit instructions, while commands that use registers R8 through R63 are encoded as 32-bit instructions. When programming in assembly code, the instructions can be specified. A tool can parse the text of the assembly code and determine whether a 16-bit or 32-bit instruction is appropriate based on the registers being used.
The instruction decoding circuitry thus supports the encoding of three-operand instructions within 16-bit instruction widths. Short width instruction sets typically limit instructions to two operand instructions when short instructions are used. Here, all three operands instructions can be encoded as 16-bit instructions. Three-operand instructions can produce more efficient signal processing code than two-operand instructions.
By trading off immediate value fields and the number of different instructions in the architecture, the inclusion of 6 bit register fields is enabled for all source and destination operands in the case of 32-bit instructions. This means that 64 registers can be used in a 32 bit instruction architecture. The use of 64 registers has the potential of significantly improving the efficiency of the code generated by configurable compilers. A larger register file can reduce the number of loads and stores to data memory, and such reduction can improve performance and reduce power consumption.
Referring to
Based on the buffer pointer, the instruction buffer 120 selects and sends an instruction to the instruction decoder 130. The program memory needs to be at least 64 bits wide to allow for two 32-bit instructions to be executed in parallel on a continuous basis. The instruction output from the instruction buffer is either 32 bits for the single issue configuration, or 64 bits for the dual issue configuration.
The number of instructions executed depends on the types of instructions currently in the instruction buffer. A legal condition for parallel instruction issue includes: (1) no dependency between the result of the first instruction and the inputs of the second instructions, and (2) no contention on hardware resources, meaning that a load/store instruction can be executed in parallel with a datapath instruction. In this embodiment, the core cannot execute two load/store instructions in parallel or execute two datapath instructions in parallel. All control instructions are executed one at a time.
The size of the instruction is used to update the write pointer and read pointer state machines. A new instruction line is fetched from memory whenever the instruction buffer has 4 empty 16-bit entries. A new instruction line is also fetched from the program memory in case of a program redirection such as a jump instruction or an interrupt request. Although some embodiments include an instruction alignment buffer, there is the possibility of implementing a microprocessor without it. The instruction alignment buffer adds area and power, and there could be applications, predominately 16 bit or 32 bit, that may not benefit from its use.
Each register, Rn, Rm, and Rd, is designated with six bits indicating which of the 64 registers is being addressed. The 6-bit address for a register is represented generally as Rx[5:0]. For 16-bit instructions that use registers R0-R7, the most significant bits (MSB) are always 000, while the three LSBs indicate that register. For instructions that have 32 bits and use registers R8 through R63, the MSBs are taken from instr[31:29], instr[28:26], and instr[25:23]. The 32-bit signal from instruction length decoder 410 thus indicates to muxes 450, 460, and 470 whether to fill in the register address with leading zeros, or whether to use bits from instr[31:23] as the MSBs of the register address.
The size of the instruction is used to reset the upper field of the operand register addresses and shown in muxes 450, 460, and 470, and to indicate a correct program counter address for the next instruction to be executed.
The decoding logic needed to support the dual length instruction set can be minimal and significantly smaller than other encoding/decoding schemes. The logic added by dual encoding length instructions in this scheme includes (or can be limited to) approximately nine NAND gates for the three operand fields Rn, Rm, and Rd (muxes 450, 460, and 470); approximately eight 2-input NAND gates to create a 32-bit instruction indicator (decoder 410); a four input NAND gate for creating an “extend” signal (gate 420); and four 2:1 muxes to create an extended opcode (mux 430) for the final control decoder (440).
All other instruction decode logic can be completely reused between the 16-bit and 32-bit instruction formats, resulting in a very small, power efficient, and fast dual-length instruction decoding circuit.
One innovation that leads to the efficient instruction decoding method is the use of multiple bits to indicate a 32-bit instruction, forcing each register based instruction to be a 16-bit or 32-bit instruction, depending on the registers used, and having two opcode fields that get selected by a 4-bit “extend” signal derived from a 4-bit opcode. The extended mode detection is then used to select the correct type bits for the general decode logic. By keeping the instruction set minimal, three 8-register operands can be used within a 16-bit instruction and three 64-register operands within a 32-bit instruction.
This architecture can be said to optimize the instruction encode/decode scheme to optimize code density for signal processing applications, while microprocessors and DSPs are typically optimized for control applications.
While DSPs often use two load store units to bring data to and from a register file, in the present architecture, a second load store unit is omitted in favor of more registers. Dual load-store buses can be useful with a smaller register file, but this architecture preferably uses a larger register file.
Individual descriptions of the instructions shown in
The parallel execution is carried out by the hardware sequencer. As can be seen, the execution unit is being used on every clock cycle, indicating that there is no load-store bottleneck in the application.
Having described certain embodiments, it should be apparent that modifications can be made without departing from the scope, and that other embodiments are within the following claims. For example, while specific numbers of bits have been identified for various aspects including the instruction length, register bits, and extend signal, modifications could be made to different numbers to accommodate a system in a different implementation, while still maintaining basis principles described herein. While the instructions that are used with certain registers have a lower number of bits (e.g., 16 bits for registers R0-R7), additional instructions could be provided that have a greater number of bits (e.g., 32 bits) in call cases regardless of the registers used; in such a case, the LSBs of the instruction received at the decoder would be 1111 to indicate a 32-bit address (using the exemplary embodiment above).
This application claims priority under 35 U.S.C. Section 119(e) to Provisional Application Ser. No. 61/197,511, filed Oct. 29, 2008, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61197511 | Oct 2008 | US |