1. Technical Field of the Invention
The present invention generally relates to the field of architecture, design and development of micro processors used for audio processing, image processing, signal processing, speech recognition and matrix processing. More particularly the invention relates to a same instruction different operation (SIDO) processor that allows short instruction format and flexibility to dynamically program the processor on the fly by changing data/operand words, and supports basic integer operations using very simple and efficient hardware execution units.
2. Description of the Related Art
Technology advancement has led to the evolution of new high performance multimedia and DSP applications, requiring high-speed computation-intensive hardware. Implementations of existing architectures cannot keep up with the demands of increased performance as they are fast approaching the bounds of circuit complexity. Several attempts have been made to counter these issues and to develop the best low-cost and high-speed architectures. In this regard the use of very long instruction word (VLIW) architecture has gained a lot of recognition.
A chip with VLIW technology is capable of executing many operations within one clock cycle. Essentially, a compiler reduces program instructions into basic operations that the processor can perform simultaneously. The operations are put into a very long instruction word that the processor then takes apart and passes the operations off to the appropriate devices. In VLIW processors several simple and short instructions are packed in a single very long instruction, and all these instructions are executed in parallel. The instruction scheduling in VLIW processors is performed statically by the compilers, therefore their hardware complexity is low as compared to superscalar processors. VLIW processors require very large instruction memories and high program memory bandwidth that tends to increase the probability of cache misses during program execution. The VLSI implementations of program memories require long wires and wide busses from instruction memory to the instruction decode/control unit. In the nano-meter VLSI design the wire delay and intrinsic parasitic is a major problem. Moreover, VLIW processors require expensive register files for multi-operand access. In a general-purpose CPU or VLIW processors, the program memory is made up of ROM, EEPROM or PROM etc., which are relatively expensive as compared to data memory. Since program memories are fixed, it is not generally possible or practical to change the flow of the program without using expensive branching techniques.
The present invention recognizes that it would be desirable to have a system or method that enables greater efficiency in handling execution of operations. It is also desirable that such a system or method offers a very short instruction word with large data widths and a number of short instructions are packed in a long data word. It would be further desirable to have such a system or method that is also scalable to adapt to high clock speed and wide bandwidth processor designs without requiring significant hardware upgrades. These and other benefits are provided in the present invention as described herein.
A same instruction different operation (SIDO) processor and a method for the provision of operation-code along with data (operands) using a short instruction word are described. The data operands are stored as packed data in the memory space of Operand-A in the data memory. However, unlike conventional processors, the operation control words in the SIDO processor are stored in the data memory in the memory space of Operand-B. This results in a relatively smaller instruction memory for the storage of only one instruction operation code and data address.
The present invention allows short instruction format and flexibility to dynamically program the processor on the fly by changing operand words. The SIDO processor supports basic integer operations including add, subtract, shift, move, permute, multiply, etc. A number of permutations of the input operands and operations can be achieved by appropriately configuring the operation control word bits in Operand-B memory space, for example a 64-bit data word can allow 264 different combinations or permutations of operations. With all the execution units of the SIDO processor working in parallel, on multiple data operands, a variety of operations can be performed in parallel. This makes the SIDO processor a very powerful number crunching engine for computation intensive applications.
The SIDO processor according to the present invention has numerous advantages over the VLIW processors. First, in the SIDO processor, the instruction control word is supplied as operand to the execution units using data bus. Second, the SIDO processor requires smaller instruction memory. Third, the SIDO processor requires less wiring for instruction buses. Fourth, the SIDO processor requires less switching of instruction bus. Fifth, the SIDO processor requires fewer ports on registers. Sixth, the SIDO processor consumes less power. In fact, it only consumes one fourth of power that a conventional VLIW processor consumes.
Additional advantages of the present invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The advantages of the present invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.
The present invention is best understood by referring to the accompanying figures and the detailed description set forth herein. Embodiments of the invention are discussed below with reference to the figures. However, those skilled in the art will readily appreciate that the description given herein with respect to the figures is for explanatory purposes as the invention extends beyond these limited embodiments.
Terminology: Given below is a list of definitions of the technical terms which are frequently used in this document:
Operands and operators—Operands refer to the objects that are manipulated and operators refer to the symbols that represent specific operations. For example, in the expression Y+7, Y and 7 are operands and + is an operator. In this document, “operands”, “data operands”, and “data words” are interchangeably used; “operator”, “operation code”, and “operation control word” are interchangeably used.
Operation control word—A predefined code which defines what operation needs to be performed, e.g. 000 for addition, 001 for subtraction, 010 for shift etc. Operation control word is used by the hardware controller to generate proper control signals for a particular operation.
Instruction—A basic command, such as the most rudimentary programming commands comprised of operation codes and data addresses.
Instruction control word—Instruction operation code, also known as opcode or op_code. In this invention, instruction control word is supplied by the instruction memory which is separated from the data memory.
Bus—A collection of wires through which data is transferred. All buses consist of two parts: an address bus and a data bus. The data bus transfers actual data whereas the address bus transfers information about where the data should go. Every bus has a clock speed measured in MHz.
Register—A temporary storage area in computers.
Execution unit—A device for performing logic operations. In a processor like the SIDO processor according to the present invention, the execution unit comprises at least one arithmetic logic unit.
Multiplexer—A multiplexer combines more than one input into a single output. The input selection is performed or controlled by an input select signal. For example, a two-input multiplexer is a simple connection of logic gates whose output Y is either input A or input B depending on the value of a third input S which selects the input.
Compressor—One of the major speed enhancement techniques used in modern digital circuits is the ability to add numbers with minimal carry propagation. For example, a 3:2 compressor reduces three numbers to 2, by doing the addition while keeping the carries and the sum separate. This means that all of the columns can be added in parallel without relying on the result of the previous column, creating a two output “adder” with a time delay that is independent of the size of its inputs.
Ripple carry adder—A ripple carry adder allows the addition of two k-bit numbers to produce one k-bit output. The addition is performed using carry propagation from bit-0 to n.
Shifter—A hardware device that can shift a data word by any number of bits in a single operation. It is implemented like a multiplexer. Each output can be connected to any input depending on the shift distance.
The present invention teaches a same instruction different operation (SIDO) processor which allows a very short instruction word with large data widths. One of the distinct characteristics of the SIDO processor is that it makes several simple and short instructions packed in a long data word. In today's high performance computers, large data widths of 64, 128, or 256 are common. Therefore, storing instruction operation code as data words is very appealing for high performance processing. The SIDO processor allows short instruction format and flexibility to dynamically program the processor on the fly by changing operand words, and supports basic integer operations including add, subtract, shift, move, permute etc., using very simple and efficient hardware execution units. A number of permutations of the input operands and operators can be achieved by appropriately configuring the operation control word bits in data memory on the fly.
Referring to
The operation control words 320-323 for the operation of execution units 313-316 are also provided from the data memory Operand-B 304. The four 16-bit operation control words are concatenated as 64-bit operation control word in the memory space of Operand-B 304. A 64-bit data bus 317 is used to load registers B0-B3 in parallel from the data memory 302. The 16-bit operation control words (OCW): OCW 309 stored in register B0, OCW 310 stored in register B1, OCW 311 stored in register B2 and OCW 312 stored in register B3. The bit positions for the registers are divided into four groups which are loaded concurrently. The most significant 16 bits group starting from the bit position bit-63 to bit-48 is loaded into register B3; the second group starting from the bit position bit-47 to bit-32 is loaded into register B2; the third group starting from the bit position bit-31 to bit-16 is loaded into register B1; and the fourth group starting from the bit position bit-15 to bit-0 is loaded into register B0. In other words, each of the registers B0-B3 is loaded with a unique section of a control word which represents a unique series of bit positions. Note that the contents of registers B0-B3 are not treated as data; rather, they are treated as 16 bit operation control words.
The four 16-bit operation control words are packed in the 64-bit Operand-B 304 and operate on the four execution units 313-516 in parallel: OCW 320 being applied to execution unit 313, OCW 321 being applied to execution unit 314; OCW 322 being applied to execution unit 315; and OCW 323 being applied to execution unit 316. In other words, each of the registers B0-B3 is loaded with a unique section of a control word which represents a unique series of bit positions. Note that the contents of registers B0-B3 are not treated as data; rather, they are treated as 16-bit operation control words.
Table 1 of
The operations that can be executed on the result of the above operations are supplied using bits 12-15 of the 16-bit operation control words. These operations include but are not limited to variable eight position shift left or right of the result. Each 16-bit instruction control word of the preferred embodiment is divided into a 12-bit operation code for operation on input operands A0-A3 and a 4-bit operation code for output shift amount. The first 3-bit group of the 12-bit operation code starting from the least significant bits (LSB) position bit0 to bit2 defines the operation to be performed on the input operands 305 A0; the second 3-bit group starting from the bit position bit3 to bit5 defines the operation to be performed on the input operands 306 A1; the third 3-bit group starting from the bit position bit6 to bit8 defines the operation to be performed on the input operands 307 A2; and the fourth 3-bit group starting from the bit position bit8 to bit11 defines the operation to be performed on the input operands 308 A3.
As shown in Table 2 of
In another preferred embodiment, the SIDO data processor includes a device or an algorithm for concatenating one or more operation control words in a first operand, a device or an algorithm for concatenating data words in one or more data operands, at least one memory for storing the first operand and the data operands, a first set of registers being loaded in parallel with the first operand, a second set of registers being loaded in parallel with the data operands, one or more execution units using the operation control words decoded from the first operand to perform operations on the data operands. Note that the number of the first set of registers is equal to the number of the execution units. Each of the first set of registers is loaded with a unique section of the first operand. The unique section of the first operand is representative of a group operation control words at a unique series of bit positions. For example, as illustrated in
The 16-bit outputs a-d from all of the four multiplexers 401-404 respectively are routed to the 4×2 compressor 405 for addition. To perform a negation operation, the input operands A0-A3 are inverted using inverters 420, and logic one is input as Carry-in 412 which is also fed to compressor 405. Carry-in 412 and the outputs a-d are all summed up by using a tree of adders in compressor 405. Sum and carry vectors that are generated by compressor 405 are sent to a carry propagate adder such as a ripple carry adder 406 for addition to obtain the final output. Then, the result of the ripple carry adder 406 is sent to a shifter 407 that receives output-controls 413 from the four most significant bits (MSB) 15:12 of the operation control word. The output-control signals define the direction and number of bits the output needs to be shifted to yield the final 16-bit result.
In this embodiment, the execution units 313-316 in
Now referring to
Here are a few examples: If A0:A1:A2:A3 are 416-bit input operands packed as 64 bit in Operand-A, and the operation control words are provided using 4-16-bit words packed in Operand-B, then, based on the configuration, words output can be (A0+A1+A2+A3)<<shift left: (A0−A1−A2−A3)>>shift right: (A0+A1-A2+A3) no shift (A0−A1+A2−A3)<<shift left from all four execution units at the same time. Different operations between A0, A1, A2, and A3 are based on the configuration word in Operand-B. The instruction configuration is based on the 16-bit values given in Table 1 of
Table 3 of
Table 4 of
Table 5 of
Table 6 of
To perform the 4×4 matrix multiplication using the SIDO processor, the input matrix columns are stored in a 64-bit wide data registers R00-R15 of a typical processor and packed as 4-16 bit words: the first column (y00-y30) is stored in R10; the second column (y01-y31) is stored in R11; the third column (y02-y32) is stored in R12; and the fourth column (y03-y33) is stored in R13 as shown below:
R10 y00:y10:y20:y30
R11 y01:y11:y21:y31
R12 y02:yl2:y22:y32
R13 y03:yl3:y23:y33
The operation control words for the matrix multiplication are stored in register R08 and packed as 4-16 bit words. Using Table 1 of
BIT [31:18] is for INSTRUCTION OPCODE. The 14-bit operation control words represent the control word for different instructions of a typical processor. One of these instruction codes could be SIDO type. For the sake of explanation, the SIDO instruction op_code is 1 (decimal) 00 0000 0000 0001 (binary).
BIT [17:12] is for OUT. The 6-bit code is for output register or memory write address.
BIT [11:6] is for OPA. The 6-bit code is for Operand-A register or memory read address (from memory space of Operand-A 303 of
BIT [5:0] is for OPB. The 6-bit code is for Operand-B register or memory read address (from the data memory Operand-B 304 of
In another preferred embodiment, the present invention is deployed as a method or process. The data processor includes at least one memory for storing data operands, at least one memory for storing instruction code and data addresses, and at least one execution unit. One of the data operands is specifically used to provide the execution units with one or more operation control words to execute operations on the remaining data operands. The basic steps of the method or process include: fetching the data operands, decoding the operation control word from one of the data operands in parallel, and executing the operations by applying the operation control word to the remaining operands. The step of fetching the data operands may include various sub-steps. For example, fetching instruction, decoding instruction and generating data operands addresses, reading the control word using the data operand for control words, reading the remaining data operands, storing the control word into a first set of registers, and storing the remaining data operands in a second set of registers. Upon execution, the processor writes the result of the step of executing as an output. The operations performed by the processor include, but not limited to, ADD, NEGATE, SHIFT LEFT 1 bit, SHIFT LEFT 1 bit with NEGATE, SHIFT RIGHT 1 bit, SHIFT RIGHT 1 bit with NEGATE, ZERO, MULTIPLICATION. Prior to the step of writing the result, further operations may be made on the result of the step of executing of writing. The further operations' may include any of: shifting left, shifting right, addition, subtraction, multiplication, division, saturation, rounding, and logical operations such as AND, OR, XOR, XNOR, NOR, NAND.
Step 801: Fetch instruction.
Step 802: Decode instruction and generate Operand A and Operand B addresses.
Step 803 and Step 804 are concurrently executed steps, wherein Step 803 includes sub-steps 803a-803b, and Step 804 includes sub-steps 804a-804c.
Sub-step 803a: Read 64 bit data (four 16-bit packed data) using address Operand-A.
Sub-step 803b: Store the 64-bit data (four 16-bit packed data) into four 16-bit registers A0-A3 of
Sub-step 804a: Read 64-bit control word (four 16-bit packed data) using address Operand-B.
Sub-step 804b: Store the 64-bit control word (four 16-bit packed data) into four 16-bit registers B0-B3 of
Sub-step 804c: Decode four 16-bit control words in parallel: bit 0-2, operation control for A0; bit 3-5, operation control for A1; bit 6-8, operation control for A3; and bit 12-15, direction Left/Right and number of bits output result to be shifted.
Step 805: Perform sixteen operations on the four operands stored in registers A0-A3 using their respective operation control words in registers B0-B3 in parallel.
Step 806: Shift the four output results produced in Step 805 by the number of bits and direction specified in the control word bits 12-15.
Step 807: Write the calculation result in Out Register (OUT).
Referring back to
Although the SIDO solution based on the foregoing exemplary illustrations is applied only to simple operations such as addition, subtraction, negation etc., the same concept can be applied to more complex instructions such as multiplication and division by small numbers. In case of multiplication, the 16-bit control word can represent 4 multipliers of 4 bit each. Using this approach a single SIDO instruction can perform operations similar to the following:
In the preferred embodiment illustrated in
Typical applications of the present invention include, but are not limited to, the compute intensive tasks in audio processing, video processing, image processing, JPEG, H.264, MPEG, signal processing, speech coding, speech recognition, computer vision, matrix processing, vector math, cryptography, and the like. All of these applications require large number of arithmetic operations. Therefore, the SIDO solution provided in the present invention is the right choice of architecture for these applications.
Although the invention has been described with reference to at least one specific embodiment, this description is not meant to be construed in a limiting sense. Various modifications of the disclosed embodiment, alternative embodiments or other equivalent solutions of implementing the disclosed SIDO processor with short instructions and provisions of sending operands will become apparent to those skilled in the art upon reference to the description of the invention. It is therefore contemplated that such modifications, equivalents, and alternatives can be made without departing from the spirit and scope of the present invention as defined in the appended claims.
The present application claims benefit of prior filed provisional Appl. Ser. No. 60/648,839 filed on Jan. 31, 2005, the entire content of which is hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
60648839 | Jan 2005 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2006/003229 | Jan 2006 | US |
Child | 12016171 | US |