This application is related to an application entitled “Add-Shift-Round Instruction with Dual-Use Source Operand for DSP” and an application entitled “Rounding Correction for Add-Shift-Round Instruction with Dual-Use Source Operand for DSP”. These three applications have the same inventors, are commonly assigned, and are simultaneously filed.
1. Technical Field of the Invention
This invention relates generally to digital signal processors, and more specifically to an instruction in which a single operand field provides both an operand value and a control value. More particularly, the instruction is an add-shift-round, the operand value is a rounding bias, and the control value is a shift count.
2. Background Art
In addition to their ISA, some processors also have a microarchitecture which is not directly visible to the ISA code, and which is used at a lower level to implement the ISA. Many processors' microarchitectures are microcoded, in that they have their own “native” software format and control constructs.
In the example shown, the processor retrieves and executes this code from a memory/storage system under control of an instruction fetcher. To improve performance, the ISA code is typically stored in an instruction cache, and may be speculatively brought in from memory/storage by a prefetcher in coordination with a branch predictor. There may also be a separate data cache in some instances. Memory may include DRAM, SRAM, ROM, flash memory, or the like, and storage may include hard disk, CD-ROM, DVD-RAM, or the like. The memory and storage may be coupled directly to the processor, or it may be coupled indirectly via one or more intervening systems or transmission means (not shown). In some embodiments, it may reside on die with the processor core.
Regardless of how or when the code is brought into the processor, before it can be executed, an instruction decoder parses the incoming code to ascertain which instructions are contained in the code. In many machines, the instruction decoder generates microcode including a series of one or more microinstructions which correspond to a given ISA instruction. While the ISA code may be thought of as being the “native” instructions of the architecture, the microcode (μcode) is the “native” instructions of the microarchitecture or the execution units in the processor.
Some ISA instructions, such as trigonometric math functions, require complex operations, and result in lengthy microcode flows. In many instances, it is beneficial to permanently store these microcode flows in a microcode read-only memory (ROM). When the instruction decoder detects such an ISA instruction, the instruction decoder triggers the microcode ROM to output the corresponding microcode flow.
The microcode from the instruction decoder and/or from the microcode ROM is sent to a microinstruction scheduler which controls the delivery of the microcode instructions to the various execution units of the processor, in accordance with the availability of the execution units, the availability of the required input data operands for the microinstructions (μops), and so forth. Ultimately, the microinstructions are executed and their results are written to their appropriate destinations, whether in the register file, memory, storage, or the like. The results are typically also written to the data cache.
All ISAs include various forms of add and subtract instructions. These typically specify two or more source operands such as registers, whose contents are added or subtracted to generate a result which is written to a destination. In some instructions, the destination is expressly identified as an operand of the instruction. In others, the destination is implicit, either in that the result is always written to the same register, or in that the result is written to the register from which one of the source operands was taken.
For example, the X86 instruction set includes an instruction of the form:
Most ISAs include various instructions which employ one or more rounding modes. When the execution unit produces a result whose precision is greater than the destination is able to represent, the result is rounded before being stored to the destination. A variety of rounding modes are known in the art, such as: round toward zero, round away from zero, round toward positive infinity, round toward negative infinity, and round to nearest. There are two common variations of round to nearest, differing in how they handle numbers which fall exactly between two valid rounding results (e.g. at X.5); in the “round to nearest even” mode, 2.5 is rounded to 2, and 3.5 is rounded to 4; in the “round to nearest up” mode, 2.5 is rounded to 3, and 3.5 is rounded to 4.
y=f(x)
where, for each possible value of x, there is exactly one value y.
The rounding function operates as follows. The “open” function markers (shown as non-filled circles) do not constitute part of the function result line, but the “closed” function markers (shown as filled circles) do. For any value on the x axis, there is exactly one point where that x value intersects the function curve, specifying a resulting y value. The open and closed function markers fall at exactly the 0.5 midpoints between adjacent integers, such as at −2.5 and at 1.5. If the x value is exactly Z.5 (where Z is any integer), the resulting y value is Z+1. Thus, the rounding function is “round to nearest integer, and round 0.5 midpoints up.”
Most ISAs also include various forms of shift instructions, which cause the contents of a specified source operand register or an intermediate result to be bit-shifted either left or right as specified by the opcode of the instruction. The shifted result is then written to a specified register or an implicitly identified register. The number of bit positions by which the result is shifted, is typically specified as an immediate value or register operand in the instruction. For example, the X86 architecture includes an instruction of the form:
There are a very few examples of implicitly specified shift count values. For example, the X86 architecture includes an instruction of the form:
Many digital signal processing software algorithms, such as multi-tap filters, perform operations which are implemented by series of multiple instructions, and which are of the equation form:
dest:=(a+b+c+d . . . +x+2n−1)>>n
where dest is the destination, a through m are a set of two or more source operands, and >> is the right shift operation, where the sum of the various operands is right shifted by n bit positions.
These operations are typically executed hundreds of times for each macro-block in a video display, each time the frame is refreshed. Each of these operations requires the execution of a lengthy sequence of instructions.
What is needed, then, is an improved digital signal processor which includes one or more new instructions specifically designed to execute these digital signal processing software operations in a reduced number of instructions or clock cycles.
The invention will be understood more fully from the detailed description given below and from the accompanying drawings of embodiments of the invention which, however, should not be taken to limit the invention to the specific embodiments described, but are for explanation and understanding only.
The term “source value” will be used to denote the original value of the operand in question, either the value of an immediate, or the contents of a register, or the contents of a memory address, and so forth. The term “operand value” will be used to denote the value upon which an instruction's functionality is performed, such as an addend, whether directly specified by the source value or derived from the source value. The term “control value” will be used to denote a value which controls some arithmetic etc. characteristic of the functionality of the instruction. For example, the instruction's opcode may specify that the instruction is a shift instruction, and a control value may determine whether the shift is left or right, and/or by how many bit positions the result is shifted, and so forth.
A processor using this invention executes a “dual-use-source instruction”, which is one in which a single source value results in both an operand value and a control value. The processor generates the operand value or the control value or both from the source value.
For ease of illustration, the invention will mainly be discussed with reference to embodiments in which the source value is specified as an immediate, but the invention is not necessarily limited to such embodiments.
The present invention includes provision in the processor for executing a new instruction, which may be represented as being of the form:
In this instance, ADDSRN operates on signed values. In some embodiments, there may also be an unsigned version ADDSRN.U of this instruction, but for purposes of illustrating the invention, they will collectively be referred to as simply ADDSRN in this disclosure. The mnemonic suggests “ADD and Shift Right and round to Nearest”.
This instruction is especially useful in speeding up the DSP operation
dest:=(a+b+c+d . . . +m+2n−1)>>n
Specifically, the ADDSRN instruction performs the addition of the final three operands, the shifting, and the rounding, in a single instruction. In some embodiments, this may be accomplished in a single clock cycle.
This instruction represents a significant improvement over the prior art. In previous DSP systems, it was necessary to perform a complex and time-consuming series of instructions to perform the functionality of the single ADDSRN instruction. The following is a comparison of the present invention with a hypothetical prior art machine, in executing this operation:
R1:=(R2+R3+R4+R5+21)>>2
Assuming that all are single-cycle instructions, and that execution must be serialized (only a single ALU), the prior art DSP takes 50% longer to complete the operation than does the present invention.
The following is a comparison on a more complex operation:
R1:=(R2+R3+R4+R5+R6+R7+R8+R9+23)>>4
Using those same assumptions, even on this longer flow, the prior art processor takes 25% longer to complete the operation than does the present invention.
The first source value SRC1 and the second source value SRC2 are provided as operands to the adder, typically via a chain of logic (omitted here for simplicity) which may include a shifter, a bypass mux, and so forth.
The adder receives the third source value SRC3 via another logic chain. For clarity of explanation, an SRC3 value of 000000112 or 310 is illustrated. The third source value is provided to an immediate decoder (IMM DEC) which assumes that the third source value is an encoded value for use in executing the ADDSRN instruction. The immediate decoder decodes the source value N into the rounding bias value 2N−1 (DEC_SRC3). In the example shown, the immediate 000000112 is decoded into the value 000001002. The original third source value 000000112 and the decoded control value 000001002 are provided to a decode mux which selects one of them, according to a control signal is_ADDSRN which indicates whether the instruction is, in fact, the ADDSRN instruction. This same hardware can also be used to execute a three-input ADD instruction in which SRC3 explicity identifies the third addend.
A bypass mux receives the output of the decode mux, and also a variety of other data sources from which operand values can be taken, such as the outputs of other ALUs (not shown). A bypass mux control value SRC3_Select determines which of these inputs provides the third source value for the current instruction. In the case of the ADDSRN instruction, it will select the data coming from the decode mux.
Because this hardware may be capable of executing a variety of instruction types, not all of which have a third operand, a 3S mux selects either the output of the bypass mux, or the value 000000002 (zero, which is inert in addition and subtraction operations), to be used as the third input to the adder, according to a control signal is—3S which indicates whether the current instruction has a third operand.
The adder then adds these three operand values, optionally (but advantageously) with one or two bits of extra internal precision (to handle intermediate overflows, sign extension, and rounding modes), and provides the resulting sum to a result shifter.
The result shifter shifts this sum by a number of bit positions determined by a shift count control value at a shift control input. In the case of the ADDSRN instruction, the shift count value is the decoded value of the SRC3 operand. A count mux selects either the value zero or the output of the bypass mux as the shift count, according to a control signal is_Shift which indicates whether the current instruction is an instruction in which the shift count will come from the bypass mux of the SRC3 logic chain. Recall that the shift count was specified as N (000000112) by the original instruction, but has been decoded into the form 2N−1 (000001002) by the immediate decoder. Typically, the result shifter will be constructed as a set of shift muxes, one per adder output bit line, and these muxes select among their inputs according to a set of mutually exclusive control inputs (in which exactly one bit will be 1 and the rest will be 0). In instructions which do not shift, or which shift by zero positions, the least significant bit (LSB) of the shift muxes' control inputs will be 1.
Note that the decoded SRC3 value will have at most one “1” bit (because the decoder generates a number of the form 2N−1), and that it will be in the Nth position from the right (LSB) of the decoded SRC3 value. In one embodiment, the count mux appends to its output an extra bit in the least significant bit position, which is 1 when the is_Shift control signal selects the 0 input of the count mux, and 0 otherwise; this extra bit signal can be used to control the result shifter muxes to select their “pass through” (non-shifted) input—it becomes the LSB of the shift mux control word. In one embodiment, this LSB is generated simply by a NOR gate whose inputs are the various bits of the count mux output; when is_Shift is 0 (and the count mux passes through the constant 00000000), or when the output of the bypass mux is 00000000, the LSB NOR gate generates a 1; otherwise, it generates a 0.
The output of the result shifter is then written to the destination specified by the instruction.
Note that, in this embodiment, the original SRC3 shift control value 000000112 has been discarded early in the logic chain, and only its decoded data operand counterpart 000001002 is used in later stages of the logic chain. And note further that, in this embodiment, the special mathematical relationship between the binary representations of N and 2N−1 (specifically, that the binary 2N−1 has exactly one 1 and it falls in the Nth position from the right) enables this to be the case. If the operand value and the control value had some other mathematical relationship, such as N and 3N+7, or N and N/2+1, it might be necessary to pass both N and 2N−1 down parallel logic chains.
If the SRC3 input had been 000001012 or 510, the immediate decoder would have generated the value 000100002 or 1610. The adder would add SRC1+SRC2+000100002 and the result would have been shifted by five positions.
In this embodiment, the immediate decoder has been moved downstream of the bypass mux, making the circuit suitable for use with an ISA in which the dual-use operand is not necessarily an immediate value. By decoding the output of the bypass mux, the shift count can be taken from, e.g., the result of an immediately preceding instruction which has not even been written to the register file yet.
The immediate decoder performs the function 2N on the SRC3 operand value, generating the rounding bias value which will be passed down the logic chain to the third input of the adder. In the embodiment of
Note that in this embodiment, the original value of SRC3 did not directly specify either the bias value nor the shift count; both are derived from it by the processor. In the example shown, both are related to the SRC3 value by respective arithmetic functions. In other embodiments, one or both could be more indirectly derived from it. In other words, SRC3 may simply be a decode input value which is used as a mere index into respective decode lookup tables storing corresponding bias values and shift counts, neither of which may necessarily be mathematically related to the SRC3 value.
The shift count is provided by a count mux which includes one-hot-output decoder logic on its control inputs, which operates as follows. If the is_ADDSRN signal is active, the count mux passes the output of the immediate decoder. Otherwise, if the is_ADDS signal is active, the count mux passes the SRC3 value. Otherwise, if the is_Shift signal is active, the count mux passes the SRC2 value. Otherwise, the count mux passes a zero value.
If the instruction is e.g. a SHIFT instruction which does not include addition, its operands will be a value to be shifted on SRC1, and a shift count on SRC2. In some embodiments, the is_Shift signal may be active for SHIFT, ADDS, and ADDSRN instructions. The count mux's one-hot decoder logic performs prioritization among the is_ADDSRN signal, the is_ADDS signal, and the is_Shift signal, to correctly generate the mux selection signals.
In another, similar embodiment, the shift count and rounding bias have identical bit patterns, but SRC3 does not directly, expressly specify the bit pattern. For example, the ISA may allow only a very limited set of shift counts and corresponding rounding bias values, and the instruction may include a limited bit field containing an encoded value which selects among the allowed shift counts. For example, a two-bit field could specify: 00 for a shift count and rounding bias of 000000102, 01 for a shift count and rounding bias of 000001002, 10 for a shift count and rounding bias of 000010002, and 11 for a shift count and rounding bias of 000100002. In this instance, the two-bit field may not necessarily arrive on the SRC3 lines, and there will be a decoder (not shown) which generates the appropriate shift count/rounding bias value, and mux logic (not shown) feeding the generated value into the bypass mux and the count mux.
The instruction decoder (or an instruction scheduler or other suitable microarchitectural component) provides the is_ADDSRN, SRC3_Select, is—3S, is_Signed, and is_Shift control signals to the dual-use-source arithmetic logic unit, which may be substantially as shown in
If (112) the is_ADDSRN signal indicates that the instruction is the ADDSRN instruction, the decode mux passes (114) the decoded third source value; otherwise, it passes (116) the original third source value. The SRC3_Select signal will cause the bypass mux to pass (118) the output of the decode mux. If the is—3S control signal indicates that the current instruction is a three-operand instruction, the 3S mux will pass (122) the value from the bypass mux; otherwise, it will pass (124) a zero (which is inert in addition and subtraction).
The adder then adds or subtracts (depending upon the opcode) its three operands. The adder will treat the operands as either signed or unsigned values, according to an is_Signed control signal. In one embodiment, the rounding bias (third operand) is always unsigned, regardless of whether the other operands are signed or unsigned.
If (128) the current instruction performs shifting, as indicated by the is_Shift control signal, the shift count mux passes (130) the shift count control word from the bypass mux; otherwise, it passes (132) a zero. The output of the adder is right shifted (134) by the number of bit positions indicated by the shift count mux output (with suitable handling for a zero shift, of course). The shifted result is then written (136) to the destination specified by the instruction, and the method ends (138).
Thus, the original SRC3 source value has ultimately provided two values: a shift count control value expressly specified by the SRC3 value, and a third addend value derived from the shift count according to a predetermined formula or the like. (Note that the shift count is expressly specified in the form of a control word, not as a binary value.)
The method begins (150) with the processor receiving (152) the instruction. The instruction decoder decodes (154) the instruction, and the processor selects (156) an execution unit suitable for executing this particular type of instruction. All SRC source values are passed (158) to the selected execution unit. If (160) the instruction is not a dual-use-source instruction, the execution unit executes (162) the instruction by performing its operation upon the input source values, and the result is written (164) to the specified destination.
However, if (160) the instruction is a dual-use type, one of the source values (SRC-X) is decoded into a decoded value DEC_SRC, which is also passed (172) to the execution unit. In some instances, the original source value SRC-X may expressly provide an operand data value, with a control value being implied thereby. In other instances, the original source value SRC-X may expressly provide a control value, with an operand data value being implied thereby. If (174) the current instruction is of the former type, in which the original source value SRC-X provides an operand data value and the decoded value DEC_SRC is a control value, the execution unit executes the operation upon all the original SRC source values including SRC-X, using the DEC_SRC value as a control input which determines some characteristic of the operation (such as shift count, signed/unsigned type, shift direction, carry mode, operand size, rounding mode, saturation mode, or any other suitably controllable execution characteristic). If (174) the current instruction is of the latter type, the execution unit executes the operation upon the DEC_SRC value and all of the original SRC values except the SRC-X value, with the SRC-X value being used as a control input determining some characteristic of the operation. In either case, the results are written (164) to the specified destination, and the method ends (168).
If (190) the instruction is a dual-use-source type, an operand value and a control value are generated (194) from one of the source values. That source value does not expressly provide either the operand value nor the control value; both are derived. The instruction is executed (196) using the other source values, if any, and the derived source value, with the derived control value determining some characteristic of the functionality, such as the shift count or the like. If (190) the instruction was of another type, it would be executed (192) using all of its source values. In either case, the result is written (198) to the appropriate destination, and the method ends (200).
The SIMD ALUs add (222) their respective operands, including the common rounding bias value, and pass their resulting sums to their respective shifters. The common shift control word is passed (224) to each of the shifters, which shift (226) their respective sum inputs accordingly. The shifted sums are written (228) to the respective SIMD destinations SIMD_R3[i], and the method ends (230).
The SIMD ALUs add (250) their respective operands, each using its respective rounding bias value, and pass their resulting sums to their respective shifters. Each ALU decodes (252) its SRC3[i] value into a corresponding shift control word ShiftCtrl[i], and each shifter shifts (254) its respective sum accordingly. The processor writes (256) the shifted sums to their respective SIMD destinations SIMD_R4[i], and the method ends (258).
The sum is AND'ed (bitwise) with the shift control word, producing an output (“ares”) of the same width as each of them. The shift control word contains a single 1 in a bit position X, and 0's in the rest of the bit positions; thus, it serves as a mask for testing the state of the sum bit in position X. If that tested bit is also a 1, it means that the rounding bias 2N−1 (which is never actually generated in this embodiment) should have been added in with the two operands in generating the sum.
The bits of the output of the AND unit are OR'ed together, producing a single-bit incrementer control signal (“ics”) which indicates whether the rounding bias should have been added in. The output of the shifter is provided to an incrementer which is controlled by this single-bit control signal from the OR gate. If the control signal is a 1, the incrementer increments the shifted result, otherwise it simply passes the shifted result through, producing the output result which is written to the destination specified by the instruction. In one embodiment, the incrementer can simply be an adder which adds the shifted result and the zero-extended OR gate output.
The following table illustrates the operation of this embodiment in the case where the rounding bias should have been added in; or, in other words, in which the result should have been rounded up.
Everything from the Nth position right will be shifted right and discarded. If the Nth position of the sum is a 1, that portion is at least 0.5, and the result should be rounded up to the next integer value.
The following table illustrates the operation of this embodiment in the case where the rounding bias should not have been added in; or, in other words, in which the result should not have been rounded up.
Again, everything from the Nth position right will be shifted right and discarded. If the Nth position of the sum is a 0, that portion is less than 0.5, and the result should not be rounded up.
The circuit illustrated works for the “round to nearest up” rounding mode. Various alterations may be made to this circuit, to yield the same results. For example, the OR gate could be replaced with an adder, with the LSB of the adder controlling the incrementer.
Different circuitry will be used to implement other rounding modes.
When one component is shown as being adjacent to another component, it should not be interpreted to mean that there is absolutely nothing between the two components, only that they are coupled in some fashion.
The various features illustrated in the figures may be combined in many ways, and should not be interpreted as though limited to the specific embodiments in which they were explained and shown.
The term “processor” has been used in this disclosure to refer to any of a variety of data processing mechanisms. This invention may be used in, for example, a monolithic single-chip processor, a multi-chip processor module, an embedded controller, a microcontroller, or a variety of other such machines capable of executing software, whether embodied as a digital signal processor or as a general purpose microprocessor. The processor may have any of a variety of Instruction Set Architectures.
The processor may include one or more ALUs, any number of which may be capable of executing the new ADDSRN instruction. The invention is not limited to the case where the mnemonic “ADDSRN” is used to identify the instruction in assembly language.
The invention may be used in a fixed-width processor which can only handle data of a single predetermined width (such as 32 bits), or in a processor which can handle data in a variety of widths (such as 8 bits, 16 bits, or 32 bits). It may be used in a processor having a RISC architecture, a CISC architecture, a VLIW architecture, or whatever other architecture may be suitable. It may be used in a SISD (single instruction, single data) implementation, or in a SIMD (single instruction, multiple data) implementation, or in a MIMD (multiple instruction, multiple data) implementation. The invention may be practiced in integer arithmetic, fixed point arithmetic, or floating point arithmetic.
Although the invention has been described with reference to an addition instruction, it may also be used in a subtract instruction, or in a subtract reverse instruction. The term “additive instruction” may be used to generically refer to any particular species of addition or subtraction instruction. The invention may even be practiced in non-additive instructions, such as multiplication instructions, division instructions, and so forth. Addition, subtraction, multiplication, and division instructions may generically be referred to as “arithmetic” instructions. The invention may be practiced with any of a variety of rounding modes of arithmetic instructions.
While the invention has been shown in the context of a three-input adder and a three-operand instruction, it can be practiced in any other size machine. If practiced in a VLIW machine, the VLIW instruction may, in fact, be able to specify all of the source operands and the immediate shift count value, of a many-operand operation.
While the invention has been illustrated with reference to an embodiment in which the ALU extrapolates the final data operand value from an immediate which specifies the shift count, it could also be practiced in an embodiment in which the immediate specifies the final source operand immediate value and the ALU extrapolates the shift count from that imm value.
And while the invention has been explained with reference to an embodiment in which a single source provides both an operand having a first value and a shift count having a second value, in the broader sense, the invention may be practiced in embodiments in which a single source provides an operand value and some other control value. While the relationship between these has been illustrated as being N and 2N−1, the invention is not limited to this relationship but can use any other relationship in which the operand value and the control value are not identical.
And while the instruction has been illustrated with reference to an embodiment in which there are one or more operands beyond the one which provides both the operand value and the control value, it may be used in single-operand instructions as well.
While the invention has been illustrated with reference to various embodiments in which the source value decoding etc. logic is part of the ALU, in other embodiments this logic could be located at various other places in the processor.
And while the invention has been described with reference to embodiments in which the processor includes a register file, it may equally be practiced in embodiments in which there is no register file, but in which the operands are taken directly from memory such as an attached or on-die SRAM memory.
The dual-use source may specify the binary value of the control value, and the processor may decode that control value into a control word value. For example, the dual-use source may have the value 0112, which is 310, which the processor may decode into the “one-hot” shift control word value 0000010002 which means “shift by 3” (the LSB meaning “shift by zero”).
And, finally, in some embodiments, the original bit pattern of the dual-use-source operand may be used directly as an operand value and/or a control word, while in other embodiments, the original bit pattern must be decoded to obtain the operand value and/or the control word. Typically, to save bits in the instruction, the original bit pattern is an encoded value.
In one embodiment, the following encoding is used:
Note that the Shift Control Word bits are shown in this table as including the “shift by zero” LSB. Per this encoding, three instruction bits provide the ability to shift by as much as 8 bit positions, corresponding to a division by 256, with corresponding rounding bias as large as 128. In other words, SRC3 provides the value N−1, where the shift is by N bits and the rounding bias is 2N−1 Stated alternatively, SRC3 provides the value N, where the shift is by N+1 bits and the rounding bias is 2N.
Those skilled in the art having the benefit of this disclosure will appreciate that many other variations from the foregoing description and drawings may be made within the scope of the present invention. Indeed, the invention is not limited to the details described above. Rather, it is the following claims including any amendments thereto that define the scope of the invention.