1. Technical Field of the Invention
This invention relates generally to ISA-level processor instructions such as for a digital signal processor or a microprocessor, and more particularly to an instruction which performs clipping, picking, rounding, and packing of data elements in a single operation.
2. Background Art
Each microprocessor is designed to execute a set of architecture-level instructions, which require the presence of certain architecturally-visible registers and other hardware. The instructions, registers, and other hardware are often collectively referred to as the instruction set architecture (ISA) of the microprocessor.
Regardless of the particular ISA and any particular assembly language incarnation of that ISA, it is common practice in the art to generically describe any instruction in the following form:
Below the ISA level, a microprocessor may utilize a set of microarchitectural features, microcode, registers, execution units, data paths, and so forth, which are not architecturally visible. That is, their presence, absence, or configuration cannot be discerned by ISA code.
Below the microarchitectural level, a microprocessor may utilize circuits, logic, transistors, and so forth, of which the microarchitecture is independent.
A wide variety of ISA instructions are known in the art, such as ADD, SUBTRACT, MULTIPLY, DIVIDE, MOVE, LOAD, STORE, XOR, and so forth.
Some ISAs have provided a MIN instruction which returns the smaller of its (typically two) operands, and a MAX instruction which returns the larger of its operands. For example, the instruction
In previous ISAs, if it was algorithmically necessary to force a result to be within a specified range—in other words, between a specified minimum and a specified maximum—it was necessary to perform a multi-instruction sequence such as
This puts into the destination register R3 the contents of source register R2, bounded by the specified range of 25 to 200.
Some ISAs have provided the ability to, with a single instruction, perform a same operation upon multiple source and destination data. These are commonly known as single-instruction multiple-data (SIMD) instructions, and they are said to operate on vector operands. Instructions which operate only on scalar operands could be termed single-instruction single-data (SISD) instructions, but they are more commonly referred to simply as scalar instructions.
For example, the scalar code sequence
Some ISAs have provided an EXTRACT instruction, which returns as its result a specified subset or smaller portion of a source register. The subset can be specified by a general purpose register, or a control register, or an immediate value, or it can be implicitly specified by the opcode or other instruction information. For example, the instruction
Some SIMD ISAs have provided PACK and UNPACK instructions, which are used to switch data between various widths. For example, the instruction
Some ISAs have provided various forms of rounding instructions. Rounding operations are generally of one of four types: “up” (also called “ceiling”) which rounds toward positive infinity, “down” (also called “floor”) which rounds toward negative infinity, “zero” (also called “truncate” or “chop”) which rounds toward zero, and “closest” (also called “nearest”) which rounds toward the nearest whole number. For example, the instruction
While these various instructions are known in the art, what has not previously been known, and what would be extremely useful, is a single instruction which combines various features from several of those instructions.
The invention will be understood more fully from the detailed description given below and from the accompanying drawings of embodiments of the invention which, however, should not be taken to limit the invention to the specific embodiments described, but are for explanation and understanding only. While the invention will be described with reference to its embodiment as or within a microprocessor, the invention may be practiced in any other form of processor.
In one embodiment, a single, same lower bound and a single, same upper bound are applied to each of the vector values. In another embodiment—that shown—the LB and UB registers, themselves, contain vector values LB7:0 and UB7:0, respectively, permitting different clipping ranges to be applied to each of the source vector positions.
In some embodiments, the MIN operation is logically performed before the MAX operation, while in other embodiments the logical ordering is reversed.
In one embodiment, the clipped SIMD values from SRC2 are packed into the high-order half of DST, and the clipped SIMD values from SRC1 are packed into the low-order half of DST. In another embodiment, the clipped SIMD values from the two source registers could be interleaved into the destination register. This interleaving is often referred to as a “shuffle” operation.
In some embodiments, a single register can hold the UB and LB values, for example LB in the upper (most significant) half of the register and LB in the lower (least significant) half of the register. This can be true whether the UB and LB are specified as scalar data (a single set of bounds applied to all data elements of a vector source) or as vector data. The UB and LB do not necessarily have to be the same width (in bits) as the source.
A clipping operation CLIP( ) is performed upon the source data element SRCi, forcing the result to be between a lower bound LBi and an upper bound UBi. The result is wider than the destination data element location DSTj so it is made narrower by a bit extraction operation PICK( ) which could also be termed a GETBITS( ) operation, then packed into the destination. (Where j is either the ith position or the N+ith position of DST, and N is the number of elements in SRC. For example, in the context of
In some embodiments, a predetermined set of bits is selected from the clipped source data value for packing into the destination register. For example, it might always use the low-order bits, or it might always use the high-order bits. In other embodiments, the set of bits is dynamically selected according to a pick offset control register value PICK_OFFSET. For example, if the pick offset value is 2, the PICK( ) operation may operate upon clipped bits 9:2 of the clipped source value.
In some embodiments, rounding is performed on the result data prior to the packing operation, rather than simply truncating the result data and discarding bits; in some such embodiments, a rounding mode control register value ROUND_MODE specifies a rounding mode (such as ceiling, floor, zero, or nearest).
In some embodiments, the ROUND_MODE and/or PICK_OFFSET may be specified as parameters in the instruction, rather than in control registers or implicit registers. Alternatively, they can be specified by some combination of instruction bits such as part of the opcode or the immediate data.
The clip-and-pack unit receives as inputs the upper bound value UB, the lower bound value LB, and the source data SRC to be clipped. An upper bound comparator UB COMP compares the source data to the upper bound value, and generates a HIGH mux selection input to a picking rounding multiplexer (PRMux). A lower bound comparator LB COMP compares the source data to the lower bound value, and generates a LOW mux selection input to the PRMux. SAME logic (such as an XNOR gate) determines whether the outputs of the bound comparators are equal, and generates a SOURCE mux selection input to the PRMux.
If the SRC value is greater than the UB value, the HIGH input will be active and the PRMux will select (clip to) the LB value for processing as its result output. If the SRC value is less than the LB value, the LOW input will be active and the PRMux will select (clip to) the LB value for processing as its result output. If the SRC value is greater than the LB value and less than the UB value, the SOURCE mux input will be active and the PRMux will select the SRC value for processing as its result output.
In cases where the SRC value is equal to the LB value or the UB value, it does not matter whether the PRMux uses the SRC value or the LB/UB value, and the designer can implement the logic to use whichever input he chooses.
There is an unusual case where, due to a software programming error or other reason, the LB value is actually larger than the UB value. In this case, both the LOW and HIGH mux selection inputs will be active, and the SOURCE mux selection input will also be active. In one embodiment, the PRMux gives priority to the SOURCE mux selection input over the LOW and HIGH inputs, so the SRC value is not clipped. The SOURCE input is active when the SRC value is between the LB and UB values, regardless of whether the LB value is lower than or higher than the UB value.
In the case where the LB value is greater than the LB value, and the SRC value is greater than them both, only the HIGH input will be active (because the SRC value is greater than the UB value), which will cause the PRMux to select the LB value, which is actually the smaller of the two bounds values. If the LB value is less than the LB value, and the SRC value is less than them both, the LOW input will be active, causing the PRMux to select the LB value. Thus, if the bounds values are specified backward, and the SRC value is outside the incorrectly-specified range, the PRMux will clip to the opposite bound—if SRC is greater than both bounds, it will clip to LB (which is smaller than LB), and if SRC is smaller than both bounds, it will clip to LB (which is larger than LB).
In other embodiments, the clip-and-pack unit could treat the “LB greater than LB” situation as specifying a “clipping anti-range”, and the SRC value is clipped to be outside the specified anti-range. By “anti-range” it is meant that LB and UB specify a range from which the result is to be clipped so as to be outside the range, whereas clipping to a conventional range causes the result to be clipped so as to be inside the range. A properly ordered LB and LB thus specify a bandpass filter, and a reverse ordered LB and UB specify a notch filter.
In some embodiments, the processor could generate an exception informing the system that the LB is greater than the UB. In some such embodiments, the exception could be treated as an error condition.
In some embodiments, the processor could internally, silently compensate for the reversal of the LB and UB values, and generate the same results which would have been generated if the LB and UB had been in the correct order. In some such embodiments, it may do so without actually swapping the storage locations of the UB and LB values; that is, the stored LB will still be greater than the stored UB.
In some embodiments, a PACKING enable signal controls whether the PRMux performs packing. If the PACKING signal is active, the PRMux selects a subset of the clipped value, as described above. If the PACKING signal is inactive, the entire clipped value is passed through. In some embodiments, a RESULT-SIZE input specifies (either directly or via some implicit or explicit encoding) the number of bits to be output as the result value, enabling different degrees of packing to be achieved. In other embodiments, a single packing factor is used, and the RESULT_SIZE input is not necessary. For example, the PRMux may always reduce a 16-bit clipped value to an 8-bit packed value.
In some embodiments, a ROUNDING enable signal controls whether the PRMux performs rounding of the clipped value before providing it as the result output. In some embodiments, a ROUND_MODE input value specifies the rounding mode, such as specifying “floor”, “ceiling”, “zero”, or “nearest” rounding. In some embodiments, there is only a single rounding mode, and the ROUND_MODE input value is not necessary, with the ROUNDING enable signal selecting between e.g. no rounding and a predetermined rounding scheme, or between two predetermined rounding schemes.
In some embodiments, an PICK_OFFSET determines the position from which the PRMux selects the bits for packing and/or rounding. For example, an PICK_OFFSET value of 2 may cause the PRMux to discard bit positions 0 and 1 from the clipped value, and to provide e.g. bits 2 through 9 as an 8-bit result. In some embodiments, it is the discarded bits which are used in determining the rounding of the result.
Rounding, packing, and picking may be used in any combination.
In some embodiments, a SIGN_EXTENSION input determines whether the result value should be sign extended or zero extended, as determined by how the programmer has specified the instruction. The sign extension happens based on the control. The sign bit that is used to extend is the MSB of the pre-extracted value. If the range does not extend to the left past the MSB of the element, then sign extension will have no affect.
The processor includes a first upper bound comparator UBC1 coupled to receive the first source operand and the upper bound value, and a first lower bound comparator LBC1 coupled to receive the first source operand and the lower bound value. The processor further includes a first multiplexer control unit MUX CNTL1 which is coupled to receive the outputs of the first upper and lower bound comparators. The processor also includes a first multiplexer MUX1 which is coupled to receive the first source operand, the upper bound value, and the lower bound value, and which is further coupled to receive control signals from the first multiplexer control unit. The first multiplexer passes one of the first source operand, the lower bound, and the upper bound, as determined by the first multiplexer control unit. The passed value is a first clipped source operand CLIPPED SRC1.
The processor includes a second upper bound comparator UBC2 coupled to receive the second source operand and the upper bound value, and a second lower bound comparator LBC2 coupled to receive the second source operand and the lower bound value. The processor also includes a second multiplexer control unit MUX CNTL2 which is coupled to receive the outputs of the second upper and lower bound comparators, and a second multiplexer MUX2 which is coupled to receive the second source operand, the upper bound value, and the lower bound value, and to pass one of them as determined by the second multiplexer control unit. The passed value is a second clipped source operand CLIPPED SRC2.
The processor includes a first shifter SHIFTER1 which is coupled to receive the first clipped source operand and a second shifter SHIFTER2 which is coupled to receive the second clipped source operand. The first shifter performs a pick (by right shifting) of each of the N elements in the first clipped source operand, and generates N round-bit and sticky-bit pairs RS1. The second shifter performs a pick (by right shifting) of each of the N elements in the second clipped source operand, and generates N round-bit and sticky-bit pairs RS2. In one embodiment, each shifter receives an N-element input containing N X-bit data elements, and generates an N-element output containing N X/Y-bit data elements, where Y is any positive integer. In one such embodiment, Y=2; for example, each shifter receives a 128-bit input containing 8 16-bit clipped values, and generates a 64-bit output containing 8 8-bit clipped values. In some embodiments, Y is fixed, while in other embodiments, Y can be dynamically determined by control inputs (not shown). It should be noted that the shifters do not shift bits across separate data elements within their inputs; that is, least significant bits from e.g. element 3 do not get shifted into the most significant bit positions of e.g. element 2. Rather, the shifting is independent as between the various data elements.
The processor includes a first rounder ROUNDER1 which is coupled to receive the N-element picked output and the round-bit and source-bit pairs RS1 from the first shifter, and a second rounder ROUNDER2 which is coupled to receive the N-element picked output and the round-bit and source-bit pairs RS2 from the second shifter.
Each rounder separately rounds each element of its respective N-element input. In some embodiments, the rounding mode is fixed (e.g. it is always “round to nearest even”), while in other embodiments, the rounding mode is dynamically determined by control inputs (not shown). The round-bit and sticky-bit values and the rounding operations may be substantially as known in the art.
The X[Y-bit rounded N-element output of the first rounder and the X/Y-bit rounded N-element output of the second rounder are concatenated into an X-bit YN-element packed result register PACKED RESULT or other suitable result storage or data path location.
The reader should note that, in the example shown, Y=2, such that 2 source operands are clipped and packed into the packed result register. In other embodiments, where Y>2, there will be more than two source operands and a corresponding set of data path elements UBCY, LBCY, MUX CNTLY, MUXY, CLIPPED SRCY, SHIFTERY, RSY, and ROUNDERY for each additional source operands. For example, the processor may perform a 4:1 packing rather than the 2:1 packing illustrated.
The term “processor” should be interpreted to mean any of: a single-chip microprocessor, a multi-chip processor module, a digital signal processor, a coprocessor, a computer, an embedded controller, an ASIC, a suitably programmed FPGA or other such reprogrammable logic array, or any other logic means which executes instructions, whether those instructions are ISA-level instructions, microcode, control logic code, or what have you.
When one component is said to be “adjacent” another component, it should not be interpreted to mean that there is absolutely nothing between the two components, only that they are in the order indicated.
The various features illustrated in the figures may be combined in many ways, and should not be interpreted as though limited to the specific embodiments in which they were explained and shown.
Those skilled in the art having the benefit of this disclosure will appreciate that many other variations from the foregoing description and drawings may be made within the scope of the present invention. Indeed, the invention is not limited to the details described above. Rather, it is the following claims including any amendments thereto that define the scope of the invention.