BACKGROUND
Embodiments of the invention relate to systems for processing data, and more specifically, to a system for processing data through single instruction multiple data (SIMD) operations.
Arithmetic instructions, such as add and subtract, are some of the most basic and widely used instructions in any given program. Processors typically support some sort of single instruction multiple data (SIMD) instructions. SIMD processing enable multiple operands within a register to be processed in parallel. Processors support various types of SIMD/non-SIMD instructions, including, but not limited to, 16-bit single add instructions, 8-bit dual (two-way) add instructions, 16-bit add with carry instruction, 8-bit dual (two-way) add with carry and the like. A SIMD adder may be used to perform either one 16-bit addition for the 16-bit add instruction or two 8-bit additions in parallel for the 8-bit dual add instruction. The 16-bit add with carry instruction adds a carry from the previous operation to the operands.
One way of implementing SIMD adders is to use two 8-bit adders with intervening logic that determines the carry propagation. For a 16-bit operation, the carry is propagated from a lower 8-bit adder to an upper 8-bit adder of the operands. For dual 8-bit operation, the carry propagation is inhibited.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention is illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that the references to “an” or “one” embodiment of this disclosure are not necessarily to the same embodiment, and such references mean at least one.
FIG. 1A is a diagram used to illustrate the operation of an arithmetic circuit according to one embodiment when it operates as a 16-bit adder.
FIG. 1B is a diagram used to illustrate the operation of an arithmetic circuit according to one embodiment when it operates as four 8-bit adders.
FIG. 2 is a block diagram illustrating a processor including a single instruction multiple data (SIMD) arithmetic circuit, in accordance with one embodiment.
FIG. 3 is a block diagram further illustrating the processor of FIG. 2, in accordance with one embodiment.
FIG. 4 is a block diagram illustrating a digital media processor, in accordance with one embodiment.
FIG. 5 is a block diagram illustrating various design representations or formats for simulation, emulation and fabrication of a design using the disclosed techniques.
DETAILED DESCRIPTION
In the following description, specific details are set forth. However, it is understood that embodiments described may be practiced without these specific details. In other instances, well-known circuits, structures and techniques have not been shown in detail to avoid obscuring the understanding of this description.
Conventional techniques for implementing a 16-bit adder use two 8-bit adders with intervening logic that determines the carry propagation. For a 16-bit operation, the carry is propagated from a lower 8-bit adder to an upper 8-bit adder of the operands. For dual 8-bit operation, the carry propagation is inhibited. One of the main disadvantages of using two 8-bit adders with intervening logic is the speed that can be achieved. For example, a synthesis tool may realize each of the 8-bit adders with the fastest adder architecture possible, but since there is carry propagation across adders, the speed of the overall adder is limited by it. The carry related decision logic is in critical path causing the design to run more slowly.
FIG. 1A is a diagram used to illustrate the operation of an arithmetic circuit according to one embodiment when it operates as a 16-bit adder, for adding two 16-bit numbers. Representatively, two operands A 102 and B 110 are associated with an instruction fetched by, for example, an instruction fetch unit (IFU) (e.g., WFU 322 of FIG. 3). The operation performed by arithmetic circuit 130, according to one embodiment, is A+B+Carry (Cin). Advantageously, there are no conditions to determine if a carry needs to be added from intervening carry logic or propagated to an upper byte.
Representatively, operand 102 includes lower byte value (AL) 106 and upper byte value (AH) 104. Likewise, operand 110 includes lower byte value (BL) 114 and upper byte value (BH) 112. An instruction decode stage (e.g., front end logic 320 of FIG. 3) determines the appropriate values of bit values (“data flags”) A0124, A1122, B0134, B1132 and packs the data flags A0124, A1122, B0129, B1128 with first operand 102 and second operand 110 into first and second registers 120 and 126, which have a length of 18 bits ((2*8)+2). Next, arithmetic circuit 130 performs a single arithmetic operation to obtain an 18-bit intermediate result within register 140. In one embodiment, result extraction logic (not shown) extracts lower byte value (CL) 144 and upper byte value (CH) 142 while ignoring data flag fields 146 and 148 to obtain a 16-bit result that represents an arithmetic operation upon values of operands 102 and 110.
As further illustrated with reference to FIGS. 1A and 1B, setting of data flags A1122 and B1128 to zero cuts-off carry propagation generated by the sum of AL 106+BL 114 (“lower byte sum”). Conversely, setting of bit B1128 enables carry propagation from the lower byte sum to the sum AH 102+BH 110 (“upper byte sum”). Likewise, setting of data flags A0124 and B0129 to zero cuts-off any carry-in values generated in response to a carry flag, which would be added to the lower byte sum. Conversely, setting of either data flags B0129 or A0124 enables carry-in propagation to the lower byte sum from a previous stage. Likewise, the setting of either data flags B1128 or A1122 and the setting of either data flag A0124 or data flag B0134 enables carry propagation for any carry-in propagation, as well as carry propagation from the lower to the upper byte sum. In one embodiment, rounding operations are achieved by setting data flags A0124 and B0129 and carry propagation to the upper byte sum may also be enabled by setting data flag B1128 to a value of one.
As illustrated in FIG. 1A, a result generated by arithmetic circuit 130 is stored in register 140. Representatively, an upper byte value within register 140 (CH) is the upper byte sum and any carry-in, depending on the setting of either data flags A1122 or B1128. Likewise, a lower byte value (CL) of register 140 is comprised of the lower byte sum, which may include a carry-in from, or according to, the setting of data flags A0124 or B0129. Accordingly, in one embodiment, to provide a proper result to a subsequent stage, such as a retirement stage, result values CH 142 and CL 144 and are extracted and stored within a register for the subsequent stages. Accordingly, in one embodiment, the positions 146 and 148 of register 140 are ignored to provide, for example, a 16-bit result from the 18-bit value stored within register 140.
FIG. 1B is a diagram used to illustrate the operation of arithmetic circuit 130 according to one embodiment when it operates as four 8-bit adders. Representatively, operand 150 includes first lower byte value (AL1) 158, second lower byte value (AL2) 156, first upper byte value (AH1) 154 and second upper byte value (AH2) 152. Likewise, operand 160 includes first lower byte value (BL1) 168, second lower byte value (BL2) 166, first upper byte value (BH1) 164 and second upper byte value (BH2) 162. Once the corresponding byte values of operands 150 and 160 are packed with the data flags (A3172, A2174, A1176, A0178, B3182, B2184, B1186 and B0188) and stored within first register 170 and second register 180, which each have a length of 36-bits (4*8+4), arithmetic circuit 130 performs a four-way SIMD operation to generate a 36-bit result within output register 190. In one embodiment, result extraction logic (not shown) extracts first lower byte value (CL1) 194, second lower byte value (CL2) 196, first upper byte value (CH1) 197 and second upper byte value (CH2) 192 while ignoring extra bit result fields 193, 195, 197 and 199.
FIG. 2 is a block diagram illustrating a processor 200 including an arithmetic logic unit 370 having a single instruction multiple data (SIMD) arithmetic circuit (adder) 380, in accordance with one embodiment. In one embodiment, processor 200 is referred to herein as a “media signal processor”, which may function according to a data driven architecture. In one embodiment, a plurality of processors 200, as shown in FIG. 2, may be coupled together to form digital media processor 400 as shown in FIG. 4. In one embodiment, digital medial processor 400 provides a data driven architecture for performing data intensive applications, such as media processing applications, including, but not limited to, video processing, image processing, sound processing, security based applications and the like.
As shown in FIG. 2, media signal processor (MSP) 200 includes one or more processing elements (PEs) 300 (300-1, . . . , 300-N). Representatively, each PE 300 is coupled to shared register file (SRF) 210. SRF 210 allows PEs 300 to exchange and store data within general purpose registers (GPRs) of register file 210. Representatively, MSP 200 includes internal volatile memory 220 for local data and variable storage, as well as memory command handler (MCH) 230 to alleviate bandwidth bottlenecks on the off chip memory. In one embodiment, PEs 300 are the basic building blocks of MSP 200 and may include instruction memory 310 to support an instruction set designed to provide flow control, arithmetic logic unit functions and custom interface functions, such as multiply-accumulate instructions, bit rotation instructions, or the like.
As such, depending on the function MSP 200 is designed to perform, PEs 300 may be divided to accomplish the desired functionality and parallel performance of algorithmic portions of a media processing application executed by a digital media processor, for example as shown in FIG. 4. In one embodiment, a general processing element (GPE) is the basic processing element upon which more complicated PEs may be generated. In one embodiment, PEs may be categorized as: input processing elements (IPE), which are connected to input ports to accept incoming data streams; general processing elements (GPE), multiply accumulate processing elements (MACPE); and output processing elements (OPE), which are connected to output ports to send outgoing data streams for performing desired processing functionality.
FIG. 3 is a block diagram further illustrating front end logic 320 and execution core 360 of PEs 300 of FIG. 2, in accordance with one embodiment. Although FIG. 3 illustrates an execution architecture of PEs 300 of FIG. 2, it should be recognized that FIG. 3 may illustrate an execution architecture of a processor that does not include processing elements. Accordingly, it should be recognized that a SIMD arithmetic circuit, as described herein, is not limited to media signal processors and can be incorporated within the execution architecture or micro-architecture of conventional processor architectures, to provide high speed addition, subtraction and other like arithmetic functions while avoiding intervening carry logic that places a carry-related decision in the critical path causing conventional adder designs to run more slowly.
In one embodiment, operation of adder 380 requires population of one or more extra bit values packed with the one or more operands of a received instruction (see FIGS. 1A and 1B). Accordingly, rather than using multiple adders with intervening carry logic, in one embodiment, SIMD adder 380 is generated with the synthesis tool to provide a high-speed adder architecture. In one embodiment, as shown in FIGS. 1A and 1B, a 16-bit SIMD adder is realized using an 18-bit adder and 4-way, 8-bit SIMD adder is realized using a 36-bit adder, respectively. To provide such functionality, in one embodiment, the one or more extra bits are packed together with the one or more operands in a decode stage of an execution pipeline of a processor, as shown in FIG. 3.
Representatively, instruction fetch unit (IFU) 322 is coupled to receive first and second operands associated with an instruction fetched from instruction memory (IM) 310. In one embodiment, the first and second operands each have a length of N*M-bits. In one embodiment, instruction decoder (ID) 330 stores the first N*M-bit operand and a first N-extra bits in a first N*M+N-bit register. Likewise, ID 330 stores the second N*M-bit operand and a second N-extra bits in a second N*M+N-bit register. In one embodiment, ID 330 uses local register file (LRF) 250 for the first and second N*M+N-bit registers. In one embodiment, N is an integer equal to or greater than two and M is an integer equal to or greater than two.
In one embodiment, extra bit logic (EBL) 340 may query look-up table (LT) 342 to determine a value of the first and second N-extra bits added to the first and second operands, respectively. In one embodiment, EBL 340 queries LT 342 according to an operation requested by an instruction received from WU 322. In one embodiment, the requested operation is determined according to an opcode, or other like designation of the instruction. As described herein, the first and second N-extra bits are referred to as “data flags”, which may be used to determine whether a carry is propagated from a lower byte value or prior stage and may be used to perform rounding and other like functions.
Accordingly, in one embodiment, EBL 340 populates first and second N-bit data flags to enable (N*M)+N-bit adder 380 to perform an addition operation or other like requested operation. In one embodiment, N*M bit result is extracted by result extraction logic (REL) 390 from an (N*M)+N-bit result generated by adder 380. For example, in one embodiment, adder 380 operates as an 18-bit adder, as shown in FIG. 1A. Representatively, ID 330 receives first and second 16-bit operands and stores each of the first and second 16-bit operands and two data flags in an 18-bit register. In one embodiment, each of the 18-bit registers, including the first and second input operands and first and second data flags, are provided to adder 380 to generate an 18-bit sum. REL 390 with adder 380 logically combines values in the first and second registers to obtain a result data having a length of 16-bits, for example, as shown in FIG. 1A.
As illustrated in FIG. 1A, PE 300 enables SIMD operation of adder 380 according to one embodiment where SRF 210 and LRF 350 work on either 16-bit integers or dual 8-bit integers. In one embodiment, EBL 340 of ID 330 adds data flags to operands 102 and 110 in the decode stage of the pipeline at positions shown in FIG. 1A. Representatively, data flags A0124 and A1122 are added at position 0 and position 9 of operand 102, which are stored in register 120. Likewise, EBL 340 adds data flags B0129 and B1128 at position 0 and position 9 of operand 110, which are stored in register 130. As described herein, the terms “set” or “assert” as well as “reset” or “deassert” do not imply a particular logical value. Rather, a bit may be set to “1” or set to “0” and both are considered embodiments of the invention. As a result, a bit may be active “0” (asserted low signal) or active “1” (asserted high signal) in accordance with the embodiments described herein.
TABLE ONE
|
|
OPCODEdualcinadsWrndA1B1A0B0Category
|
addop00x0100add (single mode, no carry
in)
addop01x01c11add (single mode, carry in)
addop1xx0000add (dual mode add, no
carry in)
adsop0x00100ads (single mode add and
shift, no rounding)
adsop0x10111ads (single mode add and
shift, with round)
adsop1x00000ads (dual mode add and
shift, no carry)
adsop1x11111ads (dual mode add and
shift, w/rounding)
subop00x0111sub (subtract. single mode)
subop01x01b11sub (subtract. single mode,
with borrow)
subop1xx1111sub (dual mode subtract, no
borrow)
absop0xx0111abs (absolute operation,
single mode)
absop1xx1111abs (absolute operation, dual mode)
abdop0xx0111sub (absolute difference
single mode)
abdop1xx1111sub (absolute difference,
dual mode)
|
In one embodiment, EBL 140 determines values of data flags A0, A1, B0, B1 (see FIGS. 1A and 1B) from LT 342 based on a type of operation being performed and stores the result in registers 120 and 130, respectively. In one embodiment, LT 342 is populated according to Table 1. Representatively, the populating of data flags A0124, A1122, B1128 and B0129 enables several addition operations, including single mode with or without carry-in, as well as dual mode addition. In the embodiments described for 16-bit operands, dual mode performs dual 8-bit operations. In addition, Table 1 provides values to enable addition with rounding for single mode or dual mode, which may include shifting, such as, for example, required to perform averaging functions. Representatively, the populating of data flags A1, B1, A0 and B0, enable additional operations, including subtraction operations and absolute difference operations to be performed using a single arithmetic circuit.
Referring again to FIG. 1A, in one embodiment, for example, for 16-bit ADD instruction (ADDU), data flags A0124, B0129, A1132 are set to zero and data flag B1128 is set to one. Any carry generated by the lower byte sum is propagated to the upper byte sum. For 8-bit Dual ADD instruction (ADDUU), data flags A0124, B0129, A1122, B1128 are all set to zero. Thus the carry generated by the lower byte sum is inhibited from propagating to upper byte sum to provide the dual add operation with adder 380 (FIG. 3).
For 16-bit ADD with Carry instruction, data flags A0124 and A1122 are set to zero and data flags B0129 and B1128 are set to one. If the carry flag (Cin) was set, then one is added to the lower byte sum. Also, any carry generated by the lower byte sum is propagated to the upper byte sum. For 16-bit ADD, SHIFT and ROUND instruction, data flags A0124, B0129 are set to one, data flag A1122 is set to 0 and data flags B1128 is set to one. As a result, there is carry out of A0/B0 stage (since A0=1 and B0=1) that is added to the sum for rounding purposes. The result is shifted right by one position to perform a division by two operation using, for example, a shifter of ALU 370. In addition, there is carry propagation from lower byte sum to upper byte sum.
For 8-bit Dual ADD, SIFT and ROUND instruction, data flags A0124, B0129, A1122, B1128 are all set to one. As can be seen from the foregoing description, by setting data flags A0124, A1122, B0129, B1128 according to, for example Table 1, various kinds of arithmetic operations are provided by a simple 18-bit adder. The logic can be extended to subtractors by generating a complement value of one of the operands and populating the data flags (A1132, A0124, B1128 and B0129) according to Table 1. The logic can also be extended to more than two-way SIMD, for example, as shown in FIG. 1B for four-way SIMD.
As illustrated in FIG. 1B, operation of PE 300 enables SIMD operation of adder 380 according to one embodiment where GPR 210 and LRF 350 work on either 32-bit (quad word) integers, dual 16-bit integers or quad 8-bit integers. Representatively, operand 150 and operand 160 are associated with an instruction fetched by, for example, EFU 322 of FIG. 3. As described above, the operation performed by adder 380 is A+B+Cin. In one embodiment, EBL 340 packs a data flag adjacent to a least significant bit of each 8-bit value of operands 150 and 160.
FIG. 4 shows a plurality of MSPs 200 (200-1, . . . , 200-6) coupled together to form a media processor 400 according to one embodiment. As illustrated, MSPs 200 include various ports that enable bi-directional data connection that allows data to flow from one unit to another. As such, each port has the ability to send and receive data simultaneously through various separate uni-directional data buses. In one embodiment, the various ports of the MSPs 200 include first in first out (FIFO) devices in each direction between two units, controlled via, for example, a port selection register.
Accordingly, any port in a unit can connect to a port of each of the other MSPs 200 which may utilize a data bus, which is, for example, 16 bits wide. Accordingly, media processor 400 utilizes the plurality of MSPs 200 to freely exchange and share data, which accelerates the performance of data intensive applications, such as audio, video and imaging applications. In one embodiment, media processor 400 is coupled to memory 450 and 440, which are, for example, dual data rate (DDR) synchronous data random access memory (SDRAM) which run at, for example, 133 MHz (266-MHz DDR devices).
In one embodiment, digital media processor 400 is used within video processing applications, image processing applications, audio processing applications, or the like. In addition, by incorporating a SIMD arithmetic circuit, such as for example, adder 380, media processor 400 provides high-speed arithmetic operations required by media processing applications. In addition, media processor 400 includes memory access units 420 and 425, as well as memory interface units 430 and 435. Likewise, input/output (I/O) block 460 provides access to various I/O devices.
FIG. 5 is a block diagram illustrating various representations or formats for simulation, emulation and fabrication of a design using the disclosed techniques. Data representing a design may represent the design in a number of manners. First, as is useful in simulations, the hardware may be represented using a hardware description language, or another functional description language, which essentially provides a computerized model of how the designed hardware is expected to perform. The hardware model 510 may be stored in a storage medium 500, such as a computer memory, so that the model may be simulated using simulation software 520 that applies a particular test suite to the hardware model 510 to determine if it indeed functions as intended. In some embodiments, the simulation software is not recorded, captured or contained in the medium.
In any representation of the design, the data may be stored in any form of a machine readable medium. An optical or electrical wave 560 modulated or otherwise generated to transport such information, a memory 550 or a magnetic or optical storage 540, such as a disk, may be the machine readable medium. Any of these mediums may carry the design information. The term “carry” (e.g., a machine readable medium carrying information) thus covers information stored on a storage device or information encoded or modulated into or onto a carrier wave. The set of bits describing the design or a particular of the design are (when embodied in a machine readable medium, such as a carrier or storage medium) an article that may be sealed in and out of itself, or used by others for further design or fabrication.
Having disclosed embodiments and the best mode, modifications and variations may be made to the disclosed embodiments while remaining within the scope of the embodiments as defined by the following claims.