1. Field of the Invention
This invention is related to the field of processors and, more particularly, to compressor circuitry used in arithmetic processing in processors.
2. Description of the Related Art
Processors are designed to execute instructions that can be categorized into several broad types: arithmetic, logic, control flow (or branch), load/store, etc. Arithmetic instructions may include instructions for various types of arithmetic. For example, floating point and integer arithmetic is common in modern processors. Some processors also implement single instruction, multiple data (SIMD) processing in which multiple independent arithmetic operations are performed on independent portions of the input operands. SIMD operations are sometimes referred to as vector operations as well.
Arithmetic operations of various types often are implemented using 4:2 compressor circuits for at least a portion of the operation. The 4:2 compressor circuits are also sometimes referred to as carry save adder (CSA) circuits. This description will use the term compressor circuits. An example of an arithmetic operation that can be implemented using 4:2 compressor circuits is multiplication. Multiplication can be implemented as booth recoded partial product addition, which can be performed using multiple levels of 4:2 compressors. Each level includes a plurality of compressors. The compressor at a given level receive various input bits from the next higher level and a carry-in from another compressor at the same level. Each compressor at a given level provides a carry out to another compressor at the same level, and sum and carry outputs to the next lower level. Over the levels, the sum and carry outputs are added until a result is generated. Typically, a 4:2 compressor is implemented as two full adders (3:2 compressors) in series.
In one embodiment, a compressor circuit has a carry-in input and input bits a, b, c, and d. The compressor circuit comprises a first multiplexor (mux) coupled to receive a value of input bit a and a complement of the value of input bit a as inputs and a value of the input bit b as a first selection control. The first mux has a first output. Coupled to receive a value of input bit c and a complement of the value of input bit c as inputs and a value of the input bit d as a second selection control, a second mux has a second output. A third mux is coupled to receive the first output and a complement of the first output as inputs and the second output as a third selection control, and the third mux has a third output. The fourth mux, coupled to receive a value of the third output and a complement of a value of the third output as inputs and the carry-in input as a fourth selection control, has a fourth output which is a sum output of the compressor circuit. In another embodiment, a processor comprises an arithmetic unit comprising a plurality of the compressor circuits arranged in two or more levels of compressor circuits.
In an embodiment, an apparatus comprises a compressor circuit having a carry-in input and input bits a, b, c, and d. The compressor circuit comprises logic circuitry configured to generate a sum output, a first carry output, and a second carry output. The sum output is the exclusive OR of the input bits a, b, c, and d and the carry-in input. The first carry output is the exclusive OR of the input bits a, b, c, and d logically ANDed with the carry-in input, the result of which is logically ORed with the exclusive NOR of the input bits a, b, c, and d logically ANDed with the logical OR of the logical AND of input bits a and b and the logical AND of input bits c and d. The second carry output is the logical AND of the logical OR of input bits a and b and the logical OR of input bits c and d.
The following detailed description makes reference to the accompanying drawings, which are now briefly described.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the drawings and detailed description thereto are not intended to limit the invention to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the present invention as defined by the appended claims.
Turning now to
The decode unit 16 may be configured to generate microops for each instruction provided from the instruction cache 14. Generally, the microops may each be an operation that the hardware included in the execution core 24 is capable of executing. Each instruction may translate to one or more microops which, when executed, result in the performance of the operations defined for that instruction according to the instruction set architecture. The decode unit 16 may include any combination of circuitry and/or microcoding in order to generate microops for instructions. For example, relatively simple microop generations (e.g. one or two microops per instruction) may be handled in hardware while more extensive microop generations (e.g. more than three microops for an instruction) may be handled in microcode. The number of microops generated per instruction in hardware versus microcode may vary from embodiment to embodiment. Alternatively, each instruction may map to one microop executed by the processor. Accordingly, an operation may be an operation derived from an instruction or may be a decoded instruction, as desired.
Microops generated by the decode unit 16 may be provided to the scheduler 20, which may store the microops and may schedule the microops for execution in the execution core 24. In some embodiments, the scheduler 20 may also implement register renaming and may map registers specified in the microops to registers included in the register file 22. When a microop is scheduled, the scheduler 20 may read its source operands from the register file 22 and the source operands may be provided to the execution core 24.
Among the microops executed by the execution core may be various arithmetic operations that may use the 4:2 compressors 28A-28N in the arithmetic unit 26. For example, floating point or integer multiplication may use the compressors 28A-28N for partial product additions. The compressor circuits 28A-28N may be arranged into multiple levels (e.g. two levels are illustrated as horizontal rows in
The register file 22 may generally comprise any set of registers usable to store operands and results of microops executed in the processor 10. In some embodiments, the register file 22 may comprise a set of physical registers and the scheduler 20 may map the logical registers to the physical registers. The logical registers may include both architected registers specified by the instruction set architecture implemented by the processor 10 and temporary registers that may be used as destinations of microops for temporary results (and sources of subsequent microops as well). In other embodiments, the register file 22 may comprise an architected register set containing the committed state of the logical registers and a speculative register set containing speculative register state.
The fetch control unit 12 may comprise any circuitry used to generate PCs for fetching instructions. The fetch control unit 12 may include, for example, branch prediction hardware used to predict branch instructions and to fetch down the predicted path. The fetch control unit 12 may also be redirected (e.g. via misprediction, exception, interrupt, flush, etc.).
The instruction cache 14 may be a cache memory for storing instructions to be executed by the processor 10. The instruction cache 14 may have any capacity and construction (e.g. direct mapped, set associative, fully associative, etc.). The instruction cache 14 may have any cache line size. For example, 64 byte cache lines may be implemented in one embodiment. Other embodiments may use larger or smaller cache line sizes. In response to a given PC from the fetch control unit 12, the instruction cache 14 may output up to a maximum number of instructions. For example, up to 4 instructions may be output in one embodiment. Other embodiments may use more or fewer instructions as a maximum.
It is noted that, while the illustrated embodiment uses a scheduler, other embodiments may implement other microarchitectures. For example, a reservation station/reorder buffer microarchitecture may be used. If in-order execution is implemented, other microarchitectures without out of order execution hardware may be used.
In one embodiment, the 4:2 compressors 28A-28N do not implement the series connection of 3:2 compressors previously used in to implement a 4:2 compressor. Additionally, the carry terms used to generate the carry output of the 4:2 compressors 28A-28N that is provided to the next level, and the terms used to generate the carry out to another compressor at the same level are changed. In one embodiment, the generation of the carry outputs may be more efficient than was previously possible. Additionally, in one embodiment, a low latency implementation using 2:1 multiplexors (muxes) and inverters may be used so that the delay through the 4:2 compressor is relatively low. Viewed in another way, the two carry outputs and the sum output have redundancy (up to 8 possible variations on the three outputs, but only five possible sums with 4 inputs and a carry-in input). By redesigning the encoding of the outputs, a high efficiency implementation may be realized.
In one embodiment, the following equations are implemented by the 4:2 compressor circuits 28A-28N. In the equations, as well as in
Sum=((âb)̂(ĉd))̂Cin (1)
Carry=((âb̂ĉd)&Cin)|({tilde over ( )}(âb̂ĉd)&((a&b)|(c&d))) (2)
Cout=(a|b)&(c|d) (3)
Equation 1 may be implemented with 4 two input XOR operations, with only three in series, as indicated by the parentheses in equation 1. Other embodiments may implement the XOR operation in any desired fashion. That is, two XORs may be performed in parallel (âb and ĉd), the results XORed ((âb)̂(ĉd)), and the result of the second XOR level may be XORed with Cin. Additionally, the output of the second level may be used for the Carry equation (equation 2). Accordingly, in some embodiments, the circuitry to implement the compressor circuit may be small.
A number of two input XOR operations are used, in one embodiment. A two input XOR operation may be implemented as a 2:1 mux, where the inputs to the mux are one of the XOR input bits and its inverse (or complement), and the mux select is the other input bit. For example, if x XOR y is to be implemented, the inputs to the mux may be “x” and “˜x” and the control input may be “y”. If y is one, ˜x may be selected; and if y is zero, x may be selected. In one embodiment, a passgate mux implementation may be used to further speed the 2 input XOR operation.
One embodiment of the 4:2 compressor circuit 28A is shown in
The embodiment illustrated in
The mux 30A is coupled to receive the value of input bit a on its 0 input (through the inverters 32 and 34, in this embodiment) and the complement of the value of input bit a on its 1 input (through inverter 32 only, in this embodiment. The true select is b, and the complement select is ˜b. Accordingly, if b is logical 1, then the complement of a is output by the mux 30A and if b is logical zero, a is output. Thus, mux 30A implements a ̂b. The mux 30B receives the inputs in the reverse order from mux 30A, but the same mux select. That is, the mux 30B is coupled to receive the complement of a on its 0 input and a on its 1 input. Accordingly, the mux 30B outputs the complement of a if b is 0, and a if b is 1. Thus, mux 30B performs an XNOR operation ({tilde over ( )}(âb)). Similarly, the muxes 30C-30D are coupled to receive the c input and its complement (via inverters 36 and 38) and perform XOR and XNOR operations, respectively, based on the d input.
The second level of two input XORing may be implemented by the muxes 30E and 30F in
A mux 44 is also shown in
Finally, the OR gate 52 (having c and d as inputs), the OR gate 54 (having a and b as inputs), and the AND gate 56 (having the outputs of OR gates 52 and 54 as inputs) generate the Cout output as set forth in equation 3.
It is noted that, while specific logic circuits are shown in
It is noted that, since the output of the mux 30B is the complement of the mux 30A, other embodiments may eliminate the mux 30B in favor of inverting the output of the mux 30A (or vice versa). Such an implementation may be slower than the implementation shown in
In one embodiment, the muxes 30A-30G and 44 may be implemented as pass gate muxes. One such embodiment of the mux 30A is shown in
It is noted that various circuitry above has been described as receiving a value of an bit or signal, or perhaps just receiving a bit or signal. Generally, the value of the bit or signal may be received. The actual bit or signal may be received directly, or one or more levels of buffering or inversion may separate the bit or signal and the receiver. However, the logical state of the bit or signal is received as described, whether directly or indirectly through buffering.
Numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.