Compressors are important circuits within processor functional blocks. For example, a floating-point processing core often generates a significant percentage of a processor's overall heat output, and a floating-point multiplier generates a significant percentage of the heat generated by the floating-point processing core. A partial product reduction unit of the floating-point processing multiplier, which is composed primarily of compressors, generates a significant percentage of the heat generated by the floating-point multiplier.
In addition, the processing speed of a conventional multiplier depends substantially upon the speed of the compressors within its partial product reduction unit. The compressors within a multiplier may therefore greatly influence the speed and the power-efficiency of the multiplier and of a processor including the multiplier. Hence, compressor designs providing suitable speed and power efficiency are desired.
Multiplier 30 includes multiplexer 310 to output various 2's complement representations of the multiplicand. Booth selection unit 320 selects and outputs one of the representations as a partial product based on the multiplier as encoded by encoder 330. Each partial product output from Booth selection unit 320 is received and summed by partial product reduction unit 340.
Partial product reduction unit 340 may comprise a partial product summation tree to sum the partial products into a product of the multiplier and the multiplicand. The product is represented in a redundant form. For example, the product may be represented by 128 Sum bits and 128 Carry bits. Accordingly, adder 350 receives the Carry bits and Sum bits and converts the received bits into a 128-bit binary number (p).
Partial product reduction unit 340 may comprise a tree including 3:2 compressors. Embodiments may be used in conjunction with any currently- or hereafter-known tree architecture. Each of the 3:2 compressors receives three input bits and outputs a Sum bit and a Carry bit based on the three input bits.
Sum block 110 comprises transmission gate 115 and Carry block comprises static mirror 125. According to some embodiments, transmission gate 115 is particularly suitable for performing an XOR logical operation. Static mirror 125, on the other hand, may provide fast production of the Carry bit. Static mirror 125 may also or alternatively facilitate routing of the circuit elements of Carry block 120. includes multiplier 30 for multiplying y by m to generate a 128-bit result (p). Multiplier 30 therefore comprises a 64-bit×64-bit multiplier, but embodiments are not limited thereto. Moreover, embodiments may be implemented within any suitable system and are not limited to a multiplier.
Multiplier 30 includes multiplexer 310 to output various 2's complement representations of the multiplicand. Booth selection unit 320 selects and outputs one of the representations as a partial product based on the multiplier as encoded by encoder 330. Each partial product output from Booth selection unit 320 is received and summed by partial product reduction unit 340.
Partial product reduction unit 340 may comprise a partial product summation tree to sum the partial products into a product of the multiplier and the multiplicand. The product is represented in a redundant form. For example, the product may be represented by 128 Sum bits and 128 Carry bits. Accordingly, adder 350 receives the Carry bits and Sum bits and converts the received bits into a 128-bit binary number (p).
Partial product reduction unit 340 may comprise a tree including 3:2 compressors. Embodiments may be used in conjunction with any currently- or hereafter-known tree architecture. Each of the 3:2 compressors receives three input bits and outputs a Sum bit and a Carry bit based on the three input bits.
Sum block 110 comprises transmission gate 115 and Carry block comprises static mirror 125. According to some embodiments, transmission gate 115 is particularly suitable for performing an XOR logical operation. Static mirror 125, on the other hand, may provide fast production of the Carry bit. Static mirror 125 may also or alternatively facilitate routing of the circuit elements of Carry block 120.
Initially, at 210, three input bits are received at a first block. The first block includes at least one transmission gate. The first block may be an element of any functional unit, including but not limited to partial product reduction unit 340 of multiplier 30. In some embodiments, the first block comprises Sum block 110 of compressor 100. As mentioned above, Sum block 110 includes transmission gate 115.
A Sum bit is output from the first block at 220. The Sum bit is output based at least on the three input bits.
At 230, the three input bits are received at a second block. The second block includes at least one transmission gate, and the three input bits may be received by the second block substantially simultaneously with reception of the three input bits by the first block at 210. The second block may comprise Carry block 120 including static mirror 125 as shown in
A Carry bit is output from the second block at 240 based at least on the three input bits. The Carry bit and/or the output Sum bit may be input to a “downstream” 3:2 compressor that itself includes a Sum block and a Carry block as described above. In some embodiments, the Carry bit is output to adder 350 along with 127 other Carry bits. Adder 350 may propagate the Carry bits and, along with 128 received Sum bits, generate a final product.
Transmission gate 440 includes an input to receive input bit C, and an output connected to the output of transmission gate 450. Transmission gate 450, in this regard, includes an input to receive C# from inverter 460, an inverted control node connected to the non-inverted control node of transmission gate 440, and a non-inverted control node connected to the inverted control node of transmission gate 440. The outputs of transmission gate 440 and transmission gate 450 are connected to an input of inverter 470, which is to output the Sum bit as shown.
Carry block 500 includes p-channel transistors 505 through 525 and n-channel transistors 530 through 550. A source of p-channel transistor 505 is connected to a supply voltage and a gate of p-channel transistor 505 is to receive input bit A. A source of p-channel transistor 510 is connected to the supply voltage, a gate of p-channel transistor 510 is to receive input bit B, and a drain of p-channel transistor 510 is connected to a drain of p-channel transistor 505.
A source of p-channel transistor 515 is connected to the supply voltage and a gate of the p-channel transistor 515 is to receive input bit A, while a source of p-channel transistor 520 is connected to the drain of p-channel transistor 505 and a gate of p-channel transistor 520 is to receive input bit C. Also according to
N-channel transistors 530 through 550 substantially mirror the layout of p-channel transistors 505 through 525. Specifically, a source of n-channel transistor 530 is connected to ground and a gate of n-channel transistor 530 is to receive input bit A, and a source of n-channel transistor 535 is connected to ground, a gate of n-channel transistor 535 is to receive input bit B, and a drain of n-channel transistor 535 is connected to a drain of n-channel transistor 530. A source of n-channel transistor 540 connected to ground and a gate of n-channel transistor 540 is to receive input bit A.
A source of n-channel transistor 545 is connected to the drain of n-channel transistor 530, a gate of n-channel transistor 545 is to receive input bit C, and a drain of n-channel transistor 545 is connected to a drain of p-channel transistor 520. A source of n-channel transistor 550 is connected to the drain of n-channel transistor 540, a gate of n-channel transistor 550 is to receive input bit B, and a drain of n-channel transistor 550 is connected to a drain of p-channel transistor 525.
Each of the drains of n-channel transistors 545 and 550 and p-channel transistors 520 and 525 are connected to one another and to an input of inverter 560. Inverter 560 outputs the aforementioned Carry bit as shown. According to some embodiments, inverter 560 is omitted and block 500 therefore outputs a Carry# bit. If all inputs are received at substantially the same time, the thusly-modified block 500 would output the Carry# bit approximately 50% faster than block 400 would output the Sum bit. The Carry# signal may therefore be connected to slower inputs of a downstream Sum block to reduce overall delay in a partial product reduction tree.
According to some embodiments, integrated circuit 610 also communicates with off-die cache 640. Off-die cache 630 may include registers storing a multiplier or a multiplicand for input to Floating Point Unit 625. Integrated circuit 610 may also communicate with system memory 640 via a host bus and a chipset 650. Memory 640 may comprise any suitable type of memory, including but not limited to Single Data Rate Random Access Memory and Double Data Rate Random Access Memory. In addition, other off-die functional units, such as graphics accelerator 660 and Network Interface Controller (NIC) 670 may communicate with integrated circuit 610 via appropriate busses.
The several embodiments described herein are solely for the purpose of illustration. Therefore, persons in the art will recognize from this description that other embodiments may be practiced with various modifications and alterations.