1. Field of the Disclosure
The present disclosure relates generally to data processing devices, and more particularly to arithmetic processing devices.
2. Description of the Related Art
A data processor device may include a specialized arithmetic processing unit such as an integer or floating-point processing device. Floating-point arithmetic is particularly applicable for performing tasks such as graphics processing, digital signal processing, and scientific applications. A floating-point processing device generally includes devices dedicated to specific functions such as multiplication, division, and addition for floating point numbers.
A floating-point processing device typically supports arithmetic operations for one or more number formats, such as single-precision, double-precision, and extended-precision formats. In addition, some floating point devices support instruction sets that provide for multiple arithmetic operations per instruction. For example, “Single Instruction, Multiple Data” (SIMD) instructions can specify that the same mathematical operation be performed on multiple data elements
The present disclosure may be better understood, and its numerous features and advantages made apparent to those skilled in the art by referencing the accompanying drawings.
The use of the same reference symbols in different drawings indicates similar or identical items.
An arithmetic processing unit is disclosed that can perform multiply operations, addition operations, or a combination thereof. The arithmetic processing unit can operate in two modes. The first mode supports one single, double, or extended-precision computation, and the second mode supports two simultaneous single-precision computations using the same exponent and mantissa datapaths.
FMAM 110 has an input labeled “A” connected to operand register 120, an input labeled “B” connected to operand register 122, an input labeled “C” connected to operand register 124, an input to receive a signal labeled “MODE,” from control module 140, and an output to provide a result to register 126. Control module 140 has an input to receive an instruction from instruction register 130.
FMAM 110 is an arithmetic processing device that can execute arithmetic instructions such as multiply, add, subtract, multiply-add, and multiply-accumulate instructions. FMAM 110 can receive three inputs, A, B, and C. Inputs A and B are a multiplicand and a multiplier, respectively, and input C is an addend. To execute a multiply-add instruction, such as floating-point multiply-add (FMADD), operands A (INPUT1) and B (INPUT2) are multiplied together to provide a product, and operand C is added to the product. A multiply instruction, such as a floating-point add (FMUL), is executed in substantially the same way except operand C (INPUT3) is set to a value of zero. An add instruction, such as a floating-point add (FADD) is executed in substantially the same way except operand B is set to a value of one. FMAM 110 includes an output to provide a result of the instruction to result register 126.
In the illustrated embodiment of
The most significant bit of the mantissa, to the left of the binary point, is referred to as an “implicit bit.” A floating-point number is generally presented as a normalized number, where the implicit bit is a one. For example, the number 0.001011*223 can be normalized to 1.011*220 by shifting the mantissa to the left until a “1” is shifted into the implicit bit, and decrementing the exponent by the same amount that the mantissa was shifted. A floating-point number will also include a sign bit that identifies the number as a positive or negative number. The exponent can also represent a positive or negative number, but a bias value is added to the exponent so that no exponent sign bit is required.
For purposes of discussion, it is assumed that the fractional component of the mantissa of a single-precision number has twenty-four bits of precision, a double-precision number has fifty-three bits of precision, and an extended-precision number has 64 bits of precision. A packed single format contains two individual single-precision values. The first, (low) value includes a twenty-four bit mantissa that is right justified in the 64-bit operand field, and the second (high) value includes another twenty-four bit mantissa that is left justified in the 64-bit operand field, with sixteen zeros included between the two single-precision values.
FMAM 110 includes mantissa module 114 that performs mathematical operations on the mantissa of the received operands( ) and includes exponent module 112 that performs mathematical operations on the exponent ( ) portions of the floating-point operands. Mantissa module 114 and exponent module 114 perform their operations in a substantially parallel manner.
In addition, it is assumed for purposes of discussion that FMAM 110 is implemented using a five stage pipeline. During the first pipeline stage, the exponent of the product is calculated, and the multiply operation begins. The multiplier uses a radix-4 booth recoding technique in which the multiplier and multiplicand are used to generate thirty-three partial products. The first two levels of 4:2 compressors in a multiplier carry-save adder (CSA) tree are included in the first pipeline stage. During the second pipeline stage, the exponents of the product and the addend are compared and the larger is selected to provide a preliminary exponent of the result. The second stage also includes the three additional 4-2 compressor levels.
During the third pipeline stage, the intermediate result (sum and carry) of the multiply-add are presented to a carry-propagate adder (CPA), which calculates an un-normalized and unrounded result. In parallel with the CPA, a leading zero anticipator (LZA) operates on the same intermediate result as the CPA to produce controls for normalization. During the fourth pipeline stage, this result is normalized, and during the fifth stage, the normalized result is rounded.
Operand registers 120, 122, and 124 can each contain a data value, INPUT1, INPUT2, and INPUT3, respectively, that can be provided to FMAM 110. For the purposes of discussion, INPUT1, INPUT2, and INPUT3 can be single, double, or extended-precision floating-point numbers or a combination thereof. FMAM 110 can perform the requested arithmetic operation using the data values, and provide a result to result register 126. For example, FMAM 110 can execute a double-precision FMAC instruction where INPUT1 is multiplied by INPUT2, and the product is added to INPUT3. A double-precision result is provided to result register 126.
Instruction register 130 can contain an instruction (also referred to as an operation code and abbreviated as “opcode”), which identifies the instruction that is to be executed by FMAM 110. The opcode specifies not only the arithmetic operation to be performed, but also the precision of the result that is desired.
Control module 140 can receive the instruction from instruction register 130 and provide mode information, via signal MODE, to FMAM 110. For example, control module 140, upon receiving an extended-precision FMUL instruction, can configure FMAM 110 to perform the indicated computation and to provide an extended-precision result. Moreover, signal MODE can configure FMAM 100 to interpret each of input values INPUT1-3 as representing on operand of any of the supported precision modes.
Portion 300 include operand registers 120, 122, and 124, a Booth encoder 340, a CSA array 350, a sign control 360, a complement module 370, an alignment module 372, CSA 380, LZA 388, CPA 390, a normalize module 392, and a round module 394. Operand register 120 further includes portions 1201 and 1202, operand register 122 further includes portions 1221 and 1222, operand register 124 further includes portions 1241 and 1242, and result register 126 further includes portions 1261 and 1262.
Operand register 120 and 122 are connected to Booth encoder 340. Booth encoder 340 is connected to CSA array 350 and to CSA 380. Sign control 360 is connected to CPA 390, and complement module 370. CSA array 350 has two outputs connected to CSA 380, and CSA 380 has two outputs also connected to CPA 390 and to LZA 388. LZA 388 is connected to normalize module 392. CPA 390 is connected to normalize module 392, and normalize module 392 is connected to round module 394. Round module 394 is connected to result register 126. Register 124 is connected to complement module 370. Complement module has an output connected to alignment module 372, and alignment module 372 is connected to CSA 380.
Operand registers 120 provide a multiplicand operand, INPUT1, and register 122 provides a multiplier operand, INPUT2, to Booth encoder 340. Booth encoder 340 uses radix4 Booth recoding to provide thirty-two partial products to CSA array 350, and a thirty-third partial products to CSA 380. CSA array 350 includes 4 levels of 4:2 carry-save adders to reduce the thirty-two partial products to two 128-bit partial products.
Operand register 124 provides an addend operand, INPUT3, to complement module 370. Complement module 370 can perform a bit-wise inversion of INPUT3 if sign control 360 determines that the computation being performed is an “effective subtract.” The determination of whether the computation is an effective subtract depends on the signs of the source operands as well as sign changes specified by the opcode, and determines if the sign of the product and the sign of the addend are different. Any or all of sources INPUT1, INPUT2, and INPUT3 may be negative (sign1, sign2, and sign3), and the opcode may specify inversion of INPUT3 (invert3) or inversion of the product (invertprod). For ADD/SUB instruction types that include two operands,
EffectiveSubtract=sign1⊕sign3⊕invert3
where sign1, and sign3 are the respective sign bits for INPUT1, and INPUT3, and invert3 corresponds to an optional opcode-specified inversion of INPUT3.
For multiply-add and multiply-subtract instruction types,
EffectiveSubtract=sign1⊕sign2⊕sign3⊕invert3⊕invertprod
where sign1,sign2, and sign3 are the respective sign bits for INPUT1, INPUT2, and INPUT3. Invert3 corresponds to an optional opcode-specified inversion of INPUT3, and invertprod corresponds to an optional opcode-specified inversion of the product prior to the addition operation.
Effective subtract does not identify whether the product or the addend should be inverted. Because floating-point is a sign+magnitude number representation, the mantissa should ultimately be positive. The smaller of the addend and the product could be inverted so that the sum of those is always positive. However, the relative size of the addend and product is unknown when sign control 360 determines whether the computation is an effective subtract. Accordingly, INPUT3 is assumed to be smaller and is inverted by complement module 370. CPA 390 is designed so that if the assumption is wrong and the sum would be negative, CPA 390 automatically inverts the sum and returns a positive result. This is accomplished by using a one's complement adder for the CPA, also known as an end-around-carry adder. The sign of the final result is computed separately.
In particular, the sign of the result is calculated by first assuming that INPUT3 is larger, and choosing a preliminary result sign equal to the exclusive-or of sign3 and invert3. In the case of a pure multiply (INPUT1*INPUT2) there is no INPUT3, so the preliminary result sign is equal to the exclusive-or of sign1 and sign2. This preliminary sign will be correct unless the operation is an effective subtract where INPUT3 was in fact smaller, and the adder should not have previously inverted the result. If that case is detected, the sign of the result is flipped during the fourth stage of the pipeline.
Align module 372 is configured to shift the addend so that its value is aligned to corresponding significant bits of the product, as determined by comparing the value of the exponent of INPUT3 to the value of the product exponent determined by exponents of INPUT1 and INPUT2.
CSA 380 is another 4:2 carry-save adder that is configured to add the last two partial products provided by CSA array 350 to the aligned addend from aligner 372 and to the 33rd partial product from the booth encoder 340. The result provided by CSA 380 is in the form of a 194-bit sum and a 130-bit carry.
CPA 390 is a carry-propagate adder that calculates an un-normalized result based on the sum and carry results provided by CSA 380. LZA 388 operates in parallel to CPA 390, and predicts the number of leading zeros that will be present in the result of CPA 390. The un-normalized result is provided to normalize module 392, which normalizes the result to produce an un-rounded result based on the leading zero prediction from LZA 388. This unrounded result is rounded by round module 394, which provides a final rounded result to result register 126. CPA 390, normalize module 392, and round module 394 can provide a carry-out value to the exponent datapath to increment the exponent of the result.
Portion 400 includes operand registers 120, 122, and 124, registers 430 and 432, Booth encoder 340, CSA array 350, sign control 360, complement module 370, alignment modules 372, 472, and 474, CSA 380, CPA 390, normalize modules 492 and 493, and round modules 384 and 494. Complement module further includes portions 3702 and 3704. CPA 390 further includes portions 3902 and 3904. Operand register 120 further includes portions 1201 and 1202, operand register 122 further includes portions 1221 and 1222, operand register 124 further includes portions 1241 and 1242, and result register 126 further includes portions 1261 and 1262.
Operand register 120 is connected to Booth encoder 340. Portion 1221 of operand register 122 is connected to register 430, and portion 1222 of operand register 122 is connected to register 432. Registers 430 and 432 are also connected to Booth encoder 340. Booth encoder 340 is connected to CSA array 350 and to CSA 380. Sign control 360 is also connected to CPA 390, and complement module 370. CSA array 350 has two outputs connected to CSA 380, and CSA 380 has two outputs connected to LZA 388 and to CPA 390. LZA 388 is connected to LZA 486 and LZA 488. CPA 390 has two portions 3902 and 3904. Portion 3902 and LZA 486 are connected to normalize module 492. Portion 3904 and LZA 488 are connected to normalize module 493. Normalize module 492 is connected to round module 394. Round module 394 is connected to portion 1261 of result register 126. Normalize module 493 is connected to round module 494. Round module 494 is connected to portion 1262 of result register 126. Portion 1241 of operand register 124 is connected to portion 3702 of complement module 370, and portion 1242 of operand register 124 is connected to portion 3704 of complement module 370. The outputs of complement module 370 portions 3702 and 3704 are connected to alignment module 372. Alignment module 372 connects to alignment modules 472 and 474. The outputs of alignment modules 472 and 474 are connected to CSA 380.
Portion 400 highlights how the extended precision mantissa datapath illustrated at
Two variations of the multiplier operands BH and BL, provided by operand register 122, are prepared. Register 430 receives operand BH, and the twenty-four bits of operand BH are left justified in 64-bit register 430, and bits 39:0 of register 430 are set to zero. Register 432 receives operand BL, and the twenty-four bits of operand BL are right justified in 64-bit register 432, and bits 63:24 of register 433 are set to zero. Booth encoder 340 uses register 432 to calculate 12 least significant partial products, and uses register 430 to calculate 13 most significant partial products. The middle eight partial products can be calculated using the value provided by either register 430 or 432.
Align module 372 is used to perform a fine-grained shift of shift by zero to 15. In this second mode of operation the upper and lower bits of the shifter are controlled independently. Align modules 472 and 474 are dedicated for use in the packed-single mode of operation and complete the shift by performing shifts by multiples of 16. Individual alignment controls are provided by the exponent data path. The exponent datapath is configured in the second mode of operation to provide an alignment shift amount for CH and CL based upon a comparison of the exponents of operands AL, BL, and CL, and AH, BH, and CH, respectively, using the same exponent modules used to provide an alignment shift amount in the first operating mode.
A carry into the least significant bit of CPA 390 is introduced when portion 300 is operating in the first mode if the operation is an effective subtract. When CPA 390 is operating in the second mode, a carry into either or both of portions 3902 and 3904 may be performed based on whether either or both operations, respectively, is an effective subtract. Therefore, sign control 360 can specify that a carry is to be injected not only into bit zero, the least significant bit of portion 3902, but also into bit eighty, the least significant bit of portion 3904, during the carry-propagate calculation.
In the event that a carry is injected into bit 80 of CPA 390, then the natural carry out of bit seventy-nine will not propagate into bit 80. When operating on two packed single-precision operands in the second operating mode, the carry-save adder Wallace tree (CSA array 350 and CSA 380) will always result in a value of one being naturally carried out of bit seventy-nine of CPA 390. Because this natural carry does not occur in CPA 390 when in the second operating mode, a compensation operation is performed during computation of the product by adding a one at bit eighty to the product within CSA array 350, as specified by being in the second operating mode.
LZA module 388 generally comprises two basic steps: generation of a leading zero value, and priority encoding of that value to find the bit position of the first “1”. When in the second operating mode, the first step of generating the LZA value is performed by LZA module 388. The upper portion of that LZA value, corresponding to the high result, is passed to LZA module 486 for priority encoding. The lower portion of the LZA value, corresponding to the low result, is passed to LZA module 488 for priority encoding.
Normalize module 492 receives the unnormalized and unrounded high result from portion 3902 of CPA 390. It also receives the leading zero prediction from LZA 486. It passes the normalized result out to round module 394. Normalize module 493 receives the unnormalized and unrounded low result from portion 3904 of CPA 390. It also receives the leading zero prediction from LZA 488. It passes the normalized result out to round module 494. Note that normalize module 392 is not used in the second mode of operation.
Round module 394 is shared between the first and second modes of operation. When operating in the second mode, round module 394 performs rounding on the high single value and passes the final rounded result to portion 1261 of result register 126. A second round module, 494, is provided to perform the rounding operation on the lower single value when operating in the second mode. The result from round module 494 is placed in portion 1262 of result register 126.
In addition to the mantissa datapath shown in
If the instruction provided at instruction register 130 instead specifies a packed single-precision multiply operation, FMAM 100 will operate in the second mode and the flow diagram proceeds from block 510 to block 550. At block 550, a second operand and a third operand, such as operand AH and AL at
A single arithmetic unit including only one exponent and mantissa datapath that can execute a single operation in one mode, can be configured to execute two single-precision operations simultaneously in another mode, with substantially minimal additional cost and device area.
Note that not all of the activities or elements described above in the general description are required, that a portion of a specific activity or device may not be required, and that one or more further activities may be performed, or elements included, in addition to those described. Still further, the order in which activities are listed are not necessarily the order in which they are performed.
Also, the concepts have been described with reference to specific embodiments. However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the present disclosure as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of the present disclosure.
For example, generic multiply, multiply-accumulate, and add operations can include variations such as multiply-add, negate multiply add, multiply subtract, and subtract. Implementation details such as the number of pipeline stages and how and when the correction value is applied are illustrated for the purpose of example, and skilled artisans will appreciate that methods disclosed can be implemented in other ways. Furthermore, the methods are applicable to other arithmetic devices and are not limited to floating-point arithmetic devices.
An arithmetic processing unit, such as FMAM 110, can receive two multiply operands and one addition operand, but the methods disclosed herein can be applied to other arithmetic processing units with a different number of multiplication and addition datapaths. Whereas FMAM 110 can support single, double, extended, and packed single-precision number formats, other formats or variations of these formats can be supported. Other arithmetic operations such as divide, square root, and transcendental operations may also be supported by FMAM 110.
Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any feature(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential feature of any or all the claims.