The present disclosure relates generally to methods and apparatus for digital computing, and, in particular embodiments, to methods and apparatus for quotient digit recoding or selection in a high-performance arithmetic unit.
Division and square-root operations in digital computing are computationally intensive. Such operations can consume considerable resources, such as hardware resources (to implement the algorithm in hardware), time resources (to implement the algorithm in software), or both.
Research continues with the expectation of improving the algorithms implementing these operations, whether to reduce hardware resources required in implementing the algorithms or the time resources required in executing the algorithms.
According to a first aspect, a redundant binary signed digit (RBSD) divider is provided. The RBSD divider comprising: a first operand prescaling unit configured to scale a divisor by a first scaling factor; a scaled divisor selection unit operatively coupled to the first operand prescaling unit, the scaled divisor selection unit configured to receive an output of the first operand prescaling unit, and selectively swap multiples of plus and minus vectors of inputs to the scaled divisor selection unit to produce selected multiples of the scaled divisor, the swapping being in accordance with a predicted sign of a partial remainder; a digit recoder operatively coupled to multiplexers, the digit recoder configured to recode at least one N-bit portion of the partial remainder as a combination of two N/2-bit vectors; and a plurality of full adder stages having inputs operatively coupled to the scaled divisor selection unit and outputs operatively coupled to the multiplexers, the plurality of full adder stages configured to compress a difference of the partial remainder and the selected multiples of the scaled divisor, wherein the outputs of the plurality of full adder stages being in a redundant format.
In a first implementation form of the RBSD divider according to the first aspect, the first operand scaling unit being further configured to generate one or more additional integer multiples of the scaled divisor.
In a second implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, further comprising: a second operand prescaling unit configured to scale a dividend by the first scaling factor, and generate a first multiple of the scaled dividend.
In a third implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, the scaled dividend being in one of a non-redundant normal binary format or a RBSD format.
In a fourth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, N being an even integer value.
In a fifth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, N being a combination of a first vector expressible as a ceiling (N/2) and a second vector expressible as a floor (N/2), when N is an odd integer value, where ceiling (N/2) produces a smallest integer greater than N/2 and floor (N/2) produces a largest integer smaller than N/2.
In a sixth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, each full adder stage of the plurality of full adder stages further comprising a partial adder stage.
In a seventh implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, further comprising a sign predictor operatively coupled to the plurality of full adder stages, the sign predictor configured to generate the predicted sign of a subsequent partial remainder in accordance with the outputs of the plurality of full adder stages.
In an eighth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, the digit recoder comprising a plurality of first recoders coupled to high inputs of a third plurality of recoders and a fourth plurality of recoders, the plurality of first recoders configured to determine a combination of two N/2-bit first vectors or a combination of a ceiling(N/2)-bit first vector and a floor(N/2)-bit first vector in accordance with the plus output; a plurality of second recoders coupled to low inputs of the third-plurality of recoders and the fourth plurality of recoders, the plurality of second recoders configured to determine a combination of two N/2-bit second vectors or a combination of a ceiling(N/2)-bit second vector and a floor(N/2)-bit second vector in accordance with the minus output; a plurality of third recoders coupled the scaled divisor selection unit, the plurality of third recoders configured to select from one of a high plus output of the plurality of first recoders or a high minus output of the plurality of second recoders in accordance with the plus and minus outputs; and a plurality of fourth recoders coupled the scaled divisor selection unit, the plurality of fourth recoders configured to select from one of a low plus output of the plurality of first recoders or a low minus output of the plurality of second recoders in accordance with the plus and minus outputs.
In a ninth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, further comprising a reciprocal unit operatively coupled to the first operand prescaling unit, the reciprocal unit configured to estimate a reciprocal of the divisor.
In a tenth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, wherein N is equal to 6, N/2 is equal to 3, and wherein the first integer multiple of the scaled divisor is 1, and the additional integer multiples of the scaled divisor are 3, 5, 6, and 7.
In a twelfth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, wherein N is equal to 5, ceiling(N/2) is equal to 3, and floor(N/2) is equal to 2, and wherein the first integer multiple of the scaled divisor is 1, and the additional integer multiples of the scaled divisor are 3, 5, 6, and 7.
In a thirteenth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, wherein N is equal to 4, N/2 is equal to 2, and wherein the first integer multiple of the scaled divisor is 1, and the additional integer multiple of the scaled divisor is 3.
In a fourteenth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, wherein N is equal to 3, ceiling(N/2) is equal to 2, and floor(N/2) is equal to 1, and wherein the first integer multiple of the scaled divisor is 1, and the additional integer multiple of the scaled divisor is 3.
In a fifteenth implementation form of the RBSD divider according to the first aspect or any preceding implementation form of the first aspect, wherein N is equal to 2, N/2 is equal to 1, and wherein the first integer multiple of the scaled divisor is 1.
According to a second aspect, a method implemented by a redundant binary signed digit (RBSD) divider is provided. The method comprising: prescaling, by the RBSD divider, a divisor and a dividend, the divisor and the dividend being inputs to the RBSD divider; and iteratively generating, by the RBSD divider, a quotient and a remainder in accordance with the divisor and the dividend utilizing a recoding of one or more radix 2N multiples of most significant bits of a partial remainder.
In a first implementation form of the method according to the second aspect, the one or more radix 2N multiples of the most significant bits of the partial remainder being recoded into two or more radix 2N/2 multiples when N is even.
In a second implementation form of the method according to the second aspect or any preceding implementation form of the second aspect, the one or more radix 2N multiples of the most significant bits of the partial remainder being recoded into one or more radix 2N multiples of the most significant bits of a partial remainder into two or more radix 2ceiling(N/2) and radix 2floor(N/2) multiples when N is odd.
In a third implementation form of the method according to the second aspect or any preceding implementation form of the second aspect, prescaling the divisor and the dividend comprising: scaling, by the RBSD divider, the divisor by a first scaling factor and one or more additional integer multiples of the first scaling factor; and scaling, by the RBSD divider, the dividend by the first scaling factor.
In a fourth implementation form of the method according to the second aspect or any preceding implementation form of the second aspect, iteratively generating the quotient and the remainder comprising: recoding, by the RBSD divider, an N-bit portion of a partial remainder as a combination of two N/2-bit vectors or a combination of a ceiling(N/2) bit vector and a floor(N/2) bit vector, where 2N is the radix of the N-bit portion of the partial remainder being recoded; selecting, by the RBSD divider, a plurality of the first scaled divisors or the second scaled divisors in accordance with a sign of outputs of the recoding; and compressing, by the RBSD divider, the plurality of the first scaled divisors or the second scaled divisors, and a current partial remainder, an output of the compressing comprising a difference of the current partial remainder and a sum of the plurality of the first scaled divisor or the one or more additional integer multiples of the first scaled divisor.
In a fifth implementation form of the method according to the second aspect or any preceding implementation form of the second aspect, further comprising predicting, by the RBSD divider, a sign of a subsequent partial remainder in accordance with the output of the compressing.
In a sixth implementation form of the method according to the second aspect or any preceding implementation form of the second aspect, recoding the one or more N-bit portions of the partial remainder comprising: determining a combination of two N/2-bit vectors or a combination of a ceiling(N/2)-bit first vector and a floor(N/2)-bit first vector in accordance with a plus output of the partial remainder; determining a combination of two N/2-bit vectors or a combination of a ceiling(N/2)-bit second vector and a floor(N/2)-bit second vector in accordance with a minus output of the partial remainder; selecting one of a high plus output of the combination of two N/2-bit first vectors or the combination of the ceiling(N/2)-bit and the floor(N/2)-bit first vectors, or a high minus output of the combination of two N/2-bit second vectors or the combination of the ceiling(N/2)-bit and the floor(N/2)-bit second vectors in accordance with the plus and minus outputs; and selecting one of a low plus output of the combination of two N/2-bit first vector or the ceiling(N/2)-bit and the floor(N/2)-bit first vectors, or a low minus output of the combination of two N/2-bit second vectors or the ceiling(N/2)-bit and the floor(N/2)-bit second vectors in accordance with the plus and minus outputs.
In a seventh implementation form of the method according to the second aspect or any preceding implementation form of the second aspect, further comprising estimating, by the RBSD divider, a reciprocal of the divisor.
In an eighth implementation form of the method according to the second aspect or any preceding implementation form of the second aspect, wherein N is equal to 6 or 5, and wherein the first multiple of the scaled divisor is 1, and the additional integer multiples of the scaled divisor are 3, 5, 6, and 7.
In a ninth implementation form of the method according to the second aspect or any preceding implementation form of the second aspect, wherein N is equal to 4 or 3, and wherein the first multiple of the scaled divisor is 1 and the additional integer multiple of the scaled divisor is 3.
In a tenth implementation form of the method according to the second aspect or any preceding implementation form of the second aspect, wherein N is equal to 2, and wherein the first multiple of the scaled divisor is 1 and there are no other additional integer multiples of the scaled divisor.
According to a third aspect, a system is provided. The system comprising: a non-transitory memory storage comprising instructions and data; one or more processors in communication with the memory storage, wherein the one or more processors execute the instructions; and an arithmetic unit in communication with the one or more processors and the memory storage, the arithmetic unit comprising: a first operand prescaling unit configured to scale a divisor by a first scaling factor; a scaled divisor selection unit operatively coupled to the first operand prescaling unit, the scaled divisor selection unit configured to receive an output of the first operand prescaling unit, and selectively swap multiples of plus and minus vectors of inputs to the scaled divisor selection unit to produce selected multiples of the scaled divisor, the swapping being in accordance with a predicted sign of a partial remainder; a digit recoder operatively coupled to multiplexers, the digit recoder configured to recode at least one N-bit portion of the partial remainder as a combination of two N/2-bit vectors; and a plurality of full adder stages having inputs operatively coupled to the scaled divisor selection unit and outputs operatively coupled to the multiplexers, the plurality of full adder stages configured to compress a difference of the partial remainder and the selected multiples of the scaled divisor, wherein the outputs of the plurality of full adder stages being in a redundant format.
In a first implementation form of the system according to the third aspect, the first operand scaling unit being further configured to generate one or more additional integer multiples of the scaled divisor.
In a second implementation form of the system according to the third aspect or any preceding implementation form of the third aspect, the arithmetic unit further comprising a second operand scaling unit configured to scale a dividend by the first scaling factor, and generate a first multiple of the scaled dividend.
In a third implementation form of the system according to the third aspect or any preceding implementation form of the third aspect, the arithmetic unit further comprising a sign predictor operatively coupled to the plurality of full adder stages, the sign predictor configured to generate the predicted sign of a subsequent partial remainder in accordance with the outputs of the plurality of full adder stages.
In a fourth implementation form of the system according to the third aspect or any preceding implementation form of the third aspect, the digit recoder comprising: a plurality of first recoders coupled to high inputs of a third plurality of recoders and a fourth plurality of recoders, the plurality of first recoders configured to determine a combination of two N/2-bit first vectors or a combination of a ceiling(N/2)-bit first vector and a floor(N/2)-bit first vector in accordance with the plus output; a plurality of second recoders coupled to low inputs of the third plurality of recoders and the fourth plurality of recoders, the plurality of second recoders configured to determine a combination of two N/2-bit second vectors or a combination of a ceiling(N/2)-bit second vector and a floor(N/2)-bit second vector in accordance with the minus output; a plurality of third recoders coupled the scaled divisor selection unit, the plurality of third recoders configured to select from one of a high plus output of the plurality of first recoders or a high minus output of the plurality of second recoders in accordance with the plus and minus outputs; and a plurality of fourth recoders coupled the scaled divisor selection unit, the plurality of fourth recoders configured to select from one of a low plus output of the plurality of first recoders or a low minus output of the plurality of second recoders in accordance with the plus and minus outputs.
In a fifth implementation form of the system according to the third aspect or any preceding implementation form of the third aspect, the arithmetic unit further comprising a reciprocal unit operatively coupled to the first operand prescaling unit, the reciprocal unit configured to estimate a reciprocal of the divisor. An advantage of a preferred embodiment is that carry-propagate addition used in computing the quotient bits during iterative processing is eliminated. Eliminating the carry propagation in the computing of the quotient bits allows for increased number of bits per cycle, as well as increased clock speed.
For a more complete understanding of the present disclosure, and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:
The structure and use of disclosed embodiments are discussed in detail below. It should be appreciated, however, that the present disclosure provides many applicable concepts that can be embodied in a wide variety of specific contexts. The specific embodiments discussed are merely illustrative of specific structure and use of embodiments, and do not limit the scope of the disclosure.
Division and square-root operations are difficult to implement on a computer. In general, division is the most complicated basic mathematical operation (addition, subtraction, multiplication, and division) to implement, with the algorithms consuming the most resources (hardware resources, time resources, or both hardware and time resources). Algorithms implementing the square-root operations are similar to those implementing the division operations in nature, and have comparable complexity.
Ri+1=r·Ri−qi·DRi+1=r·Ri−qi·(2·Qi−1+r−i·qi) In a deeply-pipelined processor design, Newton-Raphson and Goldschmidt iterations consume many cycles due to dependencies between and within successive iterations. In contrast, SRT and GST division algorithms are iterative algorithms that use recurrence equations to successively calculate the next successive quotient or square-root bits. As an example, the recurrence equation for division is expressible as
R
i+1
=r·R
i
−q
i
·DR
i+1
=r·R
i
−q
i·(2·Qi−1+r−i·qi).
Ri+1=r·Ri−qi·DRi+1=r·Ri−qi·(2·Qi−1+r−i·qi) The recurrence equation for square-roots is expressible as
R
i+1
=r·R
i
−q
i
·DR
i+1
=r·R
i
−q
i·(2·Qi−1+r−i·qi).
Ri+1=r·Ri−qi·DRi+1=r·Ri−qi·(2·Qi−1+r−i·qi) Where, for both recurrence equations, r is the radix, R is the partial remainder, qi is the computed quotient digit, D is the divisor, i is the iteration counter, and Qi is the developed root.
A disadvantage of existing SRT algorithm implementations is that they are generally restricted to 2 to 3 bits per cycle (i.e., radix-4 to radix-8) because of their reliance on complex quotient selection tables, which depend on both the partial remainder and the divisor.
Partial remainder-divider diagram 200 presents the quotient digit selection of an SRT algorithm with radix-4 for different divisor and partial remainder values.
In general, the GST algorithms are superior to the SRT algorithms because the GST algorithms allow a higher number of bits to be calculated per cycle by prescaling the dividend and the divisor. The divisor is scaled to that it is close to 1.0. The equations for the result Q (quotient in division operations and root in square root operations) are expressible as
where k is the scaling factor.
Prescaling the dividend and the divisor so that the scaled dividend is close to 1.0 allows for the quotient estimate to be determined directly from the bits of the partial remainder in each iteration without requiring the use of a look-up table.
Output of stage 305 is used as input to stage 307, and so forth. Hence, the delay of the CSAs is linear and is dependent on the number of stages. Each stage also includes a carry propagate adder on the upper bits of the partial remainder, which adds significant delay. The stages also scale with adder width and are not carry free.
In SRT and GST iterative division and square root algorithms, cycle time may be decreased by keeping the partial remainder in redundant binary signed digit (RBSD) format, which helps to improve overall performance. In the RBSD format, the partial remainder may be represented as a sum of two vectors: a Plus (+) vector and a Minus (−) vector. An advantage of using the RBSD format is that the computation of the recurrence equation for a new partial remainder (expressible as Ri+1=r·Ri−qi·D for the division algorithm, the recurrent equation for the square root algorithm is similar) can also be performed in the RBSD format. The computation of the recurrence equation is carry free. In this context, carry free means that the computation of the recurrent equation may be performed independent of carry propagation, no matter the vector length.
However, the computation of the quotient digits remains a bottleneck. The quotient digits must be produced by summing the MSBs of the partial remainders using a carry propagate adder (CPA).
GST divider 400 includes a prescale stage 405 and an iteration stage 407. Prescale stage 405 is configured to prescale a divisor vector 410 and a dividend vector 412. As shown in
A reciprocal unit 414 determines the reciprocal estimate k≈1/X of divisor vector 410 and may be implemented using an estimate table, for example. Prescalers 416 and 418 multiply divisor vector 410 and dividend vector 412 with the prescaling factor, k. Prescaler 416 produce an output in redundant form, and may be implemented using a compression tree in any suitable redundant format, for example carry-save, carry-borrow, redundant binary signed-digit (RBSD), or any other. Prescaler 418 produces an output in redundant binary signed-digit format. CPAs 420 generate the X3 and X1 scaled divisors from the output of prescaler 416, and multiplexers 422 generate the scaled dividend plus/minus vectors in RBSD format from the output of prescaler 418. The X3 scaled divisor represents 3k*divisor, whereas the X1 scaled divisor represents k*divisor.
Iteration stage 407 includes a digit recoder 425 configured to recode 4-bit portions of the MSBs of the selected non-redundant dividend, the output of mux 429, into a combination of 2-bit values. For radix-64K GST divider 400, digit recoder 425 recodes 4-bit portions of the dividend (a total of 16 possible values: 0-15) into a combination of 2-bit values (a total of 4 values: 0-3 and 4 times multiples thereof). A detailed discussion of the digit recoding is provided below.
Digit recoder 425 includes CPAs 427 that add the MSBs of the dividends (plus and minus) and a multiplexer 429 that selects a recoded digit based on the actual sign of the dividend. The selected recoded digit is the output of digit recoder 425.
The output of digit recoder 425 is provided to multiplexers 431 that select multiples of the scaled divisor [(0, 1, 2, 3)×4, (0, 1, 2, 3)]×[163, 162, 161, 160], to add using RBSD full adders 433. RBSD FA 434 subtracts all of the multiples of the scaled dividend from partial remainder 423. Output of RBSD FA 434 is provided to shifter 435 to shift the output by 4 (i.e., multiply by 16). The output of shifter 435 is the output partial remainder of the stage and may be provided to multiplexer 422 for a subsequent iteration.
GST divider 400 still requires carry propagate subtraction for multiple 4-bit portions, where the carry propagate subtractors, as well as extra carry propagate subtraction bits, are needed for sign and fraction. Furthermore, the positive difference is needed after the carry propagate subtractors. Due to the continued requirement for carry propagate adders, a radix-64 design would require carry propagate adders that are 20-bits wide, which have significant delay and would increase cycle time.
Multiple=4*High(Plus)_Vector_Value−Low(Minus)_Vector_Value,
where the High(Plus)_Vector_Value has a range of [0, 1, 2, 3] and the Low(Minus)_Vector_Value has a range of [0, −1, −2, −3]. All vector values represent multiples of the scaled divisor.
For discussion purposes, consider multiple 7 (shown in highlight 505). Multiple 7 may be expressed as a combination of 7=4*1−(−3)=4+3=7. Similarly, multiple 13 (shown in highlight 510) may be expressed as a combination of 13=4*3−(−1)=12+1=13. Other multiples may also be similarly expressed.
As discussed previously, the prior art GST divider with the digit recoding of
According to an example embodiment, methods and apparatus for a high performance divide or square-root unit with multi-stage quotient digit recoding are provided. The multi-stage quotient digit recoding recodes the upper bits of the Plus and Minus vectors of the partial remainder into combinations of lower-radix multiples over multiple stages so that carry propagate addition is not utilized. The elimination of carry propagate addition enables the use of higher radix operations without incurring extended delays that slow the operations.
Rather than using carry propagate addition, each of the 4-bit portions of both the Plus and Minus vectors are recoded into two additional vectors that can represent all possible combinations of multiples 0 to 15 by the expression
Multiple=4*High(Plus)_Vector_Value−Low(Minus)_Vector_Value,
where the High(Plus)_Vector_Value has a range of [0, 1, 2, 3] and the Low(Minus)_Vector_Value has a range of [0, −1, −2, −3]. All vector values represent multiples of the scaled divisor.
However, unlike the digit recoding described previously, the quotient digit recoding presented herein occurs in two stages. The first stage individually recodes the 4-bit portions of both the Plus vector and the 4-bit portions of the Minus vector to combinations of lower radix High (Plus) Vector multiples and Low (Minus) Vector multiples as in the prior art encodings. The second stage recodes the High portion of the first stage Plus vector (range=4*[0, 1, 2, 3]) and the High portion of the first stage Minus vector (range=4*[0, 1, 2, 3]). Similarly, the second stage also recodes the Low portion of the Plus vector and the Low portion of the Minus vector. The key takeaway is that the digit set of the resultant sum is +/−4*[0, 1, 2, 3], that is, the same lower-radix digit set as the output of the first recoding stage, but instead of being all positives or all negatives (aside from zero) the output of the second recoding stage is a Plus vector and a Minus vector with a digit set of both + and − multiples, which is easily obtained. To form a ‘−’ multiple instead of a ‘+’ multiple (or vice-versa) in RBSD format the ‘+’ and ‘−’ inputs to the RBSD full adder are simply swapped.
For illustrative purposes, consider the situation shown in
Hence, in the first stage of recoding, each multiple (e.g., a Plus vector multiple and a Minus vector multiple), which ranges from 0 to 15, may be represented as a sum of two radix-4 vectors, each of which has a range from 0 to 3. In the example shown in
The recodings of the 4-bit portions 605 and 607 of the Plus vector and the Minus vector, as produced by the first stage of recoding, are provided to the second stage of recoding. In the second stage of recoding, the High (Plus) vectors from the first stage recoding of the 4-bit portion 605 of the High (Plus) vector and the 4-bit portion 607 of the High (Minus) vector (i.e., High (Plus) vector multiple 610 and High (Plus) vector multiple 615) are combined to produce a final High (Plus) multiple.
High (Plus) vector multiple 610, which corresponds to 4-bit portion 605 (the original Plus vector), becomes High (Plus) vector multiple 625 of the Plus vector multiple of the second stage of recoding. This relationship is shown as mapping 628. High (Plus) vector multiple 615, which corresponds to 4-bit portion 607 (the original Minus vector), becomes High (Minus) vector multiple 627 of the Plus vector multiple of the second stage of recoding. This relationship is shown as mapping 629. The combination of High (Plus) vector multiple 625 and High (Minus) vector multiple 627 is expressible as 4× High (Plus) vector multiple 625−4× High (Minus) vector multiple 627=4×(1)−4×(3)=4×(−2)=4−12=−8, this is shown as highlight 630. A final High (Plus) multiple 645 is then −2 in this instance.
Low (Minus) vector multiple 612, which corresponds to 4-bit portion 605 (the original Plus vector), becomes Low (Minus) vector multiple 635 of the Low (Minus) vector multiple of the second stage of recoding. This relationship is shown as mapping 638. Low (Minus) vector multiple 617, which corresponds to 4-bit portion 607 (the original Minus vector), becomes Low (Plus) vector multiple 637 of the Low (Minus) vector multiple of the second stage of recoding. This relationship is shown as mapping 639. The combination of High (Minus) vector multiple 635 and Low (Plus) vector multiple 637 is expressible as Low (Plus) vector multiple 637−Low (Minus) vector multiple 635=−1−(−1)=−1+1=0, this is shown as highlight 640. A final Low (Minus) multiple 647 is then 0 in this instance.
Final Plus multiple 645 and final Minus multiple 647 is the recoding of the 4-bit portions 605 and 607 of the original Plus and Minus vectors.
Although the above example focuses on radix-16 recoding, the example embodiments presented herein are operable with other radix recoding. As an example, radix-64 recoding may be performed utilizing the following multiples:
Plus vector multiples: 8×[7, 6, 5, 4, 3, 2, 1, 0]
Minus vector multiples: −[7, 6, 5, 4, 3, 2, 1, 0].
Using radix-64 recoding, 6-bits worth of multiples can be recoded per multiple pair. A slight issue arising from higher radix recoding is the addition of hard multiples 5×, 6×, and 7×, which must be pre-computed, just as the hard multiple 3× is computed with the current radix-16 recoding. However, the use of the four hard multiples (3×, 5×, 6×, and 7×) may be justified in some situations. Other radix recoding may also be possible, such as radix-256, radix-1024, and so on. Therefore, the focus on radix-16 should not be construed as being limiting to the scope of the example embodiments.
The appendix attached hereto include example multi-stage quotient digit recoding for N=2, 3, 4, 5, and 6.
The arithmetic unit also finds the quotient and remainder by iteratively generating partial remainders (block 707). The iterative generation of the partial remainder utilizes the multi-stage digit recoding of N-bit portions of the Plus and Minus vectors of the operands into N/2-bit portions, where the arithmetic unit implements radix-2N operations. The recoded bits may be used to select various multiples of the scaled divisor. The use of the multi-stage digit recoding enables the elimination of carry propagate adders, which removes the delay associated with the carry propagate adders. Additionally, the delay no longer scales directly with the log2 of the radix of the arithmetic, which enables higher radix operations without incurring extended delay.
Although the discussion presented herein focusses on dividers and division operations, the example embodiments are operable with square-root operations. Therefore, the discussion of dividers should not be construed as being limiting to the scope of the example embodiments.
Prescale stage 805 scales the product of the divisor 810 and the reciprocal estimate, k, of the divisor from the 1/X estimate table 816, close to 1.0, where how close the scaled divisor is to 1.0 being dependent on the number of bits per cycle processed by GST divider 800.
As discussed previously, prescale stage 805 also prescales operands in accordance with the hard multiples. As an example, for radix-16 operation, prescale stage 805 prescales the scaled divisor from the redundant output of the compression tree 814 by a factor of 3 by CPA 818 to form the scaled divisor X3.
The prescaling of the dividend 812 is performed by prescale unit 814. Prescale unit 814 may implement the prescaling using a compression tree, for example. Prescale unit 814 may operate in a carry-save, RBSD, or any other redundant format. However, if the prescaling is not performed in RBSD format, then the output of the dividend prescaling compression tree must be converted to RBSD format before being latched by the partial remainder register.
Carry propagate adders 818 adds the redundant outputs of prescale unit 814 and outputs the prescaled divisor (scaled divisor X1), as well as the prescaled divisor times three (scaled divisor X3). As an example, the 1× scaled divisor is calculated using the sum and carry outputs or the Plus and Minus outputs of prescale unit 814 and combining the outputs using a carry propagate adder. As an example, the 3× scaled divisor is calculated from the 1× and 2× sum and carry or the 1× and 2× Plus and Minus outputs of prescale unit 814 through a 4:2 carry save adder or a RBSD full adder, with the result being combined in a carry propagate adder, to form a non-redundant sum. Compression of the scaled divisor may alternatively be performed natively in the RBSD format.
Multiplexers 820 select the prescaled dividend or an output of iteration stage 807 (a partial remainder from a prior iteration, i.e., a previous partial remainder) and produce a current partial remainder. In an initial iteration, the prescaled dividend is provided to iteration stage 807. In subsequent iterations, the previous partial remainder is passed to iteration stage 807. A partial remainder of the current iteration may be referred to as a current partial remainder. A partial remainder for an iteration occurring after the current iteration may be referred to as a subsequent partial remainder.
In iteration stage 807, multiplexers 822 selectively switch from Plus or Minus vectors of the partial remainder to provide to digit recoding unit 824. A predicted sign of the partial remainder, as produced by sign predict unit 826, selects which one of the Plus or Minus vectors of the partial remainder. As an example, if the predicted sign is positive, then the Plus vector of the partial remainder is provided to a Plus input of digit recoding unit 824 and the Minus vector of the partial remainder is provided to a Minus input of digit recoding unit 824. If the predicted sign is negative, then the Plus vector of the partial remainder is provided to the Minus input of digit recoding unit 824 and the Minus vector of the partial remainder is provided to the Plus input of digit recoding unit 824.
Sign predict unit 826 may be initially set to 0 or positive for the initial partial remainder (e.g., the scaled dividend). While, for successive iterations, sign predict unit 826 predicts the sign of the partial remainder from the previous partial remainder as produced by iteration stage 807. Sign predict unit 826 may have as inputs the inputs to RBSD full adder 838, which are two Plus vectors and two Minus vectors. Alternately, the sign predict unit 826 may have as inputs the inputs of the Left Shift 840, which is a single Plus vector and a single Minus vector. As an example, the inputs to sign predict unit 826 comprises two Plus and two Minus vectors, the difference of which comprise a signed, two's complement value. Since this value represents the difference between the current partial remainder and a multiple of the scaled divisor, the first 18 bits (for a radix-64K design) of this value will be either all zeros or all ones, indicating either a positive or negative partial remainder. When left shifted by 16 bits, this value will comprise a sign bit (1 bit), an overflow bit (1 bit), an integer quotient value (16 bits), and fractional bits (2 bit), for a total of 20 bits of the next partial remainder. Thus, the inputs to the Sign Predict 826 represent the pre-shifted bits of the next partial remainder. For each bit position, each bit may be marked as Plus, Minus or Zero. The partial remainder may be determined to be negative if the leading bit is a Minus or if the leading bits of a sequence are Zeroes followed by a Minus. Sign predict unit 826 may alternatively use a parallel prefix to determine a succession of leading Zero bits.
Digit recoding unit 824 includes two first recoders 828 (one each of the Plus vector and the Minus vector) and two second recoders 830, with one second recoder coupled to the Plus outputs of the two first recoders 828 (this second recoder 830 selects the Plus multiple) and another second recoder coupled to the Minus outputs of the two first recoders 828 (this second recoder 830 selects the Minus multiple).
The Plus or Minus scaled divisor multiples are selected (using multiplexers 832) by the outputs of the two second recoders 830 (i.e., the output of digit recoding unit 824). If the outputs of the two second recoders 830 are positive, then the Minus scaled divisor multiple is selected and passed to RBSD full adders 834. Conversely, if the outputs of the two second recoders 830 are negative, then the Plus scaled divisor multiple is selected and passed to RBSD full adders 834. This is to insure that the selected scaled multiple of the divisor is subtracted from the partial remainder, not added to the partial remainder. In an embodiment, a number of multiplexers in multiplexers 832 is equal to a sum of the number of integer bits (equals to the number of bits per cycle) and the number of additional fraction bits (t).
In general, the RBSD full adders 834 provide a 2:1 compression for multiples of the 4-bit portions. Granularity to 2-bits may be provided by using a type plus-plus-minus (PPM) block driven by one of the second recoders 830 for a Plus term or a type minus-minus-plus (MMP) block driven by the other of the second recoders 830 for a Minus term. Additional RBSD full adders 836 and 838 provide further 2:1 compression.
A shifter 840 provides shifting, which for the bits selected as inputs are bits that will become the 20 most significant bits. Therefore, the bits selected are bMSB-16, bMSB-17, . . . , bMSB-35. The number of bits presented in the discussion of the example presented in
Specific computing systems may utilize all of the components shown or only a subset of the components, and levels of integration may vary from device to device. Furthermore, a computing system may contain multiple instances of a component, such as multiple processing units, processors, memories, transmitters, receivers, etc. The computing system 900 includes a processing unit (CPU) 902, an arithmetic unit (AU) 904, memory 906, and may further include mass storage 908, a display adapter 910, a network interface 912, human interface 914. Although shown as a single unit, CPU 902 may be implemented as multiple processing units. Mass storage 908, display adapter 910, network interface 912, and human interface 914 may be connected to a bus 916 or through an I/O interface 918 connected to bus 916.
Mass storage 908 may comprise any type of non-transitory storage device configured to store data, programs, and other information and to make the data, programs, and other information accessible via bus 916. Mass storage 908 may comprise, for example, one or more of a solid state drive, hard disk drive, a magnetic disk drive, or an optical disk drive.
Display adapter 910 and I/O interface 918 provide interfaces to couple external input and output devices to the CPU 902. As illustrated, examples of input and output devices include a display coupled the video adapter 910 and a mouse, keyboard, or printer coupled to human interface 914. Other devices may be coupled to CPU 902, and additional or fewer interface cards may be utilized. For example, a serial interface such as Universal Serial Bus (USB) (not shown) may be used to provide an interface for an external device.
Computing system 900 also includes one or more network interfaces 912, which may comprise wired links, such as an Ethernet cable, or wireless links to access nodes or different networks. Network interfaces 912 allow computing system 900 to communicate with remote units via the networks. For example, network interfaces 912 may provide wireless communication via one or more transmitters/transmit antennas and one or more receivers/receive antennas. In an embodiment, computing system 900 is coupled to a local-area network or a wide-area network for data processing and communications with remote devices, such as other processing units, the Internet, or remote storage facilities.
AU 904 includes one or more units implementing any of a variety of arithmetic operations, such as multiply, divide, add, subtract, square-root, and so on. Some of the units utilize multi-stage digit recoding to eliminate delay intensive carry propagate adders, which enables the use of higher radix operations without incurring extended delays that ordinarily slow the operations. AU 904 may include units such as GST divider 800, prescale state 900, reciprocal unit 1000, Goldschmidt divider 1100, and so on.
It should be appreciated that one or more steps of the embodiment methods provided herein may be performed by corresponding units or modules. A signal may be processed by a processing unit or a processing module. Other steps may be performed by a prescaling unit or module, a generating unit or module, a scaling unit or module, a recoding unit or module, a compressing unit or module, a predicting unit or module, a determining unit or module, an estimating unit or module, or a selecting unit or module. The respective units or modules may be hardware, software, or a combination thereof. For instance, one or more of the units or modules may be an integrated circuit, such as field programmable gate arrays (FPGAs) or application-specific integrated circuits (ASICs).
Although the present disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions and alterations can be made herein without departing from the scope of the disclosure as defined by the appended claims.
This application is a continuation of International Application PCT/US2020/063955, filed Dec. 9, 2020, entitled “Methods and Apparatus for Quotient Digit Recoding in a High-Performance Arithmetic Unit,” which claims the benefit of U.S. Provisional Application No. 63/032,580, filed on May 30, 2020, entitled “Quotient Digit Recoding in a High-Performance Divide/Square-Root Unit,” applications of which are incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
63032580 | May 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/US2020/063955 | Dec 2020 | US |
Child | 18060177 | US |