Disclosed embodiments relate to computational apparatuses and methods. Specifically, disclosed embodiments are related to apparatuses, architectures, and methods for general powering computation.
The design of functional units for the computation of powering and q-th roots (XZ, Z=p or Z=1/q, where p, q are integers) has been a challenging task for years. The powering and q-th root extraction is used frequently in required operations in the fields of computer graphics, digital signal processing, and scientific computation. This includes the computation of square root (X1/2), inverse square root (X−1/2), cubic root (X1/3), inverse cubic root (X−1/3), squaring (X2), inverse squaring (X−2), reciprocal (X−1), exponential (ey or 2y), and some other less frequent but also important functions.
There are a number of architectures for the computation of the exponential and logarithm; however accurately computing the floating-point powering function and the root extraction is difficult. The prohibitive hardware requirements of a table-based implementation and the high intrinsic complexity of digit-recurrence based algorithms have lead only to partial solutions, such as powering or root extraction for a constant exponent or for very low precision. The traditional approximation to powering and q-th root extraction has been the development of functional units for the computation of a given power or root. Accordingly, there is a number of algorithms and implementations for the most frequent exponents, reciprocal, square root and the inverse square root calculation, including linear convergence digit-recurrence algorithms and quadratic convergence multiplicative-based methods, such as Newton-Raphson and Goldschmidt algorithms. There are also several approaches for the calculation of other exponents derived from the application of general methods for function evaluation to the case of powering.
In general, in the calculation of a powering or a q-th root with very low precision it is possible to employ direct table look-up, but its high memory requirements make it an inefficient method for single- or double-precision floating-point formats. Polynomial and rational approximations are another way of implementing the powering and q-th root extraction. However, one of the most efficient methods in floating-point representation is table-driven algorithms, which are halfway between direct table look-up and polynomial and rational approximations. The use of a polynomial approximation allows the table size to be reduced and the table look-up allows us to reduce the degree of the polynomial.
There are first and second order polynomial approximation based on a Taylor expansion for the calculation of a limited number of powers and roots, square root, reciprocal square root, fourth root, etc., such as those described in Powering by a Table Look-Up and a Multiplication with Operand Modification by N. Takagi, IEEE Transactions on Computers, vol. 47, no. 11, pp. 1216-1222, November 1998; Faithful Powering Computation Using Table Lookup and Fused Accumulation Tree by J. A. Piñeiro, J. D. Bruguera and J. M. Muller, Proceedings 15th IEEE Symposium on Computer Arithmetic, pp. 40-47, June 2001; and High-performance architectures for elementary function generation by J. Cao, B. W. Y. Wei and J. Cheng, Proceedings 15th IEEE Symposium on Computer Arithmetic, pp. 136-144, June 2001, but those implementations require to replicate the table to store the coefficients and cannot be considere as general q-th root caculations units.
A digit-recurrence method for the q-th root extraction has been presented in An Digit-by-Digit Algorithm for m-th Root Extraction by P. Montuschi, J. D. Bruguera, L. Ciminiera and J. A. Piñeiro, IEEE Transactions on Computers, vol. 56, no. 12, pp. 1696-1706, December 2007, and particularized to the radix 2 cube root computation in A Radix-2 Digit-by-Digit Architecture for Cube Root by A. Piñeiro, J. D. Bruguera, F. Lamberti, P. Montuschi IEEE Transactions on Computers, vol. 57, no. 4, pp. 562-566, April 2008. The complexity of the resulting architecture depends on q, such as the larger q the larger the complexity. Consequently, the architecture for the computation of large q-th roots is difficult to implement. There are also some other specific digit-recurrence implementations for both square and cube root computations presented in Digit-by-Digit Methods for Computing Certain Functions by M. D. Ercegovac, 41st Asilomar Conference on Signals, Systems and Computers, pp. 338-342, November 2007; and A Digit-Recurrence Algorithm for Cube Rooting by N. Takagi, IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E84-A, no. 5, pp. 1309-1314, May 2001.
It has to be pointed out that all the methods outlined above for the powering computation and q-th root extraction are targeted for a given exponent. That means that the resulting architecture cannot be used for the calculation of a power or root different to that it has been designed for. To adapt the architecture to a different power or root requires to change the lookup tables in the case of table-driven polynomial approximations, or to design a completely new architecture, in the case of the digit-recurrence method. The table-driven polynomial approximations can be adapted to compute more than just one power or root, but this needs the replication of the lookup tables. In any case, the methods above cannot be considered as general methods for the calculation of any power or q-th root.
The only architecture in the literature for the q-th root extraction for any q is described in Algorithm and Architecture for Logarithm, Exponential and Powering Computation by J. A. Piñeiro, M. D. Ercegovac and J. D. Bruguera, IEEE Transactions on Computers, vol. 53, no. 9, pp. 1085-1096, September 2004, and was designed for the computation of the powering function Xp, with p any integer, based on a logarithm-multiplication-exponential chain implementation speeded-up by using redundancy and online arithmetic, and extended to the computation of X1/q. However, the extended architecture for the q-th root extraction is hard to implement, because in addition to the operations in the chain, it includes an integer division and requires the calculation of the remainder of the division.
Disclosed embodiments include an apparatus for general powering computation that comprises (a) a plurality of memory elements; and (b) a hardware processor configured to compute the powering function XZ of a floating-point number X, wherein Z is an unrestricted exponent. The unrestricted exponent can be a fixed-point or a floating-point exponent. Additionally, the unrestricted exponent can be an inverse of a number to enable for q-th root computation as part of the same hardware processor. According to one embodiment, the hardware processor comprises a multiplexing unit, a reciprocal unit, a logarithm unit, an exponential unit, a multiplication unit, a shifter unit, or combinations thereof. The reciprocal unit, logarithm unit, and multiplication unit are configured to perform computations contemporaneously, and the exponential unit is configured to perform computations in an on-line basis. In a particular embodiment, and without limitation, the reciprocal, logarithm, and multiplication units are configured to perform computations in a most-significant-digit first basis. Disclosed embodiment also include methods for performing general powering computation.
Disclosed embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings.
Microprocessors have a general structure to deal with common operations, such as memory access, software instruction execution, peripheral control, and arithmetic calculations. The complexity of some operations such as the square root, cubic root, and inverse does not allow to incorporate an specific hardware to compute these operations within the microprocessor. Consequently, current microprocessors incorporate floating point units (FPUs) to carry out complex operations such as square root or division of floating points numbers. However, the functionality of FPUs is limited as they cannot implement a large number of operations and complex operations must be carried out using a software solution. The software solution degrades the overall performance of the system as it slows down the computations. Disclosed embodiments include an apparatus that implements qth-roots and general powering computations.
Disclosed embodiments, and without limitation, include methods and apparatuses for the powering computation and the root extraction XY, X and Y being floating-point numbers, X=(−1)s
being y a signed integer operand of ny+1 bits, with |y|≧2 for root extraction.
According to a particular embodiment, and without limitation, the apparatus for computing Z-th powering or Z-th root of a number X comprises: (a) a plurality of memory elements such as registers, for storing a number X whose Z-the powering or Z-th root is to be computed, a fixed-point number Z that indicates the powering or root exponent, the number of significant bits of the number X and of the resulting computation, the operation being performed, Z-th powering or Z-th root and the former exponent of Z; (b) a reciprocal unit for computing the reciprocal of Z resulting in a number A; (c) a logarithm unit for computing the logarithm base 2 of the number X resulting in a number B; (d) a multiplication unit for computing the product of said numbers A and B resulting in a number C; (e) a exponential unit for computing the exponential of said number C. In particular embodiments, the reciprocal unit operates in parallel with the logarithm unit, the logarithm unit and the multiplication unit overlap during computation, the exponential unit and the multiplication unit overlap during computation, the exponential unit computes the exponential in an on-line basis, the logarithm unit computes the logarithm in a most-significant digit first basis, and/or the multiplication unit computes the product in a most-significant-digit first basis. According to one particular embodiment, as shown in
According to one embodiment, the computing of Z-th powering or Z-th roots in a hardware processor comprises: (a) setting a first memory element of the processor to a number X, wherein X is a number whose Z-the powering or Z-th root is to be computed; (b) setting a second memory element of the processor to a number Z, wherein Z is a fixed-point number that indicates the powering or root exponent; (c) setting a third memory elements of the processor to the number of significant bits of the number X and of the resulting computation; (d) setting a fourth memory element of the processor to the operation being performed, Z-th powering or Z-th root; (e) setting a fifth memory element to the former exponent of Z; (f) computing the reciprocal of the number Z resulting in a number A; (g) computing the logarithm base 2 base 2 of the number X resulting in a number B; (h) computing the product of the number A and B resulting in a number C; (i) separating the integer and fractional parts of the number C; and (j) computing the exponential of the number C. In particular embodiments, the computing of the logarithm and the product are overlapped, the computing of the product and the computing of the exponential are overlapped, the number X is represented in a simple or double precision binary floating-point form according the standard IEEE-754, the number q is represented in a binary fixed-point form, and the processor in chosen from the group consisting of an integrated circuit, a FPGA device, a microprocessor, a microcontroller, and a general purpose computer system.
According to a particular embodiment, and without limitation, the method is derived as follows
X
Z=2log
considering that X is a floating-point operand this equation can be rewritten as
where S=Ex+log2Mx is the concatenation of the digits of Ex (integer value) and log2(Mx)ε[0,1).
According to equation (2), XZ can be calculated as a sequence of operations: (1) logarithm of the significand Mx(log2Mxε[0, 1)), (2) addition of Ex and log2Mx (concatenation of binary strings), (3) multiplication by Z, and (4) exponential of the result of the multiplication. For an efficient implementation, the operations involved must be overlapped. This requires a left-to-right most-significant digit first (MSDF) mode of operation and the use of a redundant representation. A radix-r signed-digit representation with a maximally redundant digit set {−(r−1), . . . , 0, . . . (r−1)} is employed.
A potential limitation of the algorithm above for certain applications is the range of the exponential function 2Z×S. Digit-recurrence exponential algorithms require the argument to be in the interval (−1, 1), while Z×S must be out of the range. To extend the range of convergence and guarantee the convergence of the algorithm, the integer and fractional parts of Z×S must be extracted serially and equation (2) must be rewritten,
XZ−2Z×S−2int(Z×S)×2frac(Z×S) (3)
being int(Z×S) and frac(Z×S) the integer and fractional parts of Z×S, respectively. Therefore, according to equation (3) and considering F=XZ=Mf×2E
M
f=2frac(Z×S) (4)
E
f=int(Z×S) (5)
The argument of the exponential 2frac(Z×S) is now in (−1, 1). The number of integer bits of Z×S is larger for Xy than for X1/y. In case of root extraction, the number of integer bits depends only on Ex; but in powering depends moreover on y. According to one embodiment, the sequence of operations is as follows:
The number of digits in the integer part is ┌(nEx−1)/b┐+1 for powering and ┌(nEx−1)/b┐ for root extraction. Since root extraction needs to compute Z=1/y, the number of cycles required to obtain the integer part of both algorithms is the same, ┌(nEx−1)/b┐+1. Consequently, the total latency is given by
N=(┌(nEx−1)/b┐+1)+(δ+1)+Ne (6)
where Ne=┌ne/b┐ is the latency of the exponential 2frac(T).
To provide faithfully rounded powering and root extraction, the rounded result must be within 1 ulp of the exact result. Assuming rounding to the nearest even, The required precision and minimum latency values for each intermediate operation and the latency for the complete operation are shown in the Table of
According to a particular embodiment, and without limitation, the apparatus for computing Z-th powering or Z-th root of a number X comprises: (a) a plurality of memory elements such as registers for storing number X whose Z-the powering or Z-th root is to be computed, a floating-point or fixed-point number Z that indicates the powering or root exponent, the number of significant bits of the number X and of the resulting computation, the operation being performed, Z-th powering or Z-th root and the former exponent of Z; (b) a reciprocal unit for computing the reciprocal of Z resulting in a number A; (c) a logarithm unit for computing the logarithm base 2 of the number X resulting in a number B; (d) a shifter unit for shifting the number B in case of Z being a floating-point number, resulting in a number B′ (e) a multiplication unit for computing the product of said numbers A and B or B′ resulting in a number C; and (f) a exponential unit for computing the exponential of said number C. In particular embodiments, the reciprocal unit operates in parallel with the logarithm unit, the logarithm unit and the multiplication unit overlap during computation, the exponential unit and the multiplication unit overlap during computation, the exponential unit computes the exponential in an on-line basis, the logarithm computes the logarithm in a most-significant digit first basis, the shifting is computed in a most-significant-digit first basis, and/or the multiplication unit computes the product in a most-significant-digit first basis. According to one particular embodiment, the architecture of the apparatus comprises an exponent selection unit, an operation selection unit, a reciprocal look-up table unit, a high radix logarithm unit, a LRCF multiplier, a conversion unit, and a high radix exponential unit. In an alternative embodiment, the architecture of the apparatus comprises a word-length barrel shifter unit, a high-radix reciprocal unit, a high-radix logarithm unit, a high-radix multiplier, a conversion unit, and a high-radix exponential unit.
According to one embodiment the computing of Z-th powering or Z-th roots in a hardware processor comprises: (a) setting a first memory element of the processor to a number X whose Z-th powering or Z-th root is to be computed; (b) setting a second memory element of the processor to a fixed-point number or a floating-point number Z that indicates the powering or root exponent; (c) setting a third memory elements of the processor to the number of significant bits of the number X and of the resulting computation; (d) setting a fourth memory element of the processor to the operation being performed, Z-th powering or Z-th root; (e) setting a fifth memory element to the former exponent of Z; (f) computing the reciprocal of the number Z resulting in a number A; (g) computing the logarithm base 2 base 2 of the number X resulting in a number B; (g) shifting the number B, in case Z is a floating point number resulting in a number B′;(h) computing the product of the number A and B or B′ resulting in a number C; (i) separating the integer and fractional parts of the number C; and (j) computing the exponential of the number C. In particular embodiments, the computing of the logarithm and the product are overlapped, the computing of the product and the computing of the exponential are overlapped, the number X is represented in a simple or double precision binary floating-point form according the standard IEEE-754, the number q is represented in a binary fixed-point form, and/or the processor in chosen from the group consisting of an integrated circuit, a FPGA device, a microprocessor, a microcontroller, and a general purpose computer system.
According to one embodiment the function to be computed is XY or X1/Y, being X and Y floating-point numbers, X=(−1)s
In order to use the same multiplier for both operations, 1/Myε(0.5, 1] is normalized in [1, 2); then
As for the fixed-exponent case, to guarantee the convergence of the algorithm, the integer and fractional parts are extracted serially,
|X|Z=Mf×2E
being Z=Y or Z=1/Y and
for powering and root extraction, respectively.
The sequence of operations is: (1) reciprocal 1/My for root extraction, (2) evaluation of L=log2|X|, (3) shifting of the result of the logarithm, L×2E
−(nE
In the case of root extraction, the practical range of Ey is limited to
−(n+nE
Consequently, −69≦Ey≦61 (−62<Ey≦70) and −37≦Ey≦29 (−30≦Ey≦38) for powering (root extraction) in double-precision and single-precision floating-point representation, respectively.
The computation of the powering and the generic root in the unified architecture requires the shifting of L−Ex+log2Mx by Ey, in case of powering or by −(Ey+1), in case of root extraction. In both cases, the shift amount can be positive or negative.
To simplify the presentation of the shifting algorithm, we consider a shift by Ez, with Ez=Ey for powering, and Ez=−(Ey+1) for root extraction.
The digits of the logarithm are computed serially, mostsignificant digit first, and the digits of the integer and fractional parts are obtained in parallel, as shown in
The Ez-bit left or right shift is implemented as a right shift: as the leading zeros/ones are not computed, the first non zero digit of the integer and fractional parts of L are obtained simultaneously in cycle 2; this is equivalent to prealign L by placing it KE
The shift is split in two parts: (1) a right shift of (KEx+1)−└Ez/b┘ or (K+γ+1)−└Ez/b┘ radix-r digits and (2) a binary right shift of Ez % b bits. The digit-by-digit shift is carried out in a displacement register with Ns radix b digits (FIG. 4(C)), where Ns is roughly equal to Nl. All the integer digits Ij enter at the same position of the register but in consecutive cycles. The same for the fractional digits Lj. On the other hand, digit Lj enters (γ−KEx)+K+1 positions to the right of digit Ij. The digits are left shifted out, one digit every cycle.
The position where the Ij digits input the register is determined in terms KE
To provide faithfully rounded powering and root extraction, the rounded result must be within 1 ulp of the exact result. Assuming rounding to the nearest even, The required precision and minimum latency values for each intermediate operation and the latency for the complete operation are shown in the Table of
While particular embodiments have been described, it is understood that, after learning the teachings contained in this disclosure, modifications and generalizations will be apparent to those skilled in the art without departing from the spirit of the disclosed embodiments. It is noted that the disclosed embodiments and examples have been provided merely for the purpose of explanation and are in no way to be construed as limiting. While the methods, systems, apparatuses have been described with reference to various embodiments, it is understood that the words which have been used herein are words of description and illustration, rather than words of limitation. Further, although the system has been described herein with reference to particular means, materials and embodiments, the actual embodiments are not intended to be limited to the particulars disclosed herein; rather, the system extends to all functionally equivalent structures, methods and uses, such as are within the scope of the appended claims. Those skilled in the art, having the benefit of the teachings of this specification, may effect numerous modifications thereto and changes may be made without departing from the scope and spirit of the disclosed embodiments in its aspects.
This application claims the benefit of U.S. Provisional Application No. 61/683,662 filed on 2012-08-15 by the present inventors, which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
61683662 | Aug 2012 | US |