The present invention pertains to digital data processing, and more particularly to high-speed scalar and vector unsigned binary division. The invention has application (by way of non-limiting example) in real-time software applications, scientific programming, sensor array processing, graphics and image processing, signal processing, and other highly compute-intensive and performance critical activities for a variety of applications.
Division, of course, is a fundamental operation on any computer, though design choices that are reasonable for general purpose division are unsuitable for highly compute-intensive applications, e.g., certain real-time software and/or scientific applications, sensor array processing, graphics and image processing, and signal processing. The processing needed for real-time manipulation and interpretation of medical imaging, by way of example, so overloads the computational capacity of conventional systems processors that required performance parameters sometimes cannot be met.
Vector processors are a class of computational devices that permit operations, such as multiplication and addition, to be simultaneously executed on multiple items of data. The complexity of division is such typical vector processors do not provide a divide operation. Rather, programmers are expected to include in their source code or libraries, algorithms that approximate division, e.g., by Newton-Raphson techniques or otherwise.
Though division can be accomplished at acceptable performance levels on both conventional (scaler) and vector processors, there remains a need for improved digital data processors methods and apparatus for scalar and vector binary division. Such is an object of this invention.
Another object of this invention is to provide methods and apparatus for binary division that operate on existing processors, and that can be ported to future architectures.
A related application is to provide such methods as can be readily implemented at low-cost and without consumption of undue processor or memory resources.
The foregoing are among the objects attained by the invention which provides, in one aspect, an improved method of operating a digital data processor to perform binary division. The improvement includes estimating reciprocals of at least selected division based on values accessed from a look-up table. A related aspect provides such methods wherein the divisors are used as indices to the look-up table. Further related aspects provide such methods wherein the divisors are bitwise shifted, e.g., right-shifted in order to form such indices.
Further aspects of the invention provide methods as described above including the step of estimating a reciprocal of a divisor that has a value within a first range of values based on a value stored in a first look-up table defined by the divisor. A reciprocal of a division within a second range of values (e.g., that may or may not overlap the first range of values) is estimated as a function of a value stored in a second look-up table at an index that is a bitwise-shifted function of the divisor.
Related aspects of the invention provide such methods wherein a divisor is compared with a threshold value to determine whether to estimate the reciprocal as a function of a value stored in the first table or the second table.
Further related aspects provide such methods wherein the first table comprises estimates for each respective integer divisor in the first range, while the second table comprises estimates for respective groups of integers divisors in the second range. Each of the aforementioned groups, according to related aspect of the invention, has 2x divisors. The steps of estimating reciprocals for divisors in the second range, correspondingly, includes right-shifting (or otherwise bitwise shifting) each divisor x bits prior to using it as index into the second table.
Still further aspects of the invention provide methods as described above including generating a first quotient estimate as functions of reciprocal estimates obtained from the look-up table(s) and of the original dividends. Further quotient estimates are generated, according to related aspects of the invention, by incrementing the initial quotient estimates, e.g., by one or two, depending on the size of any error in the initial reciprocal estimates.
Related aspects of the invention provide methods utilizing steps like those described above of operating a vector processing digital data processor to estimate a plurality of quotients by integer binary division, e.g., with performance under one clock cycle per dividend/divisor pair.
These and other aspects of the invention are evident in the drawings and in the detailed description that follows.
A more complete understanding of the invention may be attained by reference to the drawings, in which:
Illustrated CPU 6 represents a microprocessor, coprocessor, field programmable gate array (FPGA), application specific integrated circuit (ASIC) or other general—or specific—purpose processing unit (or combination thereof), programmable or otherwise, e.g., of the type conventionally used in the aforementioned digital data processor devices. While it can otherwise be configured and operated in the conventional manner, e.g., for image analysis, signal analysis or other functions, in the illustrated embodiment CPU 6 is programmed or otherwise operated in accord with the teachings hereof to perform binary division.
Illustrated memory 4 represents any register, memory (e.g., RAM, DRAM, ROM, EEPROM), storage device, or combination thereof, of the type conventionally used in the aforementioned below. In the drawing, the memory 4 stores a dividend 20 and divisor 22, each of which is an eight-bit binary number, e.g., an unsigned character or byte. Those skilled in the art will, of course, appreciate that the teachings hereof can be applied to division of values with greater or less bit length and, indeed, of dividends and divisors of dissimilar length (e.g., by zero-padding or otherwise). The memory 4 additionally stores a look-up table 28 of reciprocal estimates and, ultimately, a quotient 22 generated by CPU 6 in the manner discussed herein.
By way of overview, according to one practice of the invention, illustrated CPU 6 determines the quotient 22 in three phases. In phase 1 the CPU determines an initial quotient estimate and more particularly, for example, a lower boundary thereof, by accessing the divisor's reciprocal estimate in look-up table 28 and multiplying the dividend by that estimate. In phase II, it determines the error 10, if any, in the initial quotient estimate. And, in phase III the CPU adjusts the quotient estimate to reduce that error 10.
In phase I, the CPU 6 compares the divisor b to a threshold value between zero and 2n−1. Here, the threshold is 32, though in other embodiments it may take on other values. If the divisor is less than the threshold, the CPU 6 obtains a bth reciprocal estimate from a so-called “small” portion of the look-up table 28; see step 58. Otherwise, in step 64, the CPU obtains a b_shiftth reciprocal estimate within a so-called “big” portion of look-up table 28, where b_shift is equal to b bitwise-shifted (here, to the right) by x bits (here, three bits) to eliminate the x least significant bits; see, step 60. The CPU 6, in step 66, multiplies the dividend by the reciprocal estimate and right-shifts the result by the length of the inputs (here, n=8 bits), eliminating the least significant b is of the product and returning a quotient estimate with the same length as the inputs.
In the preceding paragraph and, more generally, throughout this discussion, right-shifting is employed for the purpose of eliminating one or more least significant bits (LSBs) of a value. Those skilled in the art will appreciate that the direction of such shifting is platform-dependent and that, in other embodiments (namely, those implemeneted on platforms with the LSB on the left), left-shifting is employed for that purpose. With this understanding and for the sake of simplicity, the applicants refer to bitwise shifting that eliminates LSBs as “right” shifting (regardless of whether the actual direction is right or left).
In Phase II of the illustrated example, the CPU 6 determines an error of the initial quotient estimate. CPU 6, in step 68, multiplies the divisor by the quotient estimate to determine a dividend estimate. The error is determined in step 70 as the difference of the dividend and its estimate. Those skilled in the art will appreciate other ways to determine the error, all within the invention.
Phase III includes steps 74–78, in which the CPU 6 corrects the quotient based on the size of the error. In the illustrated example, the CPU 6 increments the quotient estimate by one (step 72) if the error is greater than or equal to the divisor. In step 76, the CPU 6 increments the quotient again if the error right-shifted one bit is greater than or equal to the divisor. In step 80, the CPU returns the final quotient estimate in memory 4.
Although described above with regard to certain steps and phases, and connections therebetween, it will be appreciated by those skilled in the art that other modifications and alterations thereto are within the scope of the invention. For example, the general structure and method of the illustrated examples can manifest in other contemplated embodiments using different steps and phases, and organization thereof, without departing from the invention.
Look-up Table Design
Referring back to
Preferred embodiments use at least a partially “shared representation,” with at least some possible divisors sharing a common reciprocal estimate. This has the advantage of reducing the number of values in and, therefore, the size of the table 28. It can also speed up table access (e.g., permitting storage of the entire table in RAM or other fast memory) and, therefore the overall division operation.
By way of example, the look-up table 28 can store reciprocal estimates based on one-to-one representations for smaller-valued divisors (e.g., those with values below a threshold) and based on shared representations for larger-valued divisors (e.g., those with values above that threshold). The threshold value separating these two classes of divisors is selected to strike a balance between table size and error, which are inversely related.
Referring to
The small table includes a one-to-one representation of reciprocal estimates for a first range of divisors, here, divisors between 1 and a threshold value, here 32. Thus, the table stores a reciprocal estimate of 255 for the divisor 1, 127 for the divisor 2, 85 for the divisor 3, and so forth, as shown in
{circumflex over (b)}m−1=1/bm
where,
The values {circumflex over (b)}m−1 are converted into and stored as binary integers (e.g., using appropriate scaling) so as to represent values between 0 and 255. No reciprocal is provided for divisor b=0, though a value of “undefined” is used in some embodiments.
The big table includes a shared representation of reciprocal estimates for a second range of divisors, here, divisors from the threshold value 32 to the maximum possible divisor (here, 255, given divisors represented by n=8 bits). In the illustrated embodiment, a common reciprocal estimate is provided for each successive group (or span) of possible divisors in the second range, with each span covering 2x divisors. X can have, for example, a value of three, in which case the big table stores a first reciprocal estimate for the first edge (i.e., 23rd) divisors is the second group; a second reciprocal estimate for the next eight divisors is the second group; a third reciprocal estimate for the third eight divisors (again, 23rd) is the second group; and so forth.
In the illustrated embodiment, the big table stores reciprocal estimates having the values indicated in
In the illustrated embodiment, each such estimate {circumflex over (b)}m(span)−1 is generated, e.g., prior to run-time or, in any event, prior to utilization of the binary division methodology described herein, in accord with the relations
{circumflex over (b)}m(span)−1=1/bm(high)
where,
The values {circumflex over (b)}m(span)−1 are converted into and stored as binary integers (e.g., using appropriate scaling) as above.
As an alternative to defining {circumflex over (b)}m(span)−1 as a function of largest divisor (bm(high)) for each respective span, the smallest divisor (bm(low)) may be used instead. Alternatively, an average of the largest and smallest divisors in the group—or some other function of those (or other) values in the group—may be used. Those skilled in the art will appreciate defining {circumflex over (b)}m(span)−1 in accord with such alternatives may necessitate corresponding modification of the error adjustment in Phase III (e.g., by use of decrementing instead of incrementing, and so forth).
Those skilled in the art will recognize that the spans are not limited to eight divisors, but rather, can range from two to the entirety of divisors beyond the threshold (i.e., integer x between 1 and n). In this regard, it will be appreciated that a shared representation with a smaller span yields more accurate reciprocal estimates at the cost of increasing the length and storage requirements of the big table.
Accessing the Look-Up Table
Referring back to
In the illustrated embodiment, the CPU 6 references reciprocal estimates in the big table for divisors beyond the threshold using the divisor right-shifted x bits (here, three bits) in order to obtain the reciprocal estimate for that divisor so long, of course, that it is beyond the threshold. This is indicated in the drawing by angled arrows running from divisors 32–255 to table values {circumflex over (b)}32−1 and {circumflex over (b)}63−1. In this case, leading elements of the big table (e.g., elements with indices 0 through threshold/2x−1) are not used (e.g., since threshold/2x is the first index generated by such right-shifting). Of course, more or fewer elements can be unused even where right-shifting is employed, e.g., by adding or subtracting an offset to the right-shifted value.
Source code in the C programming language for scalar binary division according to one embodiment of the invention is provided below. Consistent with the description above, the source code provides for processing dividends and divisors, a and b, of eight-bit length and returning quotient estimates of that same length. It assumes a threshold of 32 and spans of eight (i.e., x−3). It will be appreciated that other parameters (e.g., for dividend, divisor and quotient length, threshold, span size, and so forth), data types, variables and function calls, and/or programming languages may be used instead in addition consistent with the teachings hereof.
Binary Division in a Vector Architecture
Further embodiments of the invention provide for application of the forgoing to provide binary division in a vector-processing architecture using vector operations.
Referring back to
Broadly, according to these embodiments, the CPU divides a vector dividend A by a vector divisor B, resulting in a vector quotient Q. As above, although these vectors can be maintained in any form of memory 4 including conventional RAM, DRAM, ROM, EEPROM, in a preferred embodiment register-type memory is used. Of course, the embodiment is not limited to 16-element vectors (nor each element containing 8-bit) but, rather, can be applied to vectors and elements of other sizes consistent with the teachings hereof.
These small and big tables can be pre-calculated as discussed above and, although these tables can reside in any type of memory 4, each is preferably stored in vectors associated with CPU 6. In the illustrated embodiment, the tables each contain 32-elements and occupy two 16-element vectors a piece.
Generally, as above, in Phase I, the CPU 6 concurrently compares each element of B to a threshold (e.g., between zero and 2n−1), assigns it big or small status. It then retrieves 8-bit reciprocal approximations from both tables for the respective elements of B, combining the appropriate approximation (using a mask that is based on the big/small status) into a single reciprocal estimate vector. The CPU multiplies this by the dividend vector A, resulting in a vector having sixteen 16-bit products. For each 16-bit product, the most significant 8-bits are extracted by the CPU 6 into a quotient estimate vector Q, having sixteen 8-bit elements that serve as first estimates of the respective quotients.
In phase II, the CPU 6 multiplies Q by R, resulting in a vector A_estimate with sixteen dividend estimates. The CPU then subtracts A_estimate from the dividend vector A to producer a corresponding error vector of sixteen elements.
In phase III, the CPU compares the error vector to B, and increments each 8-bit element of Q if the corresponding element of error is greater than or equal to that of B. The elements in error are each right shielded 1-bit by the CPU, which compares each element of the shifted error to the corresponding element in B. Again, for those comparisons being greater than or equal, the CPU increments the corresponding 8-bit element of Q. Q is then the final vector of quotient estimates.
A more detailed understanding of vector embodiments of the invention may be attained by reference to the C programming language source code provided below. Parameters passed to the function are three pointers to arrays of sixteen dividends, sixteen divisors, and sixteen quotients, respectively. In the code, which operates on (long) vectors of length N, two sets of vector instructions are used in a loop that processes 32 operands. The loop also includes two scalar instructions, loop count and pointer update. All loop instructions are ordered for parallelism of execution (e.g., two instructions per clock cycle) and overall performance equal to or exceeding sixteen quotients in 15½ clock cycles. Macros at the outset of the code define in C instructions used in the assembly language implementation that follows.
Provided below is an assembly language source code suitable for compilation and execution on an aforementioned PowerPC processor and corresponding to the C programming language source code above.
cmns=lo byte of q16;
Described herein are methods and apparatus meeting the above-mentioned objects. It will be appreciated that the embodiments described herein are merely examples of the invention that other embodiments, incorporating modifications to those described herein, fall within the scope of the invention. Therefore, in view of the above, what we claim is:
This application claims the benefit of priority of U.S. Provisional patent application Ser. No. 60/303,559, entitled FAST UNSIGNED CHAR DIVIDE METHODS AND APPARATUS, filed Jul. 6, 2001, the teachings of which are incorporated herein by reference.
Number | Name | Date | Kind |
---|---|---|---|
4794521 | Ziegler et al. | Dec 1988 | A |
5307303 | Briggs et al. | Apr 1994 | A |
5309385 | Okamoto | May 1994 | A |
5442581 | Poland | Aug 1995 | A |
5537338 | Coelho | Jul 1996 | A |
5539682 | Jain et al. | Jul 1996 | A |
5600846 | Gallup et al. | Feb 1997 | A |
5818744 | Miller et al. | Oct 1998 | A |
5825680 | Wheeler et al. | Oct 1998 | A |
5831885 | Mennemeier | Nov 1998 | A |
5937202 | Crosetto | Aug 1999 | A |
6014684 | Hoffman | Jan 2000 | A |
6081824 | Julier et al. | Jun 2000 | A |
6094415 | Turner | Jul 2000 | A |
6115812 | Abdallah et al. | Sep 2000 | A |
6173305 | Poland | Jan 2001 | B1 |
6202077 | Smith | Mar 2001 | B1 |
6211971 | Specht | Apr 2001 | B1 |
6330000 | Fenney et al. | Dec 2001 | B1 |
6446106 | Peterson | Sep 2002 | B1 |
6769006 | Krouglov et al. | Jul 2004 | B1 |
20030074384 | Parviainen | Apr 2003 | A1 |
Number | Date | Country |
---|---|---|
0 987 898 | Mar 2000 | EP |
WO 0022512 | Apr 2000 | WO |
Number | Date | Country | |
---|---|---|---|
60303559 | Jul 2001 | US |