Disclosed aspects relate to high performance division and root computation units. More specifically, exemplary aspects relate to improvements in the speed and power consumption in the access of lookup tables used in division and/or root computation in processors.
Computer systems or processors may include an arithmetic and logic unit (ALU) which performs arithmetic and logical operations on data. Some ALUs may include a floating-point unit that may be configured to perform division and/or root calculations (e.g., square root). Division and square root operations may be implemented in processors using similar algorithms which may operate in an iterative manner.
For example, a conventional algorithm used for performing division and/or square root calculations is known as a Sweeney, Robertson, and Tocher (SRT) algorithm. The SRT algorithm is iterative in nature. The iterations of the SRT algorithm may be implemented in a pipelined processor by performing one iteration per cycle, although it may also be possible to spread out each iteration over multiple clock cycles or pipeline stages. It is also possible to implement the SRT algorithm in a non-pipelined fashion, such as in an array divider. The SRT algorithm can produce one or more bits of the desired result (e.g., the quotient of a multiplication of the result of a square root operation) per iteration. The “radix” of a particular division or square root algorithm is an indication of the number of bits produced or computed in each iteration. For example, a radix-4 algorithm computes 2 bits of quotient in every iteration, whereas, increasing the radix to a radix-16 algorithm computes 4 bits in every iteration, which doubles the speed or reduces latency by half in comparison to the radix-4 algorithm. However, increasing the radix of the algorithm leads to increased complexity and associated hardware and/or software costs of the implementation of the algorithm.
Conventional implementations of the SRT algorithm involve a table lookup in each iteration. The table lookup is explained using a description of a conventional division process of dividing a dividend (or numerator) with a divisor (or denominator) to produce a result or quotient in one or more iterations. In the first iteration, the number of times the divisor goes into the dividend is determined. This number, also known as a multiple, forms one or more bits of the quotient (based on the radix). That multiple times the divisor is subtracted from the dividend to form a partial remainder. The operation then moves on to the next iteration where the dividend is replaced by the partial remainder. The steps related to determining the number of times the divisor goes into the partial remainder are repeated in order to obtain further bits of the quotient and the next partial remainder. This process is repeated until the partial remainder is zero, if the quotient is a rational number, or continues indefinitely if the quotient is irrational. In practice, the division process terminates when a predetermined precision of the quotient is reached.
The SRT algorithm simplifies the above process by providing a mapping of the values of partial remainders to quotient values for various possible values of divisors. A lookup table or two dimensional array is provided for this mapping, where, for example, divisors are disposed on an x-axis (or row direction) and partial remainders are disposed on a y-axis (or column direction). Quotient values are provided for each intersection on the x-y plane or for each combination of divisor values and partial remainder values. In some implementations, fewer than all bits of the divisor and/or partial remainder values (e.g., a predetermined number of most significant bits (MSBs) may be utilized in the mapping. It will be recognized that truncating the precision of the divisor and/or partial remainder values by using fewer bits may affect accuracy of the corresponding quotient values provided in the table. However, the size of the table, and correspondingly lookup time increases if higher precision/number of bits of divisor and/or partial remainder values are used.
Using the lookup table, in each iteration, the partial remainder (or a truncated version of the partial remainder) for that iteration is used to lookup the quotient bits for the particular divisor (or a truncated version) of the division. Depending on various parameters such as the radix of the SRT algorithm, number of bits of precision of the divisor and/or partial remainder values in the lookup table, etc., the speed of accessing the lookup table, as well as expenses in terms of area/cost of implementing the lookup tables can be very high. Accessing the lookup table is in the critical path of processing each iteration.
The case of determining the root (e.g., square root) of a number (or radicand) using a corresponding SRT algorithm is similar, where an initial estimate of the root is used in the table lookup instead of the divisor. While the root operation is not described in greater detail here, it will be recognized that the corresponding SRT algorithm also involves a table lookup in each iteration, which affects the speed and power consumption of implementing the SRT algorithm for root computation in processors.
Accordingly, there is a need in the art for overcoming the aforementioned limitations in conventional implementations of the SRT algorithm for division and/or root computations.
Exemplary aspects of this disclosure pertain to systems and methods for division/root computation. A lookup table according to a Sweeney, Robertson, and Tocher (SRT) algorithm for a division/root computation is stored in a memory. Information related to a selected column corresponding to a divisor/root estimate is stored in a high-speed memory. Division/root computation is performed iteratively using the cached information to improve access times and reduce latency of accessing the entire lookup table on each iteration. In each iteration, a quotient/root is determined from the cached information based on a current partial remainder, and a next partial remainder is generated based on the quotient/root, the divisor/root estimate, and the current partial remainder. implementations of the technology described herein are directed to mechanisms for quickly calculating floating-point divides and square roots in a processor.
For example, an exemplary aspect relates to a method of performing a division, the method comprising, selecting a column of a lookup table according to a Sweeney, Robertson, and Tocher (SRT) algorithm for the division, the selected column corresponding to a divisor of the division and caching information related to the selected column in a high-speed memory. The method includes iteratively performing the division using the cached information, by determining a quotient from the cached information using a current partial remainder in each iteration, and generating a next partial remainder based on the quotient, the divisor, and the current partial remainder.
Another exemplary aspect relates to a method of performing a root computation, the method comprising: selecting a column of a lookup table according to a Sweeney, Robertson, and Tocher (SRT) algorithm for the root computation, the selected column corresponding to a root estimate of the root computation and caching information related to the selected column in a high-speed memory. The method includes iteratively performing the root computation using the cached information, by determining a root from the cached information using a current partial remainder in each iteration, and generating a next partial remainder based on the root, the root estimate, and the current partial remainder.
Yet another exemplary aspect relates to a processor comprising a memory configured to store a lookup table according to a Sweeney, Robertson, and Tocher (SRT) algorithm for a division/root computation and a high-speed memory configured to cache information related to a selected column of the lookup table, the selected column corresponding to a divisor/root estimate. A division/root computation unit is configured to iteratively perform division/root computation using the cached information, comprising a division/root lookup logic configured to determine a quotient/root from the cached information based on a current partial remainder in each iteration, and generate a next partial remainder based on the quotient/root, the divisor/root estimate, and the current partial remainder.
Another exemplary aspect relates to a processing system comprising means for storing a lookup table according to a Sweeney, Robertson, and Tocher (SRT) algorithm for a division/root computation and caching means for caching information related to a selected column of the lookup table, the selected column corresponding to a divisor/root estimate. The processing system includes means for iteratively performing division/root computation using the cached information based on means for determining a quotient/root from the cached information using a current partial remainder in each iteration, and means for generating a next partial remainder using the quotient/root, the divisor/root estimate, and the current partial remainder.
The accompanying drawings are presented to aid in the description of the technology described herein and are provided solely for illustration of the implementations and not for limitation of the implementations.
Aspects of the invention are disclosed in the following description and related drawings directed to specific aspects of the invention. Alternate aspects may be devised without departing from the scope of the invention. Additionally, well-known elements of the invention will not be described in detail or will be omitted so as not to obscure the relevant details of the invention.
The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects. Likewise, the term “aspects of the invention” does not require that all aspects of the invention include the discussed feature, advantage or mode of operation.
The terminology used herein is for the purpose of describing particular aspects only and is not intended to be limiting of aspects of the invention. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising,”, “includes” and/or “including”, when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Further, many aspects are described in terms of sequences of actions to be performed by, for example, elements of a computing device. It will be recognized that various actions described herein can be performed by specific circuits (e.g., application specific integrated circuits (ASICs)), by program instructions being executed by one or more processors, or by a combination of both. Additionally, these sequence of actions described herein can be considered to be embodied entirely within any form of computer readable storage medium having stored therein a corresponding set of computer instructions that upon execution would cause an associated processor to perform the functionality described herein. Thus, the various aspects of the invention may be embodied in a number of different forms, all of which have been contemplated to be within the scope of the claimed subject matter. In addition, for each of the aspects described herein, the corresponding form of any such aspects may be described herein as, for example, “logic configured to” perform the described action.
Exemplary aspects of this disclosure are directed to high performance implementations of division and root computation (e.g., square root, cube root, etc.). In some aspects, an exemplary division and square root unit is configured to speed up and simplify the complexity of conventional implementations of the SRT algorithm. A lookup table according to a Sweeney, Robertson, and Tocher (SRT) algorithm for a division/root computation is stored in a memory. The table lookup process in each iteration of the SRT algorithm may be simplified, based, for example on determining a subset of the lookup table comprising one or more table entries of the lookup table which will be accessed for a particular division or root computation implemented in an exemplary processor. In the case of division, the subset may include table entries of a selected column corresponding to the divisor of the particular division. It is recognized that the divisor will be common to each iteration of the SRT algorithm, and therefore, the selected column comprising various possible quotient values corresponding to the various possible partial remainder values for that particular divisor can be extracted from a comprehensive lookup table which has these values for other divisor values. In exemplary aspects, the extracted selected column can be placed in a simplified one-dimensional memory structure which can be more simply indexed with the partial remainder in each iteration (as opposed to indexing the two-dimensional lookup table with two indices as in conventional implementations). The one-dimensional memory structure can be implemented in several ways. Regardless of the particular implementation, the one-dimensional memory structure can be cached in a high-speed memory and accessed with improved speed for the numerous iterations involved in a particular division. Since storage, indexing, and accessing of the one-dimensional memory structure is simpler than a two-dimensional lookup table, power consumption in each iteration is also reduced.
Extraction and storage of the selected column for a particular divisor can be implemented in several ways. In some aspects, a column mask may be applied to the two-dimensional table in order to extract the selected column corresponding to a specific divisor value for a particular division operation. Alternatively, the selected column may be directly accessed. Extraction of the selected column will be further explained with reference to the various exemplary aspects of this disclosure. Once extracted, the selected column can be stored in a high-speed memory which can be configured to support a one-dimensional memory structure. For example, the high speed memory may be an on-chip cache which is integrated on the same chip as a processor comprising an arithmetic and logic unit (ALU) or more specifically, a floating point unit (FPU) which may be utilized for division and root computations. At the start of an exemplary division, the dividend and divisor operands may be read (e.g., from a register file, cache, main memory, etc.) and a table lookup may be performed to a main or comprehensive two-dimensional lookup table. A selected column can be extracted using the divisor operand and placed in the high speed memory. Entries of the high speed memory can then be accessed in each iteration of the division.
While the above aspects relate to a table lookup for determining quotient bits corresponding to particular mappings of combinations of the partial remainder and the divisor, alternative implementations are possible, where the same mapping can be obtained from logical expressions. For example, for each divisor value, the quotient value for a particular partial remainder value may be expressed as a Boolean or logical expression using bits of the partial remainder value and predetermined coefficients. Since more than one partial remainder may map to the same quotient value for a particular divisor, the logical expressions are formulated to exploit the repetition in the mappings. In exemplary aspects, the logical expressions (or more specifically, coefficient values) that can be used to derive the quotient values for the specific divisor value and various possible partial remainder values can be determined and used for the various iterations involving the same specific divisor value.
It will be understood that in exemplary implementations, fewer than all bits of the divisor and/or the partial divisor (e.g., predetermined numbers of MSBs) may be utilized in the various table lookup operations and/or representations of mapping to quotient values using logical expressions.
Aspects related to root computation (e.g., square root) are not described in the same level of detail as division in this disclosure. This is because the various exemplary aspects discussed for division can be easily extended to root computation. For example, where references to a particular divisor are made with regard to table lookups for a particular division operation implemented using the SRT algorithm, an estimate of the root may be used instead, for the case of root computations using the SRT algorithm. Thus, a column of a similar lookup table for a root computation may be selected using an initial estimate of a root, where the initial estimate may be derived from a different lookup table or other mechanisms known in the art. For the purposes of this disclosure, the remaining processes are similar when it comes to a root computation.
Accordingly, an exemplary processor is described which includes a division/root computation unit. A memory is configured to store a lookup table according to a Sweeney, Robertson, and Tocher (SRT) algorithm for a division/root computation and a high-speed memory is configured to cache information related to a selected column of the lookup table, the selected column corresponding to a divisor/root estimate. The division/root computation unit is configured to iteratively perform division/root computation using the cached information. The cached information can include all quotient/root values for the divisor/root estimate in the selected column of the lookup table. In some aspects, the cached information comprises quotient/root select masks based on a logical combination of the divisor/root estimate for the selected column of the lookup table.
Iteratively performing the division/root computation involves a division/root lookup logic configured to determine a quotient/root from the cached information based on a current partial remainder in each iteration and to generate a next partial remainder based on the quotient/root, the divisor/root estimate, and the current partial remainder. the current partial remainder for a first iteration is the dividend/radicand for the division/square root.
In some implementations, the division/root lookup includes hardware such as a multiple select multiplexer to select a multiple of the divisor estimate based on the quotient/root, and a partial remainder subtractor to generate a next partial remainder as the multiple of the divisor/root subtracted from the current partial remainder. The division/root lookup logic may be configured to determine the quotient/root from the cached information based on only a preselected number of most significant bits (MSBs) of the current partial remainder in each iteration. A carry-propagate adder (CPA) may be configured to add only the most significant bits of a pair of redundant partial remainders from a previous iteration. A pair of redundant partial remainder registers may store the next partial remainder in a redundant form. Moreover, one or more quotient registers, such as a pair of registers comprising a developed quotient/root register (Q) and a developed quotient/root minus one register (Q−1) may be used to store the quotient/root in each iteration.
With reference now to
The selected column or selected quotients available at the output of column/quotient select mask 108 may be latched or directly fed to iterator 110. Dividend register 104 provides the dividend to iterator 110. Iterator 110 may include logic to perform computation for division/root computation in each iteration of a corresponding SRT algorithm. For example, iterator 110 may produce one or more (e.g., r) bits per iteration based on the radix and particular values of the dividend and divisor. Each iteration may be pipelined and executed over one or more clock cycles of processor 100 depending on particular implementations. Once column/quotient select mask 108 is produced, it remains constant across all iterations. In each iteration, the r bits of the result (quotient/root) are produced, which may be stored in one or more registers such as quotient register 112. In each iteration, the bits stored in quotient register may be shifted left to make room for bits in subsequent iterations and follow the correct order of bits of the results. Once the computation is completed (e.g., as determined by a partial remainder value of zero or when a predetermined maximum number of iterations/predetermined precision is reached), e.g., after n iterations the result may be available from quotient register 112. Further, after the first iteration, dividend register 104 is replaced with the partial remainder, and after each subsequent iteration, the partial remainder obtained at the end of that iteration is stored in dividend register 104.
As described above, the Sweeney, Robertson, and Tocher (SRT) algorithm may include a two-dimensional mapping of partial remainder and divisor values to a quotient, which may be in the form of a lookup table. For example, in the lookup table, m MSBs of a partial remainder in a particular iteration and n MSBs of the divisor 102 (in the case of division) or the root estimate (in the case of performing a square root operation) may be used to index into the lookup table to provide b bits of a quotient for that iteration. The particular lookup table used depends on various design considerations, such as the integers m, n, and b, and other parameters such as the radix and the accuracy of the partial remainder/root estimate. In some cases, the partial remainder may not be fully resolved or computed in each iteration. As will be explained in the following sections, it may be possible to leave the computation of a partial remainder in a redundant form (e.g., comprising sum and carry components, rather than a resolved or non-redundant form which would be obtained after adding the sum and carry components in a carry-propagate adder (CPA) as known in the art). If the partial remainder is in redundant form and only m MSBs of the partial remainder are used, then only the m MSBs of the carry and sum components may be resolved in order to get an estimate of the partial remainder in each iteration, rather than resolve the partial remainder first and obtain the m MSBs of the resolved result. Thus, the partial remainder estimate may assume either a carry-in of “0” or “1” from the resolution of less significant bits of the carry and sum components. The precision of the quotient obtained in each iteration is correspondingly adjusted based on the correctness of these assumptions.
A particular iteration of the SRT algorithm will now be discussed in further detail. For example, the operation in an ith iteration can be represented by the equation: Pi+1=r*P1−qi+1*D. In this equation, Pi is the partial remainder available as an input to the ith iteration and Pi+1 is the partial remainder obtained at the end of the ith iteration, to be used in the next or (i+1)th iteration. D represents the divisor, r is the radix, and qi+1 represents b bits of the quotient that are provided by the lookup table. The next partial remainder becomes the previous partial remainder in a next iteration on the index i, where the lookup table is accessed again but with an approximation of Pi+1 to provide the next b bits of the quotient. For the first iteration, the dividend is used as the input partial remainder.
The SRT algorithm may also be used in an iterative fashion to perform a root computation. In the case of performing a square root operation, for example, an initial estimate of the square root is used, which may be provided by another lookup table. Given divisor 102 or an initial estimate of a square root, one implementation caches a column of a lookup table. The cached column is based upon the divisor 102 or initial estimate of the square root. The cached column is accessed each iteration of the SRT algorithm.
In some aspects, computer system 200 may be configured in or form part of a cellular phone, a tablet, a phablet, a personal digital assistant, or other user device. Processor 202 may be a general-purpose processor, a microcontroller, multicore processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Programmable Logic Device (PLD), a controller, a state machine, gated logic, discrete hardware components, or any other suitable entity that can perform calculations or other manipulations of information.
In some aspects, memory 204 may be a memory structure (e.g., a cache, register bank, etc.) or any other means for storing a lookup table, which may be in communication with processor 202. ALU 206 can perform arithmetic and logical operations on data. Division and root computation unit 208 can perform division and root computation operations. Instruction cache 210 may be populated with instructions of various instruction types that may be retrieved, for example, from a higher order cache or memory. Control unit 216 may provide control to pipeline 212 and other functional units (not shown) within processor 202. High-speed memory 214 may be viewed as and referred to as a cache, a caching means, or a register bank. High-speed memory 214 may be located or integrated on the same chip as processor 202 for faster access, and may also be referred to as an on-chip cache in this context. Although high-speed memory 214 has been illustrated as an individual block, there is no requirement for high-speed memory 214 to be a standalone structure; on the other hand, high-speed memory 214 may be integrated or be part of any other memory structure, which in exemplary aspects is integrated on the same chip as processor 202.
As previously discussed, in one exemplary aspect, a one-dimensional array or column of partial remainder/root table 218 can be extracted and cached for quick access and easier indexing than the entire two-dimensional partial remainder/root table 218. Extraction of the one-dimensional array or column may be implemented in several ways including directly reading out the column, using a mask to read out the column, etc., as will be discussed in the following sections in further detail.
The rows of partial remainder/root table 218 are indexed by the approximate partial remainder, where the values 00000, 00001, 11001, and 11010 are explicitly shown. The columns are indexed by the divisor (or a truncated version, e.g., comprising MSBs of the divisor) in the case of division or the root estimate (or a truncated version of the root estimate) in the case of root computation. A truncated divisor may include the n MSBs of the divisor (excepting the MSB, which is always “1” in a normalized floating point notation), where n is chosen according to established rules regarding the number of bits produced by the look-up table.
A selected column 220 of partial remainder/root table 218 is particularly shown in
It is noted that column or quotient select mask 406, divisor register 404, dividend/partial remainder registers 402, and quotient/root registers 416 may be memory structures which may be located outside division and root computation unit 208 in some implementations, and may also be shared with other components or blocks of processor 200. However, in
For an exemplary division operation (e.g., based on the SRT algorithm) performed using division and root computation unit 208, dividend and divisor operands may be received from an instruction and loaded into dividend registers 402 and divisor register 404, respectively. As previously described, a column (e.g., 220) can be selected from partial remainder/root table 218 based on bits of the divisor from divisor register 404. Selecting this column, or “pre-selection” may be accomplished directly or by forming a mask. Information related to the selected column can be cached in used in the various iterations of the SRT algorithm. The cached information can include the values in the column or combinational logic such as a quotient select mask that can be used to obtain the values in the column Aspects where the cached information includes all quotient/root values for the divisor/root estimate in the selected column of the lookup table will be discussed with relation to
Thus, column or quotient select mask 406 can include either selected column 220 (as in
Regardless of whether the selected column is extracted or quotient select mask bits are used in block 406, the remaining blocks of division and root computation unit 208 will now be explained. For the first iteration, dividend registers 402 hold the dividend. After the first iteration, for each subsequent iteration, dividend registers 402 hold redundant partial remainders in first and second redundant partial remainder registers 422 and 424, which produce redundant partial remainder bits 410 during each iteration. The redundant partial remainder bits 410 may be in sum/carry, redundant binary signed digit (RBSD) or any other redundant number format.
Divisor register 404 holds divisor bits 405. Redundant partial remainder bits 410 are output from the first and second redundant dividend registers 402, which are then input into CPA 426. As previously stated, only a truncated version of the redundant partial remainder bits may be added (e.g., a few MSBs) in order to save time. Accordingly, CPA 426 may add MSBs of redundant partial remainder bits 410 and outputs non-redundant or resolved partial remainder bits 412. The number of MSBs of redundant partial remainder bits 410 to be added in CPA 426 may be dependent upon the number of bits processed per cycle. As previously mentioned, resolved partial remainder bits 412 is used as an index by division/root lookup logic 408 to lookup the quotient or root from column or quotient select mask 406.
Division/root lookup logic 408 can then obtain quotient bits 414, which may be stored in quotient/root register 416 for each iteration. In general, a multiple select multiplexer may be used to select a multiple of the divisor/root estimate based on the quotient/root. In the illustrated implementation, quotient bits 414 for each iteration may also be used by multiple select mux 418, which selects the multiple of the divisor bits 405 that is to be subtracted from the redundant partial remainder bits 410. For example, if the quotient bits 414 denote a decimal value of “3,” then multiple select mux 418 selects “3” times the divisor bits 405 and outputs this value to partial remainder subtractor 420.
A partial remainder subtractor may then be used to generate a next partial remainder as the multiple of the divisor/root estimate subtracted from the current partial remainder. As shown, subtractor 420 calculates the difference between partial remainder bits 410 (from a previous iteration) and the multiple of divisor bits 405 to obtain the partial remainder for the next iteration, to be stored in first and second redundant partial remainder registers 422 and 424 after a left shift, as follows. The partial remainder for the next iteration is shifted left based on how many quotient bits 414 are produced (e.g., based on the radix). Thus, if three quotient bits 414 are produced, the redundant partial remainder bits for the next iteration are shifted left three bits and loaded into first and second redundant partial remainder registers 422 and 424.
Division/root lookup logic 408 obtains the shifted difference from first and second redundant partial remainder registers 422 and 424 in the next iteration and the process repeats. That is, division and root computation unit 208 repeats the process of reading the divisor bits 405, selecting the multiple of the divisor bits 405, and performing the subtraction of the multiple of the divisor bits 405 from the redundant partial remainder bits 410.
While quotient register 416 may be a single register (e.g., quotient register Q 430), in some implementations, quotient register 416 may comprise one or more quotient registers such as a pair of registers comprising a developed quotient/root register (Q) and a developed quotient/root minus one register (Q−1) to store the quotient/root. For example, as shown, quotient register Q 430, holds the developed quotient value Q, and quotient register QM 434, holds the developed quotient minus one value Q−1. Updating of these quotient registers 416 can be performed using on-the-fly algorithms, as known in the art.
It will be appreciated that aspects include various methods for performing the processes, functions and/or algorithms disclosed herein. For example,
In block 502, method 500 loads a column of the lookup table into on-chip high speed memory. For example, given a divisor or root estimate, an appropriate column (e.g., 220) from the partial remainder/root table 218 is selected and stored in on-chip, high-speed memory 214 of
Method 500 flows from blocks 504 to 508 for each iteration of the SRT algorithm. After block 508 for a current iteration, method 500 proceeds via path 510 to block 504 and repeats until a partial remainder of zero or desired accuracy are achieved.
In block 504, method 500 generates a partial remainder based on the SRT algorithm. It is noted that for the first iteration, the first or initial partial remainder may be the dividend or radicand.
In block 506, method 500 indexes into the selected column based on the partial remainder. For example, partial remainder bits generated by the SRT algorithm in a particular iteration may be used to index into the selected column of partial remainder/root table 218 stored in the high-speed memory 214 or column or quotient select mask 406 to provide the estimated quotient bits or square root bits. In further detail, referring back to
In block 508, method 500 updates the partial remainder based on the quotient from the selected column. In one or more implementations, the quotient bits 414 are used to select a multiple of the divisor or root formed thus far, which is subtracted from the current partial remainder bits in a particular iteration to produce partial remainder bits of the next iteration. In further detail, quotient bits 414 obtained from division/root lookup logic 408 may be used to obtain a multiple of divisor bits 405 using multiple select mux 418, which may be subtracted from redundant partial remainder bits 410 in subtractor 420 to produce partial remainder bits to be stored in first and second partial remainder registers 422 and 424 for the next iteration.
After method 500 updates the partial remainder based on the result from the selected column, method 500 returns to block 504 through path 510 and repeats from that point for the next iteration.
With reference now to
Like method 500, prior to start of method 600, partial remainder/root table 218 for the SRT algorithm for a given radix and accuracy is generated and stored in memory 204.
In block 602, method 600 loads “0s” and 1s” into quotient select mask registers based on a selected column 220, which is selected based on the divisor or root estimate. For example, the partial remainder is provided as input to combinational logic which includes up to (n−1) quotient/root select registers where n is equal to 2̂(radix), and where the radix is an indication of the number of bits of the quotient/root. For example, (n−1) quotient select registers may include patterns of “0”s and “1”s stored therein. The logical combination or combinational logic comprises comparators for comparing one or more bits of the current partial remainder with preselected partial remainder constants, and performing a logical AND on a result of the comparison with the quotient select registers. These aspects are explained further with reference to alternative implementations of partial remainder/root table 218, shown in
Method 600 flows from blocks 604 to 608 for each iteration of the SRT algorithm. After block 608 for a current iteration, method 600 proceeds via path 610 to block 604 and repeats until a partial remainder of zero or desired accuracy are achieved.
In block 604, method 600 generates the partial remainder based on the SRT algorithm. It is noted that for the first iteration, the first or initial partial remainder may be the dividend or radicand.
In block 606, method 600 generates quotient bits based on decoding the partial remainder ANDed with a quotient select mask. In one implementation, the combinational logic compares the current partial remainder with preselected partial remainder constants or coefficients and the result of the compare is ANDed with the quotient select register number. These results are ORed together to form a “1-hot” decoded quotient. Also in block 608, the decoded quotient bits are encoded to produce a conventional binary representation of the quotient bits.
In block 608, method 600 updates the partial remainder based on the generated quotient bits. After the combinational logic provides the next quotient or root bits, method 600 returns to block 606 and repeats from there for subsequent iterations.
The combinational logic discussed with reference to method 600 may reside as a circuit on processor 102, where control unit 116 may provide the appropriate controls.
With reference now to
In the illustrated implementation, table 702 represents a radix-8 table lookup example, as each encoded quotient/root can have a value from 0-7. There are seven quotient select masks 704 which are numbered 1-7. Each bit in one of the seven quotient select mask 704 represents a “0” value or a “1” value, which is used as a mask to later select a decoded partial remainder.
The shaded entries in table 702 show an example that all table 702 quotient entries that correspond to a divisor value of 0111 or an equivalent decimal value “6” may be encoded into a quotient select mask #6. Each entry in the quotient select mask #6 is either a “0” or a “1” based on the column comprising divisor 0111, identified as column 722.
Division and square root unit 700 executes quotient bit equations 706. Quotient bit equations 706 represent the equations that generate a “1-hot” decoded quotient based on the partial remainder and the quotient select mask register bits set in the quotient select masks 704. As described above, these “1-hot” quotient bits can be encoded into a binary format by a conventional encoder.
Referring back to
In
Referring back to
In one implementation, the logic blocks 904 encode the column selected by divisor or root estimate 902 into quotient select mask registers 906 (which can be cached or stored in column or quotient select mask 406 of
Although
Although steps and decisions of various methods may have been described serially in this disclosure, some of these steps and decisions may be performed by separate elements in conjunction or in parallel, asynchronously or synchronously, in a pipelined manner, or otherwise. There is no particular requirement that the steps and decisions be performed in the same order in which this description lists them, except where explicitly so indicated, otherwise made clear from the context, or inherently required. It should be noted, however, that in selected variants the steps and decisions are performed in the order described above. Furthermore, not every illustrated step and decision may be required in every implementation/variant in accordance with the invention, while some steps and decisions that have not been specifically illustrated may be desirable or necessary in some implementations/variants in accordance with the invention.
Those of skill in the art would understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the implementations disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To show clearly this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware, software, or combination of hardware and software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the implementations disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in an access terminal. Alternatively, the processor and the storage medium may reside as discrete components in an access terminal.
Accordingly, an aspect of the invention can include a computer readable media embodying a method of performing a division/root computation operation using cached information for quotient/root lookup in an SRT algorithm implementation. Accordingly, the invention is not limited to illustrated examples and any means for performing the functionality described herein are included in aspects of the invention.
While the foregoing disclosure shows illustrative aspects of the invention, it should be noted that various changes and modifications could be made herein without departing from the scope of the invention as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the aspects of the invention described herein need not be performed in any particular order. Furthermore, although elements of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.