Embodiments of the invention relate generally to performing quad precision floating point operations in a microprocessor, including instructions for performing quad precision, floating-point calculations.
Due to the limits of finite precision approximation inherent in microprocessors when attempting to model arithmetic with real numbers, every floating-point operation executed by a microprocessor potentially results in a rounding error. To maintain an acceptable minimum level of accuracy, floating-point computations in microprocessors require a relatively complex set of microinstructions. The floating-point square root operation in many current microprocessors is a notable example of a computationally intensive and potentially error-prone operation.
To ensure a common representation of real numbers on computers, the IEEE-754 Standard for Binary Floating-Point Arithmetic (IEEE 754-1985) was established to govern binary floating-point arithmetic. The current version of the standard has been under revision since 2000 (due for completion in December 2005), and is referred to herein as “the proposed revised IEEE 754 standard” or “IEEE 754r.” This standard specifies number formats, basic operations, conversions, and exception conditions, and requires that the result of a divide or square root operation be calculated as if in infinite precision, and then rounded to one of the two nearest floating-point numbers of the specified precision that surround the result.
Due to various factors, such as rounding errors, decimal-binary conversion, improper management of extended precision registers, and so on, the square root (“sqrt”) operation is particularly susceptible to error, and different microprocessors that do not adhere to the proposed revised IEEE 754 standard can generate different results for the same square root operation. Increasing the number of digits of precision used by the microprocessor for the operation can help to ensure the accuracy of the operation. However, such an increase in precision can require substantial processor overhead and increase processing latencies. For example, it has been demonstrated that the correct value for a floating-point square root operation has been calculated in a microprocessor using 200 digits of precision, but the cost of such precision was significant computing time.
Many microprocessors do not have native instructions for quad precision arithmetic operations, such as a quad precision square root operation, or hardware-based implementations for the square root operation. For these microprocessors, execution of the square root function typically involves utilizing a software-based iterative approximation method, such as the Newton-Raphson method, power series expansion, or similar method. Such microprocessors execute iterative operations to perform the square root calculation that can involve hundreds of clock cycles in the critical path of the processor
Embodiments of the present invention are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements and in which:
Embodiments of a method and system for performing quad precision floating-point operations on quad precision operands in a 64-bit microprocessor are described. These embodiments are also referred to herein collectively as the “floating-point operations.” The floating-point operations include square root operations, but are not so limited. Embodiments include a reduced-latency method that can be implemented in microcode operations, software routines or modules (for example, as implemented in a compiler or software libraries supported by a compiler), microprocessor instructions, or hardware implemented logic in the 64-bit microprocessor. Embodiments of the method include executing a Newton Raphson iterative process on a quad precision operand using operations embodied in one or more microprocessor instructions described below. The iterative method comprises the process of calculating a 64-bit approximation of the reciprocal of the square root of the operand, calculating and rounding the result to one of two nearest quad precision floating-point numbers, and determining whether the result is exact or inexact.
The instructions of an embodiment store and operate on the quad precision (128-bit) operand in the floating-point registers of the processor. The instructions of an embodiment are referred to herein as “reduced-latency instructions,” but are not so limited. The reduced-latency instructions of an embodiment include a first set of microprocessor instructions that store the quad precision operand in two floating-point registers. The reduced-latency instructions of an embodiment also include a second set of microprocessor instructions that operate on the two floating-point registers to perform arithmetic and logic operations. By utilizing this storage and logic mechanism, the reduced-latency instructions of an embodiment use fewer clock cycles to perform the arithmetic and logic operations as compared to known methods for performing floating-point square root calculations.
The floating point operations of an embodiment improve the latency of quad square root operations significantly. The floating point operations can also reduce the latency of other quad precision floating-point operations, for example quad precision division. The floating point operations described below also provide good instruction-level parallelism, which makes these operations suited for processors with pipelined functional units, multiple functional units, and/or multiple cores.
Existing known implementations of the quad precision square root operation based on the Newton-Raphson method generate an approximate result that could equally be an underestimate or an overestimate of the precise result. In contrast, the floating point operations and corresponding reduced-latency instructions described herein perform a single Newton-Raphson iteration in high precision to obtain an underestimate of the result, apply a simplified rounding method, efficiently determine whether the result is inexact, and apply reduced-latency instructions that produce reduced latency not only for the quad precision square root, but also for other quad precision floating-point operations.
In the following description, numerous specific details are introduced to provide a thorough understanding of, and enabling description for, embodiments of the floating-point square root calculation methodology and instruction set. One skilled in the relevant art, however, will recognize that these embodiments can be practiced without one or more of the specific details, or with other components, systems, etc. In other instances, well-known structures or operations are not shown, or are not described in detail, to avoid obscuring aspects of the disclosed embodiments.
Embodiments of the floating point operations are directed to the calculation of the quad precision (128-bit) floating-point square root value of a quad precision floating-point argument (or operand). As defined by the proposed revised IEEE 754 standard for binary floating-point arithmetic, the quad precision floating-point format comprises a 1-bit sign plus a 15-bit exponent plus a 113-bit significand which includes an implicit integer bit.
In one embodiment of the floating point operations, a methodology for calculating the square root of a number in a proposed revised IEEE 754 standard compliant quad precision microprocessor comprises:
(1) performing a single Newton-Raphson iteration in high precision to obtain an underestimate of the result,
(2) calculating and rounding the result to quad precision using a simplified rounding method,
(3) checking whether the result is inexact, and
(4) embodying one or more portions of the methodology in one or more atomic microinstructions (e.g., reduced-latency instructions), for execution in a 64-bit microprocessor.
The computation begins with the calculation of a 64-bit approximation of the reciprocal of the square root result, 102. This is an underestimate of the exact reciprocal and is used to calculate an underestimate of the result, within a small fraction of a ulp (unit in the last place) from the precise square root value. In 104, the result is calculated and then rounded to one of the two nearest numbers for quad precision. In most cases, the approximate result can be rounded directly and the IEEE 754r-correct quad result is obtained. In general, only a few exceptional cases exist for every rounding mode, and in such cases one ulp may need to be added to the rounded value of the approximate result. Thus, the process determines whether the approximate result can be rounded directly, 106. If not, one ulp is added to the rounded result, 108. After completion of any ulp addition, or a determination that direct rounding is possible, the result is checked to determine whether it is exact or inexact, 110.
The process of
The process detailed above represents an iterative calculation based on the Newton Raphson method, which has been adapted for use with embodiments of the floating point operations described herein. In one embodiment, specific microcode instructions (reduced-latency instructions) are provided to execute one or more operations of the process. In an embodiment, these reduced-latency instructions are configured to replace and/or supplement the standard instruction set of an existing 64-bit microprocessor, such as the Intel® Itanium® 2 processor.
Column 202 indicates the known latency in clock cycles for the Itanium® processor as an example. Column 206 illustrates an estimation of the reduced latency that can be obtained with one or more reduced-latency instructions to execute specific operations 1-8 shown above, according to embodiments of the floating point operations. For those operations for which reduced-latency instructions are not available, the latency values are unchanged and shown in parentheses. As shown in
As illustrated in
Column 308 lists a set of reduced-latency instructions for executing the corresponding operations, according to one embodiment. The notation provided for the reduced-latency instructions in
Column 310 for both figures lists the estimated reduced latency associated with the reduced-latency instructions. As can be seen in
The reduced-latency instructions outlined in
As shown in row 314 of
As shown in row 316 of
As shown in row 318 of
As shown in rows 320 and 324 of
As shown in row 322 of
As shown in row 326 of
The reduced-latency instructions outlined in
The processes and instructions described herein can be adapted for use with other processors and processor architectures using techniques known to those of ordinary skill in the art. The term “processor” as generally used herein refers to any logic processing unit, such as one or more central processing units (“CPU”), digital signal processors (“DSP”), application-specific integrated circuits (“ASIC”), and so on. The processor can be monolithically integrated onto a single chip, distributed among a number of chips or components of a host system, and/or provided by some combination of algorithms. The reduced-latency instructions described above feature enhanced instruction-level parallelism, which make them suited for processors with pipelined functional units, multiple functional units, or multiple cores.
The reduced-latency instruction set illustrated in
Aspects of the floating-point operations described herein may be implemented as functionality programmed into any of a variety of circuitry, including programmable logic devices (“PLDs”), such as field programmable gate arrays (“FPGAs”), programmable array logic (“PAL”) devices, electrically programmable logic and memory devices and standard cell-based devices, as well as application specific integrated circuits. Some other possibilities for implementing aspects of the floating-point operations include: microcontrollers with memory (such as EEPROM), embedded microprocessors, firmware, software, etc. Furthermore, aspects of the floating-point operations may be embodied in microprocessors having software-based circuit emulation, discrete logic (sequential and combinatorial), custom devices, fuzzy (neural) logic, quantum devices, and hybrids of any of the above device types. The underlying device technologies may be provided in a variety of component types, e.g., metal-oxide semiconductor field-effect transistor (“MOSFET”) technologies like complementary metal-oxide semiconductor (“CMOS”), bipolar technologies like emitter-coupled logic (“ECL”), polymer technologies (e.g., silicon-conjugated polymer and metal-conjugated polymer-metal structures), mixed analog and digital, and so on.
It should also be noted that the various functions disclosed herein may be described using any number of combinations of hardware, firmware, and/or as data and/or instructions embodied in various machine-readable or computer-readable media, in terms of their behavioral, register transfer, logic component, and/or other characteristics. Computer-readable media in which such formatted data and/or instructions may be embodied include, but are not limited to, non-volatile storage media in various forms (e.g., optical, magnetic or semiconductor storage media) and carrier waves that may be used to transfer such formatted data and/or instructions through wireless, optical, or wired signaling media or any combination thereof. Examples of transfers of such formatted data and/or instructions by carrier waves include, but are not limited to, transfers (uploads, downloads, e-mail, etc.) over the Internet and/or other computer networks via one or more data transfer protocols (e.g., HTTP, FTP, SMTP, and so on).
Unless the context clearly requires otherwise, throughout the description and the claims, the words “comprise,” “comprising,” and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is to say, in a sense of “including, but not limited to.” Words using the singular or plural number also include the plural or singular number respectively. Additionally, the words “herein,” “hereunder,” “above,” “below,” and words of similar import refer to this application as a whole and not to any particular portions of this application. When the word “or” is used in reference to a list of two or more items, that word covers all of the following interpretations of the word: any of the items in the list, all of the items in the list and any combination of the items in the list.
The above description of illustrated embodiments of floating-point operations is not intended to be exhaustive or to limit the floating-point operations to the precise form or instructions disclosed. While specific embodiments of, and examples for, the floating-point operations are described herein for illustrative purposes, various equivalent modifications are possible within the scope of floating-point operations, as those skilled in the relevant art will recognize. Moreover, the teachings of the floating-point operations provided herein can be applied to other floating-point operations, such as quad precision division.
The elements and acts of the various embodiments described above can be combined to provide further embodiments. These and other changes can be made to the floating-point operations in light of the above detailed description.
In general, in the following claims, the terms used should not be construed to limit the floating-point operations to the specific embodiments disclosed in the specification and the claims, but should be construed to include all operations or processes that operate under the claims. Accordingly, the floating-point operations are not limited by the disclosure, but instead the scope of the recited embodiments is to be determined entirely by the claims.
While certain aspects of the floating-point operations are presented below in certain claim forms, the inventor contemplates the various aspects of the floating-point operations in any number of claim forms. For example, while only one aspect of the square root instruction set is recited as embodied in machine-readable medium, other aspects may likewise be embodied in machine-readable medium. Accordingly, the inventor reserves the right to add additional claims after filing the application to pursue such additional claim forms for other aspects of the floating-point operations.