The subject matter disclosed herein generally relates to microprocessor operations. Specifically, the present disclosure addresses systems and methods that accelerate microprocessor computations.
Conventionally, computing devices are used to perform operations that are used in countless applications. As an example, ray casting is an algorithm for three-dimensional (3D) gaze-to-view rendering commonly utilized in virtual, augmented, and mixed reality (VR, AR, and XR) applications. Ray casting requires a large amount of square root computations resulting in long latency and consumption of significant power. Long latency causes VR, AR, and XR to be irresponsive to users' eye, head, or body movements. High power consumption requires a VR, AR, or XR headset to embed a big and heavy battery. Finally, low bandwidth prevents photorealistic image with higher resolution. This is one example use case illustrates operational limitations of conventional microprocessors.
Binary32 format, defined by IEEE Std 754-2019, is commonly used by VR, AR, and XR engineers and others. However, the limitation of a dynamic range caused by finite bit width of exponents is inevitable for any data format, uncompressed or otherwise. A number out of the dynamic range can underflow and become zero (0) or overflow and become infinity (∞).
The binary32 format is a signed exponential format with one sign bit (S), eight exponent bits (E), 23 mantissa bits (M), and one hidden bit (H). When the sign bit (S) is 0, a represented number is positive. Otherwise, it is negative. The eight exponent bits (E) represent an integer in a range of [−126, +127] indicating a dynamic range to be in a range of [2{circumflex over ( )}−126, 2{circumflex over ( )}+127]. The hidden bit (H) is normally 1. The 23 mantissa bits (M) comprise a fraction part of the represented number.
The binary32 format represents a number with a value of (−1){circumflex over ( )}S*(H·M)*2{circumflex over ( )}E, wherein S is either 0 or 1, E is in a range of [−126, 127], and (H·M) is normally in a range of [1.0,2.0). Thus, binary32 format can represent a nonzero normal number in a range of +[1.0, 2.0)*2{circumflex over ( )}−126 to +[1.0, 2.0)*2{circumflex over ( )}127 or −[1.0, 2.0)*2{circumflex over ( )}−126 to −[1.0, 2.0)*2{circumflex over ( )}127.
For simplicity, “significand” is denoted as an optional hidden bit followed by a plurality of mantissa bits in any data format. M is denoted as a value of the mantissa bits (e.g., 23 bits in the binary32 format). Because M is in a range of [0.0, 1.0), a significand is in the range of [1.0, 2.0) for normal numbers. In general, a numerical value is evaluated by taking the optional hidden bit into account even when only the mantissa bits are available. This is why it is referred to as a “hidden” bit.
Many CPU, GPU, Floating-Point Unit (FPU), and Digital Signal Processor (DSP) apply Newton-Raphson or Sweeney-Robertson-Tocher (SRT) algorithms for division and square root computation. Both Newton-Raphson and SRT algorithms are slow due to their iterative and recurrent natures, respectively. A fast way of computing a square root of input operand (X) is to generate a reciprocal square root (R) and then multiply the reciprocal square root with the input operand (X), as showed by the following equation:
Though the above equation is mathematically correct. Applying the equation can provide incorrect results. For example, a square root of 0 is 0. Thus, reciprocal square root (R) is 1 divided by 0, and thus equal to infinity (∞) by definition.
When the above equation is applied to compute the square root of 0, the result is X*R=0*∞=Not A Number (NaN). In Floating-Point (e.g., IEEE Std 754-2019), the equation runs into an invalid operation when trying to multiply 0 with ∞ and generates a NaN to indicate the failure to produce a numeric result.
Some embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings.
The description that follows describes systems, methods, techniques, instruction sequences, and computing machine program products that illustrate example embodiments of the present subject matter. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide an understanding of various embodiments of the present subject matter. It will be evident, however, to those skilled in the art, that embodiments of the present subject matter may be practiced without some or other of these specific details. Examples merely typify possible variations. Unless explicitly stated otherwise, structures (e.g., structural components) are optional and may be combined or subdivided, and operations (e.g., in a procedure, algorithm, or other function) may vary in sequence or be combined or subdivided.
Example embodiments provide a technical solution for dealing with the technical problem of accelerating operations associated with a microprocessor. Specifically, example systems and methods enable generation of significand with high precision and utilize the significand to accelerate numerical computation. Further, example systems and methods enable generation of an unbounded exponent and utilize the unbounded exponent to accelerate numerical computation. The systems and methods are suitable for arithmetic operations on fixed-point, block floating-point, and/or floating-point operands in their uncompressed or compressed formats. Furthermore, input and output operands are allowed to be in different formats. Because computations are accelerated by example embodiments, one or more of the methodologies described herein may obviate a need for certain efforts or computing resources that otherwise would be involved in conventional computational devices. Examples of such computing resources comprise processor cycles, memory usage, data storage capacity, and power consumption.
Example embodiments improve the operations of the microprocessor by using reciprocal or reciprocal square root instructions. Reciprocal or reciprocal square root instructions can provide novel instructions for a CPU, GPU, FPU, or DSP, or other microprocessors. Reciprocal or reciprocal square root instructions can also be an extension to accelerate CPU, GPU, FPU, DSP, or other microprocessors. Furthermore, reciprocal or reciprocal square root instructions can be (or be embodied within) an independent accelerator. Multiplication or other instructions may follow the reciprocal or reciprocal square root instructions to finish division, square root, or other complex operations, as will be discussed in further details below.
In accordance with some example embodiments, some instructions can disregard some mathematical equations when generating the reciprocal square root or other results. For example, a reciprocal square root instruction with an input of 0 (X) should generate +0 as an output disregarding the equation of 1/√X. Based on some example embodiments, when computing square root (√X), the aforementioned equation is modified and becomes √XX*R by definition. This √XX*R equation is then applied to quickly generate a correct result. In accordance with some example embodiments, instructions should be aware of intentional disregard and modification of some equations when performing the X*R or other instructions.
The integrated circuit 102 comprises a microprocessor 106 such as a CPU, GPU, FPU, or DSP core. In example embodiments, the microprocessor 106 comprises an instruction fetch unit 108, a data fetch unit 110, control registers 112, register files 114, an instruction decoder 116, and an execution unit 118. The instruction fetch unit 108 is configured to fetch instruction. For example, the instructions can be fetched from the external memory 104, a cache (not illustrated), or the like. The instruction decoder 116 decodes the instructions from the instruction fetch unit 108 and sends decoded instructions to the execution unit 118. While the instruction fetch unit 108 and the instruction decoder 116 are shown as two distinct units, some embodiments can integrate the functions of the two units into a single unit. Additionally, while the instruction decode unit 116 and the data fetch unit 110 are shown as two distinct units, some embodiments can integrate the functions of the two units into a single unit.
The execution unit 118 is further coupled to the control registers 112 and the register files 114. The register files 114 can be a register set, a storage, or a combination thereof.
In example embodiments, the execution unit 118 determines a location of operands to be fetched for use by the instruction and provides the location to the data fetch unit 110. The data fetch unit 110 retrieves the requested operands from the location (e.g., the external memory 104, the register files 114, cache). The execution unit 118 performs the instruction using an arithmetic logic unit 120. When the instruction is retired, one or more resultants are provided to a store unit 122 which stores the resultants. For example, the resultants can be stored to the external memory 104, the register files 114, or the cache.
In some embodiments, reciprocal or reciprocal square root instructions can be novel instructions of the microprocessor 106. The resultant of the reciprocal or reciprocal square root instructions can be stored, for example, in the external memory 104, the register files 114, or the cache. Multiplication or other instructions may follow the reciprocal or reciprocal square root instructions to finish division, square root, or other complex operations, as will be discussed in further detail below.
Division and square root are fundamental operations for computers to precisely render and visualize two-dimensional or higher-dimensional (2D+) objects, such as, for example, generating a photorealistic 2D or 3D image of a house to be built based on a model from an architect or designer, scaling a picture to fit onto a paper for printing, resizing a video game character or virtual reality avatar as it moves forward or backward, or visualizing a 3D molecular structure. Thus, fast division and square root operations improve the functions of a computing device, improves productivity, and improves a user experience.
In example embodiments, any of the units, registers, files, decoders (collectively referred to as “components”) shown in, or associated with,
Moreover, any of the components illustrated in
When instructed by the operation 204 to perform the reciprocal instruction, the device 200 generates the output 206 which comprises mantissa bits with a value in a range of [1.0, 2.0) for a non-zero finite numeric input. In order to have the significand be in such a range, the device 200 may compute as though the exponent is unbounded by any format, bit width, bias, or otherwise. When an input (e.g., input 202) is zero, infinity, or non-numeric, the device 200 may compute as if the input is a non-zero finite numeric or compute according to a standard associated with a corresponding data format (e.g., IEEE 754-2019). This instruction may be referred to as “Exponent-Unbounded Reciprocal.”
The output 206 may optionally comprise an exponent output. For example, when the output should be 1.0*2{circumflex over ( )}−127 (e.g., exponent is −127) but the minimum representable exponent is −126, the device 200 may optionally generate an exponent output with a value of −1 to indicate the output exponent is one less than the minimum representable output.
Multiplication can be an instruction of the CPU, GPU, FPU, DSP, or another microprocessor (e.g., the microprocessor 106). Multiplication can also be an extension to accelerate the CPU, GPU, FPU, DSP, or another microprocessor. In some embodiments, multiplication instructions can be (or be embodied within) an independent accelerator.
In example embodiments, the device 300 multiplies the first operand 302 with the second operand 304 and optionally adjusts an exponent to generate a correct result. The device 300 may be controlled by an operation 308 which instructs the device 300 to perform a multiplication with exponent adjustment. Such an instruction can be referred to as “Exponent-Adjusted Multiplication.” In some embodiments, the operation 308 is issued by an instruction decoder (e.g., the instruction decoder 116) of the CPU, GPU, FPU, DSP, or another microprocessor (e.g., the microprocessor 106). The first operand 302, the second operand 304, and the optional third operand 306 can be received from register file output ports of a register file (e.g., the register file 114). An output of the device 300 can be transmitted to a register file input port of a register file (e.g., the register file 114).
In example embodiments, the device 300 adjusts the exponent in one of several ways. In a first manner, if the device 300 receives an external exponent (e.g., an exponent output from the output 206) from the first input 302 directly or via a format converter, the device 300 may use the external exponent to adjust the exponent. For example, when the device 300 realizes that the external exponent is −1 (e.g., one less than the representable minimum), the device 300 understands the reciprocal is actually 1.0*2{circumflex over ( )}−127. The device 300 multiplies 1.0*2{circumflex over ( )}−127 with 1.0*2{circumflex over ( )}127 and generates 1.0 as the correct result (e.g., output 310).
In a second manner, the device 300 may internally generate an exponent output based on the same exponent from the third input operand 306 (e.g., a same denominator exponent) by performing the same calculation as in the device 200 of
Example embodiments are also applicable to square root and other operations. To compute square root, many conventional CPU, GPU, FPU, and DSP apply the same families of iterative and recurrent slow algorithms as for division. Example embodiments, however, provide a fast way to compute square root. For example, referring back to
When instructed by the operation 204 to perform Exponent-Unbounded Reciprocal Square Root, the device 200 generates the output 206 in the following manner in accordance with example embodiments. When an input (e.g., input 202) is a non-zero finite number, the device 200 may compute according to the equation of 1/√X and generate the output 206. If the input (e.g., input 202) is zero (±0) or positive infinity (+∞), the device 200 may perform an absolute operation (|X|) instead and generate an absolute value as the output 206. If the input (e.g., input 202) is non-numeric (e.g., NaN in IEEE Std 754-2019) or negative infinity (−∞), the device 200 may generate a non-numeric (e.g., NaN in IEEE Std 754-2019) as the output 206.
In some embodiments, the device 200 may be implemented according to
Referring back to
By switching among various ways to generate the output 206, as described earlier, the square root of an input operand (X) can be correctly computed by deploying devices 200 and 300. For example, when the input 202 is infinity (∞), the device 200 generates an infinity (∞) as the output 206. The device 300 performs the Exponent-Unadjusted Multiplication and multiplies X with R. The multiplication of ∞ (X) and ∞ (R) produces ∞. Thus, the device 300 generates an infinity (∞) as the output 310. The square root of infinity (∞) is infinity (∞). This can be represented by the following equation:
In addition to Exponent-Unbounded Reciprocal and Exponent-Unbounded Reciprocal Square Root instructions, the operation 204 may instruct the device 200 to perform reciprocal or reciprocal square root while honoring any exponent range as specified by a corresponding format, bit width, bias, encoding, compression, or a combination thereof and generate the output 206 accordingly. Such instructions are referred to as “Exponent-Bounded Reciprocal” and “Exponent-Bounded Reciprocal Square Root,” respectively. When performing Exponent-Bounded Reciprocal Square Root operation, the device 200 follows the common 1/√X equation for any input 202.
Additions of Exponent-Bounded Reciprocal and Exponent-Bounded Reciprocal Square Root enable the device 200 to be utilized independently from the device 300 and generate reciprocal or reciprocal square root as commonly expected. The device 200 can embody any of Exponent-Unbounded Reciprocal, Exponent-Unbounded Reciprocal Square Root, Exponent-Bounded Reciprocal, Exponent-Bounded Reciprocal Square Root, and/or other instructions. Example embodiments also allow an embodiment without operation 204. In these embodiments, the device 200 is an accelerator or extension.
In addition to Exponent-Adjusted Multiplication and Exponent-Unadjusted Multiplication, the operation 308 may instruct the device 300 to perform another instruction such as multiply-add by multiplying the first input 302 by the second input 304 and adding the third input 306 to a product from the multiplication to generate an output 310. The device 300 can embody any of Exponent-Adjusted Multiplication, Exponent-Unadjusted Multiplication, and/or other instructions. Example embodiments also allow for an embodiment without the operation 308. In these embodiments, the device 300 is an accelerator or extension.
Reciprocal or reciprocal square root can also be an extension to accelerate CPU, GPU, FPU, DSP, or another microprocessor (e.g., the microprocessor 106). Furthermore, reciprocal or reciprocal square root instructions can be (or be embodied within) an independent accelerator. An extension can be implemented in a similar way as an independent accelerator. As an extension or accelerator, example embodiments are embodied without an instruction or data fetch unit (e.g., the data fetch unit 110). In some embodiments, a microprocessor provides an operand to an extension or accelerator. The microprocessor may receive a result from the extension or accelerator. Any of the extensions, accelerators, and devices discussed herein may be a hardware device (e.g., a hardware accelerator).
In the embodiment of
In the embodiment of
A reciprocal component 602 provides a reciprocal resultant based on a precomputed table, approximation, polynomial (e.g., Taylor Series), interpolation (e.g., Chebyshev, minimax), or a combination thereof. A reciprocal square root component 604 provides a reciprocal square root resultant based on a precomputed table, approximation, polynomial (e.g., Taylor Series), interpolation (e.g., Chebyshev, minimax), or a combination thereof. In one embodiment, the reciprocal square root component provides a reciprocal square root resultant (e.g., an output significand) with a value in a range of [1.0,2.0). The approximation, polynomial, and/or interpolation may use a small, precomputed table. In some embodiments, any of the precomputed tables may be lookup tables that are implemented with hardware decoders.
The detector 620 is configured to detect whether an input 612 is a zero (0), infinity (∞), or non-numeric (e.g., NaN in IEEE Std 754-2019) and provides the corresponding output when performing Exponent-Unbounded Reciprocal Square Root or other instruction. The selector 606 is configured to select an appropriate result. For example, the result may be selected according to an instructing signal from the CPU, GPU, FPU, DSP, or another microprocessor (e.g., the microprocessor 106). The output of the selector 606 is the output 608. In some embodiments, the selector 606 is implemented with a hardware mux.
When performing Exponent-Unbounded instructions, a first subtracter 610 subtracts a count of leading 0 bit(s) of significand from an exponent portion of the input 612 and generates a difference. When performing reciprocal square root instructions, the difference is right shifted by one (1) to truncate its least significant bit. A negater 614 changes a positive number to a negative number and, vice versa, to generate an “unbounded exponent.” An unbounded exponent is an exponent unbounded by a corresponding format, bit width, bias, or a combination thereof.
When performing Exponent-Unbounded Reciprocal and, optionally, Exponent-Unbounded Reciprocal Square Root instructions, a second subtracter 616 subtracts a minimum representable exponent from the output of the negater 614 based on the output of the negater 614 being less than the minimum representable exponent. Alternatively, the second subtracter 616 subtracts a maximum representable exponent from the output of the negater 614 based on the output of the negater 614 being greater than the maximum representable exponent. In some embodiments, the negater 614 and the second subtracter 616 can be implemented with hardware adders. In some embodiments, it is also possible to merge the negater 614 and the second subtracter 616 into a single hardware adder. The selector 622 may select a suitable corresponding output from the detector 620 if the input 612 is a zero (0), infinity (∞), or non-numeric (e.g., NaN in IEEE Std 754-2019) when performing Exponent-Unbounded Reciprocal Square Root or other instruction. The output of the selector 622 is an exponent output 618.
To maximize hardware component sharing, the arithmetic logic unit 600 of
To minimize hardware footprint, reciprocal instructions can be implemented separately as a smaller device, by removing the reciprocal square root component 604 and the selector 606. The device 402 in
Referring now to
When performing Exponent-Adjusted Multiplication, an adjuster 718 compares an exponent portion of a denominator 720 against a maximum representable exponent and counts an amount of leading 0s of denominator significand (e.g., portion of the denominator 720). If the exponent of the denominator 720 equals the maximum representable exponent, the adjuster 718 generates a minimum representable exponent (e.g., an exponent adjustment). If the amount of leading 0s of the denominator significand of the denominator 720 is greater than zero (0), the adjuster 718 generates a maximum representable exponent (e.g., an exponent adjustment). Otherwise, the adjuster 718 generates a zero (0). A hardware adder 722 sums up the exponent 704 (M0 exponent), the exponent 710 (M1 exponent), and the exponent adjustment from the adjuster 718 to generate an exponent output 724. When performing Exponent-Unadjusted Multiplication, the adjuster 718 generates a zero (0) resulting in no exponent adjustment.
In example embodiments, the adjuster 718 may generate the exponent adjustment in at least two alternative ways. In a first manner, if the denominator input 720 comprises a denominator sign (but not necessarily exponent or mantissa) and if Exponent-Unbounded instructions additionally generates a sign output which differs from the denominator sign when overflow or underflow occurs, the adjuster 718 may compare the Reciprocal or Reciprocal Square Root output sign against the denominator sign. The adjuster 718 generates a minimum representable exponent when the signs differ and the M0 exponent 704 (part of the input multiplicand 702) is negative. The adjuster 718 generates a maximum representable exponent based on the signs differing and the M0 exponent 704 being positive. Otherwise, the adjuster 718 generates zero (0).
In a second manner, if the denominator input 720 is unavailable, the adjuster 718 may check a least significant bit (LSB) of the Reciprocal or Reciprocal Square Root mantissa output (part of 702). Some embodiments preset a predetermined value as 0 or 1. The adjuster 718 generates a minimum representable exponent when the LSB differs from the predetermined value (e.g., 0 or 1) and the M0 exponent 704 is negative. The adjuster 718 generates a maximum representable exponent when the LSB differs from the predetermined value and the M0 exponent 704 is positive. Otherwise, the adjuster 718 generates zero (0).
Alternatively, when an unbounded exponent is available as part of the first input multiplicand (M0) 702, the arithmetic logic unit 700 does not have to comprise the adjuster 718, and the adder 722 can be a 2-input adder which sums up the M0 exponent 704 (the unbounded exponent) and the M1 exponent 710. As the unbounded exponent is available, no adjustment is necessary.
Optionally, the arithmetic logic unit 700 may further comprise an exception handler 726. The exception handler 726 checks the first input multiplicand (M0) 702, the second input multiplicand (M1) 708, and, optionally, the denominator input 720. The exception handler 726 may signal an invalid operation when any of the following is detected: 1) any of the inputs 702, 708, and 720 is a signaling non-numeric (e.g., signaling NaN in IEEE Std 754-2019); or 2) one of the inputs 702 and 708 represents 0 while the other one represents infinity. The exception handler 726 may signal a division by zero operation when one of the inputs 702 and 708 represents a non-zero finite number while the other one represents infinity.
To maximize hardware component sharing, the arithmetic logic unit 700 of
To minimize hardware footprint, Exponent-Adjusted Multiplication instructions can be implemented separately as a smaller device, by hardwiring to perform Exponent-Adjusted Multiplication instruction. The device 404 in
Example embodiments allow for integrating Exponent-Unbounded Reciprocal and Exponent-Adjusted Multiplication into a single device, integrating Exponent-Unbounded Reciprocal Square Root and Exponent-Unadjusted Multiplication into a single device, or both. When embodying such an integration, the adder 722 of
The embodiments of
Using the VTVT standard cell library, Icarus Verilog may implement the first and second subtracters 610 and 616 and the adder 722 with “fulladder” cells, implement the negater 614 with “inv_1” cells, implement the selector 606 with “mux_2” cells, implement the adjuster 718 with “fulladder” and “nand4_4” cells, implement the multiplier 714 with “fulladder” and “and3_4” cells, and/or implement the reciprocal component 602 and the reciprocal square root component 604 with “nand4_2,” “fulladder,” “and3_2” cells, or a combination thereof. Any precomputed table (e.g., the reciprocal component 602 and reciprocal square root component 604) can be implemented as a read-only memory (ROM).
In example embodiments, GNU Octave, FreeMat, or other programming languages can be used to precompute reciprocal and reciprocal square root and store the resultants as predetermined tables in the reciprocal component 602 and the reciprocal square root element 604, respectively. Alternatively, approximation, polynomial (e.g., Taylor Series), interpolation (e.g., Chebyshev, minimax), or a combination thereof can be applied to generate outputs of the reciprocal component 602 and the reciprocal square root component 604.
In order to ensure the silicon chips are free of manufacturing defects the following operations can be deployed: (1) dividing 1.0*2{circumflex over ( )}127 by 1.0*2{circumflex over ( )}127; and/or (2) square root of 1.0*2{circumflex over ( )}−128.
Referring back to
The reciprocal component 602 sends 1.0 (e.g., reciprocal of 1.0) to the selector 606. By performing Exponent-Unbounded Reciprocal, the selector 606 selects the output from the reciprocal component 602 and sends 1.0 as the mantissa output 608.
As the device 404 is implemented (e.g., using the embodiment of
The multiplier 714 multiplies the mantissa bits 706 (e.g., 1.0) by the mantissa bits 712 (e.g., 1.0) and sends 1.0 as a mantissa output 716. By combining the exponent output 724 (e.g., 0) and mantissa output 716 (e.g., 1.0) together, 1.0*2{circumflex over ( )}0 or 1.0 is the correct result of dividing 1.0*2{circumflex over ( )}127 by 1.0*2{circumflex over ( )}127.
Referring now to
The reciprocal square root component 604 sends 1.0 (e.g., reciprocal square root of normalized 0.25) to the selector 606. Since performing Exponent-Unbounded Reciprocal Square Root, the selector 606 selects the output form the reciprocal square root component 604 and sends 1.0 as the mantissa output 608.
As the device 504 is implemented (e.g., using the embodiment of
The multiplier 714 multiplies the mantissa bits 706 (1.0) by the mantissa 712 (1.0) and sends 1.0 as the mantissa output 716. By combining the exponent output 724 (−64) and mantissa output 716 (1.0) together, 1.0*2{circumflex over ( )}−64 is the correct result of √1.0*2{circumflex over ( )}−128.
In alternative embodiments, the machine 800 operates as a standalone device or may be connected (e.g., networked) to other machines. In a networked deployment, the machine 800 may operate in the capacity of a server machine or a client machine in a server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. The machine 800 may be a server computer, a client computer, a personal computer (PC), a tablet computer, a laptop computer, a netbook, a set-top box (STB), a personal digital assistant (PDA), a cellular telephone, a smartphone, a web appliance, a network router, a network switch, a network bridge, or any machine capable of executing the instructions 824 (sequentially or otherwise) that specify actions to be taken by that machine. Further, while only a single machine is illustrated, the term “machine” shall also be taken to include comprise a collection of machines that individually or jointly execute the instructions 824 to perform any one or more of the methodologies discussed herein.
The machine 800 comprises a processor 802 (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a radio-frequency integrated circuit (RFIC), or any suitable combination thereof), a main memory 804, and a static memory 806, which are configured to communicate with each other via a bus 808. The processor 802 may contain microcircuits that are configurable, temporarily or permanently, by some or all of the instructions 824 such that the processor 802 is configurable to perform any one or more of the methodologies described herein, in whole or in part. For example, a set of one or more microcircuits of the processor 802 may be configurable to execute one or more modules (e.g., software modules) or instructions.
The machine 800 may further comprise a graphics display 810 (e.g., a plasma display panel (PDP), a light emitting diode (LED) display, a liquid crystal display (LCD), a projector, or a cathode ray tube (CRT), or any other display capable of displaying graphics or video). The machine 800 may also comprise an input device 812 (e.g., a keyboard), a cursor control device 814 (e.g., a mouse, a touchpad, a trackball, a joystick, a motion sensor, or other pointing instrument), a storage unit 816, a signal generation device 818 (e.g., a sound card, an amplifier, a speaker, a headphone jack, or any suitable combination thereof), and a network interface device 820.
The storage unit 816 comprises a machine-storage medium 822 (e.g., a tangible machine-storage medium) on which is stored the instructions 824 (e.g., software) embodying any one or more of the methodologies or functions described herein. The instructions 824 may also reside, completely or at least partially, within the main memory 804, within the processor 802 (e.g., within the processor's cache memory), or both, before or during execution thereof by the machine 800. Accordingly, the main memory 804 and the processor 802 may be considered as machine-storage media (e.g., tangible and non-transitory machine-storage media). The instructions 824 may be transmitted or received over a network 826 via the network interface device 820.
In some example embodiments, the machine 800 may be a portable computing device and have one or more additional input components (e.g., sensors or gauges). Examples of such input components include comprise an image input component (e.g., one or more cameras), an audio input component (e.g., a microphone), a direction input component (e.g., a compass), a location input component (e.g., a global positioning system (GPS) receiver), an orientation component (e.g., a gyroscope), a motion detection component (e.g., one or more accelerometers), an altitude detection component (e.g., an altimeter), and a gas detection component (e.g., a gas sensor). Inputs harvested by any one or more of these input components may be accessible and available for use by any of the modules described herein.
The various memories (e.g., 804, 806, and/or memory of the processor(s) 802) and/or storage unit 816 may store one or more sets of instructions and data structures (e.g., software 824) embodying or utilized by any one or more of the methodologies or functions described herein. These instructions, when executed by processor(s) 802 cause various operations to implement the disclosed embodiments.
As used herein, the terms “machine-storage medium,” “device-storage medium,” “computer-storage medium” (referred to collectively as “machine-storage medium 822”) mean the same thing and may be used interchangeably in this disclosure. The terms refer to a single or multiple storage devices and/or media (e.g., a centralized or distributed database, and/or associated caches and servers) that store executable instructions and/or data, as well as cloud-based storage systems or storage networks that include multiple storage apparatus or devices. The terms shall accordingly be taken to include, but not be limited to, solid-state memories, and optical and magnetic media, including memory internal or external to processors. Specific examples of machine-storage media, computer-storage media, and/or device-storage media 822 include non-volatile memory, including by way of example semiconductor memory devices, e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), FPGA, and flash memory devices; magnetic disks such as internal hard disks and removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. The terms machine-storage medium or media, computer-storage medium or media, and device-storage medium or media 822 specifically exclude carrier waves, modulated data signals, and other such media, at least some of which are covered under the term “signal medium” discussed below. In this context, the machine-storage medium is non-transitory.
The term “signal medium” or “transmission medium” shall be taken to include any form of modulated data signal, carrier wave, and so forth. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a matter as to encode information in the signal.
The terms “machine-readable medium,” “computer-readable medium” and “device-readable medium” mean the same thing and may be used interchangeably in this disclosure. The terms are defined to include both machine-storage media and signal media. Thus, the terms include both storage devices/media and carrier waves/modulated data signals.
The instructions 824 may further be transmitted or received over a communications network 826 using a transmission medium via the network interface device 820 and utilizing any one of a number of well-known transfer protocols (e.g., HTTP). Examples of communication networks 826 include a local area network (LAN), a wide area network (WAN), the Internet, mobile telephone networks, plain old telephone service (POTS) networks, and wireless data networks (e.g., Wi-Fi, LTE, and WiMAX networks). The term “transmission medium” shall be taken to include any intangible medium that is capable of storing, encoding, or carrying instructions 824 for execution by the machine 800, and includes digital or analog communications signals or other intangible medium to facilitate communication of such software.
Throughout this specification, plural instances may implement components, operations, or structures described as a single instance. Although individual operations of one or more methods are illustrated and described as separate operations, one or more of the individual operations may be performed concurrently, and nothing requires that the operations be performed in the order illustrated. Structures and functionality presented as separate components in example configurations may be implemented as a combined structure or component. Similarly, structures and functionality presented as a single component may be implemented as separate components. These and other variations, modifications, additions, and improvements fall within the scope of the subject matter herein.
Certain embodiments are described herein as including logic or a number of components, modules, or mechanisms. Modules may constitute either software modules (e.g., code embodied on a machine-storage medium or in a transmission signal) or hardware modules. A “hardware module” is a tangible unit capable of performing certain operations and may be configured or arranged in a certain physical manner. In various example embodiments, one or more computer systems (e.g., a standalone computer system, a client computer system, or a server computer system) or one or more hardware modules of a computer system (e.g., a processor or a group of processors) may be configured by software (e.g., an application or application portion) as a hardware module that operates to perform certain operations as described herein.
In some embodiments, a hardware module may be implemented mechanically, electronically, or any suitable combination thereof. For example, a hardware module may include dedicated circuitry or logic that is permanently configured to perform certain operations. For example, a hardware module may be a special-purpose processor, such as a field programmable gate array (FPGA) or an ASIC. A hardware module may also include programmable logic or circuitry that is temporarily configured by software to perform certain operations. For example, a hardware module may include software encompassed within a general-purpose processor or other programmable processor. It will be appreciated that the decision to implement a hardware module mechanically, in dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software) may be driven by cost and time considerations.
Accordingly, the term “hardware module” should be understood to encompass a tangible entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired), or temporarily configured (e.g., programmed) to operate in a certain manner or to perform certain operations described herein. As used herein, “hardware-implemented module” refers to a hardware module. Considering embodiments in which hardware modules are temporarily configured (e.g., programmed), each of the hardware modules need not be configured or instantiated at any one instance in time. For example, where the hardware modules comprise a general-purpose processor configured by software to become a special-purpose processor, the general-purpose processor may be configured as respectively different hardware modules at different times. Software may accordingly configure a processor, for example, to constitute a particular hardware module at one instance of time and to constitute a different hardware module at a different instance of time.
Hardware modules can provide information to, and receive information from, other hardware modules. Accordingly, the described hardware modules may be regarded as being communicatively coupled. Where multiple hardware modules exist contemporaneously, communications may be achieved through signal transmission (e.g., over appropriate circuits and buses) between or among two or more of the hardware modules. In embodiments in which multiple hardware modules are configured or instantiated at different times, communications between such hardware modules may be achieved, for example, through the storage and retrieval of information in memory structures to which the multiple hardware modules have access. For example, one hardware module may perform an operation and store the output of that operation in a memory device (e.g., a register file) to which it is communicatively coupled. A further hardware module may then, at a later time, access the memory device to retrieve and process the stored output. Hardware modules may also initiate communications with input or output devices, and can operate on a resource (e.g., a collection of information).
The various operations of example methods described herein may be performed, at least partially, by one or more processors that are temporarily configured (e.g., by software) or permanently configured to perform the relevant operations. Whether temporarily or permanently configured, such processors may constitute processor-implemented modules that operate to perform one or more operations or functions described herein. As used herein, “processor-implemented module” refers to a hardware module implemented using one or more processors.
Similarly, the methods described herein may be at least partially processor-implemented, a processor being an example of hardware. For example, at least some of the operations of a method may be performed by one or more processors or processor-implemented modules. Moreover, the one or more processors may also operate to support performance of the relevant operations in a “cloud computing” environment or as a “software as a service” (SaaS). For example, at least some of the operations may be performed by a group of computers (as examples of machines comprising processors), with these operations being accessible via a network (e.g., the Internet) and via one or more appropriate interfaces (e.g., an application program interface (API)).
The performance of certain of the operations may be distributed among the one or more processors, not only residing within a single machine, but deployed across a number of machines. In some example embodiments, the one or more processors or processor-implemented modules may be located in a single geographic location (e.g., within a home environment, an office environment, or a server farm). In other example embodiments, the one or more processors or processor-implemented modules may be distributed across a number of geographic locations.
Example 1 is an integrated circuit for accelerating operations associated with a microprocessor. The integrated circuit comprises an accelerator that receives an input operand comprising an input exponent and an input mantissa and performs operations to generate an output operand, the accelerator comprising a selector that selects the output, a reciprocal square root component that provides a resultant with a value in a range of [1.0,2.0) to the selector; and a detector that detects whether the input operand is zero, infinity, or non-numeric and provides a value to the selector based on the input operand being zero, infinity, or non-numeric.
In example 2, the subject matter of example 1 can optionally comprise wherein the accelerator is an execution unit and the integrated circuit further comprises an instruction decode unit that decodes instructions comprising a reciprocal square root instruction; and a data fetch unit that accesses the input operand based on the reciprocal square root instruction.
In example 3, the subject matter of any of examples 1-2 can optionally comprise wherein the instruction decode unit and the data fetch unit are comprised within a single unit.
In example 4, the subject matter of any of examples 1-3 can optionally comprise wherein the reciprocal square root component includes a precomputed table.
In example 5, the subject matter of any of examples 1-4 can optionally comprise wherein the accelerator further comprises a subtracter and a negater that generates an unbounded exponent based on the input exponent.
In example 6, the subject matter of any of examples 1-5 can optionally comprise wherein the selector selects the output representing positive zero (+0) based on the input operand being zero (±0).
In example 7, the subject matter of any of examples 1-6 can optionally comprise wherein the selector selects the output representing infinity (+∞) based on the input operand being infinity (+∞).
In example 8, the subject matter of any of examples 1-7 can optionally comprise wherein the selector selects the output representing non-numeric (e.g., NaN in IEEE 754-2019) based on the input operand being negative non-zero.
In example 9, the subject matter of any of examples 1-8 can optionally comprise wherein the selector selects the output representing non-numeric (e.g., NaN in IEEE 754-2019) based on the input operand being non-numeric (e.g., NaN in IEEE 754-2019).
In example 10, the subject matter of any of examples 1-9 can optionally comprise wherein the selector selects the resultant from the reciprocal square root component as the output based on the input operand being positive non-zero finite.
In example 11, the subject matter of any of examples 1-10 can optionally comprise wherein the accelerator is further configured to perform a multiplication operation using the resultant and a second input operand by multiplying the resultant and the second input operand to obtain a multiplication result, the multiplication result comprising a result exponent and a result mantissa.
Example 12 is an integrated circuit for accelerating operations associated with a microprocessor. The integrated circuit comprises a multiplication device that receives a first operand and a second operand, the multiplication device configured to generate a result based on the first operand and the second operand, the multiplication device comprising a multiplier that multiplies the first operand by the second operand to generate the result; and an exception handler that checks the first operand and the second operand for zero, infinity, or non-numeric and signals an exception based on the check of the first operand and the second operands.
In example 13 the subject matter of example 12 can optionally comprise wherein the exception comprises a division by zero operation based on one of the first operand or the second operand being non-zero finite and the other being infinity.
Example 14 is a method for accelerating operations associated with a microprocessor. The method comprises receiving, by an accelerator, an input operand comprising an input exponent and an input mantissa; generating, by the accelerator, an output based on the input operand, the generating including determining, using a reciprocal square root component, a resultant with a value in a range of [1.0,2.0); providing the resultant to a selector; detecting, by a detector, whether the input operand is zero, infinity, or non-numeric; based on the input operand being zero, infinity, or non-numeric, providing a value from the detector to the selector; selecting, by the selector, the output based on the resultant and the value from the detector; and transmitting the output.
In example 15, the subject matter of example 14 can optionally comprise wherein the selecting the output comprises selecting an output representing positive zero based on the input operand being zero.
In example 16, the subject matter of any of examples 14-15 can optionally comprise wherein the selecting the output comprises selecting an output representing infinity (+∞) based on the input operand being infinity (+∞).
In example 17, the subject matter of any of examples 14-16 can optionally comprise wherein selecting the output comprises selecting an output representing non-numeric based on the input operand being negative non-zero.
In example 18, the subject matter of any of examples 14-17 can optionally comprise wherein the selecting the output comprises selecting an output representing non-numeric based on the input operand being non-numeric.
In example 19, the subject matter of any of examples 14-18 can optionally comprise wherein the generating further comprises generating an unbounded exponent based on the input exponent.
In example 20, the subject matter of any of examples 14-19 can optionally comprise wherein the selecting the output comprises selecting the resultant from the reciprocal square root component as the output based on the input operand being positive non-zero finite.
Some portions of this specification may be presented in terms of algorithms or symbolic representations of operations on data stored as bits or binary digital signals within a machine memory (e.g., a computer memory). These algorithms or symbolic representations are examples of techniques used by those of ordinary skill in the data processing, computer arithmetic, or mathematical algorithm arts to convey the substance of their work to others skilled in the art. As used herein, an “algorithm” is a self-consistent sequence of operations or similar processing leading to a desired result. In this context, algorithms and operations involve physical manipulation of physical quantities. Typically, but not necessarily, such quantities may take the form of electrical, magnetic, or optical signals capable of being stored, accessed, transferred, combined, compared, or otherwise manipulated by a machine. It is convenient at times, principally for reasons of common usage, to refer to such signals using words such as “data,” “content,” “bits,” “values,” “elements,” “symbols,” “characters,” “terms,” “numbers,” “numerals,” “sign,” “exponent,” “mantissa,” “significand” or the like. These words, however, are merely convenient labels and are to be associated with appropriate physical quantities.
Unless specifically stated otherwise, discussions herein using words such as “processing,” “computing,” “calculating,” “determining,” “presenting,” “displaying,” “subtracting,” “negating,” “forwarding,” “inverting,” “sending,” “generating,” “selecting,” “summing,” “multiplying,” “adjusting,” or the like may refer to actions or processes of a machine (e.g., a computer) that manipulates or transforms data represented as physical (e.g., electronic, magnetic, or optical) quantities within one or more memories (e.g., volatile memory, non-volatile memory, or any suitable combination thereof), registers, or other machine components that receive, store, transmit, or display information. Furthermore, unless specifically stated otherwise, the terms “a” or “an” are herein used, as is common in patent documents, to include one or more than one instance. Finally, as used herein, the conjunction “or” refers to a non-exclusive “or,” unless specifically stated otherwise.
Although an overview of the present subject matter has been described with reference to specific example embodiments, various modifications and changes may be made to these embodiments without departing from the broader scope of embodiments of the present invention. For example, various embodiments or features thereof may be mixed and matched or made optional by a person of ordinary skill in the art. Such embodiments of the present subject matter may be referred to herein, individually or collectively, by the term “invention” merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or present concept if more than one is, in fact, disclosed.
The embodiments illustrated herein are believed to be described in sufficient detail to enable those skilled in the art to practice the teachings disclosed. Other embodiments may be used and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. The Detailed Description, therefore, is not to be taken in a limiting sense, and the scope of various embodiments is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled.
Moreover, plural instances may be provided for resources, operations, or structures described herein as a single instance. Additionally, boundaries between various resources, operations, modules, engines, and data stores are somewhat arbitrary, and particular operations are illustrated in a context of specific illustrative configurations. Other allocations of functionality are envisioned and may fall within a scope of various embodiments of the present invention. In general, structures and functionality presented as separate resources in the example configurations may be implemented as a combined structure or resource. Similarly, structures and functionality presented as a single resource may be implemented as separate resources. These and other variations, modifications, additions, and improvements fall within a scope of embodiments of the present invention as represented by the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.