This invention relates to the field of computer arithmetic. More precisely, it relates to radix-1000 decimal floating-point numbers of various sizes (32 bits, 64 bits, and 128 bits) that use a skewed representation of the fraction to maintain precision and accurate arithmetic on decimal floating-point numbers. It also relates to the implementation of radix-1000 floating point arithmetic units in a method, system and/or computer program product capable of use of floating point arithmetic units for calculation and processing.
The IEEE 754-1985 standard was established for binary floating-point numbers. See David Stevenson, et al., “IEEE Standard for Binary Floating-Point Arithmetic”, IEEE Std 754-1985, March 1985, incorporated herein by reference in its entirety. The widely adopted 1985 standard defined the format and encoding of single, double, and extended-precision binary floating-point data that included normal and subnormal numbers, signed zeros, signed infinites, and special “not a number” (NaN) values. The 1985 standard defined arithmetic, conversion, comparison operations, rounding rules, special arithmetic on signed zeros, infinities, and NaNs. The major drawback of binary floating-point numbers is its inability to represent decimal fractions (such as 0.2) exactly in binary. The binary fraction must be rounded to the required precision.
The more recent IEEE 754-2008 standard (D. Zuraz, M. Cowlishaw, et al., “IEEE Standard for Floating-Point Arithmetic”, IEEE Std 754-2008, August 2008, incorporated herein by reference in its entirety) extended the original IEEE 754-1985 by adding decimal floating-point numbers. The need for decimal floating-point is important for financial applications, such as banking, taxes, and currency conversions. The use of binary floating-point numbers is inadequate for such applications because of rounding errors, which can be significant in some applications. See M. Cowlishaw, “Decimal Floating-Point: Algorism for Computers”, 16th IEEE Symposium on Computer Arithmetic (ARITH'03), p. 104-111, June 2003, incorporated herein by reference in its entirety.
Although the IEEE 754-2008 decimal floating-point standard can represent decimal fractions exactly with finite precision, the decimal format is complex. The format has three fields: a sign bit, a combination field, and a coefficient continuation field. The coefficient continuation field uses 10-bit declets to encode Binary Code Decimal (BCD) digits. This encoding scheme is known as Densely Packed Decimal (DPD). See M. Cowlishaw, “Densely Packed Decimal Encoding”, IEE Proceedings—Computers and Digital Techniques, vol. 149, p. 102-104, May 2002, incorporated herein by reference in its entirety. Each declet requires logic to unpack the three BCD digits and then pack them at the end of each operation. See Id.; and S. Carlough, A. Collura, S. Mueller, and M. Kroener, “The IBM zEnterprise-196 decimal floating-point accelerator”, in Proceedings of the 20th IEEE Symposium on Computer Arithmetic, Germany, p. 139-146, July 2011, incorporated herein by reference in its entirety. Internally, the decimal floating-point unit uses BCD digits in arithmetic operations. This is inefficient and increases the area of the decimal floating-point unit in comparison with a binary floating-point unit.
Decimal floating-point numbers are based on the IEEE 754-2008 standard. See D. Zuraz, M. Cowlishaw, et al., “IEEE Standard for Floating-Point Arithmetic”, IEEE Std 754-2008, August 2008, incorporated herein by reference in its entirety. The standard defines decimal interchange formats, called decimal32, decimal64, and decimal128, of widths 32, 64, and 128 bits, respectively. The format has three fields: a sign bit, a combination field, and a coefficient continuation field, as shown in
The integer coefficient C consists of p decimal digits, where p is the precision: p=7, 16, and 34 for decimal32, decimal64, and decimal128, respectively. The numerical value of a finite decimal floating-point number is: (−1)s×C×10q, where q=E−Bias. Prior work on decimal-point numbers was led by the IBM zEnterprise decimal floating-point accelerator. See Id. Other work on the implementation of decimal-point operations and units are documented in L. K. Wang and M. J. Schulte, “Decimal Floating-Point Adder and Multifunction Unit with Injection-Based Rounding”, in Proceedings of the 18th IEEE Symposium on Computer Arithmetic, France, June 2007; L. K. Wang, M. J. Schulte, J. D. Thompson, and N. Jairam, “Hardware designs for Decimal Floating-Point Addition and Related Operations”, IEEE Transactions on Computers, 58 (3), March 2009; L. K. Wang and M. J. Schulte, “A Decimal Floating-Point Adder with Decoded Operands and a Decimal Leading-Zero Anticipator”, in Proceedings of the 19th IEEE Symposium on Computer Arithmetic, 2009; A. Wahba and H. Fahmy, “Area Efficient and Fast Combined Binary/Decimal Floating Point Fused Multiply Add Unit”, IEEE Transactions on Computers, Vol 66, No 2, February 2017, p. 226-239; and A. Vazquez, E. Antelo, and P. Montuschi, “Improved Design of High-Performance Parallel Decimal Multipliers”, IEEE Transactions on Computers, Vol 59, No 5, May 2010, p. 679-693, each incorporated herein by reference in its entirety. All of this work is based on the IEEE 754-2008 standard.
However, as noted above, even the revised IEEE 754-2008 standard uses Radix 10 for representing decimal floating-point numbers and for decimal floating-point arithmetic. IBM has implemented Radix-10 Floating-point units inside their recent processors. In contrast, as will be explained further hereinbelow, the present invention introduces a novel representation of FLOATING-POINT numbers based on radix-1000 and a SKEWED representation of the fraction. The invention also presents detailed implementation of floating-point arithmetic units that can be of various sizes (32-bit, 64-bit, and 128-bit).
U.S. Pat. No. 7,644,115 B2 is directed to systems and methods for performing large-radix numeric operations. A first number may be segmented into large-radix segments, wherein numbers of the segments are generated such that radix of the segment is greater than radix of the first number. As a result, a plurality of disparate processor-based computing systems may be configured to perform various numeric operations on the large-radix segments of the number and output results of a numeric operation as a number whose radix is equal to the radix of the first number.
In U.S. Pat. No. 7,644,115 B2 unlike the present invention, the numbers that are manipulated are fixed-point numbers that might include a fraction; they are not floating-point numbers. There is no exponent field; there is no format for the number itself (just a string of characters); and there is no hardware implementation. As will be explained further hereinbelow, the present invention incorporates features such as using 32 bits, 64 bits, and 128 bits to store a radix-1000 decimal floating-point number in binary with an exponent field and a skewed representation of the fraction, none of which are present in this prior art. Instead, U.S. Pat. No. 7,644,115 B2 discloses the use of segmentation instructions to segment large numbers and requesting the operating system to store large-radix numbers and segments in a data structure (arrays, lists, etc.).
U.S. Pat. No. 6,546,411 B1 is directed to a high-speed radix 100 parallel adder that provides an improved method and apparatus for performing decimal arithmetic using conventional parallel binary adders. In a first aspect, a method for implementing decimal arithmetic using a radix (base) 100 and a method for implementing radix 1000 numbering system are disclosed. The first aspect implements decimal arithmetic utilizing radix 100, where one-hundred decimal numbers, 0 through 99, are represented using seven BCD bits. In a second aspect, a specialized high-speed radix 100 parallel adder is disclosed. In effect, numbers are segmented into several large-radix segments. The radix of a segment may be 100, 1000, 10000, 100000, 1000000, etc. Next, the segmentation instructions may request the operating system to store the large-radix segments in one or more data structures in memory, such as arrays, dynamic arrays, linked lists, stacks, queues, etc.
In U.S. Pat. No. 6,546,411 B1, the radix 100 numbering system is used for inventing a High-Speed Radix 100 Parallel Adder. This reference is directed to integer arithmetic, not floating-point arithmetic. In integer arithmetic, there is no exponent field, no fractions, no exponent logic, no alignment, no normalization, and no rounding logic.
In the publication entitled “Revisiting the Newton-Raphson Iterative Method for Decimal Division” by Mario P. Vestias and Horacio C. Neto (Sep. 5-7, 2011), the focus is on faster decimal division, using the Newton-Raphson iterative method. In this publication, the implementation converts 3 BCD digits into a 10-bit DPD (Densely Packed Decimal used in IEEE 754-2008). In the paper, the implementation converts a 20-bit binary number to radix-1000 (
In contrast to the publication by Vestias and Neto, as will be explained further hereinbelow, the present invention does not do any division, but only addition and subtraction. The invention does not use BCD or DPD, but only BCK (Binary Coded 1000 values). The invention does not convert binary numbers, and instead uses a much more efficient Radix-1000 adder/subtractor for adding and subtracting fractions.
The publication entitled “Floating Point Number Format with Number System with Base of 1000”, IBM Technical Disclosure Bulletin (1998) describes floating-point numbers with a base of 1000 (instead of 2) and discloses that the format is superior to Binary Code Decimal (BCD). In contrast to the present invention, there is no representation of the Radix-1000 floating-point numbers and no implementation of Radix-1000 floating point units. Rather, this publication goes the opposite direction and implements decimal floating-point units based on Radix 10, BCD, and DPD, with a much more complex representation and implementation.
In one aspect the present invention is directed to a system, structure and method using radix-1000 (instead of radix-10) to represent and operate on decimal floating-point numbers. Instead of using a 10-bit declet to encode a DPD, this invention uses a declet to encode a BCK (Binary Coded 1000 values), where the letter K is the abbreviation of the number 1000. This invention also uses a skewed representation of the fraction field to avoid the loss of decimal digits in arithmetic operations when shifting and rounding the fraction are required.
A minor drawback of radix-1000 is the loss of a BCK digit, or three BCD digits, when incrementing the exponent field by 1 (right-shifting the significand by 10 bits). This is the case when adding/subtracting two radix-1000 floating-point numbers that have different exponent values. This is also the case when a carry is obtained, and shifting and rounding are necessary. A difference of 1 in the radix-1000 exponent is equal to a difference of 3 in the radix-10 exponent. To alleviate this drawback, this invention uses a skewed representation of the fraction field.
The present invention is further directed to a processing circuit comprising logic circuitry that performs radix 1000 decimal floating-point arithmetic. Among the features of the present invention, the logic circuitry operates on operands having a sign bit, an exponent field representing an exponent on a 1000 base and a fraction field representing a number having an absolute value that is less than one. The fraction field comprises a plurality of declets representing the numbers 0-999 and a format indicator. The logic circuitry performs the radix 1000 decimal floating-point arithmetic using one of a plurality of skewed representations of operands as indicated by the format indicator. In addition, the logic circuitry includes expansion circuitry that expands the fraction field F[115:0] into its number representation X[119:0] according to:
fmt[1:0]=F[115:114]
if fmt[1:0]=11 then X[119:117]=F[5:3] else X[119:117]=000;
if fmt[1]=1 then X[116:114]=F[2:0] else X[116:114]=concat{0,0,F[114]};
if fmt[1:0]=11 then X[5:3]=000 else X[5:3]=F[5:3];
if fmt[l]=1 then X[2:0]=000 else X[2:0]=F[2:0], wherein fmt[1:0] is the format indicator.
Further features of the present invention include a special sub-circuit of the processing sub-circuit that is used for the Addition and Subtraction of radix-1000 expanded fractions.
Another feature of the present invention is a Normalize sub-circuit that may include a special sub-circuit that does the normalization of the radix-1000 result fraction. The normalization process is conditional and depends on whether the input fractions are normalized or not.
As another feature, in a Round sub-circuit, a radix-1000 normalized fraction is rounded according to the rounding mode and the normalized fraction format.
As another feature, in a Pack sub-circuit, a radix-1000 rounded fraction is packed according to its final format.
As an even further feature, in an Exception handler sub-circuit, arithmetic on special Overflow or Invalid values produces special Overflow or Invalid results. This feature also includes the ability to detect and produce an Overflow result when the normalized exponent value becomes too large. It also includes the ability to produce an Inexact flag when the produced radix-1000 fraction is rounded or truncated.
In contrast to the prior art as discussed above, even the revised IEEE 754-2008 standard uses Radix 10 for representing decimal floating-point numbers and for decimal floating-point arithmetic. IBM has implemented Radix-10 Floating-point units inside their recent processors. In contrast, as will be explained further hereinbelow, the present invention introduces a novel representation of FLOATING-POINT numbers based on radix-1000 and a SKEWED representation of the fraction. The invention also presents detailed implementation of floating-point arithmetic units that can be of various sizes (32-bit, 64-bit, and 128-bit).
These and other features, functionalities and objectives are attained with, for example, a method and system for executing a machine instruction in a central processing unit comprising circuitry.
The present invention is illustrated in the accompanying drawings, wherein:
The embodiments of the present invention will be described hereinbelow in conjunction with the above-described drawings. This invention uses radix-1000 (instead of radix-10 as done in the prior art) to represent and operate on decimal floating-point numbers. It is a deviation from the IEEE 754-2008 standard. Instead of using a 10-bit declet to encode a DPD, this invention uses a declet to encode BCK (Binary Coded 1000) values, where the letter K is the abbreviation of the number 1000. The advantages of using radix-1000 are many as outlined below:
The potential drawback of radix-1000 is the loss of a BCK digit, or three BCD digits, when incrementing the exponent field by 1 (right-shifting the significand by 10 bits). This is the case when adding/subtracting two radix-1000 floating-point numbers that have different exponent values. This is also the case when a carry is obtained, and shifting and rounding are necessary. A difference of 1 in the radix-1000 exponent is equal to a difference of 3 in the radix-10 exponent. To alleviate this drawback, this invention uses a skewed representation of the fraction field.
The radix-1000 floating-point interchange format according to the present invention consists of three fields: a sign bit s, a biased exponent field E with e bits, and a fraction field F with f bits, as shown in
In this invention, DFP32, DFP64, and DFP128 are the names of the suggested radix-1000 DFP numbers. Only a few bits are needed for the exponent field, leaving the remaining bits for the fraction field. The biased exponent range is 0 to 2-2. The Bias is equal to 2e−1. The maximum exponent scale factor is 1000+Bias=10+3×Bias and the minimum is 1000−Bias=10−3×Bias. Table 1 shows the suggested parameters of the DFP32, DFP64, and DFP128 numbers. The length of the exponent and fraction fields can be chosen differently, depending on whether a wider exponent range or a higher precision is desired. The special exponent value E=2e−1 is reserved for infinity and NaN. When E=2e−1, the most significant bit of F specifies whether the number is infinity or NaN.
The simplest representation of the fraction field F is to split the field into 10-bit declets, starting at the least-significant fraction bit and moving backwards. Each 10-bit declet is a BCK digit that encodes three decimal digits in binary (000 to 999). Since the fraction field is not multiple of 10 bits, the upper bits of the fraction field encode fewer than 1000 decimal values. This fixed-format representation of the fraction field F is shown in Table 2. The maximum precision for DFP32, DFP64, and DFP128 are 7+, 16+ and 34+ decimal digits, respectively. The + means that the precision can exceed 7, 16 and 34 digits in some limited cases.
Although the fixed-format representation of the radix-000 fraction F is the simplest to implement in hardware, its major drawback is the loss of precision when converting a radix-10 decimal number into radix-1000 or when shifting and rounding the result of an arithmetic operation. Because right-shifting is done by multiples of 10 bits, the actual precision is a variable. It varies between 5 and 7+ for DFP32, between 14 and 16+ for DFP64, and between 32 and 34+ decimal digits for DFP128.
To remedy the loss of precision, Table 3 shows a two-format skewed representation of the fraction F. The maximum decimal values of the two-format representation of the fraction are denoted as Max F0 and Max F1. The actual precision of the two-format representation of F is now improved. It is 6 to 7+ decimal digits for DFP32, 15 to 16+ decimal digits for DFP64, and 33 to 34+ decimal digits for DFP128.
Since there are three decimal digits in each BCK, it is better to have a three-format skewed representation of the fraction F. Table 4 shows a three-format representation of the DFP32, DFP64, and DFP128 fractions with maximum decimal values: Max F0, Max F1, and Max F2. The actual precision is 7, 16, and 34 decimal digits for DFP32, DFP64, and DFP128, respectively.
The two- and three-format skewed representations of the fraction field are simple to implement. The fraction F is expanded from 25 to 30 bits for DFP32, from 55 to 60 bits for DFP64, and from 116 to 120 bits for DFP128. The format is defined according to the upper bits of the fraction field. Only the upper and lower BCKs of the fraction field are expanded. The middle BCKs are not modified.
Consider the DFP128 three-format fraction, shown in Table 4. Let F[115:0] be the 116-bit fraction, where bit 115 is the most significant and bit 0 is the least-significant. The format fat is defined according to the two most-significant bits of the fraction: fmt[1:0]=F[115:114]. Let X[119:0] be the 120-bit expanded fraction for DFP128. It consists of 12 BCK digits, equivalent to 36 decimal digits. The expansion logic is described in Table 5. There are three formats to expand. If fmt[1] is 8 then X[119:0]={5′b0, F[54:0]}, where { } is the concatenation operator and the upper 5 bits of X are zeros. Second, if fmt[1:0] is 2′b10 then X[119:0]={3′ b0, F[2:0], F[113:10], F[9:3]*10}. The 7-bit F[9:3] is multiplied by 10. This requires an adder to compute the least-significant BCK as a shifted addition. Third, if fmt[1:0] is 2′ b11 then X[119:0]={F[5:0], F[113:0], F[9:6]*100}. The 4-bit F[9:3] is multiplied by 100 that can be implemented with simple logic.
The expansion logic can be further simplified by multiplying the 7-bit F[9:3] by 8 or the 4-bit F[9:6] by 64, as shown in Table 5b. The least-significant BCK is computed as: X[9:0]={F[9:3], 3′b6}, or {F[9:6], 6′b9}, or F[9:0]. This eliminates the need to multiply by 10 or 100. A second advantage is simplifying the rounding logic. Rather than dividing the least-significant BCK of the result by 10 or 100 in the rounding step, division by 8 or by 64 becomes trivial. Packing the result fraction also becomes trivial. Multiplying F[9:3] by 8 generates 125 valid BCK values (000 to 992), while multiplication by 10 generates only 100 values (000 to 990). Similarly, multiplying F[9:6] by 64 generates 16 valid BCK values (000 to 960), while multiplication by 100 generates only 10 values (000 to 900). Therefore, multiplication by 8 and 64 provides a better granularity for the least-significant BCK, which is preferably rounded for inexact arithmetic results.
The three-format fraction representation of DFP32 and DFP64 is slightly more complex to expand. The upper 5 bits of the fraction (with 32 possible values) specify three different formats multiplied by the ten decimal digits. Multiplication by 10 and 100 is necessary for the least-significant BCK only.
Unlike binary floating-point numbers which must be normalized, decimal floating-point numbers need not be according to the IEEE 754-2008 standard. This means that a decimal number can have multiple representations, called cohorts. For example, the decimal number 0.2 can be represented using different integer coefficients as: 2×10−1, 20×10−2, 200×10−3, etc. All of these are equivalent representations according to IEEE 754-2008. The drawback of cohorts is the additional complexity added to the hardware implementation when adding or subtracting decimal floats. Given two decimal numbers A and B with exponents EA and EB, the preferred exponent of the result is min(EA, EB) for addition and subtraction according to IEEE 754-2008. One coefficient is left-shifted to decrease its exponent (according to the number of leading zeros) and the other coefficient is right-shifted to increase its exponent to match the exponent of the left-shifted coefficient.
To alleviate the problem of multiplicity of representation and to simplify the implementation, the radix-1000 fraction field should be normalized. This means that the largest fraction and the smallest exponent value should be used. The most-significant BCK digit in the fraction field F should be non-zero. For example, the decimal number 0.2 should be represented uniquely as: 0.200,000, . . . ×10000. The only exception is Zero, which cannot be normalized. The exact number Zero is represented uniquely with E=F=0.
If a fraction is not normalized, it indicates the loss of significant digits. For example, the number 0.000,200,000, . . . ×10001 is not normalized. It can be approximated, but not exactly equal to 0.2. The precision is counted starting at the most-significant nonzero digit. If a number is not normalized then its fraction cannot be left-shifted and normalized, because what comes after the least-significant fraction digit is unknown (not necessarily zero). Therefore, there is no left-shifting of a radix-1000 fraction field when it is not normalized. Only right-shifting is used on the fraction of the lesser exponent when adding/subtracting radix-1000 decimal floating-point numbers.
This section describes the implementation of a radix-1000 DFP128 unit 1000 that performs addition, subtraction, and comparison of radix-1000 numbers according to the present invention. The top-level design of the structure and operation of the radix-1000 DFP128 unit 1000 is shown in
Given two DFP128 numbers A and B, SA and SB are the input sign bits, EA and EB are the input biased exponents, and FA and FB are the input fractions of A and B, respectively, as shown in
The Expand & Swap block 1002 enlarges the input fractions FA and FB from 116 to 120 bits, as described in Table 5b. The Expand logic expands only the most significant 6 bits and least-significant 6 bits of the fractions FA and FB. However, it does not modify 108 bits of FA and FB. The 120-bit expanded fractions are called XA and XB. The expanded fractions are swapped if EA<EB. The 120-bit swapped outputs are called YA and YB, where YA=swap? XB: XA and YB=swap? XA: XB.
In addition, the Expand & Swap block 1002 outputs an LZ signal that indicates whether there is a leading zero BCK in XA or XB. Only the leading BCK is examined: LZ=(XA[119:110] !=0) (XB[119:110] !=0). The LZ signal is used by the Normalize block 1006. If LZ is 1 then fraction FA or FB is not normalized, and the result of an arithmetic operation cannot be normalized when there are leading zeros in the result.
In contrast, the DFP accelerator on the IBM z196 unpacks the integer coefficient encoded in DPD into BCD. The unpacking logic for DPD is more complex and applies to all BCD digits. The output of the unpacker on the z196 consists of 36 BCD digits, or 144 bits, which is much longer.
The Exponent Difference block 1008 computes the difference of the 11-bit biased exponents EA and EB. It produces four outputs: swap=sign(EA−EB) is used to swap the expanded fractions XA and XB when (EA<EB), Emax is the maximum exponent value, Smax is the sign of the swapped fraction YA with exponent Emax, and RSA is the absolute difference of EA and EB that saturates at 15. RSA is a 4-bit right shift amount used by the R-Shift block 1004. Only a 4-bit shift amount is required by the right-shifter because there are only 12 BCKs in an expanded fraction, and right-shifting beyond 12 produces a zero output.
The R-Shift block 1004 right-shifts the 120-bit fraction YB according to the 4-bit right-shift amount RSA. It produces three outputs: a 120-bit shifted fraction YS, a 10-bit extra BCK YX that is shifted-out, and a sticky bit S, which is the OR-reduction of the shifted-out bits that appear after YX. The 10-bit YX and the sticky bit S are used by the 120-bit Fraction Adder/Subtractor block 1010 to compute the 120-bit result Z and its 10-bit result extension ZX.
It should be emphasized that there is no left-shifter to left-shift YA, when YA has leading zeros. As stated in the previous section, if an input fraction is not normalized then it cannot be normalized. The concept of cohorts used in the IEEE 754-2008 standard does not apply here. This simplifies the implementation.
The effective operation signal EOP is computed as: EOP=SA{circumflex over ( )}SB{circumflex over ( )}Op, where Op is the arithmetic operation select signal (ADD is 0 and SUB is 1) and {circumflex over ( )} is the XOR operation (see XOR block 1011). EOP is equal to Op if A and B have identical signs (SA is equal to SB). Otherwise, EOP=˜Op. Subtraction is also used to compare A with B.
The 120-bit fraction Add/Subtract block 1010 receives two 120-bit input fractions YA and YS, an effective operation signal EOP, a 10-bit YX BCK shifted-out by the R-shifter 1004, and a sticky bit S. It produces a 120-bit result Z, a 10-bit result extension ZX, an output carry Co, and a less than LT signal that indicates whether YA<YB. The Co signal is valid only for addition (when EOP is 0) and always 6 for subtraction. The LT signal is valid only for subtraction (when EOP is 1) and always 0 for addition. The 10-bit ZX is used as a round BCK for addition and a guard BCK for subtraction.
The Normalize block 1006 receives a 120-bit result Z, a 10-bit result extension ZX, and a carry bit Co from the fraction adder/subtractor 1010. It also receives a leading zero bit LZ from the expand unit 1002 (indicating whether FA or FB is not normalized) and a sticky bit S from the right-shifter 1004. It produces a 120-bit normalized result N, a 5-bit exponent correction EC used to compute the exponent of the result ER, and two X bits used for rounding the normalized result N.
The Sign block 1012 computes the sign of the result SR, based on the effective operation EOP, the sign bit SA, the sign bit Smax (sign of YA), and the LT signal (when EOP is subtraction).
The Round & Pack block 1014 receives a 120-bit normalized result N and two X bits from the Normalize block 1006. It also receives the result sign SR and a 2-bit round direction RDir. The 120-bit result N is normalized according to its format, and then packed into a 116-bit result fraction FR. Since rounding might produce an output carry, post-normalization is done to the rounded result in the same step. An output Inc signal indicates the presence of an output carry and is used to increment ER.
Finally, the E×p block 1018 computes and outputs the result exponent ER=Emax+EC+Inc. The 5-bit signed exponent correction EC is sign-extended and added to Emax.
The structure and operation of the Exponent Difference block circuit 1008 are shown in
The input sign bits SA and SB, along with swap, are inputted into the multiplexer 1008c to generate Smax which is the sign of the swapped fraction YA with exponent Emax. The lower 4 bits of Ediff are further inputted into the 2's complement sub-block 1008e wherein the 2's complement is computed when the sign of the difference is negative (swap is). The lower 4 bits output of the 2's complement sub-block 1008e is inputted along with the output of the >15 sub-block 1008b and the ′b1111 signal into the multiplexer 1008f to generate RSA. RSA is the absolute difference of EA and EB that saturates at 15. RSA is a 4-bit right shift amount used by the R-Shift block 1004. Only a 4-bit shift amount is required by the right-shifter 1004 because there are only 12 BCKs in an expanded fraction, and right-shifting beyond 12 produces a zero output.
The swap signal selects Emax=max(EA, EB)=swap? EB: EA and the sign bit Smax=swap? SB: SA. Finally, the right-shift amount RSA is computed as: RSA=max(abs(Ediff), 15). It saturates at 15 when abs(Ediff)>15.
The structure and operation of the Right-Shifter block circuit 1004 is shown in
In parallel, a 10-bit extra BCK YX is produced using also two stages (4-way multiplexers). YX is the last BCK that is shifted out according to the 4-bit shift amount RSA. It is produced from YB and Y1 as shown in
In parallel, a sticky bit S is produced, which is the OR-reduction of all bits that are shifted out after the YX BCK. Large fan-in reduction OR-gates (or trees) 1020a-1020e are used to reduce the 10-bit YB[9:0], the 20-bit YB[19:0], the 30-bit Y1[29:0], the 70-bit Y1[69:0], and the 110-bit Y1[109:0] into a single output bit. To minimize cost, the reduction OR-tree gates are shared. For example, |YB[19:0]=(|YB[19:10])| (|YB[9:0]), where |YB[19:0] means the reduction-OR of YB[19:0]. Similarly, |Y1[69:0]=(|Y1[69:30])| (|Y1[29:0]) and |Y[109:0]=(|Y1[109:70])|(|Y1[69:30])|(|Y1[29:0]). The output of the first-stage multiplexer 1018c is also ORed via OR-gates 1020c-1020e in the second stage into the second-stage multiplexer 1018f to determine S.
As shown in
For addition (EOP is 0), each 10-bit Adder sub-block 1028 computes a temporary sum TKi=(YAKi+24)+YSKi. The +24 sub-block 1024 is used to skip the 24 invalid values (1000 to 1023), and adjust the sum when (YAKi+YSKi)>999. For subtraction (EOP is 1), the 10-bit Adder sub-block 1028 computes TKi=YAKi+˜YSK1=YAKi+1023−YSKi=(YAKi+24)+(999−YSKi). Each 10-bit Adder sub-block 1028 also produces a generate bit Gi and a propagate bit Pi. The generate bit Gi indicates that TKi is greater than 1023. The propagate bit Pi indicates that TKi is equal to 1023. The Gi and Pi signals can be produced using fast logic, independently of TKi.
The second step compares the magnitudes of YA and YS when EOP is subtraction. It also produces all carries (C0 to C12) using a carry lookahead CMP/CLA unit 1034. The Group-Generate (GG) and Group-Propagate (GP) signals are defined inside the CMP/CLA unit 1034 as follows:
GG=G11|(P11 & G10)|(P11 & P10 & G9)| . . . |(P11 & P10 & . . . & P1 & G0)
GP=P11 & P10 & . . . & P1 & P0.
For subtraction, the generate and propagate signals (G0 to G11 and P0 to P11) outputted from the 10-bit Adder sub-blocks 1028 are inputted into the CMP/CLA unit 1034 and used to compare the fraction YA with YS. Given that TKi=YAKi+1023−YSKi, the generate bit G is 1 when (YAKi>YSKi). The propagate bit Pi is 1 when (YAKi==YSKi). The group generate signal GG indicates whether (YA>YS). The group propagate signal GP indicates whether (YA==YS). The LT signal is defined as: LT=EOP & ˜GG & ˜GP. It is valid only for subtraction, and always zero for addition.
The carry bit Co is defined as: Co=EOP & ˜LT & ˜X, where X is the OR-reduction of all the bits that are shifted out: X=(YX !=8)+S (see OR gate 1037). Therefore, Co is 1 for subtraction (EOP is 1), when YA>=YS (LT is 0), and all the shifted-out bits are zeros (X is 0).
The twelve carries (C0 to C11) are produced in the CMP/CLA unit 1034, and depend on the value of Co, the generate bits (G0 to G11) and the propagate bits (P0 to P11). The output carry is defined as: Co=C12 & ˜EOP (see AND gate 1035). It is valid only for addition (when EOP is 0). In summary, the CMP/CLA unit 1034 outputs:
In parallel, the 1000's complement of YX is computed as: 1000−YX−S=˜(YX+S+23), where S is the sticky bit. The 10-bit ZX BCK is generated as either ˜(YX+S+23) or YX, depending on EOP, LT, and X. It is selected as ˜(YX+S+23) for subtraction (EOP is 1), when YA>=YS (LT is 0), and at least one of the shifted-out bits is non-zero (X is 1). Otherwise, ZX=YX. Structurally, YX is inputted directly into a multiplexer 1036 and inputted into a +23 adder sub-block 1038 that also receives the sticky bit S. The inverse of the output of the +23 adder sub-block 1038 is then inputted into the multiplexer 1036. Multiplexer 1036 also receives the output of the AND gate 1040 which is derived from the logical adding of EOP, the inverse of LT, and X. ZX is thus derived as follows:
ZX=(EOP&X&˜LT)?˜(YX+S+23):YX
The third step is to post-correct the twelve 10-bit intermediate sums (TK0 to TK11) in parallel and compute a 120-bit result Z. Referring to
Given a 10-bit TKi, the 10-bit post-corrected result ZKi is computed as either:
Adding +1000 to the 10-bit sum TKi is equivalent to adding −24, because 1000=1024 (carry)−24 and the carry is ignored in the 10-bit post-correct adder. For addition, LT is always 8 and only the first two cases apply. The 120-bit result is computed as: Z=YA+YS+Co, where C0 is always 0 for addition.
For subtraction, all four cases apply as shown below. When LT is 0 (YA>=YS), the 120-bit result is computed as: Z=YA−YS−1+C0, where C0=˜X. Hence, Z=YA−YS when X is 0, and Z=YA−YS−1 when X is 1. When LT is 1 (YA<YS), the negative result is converted into positive and the 120-bit result is computed as: Z=YS−YA−C0, where C0 is always 8.
For addition (EOP is 0), with reference to
SR=(EOP==0)?SA:(S max{circumflex over ( )}LT).
The components of the Normalize block 1006 are shown in
If LSA is non-zero, the 120-bit result Z is concatenated with ZX and left-shifted via the L-Shifter sub-block 1042 and the multiplexer sub-block 1052 to produce a normalized result N. It should be noted that the left-shift amount LSA cannot exceed 1 when the fractions FA and FB are both normalized (LZ is 0) and the exponent difference is greater than 1. However, LSA can exceed 1 if the exponents EA and EB are equal or differ by at most 1, in which case the sticky bit S is always e. Therefore, the L-Shifter sub-block 1042 always inserts zero BCKs when the left shift amount LSA>1.
If the carry bit Co is 1 then Z is right-shifted 10 bits (one BCK) and the output of the L-Shifter sub-block 1042 is ignored. The exponent correction is computed as: EC=Co−LSA. It can range from −12 to +1. In summary:
LSA=˜(LZ|Co)?CLZ:8
N=(Co==S)?{Z,ZX}<<(LSA*10):{10′b1,Z[119:10]}
EC=Co−LSA=Co+˜LSA+1
In parallel, the Extra sub-block 1054 receives input from to Co, S and LSA, along with Z[9:0] and ZX, where LSA is inputted into Extra sub-block 1054 through the NOR gate 1056, to generate two extra bits X1 and X0 used for rounding N. The X1 bit is 1 when the shifted-out BCK that appears immediately after N is greater or equal to 500. Otherwise, X1 is 0. The shifted-out BCK can be Z[9:0], ZX, or simply zero, according to Co and LSA. The X0 bit is the OR reduction of all bits that are shifted-out after N. The two X bits are defined by the following equations:
X
1=(Co)?(Z[9:0]>=500):(LSA==0)&(ZX>=500)
x
0=(Co)?(Z[9:0]!=)=(ZX!=0)|S:((LSA==0)&(ZX!=0))|S
This invention implements four rounding directions (RDir) defined by the IEEE 754-2008 standard:
RDir 0: Round to nearest, with ties away from zero
RDir 1: Round toward zero (truncate)
RDir 2: Round toward positive (round up)
RDir 3: Round toward negative (round down)
The structure and operation of Round operation of the Round & Pack block 1014 (see
The R-decision sub-block 1060 generates the 7-bit round value RVal, according to the lower 6 bits of N, the two X bits, the rounding direction RDir, the sign bit of the result SR, and the 3-bit format f. The rounding values can be 0, 1, 8, or 64. The use of 8 and 64 are for formats f1 and f2. When rounding to nearest, the round bit can be X1, N2, or N5, depending on the format f. When rounding towards positive or negative, the sticky bit can be X0, (|N[2:0] |X0), or (|N[5:0]|X0), depending on f. The notation (|N[5:0]|X0) means the OR-reduction of the 6-bit N[5:0] with X0. The equations of RVal are presented in Table 6 hereinbelow.
Table 6: Round Value according to the round direction, format, result sign, and extra bits
The round value RVal is then inputted into the BCK adder sub-block 1062 and added to the least-significant BCK N[9:0] to generate a 10-bit temporary sum T[9:0], which is computed as either (N[9:0]+RVal+24) or (N[9:0]+RVal) depending on whether a carry C1 is generated or not.
C
1=(N[9:0]+RVal+24)>1023
T
[9:0]=(C1)?(N[9:0]+RVal+24):(N[9:0]+RVal)
In parallel, the upper eleven BCKs of N (N[19:10] to N[119:110]) are incremented independently to produce eleven output BCKs T[19:10] to T[119:110] and eleven propagate bits P1 to P11. Each of the 10-bit incrementer sub-blocks 1064a-1064k produces a 10-bit temporary BCK and a propagate bit P1 as follows:
If the propagate bit Pi is 1 then the corresponding 10-bit incremented BCK is 0. The only exception is the most-significant temporary BCK T[119:110], which is 1 (not zero) if P11 is asserted. This speculation means that if an output carry C12 is generated and the rounded fraction is renormalized then the most significant BCK of the result will be 1. The CLA sub-block 1066 generates output carries C2 to C12, based on the values of P1 to P11 and the carry bit C1.
Rounding and renormalization are done in one step in the Round operation. If the carry C12 is 1, then the result fraction must be renormalized and the result exponent must be incremented. Therefore, Inc=C12 is an output signal used to increment the result exponent. RN is defined as the 120-bit rounded and renormalized fraction. It is described with the following equations. If the carry C12 is 1 then the rounded result RN is also renormalized by having RN[119:110]=1 and all other BCKs (R[109:100] down to R[9:0]) equal to 0. In this operation, the outputs of the 10-bit incrementer sub-blocks 1064a-1064k are inputted into multiplexers 1068a-1068k, along with the carries C1 to C11 to generate the 120-bit rounded fraction RN as follows:
The structure and operation of the Packing operation of the Round & Pack block 1014 are shown in
The 3-bit format g of RN is then inputted into Pack sub-block 1072 along with the 120-bit rounded fraction RN to then output the 116-bit result fraction FR.
The Pack logic is described in Table 7. There are three formats and three ways to pack the result. If the format is g0 then FR={0, RN[114:0]}). If the format is g1 then FR={2′b10, RN[113:3], RN[116:114]} and the least-significant 3-bit FR[2:0]=RN[116:114]. Finally, if the format is g2 then FR={2′b11, RN[113:6], RN[119:114]} and the least-significant 6-bit FR[5:0]=RN[119:114].
The E×p block 1018, shown in
There are three observations about EC and ER. The first observation is when EC is −12 then N, RN, and FR must all be zeros. This can happen in the case of subtraction, when the input fractions FA and FB are equal and normalized. However, it cannot happen if one of the input fractions FA or FB is not normalized, because left-shifting the result fraction is not allowed in that case. Therefore, if EC is −12 then ER=0.
The second observation is that if ER is incremented to 2047 then overflow occurs. In this case, the result is infinity, ER saturates at 2047, and the most-significant bit of FR must be zero.
The third observation is that if ER is decremented below 8 then underflow occurs. In this case, ER saturates at 0, and the result fraction FR is also reduced to zero.
In one embodiment, a processor of a processing environment executes instructions or code that includes one or more Floating Point Operations or calculations at least partially dependent on decimal floating-point arithmetic units. One embodiment of a processing environment to incorporate and use one or more aspects of the present invention includes, for instance, a z/Architecture® processor (e.g., a central processing unit (CPU)), a memory (e.g., main memory), and one or more input/output (I/O) devices coupled to one another via, for example, one or more buses and/or other connections (e.g., wireless).
A z/Architecture® processor is a part of a System z™ server, offered by International Business Machines Corporation (IBM®). System z™ servers implement IBM's z/Architecture®, which specifies the logical structure and functional operation of the computer. The System z™ server executes an operating system, such as z/OS®, also offered by International Business Machines Corporation. IBM® and z/OS® are registered trademarks of International Business Machines Corporation, Armonk, N.Y., USA and may rely at least partially on arithmetic representation and/or calculations that rely or include the floating decimal point arithmetic units of the present disclosure.
In another embodiment, the instruction and/or the logic of an instruction can be executed in a processing environment that is based on one architecture (which may be referred to as a “native” architecture), but emulates another architecture (which may be referred to as a “guest” architecture). In such an environment, for example, a Perform Floating Point Operation instruction and/or logic thereof, which is specified in the z/Architecture® and designed to execute on a z/Architecture® machine, is emulated to execute on an architecture other than the z/Architecture® These instructions may rely or reference, at least partially, an arithmetic representation and/or calculation that includes the floating decimal point arithmetic units of the present disclosure.
As examples, processing environment 1000 may include a Power PC® processor, a pSeries® server, or an xSeries® server offered by International Business Machines Corporation, Armonk, N.Y.; an HP Superdome with Intel® Itanium® 2 processors offered by Hewlett-Packard Company, Palo Alto, Calif.; and/or other machines based on architectures offered by IBM®, Hewlett-Packard, Intel®, Sun Microsystems or others. Power PC®, pSeries® and xSeries® are registered trademarks of International Business Machines Corporation, Armonk, N.Y., U.S.A. Intel® and Itanium® 2 are registered trademarks of Intel Corporation, Santa Clara, Calif.
A native central processing unit may includes one or more native registers, such as one or more general-purpose registers and/or one or more special purpose registers, used during processing within the environment. These registers include information that represents the state of the environment at any particular point in time and may rely or reference, at least partially, an arithmetic representation and/or calculation that includes the floating decimal point arithmetic units of the present disclosure. While specific embodiments have been described in detail in the foregoing detailed description and illustrated in the accompanying drawings, those with ordinary skill in the art will appreciate that various modifications and alternatives to those details could be developed in light of the overall teachings of the disclosure. Accordingly, the particular arrangements disclosed are meant to be illustrative only and not limiting as to the scope of the invention, which is to be given the full breadth of the appended claims and any and all equivalents thereof.