The disclosure generally relates to adders for operands represented in a fractional logarithmic number system.
In a logarithmic number system (LNS), the real value of a number is approximated by its nearest power of two. An LNS number is represented by the sign and logarithm of the absolute value of the number. LNS representations allow multiplication to be a simple addition of exponents. For low-precision neural network inference and training, which involve many multiplications, LNS representations can provide significant memory and computation savings.
Though LNS representations simplify multiplication and division, addition and subtraction become complicated. Addition and subtraction involve interpolation of a nonlinear function and use of lookup tables, which significantly increases memory and computation requirements.
A disclosed adder for fractional logarithmic number system (FLNS) format operands, includes a compare-and-swap circuit that is configured to input first and second FLNS operands represented by fixed point values and provide a greater one of the first and second operands as operand x, and provide a lesser or equal one of the first and second operands as operand y. The bits sx and sy are sign bits of x and y, respectively, qx and qy, are integer portions of x and y, respectively, and fraction portions of x and y that as integers have values rx and ry, respectively. The FLNS operand x=sx·2q
A disclosed method for adding fractional logarithmic number system (FLNS) format operands includes inputting first and second FLNS operands represented by fixed point values to a compare-and-swap circuit and providing a greater one of the first and second operands as operand x, and providing a lesser or equal one of the first and second operands as operand y. The sign bits of of x and y are sx and sy, respectively, qx and qy, are integer portions of x and y, respectively, and fraction portions of x and y that as integers have values rx and ry, respectively. The FLNS operand x=sx·2q
Other features will be recognized from consideration of the Detailed Description and Claims, which follow.
Various aspects and features of the circuits and methods will become apparent upon review of the following detailed description and upon reference to the drawings in which:
In the following description, numerous specific details are set forth to describe specific examples presented herein. It should be apparent, however, to one skilled in the art, that one or more other examples and/or variations of these examples may be practiced without all the specific details given below. In other instances, well known features have not been described in detail so as not to obscure the description of the examples herein. For ease of illustration, the same reference numerals may be used in different diagrams to refer to the same elements or additional instances of the same element.
Fractional LNS (FLNS) formats have been used to improve LNS precision via fractional exponents. In an FLNS format, the exponent is represented by a quotient and a remainder. In the FLNS representation of a number x, where M is the bit-width of x,
x=s
x*2{dot over (x)}/γ, {dot over (x)}=0; 1; 2; . . . , 2M−1−1;
where {dot over (x)} is an integer and γ is the base factor that controls the fractional exponent of the base. γ controls the quantization gap, which is the distance between successive representable values within the number system.
The FLNS expression of x can be alternatively stated as:
x=s
x·2q
where qx and rx are the quotient and remainder of {dot over (x)}/γ, and wr represents the bit-width of the remainder.
Prior approaches involving FLNS, have attempted to reduce the hardware resource requirements for performing addition operations by converting the operands to fixed point format and using lookup tables to determine the contribution of remainder of the exponent. However, the conversion between FLNS format and fixed point introduces extra overhead and can significantly degrade performance in applications such as neural networks.
The disclosed approaches avoid inefficiencies associated with converting operands to fixed point values and converting sums back to FLNS format while improving computational efficiency in adding FLNS format operands. Operands need not be converted from FLNS format to fixed point for accumulation. Avoiding the conversion of values between FLNS format and fixed-point format can significantly improve performance and reduce resource requirements in applications such as neural networks in which accumulated values from one layer are provided as input to the next layer for multiplications.
The disclosed methods and circuits provide a conversion-free FLNS adder of two operands. Each addition is performed by way of a subtraction circuit performing logarithmic division of the operand having the lesser absolute value by the operand having the greater absolute value, approximation circuitry estimating a nearest FLNS value of the result plus 1, and an adder circuit performing a logarithmic multiplication of the estimated value and the greater operand.
α=sα·2q
The term (1+α) can be approximated (1+α→β) to the nearest FLNS value, β=2q
The sum, z, can be efficiently calculated by adding the exponents of x and β. Note that β>0 because |y|/|x|<1.
Referring to
Circuits 106 and 108 compare the exponent elements of OP1 and OP2 and provide the one of OP1 and OP2 having the greater absolute value as a fixed-point two-complement operand x in register 110 and the operand having the lesser absolute value as a fixed-point two-complement operand y in register 112.
Subtraction circuit 114 subtracts x from y (y−x=qy−qx+ry/n−rx/n) and stores the result in fixed-point two-complement form in register 116. The integer portion of the value in register 116 is qα, and the fraction portion of the value in register 116 when interpreted as an integer is rα.
Comparison circuit 118, mapping circuits M1, M2, and M3, and selector circuit 126 form an approximation circuit. The approximation circuit that maps (1+α) to the nearest FLNS value, β=2q
The mapping circuits 120, 122, and 124 implement three different mappings, and the selector circuit selects the output from one of the mapping circuits. Each of mapping circuit M1 and M2 outputs an unsigned binary format integer rβ, and mapping circuit M3 outputs unsigned binary integers qβ and rβ. The different mappings are based on mutually exclusive cases of the signs and ratio of |x| to |y|. The output of mapping M1 (“case (i)”) is selected in response to sx=sy, the output of mapping M2 (“case (ii)”) is selected in response to sx≠sy and |x|≥2|y|, and the output of mapping M3 (“case (iii)”) is selected in response to sx≠sy and |x|<2|y|<2|x|.
After swapping x and y such that |x|≥|y|, x+y is computed as x+y=x(1+y/x)≈x×2β. If sx=sy, then +y/x>1, 2β>1, and β≥0. If sx≠sy, then +y/x<1, 2β<1, and β<0. Thus, for case i, β>0, and for cases ii and iii, β<0. To avoid twos-complement conversions for accessing the mapping circuits, the implemented mappings assume β>0, and β is applied differently between case i and cases ii and iii. For case i, +y≈x×2β, and for cases ii and iii: x+y≈x×2−β. The output from M2 for case ii is rβ≥0, though the actual value of rβ for β in case ii is less than or equal to 0. The output from M3 for case iii is qβ≥0 and rβ>0, though the actual value of qβ is less than or equal to 0 and rβ is less than 0. Given that the actual values of mappings for cases ii and iii are less than or equal to 0, the outputs from mapping circuits M2 and M3 are converted to negative twos-complement values.
The mappings have either n or n−1 entries. In mapping M1, the sum z is bounded within range (x, 2x], i. e., (sx·2q
Selector circuit 126 selects one of the outputs from the mapping circuits 120, 122, and 124 based on the states of the signals from comparison circuit 118 and the signal from XNOR circuit 130. In response to sx=sy, the selector circuit selects the output from mapping circuit 120 (M1); in response to sx≠sy and qα≠0, the selector circuit selects the output from mapping circuit 122 (M2); and in response to sx≠sy and qα=0, the selector circuit selects the output from mapping circuit 124 (M3). The signed binary integers qβ and rβ are stored as a signed fixed point value in register 128. The integer portion of the value in register 128 is qβ, and the fraction portion of the value in register 128 when interpreted as an integer is rβ.
Note that qβ=0 is stored in register 128 when the output of mapping M1, or M2 is selected.
For case (i), the output of mapping M1 is always a positive value, and for cases (ii) and (iii), the outputs of mappings M2 and M3 are negative but unsigned. Twos-complement converter circuit 132 converts the value from register 128 to a signed twos-complement value (invert integer bits and add 1 to LSB), and selector circuit 134 selects either the value from register 128 or the signed twos-complement value from converter circuit 132 in response to the signal from XNOR circuit 130. In response to sx=sy, the signal from XNOR circuit causes selector circuit 134 to select the output from register 128, and in response to sx≠sy, the signal from XNOR circuit causes selector circuit 134 to select the output from converter circuit 132.
Summing circuitry adds qx+rx/n+qβ+rβ/n in response to sx=sy, and subtracts qx+rx/n−qβ−rβ/n in response to sx≠sy, to provide the sum z as a fixed point value having an integer portion qz and a fraction portion that as an integer has the value rz, (sz*2q
The two-complement converter 132 is a circuit that converts the unsigned fixed point value from register 128 to a negative twos-complement value. The selector circuit 134 selects as an addend either the fixed point value from register 128 in response to the signal from XNOR circuit 130 indicating sx=sy, or the negative twos-complement value from circuit 132 in response to the signal from the XNOR circuit indicating sx≠sy. The adder circuit 136 adds the value from register 110 (without the sign bit sx) to the addend (without the sign bit if the twos-complement value is selected) selected by selector circuit 134 and provides the sum as a fixed point value in register 138.
∀rβ∈, rβ<n−1, 1+2q
which reduces to:
∀rβ∈, rβ<n, qα+rα/n≤log2(2r
The right-hand side of the inequality defines the threshold values. Similarly, the inequality at case (ii) is
∀rβ∈, rβ<n, qα+rα/n≤log2(−2−r
The input to the LUT circuit is an integer value of rα, and the output is a fixed point value, qβ+rβ/n, having qβ as the integer portion and the fraction portion rβ if interpreted as an integer. The values configured into the LUT circuit are pre-computed as −log2(1−2−r
The maximum threshold, T(rβ_max), is the pre-computed threshold with the maximum possible value of rβ, and T(rβ_min), is the pre-computed threshold with the minimum possible value of rβ. Each threshold T(rβ) is computed as log2(2r
At the top of the search tree, comparison 202 compares qα+rα/n to T(rβ_max/2), which is the threshold at approximately the middle of the range values of rβ. Note that each division of rβ_max can be the floor of the result (i.e., floor (rβ_max/m) for m a power of 2 greater than 0).
In response to qα+rα/n being equal to the threshold T(rβ_max/2), the output value is rβ_max/2. In response to qα+rα/n<T(rβ_max/2), the decision tree continues with comparison 204 of qα+rα/n to T(rβ_max/4). In response to qα+rα/n>T(rβ_max/2), the decision tree continues with comparison 206 of qα+rα/n to T(rβ_max/2+rβ_max/4).
Comparison 206 compares qα+rα/n to T(rβ_max/2+rβ_max/4). In response to qα+rα/n being equal to the threshold T(rβ_max/2+rβ_max/4), the output value is rβ_max/2+rβ_max/4. In response to qα+rα/n<T(rβ_max/2+rβ_max/4), the decision tree continues with comparison 208 of qα+rα/n to T(rβ_max/2+rβ_max/4−rβ_max/8).
At comparison 208, in response to qα+rα/n<T(rβ_max/2+rβ_max/4−rβ_max/8), the decision tree continues with a comparison of qα+rα/n to T(rβ_max/2+rβ_max4/−rβ_max/8−rβ_max/16) (not shown). In response to qα+rα/n>T(rβ_max/2+rβ_max/4−rβ_max/8), the decision tree continues with a comparison of qα+rα/n to T(rβ_max/2+rβ_max/4+rβ_max/8+rβ_max/16) (not shown). In response to qα+rα/n being equal to the threshold T(rβ_max/2+rβ_max/4−rβ_max/8), the output value is rβ_max/2+rβ_max/4−rβ_max/8.
The search in the decision tree continues as described above until the qα+rα/n is equal to a threshold, or a comparison at the lowest level in the tree has been reached. At the lowest-level comparison, if qα+rα/n is less than the T(x), then the output is rβ=x. If qα+rα/n is greater than the T(x), then the output is rβ=x+1.
The decision tree can be implemented by a programmed processor or by programmable logic. The programmed processor can access a data structure having the threshold values and indexed by values of rβ. A programmable logic implementation can individual comparison circuits having pre-configured threshold values and associated values of rβ.
At block 304, the sign bit of operand x is selected as the sign of the sum and can be stored in a register at the bit position of the sign bit of the signed fixed point sum.
At block 306, a subtraction circuit can determine |y|/|x| by subtracting (qy+ry/n)−(qx+rx/n), where (qy+ry/n) denotes the unsigned fixed point value of x, and (qx+rx/n) denotes the unsigned fixed point value of y. The difference is {qα, rα}, which denotes the unsigned fixed point value having an integer part qα, and a fractional part that as an integer is denoted rα.
At block 308, the term (1+α) is approximated (1+α→β) to the nearest FLNS value, β=2q
At block 322, the fixed point values {qx, rx} and {qβ, rβ} are summed by and adder, and the result {qz, qz, rz} is output at block 324.
Referring to the PS 402, each of the processing units includes one or more central processing units (CPUs) and associated circuits, such as memories, interrupt controllers, direct memory access (DMA) controllers, memory management units (MMUs), floating point units (FPUs), and the like. The interconnect 416 includes various switches, busses, communication links, and the like configured to interconnect the processing units, as well as interconnect the other components in the PS 402 to the processing units.
The OCM 414 includes one or more RAM modules, which can be distributed throughout the PS 402. For example, the OCM 414 can include battery backed RAM (BBRAM), tightly coupled memory (TCM), and the like. The memory controller 410 can include a DRAM interface for accessing external DRAM. The peripherals 408, 415 can include one or more components that provide an interface to the PS 402. For example, the peripherals can include a graphics processing unit (GPU), a display interface (e.g., DisplayPort, high-definition multimedia interface (HDMI) port, etc.), universal serial bus (USB) ports, Ethernet ports, universal asynchronous transceiver (UART) ports, serial peripheral interface (SPI) ports, general purpose (GPIO) ports, serial advanced technology attachment (SATA) ports, PCIe ports, and the like. The peripherals 415 can be coupled to the MIO 413. The peripherals 408 can be coupled to the transceivers 407. The transceivers 407 can include serializer/deserializer (SERDES) circuits, MGTs, and the like.
Various logic may be implemented as circuitry to carry out one or more of the operations and activities described herein and/or shown in the figures. In these contexts, a circuit or circuitry may be referred to as “logic,” “module,” “engine,” or “block.” It should be understood that logic, modules, engines and blocks are all circuits that carry out one or more of the operations/activities. In certain implementations, a programmable circuit is one or more computer circuits programmed to execute a set (or sets) of instructions stored in a ROM or RAM and/or operate according to configuration data stored in a configuration memory.
Though aspects and features may in some cases be described in individual figures, it will be appreciated that features from one figure can be combined with features of another figure even though the combination is not explicitly shown or explicitly described as a combination.
The circuits and methods are thought to be applicable to a variety of systems for adding FLNS operands. Other aspects and features will be apparent to those skilled in the art from consideration of the specification. The circuits and methods may be implemented as one or more processors configured to execute software, as an application specific integrated circuit (ASIC), or as a logic on a programmable logic device. It is intended that the specification and drawings be considered as examples only, with a true scope of the invention being indicated by the following claims.