A Block Floating-Point (BFP) number system represents a block of floating-point (FP) numbers by a shared exponent (typically the largest exponent in the block) and right-shifted significands of the block of FP numbers. Computations using BFP can provide improved accuracy compared to integer arithmetic and use fewer computing resources than full floating point. However, the range of numbers that can be represented using a BFP format is limited, since small numbers are replaced by zero when the significands are right-shifted too far.
In some applications, such as computational neural networks, input data may have a very large range. The use of BFP in such applications can lead to inaccurate results. In applications that use a large amount of data, the use of higher precision number representations may be precluded by limitations on storage resources, etc.
The accompanying drawings provide visual representations which will be used to more fully describe various representative embodiments, and can be used by those skilled in the art to better understand the representative embodiments disclosed and their inherent advantages. In these drawings, like reference numerals identify corresponding or analogous elements.
The various apparatus and devices described herein provide mechanisms for data processing using an enhanced block floating point data format.
While this present disclosure is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the embodiments shown and described herein should be considered as providing examples of the principles of the present disclosure and are not intended to limit the present disclosure to the specific embodiments shown and described. In the description below, like reference numerals are used to describe the same, similar or corresponding parts in the several views of the drawings. For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements.
In accordance with various embodiments, a data processing apparatus is configured to determine a product of two operands stored in an Extended Block Floating-Point (EBFP) format. The operands are decoded, based on their tags and payloads, to generate exponent differences and fractions. Significands of the fractions are multiplied to generate an output significand and shared exponents and exponent differences of the operands are combined to generate an output exponent. Signs of the operands may also be combined to provide an output sign. The apparatus may be combined with an accumulator having one or more lanes to provide an apparatus for determining dot products.
A number may be represented as (−1)s×m×be, where s is a sign value, m is a significand, e is an exponent and b is a base. In some binary (b=2) floating-point representations, such as the 32-bit IEEE (Institute of Electrical and Electronic Engineers) format, the significand is either zero or in the range 1≤m<2. For non-zero values of m, the value m−1 is referred as the fractional part of the significand. The 32-bit IEEE format stores the exponent as an 8-bit value and the significands as a 23-bit value.
A Block Floating-Point (BFP) number system represents a block of floating-point (FP) numbers by a shared exponent (typically the largest exponent in the Block) and right-shifted significands of the block of FP numbers. The present disclosure improves upon BFP by representing small FP numbers (that would ordinarily be set to zero) by the difference between the exponent and the shared exponent. A tag bit indicates whether the EBFP number represents a shifted significand or the exponent difference.
Some data processing applications, such as Neural Network (NN) processing, require very large amounts of data. For example, a single network architecture can use millions of parameters. Consequently, there is great interest in storing data as efficiently as possible. In some applications, for example, 8-bit scaled integers are used for inference but data for training requires the use of floating-point numbers with a greater exponent range than the 16-bit IEEE half-precision format, which has only 5 exponent bits. A 16-bit “Bfloat” format has been used for NN training tasks. The Bfloat format and has a sign bit, 8 exponent bits, and 7 fraction bits (denoted as s,8e,7f). Other FP formats include “DLfloat” which has 6 exponent bits and 9 fraction bits (s,6e,9f) as well as other 8-bit formats having more exponent bits than fraction bits (such as s,4e,3f and s,5e,2f).
Block Floating-Point (BFP) representation has been used in a variety of applications, such as NN and Fast Fourier Transforms. In BFP, a block of data shares a common exponent, typically the largest exponent of the block to be processed. The significands of FP numbers are right-shifted by the difference between their individual exponents and the shared exponent. BFP has the added advantage that arithmetic processing can be performed on integer data paths saving considerable power and area in NN hardware implementation. BFP appears particularly well-suited to computing dot products because numbers with smaller exponents will not contribute many bits, if any, to the result. However, a difficulty with using BFP for processing Convolutional Neural Networks (CNNs) is that output feature maps are derived from multiple input feature maps which can have widely differing numeric distributions. In this case, many or even most of the numbers in a BFP scheme for encoding feature maps could end up being set to zero. By contrast, the weights employed in CNNs are routinely normalized to the range −1 . . . +1. Given that successful training and inference is usually dependent on the highest magnitude parameter of each filter, blocks of weights need exponents to sit only within a relatively small range.
TABLE 1 shows an example dot product computation for vector operands A and B. The number are denoted by hexadecimal significands with radix 2 exponents. Corresponding decimal significands and exponents are shown in brackets. The maximum of each vector is shown in bold font.
−0 × 1.ccp + 20 (−1.80 × 220)
+0 × 1.dep + 19 (1.87 × 219)
Result
+0 × 1.5d1bp + 29
TABLE 2 shows the same dot product computation for vector operands A and B performed using Block Floating Point arithmetic. In this example, the dot product is calculated as zero because a number of small operands are represented by zero in the Block Floating Point format.
BFP Result
0
This example illustrates that conventional Block Floating Point arithmetic is not well suited for used where data a large range of values.
The present disclosure uses a number format, referred to as Enhanced Block Floating Point (EBFP). The format may be used in applications such as convolutional neural networks where (i) individual feature maps have widely differing numeric distributions and (ii) filter kernels only require their larger parameters to be represented with higher accuracy.
In accordance with various embodiments, the exponent of a floating number to be encoded is compared with the shared exponent: when the difference is large enough that the BFP representation would be zero due to all the significand bits being shifted out of range, the exponent difference is stored; otherwise, the suitably encoded significand is stored.
In accordance with an embodiment of the disclosure, an input datum in EBFP format is converted into a number in floating-point format in a data processor. A payload of the EBFP number can be in a first format or a second format. The format of an input datum is determined based on a tag value of the input datum. For the first format, an exponent and significand of a floating-point number are determined, based on a payload of the input datum and a shared exponent. For the second format, the exponent of the floating-point number is determined, based on the payload of the input datum and the shared exponent. In this case, the floating-point number has a designated significand, such as the value “1.” The output floating-point number consists of a sign copied from the input datum, the exponent of the floating-point number and the significand of the floating-point number.
The EBFP format is described in more detail below with reference to an apparatus for converting an EBFP number to a floating-point (FP).
First word 204 includes sign bit 210, 1-bit tag 212, and a payload consisting of fields 214, 216, 218 and 220. The tag 212 is set to zero to indicate that the payload is associated with a significand. Fields 214, 216 and 218 indicate a difference between the shared exponent 202 and the exponent of the number being represented. Field 214 contains L zeros, where L may be zero. Field 216 contains a “one” bit, and field 218 contains an R-bit integer, where R is a designated integer. The factor 2 is called the “radix” of the representation, so the radix is 2 when R=0, 4 when R=1, and 8 when R=2. Field 218 is omitted when R=0. The exponent difference is given by 2R×L+P. Field 220 is a rounded and right-shifted fractional part of the significand. The total number of bits in the payload is fixed. Since the number of zeros in field 214 is variable, the number of bits, T, in the fraction field varies accordingly. When the integer value of field 220 is F, the significand is given 1+2−T×F, which may be denoted by 1.fff . . . f. Thus, when the shared exponent is se, the number represented is:
x=2se×2−(2
Thus, a decoder can determine the represented number by determining L, P and F from an EBFP payload. In one embodiment, the designated number R is zero and the radix is two. In this case
x=2se×2−L(1+2−T×F),
and the payload is simply the right-shifted significand. The exponent difference may be determined by counting the number of leading zeros in the EBFP number.
In second payload 206, the payload 222 is set to zero. When the tag bit is zero, the payload represents the number zero. When the tag bit is one, the payload represents an exponent difference of −1. This can occur when rounding causes the maximum value to overflow. Thus, the number represented is 2se+1.
In payload 208, the tag bit is set to one to indicate that the payload 224 relates only to the exponent difference. When the payload is an integer E, the number represented is 2se+E+bias, where bias is an offset or bias value. The bias value is included since some small values of exponent difference can represented by payload 204.
TABLE 3 shows how exponent difference and significand values are determined from a payload for an example implementation, where the payload has 8 bits and includes a sign bit, a tag bit and 6 payload bits. In this example, R=0, so the radix is 2. The format is designated “8r2”. In the table below, “f” denotes fractional bit of the input value and “e” denotes one bit of the biased exponent difference.
For zero tag, the bits indicated in bold font indicate the encoding of the exponent difference. In this example, the payload is equivalent to a right-shifted significand, including an explicit leading bit. Note that for an exponent difference greater than 5, the right-shifted significand is lost because of the limited number of bits. For an exponent difference greater than 5, only the exponent difference is encoded with a bias of 6.
Is the embodiment shown in TABLE 3, the exponent difference can be decoded from the EBFP number by counting the number of leading zeros in the payload. This operation is denoted as CLZ(payload).
TABLE 4 shows the result of the example dot product computation described above. The exponents and signs of FP values with smaller exponents are retained. The resulting error compared to the true result is 13%. This is much improved compared to conventional BFP, which gave the results as zero. The accuracy of the EBFP approach is sufficient for many applications, including training convolutional neural networks.
EBFP Result
+0 × 1.28bdp + 29 (1.16 × 229)
TABLE 5 shows how exponent differences and significands are determined from an input payload for an example implementation, where the payload has 8 bits and includes a sign bit, two tag bits and 5 payload bits. In this example, R=0. In the table below, “f” denotes fractional bit of the input value and “e” denotes one bit of the biased exponent difference. In this embodiment, the exponent difference can be decoded from the EBFP number by counting the number of leading zeros in the tag and payload. This operation is denoted as CLZ(tag, payload).
TABLES 4 and 5 above, illustrate how an output exponent difference and significand can be obtained from a payload.
TABLE 6 shows how output exponent differences and significands are obtained from a payload for an example implementation where the payload has 8 bits and includes a sign bit, a tag bit and 6 payload bits. In this example, R=1, so the radix is 4. In the table below, “f” denotes fractional bit of the input value and “e” denotes one bit of the biased exponent difference.
1.ffff
1.ff
In the examples above, the significand is stored to the right of the encoded exponent difference in the input payload. It will be apparent to those of ordinary skill in the art that alternative arrangements may be used without departing from the present disclosure. For example, in one embodiment, the significand is stored to the left of the encoded exponent difference, and the encoded exponent difference includes L trailing zeros. This is shown in TABLE 7A below. In this embodiment, the encoded exponent the use of one and zeros is reversed. The exponent difference can be decoded by counting the number of trailing zeros in the tag and payload. The exponent difference is decoded as 2×CTZ(tag, payload)+p−1.
1.ffff
1.ff
The payload is made up an encoded exponent difference (shown in bold font) concatenated with a number (possibly 0) of fraction bits (ff . . . f), where the encoded exponent difference includes a number (possibly 0) of bits set to zero, at least one bit set to one, and a number (possibly 0) of additional bits (p).
TABLE 7B, below, shows an example encoding using storage 304′ in
The payload is made up an encoded exponent difference concatenated with a number (possibly 0) of fraction bits (ff . . . f), where the encoded exponent difference includes a number (possibly 0) of bits set to zero, at least one bit set to one, and a number (possibly 0) of additional bits (p).
First decoder 422 is configured to determine exponent difference 426 and fraction 428 based on the payload 408 of input datum 402. Second decoder 424 is configured to determine exponent difference 430 of the floating-point number based on the payload 408 of the input datum 402, the floating-point number having a designated fraction 432. Selector 420 selects the outputs of the first or second decoders 422, 424 as exponent difference 434 and fraction 436. Exponent 438 of the output floating-point number is determined by subtracting the selected exponent difference 434 from a shared exponent 440 in subtractor 442. Sign bit 412 is determined from sign bit 404. However, sign bit 412 may be modified for certain special values, dependent upon the format chosen for the floating-point number.
The arrangement of the logic units shown in
At block 718, the exponent of the output floating-point number is determined by subtracting the exponent difference from a shared exponent. The sign of the output is copied from the sign of the input and the sign, exponent and fraction of the floating-point number are output at block 720.
In some embodiments, an EBFP formatted number occupies an 8-bit word. This enables computations to be made using shorter word lengths. This is advantageous, for example, when a large number of values is being processed or when memory is limited. However, in some applications, such as accumulators, more precision is needed. An EBFP format using 16-bit words is described below. In general, the format using M-bit words, where M can be any number (e.g., 8, 16, 24, 32, 64 etc.).
In one embodiment using 16-bit words, all EBFP16 numbers have an additional eight fraction bits than in EBFP8, while the range of exponent differences is the same as in EBFP8. EBFP16 may be used where a wider storage format is needed and provides better accuracy and a wider exponent range than the “bfloat” format.
TABLE 8 below gives an example of decoding an EBFP16r2 (radix 2) format with two tag bits. Note that for exponent differences in the range 7-37, the last eight bits of the payload contain the fractional part of the number, while the first 5 bits contain the exponent. In this case, the payload is similar to floating point representation of the input, except that the exponent is to be subtracted from the shared exponent.
TABLE 9 below gives an example of decoding an EBFP16r4 (radix 4) format with two tag bits.
In one embodiment, an EBFP number is encoded in a first format of the form “s:tag:P:1:F” or second format of the form “s:tag:D”. where “s” is a sign-bit, “tag” is one or more bits of an encoding tag, “P” is R encoded exponent difference bits, “F” is a fraction and “D” is an exponent difference. Except for a subset of tag values, the floating-point number represented has significand 1.F and exponent difference 2R×(tag+CLZ)+P, where CLZ is the number of leading zeros in the fraction F. For a first special tag value (e.g., all ones), the second format is used where the exponent difference is D plus a bias offset.
Some example embodiments for an 8-bit EBFP number are given below in TABLE 10.
In contrast with the embodiments discussed above, the positions of the one or more “p” bits are fixed as the leading bits in the payload. With an 8-bit data, R may be in the range 0-5. Some examples are listed below in TABLES 11-15.
In TABLE 15, “xxx” is any 3-bit combination except for the special values “111” and “110”.
Still further embodiments are given in TABLES 16-18.
TABLE 18 is equivalent to TABLE 17 and illustrates how the use of zero and one in the part of the encoding shown in bold font may be reversed.
Exponent combiner 834 may be configured to add one to product exponent 836 when the product of the significands is greater or equal to two. Significand multiplier and shifter 838 may be further configured to right-shift the product of significands by one place when the output significand is greater or equal to two to generate product significand 840 in a normalized form. Alternatively, the shift may be applied at a later time, such as when the product is accumulated or output.
Decoder 822 (and/or decoder 828) may comprise a first decoder, a second decoder, and a controller configured to select between the first decoder and the second decoder based on the tag value of the input operand. The first decoder is configured to determine the exponent difference 824 and the fraction 826 based on at least on payload 806, and the second decoder is configured to determine exponent difference 824 based on payload 806 of the first operand, and further configured to provide a designated value (such as “1”, for example) as fraction 826.
The first decoder may be configured to determine a number of leading zeros in a designated part of payload 806 (as shown as 204 and 204′ in
The second decoder may be configured to determine the first exponent difference based on the first payload and set the first fraction to zero.
Denoting the significand, exponent difference and shared exponent the first operand as SIG_A, EXP-DIFF_A and SH-EXP_A, respectively, and the significand, exponent difference and shared exponent the second operand as SIG_B, EXP-DIFF_B and SH-EXP_B, respectively, the product significand 840 is given by (SIG_A or 1.0 or zero)×(SIG_B or 1.0 or zero). The corresponding product exponent 836 is given by SH-EXP_A+SH-EXP_B−(EXP-DIFF_A+bias)−(EXP-DIFF_B+bias), for a designated bias. The (EXP-DIFF+bias) term is only subtracted when the payload is non-zero and the tag indicates that the payload represents an exponent difference (E.g., tag==2′b11 AND payload≠0). When the tag indicates that a rounding overflow has occurred (E.g., tag==2′b11 AND payload=0), the (EXP-DIFF+bias) term in the product is set to −1 to increment product exponent.
Apparatus 800 generates a product of two EBFP operands in a floating-point format. The product may be passed to a fixed-point accumulator. Since a wide range of numbers can be represented in floating-point format, a wide accumulator may be used that is much wider (has more bits) than the significand of the value to be accumulated. In this case, when a value is added to the accumulator, only a subset of bits in the accumulator are altered. In one embodiment, the accumulator may use a number of overlapping “lanes,” each lane holding a part of the final accumulated value.
Apparatus 800 has a lower power consumption than a conventional multiplier for IEEE formatted data. In addition, the multiplier is smaller and uses no rounding, or subnorms.
Anchor value 906 is held constant during a multiply-accumulate computation such as a dot product operation. For example, the anchor value may be set at SH_EXP_A+SH−EXP_B+8 for dot product of EBFP vectors. In one embodiment, lane selector 908 compares product exponent 902 with the most significant bits (MSBs) of all lanes and selects the lane with lowest lane MSB greater than or equal to product exponent 902. The lane MSB value is given by, for example,
Lane MSB value=Anchor−lane number×(Width−Overlap).
The shift value 910 is computed from the MSB of the selected lane MSB and product exponent 836. Once a lane has been selected, a large shift of significand 904 is not required. In general, the shift is less than would be required for a conventional accumulator that does not use lanes.
A lane may store a “Carry Ready” bit that indicates whether lane is close to overflowing. When a Carry Ready bit is set high, the overlap bits of the lane are added to the next higher lane, and the overlap bits are reset to zero before the accumulation continues. Operations can be completed in parallel or in series, or in a combination of parallel and series.
If the product of significands is greater than or equal to two, as depicted by the positive branch from decision block 1112, the output exponent is increased by one at block 1116, and the output significand is right shifted by one at block 1118. In this way, the output significand is normalized to be in the range 1≤significand<2. In an alternative embodiment, the extra shift may be implemented at a later position in a computation—such as in an adder of an accumulator. The sign, exponent and at least the fractional part of the significand of the product are output at block 1114. These values may be passed to an accumulator of a dot product unit.
In this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element preceded by “comprises . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises the element.
Reference throughout this document to “one embodiment,” “certain embodiments,” “an embodiment,” “implementation(s),” “aspect(s),” or similar terms means that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of such phrases or in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments without limitation.
The term “or,” as used herein, is to be interpreted as an inclusive or meaning any one or any combination. Therefore, “A, B or C” means “any of the following: A; B; C; A and B; A and C; B and C; A, B and C.” An exception to this definition will occur only when a combination of elements, functions, steps or acts are in some way inherently mutually exclusive.
As used herein, the term “configured to,” when applied to an element, means that the element may be designed or constructed to perform a designated function, or that is has the required structure to enable it to be reconfigured or adapted to perform that function.
Numerous details have been set forth to provide an understanding of the embodiments described herein. The embodiments may be practiced without these details. In other instances, well-known methods, procedures, and components have not been described in detail to avoid obscuring the embodiments described. The disclosure is not to be considered as limited to the scope of the embodiments described herein.
Those skilled in the art will recognize that the present disclosure has been described by means of examples. The present disclosure could be implemented using hardware component equivalents such as special purpose hardware and/or dedicated processors which are equivalents to the present disclosure as described and claimed. Similarly, dedicated processors and/or dedicated hard-wired logic may be used to construct alternative equivalent embodiments of the present disclosure.
Dedicated or reconfigurable hardware components used to implement the disclosed mechanisms may be described, for example, by instructions of a hardware description language (HDL), such as VHDL, Verilog or RTL (Register Transfer Language), or by a netlist of components and connectivity. The instructions may be at a functional level or a logical level or a combination thereof. The instructions or netlist may be input to an automated design or fabrication process (sometimes referred to as high-level synthesis) that interprets the instructions and creates digital hardware that implements the described functionality or logic.
The HDL instructions or the netlist may be stored on non-transitory computer readable medium such as Electrically Erasable Programmable Read Only Memory (EEPROM); non-volatile memory (NVM); mass storage such as a hard disc drive, floppy disc drive, optical disc drive; optical storage elements, magnetic storage elements, magneto-optical storage elements, flash memory, core memory and/or other equivalent storage technologies without departing from the present disclosure. Such alternative storage devices should be considered equivalents.
Various embodiments described herein are implemented using dedicated hardware, configurable hardware or programmed processors executing programming instructions that are broadly described in flow chart form that can be stored on any suitable electronic storage medium or transmitted over any suitable electronic communication medium. A combination of these elements may be used. Those skilled in the art will appreciate that the processes and mechanisms described above can be implemented in any number of variations without departing from the present disclosure. For example, the order of certain operations carried out can often be varied, additional operations can be added or operations can be deleted without departing from the present disclosure. Such variations are contemplated and considered equivalent.
The various representative embodiments, which have been described in detail herein, have been presented by way of example and not by way of limitation. It will be understood by those skilled in the art that various changes may be made in the form and details of the described embodiments resulting in equivalent embodiments that remain within the scope of the appended claims.