One or more aspects of embodiments according to the present disclosure relate to neural network processing, and more particularly to a system and method for encoding partial sums or scaling factors.
In an artificial neural network, certain operations, such as calculating a convolution, may involve calculating partial sums and subsequently summing the partial sums. These operations may be burdensome for a processing circuit performing them, both in terms of storage requirements for the partial sums, and in terms of bandwidth used to move the partial sums to and from storage.
Thus, there is a need for an improved system and method for performing neural network calculations.
According to an embodiment of the present disclosure, there is provided a method, including: performing a neural network inference operation, the performing of the neural network inference operation including: calculating a first plurality of products, each of the first plurality of products being the product of a weight and an activation; calculating a first partial sum, the first partial sum being the sum of the products; and compressing the first partial sum to form a first compressed partial sum.
In some embodiments, the first compressed partial sum has a size, in bits, at most 0.85 that of the first partial sum.
In some embodiments, the first compressed partial sum has a size, in bits, at most 0.5 that of the first partial sum.
In some embodiments, the first compressed partial sum includes an exponent and a mantissa.
In some embodiments, the first partial sum is an integer, and the exponent is an n-bit integer equal to 2n−1 less an exponent difference, the exponent difference being the difference between: the bit position of the leading 1 in a limit number, and the bit position of the leading 1 in the first partial sum.
In some embodiments, n=4
In some embodiments: the first compressed partial sum further includes a sign bit, and the mantissa is a 7-bit number excluding an implicit 1.
In some embodiments: the first partial sum is greater than the limit number, the exponent equals 2n−1, and the mantissa of the first compressed partial sum equals a mantissa of the limit number.
In some embodiments, the performing of the neural network inference operation further includes: calculating a second plurality of products, each of the second plurality of products being the product of a weight and an activation; calculating a second partial sum, the second partial sum being the sum of the products; and compressing the second partial sum to form a second compressed partial sum.
In some embodiments, the method further includes adding the first compresses partial sum and the second compressed partial sum.
According to an embodiment of the present disclosure, there is provided a system, including: a processing circuit configured to perform a neural network inference operation, the performing of the neural network inference operation including: calculating a first plurality of products, each of the first plurality of products being the product of a weight and an activation; calculating a first partial sum, the first partial sum being the sum of the products; and compressing the first partial sum to form a first compressed partial sum. In some embodiments, the first compressed partial sum has a size, in bits, at most 0.85 that of the first partial sum.
In some embodiments, the first compressed partial sum has a size, in bits, at most 0.5 that of the first partial sum.
In some embodiments, the first compressed partial sum includes an exponent and a mantissa.
In some embodiments, the first partial sum is an integer, and the exponent is an n-bit integer equal to 2n−1 less an exponent difference, the exponent difference being the difference between: the bit position of the leading 1 in a limit number, and the bit position of the leading 1 in the first partial sum.
In some embodiments, n=4
In some embodiments: the first compressed partial sum further includes a sign bit, and the mantissa is a 7-bit number excluding an implicit 1.
In some embodiments: the first partial sum is greater than the limit number, the exponent equals 2n−1, and the mantissa of the first compressed partial sum equals a mantissa of the limit number.
According to an embodiment of the present disclosure, there is provided a system, including: means for processing configured to perform a neural network inference operation, the performing of the neural network inference operation including: calculating a first plurality of products, each of the first plurality of products being the product of a weight and an activation; calculating a first partial sum, the first partial sum being the sum of the products; and compressing the first partial sum to form a first compressed partial sum.
In some embodiments: the first compressed partial sum includes an exponent and a mantissa; the first partial sum is an integer; and the exponent is an n-bit integer equal to 2n−1 less an exponent difference, the exponent difference being the difference between: the bit position of the leading 1 in a limit number, and the bit position of the leading 1 in the first partial sum.
These and other features and advantages of the present disclosure will be appreciated and understood with reference to the specification, claims, and appended drawings wherein:
The detailed description set forth below in connection with the appended drawings is intended as a description of exemplary embodiments of a system and method for neural network processing provided in accordance with the present disclosure and is not intended to represent the only forms in which the present disclosure may be constructed or utilized. The description sets forth the features of the present disclosure in connection with the illustrated embodiments. It is to be understood, however, that the same or equivalent functions and structures may be accomplished by different embodiments that are also intended to be encompassed within the scope of the disclosure. As denoted elsewhere herein, like element numbers are intended to indicate like elements or features.
The input data, which may include (as mentioned above) a set of weights and a set of activations may originally be represented in floating point format. To enable more efficient multiplication operations, these floating point values may be converted to integers using a process referred to as quantization. A uniform, symmetric quantization process may be employed, although some embodiments described herein are applicable also to nonuniform and asymmetric quantization methods.
For activations and weights that are each 8-bit integers, the partial sums may in principle have a bit width of up to 31 bits. In practice, however, when the input data are images, for example, the distribution of values in the input data results in partial sums having a bit width of at most 21 bits (e.g., for ResNet50) or 19 bits (e.g., for MobileNet v2). As such, even if kept in integer form, it may not be necessary to use a 32 bit wide representation to store each partial sum as a number. Further (as discussed in further detail below), FP16 encoding, BF16 encoding, CP encoding (clipping partial sums to a fixed bit width) or PM encoding—partial sum encoding using a fixed number of most significant bits (MSB) may be employed.
In the FP16 representation, each number may be represented in the following representation:
(−1)sign×2exp−15×1·significandbits2
(where sign is the sign bit, exp is the exponent, and significandbits is the set of bits of the mantissa (or “significand”) excluding the implicit leading 1), which does not accommodate the largest 19-bit (or 21-bit) integer. However, with suitable scaling, 19- and 21-bit integers may be represented using FP16.
Other encoding methods, such as BF16 encoding, CP encoding, and PM encoding, may also be used (as mentioned above) to encode and compress the partial sums.
The BF16 representation may represent each number as follows:
(−1)sign×2(exp−127)×1·significandbits2
where sign is the sign bit, exp is the exponent, and significandbits2 is the set of bits of the mantissa excluding the implicit leading 1. BF may have the same exponent range as FP32, but the mantissa of BF16 may be shorter; as such, FP32 may be converted to BF16 by keeping the 7 most significant bits of the mantissa and discarding the 16 least significant bits. BF16 may have the same size as a P16M12 representation (illustrated in
CP encoding may clip partial sums that do not fit within the allocated bit width to the largest number that fits within the allocated bit width, i.e., any number larger than the largest number capable of being represented is represented by the largest number capable of being represented. For example, for CP20, a total of 20 bits is available, of which one is the sign bit, which means that the maximum magnitude that can be represented is 219−1, and for CP19 the maximum magnitude that can be represented is 218−1. If the partial sum is the largest 21-bit integer, then the error, if this is encoded using CP20 encoding (and the magnitude is represented as the largest 20-bit integer), is 219, and the error, if this is encoded using CP19 encoding (and the magnitude is represented as the largest 18-bit integer), is 1.5×219.
Encoding the partial sum using a P12M8 representation (which has a size of 12 bits) (to form a compressed partial sum) may reduce the size by more than a factor of 2, e.g., if the original partial sum is an integer in a 32-bit representation. The P12M8 encoding of an integer (e.g., of the magnitude of a partial sum) may be formed based on the largest integer to be encoded, which may be referred to herein as the “limit number” used for the encoding. The encoding may proceed as follows. The four-bit exponent of the P12M8 representation is set equal to 15 (15 being equal to 2n−1 where n=4 is the number of bits in the exponent, and 15 is the maximum unsigned integer capable of being stored in the 4-bit exponent) less an exponent difference, the exponent difference being the difference between the bit position of the leading 1 in the limit number, and the bit position of the leading 1 in the integer being encoded (e.g., the partial sum). The table of
In some embodiments, the scaling factor may also be encoded to reduce bandwidth and storage requirements.
As used herein, “a portion of” something means “at least some of” the thing, and as such may mean less than all of, or all of, the thing. As such, “a portion of” a thing includes the entire thing as a special case, i.e., the entire thing is an example of a portion of the thing. As used herein, when a second quantity is “within Y” of a first quantity X, it means that the second quantity is at least X−Y and the second quantity is at most X+Y. As used herein, when a second number is “within Y %” of a first number, it means that the second number is at least (1−Y/100) times the first number and the second number is at most (1+Y/100) times the first number. As used herein, the term “or” should be interpreted as “and/or”, such that, for example, “A or B” means any one of “A” or “B” or “A and B”.
Each of the terms “processing circuit” and “means for processing” is used herein to mean any combination of hardware, firmware, and software, employed to process data or digital signals. Processing circuit hardware may include, for example, application specific integrated circuits (ASICs), general purpose or special purpose central processing units (CPUs), digital signal processors (DSPs), graphics processing units (GPUs), and programmable logic devices such as field programmable gate arrays (FPGAs). In a processing circuit, as used herein, each function is performed either by hardware configured, i.e., hard-wired, to perform that function, or by more general-purpose hardware, such as a CPU, configured to execute instructions stored in a non-transitory storage medium. A processing circuit may be fabricated on a single printed circuit board (PCB) or distributed over several interconnected PCBs. A processing circuit may contain other processing circuits; for example, a processing circuit may include two processing circuits, an FPGA and a CPU, interconnected on a PCB.
As used herein, when a method (e.g., an adjustment) or a first quantity (e.g., a first variable) is referred to as being “based on” a second quantity (e.g., a second variable) it means that the second quantity is an input to the method or influences the first quantity, e.g., the second quantity may be an input (e.g., the only input, or one of several inputs) to a function that calculates the first quantity, or the first quantity may be equal to the second quantity, or the first quantity may be the same as (e.g., stored at the same location or locations in memory as) the second quantity.
It will be understood that, although the terms “first”, “second”, “third”, etc., may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another element, component, region, layer or section. Thus, a first element, component, region, layer or section discussed herein could be termed a second element, component, region, layer or section, without departing from the spirit and scope of the inventive concept.
As used herein, the singular forms “a” and “an” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises” and/or “comprising”, when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list. Further, the use of “may” when describing embodiments of the inventive concept refers to “one or more embodiments of the present disclosure”. Also, the term “exemplary” is intended to refer to an example or illustration. As used herein, the terms “use,” “using,” and “used” may be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively.
It will be understood that when an element or layer is referred to as being “on”, “connected to”, “coupled to”, or “adjacent to” another element or layer, it may be directly on, connected to, coupled to, or adjacent to the other element or layer, or one or more intervening elements or layers may be present. In contrast, when an element or layer is referred to as being “directly on”, “directly connected to”, “directly coupled to”, or “immediately adjacent to” another element or layer, there are no intervening elements or layers present.
Any numerical range recited herein is intended to include all sub-ranges of the same numerical precision subsumed within the recited range. For example, a range of “1.0 to 10.0” or “between 1.0 and 10.0” is intended to include all subranges between (and including) the recited minimum value of 1.0 and the recited maximum value of 10.0, that is, having a minimum value equal to or greater than 1.0 and a maximum value equal to or less than 10.0, such as, for example, 2.4 to 7.6. Any maximum numerical limitation recited herein is intended to include all lower numerical limitations subsumed therein and any minimum numerical limitation recited in this specification is intended to include all higher numerical limitations subsumed therein.
Although exemplary embodiments of a system and method for neural network processing have been specifically described and illustrated herein, many modifications and variations will be apparent to those skilled in the art. Accordingly, it is to be understood that a system and method for neural network processing constructed according to principles of this disclosure may be embodied other than as specifically described herein. The invention is also defined in the following claims, and equivalents thereof.
The present application claims priority to and the benefit of U.S. Provisional Application No. 63/214,173 filed Jun. 23, 2021, entitled “PARTIAL SUM COMPRESSION FOR CONVOLUTION OPERATIONS IN NEURAL NETWORKS”, the entire content of which is incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63214173 | Jun 2021 | US |