This application claims the benefit under 35 USC § 119(a) of Korean Patent Application No. 10-2021-0028929 filed on Mar. 4, 2021, and Korean Patent Application No. 10-2021-0034835 filed on Mar. 17, 2021, in the Korean Intellectual Property Office, the entire disclosures, all of which, are incorporated herein by reference for all purposes.
The following description relates to a method and device for encoding.
An artificial neural network (ANN) is implemented based on a computational architecture. Due to the development of ANN technologies, research is being actively conducted to analyze input data using ANNs in various types of electronic systems and extract valid information. A device to process an ANN requires a large amount of computation for complex input data. Accordingly, there is a desire for a technique for analyzing a large volume of input data in real time using an ANN and efficiently processing an operation related to the ANN to extract desired information.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In one general aspect, an encoding method includes receiving input data represented by a 16-bit half floating point, adjusting a number of bits of an exponent and a mantissa of the input data to split the input data into 4-bit units and, encoding the input data in which the number of bits has been adjusted such that the exponent is a multiple of “4”.
The adjusting of the number of bits may include assigning 4 bits to the exponent, and assigning 11 bits to the mantissa.
The encoding may include calculating a quotient and a remainder obtained when a sum of the exponent of the input data and “4” is divided by “4”, encoding the exponent based on the quotient, and encoding the mantissa based on the remainder.
The encoding of the exponent may include encoding the exponent based on the quotient and a bias.
The encoding of the mantissa may include determining a first bit value of the mantissa to be “1”, if the remainder is “0”.
The encoding of the mantissa may include determining a first bit value of the mantissa to be “0” and a second bit value of the mantissa to be “1”, if the remainder is “1”.
The encoding of the mantissa may include determining a first bit value of the mantissa to be “0”, a second bit value of the mantissa to be “0”, and a third bit value of the mantissa to be “1”, if the remainder is “2”.
The encoding of the mantissa may include determining a first bit value of the mantissa to be “0”, a second bit value of the mantissa to be “0”, a third bit value of the mantissa to be “0”, and a fourth bit value to be “1”, if the remainder is “3”.
In another general aspect, an operation method includes receiving first operand data represented by a 4-bit fixed point, receiving second operand data that are 16 bits wide, determining a data type of the second operand data, encoding the second operand data, if it is determined the second operand data are of a floating-point type, and splitting the encoded second operand data into four 4-bit bricks, splitting the second operand data into four 4-bit bricks for a parallel data operation, if it is determined the second operand data are of a fixed-point type, and performing a multiply-accumulate (MAC) operation between the second operand data split into the four bricks and the first operand data.
The encoding may include adjusting a number of bits of an exponent and a mantissa of the second operand data, so as to split the second operand data into 4-bit units, and encoding the second operand data in which the number of bits is adjusted such that the exponent is a multiple of “4”.
The splitting may include splitting the encoded second operand data into one exponent brick data and three mantissa brick data.
The performing of the MAC operation may include performing a multiplication operation between the first operand data and each of the three mantissa brick data, comparing the exponent brick data with accumulated exponent data stored in an exponent register, and accumulating a result of performing the multiplication operation to accumulated mantissa data stored in each of three mantissa registers, based on a result of the comparing.
The accumulating may include aligning accumulation positions of the result of performing the multiplication operation and the accumulated mantissa data stored in each of the three mantissa registers, based on the result of the comparing.
In still another general aspect, an encoding device may include a processor configured to receive input data represented by a 16-bit half floating point, adjust a number of bits of an exponent and a mantissa of the input data to split the input data into 4-bit units, and encode the input data in which the number of bits has been adjusted such that the exponent is a multiple of “4”.
The processor may be further configured to assign 4 bits to the exponent, and assign 11 bits to the mantissa.
The processor may be further configured to calculate a quotient and a remainder obtained when a sum of the exponent of the input data and “4” is divided by “4”, encode the exponent based on the quotient, and encode the mantissa based on the remainder.
In a further general aspect, an operation device includes a processor configured to receive first operand data represented by a 4-bit fixed point, receive second operand data that are 16 bits wide, determine a data type of the second operand data, encode the second operand data, if it is determined the second operand data are of a floating-point type and split the encoded second operand data into four 4-bit bricks, split the second operand data into four 4-bit bricks for a parallel data operation, if it is determined the second operand data are of a fixed-point type, and perform a MAC operation between the second operand data split into the four bricks and the first operand data.
The processor may be further configured to adjust a number of bits of an exponent and a mantissa of the second operand data, so as to split the second operand data into 4-bit units, and encode the second operand data in which the number of bits is adjusted such that the exponent is a multiple of “4”.
The processor may be further configured to split the encoded second operand data into one exponent brick data and three mantissa brick data.
The processor may be further configured to perform a multiplication operation between the first operand data and each of the three mantissa brick data, compare the exponent brick data with accumulated exponent data stored in an exponent register, and accumulate a result of performing the multiplication operation to accumulated mantissa data stored in each of three mantissa registers, based on a result of the comparing.
The processor may be further configured to align accumulation positions of the result of performing the multiplication operation and the accumulated mantissa data stored in each of the three mantissa registers, based on the result of the comparing.
In another general aspect, an operation method includes: receiving first data represented by a 4-bit fixed point; receiving second data that are 16 bits wide; encoding the second operand data, in a case in which the second operand data are of a floating-point type, and splitting the encoded second operand data into four 4-bit bricks; splitting the second operand data into four 4-bit bricks without encoding the second operand data, in a case in which the second operand data are of a fixed-point type; and performing a multiply-accumulate (MAC) operation between the split second operand data and the first operand data.
Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
Throughout the drawings and the detailed description, unless otherwise described or provided, the same drawing reference numerals will be understood to refer to the same elements, features, and structures. The drawings may not be to scale, and the relative size, proportions, and depiction of elements in the drawings may be exaggerated for clarity, illustration, and convenience.
The following structural or functional descriptions are exemplary to merely describe the examples, and the scope of the examples is not limited to the descriptions provided in the present specification.
Terms, such as first, second, and the like, may be used herein to describe components. Each of these terminologies is not used to define an essence, order or sequence of a corresponding component but used merely to distinguish the corresponding component from other component(s). For example, a “first” component may be referred to as a “second” component, or similarly, and the “second” component may be referred to as the “first” component within the scope of the right according to the concept of the present disclosure.
It should be noted that if it is described that one component is “connected”, “coupled”, or “joined” to another component, a third component may be “connected”, “coupled”, and “joined” between the first and second components, although the first component may be directly connected, coupled, or joined to the second component. On the contrary, it should be noted that if it is described that one component is “directly connected”, “directly coupled”, or “directly joined” to another component, a third component may be absent. Expressions describing a relationship between components, for example, “between”, directly between”, or “directly neighboring”, etc., should be interpreted to be alike.
The singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms “comprises” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, components or a combination thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
Unless otherwise defined, all terms, including technical and scientific terms, used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure pertains. Terms, such as those defined in commonly used dictionaries, are to be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art, and are not to be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The examples may be implemented as various types of products such as, for example, a data center, a server, a personal computer, a laptop computer, a tablet computer, a smart phone, a television, a smart home appliance, an intelligent vehicle, a kiosk, and a wearable device. Hereinafter, example embodiments will be described in detail with reference to the accompanying drawings. In the drawings, like reference numerals are used for like elements.
An artificial intelligence (AI) algorithm including deep learning may input data 10 to an ANN, and may learn output data 30 through an operation, for example, a convolution. The ANN may be a computational architecture obtained by modeling a biological brain. In the ANN, nodes corresponding to neurons of a brain may be connected to each other and may collectively operate to process input data. Various types of neural networks may include, for example, a convolutional neural network (CNN), a recurrent neural network (RNN), a deep belief network (DBN), or a restricted Boltzmann machine (RBM), but is not limited thereto. In a feed-forward neural network, neurons may have links to other neurons. The links may be expanded in a single direction, for example, a forward direction, through a neural network.
The CNN 20 may be used to extract “features”, for example, a border or a line color, from the input data 10. The CNN 20 may include a plurality of layers. Each of the layers may receive data, may process data input to a corresponding layer and may generate data that is to be output from the corresponding layer. Data output from a layer may be a feature map generated by performing a convolution operation of an image or a feature map that is input to the CNN 20 and weights of at least one filter. Initial layers of the CNN 20 may operate to extract features of a relatively low level, for example, edges or gradients, from an input, such as image data. Subsequent layers of the CNN 20 may gradually extract more complex features, for example, an eye or a nose in an image.
Referring to
N filters, for example, filters 110-1 to 110-n may be formed. Each of the filters 110-1 to 110-n may include n×n weights. For example, each of the filters 110-1 to 110-n may be 3×3 pixels and have a depth value of K. However, the above size of each of the filters 110-1 to 110-n is merely an example and is not limited thereto.
Referring to
The convolution operation process is the process of performing multiplication and addition operations by applying a predetermined-sized, that is, n×n filter 110 to the input feature map 100 from the upper left to the lower right in a current layer. Hereinafter, the process of performing a convolution operation using a 3×3 filter 110 will be described.
For example, first, an operation of multiplying 3×3 pieces of data in a first region 101 on the upper left side of the input feature map 100 by weights W11 to W33 of the filter 110, respectively, is performed. Here, the 3×3 pieces of data in the first region 101 are a total of nine pieces of data X11 to X33 including three pieces of data in a first direction and three pieces of data in a second direction. Thereafter, first-first output data Y11 in the output feature map 120 are generated using a cumulative sum of the output values of the multiplication operation, in detail, X11×W11, X12×W12, X13×W13, X21×W21, X22×W22, X23×W23, X31×W31, X32×W32, and X33×W33.
Thereafter, an operation is performed by shifting the unit of data from the first region 101 to a second region 102 on the upper left side of the input feature map 100. In this example, the number of pieces of data shifted in the input feature map for the convolution operation process is referred to as a stride. The size of the output feature map 120 to be generated may be determined based on the stride. For example, when the stride is “1”, an operation of multiplying a total of nine pieces of input data X12 to X34 included in the second region 102 by the weights W11 to W33 of the filter 110 is performed, and first-second output data Y12 in the output feature map 120 are generated using a cumulative sum of the output values of the multiplication operation, in detail, X12×W11, X13×W12, X14×W13, X22×W21, X23×W22, X24×W23, X32×W31, X33×W32, and X34×W33.
Referring to
In a first clock, first-first data X11 in a first row {circle around (1)} of the systolic array may be input to the first PE 141. Although not shown in
As described above, the input feature map 130 may be sequentially input to the PEs 141 to 149 according to the clocks, and multiplication and addition operations with the weights input according to the clocks may be performed. An output feature map may be generated using cumulative sums of values output through multiplication and addition operations between weights and data in the input feature map 130 that are sequentially input.
Operations of
An operation using a neural network may require a different operation format according to the type of application. For example, an application configured to determine a type of object in an image may require a lower-bit precision than 8-bit, and a speech-related application may require a higher-bit precision than 8-bit.
Input operands of a multiply-accumulate (MAC) operation, which are essential operators in deep learning, may also be configured with various precisions depending on the situation. For example, a gradient, one of the input operands required for training a neural network, may require a precision of about a 16-bit half floating point, and the other input operands, an input feature map and weights, may be processed even with a low-precision fixed point.
The basic method to process data with such various requirements is generating and using hardware components for performing a MAC operation for each input type using unnecessarily many hardware resources.
In order to perform MAC operations for various input types using single hardware, operation units of the hardware need to be designed based on a data type with the highest complexity. However, in this example, it is inefficient to perform an operation through operators generated based on high-precision data with the highest complexity when a low-precision operation is input. More specifically, a hardware implementation area may unnecessarily increase, and the hardware power consumption may also unnecessarily increase.
According to an encoding method and an operation method provided herein, it is possible to maintain a gradient operation in the training process at high precision and simultaneously efficiently drive a low-precision inference process.
In operation 210, an encoding device receives input data represented by a 16-bit floating point.
In operation 220, the encoding device adjusts a number of bits of an exponent and a mantissa of the input data, so as to split the input data into 4-bit units. The encoding device may adjust the number of configuration bits in the form of {sign, exponent, mantissa}={1,4,11}, so as to split a bit distribution {sign, exponent, mantissa}={1,5,10} of an existing 16-bit half floating point into 4-bit units. As a result, the bits assigned to the exponent decrease by one, and the bits of the mantissa increase by one, to 11 bits.
In operation 230, the encoding device encodes the input data in which the number of bits is adjusted such that the exponent is a multiple of “4”. The encoding device may secure a wider exponent range than the existing 16-bit half floating point and simultaneously encode the exponent with “4” steps to be easily used for a bit-brick operation. Hereinafter, the encoding method will be described in detail with reference to
Prior to describing the encoding method, a method of representing data by a floating point will be described. For example, the decimal number 263.3 may be the binary number 100000111.0100110 . . . , which may be represented as 1.0000011101×28. Furthermore, expressing this using a floating point, the bit (1-bit) of the sign may be 0 (positive number), and the bit (5-bit) of the exponent may be 11000(8+16(bias)), and the bit of the mantissa may be 0000011101(10 bit), it may be finally represented as 0110000000011101.
Referring to
The encoding device may encode the input data in which the number of bits is adjusted such that the exponent is a multiple of “4”. In more detail, the encoding device may calculate a quotient and a remainder obtained when a sum of the exponent of the input data and “4” is divided by “4”, encode the exponent based on the quotient, and encode the mantissa based on the remainder.
The encoding device may encode the exponent based on the quotient and a bias.
The encoding device may determine a first bit value of the mantissa to be “1”, if the remainder is “0”, determine the first bit value of the mantissa to be “0” and a second bit value of the mantissa to be “1”, if the remainder is “1”, determine the first bit value of the mantissa to be “0”, the second bit value of the mantissa to be “0”, and a third bit value of the mantissa to be “1”, if the remainder is “2”, and determine the first bit value of the mantissa to be “0”, the second bit value of the mantissa to be “0”, the third bit value of the mantissa to be “0”, and a fourth bit value to be “1”, if the remainder is “3”. This is represented as in Table 1.
For example, the encoding device may convert 0.10000011101×29 to 0.10000011101×24×3−3, and again to 0.00010000011101×24×3. Based on this, the encoding device may encode the bits (4-bit) of the exponent to 1011(3+8(bias)), the bits (1-bit) of the sign to “0” (positive number), and the bits of the mantissa to 00010000011.
The encoding device may represent the encoded data by splitting the encoded data into one exponent brick data and three mantissa brick data. The three mantissa brick data may be split into top brick data, middle brick data, and bottom brick data, and a top brick may include one sign bit and three mantissa bits. In the above example, the exponent brick data may be 1011, the top brick data may be 0000, the middle brick data may be 1000, and the bottom brick data may be 0011.
The 4-bit exponent brick data and the 4-bit top/middle/bottom brick data may be easy to split in hardware. In addition, since an exponent difference that is always considered in a floating-point addition operation is always a multiple of “4”, a structure for fusing multiplicands using fixed-point adders without particular shifting may be possible.
Referring to
In operation 430, the operation device may determine a data type of the second operand data.
If the second operand data 420 are of a fixed-point type, the operation device may split the second operand data 420 into four 4-bit bricks for a parallel data operation, in operation 440-1.
If the second operand data 420 are of a floating-point type, the operation device may encode the second operand data 420 according to the method described with reference to
In operation 450, the operation device may split the encoded second operand data into four 4-bit bricks. In detail, the operation device may split the encoded second operand data into one exponent brick data and three mantissa brick data.
In operation 460, the operation device may perform a MAC operation between the second operand data split into the four bricks and the first operand data 410. The operation device may perform a multiplication operation between the first operand data 410 and each of the three mantissa brick data. The example of performing a MAC operation between the second operand data split into the four bricks and the first operand data 410 will be described in detail with reference to
In operation 470, the operation device may determine the data type of the second operand data.
If the second operand data 420 are of a fixed-point type, the operation device may accumulate the four split outputs, in operation 480-1.
If the second operand data 420 are of a floating-point type, the operation device may compare the exponent brick data with accumulated exponent data stored in an exponent register, and accumulate a result of performing the multiplication operation to accumulated mantissa data stored in each of three mantissa registers, based on a result of the comparing, in operation 480-2. In detail, the operation device may perform the accumulation by aligning accumulation positions of the result of performing the multiplication operation and the accumulated mantissa data stored in each of the three mantissa registers, based on the result of the comparing. The example of accumulating a result of performing the multiplication operation to accumulated mantissa data stored in each of three mantissa registers, based on a result of the comparing will be described in detail with reference to
Referring to
If second operand data are of a 16-bit half floating-point type, the operation device may split the three mantissa into three 4-bit brick data and perform multiplications with first operand data through the 4×4 multiplier. Three multiplication results obtained thereby may be aligned according to an exponent difference, which is a difference between exponent brick data and accumulated exponent data stored in the exponent register, and the results of performing the multiplication operations may be respectively accumulated to accumulated mantissa data stored in the mantissa registers and stored.
Referring to
For example, if the exponent difference is “0” (if an exponent of second operand data is greater than stored accumulated exponent data), the operation device may accumulate the data by aligning a multiplication operation result and accumulated exponent data stored in each of three mantissa registers at the same positions.
If the exponent difference is “−1” (if the exponent of the second operand data is greater than the stored accumulated exponent data), the operation device may accumulate the data by aligning the multiplication operation result to be 4-bit shifted rightward from the accumulated exponent data stored in each of the three mantissa registers.
If the exponent difference is “1” (if the exponent of the second operand data is less than the stored accumulated exponent data), the operation device may accumulate the data by aligning the multiplication operation result to be 4-bit shifted leftward from the accumulated exponent data stored in each of the three mantissa registers.
Referring to
The processor 710 may receive first operand data represented by a 4-bit fixed point, receive second operand data that are 16 bits wide, determine a data type of the second operand data, encode the second operand data, if the second operand data are of a floating-point type, split the encoded second operand data into four 4-bit bricks, and perform a MAC operation between the second operand data split into the four bricks and the first operand data.
The memory 730 may be a volatile memory or a non-volatile memory.
In some examples, the processor 710 may adjust a number of bits of an exponent and a mantissa of the second operand data, so as to split the second operand data into 4-bit units, and encode the second operand data in which the number of bits is adjusted such that the exponent is a multiple of “4”.
The processor 710 may split the encoded second operand data into one exponent brick data and three mantissa brick data.
The processor 710 may perform a multiplication operation between the first operand data and each of the three mantissa brick data, compare the exponent brick data with accumulated exponent data stored in an exponent register, and accumulate a result of performing the multiplication operation to accumulated mantissa data stored in each of three mantissa registers, based on a result of the comparing.
The processor 710 may align accumulation positions of the result of performing the multiplication operation and the accumulated mantissa data stored in each of the three mantissa registers, based on the result of the comparing.
In addition, the processor 710 may perform the at least one method described above with reference to
The operation device, and other devices, apparatuses, units, modules, and components described herein with respect to
The methods illustrated in
Instructions or software to control a processor or computer to implement the hardware components and perform the methods as described above are written as computer programs, code segments, instructions or any combination thereof, for individually or collectively instructing or configuring the processor or computer to operate as a machine or special-purpose computer to perform the operations performed by the hardware components and the methods as described above. In one example, the instructions or software include machine code that is directly executed by the processor or computer, such as machine code produced by a compiler. In another example, the instructions or software include higher-level code that is executed by the processor or computer using an interpreter. Programmers of ordinary skill in the art can readily write the instructions or software based on the block diagrams and the flow charts illustrated in the drawings and the corresponding descriptions in the specification, which disclose algorithms for performing the operations performed by the hardware components and the methods as described above.
The instructions or software to control a processor or computer to implement the hardware components and perform the methods as described above, and any associated data, data files, and data structures, are recorded, stored, or fixed in or on one or more non-transitory computer-readable storage media. Examples of a non-transitory computer-readable storage medium include read-only memory (ROM), random-access programmable read only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROMs, CD-Rs, CD+Rs, CD-RWs, CD+RWs, DVD-ROMs, DVD-Rs, DVD+Rs, DVD-RWs, DVD+RWs, DVD-RAMs, BD-ROMs, BD-Rs, BD-R LTHs, BD-REs, blue-ray or optical disk storage, hard disk drive (HDD), solid state drive (SSD), flash memory, a card type memory such as multimedia card micro or a card (for example, secure digital (SD) or extreme digital (XD)), magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any other device that is configured to store the instructions or software and any associated data, data files, and data structures in a non-transitory manner and providing the instructions or software and any associated data, data files, and data structures to a processor or computer so that the processor or computer can execute the instructions.
While this disclosure includes specific examples, it will be apparent after an understanding of the disclosure of this application that various changes in form and details may be made in these examples without departing from the spirit and scope of the claims and their equivalents. The examples described herein are to be considered in a descriptive sense only, and not for purposes of limitation. Descriptions of features or aspects in each example are to be considered as being applicable to similar features or aspects in other examples. Suitable results may be achieved if the described techniques are performed in a different order, and/or if components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
10-2021-0028929 | Mar 2021 | KR | national |
10-2021-0034835 | Mar 2021 | KR | national |