This application claims the benefit under 35 USC § 119(a) of Korean Patent Application No. 10-2021-0111118, filed on Aug. 23, 2021 in the Korean Intellectual Property Office, the entire disclosure of which is incorporated herein by reference for all purposes.
The following description relates to a method and apparatus with calculation.
An artificial neural network may be implemented by referring to a computational architecture. Various types of electronic systems may analyze input data and extract valid information using an artificial neural network. An apparatus to process the artificial neural network may require a large amount of computation for complex input data. Such technology may not be capable of effectively processing an operation related to an artificial neural network to extract desired information by analyzing a large amount input data using the artificial neural network.
This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
In one general aspect, a processor-implemented method includes: receiving a plurality of pieces of input data expressed as floating point; adjusting a bit-width of mantissa by performing masking on the mantissa of each piece of the input data based on a size of an exponent of each piece of the input data; and performing an operation between the input data with the adjusted bit-width.
For each piece of the input data, the adjusting of the bit-width of the mantissa may include adjusting the bit-width of the mantissa in proportion to the size of the piece of the input data.
For each piece of the input data, the adjusting of the bit-width of the mantissa may include: comparing the piece of the input data to a threshold; and adjusting the bit-width of mantissa based on a result of the comparing.
The threshold may be determined based on a distribution of the input data and an allowable error range.
The method may include: receiving a distribution of the plurality of pieces of input data; and determining a threshold corresponding to each of the plurality of pieces of input data based on the distribution of the plurality of pieces of input data.
The performing of the operation may include controlling a position and a timing of an operator to which the input data with the adjusted bit-width is input.
The performing of the operation may include: determining a number of cycles of the operation performed by a preset number of operators based on the adjusted bit-width of each piece of the input data; and inputting the input data with the adjusted bit-width to the operator based on the determined number of cycles.
The determining of the number of cycles of the operation may include determining the number of cycles of the operation based on the adjusted bit-width of the mantissa of each piece of the input data and a number of bits processible by the operator in a single cycle.
The operator may include: a multiplier configured to perform an integer multiplication of the mantissa of the input data; a shifter configured to shift a result of the multiplier; and an accumulator configured to accumulate the shifted result.
The performing of the operation may include: determining a number of operators for performing the operation within a preset number of cycles of the operation based on the adjusted bit-width of the mantissa of each piece of the input data; and inputting the input data with the adjusted bit-width to the operator based on the determined number of operators.
The determining of the number of operators may include determining the number of operators based on the adjusted bit-width of the mantissa of each piece of the input data and a number of bits processible by the operator in a single cycle.
In another general aspect, one or more embodiments include a non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, configure the one or more processors to perform any one, any combination, or all operations and methods described herein.
In another general aspect, an apparatus includes: one or more processors configured to: receive a plurality of pieces of input data expressed as floating point; adjust a bit-width of mantissa by performing masking on the mantissa of each piece of the input data based on a size of an exponent of each piece of the input data; and perform an operation between the input data with the adjusted bit-width.
For the adjusting of the bit-width of the mantissa, the one or more processors may be configured to, for each piece of the input data, adjust the bit-width of the mantissa in proportion to the size of the piece of the input data.
For the adjusting of the bit-width of the mantissa, the one or more processors may be configured to, for each piece of the input data: compare the piece of the input data to a threshold; and adjust the bit-width of the mantissa based on a result of the comparing.
The threshold may be determined based on a distribution of the input data and an allowable error range.
The one or more processors may be configured to: receive a distribution of the plurality of pieces of input data; and determine a threshold corresponding to each of the plurality of pieces of input data based on the distribution of the plurality of pieces of input data.
For the performing of the operation, the one or more processors may be configured to control a position and a timing of an operator to which the input data with the adjusted bit-width is input.
For the performing of the operation, the one or more processors may be configured to: determine a number of cycles of the operation performed by a preset number of operators based on the adjusted bit-width of the mantissa of each piece of the input data; and input the input data with the adjusted bit-width to the operator based on the determined number of cycles.
For the determining of the number of cycles of the operation, the one or more processors may be configured to determine the number of cycles of the operation based on the adjusted bit-width of the mantissa of each piece of the input data and a number of bits processible by the operator in a single cycle.
The operator may include: a multiplier configured to perform an integer multiplication of the mantissa of the input data; a shifter configured to shift a result of the multiplier; and an accumulator configured to accumulate the shifted result.
For the performing of the operation, the one or more processors may be configured to: determine a number of operators for performing the operation within a preset number of cycles of the operation based on the adjusted bit-width of the mantissa of each piece of the input data; and input the input data with the adjusted bit-width to the operator based on the determined number of operators.
For the determining of the number of operators, the one or more processors may be configured to determine the number of operators based on the adjusted bit-width of the mantissa of each piece of the input data and a number of bits processible by the operator in a single cycle.
In another general aspect, an apparatus includes: a central processing device configured to receive a distribution of a plurality of pieces of input data expressed as floating point, and determine a threshold corresponding to each of the plurality of pieces of input data based on the distribution of the plurality of pieces of input data; and a hardware accelerator configured to receive the plurality of pieces of input data, adjust a bit-width of mantissa by performing masking on the mantissa of each piece of the input data based on a size of exponent of each piece of the input data, and perform an operation between the input data with the adjusted bit-width.
In another general aspect, a processor-implemented method includes: receiving floating point input data; adjusting a bit-width of a mantissa of the input data by comparing a size of an exponent of the input data to a threshold; and performing an operation on the input data with the adjusted bit-width.
The adjusting of the bit-width of the mantissa may include allocating a smaller bit-width to the mantissa in response to the exponent being less than the threshold than in response to the exponent being greater than or equal to the threshold.
The performing of the operation may include using an operator, and the adjusted bit-width of the mantissa may be less than or equal to a number of bits processible by the operator in a single cycle.
The adjusting of the bit-width of the mantissa may include maintaining the bit-width of the mantissa in response to the exponent being greater than or equal to the threshold.
The threshold may include a plurality of threshold ranges each corresponding to a respective bit-width, and the adjusting of the bit-width of the mantissa may include adjusting, in response to the input data corresponding to one of the threshold ranges, the bit-width of the mantissa to be the bit-width corresponding to the one of the threshold ranges.
The performing of the operation may include performing a multiply and accumulate operation using an operator.
In another general aspect, a processor-implemented method includes: receiving floating point weight data and floating point feature map data of a layer of a neural network; adjusting a mantissa bit-width of the weight data and a mantissa bit-width of the feature map data by respectively comparing a size of an exponent of the weight data to a threshold and a size of an exponent of the feature map data to another threshold; and performing a neural network operation between the floating point weight data and the floating point feature map data with the adjusted bit-widths.
Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
Throughout the drawings and the detailed description, unless otherwise described or provided, the same drawing reference numerals will be understood to refer to the same elements, features, and structures. The drawings may not be to scale, and the relative size, proportions, and depiction of elements in the drawings may be exaggerated for clarity, illustration, and convenience.
The following detailed description is provided to assist the reader in gaining a comprehensive understanding of the methods, apparatuses, and/or systems described herein. However, various changes, modifications, and equivalents of the methods, apparatuses, and/or systems described herein will be apparent after an understanding of the disclosure of this application. For example, the sequences of operations described herein are merely examples, and are not limited to those set forth herein, but may be changed as will be apparent after an understanding of the disclosure of this application, with the exception of operations necessarily occurring in a certain order. Also, descriptions of features that are known in the art, after an understanding of the disclosure of this application, may be omitted for increased clarity and conciseness.
Although terms of “first,” “second,” and the like are used to explain various members, components, regions, layers, or sections, these members, components, regions, layers, or sections are not limited to such terms. Rather, these terms are used only to distinguish one member, component, region, layer, or section from another member, component, region, layer, or section. For example, a first member, component, region, layer, or section referred to in examples described herein may also be referred to as a second member, component, region, layer, or section without departing from the teachings of the examples.
Throughout the specification, when an element, such as a layer, region, or substrate, is described as being “on,” “connected to,” or “coupled to” another element, it may be directly “on,” “connected to,” or “coupled to” the other element, or there may be one or more other elements intervening therebetween. In contrast, when an element is described as being “directly on,” “directly connected to,” or “directly coupled to” another element, there can be no other elements intervening therebetween. Likewise, expressions, for example, “between” and “immediately between” and “adjacent to” and “immediately adjacent to” may also be construed as described in the foregoing.
The terminology used herein is for the purpose of describing particular examples only and is not to be limiting of the present disclosure. As used herein, the singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. As used herein, the term “and/or” includes any one and any combination of any two or more of the associated listed items. As used herein, the terms “include,” “comprise,” and “have” specify the presence of stated features, integers, steps, operations, elements, components, numbers, and/or combinations thereof, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, numbers, and/or combinations thereof. The use of the term “may” herein with respect to an example or embodiment (for example, as to what an example or embodiment may include or implement) means that at least one example or embodiment exists where such a feature is included or implemented, while all examples are not limited thereto.
Unless otherwise defined herein, all terms used herein including technical or scientific terms have the same meanings as those generally understood by one of ordinary skill in the art to which this disclosure pertains after and understanding of the present disclosure. Terms, such as those defined in commonly used dictionaries, are to be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and are not to be interpreted in an idealized or overly formal sense unless expressly so defined herein.
The examples may be implemented in various types of products, for example, a data center, a server, a personal computer, a laptop computer, a tablet computer, a smartphone, a television, a smart home appliance, a smart vehicle, a kiosk, and a wearable device. Hereinafter, the examples are described with reference to the accompanying drawings. Like reference numerals illustrated in the respective drawings refer to like elements.
An artificial intelligence (AI) algorithm including deep learning, etc., may input input data 10 to an artificial neural network (ANN), may learn output data 30 through an operation such as convolution, and may extract a feature using the trained artificial neural network. In the artificial neural network, nodes are interconnected and collectively operate to process the input data 10. Various types of neural networks include, for example, a convolutional neural network (CNN), a recurrent neural network (RNN), a deep belief neural (DBN), and a restricted Boltzmann machine (RBM) scheme. However, they are provided as examples only. In a feed-forward neural network, nodes of the neural network have links with other nodes. The links may extend in a single direction, for example, a forward direction through the neural network. While the neural network may be referred to as an “artificial” neural network, such reference is not intended to impart any relatedness with respect to how the neural network computationally maps or thereby intuitively recognizes information and how a human brain operates. I.e., the term “artificial neural network” is merely a term of art referring to the hardware-implemented neural network.
The CNN 20 may be used to extract “features”, such as a border and a line color, from the input data 10. The CNN 20 may include a plurality of layers. Each layer may receive data and may process data that is input to a corresponding layer and generate data that is output from the corresponding layer. Data that is output from a layer may be a feature map that is generated through a convolution operation between an image or a feature map input to the CNN 20 and a filter weight. Initial layers of the CNN 20 may operate to extract low level features, such as edges and gradients, from input. Subsequent layers of the CNN 20 may gradually extract more complex features, such as eyes and nose, in the image.
Referring to
A number of filters 110-1 to 110-N may be N. Each of the filters 110-1 to 110-N may include n-by-n (n×n) weights. For example, each of the filters 110-1 to 110-N may have 3×3 pixels and a depth value of K.
Referring to
The process of performing the convolution operation may refer to a process of performing multiplication and addition operations by applying the filter 100 with a desired scale, for example, a size of n×n from an upper left end to a lower right end of the input feature map 100 in a current layer. Hereinafter, an example process of performing a convolution operation on the filter 110 with a size of 3×3 is described.
For example, 3×3 data in a first area 101 at the upper left end of the input feature map 100 (that is, a total of nine data X11 to X33 including three data in a first direction and three data in a second direction) and weights W11 to W33 of the filter 110 may be multiplied, respectively. By accumulating and summing all output values of the multiplication operation (that is, X11*W11, X12*W12, X13*W13, X21*W21, X22*W22, X23*W23, X31*W31, X32*W32, and X33*W33), (1-1)-th output data Y11 of the output feature map 120 may be generated.
A subsequent operation may be performed by shifting from the first area 101 to a second area 102 of the input feature map 100 by a unit of data. Here, in a convolution operation process, a number of data that shifts in the input feature map 100 may be referred to as a stride and a scale of the output feature map 120 to be generated may be determined based on a scale of the stride. For example, when stride=1, a total of nine input data X12 to X34 included in the second area 102 and weights W11 to W33 are multiplied, respectively, and (1-2)-th output data Y12 of the output feature map 120 may be generated by accumulating and summing all the output values of the multiplication operation (that is, X12*W11, X13*W12, X14*W13, X22*W21, X23*W22, X24*W23, X32*W31, X33*W32, and X34*W33).
Referring to
The host 210 may perform the overall functionality of controlling the neural network apparatus 200. The host 210 may overall control the neural network apparatus 200 by running programs stored in the memory 220 included in the neural network apparatus 200. The host 210 may be or include a central processing unit (CPU), a graphics processing unit (GPU), an application processor (AP), etc., provided in the neural network apparatus 200, however, is not limited thereto.
The host 210 may output an operation result regarding to a class to which input data corresponds among classes using a neural network trained for classification. In detail, the neural network for classification may output an operation result for a probability that input data corresponds to each of the classes as a result value for each corresponding class. Also, the neural network for classification may include a softmax layer and a loss layer. The softmax layer may convert the result value for each of the classes to a probability value and the loss layer may calculate a loss as an objective function for learning of the neural network.
The memory 220 may be hardware configured to store data that is processed and data to be processed in the neural network apparatus 200. Also, the memory 220 may store an application and a driver to be run by the neural network apparatus 200. The memory 220 may include a volatile memory, such as a dynamic random access memory (DRAM) or a nonvolatile memory.
The neural network apparatus 200 may include the hardware accelerator 230 for driving the neural network. The hardware accelerator 230 may correspond to, for example, a neural processing unit (NPU), a tensor processing unit (TPU), and a neural engine, which are dedicated modules for driving the neural network.
Operations of
A typical calculation apparatus performing a deep learning operation may process iterative multiplication and addition operations for many layers and may perform a large amount of computation accordingly. In contrast, a calculation apparatus of one or more embodiments may reduce an amount of deep learning computation by converting data expressed as floating point, non-limiting examples of which will be described below.
In operation 310, a hardware accelerator may receive a plurality of pieces of input data expressed as floating point. The hardware accelerator may also be referred to as the calculation apparatus. In another example, a neural network apparatus (e.g., the neural network apparatus 200) may also be referred to as the calculation apparatus, and may include the hardware accelerator. The input data may include weight and input feature map data described above with reference to
Prior to describing an example of the calculation method, a method of expressing data as floating point is described. The floating point refers to arithmetic using formulaic representation of real numbers as an approximation and is represented with a mantissa that expresses a significand, without fixing a position of decimal point, and exponent that expresses a position of decimal point. For example, if 263.3 expressed in a decimal system is expressed in a binary system, it is 100000111.0100110 . . . , which may be expressed as 1.0000011101*28. In addition, if it is expressed as 16-bit floating point, a bit (1 bit) of sign includes 0 (positive number), bits (5 bits) of exponent include 11000 8+16 (bias), and mantissa bits include 0000011101 (10 bits), which may be finally expressed as 0110000000011101.
In operation 320, the hardware accelerator may adjust a bit-width of mantissa based on a size of each piece of the input data. The hardware accelerator of one or more embodiments may reduce a bit-width required for multiplication and addition operations between input data expressed as floating point by dynamically setting a bit-width of mantissa of a floating point to be different based on a size of the corresponding input data during a deep learning operation, which may lead to minimizing a loss of accuracy and reducing computational complexity. An expression scheme of input data adjusted according to the method disclosed herein may be referred to as a dynamic floating point number.
As described above, the deep learning operation may use iterative multiplication and addition operations for many layers. Therefore, quantization schemes of one or more embodiments and the hardware accelerator of one or more embodiments to support the same are described herein to process many operations with low cost and high efficiency. A quantization scheme may refer to a method of increasing a computation speed by lowering precision of an artificial neural network parameter and may be, for example, a method of converting 32-bit floating point data to 8-bit integer data.
However, while quantization of data may increase a computation speed, quantization of data may also decrease a computation accuracy. Therefore, a typical hardware accelerator may perform a re-training process to maintain the computation accuracy. Also, in the case of performing an operation in real time, every time an operation with a large quantization error such as batch normalization is performed, the typical hardware accelerator may perform a de-quantization and quantization process of performing de-quantization of, for example, converting 8-bit integer data to 32-bit floating-point data and then performing an operation using data expressed as floating point and performing again quantization to reduce the computational complexity of a subsequent operation. Therefore, the typical hardware accelerator may only obtain a limited gain. In contrast, the hardware accelerator of one or more embodiments may reduce both a loss of accuracy and an amount of computation.
Since the overall computational accuracy may decrease as a large size of data is approximated, the hardware accelerator of one or more embodiments may simultaneously reduce a loss of accuracy and an amount of computation through a method of allocating a bit-width of mantissa in proportion to a size of data.
In one example, the hardware accelerator of one or more embodiments may adjust a bit-width of mantissa using only the exponent without using the entire input data. For example, the hardware accelerator may adjust a bit-width of the mantissa in proportion to a size of the exponent. This scheme is advantageous in terms of an access speed and a computation speed compared to typically using the entire input data. Also, since the exponent expresses a location of the decimal point, the size of the exponent has an absolute influence on a size of input data. Therefore, adjusting the bit-width of the mantissa using the size of the exponent does not greatly degrade the entire accuracy of computation. The smaller the size of the exponent of the input data, the smaller the influence of the mantissa of the input data on the accuracy of the overall operation. The hardware accelerator of one or more embodiments may thus simultaneously decrease the loss of accuracy and the computational complexity.
In detail, the hardware accelerator may compare the exponent of input data to a threshold using a comparator and may allocate a larger bit-width to mantissa of input data with a large exponent (e.g., with an exponent greater than or equal to the threshold) and allocate a smaller bit-width to mantissa of input data with a smaller exponent (e.g., with an exponent less than the threshold). A non-limiting example of a method of adjusting a bit-width of mantissa based on a threshold is described with reference to
In operation 330, the hardware accelerator may perform an operation between the input data with the adjusted bit-width. The hardware accelerator may perform multiplication and addition operations between input data with the adjusted bit-width. For example, the hardware accelerator may perform multiplication and addition operations between a weight with an adjusted bit-width and an input feature map with an adjusted bit-width.
The hardware accelerator may perform a multiplication operation through a normalization after multiplication between exponents of the respective input data and between mantissas of the respective input data. Here, the multiplication between exponents may refer to multiplication between exponents with the same base and thus, may be identical to performing an addition. The multiplication between mantissas may be performed in the same manner as an integer multiplication. A non-limiting example of a calculation method between input data with the adjusted bit-width is described with reference to
The hardware accelerator may repeat operations 310 to 330 for each layer. The hardware accelerator may receive input data that is input to a corresponding layer for each layer, may adjust a bit-width of mantissa of each piece of the input data, and may perform an operation between input data with the adjusted bit-width. In addition, a threshold to be compared to an exponent of input data may be determined for each layer.
Referring to
The hardware accelerator may receive a first input 401 and a second input 402 expressed as floating point. For example, the first input 401 may be a weight and the second input 402 may be an input feature map. Alternatively, the first input 401 may be an input feature map and the second input 402 may be a weight.
The dynamic floating point conversion module 410 of the hardware accelerator may adjust a bit-width of the mantissa of each of the first input 401 and the second input 402 based on an exponent size of each of the first input 401 and the second input 402. That is, the dynamic floating point conversion module 410 may convert each of the first input 401 and the second input 402 expressed as floating point to a dynamic floating point number with a smaller bit-width of the mantissa.
The dynamic floating point conversion module 410 may include a comparator, and may compare an exponent of input data to a threshold using the comparator and may allocate a larger bit-width to mantissa of input data with a large exponent (e.g., with an exponent greater than or equal to the threshold) and may allocate a smaller bit-width to mantissa of input data with a small exponent (e.g., with an exponent less than the threshold. By performing threshold comparison using only the exponent rather than the entire input data, the dynamic floating point conversion module 410 of one or more embodiments may convert data expressed as floating point with low cost and without a loss of accuracy.
The dynamic floating point conversion module 410 may output information about the bit-width allocated to the mantissa of input data with input data of which the bit-width is adjusted. For example, the dynamic floating point conversion module 410 may output information 403 about the bit-width of the first input 401 and information 404 about the bit-width of the second input 402 respectively with first input data 405 and the second input data 406 expressed as dynamic floating point numbers.
The mixed precision arithmetic module 420 of the hardware accelerator may perform an operation between the first input data 405 and the second input data 406 expressed as the dynamic floating point numbers. The mixed precision arithmetic module 420 may include an operator that performs a multiply and accumulate (MAC) operation.
The hardware accelerator may control timings at which the first input data 405 and the second input data 406 are input to the mixed precision arithmetic module 420, a number of cycles of the operation performed by the mixed precision arithmetic module 420, and a number of operating operators based on the information 403 about the bit-width of the first input 401 and the information 404 about the bit-width of the second input 402.
The mixed precision arithmetic module 420 may support a mixed precision arithmetic using a spatial fusion method and/or a temporal function method and may obtain a higher throughput when the bit-width of mantissa is reduced. As a result, through the reduction of the bit-width of the mantissa, the hardware accelerator of one or more embodiments may improve a hardware computation speed and power consumption compared to the typical floating point arithmetic or the typical hardware accelerator. A non-limiting example of further description related to the spatial function method and the temporal fusion method are made with reference to
A neural network apparatus according to an example may receive a distribution of input data for each layer and may determine a threshold corresponding to input data of each layer. For example, referring to
In
The distribution of input data may refer to a weight distribution of the trained artificial neural network for the weight and/or may refer to a distribution of sampled input sets for the input feature map.
The neural network apparatus may determine thresholds corresponding to the weight and the input feature map using a brute-force algorithm. The neural network apparatus may calculate an amount of computation and an accuracy of computation for each of all the threshold combinations and may determine a threshold combination that meets a predetermined criterion.
In detail, for each of all the threshold combinations, the neural network apparatus may determine a combination of candidate thresholds with an average value-wise error less than a pre-defined maximum average value-wise error and a lowest computational complexity using the brute-force algorithm. Here, a method of determining a threshold may, without being limited to the aforementioned examples, include any algorithm capable of determining a threshold combination that may maximize a value obtained by dividing an accuracy of computation by an amount of computation.
The neural network apparatus may determine a threshold and then divide the distribution of input data into a plurality of areas based on the determined threshold, and may allocate a bit-width corresponding to each area to the mantissa of the input data. For example, the neural network apparatus may dynamically allocate 10 bits, 8 bits, and 4 bits to mantissa of corresponding data based on an exponent size of the input data.
In detail, when an exponent of input data corresponds to −ths_left<x<ths_right that is a first area 510, the neural network apparatus may allocate 4 bits to the mantissa of the input data. When the exponent of input data corresponds to −thm
Referring to
A mixed precision arithmetic module may be the mixed precision arithmetic module 420 of
Referring to Example 620, in the case of performing a mantissa multiplication operation between two pieces of data expressed as 16-bit floating point, 10 bits are fixedly allocated to the mantissa of corresponding input data and 9 cycles are consumed at all times accordingly. In an example, in the case of using input data converted to a dynamic floating point number, performance may be improved by up to nine times. For example, only a single cycle is consumed for an operation between DFP16_S data and DFP16_S data.
For example, referring to Example 610, only 3 cycles may be consumed for the mantissa multiplication operation between DFP16_L data and DFP16_S data. In detail, the mixed precision arithmetic module may complete the mantissa multiplication operation between DFP16_L data and DFP16_S data by performing the multiplication operation between lower 4 bits of DFP16_L data and the 4 bits of DFP16_S data in a first cycle, the multiplication operation between intermediate 4 bits of DFP16_L data and the 4 bits of DFP16_S data in a second cycle, and the multiplication operation between upper 2 bits of DFP16_L data and the 4 bits of DFP16_S data in a third cycle.
Referring to
A mixed precision arithmetic module may be the mixed precision arithmetic module 420 of
For example, referring to Example 710, in the case of the mantissa multiplication operation between DFP16_L data and DFP16_S data, the nine operators may perform three operations per one cycle. That is, in the corresponding case, the mixed precision arithmetic module may group three operators as a single operator set and may perform an operation on a single piece of DFP16_L data and three pieces of DFP16_S data (first DFP16_S data, second DFP16_S data, and third DFP16_S data) in a single cycle. As a non-limiting example, in
In detail, the mixed precision arithmetic module may complete the mantissa multiplication operation between a single piece of DFP16_L data and three pieces of DFP16_S data, for example, first DFP16_S data, second DFP16_S data, and third DFP16_S data, by performing the multiplication operation between DFP16_L data and the first DFP16_S data in a first operator set, the multiplication operation between the DFP16_L data and the second DFP16_S data in a second operator set, and the multiplication operation between the DFP16_L data and the third DFP16_S data in a third operator set.
Referring to
The processor 810 may receive a plurality of pieces of input data expressed as floating point, adjust a bit-width of mantissa of each piece of the input data based on a size of exponent of each piece of the input data, and perform an operation between input data with the adjusted bit-width.
The memory 830 may be a volatile memory or a nonvolatile memory.
In addition, the processor 810 may perform the method described above with reference to
The neural network apparatuses, hosts, memories, HW accelerators, floating point conversion modules, mixed precision arithmetic modules, calculation apparatuses, processors, communication interfaces, communication buses, neural network apparatus 200, host 210, memory 220, HW accelerator 230, floating point conversion module 410, mixed precision arithmetic module 420, calculation apparatus 800, processor 810, memory 830, communication interface 850, communication bus 805, and other apparatuses, devices, units, modules, and components described herein with respect to
The methods illustrated in
Instructions or software to control computing hardware, for example, one or more processors or computers, to implement the hardware components and perform the methods as described above may be written as computer programs, code segments, instructions or any combination thereof, for individually or collectively instructing or configuring the one or more processors or computers to operate as a machine or special-purpose computer to perform the operations that are performed by the hardware components and the methods as described above. In one example, the instructions or software include machine code that is directly executed by the one or more processors or computers, such as machine code produced by a compiler. In another example, the instructions or software includes higher-level code that is executed by the one or more processors or computer using an interpreter. The instructions or software may be written using any programming language based on the block diagrams and the flow charts illustrated in the drawings and the corresponding descriptions in the specification, which disclose algorithms for performing the operations that are performed by the hardware components and the methods as described above.
The instructions or software to control computing hardware, for example, one or more processors or computers, to implement the hardware components and perform the methods as described above, and any associated data, data files, and data structures, may be recorded, stored, or fixed in or on one or more non-transitory computer-readable storage media. Examples of a non-transitory computer-readable storage medium include read-only memory (ROM), random-access programmable read only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), flash memory, non-volatile memory, CD-ROMs, CD-Rs, CD+Rs, CD-RWs, CD+RWs, DVD-ROMs, DVD-Rs, DVD+Rs, DVD-RWs, DVD+RWs, DVD-RAMs, BD-ROMs, BD-Rs, BD-R LTHs, BD-REs, blue-ray or optical disk storage, hard disk drive (HDD), solid state drive (SSD), flash memory, a card type memory such as multimedia card micro or a card (for example, secure digital (SD) or extreme digital (XD)), magnetic tapes, floppy disks, magneto-optical data storage devices, optical data storage devices, hard disks, solid-state disks, and any other device that is configured to store the instructions or software and any associated data, data files, and data structures in a non-transitory manner and provide the instructions or software and any associated data, data files, and data structures to one or more processors or computers so that the one or more processors or computers can execute the instructions. In one example, the instructions or software and any associated data, data files, and data structures are distributed over network-coupled computer systems so that the instructions and software and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by the one or more processors or computers.
While this disclosure includes specific examples, it will be apparent after an understanding of the disclosure of this application that various changes in form and details may be made in these examples without departing from the spirit and scope of the claims and their equivalents. The examples described herein are to be considered in a descriptive sense only, and not for purposes of limitation. Descriptions of features or aspects in each example are to be considered as being applicable to similar features or aspects in other examples. Suitable results may be achieved if the described techniques are performed in a different order, and/or if components in a described system, architecture, device, or circuit are combined in a different manner, and/or replaced or supplemented by other components or their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
10-2021-0111118 | Aug 2021 | KR | national |