This application claims the benefit of Korean Patent Applications No. 10-2022-0019574, filed Feb. 15, 2022, and No. 10-2023-0001234, filed Jan. 4, 2023, which are hereby incorporated by reference in their entireties into this application.
The present disclosure relates to a method for hardware design and operation of a floating-point unit that may greatly increase resource utilization, compared to an existing structure, when outer-product-based matrix multiplication, which is used for various applications such as an artificial neural network, and the like, is performed.
Applications based on an artificial neural network or a deep-learning model generally perform operations on values stored in the form of a vector or a matrix for images, voice, pattern data, and the like.
Particularly, because each piece of data is in the form of a decimal floating-point number, the operation performance of floating-point matrix multiplication greatly affects the performance of an artificial neural network application. Particularly, operations using small floating-point data types, such as a 16-bit and 8-bit floating-point formats, rather than an existing 32-bit floating-point format, are widely used for recent artificial neural networks.
However, a currently used floating-point unit has a problem in that efficiency is decreased because it has a part that cannot be used for parallel operations.
(Patent Document 1) Korean Patent Application Publication No. 10-2019-0119074, titled “Widening arithmetic in a data processing apparatus”.
An object of the present disclosure is to apply an outer-product-based matrix multiplication method to floating-point matrix multiplication, which used in various fields, such as artificial neural network operations, and the like, thereby improving operation efficiency.
Another object of the present disclosure is to provide a multi-format floating-point operation structure that is capable of upper-level operation using multiple lower-level operators.
In order to accomplish the above objects, a method for outer-product-based matrix multiplication for a floating-point data type according to an embodiment of the present disclosure includes receiving first floating-point data and second floating-point data and performing matrix multiplication on the first floating-point data and the second floating-point data, and the result value of the matrix multiplication is calculated based on the suboperation result values of respective floating-point units.
Here, the suboperation result values may correspond to intermediate result values of an outer product of the first floating-point data and the second floating-point data.
Here, the first floating-point data and the second floating-point data may be divided into sizes capable of being input to the floating-point units and may then be input to the respective floating-point units.
Here, performing the matrix multiplication may comprise performing a shift operation and an addition operation on the suboperation result value of each of the floating-point units.
Here, each of the first floating-point data and the second floating-point data may be divided into upper bits and lower bits.
Here, performing the matrix multiplication may comprise performing a shift operation, corresponding to double the size of the lower bits, on the result value of a suboperation performed on the upper bits of the first floating-point data and the upper bits of the second floating-point data.
Here, performing the matrix multiplication may comprise performing a shift operation, corresponding to the size of the lower bits, on the result value of a suboperation performed on the upper bits of the first floating-point data and the lower bits of the second floating-point data and on the result value of a suboperation performed on the lower bits of the first floating-point data and the upper bits of the second floating-point data.
Also, in order to accomplish the above objects, an apparatus for outer-product-based matrix multiplication for a floating-point data type according to an embodiment of the present disclosure includes an input unit for receiving first floating-point data and second floating-point data and an operation unit for performing matrix multiplication on the first floating-point data and the second floating-point data, and the operation unit includes suboperation units for calculating suboperation result values for the result value of the matrix multiplication.
Here, the suboperation result values may correspond to intermediate result values of an outer product of the first floating-point data and the second floating-point data.
Here, the first floating-point data and the second floating-point data may be divided into sizes capable of being input to the suboperation units and may then be input to the respective suboperation units.
Here, the operation unit may perform a shift operation and an addition operation on the suboperation result value of each of the suboperation units.
Here, each of the first floating-point data and the second floating-point data may be divided into upper bits and lower bits.
Here, the operation unit may perform a shift operation, corresponding to double the size of the lower bits, on the result value of a suboperation performed on the upper bits of the first floating-point data and the upper bits of the second floating-point data.
Here, the operation unit may perform a shift operation, corresponding to the size of the lower bits, on the result value of a suboperation performed on the upper bits of the first floating-point data and the lower bits of the second floating-point data and on the result value of a suboperation performed on the lower bits of the first floating-point data and the upper bits of the second floating-point data.
The above and other objects, features, and advantages of the present disclosure will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
The advantages and features of the present disclosure and methods of achieving the same will be apparent from the exemplary embodiments to be described below in more detail with reference to the accompanying drawings. However, it should be noted that the present disclosure is not limited to the following exemplary embodiments, and may be implemented in various forms. Accordingly, the exemplary embodiments are provided only to disclose the present disclosure and to let those skilled in the art know the category of the present disclosure, and the present disclosure is to be defined based only on the claims. The same reference numerals or the same reference designators denote the same elements throughout the specification.
It will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements are not intended to be limited by these terms. These terms are only used to distinguish one element from another element. For example, a first element discussed below could be referred to as a second element without departing from the technical spirit of the present disclosure.
The terms used herein are for the purpose of describing particular embodiments only, and are not intended to limit the present disclosure. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises,” “comprising,”, “includes” and/or “including,” when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
In the present specification, each of expressions such as “A or B”, “at least one of A and B”, “at least one of A or B”, “A, B, or C”, “at least one of A, B, and C”, and “at least one of A, B, or C” may include any one of the items listed in the expression or all possible combinations thereof.
Unless differently defined, all terms used herein, including technical or scientific terms, have the same meanings as terms generally understood by those skilled in the art to which the present disclosure pertains. Terms identical to those defined in generally used dictionaries should be interpreted as having meanings identical to contextual meanings of the related art, and are not to be interpreted as having ideal or excessively formal meanings unless they are definitively defined in the present specification.
Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In the following description of the present disclosure, the same reference numerals are used to designate the same or similar elements throughout the drawings, and repeated descriptions of the same components will be omitted.
In order to complete the multiplication of two matrices, multiple multiplication and addition operations should be performed.
As a method for performing the multiplication of two matrices, there are a method of using an inner product as shown in (a) of
A floating-point unit (FPU) is a hardware structure for performing various operations, including arithmetic operations, on floating-point data, which is used to represent real numbers using a binary system, in a computer system. In order to support efficient parallel operations for various floating-point data types (e.g., FP64, FP32, FP16, BF16, and FP8), many structures capable of processing multiple pieces of small data in parallel using an FPU for a single large data type have been proposed (a multiformat vector FPU).
Accordingly, technology for designing new FPU hardware capable of increasing the utilization of a hardware resource, which is more wasted as the data type for which a floating-point operation is performed is smaller, and more efficiently processing floating-point matrix multiplication for various data types is required.
The most basic FPU requires separate FPU hardware components for respective data types in order to perform operations on different types of floating-point data. The conventional technology (a multiformat vector FPU), which is more advanced than the existing FPU, enables operations on various types of floating-point data using a single shared hardware component and supports parallel operations on small-sized data types. However, an underutilized hardware resource is still present in the conventional technology, and the utilization of the hardware resource may be improved by changing the existing vector operation structure into a matrix operation structure, whereby parallel floating-point operation performance per hardware area may be improved.
Referring to
Here, the suboperation result value may correspond to the intermediate result value of the outer product of the first floating-point data and the second floating-point data.
Here, each of the first floating-point data and the second floating-point data may be input to each of the floating-point units after being divided into sizes capable of being input to the floating-point unit.
Here, at the step (S120) of performing the matrix multiplication, shift and addition operations may be performed on the suboperation result of each of the floating-point units.
Here, each of the first floating-point data and the second floating-point data may be divided into upper bits and lower bits.
Here, at the step (S120) of performing the matrix multiplication, a shift operation corresponding to double the size of the lower bits may be performed on the result value of the suboperation performed on the upper bits of the first floating-point data and the upper bits of the second floating-point data.
Here, at the step (S120) of preforming the matrix multiplication, a shift operation corresponding to the size of the lower bits may be performed on the result value of the suboperation performed on the upper bits of the first floating-point data and the lower bits of the second floating-point data and on the result value of the suboperation performed on the lower bits of the first floating-point data and the upper bits of the second floating-point data.
Referring to
Because the proposed structure is capable of processing four FP8 multiplication operations at once or processing one FP16 multiplication operation using the hardware resource of a single shared multiplier, resource utilization and parallel operation performance may be improved, compared to an existing FPU, which is capable of processing two FP8 multiplication operations at once or processing one FP16 multiplication operation.
Referring to (b) of
According to an embodiment of the present disclosure, four small operators may be collectively used as a single large operator in order to improve the utilization of an FPU hardware resource, which is underutilized in the conventional FPU, and to support multi-format floating-point operations. In (c) of
In the example illustrated in
A=(a<<6+b)
B=(c<<6+d)
A×B=(a<<6+b)×(c<<6+d)=ac<<12+ad<<6+bc<<6+bd
That is, matrix multiplication may be performed using four multiplication operations, three bit-shift operations, and three addition operations.
Also, the proposed FPU structure may be recursively applied to multiple floating-point data formats. That is, like the four FP8 operators combined into a single FP16 operator, four FP16 operators may be combined into a single FP32 operator, and four FP32 operators may be combined into a single FP64 operator.
Consequently, when the proposed hardware design method is applied, a single FP64 operator may perform a single FP64 operation, four FP32 operations, 16 FP16 operations, or 64 FP8 operations at once. Accordingly, the resource sharing utilization of FPU hardware and parallel operation ability per hardware area for small floating-point data types, such as FP16 and FP8, may be improved.
This FPU hardware structure may be used in semiconductors such as an AI processor and the like for accelerating an artificial neural network application in which matrix multiplication capability is important and particularly in which many matrix multiplication operations using small floating-point data types, such as FP16, FP8, and the like, are used.
Referring to
Here, the suboperation result value may correspond to the intermediate result value of the outer product of the first floating-point data and the second floating-point data.
Here, each of the first floating-point data and the second floating-point data may be input to each of the suboperation units after being divided into sizes capable of being input to the suboperation unit.
Here, the operation unit 220 may perform shift and addition operations on the suboperation result value of each of the suboperation units.
Here, each of the first floating-point data and the second floating-point data may be divided into upper bits and lower bits.
Here, the operation unit 220 may perform a shift operation corresponding to double the size of the lower bits on the result value of the suboperation performed on the upper bits of the first floating-point data and the upper bits of the second floating-point data.
Here, the operation unit 220 may perform a shift operation corresponding to the size of the lower bits on the result value of the suboperation performed on the upper bits of the first floating-point data and the lower bits of the second floating-point data and on the result value of the suboperation performed on the lower bits of the first floating-point data and the upper bits of the second floating-point data.
The apparatus for outer-product-based matrix multiplication for a floating-point data type according to an embodiment may be implemented in a computer system 1000 including a computer-readable recording medium.
The computer system 1000 may include one or more processors 1010, memory 1030, a user-interface input device 1040, a user-interface output device 1050, and storage 1060, which communicate with each other via a bus 1020. Also, the computer system 1000 may further include a network interface 1070 connected to a network 1080. The processor 1010 may be a central processing unit or a semiconductor device for executing a program or processing instructions stored in the memory 1030 or the storage 1060. The memory 1030 and the storage 1060 may be storage media including at least one of a volatile medium, a nonvolatile medium, a detachable medium, a non-detachable medium, a communication medium, or an information delivery medium, or a combination thereof. For example, the memory 1030 may include ROM 1031 or RAM 1032.
According to the present disclosure, an outer-product-based matrix multiplication method is applied to floating-point matrix multiplication, which is used in various fields, such as artificial neural network operations, and the like, whereby operation efficiency may be improved.
Also, the present disclosure may provide a multi-format floating-point operation structure that is capable of upper-level operation using multiple lower-level operators.
Specific implementations described in the present disclosure are embodiments and are not intended to limit the scope of the present disclosure. For conciseness of the specification, descriptions of conventional electronic components, control systems, software, and other functional aspects thereof may be omitted. Also, lines connecting components or connecting members illustrated in the drawings show functional connections and/or physical or circuit connections, and may be represented as various functional connections, physical connections, or circuit connections that are capable of replacing or being added to an actual device. Also, unless specific terms, such as “essential”, “important”, or the like, are used, the corresponding components may not be absolutely necessary.
Accordingly, the spirit of the present disclosure should not be construed as being limited to the above-described embodiments, and the entire scope of the appended claims and their equivalents should be understood as defining the scope and spirit of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2022-0019574 | Feb 2022 | KR | national |
10-2023-0001234 | Jan 2023 | KR | national |