This application relates to the field of computer technologies, and in particular, to a matrix computing method, an apparatus, a circuit, a system, a chip, and a device.
In the field of computer technologies, a floating point (FP) number is mainly used to represent a decimal, and the floating point number usually includes three parts: a sign bit, an exponent bit, and a mantissa bit. Floating point number-based matrix computing is a common computing method, and may be applied to a plurality of scenarios such as artificial intelligence, deep learning, and high performance computing.
In the conventional technology, in floating point number-based matrix computing, a matrix computing unit based on a half-precision floating point number is provided. The half-precision floating point number has a bit width of 16 bits, and therefore may be referred to as FP16. As shown in
However, both a mantissa computing bit width and an exponent computing bit width in the foregoing matrix computing unit are designed based on FP16, and therefore are applicable to only FP16-based matrix computing, but are not applicable to matrix computing based on a floating point number with a relatively large bit width. Consequently, applicability is poor.
This application provides a matrix computing method, an apparatus, a circuit, a system, a chip, and a device, to implement high-precision matrix computing based on a low-precision matrix computing unit, so as to improve applicability of the matrix computing unit.
According to a first aspect, a matrix computing method is provided. The method is performed by a matrix computing unit, the matrix computing unit may be a matrix computing unit designed based on FP16, and the method includes: obtaining a computing instruction, where the computing instruction includes a to-be-computed matrix and a matrix computing type, precision of a floating point number in the to-be-computed matrix may be higher than FP16, and the matrix computing type may include matrix multiplication, matrix addition, matrix multiplication-addition, and the like; disassembling the to-be-computed matrix to obtain a plurality of disassembled matrices, where precision of a floating point number in the disassembled matrix is lower than the precision of the floating point number in the to-be-computed matrix, the precision of the floating point number in the disassembled matrix may be represented as second precision, the precision of the floating point number in the to-be-computed matrix is represented as first precision, and for example, the first precision is FP32, FP64, or FP128 and the second precision is FP16; and performing computing processing on the plurality of disassembled matrices based on the matrix computing type, to obtain a matrix operation result.
In the foregoing technical solutions, when the to-be-computed matrix and the matrix computing type are obtained, if the precision of the floating point number in the to-be-computed matrix is relatively high, the to-be-computed matrix may be disassembled into a plurality of matrices including low-precision floating point numbers, for example, disassembled into a plurality of matrices including FP16 floating point numbers, and computing processing is performed, based on the matrix computing type, on the plurality of matrices including the FP16 floating point numbers, to obtain a matrix operation result corresponding to the to-be-computed matrix. In this way, high-precision matrix computing can be implemented based on a low-precision matrix computing unit, so that applicability of the matrix computing unit is improved. In addition, in a matrix computing process, upper-layer software applications such as an AI application and an HPC application based on the matrix computing unit are unaware of a specific matrix computing process, so that software adaptation costs can be greatly reduced.
In a possible implementation, the disassembling the to-be-computed matrix to obtain a plurality of disassembled matrices includes: disassembling the to-be-computed matrix according to a preset rule to obtain the plurality of disassembled matrices, where the preset rule is used to disassemble a floating point number with first precision in the to-be-computed matrix into floating point numbers with second precision, and the second precision is lower than the first precision. For example, the first precision is FP32, FP64, or FP128, and the second precision is FP16. In the foregoing possible implementation, according to the preset rule, the to-be-computed matrix may be disassembled into a plurality of matrices including low-precision floating point numbers, for example, disassembled into a plurality of matrices including FP16 floating point numbers. In this way, high-precision matrix computing can be implemented based on a low-precision matrix computing unit, so that applicability of the matrix computing unit is improved.
In another possible implementation, the disassembling the to-be-computed matrix to obtain a plurality of disassembled matrices includes: disassembling the to-be-computed matrix including a floating point number with first precision into a plurality of matrices including floating point numbers with second precision and a plurality of exponent matrices, where the second precision is lower than the first precision. For example, the first precision is FP32, FP64, or FP128, and the second precision is FP16. In the foregoing possible implementation, the to-be-computed matrix including the floating point number with the first precision is disassembled into the plurality of matrices including the floating point numbers with the second precision and the plurality of exponent matrices, for example, disassembled into a plurality of matrices including FP16 floating point numbers and a plurality of exponent matrices. In this way, high-precision matrix computing can be implemented based on a low-precision matrix computing unit, so that applicability of the matrix computing unit is improved.
In another possible implementation, the disassembling the to-be-computed matrix including a floating point number with first precision into a plurality of matrices including floating point numbers with second precision and a plurality of exponent matrices includes: disassembling the to-be-computed matrix including the floating point number with the first precision into a plurality of column matrices by column; and disassembling each of the plurality of column matrices into one first submatrix including a floating point number with the second precision and one first exponent matrix, to obtain a plurality of first submatrices and a plurality of first exponent matrices, where a first submatrix and a first exponent matrix obtained by disassembling a same column matrix correspond to each other. In the foregoing possible implementation, a method for disassembling the to-be-computed matrix into the matrices including the floating point numbers with the second precision is provided, and may be used to effectively disassemble a first matrix when the first matrix and a second matrix are multiplied.
In another possible implementation, the disassembling the to-be-computed matrix including a floating point number with first precision into a plurality of matrices including floating point numbers with second precision and a plurality of exponent matrices includes: disassembling the to-be-computed matrix including the floating point number with the first precision into a plurality of row matrices by row; and disassembling each of the plurality of row matrices into one second submatrix including a floating point number with the second precision and one second exponent matrix, to obtain a plurality of second submatrices and a plurality of second exponent matrices, where a second submatrix and a second exponent matrix obtained by disassembling a same row matrix correspond to each other. In the foregoing possible implementation, a method for disassembling the to-be-computed matrix into the matrices including the floating point numbers with the second precision is provided, and may be used to effectively disassemble a second matrix when a first matrix and the second matrix are multiplied.
In another possible implementation, the to-be-computed matrix includes a first matrix and a second matrix, disassembled matrices corresponding to the first matrix include a plurality of first submatrices and a plurality of first exponent matrices, disassembled matrices corresponding to the second matrix include a plurality of second submatrices and a plurality of second exponent matrices, there are correspondences between the plurality of first submatrices, the plurality of first exponent matrices, the plurality of second submatrices, and the plurality of second exponent matrices, and when the matrix computing type is matrix multiplication, the performing computing processing on the plurality of disassembled matrices based on the matrix computing type includes: determining a first operation result of each first submatrix and a corresponding second submatrix based on a correspondence between the plurality of first submatrices and the plurality of second submatrices, to obtain a plurality of first operation results; determining a second operation result of each first exponent matrix and a corresponding second exponent matrix based on a correspondence between the plurality of first exponent matrices and the plurality of second exponent matrices, to obtain a plurality of second operation results; and determining a matrix operation result of the first matrix and the second matrix based on a correspondence between the plurality of first submatrices and the plurality of first exponent matrices, the plurality of first operation results, and the plurality of second operation results. In the foregoing possible implementation, a method for performing computing processing on the plurality of disassembled matrices based on the matrix computing type to obtain a matrix computing result is provided. In this way, high-precision matrix computing can be implemented based on a low-precision matrix computing unit, so that applicability of the matrix computing unit is improved.
In another possible implementation, the determining a matrix operation result of the first matrix and the second matrix based on a correspondence between the plurality of first submatrices and the plurality of first exponent matrices, the plurality of first operation results, and the plurality of second operation results includes: determining a third operation result based on each of the plurality of first operation results and a second operation result corresponding to the first operation result, to obtain a plurality of third operation results, where a first operation result of a first submatrix corresponds to a second operation result of a first exponent matrix corresponding to the first submatrix; and determining the matrix operation result of the first matrix and the second matrix based on the plurality of third operation results. In the foregoing possible implementation, a method for determining the matrix operation result of the first matrix and the second matrix is provided. In this way, high-precision matrix computing can be implemented based on a low-precision matrix computing unit, so that applicability of the matrix computing unit is improved.
In another possible implementation, two floating point numbers at a same location in a first exponent matrix and a corresponding second exponent matrix are X and Y, and a floating point number Z at the same location in a corresponding second operation result satisfies the following formula: Z=2^(expX+expY-Q×2), where Q is related to precision of floating point numbers included in the first matrix and the second matrix, and exp represents an exponential function. In the foregoing possible implementation, computing processing between disassembled exponent matrices is provided, so that it can be ensured that a matrix computing result corresponding to the to-be-computed matrix is obtained through computing based on the disassembled matrices. In this way, high-precision matrix computing can be implemented based on a low-precision matrix computing unit, so that applicability of the matrix computing unit is improved.
In another possible implementation, the matrix computing unit is integrated into a general-purpose processor, and the obtaining a computing instruction includes: obtaining the computing instruction from a register; or the matrix computing unit is integrated outside a general-purpose processor, and the obtaining a computing instruction includes: obtaining the computing instruction from a memory. In the foregoing possible implementation, two manners of integrating the matrix computing unit and the general-purpose processor are provided, so that flexibility and diversity of integration of the matrix computing unit can be improved.
In another possible implementation, the precision of the floating point number in the to-be-computed matrix is FP32 or FP64, and the precision of the floating point number in the disassembled matrix is FP16. In the foregoing possible implementation, a to-be-computed matrix in which precision of a floating point number is FP32 or FP64 may be disassembled, to obtain a plurality of matrices in which precision of floating point numbers is FP16, and a matrix computing result corresponding to the to-be-computed matrix is obtained through computing based on the disassembled matrices. Therefore, applicability of the matrix computing unit is improved.
In another possible implementation, before the disassembling the to-be-computed matrix to obtain a plurality of disassembled matrices, the method further includes: determining that the precision of the floating point number in the to-be-computed matrix is higher than preset precision. In the foregoing possible implementation, when high-precision matrix computing is implemented based on a low-precision matrix computing unit, disassembling may be performed when it is determined that the precision of the floating point number in the to-be-computed matrix is higher than the preset precision, so that a matrix computing result corresponding to the to-be-computed matrix is obtained through computing based on the disassembled matrices, and then applicability of the matrix computing unit is improved.
According to a second aspect, a matrix computing apparatus is provided. The apparatus may be a matrix computing unit designed based on FP16, and the apparatus includes: an obtaining unit, configured to obtain a computing instruction, where the computing instruction includes a to-be-computed matrix and a matrix computing type, precision of a floating point number in the to-be-computed matrix may be higher than FP16, and the matrix computing type may include matrix multiplication, matrix addition, matrix multiplication-addition, and the like; a disassembling unit, configured to disassemble the to-be-computed matrix to obtain a plurality of disassembled matrices, where precision of a floating point number in the disassembled matrix is lower than the precision of the floating point number in the to-be-computed matrix, the precision of the floating point number in the disassembled matrix may be represented as second precision, the precision of the floating point number in the to-be-computed matrix is represented as first precision, and for example, the first precision is FP32, FP64, or FP128 and the second precision is FP16; and a computing unit, configured to perform computing processing on the plurality of disassembled matrices based on the matrix computing type, to obtain a matrix operation result.
In a possible implementation, the disassembling unit is further configured to disassemble the to-be-computed matrix according to a preset rule to obtain the plurality of disassembled matrices, where the preset rule is used to disassemble a floating point number with first precision in the to-be-computed matrix into floating point numbers with second precision, and the second precision is lower than the first precision. For example, the first precision is FP32, FP64, or FP128, and the second precision is FP16.
In a possible implementation, the disassembling unit is further configured to disassemble the to-be-computed matrix including a floating point number with first precision into a plurality of matrices including floating point numbers with second precision and a plurality of exponent matrices, where the second precision is lower than the first precision. For example, the first precision is FP32, FP64, or FP128, and the second precision is FP16.
In a possible implementation, the disassembling unit is further configured to: disassemble the to-be-computed matrix including the floating point number with the first precision into a plurality of column matrices by column; and disassemble each of the plurality of column matrices into one first submatrix including a floating point number with the second precision and one first exponent matrix, to obtain a plurality of first submatrices and a plurality of first exponent matrices, where a first submatrix and a first exponent matrix obtained by disassembling a same column matrix correspond to each other.
In a possible implementation, the disassembling unit is further configured to: disassemble the to-be-computed matrix including the floating point number with the first precision into a plurality of row matrices by row; and disassemble each of the plurality of row matrices into one second submatrix including a floating point number with the second precision and one second exponent matrix, to obtain a plurality of second submatrices and a plurality of second exponent matrices, where a second submatrix and a second exponent matrix obtained by disassembling a same row matrix correspond to each other.
In another possible implementation, the to-be-computed matrix includes a first matrix and a second matrix, disassembled matrices corresponding to the first matrix include a plurality of first submatrices and a plurality of first exponent matrices, disassembled matrices corresponding to the second matrix include a plurality of second submatrices and a plurality of second exponent matrices, there are correspondences between the plurality of first submatrices, the plurality of first exponent matrices, the plurality of second submatrices, and the plurality of second exponent matrices, and when the matrix computing type is matrix multiplication, the computing unit is further configured to: determine a first operation result of each first submatrix and a corresponding second submatrix based on a correspondence between the plurality of first submatrices and the plurality of second submatrices, to obtain a plurality of first operation results; determine a second operation result of each first exponent matrix and a corresponding second exponent matrix based on a correspondence between the plurality of first exponent matrices and the plurality of second exponent matrices, to obtain a plurality of second operation results; and determine a matrix operation result of the first matrix and the second matrix based on a correspondence between the plurality of first submatrices and the plurality of first exponent matrices, the plurality of first operation results, and the plurality of second operation results.
In another possible implementation, the computing unit is further configured to: determine a third operation result based on each of the plurality of first operation results and a second operation result corresponding to the first operation result, to obtain a plurality of third operation results, where a first operation result of a first submatrix corresponds to a second operation result of a first exponent matrix corresponding to the first submatrix; and determine the matrix operation result of the first matrix and the second matrix based on the plurality of third operation results.
In another possible implementation, two floating point numbers at a same location in a first exponent matrix and a corresponding second exponent matrix are X and Y, and a floating point number Z at the same location in a corresponding second operation result satisfies the following formula: Z=2^(expX+expY-Q×2), where Q is related to precision of floating point numbers included in the first matrix and the second matrix, and exp represents an exponential function.
In a possible implementation, the matrix computing unit is integrated into a general-purpose processor, and the obtaining unit is further configured to obtain the computing instruction from a register; or the matrix computing unit is integrated outside a general-purpose processor, and the obtaining unit is further configured to obtain the computing instruction from a memory.
In another possible implementation, the precision of the floating point number in the to-be-computed matrix is FP32 or FP64, and the precision of the floating point number in the disassembled matrix is FP16.
In another possible implementation, the disassembling unit is further configured to determine that the precision of the floating point number in the to-be-computed matrix is higher than preset precision.
According to a third aspect, a matrix computing circuit is provided. The matrix computing circuit is configured to perform operation steps of the matrix computing method provided in any one of the first aspect or the possible implementations of the first aspect.
According to a fourth aspect, a matrix computing system is provided. The system includes a processor and a matrix computing unit. The processor is configured to send a computing instruction to the matrix computing unit, and the matrix computing unit is configured to perform operation steps of the matrix computing method provided in any one of the first aspect or the possible implementations of the first aspect.
According to a fifth aspect, a chip is provided. The chip includes a processor, a matrix computing unit is integrated into the processor, and the matrix computing unit is configured to perform operation steps of the matrix computing method provided in any one of the first aspect or the possible implementations of the first aspect.
According to a sixth aspect, a matrix computing device is provided. The device includes the matrix computing system provided in the fourth aspect or the chip provided in the fifth aspect.
According to a seventh aspect, a readable storage medium is provided. The readable storage medium stores instructions, and when the readable storage medium runs on a device, the device is enabled to perform operation steps of the matrix computing method provided in any one of the first aspect or the possible implementations of the first aspect.
According to an eighth aspect, a computer program product is provided. When the computer program product runs on a computer, the computer is enabled to perform operation steps of the matrix computing method provided in any one of the first aspect or the possible implementations of the first aspect.
It may be understood that any one of the apparatus for performing the matrix computing method, the computer storage medium, or the computer program product provided above is configured to perform the corresponding method provided above. Therefore, for beneficial effects that can be achieved by the apparatus, computer storage medium, or computer program product, refer to the beneficial effects in the corresponding method provided above. Details are not described herein again.
In this application, based on implementations according to the foregoing aspects, the implementations may be further combined to provide more implementations.
First, before the embodiments are described, types of floating point numbers in the embodiments are explained and described.
A floating point (FP) number is mainly used to represent a decimal, and usually includes three parts: a sign bit, an exponent bit, and a mantissa bit. The exponent bit may also be referred to as an exponent, and is referred to as the exponent below. The sign bit may be 1 bit, and the exponent and the mantissa bit each may be a plurality of bits. Usually, floating point numbers may have three formats: a half-precision floating point number, a single-precision floating point number, and a double-precision floating point number. Details are described as follows.
A half-precision floating point number is a binary data type used on a computer, occupies 16 bits (that is, occupies 2 bytes) in a computer memory, and may be referred to as FP16 for short. An absolute value range of a value that can be represented by the half-precision floating point number is approximately [6.10×10-5, 6.55×104].
A single-precision floating point number is a binary data type used on a computer, occupies 32 bits (that is, occupies 4 bytes) in a computer memory, and may be referred to as FP32 for short. An absolute value range of a value that can be represented by the single-precision floating point number is approximately [1.18×10-38, 3.40×1038].
A double-precision floating point number is a binary data type used on a computer, occupies 64 bits (that is, occupies 8 bytes) in a computer memory, and may be referred to as FP64 for short. The double-precision floating point number can represent a 15-bit or 16-bit decimal significant figure. An absolute value range of a value that can be represented by the double-precision floating point number is approximately [2.23×10-308, 1.80×10308].
Table 1 below shows a storage format of each of the foregoing three types of floating point numbers. In 16 bits occupied by FP16, a sign bit occupies 1 bit, an exponent occupies 5 bits, and a mantissa bit occupies 10 bits. In 32 bits occupied by FP32, a sign bit occupies 1 bit, an exponent occupies 8 bits, and a mantissa bit occupies 23 bits. In 64 bits occupied by FP64, a sign bit occupies 1 bit, an exponent occupies 11 bits, and a mantissa bit occupies 52 bits.
Further, in actual application, to represent a higher-precision floating point number, a format of a floating point number, a storage format in which more bits are occupied, and the like may be further extended, for example, a floating point number occupying 128 bits (which may be referred to as FP128 for short). This is not specifically limited in embodiments of this application.
A floating point number matrix may be a matrix in which a floating point number is used as an element. For example, a floating point number matrix with m rows and n columns includes m×n elements, and the m×n elements may be floating point numbers. Similar to the floating point numbers, there may also be floating point number matrices having different floating point number formats, for example, a floating point number matrix in an FP16 format, a floating point number matrix in an FP32 format, and a floating point number matrix in an FP64 format.
The memory 201 may be configured to store data, a software program, and a module, and mainly includes a program storage area and a data storage area. The program storage area may store an operating system, a software application required by at least one function, intermediate-layer software, and the like. The data storage area may store data created when the device is used, and the like. For example, the operating system may include a Linux operating system, a Unix operating system, a Window operating system, or the like. The software application required by the at least one function may include an artificial intelligence-related application, a high performance computing (HPC)-related application, a deep learning-related application, a scientific computing-related application, or the like. The intermediate-layer software may include a linear algebra library function or the like. In a possible example, the memory 201 includes but is not limited to a static random access memory (static RAM, SRAM), a dynamic random access memory (dynamic RAM, DRAM), a synchronous dynamic random access memory (synchronous DRAM, SDRAM), a high-speed random access memory, or the like. Further, the memory 201 may include another nonvolatile memory, for example, at least one magnetic disk storage device, a flash memory device, or another volatile solid-state storage device.
In addition, the processor 202 is configured to control and manage an operation of the computing device, for example, perform various functions of the computing device and process data by running or executing the software program and/or the module stored in the memory 201 and invoking the data stored in the memory 201. In a possible example, the processor 202 includes but is not limited to a central processing unit (CPU), a network processing unit (NPU), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA) or another programmable logic device, a transistor logic device, a logic circuit, or any combination thereof. The processor may implement or execute logical blocks, modules, and circuits in various examples described with reference to content disclosed in this application. Alternatively, the processor 202 may be a combination of processors implementing a computing function, for example, a combination of one or more microprocessors, or a combination of a digital signal processor and a microprocessor.
The communication interface 203 is configured to implement communication between the computing device and an external device. The communication interface 203 may include an input interface and an output interface. The input interface may be configured to obtain floating point number matrices such as a first matrix and a second matrix in the following method embodiment. In some feasible embodiments, there may be only one input interface, or there may be a plurality of input interfaces. The output interface may be configured to output a matrix operation result in the following method embodiment. In some feasible embodiments, the matrix operation result may be directly output by the processor, or may be first stored in the memory and then output by the memory. In some other feasible embodiments, there may be only one output interface, or there may be a plurality of output interfaces.
The bus 204 may be a peripheral component interconnect express (PCIe) bus, an extended industry standard architecture (EISA) bus, or the like. The bus 204 may be classified into an address bus, a data bus, a control bus, or the like. For ease of representation, only one bold line is used for representation in
In this embodiment, the processor 202 may include a matrix computing unit. The matrix computing unit may be configured to support the processor in performing one or more steps in the following method embodiment. The matrix computing unit may be an ASIC, an FPGA, a logic circuit, or the like. Certainly, the matrix computing unit may alternatively be implemented by using software. This is not specifically limited in this embodiment of this application.
Further, the processor 202 may include one or more of other processing units such as a CPU, a GPU, or an NPU. As shown in
S301: Obtain a computing instruction, where the computing instruction includes a to-be-computed matrix and a matrix computing type.
The matrix computing unit may be a matrix computing unit designed based on a low-precision floating point number. For example, the matrix computing unit may be a matrix computing unit designed based on FP16. Optionally, in this embodiment of this application, a bit width of an adder included in the matrix computing unit may be extended, to implement addition of high-precision floating point numbers. For example, the adder is extended to a high-precision adder shown in
In addition, there may be one or more to-be-computed matrices, and precision of a floating point number in the one or more matrices may be represented as first precision. The first precision may be higher than FP16. For example, the first precision may be FP32, FP64, FP128, or the like. The matrix computing type may include matrix multiplication, matrix addition, matrix multiplication-addition, and the like. The matrix multiplication-addition is hybrid computing of matrix addition and matrix multiplication. For example, if a matrix A and a matrix B are used as an example, the matrix multiplication may be represented as A×B, and the matrix addition may be represented as A+B. If a matrix A, a matrix B, and a matrix C are used as an example, the matrix multiplication-addition may be represented as A×B+C. The matrix A, the matrix B, and the matrix C herein may be different matrices, or may be a same matrix. This is not specifically limited in this embodiment of this application.
Specifically, when a processor of the computing device includes a CPU and the matrix computing unit, upper-layer software (for example, an AI application, an HPC application, and a scientific computing application) based on the processor may send a matrix computing request to the CPU by using a uniformly encapsulated linear algebraic function library. The request may carry the to-be-computed matrix and the matrix computing type. When the CPU receives the request, the CPU may schedule the matrix computing unit to execute a matrix computing task. In a possible implementation, if the matrix computing unit is integrated into the CPU, the CPU may store the to-be-computed matrix and the matrix computing type in the matrix computing request in a register, so that when executing the matrix computing task, the matrix computing unit may obtain the computing instruction from the register, to obtain the to-be-computed matrix and the matrix computing type. In another possible implementation, if the matrix computing unit is parallel to the CPU and integrated separately, the CPU may store the to-be-computed matrix and the matrix computing type in the matrix computing request in memory, so that when executing the matrix computing task, the matrix computing unit may obtain the computing instruction from the memory, to obtain the to-be-computed matrix and the matrix computing type.
S302: Disassemble the to-be-computed matrix to obtain a plurality of disassembled matrices, where precision of a floating point number in the disassembled matrix is lower than precision of a floating point number in the to-be-computed matrix.
The precision of the floating point number in the disassembled matrix may be represented as second precision, the precision of the floating point number in the to-be-computed matrix is represented as first precision, and the second precision is lower than the first precision, or the first precision is higher than the second precision. For example, the first precision is FP32, FP64, FP128, or the like, and the second precision is FP16.
In addition, the disassembling the to-be-computed matrix to obtain a plurality of disassembled matrices may include: disassembling the to-be-computed matrix according to a preset rule to obtain the plurality of disassembled matrices, where the preset rule is used to disassemble a floating point number with the first precision in the to-be-computed matrix into floating point numbers with the second precision, and the second precision is lower than the first precision. The preset rule may be set in advance. Alternatively, the disassembling the to-be-computed matrix to obtain a plurality of disassembled matrices may include: disassembling the to-be-computed matrix including a floating point number with the first precision into a plurality of matrices including floating point numbers with the second precision and a plurality of exponent matrices. The exponent matrix may be a matrix in which an exponent is used as an element.
Specifically, there may be one or more to-be-computed matrices, and each of the one or more matrices includes a floating point number with the first precision. Each matrix may be disassembled in one of the following manners, to obtain a plurality of corresponding disassembled matrices, and the disassembled matrix may include a floating point number with the second precision.
In a first manner, the to-be-computed matrix includes a first matrix, and a process of disassembling the first matrix into a plurality of disassembled matrices may be: disassembling the first matrix into a plurality of column matrices by column, where precision of a floating point number in each column matrix is the first precision; and disassembling each of the plurality of column matrices into one first submatrix and one first exponent matrix corresponding to the first submatrix, where precision of a floating point number in the first submatrix is the second precision. In this way, a plurality of first submatrices and a plurality of first exponent matrices are correspondingly obtained after the plurality of column matrices are disassembled.
For example, the first matrix is a matrix with M rows and K columns. Specifically, the first matrix may be disassembled into K column matrices, and each column matrix includes M floating point numbers with the first precision. Each of the K column matrices is disassembled into one first submatrix with M rows and W columns and floating point numbers with the second precision and one first exponent matrix with M rows and N columns, and the first submatrix corresponds to the first exponent matrix.
In a second manner, the to-be-computed matrix includes a second matrix, and a process of disassembling the second matrix into a plurality of disassembled matrices may be: disassembling the second matrix into a plurality of row matrices by row, where precision of a floating point number in each row matrix is the first precision; and disassembling each of the plurality of row matrices into one second submatrix and one second exponent matrix corresponding to the second submatrix, where precision of a floating point number in the second submatrix is the second precision. In this way, a plurality of second submatrices and a plurality of second exponent matrices are correspondingly obtained after the plurality of row matrices are disassembled.
For example, the second matrix is a matrix with K rows and N columns. Specifically, the second matrix may be disassembled into K row matrices, and each row matrix includes N floating point numbers with the first precision. Each of the K row matrices is disassembled into one second submatrix with W rows and N columns and floating point numbers with the second precision and one second exponent matrix with M rows and N columns, and the second submatrix corresponds to the second exponent matrix.
In the foregoing two manners, M, K, W, and N are all positive integers, specific values of M, K, and N may depend on a quantity of rows or a quantity of columns of the to-be-computed matrix, and a specific value of W is related to the first precision and the second precision. Optionally, the specific value of W may be set in advance. For example, when the first precision is FP32 and the second precision is FP16, W may be equal to 9; or when the first precision is FP64 and the second precision is FP16, W may be equal to 32.
In a possible embodiment, when the to-be-computed matrix includes a matrix A and a matrix B, and the matrix computing type is matrix multiplication (that is, A×B), the matrix A may be disassembled in the foregoing first manner, and the matrix B may be disassembled in the foregoing second manner, to obtain a plurality of first submatrices and a plurality of first exponent matrices corresponding to the matrix A, and a plurality of second submatrices and a plurality of second exponent matrices corresponding to the matrix B. The plurality of first submatrices obtained by disassembling the matrix A one-to-one correspond to the plurality of second submatrices obtained by disassembling the matrix B, and the plurality of first exponent matrices also one-to-one correspond to the plurality of second exponent matrices. Optionally, a first submatrix obtained after a column matrix is disassembled corresponds to a second submatrix obtained after a row matrix corresponding to the column matrix is disassembled, a first submatrix and a first exponent matrix obtained after a same column matrix is disassembled correspond to each other, and a second submatrix and a second exponent matrix obtained after a same row matrix is disassembled correspond to each other.
For example, as shown in
In another possible embodiment, when the to-be-computed matrix includes a matrix A and a matrix B, and the matrix computing type is matrix addition (that is, A+B), the matrix addition may be considered as A×O+B (or O×A+B), where O is a unit diagonal matrix. In other words, multiplication is performed on the matrix A and the matrix O, and then addition is performed on an obtained matrix multiplication result and the matrix B. Correspondingly, the matrix computing unit may disassemble the matrix A and the matrix O according to the matrix multiplication disassembling process provided above, to obtain a plurality of first submatrices and a plurality of first exponent matrices corresponding to the matrix A, and a plurality of second submatrices and a plurality of second exponent matrices corresponding to the matrix O. Optionally, when obtaining the matrix A and the matrix B, the matrix computing unit may transparently transmit the matrix A and the matrix B. For example, the matrix A and the matrix B are transparently transmitted to an adder in the matrix computing unit (for example, the matrix A and the matrix B are directly transparently transmitted to a high-precision adder shown in
In still another possible embodiment, when the to-be-computed matrix includes a matrix A, a matrix B, and a matrix C, and the matrix computing type is matrix multiplication-addition (that is, A×B+C), the matrix A and the matrix B may be first disassembled in the matrix multiplication disassembling manner provided above, and the matrix C is not disassembled. Then, when a matrix multiplication result of A×B is obtained through computing, the matrix multiplication result is directly added to the matrix C, to obtain a matrix multiplication-addition result of A×B+C.
S303: Perform computing processing on the plurality of disassembled matrices based on the matrix computing type.
The matrix computing type includes matrix multiplication, matrix addition, and matrix multiplication-addition. Different matrix computing types correspond to different computing processing on the plurality of disassembled matrices. The following separately describes computing processing corresponding to different matrix computing types.
In a first case, the matrix computing type is matrix multiplication.
Specifically, the to-be-computed matrix may include a first matrix and a second matrix, disassembled matrices corresponding to the first matrix include a plurality of first submatrices and a plurality of first exponent matrices, disassembled matrices corresponding to the second matrix include a plurality of second submatrices and a plurality of second exponent matrices, and there are correspondences between the plurality of first submatrices, the plurality of first exponent matrices, the plurality of second submatrices, and the plurality of second exponent matrices. In this case, a process of performing computing processing on the plurality of disassembled matrices based on the matrix computing type may include: determining a first operation result of each first submatrix and a corresponding second submatrix based on a correspondence between the plurality of first submatrices and the plurality of second submatrices, to obtain a plurality of first operation results; determining a second operation result of each first exponent matrix and a corresponding second exponent matrix based on a correspondence between the plurality of first exponent matrices and the plurality of second exponent matrices, to obtain a plurality of second operation results; and determining a matrix operation result of the first matrix and the second matrix based on a correspondence between the plurality of first submatrices and the plurality of first exponent matrices, the plurality of first operation results, and the plurality of second operation results. Optionally, the determining a matrix operation result of the first matrix and the second matrix may be: determining a third operation result based on each of the plurality of first operation results and a second operation result corresponding to the first operation result, to obtain a plurality of third operation results, where a first operation result of a first submatrix corresponds to a second operation result of a first exponent matrix corresponding to the first submatrix; and determining the matrix operation result of the first matrix and the second matrix based on the plurality of third operation results. For example, an accumulated sum of the plurality of third operation results is determined as the matrix operation result of the first matrix and the second matrix.
In a possible embodiment, a specific process of determining a first operation result of a first submatrix and a corresponding second submatrix may be: determining a product of the first submatrix and the corresponding second submatrix as the first operation result. For example, if a first submatrix is a matrix a′, and a second submatrix corresponding to the matrix a′ is a matrix b′, a first operation result of the matrix a′ and the matrix b′ may be a product of the matrix a′ and the matrix b′. To be specific, according to a matrix multiplication operation rule, a row element in the matrix a′ is multiplied by a column element in a corresponding column in the matrix b′, to obtain an element in each row and each column in the first operation result. The first operation result herein may be a matrix including a floating point number with the second precision.
In a possible embodiment, a specific process of determining a second operation result of a first exponent matrix and a corresponding second exponent matrix may be: determining, based on two exponent elements at a same location in the first exponent matrix and the corresponding second exponent matrix, an element at the same location in the second operation result. The second operation result herein may be an exponent matrix. For example, an element in the first row and the first column in the second operation result is determined based on an element in the first row and the first column in the first exponent matrix and an element in the first row and the first column in the second exponent matrix.
Optionally, two floating point numbers at a same location in a first exponent matrix and a corresponding second exponent matrix are X and Y, and a floating point number Z at the same location in a corresponding second operation result satisfies the following formula (I):
In this formula, exp represents an exponential function, and Q is related to precision (that is, the first precision) of floating point numbers included in the first matrix and the second matrix. For example, Q may be 2P-1-1, and P may be specifically a bit width of an exponent bit in the first precision. Optionally, when the first precision is FP32 and the second precision is FP16, P is equal to 8. In this case, Q may be equal to 127, that is, formula (I) is changed to formula (II). When the first precision is FP48 and the second precision is FP16, Pis equal to 11. In this case, Q may be equal to 1023, that is, formula (I) is changed to formula (III).
In a possible embodiment, a specific process of determining the third operation result based on each of the plurality of first operation results and the second operation result corresponding to the first operation result may be: determining a product of elements at a same location in the first operation result and the corresponding second operation result as an element at the same location in the third operation result. The third operation result herein may be a matrix including a floating point number with the first precision. For example, a product of an element in the first row and the first column in the first operation result and an element in the first row and the first column in the corresponding second operation result is determined as an element in the first row and the first column in the third operation result.
In a second case, the matrix computing type is matrix addition.
When the to-be-computed matrix includes a matrix A and a matrix B, addition of the matrix A and the matrix B (that is, A+B) may be considered as A×O+B. After the matrix A and the matrix O are disassembled in S302, matrices obtained after the matrix A and the matrix O are disassembled may be processed according to the computing processing process provided in the first case, to obtain a matrix multiplication result of A×O, and then the matrix multiplication result is added to the matrix B, to obtain a final matrix operation result.
In a third case, the matrix computing type is matrix multiplication-addition.
When the to-be-computed matrix includes a matrix A, a matrix B, and a matrix C, the matrix computing type is matrix multiplication-addition (that is, A×B+C). After the matrix A and the matrix B are disassembled in S302, matrices obtained after the matrix A and the matrix B are disassembled may be processed according to the computing processing process provided in the first case, to obtain a matrix multiplication result of A×B, and then the matrix multiplication result is added to the matrix C, to obtain a final matrix operation result.
Further, before S302, the method may further include S302a.
S302a: Determine that the precision of the floating point number in the to-be-computed matrix is higher than preset precision.
The preset precision may be precision that can be used by the matrix computing unit to implement matrix computing without matrix disassembling, and the preset precision may be set in advance. For example, if the matrix computing unit is a matrix computing unit designed based on FP16, the preset precision may be set to FP16.
Specifically, when obtaining the to-be-computed matrix, the matrix disassembling unit may determine whether the precision of the floating point number included in the to-be-computed matrix is higher than the preset precision. If a determining result is yes, it is determined that the precision of the floating point number in the to-be-computed matrix is higher than the preset precision. In this case, when performing matrix computing, the matrix computing unit may perform matrix disassembling and computing processing in the manners described in S302 and S303, to obtain a final matrix operation result. If a determining result is no, it is determined that the precision of the floating point number in the to-be-computed matrix is lower than or equal to the preset precision. In this case, when performing matrix computing, the matrix computing unit may directly perform computing processing without performing matrix disassembling, to obtain a final matrix operation result.
For ease of understanding, the following describes the technical solutions in this application by using an example in which the to-be-computed matrix includes a first matrix and a second matrix, the matrix computing type is matrix multiplication, and the second precision is FP16.
Based on the matrix computing unit shown in
S01: Upper-layer software such as an AI application and an HPC application generates a matrix computing request, and sends the matrix computing request to a CPU, where the request may carry a first matrix, a second matrix, and a matrix computing type.
S02: When the CPU receives the request, the CPU may deliver a matrix computing task to the matrix computing unit, that is, schedule the matrix computing unit to execute the matrix computing task. The first matrix, the second matrix, and the matrix computing type may be stored in internal storage space of the CPU in a form of a computing instruction.
S03: The matrix computing unit obtains the first matrix, the second matrix, and the matrix computing type; determines whether the first matrix and the second matrix are FP16 matrices, that is, determines whether precision of floating point numbers included in the first matrix and the second matrix is higher than FP16; and performs S04 if the first matrix and the second matrix are not FP16 matrices; or performs S06 according to a matrix computing method in the conventional technology if the first matrix and the second matrix are FP16 matrices. The cache 401 may be configured to cache the first matrix, and the cache 402 may be configured to cache the second matrix.
S04: Disassemble the first matrix and the second matrix, that is, disassemble the first matrix in the first manner described in S302 to obtain a plurality of first submatrices and a plurality of first exponent matrices, and disassemble the second matrix in the second manner to obtain a plurality of second submatrices and a plurality of second exponent matrices. The matrix disassembling unit 403 may be configured to perform a step of disassembling the first matrix and the second matrix. Further, as shown in
S05: Perform exponent matrix computing, that is, perform computing on the plurality of first exponent matrices and the plurality of second exponent matrices, to obtain a plurality of second operation results. The exponent multiplier 404 may be configured to perform exponent matrix computing.
S06: Perform FP16 matrix computing, that is, perform computing on the plurality of first submatrices and the plurality of second submatrices, to obtain a plurality of first operation results, and perform computing on the plurality of first operation results and the plurality of second operation results, to obtain a plurality of third operation results. The FP16 computing unit 405 may be configured to perform FP16 matrix computing.
S07: Perform high-precision matrix addition, that is, compute an accumulated sum of the plurality of third operation results, to obtain a final matrix computing result. S08: Output the final matrix computing result. The high-precision adder 406 may be configured to perform the steps of performing high-precision matrix addition and outputting the final matrix computing result. In actual application, matrix addition in S07 may alternatively be performed by using an FP16 adder. This is not specifically limited in this embodiment of this application.
In embodiments of this application, when the matrix computing unit obtains the to-be-computed matrix and the matrix computing type, if the precision of the floating point number included in the to-be-computed matrix is high precision higher than FP16, the matrix computing unit may disassemble the to-be-computed matrix into a plurality of matrices including low-precision floating point numbers, for example, disassemble the to-be-computed matrix into a plurality of matrices including FP16 floating point numbers, and perform, based on the matrix computing type, computing processing on the plurality of matrices including the FP16 floating point numbers, to obtain a matrix operation result corresponding to the to-be-computed matrix. In this way, high-precision matrix computing can be implemented based on a low-precision matrix computing unit, so that applicability of the matrix computing unit is improved. In addition, in a matrix computing process, upper-layer software applications such as an AI application and an HPC application are unaware of a specific matrix computing process, so that software adaptation costs can be greatly reduced.
A floating point number operation method provided in the embodiments is mainly described above from a perspective of a computing device. It may be understood that, to implement the foregoing functions, the computing device includes a corresponding hardware structure and/or software module for performing the functions. A person skilled in the art should be easily aware that, with reference to the example network elements and algorithm steps described in the embodiments disclosed in this specification, this application can be implemented in a form of hardware or a combination of hardware and computer software. Whether a function is implemented by hardware or hardware driven by computer software depends on specific applications and design constraints of the technical solutions. A person skilled in the art may use different methods to implement the described functions for each particular application, but it should not be considered that the implementation goes beyond the scope of this application.
In embodiments of this application, a matrix computing apparatus may be divided into function modules based on the foregoing method examples. For example, each function module may be obtained through division based on each corresponding function, or two or more functions may be integrated into one processing module. The integrated module may be implemented in a form of hardware, or may be implemented in a form of a software function module. It should be noted that division into the modules in embodiments of this application is an example, and is merely logical function division. In actual implementation, there may be another division manner.
When each function module is obtained through division based on each corresponding function,
As shown in
The foregoing describes the matrix computing apparatus in embodiments of this application from a perspective of a modular functional entity, and the following describes a matrix computing apparatus in embodiments of this application from a perspective of hardware processing.
In an embodiment of this application, a matrix computing circuit is provided. The matrix computing circuit may be configured to perform one or more steps in S301 to S303 or one or more steps in S03 to S08 in the foregoing method embodiment. In actual application, the matrix computing circuit may be an ASIC, an FPGA, a logic circuit, or the like.
In another embodiment of this application, a matrix computing system or a chip is further provided. A structure of the system or the chip may be shown in
In still another embodiment of this application, a matrix computing device is provided. A structure of the device may be shown in
The processor 202 may be configured to perform one or more steps in S301 to S303 or one or more steps in S01 to S08 in the foregoing method embodiment. In some feasible embodiments, the processor 202 may include a matrix computing unit. The matrix computing unit may be configured to support the processor in performing one or more steps in the foregoing method embodiment. In actual application, the matrix computing unit may be an ASIC, an FPGA, a logic circuit, or the like. Certainly, the matrix computing unit may alternatively be implemented by using software. This is not specifically limited in this embodiment of this application.
It should be noted that components of the matrix computing circuit, the matrix computing system, the matrix computing device, and the like provided in embodiments of this application are separately configured to implement functions of corresponding steps in the foregoing method embodiments. Because the steps have been described in detail in the foregoing method embodiments, details are not described herein again.
All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When software is used to implement embodiments, the foregoing embodiments may be implemented completely or partially in a form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the procedures or functions according to embodiments of this application are all or partially generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable apparatuses. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center to another website, computer, server, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by a computer, or a data storage device, such as a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium. The semiconductor medium may be a solid state drive (SSD).
The foregoing descriptions are only specific implementations of this application, but are not intended to limit the protection scope of this application. Any variation or replacement within the technical scope disclosed in this application shall fall within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
202010966997.1 | Sep 2020 | CN | national |
This application is a continuation of International Application No. PCT/CN2021/106961, filed on Jul. 17, 2021, which claims priority to Chinese Patent Application No. 202010966997.1, filed on Sep. 15, 2020. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2021/106961 | Jul 2021 | WO |
Child | 18183394 | US |