This application is a continuation of International Application No. PCT/CN2021/141000, filed on Dec. 23, 2021, which claims priority to Chinese Patent Application No. 202110181498.6, filed on Feb. 8, 2021, and Chinese Patent Application No. 202011617575.X, filed on Dec. 30, 2020. All of the aforementioned patent applications are hereby incorporated by reference in their entireties.
This application relates to the computer field, and in particular, to a matrix calculation apparatus, method, system, circuit, and device, and a chip.
Matrix calculation is an important computing type in different application scenarios such as artificial intelligence, scientific computing, and graphics computing. A matrix is a set of element values arranged according to a rectangular array. The element values in the matrix may include two values: a zero-element value and a non-zero-element value. When there are a large quantity of zero-element values in a matrix, to save storage space, only non-zero-element values in the matrix may be stored, that is, the matrix is compressed, and a matrix in a compressed format is stored.
In a current technology, a frequently-used method for calculating a matrix in a compressed format is as follows: First, the matrix in the compressed format needs to be decompressed. To be specific, the matrix in the compressed format is converted into a matrix in an uncompressed format. Then, matrix calculation is performed on the matrix in the uncompressed format. In the matrix calculation process, because the matrix in the compressed format needs to be decompressed, and data obtained through decompression occupies large memory space, a calculation speed of the matrix is limited by access bandwidth of memory. When the access bandwidth of the memory is fixed, the calculation speed of the matrix cannot be improved, and consequently, calculation efficiency is low.
This application provides a matrix calculation apparatus, method, system, circuit, and device, and a chip, to directly calculate a matrix in a compressed format without decompressing the matrix in the compressed format, thereby improving calculation efficiency of the matrix in the compressed format.
According to a first aspect, an embodiment of this application provides a matrix calculation apparatus, where the matrix calculation apparatus includes a vector outer product processing engine and an accumulator. The vector outer product processing engine is configured to calculate vector outer products of N first column vectors and N first row vectors, to obtain N intermediate result matrices, where the first column vector includes first element values and row coordinates of the first element values, the first row vector includes second element values and column coordinates of the second element values, the intermediate result matrix includes third element values and position coordinates of the third element values, the position coordinates include the row coordinates of the first element values and the column coordinates of the second element values, the N first column vectors are obtained by converting a first matrix in a compressed format, the N first row vectors are obtained by converting a second matrix in the compressed format, and N is an integer greater than or equal to 1. The accumulator is configured to accumulate, based on indexes of the position coordinates of the third element values, third element values with same position coordinates in the N intermediate result matrices, to obtain a result matrix. In this embodiment of this application, the matrix calculation apparatus calculates the first matrix and the second matrix in the compressed format based on the vector outer products. In the calculation process, row coordinates of element values in the first column vector are reserved, column coordinates of element values in the second column vector are reserved, and then third element values with same position coordinates are accumulated based on indexes of position coordinates, to obtain the result matrix obtained by calculating the two matrices in the compressed format. Compared with a conventional method in which a matrix in a compressed format needs to be first decompressed and then matrix calculation is performed on a matrix obtained through decompression, the matrix calculation method provided in embodiments of this application can effectively improve calculation efficiency of a matrix in a compressed format.
In an optional implementation, the N intermediate result matrices include at least a first intermediate result matrix and a second intermediate result matrix, position coordinates of third element values in the first intermediate result matrix are first position coordinates, and position coordinates of third element values in the second intermediate result matrix are second position coordinates. The accumulator is configured to: write, in a generation sequence of the N intermediate result matrices, the third element values in the first intermediate result matrix into corresponding positions in a cache based on the first position coordinates; and then read, based on the second position coordinates of the third element values in the second intermediate result matrix, cached values that are at positions corresponding to the second position coordinates in the cache, and accumulate the third element values in the second intermediate result matrix and the cached values, to obtain a result matrix in an uncompressed format. In the foregoing optional implementation, the matrix calculation apparatus may output a matrix in an uncompressed format based on different application scenarios, thereby expanding an applicable scenario of the matrix calculation apparatus.
In an optional implementation, the matrix calculation apparatus further includes a matrix compression unit. The matrix compression unit is configured to compress the result matrix in the uncompressed format to obtain a result matrix in a compressed format. In the foregoing optional implementation, the matrix calculation apparatus may output a matrix in a compressed format based on different application scenarios, thereby expanding an applicable scenario of the matrix calculation apparatus. In addition, the matrix calculation apparatus compresses the result matrix, and outputs the matrix in the compressed format, thereby saving transmission resources or facilitating a next calculation operation.
In an optional implementation, the accumulator is further specifically configured to: sort third element values in the N intermediate result matrices based on position coordinates of the third element values, for example, sort the third element values based on row coordinates of the third element values, or sort the third element values based on column coordinates of the third element values; and then compare position coordinates in N intermediate result matrices obtained through sorting, add up third element values with same position coordinates, and delete position coordinates of a zero-element value, to obtain a result matrix in a compressed format. In the foregoing optional implementation, the matrix in the compressed format may be directly obtained, and the matrix in the compressed format output by the matrix calculation apparatus may be used in some application scenarios in which a matrix in a compressed format is required for subsequently calculation. In addition, because the matrix calculation apparatus outputs the matrix in the compressed format, transmission resources of a subsequent transmission matrix can be reduced.
In an optional implementation, the matrix calculation apparatus further includes a format conversion unit. The format conversion unit is configured to: obtain the first matrix and the second matrix; convert the first matrix into the N first column vectors and reserve the row coordinates of the first element values in the first column vectors; and convert the second matrix into the N first row vectors and reserve the column coordinates of the first element values in the first row vectors. In this way, the matrix calculation apparatus can calculate the two matrices in the compressed format based on the vector outer products.
In an optional implementation, the matrix calculation apparatus further includes the format conversion unit. The format conversion unit is further configured to: obtain a fifth matrix and a sixth matrix, perform format conversion on the fifth matrix to obtain the first matrix, and perform format conversion on the sixth matrix to obtain the second matrix, where at least one of the fifth matrix and the sixth matrix is a matrix in an uncompressed format. In the foregoing optional implementation, the matrix calculation apparatus may receive the matrix in the uncompressed format, and then convert the matrix in the uncompressed format into the matrix in the compressed format, so that the matrix calculation apparatus can support calculation of matrices in a plurality of formats.
In an optional implementation, the matrix calculation apparatus further includes the format conversion unit. The format conversion unit is further configured to split the first column vector into X second column vectors, and split the first row vector into X second row vectors, where precision of element values included in the second column vector and the second row vector is second precision, precision of element values included in the first column vector and the first row vector is first precision, the first precision is higher than the second precision, and X is an integer greater than or equal to 2. The vector outer product processing engine is further configured to calculate vector outer products of the X second column vectors and the X second row vectors to obtain X2 fourth matrices, where the fourth matrix includes fourth element values and position coordinates of the fourth element values, the position coordinates of the fourth element values include the row coordinates of the first element values and the column coordinates of the second element values, and precision of the fourth element values is the first precision. Then, the accumulator is further configured to accumulate, based on indexes of the position coordinates of the fourth element values, fourth element values with same position coordinates in the X2 fourth matrices, to obtain the intermediate result matrix, where precision of the third element values in the intermediate result matrix is the first precision. In the foregoing optional implementation, the matrix calculation apparatus may implement high-precision matrix calculation based on a low-precision matrix calculation apparatus, thereby improving applicability of the matrix calculation unit.
According to a second aspect, an embodiment of this application provides a matrix calculation method, where the method is applied to a matrix calculation apparatus, and the method includes: first, obtaining a first calculation instruction, where the first calculation instruction includes N first column vectors and N first row vectors; then, calculating vector outer products of the N first column vectors and the N first row vectors, to obtain N intermediate result matrices, where the first column vector includes first element values and row coordinates of the first element values, the first row vector includes second element values and column coordinates of the second element values, the intermediate result matrix includes third element values and position coordinates of the third element values, the position coordinates include the row coordinates of the first element values and the column coordinates of the second element values, the N first column vectors are obtained by converting a first matrix in a compressed format, the N first row vectors are obtained by converting a second matrix in the compressed format, and N is an integer greater than or equal to 1; and finally, accumulating, based on indexes of the position coordinates of the third element values, third element values with same position coordinates in the N intermediate result matrices, to obtain a result matrix. In this embodiment of this application, the first matrix and the second matrix are calculated based on the vector outer products of the N first column vectors and the N first row vectors. In the calculation process, row coordinates of element values in the first column vector are reserved, column coordinates of element values in the second column vector are reserved, and then third element values with same position coordinates are accumulated based on indexes of position coordinates, to obtain the result matrix obtained by calculating the two matrices in the compressed format. Compared with a conventional method in which a matrix in a compressed format needs to be first decompressed and then matrix calculation is performed on a matrix obtained through decompression, the matrix calculation apparatus provided in embodiments of this application can effectively improve calculation efficiency of a matrix in a compressed format.
In an optional implementation, the N intermediate result matrices include at least a first intermediate result matrix and a second intermediate result matrix, position coordinates of third element values in the first intermediate result matrix are first position coordinates, and position coordinates of third element values in the second intermediate result matrix are second position coordinates. In the method, the accumulating, based on indexes of the position coordinates of the third element values, third element values with same position coordinates in the N intermediate result matrices, to obtain a result matrix may specifically include: first, writing, in a generation sequence of the N intermediate result matrices, the third element values in the first intermediate result matrix into corresponding positions in a cache based on the first position coordinates; and then reading, based on the second position coordinates of the third element values in the second intermediate result matrix, cached values that are at positions corresponding to the second position coordinates in the cache, and accumulating the third element values in the second intermediate result matrix and the cached values, to obtain a result matrix in an uncompressed format. In the foregoing optional implementation, the matrix calculation apparatus may output a matrix in an uncompressed format based on different application scenarios, thereby expanding an applicable scenario of the matrix calculation apparatus.
In an optional implementation, the method further includes: compressing the result matrix in the uncompressed format to obtain a result matrix in a compressed format. In the foregoing optional implementation, the matrix calculation apparatus may output a matrix in a compressed format based on different application scenarios, thereby expanding an applicable scenario of the matrix calculation apparatus. In addition, the matrix calculation apparatus compresses the result matrix, and outputs the matrix in the compressed format, thereby saving transmission resources or facilitating a next calculation operation.
In an optional implementation, in the method, the accumulating, based on indexes of the position coordinates of the third element values, third element values with same position coordinates in the N intermediate result matrices, to obtain a result matrix may specifically include: first, sorting third element values in the N intermediate result matrices based on position coordinates of the third element values, for example, sorting the third element values based on row coordinates of the third element values, or sorting the third element values based on column coordinates of the third element values; and comparing position coordinates in N intermediate result matrices obtained through sorting, adding up third element values with same position coordinates, and deleting position coordinates of a zero-element value, to obtain a result matrix in a compressed format. In the foregoing optional implementation, the matrix in the compressed format may be directly obtained, and the matrix in the compressed format output by the matrix calculation apparatus may be used in some application scenarios in which a matrix in a compressed format is required for subsequently calculation. In addition, because the matrix calculation apparatus outputs the matrix in the compressed format, transmission resources of a subsequent transmission matrix can be reduced.
In an optional implementation, before the obtaining a first calculation instruction, the method further includes: obtaining a second calculation instruction, where the second calculation instruction includes the first matrix and the second matrix; and converting the first matrix into the N first column vectors and reserving the row coordinates of the first element values in the first column vectors; and converting the second matrix into the N first row vectors and reserving the column coordinates of the first element values in the first row vectors. In this way, the matrix calculation apparatus can calculate the two matrices in the compressed format based on the vector outer products.
In an optional implementation, before the obtaining a second calculation instruction, the method further includes: obtaining a third calculation instruction, where the third calculation instruction includes a fifth matrix and a sixth matrix, and at least one of the fifth matrix and the sixth matrix is a matrix in an uncompressed format; and then performing format conversion on the fifth matrix to obtain the first matrix in the compressed format, and performing format conversion on the sixth matrix to obtain the second matrix. In the foregoing optional implementation, the matrix calculation apparatus may receive the matrix in the uncompressed format, and then convert the matrix in the uncompressed format into the matrix in the compressed format, so that the matrix calculation apparatus can support calculation of matrices in a plurality of formats.
In an optional implementation, the first column vector may be first split into X second column vectors, and the first row vector is split into X second row vectors, where precision of element values included in the second column vector and the second row vector is second precision, precision of element values included in the first column vector and the first row vector is first precision, the first precision is higher than the second precision, and X is an integer greater than or equal to 2. In the method, the calculating vector outer products of the N first column vectors and the N first row vectors, to obtain N intermediate result matrices may include: calculating vector outer products of the X second column vectors and the X second row vectors to obtain X2 fourth matrices, where the fourth matrix includes fourth element values and position coordinates of the fourth element values, the position coordinates of the fourth element values include the row coordinates of the first element values and the column coordinates of the second element values, and precision of the fourth element values is the first precision; and then, accumulating, based on indexes of the position coordinates of the fourth element values, fourth element values with same position coordinates in the X2 fourth matrices, to obtain the intermediate result matrix, where precision of the third element values in the intermediate result matrix is the first precision. In the foregoing optional implementation, the matrix calculation apparatus may implement high-precision matrix calculation based on a low-precision matrix calculation apparatus, thereby improving applicability of the matrix calculation unit.
According to a third aspect, a matrix calculation circuit is provided. The matrix calculation circuit is configured to perform operation steps of the matrix calculation method according to any one of the second aspect or the possible implementations of the second aspect.
According to a fourth aspect, a matrix calculation system is provided. The system includes a processor and a matrix calculation apparatus. The processor is configured to send a calculation instruction to the matrix calculation apparatus. The matrix calculation apparatus is configured to perform operation steps of the matrix calculation method according to any one of the second aspect or the possible implementations of the second aspect.
According to a fifth aspect, a chip is provided. The chip includes a processor, a matrix calculation apparatus is integrated into the processor, and the matrix calculation apparatus is configured to perform operation steps of the matrix calculation method according to any one of the second aspect or the possible implementations of the second aspect.
According to a sixth aspect, a matrix calculation device is provided. The device includes the matrix calculation system according to the fourth aspect or the chip according to the fifth aspect.
According to a seventh aspect, a readable storage medium is provided. The readable storage medium stores instructions. When the readable storage medium runs on a device, the device is enabled to perform operation steps of the matrix calculation method according to any one of the second aspect or the possible implementations of the second aspect.
According to an eighth aspect, a computer program product is provided. When the computer program product runs on a computer, the computer is enabled to perform operation steps of the matrix calculation method according to any one of the second aspect or the possible implementations of the second aspect.
It may be understood that any apparatus, computer storage medium, or computer program product for implementing the matrix calculation method provided above is configured to perform the corresponding method provided above. Therefore, for beneficial effects that can be achieved by the apparatus, the computer storage medium, or the computer program product, refer to beneficial effects of the corresponding method provided above. Details are not described herein again.
In this application, based on the implementations according to the foregoing aspects, the implementations may be further combined to provide more implementations.
The following describes the technical solutions in embodiments of this application with reference to the accompanying drawings in embodiments of this application. In the specification, claims, and accompanying drawings of this application, the terms “first”, “second”, and the like are intended to distinguish between similar objects but do not necessarily indicate a specific order or sequence.
To better understand this application, related terms in this application are first described.
Matrix (matrix): A matrix whose dimension is m×n is a rectangular array obtained by arranging elements of m rows (rows) and n columns (columns). For example, a matrix A is shown in Formula (1), and a matrix B is shown in Formula (2):
Matrix addition and subtraction: Mutual addition and subtraction may be performed between matrices of a same dimension, and specifically, addition and subtraction are performed on elements at all positions. For example, both the matrix A and the matrix B are matrices in an m×n dimension, the matrix A and the matrix B are added to obtain a matrix C, and the matrix C is shown in Formula (3):
Matrix multiplication: Two matrices can be multiplied only when a quantity of columns (columns) of a first matrix A is equal to a quantity of rows (rows) of the other matrix. For example, the matrix A is a matrix whose dimension is m×n, the matrix B is a matrix whose dimension is n×p, a product of the matrix A and the matrix B is an m×p matrix, and an element in the matrix m×p is shown in Formula (4):
where 1≤i≤m and 1≤j≤p.
A row vector (row vector) is a matrix whose dimension is 1×m , where m is a positive integer. For example, a row vector is shown in Formula (5):
X=[x
1
x
2
. . . x
m] Formula (5)
A column vector (column vector) is a matrix whose dimension is m×1, and m is a positive integer. For example, a column vector is shown in Formula (6):
A vector outer product (vector outer product) is a tensor product of two vectors. The tensor product is a matrix. For example, a column vector U whose dimension is m×1 and a row vector V whose dimension is 1×n are given. An outer product U×V of the vector U and the vector V is defined as a matrix D whose dimension is m×n, and the matrix D is shown in Formula (7):
Matrix in a compressed format: When a matrix includes zero-element values and non-zero-element values, usually, the non-zero-element values in the matrix may be stored in a specific format and the zero-element values are not stored, to save storage space. In this process, the matrix is compressed, and a matrix obtained after compression and storage is referred to as a matrix in a compressed format. A method for compressing a matrix includes but is not limited to coordinate (coordinate, COO), compressed sparse row (compressed sparse row, CSR), compressed sparse column (compressed sparse column, CSC), and the like.
The following separately uses examples to describe the three compression methods: COO, CSR, and CSC.
COO: A matrix is represented by a triplet. The triplet includes three values: a row number, a column number, and an element value. The row number and the column number are for identifying a position of the element value. For example, the triplet is (row number, column number, element value), or the triplet is (element value, row number, column number). Specifically, an arrangement sequence of the three values in the triplet is not limited. For example, refer to
Optionally, the matrix Y in the compressed format may be represented in Formula (8):
Row coordinate=([0, 0,1,1,2,2,2,3,3])
Column coordinate=([0,1,1,2,0,2,3,1,3])
Element value=([1,2,3,4,5,6,7,8,9]) Formula (8)
CSR: A matrix is represented by three types of data: an element value, a column number, and a row offset. The element value and the column number in CSR are represented in manners similar to the element value and the column number in the COO method described above. CSR is different from the COO method in that the row offset indicates a start offset position of the 1st element in a row in all element values. Refer to
The matrix Y in the compressed format may be represented in Formula (9):
Row offset=([0, 2, 4, 7, 9])
Column coordinate=([0,1,1, 2, 0, 2, 3,1, 3])
Element value=([1, 2, 3, 4, 5, 6, 7, 8, 9]) Formula (9)
CSC: A matrix is represented by three types of data: an element value, a row number, and a column offset. The element value and the row number in CSR are represented in manners similar to the element value and the row number in the COO method described above. CSC is different from the COO method in that the column offset indicates a start offset position of the 1st element in a column in all element values. Refer to
The matrix Y in the compressed format may be represented in Formula (10):
Column offset=([0, 2, 5, 7, 9])
Row coordinate=([0, 0,1,1, 2, 2, 2, 3, 3])
Element value=([1, 5, 2, 3, 8, 4, 6,7, 9]) Formula (10)
It can be learned from the descriptions of the foregoing three matrix compression methods that, each element value in the matrix in the COO compressed format has a corresponding row coordinate (row number) and a corresponding column coordinate (column number), each element value in the matrix in the CSR compressed format has a corresponding column coordinate, and each element value in the matrix in the CSC compressed format has a corresponding row coordinate.
Matrix in an uncompressed format: Refer to the matrix Y shown in
Numeric type: In the computer field, numeric types include an integer and a float. The integer mainly indicates an integer number, and the float mainly indicates a decimal. Integer precision includes int2, int4, int8, int16, int32, and the like. int represents an integer function, and a numerical digit added after int represents a quantity of bits (bits) of a binary value range, and a bit (bit) is 0 or 1. For example, a binary value range for int4 is 4 bits (0000 to 1111), and a converted decimal value range is (−8, 7). Similarly, a binary value range for int8 is (−27, 27−1). A unit of a computer storage capacity is 1 byte, namely, 8 bits. Therefore, correspondingly, there is 1 byte for int8 and 2 bytes for int16. A binary value range for int16 occupies 2 bytes, and a converted decimal value range is (−32768, 32767). A binary value range for int32 occupies 4 bytes, and a converted decimal value range is (−2147483648, 2147483647).
An integer matrix is a matrix that uses integer numbers as elements. For example, an integer matrix with m rows and n columns includes m×n elements, and the m×n elements are integer numbers. The integer number may be of precision such as int2, int4, int8, int16, or int32. For example, the integer matrix may alternatively include matrices in different integer formats, for example, a matrix including integer numbers in an int8 format, a matrix including integer numbers in an int16 format, and a matrix including integer numbers in an int32 format.
A floating-point (floating-point, FP) mainly represents a decimal number, and usually includes three parts: a sign (sign) bit, an exponent (exponent) field, and a mantissa (mantissa) field. The exponent field may also be referred to as an exponent field. The sign bit may be 1 bit (bit), and the exponent field and the mantissa field may be a plurality of bits (bits). The floating-point may usually include a plurality of formats (format), such as a half-precision floating-point, a single-precision floating-point, and a double-precision floating-point in the IEEE 754 standard. The half-precision floating-point (half-precision floating-point) occupies 16 bits (that is, occupies 2 bytes) in a computer memory, and may also be referred to as FP16 for short. An absolute value range of a value that can be represented by the half-precision floating-point is approximately [6.10×10−5, 6.55×104]. The single-precision floating-point (single-precision floating-point) occupies 32 bits (that is, 4 bytes) in the computer memory, and may also be referred to as FP32 for short. An absolute value range of a value that can be represented by the single-precision floating-point is approximately [1.18×10−38, 3.40×1038]. The double-precision floating-point (double precision floating point) occupies 64 bits (that is, occupies 8 bytes) in the computer memory, and may also be referred to as FP64 for short. The double-precision floating-point may represent a 15-digit or 16-digit decimal number. An absolute value range of a value that can be represented by the double-precision floating-point is approximately [2.23×10−308, 1.80'10308].
Table 1 below shows a storage format of the foregoing three types of floating-points. In the 16 bits occupied by FP16, a sign bit occupies 1 bit, an exponent occupies 5 bits, and a mantissa field occupies 10 bits. In the 32 bits occupied by FP32, a sign bit occupies 1 bit, an exponent occupies 8 bits, and a mantissa field occupies 23 bits. In the 64 bits occupied by FP64, a sign bit occupies 1 bit, an exponent occupies 11 bits, and a mantissa field occupies 52 bits.
The floating-point matrix may be a matrix that uses floating-points as elements. For example, a floating-point matrix with m rows and n columns includes m×n elements, and the m×n elements may be floating-points. Similar to the floating-point, the floating point matrix may also include matrices in different floating-point formats, for example, a matrix including floating-points in an FP16 format, a matrix including floating-points in an FP32 format, and a matrix including floating-points in an FP64 format.
The memory 201 may be configured to store data, a software program, and a module, and mainly includes a program storage area and a data storage area. The program storage area may store an operating system, a software application required by at least one function, middleware, and the like. The data storage area may store data created when the device is used, and the like. For example, the operating system may include a Linux operating system, a Unix operating system, a Window operating system, or the like. The software application required by the at least one function may include an application related to artificial intelligence (artificial intelligence), an application related to high-performance computing (high-performance computing, HPC), an application related to deep learning (deep learning), an application related to scientific computing, or the like. The middleware may include a linear algebra library function or the like. In a possible example, the memory 201 includes but is not limited to a static random access memory (static RAM, SRAM), a dynamic random access memory (dynamic RAM, DRAM), a synchronous dynamic random access memory (synchronous DRAM, SDRAM), a high-speed random access memory, or the like. Further, the memory 201 may further include another non-volatile memory, for example, at least one magnetic disk storage device, a flash storage device, or another volatile solid-state storage device.
In addition, the processor 202 is configured to control and manage an operation of the computing device, for example, perform various functions of the computing device and process data by running or executing a software program and/or a module that are/is stored in the memory 201 and invoking data stored in the memory 201. In a possible example, the processor 202 includes but is not limited to a central processing unit (central processing unit, CPU), a network processing unit (network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), an application-specific integrated circuit (application-specific integrated circuit, ASIC), a field programmable gate array (field programmable gate array, FPGA) or another programmable logic device, a transistor logic device, a logic circuit, or any combination thereof. The processor may implement or execute various example logical blocks, modules, and circuits described with reference to content disclosed in this application. Alternatively, the processor 202 may be a combination for implementing a computation function, for example, a combination of one or more microprocessors, or a combination of a digital signal processor and a microprocessor.
The communication interface 203 is configured to implement communication between the computing device and an external device. The communication interface 203 may include an input interface and an output interface. The input interface may be configured to obtain a first matrix and a second matrix in a compressed format in the following embodiments. In some feasible embodiments, there may be only one input interface, or there may be a plurality of input interfaces. The output interface may be configured to output a result matrix in the following embodiments. In some feasible embodiments, the result matrix may be directly output by the processor, or may be first stored in the memory and then output from the memory. In some other feasible embodiments, there may be only one output interface, or there may be a plurality of output interfaces.
The bus 204 may be a peripheral component interconnect (Peripheral Component Interconnect Express, PCIe) bus, an extended industry standard architecture (extended industry standard architecture, EISA) bus, or the like. The bus 204 may be classified into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used to represent the bus in
In this embodiment, the processor 202 may include a matrix calculation apparatus. The matrix calculation apparatus may be an ASIC, an FPGA, a logic circuit, or the like. Certainly, the apparatus may alternatively be implemented by using software. This is not specifically limited in this embodiment of this application. The matrix calculation apparatus may be configured to compute a matrix related to artificial intelligence, scientific computing, graphics computing, and the like.
Further, the processor 202 may alternatively include one or more of other processing units such as a CPU, a GPU, or an NPU. As shown in
In this embodiment of this application, the matrix calculation apparatus can calculate a matrix in a compressed format. When performing multiplication calculation on two matrices in a compressed format, the matrix calculation apparatus first obtains N first column vectors converted from one of the matrices, where the first column vector includes first element values and row coordinates of the first element values, and obtains N first row vectors converted from the other matrix, where the first row vector includes second element values and column coordinates of the second element values. The matrix calculation apparatus calculates vector outer products of the N first column vectors and the N first row vectors, to obtain N intermediate result matrices. The intermediate result matrix includes element values and position coordinates corresponding to the element values, and the position coordinates include the row coordinates of the first element values and the column coordinates of the second element values. The matrix calculation apparatus can accumulate, based on indexes of the position coordinates, third element values with same position coordinates in the N intermediate result matrices, and then obtain a result matrix. In this embodiment of this application, the matrix calculation apparatus calculates a first matrix and a second matrix based on vector outer products. In the calculation process, row coordinates of element values in the first column vector are reserved, column coordinates of element values in the second column vector are reserved, and then third element values with same position coordinates are accumulated based on indexes of position coordinates, to obtain the result matrix obtained by calculating the two matrices in the compressed format. Compared with a conventional method in which a matrix in a compressed format needs to be first decompressed and then matrix calculation is performed on a matrix obtained through decompression, the matrix calculation apparatus provided in embodiments of this application can effectively improve calculation efficiency of a matrix in a compressed format.
An embodiment of this application provides a matrix calculation apparatus. Refer to
The format conversion unit 405 is configured to: convert a first matrix in a compressed format into N first column vectors, and convert a second matrix in the compressed format into N first row vectors, where the first column vector includes first element values and row coordinates of the first element values, and the first row vector includes second element values and column coordinates of the second element values.
Optionally, the following describes a structure of the vector outer product processing engine 401. Refer to
Each PE has two functions. One function is that each PE 4011 receives two groups of input data, and outputs one group of output data based on the groups of input data. The output data includes a third element value and position coordinates of the third element value, where the third element value is obtained by performing a multiplication operation on the first element value and the second element value. The position coordinates are obtained by combining a first row coordinate and a first column coordinate. For example, a PE in row 0 column 0 in a PE array is used as an example. One group of input data is a first element value (for example, a0) in a first row vector and a row coordinate (for example, i0) corresponding to the first element value. The other group of input data is a second element value (for example, b0) in a first column vector and a column coordinate (for example, j0) corresponding to the second element value. The MAC operation subunit 40110 is configured to: receive the first element value and the second element value, perform multiplication calculation on the first element value and the second element value, and then output a product (that is, a third element value) of the first element value and the second element value. The coordinate combination subunit 40111 is configured to: receive the row coordinate of the first element value and the column coordinate of the second element value, and combine the row coordinate and the column coordinate to obtain the position coordinates. Data output by a PE includes a third coordinate value and position coordinates corresponding to the third coordinate value.
The other function of the PE is that the PE needs to transmit a first element value and a column coordinate corresponding to the first element value to a next PE in a row direction, and the PE needs to transmit a second element value and a row coordinate corresponding to the second element value to a next PE in a column direction. For example, after the 1st clock cycle, the Pt PE (that is, the PE in row 0 column 0) in the PE array transmits a0 and i0 to a next PE in the row direction (for example, a PE in row 0 column 1), and transmits b0 and j0 to a next PE in the column direction (that is, a PE in row 1 column 0). Optionally, a data transmission mode of each PE may be: transmitting data to a next PE in each clock cycle; or in a clock cycle, the 1st PE transmits a0 and i0 to a next PE in the row direction (that is, a PE in row 0 column 1) until the data is transmitted to the last PE in the row (that is, row 0), and transmits b0 and jo to the last PE in the column (column 0). In this example, a quantity of levels of PEs for transmission in each clock unit may be designed based on an actual requirement, and is not specifically limited.
Optionally, the following describes a structure of the MAC operation subunit in the PE. Refer to
Optionally, the following describes a structure of the accumulator 402 in the matrix calculation apparatus. Refer to
Optionally, refer to
The following describes a specific function of the format conversion unit 405. The format conversion unit 405 is configured to: convert the first matrix in the compressed format into the N first column vectors, and convert the second matrix in the compressed format into the N first row vectors, where the first column vector includes the first element values and the row coordinates of the first element values, and the first row vector includes the second element values and the column coordinates of the second element values. A dimension of the first matrix is M×N, and a dimension of the second matrix is N×K, where M, N, and K are integers greater than or equal to 1.
Refer to
The format conversion unit 405 is configured to split the matrix A by column into four first column vectors, where the four first column vectors are A0, A1, A2, and A3. The format conversion unit 405 splits the matrix B by row into four first row vectors, where the four first row vectors are B0, B1, B2, and B3. An example in which the first column vector is A0 is used for description, where A0 is [a0, a1, a2, a3]T, and a0, a1, a2, a3 in A0 all are element values. Each element value has a corresponding row coordinate. For example, a row coordinate of a0 is i0, a row coordinate of a1 is i1, a row coordinate of a2 is i2, and a row coordinate of a3 is i3. An example in which the first row vector is B0 is used for description. B0 is [b0,b1,b2,b3], and b0,b1,b2,b3 in B0 all are element values. Each element value in B0 has a corresponding column coordinate. For example, a column coordinate of b0 is jo, a column coordinate of b1 is j1, a column coordinate of b2 is j2, and a column coordinate of b3 is j3. It should be understood that, the matrix A is split based on N columns, and for each element value in a first column vector obtained after splitting, only a row coordinate of the element value is reserved. The matrix B is split based on N rows, and for each element value in a first row vector obtained after splitting, only a column coordinate of the element value is reserved. Similarly, A1 is [c0,c1,c2,c3]T, and c0, c1, c2, c3 in A1 all are element values. Each element value has a corresponding row coordinate. For example, a row coordinate of c0 is k0, a row coordinate of c1 is k1, a row coordinate of c2 is k2, and a row coordinate of c3 is k3. B1 is [d0,d1,d2,d3], and d0,d1,d2,d3 in B1 are element values. Each element value in B1 has a corresponding column coordinate. For example, a column coordinate of d0 is l0, a column coordinate of d1 is l1, a column coordinate of d2 is l2, and a column coordinate of d3 is l3.
For example, the first matrix in the compressed format is represented in the COO format as: (0, 0, 1), (1, 0, 2), (1, 1, 3), (1, 2, 4), (2, 0, 5), (2, 2, 6), (2, 3, 7), (3, 1, 8), (3, 3, 9), and (3, 0, 6). The format conversion unit 405 splits the first matrix in the compressed format by column, where element values in (0, 0, 1), (1, 0, 2), (2, 0, 5) and (3, 0, 6) are element values in a same column, that is, all are element values in column 0. It should be understood that, when the format conversion unit 405 splits the matrix A by column, element values “1”, “2”, “5”, and “6” are used as element values in the A0 vector. The element values “1”, “2”, “5”, and “6” all are element values in a same column. In this case, only row coordinates of the element values need to be reserved. In this example, the vector A0 includes four element values, and the four element values are “1”, “2”, “5”, and “6”. In addition, a row coordinate of the element value “1” is “0”. Similarly, a row coordinate of the element value “2” is “1”, a row coordinate of the element value “5” is “2”, and a row coordinate of the element value “6” is “3”.
For another example, the second matrix in the compressed format is represented in the COO format as: (0, 0, 1), (1, 0, 2), (2, 1, 3), (1, 2, 4), (2, 0, 5), (2, 2, 6), (2, 3, 7), (3, 1, 8), (3, 3, 9), and (3, 0, 6). The format conversion unit 405 splits the second matrix based on row coordinates, that is, uses element values in a same row as element values in a same vector. For example, row coordinates in the four triplets (2, 1, 3), (2, 0, 5), (2, 2, 6) and (2, 3, 7) are the same. In this case, element values in the four triplets are split into one row vector (for example, the row vector B0), and the element values “3”, “5”, “6”, and “7” are used as element values in the first row vector B0, and each element value in the first row vector B0 has a corresponding column coordinate. For example, a column coordinate of the element value “3” is “1”, a column coordinate of the element value “5” is “0”, a column coordinate of the element value “6” is “2”, and a column coordinate of the element value “7” is “3”. It should be noted that a specific value in the COO format is merely an example provided for ease of description, and does not constitute a limitation on this application.
The vector outer product processing engine 401 is configured to calculate vector outer products of the N first column vectors and the N first row vectors, to obtain N intermediate result matrices. The intermediate result matrix includes third element values and position coordinates of the third element values, and the position coordinates include the row coordinates of the first element values and the column coordinates of the second element values. For example, refer to
The intermediate result matrix C0 includes third element values, and each third element value has corresponding position coordinates. The position coordinates are a combined value of a row coordinate of a first element value and a column coordinate of a second element value. For example, refer to
An example is as follows: a0 is 1, a1 is 2, a2 is 5, and a3 is 6. In addition, the row coordinate of a0 is “0”. Similarly, the row coordinate of α1 is “1”, the row coordinate of a2 is “2”, and the row coordinate of a3 is “3”. b0 is 3, b1 is 5, b2 is 6, and b3 is 7. The column coordinate of b0 is “1”, the column coordinate of b1 is “0”, the column coordinate of b2 is “2”, and the column coordinate of b3 is “3”. The third element value a0b0=1×3=3, and the position coordinates of a0b0 include the row coordinate of a0 and the column coordinate of b0, that is, the position coordinates of a0b0 are (0, 3). The third element value a0b1=1×5=5, and the position coordinates of a0b1 include the row coordinate of a0 and the column coordinate of b1, that is, the position coordinates of a0b1 are (0, 0). The third element value a1b0=2×3=6, and the position coordinates of a1b0 include the row coordinate of a1 and the column coordinate of b0, that is, the position coordinates of a1b0 are (1, 3).
Refer to
Similarly, the intermediate result matrix C1 includes third element values, and each third element value in C1 has corresponding position coordinates. For example, refer to
The accumulator 402 is configured to accumulate, based on indexes of the position coordinates of the third element values, third element values with same position coordinates in the N intermediate result matrices, to obtain a result matrix. Still refer to
Optionally, an implementation in which the accumulator 402 accumulates, based on the indexes of the position coordinates of the third element values, the third element values with the same position coordinates in the N intermediate result matrices includes at least the following two implementations.
In a first possible implementation, when performing vector outer product calculation on the first row vectors and the first column vectors, the vector outer product processing engine 401 generates the intermediate result matrices in a specific sequence. For example, the first row vector and the first column vector that are shown in
Refer to
For example, in the first cache 403, the first cache 403 is divided into a plurality of storage positions by using a row coordinate identifier and a column coordinate identifier. For example, in the first cache 403, storage space of the first cache 403 is divided into q×p storage positions based on p rows and q columns. After receiving C0, the accumulator 402 writes the third element values in C0 into a corresponding position in the first cache 403 based on the first position coordinates. For example, if position coordinates of a third element value a0b0 in C0 are (i0, j0), for example, (i0, j0) are (1, 1), the accumulator 402 writes, based on the position coordinates (i0, j0), a0b0 into a position of row 1 column 1 in the first cache 403. Similarly, if position coordinates of a third element value a0b1 in C0 are (i0, j1), for example, (i0, j1) are (1, 2), the accumulator 402 writes, based on the position coordinates (i0, a0b1 into a position of row 1 column 2 in the first cache 403. If position coordinates of a third element value a0b2 in C0 are (i0, j2), for example, (i0, j2) are (1, 3), the accumulator 402 writes, based on the position coordinates (i0, j2), a0b2 into a position of row 1 column 3 in the first cache 403. If position coordinates of a third element value a0b1 in C0 are (i0, j3), for example, (i0, j3) are (1, 4), the accumulator 402 writes, based on the position coordinates (i0, j3), a0b3 into a position of row 1 column 4 in the first cache 403. A process in which the accumulator 402 writes other third element values in C0 into the first cache 403 is not described in detail. A final result is that the accumulator 402 writes all third element values in C0 into the first cache 403 based on the first position coordinates corresponding to each third element value.
Then, when the accumulator 402 receives C1, the accumulator 402 searches, based on the second position coordinates of a third element value in C1, for a cached value at a position corresponding to the second position coordinates in the first cache 403. If no cached value exists at the position corresponding to the second position coordinates, the accumulator 402 writes the third element value into the position corresponding to the second position coordinates in the first cache 403. If there is a cached value at the position corresponding to the second position coordinates, the accumulator 402 reads the cached value from the first cache 403, and accumulates the cached value and the third element value in C1. For example, position coordinates of a third element value c0d0 in C1 are (k0, l0), for example, (k0, l0) is (1, 0). The accumulator 402 queries the first cache 403 based on the position coordinates (k0, l0). If there is no cached value at the position of row 1 column 0 in the first cache 403, the accumulator 402 writes c0d0 into the position of row 1 column 0 in the first cache 403. Position coordinates of a third element value c0d1 in C1 are (k0, l1), for example, (k0, l1) is (1, 1). The accumulator 402 obtains through query that there is a cached value a0b0 at a position of row 1 column 1 in the first cache 403. The accumulator 402 reads a0b0, adds c0d1 and a0b0, and then writes a result (cod1+a0b0) obtained after the addition into a corresponding position (row 1 column 1) of the first cache 403 based on the position coordinates of c0d1. Similarly, position coordinates of a third element value code in C1 are (k0, l2), for example, (k1, l2) is (1, 2). The accumulator 402 obtains through query that there is a cached value a0b1 at a position of row 1 column 2 in the first cache 403. The accumulator 402 reads the cached value a0b1 from the first cache 403, adds c0d2 and a0b1, and then writes an accumulated value (c0d2+a0b1) obtained after the addition into a corresponding position (row 1 column 2) of the first cache 403 based on the position coordinates c0d2. In this example, the third element values in C1 are not described by using examples one by one. A final result is that the accumulator 402 adds third element values with same position coordinates in C1 and C0, and writes a result obtained after the addition into a corresponding position in the first cache 403 based on the position coordinates. Similarly, when the accumulator 402 receives the intermediate result matrix C2 transmitted by the vector outer product processing engine 401, the accumulator 402 reads a cached value (an accumulated value of third element values with same position coordinates in C0 and C1) from a corresponding position in the first cache 403 based on position coordinates of each third element value in C2. The accumulator 402 accumulates third element values with same position coordinates in C0, C1, and C2, and then writes an accumulated result into a position corresponding to the position coordinates in the first cache 403. It should be understood that processing processes performed by the accumulator 402 for C2 and C3 are similar to a processing process for C1. Details are not described herein again. A final processing result of the accumulator 402 is to accumulate third element values with same position coordinates in C0, C1, C2, and C3, to obtain a fifth element value. The result matrix includes a plurality of fifth element values. Finally, the first cache 403 outputs the result matrix.
It should be noted that the fifth element value is an accumulated value of a third element value in at least one of the intermediate result matrices C0, C1, C2, and C3. For example, still refer to
Further, if the fifth element value is an accumulated value of third element values, the fifth element value may be a zero-element value. In this example, the result matrix is a matrix in an uncompressed format. Optionally, to save transmission resources or facilitate a next calculation operation, the matrix calculation apparatus may compress the result matrix, to output a matrix in a compressed format. Refer to
It should be understood that, in a process of performing accumulation calculation by the accumulator 402, the first cache 403 is configured to store the third element values and the result obtained after the third element values are accumulated. The storage space of the first cache 403 needs to be greater than or equal to a value. For example, if a dimension of each intermediate result matrix is M×P, in other words, each intermediate result matrix includes M×P third element values, a quantity of cache positions included in the first cache 403 is greater than or equal to M×P. In this case, the storage space of the first cache 403 needs to be capable of storing at least M×P values.
In a first possible implementation, the matrix calculation apparatus may output a matrix in an uncompressed format or a matrix in a compressed format based on different application scenarios, thereby expanding an applicable scenario of the matrix calculation apparatus.
In a second possible implementation, refer to
Similarly, when the accumulator 402 receives C2, the accumulator 402 sorts third element values in C2 based on position coordinates. The accumulator 402 reads position coordinates in the first cache 403, compares the position coordinates of the third element values in C2 and the position coordinates in the first cache 403, adds third element values with same position coordinates to cached values, to obtain accumulated values, and then stores the accumulated values in the first cache 403. Calculation performed by the accumulator 402 on the third element values in C3 is similar to calculation performed on the third element values in C2. Details are not described by using examples herein. A final result is that the accumulator 402 accumulates third element values with same position coordinates in the intermediate result matrices C3, C2, C1, and C0, to obtain fifth element values, and the accumulator 402 deletes a zero-element value and position coordinates of the zero-element value from the plurality of fifth element values. The first cache 403 outputs the result matrix, where the result matrix includes the fifth element values and position coordinates of the fifth element values. In addition, the fifth element value included in the result matrix is a non-zero-element value. Therefore, the result matrix output by the first cache 403 is a matrix in a compressed format.
In the second implementation, the matrix in the compressed format may be directly obtained, and the matrix in the compressed format output by the matrix calculation apparatus may be used in some application scenarios in which a matrix in a compressed format needs to be subsequently calculated. In addition, because the matrix calculation apparatus outputs the matrix in the compressed format, transmission resources of a subsequent transmission matrix can be reduced. In addition, in the second implementation, because the third element values cached in the first cache 403 are cached according to the sequence of the position coordinates, the implementation in this example can be implemented by using smaller cache space, thereby saving cache space of the first cache 403.
Optionally, in this example, the format conversion unit 405 is further configured to: obtain a fifth matrix and a sixth matrix, and perform format conversion on the fifth matrix and the sixth matrix in an uncompressed format to obtain the first matrix and the second matrix in the compressed format; and output the first matrix and the second matrix to the second cache 406, where at least one of the fifth matrix or the sixth matrix is a matrix in the uncompressed format. The vector outer product processing engine 401 obtains the first matrix and the second matrix in the compressed format from the second cache 406. In this example, the matrix calculation apparatus may receive the matrix in the uncompressed format, and then convert the matrix in the uncompressed format into the matrix in the compressed format, so that the matrix calculation apparatus can support calculation of matrices in a plurality of formats. In this example, the fifth matrix and the sixth matrix may include the following several cases.
In a first case, refer to
In a second case, one of the fifth matrix and the sixth matrix is a matrix in the uncompressed format, and the other matrix is a matrix in the compressed format. An example in which the fifth matrix is a matrix in the uncompressed format, and the sixth matrix is a matrix in the compressed format is used for description. If the sixth matrix is a compressed matrix in a CSC or CSR format, the format conversion unit converts both the fifth matrix and the sixth matrix into compressed matrices in a COO format.
Optionally, the format conversion unit 405 is further configured to convert a matrix in the compressed format into a matrix in a target compressed format. For example, when both the first matrix and the second matrix are in the CSC format or the CSR format, the format conversion unit converts both the first matrix and the second matrix into the COO format.
In this example, the matrix calculation apparatus may convert the matrix in the uncompressed format into the matrix in the compressed format by using the format conversion unit, so that the matrix calculation apparatus can support both calculation of the matrix in the compressed format and calculation of the matrix in the uncompressed format. Optionally, the format conversion unit may convert a matrix in a non-target compressed format into a matrix in the target compressed format (for example, the COO format). In this example, the matrix calculation apparatus may convert a matrix in another compressed format into a matrix in the target compressed format, and perform matrix calculation on the matrix in the target compressed format. The matrix calculation apparatus provided in this application may support calculation of matrices in various formats.
Optionally, to improve applicability of the matrix calculation unit, high-precision matrix calculation may be implemented based on a low-precision matrix calculation apparatus. Precision of element values included in the first column vector and the first row vector is first precision. The format conversion unit 405 splits the first column vector into X second column vectors, and splits the first row vector into X second row vectors, where the second column vector and the second row vector include element values of second precision, and the first precision is higher than the second precision. Then, the vector outer product processing engine 401 calculates vector outer products of the X second column vectors and the X second row vectors to obtain X2 fourth matrices, where the fourth matrix includes fourth element values and position coordinates of the fourth element values. Finally, the accumulator 402 accumulates, based on indexes of the position coordinates of the fourth element values, fourth element values with same position coordinates in the X2 fourth matrices, to obtain the intermediate result matrix, where precision of the third element values in the intermediate result matrix is the first precision.
The following uses an example to describe how to split a vector of the first precision into a vector of the second precision in this example. After converting the first matrix in the compressed format into the N first column vectors and converting the second matrix in the compressed format into the N first row vectors, the format conversion unit 405 may further split a first column vector into X second column vectors and split a first row vector into X second row vectors based on precision of element values. The precision of the element values included in the first column vector and the first row vector is the first precision. Both the second column vector and the second row vector include an element value of the second precision. For ease of description, a vector including the element value of the second precision is referred to as a “vector of the second precision”, and a vector including the element value of the first precision is referred to as a “vector of the first precision”. X is an integer greater than or equal to 2. The first precision and the second precision may be integer precision, or the first precision and the second precision may be floating point precision. This is not specifically limited. Specifically, the format conversion unit 405 splits each element value in the vector of the first precision into a plurality of values of the second precision, and splits the vector of the first precision into X vectors of the second precision. The following provides descriptions based on cases in which the vector of the first precision and the vector of the second precision are integers, and the vector of the first precision and the vector of the second precision are floating-point vectors.
In a first case, the vector of the first precision and the vector of the second precision are integers. The first precision is higher than the second precision. For example, the first precision is int32, and the second precision may be int2, int4, int8, or int16. Alternatively, the first precision is int16, and the second precision may be int4 or int8. Alternatively, the first precision is int8, and the second precision is int2 or int4. Alternatively, the first precision is int4, and the second precision is int2.
The format conversion unit 405 splits a high-precision integer number into a plurality of low-precision integer numbers, and splits the high-precision integer number from a most significant bit to a least significant bit. For example, refer to
In a second case, the vector of the first precision and the vector of the second precision are floats. For example, a first-precision floating-point is FP32 or FP64, and a second-precision floating-point is FP16. Alternatively, a first-precision floating-point is FP64, and a second-precision floating-point may be FP32, or may be FP16. The following uses a case in which the first-precision floating-point is FP32 and the second-precision floating-point is FP16 as an example for description.
1. One FP32 is split to obtain three FP16.
Currently, composition of FP32 in a standard format is shown in Table 1. FP32 includes a 1-bit (bit) sign, an 8-bit exponent, and a 23-bit mantissa. In addition, there is an omitted 1-bit integer, and the omitted integer is 1. For FP32 in the standard format, the integer and the mantissa are 24 bits in total. FP16 in a standard format includes a 1-bit sign, a 5-bit exponent, a 10-bit mantissa. In addition, there is an omitted 1-bit integer, and the omitted integer is 1. For FP16 in the standard format, the integer and the mantissa are 11 bits in total. To split FP32 in the standard format to obtain FP16 in the standard format, three FP16 in the standard format are required.
The integer and the mantissa of FP32 in the standard format may be divided into three parts. A first part is the integer and the first 10 bits of the mantissa, a second part is the 11th bit to the 21st bit of the mantissa, and a third part is the 22nd bit and the 23rd bit of the mantissa. The three parts are separately represented by FP16 in the standard format. It should be noted herein that, when the 22nd bit and the 23rd bit of the mantissa in the third part is represented by FP16 in the standard format, nine 0s may be padded after the 23rd bit of the mantissa, that is, the 22nd bit and the 23rd bit of the mantissa and the padded 0s are represented by FP16 in the standard format.
In addition, the exponent range of FP16 is −15 to 15, that is, may indicate that a decimal point is shifted leftward by 15 bits to rightward by 15 bits. When FP16 in the standard format is for representing the first part of FP32, the fixed exponent bias is 0; when FP16 in the standard format is for representing the second part of FP32, the fixed exponent bias is −11; and when FP16 in the standard format is for representing the third part of FP32, the fixed exponent bias is −22. It can be learned that, when the third part is represented, only the corresponding fixed exponent bias has exceeded an exponent range of FP16. Therefore, a corresponding fixed exponent bias may be extracted for the exponent of each FP16 in the standard format.
Therefore, FP32 in the standard format may be represented as:
In addition, a common exponent bias may be extracted for each exponent in FP16 in the standard format. Similarly, FP32 in the standard format may be represented as:
To reduce an amount of FP16 obtained through splitting, current FP16 in a standard format may be adjusted, and a mantissa of FP16 is adjusted to 13 bits, and a quantity of bits of a sign and a quantity of bits of an exponent remain unchanged. Adjusted FP16 may be referred to as FP16 in a non-standard format. For FP16 in the non-standard format, an integer and the mantissa are 14 bits in total. In this case, if a mantissa of FP32 in a standard format is to be represented by FP16 in the non-standard format, only two FP16 in the non-standard format are required.
An integer and a mantissa of FP32 in the standard format are divided into two parts. A first part is the integer and the first 13 bits of the mantissa, and a second part is the 14th bit to the 23rd bit. The two parts are separately represented by FP16 in the non-standard format.
It should be further noted herein that, when the second part is represented by non-standard FP16, four 0s may be padded after the 23rd bit of the mantissa, that is, the 14th bit to the 23rd bit of the mantissa and the padded 0s are represented by FP16 in the non-standard format. Same as the foregoing first case, herein, a corresponding fixed exponent bias may also be extracted for an exponent of each FP16 in the standard format.
Similarly, FP32 in the standard format may be represented as:
In addition, a common exponent bias may be extracted for each exponent in FP16 in the standard format. Similarly, FP32 in the standard format may be represented as:
Certainly, for a case in which the first-precision floating-point is FP64 and the second-precision floating-point is FP32, there may be the following cases in which for FP64 is split to obtain a plurality of FP32: One FP64 floating-point is split to obtain three FP32 floating-points; or one FP64 floating-point is split to obtain two FP32 floating-points. Optionally, in a case in which the first-precision floating-point is FP64 and the second-precision floating-point is FP16, there may be the following cases in which FP64 is split to obtain a plurality of FP16: One FP64 floating-point is split to obtain five FP16 floating-points; or one FP64 floating-point is split to obtain four FP16 floating-points. A splitting principle is similar to the foregoing described case in which the first-precision floating-point is FP32 and the second-precision floating-point is FP16. Details are not described herein.
For example, for ease of description, an example in which the first row vector is split into two second row vectors and the first column vector is split into two second column vectors is used for description. For example, the first row vector A0 is a column vector [a0, a1, a2, a3]T whose precision is FP32. For example, in the foregoing method for splitting a value of the first precision into two values of the second precision, the format conversion unit 405 splits a floating-point a0 whose precision is FP32 into two floating-points (for example, a0M and a0L) whose precision is FP16. Similarly, a1 is split into a1M and a1L. a2 is split into a2M and a2L. a3 is split into a3M and a3L. To be specific, as shown in
Further, the vector outer product processing engine 401 calculates vector outer products of the X second column vectors and the X second row vectors to obtain X2 fourth matrices, where the fourth matrix includes fourth element values and position coordinates of the fourth element values, the position coordinates of the fourth element values include the row coordinates of the element values in the first column vector and the column coordinates of the element values in the first row vector, and precision of the fourth element values is the first precision (for example, FP32). In this example, refer to
In this example, if the precision of the element values in the first row vector and the first column vector is high, the matrix calculation apparatus may split both the first row vector and the first column vector into a plurality of low-precision vectors. The matrix calculation apparatus performs vector outer product calculation on the low-precision second column vectors and the low-precision second row vectors, to obtain a result matrix, so that calculation can be performed on a matrix in a compressed format, and high-precision matrix calculation can be implemented based on a low-precision matrix calculation apparatus, thereby improving applicability of the matrix calculation apparatus. In addition, in a matrix calculation process, an upper-layer software application (such as AI and HPC) based on the matrix calculation apparatus does not sense a specific matrix calculation process, so that software adaptation costs can be greatly reduced.
Based on the matrix calculation apparatus provided in this application, a great benefit can be obtained in a plurality of matrix calculation scenarios. For example, when the matrix calculation apparatus is used in an AI training and inference scenario, calculation of a matrix in a compressed format and matrix in an uncompressed format can be completely supported. In AI calculation, a sparseness characteristic of a weight and feature data is more than 50% on average (that is, more than 50% matrices are in the compressed format). The matrix calculation format in this application may directly calculate the matrix in the compressed format, without splitting the matrix in the compressed format. In this way, calculation efficiency can be improved by more than four times. In addition, for an HPC scenario such as scientific computing, regardless of calculation of a matrix in an uncompressed format that requires high computing power or a matrix calculation scenario in which memory bandwidth is limited, the matrix calculation apparatus in this application can directly access a matrix in a compressed format from a memory, thereby improving a calculation benefit. The matrix calculation apparatus supports full-precision numerical calculation, and can also effectively cover calculation with various different precision requirements. For example, floating-point calculation such as FP32 and FP16 usually required in an AI training scenario, and some scenarios such as an AI training scenario that requires FP64 and HPC scientific computing can be fully supported by the matrix calculation apparatus. In addition, the MAC in the matrix calculation apparatus can also support calculation of integer formats with medium and low precision such as INT1, INT2, INT4, and INT8. For a calculation scenario of AI inference, computing power can be improved and inference computation time can be reduced. In addition, various scenarios in which different precision can be mixed in inference computing, thereby greatly enhancing applicability of the matrix calculation apparatus.
The foregoing describes the embodiment of the matrix calculation apparatus, and the following describes a method performed by the matrix calculation apparatus. Refer to
Step 1301: Obtain a first calculation instruction, where the first calculation instruction includes N first column vectors and N first row vectors.
The N first column vectors are obtained by converting a first matrix in a compressed format, the N first row vectors are obtained by converting a second matrix in the compressed format, and N is an integer greater than or equal to 1.
Step 1302: Calculate vector outer products of the N first column vectors and the N first row vectors, to obtain N intermediate result matrices, where the first column vector includes first element values and row coordinates of the first element values, the first row vector includes second element values and column coordinates of the second element values, the intermediate result matrix includes third element values and position coordinates of the third element values, and the position coordinates include the row coordinates of the first element values and the column coordinates of the second element values.
For this step, refer to specific descriptions of functions performed by the vector outer product processing engine 401 in
Step 1303: Accumulate, based on indexes of the position coordinates of the third element values, third element values with same position coordinates in the N intermediate result matrices, to obtain a result matrix.
The N intermediate result matrices include at least a first intermediate result matrix and a second intermediate result matrix, position coordinates of third element values in the first intermediate result matrix are first position coordinates, and position coordinates of third element values in the second intermediate result matrix are second position coordinates.
In a first possible implementation, the matrix calculation apparatus writes, in a generation sequence of the N intermediate result matrices, the third element values in the first intermediate result matrix into corresponding positions in a cache based on the first position coordinates; and then reads, based on the second position coordinates of the third element values in the second intermediate result matrix, cached values that are at positions corresponding to the second position coordinates in the cache, and accumulate the third element values in the second intermediate result matrix and the cached values, to obtain a result matrix in an uncompressed format. Optionally, the matrix calculation apparatus compresses the result matrix in the uncompressed format to obtain a result matrix in a compressed format.
In the first possible implementation, refer to specific descriptions of functions performed by the accumulator 402 in the examples corresponding to
In a second possible implementation, the matrix calculation apparatus sorts the third element values in the N intermediate result matrices based on the position coordinates of the third element values. The matrix calculation apparatus compares position coordinates in N intermediate result matrices obtained through sorting, adds up third element values with same position coordinates, and deletes position coordinates of a zero-element value, to obtain a result matrix in a compressed format.
In the second possible implementation, refer to specific descriptions of functions performed by the accumulator 402 in the examples corresponding to
In this embodiment of this application, the matrix calculation apparatus can directly calculate the matrix in the compressed format, and does not need to perform operations such as decompressing the matrix in the compressed format and performing matrix calculation on a decompressed matrix in a conventional method. The matrix calculation apparatus in this embodiment of this application can improve calculation efficiency of the matrix in the compressed format.
Optionally, refer to
Step 1401: Obtain a third calculation instruction, where the third calculation instruction includes a fifth matrix and a sixth matrix, and at least one of the fifth matrix and the sixth matrix is a matrix in an uncompressed format.
Step 1402: Perform format conversion on the fifth matrix to obtain the first matrix in the compressed format, and perform format conversion on the sixth matrix to obtain the second matrix.
For step 1401 and step 1402, refer to specific descriptions of functions performed by the format conversion unit 405 in the example corresponding to
In this embodiment of this application, the matrix calculation apparatus may convert the matrix in the uncompressed format into the matrix in the compressed format by using the format conversion unit, so that the matrix calculation apparatus can support both calculation of the matrix in the compressed format and calculation of the matrix in the uncompressed format. Optionally, the format conversion unit may convert a matrix in a non-target compressed format into a matrix in the target compressed format (for example, the COO format). In this example, the matrix calculation apparatus may convert a matrix in another compressed format into a matrix in the target compressed format, and perform matrix calculation on the matrix in the target compressed format. The matrix calculation apparatus provided in this application may support calculation of matrices in various formats.
Step 1403: Obtain a second calculation instruction, where the second calculation instruction includes the first matrix and the second matrix.
Step 1404: Convert the first matrix into the N first column vectors, and convert the second matrix into the N first row vectors.
For step 1403 and step 1404, refer to descriptions of functions performed by the format conversion unit 405 in the example corresponding to
Optionally, to improve applicability of the matrix calculation unit, high-precision matrix calculation may be implemented based on a low-precision matrix calculation apparatus.
A vector of first precision is split into a plurality of vectors of second precision, and then vector outer product calculation is performed on the vectors of the second precision. Refer to the following step 1405 to step 1407.
Step 1405: Split the first column vector into X second column vectors, and split the first row vector into X second row vectors. Precision of element values included in the first column vector and the first row vector is the first precision, precision of element values included in the second column vector and the second row vector is the second precision, the first precision is higher than the second precision, and X is an integer greater than or equal to 2.
For this step, refer to the specific descriptions in which the format conversion unit 405 splits the integer number and the format conversion unit 405 splits the floating point value in the examples corresponding to
Step 1406: Calculate vector outer products of the X second column vectors and the X second row vectors to obtain X2 fourth matrices, where the fourth matrix includes fourth element values and position coordinates of the fourth element values, the position coordinates of the fourth element values include the row coordinates of the first element values and the column coordinates of the second element values, and precision of the fourth element values is the first precision.
For this step, refer to the descriptions of the function performed by the vector outer product processing engine 401 in the example corresponding to
Step 1407: Accumulate, based on indexes of the position coordinates of the fourth element values, fourth element values with same position coordinates in the X2 fourth matrices, to obtain the intermediate result matrix, where precision of the third element values in the intermediate result matrix is the first precision.
Step 1408: Accumulate, based on indexes of the position coordinates of the third element values, third element values with same position coordinates in the N intermediate result matrices, to obtain a result matrix.
For step 1407 and step 1408, refer to functions performed by the accumulator 402 in the examples corresponding to
In this embodiment of this application, when a vector outer product of two vectors is calculated, a first-precision (high-precision) vector may be further split into a plurality of second-precision (low-precision) vectors, and then vector outer product calculation is performed on the low-precision vectors, so that an outer product of the first-precision vector may be obtained by accumulating outer product results of the plurality of second-precision vectors, without losing precision.
An embodiment of this application provides a matrix calculation circuit. The matrix calculation circuit is configured to perform one or more steps in step 1301 to step 1303 or one or more steps in step 1401 to step 1408 in the foregoing method embodiments. In actual application, the matrix calculation circuit may be an ASIC, an FPGA, a logic circuit, or the like.
Another embodiment of this application provides a matrix calculation system or a chip. A structure of the system or the chip may be shown in
Still another embodiment of this application provides a matrix calculation device. A structure of the device may be shown in
The processor 202 may be configured to perform one or more steps in step 1301 to step 1303 or one or more steps in step 1401 to step 1408 in the foregoing method embodiments. In some feasible embodiments, the processor 202 may include a matrix calculation unit, and the matrix calculation unit may be configured to support the processor in performing one or more steps in the foregoing method embodiments. In actual application, the matrix calculation unit may be an ASIC, an FPGA, a logic circuit, or the like. Certainly, the matrix calculation unit may alternatively be implemented by using software. This is not specifically limited in this embodiment of this application.
It should be noted that the components of the matrix calculation circuit, the matrix calculation system, the matrix calculation device, and the like provided in embodiments of this application are separately configured to implement functions of corresponding steps in the foregoing method embodiments. Because the steps are described in detail in the foregoing method embodiments, details are not described herein again.
All or some of the foregoing embodiments may be implemented using software, hardware, firmware, or any combination thereof. When software is used to implement embodiments, the foregoing embodiments may be implemented completely or partially in a form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, all or some of the processes or the functions according to embodiments of this application are generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable apparatuses. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center to another website, computer, server, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by a computer, or a data storage device, such as a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium. The semiconductor medium may be a solid-state drive (solid-state drive, SSD).
In conclusion, the foregoing embodiments are merely intended for describing the technical solutions of this application, but not for limiting this application. Although this application is described in detail with reference to the foregoing embodiments, persons of ordinary skill in the art should understand that they may still make modifications to the technical solutions described in the foregoing embodiments or make equivalent replacements to some technical features thereof, without departing from the scope of the technical solutions of embodiments of this application.
Number | Date | Country | Kind |
---|---|---|---|
202011617575.X | Dec 2020 | CN | national |
202110181498.6 | Feb 2021 | CN | national |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2021/141000 | Dec 2021 | US |
Child | 18343622 | US |