The technology of the disclosure relates generally to matrix handling in processor-based devices, and, in particular, to techniques for efficient matrix multiplication using vector registers.
Matrix multiplication is a mathematical operation having numerous applications in applied mathematics, physics, engineering, and computer science. In the context of computer science and computer engineering, the multiplication of matrices is fundamental to applications as varied as rendering three-dimensional (3D) graphics, simulating quantum mechanics, modeling financial systems, and developing machine learning algorithms. However, the performance of such applications may be limited by the computational complexity of matrix multiplication operations. For example, the conventional algorithm for multiplying two matrices with dimensions n×n (for an integer n≥2) has a computational complexity of O(n3). As such, the optimization of matrix multiplication operations has the potential to greatly improve a wide variety of applications.
Conventional attempts to provide efficient matrix multiplication have resulted in development of both hardware-based approaches (e.g., dedicated co-processors, matrix units, and the like) as well as software algorithms (e.g., vector-instruction-based sequences for accomplishing matrix multiplication). Each approach, however, has encountered significant challenges. For example, dedicated co-processors for performing matrix multiplication, such as Google's Tensor Processing Unit (TPU), may provide high peak performance, but may prove incapable of extracting performance across a range of operations. Similarly, software-based approaches may have limitations in handling floating-point or integer computations, or may be difficult or awkward to program. Thus, an efficient apparatus for performing matrix multiplication operations is desirable.
Aspects disclosed in the detailed description include providing matrix multiplication using vector registers in processor-based devices. In this regard, a processor-based device provides a matrix processing unit and a vector processing unit configured to provide a true matrix multiplication vector instruction expressed as a register-to-register operation. In some aspects, input matrices may first be “blocked” or “tiled” (i.e., organized into smaller submatrices that better fit into the memory hierarchy and/or register file). The elements within each submatrix are rearranged into a format allowing the contents of the submatrix to be loaded into a vector register. A matrix multiplication vector operation (e.g., initiated by executing a matrix multiplication vector instruction) is then used to perform matrix multiplication on the contents of two vector registers storing the rearranged contents of a pair of submatrices, and the result is stored in a third vector register. In this manner, a matrix multiplication operation having a computational complexity of O(n3) can be encoded in a vector instruction having O(n2) elements, which enables a simple sequence of software instructions to efficiently produce a product of large matrices.
In another aspect, a processor-based device method for performing matrix multiplication operations using vector registers is provided. The processor-based device comprises a matrix processing unit, as well as a vector processing unit comprising a plurality of vector registers. The matrix processing unit is configured to rearrange a plurality of elements of a first submatrix having a number RA rows and a number CA columns into a first vector having a number of elements equal to the product of RA and CA (RACA). The matrix processing unit is further configured to store the first vector in a first vector register of the plurality of vector registers of the vector processing unit. The matrix processing unit is also configured to rearrange a plurality of elements of a second submatrix having a number RB rows and a number C8 columns into a second vector having a number of elements equal to the product of RB and CB (RBCB). The matrix processing unit is additionally configured to store the second vector in a second vector register of the plurality of vector registers of the vector processing unit. The vector processing unit is configured to perform a matrix multiplication vector operation using the first vector register and the second vector register as input operands to generate an output vector having a number of elements equal to the product of RA and CB (RACB), wherein each element E of the output vector, where 0≤E<RACB, is calculated as a dot product of a plurality of elements of the first vector corresponding to a row of the first submatrix, and a plurality of elements of the second vector corresponding to a column of the second submatrix. The vector processing unit is further configured to store the output vector in a third vector register of the plurality of vector registers of the vector processing unit.
In another aspect, a processor-based device for performing matrix multiplication operations using vector registers is provided. The processor-based device comprises a means for rearranging a plurality of elements of a first submatrix having a number RA rows and a number CA columns into a first vector having a number of elements equal to the product of RA and CA (RACA). The processor-based device further comprises a means for storing the first vector in a first vector register of a plurality of vector registers. The processor-based device also comprises a means for rearranging a plurality of elements of a second submatrix having a number R8 rows and a number CB columns into a second vector having a number of elements equal to the product of RB and CB (RBCB). The processor-based device additionally comprises a means for storing the second vector in a second vector register of the plurality of vector registers. The processor-based device further comprises a means for performing a matrix multiplication vector operation using the first vector register and the second vector register as input operands to generate an output vector having a number of elements equal to the product of RA and CB (RACB), wherein each element E of the output vector, where 0≤E<RACB, is calculated as a dot product of a plurality of elements of the first vector corresponding to a row of the first submatrix, and a plurality of elements of the second vector corresponding to a column of the second submatrix. The processor-based device also comprises a means for storing the output vector in a third vector register of the plurality of vector registers.
In another aspect, a method for performing matrix multiplication operations using vector registers is provided. The method comprises rearranging, by a matrix processing unit of a processor-based device, a plurality of elements of a first submatrix having a number RA rows and a number CA columns into a first vector having a number of elements equal to the product of RA and CA (RACA). The method further comprises storing, by the matrix processing unit, the first vector in a first vector register of a plurality of vector registers of a vector processing unit of the processor-based device. The method also comprises rearranging, by the matrix processing unit, a plurality of elements of a second submatrix having a number RB rows and a number CB columns into a second vector having a number of elements equal to the product of RB and CB(RBCB). The method additionally comprises storing, by the matrix processing unit, the second vector in a second vector register of the plurality of vector registers of the vector processing unit. The method further comprises performing, by the vector processing unit, a matrix multiplication vector operation using the first vector register and the second vector register as input operands to generate an output vector having a number of elements equal to the product of RA and CB (RACB), wherein each element E of the output vector, where 0≤<RACB, is calculated as a dot product of a plurality of elements of the first vector corresponding to a row of the first submatrix, and a plurality of elements of the second vector corresponding to a column of the second submatrix. The method also comprises storing, by the vector processing unit, the output vector in a third vector register of the plurality of vector registers of the vector processing unit.
With reference now to the drawing figures, several exemplary aspects of the present disclosure are described. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any aspect described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects.
Aspects disclosed in the detailed description include providing matrix multiplication using vector registers in processor-based devices. In this regard,
In the example of
To illustrate the constituent elements of the slices 128(0)-128(7),
The processor-based device 100 and its constituent elements as illustrated in
Before discussing using vector registers to perform a matrix multiplication operation as disclosed herein, the calculations used by a conventional matrix processing unit when performing matrix multiplication is first discussed. In this regard,
An output matrix 204 represents the result of a matrix multiplication operation on the input matrices 200 and 202, and indicates the computations used to generate each element of the output matrix 204. For example, the upper-left-most element of the output matrix 204 is computed by summing the products of elements of row A of the input matrix 200 with elements of column 1 of the input matrix 202. Accordingly, the value of the upper-left-most element of the output matrix 204 is the sum of the products of element AA of the input matrix 200 and element 00 of the input matrix 202, element AB of the input matrix 200 and element 10 of the input matrix 202, element AC of the input matrix 200 and element 20 of the input matrix 202, and element AD of the input matrix 200 and element 30 of the input matrix 202. Corresponding computations are made for each of the other elements of the output matrix 204. The resulting output matrix has dimensions RA by CB, which in the example of
As seen in
In this regard, the matrix processing units 146(0)-146(M) and the vector processing unit 142 of
As noted above, it is not uncommon for one or both matrices upon which a matrix multiplication operation is to be performed to be larger than a conventional size of the memory hierarchy and/or register files of the matrix processing units 146(0)-146(M). Accordingly, the matrices may be organized into smaller submatrices through a process known as blocking or tiling. In this regard,
To optimize bandwidth usage and enable efficient delivery of data for matrix multiplication, the submatrices 304, 306 next are rearranged. When stored in memory, the contents of the rows of each submatrix 304, 306 may be widely separated in memory by the other contents of the rows of the original input matrices 300, 302 from which the submatrices 304, 306 are extracted (i.e., the contents of the rows of each of the submatrices 304, 306 are not contiguous in memory). This condition arises as a consequence of the format (e.g., row-major, column-major, or Iliffe vector format, as non-limiting examples) in which the input matrices 300, 302 are stored in memory. Thus, the processor-based device 100 of
The reorganization of the contents of the submatrices 304, 306 may be performed as part of a data transfer between different elements of the processor-based device 100 of
In some aspects, the processor-based device 100 provides a vector instruction specifically for retrieving and rearranging data for a single “tiled” submatrix from a larger input matrix. As a non-limiting example, the following instruction may be provided to load 32 groups of 32 values each:
LDSB destVR, src1SR, src2SR/imm
The LDSB instruction shown above takes three operands: “destVR,” which specifies a destination vector register among the vector registers 144(0)-144(V); “src1SR,” which indicates a scalar register for a start address of the submatrix; and “src2SR/imm,” which indicates either a second scalar register or an immediate value indicating a start address for a next block of 32 elements. Upon execution, the LDSB instruction uses the src1SR operand and the src2SR/imm operand to load 32 groups of 32 values (i.e., a 32×32 submatrix) into the vector register indicated by the destVR operand. In this manner, the vector registers 144(0)-144(V) can be efficiently loaded with the contents of submatrices of input matrices to be multiplied.
To actually perform the matrix multiplication operation after the LDSB instruction has been used to load input matrices, the processor-based device 100 further may provide a matrix multiplication vector instruction. In some aspects, the matrix multiplication vector instruction may be provided as follows:
VFMXM rdest, rsrc1, rsrc2
As seen above, the VFMXM instruction takes as operands “rdest,” which specifies a vector register among the vector registers 144(0)-144(V) in which contents of the output matrix will be stored, and “rsrc1” and “rsrc2,” which indicate the vector registers among the vector registers 144(0)-144(V) in which the contents of the first and second input matrices, respectively, are stored.
To illustrate operations performed upon execution of a matrix multiplication vector instruction,
Some aspects of the processor-based device 100 may be configured to perform matrix multiplication operations on matrices of different sizes such as 4×4, 8×8, 16×16, or 32×32 matrices, or even non-square sizes like 8×16, 16×8, and the like, as non-limiting examples. Additionally, aspects may provide support for matrix elements of different sizes formats (e.g., 8-bit signed or unsigned, 16-bit signed or unsigned, 16-bit floating-point, 32-bit floating-point, or 4-bit signed or unsigned, as non-limiting examples). The vector instruction for performing matrix multiplication according to some aspects may be provided with or without accumulation functionality. Some aspects may be provided with or without floating-point exceptions in line with the conventions of the host instruction set architecture (ISA), while some aspects may provide that results may be of a larger width than inputs (e.g., half-precision inputs may produce single precision results). Finally, in some aspects, data layout may not be strict row-by-row orientation, but may rather utilize a permutation functional unit to pre-swizzle data to minimize the number of large buses required in the matrix processing units 146(0)-146(M).
The mechanisms described above provide a true matrix multiplication instruction, expressed as a register-to-register operation, that encodes a O(n3) operation in an instruction with n2 elements. This provides a concise model of computation that enables simpler software sequences to produce the product of larger matrices. Additionally, the underlying hardware implementation may be modified in some aspects to improve performance or reduce area in a manner that is transparent to software. For example, the number of multiply/accumulate (MAC) units may be varied, and/or support for special handling of sparse matrices may be provided. Hardware may also be modified by adding multiple functional units to execute in parallel. The mechanisms described herein may coexist with expected software-friendly processor ISA features such as breakpoints and exceptions, as well as hardware-friendly features such as load-store ISAs. They also are amenable to compiler optimizations such as code scheduling and register allocation. Additionally, the mechanisms described herein may be applied to performing other operations instead of or in addition to matrix multiplication and/or to operations requiring more operands than the submatrices 304, 306.
To illustrate exemplary operations of the processor-based device 100 of
The matrix processing unit 146(0)-146(M) next rearranges a plurality of elements of a second submatrix, such as the submatrix 306, having a number RB rows and a number CB columns into a second vector 312 having a number of elements 314(0)-314(15) equal to the product of RB and CB (RBCB) (block 504). The matrix processing unit 146(0)-146(M) thus may be referred to herein as “a means for rearranging a plurality of elements of a second submatrix having a number RB rows and a number CB columns into a second vector having a number of elements equal to the product of RB and CB (RBCB).” The matrix processing unit 146(0)-146(M) then stores the second vector 312 in a second vector register, such as the vector register 402, of the plurality of vector registers 144(0)-144(V) of the vector processing unit 142 (block 506). In this regard, the matrix processing unit 146(0)-146(M) may be referred to herein as “a means for storing the second vector in a second vector register of the plurality of vector registers.”
The vector processing unit 142 then performs a matrix multiplication vector operation using the first vector register 400 and the second vector register 402 as input operands to generate an output vector 406 having a number of elements 404(0)-404(15) equal to the product of RA and CB (RACB), wherein each element E of the output vector 406, where 0≤E<RACB, is calculated as a dot product of a plurality of elements of the first vector 308 corresponding to a row of the first submatrix 304 indicated by a quotient of E divided by RA, and a plurality of elements of the second vector 312 corresponding to a column of the second submatrix 306 indicated by a remainder of E divided by CB (block 508). Accordingly, the vector processing unit 142 may be referred to herein as “a means for performing a matrix multiplication vector operation using the first vector register and the second vector register as input operands to generate an output vector having a number of elements equal to the product of RA and CB (RACB), wherein each element E of the output vector, where 0≤E<RACB, is calculated as a dot product of a plurality of elements of the first vector corresponding to a row of the first submatrix indicated by a quotient of E divided by RA, and a plurality of elements of the second vector corresponding to a column of the second submatrix indicated by a remainder of E divided by CB.” After performing the matrix multiplication vector operation, the vector processing unit 142 stores the output vector 406 in a third vector register, such as the vector register 408, of the plurality of vector registers 144(0)-144(V) of the vector processing unit 142 (block 510). The vector processing unit 142 thus may be referred to herein as “a means for storing the output vector in a third vector register of the plurality of vector registers.”
In some aspects, operations of blocks 500 and 504 for rearranging the plurality of elements of the first submatrix 304 and rearranging the plurality of elements of the second submatrix 306 may be performed by the matrix processing unit 146(0)-146(M) as part of operations for transferring data from an external memory, such as the DDR memory 116, to an internal scratchpad memory such as the global scratchpad 136 and/or the local scratchpad 134. In such aspects, the matrix processing unit 146(0)-146(M) may be referred to herein as “a means for rearranging a plurality of elements of a submatrix during one or more data transfer operations for transferring data from an external memory to an internal scratchpad memory.” Some aspects may provide that the operations of blocks 500 and 504 may be performed by the matrix processing unit 146(0)-146(M) during one or more data transfer operations for transferring data within a same memory hierarchy level (e.g., transferring data within the HBM 118, the DDR memory 116, the memory 106, the global scratchpad 136, or the local scratchpad 134, as non-limiting examples). In such aspects, the matrix processing unit 146(0)-146(M) may be referred to herein as “a means for rearranging a plurality of elements of a submatrix during one or more data transfer operations transferring data within a same memory hierarchy level.”
As noted above, the processor-based device 100 according to some aspects may provide a specialized instruction for performing a block-strided load operation for extracting the elements of the submatrix 304 from the input matrix 300. In such aspects, operations of blocks 500 and 504 for rearranging the plurality of elements of the first submatrix 304 and rearranging the plurality of elements of the second submatrix 306 may be performed by the matrix processing unit 146(0)-146(M) as part of performing the block-strided load operation. Accordingly, in such aspects, the matrix processing unit 146(0)-146(M) may be referred to herein as “a means for performing a block-strided load operation.” Additionally, in some aspects, the operations of blocks 500 and 504 may include performing a plurality of vector load operations (e.g., for merging smaller vectors into a larger vector in the vector registers 144(0)-144(V), as a non-limiting example). In such aspects, the matrix processing unit 146(0)-146(M) may be referred to herein as “a means for performing a plurality of vector load operations.”
Providing matrix multiplication using vector registers in processor-based devices according to aspects disclosed herein may be provided in or integrated into any processor-based device. Examples, without limitation, include a set top box, an entertainment unit, a navigation device, a communications device, a fixed location data unit, a mobile location data unit, a global positioning system (GPS) device, a mobile phone, a cellular phone, a smart phone, a session initiation protocol (SIP) phone, a tablet, a phablet, a server, a computer, a portable computer, a mobile computing device, a wearable computing device (e.g., a smart watch, a health or fitness tracker, eyewear, etc.), a desktop computer, a personal digital assistant (PDA), a monitor, a computer monitor, a television, a tuner, a radio, a satellite radio, a music player, a digital music player, a portable music player, a digital video player, a video player, a digital video disc (DVD) player, a portable digital video player, an automobile, a vehicle component, avionics systems, a drone, and a multicopter.
In this regard,
Other master and slave devices can be connected to the system bus 608. As illustrated in
The CPU(s) 602 may also be configured to access the display controller(s) 620 over the system bus 608 to control information sent to one or more displays 626. The display controller(s) 620 sends information to the display(s) 626 to be displayed via one or more video processors 628, which process the information to be displayed into a format suitable for the display(s) 626. The display(s) 626 can include any type of display, including, but not limited to, a cathode ray tube (CRT), a liquid crystal display (LCD), a plasma display, etc.
Those of skill in the art will further appreciate that the various illustrative logical blocks, modules, circuits, and algorithms described in connection with the aspects disclosed herein may be implemented as electronic hardware, instructions stored in memory or in another computer readable medium and executed by a processor or other processing device, or combinations of both. The master devices, and slave devices described herein may be employed in any circuit, hardware component, integrated circuit (IC), or IC chip, as examples. Memory disclosed herein may be any type and size of memory and may be configured to store any type of information desired. To clearly illustrate this interchangeability, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. How such functionality is implemented depends upon the particular application, design choices, and/or design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The various illustrative logical blocks, modules, and circuits described in connection with the aspects disclosed herein may be implemented or performed with a processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices (e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other such configuration).
The aspects disclosed herein may be embodied in hardware and in instructions that are stored in hardware, and may reside, for example, in Random Access Memory (RAM), flash memory, Read Only Memory (ROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), registers, a hard disk, a removable disk, a CD-ROM, or any other form of computer readable medium known in the art. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a remote station. In the alternative, the processor and the storage medium may reside as discrete components in a remote station, base station, or server.
It is also noted that the operational steps described in any of the exemplary aspects herein are described to provide examples and discussion. The operations described may be performed in numerous different sequences other than the illustrated sequences. Furthermore, operations described in a single operational step may actually be performed in a number of different steps. Additionally, one or more operational steps discussed in the exemplary aspects may be combined. It is to be understood that the operational steps illustrated in the flowchart diagrams may be subject to numerous different modifications as will be readily apparent to one of skill in the art. Those of skill in the art will also understand that information and signals may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips that may be referenced throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or particles, optical fields or particles, or any combination thereof.
The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The present application claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Patent Application Ser. No. 62/558,544 entitled “PROVIDING MATRIX MULTIPLICATION USING VECTOR REGISTERS IN PROCESSOR-BASED SYSTEMS” and filed on Sep. 14, 2017, the contents of which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62558544 | Sep 2017 | US |