This application claims the priority benefit of Taiwan application serial no. 111135607, filed on Sep. 20, 2022. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The present disclosure relates to a computing device, and more particularly, to a matrix device for matrix operation and an operation method thereof.
Matrix multiplication is a fundamental operation in computer systems. After the operation circuit completes a previous matrix operation, different elements of the matrix (operation result) are sequentially written into a dynamic random access memory (DRAM) according to the generating sequence of the elements of the previous matrix operation. For example, matrices may be stored in DRAM in either a column-major manner or a row-major manner. However, the storage sequence of the matrix elements of the previous matrix operation in the DRAM might be unfavorable for the access of the next matrix operation. For example, the operation result matrix of the previous matrix operation is stored in the DRAM in a column-major manner for use of the next matrix operation, but the operand matrix of the next matrix operation is input in a row-major manner. Therefore, for the next matrix operation, the elements of the operand matrix are discretely placed in different positions (non-consecutive addresses) of the DRAM.
When the multiple elements accessed by the next matrix operation in the same batch are located at consecutive addresses in the DRAM, the operation circuit may use a burst read command to read these elements at consecutive addresses from the DRAM at one time. When the plurality of elements accessed by the next matrix operation are located at non-consecutive addresses of the DRAM, the operation circuit needs to use a plurality of read commands to read these elements from the DRAM multiple times. Generally speaking, the number of reads to DRAM is proportional to power consumption. How to appropriately store the matrix generated by the previous matrix operation in the DRAM so that the next matrix operation can efficiently access the matrix is an important issue. If the number of times of accessing the DRAM can be reduced in the process of accessing the matrix from the DRAM, the performance of the matrix operation may be effectively improved, and the power consumption of the circuit may be effectively reduced.
The present disclosure provides a matrix device and an operating method thereof to improve performance.
The present disclosure provides a matrix device including a transpose circuit and a memory. The transpose circuit is configured to receive a first element string representing a native matrix from a matrix source, and transpose the first element string into a second element string. All elements in the native matrix are arranged in the first element string in one of a “row-major manner” and a “column-major manner”. The second element string is equivalent to an element string in which all elements of the native matrix are arranged in the other one of the “row-major manner” and the “column-major manner”. The memory is coupled to the transpose circuit to receive the second element string.
In an embodiment of the present disclosure, the matrix device may be adopted in an operation, and the method includes: receiving, by a transpose circuit of the matrix device, a first element string representing a native matrix from a matrix source; transposing, by the transpose circuit, the first element string into a second element string, and all elements of the native matrix are arranged in the first element string in one of a “row-major manner” and a “column-major manner”, and the second element string is equivalent to an element string in which all elements of the native matrix are arranged in the other one of the “row-major manner” or the “column-major manner”; and receiving, by a memory of the matrix device, the second element string.
Based on the above, the transpose circuit in the embodiments of the present disclosure is able to make the arrangement of elements in the memory match the characteristics of access calculation through a transposing method. In this way, the efficiency of the matrix device may be effectively improved.
In order to make the above-mentioned features and advantages of the present disclosure more understandable, the following embodiments are given and described in detail with the accompanying drawings as follows.
A term “couple (or connect)” used in the full text of the disclosure (including the claims refers to any direct and indirect connections. For example, if a first device is described to be coupled to a second device, it is interpreted as that the first device is directly connected to the second device, or the first device is indirectly connected to the second device through other devices or connection means. Terms such as “first” and “second” mentioned in the full text of the description of the disclosure (including claims) are used to denote the names of elements, or to distinguish different embodiments or scopes, rather than to limit the upper or lower limit of the number of elements, nor is it intended to limit the order of the elements. Moreover, wherever possible, components/members/steps using the same referential numbers in the drawings and description refer to the same or like parts. Components/members/steps using the same referential numbers or using the same terms in different embodiments may cross-refer related descriptions.
In terms of hardware, the transpose circuit 110 may be implemented as a logic circuit on an integrated circuit. For example, the related functions of the transpose circuit 110 may be implemented in one or more controllers, microcontrollers, microprocessors, application-specific integrated circuits (ASICs), digital signal processor (DSP), field programmable gate array (FPGA) and/or various logic blocks, modules and circuits in other processing units. The related functions of the matrix device, transpose circuit and/or memory may be implemented as hardware circuits, such as various logic blocks, modules and circuits in an integrated circuit, using hardware description languages (such as Verilog HDL or VHDL) or other suitable programming languages.
In the form of software and/or firmware, the related functions of the transpose circuit 110 may be implemented as programming codes. For example, the transpose circuit 110 is implemented using general programming languages (e.g., C, C++, or assembly languages) or other suitable programming languages. The programming code may be recorded/stored in a “non-transitory computer readable medium”. In some embodiments, the non-transitory computer readable medium includes, for example, semiconductor memory and/or storage devices. The semiconductor memory includes a memory card, a read only memory (ROM), a flash memory, a programmable logic circuit or other semiconductor memories. The storage device includes tape, disk, hard disk drive (HDD), solid-state drive (SSD), or other storage devices. An electronic device (such as a central processing unit (CPU), controller, microcontroller, or microprocessor) may read and execute the programming code from the non-transitory computer readable medium, thereby realizing related functions of the transpose circuit 110.
The transpose circuit 110 may receive, from a matrix source (not shown in
The transpose circuit 110 may transpose the element string ES1 to the element string ES2, and all elements of a native matrix are arranged in the element string ES1 in one of a “row-major manner” and a “column-major manner”, and the element string ES2 is equivalent to an element string in which all elements of the native matrix are arranged in the other one of the “row-major manner” and the “column-major manner”. For example, it is assumed that the content of the native matrix A is as shown in Equation 1 below. The content of the element string ES1 of the native matrix A arranged in the “row-major manner” is {X00, X01, X10, X11}. After the transposing function of the transpose circuit 110, the native matrix A is transposed into the element string ES2 arranged in the “column-major manner”, and the content of the element string ES2 is {X00, X10, X01, X11}.
The memory 120 is coupled to the transpose circuit 110. The transpose circuit 110 transmits the element string ES2 obtained by transposing the element string ES1 of the native matrix A to the memory 120. According to the actual design, the memory 120 may be any kind of memory. For example, in some embodiments, the memory 120 may be static random access memory (SRAM), dynamic random access memory (DRAM), magnetic random access memory (MRAM), magnetoresistive random access memory (MRAM), flash memory, or other memories. The memory 120 receives and stores the element string ES2 as an operand matrix for the next matrix operation.
For example,
The matrix multiplication circuit 230 is coupled to the transpose circuit 210, the memory 220 and the memory 240. The matrix multiplication circuit 230 may perform a previous layer of calculation of neural network calculations to generate native matrices. The matrix multiplication circuit 230 may serve as a matrix source to provide the element string ES1 of the native matrix to the transpose circuit 210. The transpose circuit 210 may transpose the element string ES1 to the element string ES2. The memory 220 is coupled to the transpose circuit 210 to receive and store the element string ES2. The matrix multiplication circuit 230 may read the element string ES3 (matrix A) from the memory 240 as a weight matrix, and read the element string ES2 (matrix B) from the memory 220 as an input matrix, so as to perform a next layer of calculation in the neural network calculation. In general, weight matrices are pre-trained parameters.
For example, assume that memory 220 includes a DRAM. Based on the transpose operation of the transpose circuit 210, all elements of the same column of the native matrix (the result of the previous layer of calculation) may be stored in multiple consecutive addresses in the memory 220. The memory 220 provides all elements of the same column of the native matrix to the matrix multiplication circuit 230 in a burst mode, so that the matrix multiplication circuit 230 performs the next layer of calculation of the neural network calculation.
This embodiment provides no limitation to the matrix operation of the matrix multiplication circuit 230. In some application examples, the matrix operations may include matrix addition operations, matrix multiplication operations, multiply-accumulate (MAC) operations, and/or other matrix operations. For example, it is assumed that the content of the native matrix A is shown in Equation 1 above, and the content of the native matrix B is shown in Equation 2 below. A matrix Z is obtained by multiplying two 2×2 matrices A and B, as shown in Equation 3 below.
The matrix multiplication performed by the matrix multiplication circuit 230 may include four steps. Step 1: The matrix multiplication circuit 230 may extract the elements [X00, X01] of the matrix A from the memory 240, extract the elements [Y00, Y10] of the matrix B from the memory 220, and calculate X00Y00+X01Y10. Step 2: The matrix multiplication circuit 230 may retain the elements [X00, X01] of the matrix A, extract the elements [Y01, Y11] of the matrix B from the memory 220, and calculate X00Y01+X01Y11. Step 3: The matrix multiplication circuit 230 may extract the elements [X10, X11] of the matrix A from the memory 240, extract the elements [Y00, Y10] of the matrix B from the memory 220, and calculate X10Y00+X11Y10. Step 4: The matrix multiplication circuit 230 may retain the elements [X10, X11] of the matrix A, extract the elements [Y01, Y11] of the matrix B from the memory 220, and calculate X10Y01+X11Y11. At this stage, the matrix multiplication circuit 230 may obtain the matrix Z shown in Equation 3.
The matrix multiplication performed by the matrix multiplication circuit 230 described in the preceding paragraph includes four steps, and the memory 220 is read six times. If the calculation is performed on the principle of data reuse, matrix multiplication may be reduced from four steps to two optimized steps. Optimized step 1: The matrix multiplication circuit 230 may extract the elements [X00, X10] of the matrix A from the memory 240, extract the elements [Y00, Y01] of the matrix B from the memory 220, and calculate X00Y00, X00Y01, X10Y00 and X10Y01. Optimized step 2: The matrix multiplication circuit 230 may extract the elements [X01, X11] of the matrix A from the memory 240, extract the elements [Y10, Y11] of the matrix B from the memory 220, and calculate X01Y01, X01Y11, X11Y10, X11Y11. At this stage, the matrix multiplication circuit 230 may obtain the matrix Z shown in Equation 3 using X00Y00, X00Y01, X10Y00, X10Y01, X01Y01, X11Y11, X11Y10, X11Y11 in the optimized step 1 and optimized step 2.
As a comparison with
To sum up, the transpose circuit in the embodiments of the present disclosure is able to make the arrangement of elements in the memory match the characteristics of access calculation through a transposing method. In this way, the matrix device may reduce the energy consumption and time required for accessing and reading the memory, thereby effectively improving the efficiency of the matrix device.
Although the present disclosure has been disclosed in the above embodiments, it is not intended to limit the present disclosure, and those skilled in the art can make some modifications and refinements without departing from the spirit and scope of the disclosure. Therefore, the scope to be protected by the present disclosure is subject to the scope defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
111135607 | Sep 2022 | TW | national |