The present disclosure generally relates to the field of data processing. More specifically, the present disclosure relates to an integrated circuit apparatus for matrix multiplication, a board card, a computing device, a computing system, and a method.
A large number of data processing and operations are usually involved in the field of artificial intelligence, which include matrix multiplication of various types of data. Taking machine learning in the current field of artificial intelligence as an example, lots of computing tasks involve large-scale matrix multiplication, especially multiplication of large matrices. Further, taking deep learning in the machine learning as an example, the deep learning includes a large number of matrix multiplication of various types, including matrix multiplication of a weight matrix and an input vector in a fully connected layer and matrix multiplication of an input vector and a convolution kernel in a convolution layer. It may be conceived that the larger the data volume and scale involved in the matrix multiplication, the higher the requirement on storage volume of a computing platform (especially an on-chip system).
In existing matrix multiplication, a processor such as a central processing unit (CPU) or a graphics processing unit (GPU) is usually used. However, since the processor is limited by the resource capacity of an internal register, the large amount of data processing may lead to a lot of data interaction between the processor and an external memory. Since a bandwidth of an input/output (“I/O”) bus between the processor and the external memory is limited, a serious I/O bottleneck is likely to occur, causing delays in data transfer and greatly reducing the efficiency of parallel operations. Further, not only the bandwidth limitation of the I/O bus will become the bottleneck of system performance, but also the large amount of I/O access between the processor and the external memory will bring adverse effects on computing and power consumption.
To at least solve technical problems mentioned above, the present disclosure provides a solution where a hardware architecture and an operation method that may effectively execute matrix multiplication are provided, thus reducing the amount of data transmission with the external memory and minimizing the I/O bottleneck caused by the bus bandwidth limitation, which improves the operation efficiency of matrix multiplication. Specifically, the present disclosure provides the foregoing solution in several ways as follows.
A first aspect of the present disclosure discloses an integrated circuit apparatus for matrix multiplication, including: an interface unit, configured to acquire matrix data used for the matrix multiplication from an external memory, where the matrix data includes a first matrix and a second matrix, where the first matrix is divided to N2 first matrix blocks, the second matrix is divided to N2 second matrix blocks, and matrix multiplication of the first matrix and the second matrix includes N2 matrix multiplication tasks based on the N2 first matrix blocks and the N2 second matrix blocks, where N is a positive integer greater than or equal to 2; and N2 master computing units, where the N2 master computing units are connected sequentially to form a data transfer loop, where each master computing unit is configured to execute one corresponding matrix multiplication task in the N2 matrix multiplication tasks and includes: a plurality of storage areas, configured to store matrix blocks used for executing the matrix multiplication tasks and intermediate results; and a control unit, configured to execute matrix block exchange with an adjacent master computing unit.
In executing the one corresponding matrix multiplication task described above, each master computing unit is configured to: acquire one first matrix block and one second matrix block related to the matrix multiplication task through the interface unit, and store the one first matrix block in a first storage area and the one second matrix block in a second storage area; execute matrix multiplication on the one first matrix block and the one second matrix block to obtain one intermediate result; execute N−1 times of matrix block exchange with the adjacent master computing unit through the control unit and by using the first storage area and the second storage area, and execute matrix multiplication on a first matrix block and a second matrix block obtained after each exchange to obtain N−1 intermediate results respectively; and sum N intermediate results to complete the related matrix multiplication task.
A second aspect of the present disclosure discloses a board card, including the integrated circuit apparatus described above and later in a plurality of embodiments.
A third aspect of the present disclosure discloses a computing device, including the board card described above and later in a plurality of embodiments.
A fourth aspect of the present disclosure provides a computing system, including the computing device described above and later in a plurality of embodiments.
A fifth aspect of the present disclosure discloses a method for matrix multiplication using the integrated circuit apparatus described above and later in a plurality of embodiments, including: acquiring, by using an interface unit of the integrated circuit apparatus, matrix data used for the matrix multiplication from an external memory, where the matrix data includes a first matrix and a second matrix, where the first matrix is divided to N2 first matrix blocks, the second matrix is divided to N2 second matrix blocks, and matrix multiplication of the first matrix and the second matrix includes N2 matrix multiplication tasks based on the N2 first matrix blocks and the N2 second matrix blocks, where N is a positive integer greater than or equal to 2; and executing, by using each master computing unit, following operations: acquiring one first matrix block and one second matrix block related to a matrix multiplication task through the interface unit, and storing the one first matrix block in a first storage area and the one second matrix block in a second storage area; executing matrix multiplication on the one first matrix block and the one second matrix block to obtain one intermediate result; executing N−1 times of matrix block exchange with an adjacent master computing unit through a control unit and by using the first storage area and the second storage area, and executing matrix multiplication on a first matrix block and a second matrix block obtained after each exchange to obtain N−1 intermediate results respectively; and summing N intermediate results to complete the related matrix multiplication task.
A sixth aspect of the present disclosure provides a computer program product that includes a program instruction used to execute matrix multiplication. When the program instruction is executed by one or more processors, the method described above and later in a plurality of embodiments is implemented.
By using the aforementioned integrated circuit apparatus, the computing device, the computing system, the board card, and the method of the present disclosure, on-chip resources of an on-chip system may be fully utilized, and data share and transfer are implemented among the master computing units, thus significantly reducing I/O data interaction with the external memory and then enabling efficient parallel execution of data transfer and multiplication. Further, by splitting the matrix to multi-level in combination with the hardware architecture, the solution of the present disclosure simplifies complexity of the matrix multiplication and supports matrix multiplication of super-large matrices. Besides, by significantly reducing the data interaction with the external memory, the solution of the present disclosure further improves execution efficiency of matrix multiplication and reduces operation performance bottlenecks caused by on-chip and off-chip I/O bandwidth limitations, thereby improving the overall performance of the integrated circuit apparatus, the computing device, the computing system, or the board card.
By reading the following detailed description with reference to drawings, the above and other objects, features and technical effects of exemplary embodiments of the present disclosure will become easier to understand. In the drawings, several embodiments of the present disclosure are shown in an exemplary but not a restrictive manner, and the same or corresponding reference numerals indicate the same or corresponding parts.
The technical solution in embodiments of the present disclosure will be described clearly and completely hereinafter with reference to drawings in the embodiments of the present disclosure. Obviously, embodiments to be described are merely some rather than all embodiments of the present disclosure. All other embodiments obtained by those skilled in the art based on the embodiments of the present disclosure without creative efforts shall fall within the scope of protection of the present disclosure.
Specific implementations of the present disclosure will be described in detail in combination with drawings below.
Further, as shown in
As shown in
In order to realize data interaction with an adjacent master computing unit constituting the data transfer loop, the master computing unit of the present disclosure further includes a control unit, which is configured to execute matrix block exchange with the adjacent master computing unit. Therefore, by means of the interface unit between the integrated circuit apparatus and the external memory and the control unit of each master computing unit, the solution of the present disclosure enables the plurality of master computing units in the integrated circuit apparatus to acquire part of matrix block data of respective matrix multiplication tasks from the external memory and acquire another part (or more parts) of matrix block data from one or a plurality of master computing units that are connected adjacently through data interaction, thereby acquiring matrix block data required to complete corresponding matrix multiplication tasks and completing the corresponding matrix multiplication tasks based on this.
Specifically, in performing one corresponding matrix multiplication task, each master computing unit may be configured to acquire one first matrix block (which is from the first matrix) and one second matrix block (which is from the second matrix) related to the one corresponding matrix multiplication task through the interface unit and store the one first matrix block and the one second matrix block in a first storage area and a second storage area respectively. Besides, the first storage area and the second storage area may be two pieces of independent storage space allocated from the shared storage area and are used as buffer areas to store intermediate data.
After acquiring the one first matrix block and the one second matrix block, the master computing unit of the present disclosure may execute matrix multiplication on the one first matrix block and the one second matrix block to obtain one intermediate result. As mentioned before, here, the matrix multiplication of the one first matrix block and the one second matrix block may be executed in parallel pipelines by the M2 computing sub-units in the master computing unit. Hereafter, the master computing unit may execute N−1 times of matrix block exchange with the adjacent master computing unit through the control unit and by using the first storage area and the second storage area and executes matrix multiplication on a first matrix block and a second matrix block obtained after each exchange to obtain N−1 intermediate results. For example, when N=2, which means that 4 master computing units are connected in serial, in one master computing unit, another first matrix block and second matrix block may be acquired from two master computing units that are connected adjacently, thereby obtaining an intermediate result again. After obtaining N intermediate results, the master computing unit of the present disclosure may sum these intermediate results to complete one related matrix multiplication task.
As mentioned before, the master computing unit of the present disclosure uses the M2 computing sub-units to execute specific matrix multiplication tasks. Based on this arrangement, the matrix multiplication of the present disclosure may involve a case where the first matrix block and the second matrix block may be further divided. Specifically, the first matrix block and the second matrix block may be divided into M2 first matrix sub-blocks and M2 second matrix sub-blocks respectively. Based on this, one matrix multiplication task of the above one master computing unit may include M2 matrix multiplication sub-tasks based on the M2 first matrix sub-blocks and the M2 second matrix sub-blocks. Further, each of the M2 computing sub-units may be configured to execute one corresponding matrix multiplication sub-task in the M2 matrix multiplication sub-tasks.
Specifically, in performing the one corresponding matrix multiplication sub-task, each computing sub-unit may be configured to execute M times of matrix multiplication to obtain M intermediate sub-results. Especially, the computing sub-unit may acquire one first matrix sub-block and one second matrix sub-block related to the matrix multiplication sub-task from the shared storage area (such as the first storage area and the second storage area) respectively. Next, the computing sub-unit may execute a matrix multiplication operation on one first matrix sub-block and one corresponding second matrix sub-block to obtain one intermediate sub-result. Finally, the computing sub-unit may sum the M intermediate sub-results to complete the related matrix multiplication sub-task.
Based on an internal architecture and matrix division of the integrated circuit apparatus of the present disclosure, the solution of the present disclosure also realize a high-level parallel operation. Especially, the N2 master computing units may be configured to execute respective related matrix multiplication tasks in parallel, and the M2 computing sub-units may be configured to execute respective related matrix multiplication sub-tasks in parallel. Besides, the matrix division of the present disclosure may be performed based on Cannon's algorithm rules. For example, the first matrix and the second matrix involved in the matrix multiplication of the present disclosure may be divided into the N2 first matrix blocks and the N2 second matrix blocks at the master computing unit level based on the Cannon's algorithm rules. Next, at the computing sub-unit level, the one first matrix block and the one second matrix block may be further divided based on the Cannon's algorithm rules to obtain the M2 first matrix sub-blocks and the M2 second matrix sub-blocks.
Through the descriptions in combination with
In an application scenario, the integrated circuit apparatus of the present disclosure may be applied to the field of artificial intelligence, especially to machine learning including a deep neural network. For example, the integrated circuit apparatus of the present disclosure may execute a convolution operation involved in the neural network on a received first matrix and second matrix, where a lot of matrix multiplication are involved. To better understand how the integrated circuit apparatus of the present disclosure is applied to such an application scenario, the following will exemplarily describe the matrix multiplication involved in the convolution operation performed by the integrated circuit apparatus of the present disclosure according to Cannon's algorithm rules.
As shown in
As known by those skilled in the art, the convolution weight gradient, as the matrix multiplication result in this embodiment, may be used to update a gradient of a convolution result in forward propagation in the process of neural network back propagation. In an operation scenario, convolution weight gradient computing is equivalent to product accumulation computing between the convolution result gradient (when the matrix is a four-dimensional matrix, dimensions of the matrix may be expressed as NiHiWiCi shown in the figure) as the first matrix in this embodiment and the convolution input (when the matrix is the four-dimensional matrix, the dimensions of the matrix may be expressed as NoHoWoCo shown in the figure) as the second matrix in this embodiment. Here, N represents a sample count, H represents a matrix height, W represents a matrix width, and C represents a channel count. Further, according to rules for matrix multiplication, an input matrix “convolution result gradient” may be expressed as Ci*NiHiWi, and an input matrix “convolution input” may be expressed as NoHoWo*Co. The convolution result gradient and the convolution input perform convolution weight gradient computing (such as multiplication and addition) in the NiHiWi and NoHoWo directions. Finally, an obtained output matrix “convolution weight gradient” may be expressed as Kh*Kw*Ci*Co (where Kh represents a height of the output matrix, Kw represents a width of the output matrix, Ci represents a channel count of the input matrix “convolution result gradient”, and Co represents a channel count of the input matrix “convolution input”). For the sake of brevity, the figure only shows convolution weight gradient computing in a Ci*Co direction, which is the matrix multiplication of the present disclosure.
Based on the above exemplary data placement rules (including, for example, the matrix division method according to the Cannon's algorithm) and the architecture of the four master computing units forming the closed loop, a first matrix “convolution result gradient” and a second matrix “convolution input” stored in the external memory may be divided into four matrix blocks respectively. For the sake of brevity, the four matrix blocks obtained by dividing the first matrix “convolution result gradient” are expressed as A00, A01, A10, and A11 shown in the
Based on the above data block, each master computing unit may respectively execute following formulas (1) to (4) to compute and obtain respective corresponding convolution weight gradients C00, C01, C11, and C10.
C00=A00*B00+A01*B10 (1).
C01=A00*B01+A01*B11 (2).
C11=A10*B01+A11*B11 (3).
C10=A10*B00+A11*B10 (4).
Specifically, the solution of the present disclosure may respectively use the four master computing units 0, 1, 2, and 3 to execute computing tasks corresponding to the formulas (1) to (4) to respectively obtain the C00, C01, C11, and C10. In an operation scenario where the Cannon's algorithm is used to execute the above multiplication of the matrix blocks, positions of A10 and A11 of the input matrix “convolution result gradient” shown in
As mentioned before, each master computing unit may receive one corresponding first matrix block and one corresponding second matrix block from the external memory and execute corresponding matrix multiplication computing. For example, the master computing unit 0 may receive one first matrix block “A00” of the first matrix “convolution result gradient” and one second matrix block “B00” of the second matrix “convolution input” from the external memory through the interface unit, and execute its first matrix multiplication sub-task (A00*B00), which is a part of the matrix multiplication task, according to the formula (1), where “*” represents the matrix multiplication. Similarly, the master computing unit 1 receives one corresponding first matrix block and one corresponding second matrix block (A01 and B11) through the interface unit, and executes its first matrix multiplication task (A01*B11) according to the formula (2). Similarly, the master computing unit 2 and 3 respectively receive one first matrix block and one second matrix block, which are (A10 and B01) and (A11 and B10) respectively, through the interface unit, and execute respective first matrix multiplication tasks (A10*B01) and (A11*B10) respectively according to the formula (3) and the formula (4).
After receiving data of the matrix block from the external memory and executing the matrix multiplication task, each master computing unit may receive another first matrix block and second matrix block from an interconnected master computing unit. As mentioned above, each master computing unit of the present disclosure may use the bidirectional communication connection to send part of matrix block data received from the external memory to an adjacent master computing unit respectively as corresponding matrix block data of another (or second) matrix multiplication task of the adjacent master computing unit.
As mentioned above, obtaining “C00” may be seen as a matrix multiplication task of the master computing unit 0. According to the formula (1), another first matrix block and second matrix block required by completing a second matrix multiplication task in the “C00” matrix multiplication task are “A01” and “B10”. Further, it may be seen from
It may be seen from the above descriptions in combination with
As mentioned above, the matrix multiplication of the present disclosure may be executed by the plurality of computing sub-units in each master computing unit. Based on such setting of the plurality of computing sub-units, the first matrix block and the second matrix block of the present disclosure may be further divided to a plurality of first matrix sub-units and a plurality of second matrix sub-units, and each matrix multiplication task (such as the formula (1), (2), (3) or (4)) may be divided to a plurality of matrix multiplication sub-tasks corresponding to each computing sub-unit in the plurality of computing sub-units. Based on this, based on its related matrix multiplication sub-tasks, each computing sub-unit may read one corresponding first matrix sub-block and one corresponding second matrix sub-block from the shared storage area to execute matrix operations. For a better understanding, the following will discuss how each computing sub-unit completes its respective corresponding matrix multiplication sub-tasks according to rules of the Cannon's algorithm with reference to
As shown in
As shown in
c00=a00* b00+a01*b10 (5).
c01=a00* b01+a01*b11 (6).
c11=a10*b01+a11*b11 (7).
c10=a10*b00+a11*b10 (8).
According to the solution of the present disclosure, the four computing sub-units 0, 1, 2, and 3 shown in
Similar to the description with reference to
As shown in the top picture of
Based on the above description, those skilled in the art may understand that a computing result of each matrix multiplication sub-task in the first matrix multiplication task (such as A00*B00) of the master computing unit 0 is just an intermediate sub-result. Therefore, it is still required to further complete the plurality of matrix multiplication sub-tasks corresponding to the second matrix multiplication task (such as A01*B10) to obtain another intermediate result, so that a final computing result of the matrix multiplication task C00 related to the master computing unit 0 shown in FIG. 5B may be obtained by summing the two intermediate results. Specifically, the computing sub-unit 0, for example, may execute matrix multiplication sub-tasks corresponding to the first matrix multiplication task (A00*B00) according to the formula (5), and set the obtained c00 as a first sub-c001. Next, the computing sub-unit 0 is used to execute matrix multiplication sub-tasks corresponding to the second matrix multiplication task (A01*B10) of the C00 to obtain a second sub-c002. Finally, the sub-c001 and the sub-c002 are summed to obtain the matrix block c00 in the outputmatrix block C00. Considering that there are two parts of addition operations in a right side of the formula (5), and c002 is obtained by adding two intermediate results, c001 may be added with a first intermediate result and a second intermediate result of c002 sequentially to obtain the matrix sub-block c00. Specific operations will be described with reference to computing operation arrays of a sixth time slice and a seventh time slice in FIG. 6.
By executing operations similar to that of the computing sub-unit 0, the computing sub-units 1, 2, and 3 may respectively obtain the matrix sub-blocks c01, c11, and c10 in the C00. As such, the four matrix sub-blocks c00, c01, c11, and c10 shown in the right side of FIG. 5B constitute the output matrix block C00 obtained by executing the matrix multiplication task by the master computing unit 0. Since intermediate computing results (such as c00, c01, c11, and c10) of each computing sub-unit may be stored in the shared storage area of corresponding master computing units instead of stored in the external memory, the solution of the present disclosure may decrease the data exchange between the master computing unit and the external memory, thus reducing the I/O bottleneck caused by the external bandwidth limitation.
Further, according to the above description, those skilled in the art may understand that the four computing sub-units included in the master computing unit in
Specifically,
To effectively use on-chip I/O and computing resources, on-chip operations of the present disclosure may be ping-pong pipeline operations. Specifically, according to the solution of the present disclosure, on-chip storage resources may be divided to two parts, “ping” and “pong”. In an embodiment, when ping storage resources are used to load the data, pong storage resources are used to execute the matrix multiplication. On the contrary, when the ping storage resources are used to execute the matrix multiplication, the pong storage resources are used to load the data. Based on this resource allocation, the master computing unit of the present disclosure may execute parallel ping-pong pipeline operations.
It may be seen from the figure that in the first time slice, the master computing unit 0 loads the B00 from the external memory and stores the B00 in the pong part of the shared storage area. In the second time slice, the master computing unit 0 loads A00 from the external memory and stores the A00 in the ping part of the shared storage area. Meanwhile, the b00 of the B00 may be loaded to the computing sub-unit 0 in parallel. In a third time slice, a00 of the A00 may be loaded to the computing sub-unit 0. Besides, during the third time slice and a fourth time slice, the master computing unit 0 sends the A00 to the interconnected master computing unit 1 through the control unit and sends the B00 to the interconnected master computing unit 3. Meanwhile, the master computing 10 unit 0 receives A01 from the master computing unit 1 and B10 from the master computing unit 3 through the control unit.
In a data loading column of the fourth time slice, b10 of the B00 and a01 of the A00 may be loaded to the computing sub-unit 0; meanwhile, in a matrix multiplication operation column of the fourth time slice, a00*b00 of the A00 and the B00 is computed to obtain an intermediate sub-result. In a data loading column of a fifth time slice, b00 of the B10 and a00 of the A01 may be loaded to the computing sub-unit 0; meanwhile, in a computing operation column of the fifth time slice, a01*b10 of the A00 and the B00 is computed to obtain an intermediate sub-result, and the intermediate sub-result is accumulated with the intermediate sub-result of the previous time slice to obtain an intermediate result of the fifth time slice. In a data loading column of a sixth time slice, b10 of the B10 and a01 of the A01 may be loaded to the computing sub-unit 0; meanwhile, in a matrix multiplication operation column of the sixth time slice, a00*b00 of the A01 and the B10 is computed to obtain an intermediate sub-result, and the intermediate sub-result is accumulated with the intermediate sub-result of the previous time slice to obtain an intermediate result of the sixth time slice. In a matrix multiplication operation column of a seventh time slice, a01*b10 of the A01 and the B10 is computed to obtain an intermediate sub-result, and the intermediate sub-result is accumulated with the intermediate sub-result of the previous time slice to obtain the matrix sub-block c00 of the output matrix block C00.
During the data loading and computing from the third time slice to the seventh time slice, the pong part of the on-chip storage resources is used to receive a next group of B00 (B00') and A00 (A00′) from the external memory to enable the master computing unit 0 to execute the first matrix multiplication task. Next, from an eighth time slice, the computing sub-unit 0 stores c00 of C00 output from the previous time slice to the shared storage area. Meanwhile, b00 of the next group of B00' and a00 of the next group of A00′ are loaded to the computing sub-unit 0 to be computed at a next time slice (which is not shown).
Similarly, the computing sub-units 1, 2, and 3 of the master computing unit 0 and different master computing units and corresponding computing sub-units also execute similar operations as the above eight time slices to obtain corresponding matrix locks of respective output matrices. Since the input matrix “convolution result gradient” and “convolution input” may have a multi-dimensional structure, computing results of three directions NHW may be computed first and then accumulated. Then, the above computing is executed cyclically in the Ci and Co dimensions of two input matrices to obtain the computing result of the output matrix “convolution weight gradient”.
As shown in
Specifically, in a step 1004, the method 1000 acquires one first matrix block and one second matrix block related to the matrix multiplication task through the interface unit and stores the one first matrix block in a first storage area and the one second matrix block in a second storage area. Next, in a step 1006, the method 1000 executes matrix multiplication on the one first matrix block and the one second matrix block to obtain one intermediate result. Hereafter, in a step 1008, the method 1000 executes, through a control unit and by using the first storage area and the second storage area, N−1 times of matrix block exchange with an adjacent master computing unit and executes matrix multiplication on a first matrix block and a second matrix block obtained after each exchange to obtain N−1 intermediate results respectively. Finally, in a step 1010, the method 1000 sums N intermediate results to complete the related matrix multiplication task.
For the sake of simplicity, the method of the present disclosure is described only in combination with
In different embodiments, the computing processing apparatus of the present disclosure may be configured to perform an operation specified by a user. In an exemplary application, the computing processing apparatus may be implemented as a multi-core artificial intelligence processor. Similarly, one or a plurality of computing apparatuses included in the computing processing apparatus may be implemented as an artificial intelligence processor core or part of a hardware structure of the artificial intelligence processor core. When the plurality of computing apparatuses are implemented as artificial intelligence processor cores or part of hardware structures of the artificial intelligence processor cores, the computing processing apparatus of the present disclosure may be regarded as having a single-core structure or an isomorphic multi-core structure.
In an exemplary operation, the computing processing apparatus of the present disclosure may interact with other processing apparatuses through the interface apparatus to jointly complete the operation specified by the user. According to different implementations, other processing apparatuses of the present disclosure may include one or more types of general and/or dedicated processors, including a central processing unit (CPU), a graphics processing unit (GPU), an artificial intelligence processor, and the like. These processors include but are not limited to a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic components, discrete gate or transistor logic components, discrete hardware components, and the like. Moreover, the number of the processors may be determined according to actual requirements. As described above, the computing processing apparatus of the present disclosure may be regarded as having the single-core structure or the isomorphic multi-core structure. However, when the computing processing apparatus and other processing apparatus are considered together, the computing processing apparatus and other processing apparatus may be regarded as forming a heterogeneous multi-core structure.
In one or a plurality of embodiments, other processing apparatus may serve as an interface between the computing processing apparatus (which may be embodied as an artificial intelligence operation apparatus such as a neural network operation apparatus) of the present disclosure and external data and controls. Other processing apparatus may perform basic controls that include but are not limited to moving data, and starting and/or stopping the computing apparatus. In other embodiments, other processing apparatus may also cooperate with the computing processing apparatus to jointly complete an operation task.
In one or a plurality of embodiments, the interface apparatus may be used to transfer data and a control instruction between the computing processing apparatus and other processing apparatus. For example, the computing processing apparatus may acquire input data from other processing apparatus via the interface apparatus and write the input data to an on-chip storage apparatus (or called a memory) of the computing processing apparatus. Further, the computing processing apparatus may acquire the control instruction from other processing apparatus via the interface apparatus and write the control instruction to an on-chip control cache of the computing processing apparatus.
Alternatively or optionally, the interface apparatus may further read data in the storage apparatus of the computing processing apparatus and then transfer the data to other processing apparatus.
Additionally, or optionally, the combined processing apparatus of the present disclosure may further include a storage apparatus. As shown in the figure, the storage apparatus may be connected to the computing processing apparatus and other processing apparatus respectively. In one or a plurality of embodiments, the storage apparatus may be used to save data of the computing processing apparatus and/or other processing apparatus. For example, the data may be data that may not be fully saved in internal or on-chip storage apparatus of the computing processing apparatus or other processing apparatus.
In some embodiments, the present disclosure also discloses a chip (such as a chip 1202 shown in
In one or a plurality of embodiments, the control component in the board card of the present disclosure may be configured to regulate and control a state of the chip. As such, in an application scenario, the control component may include a micro controller unit (MCU), which may be used to regulate and control a working state of the chip.
According to descriptions in combination with
According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet, a smart terminal, a PC device, an Internet of Things terminal, a mobile terminal, a mobile phone, a traffic recorder, a navigator, a sensor, a webcam, a camera, a video camera, a projector, a watch, a headphone, a mobile storage, a wearable device, a visual terminal, an autonomous driving terminal, a vehicle, a household appliance, and/or a medical device. The vehicle includes an airplane, a ship, and/or a car; the household appliance includes a television, an air conditioner, a microwave oven, a refrigerator, an electric rice cooker, a humidifier, a washing machine, an electric lamp, a gas cooker, and a range hood; and the medical device includes a nuclear magnetic resonance spectrometer, a B-ultrasonic scanner, and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may be further applied to Internet, Internet of Things, data center, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction sites, medical, and other fields. Further, the electronic device or apparatus of the present disclosure may be further used in application scenarios including cloud, edge, and terminal related to artificial intelligence, big data, and/or cloud computing. In one or a plurality of embodiments, according to the solution of the present disclosure, an electronic device or apparatus with high computing power may be applied to a cloud device (such as the cloud server), while an electronic device or apparatus with low power consumption may be applied to a terminal device and/or an edge device (such as a smart phone or the webcam). In one or a plurality of embodiments, hardware information of the cloud device is compatible with that of the terminal device and/or the edge device. As such, according to the hardware information of the terminal device and/or the edge device, appropriate hardware resources may be matched from hardware resources of the cloud device to simulate hardware resources of the terminal device and/or the edge device to complete unified management, scheduling, and collaborative work of terminal-cloud integration or cloud-edge-terminal integration.
It is required to be explained that, for the sake of brevity, the present disclosure describes some method embodiments as a series of actions and combinations thereof, but those skilled in the art may understand that the solution of the present disclosure is not limited by an order of actions described. Therefore, according to the present disclosure or under the teaching of the present disclosure, those skilled in the art may understand that some steps of the method embodiments may be performed in a different order or simultaneously. Further, those skilled in the art may understand that the embodiments described in the present disclosure may be regarded as optional embodiments; in other words, actions and units involved thereof are not necessarily required for the implementation of a certain solution or some solutions of the present disclosure. Additionally, according to different solutions, descriptions of some embodiments of the present disclosure have their own emphases. In view of this, those skilled in the art may understand that, for a part that is not described in detail in a certain embodiment of the present disclosure, reference may be made to related descriptions in other embodiments.
In terms of specific implementations, according to the present disclosure and under the teaching of the present disclosure, those skilled in the art may understand that several embodiments disclosed in the present disclosure may be implemented through other methods that are not disclosed in the present disclosure. For example, for units in the electronic device or apparatus embodiment, the present disclosure divides the units on the basis of considering logical functions, but there may be other division methods during actual implementations. For another example, a plurality of units or components may be combined or integrated into another system, or some features or functions in the units or components may be selectively disabled. In terms of a connection between different units or components, the connection discussed above in combination with drawings may be direct or indirect coupling between the units or components. In some scenarios, the direct or indirect coupling relates to a communication connection using an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.
In the present disclosure, units described as separate components may be or may not be physically separated. Components shown as units may be or may not be physical units. The components or units may be located in a same position or distributed to a plurality of network units. Additionally, according to actual requirements, some or all of the units may be selected for achieving the purpose of the solution described in the embodiments of the present disclosure. Additionally, in some scenarios, a plurality of units in the embodiments of the present disclosure may be integrated into one unit, or each of the units may be physically separated.
In some implementation scenarios, the integrated unit may be implemented in the form of a software program unit. When the integrated unit is implemented in the form of the software program unit and sold or used as an independent product, the integrated unit may be stored in a computer-readable memory. Based on such understanding, if the solution of the present disclosure is embodied in the form of a software product (such as a computer-readable storage medium), the software product may be stored in a memory. The software product may include several instructions used to enable a computer device (which may be a personal computer, a server, or a network device, and the like) to perform part or all steps of the method of the embodiments of the present disclosure. The foregoing memory includes but is not limited to an USB, a flash disk, a read only memory (ROM), a random access memory (RAM), a mobile hard disk, a magnetic disk, or an optical disc, and other media that may store a program code.
In some other implementation scenarios, the integrated unit may be implemented in the form of hardware. The hardware may be a specific hardware circuit, which may include a digital circuit and/or an analog circuit. A physical implementation of a hardware structure of the circuit includes but is not limited to a physical component. The physical component includes but is not limited to a transistor, or a memristor, and the like. In view of this, various apparatuses (such as the computing apparatus or other processing apparatus) described in the present disclosure may be implemented by an appropriate hardware processor, such as a central processing unit (CPU), a graphics processing unit (GPU), a field-programmable gate array (FPGA), a digital signal processor (DSP), and an application-specific integrated circuit (ASIC), and the like. Further, the storage unit or the storage apparatus may be any appropriate storage medium (including a magnetic storage medium or a magneto-optical storage medium, and the like), such as an RRAM (resistive random access memory), a dynamic random access memory (DRAM), a static random access memory (SRAM), an enhanced dynamic random access memory (EDRAM), a high bandwidth memory (HBM), a hybrid memory cube (HMC), the ROM, and the RAM, and the like.
Based on the above sufficient disclosure of the present disclosure, those skilled in the art may understand that the present disclosure further discloses technical solutions recorded in following articles.
Article A1. An integrated circuit apparatus for matrix multiplication, including:
Article A2. The integrated circuit apparatus of article Al, where each master computing unit includes M2 computing sub-units, and the first matrix block and the second matrix block are respectively divided to M2 first matrix sub-blocks and M2 second matrix sub-blocks, where one matrix multiplication task includes M2 matrix multiplication sub-tasks based on the M2 first matrix sub-blocks and the M2 second matrix sub-blocks, where each computing sub-unit in the M2 computing sub-units is configured to execute one corresponding matrix multiplication sub-task in the M2 matrix multiplication sub-tasks, and in executing the one corresponding matrix multiplication sub-task, the computing sub-unit is configured to:
Article A3. The integrated circuit apparatus of article A2, where the first storage area and the second area are shared storage areas shared by the M2 computing sub-units.
Article A4. The integrated circuit apparatus of article A2, where the plurality of storage areas of each master computing unit further include M2 private sub-storage areas, where each private sub-storage area is related to one corresponding computing sub-unit and is configured to store an intermediate sub-result.
Article A5. The integrated circuit apparatus of article A2, where the N2 master computing units are configured to execute respective related matrix multiplication tasks in parallel, and the M2 computing sub-units are configured to execute respective related matrix multiplication sub-tasks in parallel.
Article A6. The integrated circuit apparatus of any one of articles A1-A5, where the first matrix and the second matrix are divided according to Cannon's algorithm rules to obtain the N2 first matrix blocks and the N2 second matrix blocks.
Article A7. The integrated circuit apparatus of any one of articles A2-A5, where the first matrix block and the second matrix block are divided according to Cannon's algorithm rules to obtain the M2 first matrix sub-blocks and the M2 second matrix sub-blocks.
Article A8. A board card, including the integrated circuit apparatus of any one of articles A1-A7.
Article A9. The board card of article A8, where when the board card includes P2 integrated circuit apparatuses, the integrated circuit apparatuses are connected sequentially to form a data transfer loop to execute matrix multiplication on a first matrix and a second matrix that are respectively divided to P2*N2*M2 matrix blocks, where P is a positive integer greater than or equal to 2.
Article A10. A computing device, including one or a plurality of board cards of article A8.
Article A11. A computing system, including a plurality of computing devices of article A10, where the plurality of computing devices are interconnected and work together to realize distributed matrix multiplication.
Article A12. A method for matrix multiplication using the integrated circuit apparatus of any one of articles A1-A7, including:
Article A13. The method of article A12, where the computing sub-unit is further used to execute following operations:
Article A14. The method of article A13, where the first storage area and the second area are shared storage areas shared by the M2 computing sub-units.
Article A15. The method of article A13, where the plurality of storage areas of each master computing unit further include M2 private sub-storage areas, where each private sub-storage area is related to one corresponding computing sub-unit and is configured to store an intermediate sub-result.
Article A16. The method of article A13, where the N2 master computing units are used to execute respective related matrix multiplication tasks in parallel, and the M2 computing sub-units are used to execute respective related matrix multiplication sub-tasks in parallel.
Article A17. The method of any one of articles A12-A16, including dividing the first matrix and the second matrix according to Cannon's algorithm rules to obtain the N2 first matrix blocks and the N2 second matrix blocks.
Article A18. The method of any one of articles A13-A16, where the first matrix block and the second matrix block are divided according to Cannon's algorithm rules to obtain the M2 first matrix sub-blocks and the M2 second matrix sub-blocks.
Article A19. A computer program product, including a program instruction used for executing matrix multiplication, where when the program instruction is executed by one or more processors, the method of any one of articles A12-A18 is implemented.
It should be understood that terms such as “first”, “second”, “third”, and “fourth” in claims, specification, and drawings are used for distinguishing different objects rather than describing a specific order. It should be understood that the terms “including” and “comprising” used in the specification and the claims indicate the presence of a feature, an entity, a step, an operation, an element, and/or a component, but do not exclude the existence or addition of one or more other features, entities, steps, operations, elements, components, and/or collections thereof.
It should also be understood that terms used in the specification of the present disclosure are merely for a purpose of describing a particular embodiment rather than limiting the present disclosure. As being used in the specification and the claims of the present disclosure, unless the context clearly indicates otherwise, singular forms such as “a”, “an”, and “the” are intended to include plural forms. It should also be understood that a term “and/or” used in the specification and the claims of the present disclosure refers to any and all possible combinations of one or more of relevant listed items and includes these combinations.
As being used in the specification and the claims of the present disclosure, a term “if” may be interpreted as “when”, or “once” or “in response to a determination” or “in response to a case where something is detected” depending on the context. Similarly, depending on the context, a clause “if it is determined that” or “if [a described condition or event] is detected” may be interpreted as “once it is determined that”, or “in response to a determination”, or “once [a described condition or event] is detected”, or “in response to a case where [a described condition or event] is detected”.
Even though the present disclosure has already shown and described a plurality of embodiments, it is obvious for those skilled in the art that such embodiments are only provided through the method of examples. Those skilled in the art may think of many modifying, altering, and substituting methods without deviating from the thought and spirit of the present disclosure. It should be understood that various alternatives to the embodiments of the present disclosure described herein may be adopted in the practice of the present disclosure. The attached claims are intended to limit the scope of protection of the present disclosure and therefore to cover equivalents or alternatives within the scope of these claims.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202011610669.4 | Dec 2020 | CN | national |
This application claims benefit under 35 U.S.C. 119, 120, 121, or 365(c), and is a National Stage entry from International Application No. PCT/CN2021/142653, filed Dec. 29, 2021, which claims priority to the benefit of Chinese Patent Application No. 202011610669.4 filed in the Chinese Intellectual Property Office on Dec. 30, 2020, the entire contents of which are incorporated herein by reference.
| Filing Document | Filing Date | Country | Kind |
|---|---|---|---|
| PCT/CN2021/142653 | 12/29/2021 | WO |