The present disclosure relates to an arithmetic system, and more particularly to an arithmetic processing apparatus which efficiently performs machine learning processing.
The speed of machine learning has dramatically improved through general-purpose computing on a graphical processing unit (GPGPU: General Purpose computing on GPU).
Meanwhile, not only network models of machine learning using simple convolution operations but also those using vector inner product operations called attention mechanisms have become known for many useful applications such as automatic translation and image processing (see, for example, Patent Literature (hereinafter, referred to as “PTL”) 1 and Non-Patent Literatures (hereinafter, referred to as “NPLs”) 1 and 2).
According to an aspect an arithmetic processing apparatus is described or provided. The arithmetic processing apparatus can comprise a plurality of arithmetic units connected to one another in a series in a network, wherein the plurality of arithmetic units are configured to, with respect to an M×N-dimensional first matrix a=[a0, a1, a2, . . . , aM-1] and an M×N-dimensional second matrix b=[b0, b1, b2, . . . , bM-1] each including M N-dimensional column vectors, for performing a calculation of obtaining an M×M-dimensional third matrix x=[x0, x1, x2, . . . , xM-1] by a product of matrices x=aTb, the M×M-dimensional third matrix including M M-dimensional column vectors having, as elements, dot products for combinations of all column vectors belonging to the M×N-dimensional first matrix and the M×N-dimensional second matrix, store one set of column vectors [am, bm] of the M×N-dimensional first matrix a=[a0, a1, a2, . . . , aM-1] and the M×N-dimensional second matrix b=[b0, b1, b2, . . . , bM-1] in a corresponding mth one of M arithmetic units being minimum component units of the arithmetic processing apparatus, send array data of N-dimensional column vectors bm from any arithmetic unit of the M arithmetic units, and propagate the array data sequentially to an adjacent arithmetic unit at a subsequent stage, calculate, in the mth arithmetic unit of the M arithmetic units, dot products of column vectors b0, b1, b2, . . . , bM-1 of the M×N-dimensional second matrix and am that is stored in an arithmetic unit of the M arithmetic units, the dot products being M-dimensional column vectors xm=[am·b0, am·b1, am·b2, . . . , am·bM-1]T forming a part of an array of the M×M-dimensional third matrix, and store the M-dimensional column vectors xm in the mth arithmetic unit of the M arithmetic units, wherein am, bm, and xm are column vectors stored in the M arithmetic units as arrays.
An image processing operation apparatus is generally made to have high versatility so that it can execute many types of matrix operations, and has a structure in which specific processing is performed by a software library that runs on the apparatus. In this case, it is known that extra consumption power is consumed for general-purpose use and that the speed decreases due to software processing. In contrast, creating a fully dedicated circuit results in a disadvantage of being unable to perform other calculations.
The present disclosure describes an arithmetic processing apparatus capable of speeding up the calculation of a network model based on an attention mechanism (hereinafter, also referred to as “attention”).
According to one or more aspects of the present disclosure, an arithmetic processing apparatus may comprise a plurality of arithmetic units connected to one another in a series in a network, wherein the plurality of arithmetic units are configured to, with respect to an M×N-dimensional first matrix a=[a0, a1, a2, . . . , aM-1] and an M×N-dimensional second matrix b=[b0, b1, b2, . . . , bM-1] each including M N-dimensional column vectors, for performing a calculation of obtaining an M×M-dimensional third matrix x=[x0, x1, x2, . . . , xM-1] by a product of matrices x=aTb, the M×M-dimensional third matrix including M M-dimensional column vectors having, as elements, dot products for combinations of all column vectors belonging to the M×N-dimensional first matrix and the M×N-dimensional second matrix, store one set of column vectors [am, bm] of the M×N-dimensional first matrix a=[a0, a1, a2, . . . , aM-1] and the M×N-dimensional second matrix b=[b0, b1, b2, . . . , bM-1] in a corresponding mth one of M arithmetic units being minimum component units of the arithmetic processing apparatus, send array data of N-dimensional column vectors bm from any arithmetic unit of the M arithmetic units, and propagate the array data sequentially to an adjacent arithmetic unit at a subsequent stage, calculate, in the mth arithmetic unit of the M arithmetic units, dot products of column vectors b0, b1, b2, . . . , bM-1 of the M×N-dimensional second matrix and am that is stored in the mth one of the M arithmetic units, the dot products being M-dimensional column vectors xm=[am·b0, am·b1, am·b2, . . . , am·bM-1]T forming a part of an array of the M×M-dimensional third matrix, and store the M-dimensional column vectors xm in the mth arithmetic unit of the M arithmetic units, wherein am, bm, and xm are column vectors stored in the M arithmetic units as arrays.
For example, the arithmetic processing apparatus may be configured to, when calculating a column vector ym represented by the following equation 1,
which is stored in the mth arithmetic unit and forms a part of data of a fifth matrix generated by matrix products y=[y0, y1, y2, . . . , yM-1]=cx from the M-dimensional column vectors xm stored in the mth arithmetic unit and belonging to the M×M-dimensional third matrix, and from an M×N-dimensional fourth matrix c=[c0, c1, c2, . . . , cM-1] stored in advance, send an element of an N-dimensional column vector cm stored in each arithmetic unit to the network in an order of cm0, cm1, cm2, . . . , cm(N-1) using the network in which the arithmetic units are connected to one another in series, calculate the N-dimensional column vector ym, and store the N-dimensional column vector ym in a corresponding mth arithmetic unit.
In the above aspects, the arithmetic processing apparatus may be configured to, for performing a calculation of a linear layer for M channels with respect to a column vector dm=[dm0, dm1, dm2, . . . , dm(L-1)]T that constitutes an M×L-dimensional matrix, input, sequentially, an N×L-dimensional coefficient matrix U or an N-dimensional bias vector V from an outside into to be propagated through the network being in series connection, calculate zm represented by the following equation 2, together with the column vector dm stored in advance in a corresponding mth arithmetic unit,
store an mth N-dimensional column vector zm in a corresponding one of the arithmetic units, wherein u and v represent an element of the N×L-dimensional coefficient matrix U and an element of the N-dimensional bias vector V, respectively.
In the above aspects, the arithmetic processing apparatus may be configured to, for implementing a multi-head attention, the arithmetic processing apparatus includes a set of the network and the M arithmetic units for each of H heads, divide a number of rows of a coefficient matrix of a linear layer immediately before division among the heads by the number H of heads, input, sequentially, a segment matrix resulting from the division is sequentially input into a communication channel for an hth head, respectively, and calculate zhm represented by the following equation 3, together with the shared computed vector data dm,
The arithmetic processing apparatus may be configured to, for implementing the multi-head attention, in order to concatenate vectors for multiple heads, divide, by the number H of heads, a number of columns of the coefficient matrix of the linear layer immediately after combination of vectors of the multiple heads, input a segment matrix resulting from the division sequentially into the communication channel for the hth head, sum partial sums for each head of the heads, and store the sum of the partial sums in a predetermined arithmetic unit.
In the above aspects, the arithmetic processing apparatus may be configured to include the network including, as a basic structure, a data readout daisy chain configured to sequentially read out data to the network and a data input daisy chain configured to input data to the arithmetic units, wherein an output of the data readout daisy chain is connected with an input of the data input daisy chain, and wherein the data input daisy chain is further configured to input, sequentially, the read data to the arithmetic units in series.
In the above aspects, the arithmetic processing apparatus may be configured to propagate an arithmetic code together with data sequentially in series to the network.
In the above aspects, the arithmetic processing apparatus may be configured to, in order to perform processing of column vector data of multiple heads without adding hardware, simulate multi-head processing by performing transfer and receiving of arrays and accumulation of product-sum operations dividedly and sequentially in a head order.
The arithmetic processing apparatus may include, in order to address an issue with a different number of channels or a different number of sequences, the network that is divided according to a number of channels or a number of sequences, and a circuit element configured to couple the data readout daisy chain to the data input daisy chain.
According to the above aspects, the network is capable of performing almost all processing by sending data only to adjacent arithmetic elements, except for the processing of adding partial sums when matrices are combined. Accordingly, a high parallelism that is mutually independent is achieved, and it is thus possible to accelerate the calculation of the network model based on an attention mechanism.
Next, the present disclosure will be described with reference to the drawings.
The arithmetic processing apparatus 10 includes H arithmetic unit sets 16 independently in order to perform segment matrix processing, that is, multi-head processing with H-way parallelism. An addition network 18 for summing H-divided partial sums when the inner products of the divided rows and columns are performed instead of combining the matrices after the segment matrix processing is completed is disposed orthogonally to the first and second daisy chains 12 and 13. Note that, in a case where the cost for implementing an arithmetic unit set is high, the processing may be performed by a separate Central Processing Unit (CPU) or the like. Further, the arithmetic processing apparatus 10 includes a second multiplexer 19 for appropriately propagating data in a case where the first and second daisy chains 12 and 13 are divided in accordance with the number of dimensions of the vectors to be processed. Hereinafter, a method for calculating a dot product or a linear layer in the present embodiment is described.
When calculating the products x=aTb of the matrix derived from the first matrix a and the second matrix b, that is, M M-dimensional column vectors xm=[am·b0, am·b1, am·b2, . . . , am·bM-1]T, the arithmetic unit set 16 continuously outputs the elements of bm from the arithmetic unit 11 to the second daisy chain 13 in the following sequences: b0=[b00, b01, b02, . . . , b0(N-1)], b1=[b10, b11, b12, . . . , b1(N-1)], b2=[b20, b21, b22, . . . , b2(N-1)], . . . , bM-1=[b(M-1)0, b(M-1)1, b(M-1)2, . . . , b(M-1)(N-1)]. These arrays are transferred to the first daisy chain 12 by the first multiplexer 15, and the arithmetic unit 11 receives data bmn in order from m=0 and executes the product-sum operation. By transferring array data b in this order, the products can be continuously accumulated. The resulting data of N-dimensional column vectors xm is stored in the scratch pad SRAM 20 again. On the other hand, in a case where the initial sequence length is M=2, a path is selected such that the subsequent elements at m=2 and thereafter are shortcut to the second daisy chain 13 by the second multiplexer 19, and another sequence different from the sequences of sequence length 2 using m=0, 1 can be handled. It is understood by those skilled in the art that the present calculation corresponds to the dot product attention of query q and key k in, for example, a Transformer.
the arithmetic unit set 16 reads, from each arithmetic unit 11, the set c of arrays stored in each of the arithmetic units 11 in an order of c00, c10, c20, . . . , c(M-1)0, c01, c11, c21, . . . , c(M-1)1, c02, c12, . . . , c(M-1)(N-1) different from the order in
The resulting N-dimensional data zm is stored again in the scratch pad SRAM 20. The present calculation corresponds to generating each vector of query q, key k, and value v from arbitrary vector data by a linear layer, for example, in the Transformer. Here, for the sake of simplicity, the segment matrix processing is not assumed.
Here, the operation of the operated vector in the linear layer can be performed by providing a common copy of the operated vector in the arithmetic unit related to the segment matrix processing, and can also be realized by providing a communication channel shared in the head division direction or by providing a storage area common to the units related to the segment matrix processing.
Further, combining matrices divided among multiple heads can be achieved by functions of dividing the coefficient matrix of the linear layer in the column direction that performs the operation immediately after the combination, sequentially inputting it into each head communication channel, summing partial sums divided among the heads by the addition network 18 shown in
Since arithmetic codes are executable by propagating the operation codes sequentially to the arithmetic units 11 in order to execute an instruction, the operation codes may be transmitted along the first daisy chain 12 or/and the second daisy chain 13 of the network 14.
In
Further, as illustrated in
Number | Date | Country | Kind |
---|---|---|---|
2022-117909 | Jul 2022 | JP | national |
This application is a continuation application of PCT International Application No. PCT/JP2023/027078, filed on Jul. 24, 2023, which is based on and claims priority to Japanese patent application JP 2022-117909, filed Jul. 25, 2022, each are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2023/027078 | Jul 2023 | WO |
Child | 19036053 | US |