The present disclosure relates to the field of artificial intelligence algorithms and hardware, and in particular, to a process-in-memory architecture based on a resistive random access memory and a matrix decomposition acceleration algorithm and configured for transformer acceleration.
A Transformer has become a popular deep neural network (DNN) model in neuro-linguistic processing (NLP) application and shows excellent performance in aspect of neural machine translation and entity identification. A Transformer-based model, such as a generative pre-trained Transformer (GPT), Vision Transformer (ViT) and Swin-Transformer, has become one of most important progresses in the field of artificial intelligence. These models achieve higher accuracy than conventional convolutional neural networks (CNNs) and break dominance of the CNNs in various artificial intelligence tasks. Self-attention is an important mechanism in the Transformer model and configured to perform self-attention computing on each element in an input sequence and obtain self-attention representation of each element. The mechanism may capture a dependency relationship between the elements in the sequence, realize long-distance dependency modeling, achieve a multi-head attention mechanism, lower computing complexity and be a one of cores in the Transformer model. Thus, its computation is also one of core mathematical operation of the Transformer model.
Unfortunately, excellent accuracy of the Transformer-based model is at the expense of more operations, but an existing processor designed for the CNN cannot effectively process these operations. Though researchers have successfully applied the Process-In-Memory (PIM) architecture based on the resistive random access memory (ReRAM) to acceleration of the convolutional neural network (CNNs) and a recurrent neural network (RNNs), but a unique computing process of scaled dot-product attention in Transformer makes direct application of these designs difficult. In addition, the Transformer has to execute many matrix-to-matrix multiplication (MatMul) operations, wherein both matrixes are intermediate results of the previous layers. These intermediate results need to be written in a computing device in the preliminary work, and such operation may suspend the computing process and reduce a speed and energy efficiency.
An objective of the present disclosure is to provide a process-in-memory architecture based on a resistive random access memory and a matrix decomposition acceleration algorithm against the defects in the prior art. The present disclosure transforms original matrix multiplication to vector multiplication and a dot product by decomposing a matrix by using properties of a symmetric matrix; and in addition, further optimizes a self-attention computing process based on a Re-Transformer by processing the computing process through introducing a hardware architecture.
The objective of the present disclosure is implemented through the following technical solutions. A process-in-memory architecture based on a resistive random access memory and a matrix decomposition acceleration algorithm includes:
Further, the three network parameter weight matrixes WQ, WK and WV are randomly initialized when an initial Transformer neural network performs self-attention computation, and a query matrix Q, a key-value matrix K and a value matrix V are obtained by multiplying by the input matrix X after adding positional information.
Further, the self-attention computation of the Transformer neural network is calculated by softmax(Q·KT/dk0.5)·V, and the attention score Out=Q·KT expanded as Q·KT=(X·WQ)·(X·WK)T=X·(WQ·WKT)XT;
Further, after the orthogonal matrix P, the invertible matrix U, the diagonal matrix D and the block diagonal matrix C are obtained, the attention score Out is represented as:
Out=X·(WQ·WKT)·XT=X·A·XT+X·B·XT=(X·P)D(X·P)T+(X·U)C(X·U)T,
where P′=X·P, and U′=X·U, and the attention score Out is simplified to be as follow: Out=P′·D·P′T+U′·C·U′T.
Further, after the attention score is obtained, the data softmax[(P′·D·P′T+U′·C·U′T)/dk0.5)] is obtained using the hybrid softmax computing array of the process-in-memory logic based on the resistive random access memory, and a final self-attention weight coefficient Z is obtained by successively multiplying by the input matrix X and the weight matrix WV, expressed as:
Z=softmax[(P′·D·P′T+U′·C·U′T)/dk0.5)]·X·WV.
Further, complex matrix-to-matrix multiplication operation is simplified into matrix-to-vector multiplication and vector dot products using relevant properties of the symmetric matrix.
Further, the resistive random access memory is used in a Transformer for matrix vectors and softmax operations.
Compared to the prior art, the beneficial effects of the present disclosure are as follows: the present disclosure further optimizes a self-attention computing process based on Re-Transformer using some properties of the matrixes, simplifies the original matrix-to-matrix multiplication operation into matrix-to-vector multiplication and the vector dot products, and computes the intermediate result using an in-memory computing array represented by the resistive random access memory. The solution optimizes the matrix multiplication computing process and uses in-memory computing of the resistive random access memory, thereby reducing the need for many write and storage operations, and power consumption and latency, and a space required for storing the intermediate result. Besides, the sparse array design optimization method an effective optimization technique that enhances computational efficiency while reducing storage space. Therefore, the present disclosure decomposes part of original weight matrixes by using the properties of the matrixes, reducing the sparsity of the weight matrixes and further improving the utilization and accuracy of in-memory computing arrays.
Related instances and solutions may be shown clearly and vividly by replacing part of texts with pictures in the accompanying drawings which are not necessarily drawn in scale. The accompanying drawings illustrate the specification and the claims vividly and specifically in a main form of schematic diagrams instead of serving as an accurate reference.
In order to make those skilled in the art better understand the technical solutions of the present disclosure, the present disclosure is described in detail below with reference to the accompanying drawings and specific implementations. Embodiments of the present disclosure are further described in detail below with reference to the accompanying drawings and specific embodiments instead of serving as a limitation on the present disclosure. If there is no necessity of a sequential relationship between various steps described herein, an order for describing the various steps as an example herein is not to be regarded as a limitation, those skilled in the art should know that the various steps are may be adjusted in order as long as algorithmic logicality between them is not violated and a whole process may be implemented.
Transformer Neural Network:
The Transformer is a seq2seq model based on an attention mechanism from a macroscopic perspective, and an encoder is responsible for mapping an input sequence (x1, . . . , xa) to a group of continuous sequences z=(z1, . . . , za). For given z, a decoder generates an output sequence (y1, . . . , yb) once. In each step, the model is autoregressive. The Transformer follows this type of overall architecture, which is different from a conventional convolutional neural network (CNN) and a recurrent neural network (RNN), and the whole network is completely composed of a self-attention mechanism (Self-Attention) and a Feed Forward Neural Network. In general, the Transformer neural network is composed of an Encoder and a Decoder.
The Encoder is composed of N same Encoder blocks which are stacked, and the Decoder is also composed of N same Decoder blocks which are overlaid. The Encoder plays a role in computing correlation of data between input sequences. Each Encoder block includes a Multi-Head Attention layer for computing global scope information of input data. Self-Attention adopts a parallel computing strategy for simultaneously processing all input sequential data, so computing efficiency of the model is very high. Overlaying of a plurality of layers of Encoders may better explore a potential connection of data between sequences. The Decoder plays a role in outputting a to-be-predicted sequence by combining global related information between the previous sequences.
Self-Attention Mechanism (Self-Attention):
The conventional Attention mechanism occurs between elements of a target set (Target) and elements of a source set (Source). Briefly speaking, computing of weights in the Attention mechanism needs participation of Target. That is, in an Encoder-Decoder model, computing of Attention weights needs not only a hidden state in the Encoder, but also a hidden state in the Decoder. The Self-Attention mechanism performs parallel computing for a pairwise weight relationship between each vector in the input sequence and one of all other vectors, which is a relationship in a global scope. Thus, corresponding matrix operations need to be performed only at the Source instead of using information in the Target. Computing steps of the conventional self-attention mechanism are shown in
The three network parameter weight matrixes WQ, WK and WV are randomly initialized when an initial Transformer neural network performs self-attention computing, and a query matrix Q, a key-value matrix K and a value matrix V are obtained by multiplying by the input matrix X to which position information is added, so that self-attention computing is assisted:
Q=X·WQ
K=X·WK
V=X·WV
WQ, WK,WV∈Rmxn
Out=Q·KT,
where m and n are the number of rows and the number of columns of the weight matrixes, respectively; Out represents an attention score; and a superscript T represents a transpose of a matrix.
The self-attention weight coefficient Z is obtained by computing through three matrixes of Q, K and V, and a computing formula is as follows:
Z=Attention(Q,K,V)=softmax(Q·KT/dk0.5)·V,
where dk represents a dimension of the key-value matrix K, and Attention is a self-attention function. In order to prevent a situation that too large inner product of Q and K causes a self-attention weight coefficient to deviate to the extreme, dividing by dk0.5 during computing plays a role in buffering. The Attention function is configured to compute a concern extent of each position data in the input sequence with respect to all position data, and if a relational degree between two elements is higher, the corresponding attention weight is greater. There is no sequential order of input vectors in the whole computing process, and parallel computing is implemented through making the matrixes.
Yang, et al. propose a PIM architecture based on ReRAM in a paper “Re-Transformer:ReRAM-based processing-in-memory architecture for transformer acceleration” for accelerating the Transformer, a core concept of which mainly includes:
In a matrix linear transformation, the weight matrix WK is transformed to WKT others are not transformed, and Q is computed at the same time. Compared with an initial Transformer, only Q is computed in this step.
During computing of the attention score, a formula (Q·WKT)·XT is adopted, and an extra XT needs to be cached in this step.
Hybrid Softmax operation is performed on each attention score, and this softmax operation adopts a selection and comparison logic of ReRAM.
The softmax operation result is right multiplied by the input matrix X and the weight matrix WV in succession to obtain the self-attention weight coefficient Z. Total computing steps are shown in
The detailed computing process is shown in
Q=X·WQ
R=Q·WKT
Out=R·XT
Z=[softmax(R·XT/dk0.5)·X]·WV,
where R is an intermediate matrix.
The present disclosure provides the process-in-memory architecture based on the resistive random access memory and the matrix decomposition acceleration algorithm, which is configured for transformer neural network acceleration, recorded as an Eigen-Transformer, which makes innovation mainly in aspects of self-attention acceleration algorithms and hardware architectures.
In the aspect of self-attention acceleration algorithm, self-attention computing of the initial Transformer is further optimized based on the Re-Transformer by using matrix decomposition and some properties of the symmetric matrix, a core concept of which is shown in
Any real square matrix may be decomposed into a sum of a real symmetric matrix and an antisymmetric matrix.
Any real symmetric matrix is definitely in orthogonal similarity with the diagonal matrix, and elements on a main diagonal of the diagonal matrix are all eigenvalues of the real symmetric matrix.
Any antisymmetric matrix may be decomposed and transformed to a block diagonal matrix through Schur decomposition.
In an example, conventional self-attention computing of the Transformer is softmax(Q·KT/dk0.5)·V, where Q·KT is expanded to be:
Out=Q·KT=(X·WQ)·(X·WK)T=X·(WQ·WKT)XT
WQ·WKT=Wqk=A+B,
where Wqk is a real square matrix, A is a real symmetric matrix, B is an antisymmetric matrix, A=Wqk+WqkT/2, and B=Wqk−WqkT/2.
The following decomposition is made based on the properties of the real symmetric matrix and the antisymmetric matrix, respectively:
A=P·D·PT, and B=U·C·UT,
where P is an orthogonal matrix, D is a diagonal matrix, U is an invertible matrix, C is a block diagonal matrix, and D and C may be written in the following forms: D=diag(λ1, λ2, . . . , λi), where λi is an eigenvalue of the real symmetric matrix A, i∈{1, 2, . . . , m}, m represents a number of rows of the weight matrix, and
Based on the decomposition result, the original weights WQ and WK in self-attention computing of the initial Transformer are replaced by the orthogonal matrix P, the invertible matrix U, the diagonal matrix D and the block diagonal matrix C, so the computing steps are transformed to those shown in
Out=X·(WQ·WKT)·XT=X·A·XT+X·B·XT=(X·P)·D·(X·P)T+(X·U)·C·(X·U)T
Making P′=X·P and U′=X·U, the computing step is suitable for a general in-memory computing format-matrix and input vector multiplication, so the present disclosure tries to compute matrix-to-matrix multiplication and matrix-to-vector multiplication through an in-memory computing array represented by the resistive random access memory, which essentially uses encoding input of physical quantities (for example, a voltage and a current), weights and output data, make the input data, weights and output data meet a mathematic relationship of matrix-to-vector multiplication by constructing a circuit, and completes this type of computing in high parallel in a simulated domain. A related computing paradigm is shown in
Further, the self-attention computing is simplified to be:
Out=P′·D·PT+U′·C·U′T
Further, by using hybrid softmax operation based on in-ReRAM logic proposed by the Re-Transformer work, the following operation result is obtained: Softmax[(P′·D·P′T+U′·C·U′T)/dk0.5)], and D=diag(λ1, λ2, . . . λi), where λi is an eigenvalue of the real symmetric matrix A, i∈{1,2, . . . ,m}.
Further, the result is multiplied by the input matrix X and the weight matrix WV in succession to obtain the finally outputted self-attention weight coefficient Z, which is specifically represented as: Z=softmax[(P′D·P′T+U′·C·U′T)/dk0.5)]·X·WV.
A corresponding flowchart is drawn in
In the present disclosure, self-attention computing mainly focuses on P′, U′, P′·D·P′T and U′·C·U′T
Taking P′·D·P′T as an example, due to the properties of the diagonal matrix D, a computing result Oij in an ith row and a jth column only needs to make an inner product with P′i and P′j, P′i and P′j are vectors in an ith row and a jth row of a matrix P′, wherein i≠j; then an output may be obtained by multiplying by an diagonal element in the diagonal matrix D; and besides, due to the properties of the symmetric matrix, only an upper-triangle or a lower triangle needs to be made to obtain a whole output matrix of computing this time.
Likewise, for computing of U′·C·U′T, the properties of the antisymmetric matrix and the block diagonal matrix are utilized, only an upper triangle element or a lower triangle element is computed so as to obtain a whole computing result, so the number of computing times of the whole process is greatly reduced. Meanwhile, the matrixes P′ and U′, and the diagonal elements of the diagonal matrix D and the block diagonal matrix C only need to be cached in the whole process, so the cache space is also reduced.
In the aspect of the hardware architecture, the present disclosure tries to apply the resistive random access memory (ReRAM) to the Transformer, and a corresponding hardware match measure is provided according to optimization made by the self-attention acceleration algorithm process, so as to improve acceleration computing, which mainly includes:
According to self-attention acceleration algorithm optimization made by the present disclosure, the computing process is transformed to matrix-to-vector multiplication and the vector dot products mainly from the original operation of matrix-to-matrix multiplication, so the present disclosure tries to compute related processes by using ReRAM-based process-in-memory logic, stores elements in the matrixes and the vectors into the ReRAM and performs operation by using a computing logic of the ReRAM, so computing power consumption and cache space may be reduced greatly.
In softmax operation, by using the hybrid softmax proposed based on a selection and comparison logic of the ReRAM, original softmax operation is transformed, the operation process is simplified by using the selection and comparison logic structure of the ReRAM, and related measures are described specifically in the paper of Re-Transformer.
Based on the above analysis, the process-in-memory architecture based on the resistive random access memory and the matrix decomposition acceleration algorithm provided by the present disclosure includes the following units:
A pretraining unit configured to train three network parameter weight matrixes WQ, WK and WV of a Transformer neural network through a data set.
A decomposition unit configured to decompose matrix decomposition based on the weight matrixes WQ and WK obtained by the pretraining unit, wherein a decomposition formula is WQ·WKT=Wqk=A+B, where Wqk represents a real square matrix, A represents a real symmetric matrix, B represents an antisymmetric matrix, A is calculated by A=Wqk+WqkT/2, and B is calculated by B=Wqk-WqkT/2; decompose based on properties of the real symmetric matrix and the antisymmetric matrix as A=P′·D·PT, and B=U·C·UT, where P represents an orthogonal matrix, D represents a diagonal matrix, U represents an invertible matrix, and C represents a block diagonal matrix.
A preprocessing unit configured to process an input matrix X and matrixes P, D, U and C obtained by the decomposition unit in row and column, respectively, and input the processed matrixes into a computing unit after being staggered.
A computing unit configured to perform multiply-accumulate computation on the matrixes outputted by the preprocessing unit using the resistive random access memory as a computing element to obtain a product matrix, wherein a formula of the multiply-accumulate computation is Out=P′·D·P′T+U′·C·U′T, where P′=X·P, U′=X·U, and Out represents an attention score.
A cache unit configured to store an intermediate result computed by the computing unit temporarily into the cache unit.
A self-attention mechanism unit configured to perform normalization computing on the product matrix Out obtained by the computing unit using a hybrid softmax computing array of a process-in-memory logic based on the resistive random access memory to obtain data softmax[(P′·D·P′T+U′·C·U′T)/dk0.5)], where dk represents a dimension of a key-value matrix K, and perform matrix multiply computation on the obtained data softmax[(P′·D·P′T+U′·C·U′T)/dk0.5)], the input matrix X and the weight matrix WV to obtain a self-attention weight coefficient. The self-attention weight coefficient is outputted through a data output module.
The initial Transformer, the Re-Transformer and the computing method for Eigen-Transformer in the present disclosure and the initial cache space are compared, and corresponding results are shown in
The Eigen-Transformer in the present disclosure builds upon the Re-Transformer by further optimizing the self-attention computation of the initial Transformer using matrix decomposition techniques and properties of symmetric matrices: the query weight matrix WQ is decomposed into an orthogonal matrix P and an antisymmetric matrix U; and the key value matrix WK is decomposed into a diagonal matrix D and a block diagonal matrix C. Compared to the initial Transformer and Re-Transformer, the Eigen-Transformer in the present disclosure offers the following performance improvements: the matrix decomposition and matrix properties simplify the computation process, reduce the number of computation steps, and minimize caching requirements. By leveraging matrix properties (such as symmetric matrices and block diagonal matrices), only the upper or lower triangular elements need to be computed, further reducing the storage space required for intermediate results, and significantly lowering power consumption and latency. The Eigen-Transformer further employs the resistive random access memory for the self-attention computation in Transformers, enhancing computation efficiency and accuracy through an in-memory computing architecture. Therefore, the Eigen-Transformer significantly improves the efficiency of self-attention computation, reduces power consumption and storage requirements, and ultimately enhances overall system performance. Therefore,
The hybrid softmax and the softmax based on the logic-in-ReRAM proposed in the Re-Transformer are adopted for the softmax operation part in the present disclosure, which thus is not repeated in the present disclosure.
The present disclosure is intended to optimize the self-attention computing process, so as to reduce computing and writing operands. Then operation is performed by using the softmax of the ReRAM-based selection and comparison logic structure, so as to further reduce whole power consumption.
Besides, the present disclosure has described the general process and concept of the present disclosure herein, and the scope covers any and all related algorithms and architectures that the present disclosure is based on. The operation method in the claims will be explained specifically in language used in the claims.
The above description is intended to be explanatory. Those ordinarily skilled in the art can make preliminary theoretical exploration when reading the above description and uses some auxiliary means for preliminary simulation. The main core concept of the present disclosure is to optimize self-attention computing by using the properties of the matrixes and try to compute a transition matrix used in this solution through the logic-in-ReRAM. Thus, those skilled in the related art can make further exploration and optimization.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202311325100.7 | Oct 2023 | CN | national |
The present application is a continuation of International Application No. PCT/CN2024/073129, filed on Jan. 18, 2024, which claims priority to Chinese Application No. 202311325100.7, filed on Oct. 13, 2023, the contents of both of which are incorporated herein by reference in their entireties.
| Number | Name | Date | Kind |
|---|---|---|---|
| 20220129519 | Zheng | Apr 2022 | A1 |
| Number | Date | Country |
|---|---|---|
| 114282164 | Apr 2022 | CN |
| 115879530 | Mar 2023 | CN |
| 115965067 | Apr 2023 | CN |
| 117371500 | Jan 2024 | CN |
| Entry |
|---|
| International Search Report (PCT/CN2024/073129); Date of Mailing: Jun. 11, 2024; 5 pages. |
| Yang, Xiaoxuan et al. “ReTransformer: ReRAM-based Processing-in-Memory Architecture for Transformer Acceleration” ICCAD, Dec. 31, 2020 (Dec. 31, 2020) ; 9 pages. |
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/CN2024/073129 | Jan 2024 | WO |
| Child | 18922350 | US |