The disclosure relates to a neural network model. More particularly, the disclosure relates to a computing method for a transformer model.
Machine learning technologies are utilized in many applications, such as artificial intelligence (AI), data mining, auto-pilot, etc. There are various types of neural networks developed to solve different kinds of problems. Among these neural networks, a transformer model is one of the popular neural networks. The transformer model is usually used to solve natural language tasks.
One of the important components of transformer is a self-attention mechanism, which equips the model with the ability of capturing contextual information from the entire sequence and the flexibility of learning representation from diverse data.
The disclosure provides a computing method suitable for a transformer model. The computing method includes: projecting an input matrix corresponding to an input sequence of feature vectors, according to first learnable weights, into a query matrix (Q) including query vectors; projecting the input matrix, according to second learnable weights, into a value matrix (V) including value vectors; generating a factorized matrix (H) by an incomplete Cholesky factorization according to the query matrix (Q) and a transpose (QT) of the query matrix, wherein dimensions of the factorized matrix (H) are smaller than dimensions of the query matrix (Q); calculating an intermediate matrix (HTV) according to a product between a transpose (HT) of the factorized matrix and the value matrix (V); and calculating an output matrix according to a product between the factorized matrix (H) and the intermediate matrix.
The disclosure also provides a computing system, which includes a memory and a processor. The memory is configured to store computer-executable instructions. The processor is coupled with the memory. The processor is configured to execute the computer-executable instructions to implement a transformer model. The transformer model includes an attention layer. The attention layer is configured to: project an input matrix corresponding to an input sequence of feature vectors, according to first learnable weights, into a query matrix (Q) including query vectors; project the input matrix, according to second learnable weights, into a value matrix (V) including value vectors; generate a factorized matrix (H) by an incomplete Cholesky factorization according to the query matrix (Q) and a transpose (QT) of the query matrix, and dimensions of the factorized matrix are smaller than dimensions of the query matrix; calculate an intermediate matrix according to a product between the transpose of the factorized matrix (HT) and the value matrix (V); and calculate an output matrix according to a product between the factorized matrix (H) and the intermediate matrix.
It is to be understood that both of the foregoing general description and the following detailed description are by examples, and are intended to provide further explanation of the invention as claimed.
The disclosure can be more fully understood by reading the following detailed description of the embodiment, with reference made to the accompanying drawings as follows:
Reference will now be made in detail to the present embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers are used in the drawings and the description to refer to the same or like parts.
Reference is made to
In some embodiments, the memory 140 is configured to store computer-executable instructions, training data (during a training process of the transformer model 200), learnable parameters of the transformer model 200 (after the training process), input data to be processed by the transformer model 200 and/or output data generated by the transformer model 200. The memory 140 may include a random-access memory (RAM) module, a read-only memory (ROM), a flash memory, a hard drive, a cache memory, a static random access memory (SRAM), a dynamic random access memory (DRAM), a non-volatile memory (NVM), a solid state drive (SSD), an optical storage media or any equivalent data storage medium. In some embodiments, the memory 140 may store instructions that are executable by the processor 120 to cause the processor 120 to perform operations corresponding to processes disclosed herein and described in more detail below.
The processor 120 is coupled with the memory 140. The processor 120 is configured to execute the computer-executable instructions to compute, train or operate the transformer model 200. In some embodiments, the transformer model 200 is utilized to perform some various natural language tasks, such as question answering, document classification, name entity extraction, coherence resolution, natural language inference, summarization and translation.
In some embodiments, the processor 120 may include a graphics processing unit (GPU), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a tensor processing unit (TPU), a digital signal processor (DSP), a single-instruction multiple-data (SIMD) processor and/or any equivalent processing circuit. Generally, such processors may accelerate various computing tasks associated with evaluating neural network models (e.g., training, prediction, preprocessing, and/or the like) by an order of magnitude or more in comparison to a general-purpose center processing unit (CPU).
The interface 160 is coupled with the processor 120. The interface 160 can include a keyboard, a displayer, a network transceiver, a connection port (e.g., a USB connection port), a touch panel or any equivalent input/output components. A user or an external device can provide an input to the transformer model 200 or receive an output result from the transformer model 200 through the interface 160.
In an example, when the transformer model 200 is configured to perform a translation task, a user can input an article in a first language, such as a financial report in Mandarin, and the processor 120 will feed this financial report as an input sequence to the transformer model 200. In some embodiments, the input sequence comprises characters or words in a first language (e.g., Mandarin). The transformer model 200 is configured to translate the financial report in Mandarin into another language and generates an output sequence. The output sequence includes characters or words in a second language (e.g., English). In this case, the transformer model 200 is configured to translate the input sequence into the output sequence. The transformer model 200 is not limited to perform a translation task.
In some embodiments, the transformer model 200 is configured to perform a topic extraction task. In this case, the input sequence includes an article or a document, and the transformer model 200 is configured to generate an output sequence which includes a summary corresponding to the article or the document. For example, the transformer model 200 is configured to generate the output sequence “inflation program in Asia”, which corresponds to the input sequence including the financial report. In some other cases, the output sequence generated by the transformer model 200 can be a classification result, an answer to a question or a title corresponding to the input sequence. The transformer model 200 is configured to extract, identify or generate the output sequence from the input sequence.
In some embodiments, to achieve aforesaid tasks (translation, classification or answering a question), the transformer model 200 in this disclosure utilizes an incomplete Cholesky factorization (ICF) attention mechanism, which equips the transformer model 200 with the ability of capturing contextual information from the whole input sequence and the flexibility of learning representation from diverse data. The incomplete Cholesky factorization (ICF) attention mechanism discussed in the disclosure is approximate to a self-attention mechanism. The self-attention mechanism is a type of all-to-all attention (all attention vectors are correlated to all attention vectors). In this case, a transformer model with the self-attention mechanism occupies relatively large memory storage (for storing the vectors used in the all-to-all attention). While a length of an input sequence increasing, the memory usage of the self-attention mechanism grows quadratically. The quadratically-growing memory usage becomes a resource bottleneck during training. Consequently, it requires a huge number of resources and time to pretrain a transformer model with the self-attention mechanism, and it also limits the transformer model with the self-attention mechanism to handle with a longer input sequence such as a high-resolution figure and/or a long document.
In some embodiments, the transformer model 200 in this disclosure operates with the ICF attention mechanism, which can generate a result approaching to the self-attention mechanism. The transformer model 200 with the ICF attention mechanism is able to approximate all-to-all attention with linear memory complexity. In some embodiments, the transformer model 200 with the ICF attention mechanism is able to retain a flexibility of the transformer model 200 and solve aforesaid quadratic scaling issue.
Reference is further made to
The input embedding module 210 is configured to convert an input sequence INseq into an input representation. In some embodiments, the input embedding module 210 is configured to tokenize the input sequence (e.g., a text sequence including a series of words) and map each token to a vector representation in a multidimensional vector space. For example, a token corresponding to a word may be mapped to a 100-dimensional vector representation of the word.
In a demonstrational case, it is assumed that the input sequence is “today is a good day” with a sequence length of “5”. The first word “today” in the input sequence is mapped to one vector representation in a 100-dimensional vector representation (e.g., a 1×100 matrix). The second word “is” in the input sequence is mapped to another vector representation in a 100-dimensional vector representation. In this case, the whole input sequence with five words is represented as a 5×100 input representation. Similarly, if the input sequence includes 2048 words, the input sequence will be represented as a 2048×100 input representation.
The encoder module 220 is configured to generate an encoded representation based on the input representation corresponding to the input sequence INseq. The decoder module 240 is configured to generate or predicts an output sequence OUTseq based on the encoded representation generated by the encoder module 220.
In some embodiments, the transformer model 200 may include a fully connected layer 250, which is coupled with the decoder module 240. The fully connected layer 250 is configured to generate an output sequence OUTseq according to an output matrix generated by the decoder module 240.
In some embodiments, the transformer model 200 may include an output embedding module 230, which is configured to generate an output representation based on the output sequence OUTseq or a target sequence TAR. In general, the output embedding module 230 may perform an embedding operation based on the output sequence OUTseq, and the embedding operation is similar to aforesaid embedding operation performed by the input embedding module 210 based on the input sequence INseq.
As shown in
Reference is further made to
As shown in
The attention layer ALself with the self-attention mechanism allows a token to attend to all the tokens of the sequence and incorporate the information of other tokens. An input matrix MIN corresponding to the input sequence INseq is processed by the projection layer L1 through three linear projections (by applying learnable weights WQ, WK, and WV) to emit query vectors, key vectors and value vectors respectively.
In some embodiments, dimensions of the input matrix MIN are n×d, in which n is a sequence length of the input sequence INseq, and d is a dimension value of the feature vectors in the input sequence INseq. For example, when the input sequence INseq includes 2048 words and each word in the input sequence INseq is mapped to one vector representation in a 100-dimensional vector representation, the input matrix MIN will be a 2048×100 matrix (n=2048, d=100). For the reasons of parallelization, the query vectors generated by the projection layer L1 are packed into a query matrix Q with n×d dimensions. Similarly, the key vectors and value vectors are packed into a key matrix K with n×d dimensions and a value matrix V key n×d respectively.
An output matrix MOUT of the attention layer ALself with the self-attention mechanism is defined as:
In the equation (1), an attention matrix QKT is a product between the query matrix Q and a transpose KT of the key matrix K. The attention matrix QKT preserves the attention values of all the pairs of tokens and correlates all the tokens of the sequence. The attention matrix QKT is divided by a scaling factor √{square root over (d)}, and passed into the softmax layer L2. An output,
of the softmax layer L2 representing the attention weights of the queries and all the keys, and the output of the softmax layer L2 is used to linearly combine with the value matrix V to generate the output matrix MOUT.
During aforesaid calculations for generate the output matrix MOUT, the softmax layer L2 is calculated according to the attention matrix QKT. In this case, the query matrix Q is a n×d matrix; and the transpose KT of the key matrix K is a d×n matrix; and the attention matrix QKT is a n×n matrix. While the length, n, of the sequence increasing, the memory usage of the attention matrix QKT grows quadratically, which becomes a resource bottleneck during training.
Reference is further made to
As shown in
As shown in
In some embodiments, dimensions of the input matrix MIN are n×d, in which n is a sequence length of the input sequence INseq, and d is a dimension value of the feature vectors in the input sequence INseq. In some embodiments, the query vectors generated by the projection layer L3 are packed into a query matrix Q with n×d dimensions. For example, when the input sequence INseq includes 2048 words and each word in the input sequence INseq is mapped to one vector representation in a 100-dimensional vector representation, the input matrix MIN will be a 2048×100 matrix (n=2048, d=100), and the query matrix Q will also be a 2048×100 matrix.
As shown in
In some embodiments, the query vectors generated by the projection layer L3 are packed into the value matrix V with n×d dimensions. For example, when the input matrix MIN is a 2048×100 matrix (n=2048, d=100), and the value matrix V will also be a 2048×100 matrix.
As shown in
In step S330, the ICF function layer L4 is configured to calculate the factorized matrix H and a transpose HT of the factorized matrix H, and to make a product HHT (between the factorized matrix H and the transpose HT) approximate to an exponential function of a shared-QK attention matrix (QQT).
The shared-QK attention matrix QQT is defined as a product between the query matrix Q and a transpose QT of the query matrix Q. As shown in
In some embodiments, the ICF function layer L4 is configured to calculate the factorized matrix H and a transpose HT of the factorized matrix H, and the ICF function can make sure the product HHT being approximate to an exponential function of the shared-QK attention matrix QQT, such that the factorized matrix H and the transpose HT can be used as a substitute of the shared-QK attention matrix QQT.
For the purpose of explaining the matrix approximation, aforesaid equation (1) is rewritten, replacing KT with QT (shared-QK) and expending according to a definition of softmax function, into following equation (2):
In the equation (2), exp(QQT) is an exponential function of the shared-QK attention matrix QQT; denotes element-wise division; {right arrow over (1)} denotes an all-ones vector; the scaling factor √{square root over (d)} is omitted in the equation (2) for simplicity.
It is noticed that the exp(QQT) is a symmetric and positive definite matrix (i.e., SPD matrix). Incomplete Cholesky factorization (ICF) can be used to approximate a symmetric and positive definite (SPD) matrix in n×n dimensions by a smaller factorized matrix H in n×p dimensions, wherein p<<n. Because exp(QQT) satisfies a requirement of SPD matrix, the ICF function can derive the factorized matrix H and the transpose HT to approximate the hadamard exponential function of the shared-QK attention matrix QQT.
The parameter p is a dynamic variable utilized by the incomplete Cholesky factorization for approximation. In some embodiments, the parameter p is set to be equal to or smaller than d. When p is set to be equal to d, the factorized matrix H will be efficient in down-sizing the factorized matrix H and also keeping the most features in the original all-to-all attention.
In some other embodiments, the parameter p is set to be equal to or smaller than a rank of the shared-QK attention matrix (QQT). If the parameter p is larger than the rank of the shared-QK attention matrix (QQT), the factorized matrix H in this case will include all-zero row, which carries no information and occupies unnecessary memory.
As mentioned above, the dimensions of the query matrix (Q) are n×d. In some embodiments, dimensions of the factorized matrix H generated by the ICF layer L4 are n×p, and p<<n. In which, p is a parameter corresponding to an iteration count in the incomplete Cholesky factorization. The factorized matrix H and the transpose HT of the factorized matrix are utilized by the transformer model 200 to replace the shared-QK attention matrix while calculating the output matrix MOUT.
After the derivation of the factorized matrix H and the transpose HT by the ICF function, the product of these two matrices (HHT) is used to substitute the exponential function of the shared-QK attention matrix exp(QQT) in aforesaid equation (2), and define an output matrix MOUT of the attention layer ALICF shown in
M
OUT=ICFattetion(Q,V)=(HHTHHT{right arrow over (1)})V=H(HTV)H(HT{right arrow over (1)}) equation (3)
In step S340, the attention layer ALICF executed by the processor is configured to calculate an intermediate matrix HTV according to a product between the transpose HT of the factorized matrix H and the value matrix V.
In step S350, the attention layer ALICF executed by the processor is configured to calculate the output matrix MOUT based on the equation (3). In other words, the output matrix MOUT is calculated according to a product between the factorized matrix H and the intermediate matrix HTV.
In some embodiments, dimensions of the intermediate matrix HTV are smaller than dimensions of the shared-QK attention matrix QQT. As discussed above, dimensions of the query matrix Q are n×d; dimensions of the transpose of the query matrix QT are d×n; and dimensions of the shared-QK attention matrix QQT are n×n; dimension of the transpose HT of the factorized matrix H are p×n; dimension of the value matrix V are n×d; and dimension of the intermediate matrix HTV are p×d.
In most of cases, when the input sequence includes a long document (i.e., n is relatively large), the sequence length n will be far greater than p and d (i.e., n>>p and n>>d). For example, when the long document includes 3000 words or above, the sequence length n will reach 3000 or above. Accordingly, the dimensions of the intermediate matrix HTV will be far smaller than the dimensions of the shared-QK attention matrix QQT.
In the embodiments of the attention layer ALself with the self-attention mechanism shown in
In the embodiments of the attention layer ALICF with the ICF attention mechanism shown in
As shown in
In some embodiments, the output matrix MOUT of one attention layer can be transmitted to the fully connected layer 250 as shown in
By using the attention layer ALICF with the ICF attention mechanism, the transformer model 200 is able to process the input sequence with a longer length. It will be very useful when the transformer model 200 is utilized to translate a paper, a magazine or a long article, because the transformer model 200 is able to obtain an attention matrix of the whole input sequence and generating the output accordingly, without cutting the long input sequence into several pieces.
Although the present invention has been described in considerable detail with reference to certain embodiments thereof, other embodiments are possible. Therefore, the spirit and scope of the appended claims should not be limited to the description of the embodiments contained herein.
It will be apparent to those skilled in the art that various modifications and variations can be made to the structure of the present invention without departing from the scope or spirit of the invention. In view of the foregoing, it is intended that the present invention cover modifications and variations of this invention provided they fall within the scope of the following claims.
This application claims priority to U.S. Provisional Application Ser. No. 63/192,104, filed on May 24, 2021, which is herein incorporated by reference.
Number | Date | Country | |
---|---|---|---|
63192104 | May 2021 | US |