This application is a continuation application, claiming priority under § 365(c), of an International application No. PCT/KR2023/004696, filed on Apr. 7, 2023, which is based on and claims the benefit of a Greek patent application number 20220100326, filed on Apr. 14, 2022, and a European patent application number 23150205.5, filed on Jan. 3, 2023, the disclosures of which is incorporated by reference herein in its entirety.
The present application generally relates to an apparatus for accelerating neural network computations. In particular, the present application relates to a hardware accelerator, electronic device, or processing unit, for accelerating machine learning, ML, model computations.
The advances of deep learning have led to various neural networks (NNs) capable of achieving state-of-the-art accuracy on many AI tasks. Among these, recurrent neural networks (RNNs), such as Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM) networks, have demonstrated their advantages in addressing sequencing transduction and modelling problems.
However, the recurrence property of these models introduces two drawbacks: (1) the limited capacity of utilizing the past or external knowledge, which causes long-range information loss; and (2) the sequential nature of RNNs, which hinders their parallelisation in hardware and slows down their processing speed.
Recently, the attention mechanism has been proposed to address these shortcomings. It can process long sequences of data in parallel to capture long-range information. Based on this principle, different attention-based NNs, such as BERT and GPT, have been proposed that achieve state-of-the-art accuracy in various natural language processing (NLP) tasks.
The benefits of attention-based NNs, however, come at a cost: the high degree of parallelism significantly increases the amount of parameters and computation, resulting in a large overhead on their speed and power consumption. To alleviate this, various hardware accelerators have been proposed. However, there are several issues in these designs. For example, these accelerators only focus on optimizing either FFNs or the attention mechanism. Without jointly optimizing both parts, these hardware designs lack scalability when accelerating the end-to-end attention-based NNs with different input lengths. Furthermore, while optimizing the attention mechanism, most of the existing designs dynamically detect and prune the redundant computation at runtime to achieve high sparsity on specific datasets and networks. However, the generality of these dynamic approaches needs to be further tested as their performance gain may vary among different datasets and network architectures. Further still, since the sparsity patterns introduced by these dynamic approaches are unstructured, dynamic hardware controllers are required in order to exploit the sparsity. Nevertheless, such complicated controllers often contain a large amount of clocking elements whose cost increases as the transistor-size reduces. As such, the performance or energy improvement brought by these dynamic methods may be diminished due to the hardware overhead of these dynamic controllers.
Therefore, the present applicant has recognised the need for an improved hardware acceleration mechanism for machine learning computations.
Demanding data-intensive applications like machine learning, and the need for high-speed computing, have resulted in a need for “accelerators” to offload work from general purpose processors. An accelerator is a hardware device which partners with general purpose processors, such as a CPU, to boost the speed at which data is processed, and therefore improve the performance of the processor. This is particularly useful where larger, more powerful processors are not able to be used within a device, such as within mobile communication devices, smartphones, home assistant devices, robotic devices, battery-operated devices, and so on.
In an embodiment of this disclosure, an electronic device for accelerating machine learning, ML, model computations is provided. The electronic device comprises: a first processor configured to: generate a query matrix, a key matrix, and a value matrix by performing Fast Fourier Transform (FFT) and butterfly linear transform on at least one input matrix, and a second processor configured to: perform a first matrix multiplication between the query matrix and the key matrix, perform a softmax operation on the result of the first matrix multiplication, and perform a second matrix multiplication between the result of the softmax operation and the value matrix.
In an embodiment of this disclosure, a method for accelerating machine learning, ML, model computations is provided. The method comprises: generating a query matrix, a key matrix, and a value matrix by performing Fast Fourier Transform (FFT) and butterfly linear transform on at least one input matrix, performing a first matrix multiplication between the query matrix and the key matrix, performing a softmax operation on the result of the first matrix multiplication, and performing a second matrix multiplication between the result of the softmax operation and the value matrix. In an embodiment of this disclosure, a computer-readable storage medium is provided. The computer-readable storage medium comprises instructions which, when executed by a processor, causes the processor to carry out the method of this disclosure.
In an embodiment of this disclosure, there is provided a processing unit, also referred to herein as “an electronic device”, for accelerating machine learning, ML, model computations, the processing unit comprising: a first processor for accelerating computations involving a sparse matrix, the sparse matrix having a butterfly sparsity pattern. The first processor may be used to accelerate computations of any layer of the ML model, as long as the computations involve matrices having a butterfly sparsity pattern. For example, the first processor may be used to accelerate computations of attention layers and/or linear layers of the ML model, when the butterfly sparsity pattern is adopted/applied to those layers. The ML model may comprise one or more further layers which do not adopt butterfly sparsity, but the first processor is not used to accelerate computations of these further layers.
The first processor is also referred to herein as a “butterfly processor”, because the first processor operates on data using a butterfly sparsity pattern. That is, the first processor accesses data from an input matrix according to a butterfly sparsity pattern, to reduce the number of computations that need to be performed, without losing accuracy.
The first processor may comprise a plurality of engines, also referred to herein as “butterfly engines”, used to accelerate the computations involving a sparse matrix.
Each butterfly engine may comprise an adaptable memory system and a plurality of adaptable butterfly units, wherein each butterfly unit may be configurable to perform a specific computation.
Each butterfly unit may be configurable to perform one or both of a Fast Fourier Transform, and a linear transformation. This is advantageous because many conventional techniques are able to perform either attention-based operations (i.e. linear transformations) or Fast Fourier Transform based operations, but not both. The butterfly unit of the present techniques is configurable, at runtime, to perform either operation based on which layer of the ML model the processing unit is being used for.
Each butterfly unit may comprise a plurality of multiplexers arranged to select inputs required to perform the specific computation. This enables the butterfly unit to be adapted/configured to perform the specific computation.
In an embodiment of this disclosure, each butterfly unit may comprise: four real-number multipliers, arranged to multiply inputs to perform the specific computation; two real-number adders or subtractors, arranged to operate on outputs of the real-number multipliers; and two complex-number adders or subtractors, arranged to operate on outputs of the real-number adders or subtractors. For example, to perform a linear transformation computation, all four of the real-number multipliers may be used to perform four real-number multiplications, whereas to perform a FFT computation, all four of the real-number multipliers may be used to perform four complex-number multiplications. Thus, the same hardware may be configured, in real-time, to suit the computation.
Each butterfly unit may comprise: eight multiplexers, arranged to receive the inputs for the specific computation, and to select the inputs required to perform the specific computation; and two de-multiplexers, arranged between the two real-number adders or subtractors and the two complex-number adders or subtractors, to provide outputs of the butterfly unit.
The memory system of the butterfly engine may comprise a serial-to-parallel module arranged to access data for the specific computation based on the butterfly sparsity pattern.
The processing unit may further comprise: a third processor for receiving outputs from the first processor and for performing post-processing using the received outputs. The third processor is also referred to herein as a “post processor”.
As noted above, the first processor may be used to accelerate computations of any layer of the ML model, as long as the computations involve matrices having a butterfly sparsity pattern. For example, the first processor may be used to accelerate computations of attention layers and/or linear layers of the ML model, when the butterfly sparsity pattern is adopted/applied to those layers. However, when the ML model comprises at least one attention layer requiring computations that do not involve or use a sparse matrix having a butterfly sparsity pattern, the processing unit may further comprise a second processor for performing some operations required by an attention mechanism, i.e. for performing operations required by this/these attention layer(s). The second processor is also referred to herein as an “attention processor”. That is, when the ML model comprises even a single attention layer that is vanilla or conventional, i.e. which does not use butterfly sparsity, the second processor is used to execute that attention layer when the model is executed.
The second processor may comprise a plurality of attention engines, wherein each attention engine may comprise: a query key, QK, unit for implementing matrix multiplications between query and key matrices; and a score value, SV, unit for receiving outputs from the QK unit and multiplying the outputs with value vectors to output results of an attention layer of the ML model.
As mentioned above, the plurality of butterfly engines, and therefore the butterfly processor, may be configurable in real-time to perform the specific computation. The butterfly processor may be reconfigurable and adaptable. Thus, the butterfly processor may be configured at runtime to accelerate different layers using a single piece of hardware. This makes the present hardware accelerator efficient, as it can be used to accelerate different and multiple layers of the ML model.
During runtime, the processing unit may receive an input for processing by the ML model, and may use the first processor to perform a Fast Fourier Transform computation using the received input.
In an embodiment of this disclosure, there is provided a computer-implemented method for accelerating machine learning, ML, model computations using the processing unit described herein, the method comprising: receiving an input matrix for processing by the ML model; and using the first processor of the processing unit to: access data from the input matrix based on a butterfly sparsity pattern; configure the plurality of butterfly units of the first processor to perform a Fast Fourier Transform computation; and perform a Fast Fourier Transform computation using the accessed data.
The method may further comprise: using the second processor of the processing unit to:
In a related approach of the present techniques, there is provided a computer-readable storage medium comprising instructions which, when executed by a processor, causes the processor to carry out the methods described herein.
As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.
Furthermore, the present disclosure may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.
Embodiments of the present disclosure also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.
The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD- or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.
It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.
In an embodiment, the present disclosure may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.
The methods described above may be wholly or partly performed on an apparatus, i.e. an electronic device, using a machine learning or artificial intelligence model. The model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training. Here, “obtained by training” means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.
As mentioned above, the present disclosure may be implemented using an AI model. A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system.
The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.
The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
Implementations of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:
Broadly speaking, the present disclosure generally relate to an apparatus for accelerating neural network computations. In particular, the present application relates to a hardware accelerator, or processing unit, for accelerating machine learning, ML, model computations.
To address the aforementioned issues, the present applicant adopts butterfly sparsity to accelerate attention-based models with at least three novel aspects: i) Fine-grained structured regularity, which posses regular data accesses to optimize both memory and compute efficiency; ii) Static sparsity pattern, which avoids the need of designing a dynamic controller in hardware; iii) Sparsity exploitation on both attention and linear layers, which allows scalable end-to-end acceleration of attention-based NNs. The present applicant therefore proposes FABNet, a hardware-friendly model for FFT, Attention and ButterflyNet. To fully exploit the sparsity in hardware, the present applicant proposes an adaptable butterfly accelerator that can be configured at runtime via dedicated hardware control to accelerate different layers using one unified engine, significantly improving hardware efficiency. To push the performance limit, the present applicant jointly optimized the model and hardware via a co-design approach.
Based on their network structure, attention-based NNs can be classified into three categories: (i) encoder-decoder, (ii) encoder-only, and (iii) decoder-only networks. The encoder-decoder NNs are mainly designed for sequence-to-sequence tasks, such as machine translation.
Based on the original encoder-decoder structure of Transformer, different variants have been proposed. The encoder-only networks, such as BERT and XLM are also called autoencoding models and have been widely applied to NLP tasks, such as sequence classification. The Vision Transformer (ViT) also lies in this category and introduces one extra linear projection layer at the beginning. Their encoder layers correspond to the encoder part of the original Transformer. Finally, the decoder-only networks represent the autoregressive models designed for NLP tasks, such as language modelling. GPT is a typical decoder-only model that corresponds to the decoder part of the original Transformer. Although the present applicant focuses on encoder-only networks, the hardware design is flexible and applicable to decoders too.
Butterfly Matrices and FFT. Despite the impressive accuracy attained using attention-based NNs, these models are expensive and not scalable the self-attention mechanism in the Transformer scales quadratically in compute and memory as a function of the input sequence length. As a result, numerous works adopt structured linear mappings, such as sparse and low-rank matrices to approximate the attention matrices in the attention components and/or the weight matrices in the feed-forward layers. Choosing an appropriate structure for each linear mapping, however, is application-dependent, often requiring domain expertise and entailing an arduous process of having to hand-pick solutions as different structures have different trade-offs in performance and speed.
Generally speaking, sparse matrices are matrices that contain mostly zero values, i.e. most of the values/elements are zero. These are useful because they reduce the amount of computational processing that needs to be performed when performing matrix-based computations (such as Fourier transforms or fast Fourier transforms). As only the non-zero elements of a sparse matrix are stored, the amount of memory required to store the matrix is reduced (compared to a dense matrix). It is desirable to transform data into sparse matrices to make computations less complex/time consuming. As mentioned below in relation to
To counteract this, many works have utilized butterfly matrices, which are universal representations of structured matrices and are practically efficient due to their simple recursive structure. Specifically, each butterfly matrix WBfly of size N encodes the recursive divide-and-conquer structure of the Fast Fourier Transform (FFT) and, hence, can be expressed as the product of sparse butterfly factor matrices as follows:
where each W′N, named a butterfly factor is a 2×2 block matrix of diagonal matrixes, Bi with size N/2, whose entries can be trained via gradient-based methods:
In other words, butterfly matrices arise from the Fast Fourier Transform algorithm. The FFT algorithm works by performing the following steps: (1) separating the odd and even indices of the input, (2) performing an FFT on each half of the input, and (3) recombining pairs of indices using a 2×2 matrix. The sparsity pattern of this FFT algorithm is analysed, which shows that a recursive factorisation of the FFT matrix occurs. The factorisation's sparsity pattern is called a butterfly matrix, and each individual sparse matrix in the product (see equation above) is a butterfly factor. As shown in
Due to its expressiveness in representing structured matrices and approximating unstructured data, butterfly matrices and their variants have found success in replacing attention and weight matrices, considerably improving the accuracy and efficiency of attention-based NNs.
Besides attention and weight matrices, some works have explored replacing the entire attention mechanism with more efficient counterparts. A prominent example is FNet, in which the self-attention modules are replaced with 2D Discrete Fourier Transform (DFT) operations. Specifically, for each input, a 1D DFT is applied along the sequence dimension and another 1D DFT is applied along the hidden dimension, keeping only the real component of the resulting output. As the use of DFT facilitates information flow across for all embeddings, it results in a similar performance as compared to the use of vanilla self-attention layers, but at a significant reduction in latency and memory.
On the algorithmic front, the proposed FABNet utilizes a mixture of these techniques—FFT and butterfly matrices—to outperform existing works in terms of accuracy. Notably, since FFT matrices are a special case of butterfly matrices with BN/21, BN/23 being identity matrices and BN/22, BN/24 as twiddle factors, both the FFT and butterfly matrices possess the recursive butterfly structure. Therefore, it is possible to unify a computational and data access pattern, and then devise a single hardware engine to accelerate both FFT and butterfly-based operations with high hardware efficiency.
Latency Breakdown and Motivation. The majority of previous Transformer-based accelerators focused on optimizing a subcomponent of the entire model, resulting in suboptimal performance gains. In addition, execution time is heavily dependent on the input length and thus varies in different subcomponents, reducing the scalability with respect to input length and narrowing the deployability of these solutions.
Specifically, the Transformer architecture is split into three main subcomponents: attention layers, linear layers, and other operations, e.g. softmax, layer normalization, residual connections, matrix transformations, and IO operations. Notably, on the CPU, linear layers take up a significant portion of execution time, up to 71.61%, for all input lengths. In contrast, on the GPU, executing linear layers takes up the majority of the total execution time only when the input layer is small, e.g. 40.98% and 37.9% for input lengths of 128 and 256, respectively. Therefore, works that focus on optimizing the attention layers can only be effective in the limited case of small input lengths.
On the other hand, accelerators that focus on solely optimizing linear layers also suffer from the same drawbacks in certain commonplace scenarios when the input length is large. For instance, executing attention layers on the GPU consumes up to 46.15% of the total execution time when the input length is 512.
Naively adopting a combination of previous works to optimize both the linear and attention subcomponents, however, would require the instantiation of two separate engines, resulting in excessively high resource consumption. As such, there is a gap in designing an accelerator for scalable all-purpose Transformer-based models.
The present applicant addresses this challenge by proposing an adaptable engine that accelerates both the attention and linear layers through its runtime reconfigurability, thus leading to a considerable decrease in resource consumption with negligible accuracy loss.
Algorithm Optimization
Different sparsity patterns exhibit diverse data access patterns, which calls for custom hardware support. However, supporting multiple sparsity patterns may complicate the hardware design. For instance, to fully utilize the sparsity in the random pattern, complex dynamic controllers are required to achieve a load-balanced execution on different hardware engines. The extra overhead of such controllers may counteract the improvement brought by skipping sparse operations, especially when the transistor-size goes down.
The present disclosure aims to find a hardware-friendly sparsity pattern that: (1) has structured data access patterns to simplify the memory design, (2) achieves satisfactory algorithmic performance without the help of other sparsity patterns, and (3) is applicable to both the attention mechanism and FFNs to achieve scalable improvement.
Among the five sparsity patterns in
In the ABfly block, the multi-head attention mechanism is kept, but all the linear layers are compressed using butterfly factorization. To further reduce the amount of computation and parameters, the FBfly block replaces the entire attention mechanism with a 2D Fourier transform layer. Although FBfly is less compute and memory intensive than ABfly, the use of the Fourier transform layer may also affect the accuracy. To recover the algorithm performance when needed, a butterfly-based network called FABNet is proposed that introduces a hybrid of the ABfly and FBfly blocks, as depicted in
Hardware Accelerator
The first processor 810 consists of PBE number of Butterfly Engines (BEs) 812, which are used to accelerate the computations that involve butterfly patterns, including both FFT and butterfly linear transformations. That is, the first processor 810 may comprise a plurality of butterfly engines 812 used to accelerate the computations involving a sparse matrix. Each butterfly engine 812 may comprise a memory system and a plurality of butterfly units 814, wherein each butterfly unit 814 may be configurable to perform a specific computation.
Each butterfly unit 814 may comprise a plurality of multiplexers arranged to select inputs required to perform the specific computation, as explained in more detail below.
In an embodiment, the first processor 810 generates a query matrix, a key matrix, and a value matrix by performing Fast Fourier Transform (FFT) and butterfly linear transform on at least one input matrix.
In an embodiment, the first processor 810 comprises at least one butterfly engine 812 configured to accelerate computations involving a sparse matrix with respect to the FFT and the butterfly linear transform. The at least one butterfly engine 812 comprises a (adaptable) memory system 816 and a plurality of (adaptable) butterfly units 814, wherein each of the butterfly units 814 comprises at least one real-number multiplier, at least one real-number adder, at least one complex-number adder, at least one multiplexer, and at least one de-multiplexer. The configuration functions of at least one real-number multiplier, at least one real-number adder, at least one complex-number adder, at least one multiplexer, and at least one de-multiplexer are described in detail in
In an embodiment, each of the butterfly units 814 performs the FFT and the butterfly transform based on the at least one input matrix and a plurality of twiddle factors. In a case of performing the butterfly linear transform, the twiddle factors are non-symmetric real numbers. In a case of performing the FFT, the twiddle factors are complex and symmetric numbers.
In an embodiment, the first processor 810 may generate the key matrix and the value matrix before the query matrix. The second processor 820 may start the first matrix multiplication, when at least part of the query matrix become available.
In an embodiment, the second processor 820 may start the second matrix multiplication, when at least part of the result of the softmax operation become available.
The electronic device 800 may comprise a second processor 820, also referred to herein as an Attention Processor (AP). When the ML model comprises at least one attention layer requiring computations that do not involve a sparse matrix having a butterfly sparsity pattern, the second processor 820 is used to perform operations required by this at least one attention layer. The AP 820 contains Phead number of Attention Engines (AEs) 822, and each AE 822 is composed of one QK unit 824 and one SV unit 826. The QK unit 824 is designed to implement the softmax and the matrix multiplication between queries and keys (i.e., transposed key matrices). The SV receives the outputs from the QK unit, and multiplies the results with value vectors to generate the final results of the attention layer.
In an embodiment, the QK unit 824 of the second processor 820 performs a first matrix multiplication between the query matrix and the key matrix. The QK unit 824 of the second processor 820 performs a softmax operation on the result of the first matrix multiplication. The SV unit 826 of the second processor 820 performs a second matrix multiplication between the result of the softmax operation and the value matrix.
The electronic device 800 comprises a third processor 830, also referred to herein as a Post-processing Processor (PostP), the off-chip memory and different on-chip buffers. That is, the electronic device 800 comprises a third processor 830 for receiving outputs from the first processor 810 and for performing post-processing using the received outputs. The third processor 830 is responsible for executing the layer normalization and shortcut (SC) addition. To ease the on-chip memory consumption, the intermediate results between different FFT and butterfly linear transformation operations are transferred back to the off-chip memory. All the on-chip buffers utilize double-buffering in order to overlap the data transfer with the computation.
In an embodiment, the third processor 830 receives the query matrix, the key matrix, and the value matrix from the first processor 810. The third processor 830 performs at least one of layer normalization or shortcut addition based on the query matrix, the key matrix, and the value matrix.
Each BE 812 is mainly composed of a (adaptable) memory system 816 and PBU number of (adaptable) Butterfly Units (BUs) 814. To improve the hardware efficiency and enable the use of a single unified engine, the BE 812 is designed with a focus on adaptability. As such, it can be configured via programmable multiplexers and de-multiplexers at runtime to either execute an FFT or a butterfly linear transformation.
In an embodiment, four real-number multipliers are arranged to multiply at least one input matrix and the plurality of twiddle factors. Two real-number adders or subtractors are arranged to add or subtract on outputs of two real-number multipliers from among the four real-number multipliers. Two complex-number adders or subtractors are arranged to add or subtract on outputs of the two real-number adders or subtractors. Eight multiplexers are arranged to select the at least one input matrix required to perform the FFT or the butterfly linear transform. Two de-multiplexers are arranged to control an output flow comprising outputting the data from the two real-number adders or subtractors, or providing the data from the two real-number adders or subtractors to two complex-number adders or subtractors.
In an embodiment, control signals for the eight multiplexers or the two de-multiplexer are set before performing the FFT and the butterfly linear transform.
When performing the butterfly linear transformation (see
outbt1=inbt1 wbt1+inbt2·wbt3, outbe2=inbt1·wbt2+inbt2·wbt4
where inbt1-4 and wbt1-4 represent the inputs and twiddle factors, respectively. To perform the butterfly linear transformation, four multipliers in each BE are configured to execute the four real-number multiplications in the equation above. The values inbt1˜and wbt1˜4 are selected via multiplexers as the operands of multipliers. At the same time, the results outbt1˜1 generated from the real-number adders/subtractors are outputted directly from the de-multiplexers.
For FFT (see
As shown in (a) of
To avoid the bank conflict, a custom data layout strategy is introduced ((a) of
P0=0, P2
For each 2n-2n-1 columns, the starting positions P2
In an embodiment, the function and structure shown in
While executing the ABfly block, both BP and AP are used to perform butterfly linear transformation and attention matrix multiplication, respectively. To further improve the performance when executing the ABfly block, the present applicant employs a fine-grained pipeline between BP and AP.
to compute one row in the SV and QK units respectively, the total latency reduction is
The overall design space of the present applicant's end-to-end system is formed by FABNet's hyper-parameters and the butterfly accelerator's hardware parameters. Specifically, the joint design space consists of: 1) the algorithm parameters, i.e. the hidden size (Dhid), the expand ratio of FFN (Rffn), the total number of blocks (Ntotal) and the number of ABfly blocks (NABfly) in FABNet, and 2) the hardware parameters, i.e. the parallelism of BU (Pbu) and BE (Pbe) in BP, and the parallelism of the QK (Pqk) and SV (PSV) units in AP.
To assess the trade-off provided by each design point, it is necessary to evaluate its algorithmic performance (e.g. an accuracy metric), its latency and its resource consumption. During search, the algorithmic performance is obtained by training and evaluating FABNet, while the latency is estimated by utilizing a custom cycle-accurate simulator built for the present applicant's butterfly accelerator. To verify whether the design can be accommodated by the target FPGA device, the present applicant developed an analytical model to estimate the consumption of DSP blocks and on-chip memory (BRAMs), and used a set of place-and-route measurements to fit linear regression models for the consumption of Look-Up Tables (LUTs) and registers.
To evaluate the present algorithm and hardware performance for the workloads with long sequences, six tasks from Long-Range-Arena are chosen including hierarchical data classification (ListOPs), byte-level text classification (Text), byte-level document retrieval (Retrieval), image classification for sequences of pixels (Image), classification of long-range spatial dependency (Pathfinder). The input sequences of these datasets range from 1024 to 4000.
Software Implementation. The vanilla Transformer, FNet, and the present FABNet models are implemented using Pytorch 1.10 framework. The pretrained models are obtained from Huggingface 4.16. The batch size is 256 for both Image and Pathfinder tasks, and 32 for the rest of datasets during training. The learning rate is 0.0001, except for Image task with 0.01 and Pathfinder task with 0.0005. Multiple Nvidia A100 and V100 GPUs are used for training. To use FFT cores on Nvidia GPUs, the PyTorch API “rfft2” is used to implement the FFT operation required in both FNet and FABNet. The high-performance CUDA implementation of butterfly linear transformation is adopted to accelerate both GPU training and inference. Two models are defined with different default settings: FABNet-Base (Dhid=768, Rffn=4, Ntotal12, NABfly=0) and FABNet-Large (Dhid=1024, Rffn=4, Ntotal=24, NABfly=0).
The present applicant implements their hardware accelerators using Verilog. To evaluate performance in different scenarios, two Xilinx FPGA boards are used in the experiments: VCU128 for cloud/server scenarios and Zynq 7045 for edge/mobile settings. Xilinx Vivado 2019.1 is used for synthesis and implementation. The clock frequency of the present designs vary by different FPGA boards and resource consumptions. All the FPGA designs are clocked at 200 MHz, which is below the maximum. The power consumption is obtained using the Xilinx Power Estimator (XPE) tool provided by Vivado. A cycle-accurate performance model is developed to evaluate the speed performance, which is cross-validated with RTL simulation results. The present applicant's hardware design uses 16-bit floating point.
Effectiveness of Co-Design. The effectiveness of the present applicant's co-design approach in finding the optimal algorithm and hardware designs is evaluated. For demonstration, LRA-Text is used as the target dataset and VCU128 FPGA as the target device. Dhid, Rffn, NABfly and Ntotal are selected from {64, 128, 256, 512, 1024}, {1, 2, 4}, {0, 1} and {1, 2} respectively. The hardware parallelisms (Pbe, Pbu, Pqk and NSV) are chosen from {4, 8, 16, 32, 64, 128}. Among the design points that satisfy the accuracy constraint, the point with the lowest latency in the Pareto front is chosen as point of comparison. The selected point is up to 10% more accurate than the points in the same latency range and up to 130× faster than points in the same accuracy range, underlining the advantages of the present applicant's co-design approach. To get the configurations for the rest of the datasets in LRA, the overall accuracy loss is constrained to be less than 0.5% compared with the vanilla Transformer. The final models and designs are chosen as the configurations with the highest hardware performance without violating the accuracy constraints. Unless mentioned otherwise, the rest of the sections report the algorithm and hardware performance using the configurations optimized by the present applicant's hardware configurations.
To demonstrate the advantage of co-designing both algorithm (FABNet) and hardware (butterfly accelerator), a baseline design is implemented to accelerate the vanilla BERT. The baseline hardware was designed with multiple multiplier—accumulator (MAC) units to accelerate the linear transform and different matrix multiplications between query, key and value vectors. Each MAC is composed of a bunch of multipliers followed by an adder tree. The fine-grained intra- and inter-layer pipeline techniques were used to optimize the hardware performance. The parallelism of each MAC unit is allocated according to its workload in order to achieve load-balanced execution between different pipeline stages. For a fair comparison, both baseline and butterfly accelerators are implemented on a VCU128 FPGA using 2048 DSPs. The high bandwidth memory (HBM) was used as the external memory. Both designs were clocked at 200 MHz. Both base and large versions of each model are evaluated using four different input sequences (128, 256, 512 and 1024). The base version contains 12 layers and the large version have 24 layers. A speedup breakdown is shown in
The comparison against GPU and CPU is performed in both server and edge scenarios. In edge setting, the butterfly accelerator is implemented on a Xilinx Zynq 7045 FPGA. DDR4 is used as external memory and 512 DSPs are used for computation. Nvidia Jetson Nano GPU and Raspberry Pi4 are used as GPU and CPU platforms respectively. In server scenario, a Xilinx VCU128 FPGA is used to implement the butterfly accelerator. HBM is used as external memory and the design consumes 2048DSPs. Nvidia V100 and TITAN Xp GPUs are used for comparison. The highly-optimized CUDA code is used for GPU implementations. FPGA designs are clocked at 200 MHz. Both FABNet-Base and FABNet-Large are evaluated using 128, 256, 512 and 1024 input sequences.
In the edge scenario, the present design on Zynq 7045 FPGA achieves 4.5˜10 times speedup than Jetson Nano GPU and 41.2˜394.8 times speedup than Raspberry Pi4. On FABNet-Large with long input sequences greater than 768, Raspberry Pi4 suffers out-of-memory (OOM) issue. At the same time, the present design also shows 5.0˜12.5 and 21.3˜197.4 higher energy efficiency than Jetson Nano and Raspberry Pi4 respectively. In the server setting, the present design on VCU128 is up to 10.9 and 756 times faster than RTX 2080 Ti GPU and Golden 6145 CPU. Up to 168.0 and 7624.9 times higher energy efficiency is achieved than RTX 2080 Ti GPU and Golden 6145 CPU respectively. The end-to-end speedup on both edge and server scenarios under different input sequences also demonstrates the scalability of the present butterfly accelerator.
As shown in
In order to investigate the sensitivity of the present design to off-chip memory bandwidth, the bandwidth is varied from 6, 12, 25, 50, 100 and 200 GB/s, and evaluate its latency based on the present custom cycle-accurate simulator. For this experiment, five different designs with 16, 32, 64 and 128 BEs are used when executing FABNet-Large with 24 layers. To understand the bandwidth requirements under both short and long input lengths, each design is evaluated using three input sequences (128, 1024 and 4096). The results are shown in
Power and Resource Analysis. Table 6 shows the breakdown of the power consumption based on the report generated from the Vivado XPE tool. Two designs are implemented with 128 BEs (BE-128) and 40 Bes (BE-40) on VCU128 FPGA. In both designs, the dynamic power takes more than 70% of the total power consumption. The memory resources, including both BRAM and HBM, consume more than 40% of the dynamic power. Furthermore, when the number of BEs scales from 40 to 128, the power of clocking, logic & signal and DSPs increases from 0.724 W, 0.844 W and 0.129 W to 2.143 W, 2.693 W and 0.412 W, respectively. Table7presents the resource consumption of both BE40 and BE-128 on the same VCU128 FPGA. Due to the use of FFT and butterfly matrices, the present FABNet becomes less memory-intensive than the vanilla attention-based NNs. Since the theoretical memory bandwidth of a single HBM (450 GB/s) can already satisfy the requirement of the present accelerator, one HBM is used in both designs to reduce the resource and power consumption. When the number of BEs decreases from 128 to 40, the BRAM utilization is reduced from 1042 to 338. The reduction can be also observed on the LUT and Register resources.
The method may further comprise: using the second processor of the processing unit to: perform an attention mechanism computation using the input matrix and an output of the first processor, where the attention mechanism computation does not involve a sparse matrix having a butterfly sparsity pattern. That is, when the ML model comprises even a single attention layer that is vanilla or conventional, i.e. which does not use butterfly sparsity, the second processor is used to execute that attention layer when the model is executed.
The method may further comprise: receiving the query matrix, the key matrix, and the value matrix; and performing at least one of layer normalization or shortcut addition based on the query matrix, the key matrix, and the value matrix.
The method may further comprise: generating the key matrix and the value matrix before the query matrix; and starting the first matrix multiplication, when at least part of the query matrix become available. Herein, “a matrix is available” means that in an operation using a matrix as an operand, at least one row or at least one column of the matrix required for the operation is in a completed or prepared state.
The method may further comprise: starting the second matrix multiplication, when at least part of the result of the softmax operation become available.
The apparatus 1000 comprises at least one processor 1100. The at least one processor 1100 may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit. The at least one processor 1100 may be, for example, a CPU or GPU.
The apparatus 1100 comprises an electronic device 800 of the type described herein, including with reference to
The apparatus 1000 may comprise a memory 1200. The memory 1200 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.
The apparatus 1000 comprises a ML model 1400. The electronic device 800 accelerates computations of the ML model which involve a sparse matrix, the sparse matrix having a butterfly sparsity pattern. As explained above, the first processor 810 of the electronic device 800 accelerates computations of any layer of the ML model 1400, as long as the computations involve matrices having a butterfly sparsity pattern. That is, when the ML model 1400 begins processing some data using the processor 1100, and the processing needs to be accelerated (because, for example, a latency requirement), the processor 1100 may offload the task to the electronic device 800.
The apparatus 1000 comprises at least one interface 1300 for receiving data/data items for processing by the ML model 1400. The interface 1300 may be a camera for capturing images or a microphone for capturing audio. It will be understood these are non-limiting example interfaces.
In an embodiment of this disclosure, an electronic device for accelerating machine learning, ML, model computations is provided. The electronic device comprises: a first processor configured to: generate a query matrix, a key matrix, and a value matrix by performing Fast Fourier Transform (FFT) and butterfly linear transform on at least one input matrix, and a second processor configured to: perform a first matrix multiplication between the query matrix and the key matrix, perform a softmax operation on the result of the first matrix multiplication, and perform a second matrix multiplication between the result of the softmax operation and the value matrix.
In an embodiment of this disclosure, the first processor comprises at least one butterfly engine configured to accelerate computations involving a sparse matrix with respect to the FFT and the butterfly linear transform, wherein the at least one butterfly engine comprises a memory system and a plurality of butterfly units, wherein each of the butterfly units comprises at least one real-number multiplier, at least one real-number adder, at least one complex-number adder, at least one multiplexer, and at least one de-multiplexer.
In an embodiment of this disclosure, the each of the butterfly units is configured to perform the FFT and the butterfly transform based on the at least one input matrix and a plurality of twiddle factors.
In an embodiment of this disclosure, in a case of performing the butterfly linear transform, the twiddle factors are non-symmetric real numbers, and in a case of performing the FFT, the twiddle factors are complex and symmetric numbers.
In an embodiment of this disclosure, the each of the butterfly units comprises: four real-number multipliers, arranged to multiply at least one input matrix and the plurality of twiddle factors, two real-number adders or subtractors, arranged to add or subtract on outputs of two real-number multipliers from among the four real-number multipliers, two complex-number adders or subtractors, arranged to add or subtract on outputs of the two real-number adders or subtractors, eight multiplexers, arranged to select the at least one input matrix required to perform the FFT or the butterfly linear transform, and two de-multiplexers, arranged to control an output flow comprising outputting the data from the two real-number adders or subtractors, or providing the data from the two real-number adders or subtractors to two complex-number adders or subtractors.
In an embodiment of this disclosure, control signals for the eight multiplexers or the two de-multiplexer are set before performing the FFT and the butterfly transform.
In an embodiment of this disclosure, the memory system is configured to: calculating starting positions of data layout corresponding to the input matrix, the starting positions indicating how many rows a first element in a current column should be shifted down, permuting the at least one input matrix based on the starting positions, and offering data access to the butterfly units based on the permuted input matrix.
In an embodiment of this disclosure, the electronic device comprises: a third processor configured to: receive the query matrix, the key matrix, and the value matrix from the first processor, and perform at least one of layer normalization or shortcut addition based on the query matrix, the key matrix, and the value matrix.
In an embodiment of this disclosure, the first processor is configured to generate the key matrix and the value matrix before the query matrix, and the second processor is configured to start the first matrix multiplication, when at least part of the query matrix become available.
In an embodiment of this disclosure, the second processor is configured to start the second matrix multiplication, when at least part of the result of the softmax operation become available.
In an embodiment of this disclosure, a method for accelerating machine learning, ML, model computations is provided. The method comprises: generating a query matrix, a key matrix, and a value matrix by performing Fast Fourier Transform (FFT) and butterfly linear transform on at least one input matrix, performing a first matrix multiplication between the query matrix and the key matrix, performing a softmax operation on the result of the first matrix multiplication, and performing a second matrix multiplication between the result of the softmax operation and the value matrix.
In an embodiment of this disclosure, the method comprises: receiving the query matrix, the key matrix, and the value matrix, and performing at least one of layer normalization or shortcut addition based on the query matrix, the key matrix, and the value matrix.
In an embodiment of this disclosure, the method comprises: generating the key matrix and the value matrix before the query matrix, and starting the first matrix multiplication, when at least part of the query matrix become available.
In an embodiment of this disclosure, the method comprises: starting the second matrix multiplication, when at least part of the result of the softmax operation become available.
In an embodiment of this disclosure, a computer-readable storage medium is provided. The computer-readable storage medium comprises instructions which, when executed by a processor, causes the processor to carry out the method of this disclosure.
Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present disclosure, the present disclosure should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present disclosure have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
20220100326 | Apr 2022 | GR | national |
23150205.5 | Jan 2023 | EP | regional |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/KR23/04696 | Apr 2023 | US |
Child | 18221089 | US |