ELECTRONIC DEVICE AND METHOD FOR ACCELERATING NEURAL NETWORK COMPUTATIONS

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application is a continuation application, claiming priority under § 365(c), of an International application No. PCT/KR2023/004696, filed on Apr. 7, 2023, which is based on and claims the benefit of a Greek patent application number 20220100326, filed on Apr. 14, 2022, and a European patent application number 23150205.5, filed on Jan. 3, 2023, the disclosures of which is incorporated by reference herein in its entirety.

FIELD OF THE INVENTION

The present application generally relates to an apparatus for accelerating neural network computations. In particular, the present application relates to a hardware accelerator, electronic device, or processing unit, for accelerating machine learning, ML, model computations.

BACKGROUND

The advances of deep learning have led to various neural networks (NNs) capable of achieving state-of-the-art accuracy on many AI tasks. Among these, recurrent neural networks (RNNs), such as Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTM) networks, have demonstrated their advantages in addressing sequencing transduction and modelling problems.

However, the recurrence property of these models introduces two drawbacks: (1) the limited capacity of utilizing the past or external knowledge, which causes long-range information loss; and (2) the sequential nature of RNNs, which hinders their parallelisation in hardware and slows down their processing speed.

Recently, the attention mechanism has been proposed to address these shortcomings. It can process long sequences of data in parallel to capture long-range information. Based on this principle, different attention-based NNs, such as BERT and GPT, have been proposed that achieve state-of-the-art accuracy in various natural language processing (NLP) tasks.

The benefits of attention-based NNs, however, come at a cost: the high degree of parallelism significantly increases the amount of parameters and computation, resulting in a large overhead on their speed and power consumption. To alleviate this, various hardware accelerators have been proposed. However, there are several issues in these designs. For example, these accelerators only focus on optimizing either FFNs or the attention mechanism. Without jointly optimizing both parts, these hardware designs lack scalability when accelerating the end-to-end attention-based NNs with different input lengths. Furthermore, while optimizing the attention mechanism, most of the existing designs dynamically detect and prune the redundant computation at runtime to achieve high sparsity on specific datasets and networks. However, the generality of these dynamic approaches needs to be further tested as their performance gain may vary among different datasets and network architectures. Further still, since the sparsity patterns introduced by these dynamic approaches are unstructured, dynamic hardware controllers are required in order to exploit the sparsity. Nevertheless, such complicated controllers often contain a large amount of clocking elements whose cost increases as the transistor-size reduces. As such, the performance or energy improvement brought by these dynamic methods may be diminished due to the hardware overhead of these dynamic controllers.

Therefore, the present applicant has recognised the need for an improved hardware acceleration mechanism for machine learning computations.

SUMMARY

Demanding data-intensive applications like machine learning, and the need for high-speed computing, have resulted in a need for “accelerators” to offload work from general purpose processors. An accelerator is a hardware device which partners with general purpose processors, such as a CPU, to boost the speed at which data is processed, and therefore improve the performance of the processor. This is particularly useful where larger, more powerful processors are not able to be used within a device, such as within mobile communication devices, smartphones, home assistant devices, robotic devices, battery-operated devices, and so on.

In an embodiment of this disclosure, an electronic device for accelerating machine learning, ML, model computations is provided. The electronic device comprises: a first processor configured to: generate a query matrix, a key matrix, and a value matrix by performing Fast Fourier Transform (FFT) and butterfly linear transform on at least one input matrix, and a second processor configured to: perform a first matrix multiplication between the query matrix and the key matrix, perform a softmax operation on the result of the first matrix multiplication, and perform a second matrix multiplication between the result of the softmax operation and the value matrix.

In an embodiment of this disclosure, a method for accelerating machine learning, ML, model computations is provided. The method comprises: generating a query matrix, a key matrix, and a value matrix by performing Fast Fourier Transform (FFT) and butterfly linear transform on at least one input matrix, performing a first matrix multiplication between the query matrix and the key matrix, performing a softmax operation on the result of the first matrix multiplication, and performing a second matrix multiplication between the result of the softmax operation and the value matrix. In an embodiment of this disclosure, a computer-readable storage medium is provided. The computer-readable storage medium comprises instructions which, when executed by a processor, causes the processor to carry out the method of this disclosure.

In an embodiment of this disclosure, there is provided a processing unit, also referred to herein as “an electronic device”, for accelerating machine learning, ML, model computations, the processing unit comprising: a first processor for accelerating computations involving a sparse matrix, the sparse matrix having a butterfly sparsity pattern. The first processor may be used to accelerate computations of any layer of the ML model, as long as the computations involve matrices having a butterfly sparsity pattern. For example, the first processor may be used to accelerate computations of attention layers and/or linear layers of the ML model, when the butterfly sparsity pattern is adopted/applied to those layers. The ML model may comprise one or more further layers which do not adopt butterfly sparsity, but the first processor is not used to accelerate computations of these further layers.

The first processor is also referred to herein as a “butterfly processor”, because the first processor operates on data using a butterfly sparsity pattern. That is, the first processor accesses data from an input matrix according to a butterfly sparsity pattern, to reduce the number of computations that need to be performed, without losing accuracy.

The first processor may comprise a plurality of engines, also referred to herein as “butterfly engines”, used to accelerate the computations involving a sparse matrix.

Each butterfly engine may comprise an adaptable memory system and a plurality of adaptable butterfly units, wherein each butterfly unit may be configurable to perform a specific computation.

Each butterfly unit may be configurable to perform one or both of a Fast Fourier Transform, and a linear transformation. This is advantageous because many conventional techniques are able to perform either attention-based operations (i.e. linear transformations) or Fast Fourier Transform based operations, but not both. The butterfly unit of the present techniques is configurable, at runtime, to perform either operation based on which layer of the ML model the processing unit is being used for.

Each butterfly unit may comprise a plurality of multiplexers arranged to select inputs required to perform the specific computation. This enables the butterfly unit to be adapted/configured to perform the specific computation.

In an embodiment of this disclosure, each butterfly unit may comprise: four real-number multipliers, arranged to multiply inputs to perform the specific computation; two real-number adders or subtractors, arranged to operate on outputs of the real-number multipliers; and two complex-number adders or subtractors, arranged to operate on outputs of the real-number adders or subtractors. For example, to perform a linear transformation computation, all four of the real-number multipliers may be used to perform four real-number multiplications, whereas to perform a FFT computation, all four of the real-number multipliers may be used to perform four complex-number multiplications. Thus, the same hardware may be configured, in real-time, to suit the computation.

Each butterfly unit may comprise: eight multiplexers, arranged to receive the inputs for the specific computation, and to select the inputs required to perform the specific computation; and two de-multiplexers, arranged between the two real-number adders or subtractors and the two complex-number adders or subtractors, to provide outputs of the butterfly unit.

The memory system of the butterfly engine may comprise a serial-to-parallel module arranged to access data for the specific computation based on the butterfly sparsity pattern.

The processing unit may further comprise: a third processor for receiving outputs from the first processor and for performing post-processing using the received outputs. The third processor is also referred to herein as a “post processor”.

As noted above, the first processor may be used to accelerate computations of any layer of the ML model, as long as the computations involve matrices having a butterfly sparsity pattern. For example, the first processor may be used to accelerate computations of attention layers and/or linear layers of the ML model, when the butterfly sparsity pattern is adopted/applied to those layers. However, when the ML model comprises at least one attention layer requiring computations that do not involve or use a sparse matrix having a butterfly sparsity pattern, the processing unit may further comprise a second processor for performing some operations required by an attention mechanism, i.e. for performing operations required by this/these attention layer(s). The second processor is also referred to herein as an “attention processor”. That is, when the ML model comprises even a single attention layer that is vanilla or conventional, i.e. which does not use butterfly sparsity, the second processor is used to execute that attention layer when the model is executed.

The second processor may comprise a plurality of attention engines, wherein each attention engine may comprise: a query key, QK, unit for implementing matrix multiplications between query and key matrices; and a score value, SV, unit for receiving outputs from the QK unit and multiplying the outputs with value vectors to output results of an attention layer of the ML model.

As mentioned above, the plurality of butterfly engines, and therefore the butterfly processor, may be configurable in real-time to perform the specific computation. The butterfly processor may be reconfigurable and adaptable. Thus, the butterfly processor may be configured at runtime to accelerate different layers using a single piece of hardware. This makes the present hardware accelerator efficient, as it can be used to accelerate different and multiple layers of the ML model.

During runtime, the processing unit may receive an input for processing by the ML model, and may use the first processor to perform a Fast Fourier Transform computation using the received input.

In an embodiment of this disclosure, there is provided a computer-implemented method for accelerating machine learning, ML, model computations using the processing unit described herein, the method comprising: receiving an input matrix for processing by the ML model; and using the first processor of the processing unit to: access data from the input matrix based on a butterfly sparsity pattern; configure the plurality of butterfly units of the first processor to perform a Fast Fourier Transform computation; and perform a Fast Fourier Transform computation using the accessed data.

The method may further comprise: using the second processor of the processing unit to:

- perform an attention mechanism computation, that does not involve a sparse matrix having a butterfly sparsity pattern, using the input matrix and an output of the first processor.

In a related approach of the present techniques, there is provided a computer-readable storage medium comprising instructions which, when executed by a processor, causes the processor to carry out the methods described herein.

As will be appreciated by one skilled in the art, the present techniques may be embodied as a system, method or computer program product. Accordingly, present techniques may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects.

Furthermore, the present disclosure may take the form of a computer program product embodied in a computer readable medium having computer readable program code embodied thereon. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable medium may be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present techniques may be written in any combination of one or more programming languages, including object oriented programming languages and conventional procedural programming languages. Code components may be embodied as procedures, methods or the like, and may comprise sub-components which may take the form of instructions or sequences of instructions at any of the levels of abstraction, from the direct machine instructions of a native instruction set to high-level compiled or interpreted language constructs.

Embodiments of the present disclosure also provide a non-transitory data carrier carrying code which, when implemented on a processor, causes the processor to carry out any of the methods described herein.

The techniques further provide processor control code to implement the above-described methods, for example on a general purpose computer system or on a digital signal processor (DSP). The techniques also provide a carrier carrying processor control code to, when running, implement any of the above methods, in particular on a non-transitory data carrier. The code may be provided on a carrier such as a disk, a microprocessor, CD- or DVD-ROM, programmed memory such as non-volatile memory (e.g. Flash) or read-only memory (firmware), or on a data carrier such as an optical or electrical signal carrier. Code (and/or data) to implement embodiments of the techniques described herein may comprise source, object or executable code in a conventional programming language (interpreted or compiled) such as Python, C, or assembly code, code for setting up or controlling an ASIC (Application Specific Integrated Circuit) or FPGA (Field Programmable Gate Array), or code for a hardware description language such as Verilog (RTM) or VHDL (Very high speed integrated circuit Hardware Description Language). As the skilled person will appreciate, such code and/or data may be distributed between a plurality of coupled components in communication with one another. The techniques may comprise a controller which includes a microprocessor, working memory and program memory coupled to one or more of the components of the system.

It will also be clear to one of skill in the art that all or part of a logical method according to embodiments of the present techniques may suitably be embodied in a logic apparatus comprising logic elements to perform the steps of the above-described methods, and that such logic elements may comprise components such as logic gates in, for example a programmable logic array or application-specific integrated circuit. Such a logic arrangement may further be embodied in enabling elements for temporarily or permanently establishing logic structures in such an array or circuit using, for example, a virtual hardware descriptor language, which may be stored and transmitted using fixed or transmittable carrier media.

In an embodiment, the present disclosure may be realised in the form of a data carrier having functional data thereon, said functional data comprising functional computer data structures to, when loaded into a computer system or network and operated upon thereby, enable said computer system to perform all the steps of the above-described method.

The methods described above may be wholly or partly performed on an apparatus, i.e. an electronic device, using a machine learning or artificial intelligence model. The model may be processed by an artificial intelligence-dedicated processor designed in a hardware structure specified for artificial intelligence model processing. The artificial intelligence model may be obtained by training. Here, “obtained by training” means that a predefined operation rule or artificial intelligence model configured to perform a desired feature (or purpose) is obtained by training a basic artificial intelligence model with multiple pieces of training data by a training algorithm. The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values and performs neural network computation by computation between a result of computation by a previous layer and the plurality of weight values.

As mentioned above, the present disclosure may be implemented using an AI model. A function associated with AI may be performed through the non-volatile memory, the volatile memory, and the processor. The processor may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU). The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or artificial intelligence (AI) model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning. Here, being provided through learning means that, by applying a learning algorithm to a plurality of learning data, a predefined operating rule or AI model of a desired characteristic is made. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/or may be implemented through a separate server/system.

The AI model may consist of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.

The learning algorithm is a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

BRIEF DESCRIPTION OF THE DRAWINGS

Implementations of the present disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 is a diagram illustrating a structure of a Transformer network;

FIG. 2 shows an algorithm used by a multi-head attention module to perform liner transformation and attention;

FIG. 3 is a key for the notation used herein;

FIG. 4 shows latency breakdown of a Transformer with different input lengths on GPU and CPU;

FIG. 5 is a diagram showing basic sparsity patterns in existing efficient attention-based NNs;

FIG. 6 is a schematic diagram of a network structure of FABNet;

FIG. 7 is a table illustrating combinations of sparsity patterns in different variants;

FIG. 8 is a hardware overview of an adaptable butterfly accelerator described herein;

FIG. 9A shows the architecture of a butterfly unit;

FIG. 9B shows dataflow of a butterfly linear transform;

FIG. 9C shows dataflow of a Fast Fourier Transform;

FIG. 10 shows the butterfly memory system with adaptable address mapping strategies;

FIGS. 11 shows adaptable fine-grained pipeline strategies;

FIGS. 12 shows bank conflicts in column-major and row-major orders;

FIG. 13 shows a data layout scheme and hardware design of S2P module;

FIG. 14 shows an example of a 16-input butterfly;

FIG. 15 shows a hardware architecture of Index Coalescing module;

FIG. 16 is a table showing accuracy performance of different models on LRA;

FIG. 17 shows experimental data on the speedup breakdown compared to the baseline design;

FIG. 18 shows the compression rate of the present applicant's optimized FABNet over the vanilla Transformer and FNet in terms of floating-point operations (FLOPS) and model size (number of parameters).

FIG. 19 shows experimental data comparing performance against (a) high-end and (b) edge GPUs and CPUs; and

FIG. 20 shows a comparison of the present applicant's techniques with existing Transformer accelerators in terms of latency, power and energy efficiency;

FIG. 21 shows experimental data on the achieved latency for different input sequence lengths when varying the available off-chip memory bandwidth;

FIG. 22 shows a flowchart describing a method for hardware acceleration;

FIG. 23 shows a flowchart describing a method for hardware acceleration; and

FIG. 24 is a block diagram of an apparatus 200 incorporating the accelerator of the present disclosure.

DETAILED DESCRIPTION OF FIGURES

Broadly speaking, the present disclosure generally relate to an apparatus for accelerating neural network computations. In particular, the present application relates to a hardware accelerator, or processing unit, for accelerating machine learning, ML, model computations.

To address the aforementioned issues, the present applicant adopts butterfly sparsity to accelerate attention-based models with at least three novel aspects: i) Fine-grained structured regularity, which posses regular data accesses to optimize both memory and compute efficiency; ii) Static sparsity pattern, which avoids the need of designing a dynamic controller in hardware; iii) Sparsity exploitation on both attention and linear layers, which allows scalable end-to-end acceleration of attention-based NNs. The present applicant therefore proposes FABNet, a hardware-friendly model for FFT, Attention and ButterflyNet. To fully exploit the sparsity in hardware, the present applicant proposes an adaptable butterfly accelerator that can be configured at runtime via dedicated hardware control to accelerate different layers using one unified engine, significantly improving hardware efficiency. To push the performance limit, the present applicant jointly optimized the model and hardware via a co-design approach.

Based on their network structure, attention-based NNs can be classified into three categories: (i) encoder-decoder, (ii) encoder-only, and (iii) decoder-only networks. The encoder-decoder NNs are mainly designed for sequence-to-sequence tasks, such as machine translation.

FIG. 1 is a diagram illustrating a structure of a Transformer 100. One of the most widely-used encoder-decoder network is the Transformer, which is constructed by a stack of encoders 110 and decoders 120, as illustrated in FIG. 1. FIG. 1 illustrates the structure of Transformer 100, where N₁, D_hidand R_fftrepresent input length, hidden size and FFN expand ration respectively. Each encoder 110 starts with a multi-head attention module 111, followed by a feed-forward network (FFN) 112 consisting of two linear (fully connected) layers. Finally, residual addition and layer normalization (LN) is used after FFN 112.

FIG. 2 shows an algorithm used by a multihead attention module to perform liner transformation and attention and presents the pseudocode of the multi-head attention module.

FIG. 3 is a key for the notation used herein, and summarizes the notation used throughout the present disclosure. The inputs are first mapped to query (Q), key (K) and value (V) matrices through three different linear layers. The query matrix is then multiplied with K^T, followed by a softmax operation to get the score (S) matrix. The generated S is multiplied with V, and the resultant matrix will flow into another linear layer, which generates the final output matrix of the multihead attention. Similarly, the decoder employs two multi-head attention modules and one FFN, where the difference is that the inputs of the query and key vectors in the second attention module come from the outputs of the last encoder.

Based on the original encoder-decoder structure of Transformer, different variants have been proposed. The encoder-only networks, such as BERT and XLM are also called autoencoding models and have been widely applied to NLP tasks, such as sequence classification. The Vision Transformer (ViT) also lies in this category and introduces one extra linear projection layer at the beginning. Their encoder layers correspond to the encoder part of the original Transformer. Finally, the decoder-only networks represent the autoregressive models designed for NLP tasks, such as language modelling. GPT is a typical decoder-only model that corresponds to the decoder part of the original Transformer. Although the present applicant focuses on encoder-only networks, the hardware design is flexible and applicable to decoders too.

Butterfly Matrices and FFT. Despite the impressive accuracy attained using attention-based NNs, these models are expensive and not scalable the self-attention mechanism in the Transformer scales quadratically in compute and memory as a function of the input sequence length. As a result, numerous works adopt structured linear mappings, such as sparse and low-rank matrices to approximate the attention matrices in the attention components and/or the weight matrices in the feed-forward layers. Choosing an appropriate structure for each linear mapping, however, is application-dependent, often requiring domain expertise and entailing an arduous process of having to hand-pick solutions as different structures have different trade-offs in performance and speed.

Generally speaking, sparse matrices are matrices that contain mostly zero values, i.e. most of the values/elements are zero. These are useful because they reduce the amount of computational processing that needs to be performed when performing matrix-based computations (such as Fourier transforms or fast Fourier transforms). As only the non-zero elements of a sparse matrix are stored, the amount of memory required to store the matrix is reduced (compared to a dense matrix). It is desirable to transform data into sparse matrices to make computations less complex/time consuming. As mentioned below in relation to FIG. 5, there are many different types of sparse matrix, which have their advantages and disadvantages. For example, a random sparsity pattern is where random values in the matrix are zero. Although this is sparse, it may be difficult for a processor to process because random values of the matrix will need to be accessed and read. For instance, to fully utilize the sparsity in the random pattern, complex dynamic controllers are required to achieve a load-balanced execution on different hardware engines. The extra overhead of such controllers may counteract the improvement brought by skipping sparse operations, especially when the transistor-size goes down.

To counteract this, many works have utilized butterfly matrices, which are universal representations of structured matrices and are practically efficient due to their simple recursive structure. Specifically, each butterfly matrix W_Bflyof size N encodes the recursive divide-and-conquer structure of the Fast Fourier Transform (FFT) and, hence, can be expressed as the product of sparse butterfly factor matrices as follows:

$W_{B f l y} = (W_{N / 2}^{'} [\begin{matrix} W_{N / 2}^{'} & 0 \\ 0 & W_{N / 2}^{'} \end{matrix}] \dots [\begin{matrix} W_{2}^{'} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & W_{2}^{'} \end{matrix}])$

where each W′_N, named a butterfly factor is a 2×2 block matrix of diagonal matrixes, Bⁱwith size N/2, whose entries can be trained via gradient-based methods:

$W_{N}^{'} = [\begin{matrix} B_{N / 2}^{1} & B_{N / 2}^{2} \\ B_{N / 2}^{3} & B_{N / 2}^{4} \end{matrix}]$

In other words, butterfly matrices arise from the Fast Fourier Transform algorithm. The FFT algorithm works by performing the following steps: (1) separating the odd and even indices of the input, (2) performing an FFT on each half of the input, and (3) recombining pairs of indices using a 2×2 matrix. The sparsity pattern of this FFT algorithm is analysed, which shows that a recursive factorisation of the FFT matrix occurs. The factorisation's sparsity pattern is called a butterfly matrix, and each individual sparse matrix in the product (see equation above) is a butterfly factor. As shown in FIG. 5, the sparsity pattern of each butterfly matrix W_Bflydiffers but is fixed.

Due to its expressiveness in representing structured matrices and approximating unstructured data, butterfly matrices and their variants have found success in replacing attention and weight matrices, considerably improving the accuracy and efficiency of attention-based NNs.

Besides attention and weight matrices, some works have explored replacing the entire attention mechanism with more efficient counterparts. A prominent example is FNet, in which the self-attention modules are replaced with 2D Discrete Fourier Transform (DFT) operations. Specifically, for each input, a 1D DFT is applied along the sequence dimension and another 1D DFT is applied along the hidden dimension, keeping only the real component of the resulting output. As the use of DFT facilitates information flow across for all embeddings, it results in a similar performance as compared to the use of vanilla self-attention layers, but at a significant reduction in latency and memory.

On the algorithmic front, the proposed FABNet utilizes a mixture of these techniques—FFT and butterfly matrices—to outperform existing works in terms of accuracy. Notably, since FFT matrices are a special case of butterfly matrices with B_N/2¹, B_N/2³being identity matrices and B_N/2², B_N/2⁴as twiddle factors, both the FFT and butterfly matrices possess the recursive butterfly structure. Therefore, it is possible to unify a computational and data access pattern, and then devise a single hardware engine to accelerate both FFT and butterfly-based operations with high hardware efficiency.

Latency Breakdown and Motivation. The majority of previous Transformer-based accelerators focused on optimizing a subcomponent of the entire model, resulting in suboptimal performance gains. In addition, execution time is heavily dependent on the input length and thus varies in different subcomponents, reducing the scalability with respect to input length and narrowing the deployability of these solutions. FIG. 4 shows the latency breakdown of the Transformer's subcomponents for varying input lengths on both GPU and CPU with batch size of 8 and 1, respectively.

Specifically, the Transformer architecture is split into three main subcomponents: attention layers, linear layers, and other operations, e.g. softmax, layer normalization, residual connections, matrix transformations, and IO operations. Notably, on the CPU, linear layers take up a significant portion of execution time, up to 71.61%, for all input lengths. In contrast, on the GPU, executing linear layers takes up the majority of the total execution time only when the input layer is small, e.g. 40.98% and 37.9% for input lengths of 128 and 256, respectively. Therefore, works that focus on optimizing the attention layers can only be effective in the limited case of small input lengths.

On the other hand, accelerators that focus on solely optimizing linear layers also suffer from the same drawbacks in certain commonplace scenarios when the input length is large. For instance, executing attention layers on the GPU consumes up to 46.15% of the total execution time when the input length is 512.

Naively adopting a combination of previous works to optimize both the linear and attention subcomponents, however, would require the instantiation of two separate engines, resulting in excessively high resource consumption. As such, there is a gap in designing an accelerator for scalable all-purpose Transformer-based models.

The present applicant addresses this challenge by proposing an adaptable engine that accelerates both the attention and linear layers through its runtime reconfigurability, thus leading to a considerable decrease in resource consumption with negligible accuracy loss.

Algorithm Optimization

FIG. 5 is a diagram showing basic sparsity patterns in existing efficient attention-based NNs. Various pruning schemes have been proposed to reduce the computational complexity of attention-based NNs, leading to different efficient models. By analysing the computational patterns of these variants, five basic sparsity patterns are defined as shown in FIG. 5 (i) low rank, (ii) sliding window, (iii) butterfly, (iv) random and (v) block-wise pattern. It is observed that the sparsity in these variants can be expressed as different combinations of the basic sparsity patterns, as summarized in FIG. 7. The rationale behind using multiple sparsity patterns within each variant is mainly to compensate for the accuracy loss. For example, Pixelfly introduces an additional low-rank sparsity pattern to increase the expressiveness of their flat block-wise butterfly pattern and improve accuracy (B. Chen et al, “Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models” in International Conference on Learning Representations (ICLR), 2022).

Different sparsity patterns exhibit diverse data access patterns, which calls for custom hardware support. However, supporting multiple sparsity patterns may complicate the hardware design. For instance, to fully utilize the sparsity in the random pattern, complex dynamic controllers are required to achieve a load-balanced execution on different hardware engines. The extra overhead of such controllers may counteract the improvement brought by skipping sparse operations, especially when the transistor-size goes down.

The present disclosure aims to find a hardware-friendly sparsity pattern that: (1) has structured data access patterns to simplify the memory design, (2) achieves satisfactory algorithmic performance without the help of other sparsity patterns, and (3) is applicable to both the attention mechanism and FFNs to achieve scalable improvement.

Among the five sparsity patterns in FIG. 5, the random pattern is first excluded for two reasons: (1) it requires complicated controllers with excessive hardware overhead and (2) its performance gain cannot be guaranteed as the sparsity may vary substantially among different datasets and tasks. In addition, as summarized in FIG. 7 the sliding-window pattern always requires another sparsity to compensate for the accuracy loss as it only captures the local relationship inside each window. Moreover, although some variants adopt a single low-rank or block-wise sparsity pattern with satisfactory algorithmic performance, they require extra algorithmic operations and dedicated computational kernels e.g. the locality-sensitive hashing (LSH) in Reformer during inference, which puts a large overhead at the hardware level. As a result, in the next section the focus is on exploiting the butterfly sparsity pattern to accelerate attention-based NNs.

FIG. 6 is a schematic diagram of a network structure of FABNet. The butterfly pattern has demonstrated its effectiveness and generality in approximating linear transformations. Furthermore, it has been shown the potential of simplifying the computation by replacing the entire attention layer with Fourier transform. To maximize the ability to approximate the computation with acceptable algorithm performance, two basic building blocks for scalable inference are proposed: (1) the Attention Butterfly (ABfly), and (2) Fourier Butterfly (FBfly) blocks.

In the ABfly block, the multi-head attention mechanism is kept, but all the linear layers are compressed using butterfly factorization. To further reduce the amount of computation and parameters, the FBfly block replaces the entire attention mechanism with a 2D Fourier transform layer. Although FBfly is less compute and memory intensive than ABfly, the use of the Fourier transform layer may also affect the accuracy. To recover the algorithm performance when needed, a butterfly-based network called FABNet is proposed that introduces a hybrid of the ABfly and FBfly blocks, as depicted in FIG. 6. There are NFBfly FBfly blocks at the beginning, and NABfly number of ABfly blocks at the bottom. With this setup, both NFBfly and NABfly are exposed as user-defined hyper-parameters, forming a trade-off between the algorithm and hardware performance. To balance this trade-off, a co-design framework is proposed to explore the design space of both neural architecture and hardware design according to users' requirements.

Hardware Accelerator

FIG. 8 shows the proposed hardware accelerator, electronic device, or processing unit 800 for accelerating machine learning model computations. The electronic device 800 comprises a first processor 810, also referred to herein as a Butterfly Processor (BP). The first processor 810 is for accelerating computations involving a sparse matrix, the sparse matrix having a butterfly sparsity pattern.

The first processor 810 consists of PBE number of Butterfly Engines (BEs) 812, which are used to accelerate the computations that involve butterfly patterns, including both FFT and butterfly linear transformations. That is, the first processor 810 may comprise a plurality of butterfly engines 812 used to accelerate the computations involving a sparse matrix. Each butterfly engine 812 may comprise a memory system and a plurality of butterfly units 814, wherein each butterfly unit 814 may be configurable to perform a specific computation.

Each butterfly unit 814 may comprise a plurality of multiplexers arranged to select inputs required to perform the specific computation, as explained in more detail below.

In an embodiment, the first processor 810 generates a query matrix, a key matrix, and a value matrix by performing Fast Fourier Transform (FFT) and butterfly linear transform on at least one input matrix.

In an embodiment, the first processor 810 comprises at least one butterfly engine 812 configured to accelerate computations involving a sparse matrix with respect to the FFT and the butterfly linear transform. The at least one butterfly engine 812 comprises a (adaptable) memory system 816 and a plurality of (adaptable) butterfly units 814, wherein each of the butterfly units 814 comprises at least one real-number multiplier, at least one real-number adder, at least one complex-number adder, at least one multiplexer, and at least one de-multiplexer. The configuration functions of at least one real-number multiplier, at least one real-number adder, at least one complex-number adder, at least one multiplexer, and at least one de-multiplexer are described in detail in FIG. 9.

In an embodiment, each of the butterfly units 814 performs the FFT and the butterfly transform based on the at least one input matrix and a plurality of twiddle factors. In a case of performing the butterfly linear transform, the twiddle factors are non-symmetric real numbers. In a case of performing the FFT, the twiddle factors are complex and symmetric numbers.

In an embodiment, the first processor 810 may generate the key matrix and the value matrix before the query matrix. The second processor 820 may start the first matrix multiplication, when at least part of the query matrix become available.

In an embodiment, the second processor 820 may start the second matrix multiplication, when at least part of the result of the softmax operation become available.

The electronic device 800 may comprise a second processor 820, also referred to herein as an Attention Processor (AP). When the ML model comprises at least one attention layer requiring computations that do not involve a sparse matrix having a butterfly sparsity pattern, the second processor 820 is used to perform operations required by this at least one attention layer. The AP 820 contains Phead number of Attention Engines (AEs) 822, and each AE 822 is composed of one QK unit 824 and one SV unit 826. The QK unit 824 is designed to implement the softmax and the matrix multiplication between queries and keys (i.e., transposed key matrices). The SV receives the outputs from the QK unit, and multiplies the results with value vectors to generate the final results of the attention layer.

In an embodiment, the QK unit 824 of the second processor 820 performs a first matrix multiplication between the query matrix and the key matrix. The QK unit 824 of the second processor 820 performs a softmax operation on the result of the first matrix multiplication. The SV unit 826 of the second processor 820 performs a second matrix multiplication between the result of the softmax operation and the value matrix.

The electronic device 800 comprises a third processor 830, also referred to herein as a Post-processing Processor (PostP), the off-chip memory and different on-chip buffers. That is, the electronic device 800 comprises a third processor 830 for receiving outputs from the first processor 810 and for performing post-processing using the received outputs. The third processor 830 is responsible for executing the layer normalization and shortcut (SC) addition. To ease the on-chip memory consumption, the intermediate results between different FFT and butterfly linear transformation operations are transferred back to the off-chip memory. All the on-chip buffers utilize double-buffering in order to overlap the data transfer with the computation.

In an embodiment, the third processor 830 receives the query matrix, the key matrix, and the value matrix from the first processor 810. The third processor 830 performs at least one of layer normalization or shortcut addition based on the query matrix, the key matrix, and the value matrix.

Each BE 812 is mainly composed of a (adaptable) memory system 816 and PBU number of (adaptable) Butterfly Units (BUs) 814. To improve the hardware efficiency and enable the use of a single unified engine, the BE 812 is designed with a focus on adaptability. As such, it can be configured via programmable multiplexers and de-multiplexers at runtime to either execute an FFT or a butterfly linear transformation.

FIG. 9A shows the architecture of the butterfly unit 814. Referring to FIG. 8 together with FIG. 9, each adaptable BU 814 consists of four real-number multipliers and two real-number adders/subtractors, followed by two complex-number adders/subtractors. The inputs and twiddle factors of both FFT and butterfly linear transformation are connected to the multipliers, with eight multiplexers used to select the correct inputs for each operation. Two de-multiplexers are placed after the real-number adders/subtractors to control the output flow.

In an embodiment, four real-number multipliers are arranged to multiply at least one input matrix and the plurality of twiddle factors. Two real-number adders or subtractors are arranged to add or subtract on outputs of two real-number multipliers from among the four real-number multipliers. Two complex-number adders or subtractors are arranged to add or subtract on outputs of the two real-number adders or subtractors. Eight multiplexers are arranged to select the at least one input matrix required to perform the FFT or the butterfly linear transform. Two de-multiplexers are arranged to control an output flow comprising outputting the data from the two real-number adders or subtractors, or providing the data from the two real-number adders or subtractors to two complex-number adders or subtractors.

In an embodiment, control signals for the eight multiplexers or the two de-multiplexer are set before performing the FFT and the butterfly linear transform.

When performing the butterfly linear transformation (see FIG. 9B), the twiddle factors are non-symmetric real numbers. Hence, the output of each twiddle multiply can be computed as:

out_bt¹=in_bt¹ w_bt¹+in_bt²·w_bt³, out_be²=in_bt¹·w_bt²+in_bt²·w_bt⁴

where in_bt^1-4and w_bt^1-4represent the inputs and twiddle factors, respectively. To perform the butterfly linear transformation, four multipliers in each BE are configured to execute the four real-number multiplications in the equation above. The values in_bt^1˜and w_bt^1˜4are selected via multiplexers as the operands of multipliers. At the same time, the results out_bt^1˜1generated from the real-number adders/subtractors are outputted directly from the de-multiplexers.

For FFT (see FIG. 9C) since the twiddle factors of FFT are complex numbers and symmetric, it only requires one complex-number multiplication per twiddle multiplication. Thus, by selecting the complex inputs in_{fft_r}^1˜2+in_{fft_i}^1˜2i and twiddle factor w_{fft_r}+w_{fft_i}i, the four real-number multipliers in each BE are used to perform the required complex-number multiplication. The de-multiplexers are then configured to output the results to the complex-number adders/subtractors to get the final results out_{fft_r}+out_{fft_i}^1˜2i. The control signals for the multiplexers and de-multiplexers are set by users before running each layer. As such, the proposed adaptable BE can be used to accelerate both FFTs and butterfly linear transformations by reusing the multipliers, adders and subtractors to improve the hardware efficiency. The control signals for the multiplexers and de-multiplexers are set by users before running each layer. As such, the proposed adaptable BE can be used to accelerate both FFTs and butterfly linear transformations by reusing the multipliers, adders and subtractors to improve the hardware efficiency.

FIGS. 12 shows bank conflicts in column-major and row-major orders. The present applicant's adaptable memory system comprises an input manager, a serial-to-parallel (S2P) module, a parallel-to-serial (P2S) module and input buffers. To reduce the memory consumption, the same input buffers are used for both FFT and butterfly linear transformation. Furthermore, pingpong buffering is used to reduce the data transfer time. However, as the data width of FFT is twice that of the butterfly linear transformation, different address mapping and fine-grained pipelining strategies are adopted to improve the hardware efficiency and performance.

FIG. 10 shows the present address mapping strategies for FFT and the butterfly linear transformation. Assuming the bitwidth of real numbers is 16 bits, each input buffer is 16-bit wide. While processing butterfly linear transformations, input buffer A and B are used as two independent pingpong banks with separate read and write ports. In this manner, when input buffer A is used for computation, buffer B can start the input data transfer of the next batch, leading to the fined-grained pipelining strategy shown in FIG. 11. While processing FFT, since the data include both real and imaginary parts which require 32-bit read and write ports, the lower parts of input buffer A and B are concatenated as the first pingpong bank for the storage of complex numbers. To improve the hardware efficiency, the higher parts of both buffers are reused as the second pingpong bank. As the computation requires both read and write access, another fine-grained pipeline strategy is adopted to overlap the output data transfer only with the input data load of next batch as shown in FIG. 11b. With the adaptable pipeline strategies for FFT and butterfly linear transformation, the hardware efficiency and performance is maximised.

As shown in (a) of FIG. 12 the butterfly pattern requires different data access in different stages. The conventional column-major or row-major order will cause bank conflicts while reading the data. For instance, accessing index pair x₀and x₈of the first stage causes the read conflict in the column-major order as shown (b) of FIG. 12. The row-major order also suffers from the same issue while reading x₀and x₂in the third stage.

To avoid the bank conflict, a custom data layout strategy is introduced ((a) of FIG. 13) and implemented using the S2P module shown in (b) of FIG. 13. Each column i is permuted using a starting position P_iwhich indicates how many rows the first element in the current column should be shifted down. The starting position is defined using the following formula:

P₀=0, P₂_n_˜2_n-1=P₂_n-1_˜0−1 with 1≤n≤N

For each 2ⁿ-2^n-1columns, the starting positions P₂_n_˜2_n-1is obtained by shifting P₂_n-1_˜0one position down, as shown in FIG. 13a. The starting positions are generated using a counter, and a bit-count and addition operations (FIG. 13b) After packing the serial data together, S2P permutes the packed data based on the starting positions to generate the final outputs.

In an embodiment, the function and structure shown in FIG. 13 may be implemented by a memory system 816 shown in FIG. 8. Referring to FIG. 8 together with FIG. 13, the memory system 816 may calculate starting positions of data layout corresponding to the input matrix. The starting positions may indicate how many rows a first element in a current column should be shifted down. The memory system 816 may permute the at least one input matrix based on the starting positions, The memory system 816 may offers data access to the butterfly units 814 based on the permuted input matrix. FIG. 14 presents an example with 16 inputs, where the data required by the first and second stage of the butterfly pattern are read from the buffers without bank conflicts. However, as the butterfly units receive data in pairs, an extra pairing is required after the SP2 module. An example is the second output column (x₁₁, x₁, x₉, x₃) of the first stage in FIG. 14b. To pair indices, an index coalescing module is designed before the butterfly units (FIG. 15). Based on the index of each input, a bit-count and addition operation are used to calculate the corresponding shift position. Then, a crossbar coalesces the index pairs based on the indices and shift positions. To ensure the outputs generated from the butterfly units preserve the original order, a recover module is used before the data is written back.

While executing the ABfly block, both BP and AP are used to perform butterfly linear transformation and attention matrix multiplication, respectively. To further improve the performance when executing the ABfly block, the present applicant employs a fine-grained pipeline between BP and AP.

FIG. 11 show adaptable fine-grained pipeline strategies, and illustrate the dataflow of BP and AP. In the naive implementation, the key (K), value (V) and query (Q) matrices are generated sequentially from BP. After all Q, K and V are computed, AP starts the computation of Q×K^T. To optimize this process, the present applicant reorders the execution sequence of linear layers such that BP computes K and V at the beginning, see FIG. 11b. As Q×K^Tcan be decomposed as multiple vector matrix multiplications that multiply different rows of Q with the entire matrix K^T, computation of Q×K^Tcan be started once the first few rows of Q are available. As such, the Q×K^Tin AP can be pipelined with the computation of Q in BP. At the same time, since S is generated from the QK unit row by row, Q>K^Tcan be further pipelined with S×V as the computation of S×V can start once the first few rows of S are generated from the QK unit. Assuming there are M and L rows in Q and K matrices and it takes

$\frac{T (Q K)}{M} and \frac{T (S V)}{L}$

to compute one row in the SV and QK units respectively, the total latency reduction is

$\frac{M - 1}{M} T (Q K) + \frac{L - 1}{L} T (S V) .$

The overall design space of the present applicant's end-to-end system is formed by FABNet's hyper-parameters and the butterfly accelerator's hardware parameters. Specifically, the joint design space consists of: 1) the algorithm parameters, i.e. the hidden size (D_hid), the expand ratio of FFN (R_ffn), the total number of blocks (N_total) and the number of ABfly blocks (N_ABfly) in FABNet, and 2) the hardware parameters, i.e. the parallelism of BU (P_bu) and BE (P_be) in BP, and the parallelism of the QK (P_qk) and SV (P_SV) units in AP.

To assess the trade-off provided by each design point, it is necessary to evaluate its algorithmic performance (e.g. an accuracy metric), its latency and its resource consumption. During search, the algorithmic performance is obtained by training and evaluating FABNet, while the latency is estimated by utilizing a custom cycle-accurate simulator built for the present applicant's butterfly accelerator. To verify whether the design can be accommodated by the target FPGA device, the present applicant developed an analytical model to estimate the consumption of DSP blocks and on-chip memory (BRAMs), and used a set of place-and-route measurements to fit linear regression models for the consumption of Look-Up Tables (LUTs) and registers.

To evaluate the present algorithm and hardware performance for the workloads with long sequences, six tasks from Long-Range-Arena are chosen including hierarchical data classification (ListOPs), byte-level text classification (Text), byte-level document retrieval (Retrieval), image classification for sequences of pixels (Image), classification of long-range spatial dependency (Pathfinder). The input sequences of these datasets range from 1024 to 4000.

Software Implementation. The vanilla Transformer, FNet, and the present FABNet models are implemented using Pytorch 1.10 framework. The pretrained models are obtained from Huggingface 4.16. The batch size is 256 for both Image and Pathfinder tasks, and 32 for the rest of datasets during training. The learning rate is 0.0001, except for Image task with 0.01 and Pathfinder task with 0.0005. Multiple Nvidia A100 and V100 GPUs are used for training. To use FFT cores on Nvidia GPUs, the PyTorch API “rfft2” is used to implement the FFT operation required in both FNet and FABNet. The high-performance CUDA implementation of butterfly linear transformation is adopted to accelerate both GPU training and inference. Two models are defined with different default settings: FABNet-Base (D_hid=768, R_ffn=4, N_total12, N_ABfly=0) and FABNet-Large (D_hid=1024, R_ffn=4, N_total=24, N_ABfly=0).

The present applicant implements their hardware accelerators using Verilog. To evaluate performance in different scenarios, two Xilinx FPGA boards are used in the experiments: VCU128 for cloud/server scenarios and Zynq 7045 for edge/mobile settings. Xilinx Vivado 2019.1 is used for synthesis and implementation. The clock frequency of the present designs vary by different FPGA boards and resource consumptions. All the FPGA designs are clocked at 200 MHz, which is below the maximum. The power consumption is obtained using the Xilinx Power Estimator (XPE) tool provided by Vivado. A cycle-accurate performance model is developed to evaluate the speed performance, which is cross-validated with RTL simulation results. The present applicant's hardware design uses 16-bit floating point.

Effectiveness of Co-Design. The effectiveness of the present applicant's co-design approach in finding the optimal algorithm and hardware designs is evaluated. For demonstration, LRA-Text is used as the target dataset and VCU128 FPGA as the target device. D_hid, R_ffn, N_ABflyand N_totalare selected from {64, 128, 256, 512, 1024}, {1, 2, 4}, {0, 1} and {1, 2} respectively. The hardware parallelisms (P_be, P_bu, P_qkand N_SV) are chosen from {4, 8, 16, 32, 64, 128}. Among the design points that satisfy the accuracy constraint, the point with the lowest latency in the Pareto front is chosen as point of comparison. The selected point is up to 10% more accurate than the points in the same latency range and up to 130× faster than points in the same accuracy range, underlining the advantages of the present applicant's co-design approach. To get the configurations for the rest of the datasets in LRA, the overall accuracy loss is constrained to be less than 0.5% compared with the vanilla Transformer. The final models and designs are chosen as the configurations with the highest hardware performance without violating the accuracy constraints. Unless mentioned otherwise, the rest of the sections report the algorithm and hardware performance using the configurations optimized by the present applicant's hardware configurations.

FIG. 16 compares the accuracy of the vanilla Transformer FNet and the present applicant's FABNet. FABNet achieves higher accuracy than both Transformer and FNet on three out of five tasks, including ListOPs, Retrieval and Image. On average, FAB-Net has the same accuracy performance as Transformer. To investigate the efficiency of FABNet, FIG. 18 shows the compression rate of the present applicant's optimized FABNet over the vanilla Transformer and FNet in terms of floating-point operations (FLOPS) and model size (number of parameters). Compared with the vanilla Transformer, FABNet achieves around 10˜60x reduction in FLOPs and 2˜22x reduction in model size, depending on the target task. Furthermore, compared with FNet, FABNet reduces FLOPs by 2˜10x and model size by 2˜32x.

To demonstrate the advantage of co-designing both algorithm (FABNet) and hardware (butterfly accelerator), a baseline design is implemented to accelerate the vanilla BERT. The baseline hardware was designed with multiple multiplier—accumulator (MAC) units to accelerate the linear transform and different matrix multiplications between query, key and value vectors. Each MAC is composed of a bunch of multipliers followed by an adder tree. The fine-grained intra- and inter-layer pipeline techniques were used to optimize the hardware performance. The parallelism of each MAC unit is allocated according to its workload in order to achieve load-balanced execution between different pipeline stages. For a fair comparison, both baseline and butterfly accelerators are implemented on a VCU128 FPGA using 2048 DSPs. The high bandwidth memory (HBM) was used as the external memory. Both designs were clocked at 200 MHz. Both base and large versions of each model are evaluated using four different input sequences (128, 256, 512 and 1024). The base version contains 12 layers and the large version have 24 layers. A speedup breakdown is shown in FIG. 17. Comparing to the baseline design with BERT-Base, the butterfly accelerator achieves 21˜30 times speedup on the FFN. The speedup increases to 37˜53 times on the large version. At the same time, the present butterfly accelerator is 57˜120 and 98˜201 times faster than the baseline design on the attention part of Bert-Base and Bert-Large respectively. The butterfly accelerator shows a higher speedup on the attention than FFN, which is due to the higher compression rate brought by FFT.

The comparison against GPU and CPU is performed in both server and edge scenarios. In edge setting, the butterfly accelerator is implemented on a Xilinx Zynq 7045 FPGA. DDR4 is used as external memory and 512 DSPs are used for computation. Nvidia Jetson Nano GPU and Raspberry Pi4 are used as GPU and CPU platforms respectively. In server scenario, a Xilinx VCU128 FPGA is used to implement the butterfly accelerator. HBM is used as external memory and the design consumes 2048DSPs. Nvidia V100 and TITAN Xp GPUs are used for comparison. The highly-optimized CUDA code is used for GPU implementations. FPGA designs are clocked at 200 MHz. Both FABNet-Base and FABNet-Large are evaluated using 128, 256, 512 and 1024 input sequences. FIG. 19 shows the results in term of speedup and energy efficiency. Energy efficiency is represented using Giga operations per second per Watt (GOPS/Watt). In the edge scenario, the present design on Zynq 7045 FPGA achieves 4.5˜10× speedup over Jetson Nano GPU and 41.2˜394.8× speedup over Raspberry Pi4. At the same time, the present design also shows 5.0˜12.5× and 21.3˜197.4× higher energy efficiency than Jetson Nano and Raspberry Pi4 respectively. In the server scenario, the present design on VCU128 is up to 10.9 and 9.7× faster and up to 160.6 and 149× higher energy efficiency than V100 and TITAN Xp GPUs respectively. In summary, the end-to-end speedup and energy efficiency gains on both edge and server scenarios under different input sequences highlight the scalability of the present butterfly accelerator.

In the edge scenario, the present design on Zynq 7045 FPGA achieves 4.5˜10 times speedup than Jetson Nano GPU and 41.2˜394.8 times speedup than Raspberry Pi4. On FABNet-Large with long input sequences greater than 768, Raspberry Pi4 suffers out-of-memory (OOM) issue. At the same time, the present design also shows 5.0˜12.5 and 21.3˜197.4 higher energy efficiency than Jetson Nano and Raspberry Pi4 respectively. In the server setting, the present design on VCU128 is up to 10.9 and 756 times faster than RTX 2080 Ti GPU and Golden 6145 CPU. Up to 168.0 and 7624.9 times higher energy efficiency is achieved than RTX 2080 Ti GPU and Golden 6145 CPU respectively. The end-to-end speedup on both edge and server scenarios under different input sequences also demonstrates the scalability of the present butterfly accelerator.

FIG. 20 compares the present butterfly accelerator with other state-of-the-art (SOTA) accelerators in terms of speed performance and power consumption. Instead of comparing the effective throughput, the end-to-end latency is used to represent the real execution speed. The energy efficiency is represented by the number of predictions per Joule (Pred./J). Among these accelerators, only SpAtten and DOTA report the end-to-end performance. For the rest of accelerators with only attention support, their performance is estimated by reusing the available multipliers in the design to accelerate FFN. For a fair comparison, it is assumed that all the ASIC designs use 128 multipliers for computation, and the clock frequency is 1 GHz. For the design with more than 128 multiplier, the throughput is linearly scaled down its to get the end-to-end performance. For instance, DOTA achieves 11.41 times speedup than Nvidia V100 using 12,000 multipliers with 12 TOPS throughput. The throughput is scaled down its by 12,000/128=93.75, which leads to 0.123 times speedup. The power consumption is obtained using the same linear scaling approach. For example, Sanger reports its power consumption of a design with 1024 multipliers. The power consumption of their systolic array (2243 mW) is divided by 1024/128=8, which leads to 280.375 mW. Together with the power of other modules such as pre-processing and memory, their total power consumption is 0.801 W. To match the computational capacity of ASIC designs, 640 DSPs in the VCU128 FPGA are used. As the present FPGA-based design is clocked at 200 MHz, this ensures the same 620×200 M=128GOPS theoretical peak performance as ASIC designs 128×1 G=128GOPS).

As shown in FIG. 20 the present butterfly accelerator achieves 25.6 times speed up than previous FPGA-based design while using nearly 10 times less DSPs. At the same time, 62.3 times higher energy efficiency is achieved than FTRANS. Comparing with other ASIC designs, the present disclosure can achieve 14.2˜23.2 times speedup under the same computational capacity. Although the present FPGA-based butterfly design consumes more power than all other ASIC designs, it can achieve 1.2˜4.7 times higher energy efficiency than other SOTA ASIC accelerators. Higher speedup and energy efficiency is expected when the present design is implemented on ASIC. There are two reasons for the better performance than ASIC designs: (1) the use of FFT and butterfly factorization significantly reduce the computational complexity in algorithm level; and (2) the adaptable butterfly design that adopts one unified hardware engine to accelerate both FFT and butterfly linear transformation, which significantly improve the hardware efficiency.

In order to investigate the sensitivity of the present design to off-chip memory bandwidth, the bandwidth is varied from 6, 12, 25, 50, 100 and 200 GB/s, and evaluate its latency based on the present custom cycle-accurate simulator. For this experiment, five different designs with 16, 32, 64 and 128 BEs are used when executing FABNet-Large with 24 layers. To understand the bandwidth requirements under both short and long input lengths, each design is evaluated using three input sequences (128, 1024 and 4096). The results are shown in FIG. 21. For a small-scale design of 16 BEs, a bandwidth of 50 GB/s is enough for the design to reach its peak performance under different input sequences. For the largest design with 128 BEs, the achieved performance saturates once the bandwidth reaches 100 GB/s.

Power and Resource Analysis. Table 6 shows the breakdown of the power consumption based on the report generated from the Vivado XPE tool. Two designs are implemented with 128 BEs (BE-128) and 40 Bes (BE-40) on VCU128 FPGA. In both designs, the dynamic power takes more than 70% of the total power consumption. The memory resources, including both BRAM and HBM, consume more than 40% of the dynamic power. Furthermore, when the number of BEs scales from 40 to 128, the power of clocking, logic & signal and DSPs increases from 0.724 W, 0.844 W and 0.129 W to 2.143 W, 2.693 W and 0.412 W, respectively. Table7presents the resource consumption of both BE40 and BE-128 on the same VCU128 FPGA. Due to the use of FFT and butterfly matrices, the present FABNet becomes less memory-intensive than the vanilla attention-based NNs. Since the theoretical memory bandwidth of a single HBM (450 GB/s) can already satisfy the requirement of the present accelerator, one HBM is used in both designs to reduce the resource and power consumption. When the number of BEs decreases from 128 to 40, the BRAM utilization is reduced from 1042 to 338. The reduction can be also observed on the LUT and Register resources.

FIG. 22 shows a flowchart describing a method for accelerating machine learning, ML, model computations using the processing unit described herein. The method comprises: receiving an input matrix for processing by the ML model (step S100); and using the first processor of the processing unit to: access data from the input matrix based on a butterfly sparsity pattern (step S102); configure the plurality of butterfly units of the first processor to perform a Fast Fourier Transform computation (step S104); and perform a Fast Fourier Transform computation using the accessed data (step S106). The same method may be used to accelerate computations of any layer involving matrices having a butterfly sparsity pattern. That is, instead of the FFT computation, any other computation (e.g. linear transformation) of any layer of the ML model may be performed using the first processor as long as the layer uses butterfly sparsity.

The method may further comprise: using the second processor of the processing unit to: perform an attention mechanism computation using the input matrix and an output of the first processor, where the attention mechanism computation does not involve a sparse matrix having a butterfly sparsity pattern. That is, when the ML model comprises even a single attention layer that is vanilla or conventional, i.e. which does not use butterfly sparsity, the second processor is used to execute that attention layer when the model is executed.

FIG. 23 shows a flowchart describing a method for accelerating machine learning, ML, model computations using the processing unit described herein. The method comprises: generating a query matrix, a key matrix, and a value matrix by performing Fast Fourier Transform (FFT) and butterfly linear transform on at least one input matrix (step S210); and performing a first matrix multiplication between the query matrix and the key matrix (step S220); performing a softmax operation on the result of the first matrix multiplication (step S230); and performing a second matrix multiplication between the result of the softmax operation and the value matrix (step S240).

The method may further comprise: receiving the query matrix, the key matrix, and the value matrix; and performing at least one of layer normalization or shortcut addition based on the query matrix, the key matrix, and the value matrix.

The method may further comprise: generating the key matrix and the value matrix before the query matrix; and starting the first matrix multiplication, when at least part of the query matrix become available. Herein, “a matrix is available” means that in an operation using a matrix as an operand, at least one row or at least one column of the matrix required for the operation is in a completed or prepared state.

The method may further comprise: starting the second matrix multiplication, when at least part of the result of the softmax operation become available.

FIG. 24 is a block diagram of an apparatus 1000 incorporating the accelerator of the present disclosure. The apparatus 1000 may be any apparatus which will implement a ML model and benefit from an accelerator to achieve high-speed computing. For example, the apparatus 1000 may be a constrained-resource device, but which has the minimum hardware capabilities to use a trained neural network/ML model. The apparatus 1000 may be: a smartphone, tablet, laptop, computer or computing device, virtual assistant device, a vehicle, a drone, an autonomous vehicle, a robot or robotic device, a robotic assistant, image capture system or device, an augmented reality system or device, a virtual reality system or device, a gaming system, an Internet of Things device, a smart consumer device, a smartwatch, a fitness tracker, and a wearable device. It will be understood that this is a non-exhaustive and non-limiting list of example apparatus.

The apparatus 1000 comprises at least one processor 1100. The at least one processor 1100 may comprise one or more of: a microprocessor, a microcontroller, and an integrated circuit. The at least one processor 1100 may be, for example, a CPU or GPU.

The apparatus 1100 comprises an electronic device 800 of the type described herein, including with reference to FIG. 8. A description of processing unit 800 is therefore not provided again here. The electronic device 800 accelerates machine learning, ML, model computations. Specifically, the electronic device 800 partners with the general purpose processor 1100 of the apparatus 1000, to boost the speed at which data is processed, thereby improving the performance of processor 1100.

The apparatus 1000 may comprise a memory 1200. The memory 1200 may comprise volatile memory, such as random access memory (RAM), for use as temporary memory, and/or non-volatile memory such as Flash, read only memory (ROM), or electrically erasable programmable ROM (EEPROM), for storing data, programs, or instructions, for example.

The apparatus 1000 comprises a ML model 1400. The electronic device 800 accelerates computations of the ML model which involve a sparse matrix, the sparse matrix having a butterfly sparsity pattern. As explained above, the first processor 810 of the electronic device 800 accelerates computations of any layer of the ML model 1400, as long as the computations involve matrices having a butterfly sparsity pattern. That is, when the ML model 1400 begins processing some data using the processor 1100, and the processing needs to be accelerated (because, for example, a latency requirement), the processor 1100 may offload the task to the electronic device 800.

The apparatus 1000 comprises at least one interface 1300 for receiving data/data items for processing by the ML model 1400. The interface 1300 may be a camera for capturing images or a microphone for capturing audio. It will be understood these are non-limiting example interfaces.

In an embodiment of this disclosure, the first processor comprises at least one butterfly engine configured to accelerate computations involving a sparse matrix with respect to the FFT and the butterfly linear transform, wherein the at least one butterfly engine comprises a memory system and a plurality of butterfly units, wherein each of the butterfly units comprises at least one real-number multiplier, at least one real-number adder, at least one complex-number adder, at least one multiplexer, and at least one de-multiplexer.

In an embodiment of this disclosure, the each of the butterfly units is configured to perform the FFT and the butterfly transform based on the at least one input matrix and a plurality of twiddle factors.

In an embodiment of this disclosure, in a case of performing the butterfly linear transform, the twiddle factors are non-symmetric real numbers, and in a case of performing the FFT, the twiddle factors are complex and symmetric numbers.

In an embodiment of this disclosure, the each of the butterfly units comprises: four real-number multipliers, arranged to multiply at least one input matrix and the plurality of twiddle factors, two real-number adders or subtractors, arranged to add or subtract on outputs of two real-number multipliers from among the four real-number multipliers, two complex-number adders or subtractors, arranged to add or subtract on outputs of the two real-number adders or subtractors, eight multiplexers, arranged to select the at least one input matrix required to perform the FFT or the butterfly linear transform, and two de-multiplexers, arranged to control an output flow comprising outputting the data from the two real-number adders or subtractors, or providing the data from the two real-number adders or subtractors to two complex-number adders or subtractors.

In an embodiment of this disclosure, control signals for the eight multiplexers or the two de-multiplexer are set before performing the FFT and the butterfly transform.

In an embodiment of this disclosure, the memory system is configured to: calculating starting positions of data layout corresponding to the input matrix, the starting positions indicating how many rows a first element in a current column should be shifted down, permuting the at least one input matrix based on the starting positions, and offering data access to the butterfly units based on the permuted input matrix.

In an embodiment of this disclosure, the electronic device comprises: a third processor configured to: receive the query matrix, the key matrix, and the value matrix from the first processor, and perform at least one of layer normalization or shortcut addition based on the query matrix, the key matrix, and the value matrix.

In an embodiment of this disclosure, the first processor is configured to generate the key matrix and the value matrix before the query matrix, and the second processor is configured to start the first matrix multiplication, when at least part of the query matrix become available.

In an embodiment of this disclosure, the second processor is configured to start the second matrix multiplication, when at least part of the result of the softmax operation become available.

In an embodiment of this disclosure, the method comprises: receiving the query matrix, the key matrix, and the value matrix, and performing at least one of layer normalization or shortcut addition based on the query matrix, the key matrix, and the value matrix.

In an embodiment of this disclosure, the method comprises: generating the key matrix and the value matrix before the query matrix, and starting the first matrix multiplication, when at least part of the query matrix become available.

In an embodiment of this disclosure, the method comprises: starting the second matrix multiplication, when at least part of the result of the softmax operation become available.

In an embodiment of this disclosure, a computer-readable storage medium is provided. The computer-readable storage medium comprises instructions which, when executed by a processor, causes the processor to carry out the method of this disclosure.

Those skilled in the art will appreciate that while the foregoing has described what is considered to be the best mode and where appropriate other modes of performing present disclosure, the present disclosure should not be limited to the specific configurations and methods disclosed in this description of the preferred embodiment. Those skilled in the art will recognise that present disclosure have a broad range of applications, and that the embodiments may take a wide range of modifications without departing from any inventive concept as defined in the appended claims.

Number	Date	Country	Kind
20220100326	Apr 2022	GR	national
23150205.5	Jan 2023	EP	regional

	Number	Date	Country
Parent	PCT/KR23/04696	Apr 2023	US
Child	18221089		US

ELECTRONIC DEVICE AND METHOD FOR ACCELERATING NEURAL NETWORK COMPUTATIONS

Information

Publication Number

Date Filed

Date Published

Inventors

Original Assignees

CPC

International Classifications

Abstract

Description

Claims

Priority Claims (2)

Continuations (1)