This application claims priority to Korean Patent Application Nos. 10-2024-0071933 filed on May 31, 2024, and 10-2023-0195690 filed on Dec. 28, 2023, which are hereby incorporated by reference in their entirety.
The present invention relates to an artificial intelligence (AI) neural network accelerator and a method therefor, and more particularly to an AI neural network accelerator for a transformer neural network, which requires a lot of weights while reuse of the weights is difficult, thereby requiring frequent external memory access and consequently consuming a lot of power, and a method therefor.
A transformer neural network is a neural network that learns context and meaning by tracking a relationship in sequential data, such as words in a sentence, and is replacing a convolutional neural network (CNN) or a recurrent neural network (RNN).
Such a transformer neural network mathematically finds a pattern between elements, so that there is no need to construct a large labeled learning dataset, and has been widely used in image classification and large language models since the transformer neural network is suitable for parallel processing, and thus an execution speed is fast. Further, a mobile system providing real-time responses has been expected.
However, such a transformer neural network requires a lot of weights while reuse of the weights is difficult, thereby requiring frequent external memory access and consequently consuming a lot of power.
Therefore, to solve such a problem, various technologies (Y. Wang et al., “A 28 nm 27.5TOPS/W Approximate-Computing-Based Transformer Processor with Asymptotic Sparsity Speculating and Out-of-Order Computing” 2022 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 2022, pp. 1-3, F. Tu et al., “A 28 nm 15.59 μJ/Token Full-Digital Bitline-Transpose CIM-Based Sparse Transformer Accelerator with Pipeline/Parallel Reconfigurable Modes” 2022 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 2022, pp. 466-468, and F. Tu et al., “16.1 MuITCIM: A 28 nm 2.24 μJ/Token Attention-Token-Bit Hybrid Sparse Digital CIM-Based Accelerator for Multimodal Transformers” 2023 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 2023, pp. 248-250) have been proposed to increase hardware utilization and reduce power consumption. However, system power consumption and response time of these transformer processors are not yet suitable for a mobile device. For example, a large language model such as GPT-2 has a lot of weights (400 to 700 M), and external memory access consumes 68% of the total power.
In addition, to alleviate an external memory access bottleneck, a transformer accelerator applying pruning (S. Liu et al., “16.2 A 28 nm 53.8TOPS/W 8b Sparse Transformer Accelerator with In-Memory Butterfly Zero Skipper for Unstructured-Pruned NN and CIM-Based Local-Attention-Reusable Engine” 2023 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 2023, pp. 250-252) has been proposed. However, while this transformer accelerator may increase weight sparsity, the transformer accelerator may be applied only to a simple task such as predicting the next word (for example, language modeling), and cannot achieve high sparsity in advanced tasks such as language translation, question answering, and summarization.
Therefore, a novel approach is needed to be able to compress weights to reduce the amount of external memory access for highly energy-efficient mobile large language model acceleration.
Therefore, the present invention has been made in view of the above problems, and provides an AI neural network accelerator and a method therefor that may reduce requirement of weights for accelerating an AI neural network to reduce the amount of external memory access and consequently reduce power consumption to accelerate a highly energy-efficient mobile large language model.
In addition, the present invention provides an AI neural network accelerator and a method therefor that may reduce requirement of weights for accelerating an AI neural network by primarily accelerating a reduced model in which basic requirement of weights required to accelerate a basic model is reduced by a predetermined ratio, and then further performing a secondary acceleration step to accelerate the basic model only in special cases where prediction accuracy of a result thereof is less than or equal to a preset threshold value.
In addition, the present invention provides an AI neural network accelerator and a method therefor that may reduce the amount of external memory access for receiving weights by including a weight embedding logic pre-generated as a result of learning weights of a transformer neural network matched with kernel location information of the transformer neural network, receiving only the kernel location information from an external memory to generate an implicit weight for accelerating an AI neural network, and then supplying the implicit weight using an on-chip network.
In addition, the present invention provides an AI neural network accelerator and a method therefor that may perform a process in which compressed kernel location information is received from an external memory and decompressed to reduce a time required to receive the kernel location information, thereby reducing an external memory access time.
In accordance with an aspect of the present invention, the above and other objects can be accomplished by the provision of an artificial intelligence (AI) neural network accelerator configured to accelerate a transformer neural network, the AI neural network accelerator including a calculator configured to predict output tokens in units of input tokens and including a plurality of transformer operation cores operating based on a transformer model using n weights (n being a natural number), and a controller configured to control an operation of each of the transformer operation cores after determining a size of the transformer model based on a number of weights, wherein the controller performs a control operation to perform primary prediction in which each of the plurality of transformer operation cores predicts output tokens using a reduced transformer model (referred to as hereinafter “reduced model”) using m weights (m being a natural number satisfying m<n) in response to acquisition of the input tokens, and further perform secondary prediction of predicting output tokens for the input tokens using a transformer model before reduction (referred to as hereinafter “basic model”) exclusively when prediction accuracy of a result of the primary prediction is less than or equal to a preset threshold value.
The AI neural network accelerator may further include a weight generator including a weight embedding logic generated in advance as a result of training by matching weights of the transformer neural network with a×b kernel location information of the transformer neural network, and configured to generate an implicit weight based on location information of a kernel input from an external memory, wherein the controller provides the implicit weight through an on-chip network in response to a request from the plurality of transformer operation cores.
The weight generator may include a code decompression unit configured to decompress location information of a kernel input in a code-compressed state from the external memory, and an implicit weight generation unit including the weight embedding logic and configured to apply location information of a kernel decompressed by the code decompression unit to the weight embedding logic to generate an implicit weight corresponding to a location of the decompressed kernel.
The implicit weight generation unit may include a two-dimensional (2D) MAC array configured to perform multiplication and accumulation operations to generate the implicit weight using the decompressed location information of the kernel, and a weight embedding logic configured to select weight embedding corresponding to the location information of the kernel and transfer the weight embedding to the 2D MAC Array.
The weight generator may include a transformer weight memory configured to store the implicit weight, and an on-chip network switch configured to deliver the implicit weight through the on-chip network in response to a request from at least one of the plurality of transformer operation cores.
In accordance with another aspect of the present invention, there is provided a method of accelerating an AI neural network using an AI neural network accelerator including a plurality of transformer operation cores operating based on a transformer model using n weights (n being a natural number) and configured to accelerate a transformer neural network, the method including performing, by the AI neural network accelerator, primary prediction of predicting output tokens using a reduced transformer model (referred to as hereinafter “reduced model”) using m weights (m being a natural number satisfying m<n) in response to acquisition of input tokens, calculating, by the AI neural network accelerator, prediction accuracy for a prediction result of the performing primary prediction, and further performing, by the AI neural network accelerator, secondary prediction of predicting output tokens for the input tokens using a transformer model before reduction (referred to as hereinafter “basic model”) when the prediction accuracy is less than or equal to a preset threshold value.
The method may further include including, by the AI neural network accelerator, a weight embedding logic generated in advance as a result of training by matching weights of an existing transformer neural network with a×b kernel location information of the transformer neural network, and generating an implicit weight based on location information of a kernel input from an external memory and the weight embedding logic, storing the implicit weight by the AI neural network accelerator, and transferring, by the AI neural network accelerator, the implicit weight through an on-chip network, wherein each of the performing primary prediction and the further performing secondary prediction includes predicting output tokens for the input tokens using the implicit weight.
The generating an implicit weight may include decompressing location information of a kernel input in a code-compressed state from the external memory, and the location information of the kernel decompressed in the decompressing may be applied to the weight embedding logic to generate an implicit weight corresponding to a decompressed location of the kernel.
The above and other objects, features and other advantages of the present invention will be more clearly understood from the following detailed description taken in conjunction with the accompanying drawings, in which:
Hereinafter, embodiments of the present invention will be described with reference to the attached drawings, and will be described in detail so that those skilled in the technical field to which the present invention pertains may easily practice the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. Meanwhile, to clearly describe the present invention in the drawings, parts not related to the description are omitted, and similar parts are given similar drawing reference numerals throughout the specification. In addition, descriptions of parts that may be easily understood by those skilled in the art even when a detailed description is omitted are omitted.
Throughout the specification and claims, when a part is described as including a certain component, this means that other components may be further included other than excluding other components, unless specifically stated otherwise.
Referring to
In this instance, each of the transformer operation cores 121 may include eight equivalent multipliers and accumulators (for 8-bit inputs and 8-bit weights), an input loader, a controller, an on-chip network switch, and input/weight/output memories to perform matrix multiplication required for each decoder block of the transformer and generate an output token each time.
The gateways 10 and 20 may connect an external memory (not illustrated) and the AI neural network accelerator 100. The gateways 10 and 20 may be used to transfer weights stored in the external memory (not illustrated) to the AI neural network accelerator 100, and to transfer processing results generated in the AI neural network accelerator 100 to the external memory (not illustrated).
The token input unit 110 receives input of a token. That is, the token input unit 110 receives input of one token each time from an input sentence including several tokens and transfers the token to the calculator 120.
The calculator 120 predicts output tokens in units of input tokens input through the token input unit 110. To this end, the calculator 120 may include a plurality of transformer operation cores 121, and each of the transformer operation cores 121 includes a transformer model.
In this instance, the transformer model has characteristics in that a size thereof is determined based on the number of required weights, and power consumption increases as the size of the transformer model increases. In other words, the transformer model has a characteristic that the size thereof increases as requirement of weights increases. Due to this characteristic, in the case of decreasing requirement of weights of any transformer model (hereinafter referred to as “basic model”) using n (n being a natural number) weights, the size of the basic model may be reduced, and thus power consumption may be significantly reduced.
Therefore, the present invention uses these characteristics to predict a large number of tokens among the total input tokens using a transformer model (hereinafter referred to as “reduced model”), the size of which is reduced by reducing requirement of weights to m weights (m being a natural number satisfying m<n), thereby reducing requirement of weights for accelerating the AI neural network. In this way, external memory access is reduced, thereby reducing power consumption.
To this end, the calculator 120 is controlled by the controller 130 to be described later.
The controller 130 may control an operation of each of the transformer operation cores 121 after determining the size of the transformer model based on the number of weights. In particular, the controller 130 may control each of the transformer operation cores 121 so that each of the plurality of transformer operation cores 121 performs primary prediction on an input token by the reduced model, then calculates prediction accuracy for a result of the primary prediction, and additionally performs secondary prediction using the basic model only when the prediction accuracy is less than or equal to a preset threshold value (for example, prediction accuracy of 60%).
In this instance, the prediction accuracy may be calculated using various known technologies. For example, to calculate the prediction accuracy, the controller 130 may perform management to calculate a small model first and then execute a large model as needed.
For example, the controller 130 causes each of the transformer operation cores 121 to primarily predict an output token using a reduced model that reduces requirement of weights of the basic model to n/10 weights (n being a natural number) in response to acquisition of the input token, and omit the secondary prediction process when the prediction accuracy exceeds the threshold value, thereby reducing requirement of weights. Such a process of the controller 130 is schematically displayed in
Referring to
In this way, the present invention has a feature of significantly reducing requirement of weights compared to conventional technology of predicting output tokens for all input tokens by the basic model by predicting output tokens for all input tokens using a reduced model having a small amount of requirement of weights, and performing additional output token prediction using a basic model having a large amount of requirement of weights only for a small number of input tokens having low prediction accuracy.
In this way, when a method of the present invention, which primarily performs only small model calculation and skips large model calculation when a prediction probability of a specific token exceeds a predefined threshold, is applied to language modeling using GPT-2, there is an effect of reducing external memory access by 39%.
Referring again to
The weight generator 200 generates an implicit weight using an “AI neural network” trained to implicitly remember a weight of a transformer neural network, and may receive only a×b kernel location information of the transformer neural network as input and output an implicit weight corresponding to the corresponding location. In this instance, the “AI neural network” is a neural network different from the transformer neural network, and may be used by training a multilayer perceptron, which is commonly used to generate a weight for the transformer neural network. A process of training the AI neural network and moving data will be described later with reference to
To generate the implicit weight, the weight generator 200 may include a code decompression unit 210, an implicit weight generation unit 220, a transformer weight memory 230, and an on-chip network switch 240.
The code decompression unit 210 decompresses location information of a kernel input in a code-compressed state from an external memory. To this end, as a configuration for decompression of normal code-compressed data, the code decompression unit 210 includes an index memory (IDX_MEM) 211, an MSB memory (MSB_MEM) 212, a sign memory (Sign_MEM) 213, an LSB memory (LSB_MEM) 214, and a router 215.
The index memory (IDX_MEM) 211 stores an address of data whose MSB is not filled with consecutive sign bits, the MSB memory (MSB_MEM) 212 stores an MSB part value of data whose MSB side data is not filled with sign extension bits (that is, uncompressed data), the sign memory (Sign_MEM) 213 stores a sign bit compressed into 1 bit, the LSB memory 214 stores LSB data, and the router 215 decompresses compressed data using data information of each of the index memory (IDX_MEM) 211, the MSB memory (MSB_MEM) 212, the sign memory (Sign_MEM) 213, the LSB memory (LSB_MEM) 214 and transfers the decompressed data to an 8-bit queue.
The implicit weight generation unit 220 includes a two-dimensional (2D) MAC array 221 that performs multiplication and accumulation operations to generate an implicit weight using location information of a kernel decompressed by the code decompression unit 210, and a weight embedding logic 222 generated in advance as a result of training by matching weights of the transformer neural network with a×b kernel location information of the transformer neural network.
When location information of a kernel decompressed by the code decompression unit 210 is input, the implicit weight generation unit 220 may detect weight embedding corresponding to the location information from the weight embedding logic 222 and generate an implicit weight.
To this end, the weight embedding logic 222 selects weight embedding corresponding to the location information of the kernel decompressed by the code decompression unit 210 and transfers the weight embedding to the 2D MAC Array 221, thereby generating an implicit weight.
The transformer weight memory 230 stores the implicit weight generated by the implicit weight generation unit 220.
The on-chip network switch 240 transmits an implicit weight stored in the transformer weight memory 230 to the corresponding transformer computation core 121 through the on-chip network in response to a request from at least one of the plurality of transformer operation cores 121. To this end, the on-chip network switch 240 may be controlled by the controller 130. That is, the controller 130 may control an operation of the on-chip network switch 240 to provide the implicit weight through the on-chip network in response to a request from the plurality of transformer operation cores 121.
This process of the weight generator 200 is schematically illustrated in
Through this process, decompressed data (that is, location information of a decompressed kernel) is stored in an 8-bit queue and then transferred to the 2D MAC Array 221, and weight embedding corresponding to the location information is detected from the weight embedding logic 222 to generate an implicit weight.
The implicit weight generated in this way is used in a decoder block operation (not illustrated) that forms the transformer network, and may be specifically used in attention or feed-forward operation.
First, referring to
Referring to
First, in step S110, the AI neural network accelerator 100 of the present invention generates an implicit weight. That is, in step S110, the AI neural network accelerator 100 has a weight embedding logic generated in advance as a result of training by matching weights of the existing transformer neural network with a×b kernel location information of the transformer neural network, and generates an implicit weight based on the kernel location information input from an external memory and the weight embedding logic.
To this end, in step S111, the weight embedding logic is stored in the implicit weight generation unit 220, and in steps S112 and S113, the location information of the kernel input in a code-compressed state from the external memory is decompressed.
In step S114, the implicit weight generation unit 220 applies the location information of the kernel decompressed in step S113 to the weight embedding logic to generate an implicit weight corresponding to the location of the decompressed kernel. That is, in step S114, the implicit weight generation unit 220 generates an implicit weight by the decompressed kernel location and the weight embedding logic corresponding thereto.
In step S115, the transformer weight memory 230 stores the implicit weight.
In steps S120 and S130, the transformer operation core 121 performs primary prediction to predict output tokens using a reduced transformer model (hereinafter referred to as a “reduced model”) that uses m weights (m being a natural number satisfying m<n) in response to acquisition of input tokens. To this end, in step S130, the transformer operation core 121 is controlled by the controller 130.
In step S140, the controller 130 calculates prediction accuracy for a primary prediction result of step S130.
In addition, in steps S150 and S160, the controller 130 compares the prediction accuracy with a preset threshold value, and when the prediction accuracy is less than or equal to the preset threshold value, the controller 130 controls the transformer operation core 121 so that secondary prediction is further performed to predict an output token for the input token using the transformer model before being reduced (referred to as hereinafter “basic model”). That is, in step S160, the transformer operation core 121 performs the secondary prediction under the control of the controller 130.
To this end, in steps S130 and S160, the transformer operation core 121 may further include an implicit weight transfer step in which the transformer weight memory 230 requests an implicit weight, and in response, the on-chip network switch 240 transfers the implicit weight through the on-chip network.
Further, in steps S130 and S160, the transformer operation core 121 predicts an output token for the input token using the implicit weight.
In this way, the present invention has an effect of reducing external memory access by performing some operations inside the accelerator instead of fetching all weights for acceleration of the transformer neural network from the outside. In other words, the present invention may increase energy efficiency while maintaining accuracy of transformer inference through large and small AI neural network hybrid structures and implicit weight generation.
As an example, in the case of language modeling using GPT-2, external memory access may be reduced by 39% using a method of the present invention, which performs only small model calculation first and skips large model calculation when a prediction probability of a specific token exceeds a predefined threshold value. In addition, external memory access may be reduced by 60% by generating implicit weights using only locations of kernels as input, and external memory access may be reduced up to 74% by applying code compression technology to compress the AI network for implicit weight generation.
As another example, in the case of language translation using mT5, external memory access may be reduced by 42% using mixed large and small network structures, external memory access may be reduced by 67% through implicit weight generation, and external memory access may be reduced by 78% using code compression technology.
In addition, in the case of summarization using T5, external memory access may be reduced by 59% using mixed large and small network structures, external memory access may be reduced by 71% through implicit weight generation, and external memory access may be reduced by 78% using code compression technology.
Finally, in the case of language translation using FSMT, external memory access may be reduced by 48% using mixed large and small network structures, external memory access may be reduced by 72% through implicit weight generation, and external memory access may be reduced by 81% using code compression technology.
The effects of the present invention are schematically illustrated in
Referring to
In addition, referring to
Meanwhile, referring to
As described above, the present invention shows accuracy of −0.52 to −1.29, a weight compression ratio of 74 to 81%, and a weight loading energy reduction rate of 71 to 76% compared to conventional technology.
As described above, an AI neural network accelerator and a method therefor of the present invention has a characteristic of being able to reduce requirement of weights for accelerating an AI neural network to reduce the amount of external memory access and consequently reduce power consumption to accelerate a highly energy-efficient mobile large language model.
In addition, the present invention has a characteristic of being able to reduce requirement of weights for accelerating an AI neural network by primarily accelerating a reduced model in which basic requirement of weights required to accelerate a basic model is reduced by a predetermined ratio, and then further performing a secondary acceleration step to accelerate the basic model only in special cases where prediction accuracy of a result thereof is less than or equal to a preset threshold value.
In addition, the present invention has a characteristic of being able to reduce the amount of external memory access for receiving weights by including a weight embedding logic pre-generated as a result of learning weights of a transformer neural network matched with kernel location information of the transformer neural network, receiving only the kernel location information from an external memory to generate an implicit weight for accelerating an AI neural network, and then supplying the implicit weight using an on-chip network.
An AI neural network accelerator and a method therefor of the present invention described above have an effect of being able to reduce requirement of weights for accelerating an AI neural network to reduce the amount of external memory access and consequently reduce power consumption to accelerate a highly energy-efficient mobile large language model.
In addition, the present invention has an effect of being able to reduce requirement of weights for accelerating an AI neural network by primarily accelerating a reduced model in which basic requirement of weights required to accelerate a basic model is reduced by a predetermined ratio, and then further performing a secondary acceleration step to accelerate the basic model only in special cases where prediction accuracy of a result thereof is less than or equal to a preset threshold value.
In addition, the present invention has an effect of being able to reduce the amount of external memory access for receiving weights by including a weight embedding logic pre-generated as a result of learning weights of a transformer neural network matched with kernel location information of the transformer neural network, receiving only the kernel location information from an external memory to generate an implicit weight for accelerating an AI neural network, and then supplying the implicit weight using an on-chip network.
Even though the embodiments of the present invention have been described above, the scope of the rights of the present invention is not limited thereto, and includes all changes and modifications in a range easily modified by a person having ordinary skill in the art to which the present invention pertains from the embodiments and recognized as equivalent.
Number | Date | Country | Kind |
---|---|---|---|
10-2023-0195690 | Dec 2023 | KR | national |
10-2024-0071933 | May 2024 | KR | national |