The present specification relates generally to computation of neural networks in natural language processing, and, more specifically, to systems and methods for quantization of parameters of neural networks.
Modern computing platforms, from servers to mobile and embedded devices, are energy constrained. Memory accesses, be it off- or on-chip, account for a significant fraction of overall energy consumption when executing neural models. Memory footprint, bandwidth and energy limitations are most acute for attention-based models in language understanding tasks.
Retraining and fine-tuning of a model can be computationally expensive or infeasible and may sacrifice accuracy if used to improve computational costs of a model.
According to an aspect, there is provided a computer-implemented method for memory storage, including storing a neural network in memory by: storing, in the memory, one or more values of the neural network each as a reference to a representative value.
In some embodiments, the method further includes storing, in the memory, one or more additional values of the neural network.
In some embodiments, the reference is the representative value.
In some embodiments, each of the references are generated by quantizing one of the one or more values of the neural network.
In some embodiments, the method further includes storing, in the memory, a reconstruction table storing each of the representative values for each of the references.
In some embodiments, each of the one or more representative values are generated by assigning each of the one or more values of the neural network to a cluster; and for each cluster, selecting a selected value from the cluster as the representative value for each of the one or more values of the neural network of the cluster.
In some embodiments, each selected value is a centroid, the centroid being an average of the one or more values of the neural network in the cluster that the selected value is selected from.
In some embodiments, the method further comprises: minimizing a sum of each distance between a) each of the one or more values of the neural network and b) the centroid of the cluster that the value is assigned to by iteratively: performing the assigning, the assigning further comprising reassigning a first value of the one or more values of the neural network from an original cluster of the clusters to a new cluster of the clusters where an original distance of the first value to the centroid of the original cluster is greater than a new distance of the first value to the centroid of the new cluster; and subsequently performing the selecting on at least the original cluster and the new cluster. The first value is a different value of the one or more values of the neural network upon each iteration.
In some embodiments, the method further includes generating an output from performing one or more multiply-accumulate operations A1B1+ . . . +AnBn on input vectors A and input vectors B, wherein n is the n-th input vector and wherein one or more of input vectors B are each one of the representative values, by accumulating input vectors A to an accumulated sum of input vectors A per input vector B having the same representative value and subsequently multiplying each of the accumulated sums of input vectors A by the representative value of the input vector B.
In some embodiments, one or more of the input vectors B are each one of the additional values.
In some embodiments, each of the additional values satisfy a criterion of a distribution, content, or value count of a component of the neural network.
In some embodiments, each of the additional values are in a component of the neural network and having a probability pdf less than a threshold value, the probability pdf defined by
wherein x is the additional value, μ is a mean of one or more parameters in the component of the neural network having x, and σ is a standard deviation of one or more parameters in the component of the neural network having x.
In some embodiments, each of the additional values are outside a threshold range from a distribution fit of both the values and the additional values in a component of the neural network.
In some embodiments, at least one of the references is encoded using three bits or four bits.
In some embodiments, at least one of the one or more values of the neural network is an embedding.
In some embodiments, at least one of the one or more values of the neural network is a weight.
In accordance with an aspect, there is provided a system for computation of layers in a neural network, including one or more processing elements each configured to accumulate one or more values of the neural network for each of one or more references to an identical representative value to generate an output for each accumulation; and a shared processing unit configured to, for each of the outputs from each processing element, multiply the output with the identical representative value respective to the output to generate a final output.
In some embodiments, the shared processing unit is further configured to accumulate one or more of the final outputs with one or more outlier values.
In accordance with an aspect, there is provided a computer system for memory storage, including a memory; at least one processor in communication with the computer memory, the memory comprising instructions which, when executed by the at least one processor, carries out the steps of storing a neural network in memory by storing, in the memory, one or more values of the neural network each as a reference to a representative value.
In some embodiments, the storing of the neural network in memory further includes storing, in the memory, one or more additional values of the neural network.
In some embodiments, each of the one or more representative values is generated by assigning each of the one or more values of the neural network to a cluster; and for each cluster, selecting a selected value from the cluster as the representative value for each of the one or more values of the neural network of the cluster.
In some embodiments, the steps further comprise minimizing a sum of each distance between a) each of the one or more values of the neural network and b) the selected value of the cluster that the value is assigned to by iteratively performing the assigning, the assigning further comprising reassigning a first value of the one or more values of the neural network from an original cluster of the clusters to a new cluster of the clusters where an original distance of the first value to the selected value of the original cluster is greater than a new distance of the first value to the selected value of the new cluster; and subsequently performing the selecting on at least the original cluster and the new cluster. The first value is a different value of the one or more values of the neural network upon each iteration.
In accordance with an aspect, there is provided a non-transient computer-readable medium containing computer-readable instructions which, when executed by a computer processor, perform a method of storing a neural network in memory by storing, in the memory, one or more values of the neural network each as a reference to a representative value.
In some embodiments, the storing of the neural network in memory further includes storing, in the memory, one or more additional values of the neural network.
In some embodiments, each of the one or more representative values is generated by assigning each of the one or more values of the neural network to a cluster; and for each cluster, selecting a selected value from the cluster as the representative value for each of the one or more values of the neural network of the cluster.
In some embodiments, each selected value is a centroid, the centroid being an average of the one or more values of the neural network in the cluster that the selected value is selected from.
In some embodiments, the method further includes minimizing a sum of each distance between a) each of the one or more values of the neural network and b) the centroid of the cluster that the value is assigned to by iteratively performing the assigning, the assigning further comprising reassigning a first value of the one or more values of the neural network from an original cluster of the clusters to a new cluster of the clusters where an original distance of the first value to the centroid of the original cluster is greater than a new distance of the first value to the centroid of the new cluster; and subsequently performing the selecting on at least the original cluster and the new cluster. The first value is a different value of the one or more values of the neural network upon each iteration.
In some embodiments, the method further includes generating an output from performing one or more multiply-accumulate operations A1B1+ . . . +AnBn on input vectors A and input vectors B, wherein n is the n-th input vector and wherein one or more of input vectors B are each one of the representative values, by accumulating input vectors A to an accumulated sum of input vectors A per input vector B having the same representative value and subsequently multiplying each of the accumulated sums of input vectors A by the representative value of the input vector B.
In some embodiments, one or more of the input vectors B are each one of the additional values.
Other aspects and features according to the present application will become apparent to those ordinarily skilled in the art upon review of the following description of embodiments of the invention in conjunction with the accompanying figures.
Reference will be made to the accompanying figures to provide exemplary embodiments of the invention, incorporating principles and aspects of the present invention, and in which:
Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters. Embodiments herein provide GOBO, a model quantization technique that compresses the vast majority (e.g., 99.9%) of the 32-bit floating-point parameters of state-of-the-art BERT (Bidirectional Encoder Representations from Transformers) models and their variants to 3 bits while maintaining their accuracy. Unlike other quantization methods, GOBO does not require fine-tuning nor retraining to compensate for the quantization error. Two example practical hardware applications of embodiments of GOBO are described. In embodiments in the first example, GOBO reduces memory storage and traffic and as a result inference latency and energy consumption. This GOBO memory compression mechanism can be plug-in compatible with many architectures, for example, the TPU, Eyeriss, and an architecture using Tensor Cores-like units. In embodiments in the second example, a hardware architecture that also reduces computation is described. In some embodiments, the GOBO architecture maintains most of the weights in 3b even during computation. This can: (i) make the processing elements area efficient, allowing more compute power per unit area (ii) replace most multiply-accumulations with additions, and (iii) reduce the off-chip traffic by amplifying on-chip memory capacity.
Modern computing platforms, from servers to mobile and embedded devices, are energy constrained. Techniques for improving their energy efficiency can yield several benefits. They can reduce the energy footprint of data centers and operating costs and environmental impact. They can increase uptime for mobile devices, and they can boost the capability of all systems by allowing them to perform more computations per unit of time. Improving energy efficiency can be very important for deep learning workloads as they can be particularly compute and memory intensive, especially in neural network models that require more computations and memory.
Memory accesses, be it off- or on-chip, can account for a significant fraction of overall energy consumption when executing neural models. Memory footprint, bandwidth and energy limitations can be most acute for attention-based models in language understanding tasks. Among them, the BERT family of natural language models can deliver best-of-class accuracy. Their footprint, accesses, and execution time are dominated by the parameters (e.g., weights) of their numerous attention layers. The largest and most accurate among those models, BERT-Large has a footprint of 1.12 GB, while BERT-base sacrifices some accuracy to reduce footprint to 326 MB. These BERT models are particularly expensive to train and use 32b floating-point parameters even for inference. Training BERT-Large on 16 Cloud TPUs (64 TPU chips) can take 4 days. For this reason, a pre-trained version can be used and then refined to a target task. Refinement time can vary per task but usually takes hours to a few days for BERT-Large on an RTX 2080 Ti GPU. Some architecture modifications and quantization methods can reduce the cost of BERT models but may require fine-tuning and often sacrifice accuracy. Quantization methods can prolong training time by as much as 10×.
Embodiments described herein provide a model compaction method that can work directly off the fine-tuned models. Embodiments described herein provide a model compaction method, GOBO, for such attention-based models. In some embodiments, GOBO accepts as input a trained model and reduces the number of bits needed to represent its parameters be it weights or embeddings. In some embodiments, GOBO maintains accuracy without any further model refinement such as retraining or fine-tuning. Fine-tuning may use access to the dataset which may not be available or may be not possible under strict time constraints (e.g., daily updates to the model).
In some embodiments, GOBO compacts the original floating-point parameters representing the vast majority of them with 3b or 4b. For example, a floating-point parameter can be stored in memory as a representative value (e.g., as an index identifying a representative value such as in a data structure), where the representative value is encoded using 3b or 4b. In some embodiments, GOBO is plug-in compatible with an execution engine for transformer models as after decoding it, GOBO produces a model with identical architecture (e.g., layer dimensions, types, and connectivity) containing floating-point parameters. In some embodiments, GOBO can be used as an off- and on-chip compression method to reduce footprint, traffic and energy and can amplify bandwidth, capacity, performance and energy efficiency.
In some embodiments, GOBO can further boost performance and energy efficiency as GOBO can simplify computation converting most multiplications to additions. Embodiments described herein provide a specialized functional unit that takes advantage of GOBO's representation to reduce computation costs improving energy efficiency and performance.
Per layer in a neural network, the vast majority of weights can closely follow some Gaussian distribution (whose parameters vary across layers), with very few—for example, less 0.1% per layer—“outlier” weights being the exception. In some embodiments, GOBO first stores the few outlier weights as-is, and, second, GOBO uses a dictionary of very few—for example, 8—representative values (e.g., centroids) for all other weights. Other data structures can be used. The vast majority of the weights—for example, 99.9% —are then stored as 3b indexes in this example. In some embodiments, GOBO uses a novel representative value selection algorithm that results in higher accuracy and is much faster to converge compared to linear partitioning or K-Means.
Deep Compression is representative of a class of dictionary-based compression methods that can be used to compress the parameters of fixed-point models. It can be demonstrated on 16b fixed-point convolutional neural networks. An efficient inference engine (EIE) can take advantage of Deep Compression to greatly boost energy efficiency when executing the resulting sparse neural networks. It can store weights as Huffman-encoded indexes into a dictionary of few representative values. The weights are expanded into their 16b fixed-point representative values before the MAC units. Compared to Deep Compression, in some embodiments, GOBO does not require fine-tuning, can be used on attention-based models that use floating-point values and where preserving outliers can be important for maintaining accuracy, and uses a novel method for selecting the representative values. In addition, in some embodiments, the GOBO hardware architecture never expands the quantized weights to their representative floating-point values. This can improve computational efficiency and speed, as well as reduce computational cost, power, and energy consumption.
Outlier-aware quantization can also be applied to fixed-point convolutional models. Compared to some embodiments of GOBO, this requires determining in advance what fraction of values should be outliers, uses a much larger fraction of outliers (e.g., 3%-5%), uses linear quantization to reduce data width for the non-outliers, and requires fine-tuning of the model. In post-training quantization for convolutional neural networks, the values may tend to follow a bell-shaped distribution that can be taken advantage of using a piece-wise linear quantization. In contrast, in some embodiments, GOBO targets attention-based models which are floating-point based, automatically adjusts the fraction of outliers using a Gaussian distribution fit, utilizes a novel fast converging method for selecting the non-uniform quantization values for the non-outliers, and requires no fine-tuning. This can improve computational efficiency and speed, as well as reduce computational cost, power, and energy consumption.
Embodiments of GOBO can be evaluated on various datasets and state-of-the-art attention-based NLP models: BERT (two variants), DistilBERT, RoBERTa (two variants), HAT, and SpanBERT. Embodiments of GOBO can also be compared with two quantized BERT models, Q8BERT and Q-BERT.
According to various experiments conducted on some embodiments of GOBO: For the most challenging task in the GLUE (General Language Understanding Evaluation) MNLI (Multi-Genre Natural Language Inference) task, GOBO maintains accuracy while quantizing 99.9% of the weights to 3 bits. A centroid selection algorithm used by some embodiments of GOBO converges 9× faster than K-Means selection and consistently reduces the number of required centroids to half. A practical implementation of some embodiments of GOBO compression for off-chip memory reduces model footprint by 10×. For TPU, this translates to 10× performance. Under iso-compute-area constraints, an accelerator using some embodiments of GOBO's processing units is on average 7× faster than one based on TensorCore-like units and also consumes 3× less energy.
The BERT Family of NLP Models
Some embodiments of GOBO have been experimentally assessed using the BERT family of NLP models. Attributes of some these models will now be described.
Google's BERT is an attention-based model that is the model of choice for a variety of NLP tasks.
BERT-Large and BERT-Base: Training deep learning models capable of state-of-the-art accuracy on NLP tasks can be expensive in terms of time, energy and volume of data. To train BERT from scratch in less than 1 hour, a cluster with 1,472, V100 GPUs may be required. In this case, each V100 GPU has 32 GB of memory and consumes 450W at a total power cost of 662 KW for the GPUs alone. To enable practical deployment for several tasks without access to such computing power and budgets the BERT framework introduced a “pre-training and fine-tuning” approach. In this case, BERT is pre-trained once on an unlabeled dataset—for example, billions of words—and then the pre-trained model can be fine-tuned for a few epochs to perform various tasks. Examples of two pre-trained BERT models are BERT-Base and BERT-Large. BERT-Large may achieve higher accuracy leveraging 3.5× more parameters.
Tasks: BERT is may be most useful in language understanding tasks, such as sentiment analysis, paraphrasing, sentence similarity detection, and question answering. GLUE and SQuAD (Stanford Question Answering Dataset) can be used as benchmarks for such tasks. The MNLI task of GLUE was used to assess embodiments of GOBO, since a) it is the most comprehensive inference task in the dataset, and b) it is the most sensitive task to quantization. MNLI given two provided sentences, the premise and the hypothesis, can predict if the premise entails the hypothesis, contradicts it, or neither. As a representative of other less sensitive to quantization tasks in GLUE, STS-B (Semantic Textual Similarity Benchmark) were used to assess embodiments of GOBO. STS-B can be used to predict the similarity score between two human-annotated sentences. Fine-tuning BERT for SQuAD may not be practical on a single GPU as its dataset is among the largest ones that are available. Some embodiments of GOBO were evaluated on an English to French translation task using a Transformer model (HAT).
BERT Architecture: BERT-base consists of 12 BERT Layers while BERT-large has 24.
BERT Derivatives: BERT variants may improve accuracy or reduce size compared to BERT. For example, DistilBERT uses knowledge distillation over the pre-trained BERT models to train a smaller, yet similar architecture. Facebook's RoBERTa uses hyperparameter tuning, a different training method and a different embedding table to improve accuracy while maintaining the same architecture. Some embodiments of GOBO were compared to two state-of-the-art quantized BERT variants, Intel's Q8BERT which uses 8b fixed-point values, and Q-BERT which uses dictionary compression.
Per Layer Weight Distribution:
In some embodiments, GOBO can be used to improve the computational performance of these BERT models. In some embodiments, GOBO splits the weights of each layer into two groups. For example, the “G” (Gaussian) group consists of weights whose magnitude fits within 99.9% of the values as described by the Gaussian distribution formed by the mean and standard deviation of all weights in the layer, while the second group, the “Outliers” (“O”) includes values that fall outside the Gaussian distribution. Experimentally, according to some embodiments, it has been found that: 1) representing just the outliers precisely and quantizing the rest of the model to a few representative values (e.g., 8) is sufficient for maintaining model accuracy; and 2) using representative values for all weights either drastically reduces compression (e.g., too many representative values are used) or sacrificed accuracy. In some embodiments, the G group consists of weights whose magnitude fits within a threshold level relative to attributes of the Gaussian distribution, such as the mean or standard deviation or both of all weights in the layer of the neural network that the weight is in, while the O group consists of weights whose magnitude does not fit within the threshold level relative to the same or different attributes of the Gaussian distribution (or a different threshold level).
Application and Comparison of GOBO
Compression methods for NLP models fall under three different approaches, Model Quantization, Pruning, and Knowledge Distillation. In model quantization, the objective can be to reduce the number of bits for representing the model's parameters while keeping the model architecture as-is. Pruning's goal is to remove some of the weights by forcing them to be zero. Combining pruning with zero-aware memory encoding can reduce the model's footprint. Knowledge distillation trains a smaller model, the student, to mimic the behaviour of a much larger teacher model.
Model Quantization: Quantization techniques can be direct or indirect. Direct methods map the weights to a fixed-point representation, whereas indirect methods use a dictionary of representative values and encoded weights as indexes.
Intel's Q8BERT can use a fine-tuning method to quantize the weights and activations to 8-bit fixed-point values. Some operations such as Softmax and Layer Normalization are not quantized and use FP32. The accuracy of some embodiments of GOBO have been experimentally compared with Q8BERT on the MNLI task and the experiments show that GOBO reduces model size more than Q8BERT while maintaining accuracy. Furthermore, GOBO is faster to deploy since it does not require fine-tuning. Although the decompressed model with GOBO can use FP32 values, in some embodiments, the GOBO hardware accelerator performs most computations without decompressing the weights.
ERT is a dictionary based fine-tuning approach that can use a second order Hessian information to quantize the model's weights to a few (4 to 16) representative values. It can store weights as indexes to those values. Q-BERT can separate the weights of each layer into multiple groups and can quantize each group separately using per group dictionaries each of 4, 8 or 16 entries. Dividing each layer in 128 groups can result in acceptable accuracy. Finally, Q-BERT can quantize the embedding tables to 8b to avoid a significant loss in accuracy. In some embodiments, GOBO does not require fine-tuning, only keeps one dictionary per layer and quantizes the embedding layers to 3b. Some embodiments of GOBO were tested experimentally as described herein and shown to achieves higher compression than Q-BERT while maintaining accuracy.
Model Pruning: Weight pruning can reduce model footprint by forcing a portion of the weights to be zero. BERT can be pruned during training. For example, 30%-40% of the weights based on the magnitude may be pruned with minimal effect on the accuracy of the final task. MNLI may be a task that is most sensitive to pruning. Structured pruning can remove a series of weights that correspond to a component of the model. Attention head pruning and Encoder unit pruning are examples of this approach. Pruning methods require fine-tuning to compensate for the initial accuracy loss. Some embodiments of GOBO were tested experimentally as described herein and shown to achieve nearly 10× compression (e.g., 99.9% of 32b values are compressed to 3b). In some embodiments, even if we ignore its encoding overhead, a pruning method can remove nearly 90% of the weights to achieve similar compression. In some embodiments, GOBO is used to complement pruning.
Knowledge Distillation: Knowledge distillation can train a smaller model (student) from a larger model (teacher). Based on what the student learns from the teacher there can be three groups of Knowledge distillation approaches for BERT. In the first group, the student learns the behaviour of the encoder layer. The student can have fewer attention heads in each layer or fewer encoder layers. Another approach can train a student based on the output logits (input of the last layer's softmax). Furthermore, the student can adopt a different type of network or components thereof such as a Convolutional Neural Network (CNN) or Long Short Term Memory (LSTM). A Bidirectional LSTM (BiLSTM) based architecture can be used to replace the attention layers. A CNN-based model can be used to replace transformer layers where the student tries to learn the contextual dependence of the tokens (words) from the output of each attention layer. DistilBERT is a distilled model of BERT-Base and is about half in size. Some embodiments of GOBO were tested experimentally as described herein and shown to compress DistilBERT by 10× and result in a model that is 20× smaller than BERT-Base.
GOBO Quantization
In some embodiments, there is provided a computer-implemented method for memory storage including storing a neural network in memory by storing, in the memory, one or more values of the neural network each as a reference to a representative value; and storing, in the memory, one or more additional values of the neural network. In some embodiments, there is provided a computer-implemented method for memory storage including storing a neural network in memory by storing, in the memory, one or more values of the neural network each as a reference to a representative value. The one or more values of the neural network each stored as a reference to a representative value can comprise a “Gaussian” (G) group. In some embodiments, the one or more values of the neural network are stored as the representative value. Throughout this description, these values may be referred to as weights. In some embodiments, these values can be other values such as activations, parameters of the neural network, parameters of embeddings (e.g., in an embedding table), or any collection of values in a neural network, and may not be a weight. In some embodiments, the “Gaussian” group can refer to a group that is not Gaussian. For example, the “Gaussian” group can refer to a group having values according to a different distribution, such as non-uniform distributions. For example, this can be a Laplace distribution. The one or more additional values of the neural network can comprise an “Outlier” (O) group and can be referred to herein as an outlier weight. In some embodiments, as used herein the outlier weight can refer to other values such as outlier activations, outlier parameters of the neural network, or outlier parameters of embeddings (e.g., in an embedding table) and may not be a weight. As used herein, a neural network can refer to a component of a neural network or an embedding (e.g., an embedding table).
In some embodiments, for BERT, GOBO operates at the granularity of a layer and over the fine-tuned model. The compaction process starts with separating the weights into two groups, the “Gaussian” (G) and the “Outliers” (O). GOBO stores the outliers as-is (e.g., FP32) whereas it quantizes the “G” group values to a few representative values. For example, only a tiny fraction of the weights, typically less than 0.1%, may end up in the “0” group. In some embodiments, GOBO reduces overall model size by quantizing the Gaussian group. Since the weight distribution is not uniform, GOBO can use a non-linear quantization method that results in higher resolution where the weights are densely populated. This method can represent roughly 99.9% of the weights with just 8 representative FP32 values, while the inference error can be kept below 1% (or with 16 values for no accuracy loss). In some embodiments, a “G” group weight is stored as a 3b index to those representative values. In some embodiments, GOBO uses just one set of representative values per layer. Each set of representative values may be stored in a separate data structure. Each of the “G” group weights can be represented by a reference to the bin value of the data structure that stores the representative value assigned to the respective “G” group weight.
In some embodiments, GOBO stores the following information per layer: 1) The outliers in original form (e.g., FP32); 2) the bin index for each “G” group weight (e.g., 3b); 3) a data structure such as a reconstruction table for weights which represents the representative values (centroids) for each bin (e.g., FP32). The reconstruction table can be Header 510 as shown in
Outlier Detection
In some embodiments, the additional values are determined using outlier detection. In some embodiments, each of the additional values are in a layer of the neural network and, in each layer, GOBO selects values to be the additional values, where the additional values are those having a probability pdf less than a threshold value, the probability pdf defined by
wherein x is the additional value, p is a mean of one or more parameters in the layer of the neural network having x, and a is a standard deviation of one or more parameters in the layer of the neural network having x. The parameters can be values of the neural network, for example, the values (e.g., values in the “G” group) and the additional values (e.g., outlier values). In some embodiments, this is performed per component of the neural network other than a layer. In some embodiments, each of the additional values are outside a threshold range from a Gaussian distribution fit of both the values and the additional values. In some embodiments, the log of the threshold value is in the range of −5 to −3, inclusive, for example, −4. In some embodiments, each of the additional values are outside a threshold range from a criterion related to the values (e.g., corresponding to the “G” group) and the additional values (e.g., corresponding to the “0” group) in a component (e.g., a layer) of the neural network that includes the additional value.
In some embodiments, the additional values are determined using outlier detection according to a different method. For example, in some embodiments, each of the additional values satisfy a criterion of a distribution, content, or value count of a component of the neural network, such as a layer of the neural network. The component can be a component of the neural network that includes the additional value, such as a layer, sublayer, or group of multiple layers of the neural network. As an example, an outlier can be selected based on the threshold magnitude or on the total number of outliers desired (e.g., a pre-determined number of outliers). For example, in some embodiments, GOBO is configured to select a number of outliers (e.g., 100) by selecting that number of values (e.g., weights) with the highest distance from the mean value. For example, each of the additional values selected can be those that are the values having the highest distance from the mean value of the values in a component (e.g., layer) of the neural network that includes the additional value.
For example, in some embodiments, to detect the outlier weights for a FC layer, GOBO first computes the mean and the standard deviation of the layer's weights. Then per weight, it computes a probability that the weight belongs to that distribution using the PDF (Eq. 1) where x is the weight, p is the mean, and a is the standard deviation. GOBO uses a threshold, a configuration parameter, to select outliers. A weight whose probability is less than the threshold is deemed as an outlier. For example, a log-probability threshold of −4 can be sufficient for maintaining overall accuracy. Other threshold values can be acceptable and can be adjusted as desired such as according to the particular application.
An outlier-aware quantization method can target fixed-point CNNs and consider a fixed, predetermined fraction of the weights (at least 3% which is an order of magnitude more than GOBO in some embodiments) as outliers. The outliers remain in 16b and non-outlier values are linearly quantized to 4b. To reduce the quantization error, fine-tuning is required. As GOBO uses nonlinear quantization in some embodiments, where the nonlinear quantization has about 1 outlier every 1000 weights, GOBO is sufficient for preserving accuracy without fine-tuning. In addition, in some embodiments, GOBO uses dictionary-based compression of the non-outlier group. Some embodiments of GOBO were tested experimentally as described herein with an outlier-aware inspired method that uses Linear Quantization for the non-outliers.
“G” Group Weight Quantization
In some embodiments, the representative value can be stored in the memory as a reference (e.g., index or bin number) to the representative value. In some embodiments, GOBO is configured to generate each representative value by quantizing a value (e.g., parameter such as a weight or activation) of the neural network. The value can be quantized to a reference (e.g., index or bin number) of a representative value. The reference can be encoded in 3b or 4b, for example. The representative value can be generated (e.g., by selecting) as a value based on one or more of the values of the neural network. For example, a function can be applied to the values of the neural network and the representative values can be selected using the function. For example, in some embodiments, the values of the neural network are clustered into mutually exclusive clusters, and a representative value for each of the values of the neural network in each cluster is selected, such as the average value of the values in each cluster. In some embodiments, a single representative value is assigned to each of the values of the neural network in the same cluster. Each of those values are stored in a data structure in memory and/or accessed as a reference to the representative value. The reference can be an index (e.g., bin value or reference to a memory location), such as indicating the representative value for that value of the neural network. The representative values can be stored in a data structure such as a dictionary.
In some embodiments, each of the one or more representative values are generated by assigning each of the one or more values of the neural network to a cluster; and for each cluster, selecting a selected value from the cluster as the representative value for each of the one or more values of the neural network of the cluster.
For example, in some embodiments, GOBO is configured to represent the “G” weights with few representative values (e.g., FP32 values). For example, in some embodiments, this is performed by clustering the weights and assigning a representative value per cluster. The clusters can be of equal population. Intuitively, this objective can put more clusters where weights are densely populated affording higher resolution where there are more weights. GOBO can be configured to implement this by sorting the weights and dividing them to equally sized clusters. The first and the last weight in each cluster determine the boundaries of that cluster.
In some embodiments, a value is selected from each cluster. In some embodiments, the selected value is centroid, the centroid being an average of the one or more values of the neural network in the cluster that the selected value is selected from. For example, the average of the weights inside each cluster can be used by GOBO as the centroid. Some embodiments of GOBO were tested experimentally and found that quantizing BERT-Base with this approach into 8 clusters (3b indexes) degrades inference accuracy by 10% in GLUE tasks.
To reduce this error, in some embodiments, GOBO is configured to use an iterative approach.
In some embodiments, GOBO is configured to assign each of one or more values of the neural network to a cluster and, for each cluster, select a selected value from the cluster as the representative value for each of the one or more values of the neural network of the cluster. To improve selection of the representative values, in some embodiments, GOBO is configured to minimize a sum of each distance between a) each of the one or more values of the neural network and b) the centroid of the cluster that the value is assigned to by iteratively: performing the assigning, the assigning further including reassigning a first value of the one or more values of the neural network from an original cluster of the clusters to a new cluster of the clusters where an original distance of the first value to the centroid of the original cluster is greater than a new distance of the first value to the centroid of the new cluster; and subsequently performing the selecting on at least the original cluster and the new cluster; wherein the first value is a different value of the one or more values of the neural network upon each iteration.
For example, in some embodiments, GOBO is configured to repeatedly apply the following two steps: 1) GOBO is configured to move a weight from cluster A to cluster B if the L1 distance of the weight and the centroid of cluster A is greater than the L1 distance of the weight and the centroid of cluster B; and 2) after re-assigning the weights to clusters, GOBO is configured to update the centroids by computing the new average over the weights of each cluster. This can be equivalent to minimizing the L2 distance. GOBO is configured to repeat this iterative process until the sum of L1 distances between centroids and weights is minimized. For example, for a 3b quantization, experiments were conducted on some embodiments of GOBO and showed that 7 iterations were enough to converge to the optimal solution.
In some embodiments, experiments were performed and showed that terminating the process using the L1 is faster and results in a model with higher inference accuracy than using the L2.
Deep Compression uses dictionary compression for CNNs utilizing K-Means with linear initialization for cluster centroids and requiring fine-tuning to regain any accuracy loss. It minimizes the L2 Norm inside each cluster. In some embodiments, GOBO maintains the model's accuracy without any help from retraining. The centroid initialization in GOBO is nonlinear and depends on the per layer weight distribution according to some embodiments. In some embodiments, GOBO is configured to detect a few but effective outliers and keep them in their original representation. For example, these outliers can be stored as-is (e.g., without quantization) in memory or accessed or both during computation of the neural network or embedding table. In some embodiments, importantly, GOBO is configured to minimize L1-Norm within each cluster rather than L2. Algorithm 1 summarizes how GOBO compacts each layer according to some embodiments.
Accuracy and Compression Potential
Experiments were performed on some embodiments of GOBO to evaluate the effect of embodiments of GOBO on accuracy and compression rate. These experiments will now be described. Embodiments of GOBO implement the features described as used in these experiments. Any additional information that any practical encoding may use was ignored. As described herein, a specific encoding can be used to implement GOBO for off-chip memory compression. As it will be shown, in some embodiments, the overhead is tiny as it includes the representative values and the coordinates of the outliers. For example, the overhead can be encoded in a memory configuration as described herein.
Methodology: The pre-trained models, the scripts and datasets provided by the Hugging Face repository and the SciKit-Learn library were used. Specifically, to fit a Gaussian distribution sklearn.GaussianMixture was used with one Gaussian component. Then, the log probability for each weight was computed using score-samples. GOBO was configured to consider the weights with the log probabilities of −4 or less as outliers. To show the advantage of GOBO's clustering approach the accuracy of each model when the “G” weights are quantized by Linear Quantization and K-Means was compared. In linear quantization, the range of non-outlier weights (“G” group) was linearly divided by the number or bins (2Bits). In K-Means the same centroid initialization as GOBO was used and iterations were performed until the cluster assignments converged. The outlier weights in all of these methods were detected and represented in the same manner. All experiments were performed on a dual RTX 2080 Ti 11 GB GPU system with an AMD Ryzen Threadripper 1920X CPU with 128 GB Quad-channel DDR4-3200 RAM.
Importance of Outliers:
BERT-Base: Some embodiments of GOBO were compared to BERT-specific quantization methods. In this section we focus on the MNLI task which is the most challenging among the GLUE set of tasks. Table III shows accuracy on the MNLI task with different quantization methods: Intel's Q8BERT, Q-BERT, and GOBO with 3b or 4b “G” weights. Recall that Q-BERT and Q8BERT require fine-tuning. Using the Hugging Face PyTorch library, fine-tuning BERT-Base for the MNLI task on our dual GPU system took about 3 hours per epoch. We fine-tuned the model for 3 epochs as suggested by the Hugging Face documentation. After 3 epochs (9 hours) we achieved the baseline accuracy of 84.45%. Fine-tuning the same task on the same machine for Q8BERT, takes about 34 hours per epoch. Based on the authors' suggestion, we trained the model for 3 epochs (102 hours). Quantizing the same model with GOBO takes about 10 minutes using a single CPU core of our system.
The results showed that GOBO in some embodiments can compact the model by 9.8× and with less than 0.7% loss in accuracy and by 7.92× with no accuracy loss. Q-BERT can compact the model by 6.5× at a 0.56% accuracy loss or by 7.81× with a 1.04% accuracy loss. Q8BERT can reduce the model the least and by 4× at a 0.7% accuracy loss. Q8BERT uses 8b fixed-point. In some embodiments, GOBO was shown to produce models that are smaller and with similar or no accuracy loss, and produce the model within minutes instead of days.
The centroid selection method implemented by some embodiments of GOBO for the “G” group provides improved computer functionality over two other methods: linear quantization and K-Means. Experiments were performed on some embodiments of GOBO comparing GOBO configured to implement quantization as described herein according to some embodiments, GOBO with linear quantization, and GOBO with K-Means quantization. Accuracy was evaluated as the number of centroids was changed (all experiments have the same set of outliers). Tables IV shows that GOBO according to some embodiments implementing quantization as described herein when using 3b weights (8 centroids) incurs an accuracy loss of 0.69% which is considerably less than then 1.36% loss incurred with K-Means. Linear quantization performs the worst incurring an error of nearly 52%. To maintain the baseline accuracy, GOBO, K-Means and Linear Quantization used 4b, 5b, and 6b weight indexes respectively. Using 4b weight indexes amounts to a 33% increase over using 3b. The selection method implemented by some embodiments of GOBO as described herein is faster than K-Means.
Similar behavior was observed for the STS-B task as Table IV shows. STS-B is less sensitive to quantization. GOBO according to some embodiments incurs no accuracy loss with 3b, whereas K-Means needs 4b and linear quantization requires 5b.
BERT-Large: Experiments were performed to quantize BERT-Large on the SQuAD task. SQuAD is a complex task that requires days of fine-tuning when implemented over BERT-Large. GOBO was applied after this fine-tuning phase at a negligible cost in time. Table W reports the compression and accuracy of the model with different quantization policies for the “G” group. The centroid selection policy implemented by some embodiments of GOBO was described herein proved best. With 3b weight indexes, the accuracy loss is less than 1% and with 4b there is none. For MNLI, GOBO incurred no accuracy loss even with 3b weight indexes. GOBO with K-means and GOBO with the centroid selection policy described herein was evaluated.
DistilBERT: Table V shows accuracy when quantizing DistilBERT which was distilled from BERT-Base and is about 2× smaller. GOBO in some embodiments incurred no or less than 1% accuracy loss with 4b and 3b weights respectively resulting in models that are 16× or 21× smaller than BERT-Base. In either case, K-Means required twice as many bins.
RoBERTa: Table VI shows accuracy and compression ratio. Quantizing to 3b “G” weights incurred an accuracy loss of 8%. Two FC layers (“Value layer” in self-attention and Intermediate layer) in the first 6 BERT Encoders are the most sensitive to quantization. Either the whole model can be quantized to 4b “G” weights (accuracy loss of just 0.6%), or better, 4b “G” weights can be used just for these two layers for the first 6 out of the 12 total Encoder layers and 3b for the rest. This reduces the accuracy loss to just 1.4%. Other configurations can be implemented.
RoBERTa-Large: achieved a score of 90% on MNLI and, as Table VI shows, was less sensitive to quantization compared to the RoBERTa. By quantizing the Value and Intermediate layers to 4b for the first 14 Encoders (out of 24) and to 3b for the rest GOBO was shown to achieve less than 1% loss in accuracy in some embodiments. In some embodiments, GOBO is configured to encode each of the references for the values (e.g., corresponding to the “G” group) of the neural network using three bits, except each of the references for the values in a value layer or in an intermediate layer for an encoder (e.g., an encoder layer) of the neural network, each of the references for the values in the value layer or in the intermediate layer for the encoder layer of the neural network being encoded using four bits. For example, the value layer and intermediate layer in a RoBERTa Architecture for the first few encoder layers may be the most sensitive to quantization and can be quantized to 4 bits, while the rest of layers are quantized to 3-bits. Other numbers of bits can be used.
HAT: GOBO according to some embodiments was evaluated on HAT produced models. HAT is a neural architecture search (NAS) method that composes Transformer models for efficient language translation on various hardware platforms. HAT designs FP32 models; however, accuracy may only be achieved when the model is quantized using KMeans with linear initialization and for a maximum of 300 iterations (similar to Deep Compression). Table VII shows the BLEU score on the WMT′14 En-Fr, English to French translation task when it is quantized by KMeans and GOBO. GOBO quantization according to some embodiments achieved 0.4 higher BLEU score at the expense of less than 1% extra footprint for the outliers. Where embedding tables were also quantized with GOBO, the compression ratio rises to 7.8×. GOBO according to some embodiments was shown to be effective for an attention model (different than the BERT family) where it outperforms a state-of-the-art dictionary-based quantization method.
SpanBERT: Certain architectures introduced 16b floating-point arithmetic targeting deep learning training and inference. In general, BERT models are trained with FP32. However, SpanBERT is a BERT-derived model that has been successfully trained to work with FP16. Table VIII shows the evaluation of GOBO according to some embodiments on Facebook's SpanBERT on SQuADv1.1 and SQuADv2 datasets. GOBO with 3b matches the baseline accuracy on SQuADv2 and results in less than 1% error in SQuADv1.1. GOBO achieves a 5.31× compression ratio. This result shows that GOBO according to some embodiments remains effective even when using FP16 is possible.
Embedding Table Quantization: In some embodiments, GOBO can also be used to quantize the embedding tables. Table IX shows the size of embedding table before and after quantization. The outlier threshold for all of these experiments is set to −4.
1×
1×
8×
GOBO—Memory Compression
A practical application of GOBO where it is used to compress weights in off-chip memory will now be described.
Long sequential accesses off-chip can be maintained and not sacrifice bandwidth utilization.
In some embodiments, GOBO is configured in memory to provide a method for memory storage, the method including configuring the memory according to a data structure, the data structure comprising: a header containing metadata; a quantized weights section, the quantized weights section storing a reference to a representative value for each parameter of a neural network, each of the references stored in a same order as a corresponding parameter in the neural network; and an outliers section storing one or more outlier parameters of the parameters of the neural network. In some embodiments, each reference for each of one or more outlier parameters of the parameters of the neural network is a dummy index, such as a sequence of 0s or other notation. The dummy index can be used by GOBO to designate use of the actual value of a value (e.g., a floating-point outlier value) without reference to a reference (e.g., index) or representative value.
For example,
For instance, the third weight in SM0-B0 is an outlier, and a dummy “000” index is stored for it. This format maintains the relative position of weights and avoids the hardware support and run-time costs of data movement and re-alignment during decompression.
In some embodiments, each outlier parameter is stored as a relative block reference to a block within a submatrix of a matrix, the block containing a value of the outlier parameter, the matrix storing each of the parameters of the neural network; an offset within the block; and the value of the outlier parameter. In some embodiments, the Outlier section 530 stores an outlier count and one or more outlier parameters for each submatrix (e.g., SM) of a matrix (e.g., the weight matrix) storing each of the parameters (e.g., weights) of the neural network. The value of the outlier parameter can be a representation (e.g., an approximation) of a real number, for example. The value of the outlier parameter can be a 32-bit floating-point value or a 16-bit floating-point value, for example. Other examples include 15b or 7b values or others that use an encoding other than floating-point.
For example, the Outliers section 530 encodes the outliers in submatrix order. Each SM begins with an outlier count (8b supporting up to 256 outliers) followed by the outliers in block order. Each outlier is encoded with a triplet (B, W, V) where B (4b) is the relative block within the SM, W is the weight offset (4b) within the block, and V is the FP32 value. In our example, SM0 contains 2 outliers with the first replacing the third weight in block 0 (dashed arrow).
In some embodiments, to make quantization transparent to the processing elements, the decompression engine 540 is configured to generate a stream of FP32 weights. The decompression engine 540 can use two concurrent sequential access streams. The first reads the Header 510 and then the Quantized weights 520. The Header 510 is used to set the lookup tables (LUT) in the decompression engine 540. There is one LUT per weight that can be decompressed concurrently, a configuration parameter. Once the Header 510 is read, the first stream is configured to process the weight indexes placing them in a FIFO which feeds the LUTs that replace them with the appropriate centroids. Concurrently, the second stream is configured to read in the Outliers section 530 placing values in another FIFO. Using the information per outlier, the outliers selectively overwrite the LUT provided values. Since outliers are rare, processing at most one per cycle has been shown experimentally to be sufficient in some embodiments. Other data structures can be used.
The described memory layout is suitable for example when the dataflow is pre-determined making sequential accesses to all components possible. To allow the dataflow to change without changing the layout, minor changes to the Outlier memory layout can be used. In some embodiments, the method for memory storage further includes: configuring the memory according to a first additional data structure storing outlier counts; and configuring the memory according to a second additional data structure storing outlier parameters. For example, the outlier counts and the outliers can be stored separately into two linear arrays C and O respectively. C contains the counts in cumulative form (entry Ci reports the number of outliers before the ith SM) so that they can serve as indexes into O. To access the outliers of the ith SM, an extra memory reference will first read Ci and use it as an index to start fetching the outliers from O. The number of outliers contained in the SM is given by Ci+1−Ci.
GOBO—Compute Acceleration
In some embodiments, GOBO is configured to generate an output from performing one or more multiply-accumulate operations A1B1+ . . . +AnBn on input vectors A and input vectors B, wherein n is the n-th input vector and wherein one or more of input vectors B are each one of the representative values, by accumulating input vectors A to an accumulated sum of input vectors A per input vector B having the same representative value and subsequently multiplying each of the accumulated sums of input vectors A by the representative value of the input vector B. Additionally, there can be one or more input vectors B that are each one of the additional values (e.g., outlier values or outlier weights) such that each one of the additional values are computed (e.g., multiplied with Am for an input vector Bm) without using any representative value of the additional value as the additional value is computed in the operation as-is.
In some embodiments, GOBO is configured to configure a memory according to a data structure, the data structure storing one or more representative values for values in a neural network; retrieve one or more of the representative values; and perform a computation using the representative values.
A representative value can be a value of a neural network such as a floating point parameter, representation of a real number, weight, activation, or other value. In some embodiments, GOBO is configured to select the representative values based on a property related to the values of the neural network. For example, representative values can be selected as centroids, such as described herein. As another example, a representative value can be a value selected from the values of the neural network that can be used as a representation of one or more other values of the neural network or of a component of the neural network, such as a component that includes the representative value.
In some embodiments, GOBO is configured to execute a computation of the neural network, where the computation relates to a value of the neural network (e.g., during training or classification). GOBO is configured to execute that computation by retrieving one or more representative values from the memory to use as that value in the computation. A reference to the representative value, such as a reference (e.g., index) to the memory location where the representative value is stored, is stored in a data structure (e.g., at 520) and GOBO is configured to retrieve that reference, follow that reference (e.g., retrieve the value stored in the memory location at that reference), and use the representative value in a computation. For example, the representative value can be for an activation value of the neural network, where GOBO is configured to perform a computation of multiplying a weight value with an activation value and adding multiple weight value and activation value products by accumulating (e.g., summing) weight values to be multiplied with activation values having the same representative value (e.g., as indicated by the same reference stored in memory for that activation value) and, for each activation value having the same representative value, subsequently retrieving the representative value from memory and multiplying the representative value with the corresponding accumulation of weight values.
As an example, most of the computation may occur in FC layers, each of which is a weight matrix and activation vector multiplication. The use of a very small dictionary allows transformation of the computation from many “lookup to translate weight index to its FP32 centroid followed by a multiply-accumulate with an FP32 activation” (one per weight) to an FP32 activation accumulation per weight index followed by a few FP32 multiply-accumulate operations (one per centroid per output activation). That is, rather than first decoding each weight to its centroid and then multiplying with the corresponding activation, this computer functionality for computation can be improved by, given a weight index, accumulating the corresponding activation into a per centroid sum, and after doing so for all quantized weights, multiplying just these few sums with the centroids. For example, in a layer with 4 input activations Ai and one output activation OA, when weights are not quantized, this rudimentary FC layer performs the computation in Eq. 2. In some embodiments, GOBO is configured to perform computations in a neural network according to Eq. 5. For example, GOBO can be configured to perform Eq. 2 as Eq. 5. The computation can be performed in a layer of the neural network or other component of the neural network, such as other than in a fully connected layer.
In some embodiments, the weights are quantized to two centroids V1 and V2, with W1 and W4 mapping to V1, and W2 and W3 mapping to V2. If Wi the 1b weight indexes to V dictionary values, and V(w
Re-arranging or changing floating-point arithmetic operations can be a concern for hardware and software optimizations as it can affect the outcome. Fortunately, in some embodiments, GOBO's approach effectively improves calculation fidelity as it can use separate accumulators (e.g., 8 separate accumulators). Importantly, each accumulator corresponds to a single weight index. In some embodiments, multiplying with that weight is deferred in computation until the end, and after all relevant activations have been accumulated. In FP32 hardware, these multiplications occur for all activations and weight indexes before accumulation, increasing the chances that fidelity is lost due to weight magnitude differences. Experiments were performed on some embodiments of GOBO and demonstrate that GOBO's approach preserves accuracy.
Tile Architecture: In some embodiments, GOBO is configured to implement a system for computation of layers in a neural network, including one or more processing elements each configured to accumulate one or more values of the neural network for each of one or more references (e.g., “G” group weight indexes) to an identical representative value to generate an output for each accumulation; and a shared processing unit configured to, for each of the outputs from each processing element, multiply the output with the identical representative value respective to the output to generate a final output. The shared processing unit is further configured to accumulate one or more of the final outputs with one or more outlier values, according to some embodiments. For example, multiple final outputs can be added to one or more outlier values according to a computation in a neural network. The computation can be configured to perform any one of Eq. 2 to 5. The representative value can be a centroid. The one or more values of the neural network can be a value such as a parameter (e.g., a weight or an activation) of a neural network or can be a value such as an embedding (e.g., in an embedding table).
For example, in some embodiments, GOBO is configured to include GOBO Tiles 600 with Processing Elements (PEs) 610 and is configured to include a Shared Processing Unit (SPU) 620.
Phase 1: Per index Activation Accumulation: At the unit's front there are two circular 16-entry activation buffers. At any point, one buffer, the staging buffer, loads one activation at a time from the global buffer 630, while the PEs use the other, the current buffer. It takes 16 cycles for the staging buffer to fill, during which the current buffer rotates its entries so that each activation is seen by each PE 610. For the purposes of the following description of an example embodiment, outlier processing is not discussed. Every cycle, the quantized weight buffer 640 provides 16 weight indexes, one per PE 610. The PEs 610 use these weights to add the current activation to the corresponding register file entry.
Phase 2: Centroid Processing: This phase proceeds in centroid/PE order. As shown in
Outlier Processing: Outliers are processed during phase 1 by the SPU 620. The SPU 620 reads the outliers from the global buffer 630 into a FIFO 650. The outliers can be stored using the format as shown in
According to some embodiments, this organization can support 4b weight indexes so that models such as RoBERTa can be executed using some embodiments of GOBO, where RoBERTa requires 4b dictionaries for some of their layers. To do so, GOBO is configured to pair adjacent tiles so that the 8-entry register files in the first tile store the accumulation of input activations corresponding to the first 8 weight indexes, and the second tile stores the rest. The output register file entries are configured to be added in pairs to generate the final output. This can be done over 16 cycles and can reuse the adder from one of the tiles.
Memory Organization: In some embodiments, a GOBO accelerator is configured to have several tiles. A banked Global Buffer supplies the tiles with the data needed. The most demanding data needs can be for the weight indexes. In some embodiments, each tile processes at peak one block of 16 weight indexes per cycle which requires 16×3b=48b per cycle. Each tile also reads a single FP32 activation per cycle at peak. However, the same activation can be shared among all tiles. Finally, each tile is configured to read in outliers which may use at most 40b each plus an 8b count per submatrix of 256 weights. However, the outliers can be infrequent, and the bandwidth used can be low. Accordingly, in some embodiments, GOBO is configured to partition the Global Buffer across the tiles with at least three banks per tile: one for activations, one wider for weights, and one for the outliers plus centroids. Note that there can be one set of centroids per layer that can be replicated across tiles.
Dataflow: Various dataflows can be used. Experiments have been performed that show that the output stationary dataflow may not be the best. In some embodiments, instead of dedicating each PE to process one output activation at a time, GOBO is configured to time multiplex over multiple subsets of activations (from multiple input words) producing partial sums. For example, GOBO is configured to block weights in groups of columns, and, where beneficial, rows, and process the corresponding activation columns producing partial sums. GOBO is configured to perform phase 1 and phase 2 processing to produce these partial sums. In some embodiments, this can provide an advantage of weight reuse across words; the same weights with the corresponding activations can be used for many words. In some embodiments, once done with all words for one group, GOBO is configured to proceed with the next group of columns (and rows) of weights.
Evaluation of GOBO
As will now be described, further experiments were conducted to evaluate the performance and energy-efficiency benefits when 1) GOBO is used to compress data off-chip in some embodiments, and 2) for an accelerator that uses GOBO compression and the GOBO processing architecture according to some embodiments.
Various models were used. The models were quantized with GOBO to 3b except for 12 layers of RoBERTa and 28 layers of RoBERTa-Large that were quantized to 4b as discussed herein. Cycle times were measured using a custom cycle-accurate simulator which uses DRAMsim2 to model off-chip transfers for a DDR4-3200 dual-channel DRAM main memory. The simulator was tested extensively with microbenchmarks. The simulator computes values faithfully in time which it compares against the expected outputs for correctness. Area and power for the on-chip memories are modeled by CACTI. Memories were divided into banks to meet the target access time and bandwidth. Post-layout energy and area estimates were used: all designs were implemented in Verilog and synthesized using Synopsys' Design Compiler with a 65 nm TSMC technology library and a 1 GHz target frequency. Layouts were generated using Cadence Innovus. For power estimation, Intel Model Sim was used to generate signal activity as input to Innovus.
Memory Compression
GOBO memory compression was incorporated according to some embodiments over three popular deep learning hardware architectures. TPU and Eyeriss were configured as described herein. To execute these models, however, these were implemented with FP32 and FP16 MAC units from a vendor provided library. The MAC units were optimized for energy efficiency. The modeled Tensor Cores accelerator was configured with 128 Tensor Core-like units and with a 2 MB on-chip buffer. Each Tensor Core unit performed a A4×4×B4×4 matrix multiply producing a matrix C4×4 per cycle. It was configured to perform 64 MAC operations per cycle. In some embodiments, the dataflow of each architecture is configured to improve performance and energy efficiency.
TPU: Performance for the TPU improved nearly by 10×. The TPU keeps all activations on-chip and reads weights from off-chip memory. It is severely memory bound on the weight side thus benefiting fully from the traffic reduction possible with GOBO according to some embodiments. Speedups are slightly lower for Distil-BERT and the RoBERTa models. The size of the FC layers in DistilBERT leads to different utilization of the TPU's systolic array, and as a result, the potential for improvement from memory compression is different than it is for the BERT models. The RoBERTa models have layers that are quantized to 4b instead of 3b.
Eyeriss: Speedup on Eyeriss using GOBO according to some embodiments is at nearly 7× on average. Eyeriss optimizes the dataflow to keep MACs busy and has much fewer MACs than server-class accelerators such as the TPU. Accordingly, the potential for speedup is lower.
Tensor Cores: This architecture is also memory bound with 16% of all memory transactions being for activations and partial sums, while weights account for the remaining 84%. Given that GOBO reduces weight footprint by nearly 10× in some embodiments, performance improved by nearly 4×.
Execution Time Breakdown:
Energy: DRAM transactions can be expensive in terms of time and energy.
FP16: Some architectures add support for 16b floating-point as these can be sufficient for training and inference of many neural networks.
GOBO Hardware Accelerator
An iso-compute-area comparison of the GOBO accelerator (according to some embodiments) with Tensor Cores and IBM DLP like architectures was performed. For Tensor Cores, both FP32 and FP16 variants were evaluated, and for the IBM DLP FP16 hardware was modelled.
The FP32 comparison with Tensor Cores will now be described. Tensor Cores and GOBO FP32 designs were given the same 2 MB on-chip memory. Table X summarizes the configurations. The FP32 Tensor Cores' tile was 6.2× larger than the GOBO tile while it has 4× more multipliers and accumulators. GOBO was configured to replace each multiplier with an accumulator and an entry register file, and have only a single MAC unit that is shared across multiple PEs. Table XI shows the GOBO tile area breakdown. An FP32 multiplier was 4.1× larger than an 8-entry register file. As a result, for the same compute area needed by 128 Tensor Cores, 768 GOBO tiles can be used.
Performance:
Energy: Relative energy trends in as shown in the graph in
FP16: The FP32 configurations were scaled to use FP16 instead. Table X reports the configurations when both TC and GOBO use FP16 units and a 1 MB on-chip buffer according to some embodiments. Given the non-linear scaling of FP multipliers, the ratio of TC to GOBO tiles decreased from 6.2× to 5×. The same number of GOBO tiles were kept and the TC tiles were increased so that both used the same compute area.
IBM DLP: IBM's Deep learning Processor is an FP16 systolic array with a 2 MB on-chip buffer. Experiments were performed on embodiments of GOBO demonstrating that a) IBM DLP benefits from GOBO memory compression (IBM DLP+), and b) under iso-compute-area constraints GOBO accelerator is higher performing and more energy efficient, according to some embodiments. DLP's systolic array was configured with 512 MACs to match the 1K FLOPS/cycle originally reported. GOBO's tiles were 1.2× smaller than DLP's allowing 20% more tiles to be fit. GOBO was configured with the same 2 MB on-chip capacity.
In some embodiments, GOBO is configured to provide a post-training quantization method for attention-based NLP models. In some embodiments, GOBO significantly reduces model size (e.g., parameters and embedding tables) with little or no effect on accuracy. GOBO can be used in at least two practical applications. In the first, which is plug-in compatible with other accelerators, GOBO is used to compress data from off-chip boosting energy efficiency and off-chip capacity. In the second, an accelerator is configured to use GOBO quantization throughout. It is unique at least in that it never expands the weights into their original values directly, according to some embodiments.
Quantizing both Weights and Activations
According to some embodiments, for relatively short inputs, the activations do not represent a significant portion of the memory traffic and thus quantizing them yields little benefit. However, as the input size used per network invocation increases, quantizing the activations becomes beneficial. Further, most of the activation layers resemble bell-shaped distributions as those exhibited by the weight layers where the bulk of the values follow the bell-shaped distribution and a few values are outliers. As a further optimization, GOBO is modified to quantize both activations and weights in some embodiments. In this modification, a modified method is used to quantize a generic layer, and then the quantization dictionary of that layer is scaled and shifted to match other layers. For generating the aforementioned generic layer, a random normal distribution is generated with a mean of zero and standard deviation of one. This distribution is then quantized similar to as described herein in some embodiments of GOBO and the resulting quantization dictionary is deemed as the Golden Dictionary (GD). The GD can be stored by the processor in the method and can be used and/or accessed during computation. The quantization dictionary for each layer of a model is then computed by scaling and shifting the golden dictionary, GD×s+m, where s in the standard deviation of that specific layer and m is the mean on the values in that layer. For weights, m and s are known in advance. For activations, profiling over a few example inputs is used to determine the mean and standard deviation of each layer. According to some embodiments, this method reduces the amount of storage and traffic needed by both weights and activations where all but a few outliers are stored as indexes to the respectively scaled golden dictionary.
In some embodiments, this method enables reducing computation cost. Specifically, in some embodiments, this method quantizes both weights and activations to a few unique values. For instance, with a 4-bit quantization, each weight and activation is mapped to one of 16 unique values. Therefore, multiplying a weight and an activation produces a value that could be one of 16×16=256 weight centroid×activation centroid products. This limited set of possible product values represents an opportunity to replace the native FP32 multiply-accumulate operations, with counting per possible product, specifically with calculating an integer histogram, followed by a final multiplication of the integer counts with their respective FP32 centroid products. However, having 256 counters per processing element is expensive in terms of area and energy.
To reduce the number of counters, an exponential approximation for the dictionary of the original bell-shaped distribution can be used according to some embodiments. This approximation is found to have a minimal effect on the accuracy since the fitted exponential function closely follows the golden dictionary. As explained below, this approximation reduces the aforementioned 256 products to only 15 values which are symmetric around zero, enabling the conversion of the bulk of FP32 MAC operations to 3-bit integer additions to generate a histogram of product occurrences.
According to some embodiments, the golden dictionary is symmetrical around zero thus curve fitting only one half of the dictionary can be sufficient for generating both halves. The other half can be directly derived as it has same values but with the opposite sign. According to some embodiments, the golden dictionary stores only magnitudes of the values (i.e., does not store their sign). To perform the curve fitting MATLAB's curve-fitting toolbox was used to fit an aint+b curve to the golden dictionary where a and b are the parameters that are being fitted and int is the list of integers from the dictionary half being approximated. Since there are more values near zero and the density of the values decreases as we move away from zero in either direction, a weighting scheme is applied on our curve fitting tool to put more stress when the values are densely populated. Specifically, the outer bin is given a weight of one, and as we move closer to zero, the weight of each bin is multiplied by 2. In a preferred 4-bit quantization method the first bit is the sign of the value (0: positive, 1: negative) and the following 3 bits are the index. For instance, 1011 maps to −(a3+b) (not to be confused with a−3+b). Note that a and b are fitted to the golden dictionary that comes from a random distribution and are independent of models and layers.
Representing the bulk of the matrix multiplication operands as integer exponents can simplify and accelerate computation as follows. Without loss of generality, here it is assumed that the target is multiplying weights (W) and activations (A). However, the exact same method applies to other computations such as those entailing activations on both sides. Weights are represented as θW(aint
The preceding equation shows how the calculation of the output activation is expanded into 4 terms: (1) In the first term (SoI), the challenging part is to compute the summation of . In the 4-bit quantization example, where there is a 3 bit index and a sign bit, intA+intW would be one out of 15 unique values in the [0+0,7+7]=[0,14] range. This computation is performed by first adding intA and intW and then counting how many times each of the 15 possible exponents occurs. After processing the two input arrays, the method has effectively counted how many of a0, a1 . . . a14 there are overall. After the total counts, hitherto referred to as occurrences, are calculated, 15 MAC operations, (occurrences×a0+ . . . +occurrences×a14) is used to compute the SoI term. (2) The second term (SoA) is the sum of the activations of the layer. This term is decomposed to SoA1 and SoA2. SoA1 is dependent on the sign of both weight and activation which is computed during SoI computation. SoA2 is only dependent on activations and computed while the output of the previous layer is quantized. (3) The third term, SoW is also decomposed to two terms SoW1,SoW2. SoW1 is computed similar two SoA1 and SoW2 is known after the model is profiled hence it is added as a bias term for each output activation. (4) The final fourth term, PoM, is decomposed to 4 terms. PoM1 is computed during SoI computation by accumulating the product of weights' and activations' sign. PoM2 is computed during SoA2 (quantizing output of previous layer) computation. PoM3 and PoM4 are known when the model is profiled and added as a bias term.
In some embodiments, computation per output neuron proceeds as follows: (1) for each weight and activation product, the indexes (3-bit values) of the weight and the activation are added. (2) the sign bits respectively of the weight and the activation are XORed. (3) Based on the XOR result the corresponding, as per the sum of weight and activation index, entry of the occurrences table is incremented (XOR result is zero) or decremented (XOR result is one). In an implementation, the occurrences table has 15 entries. (4) Finally, after occurrences table is updated for all activation and weight pairs, each occurrences table entry is multiplied with its corresponding value (a0 . . . a15) and the products across all entries are accumulated into a single value psum. (5) Finally, the values SoA, SoW and PoM are added to psum.
With the optimized quantization scheme there are still outliers that need to be represented. Outliers are a small subset of weights and activations but cover a wide range of values. To quantize outliers, the golden dictionary is expanded with integer exponents to cover the full range of values needed. In an example case of a 4-bit quantization for G values where GD=aint+b, int was an integer in range of 0 to 7. To support outliers the GD is expanded where int is a positive integer greater than 7. In practice quantizing to an outlier dictionary OT of 16 or 32 values was sufficient for maintaining accuracy. More than 16 or more than 32 values can also be used.
During a profiling process, each layer's mean, standard deviation and dictionary are determined. These are stored along with the model. We convert these parameters to 16-bit integers. The table below shows the effect of proposed method on various models and tasks. The baseline is FP32. In the “weight quant” measurements, only weights are quantized using the layer scaling method. In the “weight+activation quant” measurements, both weights and activations are quantized. The dictionary in both cases contains 16-bit fixed-point numbers.
The optimized quantization method can be used to reduce the model's footprint and optionally to simply the computation.
Memory Compression Implementation
Without loss of generality an implementation of a memory compressor/decompressor is described for activations. The same method is to be used for weight with the only difference being that there is no need to compress weight values during runtime. Since weight values are known statically, they can be compressed and stored in memory as described and only need to be decompressed at runtime.
To reduce the activation's memory footprint, the aforementioned method is used to quantize the activations. The quantized values are stored in a suitable memory container. To quantize each activation the method determines the dictionary index corresponding to the activation value. This index can point to the Gaussian dictionary or the outlier dictionary. To find this index the original activation value is compared with the centroids of the dictionaries that were collected during the profiling stage. The centroid closest to the value determines the dictionary (Gaussian or outlier) and the index.
The memory layout with the modified quantization method is similar to the container used for the weight-only quantization method of
Compute Acceleration Implementation:
The modified quantization method can be used to improve execution time and energy efficiency by deploying specialized compute units. For this purpose, according to some embodiments, most values are processed as quantized indexes to the G or the OT dictionaries. Specifically, this method exploits the fact that integer compute units are faster and more energy efficient compared with the floating-point units. To leverage the advantage of integer computation, in the preferred implementation the bulk of the computation is converted from the floating-point to a fixed-point, integer domain representing the indexes to the dictionaries.
In the first path, the G stage, where both weight and activation are non-outliers, the three-bit indexes for weight and activation are added producing the 4b Adr signal. The sign bits (S) are XORed, and based on the XOR result the corresponding to the Adr signal counter out of 15 possible in the Counters table is incremented or decremented as described previously. Eventually, after having processed all necessary activation and weight pairs each counter value is multiplied with its corresponding ax factor. There are 15 possible factors which are stored in an internal table. Alternatively, these factors can be provided externally one by one as needed to multiply them with the Counters table entries. The resulting 15 products are added together into the ACC G accumulator.
The second path serves the rare cases where at least one of the operands is an outlier. In this case, the AND gate accepting as input the OT field from both values disables the first path and activates instead this second path. For these cases there more combinations for counting compared to the G stage. The 4b indexes for the weight and/or the activations are first converted into 16b values via corresponding lookup tables G-LUT (non-outlier) and OT-LUT (outlier), the two values are multiplied and the product is added to the ACC accumulator. If an operand is in the G values, it will index into the 8 entry G-LUT (4-bit=1 sign+3 index) and if the operand is an outlier indexed in OT-LUT, a 16- or 32-entry table.
After the above two processes complete, they produced each a partial sum. Hence, there will be two partial sums: ACC_G for pairs of G values, and ACC for pairs where at least of the inputs is an outlier. ACC_G and ACC are multiplied by their respective scaling factors, the products are added, and finally the SoA, SoW and PoM terms are added to produce the final output value.
The output activations are quantized using the profiled “Gaussian” and “outlier” dictionaries before storing them in memory. These dictionaries contain the centroids needed to quantize output activations. The figure below demonstrates an embodiment of the quantization unit for the output activations. Each output activation is compared with every centroid. Since the dictionary values are sorted, the method can determine the two closest centroids to the output activation by detecting where the comparators' output changes from zero to one. The comparator that generated the last zero and the comparator that generated the first one are the two closest centroids to the activations. A leading one detector and an encoder generate the index of a dictionary entry that generated the first one. To find the index of the other candidate, the method can subtract this index by one. To decide which index corresponds to the nearest centroid, the method can look up both these indexes and subtract them from the activation. The smaller output determines the quantized form of output activation.
The output activations are quantized using the profiled “Gaussian” and “outlier” dictionaries before storing them in memory. These dictionaries contain the centroids needed to quantize output activations.
Table XIII below shows the effectiveness of weight and activation quantization by replacing 97% of expensive FP32 MAC operations with more efficient 3-bit additions (both operands non-outliers) and the rest 3% with look ups and 16-bit fixed-point MAC operations (at least one operand is an outlier). For operations that fall on outlier's path, 1% of operations have both operands as outliers and about 99% of operations only one outlier.
Various embodiments of the present invention have been described in detail. The present invention may be embodied in other specific forms without departing from the nature, spirit, or scope of the invention. The discussed embodiments are to be considered illustrative and not restrictive; the scope of the invention is to be indicated by the appended claims rather than the described details of the embodiments. Therefore, all changes which come with the meaning and range of equivalency of the claims are intended to be embraced therein.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2021/051312 | 9/21/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63082009 | Sep 2020 | US |
Number | Date | Country | |
---|---|---|---|
Parent | 17130690 | Dec 2020 | US |
Child | 18026927 | US |