The following relates generally to deep learning networks and more specifically to a system and method for accelerating training of deep learning networks.
The pervasive applications of deep learning and the end of Dennard scaling have been driving efforts for accelerating deep learning inference and training. These efforts span the full system stack, from algorithms, to middleware and hardware architectures. Training is a task that includes inference as a subtask. Training is a compute- and memory-intensive task often requiring weeks of compute time.
In an aspect, there is provided a method for accelerating multiply-accumulate (MAC) floating-point units during training or inference of deep learning networks, the method comprising: receiving a first input data stream A and a second input data stream B; adding exponents of the first data stream A and the second data stream B in pairs to produce product exponents; determining a maximum exponent using a comparator; determining a number of bits by which each significand in the second data stream has to be shifted prior to accumulation by adding product exponent deltas to the corresponding term in the first data stream and using an adder tree to reduce the operands in the second data stream into a single partial sum; adding the partial sum to a corresponding aligned value using the maximum exponent to determine accumulated values; and outputting the accumulated values.
In a particular case of the method, determining the number of bits by which each significand in the second data stream has to be shifted prior to accumulation includes skipping ineffectual terms mapped outside a defined accumulator width.
In another case of the method, each significand comprises a signed power of 2.
In yet another case of the method, adding the exponents and determining the maximum exponent are shared among a plurality of MAC floating-point units.
In yet another case of the method, the exponents are set to a fixed value.
In yet another case of the method, the method further comprising storing floating-point values in groups, and wherein the exponents deltas are encoded as a difference from a base exponent.
In yet another case of the method, the base exponent is a first exponent in the group.
In yet another case of the method, using the comparator comprises comparing the maximum exponent to a threshold of an accumulator bit-width.
In yet another case of the method, the threshold is set to ensure model convergence.
In yet another case of the method, the threshold is set to within 0.5% of training accuracy.
In another aspect, there is provided a system for accelerating multiply-accumulate (MAC) floating-point units during training or inference of deep learning networks, the system comprising one or more processors in communication with data memory to execute: an input module to receive a first input data stream A and a second input data stream B; an exponent module to add exponents of the first data stream A and the second data stream B in pairs to produce product exponents, and to determine a maximum exponent using a comparator; a reduction module to determine a number of bits by which each significand in the second data stream has to be shifted prior to accumulation by adding product exponent deltas to the corresponding term in the first data stream and use an adder tree to reduce the operands in the second data stream into a single partial sum; and an accumulation module to add the partial sum to a corresponding aligned value using the maximum exponent to determine accumulated values, and to output the accumulated values.
In a particular case of the system, determining the number of bits by which each significand in the second data stream has to be shifted prior to accumulation includes skipping ineffectual terms mapped outside a defined accumulator width.
In another case of the system, each significand comprises a signed power of 2.
In yet another case of the system, the exponent module, the reduction module, and the accumulation module are located on a processing unit and wherein adding the exponents and determining the maximum exponent are shared among a plurality of processing units.
In yet another case of the system, the plurality of processing units are configured in a tile arrangement.
In yet another case of the system, processing units in the same column share the same output from the exponent module and processing units in the same row share the same output from the input module.
In yet another case of the system, the exponents are set to a fixed value.
In yet another case of the system, the system further comprising storing floating-point values in groups, and wherein the exponents deltas are encoded as a difference from a base exponent, and wherein the base exponent is a first exponent in the group.
In yet another case of the system, using the comparator comprises comparing the maximum exponent to a threshold of an accumulator bit-width, where the threshold is set to ensure model convergence.
In yet another case of the system, the threshold is set to within 0.5% of training accuracy.
These and other aspects are contemplated and described herein. It will be appreciated that the foregoing summary sets out representative aspects of embodiments to assist skilled readers in understanding the following detailed description.
A greater understanding of the embodiments will be had with reference to the Figures, in which:
Embodiments will now be described with reference to the figures. For simplicity and clarity of illustration, where considered appropriate, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. In addition, numerous specific details are set forth in order to provide a thorough understanding of the embodiments described herein. However, it will be understood by those of ordinary skill in the art that the embodiments described herein may be practiced without these specific details. In other instances, well-known methods, procedures and components have not been described in detail so as not to obscure the embodiments described herein. Also, the description is not to be considered as limiting the scope of the embodiments described herein.
Any module, unit, component, server, computer, terminal or device exemplified herein that executes instructions may include or otherwise have access to computer readable media such as storage media, computer storage media, or data storage devices (removable and/or non-removable) such as, for example, magnetic disks, optical disks, or tape. Computer storage media may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. Examples of computer storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by an application, module, or both. Any such computer storage media may be part of the device or accessible or connectable thereto. Any application or module herein described may be implemented using computer readable/executable instructions that may be stored or otherwise held by such computer readable media.
During training of some deep learning networks, a set of annotated inputs, for which the desired output is known, are processed by repeatedly performing a forward and backward pass. The forward pass performs inference whose output is initially inaccurate. However, given that the desired outputs are known, the training can calculate a loss, a metric of how far the outputs are from the desired ones. During the backward pass, this loss is used to adjust the network's parameters and to have it slowly converge to its best possible accuracy.
Numerous approaches have been developed to accelerate training, and fortunately often they can be used in combination. Distributed training partitions the training workload across several computing nodes taking advantage of data, model, or pipeline parallelism. Timing communication and computation can further reduce training time. Dataflow optimizations to facilitate data blocking and to maximize data reuse reduces the cost of on- and off-chip accesses within the node maximizing reuse from lower cost components of the memory hierarchy. Another family of methods reduces the footprint of the intermediate data needed during training. For example, in the simplest form of training, all neuron values produced during the forward pass are kept to be used during backpropagation. Batching and keeping only one or a few samples instead reduces this cost. Lossless and lossy compression methods further reduce the footprint of such data. Finally, selective backpropagation methods alter the backward pass by propagating loss only for some of the neurons thus reducing work.
On the other hand, the need to boost energy efficiency during inference has led to techniques that increase computation and memory needs during training. This includes works that perform network pruning and quantization during training. Pruning zeroes out weights and thus creates an opportunity for reducing work and model size during inference. Quantization produces models that use shorter and more energy efficient to compute with datatypes such as 16b, 8b or 4b fixed-point values. Parameter Efficient Training and Memorized Sparse Backpropagation are examples of pruning methods. PACT and outlier-aware quantization are training time quantization methods. Network architecture search techniques also increase training time as they adjust the model's architecture.
Despite the above, the need to further accelerate training both at the data center and at the edge remains unabated. Operating and maintenance costs, latency, throughput, and node count are major considerations for data centers. At the edge energy and latency are major considerations where training may be primarily used to refine or augment already trained models. Regardless of the target application, improving node performance would be highly advantageous. Accordingly, the present embodiments could complement existing training acceleration methods. In general, the bulk of the computations and data transfers during training is for performing multiply-accumulate operations (MAC) during the forward and backward passes. As mentioned above, compression methods can greatly reduce the cost of data transfers. Embodiments of the present disclosure target processing elements for these operations and exploit ineffectual work that occurs naturally during training and whose frequency is amplified by quantization, pruning, and selective backpropagation.
Some accelerators rely on that zeros occur naturally in the activations of many models especially when they use ReLU. There are several accelerators that target pruned models. Another class of designs benefit from reduced value ranges whether these occur naturally or result from quantization. This includes bit-serial designs, and designs that support many different datatypes such as BitFusion. Finally, another class of designs targets bit-sparsity where, by decomposing multiplication into a series of shift-and-add operations, they expose ineffectual work at the bit-level.
While the above accelerate for inference, training presents substantially different challenges. First, is the datatype. While models during inference work with fixed-point values of relatively limited range, the values training operates upon tend to be spread over a large range. Accordingly, training implementations use floating-point arithmetic with single-precision IEEE floating point arithmetic (FP32) being sufficient for virtually all models. Other datatypes that facilitate the use of more energy- and area-efficient multiply-accumulate units compared to FP32 have been successfully used in training many models. These include bfloat 16, and 8b or smaller floating-point formats. Moreover, since floating-point arithmetic is a lot more expensive than integer arithmetic, mixed datatype training methods use floating-point arithmetic only sparingly. Despite these proposals, FP32 remains the standard fall-back format, especially for training on large and challenging datasets. As a result of its limited range and the lack of an exponent, the fixed-point representation used during inference gives rise to zero values (too small a value to be represented), zero bit prefixes (small value that can be represented), and bit sparsity (most values tend to be small and few are large) that the aforementioned inference accelerators rely upon. FP32 can represent much smaller values, its mantissa is normalized, and whether bit sparsity exists has not generally been demonstrated.
Additionally, a challenge is the computation structure. Inference operates on two tensors, the weights and the activations, performing per layer a matrix/matrix or matrix/vector multiplication or pairwise vector operations to produce the activations for the next layer in a feed-forward fashion. Training includes this computation as its forward pass which is followed by the backward pass that involves a third tensor, the gradients. Most importantly, the backward pass uses the activation and weight tensors in a different way than the forward pass, making it difficult to pack them efficiently in memory, more so to remove zeros as done by inference accelerators that target sparsity. Additionally, related to computation structure, is value mutability and value content. Whereas in inference the weights are static, they are not so during training. Furthermore, training initializes the network with random values which it then slowly adjusts. Accordingly, one cannot necessarily expect the values processed during training to exhibit similar behavior such as sparsity or bit-sparsity. More so for the gradients, which are values that do not appear at all during inference.
The present inventors have demonstrated that a large fraction of the work performed during training can be viewed as ineffectual. To expose this ineffectual work, each multiplication was decomposed into a series of single bit multiply-accumulate operations. This reveals two sources of ineffectual work: First, more than 60% of the computations are ineffectual since one of the inputs is zero. Second, the combination of the high dynamic range (exponent) and the limited precision (mantissa) often yields values which are non-zero, yet too small to affect the accumulated result, even when using extended precision (e.g., trying to accumulate 2−64 into 264).
The above observation led the present inventors to consider whether it is possible to use bit-skipping (bitserial where zero bits are skipped-over) processing to exploit these two behaviors. For inference, Bit-Pragmatic is a data-parallel processing element that performs such bit-skipping of one operand side, whereas Laconic does so for both sides. Since these methods target inference only, they work with fixed-point values. Since there is little bit-sparsity in the weights during training, converting a fixed-point design to floating-point is a non-trivial task. Simply converting Bit-Pragmatic into floating point resulted in an area-expensive unit which performs poorly under ISO-compute area constraints. Specifically, compared to an optimized Bfloat16 processing element that performs 8 MAC operations, under ISO-compute constraints, an optimized accelerator configuration using the Bfloat16 Bit-Pragmatic PEs is on average 1.72×slower and 1.96×less energy efficient. In the worst case, the Bfloat16 bit-pragmatic PE was 2.86×slower and 3.2×less energy efficient. The Bfloat16 BitPragmatic PE is 2.5×smaller than the bit-parallel PE, and while one can use more such PEs for the same area, one cannot fit enough of them to boost performance via parallelism as required by all bit-serial and bit-skipping designs.
The present embodiments (informally referred to as FPRaker) provide a processing tile for training accelerators which exploits both bit-sparsity and out-of-bounds computations. FPRaker, in some cases, comprises several adder-tree based processing elements organized in a grid so that it can exploit data reuse both spatially and temporally. The processing elements multiply multiple value pairs concurrently and accumulate their products into an output accumulator. They process one of the input operands per multiplication as a series of signed powers of two, hitherto referred to as terms. The conversion of that operand into powers of two can be performed on the fly; all operands are stored in floating point form in memory. The processing elements take advantage of ineffectual work that stems either from mantissa bits that were zero or from out-of-bounds multiplications given the current accumulator value. The tile is designed for area efficiency. In some cases for the tile, the processing element limits the range of powers-of-two that they can be processed simultaneously greatly reducing the cost of its shift-and-add components. Additionally, in some cases for the tile, a common exponent processing unit is used that is time-multiplexed among multiple processing elements. Additionally, in some cases for the tile, power-of-two encoders are shared along the rows. Additionally, in some cases for the tile, per processing element buffers reduce the effects of work imbalance across the processing elements. Additionally, in some cases for the tile, PE implements a low cost mechanism for eliminating out-of-range intermediate values.
Additionally, in some cases, the present embodiments can advantageously provide at least some of the following characteristics:
The present embodiments also advantageously provide a low-overhead memory encoding for floating-point values that rely on the value distribution that is typical of deep learning training. The present inventors have observed that consecutive values across channels have similar values and thus exponents. Accordingly, the exponents can be encoded as deltas for groups of such values. These encodings can be used when storing and reading values of chip, thus further reducing the cost of memory transfers.
Through example experiments, the present inventors determined the following experimental observations:
The present inventors measured work reduction that was theoretically possible with two related approaches:
Example experiments were performed to examine performance of the present embodiments on different applications. TABLE 1 lists the models studied in the example experiments. ResNet18-Q is a variant of ResNet18 trained using PACT, which quantizes both activations and weights down to four-bits (4b) during training. ResNet50-S2 is a variant of ResNet50 trained using dynamic sparse reparameterization, which targets sparse learning that maintain high weight sparsity throughout the training process while achieving accuracy levels comparable to baseline training. SNLI performs natural language inference and comprises of fully-connected, LSTM-encoder, ReLU, and dropout layers. Image2Text is an encoder-decoder model for image-to-markup generation. Three models of different tasks were examined from a MLPerf training benchmark: 1) Detectron2: an object detection model based on Mask R-CNN, 2) NCF: a model for collaborative filtering, and 3) Bert: a transformer-based model using attention. For measurement, one randomly selected batch per epoch was sampled over as many epochs as necessary to train the network to its originally reported accuracy (up to 90 epochs were enough for all).
Generally, the bulk of computational work during training is due to three major operations per layer:
For convolutional layers, Equation (1), above, describes the convolution of activations (I) and weights (W) that produces the output activations (Z) during forward propagation. There the output Z passes through an activation function before used as input for the next layer. Equation (1) and Equation (3), above, describe the calculation of the activation (∂E/∂t) and weight (∂E/∂W) gradients respectively in the backward propagation. Only the activation gradients are back-propagated across layers. The weight gradients update the layer's weights once per batch. For fully-connected layers the equations describe several matrix-vector operations. For other operations they describe vector operations or matrix-vector operations. For clarity, in this disclosure, gradients are referred to as G. The term term-sparsity is used herein to signify that for these measurements the mantissa is first encoded into signed powers of two using Canonical encoding which is a variation of Booth-encoding. This is because bit-skipping processing for the mantissa.
In an example, activations in image classification networks exhibit sparsity exceeding 35% in all cases. This is expected since these networks generally use the ReLU activation function which clips negative values to zero. However, weight sparsity is typically low and only some of the classification models exhibit sparsity in their gradients. For the remaining models, however, such as those for natural language processing, value sparsity may be very low for all three tensors. Regardless, since models do generally exhibit some sparsity, the present inventors investigated whether such sparsity could be exploited during training. This is a non-trivial task as training is different than inference and exhibits dynamic sparsity patterns on all tensors and different computation structure during the backward pass. It was found that, generally, all three tensors exhibit high term-sparsity for all models regardless of the target application. Given that term-sparsity is more prevalent than value sparsity, and exists in all models, the present embodiments exploit such sparsity during training to enhance efficiency of training the models.
An ideal potential speedup due to reduction in the multiplication work can be achieved through skipping the zero terms in the serial input. The potential speedup over the baseline can be determined as:
The present embodiments take advantage of bit sparsity in one of the operands used in the three operations performed during training (Equations (1) through (3) above) all of which are composed of many MAC operations. Decomposing MAC operations into a series of shift-and-add operations can expose ineffectual work, providing the opportunity to save energy and time.
To expose ineffectual work during MAC operations, the operations can be decomposed into a series of “shift and add” operations. For multiplication., let A=2Ae×Am, and B=2Be×Bm, be two values in floating point, both represented as an exponent (Ae and Be) and a significand (Am and Bm), which is normalized and includes the implied “1.”. Conventional floating-point units perform this multiplication in a single step (sign bits are XORed):
A×B=2A
By decomposing Am into a series p of signed powers of two Amp where A=ΣpAmp and Amp=±2i, the multiplication can be performed as follows:
A×B=(ΣpAmp×Bm)<<(Ae+Be)=ΣpBm<<(Amp+Ae+Be) (6)
For example, if Am=1.0000001 b, Ae=10b, Bm=1.1010011b and Be=11b, then A×B can be performed as two shift-and-add operations of and
. A conventional multiplier would process all bits of Am despite performing ineffectual work for the six bits that are zero.
However, the above decomposition exposes further ineffectual work that conventional units perform as a result of the high dynamic range of values that floating point seeks to represent. Informally, some of the work done during the multiplication will result in values that will be out-of-bounds given the accumulator value. To understand why this is the case, consider not only the multiplication but also the accumulation. Assume that the product A×B will be accumulated into a running sum S and Se is much larger than Ae+Be. It will not be possible to represent the sum S+A×B given the limited precision of the mantissa. In other cases, some of the “shift-and-add” operations would be guaranteed to fall outside the mantissa even when considering the increased mantissa length used to perform rounding, i.e., partial swamping.
Referring now to
In an embodiment, the system 100 includes one or more modules and one or more processing elements (PEs) 122. In some cases, the PEs can be combined into tiles. In an embodiment, the system 100 includes an input module 120, a compression module 130, and a transposer module 132. Each processing element 122 includes a number of modules, including an exponent module 124, a reduction module 126, and an accumulation module 128. In some cases, some of the above modules can be run at least partially on dedicated or separate hardware, while in other cases, at least some of the functions of the some of the modules are executed on the processing unit 102.
The input module 120 receives two input data streams to have MAC operations performed on them, respectively A data and B data.
The PE 122 performs the multiplication of 8 Bfloat16 (A,B) value pairs, concurrently accumulating the result into the accumulation module 128. The Bfloat16 format consists of a sign bit, followed by a biased 8b exponent, and a normalized 7b significand (mantissa).
The PE 122 accepts 8 8-bit A exponents Ae0, . . . , Ae7, their corresponding 8 3-bit significand terms t0, . . . , t7 (after canonical encoding) and signs bits As0, . . . , As7, along with 8 8-bit B exponents Be0, . . . , Be7, their significands Bm0, . . . , Bm7 (as-is) and their sign bits Bs0, . . . , Bs7; as shown in
The exponent module 124 adds the A and B exponents in pairs to produce the exponents ABei for the corresponding products. A comparator tree takes these product exponents and the exponent of the accumulator and calculates the maximum exponent emax. The maximum exponent is used to align all products so that they can be summed correctly. To determine the proper alignment per product, the exponent module 124 subtracts all product exponents from emax calculating the alignment offsets δei. The maximum exponent is used to also discard terms that will fall out-of-bounds when accumulated. The PE 122 will skip any terms who fall outside the emax−12 range. Regardless, the minimum number of cycles for processing the 8 MACs will be 1 cycle regardless of value. In case one of the resulting products has an exponent larger than the current accumulator exponent, the accumulation module 128 will be shifted accordingly prior to accumulation (acc shift signal). An example of the exponent module 124 is illustrated in the first block of
Since multiplication with a term amounts to shifting, the reduction module 126 determines the number of bits by which each B significand will have to be shifted by prior to accumulation. These are the 4-bit terms K0, . . . , K7. To calculate Ki, the reduction module 126 adds the product exponent deltas (δei) to the corresponding A term ti. To skip out-of-bound terms, the reduction module 126 places a comparator before each K term which compares it to a threshold of the available accumulator bit-width. The threshold can be set to ensure models converge within 0.5% of the FP32 training accuracy on ImageNet dataset. However, the threshold can be controlled effectively implementing a dynamic bit-width accumulator, which can boost performance by increasing the number of skipped “out-of-bounds” bits. The A sign bits are XORed with their corresponding B sign bits to determine the signs of the products Ps0, . . . , Ps7. The B significands are complemented according to their corresponding product signs, and then shifted using the offsets K0, . . . , K7. The reduction module 126 uses a shifter per B significand to implement the multiplication. In contrast, a conventional floating-point unit would require shifters at the output of the multiplier. Thus, the reduction module 126 effectively eliminates the cost of the multipliers. In some cases, bits that are shifted out of the accumulator range from each B operand can be rounded using round-to-nearest-even (RNE) approach. An adder tree reduces the 8 B operands into a single partial sum. An example of the reduction module 126 is illustrated in the second block of
For the accumulation module 128, the resulting partial sum from the reduction module 126 is added to the correctly aligned value of the accumulator register. In each accumulation step, the accumulator register is normalized and rounded using the rounding-to-nearest-even (RNE) scheme. The normalization block updates the accumulator exponent. When the accumulator value is read out, it is converted to bfloat16 by extracting only 7b for the significand. An example of the accumulation module 128 is illustrated in the third block of
In the worst case, two offsets may differ by up to 12 since the accumulation module in the example of
In some cases, processing a group of A values will require multiple cycles since some of them will be converted into multiple terms. During that time, the inputs to the exponent module will not change. To further reduce area, the system 100 can take advantage of this expected behavior and share the exponent block across multiple PEs 122. The decision of how many PEs 122 to share the exponent module 124 can be based on the expected bit-sparsity. The lower the bit-sparsity then higher the processing time per PE 122 and the less often it will need a new set of exponents. Hence, the more the PEs 122 that can share the exponent module 124. Since some models are highly sparse, sharing one exponent module 124 per two PEs 122 may be best in such situations.
By utilizing per PE 122 buffers, it is possible to exploit data reuse temporally. To exploit data reuse spatially, the system 100 can arrange several PEs 122 into a tile.
The present inventors studied spatial correlation of values during training and found that consecutive values across the channels have similar values. This is true for the activations, the weights, and the output gradients. Similar values in floating-point have similar exponents, a property which the system 100 can exploit through a base-delta compression scheme. In some cases, values can be blocked channel-wise into groups of 32 values each, where the exponent of the first value in the group is the base and the delta exponent for the rest of the values in the group is computed relative to it, as illustrated in the example of
The present inventors have determined that skipping out-of-bounds terms can be inexpensive. The processing element 122 can use a comparator per lane to check if its current K term lies within a threshold with the value of the accumulator precision. The comparators can be optimized by a synthesis tool for comparing with a constant. The processing element 122 can feed this signal back to a corresponding term encoder indicating that any subsequent term coming from the same input pair is guaranteed to be ineffectual (out-of-bound) given the current e_acc value. Hence, the system 100 can boost its performance and energy-efficiency by skipping the processing of the subsequent out-of-bound terms. The feedback signals indicating out-of-bound terms of a certain lane across the PEs of the same tile column can be synchronized together.
Generally, data transfers account for a significant portion and often dominate energy consumption in deep learning. Accordingly, it is useful to consider what the memory hierarchy needs to do to keep the execution units busy. A challenge with training is that while it processes three arrays W and G, the order in which the elements are grouped differs across the three major computations (Equations 1 through 3 above). However, it is possible to rearrange the arrays as they are read from off-chip. For this purpose, the system 100 can store the arrays in memory using a container of “square” of 32×32 bfloat16 values. This a size that generally matches the typical row sizes of DDR4 memories and allows the system 100 to achieve high bandwidth when reading values from off-chip. A container includes values from coordinates (c,r,k) (channel, row, column) to (c+31,r,k+31) where c and k are divisible by 32 (padding is used as necessary). Containers are stored in channel, column, row order. When read from off-chip memory, the container values can be stored in the exact same order on the multi-banked on-chip buffers. The tiles can then access data directly reading 8 bfloat16 values per access. The weights and the activation gradients may need to be processed in different orders depending on the operation performed. Generally, the respective arrays must be accessed in the transpose order during one of the operations. For this purpose, the system 100 can include the transposer module 132 on-chip. The transposer module 132, in an example, reads in 8 blocks of 8 bfloat16 values from the on-chip memories. Each of these 8 reads uses 8-value wide reads and the blocks are written as rows in an internal to the transposer buffer. Collectively these blocks form an 8×8 block of values. The transposer module 132 can read out 8 blocks of 8 values each and send those to the PE 122. Each of these blocks can be read out as a column from its internal buffer. This effectively transposes the 8×8 value group.
The present inventors conducted examples experiments to evaluate the advantages of the system 100 in comparison to an equivalent baseline architecture that uses conventional floating-point units.
A custom cycle-accurate simulator was developed to model the execution time of the system 100 (informally referred to as FPRaker) and of the baseline architecture. Besides modeling timing behavior, the simulator also modelled value transfers and computation in time faithfully and checked the produced values for correctness against the golden values. The simulator was validated with microbenchmarking. For area and power analysis, both the system and the baseline designs were implemented in Verilog and synthesized using Synopsys' Design Compiler with a 65 nm TSMC technology and with a commercial library for the given technology. Cadence Innovus was used for layout generation. Intel's PSG ModelSim was used to generate data-driven activity factors which was fed to Innovus to estimate the power. The baseline MAC unit was optimized for area, energy, and latency. Generally, it was not possible to optimize for all three; however, in the case of MAC units, it is possible. An efficient bit-parallel fused MAC unit was used as the baseline PE. The constituent multipliers were both area and latency efficient, and are taken from the DesignWare IP library developed by Synopsys. Further, the baseline units was optimized for deep learning training by reducing the precision of its I/O operands to bfloat16 and accumulating in reduced precision with chunk-based accumulation. The area and energy consumption of the on-chip SRAM Global Buffer (GB) is divided into activation, weight, and gradient memories which were modeled using CACTI. The Global Buffer has an odd number of banks to reduce bank conflicts for layers with a stride greater than one. The configurations for both the system 100 (FPRaker) and the baseline are shown in TABLE 2.
To evaluate the system 100, traces for one random mini-batch were collected during the forward and backward pass in each epoch of training. All models were trained long enough to attain the maximum top-1 accuracy as reported. To collect the traces, each model was trained on an NVIDIA RTX 2080 Ti GPU and stored all of the inputs and outputs for each layer using Pytorch Forward and Backward hooks. For BERT, BERT-base and the fine-tuning training for a GLUE task were traced. The simulator used the traces to model execution time and collect activity statistics so that energy can be modeled.
Since embodiments of the system 100 process one of the inputs term-serially, the system 100 uses parallelism to extract more performance. In one approach, an iso-compute area constraint can be used to determine how many PE 122 tiles can fit in the same area for a baseline tile.
The conventional PE that was compared against processed concurrently 8 pairs of bfloat16 values and accumulated their sum. Buffers can be included for the inputs (A and B) and the outputs so that data reuse can be exploited temporally. Multiple PEs 122 can be arranged in grid sharing buffers and inputs across rows and columns to also exploit reuse spatially. Both the system 100 and the baseline were configured to have scaled-up GPU Tensor-Core-like tiles that perform 8×8 vector-matrix multiplication where 64 PEs 122 are organized in a 8×8 grid and each PE performs 8 MAC operations in parallel.
Post layout, and taking into account only the compute area, a tile of an embodiment of the system 100 occupies 0.22% the area versus the baseline tile. TABLE 3 reports the corresponding area and power per tile. Accordingly, to perform an iso-compute-area comparison, the baseline accelerator has to be configured to have 8 tiles and the system 100 configured with 36 tiles. The area for the on-chip SRAM global buffer is 344 mm2, 93.6 mm2, and 334 mm2 for the activations, weights, and gradients, respectively.
SNLI, NCF, and Bert are dominated by fully connected layers. While in fully connected layers, there is no weight reuse among different output activations, training can take advantage of batching to maximize weight reuse across multiple inputs (e.g., words) of the same input sentence which results in higher utilization of the tile PEs. Speedups follow bit sparsity. For example, the system 100 achieves a speedup of 1.8× over the baseline for SN LI due its high bit sparsity.
17 [0054]
At block 302, the input module 120 receives two input data streams to have MAC operations performed on them, respectively A data and B data.
At block 304, the exponent module 124 adds exponents of the A data and the B data in pairs to produce product exponents determines a maximum exponent using a comparator.
At block 306, the reduction module 126 determines a number of bits by which each B significand has to be shifted prior to accumulation by adding product exponent deltas to the corresponding term in the A data and uses an adder tree to reduce the B operands into a single partial sum.
At block 308, the accumulation module 128 adds the partial sum to a corresponding aligned value using the maximum exponent to determine accumulated values.
At block 310, the accumulation module 128 outputs the accumulated values.
To study the effect of training with FPRaker on accuracy, the example experiments emulated the bit-serial processing of PE 122 during end-to-end training in PlaidML, which is a machine learning framework based on OpenCL compiler at the backend. PlaidML was forced to use the mad( ) function for every multiply-add during training. The mad( ) function was overridded with the implementation of the present disclosure to emulate the processing of the PE. ResNet18 was trained on CIFAR-10 and CIFAR-100 datasets. The first line shows the top-1 validation accuracy for training natively in PlaidML with FP32 precision. The baseline performs bit-parallel MAC with I/O operands precision in bfloat16 which is known to converge and supported in the art.
Conventionally, training uses bfloat16 for all computations. In some cases, mixed datatyPE 122 arithmetic can be used where some of the computations used fixed-point instead. In other cases, floating-point can be used where the number of bits used by the mantissa varies per operation and per layer. In some cases, the suggested mantissa precisions can be used while training AlexNet and ResNet18 on Imagenet.
Advantageously, the system 100 can perform multiple multiply-accumulate floating-point operations that all contribute to a single final value. The processing element 122 can be used as a building block for accelerators for training neural networks. The system 100 takes advantage of the relatively high term level sparsity that all values exhibit during training. While the present embodiments described using the system 100 for training, it is understood that it can also be used for inference. The system 100 may be particularly advantageous for models that use floating-point; for example, models that process language or recommendation systems.
Advantageously, the system 100 allows for efficient precision training. Different precision can be assigned to each layer during training depending on the layer's sensitivity to quantization. Further, training can start with lower precision and increase the precision per epoch near conversion. The system 100 can allow for dynamic adaptation of different precisions and can boost performance and energy efficiency.
The system 100 can be used to also perform fixed-point arithmetic. As such, it can be used to implement training where some of the operations are performed using floating-point and some using fixed-point. To perform fixed-point arithmetic: (1) the exponents are set to a known fixed value, typically the equivalent of zero, and (2) an external overwrite signal indicates that the significands do not contain an implicit leading bit that is 1. Further, since the operations performed during training can be a superset of the operations performed during inference, the system 100 can be used for inference.
Although the invention has been described with reference to certain specific embodiments, various modifications thereof will be apparent to those skilled in the art without departing from the spirit and scope of the invention as outlined in the claims appended hereto.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CA2021/050994 | 7/19/2021 | WO |
Number | Date | Country | |
---|---|---|---|
63054502 | Jul 2020 | US |