The disclosure generally relates to reduction trees.
In a binary neural network (BNN), the weights and activations are binary values, which significantly reduce hardware requirements for computing products. However, reducing vectors of products to scalar values can become a dominant factor in hardware costs.
Reduction trees can be employed to reduce vectors of products to scalar values. A reduction tree generally has multiple levels of reduction operations (e.g., accumulation circuitry), and in a BNN, the reductions are population counters. Each population counter counts a number of products having a bit value of 1. The next level of the reduction tree accumulates the counted values, and additional levels can accumulate totals from lower levels.
As products of weights and activations can be generated using XNOR gates in a BNN, or as wires and inverters in the case of a fully unrolled BNN, the population counters can be the dominant factor in hardware costs. For example, a fully connected layer in a BNN can require the reduction of hundreds of bits per input channel, which scaled with hundreds of output channels, costs tens of thousands of look-up tables (LUTs) to implement; a convolutional layer can require hundreds of thousands of LUTs. In addition, because the population count of n bits produces a result having [log2 n]+1 bits, the width of reduction operations in the reduction tree increases logarithmically with each level of the tree.
A disclosed circuit arrangement includes a reduction operator circuits arranged in a first level of a reduction tree. Each reduction operator circuit accumulates respective products into a respective sum. Quantizer circuits are configured to quantize the sums from the reduction operator circuits into quantized sums, respectively, based on values of the sums relative to respective first thresholds. Another reduction operator circuit is arranged in a second level of the reduction tree and is configured to accumulate the quantized sums and provide a first sum. A second-level quantizer circuit is configured to quantize the first sum into a quantized first sum based on a value of the first sum relative to a second threshold.
A disclosed method includes running inference on an input tensor by a neural network having a plurality of layers. The method includes providing, for each layer j of the plurality of layers, elements of an output tensor from layer j as input elements to layer j+1 of the neural network. The inference in layer i of the plurality of layers includes quantizing into first quantized sums by first quantizer circuits arranged in a first level of a reduction tree, sums generated by first reduction operator circuits arranged in the first level of the reduction tree, based on values of the sums relative to respective first thresholds. The inference includes inputting the first quantized sums to a first reduction operator circuit arranged in a second level of the reduction tree. The inference includes quantizing by a final quantizer circuit into an element of the output tensor from layer i, a final sum generated by a reduction operator circuit arranged in a last level of the reduction tree, based on a value of the final sum relative to a second threshold.
Another disclosed method includes performing feed forward processing by a neural network. The feed forward processing in layer i of the neural network includes summing products of input activations, initial scaling factors, and weights into partial sums by a partial adder tree. The feed forward processing in layer i includes scaling output from the partial adder tree into scaled output using intermediate scaling factors and generating final sums and an output tensor from the scaled output. The feed forward processing in layer i includes providing the output tensor to layer i+1 of the neural network. The method includes performing backpropagation by the neural network, and the backpropagation includes computing gradient values corresponding to elements of the output tensor from layer i, and updating in layer i, the weights, initial scaling factors, and intermediate scaling factors based on the gradient values. The method includes determining intermediate thresholds for intermediate quantizations in a reduction tree based on trained initial scaling factors and trained intermediate scaling factors.
Other features will be recognized from consideration of the Detailed Description and Claims, which follow.
Various aspects and features of the circuits and methods will become apparent upon review of the following detailed description and upon reference to the drawings in which:
In the following description, numerous specific details are set forth to describe specific examples presented herein. It should be apparent, however, to one skilled in the art, that one or more other examples and/or variations of these examples may be practiced without all the specific details given below. In other instances, well known features have not been described in detail so as not to obscure the description of the examples herein. For ease of illustration, the same reference numerals may be used in different diagrams to refer to the same elements or additional instances of the same element.
The disclosed methods and systems employ intermediate quantizer circuits between reduction operators in a reduction tree in order to effectively eliminate intermediate lossless accumulations. The intermediate quantizer circuits improve hardware efficiency by eliminating calculations involving a level of precision that would be lost in the quantization of the final accumulation.
The reduction operator circuits at the lowest level of the reduction tree accumulate products into respective sums, and intermediate quantizer circuits quantize the sums from the lowest-level reduction operator circuits into respective quantized sums. The intermediate quantizations can be based on values of the sums relative to respective thresholds for some applications. The quantized sums are input to a reduction operator circuit in the second level of the reduction tree, which accumulates the quantized sums into another sum, which is quantized by another quantizer circuit.
The reduction trees having intermediate quantizations can be employed in different layers of neural networks as well as in a variety of different types of neural networks. For example, the intermediate quantizations in reduction trees can be used in a deep neural network (DNN) having a cascade of fully connected and convolutional layers, with adjacent layers separated by nonlinearities such as binarization, ReLU or softmax. In some applications, the data and parameters can be quantized into low-precision data formats such as fixed point, block floating point and binary, and the outputs to the summations in multiply-and-accumulate (MAC) operations can be immediately quantized down to those formats before being fed forward for inference.
The intermediate quantizations in reduction trees can be used in a binarized DNN having a cascade of fully connected and convolutional layers. In the binarized DNN, data and parameters are quantized down to one bit, and the outputs to the summations in MAC operations can be immediately binarized using a threshold function before being fed forward for inference.
The intermediate quantizations in reduction trees can be used in a binarized DNN in which every layer is fully unrolled with every MAC operation being spatially one-to-one mapped onto the hardware. The one-to-one mapping has the parameters hardened in logic, and multiplications in MACs can be reduced to either wires or inverters. The outputs to the summations in the MAC operations can be immediately binarized using a threshold function before being fed forward for inference. Though examples shown in the figures and described herein may be directed to binarized neural networks, it will be appreciated that the methods and circuits can be adapted to neural networks that process other data formats, such as floating point, fixed point, block floating point, etc.
The reduction operator circuits in the reduction tree are population counters. Each reduction operator circuit generates a respective sum by counting the number of products having a bit value 1. For example, reduction operator circuit 104 is a population counter that counts the number of products of activations x1,1, . . . , x1,n, and weights w1,1, . . . , w1,n, that have a bit value of 1. The lowest level of the reduction tree has m reduction operator circuits, and each reduction operator circuit accumulates the population count of n products. Each reduction operator circuit produces a result having [log2 n]+1 bits from the n products.
According to the disclosed approaches, the reduction trees have intermediate quantizers disposed between reduction operators. The quantizer circuits in the exemplary reduction tree 100 are binarization circuits, and the intermediate quantizers are shown as circuit elements 106, 108, and 110. Each binarization circuit compares the count from one of the reduction operator circuits to a respective threshold and generates a bit value based on the count relative to the threshold. For example, if the sum is less than the threshold, the quantized sum can be bit value 0, and if the sum is greater than or equal to the threshold the quantized sum can be bit value 1 (or vice versa). The thresholds can be trained values and stored as registers of the binarization circuits.
The binary outputs from the binarization circuits 106, 108, . . . , 110 are input to the reduction operator circuit 112 in the second level of the reduction tree. The reduction operator circuit 112 is a population counter that counts the number of binary outputs from the m binarization circuits 106, 108, . . . , 110 having a bit value 1. The count generated from the m inputs by population counter 112 has [log2 m]+1 bits.
The quantizer circuit 114 is a binariziation circuit and generates the final binary output of the reduction tree. Notably, the binariziation circuit 114 reduces the output from circuit 112 from [log2 m]+1 bits to one bit. Prior art reduction trees lack the binarization circuits 106, 108, . . . , 110 between the reduction operators. Thus, prior art reduction trees would require extra hardware for the reduction operator 112 to handle values from the reduction operator circuits (e.g., 104) having [log2 n]+1 bits, even though that precision is lost in the final quantization by binarization circuit 114.
Each reduction operator circuit in the first level is coupled to directly input a one-bit activation (xi,j), or to directly input the output from an inverter that is coupled to directly input a one-bit activation. For example, reduction operator circuit 152 is coupled to directly input the activations x1,1 and x1,2, (156, 158) and to directly input the output from inverter 154, which inputs activation x1,n.
Each of the reduction operators in the second level sums quantized sums from a subset of the reduction operators in the first level. For example, reduction operator 206 sums the quantized sums from quantizer circuits 204, 212, . . . , 214, which are quantized sums from the reduction operators 202, 216, . . . , 218. Reduction operator 220 sums the quantized sums from quantizer circuits 222, . . . , 224, which are quantized sums from the reduction operators 226, . . . , 228.
A reduction operator circuit can be a tree of full adder circuits or a population counters depending on application requirements. In a neural network that is not a BNN, each reduction operator circuit in the first level of the reduction tree sums a subset of the products of weights and activations. For example, reduction operator 202 sums the products x1,1*w1,1, x1,2*w1,2, . . . , x1,n*w1,n.
The exemplary reduction tree has trained intermediate quantizer circuits between reduction operators in multiple levels. For example, quantizer circuit 204 quantizes the output from reduction operator 202, the output from quantizer circuit 204 is provided to reduction operator 206, quantizer circuit 208 quantizes the output from reduction operator 206, and the output from quantizer circuit 208 is provided to reduction operator 210. Quantizer circuit 230 quantizes the output from reduction operator 220, and quantizer circuit 232 quantizes the output from reduction operator 210.
In a neural network other than a BNN, the intermediate quantizer circuits in the reduction tree can reduce the precision from one multi-bit value to another multi-bit value of fewer bits. For example, the intermediate quantizer circuits can reduce a value from 16 bits to 8 bits. The quantization can be implemented by a rounding circuit, for example. The intermediate quantizations can have one threshold per bit of each multi-bit value.
For a tensor X of full-precision data, {circumflex over (X)} represents binarized input activations at training time. To improve the performance of BNNs, residual binarization (“ReB”) is applied to the output in order to improve the data precision. B bits represent each data element of {circumflex over (X)}, and B bits represent the trained full-precision scaling factors α0 and α1. In the exemplary system, B=2. The goal of residual binarization is to approximate activations X∈M×N with
∈{−1, 1}M×N×B and α∈
B, such that X≈Σb=0B−1αb{circumflex over (X)}b. where M and N are dimensions of X.
Note that {tilde over ({circumflex over (X)})} represents a tensor at inference time. Every element in {tilde over ({circumflex over (X)})} has value in set {0, 1}. The training and inference forms can be converted via functions {tilde over ({circumflex over (X)})}=({circumflex over (X)}+1)/2 and {circumflex over (X)}=2{tilde over ({circumflex over (X)})}−1. In deep neural network training, a standard approach normalizes input activation data to have a mean of zero and a standard deviation of 1 in order to improve the network performance. A single-sided activation data, taking the value of either 0 or 1, would lead to reduced performance, due to an implicit bias of value 0.5.
The input activations are labeled {circumflex over (X)}0 and {circumflex over (X)}1. {circumflex over (X)} has a size of MxKxB, where K is the reduction dimension in matrix multiplication, such that (M, K)*(K, N) produces (M, N), and B is the number of bits that represent each element of the tensor. In the exemplary system, B=2, so that {circumflex over (X)}0 is the slice of {circumflex over (X)} having the first bits of the MxK elements, and {circumflex over (X)}1 is the slice having the second bits of the elements.
The parameters undergoing training include binary weights Ŵ∈{−1, 1}K×N, scaling factors α0 and α1, biases δ0, δ1∈N×K/P×B, batch normalization scaling factors γ, biases β, moving average μ and moving standard deviation √{square root over (σ)}. The parameters γ, β, μ, and √{square root over (σ)}∈
N. Once trained, the parameters are used to determine the quantization thresholds (T0, T1, Φ0, Φ1) and scaling factors (θ0, θ1) used in the inference arrangement of
The weights, activations, and scaling factors can be input from a data bus in a streaming or memory mapped mode, for example, to the training circuitry. Multiplier circuits 252 and 254 generate scaled activations (element-wise) from the input activations and scaling factors ({circumflex over (X)}0×α0 and {circumflex over (X)}1×α1), and multiplier circuits 256 and 258 generate products (element-wise) from the scaled activations and weights Ŵ.
Each of partial adder trees 260 and 262 generates a set of partial sums of products from the multiplier circuits 256 and 258, respectively. Each partial sum is a sum of P products. Partial adder tree 260 generates X′0, and partial adder tree 262 generates X′1, where X′0, X′1∈M×N×K/P. The P inputs correspond to the inputs to the first-level population counters of
Note that m in
The elements of X′0 and X′1 are summed element-wise with scaling factors δ0 and δ1 by adders 264 and 266, respectively. The sgn circuit 268 converts each of the sums output by adder 264 into a +1 or −1 value, depending on the sign bit of the sum, and the circuit function 270 similarly converts each of the sums output by adder 266.
From partial adder tree 260 through summation circuit 272, the training circuit performs the accumulation step of a matrix multiplication with sizes (M, K)×(K, N) to produce an output matrix of size (M, N), where K is the reduction dimension. The matrix multiplication outputs of size (M, N) go through batch normalization, which generates output of size (M, N). Summation circuit 272 generates a sum of the values output from the sgn circuit 268, and summation circuit 274 generates a sum of the values output from the sgn circuit 270. Multiplier circuit 276 generates a product of the sum from circuit 272 and the scaling factor α0, and multiplier circuit 278 generates a product of the sum from circuit 272 and the scaling factor α1. Adder circuit 280 generates a sum of the products from multiplier circuits 276 and 278.
The batch normalization function 282 inputs scaling factors γ, biases β, moving average μ and moving standard deviation √{square root over (σ)} to standardize the sum from adder 280 into intermediate activations X″, where X″∈M×N. reduction tree quantization.
The residual binarization (“ReB”) function 284 is applied in forward propagation to the intermediate activations X″ using scaling factors α′0 and α′1 to generate output activations Ŷ0 and Ŷ1.
The subtraction circuit 306 subtracts, element-wise, the scaled values output by the multiplier 304 from the elements of X″, and the sgn circuit 308 converts each of the values from subtraction circuit 306 to a value of −1 or 1 in response to the sign bit of the value. Multiplier 310 multiplies each of the sign values by the scaling factor α′1, and adder 312 sum, element-wise, the scaled sign values output by multiplier 310 with the scaled sign values from multiplier 304.
The output Y from adder 312 is used only in training to ensure gradient backward propagation, and Y is not used in inference. Y is used as pre-quantization input to the next layer in training.
The trained binary weights {tilde over (Ŵ)}∈{0, 1}K×N weights and input activations {tilde over ({circumflex over (X)})}0, {tilde over ({circumflex over (X)})}1∈{0, 1}M×K×B can be input from a data bus in a streaming or memory mapped mode, for example, to the inference circuitry. XNOR circuitry 354 and 356 generate binary products (element-wise) from the activations and weights {tilde over (Ŵ)}.
The partial adder tree 358 generates sums from subsets having P elements of the outputs from XNOR circuitry 354, and partial adder tree 360 generates sums from subsets having P elements of the outputs from XNOR circuitry 356. As in M×N×K/P×B.
Threshold circuits 362 and 364 perform intermediate quantizations (binarization in the example) of the elements of X′0, and X′1, based on the trained intermediate quantization thresholds T0 and T1, respectively.
The inputs to summation circuits 366 and 368 are matrices of size (M, N, K/P), and summation circuits 366 and 368 perform summation across the third dimension K/P, resulting in output matrices of size (M, N). Inputs to threshold circuits 362 and 364 are matrices of size (M, N, K/P), and T0 and T1 are of size (N, K/P) of threshold values. T0 and T1 are broadcast into shape (M, N, K/P) by threshold circuits 362364 in applying the element-wise threshold operation. Threshold circuits 362 and 364 generate output matrices of size (M, N, K/P).
The circuits 370 and 372 implement instances of the ƒ0 function (ƒ0(X)=2X−1) in order to convert data from {tilde over ({circumflex over (X)})} format (format at inference time {0,1}) to {circumflex over (X)} format (binarized input activations at training time {−1,1}), so as to enable the application of scaling factors θ0 and θ1.
The multiplier circuits 374 and 376 scale output values from the instances 370 and 372 of the ƒ0 function, and the adder circuit 378 sums each output element from multiplier circuit 374 with the corresponding output element from multiplier circuit 376.
Threshold circuit 380 quantizes (binarization in the example) the elements generated by adder circuitry 378 based on the trained quantization threshold Φ0. The output from threshold circuit 380 is also provided as input to subtraction circuit 382, which subtracts, element-wise, the output from threshold circuit 380 from the output of adder 378. The output from subtraction circuit 382 is provided as input to threshold circuit 384, which quantizes (binarization in the example) the elements generated by subtraction circuitry 382 based on the trained quantization threshold Φ1. The binary activations from threshold circuits 380 and 384 can be provided as input activations {tilde over (Ŷ)}0 and {tilde over (Ŷ)}1 to the next layer of the neural network, respectively.
Training of the neural network involves the training system determining a level of accuracy by comparing results produced at the output layer 454 to predetermined reference results after a plurality of training iterations. The training can be terminated in response to completing a predetermined number of training iterations, or in response to changes in levels of accuracy from one iteration to the next consistently falls below a given threshold for a number of consecutive iterations.
Referring to the PS 602, each of the processing units includes one or more central processing units (CPUs) and associated circuits, such as memories, interrupt controllers, direct memory access (DMA) controllers, memory management units (MMUs), floating point units (FPUs), and the like. The interconnect 616 includes various switches, busses, communication links, and the like configured to interconnect the processing units, as well as interconnect the other components in the PS 602 to the processing units.
The OCM 614 includes one or more RAM modules, which can be distributed throughout the PS 602. For example, the OCM 614 can include battery backed RAM (BBRAM), tightly coupled memory (TCM), and the like. The memory controller 610 can include a DRAM interface for accessing external DRAM. The peripherals 608, 615 can include one or more components that provide an interface to the PS 602. For example, the peripherals can include a graphics processing unit (GPU), a display interface (e.g., DisplayPort, high-definition multimedia interface (HDMI) port, etc.), universal serial bus (USB) ports, Ethernet ports, universal asynchronous transceiver (UART) ports, serial peripheral interface (SPI) ports, general purpose (GPIO) ports, serial advanced technology attachment (SATA) ports, PCIe ports, and the like. The peripherals 615 can be coupled to the MIO 613. The peripherals 608 can be coupled to the transceivers 607. The transceivers 607 can include serializer/deserializer (SERDES) circuits, MGTs, and the like.
Various logic may be implemented as circuitry to carry out one or more of the operations and functions described herein and/or shown in the figures. In these contexts, a circuit or circuitry may be referred to as “logic,” “module,” “engine,” or “block.” It should be understood that logic, modules, engines and blocks are all circuits that carry out one or more of the operations/activities. In certain implementations, a programmable circuit is one or more computer circuits programmed to execute a set (or sets) of instructions stored in a ROM or RAM and/or operate according to configuration data stored in a configuration memory.
Though aspects and features may in some cases be described in individual figures, it will be appreciated that features from one figure can be combined with features of another figure even though the combination is not explicitly shown or explicitly described as a combination.
The circuits and methods are thought to be applicable to a variety of systems employing reduction trees. Other aspects and features will be apparent to those skilled in the art from consideration of the specification. The circuits and methods can be implemented as one or more processors configured to execute software, as an application specific integrated circuit (ASIC), or as a logic on a programmable logic device. It is intended that the specification and drawings be considered as examples only, with a true scope of the invention being indicated by the following claims.