An embodiment of the present invention is generally related to neural networks.
Artificial neural networks (NNs) can be designed and trained to perform a wide-range of functions. Example applications of NNs include image processing, speech recognition, data processing, and control, among other applications. Models of NNs can include a large number of layers and parameters (weights). Processors with highly-parallel architectures, such as graphics processing units (GPU), can facilitate efficient implementation of large NNs.
In one embodiment, each IDP unit receives compressed weights and inputs feature map data and outputs decompressed weights and IFM data to the MAA units. For example, each IDP may include at least one decompressor, and a buffer to buffer input data. In one embodiment, the accumulated results of the MAAs correspond to output feature map data (OFM) and intermediate results. One or more units (labeled in
The number of IDPs in one embodiment is eight, although more generally, different numbers of IDPs may be used. In one embodiment, each IDP unit runs in parallel, each supplying one non-zero weight and one set of feature maps values (as subset of the IFM) to a MAA computation unit. In one embodiment, the input units iterate over subsets of the IFMs and corresponding weights over multiple cycles to generate a set of OFMs in parallel.
The stored compressed weights may then be used to execute a neural network, as illustrated in the flow chart of
NN training algorithms typically result in the feature maps of the layers of the NN being arbitrarily organized in memory. As a consequence, the weights that correspond to the feature maps will also typically be arbitrarily organized in memory. This arbitrary organization, in turn, impacts compression and execution efficiency. One aspect of reordering is that there are a number of functionally equivalent orderings of a neural network. However, some of the functionally equivalent orderings can be selected to have a structure that can be exploited to achieve better compression rates than others. By way of illustration, suppose that feature maps 0 and 10 of a layer can be swapped with no impact on the NN's input-output relationship, provided the layer makes a corresponding swap of weights. The same weights are applied to the same inputs and those results are summed together with the same results in both the original and reordered networks. However, the reordering may be selected to result in a structure that is better suited for compression and/or has advantages for execution. For example, weights of a NN can be reordered so that similar weights are grouped together in memory. That is, after training of a NN and before compression of its weights, the NN's feature maps, and by extension, the weight values, can be reordered.
In one embodiment, the neural network reordering may be selected to introduce an ordering to the weights to increase the ability to compress the weights (i.e., reduce the amount of data that is used to represent the NN). By reordering network layers, an ordering can be introduced to the weights that are selected to provide better weight compression. One option is to perform the reordering to improve compression by introducing a structure to the weights that aids in compressing them. For example, weights may be grouped or ordered by value. Still another option is to perform the reordering based on characteristics of a coding technique used for compression, such as Huffman coding or Golomb-Rice coding. As an example, feature maps can be reordered so that frequency distributions are sharper in a particular localized area. Additionally, the reordering may be selected to improve prediction accuracy in the encoding. As another example, network feature maps can be reordered so that weight values tend to increase or the number of zero value weights increase.
Also, by redistributing non-zero weights, it is possible to more effectively skip over zero-value-weights during network execution. One option is to perform reordering to group zero value weights to permit them to be skipped during execution.
As still yet another example, weights may be reordered to create a better load balancing during parallel processing of neural network model. For example, the reordering may perform to achieve a reordering in which each processing unit, in the parallel processing, is supplied a more equal number (e.g., about the same number) of non-zero weights over a selected number of cycles.
In one embodiment, network pruning and weight clustering of selected weights may be performed after network training. Clustering includes, for example, mapping a number of different weight values to a smaller number of weight values to improve compression. For example, a thousand or more slightly different weights might be mapped to 32 weight values. Clustering is also sometimes referred to as quantization. In one embodiment, low magnitude weights are pruned (set to zero). In one embodiment, the pruning is performed without impacting network accuracy. In a pruning step, low magnitude weights are clamped to zero. The remaining non-zero weights may then be adjusted through network retraining to regain any lost accuracy. That is, to counteract loss of accuracy,
retraining can be done to readjust certain weights so that the overall network maintains the same or nearly the same accuracy, while maintaining the compression advantages.
In one embodiment, pruning increases the percentage of zero-value weights. This has potential advantages for compression and also execution. During execution in an end NN device, a number of weights may be applied in parallel in a given cycle in SIMD fashion (e.g., either all parallel compute units apply a weight or all skip a zero—value weight). That is, there is no need to apply weights equal to zero during execution, since these have no effect. In some cases, pruning can result in a large proportion of the weights ending up being zero (e.g., about 60% to 95% or more), which in turn, provides an opportunity to speed up network execution.
In one embodiment, zero-value weights are grouped to improve execution. It can be difficult to eliminate processing cycles for many of the zero-valued weights. However, a number of zero-value-weights can be skipped when they are grouped so that they are collected together in the same cycle. This can help speed up execution and improve compression at the same time.
In addition to reordering the network and lossless compression of the reordered weights, example embodiments can also utilize lossy compression, which can be omitted in other embodiments. In this case, together with reordering, adjustments (e.g., small adjustments) are made to the weights to improve compression.
Weights are pruned 610 to improve weight compression efficiency and reduce network computation cost. In one embodiment, the pruning is performed with variable thresholds. For example, the threshold can be selected based on a predetermined scaling factor of distance measures of the weights. In an example embodiment, the threshold is selected as a value equal to about 20% of the L1 hamming distance of each weight vector in fully connected layers or each convolutional kernel in convolutional layers. Different scaling factors or different distance measures can be used in alternative embodiments. In another example, the threshold can be found iteratively via dynamic programming to maximize zero values in each cluster generated with a regularization that bounds the threshold is satisfied.
The remaining weights are retrained 615. As indicated by block 620, in some embodiments an option may be included to repeat the pruning and retraining one or more times, until a stopping condition is satisfied, such as a preset number of iterations is met.
Quantization of the weights 625 may be performed with optional retraining. In an example embodiment, the clustering of weights is conducted based on k-means clustering, where the centroid of each cluster is used to represent the weights included in that cluster.
The sets of quantized weights are reordered 630. As previously discussed, reordering may include reordering corresponding to switching around feature maps or feature map nodes in fully-connected layers. However, the reordering may also include reordering to improve compression. The reordering may include reordering into clusters and reordering based on column and row attributes. Sets of quantized weights within clusters may also be selected to maximize effectiveness of predictions. For example, the reordering may include a reordering in which cluster 0 is the most common and cluster 31 is the least common. As one option, columns may be reordered into clusters of a selected number of columns (e.g. 16, depending on implementation details) into increasing order to maximize the effectiveness of some inter-column compression. Additionally, rows may be reordered within a group of columns to effectively compress iteratively in the row dimension. For example, row 1 elements are predicted to be the same as row 0, plus some small positive delta and the deltas are compressed. Clusters can be any suitable number of columns in alternative embodiments. Clusters can be formed from any suitable elements (e.g., rows) in alternative embodiments.
The deltas are computed versus prediction 635. For example, the differences between adjacent columns and/or rows in a cluster may be computed. Other transformation may be applied to a “base” column or row used to make predictions for the other columns and rows. For example, suppose column 0 is selected as a “base” column and all other columns in a group (e.g., of 16 columns) are predicted by different scale factors applied to the base column. For example, a row may be predicted to be row 0 multiplied by a scale factor, plus some deltas. In some cases, the deltas will be small.
An optional adjustment 645 of the deltas may be performed to improve compressibility and then retraining performed to mitigate accuracy loss. For example, a delta value might be adjusted up or down a small amount in order to improve compressibility. This adjustment would be a lossy component of the compression scheme.
The deltas and the base prediction are then compressed 650. A coding scheme, such an entropy coding scheme, may be used. For example, Huffman coding may be used represent the deltas with a number of bits. Efficient compression can be achieved by representing the most common deltas with the fewest possible bits.
The compressed representation of the reordered model is then written 655 to data storage.
In one embodiment, the manner in which zero values are handled depends in part on the layer type (e.g., convolutional layer vs. fully connected layer). That is, the way in which skipping zero-value weights is implemented depends on the layer type (which in turn corresponds to different mathematical operations, such as vector-product operations for fully connected layers and convolutional operations for convolutional layers). For example, zero-value weights may be grouped to more efficiently skip them in a fully connected layer in which vector products are calculated. However, for a convolutional layer, the zero values may be distributed (spread out) to aid in load balancing in parallel computational units. This is because there is no need to group zero weights to be able to skip processing zero values in a convolution operation for a convolution layer. Consider an example for a convolution layer in which there is load balancing. In this example, each input unit finds the next non-zero weight for its subset of inputs and moves to that weight. So each input unit moves at different rates through its input data, hopping from one non-zero weight to the next. They all move through their data at different rates. Provided each input unit has about the same number of non-zero weights to apply over their subsets of input, the system is load balanced and effectively skips cycles that would have been needed to apply zero-value weights.
In one embodiment, hardware support is provided for load balancing to be performed on the fly. For example, offline processing may be performed to work out an optimal reordering of the IFMs and perform reordering of the OFMs. In one embodiment, remapping logic and remapping tables are supported to specify that variable remapping is performed during hardware execution of the network.
As previously discussed, reordering may result in an equivalent version of the same network, such as by swapping feature maps for different layers and swapping the corresponding weights (e.g., swapping maps 2 and 10 and swapping the weights that correspond to maps 2 and 10). However, in one embodiment, the reordering includes generating additional remapping tables to aid hardware in a neural processing unit. The remapping tables may instruct hardware to perform a swapping. For example, a remapping table may instruct hardware for output map 3 to swap input maps 2 and 10.
As previously discussed, a number of different data compression algorithms can be used for the weights, such as, but not limited to, Huffman coding or any other suitable compression algorithm, such as Golomb-Rice coding. Compression performance can depend on the organization of the data to be compressed. For example, compression can rely primarily on making predictions and representing the differences versus the prediction with a variable number of bits. For example, the more commonly-occurring values are compressed with fewer bits.
Example embodiments can be deployed as an electronic device including a processor and memory storing instructions. Furthermore, it will be appreciated that embodiments can be deployed as a standalone device or deployed by multiple devices in distributed client-server networked system.
A non-limiting example of an execution environment for embodiments of the present invention is in Graphics Processing Units (GPUs). While GPUs can provide substantial computation power for implementing a NN, it can be difficult to implement a NN on a device with limited memory and/or power. Example embodiments disclosed herein can enable improved compression of neural network weight parameters for storage in a memory of a GPU and provide improved efficiency of network execution by clustering 0-value weights so they can be more effectively skipped.
Herein, a computer-readable non-transitory storage medium or media may include one or more semiconductor-based or other integrated circuits (ICs) (such, as for example, field-programmable gate arrays (FPGAs) or application-specific ICs (ASTCs)), hard disk drives (HDDs), hybrid hard drives (HHDs), optical discs, optical disc drives (ODDs), magneto-optical discs, magneto-optical drives, floppy diskettes, floppy disk drives (ODDs), magnetic tapes, solid-state drives (SSDs), RAM-drives, SECURE DIGITAL cards or drives, any other suitable computer-readable non-transitory storage media, or any suitable combination of two or more of these, where appropriate. A computer-readable non-transitory storage medium may be volatile, non-volatile, or a combination of volatile and non-volatile, where appropriate.
Herein, “or” is inclusive and not exclusive, unless expressly indicated otherwise or indicated otherwise by context. Therefore, herein, “A or B” means “A, B, or both,” unless expressly indicated otherwise or indicated otherwise by context. Moreover, “and” is both joint and several, unless expressly indicated otherwise or indicated otherwise by context. Therefore, herein, “A and B” means “A and B, jointly or severally,” unless expressly indicated otherwise or indicated otherwise by context.
The scope of this disclosure encompasses all changes, substitutions, variations, alterations, and modifications to the example embodiments described or illustrated herein that a person having ordinary skill in the art would comprehend. The scope of this disclosure is not limited to the example embodiments described or illustrated herein. Moreover, although this disclosure describes and illustrates respective embodiments herein as including particular components, elements, feature, functions, operations, or steps, any of these embodiments may include any combination or permutation of any of the components, elements, features, functions, operations, or steps described or illustrated anywhere herein that a person having ordinary skill in the art would comprehend. Additionally, although this disclosure describes or illustrates particular embodiments as providing particular advantages, particular embodiments may provide none, some, or all of these advantages.
While the invention has been described in conjunction with specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. The present invention may be practiced without some or all of these specific details. In addition, well known features may not have been described in detail to avoid unnecessarily obscuring the invention. In accordance with the present invention, the components, process steps, and/or data structures may be implemented using various types of operating systems, programming languages, computing platforms, computer programs, and/or computing devices. In addition, those of ordinary skill in the art will recognize that devices such as hardwired devices, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herein. The present invention may also be tangibly embodied as a set of computer instructions stored on a computer readable medium, such as a memory device.
The present application claims the benefit of U.S. Provisional Application No. 62/336,493 filed May 13, 2016, the contents of which are hereby incorporated by reference.
Number | Date | Country | |
---|---|---|---|
62336493 | May 2016 | US |