The present disclosure relates to data processing. More specifically, the present disclosure relates to devices and methods for operating and compressing neural networks.
Artificial Neural Networks (ANN), for instance, in the form of convolutional neural networks (CNN) are being implemented in more and more electronic devices for a variety of different purposes, such as image or speech processing. ANNs, however, are usually demanding with respect to computational resources and consume a significant amount of energy, for instance, due to frequent memory accesses by the layers of an ANN and have a large memory footprint. For large ANNs with many neural network weights per layer the memory footprint may become very substantial. Therefore, it is a challenge to implement large ANNs on electronic devices with reduced hardware capabilities in terms of processing power, memory and energy storage, such as smartphones, intelligent cameras or other types of IoT devices. Even in cloud computing, where the processing capabilities, the memory and the power resources of cloud servers are abundant, reducing the memory footprint of ANNs may provide advantages.
To address the issues mentioned above there have been some suggestions to use compression techniques for decreasing the size, i.e., the overall memory footprint of ANNs, typically subject to an error (accuracy) constraint, and obtaining a smaller neural network that is (almost) as capable as the original neural network.
It is an object of the present disclosure to provide improved devices and methods for operating and compressing neural networks.
The foregoing and other objects are achieved by the subject matter of the independent claims. Further embodiments are apparent from the dependent claims, the description and the figures.
According to a first aspect, a data processing apparatus for operating and compression a neural network (also referred to as neural network model) is disclosed. The data processing apparatus comprises a processing circuitry configured to operate the neural network, wherein the neural network comprises a plurality of processing layers, wherein each processing layer comprises, i.e. is defined by a plurality of neural network weights. The processing circuitry is further configured to compress the neural network, wherein for compressing the neural network the processing circuitry is configured to quantize the plurality of neural network weights of each processing layer using a respective quantization bin size and to encode the plurality of quantized neural network weights of each processing layer for obtaining a compressed neural network. The processing circuitry is further configured to determine for each processing layer a norm based on the plurality of neural network weights of each processing layer and to determine the respective quantization bin size for each processing layer based on the norm of the respective processing layer.
In an embodiment, the processing circuitry is configured to determine the norm of the respective processing layer based on the plurality of neural network weights of the respective processing layer as the square root of the sum of squares of the plurality of neural network weights of the respective processing layer (also known as L2 norm).
In an embodiment, the processing circuitry is configured to determine the respective quantization bin size for each processing layer based on the norm of the respective processing layer such that the respective quantization bin size is proportional to the norm of the respective processing layer.
In an embodiment, the processing circuitry is configured to determine the respective quantization bin size for each processing layer as a product of the norm of the respective processing layer and a proportionality constant, wherein the proportionality constant is substantially the same for all processing layers.
In an embodiment, the processing circuitry is configured to determine the proportionality constant based on an adjustable target quantization error.
In an embodiment, the processing circuitry is further configured to determine a quantization error (induced by quantizing the plurality of neural network weights of each processing layer using a respective quantization bin size) and to determine the proportionality constant to be the largest proportionality constant or close to the largest proportionality constant, for which the determined quantization error is still smaller than or equal to the target quantization error. The quantization error may be local quantization error per processing layer or a global quantization error produced by all processing layers.
In an embodiment, the processing circuitry is configured to determine the proportionality constant using a giant-step baby-step scheme.
In an embodiment, the processing circuitry is configured to encode the plurality of quantized neural network weights of each processing layer using an entropy encoding scheme, in particular a Huffman encoding scheme, an Arithmetic encoding scheme and/or an Asymmetric Numeral Systems, ANS, encoding scheme.
In an embodiment, the data processing apparatus further comprises a volatile or non-volatile memory, wherein the volatile or non-volatile memory, in particular a RAM is configured to store the compressed neural network, i.e. for each processing layer the plurality of quantized and encoded neural network weights.
In an embodiment, for operating the neural network the processing circuitry is further configured to decompress the compressed neural network layer by layer. For instance, the layer may be loaded from the RAM to a cache memory.
In an embodiment, the processing circuitry is further configured to compress input data of the neural network.
In an embodiment, the plurality of processing layers comprises one or more sparse processing layers.
According to another aspect a computer-implemented data processing method is provided, wherein the method comprises a step of operating a neural network, wherein the neural network comprises a plurality of processing layers, wherein each processing layer comprises a plurality of neural network weights. The method comprises a further step of determining for each processing layer a norm based on the plurality of neural network weights of each processing layer. Moreover, the method comprises a step of determining a respective quantization bin size for each processing layer based on the norm of the respective processing layer. The method comprises a further step of compressing the neural network by quantizing the plurality of neural network weights of each processing layer using a respective quantization bin size and by encoding the plurality of quantized neural network weights of each processing layer for obtaining a compressed neural network.
The data processing method can be performed by the data processing apparatus according to the first aspect. Thus, further features of the data processing method according to the second aspect result directly from the functionality of the data processing apparatus according to the first aspect and its embodiments described above and below.
According to a further aspect a computer program or a computer program product is provided, comprising a computer-readable storage medium carrying program code which causes a computer or a processor to perform the data processing method according to the second aspect when the program code is executed by the computer or the processor.
The different aspects of the present disclosure can be implemented in software and/or hardware.
Details of one or more embodiments are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description, drawings, and claims.
In the following embodiments of the present disclosure are described in more detail with reference to the attached figures and drawings, in which:
In the following identical reference signs refer to identical or at least functionally equivalent features.
In the following description, reference is made to the accompanying figures, which form part of the disclosure, and which show, by way of illustration, specific aspects of embodiments of the present disclosure or specific aspects in which embodiments of the present disclosure may be used. It is understood that embodiments of the present disclosure may be used in other aspects and comprise structural or logical changes not depicted in the figures. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims.
For instance, it is to be understood that a disclosure in connection with a described method may also hold true for a corresponding device or system configured to perform the method and vice versa. For example, if one or a plurality of specific method steps are described, a corresponding device may include one or a plurality of units, e.g. functional units, to perform the described one or plurality of method steps (e.g. one unit performing the one or plurality of steps, or a plurality of units each performing one or more of the plurality of steps), even if such one or more units are not explicitly described or illustrated in the figures. On the other hand, for example, if a specific apparatus is described based on one or a plurality of units, e.g. functional units, a corresponding method may include one step to perform the functionality of the one or plurality of units (e.g. one step performing the functionality of the one or plurality of units, or a plurality of steps each performing the functionality of one or more of the plurality of units), even if such one or plurality of steps are not explicitly described or illustrated in the figures. Further, it is understood that the features of the various exemplary embodiments and/or aspects described herein may be combined with each other, unless specifically noted otherwise.
As illustrated in
The processing circuitry 101 of the server 100 may be implemented in hardware and/or software. The hardware may comprise digital circuitry, or both analog and digital circuitry. Digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or general-purpose processors. The memory 103 may store executable program code which, when executed by the processing circuitry 101, causes the server 100 to perform the functions and methods described herein.
As will be described in more detail in the following under further reference to
Before describing in more detailed embodiments of the data processing apparatus 100 for compressing the plurality of processing layers, some further technical background will be introduced in the following.
Compression of data generally consists of two phases, namely quantization and encoding. In the quantization phase, the number of unique values (symbols) is reduced. As the number of unique symbols gets smaller, so does the entropy and consequently so does the number of bits required for representation of the layer. This may be an issue when handling neural networks, since each processing layer may have a different number of parameters, i.e. neural network weights, and may follow a different distribution, which carries a different entropy. Generally, it may be beneficial to quantize each layer at a rate that maintains proximity with respect to the original layer distribution. Still, representing the original layers' symbols at a rate that is lower than their original entropy introduces distortion (i.e., quantization error). Finding a solution that quantize the model at the lowest possible bit-rate while satisfying a certain quantization error (distortion) requirement (i.e. a target quantization error threshold) is at the heart of quantization optimization problems, and is known as the rate-distortion problem. Further, a compression scheme that quantize each layer at a different rate is referred to as a mixed-bit solution compression scheme.
In the encoding phase, the symbols statistics are gathered, and the quantized weights are compressed with an asymptotically optimal entropy compressor (e.g., Huffman, arithmetic encoding, ANS, and the like), which sets for each symbol a length that is inversely proportional to its probability. Besides reducing the number of symbols, the quantization may have an extensive effect on the compression since it produces a unique symbol statistics (distribution), and hence, entropy. Thus, different quantization schemes may lead to different entropy values even when the quantization rates are practically the same. Generally, an entropy encoder compresses these quantized values to their entropy limit without introducing further errors (i.e., lossless coding). Accordingly, for optimizing the compression of a neural network, the objective is choosing the quantization parameters that achieve the largest compression ratio while satisfying the quantization error constraint.
In a typical setting, optimization of the quantization parameters requires finding for each processing layer a quantization range ([min, max]) and a number of quantization levels N (or equivalently, a quantization rate R=log2(N), from which the quantization error (distortion) is measured over each layer separately). This per layer distortion approach is simpler and can be analyzed with the standard rate-distortion theory. Nonetheless, it misses the true purpose of the optimization by focusing on each layer independently, instead of focusing on the whole neural network. Moreover, such compression schemes ignore the propagating error effect that spreads the distortion to the consecutive layers in addition to their own quantization error. Addressing this error propagation is complex, as the parameters optimization search-space in the quantization process gets exponentially large with the number of processing layers of the neural networks. This makes an exhaustive search practically impossible even for moderate size neural networks.
As already mentioned above, the goal of quantization is reducing the number of symbols before the (lossless) compression. It has been shown that uniform scalar quantization is (asymptotically) optimal when one intends to further compress the quantized data. In other words, non-uniform scalar quantization techniques do not yield better compression than the simple uniform scalar quantization. This is since uniform quantization maintains the probabilistic characteristics of the weights, and hence, facilitates reaching (asymptotically) the entropy limit.
For a variable
a scalar uniform quantizer qU(W) with N quantization intervals (i.e., quantization bins) partitions the interval
uniformly, such that the partition boundaries are in
(as illustrated in
Often, it is more convenient to analyze the uniform quantizer in terms of the quantization rate R=log2 N instead of the number of quantization bins N. This rate R essentially depicts the number of bits required to index the quantization bins. To analyze this quantizer, a high-rate regime (R>>1) is considered, where the probability curve in each quantization bin Δj is nearly flat, as illustrated in
This relation between the quantization rate and its induced distortion is well-known. In an embodiment, quantizing at a lower rate R induces larger distortion, and vice versa, given a lower distortion requires a higher rate R. This behavior is illustrated in
As already described above, after quantizing the neural network weights, they are being compressed with entropy encoders. For completeness, in the following a short background on entropy achieving compression schemes is provided.
An optimal compression scheme allocates to each symbol a bit-length that is inversely proportional to its probability. That is, common symbols are represented by fewer bits than rare ones. The difference between entropy encoders inherent in their apparatus that captures the dependencies between the symbols. A practical perspective is to consider those as finite002Dstate automata. In this sense, Huffman coding is the simplest and fastest encoder which utilizes a single state. That is, for every input alphabet symbol, the encoder outputs the corresponding prefix-free code from a lookup-table. Nevertheless, Huffman coding must allocate an integer number of bits per symbol, and hence, can get quite far from the entropy limit that allows a fractional number of bits per symbol.
Arithmetic coding may improve this drawback of Huffman coding, allowing fractional number of bits per symbol, and hence is asymptotically optimal. Yet, in terms of number of states, arithmetic coding may get exponentially large, as it counts all previous symbols to code the next one. This involves a lot of arithmetic, which may make the implementation cumbersome in terms of memory and latency.
J. Duda, “Asymmetric numeral systems,” arXiv preprint arXiv:0902.0271, 2009 suggested an encoding scheme that is based on Asymmetric Numeral Systems (ANS), which bridges between Huffman coding and Arithmetic coding. That is, it provides lossless compression at a very high compression and decompression speeds. In terms of finite-state size, it facilitates configuring the encoding table size. Further, it utilizes simple arithmetic (only shifts and additions) which has efficient hardware implementations.
Embodiments disclosed herein make use of the finding that rotations are induced by the linear operations defined by the plurality of processing layers 201a-n of the neural network 200. These rotations make the compression optimization intricate, as it affects the quantization parameters of successive processing layers 201a-n. For a mixed-bit scheme, typical quantization requires finding for each processing layer i: (a) the quantization range [min, max]i and (b) quantization rate Ri (an integer). From these, the conventional width of the quantization is given by the following equation (2):
Yet, the quantization range is sensitive to rotations. Geometrically, each processing layer 201a-n may have its own orientation (direction), length and dimension. The neural network weights of each processing layer 201a-n get rotated (and stretched) by its input, and the output of this layer rotates the successive layer's weights, and so on and so forth. The series of rotations between layers determines the models' output, and hence it's performance (as will be explained in more detail in the context of
To illustrate how rotations affect the quantization range, a simple example is provided with the plurality of weights of a layer expressed as a weight vector {right arrow over (w)}=(1,0). Thus, in this case max−min=1 and the resulting bin width=(max−min)/(2R−1)=1/255 in the case of 8 bit quantization. However, after a rotation of 45° in the counter clockwise direction, the resulting weight vector is
and one obtains max−min=0, as well as bin width=0. Consequently, the conventional methodology yields a different quantization based on the exact rotation of the weights.
Moreover, using a conventional quantization may induce a quantization error requirement that is too small. In an embodiment, in the following it will be shown based on equations (1) and (2) that the induced quantization error in the cosine-similarity case is
which scales as order of O(|dimi|/∥Wi∥2). For the relative-error, the resulting quantization error is
The error becomes smaller as the weights' norm increases, which may be too restrictive, and hence, will induce fine quantization that translates to a poor compression ratio. Another drawback of the conventional schemes is focusing on quantization processes that produce an integer number of bits. This is typically not optimal in terms of compression, as other representations that were quantized at a fractional rate may further reduce the resulting entropy and thus, attain a higher compression ratio.
Considering the scenario where the layer weights are sparse (i.e., many weight values are inherently zeros) making use of compressed formats such as compressed sparse row (CSR) speeds up the computation since mathematical operations that involve zeros may be skipped. However, memory-wise the CSR format is not optimal in general as it allocates memory for both values and their indices without encoding them at all. Combining the embodiments disclosed herein with weight sparsity may yield remarkable compression ratios.
Another disadvantage of conventional compression schemes is the optimization time/complexity that is required to find a suitable solution. That is, in many prior art approaches the parameterization must be tuned during the network training. Other, post-training methods perform layer-by-layer optimization which are typically cumbersome and require dedicated hardware (e.g., GPUs).
As already described above, the data processing apparatus 100 shown in
The processing circuitry 101 of the data processing apparatus 100 is further configured to determine, i.e. compute for each processing layer 201a-n a norm based on the plurality of neural network weights of each processing layer 201a-n and to determine the respective quantization bin size for each processing layer 201a-n based on the calculated norm of the respective processing layer 201a-n.
For the optimization, embodiments disclosed herein focus on error constraints at the output 205 of the neural network 200. This is manifestly different from the conventional approach, since the local quantization errors in each processing layer 201a-n contribute differently to the resulting error at the output 205 of the neural network 200, and the standard rate-distortion theory may not hold directly for this case. Embodiments disclosed herein address the quantization parameters optimization by formulating the optimization problem at hand into a single parameter optimization, which allows entangling the plurality of processing layers 201a-n of the neural network 200. Deriving bounds on this single parameter search-space and using a nested giant-step baby-step approach, as implemented by embodiments of the data processing apparatus 100, makes the parameter search extremely efficient and fast.
Embodiments of the data processing apparatus 100 implement a mixed-bit fractional quantization scheme that allows for an efficient search over the generalized rate-distortion curve. Embodiments disclosed herein perform a rotation invariant quantization that is both scalable (with the number of processing layers 201a-n) and extremely fast. The embodiments disclosed herein may be extended to a full-quantization, where the data processing apparatus 100 is configured to compress both the neural network weights of the plurality of processing layers 201a-n as well as the input data (also referred to as activations). Moreover, embodiments disclosed herein provide improved compression results for a neural network 200 including one or more sparse processing layers 201a-n, because the abundance of zero weight values makes the network highly compressible.
Embodiments of the data processing apparatus 100 disclosed herein allow maximizing the neural network compression ratio subject to an error constraint at the output 205 of the neural network 200. This may be regarded as a generalized rate-distortion problem, as the distortion requirement is at the output 205 of the neural network 200. Moreover, embodiments disclosed herein provide a mixed-bit solution, where the i-th processing layer, i.e. one of the processing layers 201a-n gets quantized with a rate Ri bits. Embodiments disclosed herein allow efficiently addressing the exponentially large parameter search space which scales with the number of processing layers 201a-n. Furthermore, embodiments disclosed herein allow finding a parameter solution as fast as possible for meeting time constraints (i.e., few minutes or less).
As already described above, the processing circuitry 101 of the data processing apparatus 100 is configured to quantize each processing layer 201a-n of the neural network 200 based on its norm, in particular its L2 norm (which can be regarded as a measure of the size or length of the respective processing layer 201a-n). This approach is based on the finding that the norm, e.g. length of a respective processing layer 201a-n is invariant to rotations. In an embodiment, for the i-th processing layer, i.e. one of the processing layers 201a-n the processing circuitry 101 of the data processing apparatus 100 is configured to choose the quantization bin size Δi∝∥Wi∥ as large as possible. As already mentioned, a larger quantization bin size Δi leads to a larger entropy gain, and hence, a higher compression ratio. As will be appreciated and schematically illustrated in
As will be appreciated, as a result of the approach implemented by the processing circuitry 101 of the data processing apparatus 100 to quantize each processing layer 201a-n of the neural network 200 based on its norm, the quantization rate essentially used by the data processing apparatus 100 is not restricted to the conventional integer rates. This allows considering non-integer quantization rates that are more compressible.
Since the error constraint is imposed at the output 205 of the neural network 200 and the quantization error of each individual processing layer 201a-n contributes to this global output error, embodiments of the data processing apparatus 100 implement a parameter search that utilizes a single parameter to set the local target error E; and its corresponding quantization bin size Δi for each processing layer 201a-n, instead of the conventional approach of optimizing the quantization range (max−min)i and the quantization rate Ri parameters. As already mentioned, it may be beneficial to set the local target error ϵi as large as possible to obtain a higher compression ratio. In an embodiment, the processing circuitry 101 of the data processing apparatus 100 may be configured to set the local quantization error ϵi in the order of O(|dim(Wi)|), which is larger by a factor of ∥Wi∥ than the typical quantization error in the cosine distance and the relative error distortion metric.
Instead of optimizing layer by layer, as conventionally suggested, for finding the optimal quantization bin size Δi for each processing layer 201a-n in a fast way, the processing circuitry 101 of an embodiment of the data processing apparatus 100 is configured to scale the quantization bin size Δi in all processing layers 201a-n together during the search by using a single optimization parameter. Thus, embodiments disclosed herein maintain the bin width in each layer proportional to its norm, and thus, the proportions between the “errors per layer” are similar throughout the optimization process, as can be seen in
Even with a single optimization parameter, e.g. the proportionality constant between the norm of each layer 201a-n and its quantization bin size, the search space may be quite large. To address this and make this computationally more feasible, in an embodiment, the processing circuitry 101 of the data processing apparatus 100 is configured to determine and/or set an upper and a lower bound for the parameter search-space and to implement a fast converging “giant-step baby-step” search scheme approach, which allows finding the optimized single parameter, e.g. proportionality constant in O(√{square root over (|Ω|)}) steps where |Ω| is the search space size. An exemplary search is illustrated in
From a system point-of-view, given a pretrained neural network 200 with neural network weight tensors {Wi}, embodiments of the data processing apparatus 100 allow obtaining the smallest (quantized and compressed) version of this neural network 200, which attains an output that is as close as possible to the output 205 of the original, i.e. non-compressed neural network 200. To assess the fidelity of the quantized neural network, an input X may be sent through the original neural network 200 and in parallel through the quantized neural network, as illustrated in
As the input X propagates through the original neural network 200 and the quantized version thereof, the input X can be regarded to be rotated and maybe stretched by each processing layer 201a-n. When the quantized and the original layers' weights have a similar length (norm), the quantization errors are essentially reflected in rotation shifts, occurring in each of the quantized layers. That is, each quantized layer produces a rotation error into its output, and this error keeps propagating and accumulating over the layers until it reaches the output of the quantized neural network. Note that these rotation shifts may be constructive, and hence, increase the distance, or destructive, and hence, decrease the distance in each layer, with respect to the processing layers 201a-n of the original neural network 200.
In case the output error is sufficiently small (i.e., below a pre-defined threshold), the processing circuitry 101 of the data processing apparatus 100 in an embodiment is configured to compress the neural network 200 layer-by-layer, using, for instance, an asymptotically optimal encoder, such as the ANS. According to rate-distortion theory the average length of a symbol in the i-th processing layer 201a-n after compression is given by the following equation (3):
where H(⋅) is the entropy function H(X)=−Σx∈Xpx log px. In other words, the quantization reduces the entropy by −log(Δi), thus, a larger quantization bin size Δi yields a larger entropy gain (and, thus, a larger compression). Therefore, in an embodiment, the processing circuitry 101 of the data processing apparatus 100 is configured to determine a quantized version of the neural network 200 that satisfies the target quantization error requirement, while providing the largest possible quantization bin size Δi.
As already described above, conventional quantization parameter optimization involves usually optimizing the quantization range [max, min]i, and quantization rate Ri for each layer. Even though optimizing the range and rate is a good strategy for quantizing a single layer, where distortion is measured locally, it gets extremely hard to optimize when considering the whole neural network 200 and its distortion at the output 205, as already described above.
Before describing further embodiments of the data processing apparatus 100, in the following the conventional scalar uniform quantizer is revisited, and in particular, the connection of the quantization bin size Δ to the resulting quantization error. From equation (1) above and the law of large numbers one will appreciate that the following relations hold:
Based on this mean square error (MSE) other error criteria may be determined.
For example, the relative error criterion (Mean absolute percentage error) may be obtained by taking a square-root and normalizing by the layer norm in the following way:
Or, in terms of Δrel:
A cosine distance criterion may be derived based on the finding that the cosine distance is equivalent to the Euclidean distance of normalized vectors up to a constant. Thus, assuming that the norms ∥q(W)∥≈∥W∥ are about the same, the following relation holds:
Consequently, the following equation (4) holds:
Or, in terms of Δcos the following relation (5):
In the following, by way of example, the cosine distance analysis will be described in more detail. Based on equations (4) and (5) above it may be appreciated that the quantization bin size Δ can be selected so that it is proportional to the norm ∥W∥ by letting the local quantization errors scale with dim (W). In an embodiment, setting
where k is the parameter to be optimized. Yet, as k gets larger the local errors get smaller, which yields small quantization bin sizes Δ that may be bad for compression. Thus, to enforce quantization and a quantization error, in particular, a small fixed constant {tilde over (ϵ)} may be added. That is,
The corresponding quantization bin size is:
The resulting fractional quantization rate for the i-th processing layer 201a-n may be expressed by the following equation (6):
This fractional rate Ri is induced by the largest Δi possible, which is substantial for compression. However, practically, after decoding the symbols, they may be stored using one of the nominal representations (e.g., fp32, fp16, int8, and the like).
Based on equation (6) the following entropy gain may be obtained:
As will be appreciated, since the norm of a layer is invariant to rotations, setting a quantization bin size Δ that scales in the same way as the norm, yields a rotation invariant quantization. In other words, the quantization fidelity is independent of the layer's input as it rotates the layer's weights. As will be further appreciated, choosing a different k and, thus, a different proportionality constant for each layer 201a-n would break this rotation invariance.
As already described above, the search and optimization scheme implemented by the processing circuitry 101 of the data processing apparatus 100 according to an embodiment essentially entangles all processing layers 201a-n by using a single parameter, such as the parameter k. Still, the search space Ω(k) can get quite large. In other words, finding the optimal k* that satisfies the error constraint of the compression of the neural network 200 may still be hard. To search the optimal k*, the processing circuitry 101 of the data processing apparatus 100 according to an embodiment is configured to limit the actual search range for the optimal parameter. In the case of cosine distance, an upper bound of the search range may be determined by the processing circuitry 101 of the data processing apparatus 100 in the following way based on the following relation:
where i* is the index of the largest layer (i.e. the layer with the largest norm) of the neural network 200. As will be appreciated, in the search scheme implemented by the processing circuitry 101 of the data processing apparatus 100 this is the processing layer 201a-n whose error converges to {tilde over (ϵ)} last. This can be used for defining, where to stop the search. In an embodiment, for a sufficiently large k, the error is approximately √{square root over (ϵi*)}={tilde over (ϵ)}+o({tilde over (ϵ)}). At this point, the error converges even at the largest layer (and hence, for the rest of layers as well, which have a smaller norm).
Namely,
which happens when
For an exemplary value of {tilde over (ϵ)}=0.01, the upper limit is
For a lower bound of the search range, the fact may be exploited that ϵi≤1 in the cosine distance criterion. Thus, focusing on the largest layer i* again, it may be observed that
which happens when
Thus, in an embodiment, the processing circuitry 101 of the data processing apparatus 100 may be configured to limit the search to the following range:
For further improving the search time the processing circuitry 101 of the data processing apparatus 100 may be further configured to implement a nested giant-step baby-step search approach that will be described in more detail in the following.
At the beginning, the “giant-step” stage, only √{square root over (|Ω(k)|)} values of k are evaluated. The smallest k that suffice in terms of output error then becomes the new upper limit, and the search region is refined with a smaller region of k that has, again, only √{square root over (|Ω(k)|)} values to inspect. This continues repeatedly, until the search step is sufficiently small. This allows finding a good solution in relatively few iterations (e.g. less than 40 iterations).
More details of the nested giant-step baby-step search approach implemented by the processing circuitry 101 of the data processing apparatus 100 are illustrated in the flow diagram shown in
This induces a smaller entropy to this layer that can be attained by compression. Then, to evaluate the distortion at the output, the model is fed with calibration (or random) data (see 913 of
The compression performance provided by embodiments of the data processing apparatus 100 has been evaluated for various neural network models with different sizes using the cos distance criteria (wherein, by way of example, the quantization must reach a cosine similarity <0.005). As a baseline for comparison, a Mindspore version 1.6 converter has been used (tested on x86) which is already highly optimized compared to other current state-of-the-art solutions. As can be taken from the table shown in
For assessing how far the solution provided by the data processing apparatus 100 according to an embodiment is from the actual optimal solution, the compression performance provided by the data processing apparatus 100 according to an embodiment was compared with compression performance provided by the Multi-Objective Bayesian Optimization (MOBO) solution, which is computationally very complex to obtain and takes long running times (few hours to days on multiple GPUs). As can be taken from
The person skilled in the art will understand that the “blocks” (“units”) of the various figures (method and apparatus) represent or describe functionalities of embodiments of the present disclosure (rather than necessarily individual “units” in hardware or software) and thus describe equally functions or features of apparatus embodiments as well as method embodiments (unit=step).
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other manners. For example, the described apparatus embodiment is merely exemplary. For example, the unit division is merely logical function division and may be other division in actual implementation.
For example, a plurality of units or components may be combined or integrated into another system, or some features may be ignored or not performed. In addition, the displayed or discussed mutual couplings or direct couplings or communication connections may be implemented by using some interfaces. The indirect couplings or communication connections between the apparatuses or units may be implemented in electronic, mechanical, or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
In addition, functional units in the embodiments of the present disclosure may be integrated into one processing unit, or each of the units may exist alone physically, or two or more units are integrated into one unit.
This application is a continuation of International Application PCT/EP2022/071324, filed on Jul. 29, 2022, the disclosure of which is hereby incorporated by reference in its entirety.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/EP2022/071324 | Jul 2022 | WO |
| Child | 19038531 | US |