FIELD
This patent document relates generally to systems and methods for compressing weights in an artificial intelligence solution. Examples of compressing weights in an artificial intelligence semiconductor chip with variable compression ratio are provided.
BACKGROUND
Artificial intelligence solutions are emerging with the advancement of computing platforms and integrated circuit solutions. For example, an artificial intelligence (AI) integrated circuit (IC) may include an accelerator capable of performing AI tasks in embedded hardware. Hardware accelerators have recently emerged and can quickly and efficiently perform AI functions, such as voice or image recognitions, at the cost of precision in the input image tensor as well as the weights of the AI models. For example, in a hardware-based solution, such as an AI chip having an embedded convolution neural network (CNN) model, the bit-width of weights and/or parameters of the AI chip may be limited. For example, the weights of a convolution layer in the CNN in an AI chip may be constrained to 1-bit, 3-bit, 5-bit. Further, the memory size for storing the input and output of the CNN in the AI chip may also be limited.
In a deep convolutional neural network, compressing the weights of a CNN model to lower bit width may be used in hardware implementation of convolutional neural network to meet the required computation powers and reduce the model size stored in the local memory. For example, whereas most of the trained models use floating point format to represent the model parameters such as filter coefficients or weights, in the hardware implementation of the model, a model inside an AI chip may use fixed point format with low bits to reduce the both logic and memory space and accelerate the processing. However, direct quantization of the weights of a CNN model from floating point values to low-bit fixed point values may cause the loss of the accuracy of the model and result in performance degradation of the AI chip. The performance degradation is particularly challenging for quantization of weights to less than 8-bits fixed point format.
This document is directed to systems and methods for addressing the above issues and/or other issues.
BRIEF DESCRIPTION OF THE DRAWINGS
The present solution will be described with reference to the following figures, in which like numerals represent like items throughout the figures.
FIG. 1 illustrates a diagram of an example convolution neural network in an AI chip in accordance with various examples described herein.
FIG. 2 illustrates a diagram of an example process of re-training weights of a neural network in accordance with various examples described herein.
FIG. 3A illustrates a diagram of an example process of forward propagation in re-training weights of a neural network in accordance with various examples described herein.
FIG. 3B illustrates a diagram of an example process of backward propagation in re-training weights of a neural network in accordance with various examples described herein.
FIG. 4 illustrates a flow diagram of an example process of convolution quantization in a training process in accordance with various examples described herein.
FIG. 5 illustrates an example of mask values in a 3-bit configuration in accordance with various examples described herein.
FIG. 6 illustrates a flow diagram of an example process of activation quantization in a training process in accordance with various examples described herein.
FIG. 7A illustrates a flow diagram of an example process of inference of an AI model via activation quantization in accordance with various examples described herein.
FIG. 7B illustrates an example distribution of output values of an AI model in accordance with various examples described herein.
FIG. 8 illustrates an example feature extractor that may be embedded in an AI chip in accordance with various examples described herein.
FIG. 9 illustrates an example image retrieval system in accordance with various examples described herein.
FIG. 10 illustrates various embodiments of one or more electronic devices for implementing the various methods and processes described herein.
DETAILED DESCRIPTION
As used in this document, the singular forms “a”, “an”, and “the” include plural references unless the context clearly dictates otherwise. Unless defined otherwise, all technical and scientific terms used herein have the same meanings as commonly understood by one of ordinary skill in the art. As used in this document, the term “comprising” means “including, but not limited to.”
Examples of“artificial intelligence logic circuit” or “AI logic circuit” include a logic circuit that is configured to execute certain AI functions such as a neural network in AI or machine learning tasks. An AI logic circuit can be a processor. An AI logic circuit can also be a logic circuit that is controlled by an external processor and executes certain AI functions.
Examples of “integrated circuit,” “semiconductor chip,” “chip,” or “semiconductor device” include an integrated circuit (IC) that contains electronic circuits on semiconductor materials, such as silicon, for performing certain functions. For example, an integrated circuit can be a microprocessor, a memory, a programmable array logic (PAL) device, an application-specific integrated circuit (ASIC), or others. An integrated circuit that contains an AI logic circuit is referred to as an AI integrated circuit.
Examples of an “AI chip” include hardware- or software-based device that is capable of performing functions of an AI logic circuit. An AI chip can be physical or virtual. For example, a physical AI chip may include an embedded cellular neural network, which may contain weights and/or parameters of a convolution neural network (CNN) model. A virtual AI chip may be software-based. For example, a virtual AI chip may include one or more processor simulators to implement functions of a desired AI logic circuit of a physical AI chip.
Examples of “AI model” include data that include one or more weights that, when loaded inside an AI chip, are used for executing the AI chip. For example, an AI model for a given CNN may include the weights, biases, and other parameters for one or more convolutional layers of the CNN. Here, the weights and parameters of an AI model are interchangeable.
Examples of an AI task may include image recognition, voice recognition, object recognition, data processing and analyzing, or any recognition, classification, processing tasks that employ artificial intelligence technologies.
FIG. 1 illustrates a diagram of an example CNN in an AI chip in accordance with various examples described herein. A CNN 100 may include multiple cascaded convolution layers, such as convolution layers, e.g., 102(1), 102(2), 102(3), . . . 102(M). In operation, these convolution layers may include weights stored in fixed point or floating-point. A convolution layer may produce the output in fixed point. In some examples, a convolution layer may also include an activation layer (e.g., ReLU layer), which may also include fix point values.
In a non-limiting example, a layer in the CNN 100 may include multiple convolutional filters and each filter may include multiple weights. For example, the weights of a CNN model may include a mask (kernel) and a scalar for a given layer of the CNN model. The CNN may include a filter-wise scalar value (e.g., an integer). The CNN may also include a layer-wise value for the exponent (e.g., an integer value implemented with shift). In some examples, an output channel of a CNN layer may include one or more bias values that, when added to the output of the output channel, adjust the output values to a desired range. A kernel in a CNN layer may be represented by multiple values in lower precision, whereas a scalar may be in higher precision. The weights of a CNN layer may include the multiple values in the kernel multiplied by the scalar. To quantize the floating-point coefficient, there will be a trade-off between the compression ratio (range of the fixed point) and precision. In some examples, a compression scheme may quantize the elements (coefficients) in the filter masks with low-bits. For example, the quantization bits for the coefficients may be 1 bit, 2 bits, 3 bits, 4 or 5 bits, or other suitable bits.
In some examples, various compression scheme may be adaptable to various hardware constraints and devices, such as mobile phones, smart cameras, as the computation resources and memory consumptions vary with applications. In some scenarios, the first few convolution layers of the CNN may tend to be more sensitive to the model accuracy than the deeper layers. In some examples, the weights for different layers in the CNN may have different quantization bits, thus different compression ratios. As shown in FIG. 1, a first convolution layer 102(1) may have n(1) bits for the filter coefficients (weights); a second convolution layer 102(1) may have n(2) bits, layer 102(5) may have n(5) bits, and layer 102(M) may have n(M) bits, where M is the number of convolution layers or groups of convolution layers in the CNN. A CNN may have any suitable number of convolution layers configured in any suitable groups.
In quantizing the weights of a CNN model, in some examples, the CNN kernel may be approximated with a quantized filter kernel and a scalar: Wi=αiWiq, where Wiq is the quantized filter mask for ith filter, with its elements quantized to variable bits (e.g., 1-bit, 2-bit, 3-bit, or other suitable bits) for different layers, αi is the scalar for ith filter, which may be quantized to a higher bits, such as 8 bits. To accommodate the dynamic range of the filter coefficients, in some examples, a layer-wise shift value can be used. The shift-value may be quantized to 4 bits, for example. The bias of the CNN may be represented with a 12-bit data, or other suitable bits.
To illustrate variable-bit compression for different layers, in some examples, for a 3×3 kernel, the compressor may compress the weights to 1-bit for masks, which would require 9×1 (masks)+8 (scalar)=17 bits for each filter. This is about 17 times of compression ratio as compared to 32-bit floating point model. Alternatively, and/or additionally, for a 3×3 kernel, the compressor may compress the weights to 5 bits for masks, which would require 9×5 (masks)+8 (scalar)=53 bits (for one filter), resulting in a 5.4 times of compression ratio, and about 6 bits fixed point quantization.
In a non-limiting example, the CNN may be a VGG (e.g., VGG-16) deep neural network, which may have five layer groups Conv1-5, each layer group may have multiple convolution layers. In some scenarios, the weights of Conv1-3 may be quantized to 3 bits for the masks and the weights for Conv4-5 to 1 bit for the masks. In FIG. 1, each of the layers 102(1)-(M) may be a single convolution layer or a group of convolution layers. For example, the convolution layer 102(1) may include a group of layers in the layer group Conv1 of the VGG-16. Similarly, the convolution 102(2) may include a group of layers in the layer group Conv2; 102(3), 102(4) and 102(5) may include the group of layers in layer groups Conv3, Conv4, and Conv5, respectively. Using the above example using variable compression ratios for the convolution layers in the VGG-16, an overall compression ratio of about 13 times can be achieved. Whereas the first few layers of the CNN have fewer filter parameters than the last layer(s) due to the increased channel numbers in subsequent layers, using more bits for the weights (at low compression ratio) in the first fewer layers may improve the accuracy of the model and training converge time without significantly increasing the model size.
In some examples, using fewer bits for the weights (at higher compression ratio) in subsequent layers in the CNN may significantly reduce the size of the model without significant sacrifice of the performance of the CNN. Whereas direct quantization of weights (e.g., from 32 bits to 3 bits) may affect the accuracy of the CNN due to loss of precision, a training system may be configured to re-train the AI model which will be explained further in the present disclosure. In some scenarios where the compression ratio is relative low (e.g., for quantization from 32 bits to 8 bits), for which the loss of performance of the CNN due to quantization is minimal, then re-training of weights may not be needed. This is explained in detail with reference to FIG. 2.
Variable compression ratios for different convolution layers in a CNN may be configured in various ways. For example, a first set of weights contained in a first convolution layer may have a higher compression ratio (lower quantization bits) than a second set of weights contained in a second convolution layer that succeeds the first convolution layer. In another example, the weights of a first subset of convolution layers of a CNN comprising a sequence of layers may have a higher compression ratio (lower quantization bits) than the weights of a second subset of convolution layers that succeed the first subset of convolution layers. In the example of an implementation of VGG-16, the weights of the first group of layers Conv1-3 may have a lower compression ratio (e.g., higher quantization bits, such as 3) where the second group of layers Conv4-5 may have a higher compression ratio (e.g., lower quantization bits, such as 1). In some examples, other configurations of higher compression ratio layers and lower compression ratio layers may be possible. Correspondingly, the weights of convolution layers having a higher compression ratio may be re-trained, whereas the weights of convolution layers having a lower compression ratio may not need to be re-trained without significant loss of the performance of the CNN. Advantageously, when the weights of certain convolution layers are not re-trained, computing resources and/or fewer training data may be needed.
FIG. 2 illustrates a diagram of an example process of re-training in quantizing weights of a neural network in accordance with various examples described herein. In some examples, a re-training process 200 may include training a floating-point model at 202, re-training with convolution quantization at 204 and re-training with activation quantization at 206. Whereas the floating-point model may be stored in higher number of bits, such as 32-bit, 64-bit, re-training with convolution quantization and/or activation quantization will provide lower-bit weights for the AI model. When the compression ratio of quantization is high, the precision of a re-trained model with quantized weights in fixed point may be higher than without re-training. When the compression ratio of quantization is low, the performance gain of the AI model from re-training may not be significantly high to justify the computing resources and/or training data required of the re-training. In various training processes in FIG. 2, the weights of one or more convolution layers of the CNN model are updated. Once the weights are updated, the process 200 may upload the weights to an AI chip at 208 for executing an AI task. Now, the re-training processes 204 and 206 are further explained with reference to FIGS. 3A and 3B.
FIG. 3A illustrates a diagram of an example process of forward propagation in re-training weights of a neural network in accordance with various examples described herein. In some examples, the re-training process in 204 and/or re-training process 206 (in FIG. 2) may be implemented in a forward propagation network 300. Here, convolution layers A, A+1, . . . A+N may respectively correspond to a convolution layer in a CNN (e.g., 100 in FIG. 1). For example, layer A may correspond to the first convolution layer in the CNN 100; layer A+1 may correspond to the second convolution layer in the CNN 100 (in FIG. 1). In the example in FIG. 1, layer A may correspond to a layer, or a layer in a group of layers in the CNN 100, such as 102(1)-102(M). In some examples, the forward propagation process includes providing an output of one convolution layer to the input of a succeeding convolution layer.
As shown in FIG. 3A, in the forward propagation network 300, the floating-point weights WA(t) at time t of convolution layer A of a CNN model may be quantized at 302. The floating-point weights WA+1(t) at time t of convolution layer A+1 of the CNN model may be quantized at 304. For example, the quantization at 302, 303, 304 may be implemented in process 204 (in FIG. 2). The quantized weights WQ−A(t), WQ−A+1(t), . . . WQ−A+N(t) may be respectively provided to the convolution layers A, A+1, . . . A+N (306, 308, 310) of the CNN for inference. In the example in FIG. 3A, the output of a respective convolution layer may be generated based on the loaded quantized weights of the respective convolution layer and then provided to the input of the succeeding layer. For example, the output of layer A may be generated based on the weights WQ−A(t) and provided to the input of layer A+1, so on and so forth until propagated to the last convolution layer in the CNN.
FIG. 3B illustrates a diagram of an example process of backward propagation in re-training weights of a neural network in accordance with various examples described herein. In some examples, the retraining process (e.g., 204, 206 in FIG. 2) may be implemented in a backward propagation network 320. In FIG. 3B, in the backward propagation network 320, each of the convolution layers of the CNN model may be updated based on a change of weights. For example, the change of weights for each layer may be determined based on the change of weights in the proceeding layer in the backward propagation.
In the example in FIG. 3B, the change of weights ΔWA for layer A (306) may be determined based on the change of weights ΔWA+1; the change of weights ΔWA+1 for layer A+1 (308) may be determined based on the change of weights ΔWA+2, so on and so forth; and the change of weights ΔWA+N−1 for layer A+N−1 may be determined based on the change of weights ΔWA+N(310). In the example in FIG. 3B, if the weights of a layer is to be re-trained, the updated weights at time t+1 are determined based on the change of weights for that layer. For example, at time t+1, the weights WA(t+1)=WA(t)+ΔWA; and the updated weights for convolution layer A+N may be determined as WA+N(t+1)=WA+N(t)+ΔWA+N. In some examples, if the weights of a layer do not need to be re-trained, then the weights in that layer are not updated. In the example in FIG. 3B, the weights of layer A+1 remain the same, e.g., WA+1(t+1)=WA+1(t). Corresponding to FIG. 1, the layers A and A+N in FIGS. 3A and 3B may be any of the convolution layers in FIG. 1 whose weights have higher compression ratio. The layer A+1 in FIGS. 3A and 3B may be any of the convolution layers in FIG. 1 whose weights have lower compression ratio. The forward and backward propagation networks 300 and 320 are further explained in a training process with reference to FIGS. 4-7.
FIG. 4 illustrates a diagram of an example process of re-training with convolution quantization that may be implemented in a re-training process, such as 204 in FIG. 2. In some examples, a process 400 may include accessing training weights of an AI model at 402. For example, the AI model may be trained in the floating-point model training unit (104 in FIG. 1) and include weights in floating-point. In a non-limiting example, the trained weights may be the weights of a CNN model and may be stored in floating point. For example, the weights may be stored in 32-bit or 16-bit.
In some examples, the process 400 may further include quantizing the trained weights at 404, determining output of the AI model based on the quantized weights at 406, determining a change of weights at 408 and updating the weights at 410. In some examples, in quantizing the weights at 404, the number of quantization levels may correspond to the hardware constraint of the AI chip so that the quantized weights can be uploaded to the AI chip for execution. In a non-limiting example, the quantized weights may be of I-bit (binary value), 2-bit, 3-bit, 5-bit or other suitable bits, such as 8-bit. For example, the AI chip may include a CNN model. In the CNN model, the weights may include 1-bit (binary value), 2-bit, 3-bit, 5-bit or other suitable bits, such as 8-bit. The structure of the CNN may correspond to that of the hardware in the AI chip. In case of I-bit, the number of quantization levels will be two. In some scenarios, quantizing the weights to 1-bit may include determining a threshold to properly separate the weights into two groups: one below the threshold and one above the threshold, where each group takes one value, such as {1, −1}.
In some examples, quantizing the weights at 404 may include a dynamic fixed point conversion. For example, the quantized weights may be determined by:
nbit is the bit-size of the weights in the physical AI chip. For example, nbit may be 8-bit, 12-bit etc. Other values may be possible.
In some examples, quantizing the weights at 404 may include determining the quantized weights based on the interval in which the values of the weights fall, where the interval is defined depending on the value of nbit. In a non-limiting example, when nbit=1, the weights of a CNN model may be quantized into two quantization levels. In other words, the weight values may be divided into two intervals. For example, the first interval is [0, ∞), and the second interval (−∞, 0). When Wk≥0. WQ=(Wk)Q=(Wmean)shift-quantized, where Wk represents the weights for a kernel in a convolution layer of the CNN model, Wmean=mean(abs(Wk)), and a shift-quantization of a weight w may be determined by
where |W|max is the maximum value of absolute values of the weights. Similarly, when Wk<0, WQ=−(Wmean)shift-quantized. The mean and maximum values are relative to a convolution layer in the CNN model.
In a non-limiting example, when nbit=2, the intervals may be defined by (−∞, −Wmean/4), [−Wmean/4, Wmean/4] and (Wmean/4, ∞). Thus, the weights may be quantized into:
W
Q=0, when |Wk|<Wmean/4;
W
Q=(Wmean)shift-quantized, when Wk>Wmean/4;
W
Q=−(Wmean)shift-quantized, when Wk<−Wmean/4.
It is appreciated that other variations may also be possible. For example, Wmax may be used instead of Wmean. Denominators other than the value of 4 may also be used.
In another non-limiting example, when nbit=3, the intervals may be defined, as shown in FIG. 5. Define W′mean=Wmean/4. Thus, the weights may be quantized into:
W
Q=0, when |Wk|W′mean/2;
W
Q=(W′mean)shift-quantized, when W′mean/2<Wk<3W′mean/2;
W
Q=(2Wmean)shift-quantized, when 3W′mean/2<Wk<3W′mean;
W
Q=(4W′mean)shift-quantized, when Wk>3W′mean;
W
Q=−(W′mean)shift-quantized, when −3W′mean/2<Wk<−W′mean/2;
W
Q=−(2W′mean)shift-quantized, when −3W′mean<Wk<−3W′mean/2;
W
Q=−(4W′mean)shift-quantized, when Wk<3W′mean;
It is appreciated that other variations may also be possible. For example, Wmax may be used instead of Wmean. Denominators other than the values of 4 or 2 may also be used.
Alternatively, and/or additionally, quantizing the weights at 404 may also include compressed-fixed point conversion, where a weight value may be separated into a scalar and a mask, where W=scalar×mask. Here, a mask may include a k×k kernel and each value in the mask may have a bit-width, such as 1-bit, 2-bit, 3-bit, 5-bit, 8-bit or other bit sizes. In some examples, a quantized weight may be represented by a product of a mask and an associated scalar. The mask may be selected to maximize the bit size of the kernel, where the scalar may be a maximum common denominator among all of the weights. In a non-limiting example, when nbit=5 or above, scalar=min(abs(wk)) for all weights in kth kernel, and
With further reference to FIG. 4, determining the output of the AI model at 406 may include inferring the AI model using the training data 409 and the quantized trained weights. The process 400 may further include determining a change of weights at 408 based on the output of the CNN model. In some examples, the output of the CNN model may be the output of the activation layer of the CNN. The process 400 may further update the weights of the CNN model at 410 based on the change of weights. In some examples, the process 400 may be implemented in a forward propagation and backward propagation framework. For example, the process 400 may perform operation 406 in a layer by layer fashion in a forward propagation, in which the inference of the AI model is propagated from the first convolution layer to the last convolution layer in a CNN (or a subset of the convolution layers in the CNN). The output inferred from the first layer will be provided to the second layer, the output inferred from the second layer will be provided to the third layer, so on and so forth until the output of the last layer is inferred.
Now, the forward propagation is further explained with reference to FIG. 3A. In FIG. 3A, in the forward propagation network 300, the weights of the convolution layers in a CNN, e.g., layers A, A+1, . . . , A+N (e.g., 306, 308, . . . 310), are provided from respective quantization processes 302, 303 and 304. In some examples, the quantization processes 302, 303, 304 may be implemented in the manner described with reference to 404 in FIG. 4. In the forward propagation network 300, the output of each layer is provided to the input of the succeeding layer. For example, the output of layer A (306) is provided to the input of the immediate succeeding layer A+1 (308), and the output of layer A+1 is provided to the input of the immediate succeeding layer A+2, so on and so forth, until the output layer, such as layer A+N, where N+1 is the number of convolution layers in the CNN. The inference is obtained at the output of the last convolution layer. As such, the inferred output 9 is obtained at the output of the last convolution layer 310 in a layer by layer fashion.
Returning to FIG. 4, in some examples, the operations 408 and 410 may also be performed in a layer by layer fashion in a backward propagation, in which a change of weights is determined for each layer in a CNN from the last year to the first layer (or a subset of the convolution layers in the CNN), and the weights in each layer are updated based on the change of weights. In some examples, a loss function may be determined based on the output of the CNN model (e.g., the output of the last convolution layer of the CNN), and the changes of weights may be determined based on the loss function. This is further explained.
In some examples, the process 400 may repeat updating the weights of the CNN model in one or more iterations. In some examples, blocks 406, 408, 410 may be implemented using a gradient descent method, in which a suitable loss function may be used. In a non-limiting example, a loss function may be defined as:
where yi is the prediction of the network, e.g., the output of the CNN based on the ith training instance. In a non-limiting example, if the CNN output includes two image labels (e.g., dog or cat), then yi may have the value of 0 or 1. Here, N is the number of training instances in the training data set. The probability p(yi) of a training instance being yi and may be determined from the training. In other words, the loss function H( ) may be defined based on a sum of loss values over a plurality of training instances in the training data set, wherein the loss value of each of the plurality of training instances is a difference between an output of the CNN model for the training instance and a ground truth of the training instance.
In a non-limiting example, the training data 409 may include a plurality of training input images. The ground truth data may include information about one or more objects in the image, or about whether the image contains a class of objects, such as a cat, a dog, a human face, or a given person's face. Inferring the AI model may include generating a recognition result indicating which class to which the input image belongs. In the training process, such as 400, the loss function may be determined based on the image labels in the ground truth and the recognition result generated from the AI chip based on the training input image.
In some examples, the gradient descent may be used to determine a change of weight
ΔW=f(WQt)
by minimizing the loss function H( ), where WQt stands for the quantized weights at time t. The process may update the weights from a previous iteration based on the change of weights, e.g., Wt+1=Wt+ΔW, where Wt and Wt+1 stand for the weights in a preceding iteration and the weights in the current iteration, respectively. In some examples, the weights (or updated weights) in each iteration, such as Wt and Wt+1, may be stored in floating point. The quantized weights WQt at each iteration t may be stored in fixed point. In some examples, the gradient descent may include known methods, such as stochastic gradient descent method. Processes 408 and 410 are further explained in the context of a backward propagation with reference to FIG. 38.
With reference to FIG. 38, the changes of weights may be obtained in the convolution layers of the CNN in a backward propagation in 320, from the last convolution layer to the first convolution layer. For example, a change of weights for the last layer, e.g., layer A+N (310) may be obtained based on the derivative of the lost function H(y, ŷ), where y is the ground truth of the training data and ŷ is the inferred output obtained from the training data from the forward propagation network 300. The lost function HO may be defined based on comparing the inferred output and the ground truth of training data as described in the present disclosure. The change of weights for each subsequent layer in the backward propagation network 320 may be obtained based on the preceding layer. For example, the change of weights for layer A (306) may be obtained from the preceding layer A+1; the change of weights for layer A+1 (308) may be obtained from the preceding layer A+1, so on and so forth, in a layer by layer fashion. Once the change of weights are obtained for a convolution layer, the weights of that layer may be updated (for re-train) or not updated. As described above, the weights of certain layers having higher compression ratio may be updated based on the respective change of weights in those layers, whereas the weights of certain layers having lower compression ratio may remain unchanged (no re-train).
Now returning to FIG. 4, the process 400 may further include repeating blocks 404, 406, 408, 410 iteratively, in one or more iterations, until a stopping criteria is met at 414. In some examples, at each iteration, the process may perform operations 404, 406, 408, 410 in forward and backward propagations as disclosed in the present disclosure. For example, the process 400 may determine the output of the CNN at 406 by inference in a layer by layer fashion in a forward propagation (e.g., as shown in FIG. 3A). The process 400 may also determine the change of weights at 408 and update the weights at 410 in a layer by layer fashion in a backward propagation (e.g., as shown in FIG. 3B). For each iteration, the process 400 may use a batch of training images selected from the training data 409. The batch size may vary. For example, the batch size may have a value of 32, 64, 128, or other number of images.
In each iteration, the process 400 may determine whether a stopping criteria has been met at 414. If the stopping criteria has been met, the process may store the updated weights of the CNN model at the current iteration at 416 for use by another process (e.g., 206 in FIG. 2, to be described). If the stopping criteria has not been met, the process 400 may repeat blocks 404, 406, 408, 410 in a new iteration. In determining whether a stopping criteria has been met, the process 400 may count the number of iterations and determine whether the number of iterations has exceeded a maximum iteration number. For example, the maximum iteration may be set to a suitable number, such as 100, 200, or 1000, or 10,000, or an empirical number. In some examples, determining whether a stopping criteria has been met may also determine whether the value of the loss function at the current iteration is greater than the value of the loss function at a preceding iteration. If the value of the loss function increases, the process 400 may determine that the iterations are diverting and determine to stop the iterations. Alternatively, and/or additionally, if the iterations are diverting, the process 400 may adjust the gradient descent hyper-parameters, such as learning rate, batch size, gradient decent updating mechanism, etc. In some examples, if the value of the loss function does not decrease over a number of iterations, the process 400 may also determine that the stopping criteria is met.
In some examples, the process 400 may be implemented entirely on a desktop using a CPU or a GPU. Alternatively, certain operations in the process 400 may be implemented in a physical AI chip, where the trained weights or updated weights are uploaded inside the AI chip.
In some examples, the process 400 may combine the re-training with variable compression schemes. For example, for a given convolution layer in the CNN, if the quantization bits exceeds a threshold (high quantization bits), the process 400 may skip updating the weights for that given layer. In the backward propagation training process described above, in determining the change of weights at 408 and updating weights at 410 in the layer by layer fashion, the process 400 may not need to determine the change of weights or update the weights for the layers with high quantization bits (low compression ratio). In the example above, the convolution layers whose weights are quantized at higher bits (lower compression ratio) are still participating in the re-training process, except no weights for those layers are updated. This results in the speedup of the training process. In some examples, if all of the convolution layers in a CNN are quantized to a bit exceeding a threshold (e.g., all convolution layers have high quantization bits), then the entire re-training process 400 (as implemented in 204 in FIG. 2) may be skipped.
FIG. 6 illustrates a diagram of an example process of re-training with activation quantization that may be implemented in the training network, e.g., 206 (in FIG. 2). A training process 600 may perform operations in one or more iterations to train and update the weights of a CNN model, where the trained weights may be output in fixed point, which is suitable for an AI chip, to execute. The process 600 may include accessing trained weights of an AI model at 602. For example, the AI model may include quantized weights from the process 204 (in FIG. 2) or 400 (in FIG. 4), where the quantized weights are stored in fixed point (at 416 in FIG. 4). Alternatively, the AI model may be trained in the floating-point model training process (e.g., 202 in FIG. 2) and include weights in floating-point. In a non-limiting example, the trained weights may be the weights of a CNN model. The process 600 may further include determining output of the AI model based on the weights at 608. If the weights of the CNN are in fixed point, such as determined from the re-training process with convolution quantization (e.g., 204 in FIG. 2, 400 in FIG. 4), the operation of determining the output of the CNN may be performed in fixed point. If the weights of the CNN are in floating point, such as trained from the floating-point model training process (e.g., 202 in FIG. 2), the operation of determining the output of the CNN may be performed in floating point. Determining the output of the AI model at 608 may include inferring the AI model using the training data 609 and the weights obtained from box 602.
Similar to FIG. 4, determining the output of the CNN model at 608 may be performed on a CPU or GPU processor outside the AI chip. In some or other scenarios, determining the output of the CNN model may also be performed directly on an AI chip, where the AI chip may be a physical chip or a virtual AI chip, and executed to produce output. If the weights are in fixed-point and supported by a physical AI chip, the weights may be uploaded into the AI chip. In that case, the process 600 may load quantized weights into the AI chip for execution of the AI model. The training data 609 may be similar to the training data 409 in FIG. 4.
With further reference to FIG. 6, the process 600 may further include quantizing the output of the CNN at 606. In some examples, quantizing the output of the CNN may include quantizing at least one activation layer. In some examples, an activation layer in an AI chip may include a rectified linear unit (ReLU) of a CNN. The quantization of the activation layer may be based on the hardware constraints of the AI chip so that the quantized output of the activation layer can mimic the characterization of the physical AI chip. FIG. 1 illustrates a diagram of an example CNN in an AI chip in accordance with various examples described herein. In FIG. 1, each of the convolution layers 102(1), . . . 102(N) in the AI chip may include an activation layer having a bit size. In some examples, each of the convolution layers may produce the output in fixed point. The activation layer (e.g., ReLU layer), which may also include fix point values. Thus, the quantization of the output of the CNN at 606 may mimic the bit size of the activation layer (e.g., ReLU layer) of the CNN and also produce a fixed point value.
FIG. 7A illustrates a flow diagram of an example process of inference of an AI model via activation quantization in accordance with various examples described herein. In some examples, a process 700 may quantize the output of one or more convolution layers in a CNN during the training process. The process 700 may be implemented in the operation 606 (in FIG. 6). The one or more convolution layers in the CNN model may correspond to one or more convolution layers in the AI chip in FIG. 1. By quantizing the output of the convolution layers during the training, the trained CNN model may be expected to achieve a performance in an AI chip close to that achieved in a CPU/GPU during the training. In other words, the quantization effect over the CNN model during the training may mimic that of the AI chip so that performance of the CNN model during the training may accurately reflect the anticipated performance of the physical AI chip when the trained CNN model is uploaded and executed in the AI chip.
In some examples, the process 700 may include accessing the input of a first convolution layer at 702 and determining the output of the first convolution layer at 704. For example, the first convolution layer may be any of the convolution layers in a CNN model that corresponds to a convolution layer, e.g. 102 in an AI chip. The output of the convolution may be stored in floating point. Accessing the input of the first convolution layer at 702 may include assessing the input data, if the first convolution layer is the first layer after the input in the CNN, or assessing the output of the preceding layer, if the first convolution layer is an intermediate layer. Determining the output of the first convolution layer at 704 may include executing a CNN model to produce an output at the first convolution layer. In a training process, determining the output of the convolution layer may be performed outside of a chip, e.g., in a CPU/GPU environment. Alternatively, determining the output of the convolution layer may be performed in an AI chip.
With further reference to FIG. 7A, the process 700 may further quantize the output of the first convolution layer at 706. In some examples, the method of quantizing the output of the convolution layer may mimic the configuration of an AI chip such as the number of bits and the quantization behavior of a convolution layer in an AI chip. For example, the quantized output of the CNN model may be stored in fixed point in the same bit-length of the activation layer of the corresponding convolution layer in the AI chip. In a non-limiting example, the output of each convolution layer in an AI chip may have 5 bits (in hardware), where the output values range from 0 to 31. The process 700 may determine a range for quantization based on the bit-width of the output of each convolution layer of the AI chip. In the above example, the range for quantization may be 0-31, which corresponds to 5-bits in the hardware configuration. The process 700 may perform a clipping over the output of a convolution layer in the CNN model, which sets a value beyond a range to a closest minimum or maximum of the range. FIG. 7B illustrates an example of distribution for layout output values of an AI model. In such example, the layer output values from multiple runs of the AI chip over multiple instances of a training set are all greater than zero. A clipping was done at the maximum value yiα, where i stands for the ith convolution layer. In the above example in which the convolution layer contains 5-bit values, for a value above 31, the process may set the value to the maximum value: 31.
Returning to FIG. 7A, quantizing the activation layer may include quantizing the output values of one or more convolution layers in the CNN. For example, Y=W*X+b represents the output value of an activation layer, then, the activation layer may be quantized as:
Here, a value of [0, α] may be represented by a maximum number of bits in the activation layer, e.g., 5-bit, 10-bit, or other values. If a weight is in the range of [0, α], then the quantization becomes a linear transformation. If a weight has a value of less than zero or a value of greater than α, then the quantization clips the weight at zero or α, respectively. Here, the quantization of activation layer limits the value of the output to the same limit in the hardware. In a non-limiting example, if the bit-width of an activation layer in an AI chip is 5 bits, then [0, α] may be represented by 5 bits. Accordingly, the quantized value will be represented by 5 bits.
With further reference to FIG. 7A, the process 700 may further repeat similar operations for a second convolution layer. The process 700 may access input of the second convolution layer at 708, determine the output of the second convolution layer at 710, and quantize the output of the second convolution layer at 712. For example, the second convolution layer may correspond to a convolution layer in the AI chip, such as 504, or 506 in FIG. 5. In accessing the input of the second convolution layer, the process may take the output of the preceding layer. If the first and second convolution layers in the CNN model are consecutive layers, for example, the first layer in the CNN model corresponds to layer 502 in the AI chip and the second layer in the CNN corresponds to layer 504 in the AI chip, then accessing the input of the second layer (e.g., 504) may include accessing the output values of the first layer (e.g., 502). If the values of the output of the first layer are quantized, then accessing the input of the second layer includes accessing the quantized output of the first layer.
Blocks 710 and 712 may perform in a similar fashion as blocks 704 and 706. Further, the process 700 may repeat blocks 708-712 for one or more additional layers at 714. In some examples, the process 700 may quantize the output for all convolution layers in a CNN in a layer-by-layer fashion. In some examples, the process 700 may quantize the output of some convolution layers in a CNN model. For example, the process 700 may quantize the output of one or more last few convolution layers in the CNN. In some examples, the process 700 may be implemented in the forward propagation network 300 (in FIG. 3A). For example, each of the convolution layers 306, 308 . . . 310 may have an activation layer 307, 309, 311, respectively. The processes 706, 712 may provide quantized output at the respective activation layer and further provide to the input of the succeeding layer.
Returning to FIG. 6, the process 600 may further include determining a change of weights at 608 and updating the weights at 610. The process 600 may further repeat the processes 604, 606, 608, 610 until a stopping criteria is met at 614. Determining the change of weights at 608 and updating the weights at 610 may include a similar training process as in FIG. 4. For example, the process 600 may include determining a change of weights at 608 based on the output of the CNN model. In some examples, the output of the CNN model may be the output of the activation layer of the CNN. The process 600 may further update the weights of the CNN model at 610 based on the change of weights. The process may repeat updating the weights of the CNN model in one or more iterations. Similar to FIG. 4, in each iteration, the process 600 may also be implemented in forward and background propagations in a layer by layer manner. In some examples, blocks 604, 606, 608, 610 may be implemented using a gradient descent method. The gradient descent method may perform in a similar fashion as described in FIG. 4. For example, a loss function may be defined as:
where yi is the prediction of the network, e.g., the output of the CNN based on the ith training instance. In a non-limiting example, if the CNN output includes two image labels (e.g., dog or cat), then yi may have the value of 0 or 1. N is the number of training instances in the training data set. The probability p(yi) of a training instance being yi and may be determined from the training. In other words, the loss function H( ) may be defined based on a sum of loss values over a plurality of training instances in the training data set, wherein the loss value of each of the plurality of training instances is a difference between an output of the CNN model for the training instance and a ground truth of the training instance.
In some examples, the gradient descent may be used to determine a change of weights
ΔW=f(WQt)
by minimizing the loss function H( ), where WQt stands for the quantized weights at time t. In other words, WQt=Q(Wt). The process may update the weight from a previous iteration based on the change of weight, e.g., Wt+1=Wt+ΔW, where Wt and Wt+1 stand for the weights in a preceding iteration and the weights in the current iteration, respectively. In some examples, the weights (or updated weights) in each iteration, such as Wt and Wt+1, may be stored in floating point. The quantized weights WQt at each iteration t may be stored in fixed point. In some examples, the gradient descent may include known methods, such as a stochastic gradient descent method.
With further reference to FIG. 6, once the stopping criteria is met at 614, the process 600 may store the updated weights at 616 for use by another unit (e.g., a unit in 101 in FIG. 1). In some examples, the process 600 may be implemented entirely on a desktop using a CPU or a GPU. Alternatively, certain operations in the process 600 may be implemented in a physical AI chip, where the trained weights or updated weights are uploaded inside the AI chip.
Similar to FIG. 4, in some examples, the process 600 may combine the re-training with variable compression schemes. For example, for a given convolution layer in the CNN, if the quantization bits for the filter coefficients of that layer exceeds a threshold (high quantization bits), the process 600 may skip the weights for that given layer. The re-training with activation quantization with respect to FIGS. 6 and 7A may be performed in the forward and background propagation networks in FIG. 3. Similar to the re-training process for convolution quantization with respect to FIG. 4, in determining the change of weights at 608 and updating weights at 610 in the layer by layer fashion, the process 600 may not need to update the weights for the layers with high quantization bits (low compression ratio). In the example above, the convolution layers whose weights are quantized at higher bits (lower compression ratio) are still participating in the re-training process, except no change of weights are calculated for those layers, and no weights are updated for those layers. This results in the speedup of the training process. In some examples, if all of the convolution layers in a CNN are quantized to a bit exceeding a threshold (e.g., all convolution layers have high quantization bits), then the entire re-training process 600 (as implemented in 206 in FIG. 2) may be skipped.
The various embodiments in FIGS. 1-7B illustrate variable-bit compression schemes that quantize the weights in different convolution layers in an AI chip with variable quantization bits. The training process for obtaining the weights of the AI model for uploading to the AI chip may also implement a re-training process for the convolution layers having higher compression ratio (or lower quantization bits). Alternatively, and/or additionally, the AI chip may have weights in a higher-bit compression without requiring re-training. The various embodiments in FIGS. 1-7B illustrate high compression on weights of a CNN, such as 5.4 times or 13 times in the above examples, both below 30% of the size of the weights in the original network.
It is appreciated that variations of these embodiments may exist. For example, the compression schemes may be applicable to other types or architectures of neural network and not limited to a particular type, e.g., CNN. In some examples, the representation of the compressed neural network may contain all required information for decoding the parameters and weights without requiring external information for their interpretation. The reduced neural network as the result of the compression may be directly used for inference. In some examples, the compressed neural network may be encoded and reconstructed (decoded) in order to perform inference. The various compression scheme may require the original training data, such as via a re-training process, to improve the performance. Alternatively, the compression scheme may not require the original training data, while using a higher-bit quantization. Furthermore, returning to FIG. 2, although re-training with convolution quantization (204) and re-training with activation quantization (206) are shown in sequence, the re-training process 204, 206 are not to be in particular order. Further, each of the re-training processes 204, 206 may be performed alone without the other.
FIG. 8 illustrates an example AI chip that may utilize the compressed and trained CNN as described in various embodiments in FIGS. 1-7B in accordance with various examples described herein. In some examples, the AI chip 802 may be configured to be a feature extractor. The AI chip 802 may include an embedded cellular neural network. The AI chip 802 may receive one or more image frames and may include a CNN which may be configured to generate feature maps for each of the plurality of image frames. The CNN 806 may be implemented in the AI chip, thus may have hardware constraints. In some examples, the weights of the CNN 806 may be quantized using variable compression schemes and trained using the training processed described with reference to FIGS. 1-7B. The AI chip 802 may also include an invariance pooling layer 808 configured to generate the corresponding feature descriptor based on the feature maps. In some examples, the AI chip 802 may further include an image rotation unit 804 configured to produce multiple images rotated from the image frame at corresponding angles. This allows the CNN to be able to extract deep features off of the image, such as in a deep neural network, e.g., VGG-16. Various examples of compressing and training the weights for a VGG-16 are described in various embodiments, e.g., in FIGS. 1-7B.
Returning to FIG. 8, in some examples, the invariant pooling layer 808 may be configured to determine a feature descriptor based on the feature maps obtained from the CNN. The pooling layer 808 may include a square-root pooling, an average pooling, a max pooling or a combination thereof. The CNN may also be configured to perform a region of interest (ROI) sampling on the feature maps to generate multiple updated feature maps. The various pooling layers may be configured to generate a feature descriptor for various rotated images. In a non-limiting example, an input image (e.g., a captured image or a key frame of a video) may be fed to the CNN, which generates convolutional feature maps, with the dimension of w×h×c, where w and h denote width and height of each feature map, c denotes the number of channels. The invariant pooling layer 808 may be configured to perform one or more pooling functions, such as described in “Information Technology—Multimedia Content Description Interface—Part 15: Compact Descriptors for Video Analysis,” ISO/IEC DIS 15938-15:2018(E), Apr. 28, 2018. In some examples, the invariant pooling may include square-root pooling, followed by average pooling, which is followed by max pooling. The invariant pooling may convert feature maps generated from the convolution layers in the CNN for various image rotations to a single feature descriptor. In some examples, the feature map of the convolution layers may be sampled, such as using region of interest (ROI) sampling. The feature descriptor may include a one-dimensional (1D) vector. For example, the 1D feature descriptor may include a vector containing 512 values associated with each of 512 output channels of the CNN.
In some examples, various embodiments in FIGS. 1-7B may be utilized to perform weights/coefficients quantization, which may facilitate memory reduction and efficient high-speed operations. For example, when the weights of a CNN are quantized and/or re-trained (e.g., in any of the processes in FIG. 2), the weights may be uploaded to an AI chip for inference with compressed network under limited resource of power, memory, computation and bandwidth.
FIG. 9 illustrates an example image retrieval system in accordance with various examples described herein. An image retrieval system 900 may include a feature extractor 904 configured to extract one or more feature descriptors from an input image. Examples of a feature descriptor may include any values that are representative of one or more features of an image. For example, the feature descriptor may include a vector containing values representing multiple channels. The system 900 may also include a comparator 906 configured to compare a feature descriptor of an input image with one or more reference feature descriptors to generate image retrieval results. The reference features descriptors may be associated with known images or image instances (e.g., objects). In an image retrieval system, a feature descriptor may represent certain features of an image. For example, if two images both have a dog in the image, they may have similar feature descriptors. In an application of image instance retrieval, for example, the input image may have a dog and the feature descriptor may represent certain features of a dog face or the breed of the dog. Reference descriptors may be pre-trained, each associated with an image of a known object (e.g., dog, cat), or the breed of the dog. The comparator 906 may compare the feature descriptor of the input image with the reference descriptors and determine whether the input image has a dog or the breed of the dog based on the result of the comparison. Although FIG. 9 illustrates an example of using an AI chip to implement a CNN and perform image retrieval based on feature descriptors, it is appreciated that various applications of the compression scheme and training processes described in various embodiments in the present disclosure may be possible.
The various embodiments in FIGS. 1-9 may facilitate various applications, especially using a low-precision AI chip in performing certain AI tasks. For example, a low-cost low-precision AI chip with the weights having I-bit values may be used in a surveillance video camera. Such camera may be capable of performing an AI task in real-time, such as face recognition, to automatically distinguish unfamiliar intruders from registered visitors. The use of such AI chip may save the network bandwidth, power costs, and hardware costs associating with performing an AI task involving a deep learning neural network. With the embodiments in FIGS. 1-9, it may be feasible to compress the weights of a CNN in an AI chip into 1-bit and train the weights. With variable compression scheme and/or associated training processes described in the present disclosure, memory utilization and the processing speed of the AI chip may be improved without sacrificing the performance of the CNN model. Further, the CNN in the AI chip may be configured to generate a feature descriptor for each input image and perform an image retrieval task such as described in FIG. 9.
In some examples, an AI chip configured in various configurations with respect to FIGS. 1-9 may be installed in a camera and store the trained weights and/or other parameters of the CNN model, such as those trained/quantized/updates weights generated in any of the processes 200 (FIG. 2), 300, 320 (FIGS. 3A and 3B), 400 (FIG. 4), 600 (FIG. 6), or 700 (FIG. 7A). The AI chip may be configured to receive a captured image from the camera, perform an image recognition task by propagating the captured image from the first convolution layer to the second convolution layer in the AI chip, so on and so forth until the last convolution layer in the CNN model, and determine the recognition result. The recognition result is based on the image data and the weights in the convolution layers in the CNN. The system may present the recognition result on an output device, such as a display. For example, the camera may display, via a user interface, the recognition result. In a face recognition application, the CNN model may be trained for face recognition. A captured image may include one or more facial images associated with one or more persons. The recognition result may include the names associated with each input facial image. The camera may present the output of the recognition result on an output device, such as a display. For example, the user interface may display a person's name next to or overlaid on each of the input facial image associated with the person. Although an example image retrieval task is illustrated, other AI tasks may also be implemented in the AI chip.
FIG. 10 illustrates various embodiments of one or more electronic devices for implementing the various methods and processes described in FIGS. 1-9. An electrical bus 1000 serves as an information highway interconnecting the other illustrated components of the hardware. Processor 1005 is a central processing device of the system, configured to perform calculations and logic operations required to execute programming instructions. As used in this document and in the claims, the terms “processor” and “processing device” may refer to a single processor or any number of processors in a set of processors that collectively perform a process, whether a central processing unit (CPU) or a graphics processing unit (GPU), or a combination of the two. Read only memory (ROM), random access memory (RAM), flash memory, hard drives, and other devices capable of storing electronic data constitute examples of memory devices 1025. A memory device, also referred to as a computer-readable medium, may include a single device or a collection of devices across which data and/or instructions are stored.
An optional display interface 1030 may permit information from the bus 1000 to be displayed on a display device 1035 in visual, graphic, or alphanumeric format. An audio interface and audio output (such as a speaker) also may be provided. Communication with external devices may occur using various communication ports 1040 such as a transmitter and/or receiver, antenna, an RFID tag and/or short-range, or near-field communication circuitry. A communication port 1040 may be attached to a communications network, such as the Internet, a local area network, or a cellular telephone data network.
The hardware may also include a user interface sensor 1045 that allows for receipt of data from input devices 1050 such as a keyboard, a mouse, a joystick, a touchscreen, a remote control, a pointing device, a video input device, and/or an audio input device, such as a microphone. Digital image frames may also be received from an image capturing device 1055 such as a video or camera that can either be built-in or external to the system. Other environmental sensors 1060, such as a GPS system and/or a temperature sensor, may be installed on system and communicatively accessible by the processor 1005, either directly or via the communication ports 1040. The communication ports 1040 may also communicate with the AI chip to upload or retrieve data to/from the chip. For example, a processing device on the network may be configured to perform the processes in FIG. 2 to quantize the weights to certain bits, and upload weights to the AI chip for performing an AI task via the communication port 1040. Optionally, the processing device may use an SDK (software development kit) to communicate with the AI chip via the communication port 1040. The processing device may also retrieve the result of an AI task at the output of the AI chip via the communication port 1040. The communication port 1040 may also communicate with any other interface circuit or device that is designed for communicating with an integrated circuit.
Optionally, the hardware may not need to include a memory, but instead programming instructions are run on one or more virtual machines or one or more containers on a cloud. For example, the various methods illustrated above may be implemented by a server on a cloud that includes multiple virtual machines, each virtual machine having an operating system, a virtual disk, virtual network and applications, and the programming instructions for implementing various functions in the robotic system may be stored on one or more of those virtual machines on the cloud.
Various embodiments described above may be implemented and adapted to various applications. For example, the AI chip having a CNN architecture may be residing in an electronic mobile device. The electronic mobile device may use a built-in AI chip to generate the feature descriptor. In some scenarios, the mobile device may also use the feature descriptor to perform an image retrieval task such as described in FIG. 1. In other scenarios, the processing device may be a server device on a communication network or may be on the cloud. The processing device may implement a CNN architecture or access the feature descriptor generated from the AI chip and perform image retrieval based on the feature descriptor. These are only examples of applications in which various systems and processes may be implemented.
The various systems and methods disclosed in this patent document provide advantages over the prior art, whether implemented, standalone, or combined. For example, by using a compressor described in various embodiments herein, the weights of a CNN may be quantized at high compression ratio without significant loss of performance. This may reduce the memory space required of an AI task and also speed up the execution of the AI task on an AI chip. When the variable compression scheme and training processes are implemented for a CNN in an AI chip, with the proper re-training of the CNN model with the constraints of fixed-point weights, the model's precision could be very closed to the floating-point model with the much less bits used for model weights. For example, for the VGG-16 model, the accuracy loss for using 1-bit coefficients is estimated to be about 1%.
Other advantages can be apparent to those skilled in the art from the foregoing specification. Accordingly, it will be recognized by those skilled in the art that changes, modifications, or combinations may be made to the above-described embodiments without departing from the broad inventive concepts of the invention. It should therefore be understood that the present solution is not limited to the particular embodiments described herein, but is intended to include all changes, modifications, and all combinations of various embodiments that are within the scope and spirit of the invention as defined in the claims.