This disclosure relates to the field of artificial intelligence, and in particular, to a neural network parameter quantization method and apparatus.
Model compression is a common technique for constructing a lightweight neural network. Generally, 32-bit floating-point data (FP32) is used for storage in neural network models. It is found through research that a neural network has good robustness, and precision of a parameter of a large neural network is reduced through quantization, coding, and the like, so that the neural network can still maintain good performance. Common low-precision data includes numerical formats such as half-precision floating point (FP16), 16-bit fixed-point integer (INT16), 8-bit fixed-point integer (INT8), 4-bit fixed-point integer (INT4), and 1-bit. With network performance and a model compression degree comprehensively considered, a weight parameter is converted from a FP32 to INT8. This is a common quantization method.
However, during quantization, especially for an adder network, a precision loss is large during low-bit quantization. Therefore, how to reduce the precision loss during low-bit quantization becomes an urgent problem to be resolved.
This disclosure provides a neural network parameter quantization method and apparatus, to quantize a neural network, reduce a precision loss during low-bit quantization, and obtain a lightweight model with a more accurate output.
In view of this, according to a first aspect, this disclosure provides a neural network parameter quantization method, including obtaining a parameter of each neuron in a to-be-quantized model, to obtain a parameter set, clustering parameters in the parameter set to obtain a plurality of types of classified data, and quantizing each type of classified data in the plurality of types of classified data to obtain at least one type of quantization parameter, where the at least one type of quantization parameter is used to obtain a compression model, and precision of the at least one type of quantization parameter is lower than precision of a parameter in the to-be-quantized model.
Therefore, in this implementation of this disclosure, parameters of a neural network are clustered and then quantized, and parameters of each classification are separately quantized. This can improve an expression capability of a model.
In a possible implementation, the clustering the parameter set to obtain a plurality of types of classified data may include clustering the parameter set to obtain at least one type of clustered data, and extracting a preset quantity of parameters from each type of clustered data in the at least one type of clustered data, to obtain the plurality of types of classified data.
Therefore, in this implementation of this disclosure, each type of classification parameters is truncated after clustering, to reduce outliers in each classification and improve a subsequent model expression capability.
In a possible implementation, the parameter in the to-be-quantized model includes a parameter in a feature output by each neuron or a parameter value in each neuron. Therefore, in this implementation of this disclosure, both an internal parameter of and a feature value output by each neuron in the neural network are quantized, to reduce bits occupied by a quantized model, to obtain a lightweight model.
In a possible implementation, the to-be-quantized model includes an adder neural network. Therefore, in this implementation of this disclosure, for an adder network, if neurons share a scaling coefficient, a model expression capability is reduced. If a multiplication convolution quantization manner is used, scaling coefficients of weight data and input features may be different. Therefore, in a manner provided in this disclosure, parameters may be clustered, and each type of parameter is quantized, to improve an expression capability of a compressed model, and avoid a non-INT8 value after quantization.
In a possible implementation, the compression model is used to perform at least one of image recognition, a classification task, or target detection. Therefore, the method provided in this disclosure may be applicable to a plurality of scenarios, and has a strong generalization capability.
According to a second aspect, this disclosure provides a neural network parameter quantization apparatus, including an obtaining module configured to obtain a parameter of each neuron in a to-be-quantized model, to obtain a parameter set, a clustering module configured to cluster the parameter set to obtain a plurality of types of classified data, and a quantization module configured to quantize each type of classified data in the plurality of types of classified data to obtain at least one type of quantization parameter, where the at least one type of quantization parameter is used to obtain a compression model, and precision of the at least one type of quantization parameter is lower than precision of a parameter in the to-be-quantized model.
In a possible implementation, the clustering module is further configured to cluster the parameter set to obtain at least one type of clustered data, and extract a preset quantity of parameters from each type of clustered data in the at least one type of clustered data, to obtain the plurality of types of classified data.
In a possible implementation, the parameter in the to-be-quantized model includes a parameter in a feature output by each neuron or a parameter value in each neuron.
In a possible implementation, the to-be-quantized model includes an adder neural network.
In a possible implementation, the compression model is used to perform at least one of image recognition, a classification task, or target detection.
According to a third aspect, an embodiment of this disclosure provides a neural network parameter quantization apparatus, including a processor and a memory. The processor and the memory are interconnected through a line, and the processor invokes program code in the memory to perform a processing-related function in the neural network parameter quantization method in the first aspect. Optionally, the neural network parameter quantization apparatus may be a chip.
According to a fourth aspect, an embodiment of this disclosure provides a neural network parameter quantization apparatus. The neural network parameter quantization apparatus may also be referred to as a digital processing chip or a chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, and when the program instructions are executed by the processing unit, the processing unit is configured to perform a processing-related function in the first aspect or any one of the optional implementations of the first aspect.
According to a fifth aspect, an embodiment of this disclosure provides a computer-readable storage medium including instructions. When the instructions are run on a computer, the computer is enabled to perform the method in the first aspect or any one of the optional implementations of the first aspect.
According to a sixth aspect, an embodiment of this disclosure provides a computer program product including instructions. When the computer program product runs on a computer, the computer is enabled to perform the method in the first aspect or any one of the optional implementations of the first aspect.
The following describes the technical solutions in embodiments of this disclosure with reference to the accompanying drawings in embodiments of this disclosure. It is clear that the described embodiments are merely a part rather than all of embodiments of this disclosure. All other embodiments obtained by a person of ordinary skill in the art based on embodiments of this disclosure without creative efforts shall fall within the protection scope of this disclosure.
An overall working procedure of an artificial intelligence system is first described.
The infrastructure provides computing capability support for the artificial intelligence system, implements communication with the external world, and implements support by using a basic platform. The infrastructure communicates with the outside by using a sensor. A computing capability is provided by an intelligent chip, for example, a hardware acceleration chip such as a central processing unit (CPU), a network processing unit (or neural-network processing unit (NPU)), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), or a field-programmable gate array (FPGA). The basic platforms include related platforms, for example, a distributed computing framework and a network, for assurance and support. The basic platforms may include a cloud storage and computing network, an interconnection network, and the like. For example, the sensor communicates with the outside to obtain data, and the data is provided to an intelligent chip in a distributed computing system provided by the basic platform for computing.
Data at an upper layer of the infrastructure indicates a data source in the field of artificial intelligence. The data relates to a graph, an image, a speech, and a text, further relates to Internet of Things (IoT) data of a device, and includes service data of a system and perception data such as force, displacement, a liquid level, a temperature, and humidity.
Data processing usually includes data training, machine learning, deep learning, searching, inference, decision-making, and the like.
Machine learning and deep learning may mean performing symbolic and formal intelligent information modeling, extraction, preprocessing, training, and the like on data.
Inference is a process in which human intelligent inference is simulated in a computer or an intelligent system, and machine thinking and problem resolving are performed by using formal information according to an inference control. A typical function is searching and matching. Decision-making is a process of making a decision after intelligent information is inferred, and usually provides functions such as classification, ranking, and prediction.
After data processing mentioned above is performed on the data, some general capabilities may be further formed based on a data processing result. For example, the general capabilities may be an algorithm or a general system, for example, translation, text analysis, computer vision processing, speech recognition, and image recognition.
The intelligent product and industry application are products and applications of the artificial intelligence system in various fields. The intelligent product and industry application involve packaging overall artificial intelligence solutions, to productize and apply intelligent information decision-making. Application fields of the intelligent products and industry application mainly include intelligent terminals, intelligent transportation, intelligent health care, autonomous driving, smart cities, and the like.
Embodiments of this disclosure relate to related applications of a large quantity of neural networks. To better understand solutions of embodiments of this disclosure, the following first describes related terms and concepts of neural networks that may be in embodiments of this disclosure.
The neural network may include a neural unit. The neural unit may be an operation unit that uses xs and an intercept 1 as an input, and an output of the operation unit may be shown in a formula:
S=1, 2, . . . , n, where n is a natural number greater than 1, Ws is a weight of xs, and b is bias of the neuron. f is an activation function of the neuron, and is used to introduce a non-linear feature into the neural network to convert an input signal in the neuron into an output signal. The output signal of the activation function may be used as an input of a next convolutional layer. The activation function may be a sigmoid function. The neural network is a network constituted by linking a plurality of single neural units together. To be specific, an output of a neural unit may be an input of another neural unit. An input of each neuron may be connected to a local receptive field of a previous layer to extract a feature of the local receptive field. The local receptive field may be a region including several neurons.
The deep neural network (DNN) is also referred to as a multi-layer neural network, and may be understood as a neural network having a plurality of intermediate layers. Based on positions of different layers, neural network layers inside the DNN may be classified into three types: an input layer, an intermediate layer, and an output layer. Usually, a first layer is the input layer, a last layer is the output layer, and a middle layer is the intermediate layer, which is also referred to as a hidden layer. Layers are fully connected. To be specific, any neuron at an ith layer is necessarily connected to any neuron at an (i+1)th layer.
Although the DNN seems complex, each layer of the DNN may be represented as the following linear relationship expression: y{right arrow over (=)}α(w{right arrow over (x)}+{right arrow over (b)}), where {right arrow over (x)} is an input vector, {right arrow over (y)} is an output vector, {right arrow over (b)} is a bias vector, which is also referred to as a bias parameter, w is a weight matrix (which is also referred to as a coefficient), and a( ) is an activation function. At each layer, the output vector {right arrow over (y)} is obtained by performing such a simple operation on the input vector {right arrow over (x)}. Because there are a plurality of layers in the DNN, there are also a plurality of coefficients W and a plurality of bias vectors {right arrow over (b)}. Definitions of the parameters in the DNN are as follows. The coefficient w is used as an example. It is assumed that in a DNN with three layers, a linear coefficient from a fourth neuron at a second layer to a second neuron at a third layer is defined as W243 The superscript 3 indicates a layer at which the coefficient W is located, and the subscript corresponds to an output third-layer index 2 and an input second-layer index 4.
In conclusion, coefficients of a kth neuron at an (L−1)th layer to a jth neuron at an Lth layer are defined as WjkL.
It should be noted that the input layer does not have the parameters W. In the deep neural network, more intermediate layers make the network more capable of describing a complex case in the real world. Theoretically, a model with more parameters has higher complexity and a larger “capacity”. This indicates that the model can complete a more complex learning task. A process of training the deep neural network is a process of learning a weight matrix, and a final objective of training is to obtain a weight matrix (a weight matrix formed by vectors W at a plurality of layers) at all layers of a trained deep neural network.
The convolutional neural network (CNN) is a deep neural network with a convolutional structure. CNN has excellent features such as local perception and weight sharing, which can greatly reduce weight parameters and improve network performance. The CNN has made many breakthroughs in computer vision and image analysis, and has become the core technology of artificial intelligence and deep learning. The convolutional neural network includes a feature extractor that includes a convolutional layer and a sub-sampling layer, and the feature extractor may be considered as a filter. The convolutional layer is a neuron layer that is in the convolutional neural network and at which convolution processing is performed on an input signal. At the convolutional layer of the convolutional neural network, one neuron may be connected only to some adjacent-layer neurons. One convolutional layer usually includes several feature planes, and each feature plane may include some neural units that are in a rectangular arrangement. Neural units in a same feature plane share a weight, and the weight shared herein is a convolution kernel. Weight sharing may be understood as that an image information extraction manner is irrelevant to a location. The convolution kernel may be initialized in a form of a random-size matrix. In a process of training the convolutional neural network, the convolution kernel may obtain an appropriate weight through learning. In addition, benefits directly brought by weight sharing are that connections between layers of the convolutional neural network are reduced, and an overfitting risk is reduced.
(4) Recurrent neural network (RNN) The recurrent neural network is also referred to as a recursive neural network, and is used to process sequence data. In a neural network model, from an input layer to an intermediate layer and then to an output layer, the layers are fully connected, but nodes in each layer are not connected. Although this common neural network has resolved a plurality of problems, this common neural network is still incapable of resolving a plurality of problems. For example, to predict a next word in a sentence, a previous word usually needs to be used, because the previous word and the next word in the sentence are not independent. A reason why the RNN is referred to as the recurrent neural network is that a current output of a sequence is also related to a previous output of the sequence. A specific representation form is that the network memorizes previous information and applies the previous information to calculation of the current output. To be specific, nodes in the intermediate layer are no longer unconnected, but are connected, and input for the intermediate layer includes not only output of the input layer but also output of the intermediate layer at a previous moment. Theoretically, the RNN can process sequence data of any length. Training for the RNN is the same as training for a CNN or DNN.
The residual neural network is proposed to resolve degradation generated when there are too many hidden layers in a neural network. Degradation means that when there are more hidden layers in the network, accuracy of the network gets saturated and then degrades dramatically. In addition, degradation is not caused by overfitting. However, when backpropagation is performed, and backpropagation reaches a bottom layer, correlation between gradients is low, the gradients are not fully updated, and consequently, accuracy of a prediction label of a finally obtained model is reduced. When the neural network degrades, training effect of a shallow network is better than training effect of a deep network. In this case, if a feature at a lower layer is transmitted to a higher layer, effect is at least not worse than the effect of the shallow network. Therefore, the effect may be reached through identity mapping. Identity mapping is referred to as a shortcut connection, and it is easier to optimize shortcut mapping than to optimize original mapping.
For the foregoing neural networks, such as the CNN, the DNN, the RNN, or the ResNet, a multiplication convolution solution may be used. A core of the multiplication convolution solution is to extract a similarity between a filter and an input image through a convolution multiplication operation.
Generally, a multiplication convolution kernel may be represented as:
S (x, y) represents a similarity between x and y, X represents an input image, F represents a filter for convolution calculation, i and j represent a horizontal coordinate and a vertical coordinate of a convolution kernel, k represents an input channel, and t represents an output channel.
A difference from the foregoing multiplication neural network lies in that the adder neural network may extract a similarity between an input image and a feature extractor through a subtraction operation on the input image and the feature extractor.
For example, an adder convolution kernel may be represented as:
For ease of understanding, a difference between the multiplication convolution kernel and the adder convolution kernel is described by using an example. For example, the difference between the multiplication convolution kernel and the adder convolution kernel may be shown in
and a matrix corresponding to the convolution kernel is
When the convolution kernel is a multiplication convolution kernel, an operation of the convolution kernel includes performing a multiplication operation on all elements in the input matrix and the convolution kernel, and for example, is represented as:
When the convolution kernel is an adder convolution kernel, an operation of the convolution kernel includes performing an addition operation on all elements in the input matrix and the convolution kernel, and for example, is represented as:
In a process of training a deep neural network, because it is expected that an output of the deep neural network is as close as possible to a value that actually needs to be predicted, a current predicted value of the network and an actually expected target value may be compared, and then a weight vector of each layer of the neural network is updated based on a difference between the current predicted value and the target value (where certainly, there is usually an initialization process before the first update, to be specific, parameters are preconfigured for all layers of the deep neural network). For example, if the predicted value of the network is large, the weight vector is adjusted to decrease the predicted value, and adjustment is continuously performed, until the deep neural network can predict the actually expected target value or a value that more approximates the actually expected target value. Therefore, “how to obtain, through comparison, a difference between the predicted value and the target value” needs to be predefined. This is a loss function or an objective function. The loss function and the objective function are important equations that measure the difference between the predicted value and the target value. The loss function is used as an example. A higher output value (loss) of the loss function indicates a larger difference. Therefore, training of the deep neural network is a process of minimizing the loss as much as possible.
In a training process, a neural network may correct a value of a parameter of an initial neural network model by using an error back propagation (BP) algorithm, so that a reconstruction error loss of the neural network model becomes increasingly smaller. Further, an input signal is transferred forward until an error loss occurs in an output, and the parameter in the initial neural network model is updated based on back propagation error loss information, to make the error loss converge. The back propagation algorithm is an error-loss-centered back propagation motion intended to obtain a parameter, such as a weight matrix, of an optimal neural network model.
(10) Model quantization: Model quantification is a model compression mode in which high bits are converted into low bits. For example, a model compression technology that converts a 32-bit floating-point operation into a low-bit integer operation may be referred to as model quantization. For example, low-bit quantization being 8-bit quantization may be referred to as int8 quantization. To be specific, a weight needs to be represented by float32 originally, and only needs to be represented by int8 after quantization. Theoretically, four times of network acceleration can be implemented, and storage space of 8 bits can be four times less than that of 32 bits. This reduces storage space and computing time, thus compressing a model and implementing acceleration.
Generally, a CNN is a commonly used neural network. A neural network mentioned below in this disclosure may include a convolutional neural network in which an adder convolution kernel or a multiplication convolution kernel is disposed. For ease of understanding, the following describes a structure of a convolutional neural network by using an example.
For example, the following describes the structure of the CNN in detail with reference to
As shown in
Convolutional layer:
As shown in
The following uses the convolutional layer 221 as an example to describe an internal working principle of one convolutional layer.
The convolutional layer 221 may include a plurality of convolution operators. The convolution operator is also referred to as a kernel. In image processing, the convolution operator functions as a filter that extracts specific information from an input image matrix. The convolution operator may essentially be a weight matrix, and the weight matrix is usually predefined. In a process of performing a convolution operation on an image, the weight matrix usually processes an input image pixel by pixel (or two pixels by two pixels, . . . , which depends on a value of a stride) in a horizontal direction, to extract a specific feature from the image. A size of the weight matrix should be related to a size of the picture. It should be noted that a depth dimension of the weight matrix is the same as a depth dimension of the input picture. In a process of performing a convolution operation, the weight matrix extends to an entire depth of the input picture. Therefore, a convolutional output of a single depth dimension is generated through convolution with a single weight matrix. However, in most cases, a single weight matrix is not used, but a plurality of weight matrices with a same size (rows×columns), namely, a plurality of same-type matrices, are applied. Outputs of the weight matrices are stacked to form a depth dimension of a convolutional picture. The dimension herein may be understood as being determined based on the foregoing “plurality”. Different weight matrices may be used to extract different features from the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract a specific color of the image, and still another weight matrix is used to blur unnecessary noise in the image. Sizes of the plurality of weight matrices (rows×columns) are the same. Sizes of feature maps extracted from the plurality of weight matrices with the same size are also the same, and then the plurality of extracted feature maps with the same size are combined to form an output of the convolution operation.
During actual application, weighted values in the weight matrices need to be obtained through massive training. Weight matrices formed by weighted values obtained through training may be used to extract information from an input image, so that the convolutional neural network 200 performs correct prediction.
When the convolutional neural network 200 has a plurality of convolutional layers, a larger quantity of general features are usually extracted at an initial convolutional layer (for example, the convolutional layer 221). The general features may also be referred to as low-level features. As a depth of the convolutional neural network 200 increases, a feature extracted at a more subsequent convolutional layer (for example, the convolutional layer 226), for example, a high-level semantic feature, is more complex. A feature with higher semantics is more applicable to a to-be-resolved problem.
Because a quantity of training parameters usually needs to be reduced, a pooling layer usually needs to be periodically introduced after a convolutional layer, and the pooling layer may also be referred to as a down-sampling layer. For the layers 221 to 226 in the layer 220 shown in
After processing performed at the convolutional layer/pooling layer 220, the convolutional neural network 200 still cannot output required output information. As described above, at the convolutional layer/pooling layer 220, only a feature is extracted, and parameters brought by an input image are reduced. However, to generate final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate an output of one required class or a group of required classes. Therefore, the neural network layer 230 may include a plurality of intermediate layers (231, 232, . . . , and 23n shown in
At the neural network layer 230, the plurality of intermediate layers are followed by the output layer 240, namely, a last layer of the entire convolutional neural network 200. The output layer 240 has a loss function similar to a categorical cross entropy, and the loss function is used to calculate a prediction error. Once forward propagation (for example, propagation in a direction from 210 to 240 in
It should be noted that the convolutional neural network 200 shown in
In this disclosure, a to-be-processed image may be processed based on the convolutional neural network 200 shown in
A neural network parameter quantization method provided in this embodiment of this disclosure may be performed on a server, or may be performed on a terminal device. The terminal device may be a mobile phone with an image processing function, a tablet personal computer (TPC), a media player, a smart television, a laptop computer (LC), a personal digital assistant (PDA), a personal computer (PC), a camera, a video camera, a smartwatch, a wearable device (WD), an autonomous vehicle, or the like. This is not limited in this embodiment of this disclosure.
As shown in
After collecting the training data, the data collection device 160 stores the training data in a database 130, and a training device 120 obtains a target model/rule 101 through training based on the training data maintained in the database 130. Optionally, the training set mentioned in the following implementations of this disclosure may be obtained from the database 130, or may be obtained based on data entered by a user.
The target model/rule 101 may be a neural network mentioned below in embodiments of this disclosure.
The following describes the target model/rule 101 obtained by the training device 120 based on the training data. The training device 120 processes an input original image, and compares an output image with the original image until a difference between the image output by the training device 120 and the original image is less than a specific threshold. In this way, training of the target model/rule 101 is completed.
The target model/rule 101 may be configured to implement the first neural network obtained by using the neural network parameter quantization method in embodiments of this disclosure. In other words, to-be-processed data (for example, an image) which is preprocessed is input to the target module/rule 101, to obtain a processing result. The target model/rule 101 in this embodiment of this disclosure may be further the first neural network mentioned below in this disclosure. The first neural network may be a neural network such as a CNN, a DNN, or an RNN. It should be noted that, during actual application, the training data maintained in the database 130 is not necessarily all collected by the data collection device 160, and may be received from another device. In addition, it should be noted that the training device 120 trains the target model/rule 101 not necessarily completely based on the training data maintained in the database 130, and may obtain training data from a cloud or another place to perform model training. The foregoing descriptions should not be construed as a limitation on this embodiment of this disclosure.
The target model/rule 101 obtained through training by the training device 120 may be applied to different systems or devices, for example, an execution device 110 shown in
A preprocessing module 113 and a preprocessing module 114 are configured to perform preprocessing based on the input data (for example, the to-be-processed data) received by the I/O interface 112. In embodiments of this disclosure, the preprocessing module 113 and the preprocessing module 114 may not exist (or only one of the preprocessing module 113 and the preprocessing module 114 exists), and a computing module 111 is directly configured to process the input data.
In a related processing procedure in which the execution device 110 preprocesses the input data or a computing module 111 of the execution device 110 performs computation, the execution device 110 may invoke data, code, and the like in a data storage system 150 to implement corresponding processing, or may store, into the data storage system 150, data, an instruction, and the like obtained through corresponding processing.
Finally, the I/O interface 112 returns the processing result to the client device 140, to provide the processing result to the user. For example, if the first neural network is used to perform image classification, the processing result is a classification result. Then, the I/O interface 112 returns the obtained classification result to the client device 140, to provide the classification result to the user.
It should be noted that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data. The corresponding target models/rules 101 may be used to achieve the foregoing targets or complete the foregoing tasks, to provide a needed result for the user. In some scenarios, the execution device 110 and the training device 120 may be a same device, or may be located inside a same computing device. For ease of understanding, the execution device and the training device are separately described in this disclosure, and this is not limited.
In the case shown in
It should be noted that
As shown in
Refer to
A user may operate user equipment (for example, a local device 401 and a local device 402) to interact with the execution device 110. Each local device may represent any computation device, for example, a personal computer, a computer workstation, a smartphone, a tablet computer, an intelligent camera, a smart automobile, another type of cellular phone, a media consumption device, a wearable device, a set-top box, or a game console.
The local device of each user may interact with the execution device 110 through a communication network of any communication mechanism/communication standard. The communication network may be a wide area network, a local area network, a point-to-point connection, or any combination thereof. Further, the communication network may include a wireless network, a wired network, a combination of a wireless network and a wired network, or the like. The wireless network includes but is not limited to any one or any combination of a 5th generation (5G) mobile communication technology system, a Long-Term Evolution (LTE) system, a Global System for Mobile Communications (GSM), a code-division multiple access (CDMA) network, a wideband CDMA (WCDMA) network, WI-FI, BLUETOOTH, ZIGBEE, a radio frequency identification (RFID) technology, long range (LoRa) wireless communication, and near-field communication (NFC). The wired network may include an optical fiber communication network, a network including coaxial cables, or the like.
In another implementation, one or more aspects of the execution device 110 may be implemented by each local device. For example, the local device 401 may provide local data or feed back a calculation result for the execution device 110.
It should be noted that all functions of the execution device 110 may also be implemented by the local device. For example, the local device 401 implements the functions of the execution device 110 and provides a service for a user of the local device 401, or provides a service for a user of the local device 402.
As performance of a neural network model becomes stronger, parameters of the neural network model also increase. When a network is running, requirements for and consumption of storage, computing, bandwidth, and energy are increasing. This is not conducive to deployment of an artificial intelligence algorithm in a hardware terminal device with limited resources. Reducing storage and computing requirements by using technical means such as pruning and compression of a neural network has become an important part of implementation of a neural network algorithm in an actual terminal.
Model compression is a common means of constructing a lightweight neural network. FP32 is usually used for storage in a neural network. Generally, a neural network has good robustness, and precision of a parameter of a large neural network is reduced through quantization, coding, and the like, so that the neural network can still maintain good performance. Common low-precision data includes numerical formats such as FP16, INT16, INT8, INT4, and 1-bit. With network performance and a model compression degree comprehensively considered, a weight parameter is converted from a FP32 to INT8. This is a common quantization method.
For another example, using a lightweight convolution solution is another technical means for constructing a lightweight neural network. For most common convolutional neural networks, multiplication convolution solutions are used, and a core is to extract a similarity between a filter and an input image by using a convolution multiplication operation. For an adder neural network, for example, an adder network quantization technology that can be shared, INT8 quantization is performed by using a quantization parameter s after a weight is subtracted from an input image, and good effect is achieved during INT8 quantization.
For example, as shown in
In addition, only one shared parameter is used to quantize a convolution kernel parameter and a feature of the adder network. This inevitably reduces an expression capability of the adder network, and verifies that the adder network is basically lossless only in 8-bit quantization. In low-bit (<8 bits) quantization, a precision loss of the quantized adder network is large. In another aspect, as a quantity of bits decreases, energy consumption of basic operations and a required chip area decrease to different degrees.
Therefore, this disclosure provides a neural network parameter quantization method that can be applied to various neural network quantization scenarios, to implement efficient low-bit quantization.
Based on the foregoing description, the following describes in detail the neural network parameter quantization method provided in this disclosure.
701: Obtain a parameter of each neuron in a to-be-quantized model, to obtain a parameter set.
The to-be-quantized model may include an adder network, a multiplication network, or the like. The to-be-quantized model may include a plurality of neurons, and a parameter of each neuron may be read to obtain the parameter set of the to-be-quantized model.
The parameter of each neuron may include a parameter value in the neuron, or may include a weight occupied by an output of a neuron at each intermediate layer. Therefore, when subsequent quantization is performed, most parameters in the neural network may be quantized, to obtain a lightweight model.
702: Cluster the parameter set to obtain a plurality of types of classified data.
After the parameter set is obtained, parameters in the parameter set may be clustered to obtain one or more types of classified data. It may be understood that the parameters in the parameter set are classified into one or more types through clustering.
A specific clustering manner may include K-Means clustering, mean shift clustering, a density-based clustering method (such as density-Based spatial clustering of applications with noise (DBSCAN)), expectation-maximization clustering based on a Gaussian mixture model, and the like. Further, a matched clustering manner may be selected based on an actual usage scenario. This is not limited in this disclosure.
Optionally, a specific process of obtaining the plurality of types of classified data may include clustering the parameters in the parameter set to obtain one or more types of classified data, and extracting a preset quantity of parameters from the one or more types of classified data to obtain the foregoing plurality of types of classified data. Therefore, in this implementation of this disclosure, a threshold does not need to be specified, and a calculation amount for the threshold is reduced by extracting a specific quantity. This improves deployment generalization of the method provided in this disclosure.
703: Quantize each type of data in the plurality of types of classified data to obtain at least one type of quantization parameter.
After clustering is performed to obtain the plurality of types of classified data, each type of classified data may be quantized, that is, bits occupied by a parameter in each type of classified data are reduced, to obtain the at least one type of quantization parameter. The at least one type of quantization parameter is used to obtain a compression model. For example, if a data type of a parameter in the to-be-quantized model is FP16, the data type of the parameter may be converted into INT8, to reduce bits occupied by the parameter and implement low-bit quantization of the parameter. This implements model compression, and a lightweight model is obtained.
Generally, parameters in the parameter set may be classified into parameter values of neurons, for example, referred to as weight parameters, and output feature values of neurons, for example, referred to as feature parameters. The feature parameters may include feature values output by the neurons in the to-be-quantized model after an input image is input to the to-be-quantized model, and the input image may be a preset image, or may be a randomly selected image. The feature parameters and the weight parameters usually affect each other. However, ranges of the two types of parameters may be different. If only one type of quantization parameter is used for quantization, a part of quantization parameters may be truncated or a waste of bits may be caused. For example, if a range of activation values is used to quantize weights, most weights are truncated, and this greatly damages precision of a quantization model. If a range of weights is used to quantize activation values, only a few bits can be used, and this causes a waste of bits. In this disclosure, a weight range may be truncated to a feature range, to effectively use bits as much as possible without reducing model precision, and avoid a waste of bits.
Therefore, in this implementation of this disclosure, the parameters in the to-be-quantized model are clustered, and the parameters are classified into a plurality of types and then quantized, so that classification and quantization can be implemented, and an expression capability of a model obtained through quantization can be improved. Especially for an adder network, instead of quantization performed by using a shared parameter, quantization after clustering provided in this disclosure can significantly improve a model expression capability, to obtain a lightweight model with higher output precision. In addition, according to the method provided in this disclosure, the parameters are clustered and then quantized, so that a lightweight model with a more accurate output can be obtained only by increasing a small workload. This can be applicable to more scenarios in which a lightweight model needs to be deployed.
The foregoing describes a method procedure provided in this disclosure. For ease of understanding, the following describes in more detail the procedure of the neural network parameter quantization method provided in this disclosure with reference to a specific usage scenario.
First, a to-be-quantized model 801 is obtained. The to-be-quantized model 801 may include a multiplication network, an adder network, or the like.
The method provided in this disclosure may be deployed on a server, or may be deployed on a terminal. For example, the method provided in this disclosure may be deployed on a server. After the server quantizes the to-be-quantized model, an obtained lightweight model may be deployed on a terminal, so that the lightweight model can run on the terminal, and running efficiency of the terminal can be improved.
The to-be-quantized model may be further a CNN, an ANN, an RNN, or the like. The to-be-quantized model may include a plurality of network layers that may be divided into, for example, an input layer, an intermediate layer, and an output layer. Each network layer may include one or more neurons. Generally, an output of a neuron at an upper-layer network layer may be used as an input of a neural network at a lower-layer network layer. Further, for example, the to-be-quantized model may be used to perform one or more tasks such as an image recognition task, a target detection task, a segmentation task, and a classification task. Generally, to reduce computing resources required for running the neural network, the neural network may be compressed, to reduce the computing resources required for running the neural network while maintaining output precision of the neural network.
The method provided in this disclosure may be applicable to a plurality of types of neural networks, and a specific type of the neural network may be determined based on an actual usage scenario. This is not limited in this disclosure.
Then, parameters are extracted from the to-be-quantized model 801 to obtain a parameter set 802.
Further, a parameter like a parameter inside each neuron or an output weight of each neuron may be extracted to obtain the parameter set. For ease of differentiation, in the following, a feature value output by each neuron is referred to as a feature parameter (represented as w), and a convolution kernel parameter of each neuron is referred to as a weight parameter (represented as x).
For example, the to-be-quantized model 801 may include a plurality of neurons, and each neuron has a parameter. For example, each neuron may include one or more of the following: average pooling (avg_pool_3×3) with a pooling kernel size of 3×3, maximum pooling (max pool_3×3) with a pooling kernel size of 3×3, separable convolution (sep_conv_3×3) with a convolution kernel size of 3×3, separable convolution (sep_conv_5×5) with a convolution kernel size of 5×5, dilated convolution (dil_conv_3×3) with a convolution kernel size of 3×3 and a dilation rate of 2, dilated convolution (dil_conv_5×5) with a convolution kernel size of 5×5 and a dilation rate of 2, a skip-connection (skip-connect) operation, a zero operation (Zero, each neuron at a corresponding position is set to zero), or the like. Parameters may be extracted from these operation manners, to obtain a parameter inside a neuron, namely, the weight parameter. An output of a neuron at an upper layer may be used as an input of a neuron at a lower layer, and an output of each neuron may be different. After an input image is input to the to-be-quantized model, a feature value output by each neuron is extracted, to obtain the feature parameter. A parameter inside the neuron or a feature value output by each neuron may form the parameter set 802.
Subsequently, parameters in the parameter set 802 are clustered to obtain a plurality of types of classified data 803.
For example, clustering is performed in a clustering manner like K-Means clustering, mean shift clustering, or DBSCAN, parameters are classified into a plurality of types, and a specific quantity of parameters are extracted from each type of parameters. This reduces quantization of abnormal parameters and improves an expression capability of a model. For example, the weight parameter and the feature parameter may be separately classified, to avoid using a same quantization parameter for the feature parameter and the weight. In addition, compared with truncating a parameter by selecting a threshold, in this disclosure, extracting a specific quantity of parameters from each type of parameters can reduce a workload of calculating a threshold.
Generally, the feature parameters have great impact on model feature extraction. Therefore, outliers of feature parameters have great impact on a range of features of model statistics, and further affects a quantized scaling coefficient. For example, as shown in
In addition, for an adder network, a shared quantization parameter is usually used for a feature parameter and a weight parameter. As shown in
It may be understood that in this disclosure, a range of the weight parameters is truncated to a range covered by the feature parameters, and truncated weight parameter values and feature parameter values are integrated into a bias ratio (or bias). To be specific, most weight parameters and feature parameters are retained without affecting model precision. This improves an expression capability of a quantized lightweight model.
Subsequently, the plurality of types of classified data 803 is quantized to obtain at least one type of quantization parameter 804.
For example, a structure of the to-be-quantized model may be shown in
A plurality of quantization manners may be used. For example, a symmetric quantization method is used as an example. First, a term with a maximum absolute value is found in all weight parameters in each type of classified data, and is denoted as max (|Xf|). Secondly, a number N of bits to be quantized and a numerical representation range thereof are determined: [−2n-1, 2n-1]. INT8 is used as an example, and a numerical range that can be represented by INT8 is [−128, 127]. Thirdly, a scaling coefficient of a weight parameter is determined, for example, represented as scale=(2n-1−1)/max (|Xf|). Finally, all parameters of this group of classified data are multiplied by the coefficient, and an approximate integer is obtained. In this way, parameters of this layer are quantized.
Because scale=(2n-1−1)/max (|v|), and v=x or w, that is, scale is related to max (|w|), in this disclosure, max ([w]) is first extracted as a feature parameter of a convolution kernel, and then convolution kernel parameters are clustered in a K-Means clustering manner. In this disclosure, a quantization solution of a plurality of shared scales introduces a very small amount of additional calculation, and therefore does not increase power consumption.
Subsequently, the obtained at least one type of quantization parameter is used to form a lightweight model 805. A lightweight model may be deployed in various devices, for example, deployed in a terminal or a server with low computing power, for example, a mobile phone or a band.
Generally, stronger performance of a neural network indicates a larger scale of the neural network and more parameters, and requirements for and consumption of storage, bandwidth, energy, and computing resources are higher. According to the neural network quantization method provided in this disclosure, a model can be quantized to obtain a lightweight model, so that the lightweight model can be deployed in various devices. A lightweight model with an accurate output can be obtained while requirements for storage, bandwidth, energy, and computing resources are reduced. The lightweight model is applicable to more scenarios, and has a stronger generalization capability.
The method provided in this disclosure may be deployed in various terminals or servers, especially in a resource-limited device. According to the method provided in this disclosure, a quantized lightweight model can be obtained while an expression capability of the model is ensured. Therefore, a more accurate output result can also be obtained in the resource-limited device.
For example, the method provided in this disclosure may be deployed in a neural network accelerator, and the neural network accelerator improves a running speed of a convolutional network in a hardware module through parallel computing or the like. In this disclosure, hardware resource consumption can be greatly reduced for an adder network, so that hardware resources can be more fully utilized to construct a convolution acceleration module with a higher degree of parallelism. This further improves acceleration effect.
For another example, the method provided in this disclosure may be deployed on a low-power consumption AI chip. In hardware terminal deployment of an artificial neural network chip, power consumption is a core issue. In this disclosure, for an adder convolution kernel, operating power consumption of a circuit can be effectively reduced, so that an AI chip can be deployed on a resource-limited terminal device.
For ease of understanding, effect of deploying the method provided in this disclosure in some common datasets may be shown in Table 1.
It can be learned from Table 1 that, during quantization of a relatively high bit (for example, 8, 6, and 5), when only post-quantization (PTQ) is used in the quantization solution in which groups share scales provided in this disclosure, a quantization model precision loss is very small. During 4-bit quantization, a quantization model precision loss is very small through training quantization (QAT).
In addition, tests are performed in ImageNet, as shown in Table 2.
It can be learned from Table 2 that, during quantization of a relatively high bit (8, 6, and 5), when only post-quantization (PTQ) is used in the quantization solution of a group-shared scale provided in the present disclosure, a quantization model precision loss is very small. During 4-bit quantization, a quantization model precision loss is very small through training quantization (QAT).
The foregoing describes in detail the procedure of the method provided in this disclosure. The following describes an apparatus for performing the foregoing method.
In a possible implementation, the clustering module 1302 is further configured to cluster the parameter set to obtain at least one type of clustered data, and extract a preset quantity of parameters from each type of clustered data in the at least one type of clustered data, to obtain the plurality of types of classified data.
In a possible implementation, the parameter in the to-be-quantized model includes a parameter in a feature output by each neuron or a parameter value in each neuron.
In a possible implementation, the to-be-quantized model includes an adder neural network.
In a possible implementation, the compression model is used to perform at least one of image recognition, a classification task, or target detection.
The neural network parameter quantization apparatus may include a processor 1401 and a memory 1402. The processor 1401 and the memory 1402 are interconnected through a line. The memory 1402 stores program instructions and data.
The memory 1402 stores the program instructions and the data corresponding to steps corresponding to
The processor 1401 is configured to perform the method steps performed by the neural network parameter quantization apparatus shown in any one of the foregoing embodiments in
Optionally, the neural network parameter quantization apparatus may alternatively include a transceiver 1403 configured to receive or send data.
An embodiment of this disclosure further provides a computer-readable storage medium. The computer-readable storage medium stores a program used to generate a vehicle travel speed. When the program is run on a computer, the computer is enabled to perform the steps in the methods described in the embodiments shown in
Optionally, the neural network parameter quantization apparatus shown in
An embodiment of this disclosure further provides a neural network parameter quantization apparatus. The neural network parameter quantization apparatus may also be referred to as a digital processing chip or a chip. The chip includes a processing unit and a communication interface. The processing unit obtains program instructions through the communication interface, and when the program instructions are executed by the processing unit, the processing unit is configured to perform the method steps performed by the neural network parameter quantization apparatus in any one of the foregoing embodiments in
An embodiment of this disclosure further provides a digital processing chip. A circuit and one or more interfaces that are configured to implement the processor 1401 or a function of the processor 1401 are integrated into the digital processing chip. When a memory is integrated into the digital processing chip, the digital processing chip may implement the method steps in any one or more embodiments in the foregoing embodiments. When a memory is not integrated into the digital processing chip, the digital processing chip may be connected to an external memory through a communication interface. The digital processing chip implements, based on program code stored in the external memory, the actions performed by the neural network parameter quantization apparatus in the foregoing embodiments.
An embodiment of this disclosure further provides a computer program product. When the computer program product runs on a computer, the computer is enabled to perform the steps performed by the neural network parameter quantization apparatus in the method described in the embodiments shown in
The neural network parameter quantization apparatus in this embodiment of this disclosure may be a chip. The chip includes a processing unit and a communication unit. The processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin, or a circuit. The processing unit may execute computer-executable instructions stored in a storage unit, so that a chip in the server performs the neural network parameter quantization method described in the embodiments shown in
Further, the processing unit or the processor may be a CPU, a NPU, a GPU, a digital signal processor (DSP), an ASIC, an FPGA or another programmable logic device, a discrete gate or a transistor logic device, a discrete hardware component, or the like. The general purpose processor may be a microprocessor or any regular processor or the like.
For example,
In some implementations, the operation circuit 1503 internally includes a plurality of processing units (or process engines (PEs)). In some implementations, the operation circuit 1503 is a two-dimensional systolic array. The operation circuit 1503 may alternatively be a one-dimensional systolic array or another electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 1503 is a general-purpose matrix processor.
For example, it is assumed that there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches corresponding data of the matrix B from a weight memory 1502, and buffers the data on each PE in the operation circuit. The operation circuit obtains data of the matrix A from the input memory 1501 to perform a matrix operation on the matrix B, and stores an obtained partial result or an obtained final result of the matrix into an accumulator 1508.
A unified memory 1506 is configured to store input data and output data. Weight data is directly transferred to the weight memory 1502 by using a direct memory access controller (DMAC) 1505. The input data is also transferred to the unified memory 1506 by using the DMAC.
A bus interface unit (BIU) 1510 is configured to interact with the DMAC and an instruction fetch buffer (IFB) 1509 through an Advanced extensible Interface (AXI) bus.
The BIU 1510 is used by the instruction fetch buffer 1509 to obtain instructions from an external memory, and is further used by the direct memory access controller 1505 to obtain original data of the input matrix A or the weight matrix B from the external memory.
The DMAC is mainly configured to transfer input data in the external memory DDR to the unified memory 1506, transfer weight data to the weight memory 1502, or transfer input data to the input memory 1501.
A vector calculation unit 1507 includes a plurality of arithmetic processing units. When necessary, the vector calculation unit 1507 performs further processing on an output of the operation circuit, for example, vector multiplication, vector addition, an exponential operation, a logarithmic operation, and value comparison. The vector calculation unit 1507 is mainly configured to perform network calculation at a non-convolutional/fully connected layer in a neural network, for example, batch normalization, pixel-level summation, and up-sampling on a feature map.
In some implementations, the vector calculation unit 1507 can store a processed output vector in the unified memory 1506. For example, the vector calculation unit 1507 may apply a linear function and/or a non-linear function to the output of the operation circuit 1503, for example, perform linear interpolation on a feature map extracted by a convolutional layer, or for another example, accumulate value vectors to generate an activation value. In some implementations, the vector calculation unit 1507 generates a normalized value, a pixel-level summation value, or both. In some implementations, the processed output vector can be used as an activation input to the operation circuit 1503, for example, to be used in a subsequent layer in the neural network.
The IFB 1509 connected to the controller 1504 configured to store instructions used by the controller 1504.
The unified memory 1506, the input memory 1501, the weight memory 1502, and the IFB 1509 are all on-chip memories. The external memory is private to a hardware architecture of the NPU.
An operation at each layer in a recurrent neural network may be performed by the operation circuit 1503 or the vector calculation unit 1507.
The processor mentioned above may be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling program execution of the methods in
In addition, it should be noted that the described apparatus embodiment is merely an example. The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all the modules may be selected according to actual needs to achieve the objectives of the solutions of embodiments. In addition, in the accompanying drawings of the apparatus embodiments provided by this disclosure, connection relationships between modules indicate that the modules have communication connections with each other. This may be further implemented as one or more communication buses or signal cables.
Based on the description of the foregoing implementations, a person skilled in the art may clearly understand that this disclosure may be implemented by software in addition to necessary universal hardware, or by dedicated hardware, including a dedicated integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, and the like. Generally, any functions that can be performed by a computer program can be easily implemented by using corresponding hardware. Moreover, a specific hardware structure used to achieve a same function may be in various forms, for example, in a form of an analog circuit, a digital circuit, or a dedicated circuit. However, as for this disclosure, software program implementation is a better implementation in most cases. Based on such an understanding, the technical solutions of this disclosure essentially or the part contributing to the technology may be implemented in a form of a software product. The computer software product is stored in a readable storage medium, like a floppy disk, a Universal Serial Bus (USB) flash drive, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disc of a computer, and includes several instructions for instructing a computer device (which may be a personal computer, a server, a network device, or the like) to perform the methods described in embodiments of this disclosure.
All or some of the foregoing embodiments may be implemented by using software, hardware, firmware, or any combination thereof. When software is used to implement the embodiments, all or a part of the embodiments may be implemented in a form of a computer program product.
The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the procedure or functions according to embodiments of this disclosure are all or partially generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable apparatuses. The computer instructions may be stored in a computer-readable storage medium or may be transmitted from a computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions may be transmitted from a website, computer, server, or data center to another website, computer, server, or data center in a wired (for example, a coaxial cable, an optical fiber, or a digital subscriber line (DSL)) or wireless (for example, infrared, radio, or microwave) manner. The computer-readable storage medium may be any usable medium accessible by a computer, or a data storage device, such as a server or a data center, integrating one or more usable media. The usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, or a magnetic tape), an optical medium (for example, a DIGITAL VERSATILE DISC (DVD)), a semiconductor medium (for example, a solid-state disk (SSD)), or the like.
In the specification, claims, and accompanying drawings of this disclosure, the terms “first”, “second”, “third”, “fourth”, and so on (if existent) are intended to distinguish between similar objects but do not necessarily indicate a specific order or sequence. It should be understood that the data termed in such a way are interchangeable in proper circumstances so that embodiments of the present disclosure described herein can be implemented in other orders than the order illustrated or described herein. In addition, the terms “include” and “have” and any other variants are intended to cover the non-exclusive inclusion. For example, a process, method, system, product, or device that includes a list of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to such a process, method, product, or device.
Finally, it should be noted that the foregoing descriptions are merely specific implementations of this disclosure, but are not intended to limit the protection scope of this disclosure. Any variation or replacement readily figured out by a person skilled in the art within the technical scope disclosed in this disclosure shall fall within the protection scope of this disclosure.
| Number | Date | Country | Kind |
|---|---|---|---|
| 202210600648.7 | May 2022 | CN | national |
This is a continuation of International Patent Application No. PCT/CN2023/095019 filed on May 18, 2023, which claims priority to Chinese Patent Application No. 202210600648.7 filed on May 30, 2022. The disclosures of the aforementioned applications are hereby incorporated by reference in their entireties.
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/CN2023/095019 | May 2023 | WO |
| Child | 18961921 | US |