This application claims the priority benefit of Taiwan application serial no. 111119653, filed on May 26, 2022. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification.
The disclosure relates to a machine learning technology, and more particularly to an optimizing method and a computing apparatus for a deep learning network and a computer-readable storage medium.
In recent years, with the increasing updating of artificial intelligence (AI) technology, the number of parameters and computational complexity of neural network models are also increasing. As a result, compression technology for deep learning networks have also flourished. It is worth noting that quantization is an important technique for compressing models. However, prediction accuracy and compression rate of conventional quantized models still need to be improved.
The disclosure provides an optimizing method and a computing apparatus for a deep learning network and a computer-readable storage medium, which can ensure prediction accuracy and compression rate using multi-scale dynamic quantization.
An optimizing method for a deep learning network according to an embodiment of the disclosure includes (but is not limited to) the following steps. A value distribution is obtained from a pre-trained model. One or more breaking points in a range of the value distribution is determined. Quantization is performed on a part of the values of a parameter type in a first section among multiple sections using a first quantization parameter and the other part of values of the parameter type in a second section among the sections using a second quantization parameter. The value distribution is a statistical distribution of values of the parameter type in the deep learning network. The the range is divided into the sections by one or more breaking points. The first quantization parameter is different from the second quantization parameter.
A computing apparatus for a deep learning network according to the embodiment of the disclosure includes (but is not limited to) a memory and a processor. The memory is used for storing a code. The processor is coupled to the memory. The processor loads and executes the code to obtain a value distribution from a pre-trained model, determine one or more breaking points in a range of the value distribution, and perform quantization on a part of the values of a parameter type in a first section among multiple sections using a first quantization parameter and the other part of the values of the parameter in a second section among the sections using a second quantization parameter. The value distribution is a statistical distribution of values of the parameter type in the deep learning network. The range is divided into the sections by one or more breaking points. The first quantization parameter is different from the second quantization parameter.
A non-transitory computer-readable storage medium of the embodiment of the disclosure is used to store a code. A processor loads the code to execute the following steps. A value distribution is obtained from a pre-trained model. One or more breaking points in a range of the value distribution is determined. Quantization is performed on a part of the values of a parameter type in a first section among multiple sections using a first quantization parameter and the other part of values of the parameter type in a second section among the sections using a second quantization parameter. The value distribution is a statistical distribution of values of the parameter type in the deep learning network. The range is divided into the sections by one or more breaking points. The first quantization parameter is different from the second quantization parameter.
Based on the above, according to the optimizing method and the computing apparatus for the deep learning network and the computer-readable storage medium, the value distribution is divided into the sections according to the breaking points, and different quantization parameters are respectively used for the values of the sections. In this way, the quantized distribution can more closely approximate the original value distribution, thereby improving prediction accuracy of a model.
In order for the features and advantages of the disclosure to be more comprehensible, the following specific embodiments are described in detail in conjunction with the drawings.
The memory 110 may be any type of fixed or removable random access memory (RAM), read only memory (ROM), flash memory, traditional hard disk drive (HDD), solid state drive (SSD), or similar elements. In an embodiment, the memory 110 is used to store a code, a software module, a configuration, data, or a file (for example, a sample, a model parameter, a value distribution, or a breaking point).
The processor 150 is coupled to the memory 110. The processor 150 may be a central processing unit (CPU), a graphics processing unit (GPU), other programmable general-purpose or specific-purpose microprocessors, digital signal processors (DSPs), programmable controllers, field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), neural network accelerators, other similar elements, or a combination of the foregoing elements. In an embodiment, the processor 150 is used to execute all or part of the operations of the computing apparatus 100 and may load and execute each code, software module, file, and data stored in the memory 110.
Hereinafter, the method according to the embodiment of the disclosure will be described in conjunction with various devices, elements, and modules in the computing apparatus 100. Each process of the method may be adjusted according to the implementation situation and is not limited thereto.
It is worth noting that the pre-trained model has a corresponding parameter (for example, a weight, an input/output activation/feature value) at each layer. It is conceivable that too many parameters will require higher computing and storage requirements, and higher complexity of the parameters will increase the amount of computation. Quantization is one of the techniques for reducing the complexity of a neural network. Quantization can reduce the number of bits for representing the activation/feature value or the weight. There are many types of quantization methods, such as symmetric quantization, asymmetric quantization, and clipping methods.
On the other hand, a value distribution is a statistical distribution of multiple values of one or more parameter types in a deep learning network. The parameter type may be a weight, an input activation/feature value, and/or an output activation/feature value. The statistical distribution expresses the distribution of a statistic (for example, a total number) of each value. For example,
In an embodiment, the processor 150 may generate the value distribution using verification data. For example, the processor 150 may perform inference on the verification data through a pre-trained floating-point model (that is, the pre-trained model), collect the parameter (for example, the weight, the input activation/feature value, or the output activation/feature value) of each layer, and count the values of the parameter type to generate the value distribution of the parameter type.
Please refer to
Taking
The processor 150 may respectively divide the range according to the first search points for forming multiple evaluation sections (Step S420), and each evaluation sections is corresponding to each first search points. In other words, any search point divides the range into the evaluation sections or any evaluation section is located between two adjacent first search points. In an embodiment, the processor 150 may determine a first search space in the range of the value distribution. The first search point may divide the first search space into the evaluation sections. The processor 150 may define the first search space and the first search point using a breaking point ratio. Multiple breaking point ratios are respectively the ratios of the first search points to the maximum absolute value in the value distribution, and Mathematical Expression (1) is:
breakpoint ratio=break point/abs max (1)
where breakpoint ratio is the breaking point ratio, break point is any first search point or other search points or breaking points, and abs max is the maximum absolute value in the value distribution. For example, the first search space is [0.1, 0.9] and the distance is 0.1. In other words, the breaking point ratios of the first search points are respectively 0.1, 0.2, 0.3, etc., and so on up to 0.9, and the first search points may be backtracked according to a mathematical expression.
The processor 150 may respectively perform quantization on the evaluation sections of each first search point according to different quantization parameters for obtaining a quantized value corresponding to each first search point (Step S430). In other words, different quantization parameters are used for different evaluation sections of any one search point. Taking dynamic fixed-point quantization as an example, the quantization parameter includes a bit width (BW), an integer length (IL), and a fraction length (FL). The different quantization parameters are, for example, different integer lengths and/or different fraction lengths. It should be noted that the quantization parameters used by different quantization methods may be different. In an embodiment, under the same bit width, the fraction length used by a section with a value close to zero is longer, and the integer length used by a section with a greater value is longer.
The processor 150 may compare multiple variance amounts of the first search points for obtaining one or more breaking points (Step S440). Each variance amount corresponding to the first search point includes the variance between the quantized value and the corresponding unquantized value (that is, the value before quantization). For example, the variance amount is mean squared error (MSE), root mean squared error (RMSE), or mean absolute error. Taking MSE as an example, Mathematical Expression (2) is as follows:
where MSE is the variance amount calculated by MSE, xi is the (unquantized) value (for example, a weight or an input/output activation/feature value), Q (xi ) is the quantized value, h( ) is a constant, and n is the total number of the values. Taking symmetrical quantization for the quantized value as an example, and Equations (3) and (4) are as follows:
where Xquantized is the quantized value, xfloat is the value of a floating point (that is, the unquantized value), xscale is the quantization level scale, xfloatmax is the maximum value in the value distribution, xfloatmin is the minimum value in the value distribution, xquantizedmax is the maximum value among the quantized values, and xquantizedmin is the minimum value among the quantized values.
In an embodiment, the processor 150 may use one or more of the first search points with smaller variance amounts as one or more breaking points. Smaller variance amount means its variance amount is smaller than others. Taking one breaking point as an example, the processor 150 may select one of the first search points with the small variance amount as the breaking point. Taking two breaking points as an example, the processor 150 selects two of the first search points with the small variance amount and the second small variance amount as the breaking points.
Taking selecting the small variance amount as an example,
In an embodiment, the processor 150 may determine a second search space according to one or more of the first search points with smaller variance amounts. The second search space is less than the first search space. Defined by the breaking point ratio, in an embodiment, the processor 150 may determine the breaking point ratio according to one of the first search points with the small variance amount. The breaking point ratio is the ratio of the first search point with the small variance amount to maximum absolute value in the value distribution, and reference may be made to the relevant description of Mathematical Expression (1), which will not be repeated here. The processor 150 may determine the second search space according to the breaking point ratio. The small variance amount may be located in the middle of the second search space. For example, if the breaking point ratio is 0.5, the range of the second search space may be [0.4, 0.6], and the distance between two adjacent second search points may be 0.01 (assuming that the distance between the first search points is 0.1). It should be noted that the breaking point ratio of with the small variance amount in the first stage is not limited to being located in the middle of the second search space.
Similarly, for the second stage, the processor 150 may perform quantization on values of evaluation sections divided by each second search point using different quantization parameters to obtain a quantized value corresponding to each second search point. Next, the processor 150 may compare multiple variance amounts of the second search points for obtaining one or more breaking points. Each variance amount corresponding to the second search point includes the variance between the quantized value and the corresponding unquantized value. For example, the variance amount is MSE, RMSE, or MAE. Additionally, the processor 150 may use one or more of the second search points with smaller variance amounts as one or more breaking points. Taking one breaking point as an example, the processor 150 may select one of the second search points with the small variance amount as the breaking point.
Please refer to
For the middle section where the value distribution is denser, the processor 150 may assign a greater bit width to the fraction length (FL); and for the tail section where the value distribution is more scattered, the processor 150 may assign a greater bit width to the integer length (IL).
It should also be noted that if more than two breaking points are obtained, it is not limited to applying two quantization parameters to different sections.
In an embodiment, the processor 150 may perform dynamic fixed-point quantization combined with a clipping method. The processor 150 may determine the integer length of the first quantization parameter, the second quantization parameter, or other quantization parameters according to the maximum absolute value and the minimum absolute value in the value distribution. The clipping method takes percentile clipping as an example. There are very few values far from the middle in the bell-shaped distribution shown in
IL
W=log2(max(|Wmax|,|Wmin|))+1 (5)
It should be noted that the maximum and the minimum are not limited to the 99.99% and 0.01%, quantization is not limited to being combined with percentile clipping, and the quantization method is not limited to dynamic fixed-point quantization. Additionally, input activation/feature values, output activation/feature values, or other parameter types may also be applicable. Taking an absolute maximum value as an example, the processor 150 may use a part of the training samples as calibration samples, and infer the calibration samples to obtain the value distribution of activation/feature values. The maximum in the value distribution may be used as the maximum for the clipping method. Also, Equation (5) may determine, for example, the integer length of the input/output activation/feature value:
IL
1=log2(max(|Imax|,|Imin|))+1 (6)
IL
O=log2(max(|Omax|,|Omin|))+1 (7)
where IL1 is the integer length of the input activation/feature value, ILO is the integer length of the output activation/feature value, Imax is the maximum in the value distribution of the input activation/feature values, Omax is the maximum in the value distribution of the output activation/feature values, Imin is the minimum in the value distribution of the input activation/feature values, and Omin is the minimum in the value distribution of the output activation/feature values.
On the other hand,
A straight through estimator (STE) may be used to approximate the gradient of the quantization equation. In an embodiment, the processor 150 may use the straight through estimator (STE) with boundary constraint to further mitigate gradient noise.
where lb is the bottom limit, ub is the upper limit, fl is the fraction length, R is the real number, Q is the quantized number, xiR is the value of the real number (that is, the unquantized value), xiQ is the quantized value, y is the output activation/feature value, and B is the bit width. If the value xiR is in a limit range [lb, ub] between the upper limit and the bottom limit, the processor 150 may equate a real gradient ∂y/∂xiR thereof to a quantization gradient ∂y/∂xiQ. However, if the value xiR is outside the limit range [lb, ub], the processor 150 may ignore the gradient thereof and directly set the quantization gradient to zero.
In an embodiment, multiple quantization layers are added to the deep learning network. The quantization layers may be divided into three parts for the weight, the input activation/feature value, and the output activation/feature value. In addition, different or identical bit widths and/or fraction lengths may be respectively provided to represent the values of the three parts of the quantization layers. Thereby, the layer-by-layer level quantization layer can be achieved.
The processor 50 may post-train the quantized model (Step S121). For example, the quantized model is trained using training samples with labeled results.
Next, the processor 150 may determine the fraction length of the values/activation/feature value of each quantization layer according to a bit width limit of each quantization layer (Step S144). Equation (11) is used to determine the fraction length as follows:
FL=BW−IL (11)
where FL is the fraction length, BW is the predefined bit width limit, and IL is the integer length. Under some application scenarios, the integer length obtained from Equation (11) may be less than the integer length obtained from Equations (5) to (7), for example, by one bit. (Fine-)tuning the integer length helps to improve prediction accuracy of a model. Finally, the processor 150 may obtain a post-trained quantized model (Step S145).
Please refer to
An embodiment of the disclosure further provides a non-transitory computer-readable storage medium (for example, a hard disk drive, an optical disk, a flash memory, a solid state drive (SSD), and other storage media) and is used to store a code. The processor 150 or other processors of the computing apparatus 100 may load the code, and execute the corresponding process of one or more optimizing methods according to the embodiments of the disclosure. For the processes, reference may be made to the above descriptions, which will not be repeated here.
In summary, in the optimizing method and the computing apparatus for the deep learning network and the computer-readable storage medium according to the embodiments of the disclosure, the value distribution of the parameters of the pre-trained model is analyzed, and the range is determined to be divided into the breaking points with different quantization requirements. The breaking point may divide the value distribution of different parameter types into multiple sections and/or divide the value distribution of a single parameter type into multiple sections. Different quantization parameters are respectively used for different sections. The percentile clipping method is used to determine the integer length of the weight, and the absolute maximum method is used to determine the integer length of the input/output feature/activation value. In addition, the straight through estimator with boundary constraint is introduced to improve gradient approximation. In this way, accuracy drop can be reduced and allowable compression can be achieved.
Although the disclosure has been disclosed in the above embodiments, the embodiments are not intended to limit the disclosure. Persons skilled in the art may make some changes and modifications without departing from the spirit and scope of the disclosure. Therefore, the protection scope of the disclosure shall be defined by the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
111119653 | May 2022 | TW | national |