This application is based upon and claims the benefit of priority of the prior Japanese Patent Application No. 2021-136804, filed on Aug. 25, 2021, the entire contents of which are incorporated herein by reference.
The embodiment discussed herein is related to a threshold determination technology.
A neural network that is one kind of a trained model generated through machine learning is used to infer input data in various fields such as image processing or natural language processing.
Due to a complicated configuration of a neural network in recent years, power consumption of a computer that performs inference using the neural network tends to increase. Therefore, the neural network may be quantized to reduce the power consumption. The quantization of the neural network is processing of converting a numerical value to be quantized represented by a predetermined bit width into a quantized numerical value represented by a smaller bit width.
Although the quantization of the neural network is effective for reducing power consumption and a memory usage, accuracy of a numerical value to be quantized is deteriorated. For example, when a 32-bit single precision floating point number (FP32) is converted into an eight-bit integer (INT8) through quantization, inference accuracy largely decreases.
A technique has been known that promotes efficiency improvement of the neural network in relation to the quantization of the neural network. A neural network learning device has been also known that enables appropriate calculation while reducing a weight of a convolutional neural network (CNN) by lowering a bit of the calculation. A method for adjusting accuracy regarding some selected layers in the neural network to a lower bit has been also known.
A sequence conversion model based on an attention mechanism has been also known.
Japanese National Publication of International Patent Application No. 2021-500654, Japanese Laid-open Patent Publication No. 2020-9048, Japanese Laid-open Patent Publication No. 2020-113273, A. Canziani et al, “An Analysis of Deep Neural Network Models for Practical Applications”, arXiv:1605.07678v4, Apr. 14, 2017, O. Sharir et al., “The Cost of Training NLP Models: A Concise Overview”, arXiv:2004.08900v1, Apr. 19, 2020, Szymon Migacz, NVIDIA, “8-bit Inference with TensorRT”, [online], May 8, 2017, (retrieved on Jun. 16, 2021), Internet <URL:https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf>, and A. Vaswani et al., “Attention is All You Need”, 31st Conference on Neural Information Processing Systems (NIPS 2017), 2017 are disclosed as related art.
According to an aspect of the embodiments, a non-transitory computer-readable storage medium storing a threshold determination program that causes a processor included in a computer to execute a process, the process includes quantitating a plurality of numerical values of a quantization target using a variable representing a candidate of a threshold, and determining the threshold based on a quantization error for each of the plurality of numerical values, the quantization error is specified based on the quantitating.
The object and advantages of the invention will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the invention.
In the quantization of the neural network, it is important to select an appropriate scaling factor for converting a numerical value to be quantized into a quantized numerical value. The numerical value to be quantized is a weight of each of a plurality of edges between two layers of the neural network, an output value of each of a plurality of nodes included in each layer of the neural network, or the like. The output value of each node is called an activation. The plurality of numerical values to be quantized and the plurality of quantized numerical values may also be represented by tensors.
There is a case where the accuracy of the quantized numerical value is improved by performing clipping on the numerical value to be quantized. The clipping is processing of converting a numerical value that is deviated from a numerical value range defined by a threshold into a quantized numerical value corresponding to the threshold. However, it is difficult to select an appropriate threshold for the clipping.
Note that, the problem is caused not only in quantization of the weight or the activation but also in quantization of various numerical values in the neural network.
In one aspect, an object of the embodiment is to suppress decrease in inference accuracy caused by quantization of a neural network.
Hereinafter, an embodiment will be described in detail with reference to the drawings.
In the quantization of Szymon Migacz, NVIDIA, “8-bit Inference with TensorRT”, [online], May 8, 2017, (retrieved on Jun. 16, 2021), when a FP32 is converted into an INT8, a numerical value range of the FP32 is limited by performing clipping before a scaling factor is applied. In this case, the upper limit of the numerical value range is defined by a positive threshold +|T|, and the lower limit of the numerical value range is defined by a negative threshold −|T|.
Therefore, through quantization, a floating point number equal to or less than −|T| is converted into an integer corresponding to −|T|, and a floating point number equal to or more than +|T| is converted into an integer corresponding to +|T|. The integer corresponding to −|T| is −127, and the integer corresponding to +|T| is +127. The floating point number less than −|T| and the floating point number larger than +|T| are referred to as outliers.
By performing clipping before the scaling factor is applied, quantization noise can be reduced, and accuracy of a quantized numerical value is improved.
First, a computer sets an initial value to a variable X representing a candidate of a threshold indicating the lower limit or the upper limit of the numerical value range (step 101) and quantizes N (N is integer equal to or more than two) numerical values to be quantized using the variable X (step 102). In step 102, the computer converts a numerical value out of the numerical value range defined by the variable X into a quantized numerical value corresponding to the variable X and converts the numerical value within the numerical value range into a quantized numerical value using the scaling factor.
Next, the computer calculates a Kullback-Leibler information amount (Kullback-Leibler divergence, KL information amount) according to the following formula using a probability distribution P of N numerical values to be quantized and a probability distribution Q of N numerical values after being quantized (step 103).
KL (P||Q) in the equation (1) represents the KL information amount of the probability distribution P and the probability distribution Q, P (i) represents a probability of an i-th (i=1 to N) numerical value to be quantized, and Q (i) represents a probability of an i-th numerical value after being quantized. log represents a binary logarithm or a natural logarithm. KL (P||Q) is used as an index representing a difference between the probability distribution P and the probability distribution Q.
Next, the computer checks whether or not the KL information amount is calculated for all candidates (step 104). When an unprocessed candidate remains (step 104, NO), the computer updates the value of the variable X (step 106) and repeats processing in and after step 102 on the next candidate.
When the KL information amount has been calculated for all the candidates (step 104, YES), the computer selects a candidate that has the minimum KL information amount as a threshold (step 105).
In step 106, the computer increases the variable X by a bin width by incrementing the position of the bin indicating the value of the variable X by one. By repeating the processing in step 106, the value of the variable X changes from the position of the 128-th bin to the position of the 2048-th bin. In step 102, an outlier that is larger than the variable X is converted into a quantized numerical value corresponding to the variable X.
By performing quantization using a threshold that has the minimum KL information amount, it is possible to make the probability distribution of the quantized numerical value be closer to the probability distribution of the numerical value to be quantized. However, the threshold determination processing in
The KL information amount only includes information regarding an appearance frequency of each numerical value to be quantized and an appearance frequency of each quantized numerical value and does not include information regarding these numerical values. Therefore, when a bit width of the quantized numerical value is small, there is a case where inference accuracy largely decreases even if the quantization is performed using the threshold that has the minimum KL information amount.
The numerical value to be quantized is a weight of a linear layer in a multi-head attention block included in each layer of the encoder or the decoder and is represented by the FP32. A bit width of the quantized numerical value is two bits.
As a dataset, a German-English translation dataset of Multi30k is used. Training data includes 29000 sentences, verification data includes 1014 sentences, and input data to be inferred includes 1000 sentences.
No quantization represents a case where inference is performed without quantizing the weight represented by the FP32, and the quantization (KL) represents a case where inference is performed by applying the quantization on the basis of the threshold that has the minimum KL information amount.
Inference accuracy 1 represents a bilingual evaluation understudy (BLEU) score when the quantization is applied to the nine fully-coupled layers of the encoder. Inference accuracy 2 represents a BLEU score when the quantization is applied to the nine fully-coupled layers of each of the encoder and the decoder. The higher the BLEU score is, the higher the inference accuracy is.
The inference accuracy with no quantization is 35.08. On the other hand, the inference accuracy 1 of the quantization (KL) is 33.26, and the inference accuracy 2 of the quantization (KL) is 11.88. In this case, it can be understood that the inference accuracy 2 of the quantization (KL) is largely decreased.
According to the threshold determination device 401 in
The storage unit 514 stores an inference model 521 that performs inference in image processing, natural language processing, or the like and input data 524 to be inferred. The inference model 521 is a trained model including a neural network and, for example, is generated through supervised machine learning. The inference model 521 may be a transformer.
The determination unit 511 determines a threshold 522 used for clipping for each layer of the neural network included in the inference model 521 and stores the threshold 522 in the storage unit 514. The threshold 522 indicates the lower limit and the upper limit of the numerical value range of the numerical value to be quantized.
The determination unit 511 quantizes each of N (N is integer equal to or more than two) numerical values to be quantized based on the numerical value range defined by each of a plurality of candidates of the threshold 522 so as to generate a quantized numerical value corresponding to each numerical value.
In the quantization for converting the FP32 into the INT8, for example, the upper limit of the numerical value range is defined by a candidate TC of a positive threshold, and the lower limit of the numerical value range is defined by a candidate −TC of a negative threshold. In this case, the determination unit 511 can convert an i-th (i=1 to N) numerical value v (i) to be quantized into an i-th numerical value q (i) after being quantized, for example, according to the following equation.
q(i)=round(v(i)/S) (2)
S in the equation (2) represents a scaling factor, and round (v (i)/S) represents a value obtained by rounding v (i)/S. However, when v (i) is equal to or more than TC, q (i)=127, and when v (i) is equal to or less than −TC, q (i)=−127.
Next, the determination unit 511 calculates a quantization error using each numerical value to be quantized and the quantized numerical value corresponding to each numerical value to be quantized and calculates a statistical value of the quantization error for each of the N numerical values to be quantized. Then, the determination unit 511 selects the threshold 522 from among the plurality of candidates based on the statistical value calculated from each of the plurality of candidates.
As the statistical value, for example, an average value, a median, a mode, a maximum value, or a sum is used, and as the threshold 522, for example, a candidate that has a minimum statistical value is selected. By using the statistical value of the quantization error, the threshold 522 suitable for each layer of the neural network can be easily determined.
In the quantization for converting the FP32 into the INT8, for example, an average value QE of the quantization error for each of the N numerical values to be quantized is calculated according to the following equation.
vq (i) in the equation (3) represents a numerical value obtained by inversely quantizing q (i), and |vq (i)-v (i)| in the equation (4) represents an i-th quantization error. However, in a case of q (i)=127, vq (i)=TC, and in a case where q (i)=−127, vq (i)=−TC.
The quantization error includes information regarding an appearance frequency of each numerical value to be quantized and an appearance frequency of each quantized numerical value and information regarding these numerical values. Therefore, by selecting the candidate that has the minimum statistical value of the quantization error as the threshold 522, accuracy of the quantized numerical value is improved than that in a case where the candidate that has the minimum KL information amount is selected. Therefore, even in a case where a bit width of the quantized numerical value is small, decrease in the inference accuracy caused by the quantization is suppressed, and high inference accuracy can be maintained.
The quantization unit 512 generates a quantization inference model 523 by quantizing each of the N numerical values to be quantized using the threshold 522 for each layer of the neural network and stores the quantization inference model 523 in the storage unit 514.
In the quantization of the numerical value to be quantized, the quantization unit 512 converts the outlier deviated from the numerical value range defined by the lower limit and the upper limit indicated by the threshold 522 into a quantized numerical value corresponding to the lower limit or the upper limit. Then, the quantization unit 512 converts a numerical value within the numerical value range into the quantized numerical value using the scaling factor.
The quantization target is, for example, a weight, a bias, or an activation in each layer of the neural network. A bit width of the quantized numerical value is smaller than a bit width of the numerical value to be quantized. By quantizing the weight, the bias, or the activation, the neural network can be efficiently compressed.
The inference unit 513 infers the input data 524 using the quantization inference model 523 and outputs an inference result. By performing the inference using the quantization inference model 523 instead of the inference model 521, power consumption and a memory usage are reduced, and the inference processing is accelerated.
First, the determination unit 511 sets an initial value to the variable X representing the candidate of the threshold 522 (step 601) and quantizes the N numerical values to be quantized using the variable X (step 602). In step 602, the determination unit 511 converts a numerical value deviated from the numerical value range defined by the variable X into a quantized numerical value corresponding to the variable X and converts the numerical value within the numerical value range into a quantized numerical value using the scaling factor.
Next, the determination unit 511 calculates a quantization error using each numerical value to be quantized and each quantized numerical value and calculates a statistical value of the quantization error for each of the N numerical values to be quantized (step 603).
Next, the determination unit 511 checks whether or not the statistical value of the quantization error has been calculated for all the candidates (step 604). When an unprocessed candidate remains (step 604, NO), the determination unit 511 updates the value of the variable X (step 606) and repeats processing in and after step 602 on the next candidate.
When the statistical value of the quantization error is calculated for all the candidates (step 604, YES), the determination unit 511 selects a candidate that has the minimum statistical value as the threshold 522 (step 605).
According to the threshold determination processing in
Next, threshold determination processing in a case where the quantization target is a weight in each layer of a neural network will be described.
The distribution of the weights in
B=(max(W)−min(W))/M (5)
A control variable k is used as a hyperparameters that specifies the candidate of the threshold 522. The lower limit of the numerical value range of the weight to be quantized is represented by −TH(k), and the upper limit is represented by +TH (k). TH (k) is a positive numerical value that changes according to k and represents a candidate of the upper limit of the numerical value range.
First, the determination unit 511 sets an initial value k0 to k (step 801) and calculates TH (k) according to the following equation (step 802).
TH(k)=max(abs(W))−k*B (6)
abs (W) in the equation (6) represents a set of absolute values of the respective weights included in W, and max (abs (W)) represents a maximum value of elements of abs (W).
Next, the determination unit 511 quantizes N weights W (i) (i=1 to N) to be quantized using TH (k) so as to generate a quantized weight Q (i) (step 803).
In step 803, the determination unit 511 converts W (i) equal to or less than −TH(k) into the quantized weight −THQ (k) corresponding to −TH (k) and converts W (i) equal to or more than TH (k) into the quantized weight THQ (k) corresponding to TH (k). Furthermore, the determination unit 511 converts W (i), which is larger than −TH(k) and smaller than TH (k), into Q (i) using the scaling factor. For example, in a case where Q (i) is represented by the INT8, THQ (k)=127 may be satisfied.
Next, the determination unit 511 sets an initial value 1 to a control variable i (step 804) and compares an absolute value abs (W (i)) of the i-th weight W (i) with TH (k) (step 805).
When abs (W (i)) is smaller than TH (k) (step 805, YES), the determination unit 511 calculates a quantization error qe (i) for W (i) according to the following equation (step 806).
qe(i)=abs(WQ(i)−W(i)) (7)
WQ (i) in the formula (7) represents a numerical value obtained by inversely quantizing Q (i), and abs (WQ (i)-W (i)) represents an absolute value of WQ(i)-W(i).
On the other hand, when abs (W (i)) is equal to or more than TH (k) (step 805, NO), the determination unit 511 calculates the quantization error qe (i) for W (i) according to the following equation (step 807).
qe(i)=abs(W(i))−TH(k) (8)
Next, the determination unit 511 compares i with N (step 808).
When i does not reach N (step 808, NO), the determination unit 511 increments i only by one (step 812) and repeats processing in and after step 805.
When i reaches N (step 808, YES), the determination unit 511 calculates an average value QE (k) of the N quantization errors qe (i) according to the following equation (step 809).
QE(k)=ave(qe) (9)
qe in the equation (9) represents a set of qe (1) to qe (N), and ave (qe) represents an average value of qe (1) to qe (N).
Next, the determination unit 511 compares TH (k) with L*B (step 810). L represents a positive integer. When TH (k) is larger than L*B (step 810, YES), the determination unit 511 increments k by only Δk (step 813) and repeats processing in and after step 802. For example, in the distribution of the weights illustrated in
When TH (k) is equal to or less than L*B (step 810, NO), the determination unit 511 ends the calculation of QE (k) and selects TH (k) that has the minimum QE (k) among the calculated QE (k) (step 811). Then, the determination unit 511 determines the threshold 522 indicating the lower limit of the numerical value range as −TH (k) and determines the threshold 522 indicating the upper limit of the numerical value range as TH (k).
Inference accuracy without quantization and inference accuracy 1 and inference accuracy 2 of the quantization (KL) are similar to those of the experimental result illustrated in
The inference accuracy 1 of the quantization (QE) is 35.09, and the inference accuracy 2 of the quantization (QE) is 34.93. In this case, it can be understood that the inference accuracy 1 and the inference accuracy 2 of the quantization (QE) are rarely different from the inference accuracy without the quantization. Therefore, the inference accuracy about the same as that before the quantization is maintained by determining the threshold 522 using the average value of the quantization error instead of the KL information amount.
The configuration of the threshold determination device 401 in
The flowcharts in
The update processing illustrated in
The equations (1) to (9) are merely examples, and the inference device 501 may determine the threshold 522 using another calculation formula.
The memory 1002 is, for example, a semiconductor memory such as a read only memory (ROM) or a random access memory (RAM) and stores programs and data to be used for processing. The memory 1002 may operate as the storage unit 514 in
The CPU 1001 (processor), for example, executes a program using the memory 1002 so as to operate as the determination unit 411 in
For example, the input device 1003 is a keyboard, a pointing device, or the like and is used for inputting instructions or information from a user or an operator. For example, the output device 1004 is a display device, a printer, or the like and is used for an inquiry or an instruction to the user or the operator, and outputting a processing result. The processing result may be an inference result for the input data 524.
The auxiliary storage device 1005 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, a tape device, or the like. The auxiliary storage device 1005 may be a hard disk drive. The information processing device may store programs and data in the auxiliary storage device 1005 and load these programs and data into the memory 1002 to use.
The medium driving device 1006 drives a portable recording medium 1009 and accesses recorded content of the portable recording medium 1009. The portable recording medium 1009 is a memory device, a flexible disk, an optical disk, a magneto-optical disk, or the like. The portable recording medium 1009 may be a compact disk read only memory (CD-ROM), a digital versatile disk (DVD), a universal serial bus (USB) memory, or the like. The user or the operator can store the programs and data in the portable recording medium 1009 and can use these programs and data by loading the programs and data into the memory 1002.
As described above, a computer-readable recording medium in which the programs and data used for processing are stored is a physical (non-transitory) recording medium such as the memory 1002, the auxiliary storage device 1005, or the portable recording medium 1009.
The network connection device 1007 is a communication interface circuit that is connected to a communication network such as a local area network (LAN) or a wide area network (WAN), and that performs data conversion according to communication. The information processing device can receive programs and data from an external device via the network connection device 1007 and load these programs and data into the memory 1002 to use.
Note that, the information processing device does not need to include all the components in
While the disclosed embodiment and the advantages thereof have been described in detail, those skilled in the art will be able to make various modifications, additions, and omissions without departing from the scope of the embodiment as explicitly set forth in the claims.
All examples and conditional language provided herein are intended for the pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although one or more embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Number | Date | Country | Kind |
---|---|---|---|
2021-136804 | Aug 2021 | JP | national |