This application is based upon and claims the benefit of priority from Japanese Patent Application No. 2020-186016, filed on Nov. 6, 2020; the entire contents of which are incorporated herein by reference.
Embodiments described herein relate generally to a computing device, a computer system, and a computing method.
Neural networks have been widely used, for example, in computing of image recognition processing. For example, high recognition accuracy can be achieved in a task of image recognition by using a convolutional neural network (CNN), which is one of neural networks. However, in inference using such a convolutional neural network (CNN), reduction of processing time and power consumption is required because millions of product-sum operations are executed.
Conventionally, regularization methods of optimizing the processing time and power consumption taken for inference in consideration of a model size and memory usage of a convolutional neural network (CNN) have been studied.
However, according to the conventional optimization methods, calculation amounts and the calculation performance of hardware were not considered.
According to one embodiment, a processor is configured to calculate a calculation amount in inference time of a neural network, using a result of summing, with respect to a group to which quantization is applied, products of the number of product-sum operations and bit widths of weight for the product-sum operations in the neural network. Then, the processor is configured to optimize a value of the weight and a quantization step size to minimize a recognition error by the neural network based on the calculated calculation amount, and execute computing about the neural network based on the optimized weight and the quantization step size.
Hereinafter, with reference to accompanying drawings, a computing device, a computer system, and a computing method according to the embodiment will be described in detail. Note that the present invention is not limited by this embodiment.
The computer system 1 can output signals corresponding to the result of the processing with respect to the input data and cause a display device 80 to display the result of the processing. The display device 80 is, for example, a liquid crystal display or an organic EL display. The display device 80 is electrically connected to the computer system 1 via a cable or wireless communication.
The computer system 1 includes at least a graphic processing unit (GPU) 10, a central processing unit (CPU) 20, and a memory 70. The GPU 10, the CPU 20, and the memory 70 are connected by an internal bus so that communication can be carried out.
In the present embodiment, the GPU 10 executes computing about inference processing using a later-described neural network 100. The GPU 10 is a processor which carries out similarity calculations in an approximative manner. The GPU 10 executes processing with respect to the input data while using the memory 70 as a work area.
The CPU 20 is a processor, which controls the whole operation of the computer system 1. The CPU 20 executes various processing for control of the GPU 10 and the memory 70. The CPU 20 controls computing, which uses the neural network 100 executed by the GPU 10, while using the memory 70 as a work area.
The memory 70 functions as a memory device. The memory 70 stores input data input from outside the computer system 1, data generated by the GPU 10, data generated by the CPU 20, and parameters of neural networks. Note that the data generated by the GPU 10 and the CPU 20 may include intermediate results and final results of various calculations. For example, the memory 70 includes at least one selected from among DRAM, SRAM, MRAM, a NAND-type flash memory, a resistance-change-type memory (for example, ReRAM, Phase Change Memory (PCM)), etc. A dedicated memory (not illustrated) used by the GPU 10 may be directly connected to the GPU 10.
The input data may be provided from a storage medium 99. The storage medium 99 is electrically connected to the computer system 1 via a cable or wireless communication. The storage medium 99 functions as a memory device and may be any of a memory card, a USB memory, an SSD, an HDD, an optical storage medium, and the like.
In the computer system 1, the neural network 100 is used as a machine learning device. Herein, a machine learning is a technique to build an algorithm or a model, which carries out tasks such as categorization or prediction, by causing a computer to learn a massive amount of data. The neural network 100 is, for example, a convolutional neural network (CNN). The neural network 100 may be, for example, a multilayer perceptron (MLP) or a neural network provided with an attention mechanism (for example, Transformer).
The neural network 100 may be a machine learning device, which carries out inference of any data. For example, the neural network 100 may be a machine learning device which uses voice data as input and outputs categorization of the voice data, may be a machine learning device which realizes noise removal of voice data or voice recognition, or may be a machine learning device which realizes image recognition of image data. Note that the neural network 100 may be configured as a machine learning model.
The neural network 100 has an input layer 101, hidden layers (also referred to as intermediate layers) 102, and an output layer (also referred to as a fully connected layer) 103.
The input layer 101 receives input data (or part of the data) received from outside the computer system 1. The input layer 101 has a plurality of computing devices (also referred to as neurons or neuron circuits) 118. Note that the computing devices 118 may be dedicated devices or circuits, or the processing thereof may be realized by executing a program by a processor. Similar configurations will be described as computing devices also hereinafter. In the input layer 101, each computing device 118 subjects input data to arbitrary processing (for example, linear transformation, addition of auxiliary data, or the like) to carry out conversion and transmits the converted data to the hidden layers 102.
The hidden layers 102 (102A and 102B) execute various calculation processing with respect to the data from the input layer 101.
The hidden layers 102 have a plurality of computing devices 110 (110A and 110B). In the hidden layers 102, each computing device 110 executes product-sum operation processing using a particular parameter (for example, weight) with respect to supplied data (hereinafter, also referred to as device input data for distinguishing). For example, each of the computing devices 110 executes product-sum operation processing by using mutually different parameters with respect to the supplied data.
The hidden layers 102 may be layered. In this case, the hidden layer 102 includes at least two layers (the first hidden layer 102A and the second hidden layer 102B). The first hidden layer 102A has the plurality of computing devices 110A, and the second hidden layer 102B has the plurality of computing devices 110B.
Each computing device 110A of the first hidden layer 102A executes particular calculation processing with respect to device input data, which is the processing result of the input layer 101. Each computing device 110A transmits the calculation result to each of the computing devices 110B of the second hidden layer 102B. Each computing device 110B of the second hidden layer 102B executes particular calculation processing with respect to device input data, which is the calculation result of each computing device 110A. Each computing device 110B transmits the calculation result to the output layer 103.
In a case in which the hidden layer 102 has a layered structure in this manner, an ability of inference and learning (learning/training) by the neural network 100 can be improved. Note that the number of the layers of the hidden layers 102 may be three layers or more or may be one layer. One hidden layer may be configured to include an arbitrary combination of processing such as product-sum operation processing, pooling processing, normalization processing, and/or activation processing.
The output layer 103 receives the results of the various calculation processing executed by the computing devices 110 of the hidden layers 102 and executes various processing.
The output layer 103 has a plurality of computing devices 119. Each computing device 119 executes particular processing with respect to device input data, which is the calculation results of the plurality of computing devices 110B. As a result, based on the calculation results of the hidden layers 102, inference such as recognition and categorization about the input data supplied to the neural network 100 can be executed. Each computing device 119 can store and output obtained processing results (for example, categorization results). The output layer 103 also functions as a buffer and an interface for outputting the calculation results of the hidden layers 102 to outside the neural network 100.
Note that the neural network 100 may be provided outside the GPU 10. In other words, the neural network 100 may be realized by using not only the GPU 10, but also, for example, the CPU 20, the memory 70, and the storage medium 99 in the computer system 1.
The computer system 1 of the present embodiment executes, for example, various calculation processing for inference in voice recognition or image recognition and various calculation processing for machine learning (for example, deep learning) by the neural network 100.
For example, in the computer system 1, based on the various calculation processing by the neural network 100 with respect to image data, it is possible to carry out recognition and categorization with high accuracy to find out what the image data is or to carry out learning so as to recognize/categorize the image data with high accuracy.
The inference time of a convolutional neural network (CNN) including quantized groups is determined by following three factors related to calculation cost. Herein, the calculation cost refers to processing time and power consumption of inference.
(1) The number of product-sum operations of a convolutional neural network (CNN)
(2) Bit width dependency of the calculation speed of hardware to which the neural network is applied
(3) Unit of groups processed by the same bit accuracy
Since a calculation strength of a convolutional neural network (CNN) is high, calculation time and the calculation speed of hardware rather becomes a bottleneck than memory access time or a band width. Therefore, in order to reduce the inference time, the calculation amount (for example, the number of product-sum operations) and the calculation speed of hardware should be taken into consideration instead of a model size or memory usage. Therefore, regarding the factor (1), the number of product-sum operations of the convolutional neural network (CNN) is dominant.
Regarding the factor (2), it is known that the reciprocals of the calculation speed, in other words, the calculation time and the bit width satisfy a proportional relation in hardware as described in below described literature.
“FPGA-based CNN Processor with Filter-Wise-Optimized Bit Precision”, A. Maki, D. Miyashita, K. Nakata, F. Tachibana, T. Suzuki, and J. Deguchi, in IEEE Asian Solid-State Circuits Conference 2018.
The factor (3) depends on the specifications of hardware. For example, depending on the hardware, calculations are carried out with the same bit accuracy in the unit of the kernel of GPU, or calculations are carried out with the same bit accuracy in the unit of a filter of the convolutional neural network (CNN).
For the above reasons, the inference time of the convolutional neural network (CNN), which includes quantized groups, can be estimated by summing the products of the number of product-sum operations and the bit widths of weight with respect to the processed group.
In the present embodiment, in dedicated hardware capable of carrying out computation with a plurality of mixed bit widths (for example, 1 to 8 bits), it is assumed that the inference time is determined by a calculation amount of Σ(Number of product-sum operations)×(Bit width of weight).
In such dedicated hardware capable of computing a plurality of mixed bit widths (for example, 1 to 8 bits) related to weight, there is a demand to reduce the above described calculation amount, which determines the inference time, while maintaining recognition accuracy.
Therefore, in the present embodiment, a regularization method using an index of calculation cost correlated to inference time is proposed. Specifically, estimated inference time is added to an error function, and weight and a quantization step size are optimized while taking both of inference time and recognition accuracy into consideration. As a result, allocation of the bit width that realizes high recognition accuracy with less inference time can be obtained. Details will be described below.
Hereinafter, a procedure of quantizing weight will be described. Also, an index (calculation amount) for calculating inference time is referred to as MAC×bit and defined as below.
MAC×bit=Σ(#MAC operations)g×bg (1)
Herein, an exponent g represents a group to which quantization is applied. By appropriately setting the exponent g in accordance with the specifications of the hardware to which the neural network is applied, the calculation amount in the hardware to which the neural network is applied can be expressed by the above described equation (1). Also, “b” represents a bit width required for expressing the quantized weight.
In the present embodiment, small bit widths are allocated to layers or filters, which do not contribute to recognition accuracy among the layers and filters constituting the neural network, so as to reduce the bit widths of the weight that does not affect the inference result in quantization and reduce the amount of calculations while maintaining recognition accuracy. Methods to find out optimum allocation of the bit widths include an optimization method based on a gradient descent method. The gradient descent method is an algorithm that updates the weight little by little and searches for a point at which the gradient becomes minimum. In the method based on the gradient descent method, as well as learning of weight, a parameter such as a quantization step size used in quantization is set as a variable, and the weight and the quantization step size are optimized so as to reduce errors in accordance with the gradient descent method. Then, based on the optimized values of the weight and the quantization step size, optimum allocation of bit widths can be obtained.
Herein,
As illustrated in
Then, the calculation module 1101 initializes a quantization step size from the distribution of the values of the weight after learning (S2). More specifically, for example, an initial bit width before optimization is set to 8 bits, and the quantization step size is initialized to a value obtained by dividing a difference between the maximum value and the minimum value of the weight by 2 to the power of 8.
Then, the calculation module 1101 determines whether the carried-out update count i has not exceeded an update count set in advance (S3).
If the carried-out update count i is not exceeding the update count (Yes in S3), the calculation module 1101 quantizes the weight by a current quantization step size and carries out learning again by forward propagation to calculate loss (S4).
Herein, a procedure of weight quantization in the present embodiment will be described by referring to
When the weight W is quantized by the quantization step size Δ, the quantized weight Wint is expressed by the following equation (2).
W
nit=round(W/Δ) (2)
Herein, “round” is a function that rounds the value of an input argument to a closest integer value. Also, in the calculation of forward propagation, Wdq reversed from the quantization of Wint is expressed by the following equation (3).
W
dq
=W
int×Δ (3)
A bit width bg required for expressing the quantized weight Wint can be expressed by the following equation (4).
b
g=┌log2(maxg(abs(Wintg)+1┐ (4)
Herein, ┌⋅┐ represents ceiling function, max(⋅) represents max fun.
Herein, the exponent g represents the groups to which quantization is applied and is appropriately set in accordance with the specifications of hardware to which the neural network is applied. For example, in a case in which the hardware of the above described literature serves as a target to which the neural network is applied, the exponent g represents a filter(s).
In the method based on the gradient descent method, the weight W and the quantization step size Δ are set as parameters, and the weight W and the quantization step size Δ are repeatedly updated to carry out optimization so as to minimize errors in accordance with the gradient descent method. Then, the allocation of the bit width of the weight W optimized from the equation (4) is obtained.
Returning to
Subsequently, the calculation module 1101 determines whether MAC×bit measured in S5 is smaller than a threshold value (target) or not (S6). If measured MAC×bit is smaller than the threshold value (target) (Yes in S6), the calculation module 1101 updates the weight W and the quantization step size Δ by executing error back propagation by using loss calculated in S4 (S7) and increments the carried-out update count i by 1 (S8). Then, the process returns to (S3).
Herein, the processing of S7 will be described in detail. Normally, an error back propagation method is used in learning, information (δLoss/δW) for adjusting (optimizing) the weight W is calculated in order to reduce Loss, and the following equation (5) is calculated by using the information (δLoss/δW), and, as a result, the weight W can be adjusted so as to reduce errors.
Note that a calculation procedure of the information (δLoss/δW) is as described below.
Similarly, also regarding the quantization step size Δ, the quantization step size Δ can be adjusted so as to reduce errors by obtaining (δLoss/δΔ) by the error back propagation method by the following equation (7) and then calculating the following equation (8).
On the other hand, if measured MAC×bit is larger than the threshold value (target) (No in S6), the calculation module 1101 calculates a regularization term and adds the term to loss to obtain loss' (S9). Then, the calculation module 1101 executes error back propagation by using loss' calculated in S9, thereby updating the weight W and the quantization step size Δ (S7).
A regularization term of a comparative example has been learned to add a model size (obtained by multiplying the element count of weight by a bit width) as a penalty to errors so as to reduce errors in a manner of the below-described equation (9). However, it has not been an optimum solution although the calculation amount (MAC×bit) is reduced if the model size is reduced.
On the other hand, as described in the below-described equation (10), the regularization term of the present embodiment takes the size (Oh,l, Ow,l) of an output image into consideration and is configured to add the calculation amount (MAC×bit) as a penalty.
The calculation module 1101 repeats the processing of S4 to S9 until the number of carried-out update count i exceeds the update count set in advance (No in S3). If the carried-out update count i has exceeded the update count set in advance (No in S3), the calculation module 1101 terminates the processing.
Herein, differences in the points focused in the methods of the embodiment and the comparative example will be described.
As illustrated in
Herein,
Herein,
Herein,
In this manner, according to the present embodiment, the result of summing the products of the number of product-sum operations and the bit widths of the weight for the product-sum operations in the neural network with respect to the group to which quantization is applied is used to calculate the calculation amount in the inference time of the neural network. Then, based on the calculated calculation amount, the value of the weight and the quantization step size are configured to be optimized to minimize recognition errors by the neural network. As a result, effects that high recognition accuracy can be realized with less inference time in consideration of the calculation amount and the calculation performance of hardware can be obtained.
Note that the computing device of the present embodiment, the computer system including the computing device of the present embodiment, and the storage medium that stores the computing method of the present embodiment can be applied to smartphones, mobile phones, personal computers, digital cameras, car-mounted cameras, monitor cameras, security systems, AI equipment, system libraries (databases), artificial satellites, and so on.
The above described description shows an example in which the computing device, the computer system, and the computing method of the present embodiment are applied to the neural network in the computer system 1 related to natural language processing of processing a human language (natural language) by a machine. However, the computing device and the computing method of the present embodiment can be applied to various computer systems including neural networks and various data processing methods of executing calculation processing by neural networks.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed, the novel embodiments described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of the embodiments described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms or modifications as would fall within the scope and spirit of the inventions.
Number | Date | Country | Kind |
---|---|---|---|
2020-186016 | Nov 2020 | JP | national |