The present invention relates to a technique for processing information highly reliably, and more particularly, to a calculation system and a calculation method of a neural network.
In recent years, it has been found that a high recognition rate can be achieved by using deep neural network (DNN) for image recognition, and the deep neural attracts attention (see, for example, JP-2013-69132-A). The image recognition is processing that classifies and identifies the type of objects in an image. The DNN is a machine learning technique which can achieve a high recognition rate by performing feature quantity extraction in multiple layers by connecting perceptron which extracts feature quantity of input information.
The performance improvement of computers can be considered to be a background as to why the DNN has been found to be particularly effective among machine learning algorithms. In order to achieve a high recognition rate in the DNN, it is necessary to train and optimize the parameter data (hereinafter simply referred to as “parameters”) of the perceptron of the intermediate layer by using thousands or tens of thousands of pieces of image data. As the number of pieces of data of parameter increases, more detailed classification of images and a high recognition rate can be achieved. Therefore, a higher computing performance is required in order to train a large amount of parameters by using a large amount of images, and general image recognition with the DNN has been realized with the development of computers such as multicore technology in servers and GPGPU (General-purpose computing on graphics processing units) in recent years.
With the wide recognition of the effectiveness of the DNN, the research of the DNN has spread explosively and various applications are being studied. In one example, it is considered to use the DNN to recognize surrounding objects in the development of automatic driving techniques for automobiles.
The current DNN algorithm requires large memory and calculation load for storing parameters necessary for processing, and consumes a high power. In this regard, for built-in applications such as automobiles, there are restrictions on resources and processing performance compared to server environments.
Therefore, the inventors considered a combination of FPGA (Field-Programmable Gate Array) having a high computation efficiency per power and an external memory such as a DRAM (Dynamic Random Access Memory) on mounting in a general-purpose small device for automotive applications.
On the other hand, in order to speed up processing (parallelization) and achieve lower power consumption, it is effective to reduce the external memory usage rate and use the internal memory. Therefore, the inventors also considered making effective use of a CRAM (Configuration Random Access Memory) and the like which is the internal memory of the FPGA. However, a memory having lower resistance to soft error, for example, an SRAM (Static Random Access Memory) is used for the CRAM constituting the logic of the FPGA, and the soft error occurred at that point changes the operation of the device itself, and thus it is necessary to take measures for the soft error.
As for the countermeasure against the CRAM soft error, it may be possible to detect the soft error by cyclically monitoring the memory and comparing it with the configuration data stored in the external memory. However, a predetermined period of time (for example, 50 ms or more) is required for error detection, and erroneous processing may be performed until error detection and correction are completed.
Therefore, it is an object of the present invention to enable information processing with a high degree of reliability using DNN, and to provide an information processing technique capable of achieving a higher speed and a lower power consumption.
One aspect of the present invention is a calculation system in which a neural network performing calculation using input data and a weight parameter is implemented in a calculation device including a calculation circuit and an internal memory and an external memory, in which the weight parameter is divided into two, i.e., a first weight parameter and a second weight parameter, the first weight parameter is stored in the internal memory of the calculation device, and the second weight parameter is stored in the external memory.
Another aspect of the present invention is a calculation system including an input unit receiving data, a calculation circuit constituting a neural network performing processing on the data, a storage area storing configuration data for setting the calculation circuit, and an output unit for outputting a result of the processing, in which the neural network contains an intermediate layer that performs processing including inner product calculation, and a portion of a weight parameter for the calculation of the inner product is stored in the storage area.
Another aspect of the present invention is a calculation method of a neural network, in which the neural network is implemented on a calculation system including a calculation device including a calculation circuit and an internal memory, an external memory, and a bus connecting the calculation device and the external memory, and the calculation method of the neural network performs calculation using input data and a weight parameter with the neural network. In this case, the calculation method of the neural network includes storing a first weight parameter, which is a part of the weight parameter, to the internal memory, storing a second weight parameter, which is a part of the weight parameter, to the external memory, reading the first weight parameter from the internal memory and reading the second weight parameter from the external memory when the calculation is performed, and preparing the weight parameter required for the calculation in the calculation device and performing the calculation.
According to the present invention, it is possible to process information with a high degree of reliability using DNN, and to provide an information processing technique capable of achieving a higher speed and a lower power consumption. The problems, configurations, and effects other than those described above will become apparent from the following description of the embodiments.
Embodiments will be described with reference to the drawings. In all the drawings explaining the embodiments, the same reference numerals are given to the constituent elements having the same functions, and the repetitive description will be omitted unless it is particularly necessary.
When an example of the embodiments described below is given, a neural network calculating by using input data and a weight parameter is implemented in a calculation device of an FPGA and the like including a calculation circuit and a memory therein and an external memory, in which a weight parameter is divided into first and second weight parameters, the first weight parameter is stored in a memory provided in the inside of a calculation device such as a CRAM and the like, and the second weight parameter is stored to an external memory such as a DRAM and a flash memory.
More specifically, in the present embodiment, the parameter set of weight used for DNN calculation is divided into two as follows. The first weight parameter is a parameter having a low contribution to the calculation result of DNN. For example, the value is a weight close to 0 or a bit indicating the lower digit of the weight. On the other hand, the second weight parameter is a parameter having a high contribution to the calculation result of DNN. This can be defined as at least a part of a parameter other than the first weight parameter. Then, the first weight parameter is stored in the internal memory (CRAM) and the second weight parameter in the external memory (DRAM), and DNN calculation is executed.
In the embodiment described below, the DNN for processing an image is described, but the application is not limited to the image recognition device.
The accelerator 100 is a device dedicated for processing image data, and the input data is image data sent from the CPU 101. More specifically, when a necessity of image data processing occurs, the CPU 101 sends the image data to the accelerator 100, and receives the processing result from the accelerator 100.
The accelerator 100 has a calculation data storage area 103 (which may be referred to as an “internal memory 103” for the sake of convenience) and a calculation unit 104 in the inside. An input port and an output port (not shown), the calculation data storage area 103 and the calculation unit 104 are connected by a bus 105 (which may be referred to as an “internal bus 105” for the sake of convenience), and the calculation data is transferred via the bus 105.
In
The calculation data storage area 103 includes a BRAM (Block RAM) 106 used as a temporary storage area and a CRAM 107. The BRAM 106 stores an intermediate result of the calculation executed by the accelerator 100. The CRAM 107 stores configuration data for setting each module of the calculation unit 104. As will be described later, the BRAM and the CRAM also stores the parameter (weight data) of the intermediate layer of the DNN.
The calculation unit 104 contains the modules necessary for the calculation of the DNN. Each module included in calculation unit is programmable by the function of FPGA. However, it is also possible to configure a part of module with a fixed logic circuit.
In a case where the accelerator 100 is constituted by FPGA, the calculation unit 104 can be composed of programmable logic cells. The data for the program, such as the contents of lookup table and data for setting switches of the modules 108 to 114 of the calculation unit 104, is loaded from the external memory 102 to the CRAM 107 of the calculation data storage area 103 by the control of the CPU 101, and the logic cell is set so as to realize the functions of the modules 108 to 114.
The calculation control module 108 is a module that controls the flows of other calculation modules and calculation data according to the algorithm of DNN.
The decode calculation module 109 is a module that decodes the parameter stored in the external memory 102 and the internal memory 103. The decode calculation module 109 will be explained in detail later.
The convolution calculation and full connection calculation module 110 is a module that executes the convolution calculation or the full connection calculation in the DNN. Since the contents of convolution calculation and full connection calculation are both inner product calculation, the convolution calculation and full connection calculation can be executed with one module. Even if there are multiple convolution layers and full connection layers, the convolution calculation and full connection calculation can be executed with one convolution calculation and full connection calculation module 110.
The activation calculation module 111 is a module that executes the calculation of the activation layer of the DNN.
The pooling calculation module 112 is a module that executes the calculation of the pooling layer in the DNN.
The normalization calculation module 113 is a module that executes the calculation of the normalization layer in the DNN.
The maximum value calculation module 114 is a module for detecting the maximum value of the output layer in the DNN and obtaining the recognition result 202. The modules deeply related to the contents of the present embodiment among these calculation modules are the decode calculation module 109 and the convolution calculation and full connection calculation module 110. These two modules will be described in detail later. The configuration of which explanation is omitted in the present embodiment may be based on known FPGA or DNN techniques.
The convolution layers CN1 and CN2 extract the information (feature quantity) required for recognition from the input image data 201. For convolution processing required for extracting feature quantity, the convolution layer uses a parameter. The pooling layer summarizes the information obtained with the convolution layer and, when data is an image, the invariance with respect to the position is increased.
The full connection layer IP1 uses the extracted feature quantity to determine which category the image belongs to, i.e., performs the pattern classification.
Each layer constitutes one layer of multi-layer perceptron. Conceptually, it can be considered that a plurality of nodes are arranged in a row in one layer. One node is associated with all nodes in the upstream layer. For each connection, weight data W (also referred to as a “weight parameter”) is allocated as a parameter. The input into the node of the downstream layer is based on the inner product of the input of the upstream layer and the weight data. Other bias data and threshold value data may be used for calculation. In the present specification, these are collectively referred to as parameters. In the present embodiment, characteristic processing is performed when storing the parameters of each layer constituting the neural network in the memory.
The program of the calculation unit 104 is performed by the configuration data C stored in the CRAM 107. Since the CRAM 107 is composed of an SRAM, the configuration data C is loaded from the external memory 102 or the like into the CRAM 107 by the control of the CPU 101 at the time of power-on or the like. In
As explained in
As this rule, the weight data Wm having a low contribution to the calculation result is stored in the BRAM 106 or the CRAM 107 of the calculation data storage area 103 having a low soft error resistance. The weight data Wk having a high contribution to the calculation result is not stored in the calculation data storage area 103 having a low soft error resistance. By storing the weight data Wm having a low contribution to the calculation result in the calculation data storage area 103 which is an internal memory and using the weight data Wm for calculation, there is an effect of high speed processing and low power consumption. In addition, the adverse effect of the soft error on the calculation result can be reduced by using the weight data Wk having a high contribution to the calculation result and held in the external memory 102 having a high soft error resistance for the calculation.
When the image recognition device 1000 performs image recognition, the image data 201 is held in the BRAM 106 as input data I and calculation is performed with the logic module of calculation unit 104. When the convolution calculation and full connection calculation module 110 is adopted as an example, the parameters required for the calculation are read from the external memory 102 or the calculation data storage area 103 to the calculation unit 104 and calculation is performed. In the case of the inner product calculation, as many pieces of weight data W as the number of the products of input side node I and output side node O are required. In
The convolution layers CN1, CN2, the full connection layer IP1, and the like perform the sum-of-products calculation (inner product calculation), and therefore, if the convolution calculation and full connection calculation module 110 is programmed in accordance with the largest row and column, one convolution calculation and full connection calculation module 110 can be commonly used for calculation of each layer by changing the parameter. In this case, the amount of data of the configuration data C can be small. However, the amount of the weight data W increases as the number of layers and nodes increases. In
However, when the weight data W0 close to 0 changes to the weight data far from 0 due to soft error, the adverse effect on calculation result becomes large. Therefore, it is desirable to limit the weight data Wm stored in the calculation data storage area 103 to bits representing lower digits of weight.
How to divide weight data into W1 and W0 and how to divide W0 into Wm and Wk basically depend on the soft error resistance of device and the content of calculation, but basically they depend on the magnitude of the weight data and the digit of the bit. For example, a value of plus or minus 0.005 is set as a threshold value, and a parameter with a value equal to or less than 0.005 can be approximated to zero and can be treated as weight data W0 close to 0. For example, three lower bits are set as the weight data Wm stored in the calculation data storage area 103. The remaining part is set as the weight data Wk stored in the external memory 102.
The decode calculation module 109 selects the weight data stored in the external memory 102 and the calculation data storage area 103 with a selector 801, controls the timing with a flip flop 802, and sends it to the calculation unit 104. The image data 201 and the intermediate data are also sent to the calculation unit 104 by controlling timing with a flip flop 803.
In the example of
Next, in the processing of S1003, reference is made to the allocation table of the weight data W to the external memory 102 and the internal memory 103. The allocation table is stored in the DRAM of the external memory 102 in advance, for example.
In the processing of S1004, referring to the allocation table 1100, a predetermined number of bits of weight data Wm are loaded from the external memory 102 to the internal memory 103. For example, for the parameter #2 in
In the processing in S1005, an address table 1200 indicating the storage location of the weight data Wk stored in the external memory 102 and the weight data Wm loaded in the internal memory 103 is created, and the address table 1200 is stored in the CRAM 107 or the BRAM 106.
The preparation of data necessary for calculation of the calculation unit 104 is completed prior to the image processing of the image recognition device 1000.
Step S1301: The accelerator 100 of the image recognition device 1000 receives the image 101 which is input data from the CPU 101 and stores the image 101 in the BRAM 106 in the calculation data storage area 103. The image data corresponds to the input layer IN in the DNN.
Step S1302: The feature quantity extraction is performed with the parameter using convolution calculation and full connection calculation module 110. This corresponds to the convolution layers CN1, CN2 in the DNN. The details will be explained later with reference to
Step S1303: The activation calculation module 111 and the pooling calculation module 112 are applied to the result of the convolution calculation and the result of full connection calculation which are contained in the BRAM 106 in the calculation data storage area 103. The calculation equivalent to the activation layer and the pooling layer in the DNN is executed.
Step S1304: The normalization calculation module 113 is applied to the intermediate layer data stored in the BRAM 106 in the calculation data storage area 103. The calculation equivalent to normalization layer in the DNN is executed.
Step S1305: The feature quantity extraction is performed with the parameter using convolution calculation and full connection calculation module 110. It corresponds to the full connection layer IP1 in the DNN. Details will be explained later.
Step S1306: The index of the element having the maximum value in output layer is derived and output as the recognition result 202.
Step S1401: The loop variable is initialized as i=1.
Step S1402: The i-th filter of the convolution layer is selected. Here, multiple pieces of weight data W for multiple inputs connected to one node in the downstream stage is referred to as a filter.
Step S1403: The parameter is decoded. More specifically, the parameter is loaded into the input register of the convolution calculation and full connection calculation module 110. The details will be explained later.
Step S1404: The data of the intermediate layer stored in the BRAM 106 in the inside of the calculation data storage area 103 is loaded into the input register of the convolution calculation and full connection calculation module 110 as input data.
Step S1405: The inner product calculation is performed by using the convolution calculation and full connection calculation module 110. The output data stored in the output register is temporarily stored in the BRAM 106 in the inside of the calculation data storage area 103 as an intermediate result of calculation.
Step S1406: If the filter has been applied to all input data, the flow proceeds to step S1407. Otherwise, the target intermediate layer data to which filter is applied is changed, and step S1404 is subsequently performed.
Step S1407: When processing of all the filters is completed, the processing flow of the convolution calculation is terminated. The final output of the layer is transferred to the external memory 102, and the data is transferred to the BRAM 106 and becomes the input of the subsequent layer. If there is an unprocessed filter, the process proceeds to step S1408.
Step S1408: The loop variable is updated as i=i+1 and the subsequent filter is processed.
With the above processing, the processing flow S1302 for one convolution layer is performed. Although there are some differences, the calculation of the inner product while changing the parameters is performed for the processing flow S1305 of the full connection layer in the same manner, and it can be processed in the same way as in
The storage areas of the external memory 102 and the internal memory 103 are divided by banks 1501 and the address number is assigned by an address 1502. The configuration of these banks 1501 and how to assign the address 1502 depends on the physical configuration of the memory, but here it is assumed that the configuration and how to assign are common in the external memory 102 and the internal memory 103, and one parameter is stored for each address.
In the external memory 102, 8 bits of data 1503a are stored in one address, but the upper 6 bits indicated by hatching are decoded. In the internal memory 103, 2 bits of data 1503b are stored at the address, and all the 2 bits indicated by the hatching are decoded.
The decode calculation module 109 has a register 162 for temporarily saving parameters and a decode processing unit 161 for decoding filter data in the inside. The convolution calculation and full connection calculation module 110 is a calculation module that executes inner product calculation, and has an input register 163, a multiplier 164, an adder 165, and an output register 166. There are an odd number of (2N+1) input registers 163 in total, and includes a register F holding parameter and a register D holding a calculation result of the upstream layer. The input register 163 is connected to the bus 160 in the inside of the calculation unit 104, receives input data from the bus 160, and holds the input data. All of these input registers 163 are connected to the input of the multiplier 164 except for one, and the remaining one is connected to the input of the adder 165. Half of the 2N input registers 163 connected to the input of multiplier 164, i.e., N+1 registers F, receive and hold the parameter of the intermediate layer, and the remaining half, i.e., N registers D, receive and hold the calculation intermediate result saved in the BRAM 106 in the internal memory 103.
The convolution calculation and full connection calculation module 110 has N multipliers and adder. The N multipliers each calculate the product of the parameter and the calculation intermediate result and output it. The N adders calculate the sum of N multiplier results and one input register, and the result thereof is saved in the output register 166. The calculation data saved in the output register 166 is transferred to the external memory 102 or the calculation module through the bus 160 in the inside of the calculation unit 104.
Explanation will be given by taking as an example the case of decoding the parameter 1503 of the convolution layer CN2 shown in
Next, the decode processing unit 161 in the inside of the calculation unit 104 gives an instruction to transfer, to the register 162 in the inside of the decode calculation module 109, the 2-bit parameter stored in the address ADDR 0 of BANK A of the internal memory 103, based on the data shown in
Next, the decode processing unit 161 in the inside of the calculation unit 104 transfers the parameter stored in the register 162 to the register F of the convolution calculation and full connection calculation module via the bus 160.
Step S1701: The number of parameters of the corresponding filter is referred to, and it is set as k. The number of corresponding parameters stored in one address shall be one.
Step S1711: The loop variable j is initialized as j=1.
Step S1712: The calculation control module 108 transfers the n bits of parameter stored in the j-th address of the address of the external memory 102 to the register 162 in the inside of the decode calculation module 109 through the internal bus 105 of the accelerator 100 and the bus 160 in the inside of the calculation unit 104.
Step S1713: The calculation control module 108 transfers the m bits of parameter stored in the j-th address of the address of the internal memory 103 to the register 162 in the inside of the decode calculation module 109 through the internal bus 105 of the accelerator 100 and the bus 160 in the inside of the calculation unit 104.
Step S1714: The calculation control module 108 transfers the (n+m)-bit parameter stored in the register 162 to the j-th register F.
Step S1715: If j≤k is satisfied, step S1706 is subsequently performed, and if not, the decode processing flow of the parameter is terminated.
Thus, the decode of the weight parameter corresponding to one filter of one layer is completed.
According to the above-described embodiment, by utilizing the internal memory of the FPGA, a high speed and low power consumption calculation can be realized, and the calculation result is highly reliable.
The present invention is not limited to the embodiments described above, but includes various modifications. For example, it is possible to replace a part of the configuration of one embodiment with the configuration of another embodiment, and it is possible to add a configuration of another embodiment to the configuration of another embodiment. Further, it is possible to add, delete, or replace a configuration of another embodiment to, from, or with a part of the configuration of each embodiment.
Number | Date | Country | Kind |
---|---|---|---|
2017-006740 | Jan 2017 | JP | national |