The present invention relates to a neural network circuit associated with a convolutional neural network.
A convolutional neural network (CNN) is used in various fields including image recognition. When using the CNN, a calculation amount becomes huge. As a result, the processing speed is reduced.
In general, in a convolution layer, since convolution operation in a spatial direction and convolution operation in a channel direction are performed simultaneously, a calculation amount becomes huge. Therefore, a method that separates the convolution operation in the spatial direction and the convolution operation in the channel direction and executes them separately is devised (see, for example, the non-patent literature 1).
In the convolution operation method described in the non-patent literature 1 (hereafter, referred to as a depthwise separable convolution), the convolution is separated into pointwise convolution, which is a 1×1 convolution, and a depthwise convolution. The pointwise convolution does not perform convolution in the spatial direction but convolution in the channel direction. The size of the depthwise convolution filter is, for example, 3×3.
In the case of using a general convolution filter, when the vertical size of the input feature map is H, the horizontal size of the input feature map is W, the number of input channels is M, the filter size is K×K, and the number of output channels is N, the amount of multiplication (calculation amount) is H·W·M·K·K·N.
Since the depthwise convolution in the depthwise separable convolution does not perform convolution in the channel direction (refer to the leftmost solid in
Since the pointwise convolution in the depthwise separable convolution does not perform convolution in the spatial direction, DK=1, as shown in
Comparing the sum of the calculation amount using equation (2) and the calculation amount using equation (3) (a calculation amount of the depthwise separable convolution) with the calculation amount using equation (1) a calculation amount of general convolution), the calculation amount of the depthwise separable convolution is [(1/N)+(1/DK2)] times compared to the calculation amount of general convolution. When the size of the depthwise convolution filter is 3×3, the calculation amount for the depthwise separable convolution is reduced to about 1/9 compared to the calculation amount for general convolution, since the value of N is generally much larger than 3.
NPL 1: A. G Howard et al, “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications”, Google Inc., April 2017.
In the following, it is supposed that a 3×3 filter is used in the depthwise convolution in the depthwise separable convolution. In this case, as shown in Table 1 of the non-patent literature 1, 1×1 matrix operation (1×1 convolution operation) and 3×3 matrix operation (3×3 convolution operation) are performed many times alternately.
When realizing an operation circuit that performs the depthwise separable convolution, the configuration shown in
The 3×3 convolution operation circuit 30 reads feature map data from the DRAM 50 and performs the depthwise convolution using a weight coefficient read from the weight memory 60. The 3×3 convolution operation circuit 30 writes a computation result to the DRAM 50. The 1×1 convolution operation circuit 10 reads data from the DRAM 50 and performs the pointwise convolution using a weight coefficient read from the weight memory 60. The 1×1 convolution operation circuit 10 reads the data from the DRAM 50 and performs the pointwise convolution using a weight coefficient read from the weight memory 60. The 1×1 convolution operation circuit 10 writes a computation result to the DRAM 50. The amount of calculation results output from the 1×1 convolution operation circuit 10 and the 3×3 convolution operation circuit 30 and the amount of data input to the 1×1 convolution operation circuit 10 and the 3×3 convolution operation circuit 30 are huge. Accordingly, the DRAM 50, which is relatively inexpensive even with its large capacity, is generally used as a memory to store data.
Before the 1×1 convolution operation circuit 10 starts a computation process, the weight coefficients for the 1×1 convolution operation are loaded into the weight memory 60. The weight coefficients for the 3×3 convolution operation are loaded into the weight memory 60 before the 3×3 convolution operation circuit 30 starts a computation process.
As mentioned above, a DRAM is a relatively inexpensive and high-capacity device. However, a DRAM is a low-speed memory device. In other words, the memory bandwidth of DRAM is narrow. Therefore, the data transfer between the operation circuit and the memory becomes a bottleneck. As a result, computation speed is limited. In particular, when the time required to read data required for one convolution operation from DRAM exceeds the time required for one convolution operation, this is called a memory bottleneck.
In order to improve the processing speed, it is conceivable to use a calculator using a Systolic Array as a calculator that performs matrix operation in the convolution layer. Alternatively, a SIMD (Single Instruction Multiple Data) type calculator may be used as a calculator that performs the sum-of-products operation.
For example, as illustrated in
However, the configuration illustrated in
It is an object of the present invention to provide a neural network circuit that can alleviate limitation of computation speed caused by data transfer to and from a narrowband memory.
A neural network circuit according to the present invention divides convolution operation into convolution operation in a spatial direction and convolution operation in a channel direction, performs the respective convolution operation separately, and includes a 1×1 convolution operation circuit that performs convolution in the channel direction, an SRAM in which a computation result of the 1×1 convolution operation circuit is stored and an N×N convolution operation circuit that performs convolution in the spatial direction for the computation result stored in the SRAM.
According to this invention, it alleviates limitation on computing speed caused by data transfer to and from a narrowband memory.
Hereinafter, an exemplary embodiment of the present invention will be described with reference to the drawings.
The neural network circuit shown in
The weight memory 11 stores a weight coefficient for 1×1 convolution operation. The weight memory 31 stores a weight coefficient for 3×3 convolution operation.
The neural network circuit shown in
In this exemplary embodiment, the size of the filter used in the depthwise convolution is 3×3, i.e., 3×3 convolution operation is performed in the depthwise convolution, but it is not essential that the size of the filter be 3, and the size of the filter may be N×N (N: natural number of 2 or more).
The DRAM 40 stores a computation result of the 3×3 convolution operation circuit 30. The 1×1 convolution operation circuit 10 reads data to be computed from the DRAM 40. The SRAM 20 stores a computation result of the 1×1 convolution operation circuit 10. The 3×3 convolution operation circuit 30 reads data to be computed from the SRAM 20.
The circuit configuration as shown in
The neural network circuit shown in
The calculation amount of the 3×3 convolution operation circuit 30 are H·W·M·32 (assume the equation is (4)). The calculation amount of the 1×1 convolution operation circuit 10 is H·W·M·N (assume the equation is (5)). As mentioned above, in general, the value of the number of output channels N is much larger than DK. In other words, N>>DK (in this example, 3). As an example, any value between 64 and 1024 is used as N. The same value is used for the number of input channels M.
Comparing equation (4) with equation (5), it can be seen that the calculation amount of the 1×1 convolution operation circuit 10 is several times larger than that of the 3×3 convolution operation circuit 30. On the other hand, the difference in the size of the input to the 1×1 convolution operation circuit 10 and the 3×3 convolution operation circuit 30 is M1/M3, and in general, since M1=M3 or M1*2=M3 is often used, the difference is at most twice as large. In other words, the 3×3 convolution operation circuit 30, which is several times smaller in the calculation amount, is more likely to become a memory bottleneck than the 1×1 convolution operation circuit 10.
Therefore, as mentioned above, if the time required for the 3×3 convolution operation circuit 30 to read the computation result of the 1×1 convolution operation circuit 10 from the memory element is long, the overall computation speed of the neural network circuit will decrease.
Accordingly, as shown in
The data read speed from a SRAM element (chip) is faster than the data read speed from a DRAM element. Therefore, the overall computation speed of the neural network circuit is improved by the arrangement of the SRAM 20 as shown in
Due to the fact that the integration of the SRAM element is lower than that of the DRAM element, capacity unit price of the SRAM element is more expensive than capacity unit price of the DRAM element.
However, in the configuration shown in
The calculation amount of the 3×3 convolution by the 3×3 convolution operation circuit 30 is less than the calculation amount of the 1×1 convolution by the 1×1 convolution operation circuit 10. Accordingly, as shown in
As mentioned above, the calculation amount of the 1×1 convolution operation circuit 10 is larger than the calculation amount of the 3×3 convolution operation circuit 30. For example, where N=1024, the calculation amount of the 1×1 convolution operation circuit 10 is (1024/9)=about 114 (times) the calculation amount of the 3×3 convolution operation circuit 30.
It is preferable that the ratio of the number of calculators in the 1×1 convolution operation circuit 10 to the number of calculators in the 3×3 convolution operation circuit 30 is set according to the calculation amount. Each calculator performs convolution operation. In the example of N=1024, the number of calculators in the 1×1 convolution operation circuit 10 may be set to about 100 to 130 times the number of calculators in the 3×3 convolution operation circuit 30, for example. The method of setting the number of calculators according to the calculation amount can be effectively used, for example, when there is a restriction on the total number of calculators. One example of a case where the total number of calculators is limited is when the neural network circuit is constructed using an FPGA (Field Programmable Gate Array), as described below.
In addition, the number of input channels M and the number of output channels N are often set to the nth power of 2 (n: natural number), respectively. Then, when each of the number of calculators in each of the 1×1 convolution operation circuit 10 and the 3×3 convolution operation circuit 30 is the nth power of 2, the affinity with various convolutional neural networks will be high.
In the second example embodiment, the 1×1 convolution operation circuit 10 and the 3×3 convolution operation circuit 30 in the neural network circuit are constructed on the FPGA 101. The functions of the 1×1 convolution operation circuit 10 and the 3×3 convolution operation circuit 30 are the same as those in the first example embodiment.
In the third example embodiment, in addition to the 1×1 convolution operation circuit 10 and the 3×3 convolution operation circuit 30 in the neural network circuit, the SRAM 20 is also constructed on the FPGA 102. The functions of the 1×1 convolution operation circuit 10, the
SRAM 20, and the 3×3 convolution operation circuit 30 are the same as those in the first example embodiment.
A weight coefficient storage 80 is clearly shown in
The functions of the 1×1 convolution operation circuit 10, the weight memory 11, the SRAM 20, the 3×3 convolution operation circuit 30, the weight memory 31, and the DRAM 40 shown in
The weight memory 11 is provided in correspondence with the 1×1 convolution operation circuit 10. The weight memory 31 is provided in correspondence with the 3×3 convolution operation circuit 30. As mentioned above, once the computation results of the convolution operation for three rows by the 1×1 convolution operation circuit 10 are stored in the SRAM 20, the 3×3 convolution operation circuit 30 can start the convolution operation. Thereafter, the 1×1 convolution operation circuit 10 and the 3×3 convolution operation circuit 30 operate in parallel. The 1×1 convolution operation circuit 10 and 3×3 convolution operation circuit 30 operate in parallel, and this also improves the overall computation speed of the neural network circuit. Moreover, since the weight memory 11 and the weight memory 31 work separately, by constructing the neural network circuit so that the weight coefficient for the 3×3 convolution operation is transferred from the weight coefficient storage unit 80 to the 3×3 convolution operation circuit 30 while the 1×1 convolution operation circuit 10 is performing the convolution operation for the first three rows for example, the overall computation speed of the neural network circuit is further improved.
As explained above, in each of the above exemplary embodiments, in the neural network circuit which divides the convolution operation into the convolution operation in the spatial direction and the convolution operation in the channel direction, and performs the respective convolution operation separately, the computation result of the 1×1 convolution operation circuit 10 is stored in the SRAM 20, and the 3×3 convolution operation circuit 30 is configured to obtain the computation results of the 1×1 convolution operation circuit 10 from the SRAM 20, the overall computation speed of the neural network circuit is improved while the increase in the price of the neural network circuit is suppressed.
In each of the above exemplary embodiments, MobileNets as described in the non-patent literature 1 is used as an example of the depthwise separable convolution, but the neural network circuit of each exemplary embodiments is applicable to the depthwise separable convolution other than MobileNets. For example, the processing of the part corresponding to the 3×3 convolution operation circuit 30 is not the depthwise convolution, but may be GroupedConvolution, which is a general system of the depthwise convolution. Grouped Convolution is to divide the input channel to Convolution into G groups and perform convolution in group units. In other words, when the number of input channels is M and the number of output channels is N, G 3×3 convolutions are performed in parallel, where the number of input channels is M/G and the number of output channels is N/G. The depthwise convolution corresponds to the case where M=N=G in the GroupedConvolution.
Although the invention of the present application has been described above with reference to example embodiments, the present invention is not limited to the above example embodiments. Various changes can be made to the configuration and details of the present invention that can be understood by those skilled in the art within the scope of the present invention.
10 1×1 convolution operation circuit
11 Weight memory
20 SRAM
30 3×3 convolution operation circuit
31 Weight memory
40 DRAM
80 Weight coefficient storage unit
101, 102 FPGA
111 First weight memory
301 N×N convolution operation circuit
311 Second weight memory
201, 202, 203 Neural network circuit
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2019/012581 | 3/25/2019 | WO | 00 |