The present invention relates to technology of an operation circuit, an operation method, and a program.
In the case of performing inference by using a trained convolutional neural network (CNN) or in the case of learning a CNN, convolution processing is performed in a convolution layer, but this convolution processing is the same as repeated product-sum operation processing. In CNN inference, the aforementioned product-sum operation (referred to as “MAC operation” hereinafter) occupies most of the total throughput. Even when a CNN inference engine is implemented as hardware, the operation efficiency and implementation efficiency of a MAC operation circuit greatly affect the entire hardware.
In the convolution layer, output feature map data oFmap is obtained by performing convolution processing of Kernel that is a weight coefficient on input feature map data iFmap that is feature map data of a result of the previous layer. The input feature map data iFmap and the output feature map data oFmap are composed of a plurality of channels. These are called iCH_num (number of input channels) and oCH_num (number of output channels). Since convolution of the Kernel is performed between channels, the Kernel has the number of channels corresponding to (iCH_num×oCH_num).
In the case where such convolution layer processing is implemented as hardware, oCH_num parallel MAC operation units are prepared and a parallel method of performing kernel MAC processing on the same input channel number in parallel and repeating this processing iCH_num times is used in order to improve the throughput by parallelization.
In the first processing in which a convolution operation of iCH0 is performed, the MAC operation unit 911 performs convolution integration of iCH0*oCH0, adds the operation result and stores the result in the memory 921. The MAC operation unit 912 performs convolution integration of iCH0*oCH1, adds the operation result and stores the result in the memory 922. The MAC operation unit 913 performs convolution integration of iCH0*oCH2, adds the operation result and stores the result in the memory 923. The MAC operation unit 914 performs convolution integration of the iCH0*oCH3, adds the operation result and stores the result in the memory 924. Obtaining an output channel having an output channel number of m(oCHm) by performing a convolution operation of kernel data having an input channel number of n and an output channel number of m on an input channel having an input channel number of n(iCHn) is represented as “iCHn*oCHm.”
Subsequently, in the second processing, input feature map data iFmap of iCH1 is supplied to the MAC operation units 911 to 914, and product-sum operation processing of Kernel is performed by each MAC operation unit. The operation result is stored in the memories 921 to 924 by adding convolution results of iCH0 and iCH1 thereto. That is, in the second processing for performing a convolution operation of the iCH1, a product-sum operation result of iCH0*oCH0+iCH1*oCH0 is stored in the memory 921, a product-sum operation result of iCH0*oCH1+iCH1*oCH1 is stored in the memory 922, a product-sum operation result of iCH0*oCH2+iCH1*oCH2 is stored in the memory 923, and a product-sum operation result of iCH0*oCH3+iCH1*oCH3 is stored in the memory 924.
In the fifth processing, input feature map data iFmap of iCH4 is supplied to the MAC operation units 911 to 914, and product-sum operation processing of Kernel is performed by each MAC operation unit. The operation result is stored in the memories 921 to 924 by adding convolution results from iCH0 to iCH4 thereto. Since the final operation result becomes the output feature map data oFmap in such processing, the data in the memory 920 is determined as the oFmap result of the present convolution layer. When the next layer is a convolution layer again, the same processing is performed by using the output feature map data oFmap as input feature map data iFmap of the next layer. In a configuration like that shown in
Meanwhile, there are more than a few cases in which some of the input feature map data iFmap and input data of the kernel become 0. In such a case, the product-sum operation is not necessary (because it is processing for multiplying by 0). In particular, since each channel is generally smaller in size than Fmap such as 3×3 or 1×1, kernel data may be a channel in which kernel data of the channel is whole zero (zero matrix).
In the first processing in which a convolution operation of iCH0 is performed, kernel data of iCH0 & oCH1 and kernel data of iCH0 & oCH2 are zero matrices, and thus only 0 is added to data stored in the memory 922 and the memory 923. Therefore, the MAC operation unit 912 and the MAC operation unit 913 need not perform arithmetic operations. However, since calculation of the MAC operation unit 911 and the MAC operation unit 914 cannot be omitted, the MAC operation unit 912 and the MAC operation unit 913 have to wait for completion of these arithmetic operations in the hardware configuration according to the conventional technology shown in
When input data is sparse in this manner, the conventional technology has a problem that a sufficient arithmetic operation speed cannot be expected.
In view of the above-mentioned circumstances, an object of the present invention is to provide a technology capable of efficiently increasing an arithmetic operation speed while curbing an increase in hardware scale when some weight coefficients are zero matrices in product-sum operation processing in a convolution layer of a neural network.
One aspect of the present invention is an operation circuit for performing a convolution operation of input feature map information supplied as a plurality of channels and coefficient information supplied as a plurality of channels, the operation circuit including a set including at least two channels of an output feature map based on output channels and at least three sub-operation circuits, wherein at least two sub-operation circuits are allocated for each set, the sub-operation circuits included in the set execute processing of a convolution operation of the coefficient information and the input feature map information included in the set, when a specific channel of the output feature map is a zero matrix, a sub-operation circuit that performs a convolution operation of the zero matrix executes processing of a convolution operation of the coefficient information and the input feature map information to be supplied next from a channel of the output feature map and a channel of the input feature map included in the set, and a result of the convolution operation is output for each channel of the output feature map.
One aspect of the present invention is an operation method for causing an operation circuit including a set including at least two channels of an output feature map based on output channels, and at least three sub-operation circuits to execute a convolution operation of input feature map information supplied as a plurality of channels and coefficient information, the operation method including: allocating at least two sub-operation circuits for each set; causing the sub-operation circuits included in the set to execute processing of a convolution operation of the coefficient information and the input feature map information included in the set; when a specific channel of the output feature map is a zero matrix, causing a sub-operation circuit that performs a convolution operation of the zero matrix to execute processing of a convolution operation of the coefficient information and the input feature map information to be supplied next from a channel of the output feature map and a channel of the input feature map included in the set; and outputting a result of the convolution operation for each channel of the output feature map.
One aspect of the present invention is a program causing a computer to realize the operation circuit according to one of the above-described aspects.
According to the present invention, it is possible to efficiently increase an arithmetic operation speed while curbing an increase in hardware scale when some weight coefficients are zero matrices in product-sum operation processing in a convolution layer of a neural network.
An embodiment of the present invention will be described in detail with reference to the drawings. A method of the present embodiment can be applied to, for example, a case in which inference is performed using a trained CNN or a case in which a CNN is trained.
The sub-operation circuit 10 includes a MAC operation unit macA (sub-operation circuit), a MAC operation unit macB (sub-operation circuit), a MAC operation unit macC (sub-operation circuit), and a MAC operation unit macD (sub-operation circuit).
The memory 20 includes a memory 21 for oCH0, a memory 22 for oCH1, a memory 23 for oCH2, and a memory 24 for oCH3.
The operation circuit 1 is an operation circuit in a convolution layer of a CNN. The operation circuit 1 divides kernel data (coefficient information) that is weight coefficients into a plurality of sets including several output channels. The operation circuit 1 divides sets such that there are no channels belonging to two or more sets. Then, the operation circuit 1 allocates as many MAC operation units as the number of channels in a set to each set. Input feature map data iFmap and weight coefficient data (kernel data) Kernel are supplied to the MAC operation units.
Although
The operation circuit 1 is configured using a processor such as a central processing unit (CPU) and a memory or an operation circuit and a memory. The operation circuit 1 serves as MAC operation units, for example, by a processor executing a program. Note that all or some of the functions of the operation circuit 1 may be realized using hardware such as an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA). The aforementioned program may be recorded in a computer-readable recording medium. The computer-readable recording medium is, for example, a storage device such as a portable medium such as a flexible disk, a magneto-optical disc, a ROM, or a CD-ROM, or a semiconductor storage device (e.g., solid state drive (SSD)), or a hard disk or a semiconductor storage device provided in a computer system. The aforementioned program may be transmitted via a telecommunication line.
Next, a case of sparse kernel data will be described with reference to
In conventional parallel processing, kernel data is used in the order of i, ii, iii, iv, and v, as shown in
On the other hand, a plurality of oCHm are integrated as one set and a plurality of MAC operation units are allocated to one set in the present embodiment.
As described above, a set of the present embodiment is configured based on channels of an input feature map and channels of output feature maps in input feature map data.
Furthermore, in the present embodiment, the processing order is not fixed as in the conventional manner, such as iCH0, iCH1, . . . , and product-sum operation processing is adaptively performed in the same set according to sparsity of kernel data, thereby achieving high speed processing.
Next, an example of a processing order used in kernel data will be described.
The operation circuit 1 uses kernel data in the order of kernel data iCH0 & oCH0, iCH0 & oCH1, iCH1 & oCH0, iCH1 & oCH1, iCH2 & oCH0, iCH2 & OCH1, iCH3 & oCH0, iCH3 & oCH1, iCH4 & oCH0, and iCH4 & oCH1 in the first set 201 (set 0) of kernel data.
The operation circuit 1 uses kernel data in the order of kernel data iCH0 & oCH2, iCH0 & oCH3, iCH1 & oCH2, iCH1 & oCH3, iCH2 & oCH2, iCH2 & OCH3, iCH3 & oCH2, iCH3 & oCH3, iCH4 & oCH2, and iCH4 & oCH3 in the second set 202 (set 1) of the kernel data.
Next, an example of first processing when sparsity has occurred in kernel data will be described with reference to
When a channel of kernel data that is a sparse matrix is present in each set of the kernel data, the operation circuit 1 performs a convolution operation of the next kernel data in the corresponding set and the feature map using a MAC operation unit to which the kernel data that is the sparse matrix should be allocated.
In
In the first set 201, an arithmetic operation is not necessary because kernel data iCH0 & oCH1 is a zero matrix. Therefore, the operation circuit 1 performs an arithmetic operation on the kernel data iCH0 & oCH0 in the first processing but skips the kernel data iCH0 & oCH1 and performs an arithmetic operation on the kernel data iCH1 & oCH0 preceding by one in the first set 201.
Accordingly, as shown in
As a result, the arithmetic operation result of iCH0*oCH0+iCH1*oCH0 is stored in the memory 21 for oCH0. The arithmetic operation result is not added to the memory 22 for oCH1 and an initial value 0 remains therein.
In the second set 202, an arithmetic operation is not necessary because the kernel data iCH0 & oCH2 is a zero matrix. Therefore, the operation circuit 1 skips the kernel data iCH0 & oCH2 in the second set 202 and performs an arithmetic operation of the kernel data iCH0 & oCH3 preceding by one (skips kernel data corresponding to one channel) and a convolution operation of the kernel data iCH1 & oCH2 preceding by further one.
Accordingly, as shown in
As a result, the arithmetic operation result of iCH1*oCH2 is stored in the memory 23 for oCH2. The arithmetic operation result of the iCH0*oCH3 is stored in the memory 24 for oCH3.
Next, an example of second processing when sparsity has occurred in kernel data will be described with reference to
In the second processing, the kernel data iCH1 & oCH1 is a zero matrix in the first set 201. Therefore, the operation circuit 1 skips the kernel data iCH1 & oCH1 in the first set 201, performs an arithmetic operation on the kernel data iCH2 & oCH0 preceding by one, and performs an arithmetic operation on the kernel data iCH2 & oCH1.
Accordingly, as shown in
As a result, the arithmetic operation result of iCH0*oCH0+iCH1*oCH0+iCH2*oCH0 is stored in the memory 21 for oCH0. The arithmetic operation result of iCH2*oCH1 is stored in the memory 22 for oCH1.
As shown in
In the second set 202, the kernel data iCH2 & oCH2 is a zero matrix. Therefore, the operation circuit 1 performs an arithmetic operation on the kernel data iCH1 & oCH3, skips the kernel data iCH2 & oCH2 in the second set 202, and performs an arithmetic operation on the kernel data iCH2 & oCH3 preceding by one. The MAC operation unit macD stores the convolution integration result of iCH2*oCH3 in the memory 24 for oCH3 by adding the same thereto.
As a result, the arithmetic operation result stored in the memory 23 for oCH2 is not newly added, and the arithmetic operation result of iCH1*oCH2 remains stored. The arithmetic operation result of iCH0*oCH3+iCH1*oCH3+iCH2*oCH3 is stored in the memory 24 for oCH3.
Next, an example of third processing when sparsity has occurred in kernel data will be described with reference to
In the third processing, the kernel data iCH3 & oCH1 is a zero matrix in the first set 201. Therefore, the operation circuit 1 performs an arithmetic operation on the kernel data iCH3 & oCH0, skips the kernel data iCH3 & oCH1 in the first set 201, and performs an arithmetic operation on the kernel data iCH4 & oCH0 preceding by one.
Accordingly, as shown in
As a result, the arithmetic operation result of iCH0*oCH0+iCH1*oCH0+iCH2*oCH0+iCH2*oCH0+iCH4*oCH0 is stored in the memory 21 for oCH0. The arithmetic operation result stored in the memory 22 for oCH1 is not newly added, and the result of iCH2*oCH1 is stored. Since kernel data iCH4 & oCH1 is a zero matrix in the first set 201, as shown in
As shown in
As a result, the arithmetic operation result of iCH1*oCH2+iCH4*oCH2 is stored in the memory 23 for oCH2. The arithmetic operation result of iCH0*oCH3+iCH1*oCH3+iCH2*oCH3+iCH4*oCH3 is stored in the memory 24 for oCH3. Processing of the second set 202 is completed after being performed three times.
In this manner, convolution operation results from iCH0 to iCH4 in each oCH are stored in each memory in the present embodiment. Since the arithmetic operation result stored in the memory becomes the final arithmetic operation result, that is, the output feature map data oFmap, the operation circuit 1 uses the data of the memory as a convolution layer result.
However, processing needs to be performed five times in the conventional method. On the other hand, according to the present embodiment, processing is performed three times, and thus the processing time can be reduced by 40%, for example, and the operation speed can be considerably increased.
In the present embodiment, it is necessary to supply the input feature map data iFmap data of a plurality of input channels to the MAC operation units and thus the bus width of the input data becomes larger than that of the conventional one, but if the bus width is N times that of the conventional one, the input feature map data iFmap extending over n channels can be supplied. Further, in the present embodiment, it is possible to curb a situation in which skipping cannot be performed due to insufficient input feature map data iFmap data supply capability by making n sufficiently large. However, if the bus width is sufficiently high, circuit scale increase and the like due to increase in the bus width becomes a problem and thus, for example, the following restrictions may be added.
In the example shown in
Next, assignment of MAC operation units to kernel data sets will be described.
For example, in the case where two circuits of the MAC operation unit macA and the MAC operation unit macB are allocated with oCH0 and oCH1 as one set, as shown in
When k is small, for example, when k=1 which is the minimum, as shown in
Therefore, the MAC operation unit macA performs a convolution operation of iCH0*oCH0 and stores the operation result in the memory 21 for oCH0 by adding the same thereto, and the MAC operation unit macB performs a convolution operation of 0+iCH2*oCH1 and stores the operation result in the memory 22 for oCH1 by adding the same thereto. The MAC operation unit macC performs a convolution operation of 0+iCH1*oCH2 and stores the operation result in the memory 23 for oCH2 by adding the same thereto, and the MAC operation unit macD performs a convolution operation of iCH0*oCH3 and stores the operation result in the memory 24 for oCH3 by adding the same thereto.
In the case of k=1, for example, oCH1 is 4 among 5 sparse kernel data, but oCH0 is not sparse. Therefore, the MAC operation unit macB in charge of oCH1 completes arithmetic operations through one-time processing because there skip processing is performed four times, but the MAC operation unit macA in charge of oCH0 cannot perform skip processing and thus needs to perform processing five times. In this manner, in the case of k=1, if a product-sum operation of an input channel preceding by one in the corresponding output channel is performed, the MAC operation units often advance ahead by specific output channels. Therefore, in the case of k=1 in the examples of
Kernel data tends to have large deviation in sparsity due to output channels. That is, there are relatively many situations in which kernel data of a certain output channel is mostly sparse whereas kernel data of another output channel is hardly sparse.
Accordingly, if k is excessively small, such as k=1, it is necessary to wait until an arithmetic operation of a less sparse set is completed, and a sufficient high speed may not be expected. Therefore, it is desirable that k be 2 or more.
When k is large, for example, when k=oCH_num which is the maximum and k=4, there is one set 17 as shown in
In the case of k=oCH_num, when kernel data becomes sparse, MAC can be advanced in any output channel. In this case, the kernel data can be packed as much as possible and disposed in the MAC operation units, and thus the speed can be maximized.
On the other hand, since the MAC operation units are likely to perform arithmetic operations on all oCH, correspondence between the MAC operation units and the memory requires wiring in a fully coupled state. In the example shown in
By this wiring, on the side of the memory 21 for oCH0 to the memory 24 for oCH3, a selector circuit for selecting oCH_num for determining which arithmetic operation result of oCH_num MAC operation units should be received each time is required. In the recent CNN convolution layer, the number of oCH_num is tens to hundreds, and thus there is a hardware problem in terms of a circuit area and power consumption in wiring of oCH_num fully coupled states/implementation of a selector. Therefore, it is desirable that the value of k be not excessively large.
Therefore, in the present embodiment, the value of k is set to, for example, 2 or more and less than a maximum value.
Next, an example of a processing procedure will be described.
The operation circuit 1 allocates MAC operation units by determining a set of output channels of each set in advance. The operation circuit 1 allocates at least two MAC operation units (sub-operation circuits) for each set (step S1).
The operation circuit 1 initializes the value of each memory to 0 (step S2).
The operation circuit 1 selects data to be used for an arithmetic operation from kernel data (step S3).
The operation circuit 1 determines whether or not the selected kernel data is a zero matrix (S4). When the operation circuit 1 determines that the selected kernel data is a zero matrix (step S4; YES), processing proceeds to step S5. When the operation circuit 1 determines that the selected kernel data is not a zero matrix (step S4; NO), processing proceeds to step S6.
The operation circuit 1 skips the selected kernel data and re-selects kernel data preceding by one. The operation circuit 1 determines whether or not the re-selected kernel data is also a zero matrix, and when the re-selected kernel data is also a zero matrix, the operation circuit 1 skips the kernel data again and re-selects kernel data preceding by one (step S5).
The operation circuit 1 determines a memory for storing results of arithmetic operations performed by the MAC operation units on the basis of presence or absence of skipping and the number of times of skipping (step S6).
Each MAC operation unit performs convolution integration using the kernel data (step S7).
Each MAC operation unit adds arithmetic operation results and stores the same in the memory (step S8).
The operation circuit 1 determines whether or not arithmetic operations of all pieces of kernel data end (step S9). When the operation circuit 1 determines that the arithmetic operations of all pieces of kernel data end (step S9; YES), processing ends. When the operation circuit 1 determines that the arithmetic operations of all pieces of kernel data has not ended (step S9; NO), processing returns to step S3.
Note that the processing procedure described using
Although the above-described embodiment has described an example of MAC arithmetic operation processing in the convolutional layer of the CNN, the method of the present embodiment can be applied to other networks.
As described above, in the present embodiment, a plurality of oCH (weight coefficients) are set as one set, and a plurality of MAC operation units are allocated to each set.
Therefore, according to the present embodiment, waiting in a circuit which may occur when convolution processing of the convolutional neural network represented by a CNN is implemented in hardware can be eliminated, and thus the arithmetic operation speed can be increased.
As described above, in assignment of MAC operation units to kernel data sets, that is, channel allocation, arithmetic operation speed cannot be efficiently increased if k is excessively small, and increase in the circuit area cannot be ignored if k is excessively large. Since the value of k is related to a hardware configuration such as wiring between an operation unit and a memory, it is determined at the time of hardware design and cannot be changed at the time of inference processing. On the other hand, whether an output channel is allocated to each set is not related to the hardware configuration but can be arbitrarily changed at the time of inference processing.
For this reason, the operation circuit 1 may optimize allocation of the MAC operation units such that the inference processing speed can be maximized for k determined at the time of hardware design by determining a set of output channels of each set in advance on the basis of each values of kernel data obtained at the time of inference.
The operation circuit 1 checks each value of kernel data obtained at the time of inference (step S101).
The operation circuit 1 determines the number of sets of kernel data and allocates the MAC operation units to the kernel data. The operation circuit 1 may determine a set of output channels included in each set on the basis of, for example, the number and distribution of zero matrices included in the kernel data, and allocate the MAC operation units to kernel data sets. Alternatively, when processing has proceeded while skipping kernel data corresponding to a zero matrix, the operation circuit 1 may determine a set of output channels included in each set such that deviation in the number of arithmetic operations of the MAC operation unit in each set is reduced, and allocate the the MAC operation units to the kernel data before the actual convolution operation is performed (S102).
The operation circuit 1 determines a set of output channels included in each set and determines whether or not allocation of the MAC operation units to the kernel data sets can be optimized. The operation circuit 1 determines that optimization can be performed, for example, if a difference in the number of arithmetic operations of the MAC operation unit is within a predetermined value (S103). When the operation circuit 1 determines that optimization can be performed (step S103; YES), processing ends. When the operation circuit 1 determines that optimization cannot be performed (step S103; NO), processing returns to step S102.
After the optimization procedure described using
As described above, in the modified example, allocation of the MAC operation units to kernel data, that is, channels to be assigned to a set, is optimized.
Therefore, according to the modified example, the arithmetic operation speed can be further increased.
Although the embodiments of the present invention have been described in detail with reference to the drawings, specific configurations are not limited to these embodiments, and designs and the like within a range that does not deviating from the gist of the present invention are also included.
The present invention is applicable to various inference processing devices.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/JP2020/045854 | 12/9/2020 | WO |