This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2023-206216, filed Dec. 6, 2023, the entire contents of which are incorporated herein by reference.
This disclosure relates to a grouped convolution processing optimization device, a grouped convolution processing optimization method, and a grouped convolution processing optimization program.
A convolutional neural network (CNN) is a feedforward neural network that has a structure where two types of layers, convolutional layers and pooling layers, are alternately stacked. Hereinafter, the convolutional neural network will be referred to simply as CNN.
In addition, C1 and C2 shown in
Note that an image is merely an example of the data being input. The data input to the CNN may be data other than images.
Furthermore, P1 and P2 shown in
In addition, F shown in
Hereinafter, convolutional computation in a CNN will be described in detail.
The input image shown in
For the sake of simplicity, consider an image with a height and width of 1, and the number of channels is Cin, as the input X, as shown in the lattice pattern in
That is, in the example of convolutional computation shown in
In the convolutional computation shown in
Note that the convolutional computation shown in
In this specification, the input X and output Y0 are consistently expressed as row vectors (vectors with components arranged horizontally), and the weights are expressed as matrices. Furthermore, the convolutional computation is expressed as the product of the input row vector multiplied from the left by the weight matrix.
The terms “row” and “column” used in this specification are based on the above premise. Therefore, when another mathematically equivalent representation is substituted, the terms “row” and “column” shall be interpreted accordingly.
The CNN shown in
There is an increasing number of CNNs that use grouped convolution for convolutional computation. In grouped convolution, Cin and Cout are divided into G groups (G is an integer of 2 or more) for computation processing. For example, Non-Patent Literature 1 describes grouped convolution.
The amount of computation in grouped convolution is smaller than that in regular convolution. Furthermore, it has been experimentally shown that the recognition accuracy of ResNet (Residual Neural Networks), which uses grouped convolution, is higher than that of ResNet, which uses regular convolution, which is a type of CNN.
Hereinafter, the computation of grouped convolution in CNNs will be described in detail.
The example of grouped convolution computation shown in
In the example of grouped convolution computation shown in
Note that input X, which is an image, is merely an example of input data. The data input to the grouped convolution layer may also be non-image data.
As shown in
The AI chip multiplies weight Wa by the image composed of channels from the first channel to the (Cin/8)-th channel (which has a channel count of (Cin/8)). As a result of this multiplication, the AI chip obtains an output image with a channel count of (Cout/8).
As shown in
Finally, the AI chip places the obtained images in the same positions as the corresponding divided input X parts. After placing the 8 obtained images, the AI chip concatenates them (indicated as “concat” in
By concatenating the images, the AI chip obtains output Y, which is an image with a channel count of Cout, equivalent to the result of regular convolution computation. However, the output Y obtained from the grouped convolution computation shown in
The computational cost of convolution is proportional to the size of the weights. For example, the computational cost of the convolution shown in
In general, when input X is divided into G groups for grouped convolution computation, the computational cost of the grouped convolution is proportional to {(Cin/G)×(Cout/G)×G}=(Cin×Cout)/G. That is, the computational cost of the grouped convolution shown in
As shown in
For AI chips that are not optimized for grouped convolution, it is possible that the grouped convolution computation will be implemented as multiple separate convolution computations.
Therefore, for AI chips where the convolution computation is implemented multiple times, the AI chip will incur overhead due to the repeated invocation of the convolution computation G times when executing grouped convolution. When the AI chip incurs overhead G times, the computation speed of the grouped convolution decreases.
In the grouped convolution computation shown in
The following possibilities may also exist as reasons why the AI chip may not be suitable for grouped convolution. Some AI chips designed for CNNs have strict constraints imposed on the number of rows and columns of weights corresponding to the number of channels, as well as on the memory layout, that is, memory addressing.
The phrase “imposing constraints” means that, while the computation speed can be fully maximized as long as the constraints are followed, when the constraints are not met, the computation speed performance significantly deteriorates. These memory addressing constraints are imposed to reduce the complexity of the AI chip's memory access circuitry.
For example, the addressing of arrays in multiples of certain constants may be a constraint on memory addressing. In a 32-bit memory address, when the head address of an array is placed at a multiple of 16, ignoring the lower 4 bits, access to the array is possible without the need for circuitry to handle the lower 4 bits.
When a constraint is imposed on the starting address of weights used in grouped convolution, the constraint effectively applies to the starting address of Cin*Cout or Cin*Cout* (kernel size) for each group, since convolution operations are executed for each group.
As shown in
However, in grouped convolution, as shown in
However, depending on the value of ((Cin/8)*(Cout/8)) for each group, not all starting addresses may meet the constraint.
In some cases, the AI chip may fail to operate correctly when the constraints are not met. In such cases, additional processes may be required to adjust the placement of weights in memory to meet the constraints. When additional processes to adjust memory placement are added, the overall processing by the AI chip may slow down.
For example, in the case of grouped convolution shown in
Therefore, to meet the constraint, it may be necessary to perform memory address adjustments, such as memory copying the input X and weight Wa before each convolution execution, to align the starting memory addresses. The overhead incurred by memory copying is likely to cause a slowdown in the grouped convolution speed.
Additionally, AI chips designed specifically for convolution may lack circuits dedicated to memory copying. In such cases, memory copying may be performed by a host CPU (Central Processing Unit) or an auxiliary CPU, which can handle flexible processing but may operate at a slower speed. When memory copying is handled by the host CPU or an auxiliary CPU, the grouped convolution speed may further decrease.
Zero-padding may be considered as a method to meet the constraints. However, zero-padding must be inserted between the divided units of input X rather than simply at the end of input X. Therefore, even when zero-padding is applied in grouped convolution, the overhead incurred by memory copying remains significant.
As shown in the upper part of
Among the multiples of 16, the smallest multiple greater than or equal to 20 is 32. Therefore, it is considered that zero-padding should be applied such that the starting address of each divided unit of input X becomes a multiple of 32 (hereinafter referred to as “32-aligned state”).
As shown in the lower part of
However, when zero-padding is applied before each grouped convolution execution, multiple memory copy operations will also be added to the design. The added memory copy operations involve, for example, securing a new memory area for Cin′=256 and copying each divided unit of input X to an address that is a multiple of 32. Since this process is relatively computationally intensive, the memory copying is likely to become a cause of grouped convolution slowdown.
The technology to address the problem of reduced execution speed in grouped convolution due to hardware circuits with memory addressing constraints is not described in Non-Patent Literature 1.
Therefore, the purpose of the present disclosure is to provide a grouped convolution processing optimization device, a grouped convolution processing optimization method, and a grouped convolution processing optimization program that enable grouped convolution to be executed at high speed by hardware circuits with memory addressing constraints.
A grouped convolution processing optimization device according to the present disclosure includes a first insertion unit that executes at least one of a column insertion process that inserts (M−N/G) columns of zeros at the right end of an i-th weight matrix for each i from 1 to G or a row insertion process that inserts (M−N/G) rows of zeros at the bottom of the i-th weight matrix for each i from 1 to G, when a constraint that requires convolution computation to be executed on data composed of M channels (where M is an integer greater than N/G) for a grouped convolution is imposed on a trained convolutional neural network, in which an input convolution that executes convolutional computation using an input weight matrix of N rows and N columns on data composed of channels arranged sequentially from the first channel to the N-th channel (where N is an integer of 2 or more), the grouped convolution that divides the result of the input convolution into G parts (where G is an integer of 2 or more) along the channel direction, and executes convolutional computation for each i from 1 to G using the i-th weight matrix of N/G rows and N/G columns on the divided data composed of channels from {(i−1)×N/G+1} to (i×N/G), and an output convolution that executes convolutional computation using an output weight matrix of N rows and N columns on the result of the grouped convolution are each defined, and a second insertion unit that executes at least one of an input weight matrix insertion process that inserts (M−N/G) columns of zeros to the right of the i×N/G-th column of the input weight matrix for each i from 1 to G or an output weight matrix insertion process that inserts (M−N/G) rows of zeros below the i×N/G-th row of the output weight matrix for each i from 1 to G, wherein the second insertion unit executes the input weight matrix insertion process when the row insertion process is executed, and executes the output weight matrix insertion process when the column insertion process is executed.
A grouped convolution processing optimization method according to the present disclosure includes executing at least one of a column insertion process that inserts (M−N/G) columns of zeros at the right end of an i-th weight matrix for each i from 1 to G or a row insertion process that inserts (M−N/G) rows of zeros at the bottom of the i-th weight matrix for each i from 1 to G, when a constraint that requires convolution computation to be executed on data composed of M channels (where M is an integer greater than N/G) for a grouped convolution is imposed on a trained convolutional neural network, in which an input convolution that executes convolutional computation using an input weight matrix of N rows and N columns on data composed of channels arranged sequentially from the first channel to the N-th channel (where N is an integer of 2 or more), the grouped convolution that divides the result of the input convolution into G parts (where G is an integer of 2 or more) along the channel direction, and executes convolutional computation for each i from 1 to G using the i-th weight matrix of N/G rows and N/G columns on the divided data composed of channels from {(i−1)×N/G+1} to (i×N/G), and an output convolution that executes convolutional computation using an output weight matrix of N rows and N columns on the result of the grouped convolution are each defined, and executing at least one of an input weight matrix insertion process that inserts (M−N/G) columns of zeros to the right of the i×N/G-th column of the input weight matrix for each i from 1 to G or an output weight matrix insertion process that inserts (M−N/G) rows of zeros below the i×N/G-th row of the output weight matrix for each i from 1 to G, wherein the input weight matrix insertion process is executed when the row insertion process is executed, and the output weight matrix insertion process is executed, when the column insertion process is executed.
A grouped convolution processing optimization program according to the present disclosure causes a computer to execute a first insertion process that executes at least one of a column insertion process that inserts (M−N/G) columns of zeros at the right end of an i-th weight matrix for each i from 1 to G or a row insertion process that inserts (M−N/G) rows of zeros at the bottom of the i-th weight matrix for each i from 1 to G, when a constraint that requires convolution computation to be executed on data composed of M channels (where M is an integer greater than N/G) for a grouped convolution is imposed on a trained convolutional neural network, in which an input convolution that executes convolutional computation using an input weight matrix of N rows and N columns on data composed of channels arranged sequentially from the first channel to the N-th channel (where N is an integer of 2 or more), the grouped convolution that divides the result of the input convolution into G parts (where G is an integer of 2 or more) along the channel direction, and executes convolutional computation for each i from 1 to G using the i-th weight matrix of N/G rows and N/G columns on the divided data composed of channels from {(i−1)×N/G+1} to (i×N/G), and an output convolution that executes convolutional computation using an output weight matrix of N rows and N columns on the result of the grouped convolution are each defined, and a second insertion process that executes at least one of an input weight matrix insertion process that inserts (M−N/G) columns of zeros to the right of the i×N/G-th column of the input weight matrix for each i from 1 to G or an output weight matrix insertion process that inserts (M−N/G) rows of zeros below the i×N/G-th row of the output weight matrix for each i from 1 to G, wherein the second insertion process causes the input weight matrix insertion process to be executed when the row insertion process is executed, and causes the output weight matrix insertion process to be executed when the column insertion process is executed.
According to the present disclosure, hardware circuits with memory addressing constraints can execute grouped convolution at high speed.
Hereinafter, example embodiments of the present disclosure will be described with reference to the drawings. In the present disclosure, the drawings are associated with one or more example embodiments.
The pre-change CNN model storage unit 200 stores a trained CNN model described above, which includes weights Wa to Wh as shown in
The post-change CNN model storage unit 300 stores the trained CNN model, optimized by the grouped convolution processing optimization device 100, which is stored in the pre-change CNN model storage unit 200.
Additionally, an AI chip 400 is communicatively connected to the post-change CNN model storage unit 300. The AI chip 400 is a chip that executes convolution computations using the trained CNN model stored in the post-change CNN model storage unit 300.
As shown in
The grouped convolution processing optimization device 100 in this example embodiment assumes that grouped convolution is not used independently, but that other convolution processes exist before and after the grouped convolution.
As shown in
The grouped convolution processing optimization device 100 in this example embodiment is characterized by realizing the 32-aligned state shown in the lower part of
Additionally, the grouped convolution processing optimization device 100 converts weights used in grouped convolution based on the assumption that the input to the grouped convolution is in the 32-aligned state, as shown in the lower part of
It should be noted that even when the grouped convolution processing optimization device 100 adjusts the weights, the result of the grouped convolution computation does not change. That is, the grouped convolution processing optimization device 100 achieves faster grouped convolution by adjusting the weights without adding new processes.
The optimization candidate structure detection unit 110 has a function of detecting optimizable grouped convolutions from grouped convolutions present in the structure of the trained CNN model stored in the pre-change CNN model storage unit 200.
The optimization candidate structure detection unit 110 in this example embodiment is characterized by considering grouped convolution optimization as a set with the convolutions that exist before and after the grouped convolution.
The optimal channel number determination unit 120 has a function of determining the number of new input channels and the number of new output channels for a detected grouped convolution in an optimized CNN model.
The pre-convolution weight modification unit 130 has a function of modifying the weights used in a convolution that exists before the detected grouped convolution.
The grouped convolution weight modification unit 140 has a function of modifying each weight used in the detected grouped convolution.
The post-convolution weight modification unit 150 has a function of modifying the weights used in a convolution that exists after the detected grouped convolution.
First, an example will be explained in which grouped convolution and convolutions before and after the grouped convolution are executed as a set.
As shown in
Next, as shown in
For convenience, the example in
Next, as shown in
The memory address indicated by the straight lines in the grid-patterned data with C=160 shown in
Specifically, the memory address indicated by the straight lines in the grid-patterned data with C=256, shown in
Note that the grouped convolution and convolution after the grouped convolution shown in
For example, as shown in
By executing convolution using the weight of 160 rows and 256 columns, the grid-patterned data with C=256 in the 32-aligned state, as shown in
Next, as shown in
The grouped convolution weight modification unit 140 generates the weight of 32 rows and 32 columns by inserting 12 (=32−20) columns of zeros at the right end of the weight of 20 (=160/8) rows and 20 (=160/8) columns, as shown in
By executing convolution using the weight of 32 rows and 32 columns, the black data with C=256 in the 32-aligned state, as shown in
Next, as shown in
By executing convolution using the weight of 256 rows and 160 columns, vertical striped data with C=160, shown in
The pre-convolution weight modification unit 130, the grouped convolution weight modification unit 140, and the post-convolution weight modification unit 150 store the optimized CNN model, including the modified weights, in the post-change CNN model storage unit 300.
In this example embodiment, when the grouped convolution weight modification unit 140 inserts zero rows into each weight, the pre-convolution weight modification unit 130 inserts zero columns into the weights. Also, when the grouped convolution weight modification unit 140 inserts zero columns into the weights, the post-convolution weight modification unit 150 inserts zero rows into the weights.
Hereinafter, an operation of the grouped convolution processing optimization device 100 in this example embodiment will be described with reference to
First, the optimization candidate structure detection unit 110 of the grouped convolution processing optimization device 100 acquires a trained CNN model from the pre-change CNN model storage unit 200 (step S110).
Next, the optimization candidate structure detection unit 110 evaluates whether a grouped convolution is present in the structure of the acquired CNN model (step S120). When there is no grouped convolution (No in step S120), the grouped convolution processing optimization device 100 terminates the optimization process.
When a grouped convolution is present (Yes in step S120), the optimization candidate structure detection unit 110 evaluates whether Cin/G or Cout/G of the grouped convolution is a suitable value for a device (step S130).
In this example, the device refers to the AI chip 400. A suitable value for the device is, for example, “32” in the case of the 32-aligned state shown in
When Cin/G or Cout/G of the grouped convolution is not suitable for the device (Yes in step S130), the optimization candidate structure detection unit 110 evaluates whether there are convolutions with G=1 before and after the grouped convolution (step S140). When there is no convolution with G=1 before and after the grouped convolution (No in step S140), the process moves to step S170.
When there are convolutions with G=1 before and after the grouped convolution (Yes in step S140), the optimization candidate structure detection unit 110 evaluates whether there is a processing layer other than a per-element operation layer between the two convolutions and the grouped convolution (step S150). When there is a processing layer other than a per-element operation layer between the two convolutions and the grouped convolution (No in step S150), the process moves to step S170.
When there is no processing layer other than a per-element operation layer between the two convolutions and the grouped convolution (Yes in step S150), the grouped convolution processing optimization device 100 executes a grouped convolution optimization process (step S160).
Next, the optimization candidate structure detection unit 110 evaluates whether all grouped convolutions in the structure of the acquired CNN model have been checked (step S170). When there are still unchecked grouped convolutions (No in step S170), the optimization candidate structure detection unit 110 repeats the process from step S130.
When all grouped convolutions in the structure of the acquired CNN model have been checked (Yes in step S170), the pre-convolution weight modification unit 130, the grouped convolution weight modification unit 140, and the post-convolution weight modification unit 150 store the optimized CNN model, including the modified weights, in the post-change CNN model storage unit 300 (step S180). After storing, the grouped convolution processing optimization device 100 terminates the optimization process.
Next, the grouped convolution optimization process in step S160, which is a sub-process that constitutes the optimization process shown in
First, the optimal channel number determination unit 120 determines the value M, which is greater than an original Cin/G or Cout/G of the grouped convolution and is the smallest value suitable for the device (step S161).
Next, the optimal channel number determination unit 120 sets the new input channel number Cin′ for the grouped convolution in the optimized CNN model as Cin′=M*G (step S162).
Next, the optimal channel number determination unit 120 sets the new output channel number Cout′ for the grouped convolution in the optimized CNN model as Cout′=M*G (step S163).
It should be noted that the optimal channel number determination unit 120 may execute at least one of the processes in step S162 or step S163.
Next, the pre-convolution weight modification unit 130, the grouped convolution weight modification unit 140, and the post-convolution weight modification unit 150 generate a new model structure for the grouped convolution, where the input channel number is Cin′ and the output channel number is Cout′ (step S164).
Next, the pre-convolution weight modification unit 130, the grouped convolution weight modification unit 140, and the post-convolution weight modification unit 150 initialize the weight values of the generated model structure to zero (step S165).
Next, the pre-convolution weight modification unit 130, the grouped convolution weight modification unit 140, and the post-convolution weight modification unit 150 copy parameters, such as weight values, from the original CNN model to the layers unrelated to the optimization process in the generated model structure (step S166).
Next, the pre-convolution weight modification unit 130, the grouped convolution weight modification unit 140, and the post-convolution weight modification unit 150 copy the original weight values to the weights of the optimized target layers in the generated model structure (step S167).
Particularly, the grouped convolution weight modification unit 140 copies the original weight values for each weight matrix. As described above, the pre-convolution weight modification unit 130, the grouped convolution weight modification unit 140, and the post-convolution weight modification unit 150 realize the insertion process shown in
After copying the weight values, the grouped convolution processing optimization device 100 returns to the optimization process shown in
In this example embodiment, the grouped convolution processing optimization device 100 may modify only one of the weights used in the convolutions before and after the grouped convolution.
In the example shown in
In this case, as shown in
Furthermore, the grouped convolution weight modification unit 140 generates a weight of 32 rows and 20 columns by inserting 12 (=32−20) rows of zeros at the bottom of the weight of 20rows and 20 columns shown in
In this case, as shown in
Further, the grouped convolution weight modification unit 140 generates a weight of 20rows and 32 columns by inserting 12 (=32−20) columns of zeros at the right end of the weight of 20 rows and 20 columns shown in FIG. 3. The grouped convolution weight modification unit 140 executes the above zero column insertion process for each of the eight weights of 20 rows and 20 columns shown in
As described above, the grouped convolution weight modification unit 140 of this example embodiment, when a constraint that requires convolution computation to be executed on data composed of M channels (where M is an integer greater than N/G) for a grouped convolution is imposed on a trained convolutional neural network, in which an input convolution that executes convolutional computation using an input weight matrix of N rows and N columns on data composed of channels arranged sequentially from the first channel to the N-th channel (where N is an integer of 2 or more), the grouped convolution that divides the result of the input convolution into G parts (where G is an integer of 2 or more) along the channel direction, and executes convolutional computation for each i from 1 to G using an i-th weight matrix of N/G rows and N/G columns on the divided data composed of channels from {(i−1)×N/G+1} to (i×N/G), and an output convolution that executes convolutional computation using an output weight matrix of N rows and N columns on the result of the grouped convolution are each defined, executes at least one of a column insertion process that inserts (M−N/G) columns of zeros at the right end of the i-th weight matrix for each i from 1 to G or a row insertion process that inserts (M−N/G) rows of zeros at the bottom of the i-th weight matrix for each i from 1 to G.
Further, the pre-convolution weight modification unit 130 of this example embodiment executes an input weight matrix insertion process that inserts (M−N/G) columns of zeros to the right of the i×N/G-th column of the input weight matrix for each i from 1 to G. Additionally, the post-convolution weight modification unit 150 executes an output weight matrix insertion process that inserts (M−N/G) rows of zeros below the i×N/G-th row of the output weight matrix for each i from 1 to G.
The grouped convolution processing optimization device 100 of this example embodiment executes at least one of the input weight matrix insertion process or the output weight matrix insertion process. Also, the pre-convolution weight modification unit 130 executes the input weight matrix insertion process when the row insertion process is executed. The post-convolution weight modification unit 150 executes the output weight matrix insertion process when the column insertion process is executed.
Moreover, in this example embodiment, M is the smallest multiple (for example 32) that is greater than a predetermined number (for example 16) N/G. In the grouped convolution, a matrix in which the i-th weight matrices are arranged diagonally from the top-left to the bottom-right in the order of i=1 to G may also be used as the weight matrix.
Additionally, in this example embodiment, the optimal channel number determination unit 120 determines at least one of an input channel number, which is the number of channels in the input data for the grouped convolution subject to constraints, or an output channel number, which is the number of channels in the output data resulting from the grouped convolution, to be M*G.
The grouped convolution weight modification unit 140 executes the row insertion process when the input channel number is determined to be M*G, and executes the column insertion process when the output channel number is determined to be M*G.
Furthermore, in this example embodiment, the optimization candidate structure detection unit 110 evaluates whether the input convolution, the grouped convolution, and the output convolution are each defined in the trained convolutional neural network.
Some AI chips experience slower execution speeds for grouped convolutions due to constraints imposed on memory addressing. The pre-convolution weight modification unit 130, the grouped convolution weight modification unit 140, and the post-convolution weight modification unit 150 of the grouped convolution processing optimization device 100 in this example embodiment adjust the weight matrices used in the convolutional layers that make up the CNN model.
By adjusting the weight matrices, the grouped convolution processing optimization device 100 converts processes executed by the CNN model into processes suitable for AI chips with such constraints. Although the amount of computation increases when the process is converted for the AI chip, the grouped convolution can be executed faster because it is replaced with a process the AI chip excels at. In other words, the grouped convolution processing optimization device 100 enables the grouped convolution process to be accelerated on AI chips with constraints without changing the processing results.
Hereinafter, a specific example of a hardware configuration of the grouped convolution processing optimization device 100 in this example embodiment will be described.
The grouped convolution processing optimization device 100 shown in
The grouped convolution processing optimization device 100 is realized in software by the CPU 11 executing a program that provides the functionality of the various components.
That is, the CPU 11 loads the program stored in the auxiliary storage 14 into the main memory 12 and executes it to control the operations of the grouped convolution processing optimization device 100, thereby realizing the functions through software.
Note that the grouped convolution processing optimization device 100 shown in
The main memory 12 is used as a working area or a temporary storage area for data. The main memory 12 may be, for example, RAM (Random Access Memory).
The communication unit 13 has a function of inputting and outputting data with peripheral devices via a wireless network (information communication network).
The auxiliary storage 14 is a non-volatile tangible storage medium. Examples of non-volatile tangible storage media include magnetic disks, magneto-optical disks, CD-ROM (Compact Disk Read-Only Memory), DVD-ROM (Digital Versatile Disk Read-Only Memory), and semiconductor memory.
The input unit 15 has a function of inputting data and processing instructions. The input unit 15 may be an input device such as a keyboard, a mouse, or a touch panel.
The output unit 16 has a function of outputting data. The output unit 16 may be a display device such as an LCD display, a touch panel, or a printing device such as a printer.
As shown in
In the grouped convolution processing optimization device 100, the auxiliary storage 14 stores programs to implement the optimization candidate structure detection unit 110, the optimal channel number determination unit 120, the pre-convolution weight modification unit 130, the grouped convolution weight modification unit 140, and the post-convolution weight modification unit 150.
Additionally, the grouped convolution processing optimization device 100 may be implemented with circuits, such as LSI (Large Scale Integration) hardware components, that realize the functions shown in
Alternatively, the grouped convolution processing optimization device 100 may be implemented in hardware without using a CPU or similar elements. For example, some or all of the components may be realized using general-purpose circuits, dedicated circuits, processors, or a combination of these. These may be configured on a single chip (for example, the above-mentioned LSI) or on multiple chips connected via a bus. Some or all of the components may be realized by a combination of the above-mentioned circuits and a program.
Additionally, some or all of the components of the grouped convolution processing optimization device 100 may be configured with one or more information processing devices equipped with an arithmetic unit and a memory unit.
When some or all of the components are realized by multiple information processing devices or circuits, the information processing devices or circuits may be centrally or distributed. For example, the information processing devices or circuits may be realized in the form of a client-server system or a cloud computing system, each connected via a communication network.
Next, the outline of this disclosure will be described.
When such a grouped convolution processing optimization device is used, a hardware circuit with memory addressing constraints can execute grouped convolution at a higher speed.
Additionally, M may be the smallest multiple that is greater than the predetermined number N/G.
In grouped convolution, a matrix where i-th weight matrices are arranged diagonally from the top-left to the bottom-right in the order of i=1 to G may be used as the weight matrix.
This configuration allows the grouped convolution processing optimization device to reduce the impact of overhead involved in executing grouped convolution.
Additionally, the grouped convolution processing optimization device 20 includes a determination unit (for example, the optimal channel number determination unit 120) that determines at least one of an input channel number, which is the number of channels in input data for the grouped convolution subject to constraints, or an output channel number, which is the number of channels in the output data resulting from the grouped convolution, to be M*G. The first insertion unit 21 executes the row insertion process when the input channel number is determined to be M*G, and executes the column insertion process when the output channel number is determined to be M*G.
This configuration allows the grouped convolution processing optimization device to adjust the weights that are the target of modification.
Additionally, the grouped convolution processing optimization device 20 includes an evaluation unit (for example, the optimization candidate structure detection unit 110) that evaluates whether the input convolution, the grouped convolution, and the output convolution are each defined in the trained convolutional neural network.
This configuration allows the grouped convolution processing optimization device to determine whether or not to modify the weights used in each convolution.
Also, some or all of the above example embodiments may be described as in the following supplementary notes, but are not limited to these examples.
(Supplementary note 1) A grouped convolution processing optimization device includes: a first insertion unit that executes at least one of a column insertion process that inserts (M−N/G) columns of zeros at the right end of an i-th weight matrix for each i from 1 to G or a row insertion process that inserts (M−N/G) rows of zeros at the bottom of the i-th weight matrix for each i from 1 to G, when a constraint that requires convolution computation to be executed on data composed of M channels (where M is an integer greater than N/G) for a grouped convolution is imposed on a trained convolutional neural network, in which an input convolution that executes convolutional computation using an input weight matrix of N rows and N columns on data composed of channels arranged sequentially from the first channel to the N-th channel (where N is an integer of 2 or more), the grouped convolution that divides the result of the input convolution into G parts (where G is an integer of 2 or more) along the channel direction, and executes convolutional computation for each i from 1 to G using the i-th weight matrix of N/G rows and N/G columns on the divided data composed of channels from {(i−1)×N/G+1} to (i×N/G), and an output convolution that executes convolutional computation using an output weight matrix of N rows and N columns on the result of the grouped convolution are each defined; and a second insertion unit that executes at least one of an input weight matrix insertion process that inserts (M−N/G) columns of zeros to the right of the i×N/G-th column of the input weight matrix for each i from 1 to G or an output weight matrix insertion process that inserts (M−N/G) rows of zeros below the i×N/G-th row of the output weight matrix for each i from 1 to G, wherein the second insertion unit executes the input weight matrix insertion process when the row insertion process is executed, and executes the output weight matrix insertion process when the column insertion process is executed.
(Supplementary note 2) The grouped convolution processing optimization device according to Supplementary note 1, wherein M is the smallest multiple that is greater than a predetermined number N/G.
(Supplementary note 3) The grouped convolution processing optimization device according to Supplementary note 1 or supplementary note 2, wherein in the grouped convolution, a matrix in which the i-th weight matrices are arranged diagonally from the top-left to the bottom-right in the order of i=1 to G is used as the weight matrix.
(Supplementary note 4) The grouped convolution processing optimization device according to any one of Supplementary notes 1 to 3, further including a determination unit that determines at least one of an input channel number, which is the number of channels in input data for the grouped convolution subject to constraints, or an output channel number, which is the number of channels in the output data resulting from the grouped convolution, to be M*G, wherein the first insertion unit executes the row insertion process when the input channel number is determined to be M*G, and executes the column insertion process when the output channel number is determined to be M*G.
(Supplementary note 5) The grouped convolution processing optimization device according to any one of Supplementary notes 1 to 4, further including an evaluation unit that evaluates whether the input convolution, the grouped convolution, and the output convolution are each defined in the trained convolutional neural network.
(Supplementary note 6) A grouped convolution processing optimization method includes: executing at least one of a column insertion process that inserts (M−N/G) columns of zeros at the right end of an i-th weight matrix for each i from 1 to G or a row insertion process that inserts (M−N/G) rows of zeros at the bottom of the i-th weight matrix for each i from 1 to G, when a constraint that requires convolution computation to be executed on data composed of M channels (where M is an integer greater than N/G) for a grouped convolution is imposed on a trained convolutional neural network, in which an input convolution that executes convolutional computation using an input weight matrix of N rows and N columns on data composed of channels arranged sequentially from the first channel to the N-th channel (where N is an integer of 2 or more), the grouped convolution that divides the result of the input convolution into G parts (where G is an integer of 2 or more) along the channel direction, and executes convolutional computation for each i from 1 to G using the i-th weight matrix of N/G rows and N/G columns on the divided data composed of channels from {(i−1)×N/G+1} to (i×N/G), and an output convolution that executes convolutional computation using an output weight matrix of N rows and N columns on the result of the grouped convolution are each defined; and executing at least one of an input weight matrix insertion process that inserts (M−N/G) columns of zeros to the right of the i×N/G-th column of the input weight matrix for each i from 1 to G or an output weight matrix insertion process that inserts (M−N/G) rows of zeros below the i×N/G-th row of the output weight matrix for each i from 1 to G, wherein the input weight matrix insertion process is executed when the row insertion process is executed, and the output weight matrix insertion process is executed, when the column insertion process is executed.
(Supplementary note 7) The grouped convolution processing optimization method according to Supplementary note 6, wherein M is the smallest multiple that is greater than a predetermined number N/G.
(Supplementary note 8) The grouped convolution processing optimization method according to Supplementary note 6 or 7, wherein in the grouped convolution, a matrix in which the i-th weight matrices are arranged diagonally from the top-left to the bottom-right in the order of i=1 to G is used as the weight matrix.
(Supplementary note 9) The grouped convolution processing optimization method according to any one of Supplementary notes 6 to 8, further includes determining at least one of an input channel number, which is the number of channels in input data for the grouped convolution subject to constraints, or an output channel number, which is the number of channels in the output data resulting from the grouped convolution, to be M*G, wherein the row insertion process is executed when the input channel number is determined to be M*G, and the column insertion process is executed when the output channel number is determined to be M*G.
(Supplementary note 10) The grouped convolution processing optimization method according to any one of Supplementary notes 6 to 9, further includes evaluating whether the input convolution, the grouped convolution, and the output convolution are each defined in the trained convolutional neural network.
(Supplementary note 11) A grouped convolution processing optimization program that causes a computer to execute: a first insertion process that executes at least one of a column insertion process that inserts (M−N/G) columns of zeros at the right end of an i-th weight matrix for each i from 1 to G or a row insertion process that inserts (M−N/G) rows of zeros at the bottom of the i-th weight matrix for each i from 1 to G, when a constraint that requires convolution computation to be executed on data composed of M channels (where M is an integer greater than N/G) for a grouped convolution is imposed on a trained convolutional neural network, in which an input convolution that executes convolutional computation using an input weight matrix of N rows and N columns on data composed of channels arranged sequentially from the first channel to the N-th channel (where N is an integer of 2 or more), the grouped convolution that divides the result of the input convolution into G parts (where G is an integer of 2 or more) along the channel direction, and executes convolutional computation for each i from 1 to G using the i-th weight matrix of N/G rows and N/G columns on the divided data composed of channels from {(i−1)×N/G+1} to (i×N/G), and an output convolution that executes convolutional computation using an output weight matrix of N rows and N columns on the result of the grouped convolution are each defined; and a second insertion process that executes at least one of an input weight matrix insertion process that inserts (M−N/G) columns of zeros to the right of the i×N/G-th column of the input weight matrix for each i from 1 to G or an output weight matrix insertion process that inserts (M−N/G) rows of zeros below the i×N/G-th row of the output weight matrix for each i from 1 to G, wherein the second insertion process causes the input weight matrix insertion process to be executed when the row insertion process is executed, and causes the output weight matrix insertion process to be executed when the column insertion process is executed.
(Supplementary note 12) The grouped convolution processing optimization program according to Supplementary note 11, wherein M is the smallest multiple that is greater than a predetermined number N/G.
(Supplementary note 13) The grouped convolution processing optimization program according to Supplementary note 11 or 12, wherein in the grouped convolution, a matrix in which the i-th weight matrices are arranged diagonally from the top-left to the bottom-right in the order of i=1 to G is used as the weight matrix.
(Supplementary note 14) The grouped convolution processing optimization program according to any one of Supplementary notes 11 to 13, wherein the program causes the computer to further execute a determination process that determines at least one of an input channel number, which is the number of channels in input data for the grouped convolution subject to constraints, or an output channel number, which is the number of channels in the output data resulting from the grouped convolution, to be M*G, wherein the first insertion process causes the row insertion process to be executed when the input channel number is determined to be M*G, and causes the column insertion process to be executed when the output channel number is determined to be M*G.
(Supplementary note 15) The grouped convolution processing optimization program according to any one of Supplementary notes 11 to 14, wherein the program causes the computer to further execute an evaluation process that evaluates whether the input convolution, the grouped convolution, and the output convolution are each defined in the trained convolutional neural network.
The example embodiments have been described with reference to specific examples, but the present disclosure is not limited to these example embodiments. Various changes and modifications may be made within the scope of the present disclosure as understood by those skilled in the art. Moreover, each example embodiment can be combined with other example embodiments as appropriate.
Number | Date | Country | Kind |
---|---|---|---|
2023-206216 | Dec 2023 | JP | national |