The present disclosure relates to the field of an execution method for convolution computation, and more particularly, to an execution method for convolution computation that reuses data.
A convolutional neural network (CNN) is a type of deep neural network, which uses a convolution layer to filter inputs to obtain useful information. The filter of the convolution layer can be modified according to the learned parameters to extract the most useful information of a specific work. Convolutional neural networks are generally applicable to classification, detection, and recognition, such as image classification, medical image analysis, and image/video recognition.
At present, there are many neural network accelerators, such as Eyeriss, Tensor Processing Unit (TPU), DianNao family, Angel-Eye, and EIE. However, some accelerators, TPU, DaDianNao, and EIE are not suitable for low-end edge devices because either large on-chip memory or significant off-chip memory access is required. Eyeriss and Angel-Eye support multiple dimensional size filers, but either the processing unit architecture design or the filter mapping on MACs results in low utilization rate of multiply-accumulate units (MACs).
In view of the aforementioned problem, the present disclosure provides an execution method for convolution computation. During the execution, the data of some input images, weight values, and output images are reused to avoid repeated access to the same data from off-chip memory or on-chip memory, thereby improving the efficiency.
An aspect of the present disclosure is to disclose an execution method for convolution computation, which is executed by a convolution computation unit that includes a plurality of processing units and a controller. An input image of N channels is divided into X tiles including a first tile to an X-th tile according to a feature tile with size of T×T by the controller, wherein each of the X tiles includes T×T data, which are Ij(1,1)-Ij(T, T), wherein j is corresponding one of the channels and 1≤j≤N. Convolution computations are sequentially performed on the data in the first tile of the input image of the N channels to the X-th tile of the input image of the N channels by the processing units, and the computation results are stored as output data. The data in each of the tiles are mapped by a kernel with size of A×A, and multiply-accumulate operation is performed on the mapped data in each of the tiles. Each time one of the multiply-accumulate operations performed on the A×A data mapped by the kernel is complete, the kernel is shifted to change the mapped data in said tile, and multiply-accumulate operation is performed on the changed mapped data until the multiply-accumulate operations performed on all of the data in said tile are complete, thereby finishing the convolution computation of said tile. All of the output data form an output image, wherein 1≤A≤T.
In some embodiments of the present disclosure, in the condition of A=3, the mapped data in each of the tiles for the multiply-accumulate operations are Ij(p, q), Ij((p+1), q), Ij((p+2), q), Ij(p, (q+1)), Ij((p+1), (q+1)), Ij((p+2), (q+1)), Ij(p, (q+2)), Ij(p+1), (q+2)), Ij((p+2), (q+2)), wherein 1≤p≤(T−2), 1≤q≤(T−2); wherein when p=1 and q=1, a first multiply-accumulate operation is performed.
In some embodiments of the present disclosure, when p≠(T−2), each time one of the multiply-accumulate operations performed on the A×A data mapped by the kernel is complete, the kernel is shifted so that p is added by 1 until p=(T−2).
In some embodiments of the present disclosure, when p=(T−2) and q=K, after the multiply-accumulate operation performed on the mapped data, which are Ij((T−2), K), Ij((T−1), K), Ij(T, K), Ij((T−2), (K+1)), Ij((T−1), (K+1)), Ij(T, (K+1)), Ij((T−2), (K+2)), Ij((T−1), (K+2)), Ij(T, (K+2)), is complete, the kernel is shifted so that p=1 and q=(K+1), wherein 1≤K≤(T−2).
In some embodiments of the present disclosure, when p=(T−2) and q=(T−2), after the multiply-accumulate operation performed on the mapped data, which are Ij((T−2), (T−2)), Ij((T−1), (T−2)), Ij(T, (T−2)), Ij((T−2), (T−1)), Ij((T−1), (T−1)), Ij(T, (T−1)), Ij((T−2), T), Ij((T−1), T), Ij(T, T), is complete, the multiply-accumulate operations performed on all of the data in said tile are complete, and the kernel is not shifted.
In some embodiments of the present disclosure, a sequence of performing convolution computations is that the convolution computation is performed on the W-th tile of the input image of the first channel to the N-th channel sequentially, and the convolution computation is not performed on the (W+1)-th tile of the input image of the first channel to the N-th channel sequentially until the convolution computations performed on the W-th tiles of the input image of the N channels are complete, wherein 1≤W≤X.
In some embodiments of the present disclosure, each of the processing units includes Y multiply-accumulate units for performing the multiply-accumulate operation. In the condition of A=5 and Y<25, the mapped data of each of the tiles for the multiply-accumulate operations are twenty-five data Ij(p, q)-Ij((p+4), (q+4)), wherein 1≤p≤(T−4), 1≤q≤(T−4); when p≠(T−4), the multiply-accumulate operation is performed on the consecutive mapped data from the first to the Y-th among the twenty-five mapped data, wherein after the multiply-accumulate operation performed on the consecutive mapped data from the first to the Y-th are complete, the kernel is shifted so that p is added by 1, and the multiply-accumulate operation are performed on the changed consecutive mapped data from the first to the Y-th among the twenty-five mapped data until p=(T−4).
In some embodiments of the present disclosure, when p=(T−4) and q=K, after the multiply-accumulate operation performed on the consecutive mapped data from the first to the Y-th among the twenty-five mapped data are complete, the kernel is shifted so that p=1 and q=(K+1), and the multiply-accumulate operation are performed on the changed consecutive mapped data from the first to the Y-th among the twenty-five mapped data, wherein 1≤K≤(T−4).
In some embodiments of the present disclosure, when p=(T−4) and q=(T−4), after the multiply-accumulate operation performed on the consecutive mapped data from the first to the Y-th among the twenty-five mapped data are complete, the kernel is shifted so that p=1 and q=1 in the condition of (25−Y)>Y, and each time the kernel is shifted, the multiply-accumulate operation is performed on the changed consecutive mapped data from the (Y+1)-th to the 2Y-th among the twenty-five mapped data.
In some embodiments of the present disclosure, when p=(T−4) and q=(T−4), after the multiply-accumulate operation performed on the consecutive mapped data from the first to the Y-th among the twenty-five mapped data are complete, the kernel is shifted so that p=1 and q=1 in the condition of (25−Y)<Y, and each time the kernel is shifted, the multiply-accumulate operation is performed on the changed consecutive mapped data from the (Y+1)-th to the 25th among the twenty-five mapped data and Z default data from the first to the Z-th, wherein Z=(2Y−25).
In some embodiments of the present disclosure, a sequence of performing convolution computations is that the convolution computation is performed on the W-th tile of the input image of the first channel to the N-th channel sequentially, and the convolution computation is not performed on the (W+1)-th tile of the input image of the first channel to the N-th channel sequentially until the convolution computations performed on the W-th tiles of the input image of the N channels are complete, wherein 1≤W≤X.
In some embodiments of the present disclosure, each of the processing units includes Y multiply-accumulate units for performing the multiply-accumulate operation, in the condition of A=1 and 1<Y<N, the mapped data for the multiply-accumulate operation are data, which are Ij(p, q)-IY(p, q) at a same position of the input image from the first channel to the Y-th channel, wherein 1≤p≤T, 1≤q≤T.
In some embodiments of the present disclosure, when p≠T, each time one of the multiply-accumulate operations performed on the Y data mapped by the kernel is complete, the kernel is shifted so that p is added by 1 until p=T.
In some embodiments of the present disclosure, when p=T and q=K, after the multiply-accumulate operation performed on the Y mapped data, which are Ij (T, K)-IY(T, K), is complete, the kernel is shifted so that p=1 and q=(K+1), wherein 1≤K≤(T−1).
In some embodiments of the present disclosure, when p=T and q=T, after the multiply-accumulate operation performed on the Y mapped data, which are Ij (T, T)-IY(T, T), is complete, the kernel is shifted so that p=1 and q=1 in the condition of (N−Y)>Y, and the mapped data for the multiply-accumulate operation are data, which are I(Y+1)(p, q)-I2Y(p, q) at a same position of the input image from the (Y+1)-th channel to the 2Y-th channel.
In some embodiments of the present disclosure, when p=T and q=T, after the multiply-accumulate operation performed on the Y mapped data, which are Ij (T, T)-IY(T, T), is complete, the kernel is shifted so that p=1 and q=1 in the condition of (N−Y)<Y, and the mapped data for the multiply-accumulate operation are data, which are I(Y+1)(p, q)-IN(p, q) at a same position of the input image from the (Y+1)-th channel to the N-th channel and F default data from the first to the F-th, wherein F=(2Y−N).
In some embodiments of the present disclosure, each time one of the multiply-accumulate operations performed on the data mapped by the kernel is complete, the completed multiply-accumulate operation result is added by a partial sum to obtain the computation result, and a value of the partial sum is replaced by a value of the computation result.
To sum up, in the execution method for convolution computation of the present disclosure, during the execution, the data of some input images, weight values, and output images are reused to avoid repeated access to the same data from off-chip memory or on-chip memory, so as to maximize the efficiency. Therefore, high utilization of multiply-accumulate units and reduction of time for accessing data from the off-chip memory can be achieved, such that the performance of the convolution computation unit is improved.
In order to make the aforementioned summary, other purposes, features and advantages of the present disclosure more obvious and easy to understand, the preferred embodiments of the present disclosure are described in detail below in combination with the attached drawings.
As shown in
In one embodiment, a first buffer 191, a second buffer 193, and a third buffer 195 are also provided between the convolution computation unit 100 and the off-chip memory 190. The input data required for convolution computation can be accessed and stored in the first buffer 191 previously by the first buffer 191, and the input data memory 131 can access these data directly from the first buffer 191. The kernels/weights required for convolution computation can be accessed and stored in the second buffer 193 previously by the second buffer 193, and the weight memory 133 can access these kernels/weights directly from the second buffer 191. The output data memory 135 can store the output image obtained by the convolution computation performed by the processing unit array 110 in the third buffer 195, and the third buffer 195 then stores these result data in the off-chip memory 190.
Reference is also made to
In the execution method for convolution computation of the present disclosure, the data of some input images, weight values, and output images are reused to avoid repeated access to the same data from off-chip memory or on-chip memory, so as to maximize the efficiency. Therefore, high utilization of multiply-accumulate units and reduction of time for accessing data from the off-chip memory can be achieved, such that the performance of the convolution computation unit 100 is improved.
The execution method for convolution computation 200 includes steps S210 to S250. The details in the steps may be different according to the size of the kernel, which may be further described later. In step S210, an input image of N channels is divided into X tiles including a first tile to an X-th tile according to a feature tile with size of T×T by the controller 150, wherein each of the X tiles includes T×T data, which are Ij(1,1)-Ij(T, T), wherein j is corresponding one of the channels and 1≤j≤N (may refer to
References are also made to
As shown in
Next, corresponding to step S230 and step S250, as shown in
In one embodiment, since the size of each of the input image may be different, a part of all tiles divided according to the feature tile with the size of T×T cannot include all of the data of the input image. Therefore, default data may be filled in the positions (or pixels) corresponding to the data of the input image not included in the divided tile. In one embodiment, the default data is zero.
For example, the input image with the size of 10×10 can include 100 input data ij(1, 1)-ij(10, 10). If the size of the feature block is 3×3, it can be divided into 16 blocks. The fourth tile only includes three data of the input image, which are ij (10, 1), ij(10, 2), ij(10, 3), corresponding to the positions (1, 1), (1, 2), (1, 3) of the fourth tile, respectively. Besides, the data corresponding to the positions (2,1), (2,2), (2,3), (3,1), (3,2), (3,3) of the fourth tile are all zero. Similarly, the 16th tile only includes a datum of the input image ij(10, 10) corresponding to the position (1, 1) of the 16th tile, and the data corresponding to the remaining positions of the 16th tile are all zero.
Next, as shown in
P
0
=I
1(1,1)*W0+I1(2,1)*W1+I1(3,1)*W2+I1(1,2)*W3+I1(2,2)*W4+I1(3,2)*W5+I1(1,3)*W6+I1(2,3)*W7+I1(3,3)*W8+Psum
Since the partial sum Psum has not been calculated before, it is defaulted to 0. Since there are 32 processing units 111, the 9 data may be calculated at the same time and 32 first output data P0 are obtained.
Next, as shown in
Next, as shown in
The kernel is shifted right or down according to the above rules until p=(T−2) and q=(T−2), as shown in
Next, as shown in
Briefly, in the condition that the size of the kernel (i.e., the number of weights) is equal to the number of multiply-accumulate units included in each of processing units 111, a sequence of performing convolution computations is that the convolution computation is performed on the W-th tile of the input image of the first channel to the N-th channel sequentially, and the convolution computation is not performed on the (W+1)-th tile of the input image of the first channel to the N-th channel sequentially until the convolution computations performed on the W-th tiles of the input image of the N channels are complete, wherein 1≤W≤X.
In the above method, during the execution, the data of some input images, weight values, and output images are reused to avoid repeated access to the same data from off-chip memory or on-chip memory, so as to maximize the efficiency. Therefore, high utilization of multiply-accumulate units and reduction of time for accessing data from the off-chip memory can be achieved, such that the performance of the convolution computation unit is improved.
References are made to
As shown in
Accordingly, in this embodiment, as shown in
It should be noted that the nine selected data in
Next, as shown in
When p=(T−4) and q=(T−4), after the multiply-accumulate operation performed on the mapped data is complete, the kernel is shifted so that p=1 and q=1 in the condition of (25−Y)>Y Each time the kernel is shifted, the multiply-accumulate operation is performed on the changed consecutive mapped data from the (Y+1)-th to the 2Y-th among the 25 mapped data. Specifically, when the number of the remaining weights (i.e., (25−Y)) in the kernel, which are not calculated yet, is greater than the number of multiply-accumulate units (Y), the operation for the remaining mapped data still cannot be finished at one time. Therefore, at this time, it returns to perform the multiply-accumulate operation on the consecutive mapped data from the (Y+1)-th to the 2Y-th among the original 25 mapped data (i.e., the 10th to the 18th mapped data in this example), and the kernel is shifted according to the above rules.
When p=(T−4) and q=(T−4), after the multiply-accumulate operation performed on the mapped data is complete, the kernel is shifted so that p=1 and q=1 in the condition of (25−Y)<Y Each time the kernel is shifted, the multiply-accumulate operation is performed on the changed consecutive mapped data from the (Y+1)-th to the 25th among the 25 mapped data and Z default data from the first to the Z-th, wherein Z=(2Y−25). Specifically, when the number of the remaining weights (i.e., (25−Y)) in the kernel, which are not calculated yet, has been less than the number of multiply-accumulate units (Y), the operation for the remaining mapped data can be finished at one time. However, it is possible that the number of multiply-accumulate units is greater than the number of remaining weights. In order to avoid that a part of the multiply-accumulate units is not utilized, the default data may be provided to said part of the multiply-accumulate units in this condition. The number of the default data is Z, and the value thereof is defaulted to zero, wherein Z is equal to the number of multiply-accumulate units (Y) minus the number of weights which are not calculated.
Similarly, after the convolution computation performed on all of the data of the first tile of the input image of the first channel, i.e., the convolution computation performed in the first tile is complete, the convolution computation is performed sequentially on the first tile of the input image of the second channel according to the above rules until the convolution computation performed on the first tile of the input image of the N-th channel is complete. After the convolution computations performed on the first tile of the input image of the N channels are complete, then it returns to the input image of the first channel, and the convolution computation is performed sequentially on the second tile of the input image of the first channel according to the above rules until the convolution computations performed on all of the tiles of the input image of the N channels are complete.
Briefly, in the condition that the size of the kernel is greater than the number of multiply-accumulate units included in each of the processing units 111, a sequence of performing convolution computations is that the convolution computation is performed on the W-th tile of the input image of the first channel to the N-th channel sequentially, and the convolution computation is not performed on the (W+1)-th tile of the input image of the first channel to the N-th channel sequentially until the convolution computations performed on the W-th tiles of the input image of the N channels are complete, wherein 1≤W≤X.
In the above method, during the execution, the data of some input images, weight values, and output images are reused to avoid repeated access to the same data from off-chip memory or on-chip memory, so as to maximize the efficiency. Therefore, high utilization of multiply-accumulate units and reduction of time for accessing data from the off-chip memory can be achieved, such that the performance of the convolution computation unit is improved.
References are made to
As shown in
When p=T and q=K, after the multiply-accumulate operation performed on the Y mapped data, which are Ij(T, K), I(j+1)(T, K), I(j+2)(T, K), . . . , IY(T, K), is complete, the kernel is shifted so that p=1 and q=(K+1), wherein 1 K (T−1).
When p=T and q=T, after the multiply-accumulate operation performed on the Y mapped data, which are Ij(T, K), I(j+1)(T, K), I(j+2)(T, K), . . . , IY(T, K), is complete, the kernel is shifted so that p=1 and q=1 in the condition of (N−Y)>Y, and the mapped data for the multiply-accumulate operation are data, which are I(Y+1)(p, q)-I2Y(p, q) at a same position of the input image from the (Y+1)-th channel to the 2Y-th channel. In the condition that the number of the input image of the channels which are not calculated is still larger than the number of the multiply-accumulate units, since the operation for the data at the same position of the remaining input images still cannot be finished at one time, the multiply-accumulate operation continues to be performed on the data, which are I(Y+1)(p, q)-I2Y(p, q), at a same position of the input image from the (Y+1)-th channel to the 2Y-th channel in order.
On the other hand, in the condition of (N−Y)<Y (e.g., in this embodiment, N=13 and Y=9), since the number of the input image of the channels which are not calculated is less than the number of the multiply-accumulate units, the operation for the data at the same position of the remaining input images can be finished at one time. However, similar to the case of the kernel with the size of 5×5, the number of the remaining channels may be less than the number of the multiply-accumulate units in this case. In order to avoid that a part of the multiply-accumulate units is not utilized, the default data may be provided to said part of the multiply-accumulate units in this case. The number of the default data is F, and the value thereof is defaulted to zero, wherein F is equal to the number of multiply-accumulate units (e.g., Y) minus the number of channels which are not calculated (e.g., (N−Y)), for example, F=5 in this case.
In the above method, during the execution, the data of some input images, weight values, and output images are reused to avoid repeated access to the same data from off-chip memory or on-chip memory, so as to maximize the efficiency. Therefore, high utilization of multiply-accumulate units and reduction of time for accessing data from the off-chip memory can be achieved, such that the performance of the convolution computation unit is improved.
References are made to
To sum up, in the execution method for convolution computation 200 of the present disclosure, during the execution, the data of some input images, weight values, and output images are reused to avoid repeated access to the same data from off-chip memory or on-chip memory, so as to maximize the efficiency. Therefore, high utilization of multiply-accumulate units and reduction of time for accessing data from the off-chip memory can be achieved, such that the performance of the convolution computation unit is improved.
It will be appreciated by those skilled in the art that changes could be made to the embodiments described above without departing from the broad inventive concept thereof. It is understood, therefore, that present disclosure is not limited to the particular embodiments disclosed, but it is intended to cover modifications within the spirit and scope of the present disclosure as defined by the appended claims.
The present disclosure claims priority to U.S. Provisional Patent Application, Ser. No. 63/147,804, filed on Feb. 10, 2021. The U.S. Provisional patent applications are hereby incorporated by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
63147804 | Feb 2021 | US |