This application claims the priority to the Chinese patent application No. 202110288755.6, filed with the Chinese Patent Office on Mar. 18, 2021 and entitled “METHOD AND APPARATUS FOR PERFORMING DECONVOLUTION PROCESSING ON FEATURE DATA BY USING CONVOLUTION HARDWARE”, which is incorporated herein by references in its entirety.
The present disclosure relates to the technical field of artificial intelligence, and in particular, to a method and an apparatus for performing deconvolution processing on feature data by using convolution hardware, a device, and a storage medium.
Due to excellent dataset fitting capability and generalization capability, the convolutional neural network (CNN) algorithm has been widely used for analyzing vision, speech, sensor perception information, and high-level semantic information in the real world. Deconvolution is an operation that interpolates or upsamples images or feature data (which is also referred to as a “feature map”). With theoretical innovation and application development of deep learning, deconvolution is widely used in various novel convolutional neural network systems, to restore a low-resolution image to a high-resolution image, or to generate a high-dimensional feature map based on a low-dimensional feature map. It can be expected that deconvolution processing may be widely applied in the fields of image style transfer, ultra high resolution, object detection, semantic segmentation, instance segmentation, key point (including but not limited to human skeleton key point) detection, depth estimation, and the like.
A general-purpose processor, such as a central processing unit (CPU) or a graphics processing unit (GPU), can be used to perform deconvolution processing on the feature map. The general-purpose processor can use a column-to-image (col2im) conversion method, where an example of the method is shown in
Although the method in
To resolve the foregoing technical problem, the present disclosure provides the following technical solutions.
According to a first aspect, the present disclosure provides a method for performing deconvolution processing on a feature map, including: splitting a deconvolution kernel into a plurality of convolution kernels, and optimizing the convolution kernel to remove a row and/or column with an invalid weight. Convolution operation is performed by using a plurality of optimized convolution kernels and the corresponding feature map, and a plurality of obtained convolutional output feature maps are interleaved and tailored to obtain a deconvolutional output result. The solutions of the present disclosure can be implemented by using convolution hardware without dedicated deconvolution hardware, thereby reducing hardware complexity and saving chip area overhead and power consumption overhead. Moreover, according to the method of the present disclosure, a large quantity of invalid weights are removed through the optimization step, and thus operating efficiency of the relevant hardware can be greatly improved, thereby improving delay performance and energy consumption characteristics of the hardware.
According to an aspect of the present disclosure, a method for performing deconvolution processing on a feature map by using dedicated convolution hardware is provided, the dedicated convolution hardware including a multiply-add array and an on-chip memory.
The method includes: reading a feature map and a deconvolution kernel into an on-chip memory, and performing zero-padding processing on the feature map; determining a plurality of convolution kernels based on the deconvolution kernel; removing a row and/or column of each convolution kernel in which all elements are invalid weights, to obtain an optimized convolution kernel, and removing a corresponding row and/or column in the zero-padded feature map to obtain an optimized feature map corresponding to each optimized convolution kernel; performing convolution processing on each optimized convolution kernel and the corresponding optimized feature map by using the multiply-add array, to obtain a plurality of convolutional outputs; and performing interleaving synthesis processing on the plurality of convolutional outputs, to obtain an interleaving synthetic output, wherein the interleaving synthetic output includes a deconvolutional output corresponding to the feature map and the deconvolution kernel.
According to a second aspect, the present disclosure further provides an apparatus for performing deconvolution processing on a feature map by using dedicated convolution hardware, the dedicated convolution hardware including a multiply-add array and an on-chip memory.
The apparatus can include: a reading module configured to read the feature map and a deconvolution kernel into the on-chip memory; a zero-padding module configured to perform zero-padding processing on the feature map; a convolution kernel generation module configured to generate a plurality of convolution kernels based on the deconvolution kernel; an optimization module configured to remove a row and/or column of each convolution kernel in which all elements are invalid weights, to obtain an optimized convolution kernel, and remove a corresponding row and/or column in the zero-padded feature map to obtain an optimized feature map corresponding to each optimized convolution kernel; a convolution module configured to perform convolution processing on each optimized convolution kernel and the correspond optimized feature map by using the multiply-add array, to obtain a plurality of convolutional outputs; and an interleaving synthesis module configured to perform interleaving synthesis processing on the plurality of convolutional outputs, to obtain an interleaving synthetic output, wherein the interleaving synthetic output includes a deconvolutional output corresponding to the feature map and the deconvolution kernel.
According to a third aspect, the present disclosure further provides an electronic device, including: a dedicated convolution hardware, including a multiply-add array and an on-chip memory; at least one off-chip memory, storing instructions; and at least one processor, where when the instructions are run by the processor, the electronic device is enabled to implement the method described above.
According to another aspect of the present disclosure, a computer readable storage medium is provided, wherein the computer readable storage medium stores computer program instructions. When the computer program instructions are run by an electronic device, the electronic device is enabled to implement the method described above,
wherein the electronic device further includes dedicated convolution hardware, the dedicated convolution hardware including a multiply-add array and an on-chip memory.
By describing the embodiments of the present disclosure more detailed with reference to the accompanying drawings, the foregoing and other objectives, features, and advantages of the present disclosure will become more apparent. The accompanying drawings are used to provide further understanding of the embodiments of the present disclosure, constituting a part of the specification, and are used to explain the present disclosure together with the embodiments of the present disclosure without constituting limitation to the present disclosure. In the accompanying drawings, the same reference numerals generally represent the same components or steps.
Exemplary embodiments of the present disclosure are described below in detail with reference to the accompanying drawings. Obviously, the described embodiments are merely a part, rather than all of embodiments of the present disclosure. It should be understood that the present disclosure is not limited by the exemplary embodiments described herein.
Application Overview
To improve processing performance of a processor, a general-purpose processor is generally required to have high computing performance, a large-capacity cache and memory; but this can lead to very high energy consumption and hardware costs. Therefore, the general-purpose processor is not suitable for use in a terminal device. To overcome these shortcomings of the general-purpose processor, a dedicated hardware accelerator has been developed, which is suitable for performing convolution, pooling, deconvolution, and other processing in a sliding window manner. The dedicated hardware accelerator has high operating efficiency and very low power consumption, and therefore it is very suitable for use in a terminal device.
For example, in the example of
Subsequently, the synthetic feature map is tailored based on zero-padding parameters ph and pw and output zero-padding parameters oph and opw in the height direction and the width direction, to obtain a final deconvolutional output feature map. In this example, the zero-padding parameters ph and pw and the output zero-padding parameters oph and opw all are 1. Therefore, pw=1 line of pixels are tailored on the upper side, ph=1 column of pixels are tailored on the left side, (pw−opw)=0 line of pixels are tailored on the lower side, and (ph−oph)=0 column of pixels are tailored on the right side. The obtained deconvolutional output feature map is a matrix with a size of 6×6.
The method shown in
Therefore, it is expected that convolution operations and deconvolution operations can be performed by using simple hardware.
Referring to
Although the method for deconvolution processing of
Therefore, it is still expected to provide an improved solution for deconvolution processing. The solution can be implemented based on convolution hardware without dedicated deconvolution hardware, and can further improve operating efficiency of the relevant hardware.
A hardware architecture design of the general-purpose processor is not suitable for convolution, pooling, deconvolution, and other processing included in a neural network model in a large quantity. As a result, the operating efficiency is very low.
In addition, when the neural network model is run by using a dedicated hardware accelerator, although efficiency can be greatly improved, generally particular hardware needs to be designed for specific processing. For example, a separate convolution module and a separate deconvolution module need to be designed for convolution processing and deconvolution processing, resulting in high hardware complexity and bringing in additional chip area overhead and power consumption overhead.
Although a method of performing deconvolution processing by using convolution hardware has been proposed at present, the method includes a large quantity of invalid operations, resulting in great delay and energy consumption of the hardware accelerator. Moreover, additional on-chip cache space is required, and requirements on hardware are relatively high.
The present disclosure provides a method for performing deconvolution processing on a feature map, to resolve the foregoing technical problems. In the embodiments of the present disclosure, a deconvolution kernel can be split into a plurality of convolution kernels, and each convolution kernel can be optimized to remove invalid weights thereof, to obtain an optimized convolution kernel.
In addition, according to the method, the feature map is correspondingly optimized, to obtain an optimized feature map corresponding to each optimized convolution kernel. A plurality of convolutional outputs are obtained by performing convolution operations by using each optimized convolution kernel and the feature map. The plurality of convolutional outputs may be performed with interleaving synthesis; and optionally, may be tailored, to obtain a deconvolutional output feature map with an expected size.
The method of the present disclosure can be implemented by using convolution hardware without dedicated deconvolution hardware. Therefore, hardware complexity is reduced, and chip area overhead and power consumption overhead are saved. Moreover, according to the method of the present disclosure, a large quantity of invalid operations are also reduced through optimization, thereby further improving the operating efficiency of the hardware accelerator, improving delay and energy consumption characteristics, and reducing requirements for on-chip cache space, which helps reduce hardware costs.
Exemplary Method
For ease of description, parameters related to the deconvolution operation are predefined in the present disclosure, including a size (h, w) of an input feature map, a size (kh, kw) of a deconvolution kernel, sliding stride (sh, sw), zero padding (ph, pw), and output zero padding (oph, opw), where h indicates a height dimension of the feature map and w indicates a width dimension of the feature map.
Referring to
In addition, zero-padding processing can be performed on the feature map in various flexible manners. For example, zero padding is performed on the feature map while the feature map is read into the on-chip memory, or zero padding is performed on the feature map after the feature map is read into the on-chip memory, or zero padding is performed on the feature map when the feature map is read from the on-chip memory for use in, for example, convolution operations or other processing.
In an exemplary embodiment, different from commonly used symmetrical zero-padding, zero padding may be performed on the feature map in four directions, respectively. Specifically, an upper-side quantity for zero padding pht′ and a lower-side quantity for zero padding phb′ of the feature map may be determined based on a height size of the deconvolution kernel and a stride in a height direction and a zero-padding parameter in the height direction that are used for deconvolution operation, where the lower-side quantity for zero padding phb′ is one more row than the upper-side quantity for zero padding pht′.
Similarly, a left-side quantity for zero padding pwl′ and a right-side quantity for zero padding pwr′ of the feature map may be determined based on a width size of the deconvolution kernel and a stride in a width direction and a zero-padding parameter in the width direction that are used for deconvolution operation, where the right-side quantity for zero padding pwr′ is one more column than the left-side quantity for zero padding pwl′. For example, the upper-side quantity for zero padding pht′, the left-side quantity for zero padding pwl′, the lower-side quantity for zero padding phb′, and the right-side quantity for zero padding pwr′ of the feature map may be calculated respectively according to the following formulas 1 to 4, where floor is a rounded-down function, ceil is a rounded-up function, kh and kw respectively represent the height size and the width size of the deconvolution kernel, sh and sw respectively represent the stride in the height direction and the stride in the width direction, and ph and pw respectively represent the zero-padding parameter in the height direction and the zero-padding parameter in the width direction that are used for deconvolution operation. Specific formulas are as follows:
In the example of
In Step S120, a plurality of convolution kernels are determined based on the deconvolution kernel.
First, a quantity and sizes of convolution kernels corresponding to the deconvolution kernel may be determined. Specifically, the quantity of the convolution kernels may be determined as a product sh×sw of the stride in the height direction sh and the stride in the width direction sw that are used for deconvolution operation, and two-dimensional indexes (ish, isw) in the height direction and the width direction may be allocated to each convolution kernel. In the example of
A size kh′ of each convolution kernel in the height direction may be determined based on the height size kh of the deconvolution kernel and the stride in the height direction sh and the zero-padding parameter in the height direction ph that are used for deconvolution operation. Similarly, a size kw′ of each convolution kernel in the width direction may be determined based on the width size kw of the deconvolution kernel and the stride in the width direction sw and the zero-padding parameter in the width direction pw that are used for deconvolution operation. For example, the height and width sizes (kh′, kw′) of each convolution kernel may be determined according to the following formulas 5 and 6, where ceil is a rounded-up function, and % is a remainder operator. In the example of
In step S120, after the quantity and sizes of the convolution kernels are determined, a weight value of each convolution kernel is further determined.
Specifically, a possible implementation includes that for each position in each convolution kernel, two-dimensional coordinate values of a corresponding position in the deconvolution kernel may be determined based on two-dimensional indexes in the height direction and the width direction of the convolution kernel, the height size and the width size of the convolution kernel, two-dimensional coordinate values of the position, and the stride in the height direction, the stride in the width direction, the zero-padding parameter in the height direction, and the zero-padding parameter in the width direction that are used for deconvolution operation. A weight value of the corresponding position is taken as a weight value of the position in the convolution kernel. For example, for each position (ikh′, ikw′) in each convolution kernel (ish, isw) a corresponding position (ikh, ikw) in the deconvolution kernel may be determined according to the following formulas 7 and 8, and a weight value of this position may be taken as a weight value of the position (ikh′, ikw′) in the convolution kernel, where ish and isw respectively are indexes of each convolution kernel in the height direction and the width direction, ikh′ and ikw′ respectively are a position coordinate in the height direction and a position coordinate in the width direction in the convolution kernel, and ikh and ikw respectively are a position coordinate in the height direction and a position coordinate in the width direction in the deconvolution kernel.
When the determined corresponding position (ikh, ikw) in the deconvolution kernel exceeds a range of a position coordinate in the deconvolution kernel, a zero value is inserted at the position (ikh′, ikw′) of the convolution kernel. In other words, a weight at the position is a zero-valued invalid weight.
ik
h=(kh′−1−ikh′)×sh+ish−ph % sh (Formula 7)
ik
w−(kw′−1−ikw′)×sw+isw−pw % sw (Formula 8)
For example, referring to the example of
For a position (ikh′=0, ikw′=1), by substituting relevant parameters ish=isw=0, ikh′=0, ikw′=1, k′h=k′w=2, sh=sw=2, and ph=pw=1 into the foregoing formulas 7 and 8, it can be calculated that a corresponding position in the deconvolution kernel is (ikh=1, ikw=−1). Since this coordinate value exceeds a coordinate range of the deconvolution kernel (in the example of
Similarly, for positions (ikh′=1, ikw′=0) and (ikh′=1, ikw′=1), corresponding calculated coordinate positions in the deconvolution kernel respectively are (ikh=−1, ikw=1) and (ikh=−1, ikw=−1), and both exceed the coordinate range of the deconvolution kernel. Therefore, zero-valued invalid weights are also inserted at these positions, so as to obtain various weight values of the convolution kernel (0, 0) shown in
Similarly, in a convolution kernel (ish=0, isw=1), for a position (ikh′=0, ikw′=0), a corresponding position in the deconvolution kernel that is calculated according to formulas 7 and 8 is (ikh=1, ikw=2), and a weight value at this position is “−5”; for a position (ikh′=0, ikw′=1), a corresponding calculated position in the deconvolution kernel is (ikh=1, ikw=0), and a weight value at this position is “0”; and for positions (ikh′=1, ikw′=0) and (ikh′=1, ikw′=1), corresponding calculated coordinate positions in the deconvolution kernel respectively are (ikh=−1, ikw=2) and (ikh=−1, ikw=0), and both exceed the coordinate range of the deconvolution kernel. Therefore, zero-valued invalid weights are inserted at these positions, so as to obtain the various weight values of the convolution kernel (0, 1) shown in
Similarly, in the convolution kernel (ish=0, isw=1), two weight values “−3 and 0” in the first column are weight values determined from the corresponding positions in the deconvolution kernel according to the foregoing calculations, and two weight values in the second column are zero-valued invalid weights that are inserted because calculated coordinate values exceed the coordinate range of the deconvolution kernel.
In a convolution kernel (ish=1, isw=1), four weight values “1, 1, 3, and −1” are all weight values determined from the corresponding positions in the deconvolution kernel according to the foregoing calculations, and no zero-valued invalid weight is inserted. Herein, it should be noted that a zero-valued weight determined from the deconvolution kernel (that is, the deconvolution kernel includes a zero-valued weight) is a valid weight, and a zero-valued weight that is inserted merely because a calculated position coordinate exceeds a range is an invalid weight.
In some embodiments, to distinguish between an inserted zero-valued invalid weight and a zero-valued valid weight initially included in the deconvolution kernel, a mark may also be attached to the inserted zero-value invalid weight to indicate that the inserted zero-value invalid weight is an invalid weight. Alternatively, a mark may also be attached to each weight value in the convolution kernel, to indicate that the weight value is a valid weight or an invalid weight. For example, the mark may be a bit, where “0” indicates that a corresponding weight is an invalid weight, and “1” indicates that a corresponding weight is a valid weight, or vice versa. The indication bit may be stored with the corresponding weight value, to serve as, for example, an additional lowest-order bit or a highest-order bit of the weight value. Alternatively, a bit map may be formed and may be stored separately from the convolution kernel.
Herein, a plurality of convolution kernels are determined based on the deconvolution kernel. It should be understood that when the deconvolution kernel includes a bias value, the plurality of convolution kernels determined based on the deconvolution kernel may have the same bias value.
In the plurality of convolution kernels determined above, a large quantity of invalid weights may be included. Therefore, in step S130, each convolution kernel is further optimized, to remove a row and/or column of each convolution kernel in which all elements are invalid weights to obtain an optimized convolution kernel, and correspondingly remove a corresponding row and/or column in the zero-padded feature map to obtain an optimized feature map corresponding to each optimized convolution kernel.
For example, when an indication bit is set for each zero-valued weight in the convolution kernel as described above, whether a row or a column of weight values in the convolution kernel are all zeros can be first determined. If a row or a column includes at least one non-zero weight value, the row or column cannot be removed through optimization. If weight values in a row or column are all zeros, whether the weight values are valid zero values or invalid zero values may be determined based on an indicator associated with the zero value. Merely when weights in a row or column are all invalid zero values, the row or column may be removed through optimization.
In some other embodiments, if all weights (including zero-valued weights and non-zero-valued weights) in the convolution kernel are set with indication bits indicating whether the weights are valid weights, whether weights in a row or column in the convolution kernel are all invalid weights may be directly determined based on these indication bits. When weights in a row or column in the convolution kernel are all invalid weights, the row or column may be removed through optimization.
It can be learned that in step S130, optimized feature maps determined for various optimized convolution kernels may be different from each other. For example, referring to the example of
Similarly, for the convolution kernel (1, 0), a column on the right side only includes zero-valued invalid weights, and therefore is removed through optimization, with only two valid weight values “−3 and 0” in the first column remained. Correspondingly, a column on the right side of the zero-padded feature map is removed through optimization to obtain an optimized feature map (1, 0) that is used for the convolution kernel (1, 0), where the optimized feature map (1, 0) is a 4×3 matrix. The convolution kernel (1, 1) does not include any zero-valued invalid weight, and therefore no rows or columns are removed through optimization. For a corresponding feature map (1, 1) of the convolution kernel (1, 1), no rows or columns are removed, either. The feature map (1, 1) is the zero-padded feature map, and is a 4×4 matrix.
It can be learned that, through the optimization in step S130, invalid operation in the deconvolution processing can be almost completely eliminated, thereby improving operating efficiency of the relevant hardware and improving delay and energy consumption.
Subsequently, in step S140, convolution processing may be performed on each optimized convolution kernel and the corresponding optimized feature map by using the multiply-add array of the convolution hardware, to obtain various corresponding convolutional outputs. For example, as described in detail below with reference to
Optionally, a result obtained through accumulation may also be linearly adjusted by using an offset value of the convolution kernel, and an obtained adjusted value can be stored in the on-chip memory SRAM as an output value of the convolution operation. In some embodiments, according to a sliding window method, the convolution kernel may be sequentially convolved with feature data in a corresponding window on the feature map, to calculate various feature data in an output feature map.
In step S140, optionally, a quantity of multipliers included in the multiply-add array of the convolution hardware is larger than or equal to a quantity of weight values included in each optimized convolution kernel. In this case, convolution operations of one sliding window can be completed at one time, thereby ensuring high computational efficiency.
It should be noted that the quantity of the multipliers in the multiply-add array can be smaller than the quantity of the weight values in the deconvolution kernel. In other words, according to this embodiment, relatively more deconvolution processing can be implemented by using relatively few hardware resources. For example, in the example shown in
In step S150, interleaving synthesis processing is performed on various convolutional output feature maps, to obtain an interleaving synthetic output. The interleaving synthesis processing may include adding all elements in each convolutional output into a synthetic matrix by taking the stride in the height direction and the stride in the width direction that are used for deconvolution operation as padding strides, and taking the two-dimensional indexes in the height direction and the width direction of the convolution kernel as a padding offset. The interleaving synthesis processing can be represented by the following formulas 9 and 10, where ihfo and iwfo respectively represent a height coordinate and a width coordinate in the synthetic matrix; ish and isw represent two-dimensional indexes of the convolutional output feature map, that is, two-dimensional indexes of the corresponding convolution kernel; ih and iw respectively represent a height coordinate and a width coordinate in the convolutional output feature map; and sh and sw respectively represent padding strides in the height direction and the width direction.
ih
fo
=i
h
×s
h
+is
h (Formula 9)
iw
fo
=i
w
×s
w
+is
w (Formula 10)
A position coordinate in each convolutional output feature map may be converted into a position coordinate in the synthetic matrix according to formulas 9 and 10, so that data in each convolutional output feature map is padded into the synthetic matrix, thus completing the interleaving synthesis processing. For example, referring to the example shown in
As shown in
As shown in
As shown in
As shown in
The interleaving synthetic output obtained at step S150 includes at least a deconvolutional output corresponding to the initially provided deconvolution kernel and feature map. For example, in some embodiments, the interleaving synthetic output obtained at step S150 is the deconvolutional output of the initially provided deconvolution kernel and feature map. In some other embodiments, interleaving synthetic output may be tailored, to obtain the deconvolutional output. Therefore, the method 100 provided in this embodiment may also include step S160.
In step S160, the interleaving synthetic output is tailored, to obtain a deconvolutional output corresponding to the deconvolution kernel and the initially input feature map.
Specifically, in step S160, the right side and the lower side of the interleaving synthetic output may be tailored until a size after tailoring corresponds to a size of the deconvolutional output. For example, sizes ho and wo of the deconvolutional output may be calculated according to the following formulas 11 and 12.
h
o=(h−1)×sh−2×ph+kh+oph (Formula 11)
w
o=(w−1)×sw−2×pw+kw+opw (Formula 12)
Moreover, sizes hfo and wfo of the interleaving synthetic output may be calculated according to formulas 13 to 16, where h′o and w′o are sizes of the convolutional output of each optimized convolution kernel.
h
fo
=s
h
×h
o′ (Formula 13)=
w
fo
=s
w
×w
o′ (Formula 14)
h
o
′=h+p
ht
′+p
hb
′−k
h′+1 (Formula 15)
w
o
′=w+p
wl
′+p
wr
′−k
w′+1 (Formula 16)
Therefore, in step S160, (wfo−wo) columns may be tailored on the right side of the interleaving synthetic output, and (hfo−ho) rows on the lower side may be tailored, so as to obtain a deconvolutional output with a size of (ho, wo).
In the example of
It can be understood from the foregoing descriptions that are made with reference to
Moreover, according to this method, there is no need to perform sparse processing on the feature map, and a large quantity of zero-valued invalid weights are removed through optimization. Hence, invalid operations can be greatly reduced, thereby improving operating efficiency of the hardware, improving delay and energy consumption characteristics of the related hardware, and reducing requirements for on-chip cache space, which help further reduce hardware costs.
Exemplary Apparatus
As shown in
The reading module 210 may be configured to read a feature map and a deconvolution kernel into an on-chip memory, such as a static random access memory SRAM, of convolution hardware, as described in step S110. The convolution hardware may be a hardware accelerator dedicated for performing convolution processing, and may include a multiply-add array composed of multipliers and adders and the on-chip memory. The feature map and the deconvolution kernel can be read into the on-chip memory of the convolution hardware from, for example, a dynamic random access memory DRAM that serves as a memory, or from a non-volatile memory such as a flash memory or an electrically erasable programmable read-only memory EEPROM.
The zero-padding module 220 may be configured to perform zero-padding processing on the feature map, as described in step S110. In some embodiments, as described above, zero-padding processing may be performed on the feature map in various flexible manners. For example, zero padding is performed on the feature map while the feature map is read into the on-chip memory, or zero padding is performed on the feature map after the feature map is read into the on-chip memory, or zero padding is performed on the feature map when the feature map is read from the on-chip memory for use in, for example, convolution operations or other processing.
The convolution kernel generation module 230 may be configured to generate a plurality of convolution kernels based on the deconvolution kernel, as described in step S120. For example, the convolution kernel generation module 230 may determine a quantity and sizes of convolution kernels corresponding to the deconvolution kernel, so as to determine a weight value of each position in each convolution kernel.
Specifically, for each position in each convolution kernel, the convolution kernel generation module 230 may determine two-dimensional coordinate values of a corresponding position in the deconvolution kernel based on two-dimensional indexes in a height direction and a width direction of the convolution kernel, a height size and a width size of the convolution kernel, two-dimensional coordinate values of the position, and a stride in the height direction, a stride in the width direction, a zero-padding parameter in the height direction, and a zero-padding parameter in the width direction that are used for deconvolution operation; and take a weight value of the corresponding position as a weight value of the position in the convolution kernel. When the determined position coordinate of the corresponding position in the deconvolution kernel exceeds a range of a position coordinate in the deconvolution kernel, the convolution kernel generation module 230 may insert a zero value at this position in the convolution kernel. In other words, a weight at this position is a zero-valued invalid weight.
In some embodiments, to distinguish between an inserted zero-valued invalid weight and a zero-valued valid weight initially included in the deconvolution kernel, the convolution kernel generation module 230 may further attach a mark to the inserted zero-value invalid weight to indicate that it is an invalid weight.
The optimization module 240 may be configured to remove a row and/or column of each convolution kernel in which all elements are invalid weights to obtain an optimized convolution kernel, and remove a corresponding row and/or column in the zero-padded feature map to obtain an optimized feature map corresponding to each optimized convolution kernel, as described in detail in step S130. It can be learned that, optimized feature maps determined for various optimized convolution kernels by the optimization module 240 may be different from each other.
The convolution module 250 may perform convolution operation on each optimized convolution kernel and the corresponding optimized feature map by using the multiply-add array of the convolution hardware, to obtain various corresponding convolutional outputs, as described in step S140. For example, the convolution module 250 may provide, in a sliding window manner, each weight value in the convolution kernel and feature data in the corresponding feature map to a multiplier. Multiplication is completed in the multiplier, and then a result is output to the adder and is accumulated with outputs of other multipliers. An obtained sum can be stored in the on-chip memory SRAM.
The interleaving synthesis module 260 may perform interleaving synthetic processing on a plurality of convolutional output feature maps generated by the convolution module 250, to obtain an interleaving synthetic output, as described in step S150.
For example, the interleaving synthesis module 260 may be configured to pad all elements in each convolutional output into a synthetic matrix by taking the stride in the height direction and the stride in the width direction that are used for deconvolution operation as padding strides, and take the two-dimensional indexes in the height direction and the width direction of the convolution kernel as padding offsets. The interleaving synthetic output generated thereby may include at least a deconvolutional output corresponding to the initially provided deconvolution kernel and feature map.
For another example, in some embodiments, the interleaving synthetic output obtained by the interleaving synthesis module 260 is the deconvolutional output of the initially provided deconvolution kernel and feature map.
In some embodiments, optionally, the apparatus 200 may further include a tailoring module 270 that may tailor the interleaving synthetic output feature maps generated by the interleaving synthesis module 260, to obtain the deconvolutional output, as described in step S160. Specifically, the tailoring module 270 may tailor the right side and the lower side of the interleaving synthetic output, until a size after tailoring corresponds to a size of the deconvolutional output.
Exemplary Electronic Device
As shown in
The processor 310 may be any form of processing unit having a data processing capability and/or an instruction execution capability. Examples of the processor 310 include but not limited to a central processing unit (CPU), an ARM processor, a microcontroller unit (MCU), a general-purpose processor, a controller, a digital signal process (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or another programmable logic device, a transistor logic device, a hardware component, or any combination thereof. The processor 310 can run instructions in the memory 320 associated with the processor 310 and/or exchange data with the memory 320, so as to control other components coupled through the bus system 350 to cooperate to implement the method, steps, or functions described above.
The memory 320 may include various forms of computer-readable/writable storage media, such as a volatile memory and/or a non-volatile memory. The volatile memory may include, for example, a dynamic random access memory (DRAM) and/or a cache. The non-volatile memory may include, for example, an electrically erasable programmable read-only memory (EEPROM), a hard disk, and a flash memory. The readable/writable storage medium may include, for example, but is not limited to electricity, magnetism, light, electromagnetism, infrared ray, or a semiconductor system, apparatus or device, or any combination of the above.
In addition, computer-executable instructions can be stored in the memory 320. The instructions can be run by the processor 310 to control other components coupled through the bus system 350 to cooperate to implement the method, steps, or functions described above.
The convolution hardware accelerator 330, which may also be referred to as a convolutional neural network hardware accelerator or dedicated convolution hardware, may be dedicated hardware designed to perform convolution-related processing. As shown in
The I/O interface 340 may include communication interfaces connected to various input and output devices, such as a camera interface, a radar interface, a touch display interface, a network interface, and a controller interface that supports specific communication protocols. It should be understood that various I/O interfaces 340 may be provided according to actual application requirements.
The bus system 350 may be any bus system that can connect various components in the electronic device 300 and support the various components to communicate with each other. Examples of the bus system 350 include but are not limited to a CAN (area network controller) bus, an ISA (industry standard architecture) bus, a PIC (peripheral interconnection) or PCI-E (express peripheral interconnection) bus, an I2C (inter-integrated circuit) bus, an SPI (serial peripheral interface) bus, and an UART (universal asynchronous serial port) bus.
Certainly, for simplicity,
Referring to
The feature map cache unit 420 may receive and store the feature map data by using the interface unit 410. Herein, the feature map may be an initial input feature map that is collected by a camera and is performed with preprocessing such as tailoring and sampling; or may be a feature map output by an upper layer in the neural network, which may generally be represented in a form of a matrix.
The convolution kernel buffer unit 430 may receive and store convolution kernel data or deconvolution kernel data by using the interface unit 410. For example, the convolution hardware accelerator 400 may receive the convolution kernel to perform conventional convolution processing, or may receive the deconvolution kernel and perform deconvolution processing through a convolution operation according to the method described above with reference to
It can be understood that the feature map buffer unit 420, the convolution kernel buffer unit 430, and the output buffer unit 470 described below may be separate buffer devices, or may be different storage areas in a same buffer device. For example, the feature map buffer unit 420, the convolution kernel buffer unit 430, and the output buffer unit 470 may be implemented as a static random access memory SRAM, which may have a predetermined bit width.
The multiplier array 440 may include a plurality of multipliers 441. Each multiplier 441 may receive a piece of feature data from the feature map buffer unit 420, receive a weight value in the convolution kernel from the convolution kernel buffer unit 430, multiply the feature data and the weight value, and output a product of the feature data and the weight value.
The addition tree unit 450 may include a plurality of adders 451 that are arranged in a tree structure. The addition tree unit 450 may receive an output value of each multiplier 441 from the multiplier array 440, and accumulate output values to obtain and output a sum value thereof.
The bias unit 460 may receive a bias value from the convolution kernel buffer unit 430, receive the output value from the addition tree unit 450, and linearly adjust the output value of the addition tree unit 450 by using the offset value and then output an adjusted value. The value output by the bias unit 460 may be stored in the output buffer unit 470 as a convolution-operation output value. The foregoing steps may be repeated in a sliding window manner, so as to obtain a convolutional operation result, that is, an output feature map, of the entire input feature map and the corresponding convolution kernel. The convolution operation result may be stored in the buffer unit 470 for subsequent processing.
It can be understood that, for simplicity,
Although the method for performing deconvolution processing through convolution operations has been described above with reference to a dedicated accelerator, it should be understood that the principle of the present disclosure may also be implemented by using general-purpose hardware such as a CPU or a GPU, and similar technical effects may be implemented. For example, chip area overhead and power consumption overhead can be saved by avoiding sparse processing, and a large quantity of invalid operations are reduced through optimization, thereby further improving operating efficiency of the hardware, improving delay and energy consumption characteristics, and reducing requirements for on-chip cache space of the general-purpose processor, which help reduce hardware costs.
Exemplary Computer Program Product and Computer Readable Storage Medium
In addition to the foregoing method and device, the embodiments of the present disclosure may further relate to a computer program product, which includes a computer program instruction. When the computer program instruction is run by a processor, a convolutional neural network accelerator can be controlled to implement the method for performing deconvolution processing on a feature map according to the embodiments of the present disclosure that is described in the “exemplary method” part of this specification.
The computer program product may be program codes, written with one or any combination of a plurality of programming languages, that are configured to perform the operations in the embodiments of the present disclosure. The programming languages include an object-oriented programming language such as Java, C++, or phyon, and further include a conventional procedural programming language such as a “C” language or a similar programming language. The program codes may be entirely or partially executed on a user computing device, executed as an independent software package, partially executed on the user computing device and partially executed on a remote computing device, or entirely executed on the remote computing device or a server.
In addition, the embodiments of the present disclosure may further relate to a computer readable storage medium, which stores computer program instructions. When the computer program instructions are run by a processor, the processor is enabled to perform the method for performing deconvolution processing according to the embodiments of the present disclosure that is described in the “exemplary method” part of this specification.
The computer readable storage medium may be one readable medium or any combination of a plurality of readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can include, for example, but is not limited to electricity, magnetism, light, electromagnetism, infrared ray, or a semiconductor system, apparatus or device, or any combination of the above. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection with one or more conducting wires, a portable disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory) or a flash memory, an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
Basic principles of the present disclosure are described above in combination with the specific embodiments. However, it should be pointed out that the advantages, superiorities, and effects mentioned in the present disclosure are merely examples but are not for limitation, and it cannot be considered that these advantages, superiorities, and effects are necessary for each embodiment of the present disclosure. In addition, specific details of the above disclosure are merely for examples and for ease of understanding, rather than limitations. The foregoing details do not limit that the present disclosure must be implemented by using the foregoing specific details. Rather, a person skilled in the art can easily conceive of many changes in form and detail under the teachings of the present disclosure, and these changes should all fall within the scope defined by the claims of the present disclosure.
The block diagrams of the equipment, the apparatus, the device, and the system involved in the present disclosure are merely exemplary examples and are not intended to require or imply that the equipment, the apparatus, the device, and the system must be connected, arranged, and configured in the manners shown in the block diagrams. It is recognized by a person skilled in the art that, the equipment, the apparatus, the device, and the system may be connected, arranged, and configured in an arbitrary manner. The terms such as “include”, “contain”, and “have” are open terms that mean “including but not limited to”, and may be used interchangeably with “including but not limited to”. The terms “or” and “and” used herein refer to the term “and/or”, and may be used interchangeably with “and/or’, unless the context clearly indicates otherwise. The term “such as” used herein refers to the phrase “such as but not limited to”, and may be used interchangeably with “such as but not limited to”.
It should be further pointed out that, various components or various steps in the apparatus, the device, and the method of the present disclosure may be disassembled and/or recombined. These disassembling and/or recombinations shall be regarded as equivalent solutions of the present disclosure.
The foregoing description about the disclosed aspects is provided, so that the present disclosure can be arrived at or carried out by any person skilled in the art. Various modifications to these aspects are very obvious to a person skilled in the art. Moreover, general principles defined herein may be applicable to other aspects without departing from the scope of the present disclosure. Therefore, the present disclosure is not intended to be limited to the aspect illustrated herein, but to the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been given for illustration and description. In addition, this description is not intended to limit the embodiments of the present disclosure to forms disclosed herein. Although a plurality of exemplary aspects and embodiments have been discussed above, a person skilled in the art may recognize certain variations, modifications, changes, additions, and sub-combinations thereof.
Number | Date | Country | Kind |
---|---|---|---|
202110288755.6 | Mar 2021 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2022/075891 | 2/10/2022 | WO |