The disclosure relates to the field of artificial intelligence technologies, and in particular, to an image processing device and method.
With the development of convolutional neural networks and deep learning, high-performance network models become more and more complex. In many computing-constrained environments, there are a wide range of applications that require accelerating convolutional neural networks. Especially in mobile phones, there are more and more applications of visual intelligence algorithms. How to accelerate the calculation of these algorithms in the limited computing resources on mobile phones is of great significance. In addition, in other terminal devices, there is a similar demand due to limited computing power. These terminal devices include, but are not limited to, smart TVs, smart refrigerators, surveillance cameras, smart driving vehicles, service robots, small aerospace vehicles, and the like. For algorithms that operate in the cloud, it is also necessary to speed up the calculation and save computing resources and costs.
The computational complexity of a model itself can be reduced by optimizing the model. The common method is to reduce the amount of calculation of the model by reducing the total number of weights, such as deleting unimportant weights (connections), weight thinning, and reducing bit numbers of weights. However, reduction of weights may result in loss of input information, resulting in a significant degradation in model performance.
In addition, most existing solutions are more suitable for image-level tasks (such as image classification tasks) and regional-level tasks (such as target detection tasks). As for pixel-level tasks (e.g., image segmentation, depth prediction, super-resolution, de-noising, etc.), because the network applied to pixel-level tasks is generally more complex and sensitive to weight reduction, the above methods are not applicable to pixel-level tasks.
In summary, there is a need for an image processing device and method that can effectively accelerate the processing speed of an image processing neural network.
The present disclosure provides an image processing device and method to at least partially solve the above-mentioned technical problems.
According to an aspect of the present disclosure, there is provided an image processing method, including obtaining an input image; converting the input image or a feature map of the input image into a plurality of target input images or target feature maps, wherein a resolution of each of the target input images or the target feature maps is smaller than a resolution of the feature map of the input image or the input image, and pixels at the same position in each of the target input images or the target feature maps are of a neighborhood relationship with the input image or the feature map of the input image; processing at least a part of the plurality of target input images or target feature maps by one or more convolution blocks in a convolutional neural network; and increasing a resolution of a feature map output from the one or more convolution blocks in the convolutional neural network.
According to an aspect of the present disclosure, there is provided an image processing device, including a transceiver configured to obtain an input image; at least one processor configured to: convert the input image or a feature map of the input image into a plurality of target input images or target feature maps, wherein a resolution of each of the target input images or the target feature maps is smaller than a resolution of the feature map of the input image or the input image, and pixels at the same position in each of the target input images or the target feature maps are of a neighborhood relationship in the input image or the feature map of the input image; process at a least part of the plurality of target input images or target feature maps by one or more convolution blocks in a convolutional neural network; and increase a resolution of a feature map output from the one or more convolution blocks in the convolutional neural network.
The embodiments of the present application are described in detail below, and the examples of the embodiments are illustrated in the accompanying drawings, wherein the same or similar reference numerals indicate the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplified, and are not to be construed as limiting the present application.
It shall be understood by one skilled in the art that the singular forms “a”, “an”, “the” and “said” used herein comprise the plural form, unless otherwise stated. It is to be understood that the phrase “comprise” refers to the presence of features, integers, steps, operations, components and/or elements, and does not exclude the presence of one or more other features, integers, steps, operations, components and/or elements. It is be understood that when an element is referred to as being “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there is an intermediate element. Further, the term “connected” or “coupled” as used herein may include either a wireless connection or a wireless coupling. The term “and/or” used herein includes all or any and all combinations of one or more of the associated listed.
In order to avoid the significant degradation of performance while accelerating the calculation of a convolutional neural network, the present disclosure provides an image processing device and method, the main core of which is to perform neural network calculation acceleration based on the resolution propagation effect. Specifically, a large-resolution input image or a feature map is converted into a plurality of small-resolution target input images or target feature maps, wherein the resolution of each of the target input images or target feature maps is smaller than the resolution of the input image or the feature map. Then at least a portion of the target input images or the target feature maps are processed by one or more convolution blocks in the convolutional neural network. In addition to the total number of weights, another factor that determines the amount of multiplication and addition is the resolution of the input data. When the resolution of the input data of the convolutional neural network is small, the calculation speed is faster. Therefore, according to the solution of the present disclosure, it can improve the calculation speed of the convolutional neural network.
In addition, after processing at least part of the target input images or the target feature maps by one or more convolution blocks in the convolutional neural network, the present disclosure enlarges the processed small-resolution feature maps to a larger-resolution feature map (for example, by enlarging the resolution to the resolution of the lager-resolution feature map), thereby improving the resolution of the output feature map, maintaining the information integrity, and even increasing the resolution of the image.
In the present disclosure, the plurality of small-resolution target input images or feature maps collectively constitute an interlaced space of the large-resolution input image or feature map.
In order to make the objects, technical solutions and advantages of the present disclosure more clear, the present disclosure will be further described in detail below with reference to the specific embodiments of the present disclosure and by referring to the drawings.
At step S110, an input image is acquired.
At step S120, the input image or a feature map of the input image are converted into a plurality of target input images or target feature maps, wherein a resolution of each of the target input images or the target feature maps is smaller than a resolution of the feature map of the input image or the input image, and pixels at the same position in each of the target input images or the target feature maps are of a neighborhood relationship in the input image or the feature map of the input image.
At step S130, at least part of the plurality of target input images or target feature maps are processed by one or more convolution blocks in a convolutional neural network.
At step S140, a resolution of a feature map output from the one or more convolution blocks in the convolutional neural network is enlarged.
In the present embodiment, an input image is converted into a plurality of target input images, or a feature map of the input image is converted into a plurality of target feature maps, and then at least part of the target input images or target feature maps are processed by one or more convolution blocks in a convolutional neural network. A resolution of each of the target input images or the target feature maps is smaller than a resolution of the feature map of the input image or the input image, thereby reducing the calculation amount of the convolution block, improving the computational speed of the convolutional neural network. Furthermore, in the present embodiment, the resolution of the feature map output from the one or more convolution blocks in the convolutional neural network is enlarged, thereby improving the quality of the image output from the network.
The above feature map can be obtained by any method. For example, in some embodiments, the feature map is obtained by processing the input image by the one or more convolution blocks in the convolutional neural network. However, in other embodiments, the feature map of the input image may also be obtained by any other means currently used or developed in the future, and the embodiments of the present disclosure are not limited by the specific manner in which the feature map is obtained.
In some embodiments, enlarging the resolution of the feature map output from the one or more convolution blocks in the convolutional neural network may include: enlarging the resolution of the feature map output from any one or more convolution blocks in the convolutional neural network to a resolution that is obtained by only processing the input image in the convolutional neural network. In the embodiment, enlarging the resolution may include enlarging the resolution of a feature map finally output from the convolutional neural network, or enlarging the resolution of a feature output from a convolution block in the middle of the convolutional neural network, or enlarging the resolution of a feature map output from one convolution block, or enlarging the resolution of feature maps output from a plurality of convolution blocks. The operation is flexible. Moreover, since the input image is converted into a plurality of target input images and the resolution of the feature map finally output from the convolutional neural network, the images or feature maps processed by the convolution blocks in the middle of the convolutional neural network are of a small resolution, thereby further reducing the calculation amount of the convolution block and improving the computational speed of the convolutional neural network. Enlarging the resolution of feature maps output from a plurality of convolution blocks may include enlarging the resolution more than one time to enlarge the resolution to be higher. A relatively simple method of enlarging the resolution can be provided by enlarging the resolution of the feature map output from one convolution block.
Moreover, in other embodiments, the resolution of the feature map output from the convolutional neural network may even be enlarged to be higher than the resolution that is obtained by only processing the input image in the convolutional neural network, to achieve super resolution cases, such as those described below related with an ULED display.
Optionally, the method further includes: performing online training on the convolutional neural network, or performing online training on the convolutional neural network by other devices, and acquiring the convolutional neural network from the other devices after training. This step is not shown in
Optionally, after step S140, the number of channels of the input image or feature map of the convolution block in the convolutional neural network is adjusted, and the convolutional neural network is retrained. By retraining the convolutional neural network, the network would fit to the new number of channels.
Step S120 may include:
A. determining a down resolution ratio N between the resolution of the target input image or the target feature map and the resolution of the feature map of the input image or the input image.
The down resolution ratio N is a parameter indicating a resolution reduction ratio between the target input image and the input image or a parameter indicating a resolution reduction ratio between the target feature map and the feature map of the input image, and can be implemented by any feasible means. For example, for the input image/the target input image,
and for a feature map of the input image/the target feature map,
. However, the technical solution of the embodiments of the present disclosure is not limited thereto. For example, a height ratio between images may be employed instead of a width ratio to characterize N, or a feature capable of reflecting the resolution reduction between the target input image or target feature map and the input image or the feature map of the input image may be used to calculate N.
B. determining a number F*F of the target input images or the target feature maps according to the down resolution ratio N, wherein
F=U(N)
U(⋅) is a ceiling function; and
C. converting the input image or the feature map of the input image into a number F*F of the target input images or the target feature maps.
In the above example, the number of target input images or target feature maps is determined according to a down resolution ratio N between the resolution of the target input image or the target feature map and the resolution of the feature map of the input image or the input image, however, in others In the example, the number of target input images or target feature maps may also be directly specified according to any other factors or even directly specified. Embodiments of the present disclosure are not limited by the specific determination method of the number of target input images or target feature maps.
In some embodiments, converting the input image or the feature map of the input image into a number F*F of the target input images or the target feature maps includes:
performing down-sampling on the input image or the feature map of the input image by a step size N to obtain a number F*F of the target input images or the target feature maps, wherein the sampling formula is:
O
i,j(x, y)=I(xN+i, yN+j)
wherein I and O represent the feature map of the input image or the input image and the target input image or the target feature map, respectively, and i and j are indexes established for the plurality of converted target input images or target feature maps, and i∈[0,F), j∈[O,F), x and y are abscissa and ordinate of a pixel in a corresponding target input image or target feature map, respectively, and x∈[0,W′), y∈[0,H′), W′=L(W/N), H′=L(H/N), xN+i and yN+j are indexes of a pixel in the input image or the feature map of the input image, U(⋅) represents a ceiling function, L(⋅) represents a floor function, and W and H represent the width and height of the feature map of the input image or the input image, respectively, W′ and H′ represent the width and height of the target input image or the target feature map, respectively.
In the above manner, the number F*F of the converted target input images or the target feature maps collectively constitute an interlaced space of the input image or the feature map of the input image.
In some embodiments, processing at least part of the plurality of target input images or target feature maps by one or more convolution blocks in a convolutional neural network includes:
processing at least part of the plurality of target input images or target feature maps by one or more convolution blocks in a convolutional neural network by one of:
convoluting a part of the plurality of target input images or target feature maps by each convolution block;
convoluting a part of the plurality of target input images or target feature maps by each weighted convolution block; and
convoluting at least part of the plurality of target input images or target feature maps according to the information differences of the plurality of target input images or the target feature maps.
In some embodiments, convoluting a part of the plurality of target input images or target feature maps by each convolution block includes:
convoluting a part of the plurality of target input images or target feature maps by each convolution block, wherein the target input images or the target feature maps of the part have a specific step interval therebetween, or
convoluting a part of the plurality of target input images or target feature maps by each convolution block, wherein a correlation between the target input images or the target feature maps of the part processed by one convolution block is higher than a threshold, or the target input images or the target feature maps of the part processed by one convolution block are multiple preset target input images or target feature maps having a correlation.
It should be noted that the correlation between channels can be manually selected, or can be measured by statistically calculating the distance between the input sub-channels (for example, Euclidean distance), or can be obtained by calculating a co-correlation matrix.
In some embodiments, convoluting a part of the plurality of target input images or target feature maps by each convolution block, wherein the target input images or the target feature maps of the part have a specific step interval therebetween, includes:
dividing the target input images or the target feature maps into R groups of target input images or target feature maps, wherein R is an integer, the target input images or the target feature maps in each group have a step interval R therebetween, and wherein the R groups of target input images or target feature maps do not overlap;
convoluting each group of target input images or target feature maps to obtain a corresponding output feature map,
wherein a first target input image or a first target feature map of the t-th group of target input images or target feature maps is the t-th target input image or target feature map, wherein t∈[0,R−1]; and
wherein the i-th target input image or the i-th target feature map of the t-th group of target input images or target feature maps is the (i*R+t)th target input image or the (i*R+t)th target feature map in all target input images or target feature maps, wherein i is a non-negative integer.
R may be determined in any manner, such as by default, as specified by the user/operator or by some other rule, and embodiments of the present disclosure are not limited by the specific manner in which R is determined.
In some embodiments, convoluting at least part of the plurality of target input images or target feature maps by each weighted convolution block includes:
convoluting the target input images or target feature maps by each convolution block according to the following formula:
wherein Si represents the i-th output feature map, Convi,p(Mp) represents the convolution operation on the p-th target input image or target feature map MP by the i-th convolution kernel, wi,p represents a weight for the i-th convolution kernel corresponding to the p-th target input image or target feature map, and P represents the total number of the target input images or target feature maps.
The pixels at the same position among many channels in the interlaced space have a neighborhood relationship of different distances in the original image. The above example takes into account the difference in the neighborhood relationship, so that the input information can be kept integral.
In some embodiments, convoluting at least part of the plurality of target input images or target feature maps according to the information differences between the plurality of target input images or the target feature maps includes:
computing the information differences of the plurality of target input images or target feature maps according to the following formula:
OSp=Mp−Mb
wherein Mp represents the p-th target input image or target feature map, Mb represents a mapped reference target input image or target feature map, and OSp is offset information of the p-th target input image or target feature map, the reference target input image or target feature map being the converted first target input image or target feature map.
an I/O module 210 for acquiring an input image;
an Interlaced Space Module (ISM) 220 configured to convert the input image or a feature map of the input image into a plurality of target input images or target feature maps, wherein a resolution of each of the target input images or the target feature maps is smaller than a resolution of the feature map of the input image or the input image, and pixels at the same position in each of the target input images or the target feature maps are of a neighborhood relationship in the input image or the feature map of the input image;
a processing module 230 configured to process at least part of the plurality of target input images or target feature maps by one or more convolution blocks in a convolutional neural network; and
an Up Resolution module (URM) 240 configured to enlarge a resolution of a feature map output from the one or more convolution blocks in the convolutional neural network.
The image processing device of the present disclosure utilizes an interlaced space module to reduce the resolution of the input in the interlaced space while maintaining information integrity and the output resolution of the original neural network is maintained by the up resolution module.
In some embodiments, the processing module 230 is configured to obtain the feature map of the input image from the input image by the one or more convolution blocks in the convolutional neural network.
In some embodiments, the up resolution module 240 is configured to enlarge a resolution of a feature map output from any one or more convolution blocks in the convolutional neural network to a resolution that is obtained by only processing the input image in the convolutional neural network.
In the present embodiment, an input image is converted into a plurality of target input images, or a feature map of the input image is converted into a plurality of target feature maps, and then at least part of the target input images or target feature maps are processed by one or more convolution blocks in a convolutional neural network. A resolution of each of the target input images or the target feature maps is smaller than a resolution of the feature map of the input image or the input image, thereby reducing the calculation amount of the convolution block, improving the computational speed of the convolutional neural network. Furthermore, in the present embodiment, the resolution of the feature map output from the one or more convolution blocks in the convolutional neural network is enlarged, thereby improving the quality of the image output from the network.
The above feature map can be obtained by any means. For example, in some embodiments, the feature map is obtained by processing the input image by the one or more convolution blocks in the convolutional neural network. However, in other embodiments, the feature map of the input image may also be obtained by any other means currently used or developed in the future, and the embodiments of the present disclosure are not limited by the specific manner in which the feature map is obtained.
Moreover, in other embodiments, the resolution of the feature map output from the convolutional neural network may even be enlarged to be higher than the resolution that is obtained by only processing the input image in the convolutional neural network, to achieve super resolution cases, such as those described below related with ULED display screens.
In some embodiments, the interlaced space module 220 is configured to convert the input image or a feature map of the input image into a plurality of target input images or target feature maps by:
determining a down resolution ratio N between the resolution of the target input image or the target feature map and the resolution of the feature map of the input image or the input image;
determining a number F*F of the target input images or the target feature maps according to the down resolution ratio N, wherein
F=U(N)
U(⋅) is a ceiling function; and
converting the input image or the feature map of the input image into a number F*F of the target input images or the target feature maps.
In some embodiments, the interlaced space module 220 is configured to convert the input image or the feature map of the input image into a number F*F of the target input images or the target feature maps by:
performing down-sampling on the input image or the feature map of the input image by a step size N to obtain a number F*F of the target input images or the target feature maps, wherein the sampling formula is:
O
i,j(x, y)=I(xN+i, yN+j)
wherein I and O represent the feature map of the input image or the input image and the target input image or the target feature map, respectively, and i and j are indexes established for the plurality of converted target input images or target feature maps, and i∈[0,F), j∈[0,F), x and y are abscissa and ordinate of a pixel in a corresponding target input image or target feature map, respectively, and x∈[0,W′), y∈[0,H′), W′=L(W/N), H′=L(H/N), xN+i and yN+j are indexes of a pixel in the input image or the feature map of the input image, U(⋅) represents a ceiling function, L(⋅) represents a floor function, and W and H represent the width and height of the feature map of the input image or the input image, respectively, W′ and H′ represent the width and height of the target input image or the target feature map, respectively.
In the above example, the down resolution ratio N is a parameter indicating a resolution reduction ratio between the target input image and the input image or a parameter indicating a resolution reduction ratio between the target feature map and the feature map of the input image, and can be implemented by any feasible means. For example, for the input image/the target input image,
and for a feature map of the input image/the target feature map,
However, the technical solution of the embodiments of the present disclosure is not limited thereto. For example, a height ratio between images may be employed instead of a width ratio to characterize N, or a feature capable of reflecting the resolution reduction between the target input image or target feature map and the input image or the feature map of the input image may be used to calculate N.
In the above example, the number of target input images or target feature maps is determined according to a down resolution ratio N between the resolution of the target input image or the target feature map and the resolution of the feature map of the input image or the input image, however, in others In the example, the number of target input images or target feature maps may also be directly specified according to any other factors or even directly specified. Embodiments of the present disclosure are not limited by the specific determination method of the number of target input images or target feature maps.
In some embodiments, the processing module 230 is configured to process at least part of the plurality of target input images or target feature maps by one or more convolution blocks in a convolutional neural network by one of:
convoluting a part of the plurality of target input images or target feature maps by each convolution block;
convoluting a part of the plurality of target input images or target feature maps by each weighted convolution block; and
convoluting at least part of the plurality of target input images or target feature maps according to the information differences of the plurality of target input images or the target feature maps.
In some embodiments, the processing module 230 is configured to convolute a part of the plurality of target input images or target feature maps by each convolution block by:
convoluting a part of the plurality of target input images or target feature maps by each convolution block, wherein the target input images or the target feature maps of the part have a specific step interval therebetween, or
convoluting a part of the plurality of target input images or target feature maps by each convolution block, wherein a correlation between the target input images or the target feature maps of the part processed by one convolution block is higher than a threshold, or the target input images or the target feature maps of the part processed by one convolution block are multiple preset target input images or target feature maps having a correlation.
It should be noted that the correlation between channels can be manually selected, or can be measured by statistically calculating the distance between the input sub-channels (for example, Euclidean distance), or can be obtained by calculating a co-correlation matrix.
In some embodiments, the processing module 230 is configure to convolute a part of the plurality of target input images or target feature maps by each convolution block, wherein the target input images or the target feature maps of the part have a specific step interval therebetween, by:
dividing the target input images or the target feature maps into R groups of target input images or target feature maps, wherein R is an integer, the target input images or the target feature maps in each group have a step interval R therebetween, and wherein the R groups of target input images or target feature maps do not overlap;
convoluting each group of target input images or target feature maps to obtain a corresponding output feature map,
wherein a first target input image or a first target feature map of the t-th group of target input images or target feature maps is the t-th target input image or target feature map, where t∈[0,R−1]; and
wherein the i-th target input image or the i-th target feature map of the t-th group of target input images or target feature maps is the (i*R+t)th target input image or the (i*R+t)th target feature map in all target input images or target feature maps, wherein i is a non-negative integer.
R may be determined in any manner, such as by default, as specified by the user/operator or by some other rule, and embodiments of the present disclosure are not limited by the specific manner in which R is determined.
In some embodiments, the processing module 230 is configured to convolute at least part of the plurality of target input images or target feature maps by each weighted convolution block by:
convoluting the target input images or target feature maps by each convolution block according to the following formula:
wherein Si represents the i-th output feature map, Convi,p(Mp) represents the convolution operation on the p-th target input image or target feature map MP by the i-th convolution kernel, wi,p represents a weight for the i-th convolution kernel corresponding to the p-th target input image or target feature map, and P represents the total number of the target input images or target feature maps.
In some embodiments, the processing module 230 is configured to convolute at least part of the plurality of target input images or target feature maps according to information differences of the plurality of target input images or the target feature maps by:
computing the information differences of the plurality of target input images or target feature maps according to the following formula:
OSp=Mp−Mb
wherein Mp represents the p-th target input image or target feature map, Mb represents a mapped reference target input image or target feature map, and OSp is offset information of the p-th target input image or target feature map, the reference target input image or target feature map being the converted first target input image or target feature map.
After the interlaced space module reduces the resolution and the up resolution module restores the resolution, the network may need to be retrained due to changes in the network model structure. Accordingly, the present disclosure provides a method of training the convolutional neural network.
at step S310, obtaining training data;
at step S320, training the convolutional neural network with the training data; and
at step S330, adjusting the number of channels of input images or feature maps of each convolution block of the convolutional neural network according to the training result; wherein, for a K-th convolution block, if the number of channels of input images or feature maps of the K-th convolution block before adjustment is Mk, the number of channels of input images or feature maps of the K-th convolution block after adjustment is λkMk, and λk is an expansion coefficient.
In some embodiments, if the adjustment does not increase the number of channels and convolution kernel, the corresponding convolution kernel follows the convolution method before the adjustment and convolutes all the input channels.
In some embodiment, if the adjustment increases the number of channels and convolution kernels, all newly added input channels are convoluted or all input channels are convoluted.
In some embodiments, adjusting the number of channels of input images or feature maps of each convolution block of the convolutional neural network according to the training result includes:
setting the expansion coefficient to λ=[λ0, λ1, . . . , λL], wherein L represents the number of convolution blocks, λ0=λ1= ⋅ ⋅ ⋅ =λL and λ becomes a scalar super parameter;
determining an adjustment step size of λ to obtain a new sequence of λ, [λ0, λ1, . . . , λS], wherein each λs (s ∈[0, S]) is an L-dimensional vector, λs=[λ0s, λ1s, . . . , λLs], and λs<λv if s<v, where v ∈[0, S];
calculating a corresponding performance gain for each λs:
wherein the performance gain Gs−1 is a scalar, and As represents the performance of the obtained model corresponding to λs; and
calculating a finally selected expansion coefficient λ based on the following formula:
wherein Gthr is a threshold value, whose value is a predetermined constant value or the performance gain of the convolutional neural network before adjustment that corresponds to the expansion coefficient λ being a unit vector, and argmaxλ
During training, although it may be necessary to adjust the number of feature maps of each convolution block, in general, the range of variation of the number of feature maps is limited, and thus the increase or decrease of additional calculation amount so caused is also limited, thereby it may achieve calculation acceleration of the entire model.
The output resolution can be enlarged by enlarging the resolution of the feature map output from any one or more convolution blocks of the convolutional neural network, so that the resolution of the feature map output from the convolutional neural network to a resolution that is obtained by only processing the input image in the convolutional neural network, to restore the resolution.
For example, for a convolutional neural network comprising P convolution blocks, the large-resolution input channel of the Con1 convolution block of the convolutional neural network can be converted into multiple small-resolution input channels, thereby reducing the calculation amount of the Con1˜ConP convolution blocks while maintaining information integrity. After that, the resolution of the output channel of the Con(P−1) convolution block can be enlarged, thereby enlarging the resolution of the output of the entire convolutional neural network, to achieve resolution recovery. The example is given only as an exemplary illustration, and the image processing device and method of the present disclosure are not limited to the input channel resolution reduction of the Con1 convolution block and the output channel resolution improvement of the Con(P−1) convolution block.
The present disclosure further provides a device of training a convolutional neural network.
an I/O module 410 configured to obtain training data;
a training module 420 configured to train the convolutional neural network with the training data; and
a neural network adjusting module 430 configured to adjust the number of channels of input images or feature maps of each convolution block of the convolutional neural network according to the training result; wherein, for a K-th convolution block, if the number of channels of input images or feature maps of the K-th convolution block before adjustment is Mk, the number of channels of input images or feature maps of the K-th convolution block after adjustment is λkMk, and λk is an expansion coefficient.
In some embodiments, if the adjustment does not increase the number of channels and convolution kernel, the corresponding convolution kernel follows the convolution method before the adjustment and convolutes all the input channels.
In some embodiment, if the adjustment increases the number of channels and convolution kernels, all newly added input channels are convoluted or all input channels are convoluted.
In some embodiments, the neural network adjusting module 430 is configured to:
set the expansion coefficient to λ=[λ0, λ1, . . . , λL], wherein L represents the number of convolution blocks, λ0=λ1= ⋅ ⋅ ⋅ =λL, and λ becomes a scalar super parameter;
determine an adjustment step size of λto obtain a new sequence of λ, [λ0, λ1, . . . , λS], wherein each λs (s ∈[0, S]) is an L-dimensional vector, λs=[λ0s, λ1s, . . . , λLs], and λs<λv if s<v, where v ∈[0, S];
calculate a corresponding performance gain for each λs:
wherein the performance gain Gs−1 is a scalar, and As represents the performance of the obtained model corresponding to λs; and
calculate a finally selected expansion coefficient λ based on the following formula:
wherein Gthr is a threshold value, whose value is a predetermined constant value or the performance gain of the convolutional neural network before adjustment that corresponds to the expansion coefficient λ being a unit vector, and argmaxλ
The image processing method provided by the embodiments of the present disclosure can be used in various electronic devices and applied to various usage scenarios.
step S510 of determining a usage scenario of the electronic device from at least one preset scenarios; and
step S520 of processing an acquired input image by using the image processing method of the above aspect based on the determined usage scenario of the electronic device, wherein the number of the plurality of target input images or target feature maps is based on the determined usage scenario of the electronic device.
According to an embodiment of the present disclosure, an electronic device corresponding to the method shown in
a scenario determining module 610 configured to determine a usage scenario of the electronic device from at least one preset scenarios; and
an image processing device 620 configured to process an acquired input image based on the determined usage scenario of the electronic device, wherein the number of the plurality of target input images or target feature maps is based on the determined usage scenario of the electronic device.
The image processing device 620 may be the image processing device as shown in
The specific process of image processing based on convolutional neural networks according to the present disclosure will be described below with reference to
One of the purposes of the present disclosure is to accelerate convolutional neural networks. Convolutional neural networks may include various convolutional neural network structures, such as various network structures for image-level classification tasks, region-level detection tasks, and pixel-level segmentation tasks. Neural networks with different tasks have some special structures. For example, as for neural networks for classification, the last two layers may be fully connected layers. The networks for detection tasks may end up with different branches for multitasking. The networks for segmentation tasks not only contain a coding network, and may also include a decoding network with a higher resolution, a jump connection between the front and back layers, a hole convolution, a pyramid network layer, and the like. In this embodiment, a classical network structure is taken as an example for description. In fact, the present disclosure is applicable as long as the main part of the network architecture is a convolutional neural network.
In the following description, for the sake of brevity of description, in some embodiments, the technical solution of the present disclosure will be described mainly for an input image or a target input image. It should be noted, however, that if not specifically stated or if there is no conflict, the technical solution described for the input image or the target input image is equally applicable to the feature map of the input image or the target feature map.
The input module receives an input image in step S110. This image is input to the subsequent network.
The ISM module converts the input image into a plurality of target input images in step S120, wherein the resolution of each of the plurality of target input images is smaller than the resolution of the input image, and the pixels at the same position in each of the target input images are of a neighborhood relationship in the input image.
As shown in
Firstly a down resolution ratio N is determined. The down resolution ratio N is a parameter indicating a resolution reduction ratio between the target input image and the input image, and can be implemented by any feasible means. For example,
However, the technical solution of the embodiments of the present disclosure is not limited thereto. For example, a height ratio between images may be employed instead of a width ratio to characterize N, or a feature capable of reflecting the resolution reduction between the target input image and the input image may be used to calculate N. N may be an integer or a non-integer.
Then a number F*F of the target input images is determined according to the down resolution ratio N, wherein
F=U(N)
U(⋅) is a ceiling function.
Finally, the input image is converted into a number F*F of the target input images.
For example, the number F*F of target input images can be obtained by down-sampling the input image with a step size of N. The sampling formula is:
O
i,j(x, y)=I(xN+i, yN+j)
W′=L(W/N)
H′=L(H/N)
wherein I and O represent the input image and the target input image, respectively, and i and j are indexes established for the plurality of converted target input images, and i∈[0,F), j∈[O,F), x and y are abscissa and ordinate of a pixel in a corresponding target input image, respectively, and x∈[0, W′), y∈[0,H′), W′=L(W/N), H′=L(H/N), xN+i and yN+j are indexes of a pixel in the input image, U(⋅) represents a ceiling function, L(⋅) represents a floor function, and W and H represent the width and height of the input image, respectively, W′ and H′ represent the width and height of the target input image, respectively.
O
i,j(x, y)=I(xN+i, yN+j)
Oi,j(x, y)=I(xN+i, yN+j)
Finally, a number F*F of converted small images are obtained. The resulting converted small image varies with N. Since N can be a non-integer, the numerical field of N is continuous, and all of the mapped small images collectively constitute an interlaced space of the original image.
The operation of the ISM module reduces the resolution of the input channel in the interlaced space while maintaining no loss of input channel information.
The plurality of target input images obtained by the conversion in step S120 are input to the convolution block, and each convolution block processes at least a part of the plurality of input target input images in step S130.
In
The processing in step S130 can be performed by a variety of different convolution methods, four of which are described below: packet cross-convolution, correlated channel convolution, channel weighted convolution, and offset features learned convolution. The first three convolution methods emphasize the influence of the important input channels, which alleviate the influence of the unimportant input channels. The present disclosure collectively refers the first three convolution methods as an interlaced-space-based sparse channel convolution, as shown in
Specifically, the first convolution method is an interlaced-space-based packet cross-convolution method (
In the convolution method, the target input images are divided into R groups of target input images, wherein R is an integer, the target input images in each group have a step interval R therebetween, and wherein the R groups of target input images do not overlap.
Then each group of target input images are convoluted to obtain a corresponding output feature map, wherein a first target input image of the t-th group of target input images is the t-th target input image, where t∈[0,R−1].
In the convolution method, the i-th target input image of the t-th group of target input images is the (i*R+t)-th target input image in all target input images, wherein i is a non-negative integer.
The second one is an interlaced-space-based on correlated channel convolution (the right graph in
The same positions of the sub-channels obtained by mapping the original input channel in the interlaced space are different in the original picture, and some of the channels are more closely related. For some tasks, e.g., super resolution tasks, the convolution between adjacent channels is more important. Therefore, it is necessary to preserve the convolution between these related channels. For some tasks, it is not necessary to perform the convolution between channels that are correlated closely. T groups of input channels can be selected according to the importance of the correlation between the input channels (for example, above a certain threshold) and the application. The convolution can be performed within each group of input channels, and correspondingly, the output channel is also divided into T groups. The output channel of group t (t∈[0,T−1]) only convolutes the relevant input channel of group t.
It should be noted that the correlation between channels can be manually selected, or can be measured by statistically calculating the distance between the input sub-channels (for example, Euclidean distance), or can be obtained by calculating a co-correlation matrix.
The third one is an interlaced-space-based channel weighted convolution method. The method performs convolution processing on at least a part of the plurality of target input images by each weighted convolution block. Specifically, for example, each convolution block convolutes at least part of the plurality of target input images by weight by convoluting the target input images or target feature maps by each convolution block according to the following formula:
wherein Si represents the i-th output feature map, Convi,p(Mp) represents the convolution operation on the p-th target input image MP by the i-th convolution kernel, wi,p represents a weight for the i-th convolution kernel corresponding to the p-th target input image, and P represents the total number of the target input images.
The existing convolution method is a simple summation of the convolution result between different channels. In contrast, the above convolution method according to the embodiment of the present disclosure takes the fact that pixels at the same position of different channels in the interlaced space have a neighbor ship of different distances in the original image into account by setting corresponding weights for the channel, thereby being able to maintain the integrity of the input information.
It should be noted that the weight setting of the input channel may be shared between different output feature maps Si. For example, the output feature map is divided into T groups, and the weight settings are shared between each group. In addition, the channel weighted convolution method can be used in combination with the other convolution methods without conflict.
The fourth one is an interlaced-space-based convolution method of learning offset feature. The method processes at least part of the plurality of target input images according to information differences between the plurality of target input images. specifically, the information differences of the plurality of target input images or target feature maps is determined according to the following formula:
OSp=Mp−Mb
wherein Mp represents the p-th target input image, Mb represents a mapped reference target input image, and OSp is offset information of the p-th target input image, the reference target input image being the converted first target input image.
As shown in
It should be noted that, since the convolution method learns the offset information, the input channels based on the interlaced space can be directly set as offset information except the reference channel, which can be configured according to the effect of the specific application.
After the process of step S130, the URM module enlarges the resolution of the feature map output from one or more convolution blocks in step S140. The improved resolution may be the resolution of the convolutional neural network, i.e., the resolution that is obtained by only processing the input image in the convolutional neural network. In this way, the resolution of the output can be maintained to the level of the output resolution of the convolutional neural network; or the resolution after improvement can be even higher than that of the convolutional neural network, thereby achieving the purpose of improving the resolution of the input image while reducing the computational complexity.
As shown in
In
With the technical solution according to the embodiments of the present disclosure, the required amount of calculation can be reduced, and thus the required computing resources can be saved and the image processing speed can be improved.
The amount of the calculation of the convolutional neural network can be measured by the amount of the basic multiplication and addition. Specifically, if the original neural network to be accelerated in the present disclosure includes L convolution blocks, the calculation amount is:
C
old
=C
1
+C
2
+ ⋅ ⋅ ⋅ +C
L−1
+c
L
wherein Ci represents the amount of multiplication and addition of the i-th convolution block. The calculation amount of the neural network after the implementation of the embodiment of the present disclosure is:
wherein λMax=max(λ1, λ2, . . . , λL), whose value is generally between [0.5, 1.5] and is much smaller than N. The above formula can be approximated as:
It can be seen that the present disclosure can achieve approximately N2 times calculation acceleration for the original neural network. In order to verify the acceleration effect of the present disclosure, the image processing time of the original network and the network accelerated by the technical solution according to the embodiment of the present disclosure are respectively tested by using a NVIDI Tesla M40 graphics card, and the test results are shown in Table 1:
It can be seen that compared with the existing network, the image processing time of the network accelerated by the present disclosure is greatly reduced.
The above example is described with respect to an input image, however, the technical solution of this example is also applicable to a feature map of an input image, and the difference therebetween is only in whether it is an input image or a feature map of an input image that is converted into a plurality of images.
The feature map of the input image may be generated by one or more convolution blocks of the convolutional neural network. In this case, for example, the ISM module can be placed between convolution blocks of the convolutional neural network. The input image is input to a convolutional neural network, and a feature map of the input image is generated by a convolution block located before the ISM module and sent to the ISM module to generate a corresponding plurality of target feature maps.
Specifically, when the ISM module converts the feature map of the input image into a plurality of target feature maps, it firstly determines the down resolution ratio N. The down resolution ratio N can be implemented by any feasible means. For example,
However, the technical solution of the embodiments of the present disclosure is not limited thereto. For example, a height ratio between images may be employed instead of a width ratio to characterize N, or a feature capable of reflecting the resolution reduction between the target feature map and the feature map of the input image may be used to calculate N.
Then a number F*F of the target feature maps is determined according to the down resolution ratio N, wherein
F=U(N)
U(⋅) is a ceiling function.
Finally, the feature map of the input image are converted into a number F*F of the target feature maps.
The manner of determining the number of target feature maps according to the down resolution ratio N and converting to the target feature maps is similar to the manner of determining the number of target input images according to the down resolution ratio N and converting to the target input image, and it only needs to replace the input image in the manner described for the input image with the feature map of the input image and replace the target input image with the target feature map. The details are not described herein again.
The resulting target feature map can be used for subsequent processing. The subsequent processing is as described in steps S130-S140 in
Alternatively, in other implementations, the received input image may be processed by other modules or units to generate a feature map of the input image, and the feature map of the input image is processed as shown in step S120 of
Alternatively, after the input image has been processed by the convolutional neural network of
The training flow will be described below with reference to the left graph of
The large input image (i.e., training data) can be batch images (Batch images), which can contain several channels. When the large input image contains multiple channels, each channel of the large input image are subjected to the conversion in the interlaced space module to be converted to a plurality of small images. When the large input image contains only one channel, the large input image is subjected to the conversion in the interlaced space module to be converted to a plurality of small images. Some or all of these small images are sequentially sent to each convolution block in the convolutional network for processing, and an input feature map is obtained. Due to the resolution reduction implemented by the interlaced space module, the resolution to be processed by all subsequent convolution blocks is reduced and the amount of calculation is also reduced. Since the resolution processed by the last convolution block is also reduced, in order to maintain the resolution of the output result, the embodiment of the present disclosure introduces an up resolution module to complete the maintenance of the output resolution. The resolution of the output feature map is enlarged by the up resolution module, and the output result is a feature map with the resolution enlarged. It should be noted that the interlaced space module is not limited to being placed only before the first convolution block, and rather may be placed before any convolution block. The up resolution module can also be placed after any convolution block, as long as the constraint that the ISM module is placed before the up resolution module is satisfied. When the ISM module is located after the h-th convolution block, the ISM module converts each feature map output from the h-th convolution block into a plurality of target feature maps. A loss calculation module can also be provided to calculate the loss of the output result during the test. The specific calculation method of this module is related to the specific application. For example, in a super resolution task, a Mean Squared Error (MSE) can be used to calculate the loss. It should be noted that, the training process performs a data forward propagation from the front to the back, and will also perform a gradient backward propagation from the back to the front after completing a loss calculation. In the left graph of
By retraining the neural network, the number of feature maps for each convolution block can be re-adjusted. This operation is illustrated by the light colored feature maps at each of the convolution blocks in
As shown in
For example, the resolution of the input channel shown in
wherein x%N represents the remainder obtained by dividing x by N.
If the up resolution module URM is located between multiple convolution blocks, when the number of channels output by the URM module is different from the number of the original input channels of the next convolution block, the convolution kernel (if the layer is a convolutional layer) of the first layer in the next convolution block needs to be adjusted to match the new number of input channels. The adjusted convolution kernel completes the weight learning again during subsequent offline retraining.
The retraining and channel adjustment process can be specifically as follows.
As shown in
In order to explain the adjustment of the super parameter λ=[λ0, λ1, . . . , λL] of the expansion coefficient of the feature map, the present disclosure uses the model MobileNetV2 as an example. In this model, the expansion coefficients of the layers are set to be the same, so that the number of channels of all layers can be controlled by a global scalar λ (similar to the meaning of the expansion coefficient in the present disclosure). For the relationship between model performance and the expansion coefficient, it may make reference to
It can be seen from the figure that the performance curve enters the flat zone when λ>1, and the increase of the number of channels no longer has a significant contribution to the network performance. This is also the reason why the number of channels of the original convolutional neural network is selected as λ=1. The resolution of the input image of the original convolutional neural network is directly reduced to 128*128 to obtain a down-sampling network. The performance of the network is lower than the original network when λ is the same. In addition to the flat zone, the increase of λ generally contributes to the performance improvement of the downsampling network. For example, the downsampling network with λ=1 is better than the original convolutional neural network with λ=0.5. However, λ selected by the original convolutional neural network is often at the front of the flat zone. At this time, due to the information loss of the down-sampled network and the existence of the flat performance zone, it is difficult for the performance of the down-sampling network to catch up with the one of original convolutional neural network.
In
In order to determine the best expansion coefficient A, the embodiment of the present disclosure proposes a method for automatically determining the parameter λ, which is also called as a performance gain threshold based method. The method includes the following sub-steps:
sub step 1 of setting the expansion coefficient to λ=[λ0, λ1, . . . , λL], wherein L represents the number of convolution blocks, λ0=λ1= ⋅ ⋅ ⋅ =λL, and λ becomes a scalar super parameter;
sub-step 2 of determining an adjustment step size of λ to obtain a new sequence of λ, [λ0, λ1, . . . , λs], wherein each λs(s ∈[0, S]) is an L-dimensional vector, λs=[λs0, λs1, . . . , λLs], and λ2<λv if s<v, where v ∈[0, S];
sub-step 3 of calculating a corresponding performance gain for each λs:
wherein the performance gain Gs−1 is a scalar, and As represents the performance of the obtained model corresponding to λs; and
sub-step 4 of calculating a finally selected expansion coefficient λ based on the following formula:
wherein Gthr is a threshold value, whose value is a predetermined constant value or the performance gain of the convolutional neural network before adjustment that corresponds to the expansion coefficient λ being a unit vector, and argmax80
It should be noted that, as the expansion coefficient is adjusted, the convolution manner of the newly added channels of feature maps may adopt the convolution methods introduced in step S130, as described above.
The training flow will be described below with reference to the right graph of
During training, if the stitching method described later in the embodiment of the present disclosure is used to enlarge the resolution, since the loss function applied to the original convolutional neural network can be directly used on the expanded small-resolution output, it is not necessary to add the up resolution module, but this module will be required for testing.
In the present disclosure, the development flow of the neural network model of the present disclosure is also different from the one of the existing convolutional neural network.
In an embodiment of the present disclosure, the down resolution ratio N may be an adjustable adaptive parameter (according to the configuration of the hardware operating environment, an appropriate N is adaptively selected to be applicable to a wider variety of computing platforms), which may be used for trade-offs between the performance and the speedup. Specifically, the down resolution ratio N may be determined manually, or selected according to the hardware resource environment, but the disclosure is not limited thereto.
The image processing method according to the embodiment of the present disclosure can be applied to various scenes.
At step S510, it is determined a usage scenario of the electronic device from at least one preset scenarios.
The above scenarios may be classified into low-light scenes (such as night scenes), high-light scenes, etc., or may be classified into high ISO scenes, medium ISO scenes, low ISO scenes, etc., or may be classified into super resolution scenes and non-super resolution scenes, etc. The specific classification may depend on the type of the electronic device and its particular use, and embodiments of the present disclosure are not limited by the specific usage scenario classification.
At step S520, an acquired input image is processed by using the image processing method described in conjunction with
The method is further described below in specific scenarios. It is to be noted that only certain specific usage scenarios are shown below, and it should not be construed as limiting the application of the technical solutions according to the embodiments of the present disclosure.
In some embodiments, a usage scenario of de-noising images of a nighttime captured by a mobile phone will be described.
When a user of a mobile phone takes photos in different scenes, the noises included are different due to the influence of the light, hardware and the like. The computation resources of mobile phones are limited, and the deep-learning-based de-noising model is a complex pixel-level task, which leads to a heavy response delay. However, the user is more sensitive to the camera response speed when taking pictures. Therefore, in the specific application, solving the contradiction therebetween will bring a better user experience.
For this problem, the taken scenes can be classified and processed separately for different usage scenarios. As shown in
Without loss of generality, in this embodiment, the WIN5-RB model in the image de-noising network is selected in the present disclosure as an example of a convolutional neural network.
The network structure is shown in the left graph of
This scheme is specially designed for nighttime image processing, and it obtains a good image quality at the cost of fast response. To further solve the problem, the image processing method and device described in
A flowchart of an embodiment of an image processing scheme employing an embodiment of the present disclosure is shown as the right graph of
Night scene: N=2
Low light scene: N=3
Daytime scene: N=4
or:
High ISO: N=2
Medium ISO: N=3
Low ISO: N=4
Depending on the super parameter N, different models or the same model can be used. When the same model is selected for different configurations, the number of input channels of the model should be the same as the number of channels in the configuration with a maximum value of N. If the input channel is smaller than the number of channels of the model, the values in the deficient input channels can be set as 0. Without loss of generality, in this specific implementation, a trained model can be configured for each N that is set.
Without loss of generality, in this specific implementation, it is assumed that the user has taken a night scene or it is in a high ISO mode, and N=2 is selected at this time. The input size of the WIN5 RB network is a grayscale image of 481*321. For convenience of calculation, the present disclosure scales the input image from 481*321 to 480*320, and converts it to four small images of 240*160 by the interlaced space module (in this embodiment the interlaced space module can be implemented in hardware, and the speed would be faster). These small images are sequentially superimposed into four channels and sent to the first layer of the WIN5-RB network. As the input resolution is reduced, the amount of calculation of subsequent layers becomes smaller. The last layer of the original WIN5-RB network output is a 481*321 de-noised image. According to the method of the present disclosure, it is modified into four output channels of 240*160 and then passes through a stitch-based up resolution module (in this embodiment, the up resolution module can be implemented in hardware, and the speed is faster) to restore to an image of 480*320, and finally is enlarged to 481*321 by interpolation. In this embodiment, the up resolution module may adopt a reverse manner of the interlacing conversion method, i.e., a stitching conversion. In this embodiment, for the additional convolution kernel added by the interlaced space module and the up resolution module, the initial weight is set to 0 (other initialization strategies may also be adopted), and the remaining weights are initialized to the corresponding weights of the original network. Finally, it performs the process of retraining. In this embodiment, the expansion coefficients of all the feature maps are set as the formula: λi=1, that is, the number of channels of feature maps is kept unchanged.
In other embodiments, a usage scenario of a high definition television display will be described in conjunction with a ULED display.
Video projection has many important applications. For example, a video needs to be projected to a Samsung HDTV, a shopping mall advertising screen, an irregular shape LED advertising screen, a projector, a mobile phone of various resolutions. When the traditional player software plays a video on TVs or mobile phones, the short-edge alignment is often used to maintain the scale of the video content. If the resolution is insufficient, the scale of the video frame is amplified by some interpolation algorithms. However, some applications require a full-screen projection of the played video signal, such as digital video images, ULED projection with various shapes and resolutions, etc., and when the resolution is insufficient, it is necessary to generate a super resolution image for the video frame.
Take video signals from HDTVs with different resolutions as an example. Classic HDTV resolution specifications include: 2K (1920*1080), 4K (3840*2160) and 8K (7680*4320). Without loss of generality, this specific implementation projects a video of 4K to a screen of 8K. Resolution alignment can be done using algorithms such as bilinear interpolation, or a traditional machine learning method based on data driven. This processing can be achieved by a convolutional neural network. The convolutional neural network processes each frame of the video frame by frame. For example, the network consists of three convolution blocks, each of which contains a convolutional layer and an activation function layer, and the resolutions of the first two convolution blocks are consistent with the resolution of the input video, and the third convolution block doubles the resolution to 8K by de-convolution.
This convolutional neural network has a problem: because several convolution blocks in the network maintain the resolution without down-sampling, and the resolution of each feature map is as high as 3840*2160, it is difficult for the computing resources configured by the HDTV to meet the needs of real-time computing.
To at least partially address this problem, the convolutional neural network can be improved using the image processing schemes described in
step S1710 of receiving a video;
step S1720 of enlarging the resolution of the video using an AI (Artificial Intelligence) related module; and
step S1730 of displaying the video with the resolution enlarged.
In some embodiments, enlarging a resolution of the video using an AI related module includes: using an AI chip to enlarge the resolution of the video according to the image processing method shown in
The AI related module may be an AI chip, or any hardware and/or software implementation that can implement AI functionality.
In some embodiments, enlarging a resolution of the video using an AI related module includes: converting a frame in the video to a plurality of target input images using hardware, and processing the plurality of target input images using the AI related module to enlarge the resolution of an output image.
an I/O module 1810 for receiving a video;
an up resolution module 1820 configured to enlarge the resolution of the video using an AI (Artificial Intelligence) related module; and
a display module 1830 configured to display the video with the resolution enlarged.
In some embodiments, the up resolution module 1820 is configured for: using an AI chip to improve the resolution of the video according to the image processing method of
The AI related module may be an AI chip, or any hardware and/or software implementation that can implement AI functionality.
In some embodiments, the up resolution module 1820 is configured for: converting a frame in the video to a plurality of target input images using hardware, and processing the plurality of target input images using the AI related module to improve the resolution of an output image.
The improved network structure is shown in
In a specific implementation, the related algorithm is used in an HDTV, and an example of a general hardware structure of an HDTV can be seen in
The function of step S1720 and the up resolution module 1820 can be implemented in the super resolution module shown in
The super resolution module may further include a video signal determiner, and the video signal determiner determines whether the signal is an 8K signal. If the signal is already an 8K signal, the super resolution task would not be performed, and the signal is directly output. If the resolution of the signal is lower than 8K, it would be necessary to execute the super resolution task, and then send the signal to the AI computing chip.
The super resolution module may further include an ISM hardware module, which implements the function of the ISM module in the present disclosure by hardware, decomposes a large input image into several sub-images, and sends the sub-images to the AI computing chip. Specifically, after receiving the signal, the ISM module calculates and stores the corresponding data into the AI computing chip according to the index calculation method introduced by the ISM module in the present disclosure.
The super resolution module may further include a video frame buffer for buffering the sub-images got from the decomposition in the ISM hardware module, and the sub-images are stored by the corresponding address of the video frame buffer. After reading the algorithm, the AI computing chip performs super resolution processing on the images in the corresponding address of the video frame buffer.
The super resolution module may further include a video frame buffer for buffering the 8K signals obtained by the AI calculating chip and sending them to the display module.
With the technical solutions shown in
Convolutional neural networks have two effects: the resolution propagation effect and the feature map number effect. A convolutional neural network (CNN) is different from the fully connected neural network. In the same channel, the convolution kernel shares parameters, while the convolution kernel computes a fixed size of the original image, so the resolution of the feature map output from the convolutional layer is proportional to the size of the feature map input to this layer. Similarly, the resolution of the feature map output from the pooling layer, activation function layer, deconvolution layer, etc. is also proportional to the resolution of the input feature map. If a convolutional neural network, especially a full convolutional neural network, changes the resolution of a certain convolution block, the resolutions of all subsequent convolution blocks would be affected in proportion, and the amount of calculation of the convolution block is also proportional to the resolution. The present disclosure refers to this effect as a resolution propagation effect. Its propagation direction is along the direction of distribution of the convolution blocks. The number of feature maps of a certain convolution block can only affect the calculation amount of the current block and the next one convolution block. The present disclosure refers to the later effect as a feature map number effect. The acceleration of the calculation of the convolutional neural network of the present disclosure is based on these two effects.
The technical solution of the embodiments of the present disclosure utilizes the effects to accelerate the calculation. Specifically, a large-resolution input image or feature map can be converted into a plurality of small-resolution target input images or feature maps by the mapping of the interlacing method in the interlaced space. In the interlaced space, the input information maintains the integrity of the input information while the resolution is reduced. Using the mapping of the interlacing method, a large resolution input channel is converted to multiple small resolution input channels, and these small resolution channels can also be restored to the original large resolution channel, thus maintaining the information integrity. It reduces the overall network resolution. An up resolution module is introduced at the end or in the middle of the network to maintain the resolution of the output of the original network.
In addition, the image processing device and method of the present disclosure can trade off the acceleration effect according to different precision requirements of tasks by controlling a part of the small resolution channels to be sent to the network.
The image processing device and method of the present disclosure are different from the existing methods of only reducing the size of the model, and can reduce a large amount of online running memory and memory requirements. On the one hand, the requirement of the amount about memory usage of the computing task on devices can be reduced, and on the other hand, the memory resource requirements of the computing task on the cloud can also be relieved to ease the memory resource burden. For example, the image processing device and method of the present disclosure utilizes the resolution propagation effect in the convolutional neural network and the characteristic of information maintain of the interlaced space to achieve a speed improvement of about N{circumflex over ( )}2 times for the target convolutional neural network, and save a lot of data memory. In particular, the present disclosure proposes a possible acceleration scheme for complex pixel-level tasks.
On the other hand, training and testing of many tasks, for example, image segmentation tasks, are limited by the CPU and GPU memory resources. The input of the model has a fixed and limited image resolution, e.g., 320*320, and a large image shall be scaled or sliced before being sent to the neural network. The image processing device and method of the present disclosure can realistically process large images.
Moreover, the present disclosure does not conflict with existing classical methods and can be used together with existing classical methods.
Heretofore, the embodiments of the present disclosure have been described in detail in conjunction with the accompanying drawings. Based on the above description, those skilled in the art should have a clear understanding of the image processing device and method of the present disclosure.
It should be noted that the implementations that are not shown or described in the drawings or the description are all known to those skilled in the art and are not described in detail. In addition, the above definitions of the various elements and methods are not limited to the specific structures, shapes or manners mentioned in the embodiments, and those skilled in the art can simply modify or replace them.
Of course, the method of the present disclosure also includes other steps according to actual demands, and since they are not related to the innovation of the present disclosure, they are not described here.
The details, the technical solutions, and the beneficial effects of the present disclosure are described in detail in the embodiment of the present disclosure. It is to be understood that the above description is only the embodiment of the present disclosure, and is not intended to limit the disclosure. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
201811049734.3 | Sep 2018 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/KR2019/011557 | 9/6/2019 | WO | 00 |