The present disclosure relates generally to image processing, and more particularly, to a method and an electronic device for performing image processing using artificial intelligence.
In the art of image processing, after a user edits an image, the editing of the image may cause content to be lost. That is, the editing of the image may generate missing content of the image due to editing. As a result, some blurred, or background cluttered images may be processed. Related image processing operations generally involve image recovering techniques, however, the related image recovering techniques corresponding to image processing may have poor performance.
Embodiments of the present disclosure provide an image processing method, apparatus, electronic device, computer readable storage medium, and computer program product that may address the technical problem of poor performance of image processing in the related technology.
According to an aspect of the disclosure, an image processing method includes acquiring a target image based on an editing operation on an original image, wherein the target image includes an unknown region formed by the editing operation. In an embodiment, the image processing method includes performing a filling processing operation on the target image to obtain a first filled image including a first target region, wherein the first target region corresponds to the unknown region. In an embodiment, the image processing method includes identifying a target patch based on the first filled image, wherein the target patch corresponds to at least a portion of the first target region. In an embodiment, the image processing method includes calculating a similarity value between the target patch and at least one patch related to the first filled image using a first artificial intelligence (AI) model. In an embodiment, the image processing method includes determining a target residual patch corresponding to the target patch based on the similarity value. In an embodiment, the image processing method includes generating a processing result image based on the target residual patch.
According to an aspect of the disclosure, an electronic device includes a memory storing one or more instructions, and at least one processor communicatively coupled to the memory. In an embodiment, the at least one processor is configured to execute the one or more instructions to acquire a target image based on an editing operation on an original image, wherein the target image includes an unknown region formed by the editing operation. In an embodiment, the at least one processor is configured to execute the one or more instructions to perform a filling processing operation on the target image to obtain a first filled image including a first target region, wherein the first target region corresponds to the unknown region. In an embodiment, the at least one processor is configured to execute the one or more instructions to identify a target patch based on the first filled image, wherein the target patch corresponds to at least a portion of the first target region. In an embodiment, the at least one processor is configured to execute the one or more instructions to calculate a similarity value between the target patch and at least one patch related to the first filled image using a first artificial intelligence (AI) model. In an embodiment, the at least one processor is configured to execute the one or more instructions to determine a target residual patch corresponding to the target patch based on the similarity value. In an embodiment, the at least one processor is configured to execute the one or more instructions to generate a processing result image based on the target residual patch.
According to an aspect of the disclosure, there is provided a non-transitory computer readable storage medium storing a program to be executable by at least one processor to perform an image processing method according to at least one of the above-described embodiments.
In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure, the following is a brief description of the accompanying drawings that are necessary to describe the embodiments of the present disclosure.
Embodiments of the present disclosure are described below in connection with the accompanying drawings in the present disclosure. It should be understood that the embodiments set forth below in conjunction with the accompanying drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present disclosure and do not constitute a limitation of the technical solutions of the embodiments of the present disclosure.
It will be understood by those skilled in the art that, unless specifically stated, the singular forms “a”, “an”, “said” and “the” used herein may further comprise plural forms. It should be further understood that the terms “includes” and “comprises” as used in embodiments of the present disclosure mean that the corresponding features may be implemented as the features, information, data, steps, operations, elements, and/or components presented herein, but do not exclude to be implemented as other features, information, data, steps, operations, elements, components and/or a combination thereof supported in the art. It is to be understood that when we refer to an element as being “connected” or “coupled” to the other element, the element may be directly connected or coupled to the other element. Alternatively, it may refer to the element and the other element being connected via an intermediate element. Moreover, the “connected” or “coupled” as used herein may comprise wireless connection or wireless coupling. The term “and/or” as used herein indicates at least one of the items defined by the terms, for example, “A and/or B” indicates an embodiment of “A”, or an embodiment of “B”, or an embodiment of “A and B”.
Throughout the specification, the term “unknown region” may be understood as a region does not comprises image information or data. In some embodiments, the unknown region is a region to be filled by image generation processing or image filling processing.
Throughout the specification, the term “first”, “second”, “third”, . . . may be used to distinguish each element or each configuration in each content item of the specification.
Throughout the specification, the term “patch” may be understood as an image that is at least a part, a portion or a piece of a whole image.
Throughout the specification, the term “residual” may be understood as an element constituting an image content. In some embodiments, the term “residual” may be understood as an element representing high frequency texture of an image.
In order to illustrate the object, technical solutions and advantages of the present disclosure clearer, embodiments of the present disclosure are described in further detail below in conjunction with the accompanying drawings.
The following is a description of the relevant technology involved in the present disclosure.
Artificial Intelligence (AI) is a theory, method, technology, and application system that uses a digital computer or a machine simulation controlled by a digital computer, extends, and expands human intelligence, perceives the environment, acquires knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that may respond in a similar way to human intelligence. Artificial intelligence is also the study of the design principles and implementation methods of various intelligent machines to make the machines capable of perception, reasoning and decision making.
Artificial intelligence technology is a comprehensive discipline that covers a wide range of fields, and has both hardware level technology and software level technology. Basic AI technology generally comprises technologies such as sensors, special AI chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics. Artificial intelligence software technologies mainly comprise computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning and several other major directions. In the present disclosure, computer vision, machine learning/deep learning and other technologies may be involved.
Machine Learning (ML) is a multi-disciplinary interdisciplinary discipline involving probability theory, statistics, approximation theory, convex analysis, algorithmic complexity theory, and many other disciplines. ML specializes in studying how computers may simulate or implement human learning behaviors, so as to acquire new knowledge or skills, and to reorganize existing knowledge structures to continuously improve their performance. Machine learning is a core of artificial intelligence and is the fundamental way to make computers intelligent, and its applications span all areas of artificial intelligence. Machine learning and deep learning usually comprise techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, inductive learning, and style teaching learning.
Computer Vision (CV) is the science of how to make machines “observe”. For example, CV refers to machine vision by using cameras and computers instead of human eyes to identify, track, and measure targets and the like. CV further performs graphics processing, so that the image processed by the computer is more suitable for the human eye to observe or to be transmitted to the instrument to detect. As a scientific discipline, computer vision studies related theories and techniques in an attempt to build artificial intelligence systems capable of acquiring information from images or multidimensional data. Computer vision technologies typically comprise technologies such as image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional (3D) object reconstruction, 3D technology, virtual reality, augmented reality, simultaneous localization and map construction, autonomous driving, intelligent transportation, and/or other common biometric technologies such as face recognition and fingerprint recognition.
Machine learning/deep learning techniques may be used in the present disclosure to solve the technical problems of high computational complexity and long time consuming in image processing.
The background of the embodiments of the present disclosure may be described below in terms of example scenarios.
In one example scenario, after a user edits an image (e.g., rotates the image, changes the angle of the image, zoom in/out the image and perspective transformation of the image, etc.), the image expansion technique may be used to generate the missing content of the image or the content to be expanded of the image due to editing, as described in reference to
In an alternative or additional example scenario, when a user takes an image with some targets in the background of the image, the user may remove the targets from the background by using image processing functions (e.g., image target removal techniques may be used). For example, if the image taken by the user comprises a plurality of other people (e.g., people in the background) in addition to the user (e.g., in the foreground), the user may select the people in the background to remove them, and then the terminal device may recover the removed area according to the background information to achieve the removal of the targets, and get the result image after removing the specific targets.
In another alternative or additional example scenario, after image processing for a face image comprising a mole, the mole is removed, and the processed face image is obtained.
Image target removal techniques may be principally based on image recovering techniques. By using the region to be removed as the missing region of the image, the missing region is recovered according to the background information by using the image recovering technique. And thus, the target removal of the image is achieved. However, the computational complexity of the image recovering operation in the related technology may be significant, as well as, time-consuming, resulting in a substantial consumption of processing resources (e.g., processor, memory) and a degraded user experience.
Embodiments of the present disclosure provide an image processing method, apparatus, electronic device, computer readable storage medium, and computer program product. Specifically, embodiments of the present disclosure may learn the disciplines for calculating similarity between images during image processing through a network model, so as to perform image processing through a network model, which may reduce the complexity of the calculation, shorten a processing time, reduce memory content occupied by image processing, and improve the user experience.
The technical solution of an embodiment of the present disclosure and the technical effect produced by the technical solution of the present disclosure are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be cross-referenced, borrowed or combined with each other, and the descriptions of the same terms, similar features and similar implementation steps, etc. in different embodiments will not be repeated for the sake of brevity.
An image processing method is provided in embodiments of the present disclosure, as shown in
As shown in
In step S101, the image processing method 10 may include performing image extraction on the acquired first image to obtain a plurality of second images.
Specifically, the first image is an image to be processed. For example, the image after the second filling processing may be an image after the recovering processing or an image after the image expansion process. Alternatively or additionally, when the first image is an image after recovering processing, it may further be an image inputted by the user after shooting. If during the image shooting process the first image is improperly focused for processing, some regions of the image are blurred, and then the image may be inputted into the image processing model provided in the present disclosure, so as to process the image. When the first image is an image which has been image expansion processed, the first image may further be an image edited by the user. If an image content does not comprise any image information, wherein a part of the image content is generated by image expansion after the image is edited, and the image content of this part is blurred, the image may be input into the image processing model provided in the present disclosure to process the image, so as to improve the clarity of the image.
The image extraction on the first image may be achieved by image splitting of the first image, and a plurality of second images resulting from the processing do not interfere with each other. That is, each second image is independent of each other and there is no part of pixel overlap.
Alternatively or additionally, in order to reduce the processing error caused by the image splitting boundaries, which leads to the degradation of the image processing accuracy, the epitaxial length may be configured for the boundaries of the second images. Assuming that a fixed size X×X is configured to split the first image, after the positions of the second images are determined by the fixed size, the fixed size may be extended to obtain the extended size (X+1)×(X+1). A plurality of second images may be extracted from the first image, based on the extended size. There may be an overlap of 1 pixel between the adjacent second images. The above epitaxial length may be configured according to the actual requirements, and the present disclosure is not limited in this regard.
In Step S102, the image processing method 10 may include determining, in the first image, target images associated with the second images, based on the image similarity value determined by the convolutional neural network.
For example, calculating one by one the similarity of the image regions, which have the same size as the second images, in the first image, by the convolutional neural network, based on the size of the second images, and then determining the target images associated with the second images.
In Step S103, the image processing method 10 may include performing a first filling processing of the first image, based on the second images and the target images.
For example, the first filling processing is an operation of pasting a corresponding image for a specific location in the first image. In some embodiments, the first filling processing is, based on the similarity between the second images and the image regions which have the same size as the second images, in the first image. Because an embodiment of the present disclosure uses the image information originally included in the first image to process the first image, the first image is filled based on the extracted second images and the target images (that may have a higher similarity to the second images) associated with the second images, so as to improve the smoothness of the result obtained from processing the first image.
The process of image extraction in some embodiments is described in further detail below.
In an embodiment, in the step S101, performing image extraction on the acquired first image so as to obtain a plurality of second images comprises the following steps A1-A2 (not shown).
In Step A1, the image extraction on the acquired first image may include acquiring a first image processed after a second filling processing. The second filling processing may include determining, in the image to be filled, a target region, in response to an editing operation, and performing a second filling processing on the target region by using image information included in the image to be filled, to obtain the first image.
For example, the second filling processing is an image recovering processing (e.g., as shown in
Alternatively or additionally, the image recovering performed prior to image processing of the first image, may further be implemented using deep learning-based networks and generative adversarial networks. The learning-based method may use a U-net network (full convolutional network) structure, and may achieve image recovering by using special convolutional operations base on the U-net network, that are specific to image recovering. The learning-based method uses the special convolution operations for image recovering, and requires inputting an image comprising the missing regions and mask images generated from the missing regions. The mask image has only values 0 and 255, which has the same size as the size of pixels of the original image, corresponding to the position of the missing region of the original image. The value of the mask image is 0 and the value of the lossless region is 255. An image recovering flow by using learning-based method may be as shown in
For example, the second filling processing may include the image expansion process, as shown in
In Step A2, the image extraction on the acquired first image may include performing image extraction from the target region of the first image to obtain a plurality of second images.
Specifically, as shown in
After acquiring the first image, the network may split the first image into an image comprising the target region, and an image corresponding to the target region. Specifically, since an embodiment of the present disclosure performs image processing for the target region in the first image, it is optional to perform image extraction for only the target region in the first image.
Alternatively or additionally, in step A1, determining the target region in the image to be filled, in response to the editing operation, may include determining the image to be filled, based on the image before editing, in response to the editing operation, and determining, in the image to be filled, a region that does not comprise any image information, as the target region.
As shown in
In some embodiments, as shown in
In Step A0, the image processing method 10 may include cropping and/or scaling the image to be filled, based on a predetermined image area.
For example, an embodiment of the present disclosure performs an adaptive image cropping operation in the pre-processing session primarily according to the shape of the target region determined by the user after editing the image.
In some embodiments, in step A0, cropping and/or scaling the image to be filled based on a predetermined image area comprises steps A01 and A02.
In Step A01, if an area of the image to be filled is larger than the predetermined image area, cropping the image to be filled.
For example, determining whether to perform the cropping operation may vary according to the size of the image. If the image is smaller than the determined image area (or size) 512×512 the image is directly input to the subsequent image generative network. Otherwise an adaptive image cropping operation is performed. The image area is only an example, which may be adjusted according to the actual requirements, and the present disclosure is not limited in this regard.
In some embodiments, in step A01, the cropping of the image to be filled comprises steps A011 and A012 (not shown).
Step A011 may include calculating a maximum connected component in a mask image corresponding to the image to be filled.
Step A012 may include cropping the image to be filled, based on minimum bounding squares corresponding to the maximum connected component, to obtain at least one cropped image to be filled.
As shown in
Returning to
For example, the image may be cropped into a plurality of square regions according to the geometry of the target regions. And then the images of these square images that are larger than the predetermined image area of 512×512 are scaled down to 512×512. And if these square images are smaller than (or equal to) 512×512 (NO in step 1106 of
As shown in
In some embodiments, the object of pre-processing is to minimize the ratio of image scaling, and to prevent high frequency texture information from being excessively lost in the image generative network stage.
In some embodiments, as shown in
In some embodiments, it is considered that the AOT module, in the network based on the U-net structure of the AOT module, uses only a limited number of scales of the dilated convolutions, which are performed by the concatenating operation and then are output by the elemental point multiplication operation. However, because some feature points cannot participate in computing when the dilated convolutions extract features, regular texture features cannot be extracted, and then the expanded results cannot recover the semantic textures. A normal convolution calculates all the data in the sliding window sequentially, and the dilated convolution selects some of the data in the sliding window for calculation. For example, the dilated convolution with a dilation rate of 2 only calculates the data of the corresponding positions and not the other positions. Therefore, the spatial structure information of the object in the original image is lost in the high-level semantic feature map, and thus it is difficult to generate regular textures according to the expanded result. To solve this problem, embodiments of the present disclosure propose an improved image complementation network so as to perform second filling processing.
For example, in step A1, performing second filling processing on the target region to obtain a first image by using the image information included in the image to be filled comprises step A11 (not shown).
Step A11 may include sequentially performing at least one down-sampling operation and at least one up-sampling operation, for the image to be filled, to obtain the first image.
In some embodiments, whenever performing a down-sampling operation may further include performing a dilated convolution operation, for the feature map obtained from the down-sampling operation, based on a different dilation rate.
For example, some embodiments may use a Multi-Dilated Residual Block (MDRB) instead of the AOT module, and a network structure of an image complementation network as shown in
Alternatively or additionally, the dilated convolution operation is performed for the feature map obtained from down-sampling, based on different null rates, comprising steps A111-A113 (not shown).
Step A111 may include splitting the down-sampled feature map into at least one sub-feature map, based on a predetermined number of dilated convolutions.
For example, as shown in
In the branch for extracting features, the data are convolved, and then the new feature maps are extracted, and then the new extracted feature maps are output. These feature maps are divided into n groups according to channel dimensions, which may be configured as o1, o2, o3, . . . , on. The input feature maps would be divided into a plurality of groups according to channel dimensions. If the input data dimensions are (e.g., 1, 128, 128, 64), the feature maps would be split according to the predetermined number of dilated convolutions. If 8 dilated convolutions are predetermined, the feature maps may be divided into 8 sub-feature maps (e.g., 1, 128, 128, 8). If 16 dilated convolutions are predetermined, the feature maps may be divided into 16 sub-feature maps (e.g., 1, 128, 128, 4). It may be understood that the larger the predetermined number of dilated convolutions, the larger the texture information that may be compensated by the operation of the multi-dilated convolution superimposed residual block. However, when configuring the number of dilated convolutions, the present disclosure further takes into account the receptive field, and thus, the number of dilated convolutions may be configured according to the actual requirements, by making a trade-off between the texture information that may be compensated and the receptive field.
Step A112 may include performing feature extraction for different sub-feature maps by using different dilation rates.
For example, each sub-feature map is input to the dilated convolution with a different dilation rate for feature extraction. As shown in
Step A113 may include determining the outputs of the dilated convolutions, based on the feature extraction result of the sub-feature maps.
For example, more structural texture information lost by the dilated convolution may be obtained by collecting the feature extraction result of the sub-feature maps. The principle of the superimposed residual operation is that there are differences in the positions of features extracted by different dilation rate convolutions and differences in the positions of features not extracted, and by using more types of dilation rate convolutions and collecting the information of these differences, the lost structural texture information of the dilated convolution may be obtained to the maximum extent, as shown in
Alternatively or additionally, determining the output of the dilated convolution based on a feature extraction result of the sub-feature maps, comprises performing a summing operation for a feature extraction result of the sub-feature maps so as to obtain a plurality of pieces of superimposed information, concatenating the pieces of superimposed information, and determining the output of the dilated convolution, based on the concatenated superimposed information. The summing operation comprises summing a feature extraction result of a current sub-feature map with a result obtained from a previous summing operation, so as to obtain the superimposed information associated with the current sub-feature map.
For example, as shown in
Alternatively or additionally, in order to more quickly achieve collecting the feature extraction result of the sub-feature maps, the above residual superimposed operations may be completed one by one, based on the order in which the sub-feature maps are obtained by splitting in step A111 (not shown).
Therefore, the feature maps generated by these dilated convolutions are concatenated in the channel dimension, and then are input to the subsequent convolutions, while combining another branch convolution to complete the attention mechanism operation.
In the branch used for the attention mechanism, the input feature map is input to the convolution, when the feature map g output from this convolution is kept in the same dimension as the input data, and it is necessary for this branch to output by using the activation function (sigmoid) so as to ensure that the value of the feature map of the branch used for the attention mechanism ranges from 0 to 1. Alternatively or additionally, the element point multiplication operation may be used for the feature map f and the feature map g so as to implement the attention mechanism. The meanings of the values 0 to 1 in the feature map g are values of features in the feature map f at the corresponding positions, which are involved in the weights of the image expansion. That is, for the final image expansion problem, the feature map g indicates which features in the feature map f are used and what weights are used.
In an embodiment, in step A2, performing image extraction from the target region of the first image to obtain a plurality of second images comprises the following step A31 (not shown).
Step A31 may include performing image extraction from the target region of the first image to obtain a plurality of second images, based on a predetermined first image size and a predetermined image extraction order.
The image extraction order comprises at least one of from top to bottom, and from left to right.
For example, the first image size affects the fine granularity of image information splitting of the target region, which may be adjusted according to the actual requirements in some embodiments and is not limited herein. The larger the first image size, the greater the fine granularity and the relatively longer the computation time. Alternatively or additionally, the smaller the first image size, the smaller the fine granularity and the relatively shorter the computation time.
Alternatively or additionally, the first image size is related to the target region, and in order to make the image processing network provided by the present disclosure have better processing performance, the image size of the region to be processed may be predetermined. If the size of the currently determined target region is different from the size of the predetermined region, the image of the currently determined target region to be processed may be transformed in size.
In some embodiments, the image extraction order may further be adjusted, based on the first image size and the size of the target region. If it could be determined the plurality of the split second images larger than 3, according to the target region and the first image size, the image extraction operation may be performed from top to bottom and from left to right, based on the predetermined image extraction order.
Alternatively or additionally, the second images extracted based on the predetermined image extraction order, have a characteristic of sequential order and no overlap.
Adapted to the image processing method of the above embodiment, in some embodiments, a contextual-wise Attention Module (C-wAM) is provided, as shown in
The specific process of performing the first filling processing for the second images in some embodiments is described below.
In an embodiment, in the step S102, performing the first filling processing for at least one second image comprises the following step B1 (not shown).
Step B1 may include performing the first filling process for the second images sequentially, based on the image extraction order.
For example, in order to potentially reduce the time consumed for performing the first filling processing operation of the first image, such as the complexity of calculating the filling position, and in order to shorten the corresponding time consumed, the first filling processing may be performed to the second images corresponding to the predetermined image extraction order. Assuming that 4 second images are extracted for the target regions, the order of the second images may be as follows: second image A, second image B, second image C, and second image D; and the positions of the second images A, B, C, and D are upper left, upper right, lower left, and lower right. At this time, when the first filling processing is performed, the second images at the corresponding positions are obtained, according to this image order, in order to perform the first filling processing.
In some embodiment, because the second images are extracted based on the image extraction order, the order in which the similarity information is output is consistent with the extraction order of the second images. And this processing facilitates quickly understanding of the filling positions corresponding to the second images in the first image.
In an embodiment, in the step S102, determining, the target images associated with the second images in the first image, based on the image similarity value determined by the convolutional neural network, comprises: for each second image, performing the following operations of steps C1-C3 (not shown).
Step C1 may include tiling the second images to obtain a third image with an image size corresponding to the target region.
For example, the order of tiling the images may be the same as the order of image extraction, such as when 9 second images are extracted for the target region, and when tiling a second image in an image position on the upper left in the first order, a third image of 3×3 may be obtained by tiling. That is, the third image comprises 9 images in the first order. Accordingly, at this time, the target region corresponds to 9 different second images, and the third image comprises 9 identical second images. If the first filling processing is an operation performed for all the second images, then, when the target region is extracted to obtain 9 second images, 9 third images are obtained accordingly.
The size of the target region may the same as the image size of the third image.
Step C2 may include the similarity values of the second image relative to the other second images are calculated by the convolutional neural network, based on the third image, to obtain a similarity feature map.
Specifically, as shown in
As shown in
Step C3 may include determining a target position in the target region for the target image, with the highest similarity to that second image, based on the similarity feature map.
As shown in
In step S103, performing a first filling processing of the first image, based on the second images and the target images, comprises a step C4 (not shown).
Step C4 may include filling the second images to the target position.
For example, when the first filling processing is performed, the target region may be a blank region (e.g., not comprise any image information), as shown in
In some embodiments, considering that steps C3 and C4 are performed based on the target image that is most similar to the second images, it may occur that some second images in the target region are located at positions that are not filled by the second images in the third image. At this time, the second image corresponding to that position in the target region may be filled, that is, the second image at that position is not replaced.
For example, as shown in
Alternatively or additionally, the target region may further comprise corresponding image information when the first filling processing is performed. Therefore, only the image information at the image position that is necessary to be filled is changed.
In an embodiment, steps C01-C02 (not shown) are further included before performing the step C2.
Step C01 may include transforming the third image based on the predetermined second image size, to obtain the transformed third image.
Step C02 may include down-sampling the transformed third image based on the predetermined first image size and the predetermined second image size, to obtain the down-sampled third image.
The first image size may be a size for extracting the second images from the target region of the first image.
As shown in
Alternatively or additionally, the image size of the input similarity calculation network may be configured to 256×256 (e.g., the predetermined second image size), and if the image size of the input network does not match the configured size, step C01 may be performed to transform the image and obtain the transformed third image, and then perform similarity calculation, based on the transformed third image.
That is, the number of down-samplings of the similarity calculation network depends on the first image size. Assuming that the input image size of the image recovering network is 512×512 and the first image size is 32×32, the height and width of the feature map for the second images output by the similarity calculation network is (512/32)×(512/32)=16×16. That is, it is necessary for the similarity calculation network to down-sample an input in an image size of 256×256 into a 16×16 feature map. That is, 4 down-sampling processes are performed.
In some embodiments, adapting the above C-wAM network module, an ultra-high resolution recovering network is further provided, as shown in
In some embodiments, the image processing method further comprises step S103 of outputting the target image obtained after the image processing, to a refine recovering module of the image processing network.
The image processing network (C-wAM) provided in some embodiments may further be fused into any image recovering network as a separate module, or as a post-processing operation of any image recovering network, so as to improve the fine granularity and the clarity of the image processed by the recovering network.
Alternatively or additionally, as shown in
Alternatively or additionally, as shown in
The following description is provided for a processing flow and a training process of the C-wAM network provided by the embodiments of the present disclosure.
The C-wAM network provided in some embodiments (e.g., as shown in
As shown in
By patching the network, all tiled patch images comprising known images will be transformed into Pnum×Pnum feature maps M1 in a shape of (1, Pnum, Pnum, Pdeep). Pdeep is the depth of the configured feature map; the larger the Pdeep value, the more accurate the similarity is obtained. After that, it is necessary to reduce the dimension of the Pdeep axis of M. The dimension of the Pdeep axis of M may be reduced into a dimension 1, by using the method of finding the average number or the maximum number, so as to obtain Pnum×Pnum feature maps M2 in a shape of (1, Pnum, Pnum, 1).
As shown in
Referring to Equation 1, MPatch
The operation of labeling the mask is as shown in Equation (2) below.
C-wAMi=C-wAMi×Matt+a×(1−Matt) [Eq. 2]
Referring to Equation 2, C-wAMi denotes the i-th feature map M2 in a shape of (1, Pnum, Pnum, 1). After labeling the Pnum×Pnum feature maps M2, the Pnum×Pnum feature maps C-wAMi in a shape of (1, Pnum, Pnum, 1) are concatenated at the fourth dimension to obtain the A-wSM in a shape of (1, Pnum, Pnum, Pnum×Pnum). The a is a configured parameter.
Ires is denoted as the residual image, where h and w are the corresponding height and width of this residual image. The Ires is split into Pnum×Pnum patch images, where the image size of each patch image is
The (x, y) is denoted as coordinates of each patch image in a matrix Pnum×Pnum, where an origin of the coordinates (x, y) is in the upper left corner. The Fas represents the A-wSMs, which comprise Pnum×Pnum attention score maps in a shape of (1, Pnum, Pnum, 1). The Fas
In an embodiment, the process of pasting similar patch images is as follows.
First, the parameters are defined as follows: Ires is an input residual image; Matt is a mask size of the attention score map; Pnum is the number of patch images per column or per row; Fas is an attention score map; and Ores is an aggregated residual image.
Upon pasting, for an x corresponding to a value range (0, Pnum) and a y corresponding to the value range (0, Pnum), when Matt at the position (x, y) is equal to 0, it means that the patch at the position (x, y) of the Ires image is not in the missing region, so no operation is required and the patch at the next position is continued to be observed. Alternatively or additionally, when Matt at the position (x, y) is not equal to 0, it means that the patch at the position (x, y) of the Ires image is a patch in the missing region, and thus the patch is necessary to be pasted. And then, extract the feature map Fas
A training method of C-wAM is given below.
Specifically, the loss functions used during training the C-wAM network are expressed as shown in Equation (3) below.
L=λ
1
L
prc+λ2Lstyle+λ3Latt+λ4L1+λ5Ladv [Eq. 3]
Referring to Equation 3, λ is a weight coefficient of each loss, Lprc and Lstyle are the same losses as the local convolution of the relevant technique, Latt is the attention loss provided in the present disclosure, L1 is the L1 loss, and Lady is the loss of Deepfillv2 image recovering.
The attention loss provided in some embodiments is described as follows. The attention loss is the sum of all magnitudes or absolute differences between the C-wAM output and the ground truth (labeling), which is approximates the A-wSM, and which is generated by using the VGG (Visual Geometry Group) network. Given a group Igt={I1gt, . . . , Ingt} of approximate labeling, and the output Igen={I1gen, . . . , Ingen} of the patch network, the attention loss Latt is expressed as shown in Equation (4) below.
Referring to Equation 4, N denotes the number of all extracted patches; Iigen denotes the i-th A-wSM generated by the patch network; and Iigt denotes the i-th labeled feature map obtained by the VGG network. Iigt expresses the following equation (5).
I
i
gt
=M⊙α+|Ψ
l(Xgt)−Φil(Ψl(Xipt))|⊙(1−M) [Eq. 5]
Referring to Equation 5, α is an coefficient for avoiding the mask region value interfering with the attention score feature map configuration, M is a mask in a size of Pnum×Pnum obtained by the extracted patch resizing method, Xgt is the original image in a size of 256×256, ⊙ denotes a multiplication product, Ψ1 is a feature map of a pooling layer of l when Xgt is given, and l equals to log2(256/Pnum), Xipt is the output of the image recovering network generated from the resized 256×256 image, where Ψi is defined as follows in Equation (6) shown below.
Φil(I)=Tile(Extracti(I)) [Eq. 6]
Referring to Equation 6, Extracti(I) denotes the i-th extracted patch in a size of 1×1, Tile (I) denotes tiling the patch in a size of Pnum×Pnum.
For example, some embodiments train the provided C-wAM end-to-end by the above defined attention loss function.
In some embodiments, considering problems such as low robustness for determining similarity information between images according to cosine similarity, as well as, an increasing computational overhead in the processing of the feature map of similarity results as the density of patches (second images) increases, based on the contextual attention mechanism, as well as, the increased computational amount of pasting patches based on the contextual attention mechanism, some embodiments provide an improved post-processing technique when compared to related post-processing techniques, as shown in
In some embodiments, in the step S102 of
Step D1 may include tiling the second image, to obtain a fourth image with the same image size as a size of the first image.
Step D2 may include determining a target image having the highest similarity to the second image in the first image by a convolutional neural network, based on the fourth image.
For example, the first image of 512×512 generated in the image complementation stage is scaled down to a size of 256×256, as shown in
As shown in
Alternatively or additionally, in step S103 of
Step D3 may include filling the target images to the corresponding positions of the second images in the first image.
For example, as shown in
In some embodiments, considering that a ghosting phenomenon may occur in some scenes because of direct pasting of high-frequency information, some embodiments further provide a high-frequency information enhanced network so as to reduce the effect of the ghosting phenomenon. The extracted generated high-frequency information image (the target image extracted from the first image) in a size of 512×512 is not as rich in high-frequency information as the original image (e.g., the image before editing), but there is no semantic unreasonable problem; and thus, the high-frequency information enhanced network provided by the present disclosure may be enhanced for processing, based on this high-frequency information image.
For example, after performing the first filling processing in step S103, steps E1-step E2 are further included (not shown).
Step E1 may include acquiring the fifth images corresponding to regions where the plurality of the second images are located, based on the first image.
For example, the first image may be an image obtained by generative network processing or by image recovering processing (e.g., an image obtained by second filling processing). The fifth images corresponding to the target regions corresponding to all the second images may be acquired, based on the first image.
Alternatively or additionally, acquiring the fifth images corresponding to regions where the plurality of the second images are located, based on the first image, may include performing a scaling down and scaling up operation on the first image, so as to obtain a seventh image, determining an eighth image, based on the first image and the seventh image, and acquiring the fifth images corresponding to the plurality of second images, based on the eighth image.
For example, as shown in
Step E2 may include acquiring sixth images corresponding to a plurality of the second images, in the first image after the first filling process.
For example, as shown in
Step E3 may include determining the target filled image, based on the fifth images and the sixth images.
For example, a convolution operation may be performed on the fifth images and the sixth images by the enhanced network, so as to obtain the target filled image. The high frequency information enhanced network structure is a U-net structure, and may comprise a gated convolution (gated convolution).
As shown in
Alternatively or additionally, the enhanced network comprises a down-sampling module, a residual block and an up-sampling module connected sequentially. The down-sampling module comprises a plurality of first gated convolution layers connected sequentially. The residual block comprises a plurality of second gated convolution layers connected sequentially. The second gated convolution layers further comprise a skip connection structure between them. The up-sampling module comprises a plurality of third gated convolution layers connected sequentially. The third gated convolutional layers are connected to the first gated convolutional layers in the corresponding layers.
As shown in
In some embodiments, the use of a plurality of residual connections (skip connection) in the high frequency information enhanced network may retain as much high frequency information in the input data as possible. As shown in
Step E4 may include determining an image obtained after the image processing, based on the target filled image and the first image.
For example, as shown in
An application example is given below in conjunction with
As shown in
Alternatively or additionally, when the terminal 100 is employed to perform the image processing method provided in some embodiments, the first image to be processed may be an image currently shot and input by the user, or an image to be processed obtained from another terminal (e.g., a terminal employed by a user A transmits the image to be processed to the terminal employed by a user B via a connection such as Bluetooth), or the first image obtained from the server 200.
Alternatively or additionally, when the server 200 is employed to perform the image processing method provided by the embodiment of the present disclosure, the image to be processed is obtained from the terminal 100. And then, after the image processing step is performed by the server 200, the result image obtained by the image processing is fed back to the terminal 100.
As shown in
An embodiment of the present disclosure provides an image processing apparatus, as shown in
The image extraction module 101 is configured to perform image extraction on an acquired first image to obtain a plurality of second images; the similarity determination module 102 is configured to determine target images associated with the second images in the first image, based on the image similarity value determined by the convolutional neural network; and the image filling module 103 is configured to perform a first filling processing of the first image, based on the second images and the target images.
In an embodiment, the image extraction module 101 is configured to perform an image extraction on the acquired first image to obtain a plurality of second images when specifically configured to acquire a first image, which is an image processed by second filling processing. The device 100 may further comprise an initial processing module configured to perform the second filling processing. For example, the initial processing module may be configured to determine, in the image to be filled, a target region, in response to an editing operation, and perform a second filling processing on the target region by using image information included in the image to be filled, so as to obtain the first image.
The image extraction module 101 may be further configured to perform an image extraction from the target region of the first image to obtain a plurality of second images.
In an embodiment, when the initial processing module is configured to determine the target region in the image to be filled in response to the editing operation, the initial processing module may be further configured to determine the image to be filled based on the image before editing, in response to the editing operation, and determine a region of the image to be filled that does not comprise any image information as the target region.
In an embodiment, the initial processing module, prior to being configured to perform a second filling processing on the target region by using image information included in the image to be filled, may be further configured to crop and/or scale the image to be filled, based on a predetermined image area.
In an embodiment, when the initial processing module is configured to perform crop and/or scale of the image to be filled, based on the predetermined image area, the initial processing module may be further configured to, when an area of the image to be filled is larger than the predetermined image area, crop the image to be filled, and, when the area of the cropped image to be filled is larger than the predetermined image area, scale down the area of such cropped image to be filled to the predetermined image area.
In an embodiment, the initial processing module, when configured to perform crop of the image to be filled, may be further configured to calculate maximum connected component in a mask image corresponding to the image to be filled, and crop the image to be filled, based on minimum bounding squares corresponding to the maximum connected component, to obtain at least one cropped image to be filled.
In an embodiment, the initial processing module, when configured to perform a second filling processing of the target region by using image information included in the image to be filled, so as to obtain the first image, may be further configured to sequentially perform at least one down-sampling operation and at least one up-sampling operation for the image to be filled, so as to obtain the first image, wherein, upon each execution of a down-sampling operation, perform a dilated convolution operation for the feature map obtained by down-sampling, based on a different dilation rate.
In an embodiment, the initial processing module, when configured to perform the dilated convolution operation for the feature map obtained by down-sampling, based on the different dilation rate, may be further configured to split the feature map obtained by down-sampling into at least one sub-feature map, based on a predetermined number of dilated convolutions, perform feature extraction from different sub-feature maps by using different dilation rates, and determine an output of a dilated convolution, based on a feature extraction result of the sub-feature maps.
In an embodiment, the initial processing module, when configured to perform determining the output of the dilated convolution, based on the feature extraction result of the sub-feature maps, may be further configured to perform a summing operation for a feature extraction result of the sub-feature maps so as to obtain a plurality of pieces of superimposed information, concatenate the pieces of superimposed information, and determine the output of the dilated convolution, based on the concatenated superimposed information, wherein the summing operation comprises summing a feature extraction result of a current sub-feature map with a result obtained from a previous summing operation, so as to obtain the superimposed information associated with the current sub-feature map.
In an embodiment, the image extraction module 101, when configured to perform an image extraction of a target region of the first image to obtain a plurality of second images, may be further configured to perform image extraction from the target region of the first image, based on a predetermined first image size and a predetermined image extraction order, so as to obtain a plurality of second images.
In an embodiment, the similarity determination module 102, when configured to determine, in the first image, the target images associated with the second images, based on the image similarity value determined by the convolutional neural network, may be further configured to perform the following operations for each second image: tile the second image to obtain a third image with an image size corresponding to the target region, calculate a similarity value of the second image relative to other second images, based on the third image, by a convolutional neural network, so as to obtain a similarity feature map, and determine, the target position in the target region of the target image, with the highest similarity to the second image, based on the similarity feature map.
The image filling module 103, when configured to perform first filling processing of the first image, based on the second images and the target images, may be further configured to fill the second image to the target position.
In an embodiment, prior to configured to calculate the similarity value of the second image with other second images, based on the third image via a convolutional neural network, the image filling module 103 may be further configured to transform the third image, based on a predetermined second image size, so as to obtain a transformed third image, down-sample the transformed third image, based on the predetermined first image size and the predetermined second image size, so as to obtain the down-sampled third image, wherein the first image size is a size for extracting a second image in a target region of the first image.
In an embodiment, the similarity determination module 102, when configured to determine, in the first image, the target images associated with the second images, based on the image similarity value determined by the convolutional neural network, may be further configured to perform the following operations for each second image: tile the second image so as to obtain a fourth image with an image size the same as a size of the first image, and determine, in the first image, the target image, with the highest similarity to the second image, based on the fourth image, by a convolutional neural network.
The image filling module 103, when configured to perform a first filling processing of the first image, based on the second images and the target images, may be further configured to fill the target image to a corresponding position of the second image in the first image.
In an embodiment, the image filling module 103, after configured to perform the first filling processing, may be further configured to acquire fifth images corresponding to regions where the plurality of second images are located, based on the first image, acquire sixth images corresponding to the plurality of second images from the first image processed by the first filling processing determine a target filled image, based on the fifth images and the sixth images, and determine an image obtained after the image processing, based on the target filled image and the first image.
In an embodiment, the image filling module 103, when performing the acquisition of fifth images corresponding to the regions where the plurality of second images are located, based on the first image, may be further configured to perform a scaling down and scaling up operation on the first image, so as to obtain a seventh image, determine an eighth image, based on the first image and the seventh image, and acquire the fifth images corresponding to the plurality of second images, based on the eighth image.
In an embodiment, the image filling module 103, when configured to determine the target filled image, based on the fifth images and the sixth images, may be further configured to perform a convolution operation on the fifth images and the sixth images by an enhanced network, so as to obtain the target filled image, wherein the enhanced network comprises a down-sampling module, a residual block and an up-sampling module connected sequentially. The down-sampling module comprises a plurality of first gated convolution layers connected sequentially. The residual block comprises a plurality of second gated convolution layers connected sequentially. The second gated convolution layers further comprise a skip connection structure between each other. The up-sampling module comprises a plurality of third gated convolution layers connected sequentially. Each third gated convolution layer is connected to the first gated convolutional layers in corresponding levels.
The apparatus of an embodiment of the present disclosure may perform the method provided in some embodiments, and has similar embodiment principles with those of the method, and the actions performed by the modules in the apparatus of an embodiment of the present disclosure correspond to the steps in the method of an embodiment of the present disclosure, and the detailed functional description of the modules of the apparatus may be specifically referred to the description in the corresponding method previously shown herein, which will not be repeated here.
An embodiment of the present disclosure provides an electronic device comprising a memory, at least one processor and a computer program stored in the memory, the processor executing the above-mentioned computer program to perform the steps of the image processing method, which may be achieved in comparison with the prior art: the present disclosure, after acquiring a first image, firstly performing image extraction on the first image to be processed by image processing, so as to obtain a plurality of second images; and then determining, in the first image, target images associated with the second images, based on the image similarity value determined by the convolutional neural network; and then performing a first filling processing of the first image, based on the second images and the target images. The embodiment of the present disclosure may perform image filling, based on the image information included in the first image. Wherein the second images correspond to portions of the first image to be subjected to the first filling processing. The first image may be filled based on the image information included in the first image, in an embodiment of the disclosure. The processing of this solution facilitates quickly understanding the positions of filling, and reducing the computational complexity and shortening the time consuming. Moreover, the similarity value is determined by using a convolutional neural network, and the computation of its similarity value is related to the parameters of the network and not related to the size of the image processed, so as to significantly reduce the computational complexity and thus improve the performance of the image processing.
In an optional embodiment, an electronic device is provided, as shown in
The processor 1001 may be a CPU (Central Processing Unit), a general purpose processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (FPGA), FPGA (Field Programmable Gate Array) or other programmable logic device, a transistorized logic device, a hardware component, or any combination thereof. It may implement or execute various exemplary logic blocks, modules, and circuits described in conjunction with the present disclosure. The processor 1001 may further be a combination that implements a computing function, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, etc.
The bus 1002 may comprise a pathway to transmit information between the above components. The bus 1002 may be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus, for example. The bus 1002 may be divided into an address bus, a data bus, a control bus, etc. For the convenience of representation, only one thick line is used in
The memory 1003 may be a ROM (Read Only Memory) or other type of static storage device that may store static information and instructions, a RAM (Random Access Memory) or other type of dynamic storage device that may store information and instructions, or an EEPROM (Electrically Erasable Programmable Read Only Memory), CD-ROM (Compact Disc Read Only Memory) or other optical disc storage, optical disc storage (including compressed disc, laser disc, optical disc, digital universal disc, Blu-ray disc, etc.), disk storage media, other magnetic storage devices, or any other media that may be used to carry or store computer programs and may be read by a computer without limitation here.
The memory 1003 is configured to store a computer program for executing an embodiment of the present disclosure and is controlled for execution by the processor 1001. The processor 1001 is configured to perform the computer program stored in the memory 1003 to perform the steps shown in the preceding method embodiment.
The electronic device 100 may include, but is not limited to smart phones, tablet computers, laptops, smart speakers, smart watches, in-car devices, etc.
Embodiments of the present disclosure provide a computer-readable storage medium on which a computer program is stored, the computer program being executable by at least one processor to perform the steps and corresponding contents of the preceding method embodiments.
Embodiments of the present disclosure further provide a computer program product comprising a computer program, the computer program when executed by at least one processor realizing the steps and corresponding contents of the preceding method embodiments.
The aforementioned image processing method performed by the electronic device in the embodiment provided by the present disclosure may be performed by using an artificial intelligence model.
According to embodiments of the present disclosure, the method performed in the electronic device may obtain output data for recognizing image or image features in an image by using image data or video data as input data for the artificial intelligence model. The artificial intelligence model may be obtained by training. Here, “obtained by training” means that the basic artificial intelligence model is trained with a plurality of training data by a training algorithm to obtain predefined operational rules or artificial intelligence models configured to perform the desired feature (or object). The artificial intelligence model may comprise a plurality of neural network layers. Each layer of the plurality of neural network layers comprises a plurality of weight values and performs neural network computations by computing between the results of the previous layer and the plurality of weight values.
Visual understanding is a technique for recognizing and processing things like human vision and includes, for example, object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.
The image processing apparatus provided in the present disclosure may be implemented by AI models for at least one of a plurality of modules. The functions associated with the AI may be performed through non-volatile memory, volatile memory, and at least one processor.
The processor may comprise one or more processors. Wherein, the one or more processors may be a general purpose processor, (e.g., central processing unit (CPU), application processor (AP), etc.), or a pure graphics processing unit, (e.g., graphics processing unit (GPU), vision processing unit (VPU), and/or AI-specific processor, (e.g., neural processing unit (NPU)).
The one or more processors control processing of the input data, based on predefined operational rules or artificial intelligence (AI) models stored in non-volatile memory and volatile memory. The predefined operation rules or AI models are provided by training or learning.
Here, providing by learning refers to obtaining predefined operating rules or AI models with desired characteristics by applying a learning algorithm to a plurality of learned data. The learning may be performed in the device itself in which the AI according to the embodiment is executed, and/or may be implemented by a separate server/system.
The AI model may comprise a plurality of neural network layers. Each layer has a plurality of weight values, and the computation of a layer is performed by the results of the computation of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, convolutional neural networks (CNN), deep neural networks (DNN), recurrent neural networks (RNN), restricted Boltzmann machines (RBM), deep belief networks (DBN), bidirectional recurrent deep neural networks (BRDNN), generative adversarial networks (GAN), and deep Q networks.
A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data to enable, allow, or control the target device to make determinations or predictions. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
According to an aspect of the disclosure, an electronic device may include a memory storing one or more instructions, and at least one processor communicatively coupled to the memory. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to acquire a target image based on an editing operation on an original image. In an example, the target image may include an unknown region formed by the editing operation. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to perform a filling processing operation on the target image to obtain a first filled image including a first target region. In an example, the first target region may correspond to the unknown region. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to identify a target patch based on the first filled image. In an example, the target patch may correspond to at least a portion of the first target region. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to calculate a similarity value between the target patch and at least one patch related to the first filled image using a first artificial intelligence (AI) model. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to determine a target residual patch corresponding to the target patch based on the similarity value. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to generate a processing result image based on the target residual patch.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to acquire an edited image by the editing operation on the original image. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to crop the edited image to acquire a cropped image as the target image, when a size of the edited image is larger than the predetermined first size.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to calculate at least one maximum connected component in a mask image corresponding to the edited image. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to crop the edited image based on at least one minimum bounding square corresponding to the at least one maximum connected component, to obtain at least one cropped image.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to scale down the target image based on the predetermined second size, when a size of the target image is larger than the predetermined second size. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to generate the first filled image based on the target image scaled down, using a second AI model for image generation (e.g., recovering).
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to generate image data of the unknown region based on the target image using a second AI model for image generation to obtain the first filled image.
According to an embodiment, the second AI model may include at least one down-sampling operation, at least one dilated convolution operation and at least one up-sampling operation. In an example, each of the at least one down-sampling operation may be configured to extract a feature map from input data. In an example, each of the at least one dilated convolution operation may be configured to split an input feature map into at least one sub-feature map, based on a predetermined number of dilated convolutions. In an example, each of the at least one dilated convolution operation may be configured to perform feature extraction from each of the at least one sub-feature map using each dilation rate corresponding to the each of the at least one sub-feature map. In an example, each of the at least one dilated convolution operation may be configured to calculate output data based on a feature extraction result of the each of the at least one sub-feature map.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to perform at least one summing operation on the feature extraction result of the each of the at least one sub-feature map to obtain a plurality of pieces of superimposed information. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to concatenate the plurality of pieces of superimposed information. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to determine the output data based on the concatenated plurality of pieces of superimposed information. According to an embodiment, the at least one summing operation may include summing a feature extraction result of a current sub-feature map with a result of a previous summing operation, to obtain superimposed information associated with the current sub-feature map.
According to an embodiment, the first AI model may include each sub-network for each patch size. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to calculate the similarity value using a sub-network for a size of the target patch.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to scale down the first filled image based on a predetermined third size, to obtain a second filled image including a second target region. In an example, the second target region may correspond to the first target region of the first filled image. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to extract the target patch from the second filled image based on at least one predetermined patch size. In an example, the target patch may include at least a portion of the second target region.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to tile the target patch to generate a tile image comprising a plurality of tiles corresponding to the target patch. In an example, a size of the tile image may be determined based on a size of the second filled image. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to input the tile image and the second filled image with the second target region masked to the first AI model. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to acquire output data of the first AI model including the similarity value between the target patch and at least one patch of the second filled image with the second target region masked.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to acquire an attention scores map for the target patch. In an example, a value of each position in the attention scores map may indicates a similarity value between the target patch and a patch at the each position in the second filled image with the second target region masked.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to identify a position with a lowest similarity value in the attention scores map. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to determine the target residual patch in the target image, based on the position with the lowest similarity value.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to generate a target residual image for the unknown region of the target image, based on the target residual patch. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to generate the processing result image based on the target residual image and the first filled image.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to generate a first residual image for the unknown region based on the target residual patch. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to generate a second residual image for the unknown region based on the first filled image. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to generate a refined residual image for the unknown region as the target residual image, based on the first residual image and the second residual image.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to perform at least one of a scaling down operation, a scaling up operation or a subtract operation on the first filled image to generate a low resolution residual image as the second residual image.
According to an embodiment, the at least one processor may be configured to execute the one or more instructions to scale up the first filled image based on a size of the target image. According to an embodiment, the at least one processor may be configured to execute the one or more instructions to add the target residual image to the first filled image scaled up to generate the processing result image. An image processing method is provided in embodiments of the present disclosure, as shown in
As shown in
In step S241, the image processing method 240 may include acquiring a target image based on an editing operation on an original image (e.g., high resolution image, 4096×4096). In an embodiment, the target image may include an unknown region formed by the editing operation. In an example, the target image (e.g., the edited image) may be a result image of the editing operation on the original image. The editing operation may include rotating the image, removing a part of the image, changing the angle of the image, etc. Additionally or alternatively, the target image (e.g., the cropped image) may be at least one image cropped from the edited image by cropping the edited image.
According to an embodiment, the acquiring of the target image may include acquiring an edited image by the editing operation on the original image. According to an embodiment, the acquiring of the target image may include cropping the edited image to acquire a cropped image as the target image when a size of the edited image is larger than the predetermined first size (e.g., 512×512).
According to an embodiment, the cropping of the edited image may include calculating at least one maximum connected component in a mask image corresponding to the edited image. According to an embodiment, the cropping of the edited image may include cropping the edited image based on at least one minimum bounding square corresponding to the at least one maximum connected component, to obtain at least one cropped image.
In step S242, the image processing method 240 may include performing a filling processing operation on the target image to obtain a first filled image including a first target region. In an embodiment, the first target region may correspond to the unknown region.
According to an embodiment, the performing of the filling processing operation on the target image may include scaling down the target image based on the predetermined second size (e.g., 512×512) when a size of the target image is larger than the predetermined second size. According to an embodiment, the performing of the filling processing operation on the target image may include generating the first filled image based on the target image scaled down, using a second AI model for image generation.
In an embodiment, the performing of the filling processing operation on the target image may include generating image data of the unknown region based on the target image using a second AI model for image generation to obtain the first filled image (e.g., expanded result). In an example, the second AI model may include at least one down-sampling operation, at least one dilated convolution operation and at least one up-sampling operation.
In an embodiment, each of the at least one down-sampling operation may be configured to extract a feature map from input data. In an embodiment, each of the at least one dilated convolution operation may be configured to split an input feature map into at least one sub-feature map, based on a predetermined number of dilated convolutions. In an embodiment, each of the at least one dilated convolution operation may be configured to perform feature extraction from each of the at least one sub-feature map using each dilation rate corresponding to the each of the at least one sub-feature map. In an embodiment, each of the at least one dilated convolution operation may be configured to calculate output data based on a feature extraction result of the each of the at least one sub-feature map.
In an example, the calculating of the output data may include performing at least one summing operation on the feature extraction result of the each of the at least one sub-feature map to obtain a plurality of pieces of superimposed information. For example, the at least one summing operation may include summing a feature extraction result of a current sub-feature map with a result of a previous summing operation, to obtain superimposed information associated with the current sub-feature map. In an example, the calculating of the output data may include concatenating the plurality of pieces of superimposed information. In an example, the calculating of the output data may include determining the output data based on the concatenated plurality of pieces of superimposed information.
In step S243, the image processing method 240 may include identifying a target patch based on the first filled image. In an embodiment, the target patch may correspond to at least a portion of the first target region.
According to an embodiment, the identifying of the target patch may include scaling down the first filled image based on a predetermined third size (e.g., 256×256), to obtain a second filled image including a second target region. In an example, the second target region may correspond to the first target region of the first filled image. According to an embodiment, the identifying of the target patch may include extracting the target patch from the second filled image based on at least one predetermined patch size (e.g., 16×16, 32×32). For example, the target patch may include at least a portion of the second target region.
In step S244, the image processing method 240 may include calculating a similarity value between the target patch and at least one patch related to the first filled image using a first artificial intelligence (AI) model (e.g., patch network). According to an embodiment, the first AI model may include each sub-network for each patch size (e.g., 32×32, 16×16). The calculating of the similarity value may include calculating the similarity value using a sub-network for a size of the target patch.
According to an embodiment, the calculating of the similarity value between the target patch and at least one patch may include tiling the target patch to generate a tile image comprising a plurality of tiles corresponding to the target patch. In an example, a size of the tile image may be determined based on a size of the second filled image. According to an embodiment, the calculating of the similarity value between the target patch and at least one patch may include inputting the tile image and the second filled image with the second target region masked to the first AI model. According to an embodiment, the calculating of the similarity value between the target patch and at least one patch may include acquiring output data of the first AI model including the similarity value between the target patch and at least one patch of the second filled image with the second target region masked.
In an example, the acquiring of the output data of the first AI model may include acquiring an attention scores map for the target patch. A value of each position in the attention scores map may indicated a similarity value between the target patch and a patch at the each position in the second filled image with the second target region masked. The position may include coordinate information in each image or each map. In an embodiment, the lower similarity value may indicate more similarity between each other. Alternatively, the higher similarity value may indicate more similarity between each other.
In step S245, the image processing method 240 may include determining a target residual patch corresponding to the target patch based on the similarity value. According to an embodiment, the determining of the target residual patch may include identifying a position with a lowest similarity value in the attention scores map. According to an embodiment, the determining of the target residual patch may include determining the target residual patch in the target image, based on the position with the lowest similarity value.
In step S246, the image processing method 240 may include generating a processing result image (e.g., final result, final ultra high-resolution result), based on the target residual patch. According to an embodiment, the generating of the processing result image may include generating a target residual image for the unknown region of the target image, based on the target residual patch. According to an embodiment, the generating of the processing result image may include generating the processing result image based on the target residual image and the first filled image (e.g., generated 512×512 result).
In an example, the generating of the target residual image for the unknown region may include generating a first residual image (e.g., pasted result in the unknown region, mask region residual image result) for the unknown region based on the target residual patch. In an example, the generating of the target residual image for the unknown region may include generating a second residual image (e.g., low-resolution residual image, low-resolution high frequency image) for the unknown region based on the first filled image. In an example, the generating of the target residual image for the unknown region may include generating a refined residual image (e.g., enhanced high-frequency information image) for the unknown region as the target residual image, based on the first residual image and the second residual image.
In an embodiment, the generating of the second residual image may include performing at least one of a scaling down operation, a scaling up operation or a subtract operation on the first filled image to generate a low resolution residual image as the second residual image.
In an example, the generating of the processing result image based on the target residual image and the first filled image may include scaling up the first filled image based on a size of the target image. In an example, the generating of the processing result image based on the target residual image and the first filled image may include adding the target residual image to the first filled image scaled up (e.g., generated coarse result, blurred image in original pixel size), to generate the processing result image.
According to an aspect of the disclosure, there is provided a non-transitory computer readable storage medium storing a program to be executable by at least one processor to perform an image processing method according to at least one of the above-described embodiments.
It should be understood that while the flowcharts of embodiments of the present disclosure indicate the individual operational steps by arrows, the order in which these steps are performed is not limited to the order indicated by the arrows. Unless explicitly stated herein, in some implementation scenes of embodiments of the present disclosure, the implementation steps in the respective flowcharts may be performed in other orders as desired. Moreover, some or all of the steps in each flowchart may comprise a plurality of sub-steps or a plurality of stages, based on actual implementation scenes. Some or all of these sub-steps or phases may be executed at the same moment, and each of these sub-steps or phases may further be executed separately at different moments. In scenes where the execution time is different, the order of execution of these sub-steps or stages may be flexibly configured according to the needs, and an embodiment of the present disclosure is not limited in this regard.
It should be noted that for those skilled in the art, other similar means of embodiment, based on the technical concept of the present disclosure, without departing from the technical concept of the present disclosure, also fall within the scope of protection of the embodiments of the present disclosure.
According to an aspect of the disclosure, an image processing method includes performing image extraction on an acquired first image to obtain a plurality of second images. The image processing method further includes determining target images associated with the plurality of second images in the acquired first image, based on image similarity values determined by a convolutional neural network. The image processing method further includes performing a first filling processing of the acquired first image, based on the plurality of second images and the target images.
In some embodiments, the performing of the image extraction may include acquiring a first image, that has been processed by a second filling processing, wherein the second filling processing may include determining, in response to an editing operation, a target region in an image to be filled, and performing the second filling processing of the target region by using image information included in the image to be filled, to obtain the acquired first image. The performing of the image extraction may further include performing the image extraction from the target region of the acquired first image to obtain the plurality of second images.
In some embodiments, the determining of the target region in the image to be filled may include determining, in response to the editing operation, the image to be filled based on the image before the editing operation. The determining of the target region in the image to be filled may further include identifying, as the target region, a region in the image to be filled that does not comprise the image information.
In some embodiments, the image processing method may further include cropping or scaling the image to be filled, based on a predetermined image area, before the performing of the second filling processing on the target region.
In some embodiments, the cropping or scaling of the image to be filled may include, when an area of the image to be filled is larger than the predetermined image area, cropping the image to be filled, and, when the area of the cropped image to be filled is larger than the predetermined image area, scaling down the area of the cropped image to be filled to the predetermined image area.
In some embodiments, the cropping of the image to be filled may include calculating a maximum connected component in a mask image corresponding to the image to be filled, and cropping the image to be filled based on a minimum bounding square corresponding to the maximum connected component, to obtain at least one cropped image to be filled.
In some embodiments, the performing of the second filling processing on the target region may include sequentially performing at least one down-sampling operation and at least one up-sampling operation on the image to be filled, to obtain the acquired first image. In such embodiments, the performing of each of the at least one down-sampling operation may include performing a dilated convolution operation on a feature map obtained by that down-sampling operation, based on a different dilation rate for each of the at least one down-sampling operation.
In some embodiments, the performing of the dilated convolution operation may include splitting the feature map obtained by the at least one down-sampling operation into at least one sub-feature map, based on a predetermined number of dilated convolutions. The performing of the dilated convolution operation may further include performing feature extraction from different sub-feature maps by using different dilation rates. The performing of the dilated convolution operation may further include determining an output of a dilated convolution, based on a feature extraction result of the different sub-feature maps.
In some embodiments, the determining of the output of the dilated convolution may include performing a summing operation for the feature extraction result of the different sub-feature maps to obtain a plurality of pieces of superimposed information. The determining of the output of the dilated convolution may further include concatenating the plurality of pieces of superimposed information. The determining of the output of the dilated convolution may further include determining the output of the dilated convolution, based on the concatenated plurality of pieces of superimposed information. In such embodiments, the summing operation may include summing the feature extraction result of a current sub-feature map with a result obtained from a previous summing operation, to obtain superimposed information associated with the current sub-feature map.
In some embodiments, the performing of the image extraction from the target region may include performing the image extraction from the target region of the acquired first image, based on a predetermined first image size and a predetermined image extraction order, to obtain the plurality of second images.
In some embodiments, the determining of the target images associated with the plurality of second images may include performing for each second image of the plurality of second images: tiling that second image to obtain a third image with an image size corresponding to the target region, calculating a similarity value of that second image relative to remaining second images of the plurality of second images, based on the third image, by the convolutional neural network, to obtain a similarity feature map, and determining a target position in the target region of a target image of the target images with a highest similarity to that second image, based on the similarity feature map. In such embodiments, the performing of the first filling processing of the acquired first image, based on the plurality of second images and the target images, may include filling that second image to the target position.
In some embodiments, the calculating of the similarity value of that second image may include transforming the third image, based on a predetermined second image size, to obtain a transformed third image, and down-sampling the transformed third image, based on a predetermined first image size and the predetermined second image size, to obtain a down-sampled third image. The predetermined first image size may be a size for extracting that second image from the target region of the acquired first image.
In some embodiments, the determining of the target images may include performing for each second image of the plurality of second images: tiling that second image to obtain a fourth image with an image size that matches a size of the acquired first image, and determining, a target image of the target images in the acquired first image with a highest similarity to the second image by the convolutional neural network, based on the fourth image. In such embodiments, the performing of the first filling processing of the acquired first image may include filling the target image to a corresponding position of that second image in the acquired first image.
In some embodiments, the image processing method may further include, after performing the first filling processing, acquiring fifth images corresponding to regions where the plurality of second images are located, based on the acquired first image, acquiring sixth images corresponding to the plurality of second images, from the acquired first image processed by the first filling processing, determining a target filled image, based on the fifth images and the sixth images, and determining an image obtained after the image processing, based on the target filled image and the acquired first image.
In some embodiments, the acquiring of the fifth images may include performing a scaling down operation and a scaling up operation on the acquired first image, to obtain a seventh image, determining an eighth image, based on the acquired first image and the seventh image, and acquiring the fifth images corresponding to the plurality of second images, based on the eighth image.
In some embodiments, the determining of the target filled image may include performing a convolution operation on the fifth images and the sixth images by an enhanced network, to obtain the target filled image. The enhanced network may include a down-sampling layer, a residual block layer, and an up-sampling layer connected sequentially. The down-sampling layer may include a plurality of first gated convolution layers connected sequentially. The residual block layer may include a plurality of second gated convolution layers connected sequentially. The plurality of second gated convolution layers may include a skip connection structure between each other. The up-sampling layer may include a plurality of third gated convolution layers connected sequentially. The plurality of third gated convolution layers may be connected to the plurality of first gated convolution layers in corresponding levels.
According to an aspect of the disclosure, an electronic device includes a memory storing one or more instructions, and at least one processor communicatively coupled to the memory. The processor configured to execute the one or more instructions to perform image extraction on an acquired first image to obtain a plurality of second images. The processor is further configured to execute the one or more instructions to determine target images associated with the plurality of second images in the acquired first image, based on image similarity values determined by a convolutional neural network. The processor is further configured to execute the one or more instructions to perform a first filling processing of the acquired first image, based on the plurality of second images and the target images.
In some embodiments, the processor is further configured to execute the one or more instructions to acquire a first image, that has been processed by a second filling processing, wherein the second filling processing may include further instructions to determine, in response to an editing operation, a target region in an image to be filled, and to perform the second filling processing of the target region by using image information included in the image to be filled, to obtain the acquired first image. In such embodiments, the processor is further configured to execute the one or more instructions to perform the image extraction from the target region of the acquired first image to obtain the plurality of second images.
According to an aspect of the disclosure, a computer readable storage medium is configured to store computer instructions which allow a computer, when the computer instructions are operated in the computer, to perform image extraction on an acquired first image to obtain a plurality of second images, determine target images associated with the plurality of second images in the acquired first image, based on image similarity values determined by a convolutional neural network, and perform a first filling processing of the acquired first image, based on the plurality of second images and the target images.
In some embodiments, the computer instructions further allow the computer to acquire a first image, that has been processed by a second filling processing, wherein the second filling processing may include further instructions to determine, in response to an editing operation, a target region in an image to be filled, and to perform the second filling processing of the target region by using image information included in the image to be filled, to obtain the acquired first image, and perform the image extraction from the target region of the acquired first image to obtain the plurality of second images.
The present disclosure provides an image processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product. Specifically, the present disclosure is related to a first image requiring image processing. After acquiring the first image, an image extraction on the acquired first image is firstly performed to obtain a plurality of second images; and target images associated with the second images are determined in the first image, based on an image similarity value determined by a convolutional neural network; and a first filling processing of the first image is performed based on the second images and the target images. The implementation of the solution of the present disclosure may perform image filling based on the image information included in the first image. Such that the second image corresponds to a part of the first image to be subjected to the first filling processing, and a filling of the first image may be achieved based on the target image related to the second image. The processing of this solution may facilitate a faster understanding of the filling position, with reduced computational complexity and resource usage (e.g., processing resources, memory resources) when compared to related image processing operations. Moreover, the determination of similarity is calculated by using a convolutional neural network. The computational amount of similarity is related to the parameters of the network, but not related to the size of the processed image, which facilitates reducing the computational complexity and improving the performance of image processing.
Number | Date | Country | Kind |
---|---|---|---|
202111342435.0 | Nov 2021 | CN | national |
202210821462.4 | Jul 2022 | CN | national |
This application is a continuation application of International Application No. PCT/KR2022/017785, filed on Nov. 11, 2022, which claims priority to Chinese Patent Application 202111342435.0, filed on Nov. 12, 2021, and Chinese Patent Application 202210821462.4, filed on Jul. 12, 2022, in the China National Intellectual Property Administration, the disclosures of which are incorporated by reference herein in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/KR2022/017785 | Nov 2022 | US |
Child | 18074003 | US |