The disclosure relates to a field of video processing. In particular, the disclosure relates to a method executed by an electronic apparatus, an electronic apparatus, and a storage medium.
In many video application scenarios, there are technical requirements for removing objects in frames of a video and completing missing areas in the frames of the video. In particular, with popularity of mobile terminals such as smart phones and tablet computers, the demand for people to use the mobile terminals for video shooting and video processing is increasing gradually. However, in related technologies, the efficiency of techniques such as object removal or missing area completion in the video frames is low.
How to efficiently remove objects or complete missing areas to better meet user needs is a technical problem that technicians in the art have been working hard to study.
In order to at least solve the above problems existing in the prior art, the disclosure provides a method performed by an electronic apparatus, an electronic apparatus, and a storage medium.
According to an aspect of the disclosure, a method performed by an electronic apparatus, includes: determining a local frame sequence from a video; obtaining a feature map sequence of the local frame sequence by encoding the local frame sequence; determining a feature flow sequence of the local frame sequence based on the feature map sequence of the local frame sequence and a mask image sequence of an object to be removed, the mask image sequence being corresponding to the feature map sequence; obtaining an updated feature map sequence of the local frame sequence, by performing, based on the feature flow sequence, feature fusion between adjacent feature maps in the feature map sequence; and obtaining a processed local frame sequence, by decoding the updated feature map sequence of the local frame sequence.
According to another aspect of the disclosure, a method performed by an electronic apparatus, includes: determining a local frame sequence from a video sequence of a video; and determining a reference frame sequence corresponding to the local frame sequence, from the video, and performing inpainting processing for the local frame sequence based on the reference frame sequence, wherein the determining the reference frame sequence corresponding to the local frame sequence from the video includes: determining a candidate frame sequence from a video sequence to which a local frame sequence belongs, wherein the candidate frame sequence comprises image frames in the video sequence excluding the local frame sequence; and selecting the reference frame sequence corresponding to the local frame sequence from the candidate frame sequence based on a similarity between a frame image in the candidate frame sequence and a frame image in the local frame sequence, wherein the similarity represents a correlation of a background area and an uncorrelation of a foreground area between the frame image in the candidate frame sequence and the frame image in the local frame sequence.
The beneficial effects of the technical solutions provided by the disclosure will be described later in combination with specific optional embodiments, or may be learned from descriptions of the embodiments, or may be learned from implementation of the embodiments.
In order to more clearly and easily explain and understand technical solutions in embodiments of the disclosure, the following will briefly introduce the drawings needed to be used in the description of the embodiments of the disclosure:
Embodiments of the disclosure are described below in conjunction with the accompanying drawings in the disclosure. It should be understood that the embodiments described below in combination with the accompanying drawings are exemplary descriptions for explaining technical solutions of the embodiments of the disclosure, and do not constitute restrictions on the technical solutions of the embodiments of the disclosure.
It may be understood by those skilled in the art that singular forms “a”, “an”, “the” and “this” used herein may also include plural forms unless specifically stated. It should be further understood that the terms “include” and “comprise” used in the embodiments of the disclosure mean that a corresponding feature may be implemented as the presented feature, information, data, step, operation, element, and/or component, but do not exclude implement of other features, information, data, steps, operations, elements, components and/or a combination thereof, which are supported in the present technical field. It should be understood that, when we state that one element is “connected” or “coupled” to another element, this element may be directly connected or coupled to the another element, or it may mean that a connection relationship between this element and the another element is established through an intermediate element. In addition, “connection” or “coupling” used herein may include a wireless connection or wireless coupling. The term “and/or” used herein represents at least one of items defined by this term, for example, “A and/or B” may be implemented as “A”, or “B”, or “A and B”. When describing a plurality of (two or more) items, if a relationship between the plurality of items is not clearly defined, between the plurality of items may refer to one, more or all of the plurality of items. For example, for a description of “a parameter A includes A1, A2, A3”, it may be implemented that the parameter A includes A1, or A2, or A3, and it may also be implemented that the parameter A includes at least two of the three parameters A1, A2, A3.
In operation S110, at least one (or a) local frame sequence is determined from a video. This is described below with reference to
First, the video is split based on scene information to obtain at least one video sequence. As shown in
Then, the at least one local frame sequence is obtained, by selecting a predetermined number of consecutive image frames (also referred to as video frames) from each video sequence according to a predetermined stride as the local frame sequence, wherein the predetermined stride is less than or equal to the predetermined number. As shown in
After operation S110, operations S120 to S150 as shown in
In operation S120, a feature map sequence of a current local frame sequence is obtained by encoding the current local frame sequence. A ‘feature’ may be a piece of information about the content of an image; typically about whether a certain region of the image has certain properties. Features may include information about specific structures in the image such as points, edges or objects.
Specifically, each local frame image in the current local frame sequence is input to a convolutional network encoder for encoding, to obtain a feature map of each local frame image, thus obtaining the feature map sequence of the current local frame sequence. Here, the convolutional network encoder may be any convolutional network encoder that may extract a feature map from an image.
In operation S130, a feature flow sequence of the current local frame sequence is determined based on the feature map sequence of the current local frame sequence and a corresponding mask image sequence of an object to be removed (or a mask image sequence corresponding to the feature map sequence of the current local frame sequence).
Specifically, the determining the feature flow sequence of the current local frame sequence based on the feature map sequence of the current local frame sequence and the corresponding mask image sequence of the object to be removed includes: determining an occlusion mask image sequence and the feature flow sequence of the current local frame sequence based on the feature map sequence of the current local frame sequence and the corresponding mask image sequence of the object to be removed.
Here, when determining the occlusion mask image sequence and the feature flow sequence of the current local frame sequence, the following operations are performed for each frame image in the current local frame sequence: selecting two frames of feature map corresponding to a current frame image and its adjacent frame image from the feature map sequence of the current local frame sequence, and selecting two frames of mask images of the object to be removed corresponding to the current frame image and the adjacent frame image from the mask image sequence of the object to be removed corresponding to the current local frame sequence; estimating a feature flow and the occlusion mask image from the adjacent frame image to the current frame image by using a feature flow estimation module, based on the two frames of feature maps and the two frames of the mask images of the object to be removed.
In the disclosure, in order to make a video processing (for example, removing an object from the video, completing a missing area, etc.) more effective, the method proposed in the disclosure may be used to perform a corresponding processing in turn based on an order from the first frame image to the last frame image of each video sequence (this process may be referred to as referred to as one forward traversal processing), then, a corresponding processing is performed based on an order from the last frame image to the first frame image of the video sequence (this process may be referred to as one backward traversal processing). One forward traversal processing and one backward traversal processing may be referred to as one bidirectional traversal processing. In the disclosure, at least one bidirectional traversal processing may be performed. In addition, the disclosure also proposes that whenever the bidirectional traversal process is performed, the backward traversal process may be performed first and then the forward traversal process may be performed.
In the disclosure, the adjacent frame image of the current frame image may be a previous frame image of the current frame image or a next frame image of the current frame image. In the following description, for the convenience of description, it is assumed that the current frame image is represented by the t-th frame image, and the adjacent frame image of the current frame image is represented by the (t+1)th frame image. A process of using the feature flow estimation module to estimate the feature flow and the occlusion mask image from the adjacent frame image to the current frame image is described in detail below with reference to
In operation S310, a forward feature flow and a backward feature flow are determined based on the two frames of feature maps and the two frames of the mask images of the object to be removed, of the current frame image and its adjacent frame image, wherein, when the adjacent frame image is the next frame image of the current frame image, the forward feature flow represents a feature flow from the current frame image to the adjacent frame image, and the backward feature flow represents a feature flow from the adjacent frame image to the current frame image; when the adjacent frame image is the previous frame image of the current frame image, the backward feature flow represents the feature flow from the current frame image to the adjacent frame image, and the forward feature flow represents the feature flow from the adjacent frame image to the current frame image.
As shown in
In operation S320, the occlusion mask image from the adjacent frame image to the current frame image is obtained by performing consistency checking on the forward feature flow and the backward feature flow.
As shown in
The process of
In operation S610, each of the two frames of feature maps and a corresponding mask image of the object to be removed are weighted, respectively, to obtain the two frames of the weighted feature maps.
As shown in
In operation S620, a forward concatenation feature map and a backward concatenation feature map are obtained by performing forward concatenation and backward concatenation on the two frames of the weighted feature maps, respectively.
As shown in
In operation S630, each of the forward concatenation feature map and the backward concatenation feature map is encoded by using a plurality of residual blocks based on a residual network.
As shown in
In operation S640, the forward feature flow is obtained by performing a feature flow prediction on an encoding result of the forward concatenation feature map, and the backward feature flow is obtained by performing a feature flow prediction on an encoding result of the backward concatenation feature map.
As shown in
Through the process in
Specifically, for each point A in the adjacent feature map (for example, the (t+1)th frame feature mapt+1), a corresponding point A′ in the current feature map (for example, the t-th frame feature map) is determined using the backward feature flow. In detail, it assumed that point A is (x, y), then the corresponding point A′ of point A in the t-th frame feature mapt is (x, y)+flowt+1->t(x, y).
When a feature flow value at the point A in the backward feature flow and a feature flow value at the corresponding point A′ in the forward feature flow meet a predetermined condition, the point A is determined as an occlusion point, wherein all the occlusion points determined are used to constitute an occlusion mask image from the adjacent frame image to the current frame image. Specifically, when the current feature flow is completely consistent with the backward feature flow, absolute feature flow values at the point A and the point A′ are the same but their signs are opposite, that is, a sum of their corresponding feature flows is zero: flowt->t+1(A′)+flowt+1->t(A)=0. However, considering an calculation error, when the point A: (x, y) meets the following formula (1), the point A is referred to as an occlusion point:
|flowt→t+1(A′)+flowt+1→t(A)|2>θ (1)
Among them, θ is a threshold greater than zero. Therefore, by performing the above processing for each point A in the adjacent feature map (for example, the (t+1)th frame feature mapt+1), all points A satisfying equation (1) may be determined, and these points may constitute the occlusion mask image from the adjacent frame image to the current frame image.
The feature flow estimation module shown in
In addition, for each level of feature flow estimator (also referred to as flow estimator, for short), of which input includes the feature map and the mask image of the object to be removed having a corresponding resolution. That is, for the feature flow estimation module with a N-level pyramid structure of N feature flow estimators, N−1 times downsampling is required for the feature map and the mask image of the object to be removed having an original resolution. Therefore, inputs of the 1st-level of feature flow estimator in the most right side of the
Therefore, when the feature flow estimation module has an N levels of feature flow estimators (N is a positive integer greater than or equal to 1), the estimating the feature flow and the occlusion mask image from the adjacent frame image to the current frame image by using the feature flow estimation module includes: for a 1st-level of feature flow estimator, performing feature flow estimation by using two frames of feature maps and two frames of mask images of the object to be removed obtained from the (N−1)th downsampling, to generate a 1st-level of forward feature flow and a 1st-level of backward feature flow, wherein the forward feature flow represents a 1st-level of feature flow from the current frame image to the adjacent frame image, and the backward feature flow represents a 1st-level of feature flow from the adjacent frame image to the current frame image; obtaining a 1st-level of occlusion mask image from the adjacent frame image to the current frame image by performing the consistency checking on the forward feature flow and the backward feature flow.
If N=1, the feature flow estimation module only has one feature flow estimator, at this time, the feature flow estimator may have the structure of the feature flow estimation module shown in
If N is greater than or equal to 2, the 1st-level of feature flow estimator has a structure similar to that shown in
In addition, when the feature flow estimation module has N levels of feature flow estimators (N is a positive integer greater than or equal to 1), the estimating the feature flow and the occlusion mask image from the adjacent frame image to the current frame image by using the feature flow estimation module includes: if N is greater than or equal to 2, for a nth-level of feature flow estimator, generating a nth-level of feature flow and an nth-level of occlusion mask image from the adjacent frame image to the current frame image, by using two frames of feature maps and two frames of mask images of the object to be removed obtained from the (N−n)th downsampling, and a (n−1)th-level of feature flow, an (n−1)th-level of occlusion mask image from the adjacent frame image to the current frame image and an (n−1)th-level of additional feature generated by a (n−1)th-level of feature flow estimator, wherein n is a positive integer greater than or equal to 2 and less than or equal to N, wherein an additional feature generated by each of the 1st to (N−1)th levels of feature flow estimators is used indicate an occlusion area in an occlusion mask image generated by a corresponding level of feature flow estimator. This will be described below with reference to
As shown in
In operation S1020, an adjacent feature map corresponding to the adjacent frame image of the current frame image obtained from the (N−n)th downsampling is feature-weighted and aligned, based on the upsampled feature flow and the upsampled occlusion mask image, the (n−1)th-level of additional feature generated by the (n−1)th-level of feature flow estimator, and the mask image of the object to be removed corresponding to the adjacent frame image of the current frame image obtained from the (N−n)th downsampling, to obtain the weighted and aligned adjacent feature map.
This operation S1020 may include: weighting the adjacent feature map corresponding to the adjacent frame image of the current frame image, obtained from the (N-n)th downsampling by using the upsampled occlusion mask image, to obtain the weighted adjacent feature map; performing a convolution processing (the size K of the convolution kernel may be 3×3, the number of channel C of the convolution kernel may be Cnum) and a nonlinear-processing (such as, Sigmaid function) using an activation function, on the mask image of the object to be removed corresponding to the adjacent frame image of the current frame image, obtained from the (N−n)th downsampling, to obtain a nonlinear-processed mask image of the object to be removed; performing the convolution processing (the size K of the convolution kernel may be 3×3, the number of channel C of the convolution kernel may be Cnum) on the (n−1)th-level of additional feature generated by the (n−1)th-level of feature flow estimator, and obtaining an updated adjacent feature map based on a result of the convolution processing and the weighted adjacent feature map; weighting the updated adjacent feature map by using the nonlinear-processed mask image of the object to be removed, and performing an alignment processing on it by using the upsampled feature flow, to obtain the weighted and aligned adjacent feature map.
Specifically, as shown in
In operation S1030, the current feature map corresponding to the current frame image obtained from the (N−n)th downsampling is weighted based on the mask image of the object to be removed corresponding to the current frame image obtained from the (N−n)th downsampling, to obtain the weighted current feature image. As shown in
In operation S1040, a backward concatenation is performed on the weighted and aligned adjacent feature map and the weighted current feature map to obtain the back concatenation feature map. As shown in
Here, the process of obtaining the nth-level of occlusion mask image by performing occlusion mask prediction based on the encoding result in operation S1060 is different from the process of determining the occlusion mask image from the adjacent frame image to the current frame image through the consistency check process described above with reference to
In addition, if N is greater than or equal to 3, when n is greater than or equal to 2 and less than or equal to N−1, for the nth-level of feature flow estimator, the coding result is also used for additional feature prediction to obtain the nth-level of additional feature. This process is the same as the process described above with reference to
So far, the feature flow and mask image from the adjacent frame image to the current frame image are finally obtained through the N levels of feature flow estimators, and in the same manner, the feature flow sequence of the current local frame sequence may be obtained.
Returning to
Specifically, after operation S130, the feature flow sequence and the occlusion mask image sequence of the current local frame sequence may be obtained. In this case, the obtaining the updated feature map sequence of the current local frame sequence includes: performing the feature fusion between adjacent feature maps in the feature map sequence based on the occlusion mask image sequence, the feature flow sequence and the mask image sequence of the object to be removed corresponding to the current local frame sequence, to obtain the updated feature map sequence of the current local frame sequence. This is described in detail below with reference to
In operation S1210, an mask image of the object to be removed corresponding to the adjacent frame image of the current frame image is selected, from among the mask image sequence of the object to be removed, a feature flow from the adjacent frame image to the current frame image is selected, from among the feature flow sequence, and an occlusion mask image from the adjacent frame image to the current frame image is selected, from among the occlusion mask image sequence.
In operation S1220, the feature map corresponding to the adjacent frame image of the current frame image is weighted and aligned based on the feature flow, the occlusion mask image, and the mask image of the object to be removed corresponding to the adjacent frame image of the current frame image, to obtain the weighted and aligned feature map corresponding to the adjacent frame image of the current frame image. This will be described below with reference to
As shown in
In operation S1320, a convolution processing and a nonlinear-processing (by using an activation function) are performed on the concatenation mask image to obtain the nonlinear-processed concatenation mask image. As shown in
In operation S1330, the weighted feature map corresponding to the adjacent frame image of the current frame image is obtained, by weighting the feature map corresponding to the adjacent frame image of the current frame image by using the nonlinear-processed concatenation mask image. As shown in
In operation S1340, the weighted and aligned feature map corresponding to the adjacent frame image of the current frame image is obtained, by performing a feature aligning on the weighted feature map corresponding to the adjacent frame image of the current frame image by using the feature flow. As shown in
Returning to
Specifically, the obtaining the updated feature map corresponding to the current frame image by performing feature fusion on the feature map corresponding to the current frame image and the weighted and aligned feature map corresponding to the adjacent frame image of the current frame image includes: performing concatenation on the feature map corresponding to the current frame image and the weighted and aligned feature map corresponding to the adjacent frame image of the current frame image, to obtain the concatenation feature map; performing a convolution processing on the concatenation feature map, to obtain the updated feature map corresponding to the current frame image. As shown in
So far, the t-th frame feature map after propagation as shown in
Returning to
Specifically, contrary to the encoding process performed by the convolutional network encoder used in operation S120, the updated feature map sequence of the current local frame sequence may be input to the convolutional network decoder for decoding, to obtain a decoded current local frame sequence. Then, the mask area of the mask image in the mask image sequence may be used to crop the corresponding partial area from each frame of the decoded current local frame sequence, and then, the corresponding area of the corresponding frame in the original current local frame sequence may be replaced by the cropped corresponding partial area, so as to obtain the current local frame sequence after the object being removed or the missing area being completed.
In order to further improve an effect of object removal or completion of missing area, the method shown in
In operation S1510, a candidate frame sequence is determined from a current video sequence to which the current local frame sequence belongs, wherein the candidate frame sequence comprises image frames in the current video sequence excluding the current local frame sequence.
Specifically, as shown in
In operation S1520, the reference frame sequence corresponding to the current local frame sequence is selected from the candidate frame sequence according to a similarity between each frame image in the candidate frame sequence and a specific frame image in the current local frame sequence, wherein the similarity represents a correlation of a background area and an uncorrelation of a foreground area between a candidate image in the candidate frame sequence and the specific frame image in the current local frame sequence.
Specifically, as shown in
The selecting the reference frame sequence corresponding to the current local frame sequence from the candidate frame sequence according to the similarity between each frame candidate image in the candidate frame sequence and the specific frame image in the current local frame sequence includes: obtaining the similarity between a current candidate image in the candidate frame sequence and the specific frame image in the current local frame sequence by inputting the current candidate image and the specific frame image into a reference frame matching network; if the similarity between the current candidate image and the specific frame image is greater than a first threshold, selecting the current candidate image as a reference frame and determining whether a sum of similarities of all reference frames selected for the current local frame sequence is greater than a second threshold; if the sum of the similarities of all the reference frames selected for the current local frame sequence is greater than the second threshold, determining all the reference frames selected for the current local frame sequence as the reference frame sequence corresponding to the current local frame sequence. Based on a frame order, the above operations are performed on each frame candidate image in the candidate frame sequence.
As shown in
The process of obtaining the similarity between the candidate image in the candidate frame sequence and the specific frame image in the current local frame sequence using the reference frame matching network is described in detail below with reference to
In operation S1710, a similarity matrix between the current candidate image in the candidate frame sequence and the specific frame image in the current local frame sequence is determined. Specifically, this operation may include: by concatenating and encoding the current candidate image in the candidate frame sequence and the specific frame image in the current local frame sequence, obtaining the similarity matrix between the current candidate image and the specific frame image.
As shown in
In operation S1720, an edge attention map is determined based on a mask edge map corresponding to the specific frame image and the similarity matrix. This operation may include: obtaining a heatmap of the mask edge map by convoluting the mask edge map; obtaining the edge attention map based on the heatmap and the similarity matrix.
As shown in
In operation S1730, a mask attention map is determined based on a mask image of the object to be removed corresponding to the specific frame image and the similarity matrix. This operation may include: obtaining a mask similarity matrix based on the mask image of the object to be removed and the similarity matrix; obtaining a mask area feature descriptor based on the mask similarity matrix; obtaining the mask attention map based on the mask area feature descriptor. For example, a ‘feature descriptor’ is a method that extracts feature descriptions for an interest point (or the full image). Feature descriptors serve as a kind of numerical ‘fingerprint’ that may be used to distinguish one feature from another by encoding interesting information into a string of numbers.
As shown in
In operation S1740, a fusion feature map is determined based on the similarity matrix, the edge attention map and the mask attention map. This operation may include: determining the fusion feature map, by multiplying the edge attention map and the similarity matrix and adding a result of the multiplication with the mask attention map.
In operation S1750, the similarity between the current candidate image and the specific frame image is obtained based on the fusion feature map.
As shown in
An adaptive reference frame selection method proposed in the disclosure is described above with reference to
So far, according to the above described process, the reference frame sequence corresponding to each local frame sequence may be determined from the video, and then the reference frame sequence can be used to perform the feature enhancement or feature completion on the corresponding local frame sequence. Therefore, the method may also include: by encoding the reference frame sequence corresponding to each local frame sequence, obtaining a feature map sequence of the reference frame sequence. Specifically, the reference frame sequence may be encoded by using the convolutional network encoder adopted in operation S120 above to obtain the feature map sequence of the reference frame sequence.
On this basis, the decoding the updated feature map sequence of the current local frame sequence in operation S150 may include: performing feature enhancement or feature completion on the updated feature map sequence of the current local frame sequence by using the feature map sequence of the reference frame sequence of the current local frame sequence, to obtain the enhanced or completed feature map sequence of the current local frame sequence.
Specifically, feature enhancement and/or feature completion operations are performed on the updated feature map sequence of the current local frame sequence by using the feature map sequence of the reference frame sequence through a Transformer module. However, the disclosure is not limited to this. The feature enhancement and/or feature completion operations may be performed on the updated feature map sequence of the current local frame sequence by using the feature map sequence of the reference frame sequence through a PoolFormer module. Compared with the Transformer module, in the PoolFormer module, a multi-head attention layer in the Transformer module is replaced by a pooling layer, thus realizing a lightweight of modules while maintaining a performance, so that the video processing method proposed in the disclosure may be more easily deployed on the mobile terminal. Here, when a content to be completed in the video is not completed after operations S110 to S140 due to a video content, the updated feature map sequence of the current local frame sequence may be completed by using the feature map sequence of the reference frame sequence through the Transformer module or PoolFormer module. In addition, even if the content to be completed in the video is completed after operations S110 to S140, the Transformer module or PoolFormer module may still use the feature map sequence of the reference frame sequence to enhance features of the updated feature map sequence of the current local frame sequence, further improving the effect of video processing.
Then, the decoding the updated feature map sequence of the current local frame sequence (in operation S150) may also include: performing the decoding processing of the enhanced or completed feature map sequence of the current local frame sequence, that is, the enhanced or completed feature map sequence of the current local frame sequence may be input to a convolutional network decoder for decoding, to obtain the decoded current local frame sequence. Then, the mask area of the mask image in the mask image sequence may be used to crop a corresponding partial area from each frame of the decoded current local frame sequence, and then, a corresponding area of the corresponding frame in the original current local frame sequence may be replaced by the cropped corresponding partial area, so as to obtain the current local frame sequence after the object being removed or the missing area being completed.
In addition, from the above replacement process (that is, the process of replacing the corresponding area of the corresponding frame in the original current local frame sequence by the cropped corresponding partial area), it may be found that, in the decoding stage, only a feature area associated with the mask area is useful for a final completion of the resulting image, while a calculation of the features of other areas except the mask area in the decoding stage is redundant, therefore, in order to further reduce the redundant calculation in the decoding stage, the disclosure proposes a feature cropping method without information loss. This will be described below with reference to
As shown in
Specifically, because each dynamic change of the resolution of the input image will lead to the loss of additional model loading time when a depth learning module performs reasoning on the mobile terminal, the disclosure, in order to reduce the loss of time, determines a maximum bounding box of the mask area according to the mask image sequence corresponding to the object to be removed, for the current local frame sequence, thus overcoming the additional model loading time loss caused by the dynamic change of the resolution of the input image. As shown in
Input(h,w)=(max(h1,h2 . . . hm),max(w1,w2, . . . wm)) (2)
Wherein, m represents the number of the mask images of the object to be removed in the mask image sequence of the object to be removed, h1, h2 . . . hm represents the height of the bounding box of the mask area of the mask image of the object to be removed 1, the mask image of the object to be removed 2, . . . , the mask image of the object to be removed m, and w1, w2 . . . wm represents the width of the bounding box of the mask area of the mask image of the object to be removed 1, the mask image of the object to be removed 2, . . . the mask image of the object to be removed m, respectively, Input (h, w) represents the maximum bounding box of the mask area, wherein h and w represent the height and width of the maximum bounding box of the mask area.
In operation S1920, a calculation area of the maximum bounding box of the mask area on the enhanced or completed feature map sequence of the current local frame sequence is determined.
Specifically, as shown in
Wherein, k represents the kth-layer network of the decoder, f represents a convolution kernel of each layer, and s represents a stride of each layer when convolution is performed. Therefore, it is assumed that the decoder used in the subsequent decoding operations of the disclosure is consists of two upsamplings and a convolution layer with two convolution kernels of 3×3. Therefore, for an output layer of the decoder, the receptive field of the input layer may be calculated as:
l
0=1
l
1=1+(3−1)=3
l
2=3+(2−1)*1=4
l
3=4+(3−1)*1*2=8
l
4=8+(2−1)*1*2*1=10 (4)
According to equation (4) above, each receptive field of the decoder may be deduced.
Then, the maximum bounding box of the mask area is scaled according to a resolution ratio between the enhanced or completed feature map sequence and an original feature map sequence of the current local frame sequence. Since the feature map sequence subsequently input into the decoder has been downsampled in the previous process, in order to make the maximum bounding box of the mask area at this time correspond to the resolution of the feature map sequence at the current stage (that is, the enhanced or completed feature map sequence), it is necessary to scale the maximum bounding box of the mask area according to the resolution ratio between the enhanced or completed feature map sequence and the original feature map sequence of the current local frame sequence, for example, reducing the maximum bounding box of the mask area by a quarter.
Then, the scaled maximum bounding box of the mask area is expanded according to the receptive field, to obtain the calculation area. Specifically, the top, bottom, left and right of the scaled maximum bounding box of the mask area are expanded outwards by (receptive field/2) pixels according to the calculated receptive field, for example, it is assumed that the decoder used in the disclosure is composed of two upsamplings and a convolution layer with two convolution kernels of 3×3, therefore, as shown in equation (4) above, the receptive field of the decoder is l4=10. Accordingly, after the maximum bounding box of the mask area is scaled, the top, bottom, left and right of the scaled maximum bounding box of the mask area are expanded outwards by l4/2 pixels, that is, 5 pixels, to obtain the calculation area of the maximum bounding box of the mask area on the enhanced or completed feature map of the current local frame sequence.
In operation S1930, the enhanced or completed feature map sequence of the current local frame sequence is cropped according to the calculation area, and the cropped feature map sequence is decoded.
As shown in
After the decoding result of the cropped area sequence is obtained according to the process described in
Specifically, since the decoding result of the cropped area sequence is a decoded image area sequence corresponding to the maximum bounding box of the mask area, the corresponding area in the local frames may be replaced by the decoded image area of each local frame in the decoding result, according to the size and position of the maximum bounding box of the mask area.
The feature cropping method without information loss described above may reduce the redundant calculation in the decoding stage, greatly improve the video processing speed of the mobile terminal on the premise of ensuring the unchanged final video processing effect (such as, an effect of object removal), thereby making the method proposed in the disclosure have a faster reasoning speed when deployed to the mobile terminal.
Firstly, a video is scene split according to scene information to obtain at least one video sequence, and then at least one local frame sequence is determined from the split at least one video sequence. Then, as shown in
A current local frame sequence and other frames in a current video sequence to which it belongs are input to an adaptive reference frame selection module, which, for the current local frame sequence, selects a reference frame sequence from a candidate frame sequence in the current video sequence excluding the current local frame sequence according to a similarity;
The current local frame sequence and its reference frame sequence are input to an encoder (such as, a convolutional network encoder) for encoding, to obtain a feature map sequence of the current local frame sequence and a feature map sequence of the reference frame sequence;
Based on a feature aligning and propagation module, a feature flow sequence of the current local frame sequence is determined using the feature map sequence of the current local frame sequence and the mask image sequence corresponding to the object to be removed, and then feature fusion is performed between adjacent feature maps in the feature map sequence of the current local frame sequence based on the feature flow sequence to obtain the updated feature map sequence of the current local frame sequence;
The feature enhancement or feature completion operations are performed on the updated feature map sequence of the current local frame sequence by using the feature map sequence of the reference frame sequence output by the encoder through a PoolFormer module or Transformer module, to obtain the enhanced or completed feature map sequence of the current local frame sequence;
Each feature map in the enhanced or completed feature map sequence is cropped using a maximum bounding box of object mask determined based on the mask image sequence of the object to be removed, through a feature cropping module without information loss.
The decoder is used to decode the cropped feature map sequence to obtain the image area corresponding to the maximum bounding box of the object mask for each local frame in the current local frame sequence.
Finally, the corresponding local frame in the current local frame sequence is replaced by each decoded image area corresponding to the maximum bounding box of the object mask, thereby realizing operations such as object removal in specific areas and completion of the missing area.
As shown in
In operation S2320, a reference frame sequence corresponding to each local frame sequence, is determined from the video, and an inpainting processing is performed for the corresponding local frame sequence according to the reference frame sequence.
The determining the reference frame sequence corresponding to each local frame sequence from the video includes: determining a candidate frame sequence from a current video sequence to which a current local frame sequence belongs, wherein the candidate frame sequence comprises image frames in the current video sequence excluding the current local frame sequence; selecting the reference frame sequence corresponding to the current local frame sequence from the candidate frame sequence according to a similarity between each frame image in the candidate frame sequence and a specific frame image in the current local frame sequence, wherein the similarity represents a correlation of a background area and an uncorrelation of a foreground area between an image in the candidate frame sequence and the specific frame image in the current local frame sequence. Since the above process is the same as that in
Thereafter, after the reference frame sequence is selected, the current local frame sequence is inpainted according to the reference frame sequence. For example, a convolutional network encoder may be used to encode the reference frame sequence to obtain a feature map sequence of the reference frame sequence, and then the feature map sequence of the reference frame sequence may be used to enhance or complete the feature map sequence of the current local frame sequence, thereafter, the feature enhanced or feature completed feature map sequence of the current local frame sequence is decoded, and the corresponding image in the current local frame sequence is replaced by the image area corresponding to the mask area in the decoding result, so as to obtain the current local frame sequence after the object being removed or the missing area being completed. In addition, the feature cropping method without information loss described above with reference to
As shown in
Compared with the existing method, the adaptive reference frame selection method proposed in the disclosure may ensure the effectiveness and efficiency of reference frame selection, thereby making the effect of object removal, completion of the missing area, etc. better. Compared with the existing methods, the feature aligning and propagation method (module) based on the feature flow proposed in the disclosure may make the video completion effect more stable in timing, thus making the final effect of object removal and completion of the missing area better. Through quantitative comparison by calculating the L1 distance of the corresponding channels of the two aligned features, it may be found that the performance of feature alignment using the feature flow output by the feature flow estimation module is better than the existing methods.
At least one of the above plurality of modules may be implemented through the AI model. Functions associated with AI may be performed by non-volatile memory, volatile memory, and processors.
As an example, the electronic apparatus may be a PC computer, a tablet device, a personal digital assistant, a smart phone, or other devices capable of executing the above set of instructions. Here, the electronic apparatus does not have to be a single electronic apparatus and may also be any device or a collection of circuits that may execute the above instructions (or instruction sets) individually or jointly. The electronic apparatus may also be a part of an integrated control system or a system manager, or may be configured as a portable electronic apparatus interconnected by an interface with a local or remote (e.g., via wireless transmission). A processor may include one or more processors. At this time, the one or more processors may be a general-purpose processor, such as central processing unit (CPU), application processor (AP), etc., and a processor used only for graphics (such as, graphics processing unit (GPU), visual processing unit (VPU), and/or AI dedicated processor (such as, neural processing unit (NPU)). The one or more processors control the processing of input data according to predefined operation rules or AI models stored in a non-volatile memory and a volatile memory. The predefined operation rules or AI models may be provided through training or learning. Here, providing by learning means that the predefined operation rules or AI models with desired characteristics is formed by applying a learning algorithm to a plurality of learning data. The learning may be performed in the apparatus itself executing AI according to the embodiment, and/or may be implemented by a separate server/apparatus/system.
A learning algorithm is a method that uses a plurality of learning data to train a predetermined target apparatus (for example, a robot) to enable, allow, or control the target apparatus to make a determination or prediction. Examples of the learning algorithm include, but are not limited to, supervised learning, unsupervised learning, semi supervised learning, or reinforcement learning.
The AI models may be obtained through training. Here, “obtained through training” refers to training a basic AI model with a plurality of training data through a training algorithm to obtain the predefined operation rules or AI models, which are configured to perform the required features (or purposes).
As an example, the AI models may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and a neural network calculation is performed by performing a calculation between the calculation results of the previous layer and the plurality of weight values. Examples of the neural network include, but are not limited to, convolution neural network (CNN), depth neural network (DNN), recurrent neural network (RNN), restricted Boltzmann machine (RBM), depth confidence network (DBN), bidirectional recursive depth neural network (BRDNN), generative countermeasure network (GAN), and depth Q network.
The processor may execute instructions or codes stored in the memory, where the memory may also store data. Instructions and data may also be transmitted and received through a network via a network interface device, wherein the network interface device may use any known transmission protocol.
The memory may be integrated with the processor as a whole, for example, RAM or a flash memory is arranged in an integrated circuit microprocessor or the like. In addition, the memory may include an independent device, such as an external disk drive, a storage array, or other storage device that may be used by any database system. The memory and the processor may be operatively coupled, or may communicate with each other, for example, through an I/O port, a network connection, or the like, so that the processor may read files stored in the memory.
In addition, the electronic apparatus may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, a mouse, a touch input device, etc.). All components of the electronic apparatus may be connected to each other via a bus and/or a network.
According to an embodiment of the disclosure, there may also be provided a computer-readable storage medium storing instructions, wherein the instructions, when being executed by at least one processor, cause the at least one processor to execute the above method performed by the electronic apparatus according to the exemplary embodiment of the disclosure. Examples of the computer-readable storage medium here include: Read Only Memory (ROM), Random Access Programmable Read Only Memory (PROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Random Access Memory (RAM), Dynamic Random Access Memory (DRAM), Static Random Access Memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD+R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD+R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disc storage, Hard Disk Drive (HDD), Solid State Drive (SSD), card storage (such as multimedia card, secure digital (SD) card or extremely fast digital (XD) card), magnetic tape, floppy disk, magneto-optical data storage device, optical data storage device, hard disk, solid state disk and any other devices which are configured to store computer programs and any associated data, data files, and data structures in a non-transitory manner, and provide the computer programs and any associated data, data files, and data structures to the processor or the computer, so that the processor or the computer may execute the computer programs. The instructions and the computer programs in the above computer-readable storage mediums may run in an environment deployed in computer equipment such as a client, a host, an agent device, a server, etc. In addition, in one example, the computer programs and any associated data, data files and data structures are distributed on networked computer systems, so that computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed manner through one or more processors or computers.
According to an embodiment of the disclosure, a method performed by an electronic apparatus may be provided. The method may include determining a local frame sequence from a video. The method may include obtaining a feature map sequence of the local frame sequence by encoding the local frame sequence. The method may include determining a feature flow sequence of the local frame sequence based on the feature map sequence of the local frame sequence and a mask image sequence regarding an object to be removed, the mask image sequence being corresponding to the feature map sequence. The method may include obtaining an updated feature map sequence of the local frame sequence, by performing, based on the feature flow sequence, feature fusion between adjacent feature maps in the feature map sequence. The method may include obtaining a processed local frame sequence, by decoding the updated feature map sequence of the local frame sequence.
The determining the local frame sequence from the video may include splitting, based on scene information, the video to obtain a video sequence. The determining the local frame sequence from the video may include obtaining the local frame sequence, by selecting, based on a predetermined stride, a predetermined number of consecutive image frames from the video sequence as the local frame sequence. The predetermined stride may be less than or equal to the predetermined number.
The method may include determining a reference frame sequence corresponding to the local frame sequence The method may include obtaining a feature map sequence of the reference frame sequence by encoding the reference frame sequence corresponding to the local frame sequence. The decoding the updated feature map sequence of the local frame sequence may include performing feature enhancement or feature completion on the updated feature map sequence of the local frame sequence by using the feature map sequence of the reference frame sequence of the local frame sequence, to obtain the enhanced or completed feature map sequence of the local frame sequence, and decoding the enhanced or completed feature map sequence of the local frame sequence.
The determining the reference frame sequence corresponding to the local frame sequence may include determining a candidate frame sequence from a video sequence to which the local frame sequence belongs, the candidate frame sequence comprising image frames in the video sequence excluding the local frame sequence, and selecting the reference frame sequence corresponding to the local frame sequence from the candidate frame sequence based on a similarity between a candidate image in the candidate frame sequence and a frame image in the local frame sequence, the similarity representing a correlation of a background area and an uncorrelation of a foreground area between the candidate image in the candidate frame sequence and the frame image in the local frame sequence.
The selecting the reference frame sequence corresponding to the local frame sequence from the candidate frame sequence based on the similarity between the candidate image in the candidate frame sequence and the frame image in the local frame sequence may include obtaining the similarity between the candidate image in the candidate frame sequence and the frame image in the local frame sequence by inputting the candidate image in the candidate frame sequence and the frame image in the local frame sequence into a reference frame matching network; based on the similarity being greater than a first threshold, selecting the candidate image as a reference frame and determining whether a sum of similarities of a plurality of reference frames selected for the local frame sequence is greater than a second threshold; and based on the sum of the similarities of the plurality of reference frames selected for the local frame sequence being greater than the second threshold, determining all the reference frames selected for the local frame sequence as the reference frame sequence corresponding to the local frame sequence.
The obtaining the similarity between the candidate image in the candidate frame sequence and the frame image in the local frame sequence by inputting the candidate image in the candidate frame sequence and the frame image in the local frame sequence into the reference frame matching network may include determining a similarity matrix between the candidate image in the candidate frame sequence and the frame image in the local frame sequence, determining an edge attention map based on a mask edge map corresponding to the frame image in the local frame sequence and the similarity matrix, determining a mask attention map based on a mask image of the object to be removed and the similarity matrix. The mask image may be corresponding to the frame image in the local frame sequence; determining a fusion feature map based on the similarity matrix, the edge attention map, and the mask attention map, and obtaining the similarity between the candidate image and the frame image in the local frame sequence based on the fusion feature map.
The determining the similarity matrix between the candidate image in the candidate frame sequence and the frame image in the local frame sequence may include obtaining the similarity matrix between the candidate image and the frame image by concatenating and coding the candidate image in the candidate frame sequence and the frame image in the local frame sequence.
The determining the edge attention map based on the mask edge map corresponding to the frame image and the similarity matrix may include obtaining a heatmap of the mask edge map by convoluting the mask edge map; and obtaining the edge attention map based on the heatmap and the similarity matrix.
The determining the mask attention map based on the mask image of the object to be removed, the mask image being corresponding to the frame image and the similarity matrix may include obtaining a mask similarity matrix based on the mask image of the object to be removed and the similarity matrix; obtaining a mask area feature descriptor based on the mask similarity matrix; and obtaining the mask attention map based on the mask area feature descriptor.
The determining the feature flow sequence of the local frame sequence based on the feature map sequence of the local frame sequence and the mask image sequence regarding the object to be removed may include determining an occlusion mask image sequence and the feature flow sequence of the local frame sequence based on the feature map sequence of the local frame sequence and the mask image sequence regarding the object to be removed; and
The obtaining the updated feature map sequence of the local frame sequence may include performing the feature fusion between adjacent feature maps in the feature map sequence to obtain the updated feature map sequence of the local frame sequence based on the occlusion mask image sequence, the feature flow sequence, and the mask image sequence regarding the object to be removed.
The determining the occlusion mask image sequence and the feature flow sequence of the local frame sequence based on the feature map sequence of the local frame sequence and the mask image sequence regarding the object to be removed may include selecting two frames of feature map, the two frames being corresponding to a frame image and an adjacent frame image from the feature map sequence of the local frame sequence, and selecting two frames of mask images of the object to be removed, the two frames being corresponding to the frame image and the adjacent frame image from the mask image sequence regarding the object to be removed; and estimating a feature flow and the occlusion mask image from the adjacent frame image to the frame image by using a feature flow estimation module, based on the two frames of feature maps and the two frames of the mask images of the object to be removed.
The estimating the feature flow and the occlusion mask image from the adjacent frame image to the frame image by using the feature flow estimation module may include determining a forward feature flow and a backward feature flow based on the two frames of feature maps and the two frames of the mask images of the object to be removed; and obtaining the occlusion mask image from the adjacent frame image to the frame image by performing consistency checking on the forward feature flow and the backward feature flow.
The feature flow estimation module may include an N levels of feature flow estimators. N may be a positive integer greater than or equal to one (1). The estimating the feature flow and the occlusion mask image from the adjacent frame image to the frame image by using the feature flow estimation module may include, for a 1st-level of a feature flow estimator, performing feature flow estimation by using the two frames of feature maps and the two frames of mask images of the object to be removed obtained from (N−1)th downsampling, to generate a 1st-level of forward feature flow and a 1st-level of backward feature flow, and obtaining a 1st-level of occlusion mask image from the adjacent frame image to the frame image by performing the consistency checking on the forward feature flow and the backward feature flow. The forward feature flow may represent a 1st-level of feature flow from the frame image to the adjacent frame image. The backward feature flow may represent a 1st-level feature flow from the adjacent frame image to the frame image;
The estimating the feature flow and the occlusion mask image from the adjacent frame image to the frame image by using the feature flow estimation module further may include based on N being greater than or equal to 2, for a nth-level of feature flow estimator, generating a nth-level of feature flow and an nth-level of occlusion mask image from the adjacent frame image to the frame image, by using the two frames of feature maps and the two frames of mask images of the object to be removed obtained from (N−n)th downsampling, and a (n−1)th-level of feature flow, an (n−1)th-level of occlusion mask image from the adjacent frame image to the frame image and an (n−1)th-level of additional feature generated by a (n−1)th-level of feature flow estimator. n may be a positive integer greater than or equal to 2 and the n may be less than or equal to N.
An additional feature generated by a level of the 1st to (N−1)th levels of feature flow estimators may indicate an occlusion area in an occlusion mask image generated by a corresponding level of feature flow estimator.
The performing feature flow estimation by using the two frames of feature maps and the two frames of mask images of the object to be removed obtained from the (N−1)th downsampling, to generate the 1st-level of forward feature flow and the 1st-level of backward feature flow may include: weighting the two frames of feature maps obtained from the (N−1)th downsampling and a corresponding mask image of the object to be removed, respectively, to obtain the two frames of the weighted feature maps; obtaining a forward concatenation feature map and a backward concatenation feature map by performing forward concatenation and backward concatenation on the two frames of the weighted feature maps, respectively; encoding the forward concatenation feature map and the backward concatenation feature map by using a plurality of residual blocks based on a residual network; obtaining the forward feature flow by performing a feature flow prediction on an encoding result of the forward concatenation feature map; and obtaining the backward feature flow by performing a feature flow prediction on an encoding result of the backward concatenation feature map.
The generating the nth-level of feature flow and the nth-level of occlusion mask image from the adjacent frame image to the frame image may include upsampling the (n−1)th-level of feature flow and the (n−1)th-level of occlusion mask image generated by the (n−1)th-level of feature flow estimator, to obtain an upsampled feature flow and an upsampled occlusion mask image; performing feature-weighting and aligning an adjacent feature map corresponding to the adjacent frame image obtained from the (N−n)th downsampling, to obtain the weighted and aligned adjacent feature map, based on the upsampled feature flow and the upsampled occlusion mask image, the (n−1)th-level of additional feature generated by the (n−1)th-level of feature flow estimator, and the mask image of the object to be removed, the mask image being corresponding to the adjacent frame image obtained from the (N−n)th downsampling; weighting the feature map corresponding to the frame image obtained from the (N−n)th downsampling, to obtain a weighted feature image, based on the mask image of the object to be removed, the mask image being corresponding to the frame image obtained from the (N−n)th downsampling; performing backward concatenation between the weighted and aligned adjacent feature map and the weighted feature map, to obtain a backward concatenation feature map; encoding the backward concatenation feature map by using a plurality of residual blocks based on a residual network; performing a feature flow prediction and an occlusion mask prediction based on the encoding result, to obtain the nth-level of feature flow and nth-level of occlusion mask image from the adjacent frame image to the frame image.
The obtaining the weighted and aligned adjacent feature map may include weighting the adjacent feature map by using the upsampled occlusion mask image to obtain the weighted adjacent feature map; performing a convolution processing and a nonlinear-processing by using an activation function, on the mask image of the object to be removed, the mask image being corresponding to the adjacent frame image obtained from the (N−n)th downsampling, to obtain a nonlinear-processed mask image of the object to be removed; performing the convolution processing on the (n−1)th-level of additional feature generated by the (n−1)th-level of feature flow estimator, and obtaining an updated adjacent feature map based on a result of the convolution processing and the weighted adjacent feature map; weighting the updated adjacent feature map by using the nonlinear-processed mask image of the object to be removed, and performing an alignment processing on it by using the upsampled feature flow, to obtain the weighted and aligned adjacent feature map.
The obtaining the updated feature map sequence of the local frame sequence may include selecting an mask image of the object to be removed, the mask image being corresponding to the adjacent frame image of the frame image, from among the mask image sequence regarding the object to be removed, selecting a feature flow from the adjacent frame image to the frame image, from among the feature flow sequence, and selecting an occlusion mask image from the adjacent frame image to the frame image, from among the occlusion mask image sequence; weighting and aligning the feature map corresponding to the adjacent frame image based on the feature flow, the occlusion mask image, and the mask image of the object to be removed, the mask image being corresponding to the adjacent frame image, to obtain the weighted and aligned feature map corresponding to the adjacent frame image; obtaining the updated feature map corresponding to the frame image by performing feature fusion on the feature map corresponding to the frame image and the weighted and aligned feature map corresponding to the adjacent frame image.
The weighting and aligning the feature map corresponding to the adjacent frame image based on the feature flow, the occlusion mask image, and the mask image of the object to be removed, the mask image being corresponding to the adjacent frame image, to obtain the weighted and aligned feature map corresponding to the adjacent frame image may include obtaining an concatenation mask image by performing concatenation on the occlusion mask image and the mask image of the object to be removed, the mask image being corresponding to the adjacent frame image, performing a convolution processing and a nonlinear-processing by using an activation function on the concatenation mask image to obtain a nonlinear-processed concatenation mask image; obtaining the weighted feature map corresponding to the adjacent frame image by weighting the feature map corresponding to the adjacent frame image by using the nonlinear-processed concatenation mask image; and obtaining the weighted and aligned feature map corresponding to the adjacent frame image by performing a feature aligning on the weighted feature map corresponding to the adjacent frame image by using the feature flow.
The obtaining the updated feature map corresponding to the frame image by performing feature fusion on the feature map corresponding to the frame image and the weighted and aligned feature map corresponding to the adjacent frame image may include performing concatenation on the feature map corresponding to the frame image and the weighted and aligned feature map corresponding to the adjacent frame image, to obtain the concatenation feature map; and performing a convolution processing on the concatenation feature map, to obtain the updated feature map corresponding to the frame image.
The decoding based on the enhanced or completed feature map sequence of the local frame sequence may include determining a maximum bounding box of a mask area based on the mask image sequence regarding the object to be removed; determining a calculation area of the maximum bounding box of the mask area on the enhanced or completed feature map sequence of the local frame sequence; cropping the enhanced or completed feature map sequence of the local frame sequence based on the calculation area; and decoding the cropped enhanced or completed feature map sequence.
The determining the maximum bounding box of the mask area based on the mask image sequence regarding the object to be removed may include calculating a receptive field of a decoder for the decoding; scaling the maximum bounding box of the mask area based on a resolution ratio between the enhanced or completed feature map sequence and an original feature map sequence of the local frame sequence; and expanding the maximum bounding box of a scaled mask area based on the receptive field to obtain the calculation area.
According to an embodiment of the disclosure, a method performed by an electronic apparatus may be provided. The method may include determining a local frame sequence from a video sequence of a video; and determining a reference frame sequence corresponding to the local frame sequence, from the video, and performing inpainting processing for the local frame sequence based on the reference frame sequence.
The determining the reference frame sequence corresponding to the local frame sequence from the video may include: determining a candidate frame sequence from a video sequence to which a local frame sequence belongs, wherein the candidate frame sequence comprises image frames in the video sequence excluding the local frame sequence; and selecting the reference frame sequence corresponding to the local frame sequence from the candidate frame sequence based on a similarity between a frame image in the candidate frame sequence and a frame image in the local frame sequence. The similarity may represent at least one of a correlation of a background area and an uncorrelation of a foreground area between the frame image in the candidate frame sequence and the frame image in the local frame sequence.
According to an embodiment of the disclosure, an electronic apparatus, may be provided. The electronic apparatus may include a processor; and a memory storing computer executable instructions. The computer executable instructions, when being executed by the processor, may cause the at least one processor to perform the above method.
According to an embodiment of the disclosure, a computer-readable storage medium storing instructions may be provided. The instructions, when being executed by a processor, may cause the processor to perform the above method.
It should be noted that the terms “first”, “second”, “third”, “fourth”, “1”, “2” and the like (if exists) in the description and claims of the disclosure and the above drawings are used to distinguish similar objects, and need not be used to describe a specific order or sequence. It should be understood that data used as such may be interchanged in appropriate situations, so that the embodiments of the disclosure described here may be implemented in an order other than the illustration or text description.
It should be understood that although each operation is indicated by arrows in the flowcharts of the embodiments of the disclosure, an implementation order of these operations is not limited to an order indicated by the arrows. Unless explicitly stated herein, in some implementation scenarios of the embodiments of the disclosure, the implementation operations in the flowcharts may be executed in other orders according to requirements. In addition, some or all of the operations in each flowchart may include a plurality of sub operations or stages, based on an actual implementation scenario. Some or all of these sub operations or stages may be executed at the same time, and each sub operation or stage in these sub operations or stages may also be executed at different times. In scenarios with different execution times, an execution order of these sub operations or stages may be flexibly configured according to requirements, which is not limited by the embodiment of the disclosure.
The above description is only an alternative implementation of some implementation scenarios of the disclosure. It should be pointed out that for those ordinary skilled in the art, adopting other similar implementation means based on a technical idea of the disclosure also belongs to a protection scope of the embodiment of the disclosure, without departing from a technical concept of the disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202211515180.8 | Nov 2022 | CN | national |
This application is a by-pass continuation application of International Application No. PCT/KR2023/013077, filed on Sep. 1, 2023, which is based on and claims priority to Chinese Patent Application No. 202211515180.8, filed on Nov. 29, 2022, in Chinese Intellectual Property Office, the disclosures of which are incorporated by reference herein their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/KR2023/013077 | Sep 2023 | WO |
Child | 18368921 | US |