The invention belongs to the field of computer vision technology and relates to a deep learning method for fully automatic natural image matting.
How to seamlessly combine a foreground object with another image to create a new image, the most critical technology is image matting. With the development of society and the continuous progress of technology, the number of images around us is exponential growth has also been accompanied by numerous image processing technologies. From the initial image classification to target detection, to image segmentation, etc., behind them all hide the needs of people to liberate their hands and reduce labor, and these needs are solved through different image processing technologies to facilitate our lives.
Image matting is a more important task in computer vision. It is based on image segmentation, but it extends the image segmentation deeply. Image segmentation aims to segment different regions or regions of interest in the image. It is essentially a non-zero or one binary classification problem. It does not require too much detail on the edge of the segment, while the image matting not only divides the foreground area, it also requires a higher degree of fineness of the segmented objects, such as human hair, animal feather, dense meshes and translucent objects, etc. This high-precision segmentation result is of great significance for image synthesis. It can be used in applications such as daily portrait change backgrounds, as well as in the field of virtual background production in the movie industry and fine parts production in the industrial world.
Image matting and image synthesis are essentially reversible processes, and the mathematical model can be expressed by the following formula:
Iz=αFz+(1−α)Bz, α∈[0,1] (1)
Where the z=(x, y) denotes the position of pixel in the image I, F and B refer to the foreground and background values at pixel z, respectively, α represents the degree of opacity of the pixel, and its value is between 0 and 1, which is essentially a regression problem. The formula gives an intuitive explanation for image synthesis, that is, an image is composed of many pixels, and each pixel is composed of different weighted sums of the foreground and background, and α is the weighting factor. When α=1, it means completely opaque, that is, this pixel is only composed of the foreground. When α=0, it means that it is completely transparent, that is, this pixel is only composed of the background. When α∈[0,1], it means that the pixel is a weighted sum of the foreground and background. The area where this pixel is located is also called the unknown area or the transition area.
Looking back at Formula 1, it can be seen that image matting is an under-constrained problem. For an RGB color image, there are 7 unknowns but only 3 knowns. Therefore, some existing methods solve this ill-posed problem by adding some additional auxiliary information (such as Trimap, Scribble strokes). In these auxiliary information, the alpha value of some regions is usually manually specified. Of course, with the development of science and technology, research on image matting technology and related fields has also continuously made new breakthroughs. There are many types of algorithms in the field of image matting, which can be roughly divided into the following three types.
(1) Sampling-Based Methods
The sampling-based method is mainly to sample the known foreground and background areas to find the candidate colors of the foreground and background of a given pixel, and then use different evaluation indicators to determine the optimal weighted combination of foreground and background pixels. Different sampling methods also have different effects on the weighted combination of pixels, including sampling pairs of pixels along the boundary of the unknown area, sampling based on ray projection, sampling based on color clustering, etc. The evaluation index here is used to make decisions among the sampling candidates. It mainly includes methods such as the reconstruction error of Formula 1, the distance from pixels in the unknown area, and the similarity measurement of foreground/background sampling.
(2) Propagation-Based Methods
In the propagation method, α in formula 1 is allowed to propagate the value of pixels of known α to pixels of unknown α through different propagation algorithms. The most mainstream of the propagation algorithm is to make a local smooth assumption on the foreground/background, and then find the globally optimal alpha matte by solving the linear sparse equations. Other methods include random walk and non-localized propagation.
(3) Deep-Learning Based Methods
With the rapid development of deep learning, more and more methods based on deep learning in the visual field such as image classification and semantic segmentation have surpassed the traditional image processing technology, and the application of deep learning technology in the field of image matting makes the final image synthesis. The quality of the image has been greatly improved. The laboratory of Professor Jia Jiaya of the Chinese University of Hong Kong has proposed a deep automatic portrait matting, which not only considers the semantic prediction of images, but also considers the optimization of pixel-level alpha mattes. During implementation, the input image is first segmented into foreground, background and unknown regions through semantic segmentation, and then a novel mask layer is proposed to enable feedforward and feedback operations for the entire network. This end-to-end deep learning method makes the method does not require any user interaction, while ensuring accuracy while greatly reducing manual labor. Recently, the laboratory of Professor Xu Weiwei from Zhejiang University proposed a Late-fusion method. From the perspective of classification, the problem of image matting is divided into coarse classification of foreground and background and edge optimization. In the implementation, first perform two classification tasks on an image, and then use multiple convolutional layers to perform a fusion operation. The difference between it and deep portrait segmentation is that deep portrait segmentation uses the traditional propagation method to perform the bridge training process, while Late-fusion uses a full convolution method to train in stages.
In view of the shortcomings of the existing methods, the present invention proposes a full-automatic image matting framework based on attention-guided hierarchical structure aggregation. This framework can obtain a finer alpha matte when only a single RGB image is input without any additional auxiliary information. The user inputs a single RGB image to the network, firstly through a feature extraction network with an atrous pyramid pooling module to extract the features of the image, and then through a channel attention module to filter the advanced features, After that, the filtered results and low-level features are sent to the spatial attention module to extract the image details. Finally, the obtained mask and supervised ground truth as well as the original image are sent to the discriminator network for later optimization, and a fine alpha matte is finally obtained.
The technical solution of the present invention:
A fully automatic natural image matting method, which obtains an accurate alpha matte of foreground object from a single RGB image without any additional auxiliary information. The method consists of four parts, and the overall pipeline is as shown in the
(1) Hierarchical Feature Extraction Stage
The hierarchical feature extraction stage mainly extracts different hierarchical feature representation from the input image. Here we select the ResNext as our basic backbone and divide it into five blocks. The five blocks from shallow to deep. The low-level spatial features and texture features are extracted from shallow layer, while the high-level semantic features are extracted from deep layers. With the deepening of the network, the network itself learn more deep semantic features, so the second block is used to extract low-level features.
(2) Pyramidal Feature Filtration Stage
After extracting the high-level semantic feature representation, the traditional method usually does not filter the entire feature representation for the next step. Since there are more than one type of object in the image, there is more than one semantic information activated on the upper layer, and objects in the foreground and background are likely to be activated (that is, different channels are different for the responding objects), which will cause great trouble to image matting. The present invention proposes a pyramid feature filtering module (that is, channel attention in hierarchical attention). The present invention proposes a pyramid feature filtering module (that is, channel attention in hierarchical attention). The specific process is shown in
Output=σ(MLP(MaxPool(Input)))×Input (2)
the input represents the advanced semantic features obtained in the first stage, σ represents the non-linear activation function, the size of the channel attention map obtained after σ is 1×1×n, n represents the number of channels, and the size of the obtained advanced semantic features is x×y×n, x and y represent the length and width of the channel, and the two will perform the broadcast operation when they are multiplied, × refers to the multiplication operation of the channel attention map and advanced semantic features.
(3) Appearance Cues Filtration Stage
Existing learning-based methods directly upsample the selected advanced semantic features to obtain the final alpha matte, which will largely lose the details and texture information of the foreground objects at the edges. In order to improve the fineness of the alpha matte at the edges of objects (such as hair, translucent glass, mesh), the present invention proposes a appearance cues filtration module (that is, spatial attention in the hierarchical attention). As shown in the
Ω represents the set of pixels, |Ω| represents the number of pixels in an image, αpi and αgi denotes the alpha matte value and supervised ground truth at pixel i. The structural similarity error ensures the consistency of spatial information and texture information extracted from low-level features to further improve the structure of foreground objects. The calculation formula is as follows:
αpi and αgi denotes the alpha matte value and supervised ground truth at pixel i, μp, μg and σp, σg represents the mean and variance of αpi and σgi.
(4) Later Refinement Stage
In order to make the generated alpha matte more closely match the supervised ground truth in visual effect, a discriminator network is used in the later refinement stage. As shown in the
The beneficial effect of the present invention: Compared with the existing image matting method, the biggest advantage of the present invention is that it does not require any auxiliary information and any additional user interaction information, and only needs to input an RGB image to obtain a fine alpha matte. On the one hand, it saves a lot of time for scientific researchers, and it is no longer necessary to manually make auxiliary information such as trimaps or scribbles, on the other hand, for users, they no longer need to manually mark some foregrounds/backgrounds when using them. At the same time, the hierarchical structure fusion method based on attention guidance in the present invention has enlightening significance for the task of image matting. It can get rid of the dependence on auxiliary information and ensure the accuracy of the alpha matte. This idea of high-level guiding low-level learning has great reference value to other computer vision tasks.
The specific embodiments of the present invention are further described below in conjunction with the drawings and technical solutions. In order to better compare the contribution of different components to the entire framework, we make a visual illustration according to
The core of the present invention lies in the fusion of attention-guided hierarchical structure, which will be described in detail in conjunction with the specific implementation. The invention is divided into four parts. The first part uses the feature extraction network and the atrous pyramidal pooling module to extract features of different levels, as shown in the overall framework pipeline of
Number | Date | Country | Kind |
---|---|---|---|
202010029018.X | Jan 2020 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/089942 | 5/13/2020 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2021/139062 | 7/15/2021 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
9665906 | Adeyoola | May 2017 | B2 |
20160180433 | Adeyoola | Jun 2016 | A1 |
20170278246 | Kim | Sep 2017 | A1 |
20180253865 | Price | Sep 2018 | A1 |
20200357142 | Aydin | Nov 2020 | A1 |
20210027470 | Lin | Jan 2021 | A1 |
Number | Date | Country |
---|---|---|
108010049 | May 2018 | CN |
109685067 | Apr 2019 | CN |
Entry |
---|
Singh et al. “Automatic Trimap and Alpha-Matte Generation For Digital Image Matting” 2013 Sixth International Conference on Contemporary Computing (IC3), Aug. 8-10, 2013<br>. |
Lin et al. “RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, pp. 5168-5177. |
Number | Date | Country | |
---|---|---|---|
20210216806 A1 | Jul 2021 | US |