The present application claims priority to Chinese Patent Application No. 202410544518.5, filed May 6, 2024, the entire disclosure of which is incorporated herein by reference.
The present application relates to a face image restoration method based on a state space model and belongs to the field of image processing technology.
In audio-visual entertainment, security monitoring and other scenarios, high quality and clear face images not only provide users with a good visual sense, but also help law enforcement officers to search for suspects, missing persons and other work. Due to the shooting conditions there are many unstable factors, such as imaging equipment focus failure, camera shake; imaging environment of low light, high exposure, shooting object movement interference; channel transmission lossy compression, code format, etc., will lead to multiple interferences in the image triggered by different degrees of degradation. The effective identity information provided by indistinguishable low-quality face images is very limited. Therefore, face image restoration aims to recover clear and high quality face images from degraded low quality face images, which helps to improve the robustness of downstream tasks, such as, face super-resolution and recognition, old photo restoration, and virtual digital human image editing.
Currently, the model architectures used in face restoration methods are mainly based on convolutional neural networks (CNNs) and Transformer networks. Due to the localized nature of the convolution operation, the existing methods based on convolutional neural networks often fail to mine global information; existing Transformer network-based methods use a computationally intensive self-attention mechanism to grasp global information after dividing the image into a number of patches, with limited ability to capture pixel-level . . . detail information.
The described method has achieved some positive results in face restoration tasks, but when facing face images with larger size and more complex and severe degradation, it cannot simultaneously mine and efficiently integrate local detail information and global geometric information, resulting in a restoration performance that is difficult to meet the existing industry requirements.
To overcome the deficiencies in the related art, the present application provides a face image restoration method based on a state space model.
The state space model architecture has global information mining capability compared with convolution neural network; compared with the transformer model architecture, it not only reduces computational overhead, but also has stronger temporal memory capability. The present application further introduces multi-scale technology to alleviate the problem of insufficient integration of local and global information, and to maintain the consistency of face identity information while ensuring the restoration of detailed texture and geometric contour. In the face of complex real degradation scenes, the proposed restoration method of the present application can not only achieve high index scores, such as peak signal-to-noise ratio (PSNR), but also recover clear face images with high quality.
In order to achieve the stated purpose, the present application is realized using the following technical solution.
In a first aspect, the present application discloses a face image restoration method based on a state space model, including:
In some embodiment, the first multi-scale state space module, the second multi-scale state space module, the third multi-scale state space module, the fourth multi-scale state space module, the fifth multi-scale state space module and the sixth multi-scale state space module have same network structures, each of which includes a plurality of multi-scale state space models and a feature extraction unit, and each multi-scale state space model contains two state space branches, a dimension reduction unit, a first linear layer and a second linear layer, a layer normalization unit, and a dimension upgrading unit.
In some embodiment, the first image fusion module, the second image fusion module and the third image fusion module have same network structures, and each of the first image fusion module, the second image fusion module and the third image fusion module comprises a 2-fold down-sampled unit, a feature extraction unit and a channel attention unit;
low quality face images of different scales are input to be performed, by the channel attention unit, key feature fusion on shallow features mined by the feature extraction unit and deep features output by the encoder at different stages.
In some embodiment, the feature extraction unit includes a residual convolution block and an activation function;
In some embodiment, each state space model includes three linear layers, a convolution layer, two activation layers, and a selection state space model connected by residuals;
u(t) denotes an input signal, x(t) denotes a historical state, x′(t) denotes a current state, y(t) denotes an output signal; A denotes a state transfer matrix, B is a matrix of inputs to states, C is a matrix of states to outputs, and D is a parameter of inputs to outputs.
In some embodiment, since the input data for image processing is often discrete, ordinary differential equations of the state space model are converted into discrete-time differential equations according to a bilinear transformation method by selecting a suitable discrete-time step Δt, the differential equations are:
xk denotes a current Kth state, uk denotes a Kth discrete value of an input sequence, xk-1 denotes a K−1th historical state, and an affection xk of a current input
the state space model controls the focus on the current input by adjusting the step size Δt, thereby achieving selection forgetting or retention of the state; when Δt is increased, the model tends to focus on the current input and forget the previous state; when Δt is reduced, the model tends to retain more historical states, thereby achieving a selection state-space model.
In some embodiment, the first multi-scale attention fusion module, the second multi-scale attention fusion module, and the third multi-scale attention fusion module have same network structures, each of which includes a generalized interpolation unit, a 2-fold down-sampled unit, a local attention stage, and a global attention stage;
the first stage features, the second stage features and the third stage features output by the encoder are unified in sizes by the generalized interpolation unit, to keep consistent with a size of output features output currently by encoder at current stage which pass through the 2-fold down-sampled unit, and followed by fusion of local and global features by the local attention stage and global attention stage.
In some embodiment, the 2-fold down-sampled unit includes a pooling layer and a convolution layer, the feature extraction unit includes two residual convolution blocks and an activation function, the channel attention unit includes a dimension reduction unit, an attention mechanism and a dimension reduction unit;
the generalized interpolation unit includes an image fusion unit and a convolution layer, and the local attention stage includes two residual convolution blocks and two activation functions, and the global attention stage contains a hourglass-shaped attention fusion unit and a channel attention unit.
In some embodiment, a method of training a restoration model includes:
In a second aspect, the present application provides a face image restoration system based on a state space model, including a processor and a storage medium;
In a third aspect, the present application provides a computer readable storage medium having a computer program stored thereon, the computer program implements the method described in the first aspect when executed by a processor.
In a fourth aspect, the present application provides a computer device, including a memory and a processor, the memory stores a computer program, the processor implements the method described in the first aspect when the computer program is executed by the processor.
In a fifth aspect, the present application provides a computer program product includes a computer program that implements the method described in the first aspect when executed by a processor.
The present application proposes a face image restoration method based on a state space model, firstly, based on an image fusion module, different scale input images are fused at different stages of the encoder, so that the image pixels and the semantic features learn from each other; secondly, a multi-scale state space model is introduced based on the traditional state space model, and the processing branches with different sensory field sizes are used to excavate the local and global information, and assist the model deep feature extraction; finally, the multi-scale attention fusion module, compared to the general connection method using jump connection coding and decoding structure, fully consider the characteristics of the encoder features at different stages, and effectively improve the learning ability of the decoder. The present application can guarantee the consistency of face identity information while ensuring the detailed texture and geometric contour of the restoration results, and further improve the generalization performance of the face restoration model in real scenes, so as to meet the needs of all kinds of face image-related tasks and applications.
The present application is further described below in conjunction with the accompanying drawings. The following embodiments are only used to illustrate the technical solution of the present application more clearly, and are not to be used to limit the scope of the present application.
The present application provides a face image restoration method based on a state space model, including:
In some embodiment, as shown in
An encoder, for mining multi-scale deep semantic features based on the to-be-restored face image and its down-sampled images at different scales; a decoder, for generating the restored face image based on the multi-scale deep semantic features generated by the encoder.
In the present application, an image fusion module, for inputting low quality face images of different scales to be performed, by the channel attention unit, key feature fusion on shallow features mined by the feature extraction unit and deep features output by the encoder at different stages, to promote image pixels and semantic features to learn from each other, according to a down-sampled unit, a feature extraction unit and a channel attention unit;
a multi-scale state space module, for summing an output of the first state space branch with an output of the dimension reduction unit and then passing through the first linear layer and the second linear layer, respectively;
a multi-scale attention fusion module, for mining local and global information based on processing branches with different sensory field sizes, using jump connections to enhance the learning ability of the decoder.
A technical idea of the present application is: based on an image fusion module, multi-scale input images are fused at different stages of the encoder, so that the image pixels and the semantic features learn from each other; secondly, a multi-scale state space model is introduced based on the traditional state space model, and the processing branches with different sensory field sizes are used to excavate the local and global information, and assist the model deep feature extraction; finally, the multi-scale attention fusion module, compared to the general connection method using jump connection coding and decoding structure, fully consider the characteristics of the encoder features at different stages, and effectively improve the learning ability of the decoder, thereby improving the problem that existing methods are difficult to grasp long-distance global feature information and have low inference efficiency, improving the generalization performance of the face restoration model, and achieving higher evaluation index scores and high quality visualization effects in real scenarios.
As shown in
The feature extraction unit includes a residual convolution block and an activation function; the residual convolution block includes two convolution layers and two activation functions, a modified linear function is used as the activation function, and output features obtained after the two convolution layers extracting deeper features from input features are summed with the input features to obtain final output features;
As shown in
The image fusion module is used to enable image pixels and semantic features to learn from each other. Specifically, the semantic features of the low quality face image input are initially extracted based on a convolution neural network, and multi-scale fusion is utilized to efficiently combine image features from different scales.
In some embodiment, as shown in
In some embodiment, the 2-fold down-sampled unit includes a pooling layer and a convolution layer, the feature extraction unit includes two residual convolution blocks and an activation function, the channel attention unit includes a dimension reduction unit, an attention mechanism and a dimension reduction unit; the generalized interpolation unit includes an image fusion unit and a convolution layer, and the local attention stage includes two residual convolution blocks and two activation functions, and the global attention stage contains a hourglass-shaped attention fusion unit and a channel attention unit.
The multi-scale attention fusion module is used to combine the low-level details of different scale features of the face image with high-level semantics. Specifically, the encoder is divided into three stages of different scales to extract the low-level semantic features of the face image, the output features of the decoder in stage i are jump-connected through the concatenation function, the local semantic features based on the residual convolution and the modified linear function (ReLU) are improved, and the global semantic features are improved based on the hourglass-shaped attention fusion mechanism of the first reduction dimensional encoding input fusion and the upgraded decoding output fusion in stages, to preserve the global important features and identity of the face image, to enhance the learning ability of the decoder, and make the restored face image more realistic.
As shown in
In some embodiments, as shown in
As shown in
Multi-scale state space models are built on traditional state space theory, an expression of the state space model is:
u(t) denotes an input signal, x(t) denotes a historical state, x(t) denotes a current state, y(t) denotes an output signal; A denotes a state transfer matrix, B is a matrix of inputs to states, C is a matrix of states to outputs, and D is a parameter of inputs to outputs.
since the input data for image processing is often discrete, ordinary differential equations of the state space model are converted into discrete-time differential equations according to a bilinear transformation method by selecting a suitable discrete-time step Δt, the differential equations are:
xk denotes a current Kth state, UK denotes a Kth discrete value of an input sequence, xk-1 denotes a K−1th historical state, and an affection xk of a current input
The feature sequence processing method based on the state space model can capture the global information over a long distance according to the selection mechanism, and compared with other feature sequence analysis models, it has greater computing throughput and model inference speed, high execution efficiency, and retains the consistency of the semantic information of the face as far as possible while guaranteeing the details and textures of the restoration results, and is able to achieve better restoration results in real scenarios.
A method of training a restoration model in the present application includes:
Specifically, obtaining the training set includes:
Specifically, adjusting a pixel of the high-quality face image to obtain a degraded face image; and using the degraded face image as a face training image to be restored.
Specifically, each high quality face image is extracted from the FFHQ dataset and its aspect is adjusted to 512 pixels to obtain the degraded face image. The expression for the pixel adjustment operation is as follows:
Iiq={JPEGq(Ihg*kσ)↓s+nδ)}↑s;
Ilq denotes the degraded face image, i.e., the face training image to be restored; JPEGq denotes JPEG compression with a compression quality of q; Ihq denotes the high quality face image, i.e., the high definition face real image; * denotes a convolution operation; kσ denotes a blurring kernel of sigma=σ; ↓s denotes a down-sampled s-fold operation; nδ denotes a Gaussian noise of sigma=δ; and ↑s denotes an up-sampled s-fold operation.
The specific parameters of the pixel adjustment operation can be adjusted according to the actual image, which are not limited here.
Second, the multi-scale down-sampled face training image Ilq to be restored is input into the pre-constructed restoration model to obtain the multi-scale up-sampled restored face training image Îhq.
Third, based on the restored face training image Îhq and the corresponding face high definition image Ilq, a loss function value L of the restoration model is calculated.
The expression of the loss function for the restoration model is as follows:
L=Ll1+λperLper+λadvLadv
L denotes the total loss function value of the restoration model; Ll1 denotes the L1 loss function value; λper denotes the perceived loss weight, which in this embodiment takes the value of 0.1; Lper denotes the perceived loss function value based on the VGG network; λadv denotes the antagonistic loss weight, which in this embodiment takes the value of 0.01; and Lady denotes the antagonistic loss function value based on the antagonistic training.
The expression for the L1 loss function value Ll1 is as follows:
Ll1=∥Ihq−Îhq∥1
Ihq denotes a collection of real images of a high definition face; Îhq denotes a collection of restored face training images; and ∥·∥1 denotes an average absolute error.
The expression for the perceived loss function value Lper based on VGG network is as follows:
Lper=|Ø(Ihq)−Ø(Îhq)∥22
ø denotes the feature maps of the 1st to 5th convolution layers in the pre-trained VGG model; ∥·∥22 denotes the square of the 2-parameter number.
The expression for the adversarial loss function value Ladv based on adversarial training is as follows:
Ladv=−EÎhq softplus(D(Îhq))
D(·) denotes the output of the discriminator in adversarial training; EÎhq denotes the expectation about the distribution Îhq; softplus denotes the softplus function, which is expressed as:
softplus(x)=In(1+ex)
Fourth, iterative update training of the restoration model based on the gradient descent method is performed, and the restoration model with the smallest total loss function value of the restoration model is selected as the pre-trained restoration model.
Based on the first embodiment, the present embodiment provides a face image restoration system based on a state space model, which includes a processor and a storage medium; the storage medium is used to store instructions; the processor for operating according to the instructions to perform the method according to the first embodiment.
Based on the first embodiment, the present embodiment provides a computer readable storage medium having a computer program stored thereon, the computer program implements the method described in the first aspect when executed by a processor.
Based on the first embodiment, the present embodiment provides a computer device, including a memory and a processor, the memory stores a computer program, the processor implements the method described in the first aspect when the computer program is executed by the processor.
Based on the first embodiment, the present embodiment provides a computer program product, which includes a computer program that implements the method described in the first aspect when executed by a processor.
It should be appreciated by those skilled in the art that embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of a fully hardware embodiment, a fully software embodiment, or an embodiment that combines software and hardware aspects. Further, the present application may take the form of a computer program product implemented on one or more computer-usable storage medium (including, but not limited to, a disk memory, a CD-ROM, an optical memory, etc.) that contain computer-usable program code therein.
The present application is described with reference to flowcharts and/or block diagrams of methods, devices (systems), and computer program products according to embodiments of the present application. It should be understood that each of the processes and/or boxes in the flowchart and/or block diagram, and the combination of processes and/or boxes in the flowchart and/or block diagram, may be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data-processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data-processing device produce a device for carrying out the functions specified in the one process or multiple processes of the flowchart and/or the one box or multiple boxes of the box diagram.
These computer program instructions may also be stored in computer-readable memory capable of directing the computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising an instruction device that implements the function specified in the flowchart one process or a plurality of processes and/or the box diagram one box or a plurality of boxes.
These computer program instructions may also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on the computer or other programmable device to produce computer-implemented processing, such that the instructions executed on the computer or other programmable device provide steps for implementing the functionality specified in the flowchart one process or a plurality of processes and/or the box diagram one box or a plurality of boxes.
The foregoing is only a preferred embodiment of the present application, and it should be noted that: for those skilled in the art, without departing from the principles of the present application, a number of improvements and embellishments may be made, which shall also be considered as the scope of protection of the present application.
Number | Date | Country | Kind |
---|---|---|---|
202410544518.5 | May 2024 | CN | national |
Number | Name | Date | Kind |
---|---|---|---|
8280180 | Banner | Oct 2012 | B2 |
11165992 | Ong | Nov 2021 | B1 |
11657598 | Suresh | May 2023 | B1 |
20100246952 | Banner et al. | Sep 2010 | A1 |
Number | Date | Country |
---|---|---|
103514589 | Jan 2014 | CN |
105874775 | Aug 2016 | CN |
112750082 | May 2021 | CN |
114419449 | Apr 2022 | CN |
114707227 | Jul 2022 | CN |
116416434 | Jul 2023 | CN |
116664435 | Aug 2023 | CN |
116739946 | Sep 2023 | CN |
117391995 | Jan 2024 | CN |
117710251 | Mar 2024 | CN |
118097363 | May 2024 | CN |
118298047 | Jul 2024 | CN |
118396859 | Jul 2024 | CN |
WO-2015115018 | Aug 2015 | WO |
2022110638 | Jun 2022 | WO |
Entry |
---|
Han Huihui, et al. <Semantic segmentation of encoder-decoder structure>, <Journal of image and graphics>, Feb. 16, 2020, entire document. |
Rui Deng, at el., “CU-Mamba: Selective State Space Models with Channel Learning for Image Restoration”, larXiv:2404.11778v1 [cs.CV], Apr. 17, 2024, Full text. |
Yuan Shi, at el., “VmambaIR: Visual State Space Model for Image Restoration”, arXiv:2403.11423v1, Mar. 18, 2024, Full text. |
Xiaoming Li, at el., “Learning Dual Memory Dictionaries for blind Face Restoration”, IEEE Transactions on Pattern Analysis and Machine Intelligence, May 31, 2023, Full text. |
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2024/120396 | Sep 2024 | WO |
Child | 18930208 | US |