This application is the national phase entry of International Application No. PCT/CN2021/111577, filed on Aug. 9, 2021, which is based upon and claims priority to Chinese Patent Application No. 202110127788.2, filed on Jan. 29, 2021, the entire contents of which are incorporated herein by reference.
The present disclosure relates to the field of image processing technologies, and in particular, to a target-independent video generation method and system for high resolution face swapping.
Image/video synthesizing methods for a swapped face is an important branch of image and video generation technology in the computer vision field. These methods aim to transfer a source face to a target face while maintaining identity information represented by the source face and detailed information such as posture and a facial expression of the target face.
With the widespread applications of deep learning theory, especially the rapid development of generative adversarial networks (GANs), most of the existing face swapping technologies use GANs-based models to synthesize vivid swapping images. But all current known target-independent face swapping architectures can only handle face swapping tasks at 256×256 resolution due to the three reasons as follows:
The above three reasons hinder the optimization of an algorithm, or make the training of a GAN collapse, or cause a generated face to lack sufficient details so that the generated face is not different from a swapped face that is generated at 256×256 resolution. As a result, an image with face swapping is not sufficiently realistic.
To solve the above problems in the prior art, that is, to obtain a video characterized by high resolution face swapping, the present disclosure provides a video generation method and system for high resolution face swapping.
To solve the above technical problems, the present disclosure provides the following solutions.
A video generation method for high resolution face swapping is provided. The video generation method includes:
Optionally, the video generation method further includes:
Optionally, the determining a first loss function of the face feature encoder according to multiple history real face images specifically includes:
Optionally, the first loss function Linv is determined according to the following formula:
Linv=λ1Lrec+λ2LLPIPS+λ3Lid+λ4Lldm;
where a first face reconstruction loss function Lrec:
Lrec=∥x−{circumflex over (x)}∥2;
a first face sensing loss function LLPIPS:
LPIPS=∥F(x)−F({circumflex over (x)})∥2;
a first face identity loss function Lid:
Lid=1−cos(R(x),R({circumflex over (x)})); and
a first face keypoint loss function Lldm:
Lldm=∥P(x)−P({circumflex over (x)})∥2;
where x represents a history real face image, {circumflex over (x)} represents a virtual face image, ∥.∥2 represents Euclidean distance calculation, F(.) represents a face feature extraction function, R(.) represents a face recognition feature extraction function, cos(.,.) represents cosine similarity calculation, P(.) represents a face keypoint extraction function, and each of λ1, λ2, λ3, and λ4 is a weight of the first loss function.
Optionally, the video generation method further includes:
Optionally, the determining a second loss function of the face feature exchanger according to multiple history real face images and corresponding history target face images specifically includes:
Optionally, the second loss function Lswap is determined according to the following formula:
Lswap=φ1Lrec′+φ2LLPIPS′+φ3Lid′+φ4Lldm′φ5Lnorm;
where a second face reconstruction loss function Lrec′:
Lrec′=∥xs−{circumflex over (x)}s∥2+∥xt−{circumflex over (x)}t∥2;
a second face sensing loss function LLPIPS′:
LLPIPS′=∥F(xt)−F(ys2t)∥2;
a second face identity loss function Lid′;
Lid′=1−cos(R(xs),R(ys2t));
a second face keypoint loss function Lldm′:
Lldm′=∥P(xt)=P(ys2t)∥2; and
a regularization term Lnorm:
Lnorm=∥Lshigh−Ls2t∥2;
where xs represents a history real face image, {circumflex over (x)}s represents a history virtual face image, xt represents a history target face image, {circumflex over (x)}t represents a history virtual target face image, ys2t represents a history swapped face image, Lshigh represents a high-level semantic expression of a history real face image, Ls2t represents a high-level semantic expression of a history swapped face image, ∥.∥2 represents Euclidean distance calculation, F(.) represents a face feature extraction function, R(.) represents a face recognition feature extraction function, cos(.,.) represents cosine similarity calculation, P(.) represents a face keypoint extraction function, and each of φ1, φ2, φ3, φ4, and φ5 is a weight of the second loss function.
To solve the above technical problems, the present disclosure further provides the following solutions:
A video generation system for high resolution face swapping is provided, where the video generation system includes:
To solve the above technical problems, the present disclosure further provides the following solutions:
A video generation method for high resolution face swapping is provided, including:
To solve the above technical problems, the present disclosure further provides the following solutions:
A computer-readable storage medium is provided, where the computer-readable storage medium stores one or more programs, and the one or more programs, when executed by an electronic device including multiple application programs, cause the electronic device to perform the following operations:
Based on embodiments of the present disclosure, the present disclosure discloses the following technical effects:
In the present disclosure, the face feature encoder performs hierarchical encoding on the face feature to reserve semantic details of a face as much as possible, the face feature exchanger performs further processing based on the hierarchical encoding, to obtain hierarchical encoding of a swapped face feature with semantic details, the face generator generates the initial swapped face image, and the face fuser fuses the initial swapped face image with the target face image, to obtain the final swapped face image. Therefore, a high resolution face swapping video can be obtained.
Descriptions of Reference Numerals:
Image obtaining device—1, face feature encoder—2, face feature exchanger—3, face generator—4, and face fuser—5.
The preferred implementations of the present disclosure are described below with reference to the accompanying drawings. Those skilled in the art should understand that the implementations herein are merely intended to explain the technical principles of the present disclosure, rather than to limit the protection scope of the present disclosure.
The present disclosure provides a video generation method for high resolution face swapping. The face feature encoder performs hierarchical encoding on the face feature to reserve semantic details of a face as much as possible, the face feature exchanger performs further processing based on the hierarchical encoding, to obtain hierarchical encoding of a swapped face feature with semantic details, the face generator generates the initial swapped face image, and the face fuser fuses the initial swapped face image with the target face image, to obtain the final swapped face image. Therefore, a high resolution face swapping video can be obtained.
To make the above objectives, features, and advantages of the present disclosure clearer and more comprehensible, the present disclosure will be further described in detail below with reference to the accompanying drawings and the specific examples.
As shown in
Step 100: Obtain a target face image in a to-be-processed video and a corresponding source face image.
Step 200: Extract a feature of each of the source face image and the target face image through a face feature encoder, to obtain corresponding source feature codes and target feature codes.
Step 300: Generate swapped face feature codes through a face feature exchanger according to the source feature codes and the target feature codes.
In the present disclosure, the face feature exchanger that performs piecewise nonlinear optimization is used, and swapped face feature codes is obtained by manipulating global feature codes of the face, to avoid local distortion of the generated face.
Step 400: Generate an initial swapped face image through a face generator according to the swapped face feature codes.
Step 500: Fuse the initial swapped face image with the target face image through a face fuser that is based on face semantic segmentation, to obtain a final swapped face image.
Preferably, before step 100 is performed, the face feature encoder may be optimized first. Specifically, the video generation method for high resolution face swapping in the present disclosure further includes:
Step A1: Determine a first loss function of the face feature encoder according to multiple history real face images.
Step A2: Iteratively adjust a weight of the face feature encoder according to the first loss function by using a gradient backpropagation algorithm, until the first loss function converges to obtain an adjusted face feature encoder.
Further, in step A1, the determining a first loss function of the face feature encoder according to multiple history real face images specifically includes:
Step A11: Extract a feature of each history real face image through the current face feature encoder, to obtain real feature codes [C, Llow, Lhigh], where C represents a basic information expression of a face feature, Llow represents a low-level semantic expression, and Lhigh represents a high-level semantic expression.
Step A12: Obtain a reconstructed virtual face image through the face generator according to the real feature codes.
Step A13: Determine the first loss function according to each pair of a history real face image and a virtual face image.
Specifically, the first loss function Linv may be determined according to the following formula:
Linv=λ1Lrec+λ2LLPIPS+λ3Lid+λ4Lldm;
where a first face reconstruction loss function Lrec:
Lrec=∥x−{circumflex over (x)}∥2.
a first face sensing loss function LLPIPS:
LPIPS=∥F(x)−F({circumflex over (x)})∥2;
a first face identity loss function Lid:
Lid=1−cos(R(x),R({circumflex over (x)})); and
a first face keypoint loss function Lldm:
Lldm=∥P(x)−P({circumflex over (x)})∥2;
where x represents a history real face image, {circumflex over (x)} represents a virtual face image, ∥.∥2 represents Euclidean distance calculation, F(.) represents a face feature extraction function, R(.) represents a face recognition feature extraction function, cos(.,.) represents cosine similarity calculation, P(.) represents a face keypoint extraction function, and each of ∥1, λ2, λ3, and λ4 is a weight of the first loss function.
Further, after step A1 is performed to optimize the face feature encoder, and before step 100 is performed, the face feature exchanger is further optimized. Specifically, the video generation method for high resolution face swapping in the present disclosure further includes:
Step B1: Determine a second loss function of the face feature exchanger according to multiple history real face images and corresponding history target face images.
Step B2: Iteratively adjust a weight of the face feature exchanger according to the second loss function by using a gradient backpropagation algorithm, until the second loss function converges to obtain an adjusted face feature exchanger.
Further, in step B1, the determining a second loss function of the face feature exchanger according to multiple history real face images and corresponding history target face images specifically includes:
Step B11: For each group of history real face images and history target face images, extract features of the history real face images and the history target face images respectively through the current face feature encoder, to obtain corresponding real feature codes and history target feature codes.
Step B12: Obtain a corresponding reconstructed history virtual face image and a corresponding reconstructed history virtual target face image through the face generator according to each of the real feature codes and the history target feature codes.
Step B13: Generate history swapped face feature codes through the face feature exchanger according to the real feature codes and the history target feature codes.
Step B14: Obtain a history swapped face image through the face generator according to the history swapped face feature codes.
Step B15: Determine the second loss function according to each group of history real face images, each group of history target face images, each group of history virtual face images, each group of history virtual target face images, and each group of history swapped face images.
Specifically, the second loss function Lswap may be determined according to the following formula:
Lswap=φ1Lrec′+φ2LLPIPS′+φ3Lid′+φ4Lldm′+φ5Lnorm;
where a second face reconstruction loss function Lrec:
Lrec=∥xs−{circumflex over (x)}s∥2+∥xt−{circumflex over (x)}t∥2;
a second face sensing loss function LLPIPS′:
LLPIPS′=∥F(xt)−F(ys2t)∥2;
a second face identity loss function Lid′.
Lid′=1−cos(R(xs),R(ys2t));
a second face keypoint loss function Lldm′:
Lldm′=∥P(xt)−P(ys2t)∥2; and
a regularization term Lnorm:
Lnorm=∥Lshigh−Ls2t∥2;
where xs represents a history real face image, {circumflex over (x)}s represents a history virtual face image, xt represents a history target face image, {circumflex over (x)}t represents a history virtual target face image, ys2t represents a history swapped face image, Lshigh represents a high-level semantic expression of a history real face image, Ls2t represents a high-level semantic expression of a history swapped face image, ∥.∥2 represents Euclidean distance calculation, F(.) represents a face feature extraction function, R(.) represents a face recognition feature extraction function, cos(.,.) represents cosine similarity calculation, P(.) represents a face keypoint extraction function, and each of φ1, φ2, φ3, φ4, and φ5 is a weight of the second loss function.
Processing steps for obtaining the history virtual face image {circumflex over (x)}s and the history virtual target face image {circumflex over (x)}t are the same as step A11 to step A13, and are not repeated again.
The present disclosure is based on the pre-trained face generator and face fuser, and a piecewise training policy is used to reduce dependence on hardware.
In step 500, the face fuser fuses the swapped face image with background of the face in the target face image, to obtain a final swapped face image, and a high resolution face swapping video is formed based on each frame of final swapped face image.
In the present disclosure, phased model design is used to achieve high resolution video face swapping at the megapixel level. The model mainly includes four parts: the face feature encoder that performs hierarchical encoding, the face feature exchanger that performs piecewise nonlinear optimization, the face generator based on StyleGAN (v1 or v2), and the face fuser based on face semantic segmentation. The face feature encoder uses a face image as an input to obtain a hierarchical feature expression of the face. The face feature exchanger performs feature exchange on hierarchical feature expressions of the source face and the target face, to obtain a face-level feature expression after swapping. The face generator uses the face-level feature expression obtained after swapping as an input to obtain a face after swapping. Finally, in video processing, the face fuser fuses the face obtained after swapping with the background of the target face, to obtain a current face frame after swapping.
To ensure desirable training stability of the model and reduce high requirements on hardware, in the present disclosure, a phased training method is used, that is, based on the pre-trained face generator and face fuser, the face feature encoder is first trained, and then the face exchanger is trained.
To monitor the training process of the model, in the present disclosure, a face reconstruction loss, a face sensing loss, a face identity loss, and a face keypoint loss are used to constrain the face feature encoder, and a face reconstruction loss, a face sensing loss, a face identity loss, a face keypoint loss, and a regularization term are used to constrain the face feature exchanger. Specifically, the face feature encoder is responsible for performing hierarchical encoding on the face feature to reserve semantic details of a face as much as possible, and the face feature exchanger performs control based on the hierarchical encoding, to obtain hierarchical encoding of a swapped face feature with semantic details, so that the face generator generates the swapped face.
The present disclosure is described in detail below with a specific embodiment (as shown in
Step S1: Reconstruct a history real face image. Step S1 specifically includes the following process:
Step S11: Extract a feature of the history real face image to obtain face-level encoding [C, Llow, Lhigh] (that is, real feature codes) of the history real face.
Step S12: Input the face-level encoding into the face generator to obtain a reconstructed virtual face image.
Step S2: Calculate a first face reconstruction loss, a first face sensing loss, a first face identity loss, and a first face keypoint loss according to the real face image and the virtual face image, and iteratively adjust a weight of the face feature encoder by using a loss gradient backpropagation algorithm, until convergence occurs.
Step S2 specifically includes the following sub-steps:
Step S21: Determine a first loss function based on the virtual face image obtained in step S12 and the real face image. The first loss function is divided into four parts: a first face reconstruction loss function, a first face sensing loss function, a first face identity loss function, and a first face keypoint loss function.
Step S22: Iteratively adjust the weight of the face feature encoder based on losses of the first face reconstruction loss function, the first face sensing loss function, the first face identity loss function, and the first face keypoint loss function by using a gradient backpropagation algorithm, until convergence occurs.
Step S3: Fuse the source face image with the target face image. Step S3 specifically includes the following process:
Step S31: Extract a feature of each of the source face image and the target face image to obtain hierarchical encoding of the source face image and the target face image.
Step S32: Input source face-level encoding and target face-level encoding into the face exchanger to obtain swapped face-level encoding.
Step S33: Input the swapped face-level encoding into the face generator to obtain a swapped face image.
Step S34: Input the swapped face image and the target face image into the face fuser, to fuse the face part of the swapped face image with the background part of the target face image to obtain a final swapped face image.
Each time after a swapped face image is obtained, the face feature exchanger may be further optimized according to the currently obtained swapped face image, the source face image, and the target face image. Specifically:
Step S4: Determine a second face reconstruction loss function, a second face sensing loss function, a second face identity loss function, a second face keypoint loss function, and a regularization term according to the swapped face image, the source face image, and the target face image, and iteratively adjust a weight of the face feature exchanger by using a loss gradient backpropagation algorithm, until convergence occurs.
Step S4 specifically includes the following steps:
Step S41: Determine a second loss function based on the swapped face image obtained in step S33, the source face image, and the target face image, where the second loss function is divided into five parts: a second face sensing loss function, a second face identity loss function, a second face keypoint loss function, and a regular term.
Step S42: Iteratively adjust a weight of the face feature exchanger based on the second face sensing loss function, the second face identity loss function, the second face keypoint loss function, and the regularization term by using a gradient backpropagation algorithm, until convergence occurs.
In the present disclosure, the target-independent face swapping capability at the megapixel level is achieved through the piecewise face feature encoder, the piecewise face feature exchanger, the piecewise face generator, and the piecewise face fuser. Specifically, the face feature encoder performs hierarchical encoding to obtain a complete feature expression of a face, the face feature exchanger performs piecewise nonlinear optimization to obtain a complete feature expression of a swapped face, the face generator uses the complete feature expression of the swapped face to generate a swapped face with rich details at the resolution of 1024×1024, and finally the face fuser fuses the swapped face with the background of the target face.
In addition, the present disclosure further provides a video generation system for high resolution face swapping, so that a video characterized by high resolution face swapping can be obtained.
As shown in
The image obtaining device 1 is configured to obtain a target face image in a to-be-processed video and a corresponding source face image.
The face feature encoder 2 is configured to extract a feature of each of the source face image and the target face image, to obtain corresponding source feature codes and target feature codes.
The face feature exchanger 3 is configured to generate swapped face feature codes according to the source feature codes and the target feature codes.
The face generator 4 is configured to generate an initial swapped face image according to the swapped face feature codes.
The face fuser 5 is configured to fuse the initial swapped face image with the target face image, to obtain a final swapped face image.
Besides, the present disclosure further provides the following solutions:
A video generation method for high resolution face swapping is provided, including:
To solve the above technical problems, the present disclosure further provides the following solutions:
A computer-readable storage medium is provided, where the computer-readable storage medium stores one or more programs, and the one or more programs, when executed by an electronic device including multiple application programs, cause the electronic device to perform the following operations:
Compared with the prior art, the video generation system for high resolution face swapping and the computer-readable storage medium in the present disclosure have the same beneficial effects as those of the above video generation method for high resolution face swapping, and will not be repeated herein.
The technical solutions of the present disclosure are described with reference to the preferred implementations and drawings. Those skilled in the art should easily understand that the protection scope of the present disclosure is not limited to these specific implementations. A skilled in the art can make equivalent changes or substitutions to the relevant technical features without departing from the principles of the present disclosure, and the technical solutions after these changes or substitutions should fall within the protection scope of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202110127788.2 | Jan 2021 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/111577 | 8/9/2021 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2022/160657 | 8/4/2022 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
11276231 | Chandran | Mar 2022 | B2 |
20060093208 | Li et al. | May 2006 | A1 |
20170109852 | Ito | Apr 2017 | A1 |
20200257786 | Kim et al. | Aug 2020 | A1 |
20200372621 | Naruniec | Nov 2020 | A1 |
20210056348 | Berlin | Feb 2021 | A1 |
Number | Date | Country |
---|---|---|
108932693 | Dec 2018 | CN |
110868598 | Mar 2020 | CN |
111368796 | Jul 2020 | CN |
111833257 | Oct 2020 | CN |
111861872 | Oct 2020 | CN |
112446364 | Mar 2021 | CN |
Number | Date | Country | |
---|---|---|---|
20230112462 A1 | Apr 2023 | US |