The present disclosure claims priority to Chinese Patent Application No. 202011211255.4, filed on Nov. 3, 2020, the disclosure of which is herein incorporated by reference in its entirety.
The present disclosure relates to the field of image processing technologies, and for example, to a method and apparatus for training parameter estimation models, a device, and a storage medium.
With the development of video technologies, there is an increasing demand for creating realistic face models in entertainment applications with a demand of display of face images, such as face animation, face recognition, and augmented reality (AR). It is difficult to create a realistic three-dimensional face model, which requires to reconstruct a corresponding three-dimensional face for one or more two-dimensional face images or depth images to cause that the three-dimensional face includes a diversity of types of three-dimensional information, such as face shapes, colors, illumination, and head rotation angles.
In a three-dimensional face reconstruction method, three-dimensional scan data of a large number of faces is usually collected firstly. Then a corresponding three-dimensional morphable model (3DMM) is constructed based on the three-dimensional scan data. In this case, the three-dimensional morphable model includes an average face shape of a standard face, a principal component base indicating a change of face identity, and a principal component base indicating a change of face expression. Then reconstruction parameters corresponding to two groups of the principal component bases are estimated based on a current to-be-reconstructed two-dimensional face image, such that the average face shape is morphed by correspondingly adjusting, based on the reconstruction parameters, the two groups of the principal component bases, and thus a corresponding three-dimensional face is reconstructed.
The reconstruction parameters corresponding to the principal component bases are generally estimated in two methods. In one method, pixel values of feature points in the two-dimensional face image are directly determined as supervision information of three-dimensional face deformation to estimate the reconstruction parameters corresponding to the principal component bases. However, as transformation from a two-dimensional picture to three-dimensional reconstruction is a morbid problem, and only the pixel values of feature points are determined as the supervision information, such that an accuracy of estimating the reconstruction parameters cannot be ensured. In the other method, the current to-be-reconstructed multi-view two-dimensional face images or depth information are determined as input to estimate the reconstruction parameters corresponding to the principal component bases. However, in the method, a plurality of face images are required to be collected, even a special sensor is required to collect a depth image, such that a reconstruction scene is limited. In addition, the collection requirements of the reconstruction parameters are excessive, and thus an operation of three-dimensional face reconstruction is complicated.
The present disclosure provides a method and apparatus for training parameter estimation models, a device, and a storage medium, which optimizes the method for training the parameter estimation models used for estimating corresponding reconstruction parameters in three-dimensional face reconstruction, improves an accuracy of estimating the reconstruction parameters used in the three-dimensional face reconstruction, and reduces operation complexity of the three-dimensional face reconstruction on the basis of ensuring the accuracy of the three-dimensional face reconstruction.
The present disclosure provides a method for training parameter estimation models. The method includes:
The present disclosure further provides an apparatus for training parameter estimation models. The apparatus includes:
The present disclosure further provides a computer device. The computer device includes:
The present disclosure further provides a computer-readable storage medium storing a computer program thereon. The computer program, when run by a processor, causes the processor to perform the above method for training the parameter estimation models.
The present disclosure will be described hereinafter with reference to the accompanying drawings and embodiments. The specific embodiments described herein are merely illustrative of the present disclosure. For convenience of description, only the portions relevant to the present disclosure are shown in the drawings.
Referring to
In S110, reconstruction parameters specified for three-dimensional face reconstruction are estimated by inputting training samples in a face image training set to a pre-constructed neural network model, and three-dimensional faces corresponding to the training samples are reconstructed by inputting the reconstruction parameters to a pre-constructed three-dimensional morphable model.
Situations, such as face animation, face recognition, augmented reality, and face beauty all require the three-dimensional face reconstruction technologies. Three-dimensional face reconstruction refers to reconstruction of a plurality of pieces of three-dimensional information of a two-dimensional face image, such as a three-dimensional geometric shape, an albedo, illumination information, and a head rotation angle. The three-dimensional geometric shape is composed of a set of vertexes in three-dimensional space, and each vertex is uniquely determined by corresponding three-dimensional coordinates (x, y, z). For the three-dimensional face reconstruction, a corresponding three-dimensional morphable model is pre-constructed, the three-dimensional morphable model is used to perform modeling of a shape and appearance of the three-dimensional face on any two-dimensional face image, and any face shape is expressed by a sum of a standard face and a group of linear combinations of principal component vectors indicating changes of face shape and face expression. Therefore, an average face shape of the standard face is correspondingly deformed and face expression is adjusted based on different linear combination forms of the principal component vectors, such that the corresponding three-dimensional face is reconstructed.
In the embodiments, the reconstruction parameters specified for the three-dimensional face reconstruction refer to a plurality of types of parameters indicating the linear combination forms of referenced principal component vectors and parameters (such as an illumination parameter, a face position, and a gesture) affecting a realistic effect of the three-dimensional face in the case that the average face shape is deformed and the face expression is adjusted based on the three-dimensional morphable model. Corresponding deformation of the average face shape and change of face expression are controlled based on the reconstruction parameters, such that the corresponding three-dimensional face is generated, and the display of details, such as an illumination condition and a gesture angle in the three-dimensional face is improved. Therefore, an accuracy of estimating reconstruction parameters of a to-be-reconstructed two-dimensional face image directly affects a reality of the reconstructed three-dimensional face corresponding to the two-dimensional face image. Therefore, for a more realistic reconstructed three-dimensional face, it is necessary to accurately estimate a plurality of reconstruction parameters specified for the three-dimensional face reconstruction from the to-be-reconstructed two-dimensional face image. The reconstruction parameters in the embodiments include a deformation parameter indicating a change of face shape, an expression parameter indicating a change of face expression, an albedo parameter indicating a change of a face albedo, an illumination parameter indicating a change of face illumination, a position parameter indicating face translation, a rotation parameter indicating a head gesture, and the like. The albedo parameter includes red-green-blue (RGB) color information of the to-be-reconstructed two-dimensional face image.
In some embodiments, for accurate estimation of a plurality of types of reconstruction parameters specified for the three-dimensional face reconstruction for any to-be-reconstructed two-dimensional face image, in the embodiments, a neural network model is initially constructed, and then parameter estimation training is performed on the neural network model based on a large number of face image samples, such that a parameter estimation model capable of accurately estimating the reconstruction parameters specified for the three-dimensional face reconstruction is acquired. The neural network model is a convolutional neural network. In the case that the parameter estimation training is performed on the neural network model, a corresponding face image training set is constructed firstly. The face image training set includes a large number of face images, with different sources and different types, determined as training samples of the neural network model. As shown in
Therefore, after the reconstruction parameters used in performing the three-dimensional face reconstruction on the training sample are estimated by the neural network model, the reconstruction parameters are correspondingly inputted to the pre-constructed three-dimensional morphable model, corresponding deformation, expression change, and albedo change are performed on a standard face in the three-dimensional morphable model by the three-dimensional morphable model using the reconstruction parameters, and corresponding three-dimensional detail display information is correspondingly adjusted, such that the three-dimensional face corresponding to the training sample is reconstructed. The reality of the reconstructed three-dimensional face is analyzed by comparing similarity between the training sample and the reconstructed three-dimensional face.
In S120, a plurality of loss functions of a plurality of pieces of two-dimensional supervision information between the three-dimensional faces and the training samples are calculated, and weights corresponding to the plurality of loss functions are adjusted.
In some embodiments, for analysis of the reality of the reconstructed three-dimensional face relative to the face in the training sample, a corresponding loss function is predetermined in the embodiments, and the similarity between the reconstructed three-dimensional face and the training sample is compared by the loss function. In the embodiments, a plurality of pieces of two-dimensional supervision information indicate that the similarity between the reconstructed three-dimensional face and the training sample under a plurality of supervision dimensions is judged comprehensively based on the plurality of pieces of two-dimensional supervision information, and the loss functions of the plurality of pieces of two-dimensional supervision information are set to avoid a reconstruction error of the three-dimensional face as comprehensively as possible. Illustratively, the loss functions of the plurality of pieces of two-dimensional supervision information in the embodiments include: an image pixel loss function, a key point loss function, an identity feature loss function, an albedo penalty function, and a regular term corresponding to a target reconstruction parameter in the reconstruction parameters specified for the three-dimensional face reconstruction.
Upon reconstruction of the three-dimensional faces corresponding to the training samples, reconstruction errors between the reconstructed three-dimensional faces and the training samples under dimensions of the plurality of pieces of supervision information, that is, values of the loss functions, are calculated by the currently set loss functions of the plurality of pieces of two-dimensional supervision information. The reality of the reconstructed three-dimensional face of each piece of supervision information is further analyzed based on the values of the plurality of loss functions, such that an estimation accuracy of the trained neural network model of each piece of supervision information is determined. Further, the weights corresponding to the plurality of loss functions are correspondingly adjusted to improve estimation capability in next training.
Illustratively, in the embodiments, by calculating the image pixel loss function, the key point loss function, the identity feature loss function, and the albedo penalty function between the three-dimensional face and the training sample, and the regular term corresponding to the target reconstruction parameter in the reconstruction parameters specified for the three-dimensional face reconstruction, a reconstruction accuracy capability of the current training process for an image pixel, a key point, an identity feature, an albedo, and a plurality of reconstruction parameters in the three-dimensional face reconstruction is determined. The weights of the plurality of loss functions are correspondingly adjusted based on the reconstruction capability, and the training is continued to continuously improve a capability of estimating the reconstruction parameters in the three-dimensional face reconstruction.
In addition, as the reconstructed three-dimensional face only includes a three-dimensional face image, and the training sample further displays a background image of a non-face image in addition to the face image, as shown in
The three-dimensional face is rendered by the differentiable renderer, such that the texture and the image of the rendered three-dimensional face are more similar to the training sample, and the parameter estimation model is more accurately trained subsequently based on the rendered three-dimensional face.
In S130, fitting loss functions are generated based on the plurality of loss functions and the weights corresponding to the plurality of loss functions, and a trained parameter estimation model is acquired by performing an inverse correction on the neural network model using the fitting loss function.
Upon calculation of the plurality of loss functions of the plurality of pieces of two-dimensional supervision information between the three-dimensional faces and the training samples, and adjustment of the weights corresponding to the plurality of loss functions, the plurality of loss functions of the plurality of pieces of two-dimensional supervision information are integrated based on the weights corresponding to the plurality of loss functions to generate corresponding fitting loss functions, and the fitting loss functions are determined as the loss functions of the whole training process of the neural network model. Then backpropagation is performed on the whole training process by the fitting loss function to correct the network parameters in the neural network model. Reconstruction parameters of the next training sample in the three-dimensional face reconstruction are estimated by the corrected neural network model according to the above processes, such that the training process is further performed, the inverse correction is continuously performed on the neural network model, and the finally trained neural network model is determined as a trained parameter estimation model. The inverse correction is continuously performed on the neural network model using the fitting loss functions in the embodiments based on the two-dimensional face information of the plurality of pieces of supervision information without referring to additional three-dimensional face information prior to reconstructing the three-dimensional face to acquire the trained parameter estimation model, such that the method for training the parameter estimation model used for estimating corresponding reconstruction parameters in the three-dimensional face reconstruction is optimized. The parameter estimation model is trained based on the loss functions of the plurality of pieces of two-dimensional supervision information, such that reference information in the training is more comprehensive, and thus the accuracy of estimating the reconstruction parameters used in the three-dimensional face reconstruction is improved.
For example, in the case that the loss functions of the plurality of pieces of two-dimensional supervision information are: an image pixel loss function Lphotometric, a key point loss function Llandmark, an identity feature loss function Lperception, an albedo penalty function Lbox, and a regular term corresponding to a target reconstruction parameter (an adjustment parameter α of a principal component vector indicating the change of face shape, an adjustment parameter δ of a principal component vector indicating the change of face expression, an adjustment parameter β of a principal component vector indicating the change of face albedo, and an illumination parameter in the reconstruction parameters specified for the three-dimensional face reconstruction, the fitting loss function in the embodiment is:
represents a weight corresponding to the image pixel loss function, λ1 represents a weight corresponding to the key point loss function, λid represents a weight corresponding to the identity feature loss function, and λb represents a weight corresponding to the albedo penalty function, and λα, λβ, λδ, and λγ represent weights corresponding to regular terms corresponding to target reconstruction parameters in the plurality of reconstruction parameters specified for the three-dimensional face reconstruction, respectively.
Upon acquisition of the trained parameter estimation model, the reconstruction parameters of any to-be-reconstructed two-dimensional face image in the three-dimensional face reconstruction are accurately estimated by the parameter estimation model. Therefore, upon the acquisition of the trained parameter estimation model by performing the inverse correction on the neural network model using the fitting loss function, the embodiments further include: inputting to-be-reconstructed two-dimensional face images to the parameter estimation model, estimating reconstruction parameters specified for three-dimensional face reconstruction, and reconstructing three-dimensional faces corresponding to the two-dimensional face images by inputting the reconstruction parameters to a pre-constructed three-dimensional morphable model.
As shooting sizes of to-be-reconstructed two-dimensional face images are different, for the accuracy of the three-dimensional face reconstruction, as shown in
According to the technical solutions provided in the embodiments, a corresponding neural network model is pre-constructed for a plurality of reconstruction parameters specifically used in the three-dimensional face reconstruction. Each training sample in a face image training set is inputted to the neural network model to estimate the reconstruction parameters required in the three-dimensional face reconstruction of the training sample, and the reconstruction parameters are inputted to the pre-constructed three-dimensional morphable model to reconstruct the three-dimensional face corresponding to the training sample. The fitting loss function in training the neural network model is generated by calculating the plurality of loss functions of the plurality of pieces of two-dimensional supervision information between the three-dimensional face and the training sample and adjusting the weights corresponding to the plurality of loss functions. The inverse correction is continuously performed on the neural network model using the fitting loss functions based on the two-dimensional face information of the plurality of pieces of supervision information without referring to additional three-dimensional face information prior to reconstructing the three-dimensional face to acquire the trained parameter estimation model, such that the method for training the parameter estimation models used for estimating the corresponding reconstruction parameters in the three-dimensional face reconstruction is optimized. The parameter estimation model is trained based on the loss functions of the plurality of pieces of two-dimensional supervision information, such that the reference information in the training is more comprehensive, and thus the accuracy of estimating the reconstruction parameters used in the three-dimensional face reconstruction is improved. Meanwhile, the reconstruction parameters in the three-dimensional face reconstruction are estimated by the trained parameter estimation model, such that the deformation process of the to-be-reconstructed face image in the three-dimensional morphable model is more accurate, the accuracy of the three-dimensional face reconstruction is ensured, operation complexity of the three-dimensional face reconstruction is reduced without deploying additional information devices in the three-dimensional face reconstruction.
The three-dimensional morphable model in the embodiments includes a dual principal component analysis (PCA) model and a single PCA model. As shown in
The dual PCA model in the embodiments includes a three-dimensional average face, a first principal component base indicating the change of face appearance, and a second principal component base indicating the change of face expression, which is represented as: S=
The single PCA model in the embodiments includes an average face albedo and a third principal component base indicating the change of a face albedo, which is represented as T=
In addition, for three-dimensional detail features in the three-dimensional face reconstruction, the three-dimensional morphable model of the embodiments further includes an illumination parameter γ indicating the change of face illumination, a position parameter t indicating face translation, and a rotation parameter P indicating a head gesture. In the embodiments, spherical harmonic illumination is determined as illumination in a three-dimensional scene to estimate the corresponding illumination parameter γ.
Therefore, for accurate reconstruction of the three-dimensional face, the reconstruction parameters specified for the three-dimensional face reconstruction in the embodiment are (α, δ, β, γ, t, p).
In some embodiments, for the accuracy of the three-dimensional face reconstruction, in the dual PCA model in the embodiments, a number of first principal component bases indicating the change of face appearance is 80, a number of second principal component bases indicating the change of face expression is 30. In the single PCA model, a number of third principal component bases indicating the change of the face albedo is 79, and a number of illumination parameters is 27, which includes nine color parameters of each of R, G, B color channels, and numbers of position parameters and rotation parameters are three. The numbers of principal component bases and illumination parameters according to the embodiments are merely exemplary, which are set according to the corresponding reconstruction requirements and are not limited in the embodiments.
As shown in
In some embodiments, as shown in
In S210, three-dimensional face scan data with uniform illumination under a multi-dimensional data source is collected, and the three-dimensional average face, the average face albedo, the first principal component base, the second principal component base, and the third principal component base are acquired by performing a deformation analysis, an expression change analysis, and an albedo analysis are performed on the three-dimensional face scan data.
In some embodiments, for accurate judgement of a principal component base capable of affecting the reality of the three-dimensional face reconstruction, a three-dimensional (3D) scanning technology is firstly used to scan a large amount of face information with uniform illumination under the multi-dimensional data source of different ethnicities, ages, sexes, skin colors, expressions, and the like, such that the three-dimensional face scan data with the uniform illumination under the multi-dimensional data source is acquired. Subsequently, the corresponding deformation analysis, expression change analysis, and albedo analysis are performed on the large amount of three-dimensional face scan data to acquire the corresponding three-dimensional average face, average face albedo, the first principal component base indicating the change of face appearance, the second principal component base indicating the change of face expression, and the third principal component base indicating the change of the face albedo, such that the corresponding three-dimensional morphable model is constructed subsequently to perform the accurate three-dimensional reconstruction on the face image.
In S220, the reconstruction parameters specified for the three-dimensional face reconstruction are estimated by inputting training samples in a face image training set to a pre-constructed neural network model.
In S230, the reconstruction parameters matched with the first principal component base and reconstruction parameters matched with the second principal component base are inputted to a dual PCA model, and a three-dimensional morphable face is acquired by deforming the three-dimensional average face.
In some embodiments, upon estimation of a plurality of types of reconstruction parameters in performing the three-dimensional face reconstruction on the training samples, the plurality of types of reconstruction parameters are inputted to the three-dimensional morphable model to deform the three-dimensional average face. In this case, the three-dimensional morphable model includes the dual PCA model and the single PCA model. Different PCA models have different reconstruction functions. The dual PCA model is mainly configured to model the changes of face appearance and expression in the three-dimensional face reconstruction, and the single PCA model is mainly configured to model the change of the face albedo in the three-dimensional face reconstruction. Therefore, the sample is trained to perform the three-dimensional face reconstruction sequentially by the dual PCA model and the single PCA model.
Firstly, the reconstruction parameters matched with the first principal component base and the reconstruction parameters matched with the second principal component base in the dual PCA model are screened out from the estimated reconstruction parameters, and then the screened reconstruction parameters are inputted to the dual PCA model. Corresponding appearance change and expression change are performed on the three-dimensional average face defined in the dual PCA model using a model indication function of the above dual PCA model, such that the corresponding three-dimensional morphable face is acquired. Subsequently, the single PCA model is used to continuously change the albedo of the three-dimensional morphable face to reconstruct the corresponding three-dimensional face.
In S240, the three-dimensional morphable face and reconstruction parameters matched with the third principal component base are inputted to the single PCA model, and a reconstructed three-dimensional face is acquired by performing an albedo correction on the three-dimensional morphable face based on the average face albedo.
In some embodiments, upon acquisition of the corresponding three-dimensional morphable face by the dual PCA model, the reconstruction parameters matched with the third principal component base defined in the single PCA model are screened out again from the estimated reconstruction parameters. Then the three-dimensional morphable face and the reconstruction parameters matched with the third principal component base are both inputted to the single PCA model. The albedo correction is performed on the three-dimensional morphable face based on a face standard albedo using a model representation function of the above single PCA model, such that the reconstructed three-dimensional face is acquired.
In addition, for three-dimensional detail features of the three-dimensional face, the three-dimensional face is optimized by an illumination parameter, a position parameter, and a rotation parameter defined in the three-dimensional morphable model.
In S250, a plurality of loss functions of a plurality of pieces of two-dimensional supervision information between the three-dimensional face and the training sample are calculated, and weights corresponding to the plurality of loss functions are adjusted.
In S260, fitting loss functions are generated based on the plurality of loss functions n and the weights corresponding to the plurality of loss functions, and a trained parameter estimation model is acquired by performing an inverse correction on the neural network model using the fitting loss function.
According to the technical solutions provided in the embodiments, in the training of the parameter estimation model, the dual PCA model is used to construct the three-dimensional morphable model to ensure the accuracy of the three-dimensional face reconstruction. Further, the loss on the plurality of pieces of two-dimensional supervision information between the reconstructed three-dimensional face and the training sample is reflected on the estimation error of the reconstruction parameters as much as possible. The parameter estimation model is trained based on the loss functions of the plurality of pieces of two-dimensional supervision information, such that the reference information in the training is more comprehensive and accurate, and thus the accuracy of estimating the reconstruction parameters used in the three-dimensional face reconstruction is improved.
In some embodiments, as shown in
In S301, reconstruction parameters specified for three-dimensional face reconstruction are estimated by inputting training samples in a face image training set to a pre-constructed neural network model, and three-dimensional faces corresponding to the training samples are reconstructed by inputting the reconstruction parameters to a pre-constructed three-dimensional morphable model.
In S302, skin masks are segmented from the training samples.
In some embodiments, a mask is a binary image composed of pixel values 0 and 1. In the embodiments, in the case that the image pixel loss function is set, for the accuracy of image pixel loss in the training, the skin masks are applied to the training samples, such that pixel values of a facial skin region in the training sample are all set to 1, and pixel values of a non-facial skin region are all set to 0. Therefore, a skin segmentation algorithm is used to accurately segment the corresponding facial skin region from the training sample, such that interference of pixel features in the non-facial skin region on the three-dimensional face reconstruction is avoided.
In S303, a corresponding image pixel loss function is acquired by calculating, based on the skin masks, pixel errors of the same pixel points in facial skin regions in the three-dimensional faces and the training samples.
In some embodiments, after the skin mask is segmented from the training sample, pixel points at the same pixel position are looked up from the reconstructed three-dimensional faces and the training samples. Then whether each same pixel point is in the facial skin region is accurately judged based on the segmented skin masks. The whole pixel error of the three-dimensional faces and the training samples in the facial skin regions is analyzed by calculating the pixel errors of the pixel points in the facial skin region between the three-dimensional faces and the training samples, such that the image pixel loss function is acquired. The image pixel loss function only compares pixel errors in the facial skin regions prior to and upon reconstruction, but shields an effect of pixels of the non-facial skin region, such that estimated face identity features and albedo information in the reconstruction parameters are more accurate.
Illustratively, the image pixel loss function in the embodiment is:
wherein Iini,jk represents a pixel value of a pixel point (j, k) in the ith training sample, Iouti,jk is a pixel value of a pixel point (j, k) in the three-dimensional face reconstructed for the ith training sample, Mjk is a pixel value of a pixel point (j, k) in the skin mask, and Mjk of a pixel point in the facial skin region in the skin mask is 1, otherwise, 0.
In S304, a plurality of key feature points at a predetermined position are extracted from the training samples, and visibility of the plurality of key feature points is determined.
In some embodiments, for one-to-one matching of the key feature points in the reconstructed three-dimensional faces and the training samples, in the embodiments, in the case that the key point loss functions in the training are set, an Landmark algorithm is used to extract the key feature points at the predetermined positions in a plurality of face regions from the training samples, for example, 17 key feature points on a face contour, five key feature points on left and right eyebrows, six key feature points on left and right eyes, nine key feature points on a nose, 20 key feature points on a mouth, and the like. Illustratively, 68 key feature points are used in the embodiments, as shown in
In S305, the key point loss function is acquired by calculating position reconstruction errors of the plurality of visible key feature points between the three-dimensional faces and the training samples.
In some embodiments, after the plurality of visible key feature points are determined, by analyzing whether pixel positions of the visible key feature points on the reconstructed three-dimensional faces and the training samples are consistent, the position reconstruction error of each visible key feature point prior to and upon reconstruction is calculated, such that the corresponding key point loss function is acquired. For a face with a large head rotation angle in the training sample, only half of the visible key feature points need to be selected to calculate the corresponding key point reconstruction loss, and the invisible key feature points do not participate in the loss calculation of key point reconstruction.
In addition, in the training, the head gestures of the reconstructed three-dimensional faces and the training samples are different. Therefore, for the matching of the same pixel point in the three-dimensional face and the training sample, in the embodiments, vertexes matched with the plurality of visible key feature points in the training sample are determined by dynamically selecting key feature points in the reconstructed three-dimensional face. Illustratively, according to the head gesture in the three-dimensional face, a three-dimensional mesh vertex matched with each visible key feature point is dynamically selected from the three-dimensional face, and position information of the three-dimensional mesh vertex in the three-dimensional face is determined as a reconstruction position of the visible key feature point to calculate a position reconstruction error of the visible key feature point between the three-dimensional face and the training sample.
Firstly, head gestures, such as head translation positions and rotation angles in the reconstructed three-dimensional face are analyzed, and then the three-dimensional mesh vertex matched with each visible key feature point is dynamically selected from the three-dimensional face based on the head gestures in the three-dimensional face and a face part represented by each visible key feature point. As shown in
Illustratively, the key point loss function in the embodiments is:
wherein 1ini,j represents a position coordinate of the jth key feature point in the ith training sample, 1outi,j represents a position coordinate of the jth key feature point in the three-dimensional face reconstructed by the ith training sample, Vij represents visibility of the jth key feature point, a value of the visible key feature point is 1, a value of the invisible key feature point is 0, and wj is a weight of the jth key feature point in the loss function. Different weights are used in different face parts (such as eyes, mouth, contour points, and the like), and the weights are controlled by adjusting a size of wj.
In S306, first identity features corresponding to the training samples and second identity features corresponding to the three-dimensional faces are acquired by inputting the training samples and the reconstructed three-dimensional faces to a pre-constructed face recognition model.
In some embodiments, for the identity feature loss function, it is essential to analyze whether the identity feature changes prior to and upon reconstruction. Therefore, in the embodiments, for identity feature recognition, a corresponding face recognition model is pre-constructed, and the identity features prior to and upon reconstruction are extracted by the face recognition model. As shown in
In addition, as the head gesture in the training sample has a certain rotation angle, part of face region in the training sample is invisible, and the identity features extracted at a single angle have an error. For the accuracy of the first identity feature, the first identity feature corresponding to the training sample in the embodiments is further calculated by: collecting a plurality of face images, containing same faces as the training samples, shot from a plurality of angles, extracting a plurality of identity sub-features corresponding to the plurality of face images by inputting the plurality of face images to the pre-constructed three-dimensional morphable model, and acquiring the first identity features corresponding to the training samples by integrating the extracted plurality of identity sub-features.
By analyzing the face in the training sample, a plurality of face images containing the same face as the training sample are additionally shot from the plurality of angles, and then the face images shot from the plurality of angles are inputted to the three-dimensional morphable model. The identity features of the plurality of face images are extracted based on the first principal component base indicating the change of face appearance in the three-dimensional morphable model, such that the identity sub-feature corresponding to each face image is acquired. In this case, the plurality of identity sub-features are integrated to acquire the first identity feature corresponding to the training sample, such that comprehensiveness and accuracy of the first identity feature are ensured.
In S307, an identity feature loss function is calculated based on similarities between the first identity features and the second identity features.
In some embodiments, upon acquisition of the first identity feature corresponding to the training sample and the second identity feature corresponding to the three-dimensional face, for analysis of whether an error is present in the identity features prior to and upon reconstruction, the similarity between the first identity feature and the second identity feature requires to be determined first, and then a corresponding identity feature loss function is calculated based on the similarity.
Illustratively, the identity feature loss function in the embodiments is:
wherein Fini represents a first identity feature corresponding to the ith training sample, and Fouti represents a second identity feature corresponding to the three-dimensional face reconstructed by the ith training sample.
In S308, albedos of a plurality of vertexes in the three-dimensional face are calculated.
In some embodiments, by detecting information, such as color and reflection light intensity of a plurality of pixel points in the training sample, the albedos of the plurality of pixel points are calculated, and then the albedo of each vertex on the reconstructed three-dimensional face is set based on the position matching between each vertex in the reconstructed three-dimensional face and the plurality of pixel points in the training sample, such that consistency of the face albedo prior to and upon reconstruction is ensured.
In S309, an albedo penalty function is calculated based on the albedos of the plurality of vertexes in the three-dimensional face and a predetermined albedo interval.
In some embodiments, in order to ensure that the albedo of the vertex in the reconstructed three-dimensional face is not too low or too high, the albedo of the vertex in the three-dimensional face is correspondingly adjusted in the embodiments. A reasonable predetermined albedo interval is predetermined in the embodiments, and the predetermined albedo interval in the embodiments is [0.05, 0.95], such that the albedos of the plurality of vertexes in the reconstructed three-dimensional face are completely within the predetermined albedo interval. Therefore, by analyzing whether the albedo of each vertex in the three-dimensional face is within the predetermined albedo interval, a corresponding albedo penalty function is calculated to continuously optimize the albedo in the reconstructed three-dimensional face in the training.
Illustratively, the albedo penalty function in the embodiments is:
wherein Tclipi,j represents an albedo of the jth pixel point in the ith training sample, and Tclipi,j is an albedo of the jth pixel point in the three-dimensional face reconstructed by the ith training sample.
In the embodiments, S302 and S303 are processes of calculating the image pixel loss function, S304 and S305 are processes of calculating the key point loss function, S306 and S307 are processes of calculating the identity feature loss function, and S308 and S309 are processes of calculating the albedo penalty function. Calculation processes corresponding to the image pixel loss function, the key point loss function, the identity feature loss function, and the albedo penalty function in the embodiment are performed simultaneously or sequentially, which are not limited.
In S310, weights corresponding to the plurality of loss functions are adjusted.
In S311, fitting loss functions are generated based on the plurality of loss functions and the weights corresponding to the plurality of loss functions, and a trained parameter estimation model is acquired by performing an inverse correction on the neural network model using the fitting loss functions.
According to the technical solutions provided in the embodiments, some loss functions in the plurality of loss functions of the plurality of pieces of two-dimensional supervision information are optimized by the skin mask and dynamically selecting the key feature points in the training sample, such that the accuracy of training the parameter estimation model is ensured. The parameter estimation model is trained based on the loss functions of the plurality of pieces of two-dimensional supervision information, such that the reference information in the training is more comprehensive, and thus the accuracy of estimating the reconstruction parameters used in the three-dimensional face reconstruction is improved.
According to the technical solutions provided in the embodiments, a corresponding neural network model is pre-constructed for a plurality of reconstruction parameters specifically used in the three-dimensional face reconstruction. Each training sample in a face image training set is inputted to the neural network model to estimate the reconstruction parameters required in the three-dimensional face reconstruction of the training sample, and the reconstruction parameters are inputted to the pre-constructed three-dimensional morphable model to reconstruct the three-dimensional faces corresponding to the training sample. The fitting loss function in training the neural network model is generated by calculating the plurality of loss functions of the plurality of pieces of two-dimensional supervision information between the three-dimensional face and the training sample and adjusting the weights corresponding to the plurality of loss functions. The inverse correction is continuously performed on the neural network model using the fitting loss functions based on the two-dimensional face information of the plurality of pieces of supervision information without referring to additional three-dimensional face information prior to reconstructing the three-dimensional face to acquire the trained parameter estimation model, such that the method for training the parameter estimation models used for estimating the corresponding reconstruction parameters in the three-dimensional face reconstruction is optimized. The parameter estimation model is trained based on the loss functions of the plurality of pieces of two-dimensional supervision information, such that the reference information in the training is more comprehensive, and thus the accuracy of estimating the reconstruction parameters used in the three-dimensional face reconstruction is improved. Meanwhile, the reconstruction parameters in the three-dimensional face reconstruction are estimated by the trained parameter estimation model, such that the deformation process of the to-be-reconstructed face image in the three-dimensional morphable model is more accurate, the accuracy of the three-dimensional face reconstruction is ensured, operation complexity of the three-dimensional face reconstruction is reduced without deploying additional information devices in the three-dimensional face reconstruction.
The apparatus for training the parameter estimation models according to the embodiments is applicable to the method for training the parameter estimation models according to any one of the above embodiments, has and has corresponding functions and effects.
The computer device according to the embodiments is configured to perform the method for training the parameter estimation models according to any one of the above embodiments, and has corresponding functions and effects.
A sixth embodiment of the present disclosure further provides a computer-readable storage medium storing a computer program thereon. The program, when run by a processor, causes the processor to perform the method for training the parameter estimation models according to any one of the above embodiments. The method includes:
The embodiments of the present disclosure provide a storage medium containing computer-executable instructions, and the computer-executable instructions are not limited to the above method operations, but also perform related operations in the method for training the parameter estimation models according to any embodiment of the present disclosure.
The computer-readable storage medium is a non-transitory storage medium.
Based on the above description of the embodiments, the present disclosure is achieved by software and necessary general-purpose hardware or by hardware. The technical solutions of the present disclosure are substantially embodied in the form of a software product. The computer software product is stored in a computer-readable storage medium, such as a floppy disk, a read-only memory (ROM), a random access memory (RAM), a flash memory (FLASH), a hard disk, or an optical disk of a computer, and includes a plurality of instructions causing a computer device (may be a personal computer, a server, or a network device) to perform the method according to the embodiments of the present disclosure.
In the above embodiments of the apparatus for training the parameter estimation models, a plurality of included units and modules are only divided according to function logic, but are not limited to the above division, as long as corresponding functions are achieved. In addition, the names of the plurality of function units are merely for convenience of distinguishing from each other, and are not intended to limit the scope of protection of the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202011211255.4 | Nov 2020 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2021/125575 | 10/22/2021 | WO |