This application relates to the field of computer technologies, and specifically to an image generation technology.
With the development of computer technologies, image processing technologies are applied to more and more fields. For example, the image processing technologies may include image generation such as face image generation. The face image generation may be applied to fields such as animation production.
In a current related technology, if face images for a to-be-adjusted object in different face poses need to be obtained, a modeler and an animator need to draw the face images in the face poses, respectively. As a result, this image generation process is relatively time-consuming and labor-consuming. Furthermore, the image generation efficiency of this conventional process is low.
According to one or more embodiments, an image generation method, performed by an electronic device, the method comprising: obtaining an original face image frame, audio driving information, and emotion driving information, the original face image frame comprising an original face of a to-be-adjusted object, the audio driving information comprising voice content of the to-be-adjusted object to drive a face pose of the original face to change according to the voice content, and in a case of issuing the voice content, the emotion driving information being configured for describing a target emotion of the to-be-adjusted object to drive the face pose of the original face to change according to the target emotion; performing spatial feature extraction on the original face image frame to obtain an original face spatial feature corresponding to the original face image frame; performing feature interaction processing on the audio driving information and the emotion driving information to obtain a face local pose feature of the to-be-adjusted object issuing the voice content with the target emotion; and performing, based on the original face spatial feature and the face local pose feature, face reconstruction processing on the to-be-adjusted object to generate a target face image frame.
An image generation apparatus, deployed on an electronic device, and comprising: at least one memory configured to store program code; and at least one processor configured to read the program code and operate as instructed by the program code, the program code including: obtaining code configured to cause the at least one processor to obtain an original face image frame, audio driving information, and emotion driving information of a to-be-adjusted object, the original face image frame comprising an original face of the to-be-adjusted object, the audio driving information comprising voice content of the to-be-adjusted object, to drive a face pose of the original face to change according to the voice content, and in a case of issuing the voice content, the emotion driving information being configured for describing a target emotion of the to-be-adjusted object to drive the face pose of the original face to change according to the target emotion; extraction code configured to cause the at least one processor to perform spatial feature extraction on the original face image frame, to obtain an original face spatial feature corresponding to the original face image frame; interaction code configured to cause the at least one processor to perform feature interaction processing on the audio driving information and the emotion driving information, to obtain a face local pose feature of the to-be-adjusted object issuing the voice content with the target emotion; and reconstruction code configured to cause the at least one processor to perform, based on the original face spatial feature and the face local pose feature, face reconstruction processing on the to-be-adjusted object, to generate a target face image frame.
A non-transitory computer-readable storage medium having an instructions stored therein, which when executed by a processor in an electronic device cause the processor to execute an image generation method comprising: obtaining an original face image frame, audio driving information, and emotion driving information, the original face image frame comprising an original face of a to-be-adjusted object, the audio driving information comprising voice content of the to-be-adjusted object to drive a face pose of the original face to change according to the voice content, and in a case of issuing the voice content, the emotion driving information being configured for describing a target emotion of the to-be-adjusted object to drive the face pose of the original face to change according to the target emotion; performing spatial feature extraction on the original face image frame to obtain an original face spatial feature corresponding to the original face image frame; performing feature interaction processing on the audio driving information and the emotion driving information to obtain a face local pose feature of the to-be-adjusted object issuing the voice content with the target emotion; and performing, based on the original face spatial feature and the face local pose feature, face reconstruction processing on the to-be-adjusted object to generate a target face image frame.
To describe the technical solutions of some embodiments of this disclosure more clearly, the following briefly introduces the accompanying drawings for describing some embodiments. The accompanying drawings in the following description show only some embodiments of the disclosure, and a person of ordinary skill in the art may still derive other drawings from these accompanying drawings without creative efforts. In addition, one of ordinary skill would understand that aspects of some embodiments may be combined together or implemented alone.
FIG. if is another model structure diagram of an image generation method according to one or more embodiments of the present disclosure.
To make the objectives, technical solutions, and advantages of the present disclosure clearer, the following further describes the present disclosure in detail with reference to the accompanying drawings. The described embodiments are not to be construed as a limitation to the present disclosure. All other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present disclosure and the appended claims.
In the following descriptions, related “some embodiments” describe a subset of all possible embodiments. However, it may be understood that the “some embodiments” may be the same subset or different subsets of all the possible embodiments, and may be combined with each other without conflict. As used herein, each of such phrases as “A or B,” “at least one of A and B,” “at least one of A or B,” “A, B, or C,” “at least one of A, B, and C,” and “at least one of A, B, or C,” may include all possible combinations of the items enumerated together in a corresponding one of the phrases. For example, the phrase “at least one of A, B, and C” includes within its scope “only A”, “only B”, “only C”, “A and B”, “B and C”, “A and C” and “all of A, B, and C.”
A person skilled in the art would understand that these “units” could be implemented by hardware logic, a processor or processors executing computer software code, or a combination of both. The “units” may also be implemented in software stored in a memory of a computer or a non-transitory computer-readable medium, where the instructions of each unit are executable by a processor to thereby cause the processor to perform the respective operations of the corresponding unit.
The embodiments of the present disclosure provide an image generation method and related devices, and the related devices may include an image generation apparatus, an electronic device, a computer-readable storage medium, and a computer program product. The image generation apparatus may be specifically integrated in an electronic device, and the electronic device may be a device such as a terminal or a server.
It may be understood that, the image generation method in one or more examples may be performed on a terminal, or may be performed on a server, or may be performed by a terminal and a server jointly. The above examples are not to be construed as limiting the present disclosure.
The server 11 may be configured to obtain an original face image frame, audio driving information, and emotion driving information of a to-be-adjusted object. The original face image frame may include an original face of the to-be-adjusted object, and the audio driving information may include voice content of the to-be-adjusted object to drive a face pose of the original face to change according to the voice content. The emotion driving information being configured for describing a target emotion of the to-be-adjusted object in a case of issuing the voice content to drive the face pose of the original face to change according to the target emotion. Spatial feature extraction may be performed on the original face image frame to obtain an original face spatial feature corresponding to the original face image frame. Feature interaction processing may be performed on the audio driving information and the emotion driving information to obtain a face local pose feature of the to-be-adjusted object. Face reconstruction processing may be performed on the to-be-adjusted object based on the original face spatial feature and the face local pose feature to generate a target face image frame, and sending the target face image frame to the terminal 10. The server 11 may be one server, a server cluster including a plurality of servers. In the image generation method or apparatus disclosed in the present disclosure, a plurality of servers may form a blockchain, and the servers are nodes on the blockchain.
The terminal 10 may be configured to receive the target face image frame sent by the server 11. The terminal 10 may include a mobile phone, an intelligent television, a tablet computer, a notebook computer, a personal computer (PC), an intelligent voice interaction device, an intelligent household appliance, an in-vehicle terminal, an aircraft, or any other suitable device known to one of ordinary skill in the art. A client may be further disposed on the terminal 10, and the client may be an application program client, a browser client, or any other suitable client known to one of ordinary skill in the art.
In one or more examples, the foregoing step of generating the target face image frame by the server 11 may be performed by the terminal 10.
The image generation method provided in the embodiments of this disclosure relates to computer vision technologies, voice technologies, and natural language processing in the field of artificial intelligence.
Detailed descriptions are made below separately. The description orders of the following embodiments are not intended to limit preference orders of the embodiments.
This embodiment is described from the perspective of an image generation apparatus. The image generation apparatus may be specifically integrated in an electronic device, and the electronic device may be a device such as a server or a terminal.
It may be understood that, related data such as user information is involved in specific implementations of this disclosure. When the above embodiments of this disclosure are applied to a specific product or technology, permission or consent of a user is required, and collection, use, and processing of the related data need to comply with related laws, regulations, and standards of related nations and districts.
Embodiments of the present disclosure may be applied to various scenarios such as cloud technology, artificial intelligence, intelligent transportation, and assistant driving.
As shown in
101. Obtain an original face image frame, audio driving information, and emotion driving information.
In one or more examples, the original face image frame may include an original face of a to-be-adjusted object. For example, the original face image frame may be an image containing an original face of a to-be-adjusted object, and the to-be-adjusted object may be an object whose face pose is to be adjusted. The face pose mentioned herein may specifically refer to a face expression, for example, object face information such as a mouth shape or a gaze. The embodiments of the present disclosure are not limited to this example. An emotion presented by the face of the to-be-adjusted object in a target face image frame may correspond to the emotion driving information, and the mouth shape of the to-be-adjusted object in the target face image frame may conform to the audio driving information.
The audio driving information is audio information configured for driving the face pose of the original face to change, the audio driving information includes voice content of the to-be-adjusted object, to drive the face pose of the original face to change according to the voice content, and the change herein is mainly a mouth shape change. To improve image generation efficiency, in the embodiments of the present disclosure, the audio driving information may be configured for replacing the face pose of the to-be-adjusted object in the original face image frame with a corresponding face pose of the to-be-adjusted object when speaking, to obtain the target face image frame, where audio information (e.g., voice content) corresponding to the speaking of the to-be-adjusted object is the audio driving information, and the audio driving information is mainly driving information based on the voice content issued by the to-be-adjusted object. An audio length corresponding to the audio driving information may be 1 second or 2 seconds. The embodiments of the present disclosure are not limited to this examples. In one or more examples, a face pose may refer to a facial expression corresponding to an emotion (e.g., angry, happy, sad).
The emotion driving information is information configured for driving the face pose of the original face to change, and the emotion driving information is configured for describing a target emotion of the to-be-adjusted object in a case of issuing the voice content, to drive the face pose of the original face to change according to the target emotion, thereby adjusting the face pose to a face pose with the target emotion represented by the emotion driving information. The emotion driving information is mainly driving information based on an emotion of the to-be-adjusted object. The emotion driving information may have a plurality of information carriers, for example, text, audio, an image, and a video. The embodiments of the present disclosure are not limited to this example.
In one or more examples, the emotion driving information may be text containing emotion description information of “angry”, and a face pose of the to-be-adjusted object in the target face image frame finally generated based on the emotion driving information may be with an angry emotion.
In one or more examples, the emotion driving information may be a piece of audio containing emotion information “panic”, an emotion recognition result “panic” may be obtained based on emotion recognition of the audio, and a face pose of the to-be-adjusted object in the target face image frame generated based on the emotion recognition result may be with an “panic” emotion.
In one or more examples, the emotion driving information may be an image containing emotion information “excited”, an emotion recognition result “excited” may be obtained based on emotion recognition of the image, and a face pose of the to-be-adjusted object in the target face image frame generated based on the emotion recognition result may be with an “excited” emotion.
In one or more embodiments, change information such as the mouth shape of the to-be-adjusted object when speaking may be determined using the voice content of the to-be-adjusted object contained in the audio driving information, pose information such as the gaze of the to-be-adjusted object may be determined using the target emotion contained in the emotion driving information, and then a change in the face pose of the to-be-adjusted object may be comprehensively determined.
In one or more examples, if voice content included in the audio driving information is “hello”, a mouth shape of the to-be-adjusted object may be simply determined based on the audio driving information, thereby ensuring that the mouth shape is a mouth shape of “hello”. In a case that the target emotion represented by the emotion driving information is “angry”, although the mouth shape is still a mouth shape of pronunciation “hello”, because the target emotion is “angry”, and a face pose of a part such as a gaze is affected as a result of an emotion change, a face pose change caused by the target emotion may be superimposed based on the mouth shape of “hello”.
In some embodiments, an emotion of the to-be-adjusted object may be determined according to speaking content and volume of the to-be-adjusted object contained in the audio driving information, and a change in the face pose of the to-be-adjusted object is determined with reference to the determination result and the emotion driving information.
In one or more examples, a plurality of pieces of audio driving information and emotion driving information of the to-be-adjusted object may be obtained; and for each piece of audio driving information, a target face image frame with a target emotion corresponding to each piece of audio driving information is generated according to the target emotion corresponding to each piece of audio driving information and the emotion driving information and the original face image frame, and then target face image frames with the target emotion corresponding to the pieces of audio driving information are spliced, to generate a target face video segment corresponding to the to-be-adjusted object with the target emotion. The target face video segment contains a face pose change process of the to-be-adjusted object when speaking with the target emotion, and audio information corresponding to the speaking of the to-be-adjusted object is the pieces of audio driving information. The leading role in the target face video segment is still the face of the object in the original face image frame, and the expression (particularly, the mouth shape) of the to-be-adjusted object in the generated target face video segment corresponds to the emotion driving information and the pieces of audio driving information.
In one or more embodiments, this disclosure may be applied to a video repair scenario. For example, if a lecture video about the to-be-adjusted object is damaged, and some video frames in the lecture video are lost, the lost video frames may be generated or repaired using other video frames in the lecture video, audio information corresponding to the lost video frames, and an emotion label based on the image generation method provided in this disclosure. The audio information configured for repair may be audio segments in 1 second before and after the lost video frames in the lecture video, the audio segments are the audio driving information in the foregoing embodiment, and the emotion label may be an emotion label corresponding to the lecture video or may be an emotion label of a lecturer in video frames before and after the lost video frames. The embodiments of the present disclosure are not limited to this example.
As shown in
102. Perform spatial feature extraction on the original face image frame, to obtain an original face spatial feature corresponding to the original face image frame.
The original face spatial feature may specifically include three-dimensional (3D) face coefficients corresponding to the original face image frame, for example, may include identity information, lighting, texture, expression, pose, and gaze. The face in the original face image frame may be reconstructed according to the face coefficients.
The spatial feature extraction on the original face image frame may be specifically convolution processing, pooling processing, or any other image processing known to one of ordinary skill in the art, on the original face image frame performing. The embodiments of the present disclosure are not limited to this example.
In one or more embodiments, spatial feature extraction may be performed on the original face image frame through an image feature extraction network. The image feature extraction network may be specifically a neural network model, and the neural network may be a visual geometry group network (VGGNet), a residual network (ResNet), a densely connected convolutional network (DenseNet), or any other suitable neural network known to one of ordinary skill in the art. However, it is to be understood that, the neural network is not merely limited to the several types listed above.
The image feature extraction network may be pre-trained, and 3D face coefficients corresponding to a face image may be predicted through the image feature extraction network.
In one or more examples, the original face spatial feature corresponding to the original face image frame may be extracted using ResNet50 or another network structure, and the feature extraction process may be represented using the following formula (1):
In one or more examples, coeff is a 3D face coefficient (e.g., the original face spatial feature corresponding to the original face image frame), D3DFR may represent ResNet50 or another network structure, and Iface may represent the original face image frame.
After the original face spatial feature is extracted, the original face spatial feature may be further screened to obtain a feature associated with a face pose of the to-be-adjusted object. For example, 3D face coefficients such as identity information, lighting, texture, expression, pose, and gaze may be extracted from the original face spatial feature, and the 3D face coefficients are used as a final original face of spatial feature.
103. Perform feature interaction processing on the audio driving information and the emotion driving information, to obtain a face local pose feature of the to-be-adjusted object issuing the voice content with the target emotion.
In one or more examples, the feature interaction means that different features are combined together according to a specific rule to form a new feature. The feature interaction may help capture a relationship between features, and learn and use information in data from a plurality of perspectives, thereby enhancing prediction performance of the model. The rule may be, for example, addition, multiplication, squaring, or root finding. In one or more examples, a face local pose feature may correspond to one or more features of a face (e.g., mouth, eyes, cheeks, noise, etc.) that exhibit an expression. In one or more examples, a face local pose feature may correspond to face that does not exhibit an expression. For example, a face local pose feature may be one of a plurality of face local pose features, where one of the face local pose features corresponds to a scenario where a face does not exhibit an expression, and the remaining face local pose features correspond to scenarios where the face exhibits one or more expressions (e.g., happy, sad, concerned, angry, etc.)
In one or more examples, the audio driving information and the emotion driving information may be combined according to a specific rule, thereby obtaining the face local pose feature. For example, the audio driving information and the emotion driving information may be added through addition to obtain the face local pose feature. For another example, the audio driving information and the emotion driving information may be multiplied through multiplication to obtain the face local pose feature.
In one or more embodiments, a manner of performing feature interaction processing on the audio driving information and the emotion driving information to obtain a face local pose feature of the to-be-adjusted object issuing the voice content with the target emotion may be:
The audio driving information may be one audio driving sequence, which may include at least one audio frame.
The performing feature interaction processing on the audio driving information and the emotion driving information may mean that the audio driving information and the emotion driving information are combined together according to a specific rule, to form new information, (e.g., interaction feature information). The interaction feature information may be information obtained by combining a plurality of features, and may reflect a relationship between different features to a specific extent. In some embodiments, semantic feature extraction may be first performed on the emotion driving information, to obtain an emotion semantic feature corresponding to the emotion driving information. Vectorization processing may be performed on each audio frame in the audio driving information, to obtain an initial feature vector corresponding to each audio frame, and then dimension mapping may be performed on the initial feature vector of each audio frame to obtain an audio feature vector corresponding to each audio frame. Through dimension mapping, a vector dimension quantity of the initial feature vector may be transformed. In this way, a vector dimension quantity of the audio feature vector of each audio frame may be consistent. Then, feature interaction is performed between the emotion semantic feature and the audio feature vector of each audio frame, to obtain interaction feature information.
In one or more examples, a process of the feature interaction processing may be: sorting a token corresponding to each audio frame in the audio driving information in order of the audio frames, to obtain a sorted token sequence; adding a token corresponding to the emotion driving information to the sorted token sequence, to obtain an updated token sequence; for each token in the updated token sequence, extracting feature information of the token; processing the feature information of the token according to feature information of front and back tokens of the token; fusing the processed feature information of each token, to obtain interaction feature information; and predicting the face local pose feature based on the interaction feature information and the emotion driving information. The prediction manner may be implemented through a neural network model, and the neural network model may be obtained by performing training in advance.
The emotion semantic feature corresponding to the emotion driving information may be considered as one token, and the audio feature vector of each audio frame in the audio driving information is considered as one token, so that a specific process of the foregoing feature interaction processing may be as follows: Tokens corresponding to the audio frames may be sorted in order of the audio frames, to obtain a sorted token sequence, and then a token corresponding to the emotion driving information may be added to the sorted token sequence, for example, may be added to a position of a queue head or queue tail, thereby obtaining a target token sequence. Feature information of each token in the target token sequence may be extracted, and then the feature information of the token is processed according to feature information of tokens before and after each token to obtain interaction feature information based on the processed feature information of each token. For example, the processed feature information of each token may be fused to obtain interaction feature information, and the fusion manner may be a splicing or weighting operation or any other suitable operation known to one of ordinary skill in the art.
In one or more embodiments, the audio driving information may include a plurality of audio frames, and a manner of performing feature interaction processing on the audio driving information and the emotion driving information, to obtain interaction feature information may be:
In some embodiments, the original face spatial feature may include 3D face coefficients such as object identity information, lighting, texture, expression, pose, and gaze that correspond to the original face image frame and from which object identity information corresponding to the to-be-adjusted object may be extracted.
The PE information corresponding to the audio driving information may include a frame number of each audio frame in the audio driving information. In the feature interaction processing process, by adding the PE information, a time sequence feature of the audio driving information may be fused into the interaction feature information.
In one or more examples, feature interaction processing may be performed by combining the object identity information, the PE information, the audio driving information, and the emotion driving information, thereby improving accuracy of the feature interaction information, to make it convenient to reconstruct a more real face image frame subsequently.
In one or more examples, a manner of predicting the face local pose feature based on the interaction feature information and the emotion driving information may be:
There are a plurality of processes of fusing the interaction feature information and the emotion driving information. The embodiments of the present disclosure are not limited to this example. For example, the fusion manner may be splicing processing or a weighting operation.
The performing decoding processing on the fused feature information may be specifically performing attention processing on the fused feature information, thereby obtaining the face local pose feature corresponding to the target face image frame.
Given that an emotion usually relates to an intense extent (e.g., intensity), so that face reconstruction processing may be performed with reference to emotion intensity information. Based on this, in one or more examples, a manner of performing feature interaction processing on the audio driving information and the emotion driving information, to obtain a face local pose feature may be:
In one or more examples, the emotion intensity information may reflect an intense extent of an emotion. For example, if an emotion is gladness, intense extents of gladness may be different, and then change extents of the face pose are also different, so that face pose changes caused based on the emotion driving information are different. Therefore, the face local pose feature may be determined based on the emotion intensity information. The emotion intensity information used herein may be emotion intensity information that is set in advance (e.g., preset emotion intensity information). The preset emotion intensity information may be set according to an actual situation, or may be set randomly. The embodiments of the present disclosure are not limited to this example. The preset emotion intensity information is learnable.
For example, a mapping relationship among different emotions, emotion intensity information, and change extents of the face pose may be established in advance. In this way, after the emotion driving information and the preset emotion intensity information are obtained, a change extent of the face pose may be determined based on the mapping relationship, and then a face local pose feature is obtained with reference to the voice content included in the audio driving information, thereby adjusting the face pose of the to-be-adjusted object.
The updated emotion intensity information may be learned emotion intensity information, and may be configured for representing an intense extent of an emotion.
In one or more embodiments, the audio driving information and the emotion driving information may be transformed into the face local pose feature using a transformer network usually used in sequential modeling. The transformer may be a model completely formed by an attention mechanism, and usually may contain an encoder and a decoder, and input of the transformer network may be one sequence, and each part of the sequence may be referred to as one token.
When generation of a face image is driven using audio and emotion description, the audio driving information may first undergo feature processing to obtain an audio driving sequence (denoted as A1-32), where the audio driving sequence may include audio feature information corresponding to each audio frame in the audio driving information. Because the face feature is closely related to the object identity information, and human face identity information is not completely decoupled in an expression coefficient generated only based on the audio and the emotion description, one token (a) may be expanded, where a represents the object identity information, and the object identity information is input to the transformer encoder (CD). In addition, to match a feature dimension in the encoder, dimension mapping may be first performed on the audio driving sequence A1-32 and the object identity information α, and then the audio driving sequence A1-32 and the object identity information α that undergo dimension mapping are input to the encoder.
In one or more examples, a contrastive language-image pre-training (CLIP) model may be used to extract an emotion semantic feature (denoted as zemo) contained in the emotion driving information. The CLIP model may be a multi-modal pre-training model, may be a multi-modal model based on contrastive learning, can learn a matching relationship between text and an image, and may contain one text encoder and one image encoder, which are respectively used to extract a text representation and an image representation. The CLIP model facilitates migration between a plurality of tasks, and therefore, may be applied to other visual tasks.
In addition, given that an emotion usually involves an intense extent, one learnable emotion intensity token (a) may be further expanded to encode intensity information, where a represents the preset emotion intensity information corresponding to the emotion driving information. In one or more embodiments, zemo and PE information of the audio driving information may be added to the audio driving information, the object identity information, and the learnable emotion intensity representation (e.g., the preset emotion intensity information), and then they are input together to the encoder to perform feature interaction learning of audio and an emotion. The feature interaction process is shown in Formula (2):
z is an intermediate feature of audio-emotion fusion (e.g., the interaction feature information in the foregoing embodiments; {circumflex over (σ)} is an updated learnable emotion intensity representation (e.g., the updated emotion intensity information in the foregoing embodiments. Linear represents a linear layer, and may be used for dimension mapping; ϕ represents an encoder.
Then, the interaction feature information z and an emotion representation zemo corresponding to the emotion driving information may be fused, for example, added, and then a fusing result is input to the transformer decoder (Ψ) to predict an emotion coefficient, thereby obtaining a face local pose feature. In one or more examples, a fully connected layer may be designed to perform dimension mapping on a feature outputted by the decoder, to obtain an emotion coefficient in a preset dimension. A process thereof is specifically shown in Formula (3):
qT is PE information of audio driving information whose time dimension is T, Linear represents a fully connected layer (e.g., a linear layer), and T represents a decoder. {circumflex over (β)} may be used to represent a face local pose feature.
Based on the foregoing process, the audio driving information and the emotion driving information that are given may be finally reflected in an expression coefficient of the 3D face coefficients.
104. Perform, based on the original face spatial feature and the face local pose feature, face reconstruction processing on the to-be-adjusted object, to generate a target face image frame.
The face local pose feature may reflect the current audio and the emotion-induced change in the face pose so that face reconstruction processing may be performed on the to-be-adjusted object based on the original face spatial feature and the face local pose feature, thereby generating a target face image frame based on the original face image frame. The performing face reconstruction processing on the to-be-adjusted object provided in the embodiments of the present disclosure is performing an adjustment based on the original face included in the original face image frame so that drawing does not need to be performed again, thereby improving image generation efficiency. To be specific, the target face image frame may be specifically a corresponding face image obtained after the face pose in the original face image frame is adjusted based on the audio driving information and the emotion driving information.
In one or more examples, if updated emotion intensity information is further generated while performing feature interaction processing on the audio driving information and the emotion driving information, to obtain the face local pose feature corresponding to the target face image frame, a manner of performing, based on the original face spatial feature and the face local pose feature, face reconstruction processing on the to-be-adjusted object, to generate a target face image frame may be:
The original face spatial feature, the face local pose feature, and the updated emotion intensity information may be fused, to obtain a fused face spatial feature. There are a plurality of fusion manners. The embodiments of the present disclosure are not limited to this example. For example, the fusion manner may be splicing processing or weighted fusion. After the fusion processing, face reconstruction processing is performed on the to-be-adjusted object based on the fused face spatial feature, thereby generating the target face image frame.
In the foregoing manner, it is considered that an emotion usually involves an intense extent and an intense extent of an emotion may be reflected through the updated emotion intensity information, and then impact of an intense extent of an emotion on a face pose is considered when a target face image frame is generated, thereby obtaining a more real and accurate target face image frame.
In one or more examples, a manner of performing, based on the original face spatial feature and the face local pose feature, face reconstruction processing on the to-be-adjusted object, to generate a target face image frame may be:
In one or more examples, a face local pose feature contains partial face pose information in a to-be-generated target face image frame. For example, the face local pose feature may include related 3D face coefficients such as an expression, a pose, a gaze, and a mouth shape. The original face spatial feature contains face pose information of the original face image frame.
There are a plurality of processes of fusing the original face spatial feature and the face local pose feature. For example, the fusion manner may be splicing processing or weighted fusion. The embodiments of the present disclosure are not limited to this example.
The fused face spatial feature obtained through the foregoing manner not only contains a main face feature of the target object, but also contains detail information of a specific expression of the human face, to help better simulate a corresponding face pose of the target object.
In one or more examples, a manner of performing, based on the fused face spatial feature, face reconstruction processing on the to-be-adjusted object, to obtain a reference face image frame corresponding to the to-be-adjusted object may be:
Face reconstruction processing may be performed on the to-be-adjusted object based on the fused face spatial feature through a 3DMM (3D Morphable Model) model or any other suitable model known to one of ordinary skill in the art, to obtain a reconstructed 3D face image. In one or more examples, the face reconstruction processing may be 3D reconstruction. In 3D reconstruction, an input two-dimensional face image may be represented using a 3D mesh, and the 3D mesh may contain coordinates and colors of vertexes of a 3D net structure. In one or more embodiments, the 3D mesh (e.g., the reconstructed 3D face image) may be projected onto a two-dimensional plane in a rendering manner, as shown in
Texture and lighting of the reconstructed 3D face image may come from the original face image frame, a pose and an expression of the reconstructed 3D face image may come from the audio driving information and the emotion driving information, the 3D image may be projected onto a two-dimensional plane by performing rendering and mapping processing on the reconstructed 3D face image, to obtain a reference face image frame corresponding to the to-be-adjusted object.
In one or more examples, the fused face spatial feature may contain a geometry feature and a texture feature of the to-be-adjusted object, and the reconstructed 3D face image may be constructed according to the geometry feature and the texture feature. The geometry feature may be understood as coordinate information of a key point of the 3D net structure of the to-be-adjusted object, and the texture feature may be understood as a feature indicating texture information of the to-be-adjusted object. There are a plurality of processes of performing, face reconstruction processing based on the fused face spatial feature, to obtain a reconstructed 3D face image corresponding to the to-be-adjusted object. For example, position information of at least one face key point may be extracted from the fused face spatial feature, the position information of the face key point is transformed into a geometry feature, and a texture feature of the to-be-adjusted object is extracted from the fused face spatial feature.
After the geometry feature and the texture feature are obtained through transform, a 3D object model of the to-be-adjusted object, that is, the reconstructed 3D face image in the foregoing embodiments may be constructed, and the 3D object model is projected onto a two-dimensional plane, to obtain a reference face image frame. There are a plurality of processes of obtaining the reference face image frame. For example, 3D model parameters of the to-be-adjusted object may be determined according to the geometry feature and the texture feature, a 3D object model of the to-be-adjusted object is constructed based on the 3D model parameters, and the 3D object model is projected onto a two-dimensional plane, to obtain the reference face image frame.
In one or more examples, a manner of generating the target face image frame based on the original face image frame, the fused face spatial feature, and the reference face image frame may be:
A spatial feature of the original face image frame or the reference face image frame with each preset resolution may be obtained through multi-scale feature extraction, different resolutions correspond to different image scales of original face feature images, and similarly, different resolutions correspond to different image scales of reference face feature images. In one or more embodiments, an original spatial feature of the original face image frame and a reference spatial feature of the reference face image frame may be obtained through multi-scale extraction, the original spatial feature contains original face feature images on a plurality of scales, and the reference spatial feature contains reference face feature images on a plurality of scales, so that each of the original spatial feature of the original face image frame and the reference spatial feature of the reference face image frame is a multi-layer spatial feature.
There are a plurality of processes for extracting original face feature images on a plurality of scales of the original face image frame, and reference face feature images on a plurality of scales of the reference face image frame. The process may be as follows
For example, spatial encoding may be performed on the original face image frame and the reference face image frame with each preset resolution using an encoding network (denoted as Enc Block) of a trained image generation model, and therefore an original face feature image and a reference face feature image with each resolution may be obtained.
The encoding network (Enc Block) may include a plurality of encoding sub-networks, each encoding sub-network may correspond to one preset resolution, the encoding sub-networks may be sorted sequentially in ascending order according to magnitudes of resolutions, thereby obtaining the encoding network. When the original face image frame and the reference face image frame are input to the encoding network for network encoding, each encoding sub-network may output a spatial feature corresponding to one preset resolution. Encoding networks for the original face image frame and the reference face image frame may be the same or different, but different encoding networks share network parameters. The encoding sub-network may have a plurality of structures. For example, the encoding sub-network may be formed by a simple single-layer convolutional network, or may have another encoding network structure. The preset resolution may be set according to actual application. For example, the resolution may range from 4*4 to 512*512.
The latent feature information may be an intermediate feature w obtained by encoding and mapping the fused face spatial feature, and different elements of the intermediate feature w control different visual features, thereby reducing a relationship between features (decoupling and feature separation). The encoding and mapping process may be extracting deep relationships hidden under a surface feature from the fused face spatial feature, and a latent feature (latent code) may be obtained by decoupling the relationships. There may be a plurality of processes of mapping the fused face spatial feature to the latent feature information using the trained image generation model. For example, the fused face spatial feature may be mapped to the latent feature information (w) using a mapping network (mlp) of the trained image generation model.
A spatial feature of the original face image frame or the reference face image frame with each preset resolution may be obtained through multi-scale feature extraction, and different resolutions correspond to different image scales of original face feature images. A low-level feature has a high resolution, and contains more detail information; a high-level feature has strong semantic information. A face image frame may be constructed with reference to original face feature images and reference face feature images on scales. In this way, a representation capability of a feature may be improved, thereby improving generation efficiency and accuracy of the target face image frame. In addition, the dimension of the fused face spatial feature may be reduced through encoding and mapping processing, thereby reducing a calculation amount in a subsequent generation process of the target face image frame, and improving generation efficiency of the target face image frame.
In one or more examples, a manner of fusing the original face feature images on the plurality of scales, the reference face feature images on the plurality of scales, and the latent feature information, to obtain the target face image frame may be:
In one or more examples, a manner of fusing the corresponding fused face feature image on the target scale, an original face feature image on an adjacent scale, and a reference face feature image on the adjacent scale, to obtain the target face image frame may be:
The adjacent scale may be a scale larger than the target scale among the plurality of scales. In a case that the plurality of scales contain 4*4, 8*8, 16*16, 32*32, and 64*64, if the target scale is 16*16, the adjacent scale may be 32*32; if the target scale is 4*4, the adjacent scale may be 8*8.
In one or more embodiments, a preset basic style feature may be adjusted based on the latent feature information to obtain a modulated style feature. The preset basic style feature may be understood as a style feature in a constant tensor (Const) that is set in advance in a process of performing image driving. The so-called style feature may be understood as feature information configured for generating an image of a specific style.
There are a plurality of processes of style modulation processing. For example, a size of a basic style feature is adjusted, to obtain an initial style feature, modulation processing is performed on the latent feature information, to obtain a convolutional weight corresponding to the initial style feature, and the initial style feature is adjusted based on the convolutional weight, to obtain a modulated style feature.
The convolutional weight may be understood as weight information in a case of performing convolution processing on the initial style feature. There may be a plurality of processes of performing modulation processing on the latent feature information. For example, a basic convolutional weight may be obtained, and the convolutional weight is adjusted based on the latent feature information, thereby obtaining a convolutional feature corresponding to the initial style feature. The adjusting the convolutional weight based on the latent feature information may be implemented mainly using a Mod module and a Demod module in a decoding network of stylegan v2 (a style migration model).
After modulation processing is performed on the latent feature information, the initial style feature may be adjusted based on the convolutional weight obtained after the modulation processing. There may be a plurality of processes of adjustment. For example, a target style convolutional network corresponding to a resolution of a basic face image is selected from style convolutional networks (StyleConv) of the trained image generation model, and the initial style feature is adjusted based on the convolutional weight, to obtain a modulated style feature. The basic face image with the initial resolution is generated based on the preset basic style feature.
In one or more examples, a manner of fusing the modulated style feature, the original face feature image on the adjacent scale, and the reference face feature image on the adjacent scale, to obtain the target face image frame may be:
The fused face feature image may also be considered as a fused style feature.
In one or more examples, a manner of generating the target face image frame based on the fused face feature image on the adjacent scale and the basic face image may be: using the fused face feature image on the adjacent scale as a fused face feature image on a new target scale, and returning to perform the step of performing, based on the latent feature information, style modulation processing on the corresponding fused face feature image on the target scale, to obtain a modulated style feature, until a scale of an obtained target face image frame satisfies a preset scale condition.
The preset scale condition may be specifically that the scale of the target face image frame is a maximum scale of the plurality of scales.
In one or more embodiments, there may be a plurality of processes of selecting, based on the preset resolution, a target original spatial feature (e.g., an original face of feature image on the target scale) from original spatial features and a target reference spatial feature (e.g., a reference face feature image on the target scale) from reference spatial features. For example, original spatial features and reference spatial features may be sorted based on the preset resolution separately. Based on sorting information, an original spatial feature with a minimum resolution is selected from the original spatial features as a target original spatial feature, and a reference spatial feature with a minimum resolution is selected from the reference spatial features as a target reference spatial feature. After the target original spatial feature and the target reference spatial feature are selected, the target original spatial feature may be deleted from the original spatial features, and the target reference spatial feature may be deleted from the reference spatial features. In this way, spatial features with the minimum resolution may be selected from the original spatial features and the reference spatial features each time, thereby obtaining the target original spatial feature and the target reference spatial feature.
After the target original spatial feature and the target reference spatial feature are selected, the modulated style feature, the target original spatial feature, and the target reference spatial feature may be fused. There may be a plurality of processes of fusion. For example, the target original spatial feature, the target reference spatial feature, and the modulated style feature may be directly spliced, thereby obtaining a fused style feature with the current resolution, which may be specifically shown in Formula (4), Formula (5), and Formula (6):
zsty represents the fused face spatial feature, StyleConv represents a style convolution module in stylegan v2, Up represents up-sampling, T represents a feature resampling operation, Fflow has a function of transforming a feature into a dense flow field ϕi, and FRGB has a function of transforming a feature into an RGB color image . Fi+1 is a fused style feature, the fused style feature may be a style feature corresponding to a next preset resolution of the basic style feature, Fi is the basic style feature, Fis is the target original spatial feature with the preset resolution, and Fird is the target reference spatial feature with the preset resolution.
After the fused style feature with the current resolution is obtained, the target face image frame with the target resolution may be generated based on the fused style feature and the basic face image. There may be a plurality of processes of generating the target face image frame with the target resolution. For example, a current face image may be generated based on the fused style feature, and the current face image and the basic face image are fused to obtain a fused face image with the current resolution; and the fused style feature may be used as the preset basic style feature, and the fused face image is used as the basic face image, to return to perform the step of adjusting a preset basic style feature based on the latent feature information, until the current resolution is the target resolution, to obtain the target face image frame.
In the process of generating the target face image frame with the target resolution, it may be found that the current face image and the basic face image with different resolutions are superimposed sequentially. In addition, during superimposition, the resolutions are sequentially increased, and therefore a high-definition target face image frame may be output.
In one or more examples, a basic optical flow field with an initial resolution may be generated based on the basic style feature, thereby outputting a target optical flow field with the target resolution. The basic optical flow field may be understood as a visualized field configured to indicate the same key point movement in a face image with the initial resolution. There may be a plurality of processes of outputting the target optical flow field. For example, a basic optical flow field with an initial resolution may be generated based on the basic style feature, and the modulated style feature, the original spatial feature, and the reference spatial feature are fused according to the basic optical flow field, to obtain a target optical flow field with the target resolution.
There may be a plurality of processes of fusing the modulated style feature, the original spatial feature, and the reference spatial feature. For example, based on the preset resolution, a target original spatial feature may be selected from original spatial features and a target reference spatial feature may be selected from reference spatial features; the modulated style feature and the target style feature are fused, to obtain a fused style feature with the current resolution; and a target optical flow field with the target resolution is generated based on the fused style feature and the basic optical flow field.
There may be a plurality of processes of generating a target optical flow field with the target resolution based on the fused style feature and the basic optical flow field. For example, a current optical flow field may be generated based on the fused style feature, and the current optical flow field and the basic optical flow field are fused, to obtain a fused optical flow field with the current resolution; and the fused style feature is used as the preset basic style feature, and the fused optical flow field is used as the basic optical flow field, to return to perform the step of adjusting a preset basic style feature based on the latent feature information, until the current resolution is the target resolution, to obtain the target optical flow field.
The target face image and the target optical flow field may be simultaneously generated based on the preset basic style feature, or the target face image or the target optical flow field may be individually generated based on the preset style feature. Using an example in which the target face image and the target optical flow field are simultaneously generated, the preset basic style feature, the basic face image, and the basic optical flow field may be processed through a decoding network of the trained image generation model, each decoding network may include a decoding sub-network corresponding to each preset resolution, resolutions may increase from 4*4 to 512*512, and a network structure of a decoding sub-network may be shown in
For an operation of each layer in the network in the foregoing image generation process, reference may be made to
The generation network may be established based on StyleGAN2. In one or more examples, an original face spatial feature (denoted as θ) of a source human face (e.g., the original face image frame) and a face local pose feature (denoted as {circumflex over (β)}{circumflex over ( )}) of a driven human face (e.g., the target face image frame) may be recombined first. For example, object identity information (αs), lighting (δs), and texture (γs) in the original face spatial feature θ of the source human face, an expression coefficient () and a pose (pd) in the face local pose feature {circumflex over (β)}, and updated emotion intensity information () may be merged into one new feature vector (zsty). The feature vector is the fused face spatial feature in the foregoing embodiments, the feature vector contains expression and pose information of the driven human face, and emotion information in the feature vector is further enhanced using an emotion intensity representation outputted by the transformer encoder. The process of obtaining the fused face spatial feature is shown in Formula (7):
The Linear function may represent a linear layer.
In addition, an encoder (Esrc) of the source human face may be constructed to provide a texture cue (Fs) of the source human face and an encoder (Erd) of the driven human face may be rendered to explicitly provide a spatial cue (Frd) of the driven human face, which mainly includes information such as a pose and an expression.
Then, zsty, Fs, and Frd are input to the network G, and processed by each layer in the network G, to finally generate a high-fidelity human face, that is, the target face image frame in the foregoing embodiments.
The process of training the emotional human face generation network may be as follows:
In one or more embodiments, the emotional human face generation network may be trained using four loss functions, and the four loss functions are respectively: emotion consistency loss (denoted as Lemo), pixel-level image reconstruction loss (denoted as Lrec), feature-level perception loss (denoted as Lp), and adversarial loss (denoted as Lgan). In one or more examples, the four loss functions may be fused (for example, undergo a weighting operation), to obtain a total loss function, and then parameters of the emotional human face generation network are adjusted based on the total loss function, until the total loss function satisfies a preset loss condition, thereby obtaining a trained emotional human face generation network, where the preset loss condition may be that the total loss function is less than a preset loss value.
In some embodiments, the total loss function L may be shown in Formula (8):
For the emotion consistency loss Lemo, emotion recognition processing may be performed on the target driving face image frame sample and the sample object in the predicted driving face image frame separately, to obtain a first emotion recognition result corresponding to the target driving face image frame sample and a second emotion recognition result corresponding to the predicted driving face image frame; and calculation is performed based on the first emotion recognition result and the second emotion recognition result, to obtain an emotion consistency loss of the preset emotional human face generation network.
For the image reconstruction loss Lrec, the image reconstruction loss of the preset emotional human face generation network may be determined based on a similarity between the target driving face image frame sample and the predicted driving face image frame. The similarity may be specifically a cosine similarity or may be a histogram similarity measure. The embodiments of the present disclosure are not limited to this example.
The perception loss Lp is a feature-level loss value. Specifically, spatial feature extraction is performed on each of the target driving face image frame sample and the predicted driving face image frame, to obtain a first face spatial feature of the target driving face image frame sample and a second face spatial feature of the predicted driving face image frame, and then a similarity between the first face spatial feature and the second face spatial feature is calculated, to obtain a perception loss. A larger similarity indicates a smaller perception loss; otherwise, a smaller similarity indicates a larger perception loss.
For the adversarial loss Lgan, probabilities that the target driving face image frame sample and the predicted driving face image frame belong to a real driving face image frame may be separately predicted, and the adversarial loss of the preset emotional human face generation network is determined based on the probabilities.
In one or more embodiments, a manner of separately predicting probabilities that the target driving face image frame sample and the predicted driving face image frame belong to a real driving face image frame, and determining the adversarial loss of the preset emotional human face generation network based on the probabilities may be:
The emotional human face generation network may be a generative adversarial network. The generative adversarial network may contain one generator and one discriminator. The generator is responsible for generating a sample, and the discriminator is responsible for discriminating whether the sample generated by the generator is true. The generator may need to generate an as real as possible sample to confuse the discriminator, and the discriminator may need to distinguish the sample generated by the generator and a real sample as much as possible, to improve capabilities of the generator through a continuous adversarial game between the two.
The preset discrimination model may be the discriminator D. During training, the target driving face image frame sample is a real image, the predicted driving face image frame is a generation result of the preset emotional human face generation network, and the discriminator may need to determine that the generation result is false and the real image is true. The preset emotional human face generation network may be considered as the entire generator G. During training and learning, an image generated by the generator G may need to be capable of fooling the discriminator D. For example, the discriminator D may determine that the probability that the predicted driving face image frame generated by the generator G belongs to a real driving face image frame is 1 as much as possible.
Input of the discriminator may be a real image or output of the emotional human face generation network, aiming to distinguish the output of the emotional human face generation network from the real image as much as possible. The emotional human face generation network needs to fool the discriminator as much as possible. The emotional human face generation network and the discriminator oppose each other, and continuously adjust parameters, thereby obtaining a trained emotional human face generation network.
In one or more examples, a manner of performing spatial feature extraction on the original face image frame, to obtain an original face spatial feature corresponding to the original face image frame may be:
The image generation model may be specifically a neural network model, and the neural network may be a visual geometry group network (VGGNet), a residual network (ResNet), a densely connected convolutional network (DenseNet), or any other suitable neural network model known to one of ordinary skill in the art. However, it is to be understood that, the image generation model is not merely limited to the several types listed above.
The image generation model may be formed by training a plurality of groups of training data, and the image generation model may be specifically trained by another device and then provided to the image generation apparatus, or may be trained by the image generation apparatus autonomously.
If the image generation model is trained by the image generation apparatus autonomously, before the performing spatial feature extraction on the original face image frame through the image generation model, to obtain the original face spatial feature corresponding to the original face image frame, the image generation method may further include:
The target driving face image frame sample may be considered as label information, and may be specifically an expected driving face image frame corresponding to the audio driving information sample and the emotion driving information sample.
There are a plurality of processes of obtaining an original face image frame sample of a sample object, a target driving face image frame sample, and an audio driving information sample and an emotion driving information sample that correspond to the target driving face image frame sample. The embodiments of the present disclosure are not limited to this example.
For example, the model may be trained using a lecturer video with an emotion label. For example, any two video frames containing the object face may be extracted from a lecture video about the sample object, where one frame is used as the original face image frame sample, and the other frame is used as the target driving face image frame sample. Audio information corresponding to the target driving face image frame sample in 1 second before and after the lecture video is used as the audio driving information sample, and the emotion label of the lecture video is used as the emotion driving information sample.
For another example, any consecutive 32 frames may be captured from the lecture video and used as driving images (e.g., target driving face image frame samples, denoted as I1-32), which is also a supervised real value of a model generation result, where the first frame I1 may be used as an original image (e.g., the original face image frame sample), audio signals corresponding to consecutive driving images are used as audio driving information samples (A1-32), and a text representation (temo) of the emotion label of the video is used as emotion driving information for generating a human face.
In one or more examples, a manner of adjusting, based on the target driving face image frame sample and the predicted driving face image frame, parameters of the preset image generation model, to obtain a trained image generation model may be:
Emotion recognition processing may be performed separately on the sample objects in the target driving face image frame sample and the predicted driving face image frame through an emotion recognition model, and the emotion recognition model may be specifically a neural network model.
The first emotion recognition result may be first probability information that an emotion of the sample object in the target driving face image frame sample belongs to a preset emotion. The second emotion recognition result may be second probability information that an emotion of the sample object in the predicted driving face image frame belongs to a preset emotion. Then, the emotion loss information is obtained according to a cross entropy loss between the first probability information and the second probability information.
In one or more examples, a manner of determining, based on a similarity between the target driving face image frame sample and the predicted driving face image frame, reconstruction loss information of the preset image generation model may be:
A vector distance between a feature vector corresponding to the first feature information and a feature vector corresponding to the second feature information may be calculated, and the similarity between the first feature information and the second feature information is determined according to the vector distance. A larger vector distance indicates a lower similarity, and a larger loss value corresponding to the reconstruction loss information; otherwise, a smaller vector distance indicates a higher similarity, and a smaller loss value corresponding to the reconstruction loss information.
In one or more examples, a manner of adjusting the parameters of the preset image generation model according to the emotion loss information and the reconstruction loss information, to obtain the trained image generation model may be:
The performing spatial feature extraction on the target driving face image frame sample may be performing convolution processing, pooling processing, or any other suitable processing known to one of ordinary skill in the art, on the target driving face image frame sample. The embodiments of the present disclosure are not limited to this example. In one or more embodiments, spatial feature extraction may be performed on the target driving face image frame sample through a trained image feature extraction network.
The extracted target face spatial feature corresponding to the target driving face image frame sample may specifically include 3D face coefficients corresponding to the target driving face image frame sample, for example, may specifically include identity information, lighting, texture, expression, pose, and gaze.
The face key point extraction may be specifically to extract a key point of one of five sense organs such as an eye or a mouth.
For the regularization loss information, in one or more embodiments, the target face spatial feature corresponding to the target driving face image frame sample may be used as a supervisory signal of the extracted face local pose feature. In one or more examples, a feature distance between the face local pose sample feature corresponding to the target driving face image frame sample and the target face spatial feature may be calculated, and regularization loss information is determined according to the feature distance. A larger feature distance indicates a larger loss value corresponding to the regularization loss information; otherwise, a smaller feature distance indicates a smaller loss value corresponding to the regularization loss information.
A manner of adjusting the parameters of the preset image generation model according to the emotion loss information, the reconstruction loss information, the face key point loss information, and the regularization loss information, to obtain the trained image generation model may be:
There are a plurality of processes of fusing the emotion loss information, the reconstruction loss information, the face key point loss information, and the regularization loss information. The embodiments of the present disclosure are not limited to this example. For example, the fusion manner may be weighted fusion.
In the process of training the preset image generation model, total loss information is first calculated, then parameters of the preset image generation model are adjusted using a back-propagation algorithm, and parameters of the image generation model are optimized based on the total loss information, so that a loss value corresponding to the total loss information is less than the preset loss value, to obtain a trained image generation model. The preset loss value may be set according to an actual situation. For example, if a precision requirement for the image generation model is higher, the preset loss value is smaller.
In one or more embodiments, the image generation model may include an emotion-perceivable audio-3D coefficient transform network and an emotional human face generation network. As shown in
Then, feature interaction processing may be performed on the audio driving information and the emotion driving information, to obtain a face local pose feature, denoted as {circumflex over (β)}{circumflex over ( )}. A specific process thereof may include: extracting an emotion semantic feature (denoted as zemo) contained in the emotion driving information (which may be, for example, description information such as “angry”) and extracting object identity information α from the original face spatial feature θ through a multi-modal model, and performing dimension mapping on an audio driving sequence (A1, A2, . . . , and AT) and the object identity information α through a linear layer; then inputting PE information PE of the audio driving information, the preset emotion intensity information σ, the emotion semantic feature zemo, the dimension-mapped audio driving sequence A1-32, and the object identity information α to the encoder for feature interaction, to obtain interaction feature information z (z may include a feature sequence z1, z2, . . . , and zT) and updated emotion intensity information {circumflex over (σ)}; and then inputting the interaction feature information z, the emotion semantic feature zemo, and the PE information of the audio driving information to the decoder, and processing an output result of the decoder through the linear layer, to obtain the face local pose feature {circumflex over (β)}, where {circumflex over (β)} may specifically include coefficients , , . . . , and .
After the face local pose feature is obtained, the original face spatial feature and the face local pose feature may be fused, and {circumflex over (β)} in the original face spatial feature θ may be specifically replaced with {circumflex over (β)}, to obtain updated {circumflex over (θ)}. Coefficients in β and {circumflex over (β)} correspond to each other. In addition, the updated emotion intensity information and {circumflex over (θ)} may be further fused, to obtain the fused face spatial feature. Then, face reconstruction processing is performed on the to-be-adjusted object based on the fused face spatial feature, to obtain a reconstructed 3D face image corresponding to the to-be-adjusted object; and rendering and mapping processing is performed on the reconstructed 3D face image through a rendering layer, to obtain a two-dimensional image of a driven emotional human face, that is, the reference face image frame corresponding to the to-be-adjusted object. Finally, the reference face image frame, the source human face, and the recombined coefficient (which is specifically the fused face spatial feature) are input to the emotional human face generation network, to obtain a final real human face.
In the process of training the emotion-perceivable audio-3D coefficient transform network, the target driving face image frame sample corresponding to the sample object may be denoted as I, and the reference face image frame sample obtained through the rendering layer is denoted as Ird. In one or more examples, the rendered human face Ird may be stuck back onto the original face image frame sample through face masking, to form a new human face image, denoted as I3d, I3d and a real value image I are input to the emotion recognition model for emotion recognition processing, and emotion loss information (e.g., emotion consistency loss) of the network is calculated based on an emotion recognition result, as shown in Formula (9) below:
φemo represents the emotion recognition model, and Lemo represents the emotion consistency loss.
In addition, a face reconstruction loss of the emotion-perceivable audio-3D coefficient transform network may be further calculated based on the real value image I and the rendered human face Ird, as shown in Formula (10):
M represents the masking, and Lrec represents the face reconstruction loss.
In some embodiments, original 68 face key points (denoted as I) may be extracted from the 3D face coefficients of the target driving face image frame sample, and face key points (denoted as Ird) may be extracted from the recombined 3D face coefficients, to calculate a face key point loss between the two, as shown in Formula (11):
Llmd represents the face key point loss.
In some embodiments, a coefficient regularization loss of the emotion-perceivable audio-3D coefficient transform network may be further calculated, and a coefficient regularization term may be used to stabilize a training process. Specifically, by calculating a loss value between a 3D face coefficient (β) outputted by the image feature extraction network and a 3D face coefficient ({circumflex over (β)}) predicted by the decoder, a coefficient regularization loss Lreg may be obtained, as shown in Formula (12):
After the emotion consistency loss, the face reconstruction loss, the face key point loss, and the coefficient regularization loss are obtained, a total loss function of the emotion-perceivable audio-3D coefficient transform network may be calculated based on the four loss functions, and then the emotion-perceivable audio-3D coefficient transform network is trained based on the total loss function. The total loss function L may be shown in Formula (13):
Based on the image generation method provided in this disclosure, motion of a face in any original face image frame may be flexibly driven according to given audio and emotion description, to generate a target face image frame in which a mouth shape is consistent with audio content, and a face emotion is consistent with the emotion description. The emotion description herein may be a hybrid emotion or any other suitable emotion description known to one of ordinary skill in the art. The embodiments of the present disclosure are not limited to this example. This solution may implement emotional voice figure driving, and any emotion and audio are reflected on any human face, to implement multi-modal emotional human face driving and provide a solution with extremely strong operability.
The emotional voice figure driving is specifically that a face expression of a figure is further controlled according to an external given emotion condition or an emotion representation decouple from audio, to be consistent with an expected emotional expression.
It may be learned from the foregoing that, in one or more examples, an original face image frame, audio driving information, and emotion driving information of a to-be-adjusted object may be obtained, the original face image frame including an original face of the to-be-adjusted object, the audio driving information including voice content of the to-be-adjusted object, to drive a face pose of the original face to change according to the voice content, and the emotion driving information being configured for describing a target emotion of the to-be-adjusted object in a case of issuing the voice content, to drive the face pose of the original face to change according to the target emotion; and spatial feature extraction is performed on the original face image frame, to obtain an original face spatial feature corresponding to the original face image frame. Feature interaction processing is performed on the audio driving information and the emotion driving information, to obtain a face local pose feature of the to-be-adjusted object, and the face local pose feature may reflect the current voice content and the emotion-induced change in the face pose, so that face reconstruction processing is performed on the to-be-adjusted object based on the original face spatial feature and the face local pose feature, thereby generating a target face image frame based on the original face image frame. In this disclosure, partial face pose detail information of the to-be-adjusted object may be captured using the audio driving information and the emotion driving information, and then face adjustment is performed on the original face image frame based on the captured information, thereby obtaining the corresponding target face image frame. In this way, an improvement in generation efficiency and accuracy of the target face image frame is facilitated.
According to the method described in the foregoing embodiments, the following further performs detailed description by using an example in which the image generation apparatus is specifically integrated in a server.
One or more embodiments of this disclosure provides an image generation method. As shown in
201. A server obtains an original face image frame, audio driving information, and emotion driving information of a to-be-adjusted object.
202. The server performs spatial feature extraction on the original face image frame, to obtain an original face spatial feature corresponding to the original face image frame.
203. The server performs feature interaction processing on the audio driving information and the emotion driving information, to obtain interaction feature information.
204. The server predicts the face local pose feature based on the interaction feature information and the emotion driving information.
205. The server performs, based on the original face spatial feature and the face local pose feature, face reconstruction processing on the to-be-adjusted object, to generate a target face image frame.
Based on the image generation method provided in this disclosure, motion of a face in any original face image frame may be flexibly driven according to given audio and emotion description, to generate a target face image frame in which a mouth shape is consistent with audio content, and a face emotion is consistent with the emotion description. The emotion description herein may be a hybrid emotion or any other emotion description known to one of ordinary skill in the art. The embodiments of the present disclosure are not limited to this example. This solution may implement emotional voice figure driving, and any emotion and audio are reflected on any human face, to implement multi-modal emotional human face driving and provide a solution with extremely strong operability.
The emotional voice figure driving is specifically that a face expression of a figure is further controlled according to an external given emotion condition or an emotion representation decouple from audio, to be consistent with an expected emotional expression.
It may be learned from the foregoing that, in one or more examples, a server may obtain an original face image frame, audio driving information, and emotion driving information of a to-be-adjusted object; perform spatial feature extraction on the original face image frame, to obtain an original face spatial feature corresponding to the original face image frame; perform feature interaction processing on the audio driving information and the emotion driving information, to obtain interaction feature information; and then predict the face local pose feature based on the interaction feature information and the emotion driving information. The face local pose feature may reflect the current audio and the emotion-induced change in the face pose, so that the server may perform face reconstruction processing on the to-be-adjusted object based on the original face spatial feature and the face local pose feature, thereby generating a target face image frame based on the original face image frame. In this disclosure, partial face pose detail information of the to-be-adjusted object may be captured using the audio driving information and the emotion driving information, and then face adjustment is performed on the original face image frame based on the captured information, thereby obtaining the corresponding target face image frame. In this way, an improvement in generation efficiency and accuracy of the target face image frame is facilitated.
To better implement the foregoing method, one or more embodiments of this disclosure further provides an image generation apparatus. As shown in
The obtaining unit is configured to obtain an original face image frame, audio driving information, and emotion driving information of a to-be-adjusted object, the original face image frame including an original face of the to-be-adjusted object, the audio driving information including voice content of the to-be-adjusted object, to drive a face pose of the original face to change according to the voice content, and the emotion driving information being configured for describing a target emotion of the to-be-adjusted object in a case of issuing the voice content, to drive the face pose of the original face to change according to the target emotion.
The extraction unit is configured to perform spatial feature extraction on the original face image frame, to obtain an original face spatial feature corresponding to the original face image frame.
The interaction unit is configured to perform feature interaction processing on the audio driving information and the emotion driving information, to obtain a face local pose feature of the to-be-adjusted object issuing the voice content with the target emotion.
In some embodiments of this disclosure, the interaction unit may include a feature interaction sub-unit and a prediction sub-unit, which are as follows:
In some embodiments of this disclosure, the audio driving information includes a plurality of audio frames, and the feature interaction sub-unit may be further configured to: extract object identity information in the original face spatial feature; encode position information of each audio frame of the plurality of audio frames in the audio driving information, to obtain a position code of each audio frame, and combine position codes respectively corresponding to the plurality of audio frames, to obtain PE information corresponding to the audio driving information; and perform feature interaction processing on the object identity information, the PE information, the audio driving information, and the emotion driving information, to obtain the interaction feature information.
In some embodiments of this disclosure, the prediction sub-unit may be further configured to: fuse the interaction feature information and the emotion driving information, to obtain fused feature information; and decode the fused feature information, to obtain the face local pose feature.
The reconstruction unit is configured to perform, based on the original face spatial feature and the face local pose feature, face reconstruction processing on the to-be-adjusted object, to generate a target face image frame.
In some embodiments of this disclosure, the interaction unit may include a setting sub-unit and an interaction sub-unit, which are as follows:
In some embodiments of this disclosure, the reconstruction unit may include a fusing sub-unit, a reconstruction sub-unit, and a generation sub-unit, which are as follows:
In some embodiments of this disclosure, the reconstruction sub-unit may be further configured to: perform, based on the fused face spatial feature, face reconstruction processing on the to-be-adjusted object, to obtain a reconstructed 3D face image corresponding to the to-be-adjusted object; and perform rendering and mapping processing on the reconstructed 3D face image, to obtain the reference face image frame corresponding to the to-be-adjusted object.
In some embodiments of this disclosure, the generation sub-unit may be further configured to: perform multi-scale feature extraction on the original face image frame, to obtain original face feature images corresponding to the original face image frame on a plurality of scales; perform multi-scale feature extraction on the reference face image frame, to obtain reference face feature images corresponding to the reference face image frame on a plurality of scales; encode and map the fused face spatial feature, to obtain latent feature information corresponding to the fused face spatial feature; and fuse the original face feature images on the plurality of scales, the reference face feature images on the plurality of scales, and the latent feature information, to obtain the target face image frame.
In some embodiments of this disclosure, the extraction unit may be further configured to perform spatial feature extraction on the original face image frame through an image generation model, to obtain the original face spatial feature corresponding to the original face image frame;
In some embodiments of this disclosure, the image generation apparatus may further include a training unit, and the training unit may be configured to train the image generation model; and
In some embodiments of this disclosure, the step of “adjusting, based on the target driving face image frame sample and the predicted driving face image frame, parameters of the preset image generation model, to obtain a trained image generation model” may include:
In some embodiments of this disclosure, the step of “adjusting the parameters of the preset image generation model according to the emotion loss information and the reconstruction loss information, to obtain the trained image generation model” may include:
In some embodiments of this disclosure, the step of “performing feature interaction processing on the audio driving information and the emotion driving information, to obtain a face local pose feature of the to-be-adjusted object issuing the voice content with the target emotion” may include:
It may be learned from the foregoing that, in one or more examples, the obtaining unit 301 may obtain an original face image frame, audio driving information, and emotion driving information of a to-be-adjusted object, the original face image frame including an original face of the to-be-adjusted object, the audio driving information including voice content of the to-be-adjusted object, to drive a face pose of the original face to change according to the voice content, and the emotion driving information being configured for describing a target emotion of the to-be-adjusted object in a case of issuing the voice content, to drive the face pose of the original face to change according to the target emotion; the extraction unit 302 may perform spatial feature extraction on the original face image frame, to obtain an original face spatial feature corresponding to the original face image frame; the interaction unit 303 may perform feature interaction processing on the audio driving information and the emotion driving information, to obtain a face local pose feature of the to-be-adjusted object; and the face local pose feature may reflect the current voice content and the emotion-induced change in the face pose, so that the reconstruction unit 304 may perform face reconstruction processing on the to-be-adjusted object based on the original face spatial feature and the face local pose feature, thereby generating a target face image frame based on the original face image frame. In this disclosure, partial face pose detail information of the to-be-adjusted object may be captured using the audio driving information and the emotion driving information, and then face adjustment is performed on the original face image frame based on the captured information, thereby obtaining the corresponding target face image frame. In this way, an improvement in generation efficiency and accuracy of the target face image frame is facilitated.
One or more embodiments of this disclosure further provides an electronic device.
In one or more examples, the electronic device may include components such as a processor 401 of one or more processing cores, a memory 402 of one or more computer readable storage media, a power supply 403, and an input unit 404. A person skilled in the art may understand that the electronic device structure shown in
In one or more examples, the processor 401 may be a control center of the electronic device, which is connected to various parts of the entire electronic device by using various interfaces and lines, and by running or executing a software program and/or module stored in the memory 402 and calling data stored in the memory 402, to implement various functions of the electronic device and process data. Optionally, the processor 401 may include one or more processing cores. Preferably, the processor 401 may integrate an application processor and a modem. The application processor mainly processes an operating system, a user interface, an application program, or any other suitable programs or components known to one of ordinary skill in the art. The modem mainly processes wireless communication. It may be understood that, the foregoing modem may not be integrated into the processor 401.
The memory 402 may be configured to store a software program and module. The processor 401 runs the software program and module stored in the memory 402, to execute various functional applications and data processing. The memory 402 may mainly include a program storage area and a data storage area. The program storage area may store an operating system, an application program required by at least one function (for example, a sound playing function and an image playing function), or any other suitable information known to one of ordinary skill in the art. The data storage area may store data created according to use of the electronic device. In addition, the memory 402 may include a high speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory, or another volatile solid storage device. Correspondingly, the memory 402 may further include a memory controller, to provide access of the processor 401 to the memory 402.
The electronic device further includes the power supply 403 for supplying power to the components. Preferably, the power supply 403 may logically connect to the processor 401 by using a power supply management system, thereby implementing functions, such as charging, discharging, and power consumption management, by using the power supply management system. The power source 403 may further include one or more of a direct current or alternate current power source, a re-charging system, a power source fault detection circuit, a power source converter or an inverter, a power source state indicator, or any other components.
The electronic device may further include the input unit 404. The input unit 404 may be configured to receive entered numeric or character information and generate keyboard, mouse, joystick, optical, or trackball signal input related to user settings and function control.
Although not shown in the figure, the electronic device may further include a display unit, or any other suitable component known to one of ordinary skill in the art. Details are not described herein again. Specifically, in one or more examples, the processor 401 in the electronic device loads, into the memory 402 according to the following instructions, executable files corresponding to processes of one or more application programs, and the processor 401 runs the application programs stored in the memory 402 to implement the following various functions:
For a specific implementation of each of the foregoing operations, reference may be made to the foregoing embodiments. This is not described herein again.
A person of ordinary skill in the art may understand that, all or some steps of the methods in the foregoing embodiments may be implemented by using a computer program, or implemented through instructions controlling relevant hardware, and the computer program may be stored in a computer-readable storage medium and loaded and executed by a processor.
Accordingly, one or more embodiments of this disclosure provides a computer-readable storage medium, storing a computer program. The computer program may be loaded by a processor, to perform the steps in any image generation method according to the embodiments of this disclosure.
For a specific implementation of each of the foregoing operations, reference may be made to the foregoing embodiments. This is not described herein again.
The computer-readable storage medium may include: a read-only memory (ROM), a random access memory (RAM), a magnetic disk, an optical disc, or any other suitable component known to one of ordinary skill in the art.
Because the computer program stored in the computer-readable storage medium may perform the steps of any image generation method provided in the embodiments of this disclosure, the instructions can implement beneficial effects that may be implemented by any image generation method provided in the embodiments of this disclosure. For details, reference may be made to the foregoing embodiments. Details are not described herein again.
According to one aspect of this disclosure, a computer program product is provided. The computer program product includes a computer program, and the computer program is stored in a computer-readable storage medium. A processor of a computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, to cause the computer device to perform a method provided in various optional implementations in the foregoing image generation aspects.
The image generation method and the related devices provided in the embodiments of this disclosure are described in detail above. The principle and implementations of this disclosure are described herein by using specific examples. The descriptions of the foregoing embodiments are merely used for helping understand the method and core ideas of this disclosure. In addition, a person skilled in the art may make modifications in terms of the specific implementations and application scopes according to the idea of this disclosure. In conclusion, the content of this specification is not to be construed as a limitation on this disclosure.
Number | Date | Country | Kind |
---|---|---|---|
202211080126.5 | Sep 2022 | CN | national |
The present application is a continuation application of International Application No. PCT/CN2023/112814 filed on Aug. 14, 2023, which claims priority to Chinese Patent Application No. 202211080126.5, filed with the China National Intellectual Property Administration on Sep. 5, 2022, the disclosures of each of which are incorporated by reference herein in their entireties.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/CN2023/112814 | Aug 2023 | WO |
Child | 18632696 | US |