The present disclosure relates to a method and an apparatus for generating a reenacted image. More particularly, the present disclosure relates to a method, an apparatus, and a computer-readable recording medium capable of generating an image transformed by reflecting characteristics of different images.
Extraction of a facial landmark means the extraction of keypoints of a main part of a face or the extraction of an outline drawn by connecting the keypoints. Facial landmarks have been used in techniques including analysis, synthesis, morphing, reenactment, and classification of facial images, e.g., facial expression classification, pose analysis, synthesis, and transformation.
Existing facial image analysis and utilization techniques based on facial landmarks do not distinguish appearance characteristics from emotional characteristics, e.g., facial expressions, of a subject when processing facial landmarks, leading to deterioration in performance. For example, when performing emotion classification on a facial image of a person whose eyebrows are at a height greater than the average, the facial image may be misclassified as surprise even when it is actually emotionless.
The present disclosure provides a method and an apparatus for generating a reenacted image. The present disclosure also provides a computer-readable recording medium having recorded thereon a program for executing the method in a computer. The technical objects of the present disclosure are not limited to the technical objects described above, and other technical objects may be inferred from the following embodiments.
Additional aspects will be set forth in part in the description which follows and, in part, will be apparent from the description, or may be learned by practice of the presented embodiments of the disclosure.
According to an aspect of the present disclosure, a method of generating a reenacted image includes: extracting a landmark from each of a driver image and a target image; generating a driver feature map based on pose information and expression information of a first face shown in the driver image; generating a target feature map and a pose-normalized target feature map based on style information of a second face shown in the target image; generating a mixed feature map by using the driver feature map and the target feature map; and generating the reenacted image by using the mixed feature map and the pose-normalized target feature map.
According to another aspect of the present disclosure, a computer-readable recording medium includes a recording medium having recorded thereon a program for executing the method described above on a computer.
According to another aspect of the present disclosure, an apparatus for generating a reenacted image includes: a landmark transformer configured to extract a landmark from each of a driver image and a target image; a first encoder configured to generate a driver feature map based on pose information and expression information of a first face shown in the driver image; a second encoder configured to generate a target feature map and a pose-normalized target feature map based on style information of a second face shown in the target image; an image attention unit configured to generate a mixed feature map by using the driver feature map and the target feature map; and a decoder configured to generate the reenacted image by using the mixed feature map and the pose-normalized target feature map.
The above and other aspects, features, and advantages of certain embodiments of the disclosure will be more apparent from the following description taken in conjunction with the accompanying drawings, in which:
Reference will now be made in detail to embodiments, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to like elements throughout. In this regard, the present embodiments may have different forms and should not be construed as being limited to the descriptions set forth herein. Accordingly, the embodiments are merely described below, by referring to the figures, to explain aspects. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed items. Expressions such as “at least one of,” when preceding a list of elements, modify the entire list of elements and do not modify the individual elements of the list.
Although the terms used in the embodiments are selected from among common terms that are currently widely used, the terms may be different according to an intention of one of ordinary skill in the art, a precedent, or the advent of new technology. Also, in particular cases, the terms are discretionally selected by the applicant of the present disclosure, in which case, the meaning of those terms will be described in detail in the corresponding part of the detailed description. Therefore, the terms used in the specification are not merely designations of the terms, but the terms are defined based on the meaning of the terms and content throughout the specification.
Throughout the specification, when a part “includes” a component, it means that the part may additionally include other components rather than excluding other components as long as there is no particular opposing recitation. Also, the terms described in the specification, such as “ . . . er (or)”, “ . . . unit”, “ . . . module”, etc., denote a unit that performs at least one function or operation, which may be implemented as hardware or software or a combination thereof.
In addition, although the terms such as “first” or “second” may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another element.
Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. The embodiments may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein.
The present disclosure is based on the paper entitled ‘MarioNETte: Few-shot Face Reenactment Preserving Identity of Unseen Targets’ (arXiv: 1911.08139v1, [cs.CV], 19 Nov. 2019). Therefore, the descriptions in the paper including those omitted herein may be employed in the following description.
Hereinafter, embodiments will be described in detail with reference to the drawings.
Referring to
The server 100 may be connected to an external device through a communication network. The server 100 may transmit data to or receive data from an external device (e.g., the first terminal 10 or the second terminal 20) connected thereto.
For example, the communication network may include a wired communication network, a wireless communication network, and/or a complex communication network. In addition, the communication network may include a mobile communication network such as Third Generation (3G), Long-Term Evolution (LTE), or LTE Advanced (LTE-A). Also, the communication network may include a wired or wireless communication network such as Wi-Fi, universal mobile telecommunications system (UMTS)/general packet radio service (GPRS), and/or Ethernet.
The communication network may include a short-range communication network such as magnetic secure transmission (MST), radio frequency identification (RFID), near-field communication (NFC), ZigBee, Z-Wave, Bluetooth, Bluetooth Low Energy (BLE), or infrared (IR) communication. In addition, the communication network may include a local area network (LAN), a metropolitan area network (MAN), or a wide area network (WAN).
The server 100 may receive data from at least one of the first terminal 10 and the second terminal 20. The server 100 may perform an operation by using data received from at least one of the first terminal 10 and the second terminal 20. The server 100 may transmit a result of the operation to at least one of the first terminal 10 and the second terminal 20.
The server 100 may receive a relay request from at least one of the first terminal 10 and the second terminal 20. The server 100 may select the terminal that has transmitted the relay request. For example, the server 100 may select the first terminal 10 and the second terminal 20.
The server 100 may relay a communication connection between the selected first terminal 10 and second terminal 20. For example, the server 100 may relay a video call connection between the first terminal 10 and the second terminal 20 or may relay a text transmission/reception connection. The server 100 may transmit, to the second terminal 20, connection information about the first terminal 10, and may transmit, to the first terminal 10, connection information about the second terminal 20.
The connection information about the first terminal 10 may include, for example, an IP address and a port number of the first terminal 10. The first terminal 10 having received the connection information about the second terminal 20 may attempt to connect to the second terminal 20 by using the received connection information.
When an attempt by the first terminal 10 to connect to the second terminal 20 or an attempt by the second terminal 20 to connect to the first terminal 10 is successful, a video call session between the first terminal 10 and the second terminal 20 may be established. The first terminal 10 may transmit an image or sound to the second terminal 20 through the video call session. The first terminal 10 may encode the image or sound into a digital signal and transmit a result of the encoding to the second terminal 20.
Also, the first terminal 10 may receive an image or sound from the second terminal 20 through the video call session. The first terminal 10 may receive an image or sound encoded into a digital signal and decode the received image or sound.
The second terminal 20 may transmit an image or sound to the first terminal 10 through the video call session. Also, the second terminal 20 may receive an image or sound from the first terminal 10 through the video call session. Accordingly, a user of the first terminal 10 and a user of the second terminal 20 may make a video call with each other.
The first terminal 10 and the second terminal 20 may be, for example, a desktop computer, a laptop computer, a smart phone, a smart tablet, a smart watch, a mobile terminal, a digital camera, a wearable device, or a portable electronic device. The first terminal 10 and the second terminal 20 may execute a program or an application. The first terminal 10 and the second terminal 20 may be of the same type or different types.
The server 100 may generate a reenacted image by using a driver image and a target image. For example, each of the images may be an image of the face of a person or an animal, but is not limited thereto. Hereinafter, a driver image, a target image, and a reenacted image according to an embodiment will be described in detail with reference to
For example, the target image 210 may be an image of the face of a person other than the users of the terminals 10 and 20, or an image of the face of one of the users of the terminal 10 and 20 but different from the driver image 220. In addition, the target image 210 may be a static image or a dynamic image.
The face in the reenacted image 230 has the identity of the face in the target image 210 (hereinafter, referred to as ‘target face’) and the pose and facial expression of the face in the driver image 220 (hereinafter, referred to as a ‘driver face’). Here, the pose may include a movement, position, direction, rotation, inclination, etc. of the face. Meanwhile, the facial expression may include the position, angle, and/or direction of a facial contour. In this embodiment, a facial contour may include, but is not limited to, an eye, nose, and/or mouth.
In detail, when comparing the target image 210 with the reenacted image 230, the two images 210 and 230 show the same person with different facial expressions. That is, the eyes, nose, mouth, and hair style of the target image 210 are identical to those of the reenacted image 230, respectively.
The facial expression and pose shown in the reenacted image 230 are substantially the same as the facial expression and pose of the driver face. For example, when the mouth of the driver face is open, the reenacted image 230 is generated in which the mouth of a face is open; and when the head of the driver face is turned to the right or left, the reenacted image 230 is generated in which the head of a face is turned to the right or left.
When the driver image 220 is a dynamic image in which the driver face continuously changes, the reenacted image 230 may be generated in which the target image 210 is transformed according to the pose and facial expression of the driver face.
Meanwhile, the quality of the reenacted image 230 generated by using an existing technique in the related art may be seriously degraded. In particular, in the case of a small number of target images 210 (i.e., in a few-shot setting), and the identity of the target face does not coincide with the identity of the driver face, the quality of the reenacted image 230 may be significantly low.
By using a method of generating a reenacted image according to an embodiment, the reenacted image 230 may be generated with high quality even in a few-shot setting. Hereinafter, the method of generating a reenacted image will be described in detail with reference to
Operations of the flowchart shown in
In operation 310, the apparatus 400 extracts a landmark from each of a driver image and a target image. In other words, the apparatus 400 extracts at least one landmark from the driver image and extracts at least one landmark from the target image.
For example, the target image may include at least one frame. For example, when the target image includes a plurality of frames, the target image may be a dynamic image (e.g., a video image) in which the target face moves according to a continuous flow of time.
The landmark may include information about a position corresponding to at least one of the eyes, nose, mouth, eyebrows, and ears of each of the driver face and the target face. For example, the apparatus 400 may extract a plurality of three-dimensional landmarks from each of the driver image and the target image. As a result, the apparatus 400 may generate a two-dimensional landmark image by using extracted three-dimensional landmarks.
For example, the apparatus 400 may extract an expression landmark and an identity landmark from each of the driver image and the target image.
For example, the expression landmark may include expression information and pose information of the driver face and/or the target face. Here, the expression information may include information about the position, angle, and direction of an eye, a nose, a mouth, a facial contour, etc. In addition, the pose information may include information such as the movement, position, direction, rotation, and inclination of the face.
For example, the identity landmark may include style information of the driver face and/or the target face. Here, the style information may include texture information, color information, shape information, etc. of the face.
In operation 320, the apparatus 400 generates a driver feature map based on pose information and expression information of a first face in the driver image.
The first face refers to the driver face. As described above with reference to
For example, the apparatus 400 may generate the driver feature map by inputting the pose information and the expression information of the first face into an artificial neural network. Here, the artificial neural network may include a plurality of artificial neural networks that are separated from each other, or may be implemented as a single artificial neural network.
According to an embodiment, the expression information or the pose information may correspond to the expression landmark obtained in operation 310.
In operation 330, the apparatus 400 generates a target feature map and a pose-normalized target feature map based on style information of a second face in the target image.
The second face refers to the target face. As described above with reference to
The style information may include texture information, color information, and/or shape information. Accordingly, the style information of the second face may include texture information, color information, and/or shape information, corresponding to the second face.
According to an embodiment, the style information may correspond to the identity landmark obtained in operation 310.
The target feature map may include the style information and pose information of the second face. In addition, the pose-normalized target feature map corresponds to an output by an artificial neural network with respect to style information of the second face input thereinto. Alternatively, the pose-normalized target feature map may include information corresponding to a unique feature of the second face other than the pose information of the second face. That is, it may be understood that the target feature map includes data corresponding to the expression landmark obtained from the second face, and the pose-normalized target feature map includes data corresponding to the identity landmark obtained from the second face.
In operation 340, the apparatus 400 generates a mixed feature map by using the driver feature map and the target feature map.
For example, the apparatus 400 may generate the mixed feature map by inputting the pose information and the expression information of the first face and the style information of the second face into an artificial neural network. Accordingly, the mixed feature map may be generated such that the second face has the pose and facial expression corresponding to the landmark of the first face. In addition, spatial information of the second face included in the target feature map may be reflected in the mixed feature map.
In operation 350, the apparatus 400 generates a reenacted image by using the mixed feature map and the pose-normalized target feature map.
Accordingly, the reenacted image may be generated to have the identity of the second face and the pose and facial expression of the first face.
Hereinafter, an example of an operation of the apparatus 400 will be described in detail with reference to
Referring to
In addition, it will be understood by one of skill in the art that one or more of the landmark transformer 410, the first encoder 420, the second encoder 430, the image attention unit 440, and the decoder 450 of the apparatus 400 may be implemented as an independent apparatus.
In addition, the landmark transformer 410, the first encoder 420, the second encoder 430, the image attention unit 440, and the decoder 450 may be implemented as at least one processor. Here, the processor may be implemented as an array of a plurality of logic gates, or may be implemented as a combination of a general-purpose microprocessor and a memory storing a program executable by the microprocessor. In addition, it will be understood by one of skill in the art that the landmark transformer 410, the first encoder 420, the second encoder 430, the image attention unit 440, and the decoder 450 may be implemented as different types of hardware.
For example, the apparatus 400 of
As another example, the apparatus 400 of
Meanwhile, the apparatus 400 shown in
The apparatus 400 receives a driver image x and target images y′, and transmits the received driver image x and target images y′ to the landmark transformer 410. Also, the apparatus 400 transfers the target images y′ to the second encoder 430, which will be described below. Here, i is a natural number greater than or equal to 2.
The landmark transformer 410 extracts a landmark from each of the driver image x and the target images y′.
For example, the landmark transformer 410 may generate a landmark image based on the driver image x and the target images yi. In detail, the landmark transformer 410 may extract three-dimensional landmarks from each of the driver image x and the target images yi, and render the extracted three-dimensional landmarks to two-dimensional landmark images rx and riy. That is, the landmark transformer 410 generates the two-dimensional landmark image rx for the driver image x by using the three-dimensional landmarks of the driver image x, and generates the two-dimensional landmark images riy for the target images yi by using the three-dimensional landmarks of the target images yi. An example in which the landmark transformer 410 extracts the three-dimensional landmarks of the driver image x and the target images yi will be described below with reference to
As described above with reference to
The first encoder 420 generates a driver feature map zx based on pose information and expression information of a first face in the driver image x.
In detail, the first encoder 420 generates the driver feature map zx based on at least one of the pose information and the expression information of the driver face. For example, the first encoder 420 may extract the pose information and the expression information of the driver face from the two-dimensional landmark image rx, and generate the driver feature map zx by using the extracted information.
Here, it may be understood that the pose information and the expression information correspond to the expression landmark extracted by the landmark transformer 410.
The second encoder 430 may generate target feature maps ziy and a normalized target feature map Ŝ based on style information of a second face in the target images yi.
In detail, the second encoder 430 generates the target feature maps ziy based on the style information of the target face. For example, the second encoder 430 may generate the target feature maps ziy by using the target images yi and the two-dimensional landmark images riy. In addition, the second encoder 430 transforms the target feature maps ziy into the normalized target feature maps Ŝ through a warping function T. Here, the normalized target feature map Ŝ denotes a pose-normalized target feature map. An example in which the second encoder 430 generates the target feature maps ziy and the normalized target feature maps Ŝ will be described below with reference to
Meanwhile, it may be understood that the style information corresponds to the identity landmark extracted by the landmark transformer 410.
The image attention unit 440 generates a mixed feature map zxy by using the driver feature map zx and the target feature maps ziy. An example in which the image attention unit 440 generates the mixed feature map zxy will be described below with reference to
The decoder 450 generates a reenacted image by using the mixed feature map zxy and the normalized target feature maps Ŝ. An example in which the decoder 450 generates the reenacted image will be described below with reference to
Although not illustrated in
The landmark transformer 410 according to an embodiment utilizes multiple dynamic images of unlabeled faces and is trained in an unsupervised manner. Accordingly, in a few-shot setting, a high-quality reenacted image may be generated even with a large structural difference between landmarks of a driver face and a target face.
In operation 510, the landmark transformer 410 receives an input image and a landmark.
The input image refers to a driver image and/or target images, and the target images may include facial images of an arbitrary person.
In addition, the landmark refers to keypoints of one or more main parts of a face. For example, the landmark included in the face may include information about the position of at least one of the main parts of the face (e.g., eyes, nose, mouth, eyebrows, jawline, and ears). The landmark may include information about the size or shape of at least one of the main parts of the face. The landmark may include information about the color or texture of at least one of the main parts of the face.
The landmark transformer 410 may extract a landmark corresponding to the face in the input image. The landmark may be extracted through a known technique, and the landmark transformer 410 may use any known method. In addition, the present disclosure is not limited to a method performed by the landmark transformer 410 to obtain a landmark.
A landmark may be updated as a sum of an average landmark, an identity landmark, and an expression landmark. For example, when a video image (i.e., a dynamic image) of person c is received as an input image, a landmark of person c in a frame t may be expressed as a sum of an average landmark related to an average identity of collected human faces (i.e., average facial landmark geometry), an identity landmark related to a unique identity of person c (i.e., facial landmark of identity geometry), and an expression landmark of person c in the frame t (i.e., facial landmark of expression geometry). An example of calculating the average facial landmark geometry, the facial landmark of identity geometry, and the facial landmark of expression geometry will be described below in detail with reference to operation 530 in
In operation 520, the landmark transformer 410 estimates a principal component analysis (PCA) transformation matrix corresponding to the updated landmark.
The PCA transformation matrix may constitute the updated landmark together with a predetermined unit vector. For example, a first updated landmark may be calculated as a product of the unit vector and a first PCA transformation matrix, and a second updated landmark may be calculated as a product of the unit vector and a second PCA transformation matrix.
The PCA transformation matrix is a matrix that transforms a high-dimensional (e.g., three-dimensional) landmark into low-dimensional (e.g., two-dimensional) data, and may be used in PCA.
PCA is a dimensionality reduction method in which the distribution of data may be preserved as much as possible and new axes orthogonal to each other are searched for to transform variables in a high-dimensional space into variables in a low-dimensional space. In detail, in PCA, first, a hyperplane closest to data may be searched for, and then the data may be projected onto a low-dimensional hyperplane to reduce the dimensionality of the data.
In PCA, a unit vector defining an i-th axis may be referred to as an i-th principal component (PC), and, by linearly combining such axes, high-dimensional data may be transformed into low-dimensional data.
For example, the landmark transformer 410 may estimate the transformation matrix by using Equation 1.
In Equation 1, X denotes a high-dimensional landmark, Y denotes a low-dimensional PC, and a denotes a PCA transformation matrix.
As described above, the PC (i.e., the unit vector) may be predetermined. Accordingly, when a new landmark is received, a corresponding PCA transformation matrix may be determined. In this case, a plurality of PCA transformation matrices may exist corresponding to one landmark.
In operation 520, the landmark transformer 410 may use a pre-trained learning model to estimate a PCA transformation matrix. Here, the learning model refers to a model that is pre-trained to estimate a PCA transformation matrix from an arbitrary facial image and a landmark corresponding thereto.
The learning model may be trained to estimate a PCA transformation matrix from a facial image and a landmark corresponding to the facial image. In this case, several PCA transformation matrices may exist corresponding to one high-dimensional landmark, and the learning model may be trained to output only one PCA transformation matrix among the PCA transformation matrices. Accordingly, the landmark transformer 410 may output one PCA transformation matrix by using an input image and a corresponding landmark.
A landmark to be used as an input to the learning model may be extracted from a facial image and obtained through a known method of visualizing the facial image.
The learning model may be trained to classify a landmark into a plurality of semantic groups corresponding to the main parts of a face, respectively, and output PCA transformation coefficients corresponding to the plurality of semantic groups, respectively. Here, the semantic groups may be classified to correspond to eyebrows, eyes, nose, mouth, and/or jawline.
The landmark transformer 410 may classify a landmark into semantic groups in subdivided units by using the learning model, and estimate PCA transformation matrices corresponding to the classified semantic groups.
In operation 530, the landmark transformer 410 calculates an expression landmark and an identity landmark corresponding to the input image by using the PCA transformation matrix.
A landmark may be decomposed into a plurality of sub-landmarks. In detail, when a video image (i.e., a dynamic image) of person c is received as an input image, a landmark
For example, the landmark
In Equation (2),
In addition, in Equation 2,
In Equation 3, T denotes the total number of frames included in the dynamic image. Accordingly,
In addition, in Equation 2,
In addition, in Equation 2,
Equation 5 represents a result of performing PCA on each of the semantic groups (e.g., the right eye, left eye, nose, and mouth) of person c. In Equation 5, nexp denotes the sum of the numbers of expression bases of all semantic groups, bexp denotes an expression basis that is a PCA basis, and a denotes a PCA coefficient. α corresponds to a PCA coefficient of the PCA transformation matrix corresponding to each semantic group estimated in operation 520.
In other words, bexp denotes a unit vector, and a high-dimensional expression landmark may be defined as a combination of low-dimensional unit vectors. In addition, nexp denotes the total number of facial expressions that person c may make with his/her right eye, left eye, nose, mouth, etc.
The landmark transformer 410 separates expression landmarks into semantic groups of the face (e.g., mouth, nose, and eyes) and performs PCA on each group to extract the expression bases from the training data.
Accordingly, the expression landmark
The landmark transformer 410 may train a learning model to estimate the PCA coefficient α(c,t) by using an image x(c,t) and the landmark
As described above with reference to Equation 2, a landmark may be defined as a sum of an average landmark, an identity landmark, and an expression landmark. The landmark transformer 410 may calculate an expression landmark through operation 530. Therefore, the landmark transformer 410 may calculate an identity landmark as shown in Equation 6.
In Equation 6,
In Equation 7, λexp denotes a hyperparameter that controls the intensity of an expression predicted by the landmark transformer 410.
When the target images yi are received as input images, the landmark transformer 410 takes the mean over all identity landmarks {circumflex over (l)}id(cy). In summary, when the driver image x and the target images yi are received as input images, and a target landmark {circumflex over (l)}(cy, ty) and a driver landmark {circumflex over (l)}(cx, tx) are received, the landmark transformer 410 transforms the received landmark as shown in Equation 8.
The landmark transformer 410 performs denormalization to recover to the original scale, translation, and rotation, and then performs rasterization. A landmark generated through rasterization may be transferred to the first encoder 420 and the second encoder 430.
Referring to
Although
Also, the learning models described with reference to
The landmark transformer 410 illustrated in
When an input image x(c,t) and a normalized landmark
In detail, the first neural network 411 extracts an image feature from the input image x(c,t). In addition, the landmark transformer 410 performs first processing for removing an average landmark
In addition, the landmark transformer 410 performs second processing for calculating an expression landmark {circumflex over (l)}exp(c,t) according to the PCA coefficient and Equation 7. Furthermore, the landmark transformer 410 performs third processing for calculating an identity landmark {circumflex over (l)}id(c) by using the result of the first processing (
As described above with reference to
Meanwhile, the landmark transformer 410 may separate a landmark from an image even when a large number of target images 210 are given (i.e., in a many-shot setting). Hereinafter, an example in which the landmark transformer 410 extracts (separates) a landmark from an image in a many-shot setting will be described with reference to
In operation 710, the landmark transformer 410 receives a plurality of dynamic images.
Here, the dynamic image includes a plurality of frames. Only one person may be captured in each of the dynamic images. That is, only the face of one person is captured in one dynamic image, and respective faces captured in the plurality of dynamic images may be of different people.
In operation 720, the landmark transformer 410 calculates an average landmark lm of the plurality of dynamic images.
For example, the average landmark lm may be calculated by Equation 9.
In Equation 9, C denotes the number of input images, and T denotes the number of frames included in each of the input images.
The landmark transformer 410 may extract a landmark l(c,t) of each of the faces captured in the C dynamic images, respectively. Then, the landmark transformer 410 calculates an average value of all of the extracted landmarks l(c,t), and sets the calculated average value as the average landmark lm.
In operation 730, the landmark transformer 410 calculates a landmark l(c,t) for a specific frame among a plurality of frames included in a specific dynamic image containing a specific face among the dynamic images.
For example, the landmark l(c,t) for the specific frame may be keypoint information of the face included in a t-th frame of a c-th dynamic image among the C dynamic images. That is, it may be assumed that the specific dynamic image is the c-th dynamic image and the specific frame is the t-th frame.
In operation 740, the landmark transformer 410 calculates an identity landmark lid(c) of the face captured in the specific dynamic image.
For example, the landmark transformer 410 may calculate the identity landmark lid(c) by using Equation 10.
Various facial expressions of the specific face are captured in a plurality of frames included in the c-th dynamic image. Therefore, in order to calculate the identity landmark lid(c), the landmark transformer 410 may assume that a mean value
of facial expression landmarks lexp of the specific face included in the c-th dynamic image is 0. Accordingly, the identity landmark lid(c) may be calculated without considering the mean value
of the expression landmarks lexp of the specific face.
In summary, the identity landmark data lid(c) may be defined as a value obtained by subtracting the average landmark lm of the plurality of dynamic images from the mean value
of the respective landmarks l(c,t) of the plurality of frames included in the c-th dynamic image.
In operation 750, the landmark transformer 410 may calculate an expression landmark lexp(c,t) of the face captured in the specific frame included in the specific dynamic image.
That is, the landmark transformer 410 may calculate the expression landmark lexp(c,t) of the face captured in the t-th frame of the c-th dynamic image. For example, the expression landmark lexp(c,t) may be calculated by Equation 11.
The expression landmark lexp(c,t) may correspond to an expression of the face captured in the t-th frame and movement information of parts of the face, such as the eyes, eyebrows, nose, mouth, and chin line. In detail, the expression landmark lexp(c,t) may be defined as a value obtained by subtracting the average landmark lm and the identity landmark lid(c) from the landmark l(c,t) for the specific frame.
As described above with reference to
Referring to
For example, the second encoder 430 may adopt a U-Net architecture. U-Net is a U-shaped network that basically performs a segmentation function and has a symmetric shape.
In
The second encoder 430 generates the encoded target feature map Sj and the normalization flow map fy by using the rendered target landmark ry and the target image y. Then, the second encoder 33 generates the normalized target feature map Ŝ by applying the generated encoded target feature map Si and the normalized flow map fy to the warping function T.
Here, it may be understood that the normalized target feature map Ŝ is a pose-normalized target feature map. Accordingly, it may be understood that the warping function T is a function of normalizing pose information of a target face and generating data including only normalized pose information and a unique style of the target face (i.e., an identity landmark).
In summary, the normalized target feature map Ŝ may be expressed as Equation 12.
Referring to
To transfer style information of targets to the driver, previous studies encoded target information as a vector and mixed it with driver feature by concatenation or AdaIN layers. However, encoding targets as a spatial-agnostic vector leads to losing spatial information of targets. In addition, these methods are absent of innate design for multiple target images, and thus, summary statistics (e.g. mean, max) are used to deal with multiple targets which might cause losing details of the target. We suggest the image attention unit 440 to alleviate the aforementioned problem.
The image attention unit 440 generates the mixed feature map 930 by using a driver feature map 910 and the target feature maps 920. Here, the driver feature map 910 may serve as an attention query, and the target feature maps 920 may serve as attention memory.
Although one driver feature map 910 and three target feature maps 920 are illustrated in
The image attention unit 440 attends to appropriate positions of the respective landmarks 941, 942, 943, and 944 while processing the plurality of target feature maps 920. In other words, the landmark 941 of the driver feature map 910 and the landmarks 942, 943, and 944 of the target feature maps 920 correspond to a landmark 945 of the mixed feature map 930.
The driver feature map 910 and the target feature maps 920 input to the image attention unit 440 may include a landmark of a driver face and a landmark of a target face, respectively. In order to generate an image of the target face corresponding to the movement and expression of the driver face while preserving the identity of the target face, the image attention unit 440 may perform an operation of matching the landmark of the driver face with the landmark of the target face.
For example, in order to control the movement of the target face according to the movement of the driver face, the image attention unit 440 may link landmarks of the driver face, such as keypoints of the eyes, eyebrows, nose, mouth, and jawline, to landmarks of the target face, such as corresponding keypoints of the eyes, eyebrows, nose, mouth, and jawline, respectively. Moreover, in order to control the expression of the target face according to the expression of the driver face, the image attention unit 440 may link expression landmarks of the driver face, such as the eyes, eyebrows, nose, mouth, and jawline, to corresponding expression landmarks of the target face, such as the eyes, eyebrows, nose, mouth, and jawline, respectively.
For example, the image attention unit 440 may detect the eyes in the driver feature map 910, then detect the eyes in the target feature maps 920, and then generate the mixed feature map 930 such that the eyes of the target feature maps 920 reenact the movement of the eyes of the driver feature map 910. The image attention unit 440 may perform substantially the same operation on other feature points in the face.
The image attention unit 440 may generate the mixed feature map 930 by inputting pose information of the driver face and style information of the target face into an artificial neural network. For example, in an attention block 441, an attention may be calculated based on Equations 13 and 14.
In Equation 13, zx denotes the driver feature map 910 and satisfies zx∈h
In Equation 14, f denotes a flattening function, which is ƒ: d
For example, first, the attention block 441 divides the number of channels of the positional encoding in half. Then, the attention block 441 utilizes half of them to encode the horizontal coordinate and the rest of them to encode the vertical coordinate. To encode the relative position, the attention block 441 normalizes the absolute coordinate by the width and the height of the feature map. Thus, given a feature map of z∈h
The image attention unit 440 generates the mixed feature map 930 by using instance normalization layers 442 and 444, a residual connection, and a convolution layer 443. The image attention unit 440 provides a direct mechanism of transferring information from the plurality of target feature maps 920 to the pose of the driver face.
Referring to
In
In addition, a warp-alignment block 451 of the decoder 450 applies a warping function T by using an output u of the previous block of the decoder 450 and the normalized target feature map. The warping function T may be used for generating a reenacted image in which the movement and pose of a driver face are transferred to a target face preserving its unique identity, and may differ from the warping function T applied in the second encoder 430.
In a few-shot setting, the decoder 450 averages resolution-compatible feature maps from different target images (i.e., Ŝj=ΣiŜjij/K)). To apply pose-normalized feature maps to the pose of the driver face, the decoder 450 generates an estimated flow map of the driver face fu by using a 1×1 convolution block that takes u as an input. Then, alignment by T(Ŝi; ƒu) may be performed, and the result of the alignment may be concatenated to u and then fed into a 1×1 convolution block and a residual upsampling block.
As described above with reference to
Meanwhile, based on a target image, which is a static image, a dynamic image may be generated as a reenacted image. For example, when a target image is input, a dynamic image may be generated as a reenacted image by using an image transformation template. Here, the image transformation template may be pre-stored or input from an external source.
Hereinafter, an example in which a dynamic image is generated as a reenacted image will be described with reference to
Referring to
The processor 1110 may be an example of the apparatus 400 described above with reference to
In addition, the apparatus 1100 may be included in the server 100 and/or the terminals 10 and 20 of
The processor 1110 receives a target image y. For example, the target image y may be a static image. The size of a target face captured in the target image y may vary, and for example, the size of the face captured in target image 1 may be 100×100 pixels, and the size of the face captured in target image 2 may be 200×200 pixels.
The processor 1110 extracts only a facial region from the target image y. For example, the processor 1110 may extract a region corresponding to the target face from the target image y, with a preset size. For example, when the preset size is 100×100 and the size of the facial region included in the target image is 200×200, the processor 1110 may reduce the facial image having a size of 200×200 into an image having a size of 100×100, and then extract the reduced region. Alternatively, the processor 1110 may extract the facial image having a size of 200×200 and then convert it into an image of a size of 100×100.
The processor 1110 may obtain at least one image transformation template. The image transformation template may be understood as a tool for transforming a target image into a new image of a specific shape. For example, when an expressionless face is captured in a target image, a new image in which the expressionless face is transformed into a smiling face may be generated by a specific image transformation template. For example, the image transformation template may be a dynamic image, but is not limited thereto.
The image transformation template may be an arbitrary template that is pre-stored in the memory 1120, or may be a template selected by a user from among a plurality of templates stored in the memory 1120. In addition, the processor 1110 may receive at least one driver image x and use the driver image x as an image transformation template. Accordingly, although omitted below, the image transformation template may be interpreted to be the same as the driver image x. For example, the driver image x may be a dynamic image, but is not limited thereto.
The processor 1110 may generate a reenacted image (e.g., a dynamic image) by transforming an image (e.g., a static image) of the facial region extracted from the target image y by using the image transformation template. An example in which the processor 1110 generates a reenacted image will be described below with reference to
In operation 1210, the processor 1110 receives a target image. Here, the target image may be a static image including a single frame.
In operation 1220, the processor 1110 obtains at least one image transformation template from among a plurality of image transformation templates pre-stored in the memory 1120. Alternatively, the image transformation template may be selected by a user from among the plurality of pre-stored image templates. For example, the image transformation template may be a dynamic image, but is not limited thereto.
Although not illustrated in
In operation 1230, the processor 1110 may generate a dynamic image as a reenacted image by using the image transformation template. In other words, the processor 1110 may generate a dynamic image as a reenacted image by using the target image, which is a static image, and the image transformation template, which is a dynamic image.
For example, the processor 1110 may extract texture information from the face captured in the target image. For example, the texture information may be information about the color and visual texture of a face.
In addition, the processor 1110 may extract a landmark from a region corresponding to the face captured in the image transformation template. An example in which the processor 1110 extracts a landmark from an image transformation template is the same as described above with reference to
For example, a landmark may be obtained from a specific shape, pattern, color, or a combination thereof included in the face of a person, based on an image processing algorithm. Here, the image processing algorithm may include one of scale-invariant feature transform (SIFT), histogram of oriented gradient (HOG), Haar feature, Ferns, local binary pattern (LBP), and modified census transform (MCT), but is not limited thereto.
The processor 1110 may generate a reenacted image by using the texture information and the landmark. An example in which the processor 1110 generates a reenacted image is the same as described above with reference to
The reenacted image may be a dynamic image including a plurality of frames. For example, a change in the expression of the face captured in the image transformation template may be equally reproduced in the reenacted image. That is, at least one intermediate frame may be included between the first frame and the last frame of the reenacted image, the facial expression captured in each intermediate frame may gradually change, and the change in the facial expression may be the same as the change in the facial expression captured in the image transformation template.
The processor 1110 according to an embodiment may generate a reenacted image having the same effect as that of a dynamic image (e.g., an image transformation template) in which a user changing his/her facial expression is captured.
The driver image 1320 may be a dynamic image in which a facial expression and/or a pose of a person changes over time. For example, the driver image 1320 of
The person captured in the target image 1310 may be different from the person captured in the driver image 1320. Accordingly, the face captured in the driver image 1320 may be different from the face captured in the target image 1310. For example, by comparing the target face captured in the target image 1310 with the driver face captured in the driver image 1320 in
The processor 1110 generates the reenacted image 1330 by using the target image 1310 and the driver image 1320. Here, the reenacted image 1330 may be a dynamic image. For example, the reenacted image 1330 may be an image in which a person corresponding to the target face makes a facial expression and/or a pose corresponding to the driver face. That is, the reenacted image 1330 may be a dynamic image in which the facial expression and/or the pose of the driver face continuously change.
In the reenacted image 1330 of
As described above with reference to
The facial expression shown in each image transformation template may correspond to one of various facial expressions such as a sad expression, a happy expression, a winking expression, a depressed expression, a blank expression, a surprised expression, an angry expression, and the image transformation templates include information about different facial expressions. Various facial expressions correspond to different outline images, respectively. Accordingly, the image transformation templates may include different outline images, respectively.
The processor 1110 may extract a landmark from the image transformation template. For example, the processor 1110 may extract an expression landmark corresponding to the facial expression shown in the image transformation template.
For example, the target image 1510 may contain a smiling face. The facial expression 1520 shown in the image transformation template may include an outline corresponding to the eyebrows, eyes, and mouth of a winking and smiling face.
The processor 1110 may extract texture information of a region corresponding to the face from the target image 1510. Also, the processor 1110 may extract a landmark from the facial expression 1520 shown in the image transformation template. In addition, the processor 1110 may generate the reenacted image 1530 by combining the texture information of the target image 1510 and the landmark of the facial expression 1520 shown in the image transformation template.
Referring to
Here, each of the at least one frame between the first frame 1610 and the last frame 1620 may be an image showing the face with the right eye being gradually closed.
It may be understood that the reenacted image 1730 illustrated in
The processor 1110 may extract texture information of a region corresponding to the face from the target image 1710. Also, the processor 1110 may extract a landmark from the image transformation template 1720. For example, the processor 1110 may extract the landmark from regions corresponding to the eyebrows, eyes, and mouth in the face shown in the image transformation template 1720. The processor 1110 may generate the reenacted image 1730 by combining the texture information of the target image 1710 and the landmark of the image transformation template 1720.
Referring to
Each of the at least one frame between the first frame 1810 and the last frame 1820 of the reenacted image 1730 may include an image showing the face with the right eye being gradually closed and the mouth being gradually open.
As described above, the apparatus 400 may generate a reenacted image containing a face having the identity of a target face and the expression of a driver face, by using a driver image and a target image. Also, the apparatus 400 may accurately separate a landmark even from a small number of images (i.e., in a few-shot setting). Furthermore, the apparatus 400 may separate, from an image, a landmark including more accurate information about the identity and expression of a face shown in the image.
In addition, the apparatus 1100 may generate a reenacted image showing the same facial expression as that captured in a dynamic image in which a user changing his/her facial expression is captured.
Meanwhile, the above-described method may be written as a computer-executable program, and may be implemented in a general-purpose digital computer that executes the program by using a computer-readable recording medium. In addition, the structure of the data used in the above-described method may be recorded in a computer-readable recording medium through various means. Examples of the computer-readable recording medium include magnetic storage media (e.g., read-only memory (ROM), random-access memory (RAM), universal serial bus (USB), floppy disks, hard disks, etc.), and optical recording media (e.g., compact disc-ROM (CD-ROM), digital versatile disks (DVDs), etc.).
According to an embodiment of the present disclosure, a reenacted image containing a face having the identity of a target face and the expression of a driver face may be generated by using a driver image and a target image. In addition, a landmark may be accurately separated even from a small number of images (i.e., in a few-shot setting). Furthermore, a landmark including more accurate information about the identity and expression of a face shown in an image may be separated.
In addition, a user may generate, without directly capturing a dynamic image by himself/herself, a reenacted image having the same effect as that in a dynamic image in which the user changing their facial expression is captured.
It will be understood by one of skill in the art that the disclosure may be implemented in a modified form without departing from the intrinsic characteristics of the descriptions provided above. The methods disclosed herein are to be considered in a descriptive sense only, and not for purposes of limitation, and the scope of the present disclosure is defined not by the above descriptions, but by the claims and their equivalents, and all variations within the scope of the claims and their equivalents are to be construed as being included in the present disclosure.
Number | Date | Country | Kind |
---|---|---|---|
10-2019-0141723 | Nov 2019 | KR | national |
10-2019-0177946 | Dec 2019 | KR | national |
10-2019-0179927 | Dec 2019 | KR | national |
10-2020-0022795 | Feb 2020 | KR | national |
This application is a continuation-in-part of U.S. patent application Ser. No. 17/092,486, filed on Nov. 9, 2020, which claims the benefit of Korean Patent Applications No. 10-2019-0141723, filed on Nov. 7, 2019, No. 10-2019-0177946, filed on Dec. 30, 2019, No. 10-2019-0179927, filed on Dec. 31, 2019, and No. 10-2020-0022795, filed on Feb. 25, 2020, in the Korean Intellectual Property Office, the disclosures of which are incorporated herein in their entireties by reference.
Number | Date | Country | |
---|---|---|---|
Parent | 17092486 | Nov 2020 | US |
Child | 17658620 | US |