This application claims the benefit of Korean Patent Applications Nos. 10-2019-0141723, filed on Nov. 7, 2019, 10-2019-0177946, filed on Dec. 30, 2019, 10-2019-0179927, filed on Dec. 31, 2019, and 10-2020-0022795, filed on Feb. 25, 2020 in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference.
The present disclosure relates to an image conversion apparatus and method, and a computer-readable recording medium, and more particularly, to an image conversion apparatus and method, in which a static image may be converted into a natural moving image, and a computer-readable recording medium.
The present disclosure relates a landmark data decomposition apparatus and method, and a computer-readable recording medium, and more particularly, to a landmark data decomposition apparatus and method, in which landmark data may be more accurately separated from a face included in an image, and a computer-readable recording medium.
The present disclosure relates to a landmark decomposition apparatus and method, and a computer-readable recording medium, and more particularly, to a landmark decomposition apparatus and method, in which a landmark may be decomposed from one frame or a small number of frames, and a computer-readable recording medium.
The present disclosure relates to an image transformation apparatus and method, and a computer-readable recording medium, and more particularly, to an image transformation apparatus and method, in which an image that is naturally transformed according to characteristics of another image may be generated, and a computer-readable recording medium.
Most portable personal devices have built-in cameras and may thus capture static images or moving images such as videos. Whenever a moving image of a desired facial expression is needed, a user of a portable personal device has to use a built-in camera of the portable personal device to capture an image.
When the moving image of the desired facial expression is not obtained, the user has to repeatedly capture images until a satisfactory result is obtained. Accordingly, there is a need for a method of transforming a static image input by the user into a natural moving image by inserting a desired facial expression into the static image.
Research on techniques that analyze and use an image of a person's face based on facial landmarks obtained by extracting facial key points of the person's face is being actively conducted. The facial landmarks include result values of extracting points of major elements of a face, such as eyes, eyebrows, a nose, a mouth, and a jawline or extracting an outline drawn by connecting the points. The facial landmarks are mainly used in techniques such as facial expression classification, pose analysis, face synthesis, face transformation, etc.
However, facial image analysis and utilization techniques of the related art that are based on facial landmarks do not take into account appearance and emotional characteristics of the face when processing the facial landmarks, which results in a decrease in performance of the techniques. Therefore, in order to improve the performance of the facial image analysis and utilization techniques, there is a need for development of techniques for decomposing the facial landmarks including the emotional characteristics of the face.
The present disclosure provides an image conversion apparatus and method, in which a static image may be converted into a natural moving image, and a computer readable recording medium.
The present disclosure provides a landmark data decomposition apparatus and method, in which landmark data may be more accurately and precisely decomposed from a face included in an image, and a computer-readable recording medium.
The present disclosure provides a landmark decomposition apparatus and method, in which landmark decomposition may be performed even on an object with only a small amount of data, and a computer-readable recording medium.
The present disclosure provides an image transformation apparatus and method, in which, when a target image is given as an object to be transformed, a user image different from the target image may be used to generate an image that conforms to the user image and also has characteristics of the target image, and a computer-readable recording medium.
An image conversion method according to an embodiment of the present disclosure, which uses an artificial neural network, includes receiving a static image from a user, obtaining at least one image conversion template, and transforming the static image into a moving image by using the obtained image conversion template.
The present disclosure may provide an image conversion apparatus and method, in which, even though a user does not directly capture a moving image, a moving image having the same effect as a moving image captured by the user directly changing his or her facial expression is provided, and a computer-readable recording medium.
The present disclosure may provide an image conversion method and method, in which a moving image generated by converting a static image is provided to a user, to thereby provide the user with an interesting user experience along with the moving image, and a computer-readable recording medium.
The present disclosure may provide a landmark data decomposition apparatus and method, in which landmark data may be more accurately and precisely decomposed from a face included in an image, and a computer-readable recording medium.
The present disclosure may provide a landmark data decomposition apparatus and method, in which landmark data including information about characteristics and facial expressions of a face included in an image may be more accurately decomposed, and a computer-readable recording medium.
The present disclosure may provide a landmark decomposition apparatus and method, in which landmark decomposition may be performed even on an object with only a small amount of data, and a computer-readable recording medium.
The present disclosure may provide an image transformation apparatus and method, in which, when a target image is given as an object to be transformed, a user image different from the target image may be used to generate an image that conforms to the user image and also has characteristics of the target image, and a computer-readable recording medium.
Advantages and features of the present disclosure and methods of accomplishing the same will become more apparent by the embodiments described below in detail with reference to the accompanying drawings. In this regard, the embodiments of the present disclosure may have different forms and should not be construed as being limited to the descriptions set forth herein. Rather, these embodiments will give a comprehensive understanding of the present disclosure and fully convey the scope of the present disclosure to those of ordinary skill in the art, and the present disclosure will only be defined by the appended claims. The same reference numerals refer to the same elements throughout the specification.
Although the terms such as “first”, “second”, and so forth, can be used for describing various elements, the elements are not limited to the terms. The terms as described above may be used only to distinguish one element from another element. Accordingly, first elements mentioned below may refer to second elements within the technical idea of the present disclosure.
The terms used herein should be considered in a descriptive sense only and not for purposes of limitation. As used herein, the singular forms “a”, “an”, and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be understood that the term such as “include (or including)” or “comprise (or comprising)” used herein is inclusive or open-ended and does not exclude the presence or addition of elements or method operations.
Unless defined otherwise, all the terms used herein may be construed as meanings that may be commonly understood by those of ordinary skill in the art. Also, the terms as those defined in generally used dictionaries are not construed ideally or excessively, unless clearly defined otherwise in particular.
In embodiments of the present disclosure, the server 10 may receive an image from the terminal 20, transform the received image into an arbitrary form, and transmit the transformed image to the terminal 20. Alternatively, the server 10 may function as a platform for providing a service that the terminal 20 may access and use. The terminal 20 may transform an image selected by a user of the terminal 20 and transmit the transformed image to the server 10.
The server 10 may be connected to a communication network. The server 10 may be connected to another external device through the communication network. The server 10 may transmit or receive data to or from the other external device connected thereto.
The communication network connected to the server 10 may include a wired communication network, a wireless communication network, or a combined complex communication network. A communication network may include mobile communication networks such as 3G, LTE, or LTE-A. A communication network may include a wired or wireless communication network such as Wi-Fi, UMTS/GPRS, or Ethernet. The examples of the communication network may include local area communication network such as MST, RFID, NFC, Zigbee, Z-Wave, Bluetooth, BLE, or infrared communication. The examples of the communication network may include a LAN, a MAN, or a WAN.
The server 10 may be connected to the terminal 20 through the communication network. When the server 10 is connected to the terminal 20, the server 10 may transmit and receive data to and from the terminal 20 through the communication network. The server 10 may perform an arbitrary operation using the data received from the terminal 20. The server 10 may transmit a result of the operation to the terminal 20.
Examples of the terminal 20 may include a desktop computer, a smartphone, a smart tablet, a smartwatch, a mobile terminal, a digital camera, a wearable device, or a portable electronic device. The terminal 20 may execute a program or an application.
Referring to
The image receiver 110 receives an image from a user. The image may include a user's face, and may include a still image or a static image. A size of the user's face included in the image may be different depending on images. For example, a size of a face included in image 1 may be a pixel size of 100×100, and a size of a face included in image 2 may be a pixel size of 200×200.
The image receiver 110 may extract only a face area from the image received from the user, and then provide the extracted face area to the image transformer 130.
The image receiver 110 may extract an area corresponding to the user's face from the image including the user's face into a predetermined size. For example, when the predetermined size is 100×100 and the size of the area corresponding to the user's face included in the image is 200×200, the image receiver 110 may reduce the image having the size of 200×200 into an image having a size of 100×100, and then extract the reduced image. Alternatively, a method of extracting an image having a size of 200×200 and then transforming the image into an image having a size of 100×100 may be used.
The template obtainer 120 obtains at least one image conversion template. The image conversion template may be understood as a tool capable of transforming an image received by the image receiver 110 into a new image in a specific shape. For example, when a user's expressionless face is included in the image received by the image receiver 110, a new image including a user's smiling face may be generated by using a specific image conversion template.
The image conversion template may be determined in advance as an arbitrary template, or may be selected by the user.
The image transformer 130 may receive, from the image receiver 110, a static image corresponding to the face area. Also, the image transformer 130 may transform the static image into a moving image by using the image conversion template obtained by the template obtainer 120.
Referring to
The image conversion method according the present disclosure uses an artificial neural network, and a static image may be obtained in operation S110. The static image may include a user's face and may include one frame.
In operation S120, at least one of a plurality of image conversion templates stored in the image conversion apparatus 100 may be obtained. The image conversion template may be selected by a user from among the plurality of image conversion templates stored in the image conversion apparatus 100.
The image conversion template may be understood as a tool capable of transforming an image received in operation S110 into a new image in a specific shape. For example, when a user's expressionless face is included in the image received in operation S110, a new image including a user's smiling face may be generated by using a specific image conversion template.
In another embodiment, when a user's smiling face is included in the image received in operation S110, a new image including a user's angry face may be generated by using another specific image conversion template.
In some embodiments, in operation S120, at least one reference image may be received from the user. For example, the reference image may include an image obtained by capturing the user, or an image of another person selected by the user. When the user does not select one of a plurality of preset templates and selects the reference image, the reference image may be obtained as the image conversion template. That is, it may be understood that the reference image performs the same function as the image conversion template.
In operation S130, the static image may be converted into a moving image by using the obtained image conversion template. In order to transform the static image into the moving image, texture information may be extracted from the user's face included in the static image. The texture information may include information about a color or visual texture of the user's face.
Also, in order to transform the static image into the moving image, landmark information may be extracted from an area corresponding to a person's face included in the image conversion template. The landmark information may be obtained from a specific shape, pattern, and color included in the person's face, or a combination thereof, based on an image processing algorithm. Also, the image processing algorithm may include one of SIFT, HOG, Haar feature, Ferns, LBP, and MCT, but is not limited thereto.
The moving image may be generated by combining the texture information and the landmark information. In some embodiments, the moving image may include a plurality of frames. In the moving image, a frame corresponding to the static image may be set as a first frame, and a frame corresponding to the image conversion template may be set as a last frame.
For example, a user's facial expression included in the static image may be the same as a facial expression included in the first frame included in the moving image. Moreover, when the texture information and the landmark information are combined, the user's facial expression included in the static image may be transformed in response to the landmark information, and the last frame included in the moving image may include a frame corresponding to the user's transformed facial expression.
When the moving image is generated using the artificial neural network, the moving image may gradually change from the user's facial expression included in the static image to the user's facial expression transformed in response to the landmark information. That is, at least one frame may be included between the first frame and the last frame of the moving image, and a facial expression included in each of the at least one frame may gradually change.
By using the artificial neural network, even though the user does not directly capture a moving image, a moving image having the same effect as a moving image captured by the user directly changing his or her facial expressions may be generated.
A plurality of image conversion templates may be stored in the image conversion apparatus 100. Each of the plurality of image conversion templates may include outline images respectively corresponding to eyebrows, eyes, and a mouth. The plurality of image conversion templates may correspond to various facial expressions such as a sad expression, a joyful expression, a winking expression, a depressed expression, a blank expression, a surprised expression, an angry expression, etc., and the plurality of image conversion templates include information about different facial expressions, respectively. Outline images respectively corresponding to the various facial expressions are different from each other. Accordingly, the plurality of image conversion templates may include different outline images, respectively.
Referring to
Referring to
Although it may be seen that the moving image 33 shown in
The image conversion apparatus 100 may extract, from the static image 31, texture information of an area corresponding to the user's face. Also, the image conversion apparatus 100 may extract landmark information from the image conversion template 32. The image conversion apparatus 100 may generate the moving image 33 by combining the texture information of the static image 31 and the landmark information of the image conversion template 32.
The moving image 33 is shown as an image including a user's winking face.
However, the moving image 33 includes a plurality of frames. The moving image 33 including the plurality of frames will be described with reference to
Referring to
Each of the at least one frame present between the first frame 33_1 and the last frame 33_n of the moving image 33 may include an image of the user's face, whose eyes are gradually shut.
Referring to
Although it may be seen that the moving image 43 shown in
The image conversion apparatus 100 may extract, from the static image 41, texture information of an area corresponding to the user's face. Also, the image conversion apparatus 100 may extract landmark information from the reference image 42. The image conversion apparatus 100 may extract landmark information from areas, of the face included in the reference image 42, respectively corresponding to eyebrows, eyes, and a mouth. The image conversion apparatus 100 may generate the moving image 43 by combining the texture information of the static image 41 and the landmark information of the reference image 42.
The moving image 43 is shown as an image including a user's winking face with a big smile. However, the moving image 43 includes a plurality of frames. The moving image 43 including the plurality of frames will be described with reference to
Referring to
Each of the at least one frame present between the first frame 43_1 and the last frame 43_n of the moving image 43 may include an image of the user's face, whose eyes are gradually shut and mouth is gradually opened.
Referring to
The image conversion apparatus 200 may be similar or identical to the image conversion apparatus 100 shown in
The processor 210 may control overall operations of the image conversion apparatus 200 and may include at least one processor such as a central processing unit (CPU), or the like. The processor 210 may include at least one specialized processor corresponding to each function, or may include an integrated processor.
The memory 220 may store programs, data, or files related to the artificial neural network. The memory 220 may store instructions executable by the processor 210. The processor 210 may execute the programs stored in the memory 220, read the data or files stored in the memory 220, or store new data. Also, the memory 220 may store program commands, data files, data structures, etc. separately or in combination.
The processor 210 may obtain a static image from an input image. The static image may include a user's face and may include one frame.
The processor 210 may read at least one of a plurality of image conversion templates stored in the memory 220. Alternatively, the processor 210 may read at least one reference image stored in the memory 220. For example, the at least one reference image may be input by a user.
The reference image may include an image obtained by capturing the user, or an image of another person selected by the user. When the user does not select one of a plurality of preset templates and selects the reference image, the reference image may be obtained as the image conversion template.
The processor 210 may transform the static image into a moving image by using the obtained image conversion template. In order to transform the static image into the moving image, texture information may be extracted from the user's face included in the static image. The texture information may include information about a color or visual texture of the user's face.
Also, in order to transform the static image into the moving image, landmark information may be extracted from an area corresponding to a person's face included in the image conversion template. The landmark information may be obtained from a specific shape, pattern, and color included in the person's face, or a combination thereof, based on an image processing algorithm. Also, the image processing algorithm may include one of SIFT, HOG, Haar feature, Ferns, LBP, and MCT, but is not limited thereto.
The moving image may be generated by combining the texture information and the landmark information. The moving image may include a plurality of frames. In the moving image, a frame corresponding to the static image may be set as a first frame, and a frame corresponding to the image conversion template may be set as a last frame.
For example, a user's facial expression included in the static image may be the same as a facial expression included in the first frame included in the moving image. Moreover, when the texture information and the landmark information are combined, the user's facial expression included in the static image may be transformed in response to the landmark information, and the last frame included in the moving image may include a frame corresponding to the user's transformed facial expression. The moving image generated by the processor 210 may have a shape as shown in
The processor 210 may store the generated moving image in the memory 220 and output the moving image to be seen by the user.
As described with reference to
Also, the image conversion apparatus 200 may provide the user with a moving image generated by transforming a static image, to thereby provide the user with an interesting user experience along with the moving image.
Referring to
In an embodiment of the present disclosure, the server 10-1 may receive an image from the terminal 20-1, extract landmark data from a face included in the received image, calculate necessary data from the extracted landmark data, and then transmit the calculated data to the terminal 20-1.
Alternatively, the server 10-1 may function as a platform for providing a service that the terminal 20-1 may access and use. The terminal 20-1 may extract landmark data from a face included in an image, calculate necessary data from the extracted landmark data, and then transmit the calculated data to the server 10-1.
The server 10-1 may be connected to a communication network. The server 10-1 may be connected to another external device through the communication network. The server 10-1 may transmit or receive data to or from the other external device connected thereto.
The communication network connected to the server 10-1 may include a wired communication network, a wireless communication network, or a combined complex communication network. A communication network may include mobile communication networks such as 3G, LTE, or LTE-A. A communication network may include a wired or wireless communication network such as Wi-Fi, UMTS/GPRS, or Ethernet. The examples of the communication network may include local area communication network such as MST, RFID, NFC, Zigbee, Z-Wave, Bluetooth, BLE, or infrared communication. The examples of the communication network may include a LAN, a MAN, or a WAN.
The server 10-1 may be connected to the terminal 20-1 through the communication network. When the server 10-1 is connected to the terminal 20-1, the server 10-1 may transmit and receive data to and from the terminal 20-1 through the communication network. The server 10-1 may perform an arbitrary operation using the data received from the terminal 20-1. The server 10-1 may transmit a result of the operation to the terminal 20-1.
Examples of the terminal 20-1 may include a desktop computer, a smartphone, a smart tablet, a smartwatch, a mobile terminal, a digital camera, a wearable device, or a portable electronic device. The terminal 20-1 may execute a program or an application.
Referring to
The image receiver 110-1 may receive a plurality of images from a user. Each of the plurality of images may include only one person. That is, each of the plurality of images may include only one person's face, and people included in the plurality of images may all be different people.
The image receiver 110-1 may extract only a face area from each of the plurality of images, and then provide the extracted face area to the landmark data calculator 120-1.
The landmark data calculator 120-1 may calculate landmark data of faces respectively included in the plurality of images, mean landmark data of all faces included in the plurality of images, and characteristic landmark data of a specific face included in a specific image among the plurality of images, and facial expression landmark data of the specific face.
In some embodiments, the landmark data may include a result of extracting facial key points. A method of extracting landmark data may be described with reference to
The landmark data may be obtained by extracting points of major elements of a face, such as eyes, eyebrows, a nose, a mouth, and a jawline or extracting an outline drawn by connecting the points. The landmark data may be used in techniques such as facial expression classification, pose analysis, synthesis of faces of different people, face transformation, etc.
Referring back to
The landmark data calculator 120-1 may calculate landmark data from the specific image including the specific face, among the plurality of images. In more detail, landmark data for a specific face included in a specific frame among a plurality of frames included in a specific image may be calculated.
Also, the landmark data calculator 120-1 may calculate characteristic landmark data of the specific face included in the specific image among the plurality of images. The characteristic landmark data may be calculated based on face landmark data included in each of the plurality of frames included in the specific image.
Also, the landmark data calculator 120-1 may calculate facial expression landmark data for the specific frame in the specific image by calculating the mean landmark data, the landmark data for the specific frame, and characteristic landmark data. For example, the facial expression landmark data may correspond to a facial expression of the specific face or movement information of major elements such as eyes, eyebrows, a nose, a mouth, and a jawline.
The landmark data storage 130-1 may store the data calculated by the landmark data calculator 120-1. For example, the landmark data storage 130-1 may store the mean landmark data, the landmark data for the specific frame, the characteristic landmark data, and the facial expression landmark data, which are calculated by the landmark data calculator 120-1.
Referring to
In operation S1200, the landmark data decomposition apparatus 100-1 may calculate mean landmark data Im. The mean landmark data Im may be represented as follows.
In an embodiment of the present disclosure, C may denote the number of a plurality of images, and T may denote the number of frames included in each of the plurality of images.
That is, the landmark data decomposition apparatus 100-1 may extract landmark data I(c,t) of each of faces included in the plurality of images C. The landmark data decomposition apparatus 100 may calculate a mean value of all pieces of the extracted landmark data. The calculated mean value may correspond to mean landmark data Im.
In operation S1300, the landmark data decomposition apparatus 100-1 may calculate landmark data I(c,t) for a specific frame among a plurality of frames in a specific image including a specific face, among the plurality of images.
For example, the landmark data I(c,t) for the specific frame may be information about a facial key point of a specific face included in a t-th frame of a c-th image among a plurality images C. That is, it may be assumed that the specific image is the c-th image, and the specific frame is the t-th frame.
In operation S1400, the landmark data decomposition apparatus 100-1 may calculate characteristic landmark data Iid(c) of the specific face included in the c-th image. The characteristic landmark data Iid(c) may be represented as follows.
In an embodiment of the present disclosure, a plurality of frames included in the c-th image include various facial expressions of the specific face. Accordingly, in order to calculate the characteristic landmark data Iid(c), the landmark data decomposition apparatus 100-1 may assume that a mean value
of facial expression landmark data Iexp of the specific face included in the c-th image is 0. Therefore, the characteristic landmark data Iid(c) may be calculated without considering the mean value
of the facial expression landmark data Iexp of the specific face.
The characteristic landmark data Iid(c) may be defined as a value obtained by calculating landmark data for each of the plurality of frames included in the c-th image, calculating the mean landmark data
of the landmark data for each of the plurality of frames, and subtracting mean landmark data Im of the plurality of images from the calculated mean landmark data
of the c-th image.
In operation S1500, the landmark data decomposition apparatus 100-1 may calculate facial expression landmark data Iexp(c,t) of the specific face.
In more detail, the landmark data decomposition apparatus 100-1 may calculate facial expression landmark data Iexp(c,t) of the specific face included in the t-th frame of the c-th image. The facial expression landmark data Iexp(c,t) may be represented as follows.
The facial expression landmark data Iexp(c,t) may correspond to a facial expression of the specific face included in the t-th frame and movement information of eyes, eyebrows, a nose, a mouth, and a jawline included in the specific face. In more detail, the facial expression landmark data Iexp(c,t) may be defined as a value obtained by subtracting the mean landmark data Im and the characteristic landmark data Iid(c) from the landmark data I(c,t) for the specific frame.
Through an operation as described with reference to
The server 10-1 or the terminal 20-1 may implement a technique of transforming a facial expression of a first image into a facial expression of a face included in a second image while maintaining an external shape of a face included in the first image, by using the facial expression landmark data Iexp(c,t), the mean landmark data Im, and the characteristic landmark data Iid(c) which are decomposed by the landmark data decomposition apparatus 100-1. A detailed method thereof may be described with reference to
Referring to
For example, the first image 300 may correspond to a tx-th frame among a plurality of frames included in a cx-th image among a plurality of images. Also, the second image 400 may correspond to a ty-th frame among a plurality of frames included in a cy-th image among a plurality of images. The cx-th image and the cy-th image may be different from each other.
Landmark data of the face included in the first image 300 may be decomposed as follows.
I
(c
,t
)
=I
m
+I
id(c
)
+I
exp(c
,t
) [Equation 4]
The landmark data I(c
Landmark data of the face included in the second image 400 may be decomposed as follows.
I
(c
,t
)
=I
m
+I
id(c
)
+I
(c
,t
) [Equation 5]
The landmark data I(c
In order to transform only the facial expression of the first image 300 into the facial expression of the face included in the second image 400 while maintaining the external shape of the face included in the first image 300, the landmark data of the face included in the first image 300 may be expressed as follows.
I
(c
→c
t
)
=I
m
+I
id(c
)
+I
(c
,t
) [Equation 6]
The server 10-1 or the terminal 20-1 may substitute the facial expression landmark data Iid(c
By using such a method, the first image 300 may be converted into a third image 500. Although the face included in the first image 300 had a smiling expression, a face included in the third image 500 has a winking expression with a big smile, as in the facial expression of the face included in the second image 400.
A MarioNETte model is used to transform a facial expression of a face included in an image without using the landmark data decomposition method. When the MarioNETte model is used, a result of measuring the degree of naturalness of a transformed image is 0.147.
A MarioNETte+LT model is used to transform a facial expression of a face included in an image using the landmark data decomposition method. When the MarioNETte+LT model is used, a result of measuring the degree of naturalness of a transformed image is 0.280. That is, it is verified that the image transformed using the MarioNETte+LT model is 1.9 times more natural than the image transformed using the MarioNETte model.
Referring to
The image conversion apparatus 200-1 may be similar or identical to the landmark data decomposition apparatus 100-1 shown in
The processor 210-1 may control overall operations of the landmark data decomposition apparatus 200-1, and may include at least one processor such as a CPU, or the like. The processor 210-1 may include at least one specialized processor corresponding to each function, or may include an integrated processor.
The memory 220-1 may store programs, data, or files that control the landmark data decomposition apparatus 200-1. The memory 220-1 may store instructions executable by the processor 210-1. The processor 210-1 may execute the programs stored in the memory 220-1, read the data or files stored in the memory 220-1, or store new data. Also, the memory 220-1 may store program commands, data files, data structures, etc. separately or in combination.
The processor 210-1 may receive a plurality of images. Each of the plurality of images may include only one person. That is, each of the plurality of images may include only one person's face, and people included in the plurality of images may all be different people.
The processor 210-1 may store the plurality of received image in the memory 220-1.
The processor 210-1 may extract landmark data I(c,t) of each of faces included in a plurality of images C. The landmark data decomposition apparatus 100-1 may calculate a mean value of all pieces of the extracted landmark data. The calculated mean value may correspond to mean landmark data Im.
The processor 210-1 may calculate landmark data I(c,t) for a specific frame among a plurality frames in a specific image including a specific face, among the plurality of images.
The landmark data I(c,t) for the specific frame may be information of a facial key point of a specific face included in a t-th frame of a c-th image among the plurality images C. That is, it may be assumed that the specific image is the c-th image, and the specific frame is the t-th frame.
The processor 210-1 may calculate characteristic landmark data Iid(c) of the specific face included in the c-th image. A plurality of frames included in the c-th image include various facial expressions of the specific face. Accordingly, in order to calculate the characteristic landmark data Iid(c), the processor 210-1 may assume that a mean value
of facial expression landmark data Iexp of the specific face included in the c-th image is 0. Therefore, the characteristic landmark data Iid(c) may be calculated without considering the mean value
of the facial expression landmark data Iexp of the specific face.
The characteristic landmark data Iid(c) may be defined as a value obtained by calculating landmark data for each of the plurality of frames included in the c-th image, calculating the mean landmark data
of the landmark data for each of the plurality of frames, and subtracting mean landmark data Im of the plurality of images from the calculated mean landmark data
of the c-th image.
The processor 210-1 may calculate facial expression landmark data Iexp(c,t) of the specific face included in the t-th frame of the c-th image. The facial expression landmark data Iexp(c,t) may correspond to a facial expression of the specific face included in the t-th frame and movement information of eyes, eyebrows, a nose, a mouth, and a jawline included in the specific face. In more detail, the facial expression landmark data Iexp(c,t) may be defined as a value obtained by subtracting the mean landmark data Im and the characteristic landmark data Iid(c) from the landmark data I(c,t) for the specific frame.
The processor 210-1 may store, in the memory 220-1, the facial expression landmark data Iexp(c,t), the mean landmark data Im, and the characteristic landmark data Iid(c), which are decomposed.
As described with reference to
Also, the landmark data decomposition apparatuses 100-1 and 200-1 may decompose landmark data including information about characteristics and facial expressions of a face included in an image more accurately.
Moreover, the server 10-1 or the terminal 20-1 including the landmark data decomposition apparatuses 100-1 and 200-1 may implement a technique of naturally transforming a facial expression of a first image into a facial expression of a face included in a second image while maintaining an external shape of a face included in the first image, by using facial expression landmark data Iexp(c,t), mean landmark data Im, and characteristic landmark data Iid(c), which are decomposed.
The server 1000 may be connected to a communication network. The server 1000 may be connected to other external devices through the communication network. The server 1000 may transmit data to another connected device or receive data from the other device.
The communication network connected to the server 1000 may include a wired communication network, a wireless communication network or a complex communication network. A communication network may include mobile communication networks such as 3G, LTE, or LTE-A. A communication network may include a wired or wireless communication network such as Wi-Fi, UMTS/GPRS, or Ethernet. A communication network may include short-range communication networks such as Magnetic Secure Transmission) (MST), Radio Frequency Identification (RFID), Near Field Communication (NFC), ZigBee, Z-Wave, Bluetooth, Bluetooth Low Energy (BLE), or Infrared (IR) communication), or the like. A communication network may include a local area network (LAN), a metropolitan area network (MAN), or a wide area network (WAN).
The server 1000 may include data from at least one of the first terminal 2000 and the second terminal 3000. The server 1000 may perform operation by using data received from at least one of the first terminal 2000 and the second terminal 3000. The server 1000 may transmit a result of the operation to at least one of the first terminal 2000 and the second terminal 3000.
The server 1000 may receive a relay request from at least one terminal from among the first terminal 2000 and the second terminal 3000. The server 1000 may select a terminal that has transmitted a relay request. For example, the server 1000 may select the first terminal 2000 and the second terminal 3000.
The server 1000 may serve as an intermediate for a communication connection between the selected first terminal 2000 and the selected second terminal 3000. For example, the server 1000 may serve as an intermediary for a video call connection or a text transmission and reception connection between the first terminal 2000 and the second terminal 3000. The server 1000 may transmit connection information regarding the first terminal 2000 to the second terminal 3000, and transmit connection information regarding the second terminal 3000 to the first terminal 2000.
The connection information regarding the first terminal 2000 may include an IP address and a port number of the first terminal 2000. The first terminal 2000 that has received connection information regarding the second terminal 3000 may try to connect to the second terminal 3000 by using the received connection information.
As the attempt by the first terminal 2000 to connect to the second terminal 3000 or the attempt by the second terminal 3000 to the first terminal 2000 succeed, a video call session between the first terminal 2000 and the second terminal 3000 may be established. The first terminal 2000 may transmit an image or sound to the second terminal 3000 through the video call session. The first terminal 2000 may encode the image or sound into a digital signal and transmit a result of the encoding to the second terminal 3000.
The first terminal 2000 may receive an image or sound that is encoded into a digital signal and decode the received image or sound.
The second terminal 3000 may transmit an image or sound to the first terminal 2000 through the video call session. In addition, the second terminal 3000 may receive an image or sound from the first terminal 2000 through the video call session. Accordingly, a user of the first terminal 2000 and a user of the second terminal 3000 may make a video call with each other.
The first terminal 2000 and the second terminal 3000 may be, for example, a desktop computer, a laptop computer, a smartphone, a smart tablet, a smart watch, a mobile terminal, a digital camera, a wearable device, or a portable electronic device, or the like. The first terminal 2000 and the second terminal 3000 may execute a program or application. The first terminal 2000 and the second terminal 3000 may be devices of the same type or different types.
In operation S210, a face image of a first person and landmark information corresponding to the face image are received. Here, a landmark may be understood as a landmark of the face image (facial landmark). The landmark may indicate major elements of a face, for example, eyes, eyebrows, nose, mouth, or jawline.
In addition, the landmark information may include information about the position, size or shape of the major elements of the face. Also, the landmark information may include information about a color or texture of the major elements of the face.
The first person indicates an arbitrary person, and in operation S210, a face image of the arbitrary person and landmark information corresponding to the face image are received. The landmark information may be obtained through well-known technology, and any of well-known methods may be used to obtain the same. In addition, the present disclosure is not limited by the method of obtaining a landmark.
In operation S220, a transformation matrix corresponding to the landmark information is estimated. The transformation matrix may constitute the landmark information together with a preset unit vector. For example, first landmark information may be calculated by a product of the unit vector and a first transformation matrix. For example, second landmark information may be calculated by a product of the unit vector and a second transformation matrix.
The transformation matrix is a matrix converting high-dimensional landmark information into low-dimensional data, and may be used in principal component analysis (PCA). PCA is a dimension reduction method in which distribution of data is preserved as much as possible and new axes orthogonal to each other are searched for to convert variables of a high-dimensional space into variables of a low-dimensional space. In PCA, first, a hyperplane that is closest to data is searched for, and then the data is projected onto a hyperplane of a low dimension to reduce the data.
In PCA, a unit vector defining an ith axis is referred to as an ith principal component (PC), and by linearly combining these axes, high-dimensional data may be converted into low-dimensional data.
X=αY [Equation 7]
Here, X denotes landmark information of a high dimension, and Y denotes a principal component of a low dimension, and a denotes a transformation matrix.
As described above, the unit vector, that is, a principal component, may be determined in advance. Accordingly, when new landmark information is received, a transformation matrix corresponding thereto may be determined. Here, there may be a plurality of transformation matrices corresponding to one piece of landmark information.
Meanwhile, in operation S220, a learning model trained to estimate the transformation matrix may be used. The learning model may be understood as a model trained to estimate a PCA transformation matrix from an arbitrary face image and landmark information corresponding to the arbitrary face image.
The learning model may be trained to estimate the transformation matrix from face images of different people and landmark information corresponding to each of the face images. There may be several transformation matrices corresponding to one high-dimensional landmark information, and the learning model may be trained to output only one transformation matrix among the several transformation matrices.
The landmark information used as an input to the learning model may be obtained using a well-known method of extracting landmark information from a face image and visualizing the landmark information.
Thus, in operation S220, the face image of the first person and the landmark information corresponding to the face image are received as an input, and one transformation matrix is estimated therefrom and output.
Meanwhile, the learning model may be trained to classify landmark information into a plurality of semantic groups respectively corresponding to the right eye, the left eye, nose, and mouth, and to output PCA conversion coefficients respectively corresponding to the plurality of semantic groups.
Here, the semantic groups are not necessarily classified to correspond to the right eye, the left eye, nose, and mouth, but may also be classified to correspond to eyebrows, eyes, nose, mouth, and jawline or to correspond to eyebrows, the right eye, the left eye, nose, mouth, jawline, and ears. In operation S120, the landmark information is classified into semantic groups of a segmented unit according to the learning model, and a PCA conversion coefficient corresponding to the classified semantic groups may be estimated.
In operation S230, an expression landmark of the first person is calculated using the transformation matrix. Landmark information may be decomposed into a plurality of pieces of sub landmark information, and it is assumed in the present disclosure that the landmark information may be expressed as below.
I(c,t)=Im+Iid(c)+Iexp(c,t) [Equation 8]
where I(c,t) denotes landmark information in a t-th frame of a video containing person c; Im denotes a mean facial landmark of humans; Iid(c) denotes a facial landmark of identity geometry of person c; Iexp(c,t) denotes a facial landmark of expression geometry of person c in the t-th frame of the video containing person c.
That is, landmark information in a particular frame of a particular person may be expressed as a sum of mean landmark information of faces of all persons, identity landmark information of just the particular person, and facial expression and motion information of the particular person in the particular frame.
The mean landmark information may be defined by the equation as below, and may be calculated based on a large amount of videos that are collectable in advance.
where T denotes the entire number of frames of a video, and thus, Im denotes a mean of landmarks (I(c,t)) of all persons appearing in previously collected videos.
Meanwhile, the expression landmark may be calculated using the equation below.
The above equation represents a result of performing PCA on each semantic group of person c. nexp denotes a sum of expression bases of all semantic groups, and bexp denotes an expression basis which is a basis of PCA, and a denotes a coefficient of PCA.
In other words, bexp denotes an eigenvector described above, and an expression landmark of a high dimension may be defined by a combination of low-dimensional eigenvectors. Also, nexp denotes a total number of expressions and motions that may be expressed by person c by the right eye, the left eye, nose, mouth, etc.
Thus, the expression landmark of the first person may be defined by a set of expression information regarding main parts of a face, that is, each of the right eye, the left eye, nose, and mouth. Also, there may be αk(c,t) corresponding to each eigenvector.
The learning model described above may be trained to estimate a PCA coefficient α(c,t) by using, as an input, a picture x(c,t) and landmark information I(c,t) of person c whose landmark information is to be decomposed, as shown in Equation 8. Through such learning, the learning model may estimate a PCA coefficient from an image of a particular person and landmark information corresponding thereto, and may estimate the low-dimensional eigenvector.
When applying a trained neural network, a PCA transformation matrix is estimated by using a picture x(c′,t) and landmark information I(c′,t) of person c′ whose landmark is to be decomposed, as an input to a neural network. Here, a value obtained from learning data may be used as bexp, and an expression landmark may be estimated as below by using a predicted (estimated) PCA coefficient and bexp.
Î
exp(c,t)=bexpT{circumflex over (α)}(c,t) [Equation 11]
Here, Îexp(c,t) denotes an estimated expression landmark, {circumflex over (α)}(c,t) denotes an estimated PCA transformation matrix.
In operation S240, an identity landmark of the first person is calculated using the expression landmark. As described with reference to Equation 8, landmark information may be defined as a sum of mean landmark information, identity landmark information, and expression landmark information, and the expression landmark information may be estimated through Equation 11 in operation S230.
Thus, the identity landmark may be calculated as below.
Î
id(c)=I(c,t)−Im−Îexp(c,t) [Equation 12]
The above equation may be derived from Equation 8, and when the expression landmark is calculated in operation S230, an identity landmark may be calculated through Equation 12 in operation S240. The mean landmark information 6 may be calculated based on a large amount of videos that are collectable in advance.
Thus, when a face image of a person is given, landmark information may be obtained therefrom, and expression landmark information and identity landmark information may be calculated from the face image and the landmark information.
Multi Layer Perceptron (MLP) is a type of an artificial neural network in which several layers of Perceptron are stacked to overcome the shortcomings of a single-layer Perceptron. Referring to
In
When the transformation matrix is estimated through the trained artificial neural network, as described above with reference to
The trained artificial neural network is trained to estimate a low-dimensional eigenvector and a conversion coefficient from a large number of face images and landmark information corresponding to the face images, and the artificial neural network trained in this manner may estimate the eigenvector and the conversion coefficient even when just a face image of one frame is given.
When an expression landmark and an identity landmark of an arbitrary person are decomposed using the above-described method, the quality of face image processing techniques such as face reenactment, face classification, face morphing, or the like may be improved.
In the face reenactment technique, when a target face and a driver face are given, a face image or picture is composed, in which a motion of the driver face is emulated but the identity of the target face is preserved.
In the face morphing technique, when a face image or picture of each of person 1 and person 2 are given, a face image or picture of a third person in which the identities of person 1 and person 2 are preserved is composed. In a traditional morphing algorithm, a key point of a face is searched for, and then the face is divided into triangular or quadrangular pieces that do not overlap each other with respect to the key point. Thereafter, the pictures of person 1 and person 2 are combined to compose a picture of the third person, and because positions of key points of person 1 and person 2 are different from each other, when the picture of the third person is made by combining the pictures of person 1 and person 2 pixel-wise, a great degree of incompatibility may be sensed. As the conventional face morphing technique does not distinguish characteristics due emotion, such as the characteristics of outer appearance or facial expression of an object, the quality of a morphing result may be low.
According to the landmark decomposition method of the present disclosure, since expression landmark information and identity landmark information may be respectively decomposed from one piece of landmark information, the landmark decomposition method may contribute to improving a result of a face image processing technique using facial landmarks. In particular, according to the landmark decomposition method of the present disclosure, landmarks may be decomposed also when a very small amount of face image data is given, and thus, the landmark decomposition method of the present disclosure may be highly useful.
The receiver 5100 receives a face image of a first person and landmark information corresponding to the face image. Here, a landmark refers to a landmark of the face (facial landmark), and may be understood as a concept encompassing major elements of a face, for example, eyes, eyebrows, nose, mouth, jawline, etc.
In addition, the landmark information may include information about the position, size or shape of the major elements of the face. Also, the landmark information may include information about a color or texture of the major elements of the face.
The first person denotes an arbitrary person, and the receiver 5100 receives a face image of the arbitrary person and landmark information corresponding to the face image. The landmark information may be obtained through well-known technology, any of well-known methods may be used to obtain the same. In addition, the present disclosure is not limited by the method of obtaining a landmark.
The transformation matrix estimator 5200 estimates a transformation matrix corresponding to the landmark information. The transformation matrix may constitute the landmark information together with a preset unit vector. For example, first landmark information may be calculated by a product of the unit vector and a first transformation matrix. For example, second landmark information may be calculated by a product of the unit vector and a second transformation matrix.
The transformation matrix is a matrix converting high-dimensional landmark information into low-dimensional data, and may be used in principal component analysis (PCA). PCA is a dimension reduction method in which distribution of data is preserved as much as possible and new axes orthogonal to each other is searched for to convert variables of a high-dimensional space into variables of a low-dimensional space. In PCA, first, a hyperplane that is closest to data is searched for, and then the data is projected onto a hyperplane of a low dimension to reduce the data.
In PCA, a unit vector defining an ith axis is referred to as an ith principal component (PC), and by linearly combining these axes, high-dimensional data may be converted into low-dimensional data.
As described above, the unit vector, that is, a principal component, may be determined in advance. Accordingly, when new landmark information is received, a transformation matrix corresponding thereto may be determined. Here, a there may be a plurality of transformation matrices corresponding to one piece of landmark information.
Meanwhile, the transformation matrix estimator 5200 may use a learning model trained to estimate the transformation matrix. The learning model may be understood as a model trained to estimate a PCA transformation matrix from an arbitrary face image and landmark information corresponding to the arbitrary face image.
The learning model may be trained to estimate the transformation matrix from face images of different people and landmark information corresponding to each of the face images. There may be several transformation matrices corresponding to one high-dimensional landmark information, and the landmark information may be trained to output only one transformation matrix among the several transformation matrices.
The landmark information used as an input to the learning model may be obtained using a well-known method of extracting landmark information from a face image and visualizing the landmark information.
Thus, the transformation matrix estimator 5200 may receive, as an input, the face image of the first person and landmark information corresponding to the face image, and estimate one transformation matrix from the face image and output the transformation matrix.
Meanwhile, the learning model may be trained to classify landmark information into a plurality of semantic groups respectively corresponding to the right eye, the left eye, nose, and mouth, and to output PCA conversion coefficients respectively corresponding to the plurality of semantic groups.
Here, the semantic groups are not necessarily classified to correspond to the right eye, the left eye, nose, and mouth, but may also be classified to correspond to eyebrows, eyes, nose, mouth, and jawline or to correspond to eyebrows, the right eye, the left eye, nose, mouth, jawline, and ears. The transformation matrix estimator 5200 may classify the landmark information into semantic groups of a segmented unit according to the learning model, and estimate a PCA conversion coefficient corresponding to the classified semantic groups.
The calculator 5300 calculates an expression landmark of the first person by using the transformation matrix, and calculate an identity landmark of the first person by using the expression landmark. Landmark information may be decomposed into a plurality of pieces of sub landmark information, for example, into mean landmark information, identity landmark information, and expression landmark information.
That is, landmark information in a particular frame of a particular person may be expressed as a sum of mean landmark information of faces of all persons, identity landmark information of the particular person, and facial expression and motion information of the particular person in the particular frame.
The mean landmark information may be defined by the equation as below, and may be calculated based on a large amount of videos that are collectable in advance.
The learning model described above may be trained to estimate a PCA coefficient α(c,t) by using, as an input, a picture x(c,t) and landmark information I(c,t) of person c whose landmark information is to be decomposed, as shown in Equation 8. Through such learning, the learning model may estimate a PCA coefficient from an image of a particular person and landmark information corresponding thereto, and may estimate the low-dimensional eigenvector.
When applying a trained neural network, a PCA transformation matrix is estimated by using a picture x(c′,t) and landmark information I(c,t) of person c′ whose landmark is to be decomposed, as an input to a neural network. Here, a value obtained from learning data may be used as bexp, and an expression landmark may be estimated as in Equation 11 by using a predicted (estimated) PCA coefficient and bexp.
Meanwhile, as described with reference to Equation 8, landmark information may be defined as a sum of mean landmark information, identity landmark information, and expression landmark information, and the expression landmark information may be estimated through Equation 11 in operation S230.
Accordingly, the identity landmark may be calculated as in Equation 12, and when a face image of an arbitrary person is given, landmark information may be obtained therefrom, and expression landmark information and identity landmark information may be calculated from the face image and the landmark information.
A reenacted image 4300 has the characteristics of the target image 4100, but a facial expression thereof corresponds to the driver image 4200. That is, the reenacted image 4300 has an identity landmark of the target image 4100, but the expression landmark has features corresponding to the driver image 4200.
Thus, for natural reenactment of a face, it is important to appropriately decompose an identity landmark and an expression landmark from one landmark.
The server 10000 may be connected to a communication network. The server 10000 may be connected to other external devices through the communication network. The server 10000 may transmit data to another connected device or receive data from the other device.
The communication network connected to the server 10000 may include a wired communication network, a wireless communication network or a complex communication network. A communication network may include mobile communication networks such as 3G, LTE, or LTE-A. A communication network may include a wired or wireless communication network such as Wi-Fi, UMTS/GPRS, or Ethernet. A communication network may include short-range communication networks such as Magnetic Secure Transmission) (MST), Radio Frequency Identification (RFID), Near Field Communication (NFC), ZigBee, Z-Wave, Bluetooth, Bluetooth Low Energy (BLE), or Infrared (IR) communication), or the like. A communication network may include a local area network (LAN), a metropolitan area network (MAN), or a wide area network (WAN).
The server 10000 may include data from at least one of the first terminal 6000 and the second terminal 7000. The server 10000 may perform operation by using data received from at least one of the first terminal 6000 and the second terminal 7000. The server 10000 may transmit a result of the operation to at least one of the first terminal 6000 and the second terminal 7000.
The server 10000 may receive a relay request from at least one terminal from among the first terminal 6000 and the second terminal 7000. The server 10000 may select a terminal that has transmitted a relay request. For example, the server 10000 may select the first terminal 6000 and the second terminal 7000.
The server 10000 may serve as an intermediate for a communication connection between the selected first terminal 6000 and the selected second terminal 7000. For example, the server 10000 may serve as an intermediary for a video call connection or a text transmission and reception connection between the first terminal 6000 and the second terminal 7000. The server 10000 may transmit connection information regarding the first terminal 6000 to the second terminal 7000, and transmit connection information regarding the second terminal 7000 to the first terminal 6000.
The connection information regarding the first terminal 6000 may include an IP address and a port number of the first terminal 6000. The first terminal 6000 that has received connection information regarding the second terminal 7000 may try to connect to the second terminal 7000 by using the received connection information.
As the attempt by the first terminal to connect to the second terminal 7000 or the attempt by the second terminal 7000 to the first terminal 6000 succeed, a video call session between the first terminal 6000 and the second terminal 7000 may be established. The first terminal 6000 may transmit an image or sound to the second terminal 7000 through the video call session. The first terminal 6000 may encode the image or sound into a digital signal and transmit a result of the encoding to the second terminal 7000.
In addition, the first terminal 6000 may receive an image or sound from the second terminal 7000 through the video call session. The first terminal 6000 may receive an image or sound that is encoded into a digital signal and decode the received image or sound.
The second terminal 7000 may transmit an image or sound to the first terminal 6000 through the video call session. In addition, the second terminal 7000 may receive an image or sound from the first terminal 6000 through the video call session. Accordingly, a user of the first terminal 6000 and a user of the second terminal 7000 may make a video call with each other.
The first terminal 6000 and the second terminal 7000 may be, for example, a desktop computer, a laptop computer, a smartphone, a smart tablet, a smart watch, a mobile terminal, a digital camera, a wearable device, or a portable electronic device, or the like. The first terminal 6000 and the second terminal 7000 may execute a program or application. The first terminal 6000 and the second terminal 7000 may be devices of the same type or different types.
Referring to
In operation S2100, landmark information is obtained from a face image of a user. The landmark denotes face parts that characterize the face of the user, and may include, for example, eyes, eyebrows, nose, mouth, ears, or jawline or the like of the user. In addition, the landmark information may include information about the position, size or shape of the major elements of the face of the user. In addition, the landmark information may include information about a color or texture of the major elements of the face of the user.
The user may denote an arbitrary user who uses a terminal on which the image transformation method according to the present disclosure is performed. In operation S2100, the face image of the user is received and landmark information corresponding to the face image is obtained. The landmark information may be obtained through well-known technology, and any of well-known methods may be used to obtain the same. In addition, the present disclosure is not limited by the method of obtaining landmark information.
In operation S2200, a transformation matrix corresponding to the landmark information may be estimated. The transformation matrix may constitute the landmark information together with a preset unit vector. For example, first landmark information may be calculated by a product of the unit vector and a first transformation matrix. For example, second landmark information may be calculated by a product of the unit vector and a second transformation matrix.
The transformation matrix is a matrix converting high-dimensional landmark information into low-dimensional data, and may be used in principal component analysis (PCA). PCA is a dimension reduction method in which distribution of data is preserved as much as possible and new axes orthogonal to each other is searched for to convert variables of a high-dimensional space into variables of a low-dimensional space. In PCA, first, a hyperplane that is closest to data is searched for, and then the data is projected onto a hyperplane of a low dimension to reduce the data.
In PCA, a unit vector defining an ith axis is referred to as an ith principal component (PC), and by linearly combining these axes, high-dimensional data may be converted into low-dimensional data.
X=αY[Equation 13]
Here, X denotes landmark information of a high dimension, and Y denotes a principal component of a low dimension, and a denotes a transformation matrix.
As described above, the unit vector, that is, a principal component, may be determined in advance. Accordingly, when new landmark information is received, a transformation matrix corresponding thereto may be determined. Here, there may be a plurality of transformation matrices corresponding to one piece of landmark information.
Meanwhile, in operation S2100, a learning model trained to estimate the transformation matrix may be used. The learning model may be understood as a model trained to estimate a PCA transformation matrix from an arbitrary face image and landmark information corresponding to the arbitrary face image.
The learning model may be trained to estimate the transformation matrix from face images of different people and landmark information corresponding to each of the face images. There may be several transformation matrices corresponding to one high-dimensional landmark information, and the landmark information may be trained to output only one transformation matrix among the several transformation matrices.
The landmark information used as an input to the learning model may be obtained using a well-known method of extracting landmark information from a face image and visualizing the landmark information.
Thus, in operation S2100, the face image of the user and landmark information corresponding to the face image are received as an input, and one transformation matrix is estimated therefrom and output.
Meanwhile, the learning model may be trained to classify landmark information into a plurality of semantic groups respectively corresponding to the right eye, the left eye, nose, and mouth, and to output PCA conversion coefficients respectively corresponding to the plurality of semantic groups.
Here, the semantic groups are not necessarily classified to correspond to the right eye, the left eye, nose, and mouth, but may also be classified to correspond to eyebrows, eyes, nose, mouth, and jawline or to correspond to eyebrows, the right eye, the left eye, nose, mouth, jawline, and ears. In operation S2100, the landmark information is classified into semantic groups of a segmented unit according to the learning model, and a PCA conversion coefficient corresponding to the classified semantic groups may be estimated.
Meanwhile, an expression landmark of the user is calculated using the transformation matrix. Landmark information may be decomposed into a plurality of pieces of sub landmark information, and it is assumed in the present disclosure that the landmark information may be expressed as below.
I(c,t)=Im+Iid(c)+Iexp(c,t) [Equation 14]
where I(c,t) denotes landmark information in a t-th frame of a video containing person c; Im denotes a mean facial landmark of humans; Iid(c) denotes a facial landmark of identity geometry of person c; Iexp(c,t) denotes a facial landmark of expression geometry of person c in the t-th frame of the video containing person c.
That is, landmark information in a particular frame of a particular person may be expressed as a sum of mean landmark information of faces of all persons, identity landmark information of the particular person, and facial expression and motion information of the particular person in the particular frame.
The mean landmark information may be defined by the equation as below, and may be calculated based on a large amount of videos that are collectable in advance.
where T denotes the entire number of frames of a video, and thus, Im denotes a mean of landmarks (I(c,t)) of all persons appearing in previously collected videos.
Meanwhile, the expression landmark may be calculated using the equation below.
The above equation represents a result of performing PCA on each semantic group of person c. nexp denotes a sum of expression bases of all semantic groups, and bexp denotes an expression basis which is a basis of PCA, and a denotes a coefficient of PCA.
In other words, bexp denotes an eigenvector described above, and an expression landmark of a high dimension may be defined by a combination of low-dimensional eigenvectors. Also, nexp denotes a total number of expressions and motions that may be expressed by person c by the right eye, the left eye, nose, mouth, etc.
Thus, the expression landmark of the first person may be defined by a set of expression regarding main parts of a face, that is, each of the right eye, the left eye, nose, and mouth. Also, there may be αk(c,t) corresponding to each eigenvector.
The learning model described above may be trained to estimate a PCA coefficient α(c,t) by using, as an input, a picture x(c,t) and landmark information I(c,t) of person c whose landmark information is to be decomposed, as shown in Equation 14. Through such learning, the learning model may estimate a PCA coefficient from an image of a particular person and landmark information corresponding thereto, and may estimate the low-dimensional eigenvector.
When applying a trained neural network, a PCA transformation matrix is estimated by using a picture x(c′,t) and landmark information I(c′,t) of person c′ whose landmark is to be decomposed, as an input to a neural network. Here, a value obtained from learning data may be used as bexp, and an expression landmark may be estimated as below by using a predicted (estimated) PCA coefficient and bexp.
Î
exp(c,t)=bexpT{circumflex over (α)}(c,t) [Equation 17]
Thereafter, by using the expression landmark, an identity landmark of the first person is calculated. As described with reference to Equation 14, landmark information may be defined as a sum of mean landmark information, identity landmark information, and expression landmark information, and the expression landmark information may be estimated through Equation 17.
Thus, the identity landmark may be calculated as below.
Î
id(c)=I(c,t)−Im−Îexp(c,t) [Equation 18]
The above equation may be derived from Equation 14, and when the expression landmark is calculated, an identity landmark may be calculated through Equation 18. The mean landmark information 6 may be calculated based on a large amount of videos that are collectable in advance.
Thus, when a face image of a person is given, landmark information may be obtained therefrom, and expression landmark information and identity landmark information may be calculated from the face image and the landmark information.
In operation S2200, a user feature map is generated from pose information of the face image of the user. The pose information may include motion information and facial expression information of the face image. Also, in operation S2200, the user feature map may be generated by inputting pose information corresponding to the face image of the user to an artificial neural network. Meanwhile, the pose information may be understood as corresponding to the expression landmark information obtained in operation S2100.
The user feature map generated in operation S2200 includes information expressing a facial expression that the user is making and characteristics of a motion of the face of the user. In addition, the artificial neural network used in operation S2200 may be a convolutional neural network (CNN) or other types of artificial neural networks.
In operation S2300, a face image of a target is received, and a target feature map and a pose-normalized target feature map are generated from style information and pose information corresponding to the face image of the target.
The target refers to a person to be transformed according to the present disclosure, and the user and the target may be different persons, but are not limited thereto. An reenacted image generated as a result of performing the present disclosure is transformed from the face image of the target, and may appear as a target imitating or copying a motion or facial expression of the user.
The target feature map includes information expressing a facial expression that the target is making and characteristics of a motion of the face of the target.
The pose-normalized target feature map may correspond to an output regarding the style information input to an artificial neural network. Alternatively, the pose-normalized target feature map may include information corresponding to distinct characteristics of the face of the target except for the pose information of the target.
Like the artificial neural network used in operation S2200, CNN may be used as the artificial neural network used in operation S2300, and a structure of the artificial neural network used in operation S2200 may be different from that of the artificial neural network used in operation S2300.
The style information refers to information indicating distinct characteristics of a person in a face of the person; for example, the style information may include innate features appearing on the face of the target, a size, shape, position or the like of landmarks, etc. Alternatively, the style information may include at least one of texture information, color information, and shape information corresponding to the face image of the target.
It will be understand that the target feature map includes data corresponding to expression landmark information obtained from the face image of the target, and the pose-normalized target feature map includes data corresponding to the identity landmark information obtained from the face image of the target.
In operation S2400, a mixed feature map is generated using the user feature map and the target feature map, and the mixed feature map may be generated by inputting the pose information of the face image of the user and the style information of the face image of the target, into an artificial neural network.
The mixed feature map may be generated to have pose information in which a landmark of the target corresponds to a landmark of the user.
Like the artificial neural network used in operation S2200 and operation S2300, CNN may be used as the artificial neural network used in operation S2400, and a structure of the artificial neural network used in operation S2400 may be different from that of the artificial neural network used in the previous operations.
In operation S2500, by using the mixed feature map and the pose-normalized target feature map, a reenacted image of the face image of the target is generated.
As described above, the pose-normalized target feature map includes data corresponding to identity landmark information obtained from the face image of the target, and the identity landmark information refers to information corresponding to distinct characteristics of a person, which are not relevant to expression information corresponding to motion information or facial expression information of that person.
When a motion of a target that naturally follows a motion of the user can be obtained through the mixed feature map generated in operation S2400, in operation S2500, the effect as if the target is actually moving by itself and making a facial expression by itself may be obtained by reflecting the distinct characteristics of the target.
When comparing the target image with the reenacted image of
Meanwhile, a facial expression of the person of the reenacted image is substantially the same as that of the user. For example, when the user on the user image is opening the mouth, the reenacted image has an image of the target opening the mouth. In addition, when the user on the user image is turning his or her head to the right or to the left, the reenacted image has an image of the target turning his or her head to the right or the left.
When receiving an image of a user that changes in real time and generating a reenacted image based on the received image, the reenacted image may change a target image according to motion and facial expression of the user that change in real time.
The landmark obtainer 8100 receives face images of a user and a target and obtains landmark information from each face image. The landmark denotes face parts that characterize the face of the user, and may include, for example, eyes, eyebrows, nose, mouth, ears, or jawline or the like of the user. In addition, the landmark information may include information about the position, size or shape of the major elements of the face of the user. In addition, the landmark information may include information about a color or texture of the major elements of the face of the user.
The user may denote an arbitrary user who uses a terminal on which the image transformation method according to the present disclosure is performed. The landmark obtainer 8100 receives the face image of the user and obtains landmark information corresponding to the face image. The landmark information may be obtained through well-known technology, and any of well-known methods may be used to obtain the same. In addition, the present disclosure is not limited by the method of obtaining landmark information.
The landmark obtainer 8100 may estimate a transformation matrix corresponding to the landmark information. The transformation matrix may constitute the landmark information together with a preset unit vector. For example, first landmark information may be calculated by a product of the unit vector and a first transformation matrix. For example, second landmark information may be calculated by a product of the unit vector and a second transformation matrix.
The transformation matrix is a matrix converting high-dimensional landmark information into low-dimensional data, and may be used in principal component analysis (PCA). PCA is a dimension reduction method in which distribution of data is preserved as much as possible and new axes orthogonal to each other is searched for to convert variables of a high-dimensional space into variables of a low-dimensional space. In PCA, first, a hyperplane that is closest to data is searched for, and then the data is projected onto a hyperplane of a low dimension to reduce the data.
In PCA, a unit vector defining an ith axis is referred to as an ith principal component (PC), and by linearly combining these axes, high-dimensional data may be converted into low-dimensional data.
Meanwhile, the landmark obtainer 31 may use a learning model trained to estimate the transformation matrix. The learning model may be understood as a model trained to estimate a PCA transformation matrix from an arbitrary face image and landmark information corresponding to the arbitrary face image.
The learning model may be trained to estimate the transformation matrix from face images of different people and landmark information corresponding to each of the face images. There may be several transformation matrices corresponding to one high-dimensional landmark information, and the landmark information may be trained to output only one transformation matrix among the several transformation matrices.
The landmark information used as an input to the learning model may be obtained using a well-known method of extracting landmark information from a face image and visualizing the landmark information.
Thus, the landmark obtainer 8100 receives the face image of the user and landmark information corresponding to the face image as an input, and estimates one transformation matrix therefrom and outputs the same.
Meanwhile, the learning model may be trained to classify landmark information into a plurality of semantic groups respectively corresponding to the right eye, the left eye, nose, and mouth, and to output PCA conversion coefficients respectively corresponding to the plurality of semantic groups.
Here, the semantic groups are not necessarily classified to correspond to the right eye, the left eye, nose, and mouth, but may also be classified to correspond to eyebrows, eyes, nose, mouth, and jawline or to correspond to eyebrows, the right eye, the left eye, nose, mouth, jawline, and ears. The landmark obtainer 8100 may classify the landmark information into semantic groups of a segmented unit according to the learning model, and estimate a PCA conversion coefficient corresponding to the classified semantic groups.
Meanwhile, an expression landmark of the user may be calculated using the transformation matrix. Landmark information may be decomposed into a plurality of pieces of sub landmark information, and in the present disclosure, the landmark information is defined to be a sum of mean facial landmark of humans, facial landmark of identity geometry of a person, and facial landmark of expression geometry of the person.
That is, landmark information in a particular frame of a particular person may be expressed as a sum of mean landmark information of faces of all persons, identity landmark information of the particular person, and facial expression and motion information of the particular person in the particular frame.
Meanwhile, the expression landmark corresponds to pose information of the face image of the user, and the identity landmark corresponds to style information of the face image of the target.
In sum, the landmark obtainer 8100 may receive the face image of the user and the face image of the target and respectively generate, from the face images, a plurality of pieces of landmark information including expression landmark information and identity landmark information.
The first encoder 8200 generates user feature map from the pose information of the face image of the user. The pose information corresponds to the expression landmark information and may include motion information and facial expression information of the face images. In addition, the first encoder 8200 may input pose information corresponding to the face image of the user into an artificial neural network to generate the user feature map.
The user feature map generated by the first encoder 8200 includes information expressing a facial expression that the user is making and characteristics of a motion of the face of the user. In addition, the artificial neural network used by the first encoder 8200 may be a convolutional neural network (CNN) or other types of artificial neural networks.
The second encoder 8300 generates a target feature map and a pose-normalized target feature map from style information and pose information of the face image of the target.
The target refers to a person to be transformed according to the present disclosure, and the user and the target may be different persons, but are not limited thereto. An reenacted image generated as a result of performing the present disclosure is transformed from the face image of the target, and may appear as a target imitating or copying a motion or facial expression of the user.
The target feature map generated by the second encoder 8300 may be understood to be data corresponding to the user feature map generated by the first encoder 8200, and includes information expressing the features of a facial expression that the target is making and a motion of the face of the target.
The pose-normalized target feature map may correspond to an output regarding the style information input to an artificial neural network. Alternatively, the pose-normalized target feature map may include information corresponding to distinct characteristics of the face of the target except for the pose information of the target.
Like the artificial neural network used by the first encoder 8200, CNN may be used as the artificial neural network used by the second encoder 8300, and a structure of the artificial neural network used by the first encoder 8200 may be different from that of the artificial neural network used by the second encoder 8300.
The style information refers to information indicating distinct characteristics of a person in a face of the person; for example, the style information may include innate features appearing on the face of the target, a size, shape, position or the like of landmarks, etc. Alternatively, the style information may include at least one of texture information, color information, and shape information corresponding to the face image of the target.
It will be understand that the target feature map includes data corresponding to expression landmark information obtained from the face image of the target, and the pose-normalized target feature map includes data corresponding to the identity landmark information obtained from the face image of the target.
The blender 8400 may generate a mixed feature map by using the user feature map and the target feature map, and generate the mixed feature map by inputting the pose information of the face image of the user and the style information of the face image of the target, into an artificial neural network.
The mixed feature map may be generated to have pose information in which a landmark of the target corresponds to a landmark of the user. Like the artificial neural network used by the first encoder 32 and the second encoder 8300, CNN may be used as the artificial neural network used by the blender 8400, and a structure of the artificial neural network used by the blender 8400 may be different from that of the artificial neural network used by the first encoder 8200 or the second encoder 8300.
The user feature map and the target feature map that are input to the blender 8400 respectively include landmark information of the face of the user and landmark information of the face of the target, and the blender 8400 may perform an operation of matching the landmark of the face of the user to the landmark of the face of the target such that the distinct characteristics of the face of the target are maintained, while generating the face of the target corresponding to the motion and facial expression of the face of the user.
For example, to control a motion of the face of the target according to a motion of the face of the user, it may be understood that landmarks of the user such as the eyes, eyebrows, nose, mouth, jawline or the like are respectively linked with landmarks of the target such as the eyes, eyebrows, nose, mouth, jawline or the like.
Alternatively, to control a facial expression of the face of the target according to a facial expression of the face of the user, landmarks of the user such as the eyes, eyebrows, nose, mouth, jawline or the like may be respectively linked with landmarks of the target such as the eyes, eyebrows, nose, mouth, jawline or the like.
By using the mixed feature map and the pose-normalized target feature map, the decoder 8500 generates a reenacted image of the face image of the target.
As described above, the pose-normalized target feature map includes data corresponding to identity landmark information obtained from the face image of the target, and the identity landmark information refers to information corresponding to distinct characteristics of a person, which are not relevant to expression information corresponding to motion information or facial expression information of that person.
When a motion of a target that naturally follows a motion of the user can be obtained through the mixed feature map generated using the blender 8400, the decoder 8500 may obtain the effect as if the target is actually moving by itself and making a facial expression by itself by reflecting the distinct characteristics of the target.
Multi Layer Perceptron (MLP) is a type of an artificial neural network in which several layers of Perceptron are stacked to overcome the shortcomings of a single-layer Perceptron. Referring to
When the transformation matrix is estimated through the trained artificial neural network, as described above with reference to
The trained artificial neural network is trained to estimate a low-dimensional eigenvector and a conversion coefficient from a large number of face images and landmark information corresponding to the faces images, and the artificial neural network trained in this manner may estimate the eigenvector and the conversion coefficient even when just a face image of one frame is given.
When an expression landmark and an identity landmark of an arbitrary person are decomposed using the above-described method, the quality of face image processing techniques such as face reenactment, face classification, face morphing, or the like may be improved.
Referring to
fy denotes a normalized flow map used when normalizing a target feature map, and T denotes a warping function performing warping. Also, Sk and j=1 . . . ny respectively denote a target feature map encoded in each convolutional layer.
The second encoder 8300 receives a rendered target landmark and a rendered target image as an input and generates, therefrom, an encoded target feature map and an encoded normalized flow map fy. Also, by performing a warping function by using the generated target feature map Sj and the generated normalized flow map fy as an input, a warped target feature map is generated.
Here, the warped target feature map may be understood to be the same as the pose-normalized target feature map described above. Accordingly, the warping function T may be understood to be a function that generates data consisting of only style information of the target itself, that is, only identity landmark information, without expression landmark information of the target.
As described above, the blender 8400 generates a mixed feature map from a user feature map and a target feature map, and may generate the mixed feature map by inputting, into an artificial neural network, pose information of the face image of the user and the style information of the face image of the target.
In
The user feature map and the target feature map that are input to the blender 8400 respectively include landmark information of the face of the user and landmark information of the face of the target, and the blender 8400 may perform an operation of matching the landmark of the face of the user to the landmark of the face of the target such that the distinct characteristics of the face of the target are maintained, while generating the face of the target corresponding to the motion and facial expression of the face of the user.
For example, to control a motion of the face of the target according to a motion of the face of the user, it may be understood that landmarks of the user such as the eyes, eyebrows, nose, mouth, jawline or the like are respectively linked with landmarks of the target such as the eyes, eyebrows, nose, mouth, jawline or the like.
Alternatively, to control a facial expression of the face of the target according to a facial expression of the face of the user, landmarks of the user such as the eyes, eyebrows, nose, mouth, jawline or the like may be respectively linked with landmarks of the target such as the eyes, eyebrows, nose, mouth, jawline or the like.
In addition, for example, eyes may be searched from the user feature map and then eyes may be searched from the target feature map, and a mixed feature map may be generated such that the eyes of the target feature map follows a movement of the eyes of the user feature map. Substantially the same operation may be performed on other landmarks by using the blender 8400.
Referring to
In
In addition, a warp-alignment block of the decoder 8500 performs a warping function by using an output (u) of a previous block of the decoder 8500 and the pose-normalized target feature map as an input. The warping function performed in the decoder 8500 is to generate a reenacted image conforming to a motion and pose of the user while maintaining the distinct characteristics of the target, and is different from the warping function performed in the second encoder 8300.
Meanwhile, a moving image may be generated by the embodiments described above with reference to
Alternatively, based on an image conversion template, a static image input may be converted into a moving image. The image conversion template may include a plurality of frames, and each frame may be a static image. For example, a plurality of intermediate images (i.e., a plurality of static images) may be generated by applying each of the plurality of frames to an input static image. And, the moving image may be generated by combining the generated intermediate images.
Alternatively, a moving image may be generated by converting an input moving image. In this case, each of the plurality of first static images (frames) included in the input moving image is converted into second static images, respectively, and the second static images are combined to generate the moving image.
The embodiments described above with reference to
When there is a mismatch between the target identity and the driver identity, face reenactment suffers severe degradation in the quality of the result, especially in a few-shot setting. The identity preservation problem, where the model loses the detailed information of the target leading to a defective output, is the most common failure mode. The problem has several potential sources such as the identity of the driver leaking due to the identity mismatch, or dealing with unseen large poses.
To overcome such problems, we introduce components that address the mentioned problem: image attention block, target feature alignment, and landmark transformer. Through attending and warping the relevant features, the proposed architecture, called MarioNETte, produces high-quality reenactments of unseen identities in a few-shot setting. In addition, the landmark transformer dramatically alleviates the identity preservation problem by isolating the expression geometry through landmark disentanglement. Comprehensive experiments are performed to verify that the proposed framework can generate highly realistic faces, outperforming all other baselines, even under a significant mismatch of facial characteristics between the target and the driver.
Given a target face and a driver face, face reenactment aims to synthesize a reenacted face which is animated by the movement of a driver while preserving the identity of the target.
Many approaches make use of generative adversarial net-works (GAN) which have demonstrated a great success in image generation tasks. Xu et al.; Wu et al. (2017; 2018) achieved high-fidelity face reenactment results by exploiting CycleGAN (Zhu et al. 2017). However, the CycleGAN-based approaches require at least a few minutes of training data for each target and can only reenact predefined identities, which is less attractive in-the-wild where a reenactment of unseen targets cannot be avoided.
The few-shot face reenactment approaches, therefore, try to reenact any unseen targets by utilizing operations such as adaptive instance normalization (AdaIN) (Zakharov et al. 2019) or warping module (Wiles, Koepke, and Zisserman 2018; Siarohin et al. 2019). However, current state-of-the-art methods suffer from the problem we call identity preservation problem: the inability to preserve the identity of the target leading to defective reenactments. As the identity of the driver diverges from that of the target, the problem is exacerbated even further.
Examples of flawed and successful face reenactments, generated by previous approaches and the proposed model, respectively, are illustrated in
1. Neglecting the identity mismatch may lead to an identity of the driver interfere with the face synthesis such that the generated face resembles the driver (
2. Insufficient capacity of the compressed vector representation (e.g., AdaIN layer) to preserve the information of the target identity may lead the produced face to lose the detailed characteristics (
3. Warping operation incurs a defect when dealing with large poses (
We propose a framework called MarioNETte, which aims to reenact the face of unseen targets in a few-shot manner while preserving the identity without any fine-tuning. We adopt image attention block and target feature alignment, which allow MarioNETte to directly inject features from the target when generating image. In addition, we propose a novel landmark transformer which further mitigates the identity preservation problem by adjusting for the identity mismatch in an unsupervised fashion. Our contributions are as follows:
MarioNETte Architecture
The generator consists of following components:
Image Attention Block
To transfer style information of targets to the driver, previous studies encoded target information as a vector and mixed it with driver feature by concatenation or AdaIN layers (Liu et al. 2019; Zakharov et al. 2019). However, encoding targets as a spatial-agnostic vector leads to losing spatial information of targets. In addition, these methods are absent of innate design for multiple target images, and thus, summary statistics (e.g. mean or max) are used to deal with multiple targets which might cause losing details of the target.
We suggest image attention block (
Given driver feature map zx∈h
Q=z
x
W
q
+P
x
W
qp∈h
K=Z
y
W
k
+P
y
W
kp∈K×h
V=Z
y
W
v∈∈k×h
where f: d
Instance normalization, residual connection, and convolution layer follow the attention layer to generate output feature map zxy. The image attention block offers a direct mechanism of transferring information from multiple target images to the pose of driver.
Target Feature Alignment
The fine-grained details of the target identity can be preserved through the warping of low-level features (Siarohin et al. 2019). Unlike previous approaches that estimate a warping flow map or an affine transform matrix by computing the difference between keypoints of the target and the driver (Balakrishnan et al. 2018; Siarohin et al. 2018; Siarohin et al. 2019), we propose a target feature alignment (
1. Target pose normalization. In the target encoder Ey, encoded feature maps {Sj}j=1 . . . n
2. Driver pose adaptation. The warp-alignment block in the decoder receives {Ŝi}i=1 . . . K and the output u of the previous block of the decoder. In a few-shot setting, we average resolution-compatible feature maps from different target images (i.e., Ŝj=ΣiŜjiK). To adapt pose-normalized feature maps to the pose of the driver, we generate an estimated flow map of the driver fu using 1×1 convolution that takes u as the input. Alignment by (Ŝj;fu) follows ({circle around (2)} in
Landmark Transformer
Large structural differences between two facial landmarks may lead to severe degradation of the quality of the reenactment. The usual approach to such a problem has been to learn a transformation for every identity (Wu et al. 2018) or by preparing a paired landmark data with the same expressions (Zhang et al. 2019). However, these methods are unnatural in a few-shot setting where we handle unseen identities, and moreover, the labeled data is hard to be acquired. To overcome this difficulty, we propose a novel landmark transformer which transfers the facial expression of the driver to an arbitrary target identity. The landmark transformer utilizes multiple videos of unlabeled human faces and is trained in an unsupervised manner.
Landmark Decomposition
Given video footages of different identities, we denote x(c,t) as the t-th frame of the c-th video, and I(c,t) as a 3D facial landmark. We first transform every landmark into a normalized landmark Ī(c,t) by normalizing the scale, translation, and rotation. Inspired by 3D morphable models of face (Blanz and Vetter 1999), we assume that normalized landmarks can be decomposed as follows:
Ī(c,t)=Īm+Īid(c)+Īexp(c,t) [Equation 21]
where Īm is the average facial landmark geometry computed by taking the mean over all landmarks, Īid(c) denotes the landmark geometry of identity c, computed by Īid(c)=ΣtĪ(c,t)/Tc−Īm where Tc is the number of frames of c-th video, and Īexp(c,t) corresponds to the expression geometry of t-th frame. The decomposition leads to Īexp(c,t)=Ī(c,t)−Īm−Īid(c).
Given a target landmark Ī(cy, ty) and a driver landmark Ī(cx, tx) we wish to generate the following landmark:
Ī(cc→cy,tx)=Īm+Īid(cy)+Īexp(cx,tx) [Equation 22]
i.e., a landmark with the identity of the target and the expression of the driver. Computing Īid(cy) and Īexp is possible if enough images of cy are given, but in a few-shot setting, it is difficult to disentangle landmark of unseen identity into two terms.
Landmark Disentanglement
To decouple the identity and the expression geometry in a few-shot setting, we introduce a neural network to regress the coefficients for linear bases. Previously, such an approach has been widely used in modeling complex face geometries (Blanz and Vetter 1999). We separate expression landmarks into semantic groups of the face (e.g., mouth, nose and eyes) and perform PCA on each group to extract the expression bases from the training data:
where bexp,k and αk represent the basis and the corresponding coefficient, respectively.
The proposed neural network, a landmark disentangle M, estimates α(c,t) given an image x(c,t) and an landmark Ī(c,t).
Ī
exp(c,t)=λexpbexpT{circumflex over (α)}(c,t)
Ī
id(c)=Ī(c,t)−Īm−Īexp(c,t) [Equation 24]
where λmax is a hyperparameter that controls the intensity of the predicted expressions from the network. Image feature extracted by a ResNet-50 and the landmark, Ī(c,t)−Īm, are fed into a 2-layer MLP to predict {circumflex over (α)}(c,t).
During the inference, the target and the driver landmarks are processed according to Equation 24. When multiple target images are given, we take the mean value over all Īid(cy). Finally, landmark transformer converts landmark as:
Ī(cx→cy,tx)=Īm+Īid(cy)+Īext(cx,tx) [Equation 25]
Denormalization to recover the original scale, translation, and rotation is followed by the rasterization that generates a landmark adequate for the generator to consume.
Experimental Setup
Datasets
We trained our model and the baselines using VoxCeleb1 (Nagrani, Chung, and Zisserman 2017), which contains 256×256 size videos of 1,251 different identities. We utilized the test split of VoxCeleb1 and CelebV (Wu et al. 2018) for evaluating self-reenactment and reenactment under a different identity, respectively. We created the test set by sampling 2,083 image sets from randomly selected 100 videos of VoxCeleb1 test split, and uniformly sampled 2,000 image sets from every identity from CelebV. The CelebV data includes the videos of five different celebrities of widely varying characteristics, which we utilize to evaluate the performance of the models reenacting unseen targets, similar to in-the-wild scenario. Further details of the loss function and the training method can be found at Supplementary Material A3 and A4.
Baselines
MarioNETte variants, with and without the landmark transformer (MarioNETte+LT and MarioNETte, respectively), are compared with state-of-the-art models for few-shot face reenactment. Details of each baseline are as follows:
Metrics
We compare the models based on the following metrics to evaluate the quality of the generated images. Structured similarity (SSIM) (Wang et al. 2004) and peak signal-to-noise ratio (PSNR) evaluate the low-level similarity between the generated image and the ground-truth image. We also report the masked-SSIM (M-SSIM) and masked PSNR (M-PSNR) where the measurements are restricted to the facial region.
In the absence of the ground truth image where different identity drives the target face, the following metrics are more relevant. Cosine similarity (CSIM) of embedding vectors generated by pre-trained face recognition model (Deng et al. 2019) is used to evaluate the quality of identity preservation. To inspect the capability of the model to properly reenact the pose and the expression of the driver, we compute PRMSE, the root mean square error of the head pose angles, and AUCON, the ratio of identical facial action unit values, between the generated images and the driving images. OpenFace (Baltrusaitis et al. 2018) is utilized to compute pose angles and action unit values.
Experimental Results
Models were compared under self-reenactment and reenactment of different identities, including a user study. Ablation tests were conducted as well. All experiments were conducted under two different settings: one-shot and few-shot, where one or eight target images were used respectively.
Self-reenactment
Reenacting Different Identity
User Study
Two types of user studies are conducted to assess the performance of the proposed model:
For both studies, 150 examples were sampled from CelebV, which were evenly distributed to 100 different human evaluators.
Ablation Test
We performed ablation test to investigate the effectiveness of the proposed components. While keeping all other things the same, we compare the following configurations reenacting different identities: (1) MarioNETte is the proposed method where both image attention block and target feature alignment are applied. (2) AdaIN corresponds to the same model as MarioNETte, where the image attention block is replaced with AdaIN residual block while the target feature alignment is omitted. (3) +Attention is a MarioNETte where only the image attention block is applied. (4) +Alignment only employs the target feature alignment.
Entirely relying on target feature alignment for reenactment, +Alignment is vulnerable to failures due to large differences in pose between target and driver that MarioNETte can overcome. Given a single driver image along with three target images (
Related Works
The classical approach to face reenactment commonly involves the use of explicit 3D modeling of human faces (Blanza and Vetter 1999) where the 3DMM parameters of the driver and the target are computed from a single image, and blended eventually (Thies et al. 2015; Thies et al. 2016). Image warping is another popular approach where the target image is modified using the estimated flow obtained form 3D models (Cao et al. 2013) or sparse landmarks (Averbuch-Elor et al. 2017). Face reenactment studies have embraced the recent success of neural networks exploring different image-to-image translation architectures (Isola et al. 2017) such as the works of Xu et al. (2017) and that of Wu et al. (2018), which combined the cycle consistency loss (Zhu et al. 2017). A hybrid of two approaches has been studied as well. Kim et al. (2018) trained an image translation network which maps reenacted render of a 3D face model into a photo-realistic output.
Architectures, capable of blending the style information of the target with the spatial information of the driver, have been proposed recently. AdaIN (Huang and Belongie 2017; Huang et al. 2018; Liu et al. 2019) layer, attention mechanism (Zhu et al. 2019; Lathuili'ere et al. 2019; Park and Lee 2019), deformation operation (Siarohin et al. 2018; Dong et al. 2018), and GAN-based method (Bao et al. 2018) have all seen a wide adoption. Similar idea has been applied to few-shot face reenactment settings such as the use of image-level (Wiles, Koepke, and Zisserman 2018) and feature-level (Siarohin et al. 2019) warping, and AdaIN layer in conjuction with a meta-learning (Zakharov et al. 2019). The identity mismatch problem has been studied through methods such as CycleGAN-based landmark transformers (Wu et al. 2018) and landmark swappers (Zhang et al. 2019). While effective, these methods either require an independent model per person or a dataset with image pairs that may be hard to acquire.
In this paper, we have proposed a framework for few-shot face reenactment. Our proposed image attention block and target feature alignment, together with the landmark transformer, allow us to handle the identity mismatch caused by using the landmarks of a different person. Proposed method do not need additional fine-tuning phase for identity adaptation, which significantly increases the usefulness of the model when deployed in-the-wild. Our experiments including human evaluation suggest the excellence of the proposed method.
One exciting avenue for future work is to improve the landmark transformer to better handle the landmark disentanglement to make the reenactment even more convincing.
Supplemental Materials
MarioNETte Architecture Details
Architecture Design
Given a driver image x and K target images {yi}, the proposed few-shot face reenactment framework which we call MarioNETte first generates 2D landmark images (i.e. rx and {ryi}1=i . . . K.). We utilize a 3D landmark detector :h×q×3→68×3 (Bulat and Tzimiropoulos 2017) to extract facial keypoints which includes information about pose and expression denoted as Ix=(x) and Iyi=(yi), respectively. We further rasterize 3D landmarks to an image by rasterizer R, resulting in rx=(Ix), ryi=(Iyi).
We utilize simple rasterizer that orthogonally projects 3D landmark points, e.g., (x, y, z), into 2D XY-plane, e.g., (x,y), and we group the projected landmarks into 8 categories: left eye, right eye, contour, nose, left eyebrow, right eyebrow, inner mouth, and outer mouth. For each group, lines are drawn between predefined order of points with predefined colors (e.g., red, red, green, blue, yellow, yellow, cyan, and cyan respectively), resulting in a rasterized image as shown in
MarioNETte Consists of Conditional Image Generator
G(rx;{yi}i=1 . . . K, {ryi}i=1 . . . K) and projection discriminator D({circumflex over (x)}, {circumflex over (r)}, c) The discriminator D determines whether the given image {circumflex over (x)} is a real image from the data distribution taking into account the conditional input of the rasterized landmarks {circumflex over (r)} and identity c.
The generator G(rx; {yi}i=1 . . . K, {ryi}i=1 . . . K) is further broken down into four components: namely, target encoder, drvier encoder, blender, and decoder. Target encoder Ey(y, ry) takes target image and generates encoded target feature map z y together with the warped target feature map Ŝ. Driver encoder Ex(rx) receives a driver image and creates a driver feature map zx. Blender B(zx, {zyi}i=1 . . . K) combines encoded feature maps to produce a mixed feature map zxy. Decoder Q(zxy, {Ŝi}i=1 . . . K) generates the reenacted image. Input image y and the landmark image ry are concatenated channel-wise and fed into the target encoder.
The target encoder Ey(y, ry) adopts a U-Net (Ronneberger, Fischer, and Brox 2015) style architecture including five downsampling blocks and four upsampling blocks with skip connections. Among five feature maps {sj}j=1 . . . 5 generated by the downsampling blocks, the most downsampled feature map, s 5, is used as the encoded target feature map zy, while the others, {sj}j=1 . . . 4, are transformed into normalized feature maps. A normalization flow map fy∈(h/2)×(w/2)×2 transforms each feature map into normalized feature map,
Ŝ={(s1;fy), . . . (s4;fy)} [Equation 26]
Flow map fy is generated at the end of upsampling blocks followed by an additional convolution layer and a hyperbolic tangent activation layer, thereby producing a 2-channel feature map, where each channel denotes a flow for the horizontal and vertical direction, respectively.
We adopt bilinear sampler based warping function which is widely used along with neural networks due to its differentiability (Jaderberg et al. 2015; Balakrishnan et al. 2018; Siarohin et al. 2019). Since each s has a different width and height, average pooling is applied to downsample fy to match the size of fy to that of sj.
The driver encoder Ex(rx), which consists of four residual downsampling blocks, takes driver landmark image rx and generates driver feature map zx.
The blender B(zx,{zyi}i=1 . . . K) produces mixed feature map zxy by blending the positional information of zx with the target style feature maps zy. We stacked three image attention blocks to build our blender.
The decoder Q(zxy{Ŝi}i=1 . . . K) consists of four warp-alignment blocks followed by residual upsampling blocks. Note that the last upsampling block is followed by an additional convolution layer and a hyperbolic tangent activation function.
The discriminator D({circumflex over (x)}, {circumflex over (r)}, c) consists of five residual downsampling blocks without self-attention layers. We adopt a projection discriminator with a slight modification of removing the global sum-pooling layer from the original structure. By removing the global sum-pooling layer, discriminator generates scores on multiple patches like PatchGAN discriminator (Isola et al. 2017).
We adopt the residual upsampling and downsampling block proposed by Brock, Donahue, and Simonyan (2019) to build our networks. All batch normalization layers are substituted with instance normalization except for the target encoder and the discriminator, where the normalization layer is absent. We utilized ReLU as an activation function. The number of channels is doubled (or halved) when the output is downsampled (or upsampled). The minimum number of channels is set to 64 and the maximum number of channels is set to 512 for every layer. Note that the input image, which is used as an input for the target encoder, driver encoder, and discriminator, is first projected through a convolutional layer to match the channel size of 64.
Positional Encoding
We utilize a sinusoidal positional encoding introduced by Vaswani et al. (2017) with a slight modification. First, we divide the number of channels of the positional encoding in half. Then, we utilize half of them to encode the horizontal coordinate and the rest of them to encode the vertical coordinate. To encode the relative position, we normalize the absolute coordinate by the width and the height of the feature map. Thus, given a feature map of z∈h
Loss Functions
Our model is trained in an adversarial manner using a projection discriminator D (Miyato and Koyama 2018). The discriminator aims to distinguish between the real image of the identity c and a synthesized image of c generated by G. Since the paired target and the driver images from different identities cannot be acquired without explicit annotation, we trained our model using the target and the driver image extracted from the same video. Thus, identities of x and y are always the same, e.g., c, for every target and driver image pair, i.e., (x, {yi}i=1 . . . K), during the training.
We use hinge GAN loss (Lim and Ye 2017) to optimize discriminator D as follows:
{circumflex over (x)}=G(rx;{yi},{ryi})
D=max(0,1−D(x,rx,c))+max(0,1+D({circumflex over (x)},rx,c)) [Equation 28]
The loss function of the generator consists of four components including the GAN loss GAN, the perceptual losses (P and PF), and the feature matching loss FM. The GAN loss GAN is a generator part of the hinge GAN loss and defined as follows:
GAN
=−D({circumflex over (x)},rx,c) [Equation 29]
The perceptual loss (Johnson, Alahi, and Fei-Fei 2016) is calculated by averaging L1-distances between the intermediate features of the pre-trained network using ground truth image x and the generated image {circumflex over (x)}. We use two different networks for perceptual losses where P and PF are extracted from VGG19 and VGG-VD-16 each trained for ImageNet classification task (Simonyan and Zisserman 2014) and a face recognition task (Parkhi, Vedaldi, and Zisserman 2015), respectively. We use features from the following layers to compute the perceptual losses: relu1_1, relu2_1, relu3_1, relu4_1, and relu5_1. Feature matching loss FM is the sum of L1-distances between the intermediate features of the discriminator D when processing the ground truth image x and the generated image {circumflex over (x)} which helps with the stabilization of the adversarial training. It helps to stabilize the adversarial training. The overall generator loss is the weighted sum of the four losses:
G
=
GAN++λ
P
P+λPFPF+λFMFM [Equation 30]
Training Details
To stabilize the adversarial training, we apply spectral normalization (Miyato et al. 2018) for every layer of the discriminator and the generator. In addition, we use the convex hull of the facial landmarks as a facial region mask and give three-fold weights to the corresponding masked position while computing the perceptual loss. We use Adam optimizer to train our model where the learning rate of 2×10−4 is used for the discriminator and 5×10−5 is used for the generator and the style encoder. Unlike the setting of Brock, Donahue, and Simonyan (2019), we only update the discriminator once per every generator updates. We set λP to 10, λPF to 0.01, λFM to 10, and the number of target images K to 4 during the training.
Landmark Transformer Details
Landmark Decomposition
Formally, landmark decomposition is calculated as:
where C is the number of videos, Tc is the number of frames of c-th video, and T=ΣTc. We can easily compute the components shown in Equation 31 from the training dataset.
However, when an image of unseen identity c′ is given, the decomposition of the identity and the expression shown in Equation 31 is not possible since Îexp(c,t) will be zero for a single image. Even when a few frames of an unseen identity c′ is given, Îexp(c′,t) will be zero (or near zero) if the expressions in the given frames are not diverse enough. Thus, to perform the decomposition shown in Equation 31 even under the one-shot or few-shot settings, we introduce landmark disentangler.
Landmark Disentanglement
To compute the expression basis bexp, using the expression geometry obtained from the VoxCeleb1 training data, we divide a landmark into different groups (e.g., left eye, right eye, eyebrows, mouth, and any other) and perform PCA on each group. We utilize PCA dimensions of 8, 8, 8, 16 and 8, for each group, resulting in a total number of expression bases, nexp, of 48.
We train landmark disentangler on the VoxCeleb1 training set, separately. Before training landmark disentangler, we normalized each expression parameter al to follow a standard normal distribution (0, 12) for the ease of regression training. We employ ResNet50, which is pre-trained on ImageNet (He et al. 2016), and extract features from the first layer to the last layer right before the global average pooling layer. Extracted image features are concatenated with the normalized landmark Ī subtracted by the mean landmark m, and fed into a 2-layer MLP followed by a ReLU activation. The whole network is optimized by minimizing the MSE loss between the predicted expression parameters and the target expression parameters, using Adam optimizer with a learning rate of 3×10−4. We use gradient clipping with the maximum gradient norm of 1 during the training. We set the expression intensity parameter A exp to 1.5.
Additional Ablation Tests
Quantitative Results
In
Qualitative Results
However, MarioNETte tends to generate more natural images in a few-shot setting, while +Alignment struggles to deal with multiple target images with diverse poses and expressions.
Inference Time
In this section, we report the inference time of our model. We measured the latency of the proposed method while generating 256×256 images with different number of target images, K∈{1, 8}. We ran each setting for 300 times and report the average speed. We utilized Nvidia Titan Xp and Pytorch 1.0.1.post2. As mentioned in the main paper, we used the open-sourced implementation of Bulat and Tzimiropoulos (2017) to extract 3D facial landmarks.
Since we perform a batched inference for multiple target images, the inference time of the proposed components (e.g., the target encoder and the target landmark transformer) scale sublinearly to the number of target images K. On the other hand, the open-source 3D landmark detector processes images in a sequential manner, and thus, its processing time scales linearly.
Additional Examples of Generated Images
We provide additional qualitative results of the baseline methods and the proposed models on VoxCeleb1 and CelebV datasets. We report the qualitative results for both one-shot and few-shot (8 target images) settings, except Monkey-Net which is designed for using only a single image. In the case of the few-shot reenactment, we display only one target image, due to the limited space.
The embodiments described above may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. A computer-readable media may be any available media accessible by a computer, and may include both volatile and non-volatile media, separable and non-separable media.
Further, examples of the computer-readable medium may include a computer storage medium. Examples of the computer storage medium include volatile, nonvolatile, separable, and non-separable media realized by an arbitrary method or technology for storing information about a computer-readable instruction, a data structure, a program module, or other data.
At least one of the components, elements, modules or units (collectively “components” in this paragraph) represented by a block in the drawings such as FIGS. 2, 9, 17, 18, 23-27, and 29-32 may be embodied as various numbers of hardware, software and/or firmware structures that execute respective functions described above, according to an exemplary embodiment. For example, at least one of these components may use a direct circuit structure, such as a memory, a processor, a logic circuit, a look-up table, etc. that may execute the respective functions through controls of one or more microprocessors or other control apparatuses. Also, at least one of these components may be specifically embodied by a module, a program, or a part of code, which contains one or more executable instructions for performing specified logic functions, and executed by one or more microprocessors or other control apparatuses. Further, at least one of these components may include or may be implemented by a processor such as a central processing unit (CPU) that performs the respective functions, a microprocessor, or the like. Two or more of these components may be combined into one single component which performs all operations or functions of the combined two or more components. Also, at least part of functions of at least one of these components may be performed by another of these components. Further, although a bus is not illustrated in the above block diagrams, communication between the components may be performed through the bus. Functional aspects of the above exemplary embodiments may be implemented in algorithms that execute on one or more processors. Furthermore, the components represented by a block or processing steps may employ any number of related art techniques for electronics configuration, signal processing and/or control, data processing and the like.
While the embodiments of the present disclosure have been particularly shown and described with reference to the attached drawings, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the inventive concept as defined by the appended claims. Therefore, the embodiments described above should be considered in a descriptive sense only in all respects and not for purpose of limitation.
Number | Date | Country | Kind |
---|---|---|---|
10-2019-0141723 | Nov 2019 | KR | national |
10-2019-0177946 | Dec 2019 | KR | national |
10-2019-0179927 | Dec 2019 | KR | national |
10-2020-0022795 | Feb 2020 | KR | national |