This disclosure relates to techniques for manipulation of facial images and more particularly to the use of deep learning and artificial neural networks for performing manipulation of facial images.
Understanding and manipulating face images in-the-wild is of great interest to the computer vision and graphics community, and as a result, has been extensively studied in previous work. Example techniques range from relighting portraits (e.g., Y. Wang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras, Face Re-lighting from a Single Image Under Harsh Lighting conditions, Pages 1-8, June 2007), editing or exaggerating expressions (e.g., F. Yang, J. Wang, E. Shechtman, L. Bourdev, and D. Metaxas, Expression Flow for 3d-aware Face Component Transfer, ACM Transactions on Graphics (TOG), volume 30, page 60. ACM, 2011), and even driving facial performances (e.g., F. Yang, J. Wang, E. Shechtman, L. Bourdev, and D. Metaxas, Expression Flow for 3d-aware Face Component Transfer, ACM Transactions on Graphics (TOG), volume 30, page 60. ACM, 2011). Many of these methods start by explicitly reconstructing facial attributes such as geometry, texture, and illumination, and then editing these attributes in the image. However, reconstructing these attributes is a challenging and often ill-posed task Previous techniques attempt to address these challenges by either utilizing more data (e.g., RGBD video streams) or imposing a strong prior on the reconstruction that is adapted to the particular editing task that is to be solved (e.g., utilizing low dimensional geometry). As a result, these techniques tend to be both costly (with respect to use of computational resources) and error-prone. Moreover, such techniques fail to generalize at scale.
Techniques are disclosed for performing manipulation of facial images using a neural network architecture. In one an example embodiment, the neural network architecture includes a disentanglement portion and a rendering portion. The disentanglement portion of the network is trained to disentangle at least one physical property captured in the facial image, such that the disentanglement portion receives a facial image and outputs a disentangled representation of that facial image based on the at least one physical property. The rendering portion of the network receives or otherwise has access to the disentangled representation and is trained to perform a facial manipulation of the facial image based upon an image formation equation and the at least one physical property, thereby generating a manipulated facial image. The at least one physical property may include, for example, at least one of diffuse albedo, a surface normal, a matte mask, a background, a shape, a texture, illumination, and shading. These properties are also referred to herein as intrinsic facial properties. As will be appreciated, the network is able to handle a much wider range of manipulations including changes to, for example, viewpoint, lighting, expression, and even higher-level attributes like facial hair and age—aspects that cannot be represented using previous techniques. Significant advantages can be realized, including the ability to learn a model for a given facial appearance in terms of intrinsic (or physical) facial properties without the need for expensive data capture (e.g., calibrated appearance and geometry capture).
In some embodiments, the disentanglement portion of the network includes one or more first layers, each first layer encoding a respective map. Each map performs a transformation of the input image to a respective first intermediate result. Each respective first intermediate result is associated with an intrinsic facial property (e.g., geometry, diffuse albedo, or illumination), sometimes referred to herein as physical properties. The rendering portion of the network includes one or more second layers arranged according to an image formation equation for manipulating a facial image. The rendering portion operates on the first intermediate result(s) to generate a manipulated facial image. In some such cases, the disentanglement portion of the network further includes a respective first intermediate loss function associated with each map. In some such embodiments, during a training phase, each respective first intermediate loss function causes an inference with respect to a corresponding facial property of said respective map.
Trivially applying autoencoder networks to learn “good” facial features from large amounts of data often leads to representations that are not meaningful making the subsequent editing challenging. According to various embodiments of the present disclosure, a network can be trained to infer approximate models for facial appearance in terms of intrinsic face properties such as geometry (surface normals), material properties (diffuse albedo), illumination, and shading. Merely introducing these factors into the network, however, is not sufficient because of the ill-posed nature of the inverse rendering problem as the learned intrinsic properties can be arbitrary. Instead, according to some embodiments provided herein, a network is guided by imposing priors on each of the intrinsic properties. These priors may include, for example, a morphable model-driven prior on the geometry, a Retinex-based prior on the albedo, and an assumption of low-frequency spherical harmonics-based lighting model. Under these constraints, utilizing adversarial supervision on image reconstruction and weak supervision on the inferred face intrinsic properties, various network embodiments can learn disentangled representations of facial appearance.
According to one embodiment, a matte layer may be introduced to disentangle the foreground (face) and the natural image background. Furthermore, according to various embodiments, low-dimensional manifold embeddings are exposed for each of the intrinsic facial properties, which in turn enables direct and data-driven semantic editing from a single input image. For example, direct illumination editing using explicit spherical harmonics lighting built into the network may be performed. In addition, semantically meaningful expression edits such as smiling-based edits, and more structurally global edits such as aging may be achieved. Thus, an end-to-end generative network specifically designed for understanding and editing of face images in the wild is provided herein. According to some embodiments, image formation and shading process may be encoded as in-network layers enabling physically based rendering element disentangling such as shape, illumination, and albedo in latent space. Further, according to other embodiments statistical loss functions that correspond to well-studied theories (such as Batchwise White Shading (“BWS”) corresponding to color consistency theory are used to help improve disentangling of latent representations.
As will be further recognized, facial image manipulation network 200 will typically undergo a supervised learning or training phase in which both the aforementioned weights, which codify the intercoupling between artificial neural units and the biases of respective artificial neural units are learned by employing an optimization method such as gradient descent. The learning phase will typically utilize a set of training data and validation data. Full batch learning, mini-batch learning, stochastic gradient descent or any other training methods may be employed. Further, updating of the weights and biases during the learning/training phase may utilize the backpropagation algorithm. Upon training of facial image manipulation network 200, a test phase may then be conducted using arbitrary input facial images Ii, which are processed by the network 200 using the learned weights and biases from the learning/training phase.
As can be seen in this example embodiment, facial image manipulation network 200 comprises a disentanglement network or portion 202 and a rendering network or portion 204. As will be appreciated, each network or portion 202 and 204 may include one or more layers (e.g., input layers, middle layers, hidden layers, output layers). At a high level, disentanglement network 202 operates to disentangle an input facial image Ii into intrinsic facial properties (described below). The output of disentanglement network 202 is generally depicted as a disentangled representation 208 of input image Ii. Rendering network 204 may then operate on these disentangled intrinsic facial properties to render various facial manipulations. Thus, facial image manipulation network 200 receives input image Ii and ultimately generates output image Io, which represents a desired facially manipulated representation of input image Ii. For example, output image Io may be a facial image that includes facial hair or some other perceptible feature that was not present in input image Ii. In another example, output image Io may be a facial image that displays glasses that were not present in input image Ii. In another example, output image Io may be a facial image that includes a smile rather than pursed lips present in input image Ii.
As will be appreciated in light of this disclosure, both input image Ii and output image Io are representations of facial images in an image space. That is, according to one embodiment, input and output images (Ii and Io) comprise pixel data values for facial images. Input and output images Ii and Io may be greyscale or color images. In the former case, pixel values may thereby describe greyscale intensity values while in the latter case, pixel values may describe RGB (“Red-Green-Blue”) intensity values. Further, input and output images (Ii and Io) may represent 2-D or 3-D images. It will be further understood, that in the case of 2-D images, although input image and output image representations (Ii and Io) may be represented by a 2-D matrix corresponding to the pixel values in a 2-D image, some reshaping of the data comprising input image Ii and output image Io may be performed such as reshaping into a 1-D vector prior to processing by facial image manipulation network 200.
As further shown in
Rendering network 204 operates to generate output image Io, which is a rendered facial manipulation of input image Ii using intermediate results IR11-IR1N generated by disentanglement network 202. In particular, according to one embodiment, rendering network 204 utilizes an architecture based upon image formation equation 206. Example image formation equations are described below. For purposes of the current discussion, it is sufficient to understand that rendering network 204 comprises a plurality of neural network layers arranged in an architecture based upon image formation equation 206. Each of the neural network layers in rendering network 204 may generate respective intermediate results, which may then be provided to other layers. A more detailed description of rendering network 204 is described below with respect to
In any case, input image Ii is provided to encoder 212 of the disentanglement network 202, which generates entangled latent representation Zi. Disentanglement network 202 causes latent representation Zi to be disentangled into disentangled latent representations Z1-ZN. Each of the disentangled latent representations Z1-ZN is then provided to a respective decoder 210(1)-210(N), which generates a respective intermediate result IR11-IR1N. It will be understood that entangled latent representation Zi and disentangled latent representations Z1-ZN are represented in a different space from image space (e.g., the space where input image Ii and output image Io are represented). In particular, latent representations Zi and Z1-ZN are typically lower dimensional representations than those of image space.
Further, as depicted in
As previously explained, facial image manipulation network 200 further comprises rendering network 204. As depicted in
It will be understood that each of layers 20821-2082N comprising rendering network 204 may represent a single neural network layer or multiple neural network layers. Further, layers 20821-2082N may be arranged in architecture determined by image formation equation 206 (shown in
Rendering network 204 further comprises output layer 2082M. Output layer 2082M generates output image Io and may be associated with one or more global loss functions (as shown in
Disentanglement Network
Referring to
Each of the latent representations ZAe, ZNe, Zm and Zbg is then passed to a respective decoder 210(4), 210(3), 210(2) and 210(1) and decoded into respective intermediate results Ae, Ne, M, and Ibg. It will be understood that intermediate results Ae, Ne, M, and Ibg are generated as a map or transformation from input image Ii into each respective intermediate result Ae, Ne, M and Ibg.
Disentanglement Network Loss Functions
According to one embodiment, facial manipulation network 200 may be guided during training by imposing priors respectively on each intrinsic property. In particular, each intermediate result Ae, Ne, M, and Ibg may be associated with one or more intermediate loss functions (details of various types of loss functions reference herein are described in detail below). In particular, as shown in
It will be understood that an L1 loss function also known as least absolute deviations (“LAD”) or least absolute errors (“LAE”) minimizes the absolute differences between an estimated value and a target value. In particular, if y is a target value and h(x) and estimate, L1 loss may be expressed as follows:
L1=Σi=0n|yi−h(xi)|.
The L2 loss functions shown in
L2=Σi=0n(yi−h(xi))2.
According to one embodiment, A signifies an adversarial loss function that pits a discriminative model against a generative model to determine whether a sample is from the model distribution or the data distribution.
Rendering Network
As previously explained, facial image manipulation network 200 further comprises rendering network 204. Rendering network 204 renders a manipulated facial image using an image formation equation based upon intermediate results Ae, Ne, M and Ibg received from disentanglement network 202. Rendering network further utilizes latent representation ZL generated by disentanglement network 202.
Referring to
An example image formation equation informing the architecture of rendering network shown in
According to one embodiment Ifg is a result of a rendering process frendering based upon Ae, Ne and L as follows:
I
fg
=f
rendering(Ae,Ne,L).
Assuming Lambertian reflectance and adopting Retinex theory to separate the albedo (reflectance) from the geometry and illumination, Ifg may be expressed as follows:
I
fg
=f
image-formation(Ae,Se)=Ae⊙Se,
whereby ⊙ denotes a per-element production operation in the image space and:
S
e
=f
shading(Ne,L).
If these previous two equations are differentiable, they can be represented as in-network-layers in an autoencoder network. This allows representation of an image using disentangled latent variables for physically meaningful (intrinsic) factors in the image formation process and in particular ZAe, ZNe, and ZL. This is advantageous over conventional approaches using a single latent variable that encodes the combined effects of all image formation factors. In particular, each of the disentangled latent variables ZAe, ZNe, and ZL allows access to a specific manifold where semantically relevant edits can be performed while keeping irrelevant latent variables fixed. For, example, image relighting may be performed by only traversing the lighting manifold ZL or changing the albedo (e.g., to grow a beard) by traversing ZA.
In practice, the shading process utilizing geometry Ne, and illumination L under unconstrained conditions may result in fshading(⋅,⋅) being a discontinuous function in a significantly large portion of the space it represents. In order to address these issues, according to one embodiment, distant illumination L is represented by spherical harmonics such that the Lambertian fshading(⋅,⋅) has an analytical form and is differentiable.
According to one embodiment L is represented by a 9-dimensional vector (spherical harmonics coefficients). For a given pixel i, with its normal ni=[nx, ny, nz]T, the shading for pixel I is rendered as Sei=Se(ni,L)=[ni;1]TK [ni;1] where
In-Network Face Representation—Explicit
According to one embodiment as shown in
In-Network Face Representation—Implicit
Although the explicit representation depicted in
In-Network Background Matting
According to one embodiment, to encourage physically based representations of albedo, surface normals, and lighting to concentrate on the face region, the background may be disentangled from the foreground. According to one embodiment, matte layer 208(e) computes the compositing of the foreground onto the background as follows:
I
o
=M⊙I
fg+(1−M)⊙Ibg.
Matte layer 208(e) also allows the employment of efficient skip layers where unpooling layers in the decoder stack can use the pooling switches from the corresponding encoder stack of the input image (216).
Because the mapping between the pooling and unpooling switches establishes a skip connection between the encoder and the decoder, the details of the background are significantly preserved. Such skip connections may bypass the bottleneck Z and therefore only allow partial information flow through Z during training. In contrast, for the foreground face region, all the information flows through the bottleneck Z without any skip connections such that full control is maintained over the latent manifolds for editing at the expense of some detail loss.
Convolutional Encoder Stack Architecture
According to one embodiment, convolutional encoder stack 212 is comprised of three convolutions with 32*3×3, 64*3×3 and 64*3×3 filter sets. According to this same embodiment, each convolution is followed by a max-pooling and ReLU nonlinear activation function (not shown in
According to one embodiment, Zi is a latent variable vector of size 128×1, which is fully connected to the last encoder stack downstream as well as the individual latent variables Zbg, ZL, Zm and the foreground representations ZNe, ZAe. For the explicit representation 214(a), Zli is directly connected to ZNe and ZAe. For the implicit representation 214(b), Zi is directly connected to ZUV, ZNi and ZAi. According to one embodiment, all latent representations are 128×1 vectors except for ZL, which represents the light L directly and is a 27×1 vector where three 9×1 concatenated vectors represent spherical harmonics of red, blue and green lights.
According to one embodiment, all convolutional decoder stacks 210(1)-210(4) are strictly symmetrical to convolutional encoder stack 212. Convolutional decoder stacks 210(1) and 210(2) may utilize skip connections to convolutional encoder 212 at corresponding layers. According to one embodiment, in the implicit representation 214(b), ZNi and ZAi share weights in their respective decoders because supervision is performed only for the implicit normals.
Training
Main Loss Function
According to one embodiment, training may be performed using “in-the-wild” facial images. This means that access is provided only to the image itself (denoted by I*). Thereby, no ground-truth data is available for illumination, normal, or albedo.
According to one embodiment a main loss function imposed for the reconstruction of image Ii at the output Io is:
E
o
=E
recon+λadvEadv.
According to this relationship, Erecon=∥Ii−Io∥2 and Eadv is an adversarial loss function such that a discriminative network is trained simultaneously to distinguish between generated and real images. According to one embodiment, an energy-based method is utilized to incorporate adversarial loss. According to this method, an autoencoder is utilized as a discriminative network, . The adversarial loss may be defined as Eadv=D(I′) for the generative network . Here I′ is the reconstruction of the discriminator input Io and D(⋅) is an L2 reconstruction loss of the discriminative network . In this case, may be trained to minimize the reconstruction error (L2) for real facial images while maximizing the reconstruction error with a margin for Io.
Intermediate Loss Functions
A fully unsupervised training using only the reconstruction and adversarial loss on the output image Io will often result in semantically meaningless latent representations. The network architecture itself cannot prevent degenerate solutions such as constant Se where Ae captures both albedo and shading information. Because each of the rendering elements has a specific physical meaning and they are explicitly encoded as intermediate layers in the network, according to one embodiment, additional constraints may be introduced through intermediate loss functions to guide the training.
In particular, according to one embodiment a “pseudo ground-truth” {circumflex over (N)} for the normal representation Ne may be introduced to maintain Ne close to plausible face normals during the training process. {circumflex over (N)} may be estimated by fitting a rough facial geometry to every image in a training set using a 3D morphable model. According to one embodiment, the following L2 intermediate loss function may then be introduced for Ne:
E
recon-N
=∥N
e
−{circumflex over (N)}∥
2.
According to other embodiments, similar to {circumflex over (N)}, an L2 reconstruction loss {circumflex over (L)} may be introduced with respect to the lighting parameters ZL:
E
recon-L
=∥Z
L
−{circumflex over (L)}∥
2.
{circumflex over (L)} may be computed from {circumflex over (N)} and the input image Ii using least square optimization and constant albedo assumption.
According to other embodiments, following Retinex theory, wherein albedo is assumed to be piecewise constant and shading to be smooth, an L1 smoothness loss may be introduced for albedo as follows:
E
smooth-A
=∥∇A
e∥ in which ∇ is the spatial image gradient operator.
In addition, because shading is assumed to vary smoothly, an L2 intermediate smoothness loss may be introduced for Se as follows:
E
smooth-S
=∥∇S
e∥2.
For the implicit coordinate system (UV) (
E
UV
=∥UV−
∥
2,
E
N
=∥N
i
−{circumflex over (N)}
1∥2.
and {circumflex over (N)}i may be obtained from a morphable model in which vertex-wise correspondence on a 3D fit exists. In particular, an average shape of the morphable model
Batch-Wise Shading Constraint
Due to the ambiguity in the magnitude of lighting and therefore the intensity of shading, it may be necessary to introduce constraints on the shading magnitude to prevent the network from generating arbitrary bright/dark shading. Moreover, because the illumination is separated in color space, by individual Lr, L9 and Lb, according to one embodiment a constraint may be imposed to prevent the shading to be too strong in one color channel compared to the others. To handle these ambiguities, according to one embodiment, a Batch-wise White Shading (“BWS”) constraint may be introduced on Se as follows:
where sri(j) denotes the j-th pixel of the i-th example in the first (red) channel of Se. s9 and sb denote the second and third channel of shading respectively. m is the number of pixels in a training batch. According to one embodiment c is set to 0.75.
Because {circumflex over (N)} obtained by the previously described morphable model addresses a region of interest only on the surface of a face, according to one embodiment, it is used as a mask and all foreground losses are computed under this face mask. In addition, according to one embodiment, the region of interest is also used as a mask pseudo ground truth at training time for learning the matte mask:
E
M
=∥M−{circumflex over (M)}∥
2,
in which {circumflex over (M)} represents the morphable model mask.
Process for Network Construction
Face Editing by Manifold Traversal
According to one embodiment, in order to compute the distributions for each attribute to be manipulated, a subset of images from a facial image database such as CelebA may be sampled (e.g., 2000 images) with the appropriate attribute label (e.g., smiling or other expressions). Then, a manifold traversal method may be employed independently on each appropriate variable. The extent of traversal may be parameterized by a regularization parameter λ. (see, e.g., Gardner et al., Deep Manifold Traversal: Changing Labels With Convolutional Features, arXiv preprint: arXiv: 1511.06421, 2015).
Experiments
According to one embodiment the CelebA dataset may be utilized to train facial manipulation network 200. During this training, for each facial image, landmarks may be detected and a 3D morphable model may be fitted to the facial region to develop a rough estimation of the rendering elements ({circumflex over (N)}, {circumflex over (L)}). These estimates may then be utilized to setup the previously described intermediate loss functions.
Baseline Comparisons
Furthermore, given input facial images, facial manipulation network 200 described herein provides explicit access to albedo, shading and normal maps for the face (rows 4-6 of
Smile Face Editing
Using the techniques described with reference to
Aging Face Editing
Relighting
I
fg
=f
image-formation(Ae,Se)=Ae⊙Se
a detailed albedo Atarget may be generated. Given a light source Lsource, the shading of the target may be rendered under this light with the target normals Ntarget given by:
S
e
=f
shading(Ne,L),
to obtain the transferred Stransfer. Finally, the lighting transferred image may be rendered with Stransfer and Atarget using the relation:
I
fg
=f
image-formation(Ae,Se)=Ae⊙Se.
Facial image manipulation network 200 and its various functional block may be implemented on a computing device such as a general-purpose or application specific CPU that includes one or more storage devices and/or non-transitory computer-readable media having encoded thereon one or more computer-executable instructions or software for implementing techniques as variously described in this disclosure. In some embodiments, the storage devices include a computer system memory or random access memory, such as a durable disk storage (e.g., any suitable optical or magnetic durable storage device, including RAM, ROM, Flash, USB drive, or other semiconductor-based storage medium), a hard-drive, CD-ROM, or other computer readable media, for storing data and computer-readable instructions and/or software that implement various embodiments as taught in this disclosure. In some embodiments, the storage device includes other types of memory as well, or combinations thereof. In one embodiment, the storage device is provided on the computing device. In another embodiment, the storage device is provided separately or remotely from the computing device. The non-transitory computer-readable media include, but are not limited to, one or more types of hardware memory, non-transitory tangible media (for example, one or more magnetic storage disks, one or more optical disks, one or more USB flash drives), and the like. In some embodiments, the non-transitory computer-readable media included in the computing device store computer-readable and computer-executable instructions or software for implementing various embodiments. In one embodiment, the computer-readable media are provided on the computing device. In another embodiment, the computer-readable media are provided separately or remotely from the computing device.
Training subsystem 622 further comprises facial image training/validation datastore 610(a), which stores training and validation facial images. Training algorithm 616 represents programmatic instructions for carrying out training of network 200 in accordance with the training described herein. As shown in
Test/Inference subsystem further comprises test/inference algorithm 626, which utilizes network 200 and the optimal weights/biases generated by training subsystem 622. CPU/GPU 612 may then carry out test/inference algorithm 626 based upon model architecture and the previously described generated weights and biases. In particular, test/inference subsystem 624 may receive test image 614 from which it may generate a manipulated image 620 using network 200.
It will be further readily understood that network 508 may comprise any type of public and/or private network including the Internet, LANs, WAN, or some combination of such networks. In this example case, computing device 500 is a server computer, and client 506 can be any typical personal computing platform. Further note that some components of the creator recommendation system 102 may be served to and executed on the client 506, such as a user interface by which a given user interacts with the system 102. The user interface can be configured, for instance, similar to the user interface of Behance® in some embodiments. In a more general sense, the user interface may be configured, for instance, to allow users to search for and view creative works, and to follow or appreciate certain creators for which the viewer has affinity. The user interface can be thought of as the front-end of the creative platform. The user interface may further be configured to cause display of an output showing ranked creators, such as shown in
As will be further appreciated, computing device 600, whether the one shown in
In some example embodiments of the present disclosure, the various functional modules described herein and specifically training and/or testing of network 200, may be implemented in software, such as a set of instructions (e.g., HTML, XML, C, C++, object-oriented C, JavaScript, Java, BASIC, etc.) encoded on any non-transitory computer readable medium or computer program product (e.g., hard drive, server, disc, or other suitable non-transitory memory or set of memories), that when executed by one or more processors, cause the various creator recommendation methodologies provided herein to be carried out.
In still other embodiments, the techniques provided herein are implemented using software-based engines. In such embodiments, an engine is a functional unit including one or more processors programmed or otherwise configured with instructions encoding a creator recommendation process as variously provided herein. In this way, a software-based engine is a functional circuit.
In still other embodiments, the techniques provided herein are implemented with hardware circuits, such as gate level logic (FPGA) or a purpose-built semiconductor (e.g., application specific integrated circuit, or ASIC). Still other embodiments are implemented with a microcontroller having a processor, a number of input/output ports for receiving and outputting data, and a number of embedded routines by the processor for carrying out the functionality provided herein. In a more general sense, any suitable combination of hardware, software, and firmware can be used, as will be apparent. As used herein, a circuit is one or more physical components and is functional to carry out a task. For instance, a circuit may be one or more processors programmed or otherwise configured with a software module, or a logic-based hardware circuit that provides a set of outputs in response to a certain set of input stimuli. Numerous configurations will be apparent.
The following examples pertain to further example embodiments, from which numerous permutations and configurations will be apparent.
Example 1 is a neural network architecture for manipulating a facial image, said architecture comprising: a disentanglement portion trained to disentangle at least one physical property captured in said facial image, said disentanglement portion receiving said facial image and outputting a disentangled representation of said facial image based on said at least one physical property; and a rendering portion trained to perform a facial manipulation of said facial image based upon an image formation equation and said at least one physical property, thereby generating a manipulated facial image.
Example 2 includes the subject matter of Example 1, wherein: said disentanglement portion includes at least one first layer, each of said at least one first layer encoding a respective map, wherein each map performs a transformation of said facial image to a respective first intermediate result, said respective first intermediate result associated with one of said at least one physical property; and said rendering portion includes at least one second layer arranged according to said image formation equation for manipulating said facial image, wherein said rendering portion operates on said at least one first intermediate result to generate said manipulated facial image.
Example 3 includes the subject matter of Example 2, wherein a respective first intermediate loss function is associated with each of said at least one map.
Example 4 includes the subject matter of Example 3, wherein during a training phase, each respective first intermediate loss function causes an inference of said respective map.
Example 5 includes the subject matter of any of Examples 2 through 4, wherein each of said maps further comprises a convolutional encoder stack and at last one convolutional decoder stack, each of said at least one convolutional decoder stack generating one of said respective first intermediate results.
Example 6 includes the subject matter of Example 5, wherein said convolutional encoder stack generates an entangled representation in a latent space.
Example 7 includes the subject matter of Example 1, and further includes a fully connected layer.
Example 8 includes the subject matter of Example 7, wherein said fully connected layer generates a disentangled representation in said latent space from said entangled representation.
Example 9 includes the subject matter of any of the preceding Examples, wherein said at least one physical property includes at least one of diffuse albedo, a surface normal, a matte mask, a background, a shape, illumination, and shading.
Example 10 includes the subject matter of any of the preceding Examples, wherein said at least one physical property includes at least one of geometry, illumination, texture, and shading.
Example 11 is method for generating a manipulated facial image using a neural network architecture that includes a disentanglement portion and a rendering portion, said disentanglement portion trained to disentangle at least one physical property captured in said facial image, and said rendering portion trained to perform a facial manipulation of said facial image based upon an image formation equation and said at least one physical property, said method comprising: receiving said facial image at said disentanglement portion of said neural network architecture, thereby disentangling at least one physical property captured in said facial image and outputting a disentangled representation of said facial image based on said at least one physical property; and receiving said disentangled representation of said facial image at said rendering portion of said neural network architecture, thereby generating a manipulated facial image.
Example 12 includes the subject matter of Example 11, wherein: said disentanglement portion includes at least one first layer, each of said at least one first layer encoding a respective map, wherein each map performs a transformation of said facial image to a respective first intermediate result, said respective first intermediate result associated with one of said at least one physical property; and said rendering portion includes at least one second layer arranged according to said image formation equation for manipulating said facial image, wherein said rendering portion operates on said at least one first intermediate result to generate said manipulated facial image.
Example 13 includes the subject matter of Example 12, wherein a respective first intermediate loss function is associated with each of said at least one map, and during a training phase, each respective first intermediate loss function causes an inference of said respective map.
Example 14 includes the subject matter of any of Examples 11 through 13, wherein said at least one physical property includes at least one of diffuse albedo, a surface normal, a matte mask, a background, a shape, a texture, illumination, and shading.
Examples 15 through 18 are each a computer program product including one or more non-transitory computer readable mediums encoded with instructions that when executed by one or more processors cause operations of a neural network architecture to be carried out so as to generate a manipulated facial image, said neural network architecture including a disentanglement portion and a rendering portion, said disentanglement portion trained to disentangle at least one physical property captured in said facial image, and said rendering portion trained to perform a facial manipulation of said facial image based upon an image formation equation and said at least one physical property, said operations responsive to receiving said facial image at said disentanglement portion of said neural network architecture, said operations comprising the method of any of Examples 11 through 14. The one or more non-transitory computer readable mediums may be any physical memory device, such as one or more computer hard-drives, servers, magnetic tape, compact discs, thumb drives, solid state drives, ROM, RAM, on-chip cache, registers, or any other suitable non-transitory or physical storage technology.
Example 19 is a method for generating a manipulated facial image, the method comprising: associating a respective first intermediate loss function with each of a plurality of first intermediate results generated by a first network portion, wherein each of said plurality of first intermediate results corresponds to a respective intrinsic facial property; providing said plurality of first intermediate results to a second network portion, said second network portion arranged according to an image formation equation for rendering a manipulated facial image based upon said image formation equation; performing a training by imposing a plurality of respective first intermediate loss functions upon each of said first intermediate results, to generate a plurality of weights; assigning said generated weights in said first and second network portions; and providing an input facial image to said first network portion, wherein said first network portion performs a disentanglement of a facial image into said intrinsic facial properties and second network portion receives said disentangled facial properties to generate a manipulated facial image.
Example 20 includes the subject matter of Example 19, and further includes: associating a respective second intermediate loss function with each of a plurality of second intermediate results associated with said second network portion, wherein said training further imposes said second intermediate loss function upon each of said respective second intermediate results.
Example 21 includes the subject matter of Example 19 or 20, wherein said associated intrinsic properties are at least one of albedo (Ae), normal (Ne), matte mask (M), and background (Ibg).
Example 22 includes the subject matter of Example 21, and further includes: generating a pseudo ground-truth ({circumflex over (N)}) for said normal representation Ne, wherein said pseudo ground truth is utilized in one of said first intermediate loss functions according to the relationship: Erecon-N=∥Ne−{circumflex over (N)}∥2.
Example 23 includes the subject matter of Example 22, wherein {circumflex over (N)} is estimated by fitting a rough facial geometry to every image in a training set using a 3D morphable model.
Example 24 includes the subject matter of any of Examples 21 through 23, and further includes associating an L1 smoothness intermediate loss function for Ae according to the relationship: Esmooth-A=∥∇Ae∥, wherein ∇ is a spatial image gradient operator.
Example 25 includes the subject matter of any of Examples 19 through 24, wherein generating a manipulated facial image further comprises: providing at least one positive data element ({xp}) to said first network portion to generate a respective positive code ({zp}); providing at least one negative data element ({xn}) said first network portion to generate a respective negative code ({zn}); measuring an empirical distribution of an input image ({ZSource}); generating a transformed empirical distribution ({ZTrans}) from {Zsource} by moving the distribution {ZSource} toward {zp}, and generating said manipulated facial image by decoding {ZTrans}.
Examples 26 through 32 are each a computer program product including one or more non-transitory machine readable mediums encoded with instructions that when executed by one or more processors cause a process to be carried out for generating a manipulated facial image from an input facial image, the process comprising the method of any of Examples 19 through 24. The previous disclosure with respect to the non-transitory computer readable medium(s) is equally applicable here.
The foregoing description of example embodiments of the disclosure has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in light of this disclosure. It is intended that the scope of the disclosure be limited not by this detailed description, but rather by the claims appended hereto.