Generative machine-learning models, such as generative adversarial networks (GANs) are a class of deep learning models used for generating data samples, such as images, that resemble a given dataset. Although different domains may share similar semantics and characteristics, conventional generative machine-learning models are unable to generate aligned data samples across different domains simultaneously.
Embodiments of the present disclosure relate to systems and methods for multi-domain generative adversarial networks with learned warp fields. The present disclosure provides generative machine-learning models that can simultaneously generate aligned image samples from multiple related domains. The generative machine-learning models can be updated/trained to learn the shared features across multiple domains and a per-domain morph layer to morph shared features according to each domain. Conventional generative machine-learning models fail to simultaneously model images with highly varying geometries, such as images of human faces, painted and artistic faces, as well as multiple different animal faces.
The techniques described herein leverage the fact that a variety of object classes may share common attributes, despite certain geometric differences. In addition to simultaneously generating aligned image samples from multiple related domains, the generative machine-learning models described herein produce aligned samples that can be used for applications such as segmentation transfer and cross-domain image editing, as well as training in low-data regimes. The generative machine-learning models described herein can be utilized for image-to-image translation tasks, and greatly surpass the performance of conventional approaches in cases where the geometric differences between domains are large.
At least one aspect relates to a processor. The processor can include one or more circuits. The one or more circuits can generate input data according to a noise function. The one or more circuits can determine, using a generative machine-learning model and based at least on the input data, a plurality of output images each corresponding to one of a respective plurality of image domains. The generative machine-learning model can generate a plurality of morph maps each corresponding to one of the respective plurality of image domains. The one or more circuits can present, using a display device, the plurality of output images.
In some implementations, the one or more circuits can update/train the generative machine-learning model by applying the input data to a generative neural network of the generative machine-learning model to generate a set of output features. In some implementations, the one or more circuits can update/train the generative machine-learning model by applying the plurality of morph maps to the set of output features to generate a set of morphed output features. In some implementations, the one or more circuits can update/train the generative machine-learning model based at least on a plurality of outputs of a respective plurality of discriminator models that respectively receive the plurality of output images as input. In some implementations, each of the respective plurality of discriminator models corresponds respectively to one of the respective plurality of image domains.
In some implementations, each of the respective plurality of image domains correspond to a geometrically different domain. In some implementations, the plurality of morph maps each comprises a pixel-wise transformation vector. In some implementations, the generative machine-learning model comprises a plurality of rendering layers updated to generate the plurality of output images. In some implementations, the plurality of rendering layers receives a sum calculated based at least on a set of morphed features. In some implementations, the plurality of rendering layers comprise at least one shared weight value. In some implementations, the generative machine-learning model comprises a plurality of layers, at least one layer of the generative machine-learning model being a convolution layer.
At least one other aspect is related to a processor. The processor can include one or more circuits. The one or more circuits can determine, using a generative machine-learning model and based at least on input noise data, a plurality of output images, each corresponding to one of a respective plurality of image domains. The generative machine-learning model can generate a plurality of morph maps, each corresponding to one of the respective plurality of image domains. The one or more circuits can update/train the generative machine-learning model based at least on a plurality of outputs from a respective plurality of discriminator models. Each of the respective plurality of discriminator models can correspond respectively to one of the respective plurality of image domains.
In some implementations, the generative machine-learning model comprises a pre-trained generative neural network. In some implementations, the generative neural network comprises a variational autoencoder. In some implementations, the generative neural network comprises a generative adversarial network. In some implementations, the generative machine-learning model is, or comprises, any combination of one or more of: a pre-trained generative neural network; a variational autoencoder; and/or a generative adversarial network. In some implementations, each of the respective plurality of image domains corresponds to a geometrically different domain. In some implementations, the plurality of morph maps each comprises a pixel-wise transformation vector. In some implementations, the plurality of morph maps are generated using at least a first layer of the generative machine learning model, and the one or more circuits can update the generative machine-learning model by applying the plurality of morph maps to a set of features generated by at least a second layer of the generative machine-learning model.
Yet another aspect of the present disclosure is related to a method. The method can include generating, by using one or more processors, input data according to a noise function. The method can include determining, using the one or more processors and a generative machine-learning model, based at least on the input data, a plurality of output images each corresponding to one of a respective plurality of image domains. The generative machine-learning model can generate a plurality of morph maps each corresponding to one of the respective plurality of image domains. The method can include presenting, using the one or more processors and using a display device, the plurality of output images.
In some implementations, the method can include updating/training/establishing, by using the one or more processors, the generative machine-learning model by applying the input data to a generative neural network of the generative machine-learning model to generate a set of output features. In some implementations, the method can include updating, by using the one or more processors, the generative machine-learning model by applying the plurality of morph maps to the set of output features to generate a set of morphed output features.
The processors, systems, and/or methods described herein can be implemented by or included in at least one of a system associated with an autonomous or semi-autonomous machine (e.g., an in-vehicle infotainment system); a system for performing simulation operations; a system for performing digital twin operations; a system for performing light transport simulation; a system for performing collaborative content creation for 3D assets; a system for performing deep learning operations; a system implemented using an edge device; a system implemented using a robot; a system for generating or presenting virtual reality (VR) content, augmented reality (AR) content, and/or mixed reality (MR) content; a system for performing conversational AI operations; a system for performing generative AI operations using a large language model (LLM), a system for generating synthetic data; a system incorporating one or more virtual machines (VMs); a system implemented at least partially in a data center; or a system implemented at least partially using cloud computing resources.
The present systems and methods for controllable trajectory generation using neural network models are described in detail below with reference to the attached drawing figures, wherein:
This disclosure relates to systems and methods that train/update and execute multi-domain generative adversarial networks that utilize learned warp fields. At a high level, a computing system can extend the functionality of a pre-trained generative adversarial network by incorporating additional domain-specific morph layers during fine-tuning. Features for multiple spatial resolutions generated by the GAN can be utilized by the domain-specific morph layers to modify the geometry embedded in the features for multiple target domains.
Generative machine-learning models, such as GANs, are a class of deep learning models used for generating new data samples, such as images, that resemble (or are otherwise based on) other data (e.g., data that the models “learn” from). As used herein, deep learning refers to neural networks that have three or more layers, and can be especially adept at learning from large datasets. Generative machine-learning models may be utilized, for example, to generate images, perform image editing, inverse rendering, style transfer, image-to-image translation, and semi-supervised learning. GANs may be updated/trained on individual domains, or classes of data, but often struggle to generalize to many domains. For example, generative machine-learning models may be updated/trained to generate images of human faces from input data. As used herein, related domains are domains with data sharing certain features, such as semantic characteristics. Although different domains may share similar semantics and other characteristics, updated/trained generative machine-learning models are unable to generate aligned data samples across different domains simultaneously.
This is because the geometry of the output data corresponding to different target domains can vary significantly even though the domains may share similar semantics and other characteristics. For example, a face of a human is significantly different geometrically from a face of a cat or a dog. As the geometric differences between different domains increase, so does the difficulty in utilizing conventional techniques to update/train a generative machine-learning model that produces simultaneous outputs corresponding to the different domains.
To address these challenges, the systems and methods described herein provide techniques to update/train and execute a shared generative machine-learning model that includes morph maps that geometrically deform and adapt feature maps produced by an underlying generative network. By sharing generative network layers, the semantic properties learned by the model can be shared across all modeled domains, while the geometric differences are still correctly reflected due to the additional, domain-specific morph operations. The generative machine-learning model described herein can include a shallow convolutional network that renders the morphed features into correctly stylized and geometrically aligned outputs that are also semantically consistent across multiple domains.
To update/train the generative machine-learning models described herein, a generative neural network (e.g., a GAN, etc.) is first pre-trained/implemented on an initial domain that includes semantic features that are shared by other target domains for the generative machine-learning model. The generative neural network can be pre-trained/pre-configured using any suitable machine-learning technique and may include a GAN learning or a variational autoencoder (VAE) framework. Once the generative neural network has been pre-trained, the generative machine-learning model, including the generative neural network, can be fine-tuned to produce simultaneous outputs. The generative machine-learning model includes a morph network, which receives intermediate features produced by the generative neural network as input and produces corresponding morph maps for each target domain as output.
The intermediate features may include semantic information and fine-grained edge information. The intermediate features are provided as input to the morph network of the generative machine-learning model during fine-tuning to produce corresponding morph maps for each target domain. The morph maps are used to transform the intermediate features to produce morphed features for each target domain, which are subsequently provided to rendering layers (which are also updated/trained during the fine-tuning process) that respectively correspond to and produce a respective output image for each target domain.
In the example of updating/training a GAN, the output images are provided as input to respective discriminator models, which are updated/trained concurrently with the generative machine-learning model to receive images and generate a prediction of whether the input image was a real image or generated by the generative machine-learning model. The training set for each discriminator model can include real images corresponding to the target domain and images generated corresponding to rendering layers of the generative machine-learning model. The output of the discriminator model can be utilized to update/train the layers of the generative machine-learning model. Using this process, the generative machine-learning model can be updated/trained to learn the geometric differences between domains in an unsupervised manner.
The system 100 can train, update, or configure one or more generative machine-learning models 104. The generative machine-learning model 104 can include a generative neural network 106 and one or more morph layers 108. The generative machine-learning model 104 may receive noise data or other data as input and can simultaneously generate aligned image samples from multiple related domains (e.g., once updated/trained using the parent domain data 114 and the additional domain data 116). The generative neural network 106 of the generative machine-learning model 104 may include a GAN, a VAE, or another of suitable generative neural network.
The generative neural network 106 can include an input layer, one or more output layers, and/or one or more intermediate layers, such as hidden layers, which can each have respective nodes. The generative neural network 106 may include, or may be generated from, a pre-trained neural network, such as a pre-trained network that is updated/trained on the parent domain data 114 corresponding to a first image domain. Image domains refer to spaces of possible images that may share geometric and semantic similarities (e.g., images of a similar category). For example, image domains may include images of human faces, images of different types of cars, images of cat faces, images of wildlife, or other types of images that share semantic and geometric similarities, or otherwise depict similar categories of content.
The generative neural network 106 may be a pre-trained neural network that is updated, trained or fine-tuned on a single image domain, such as an image domain represented by the parent domain data 114. Conventional generative neural network updating/training techniques may use approaches such as fine-tuning a pre-trained generative network, which loses the ability to sample from the parent domain, or, more generally, multiple domains at the same time. As such, the generative neural network 106, on its own, may be unable to simultaneously generate aligned samples from multiple image domains (e.g., which may vary geometrically, but not necessarily semantically). To address these issues, the system 100 implements the generative machine-learning model 104 to include one or more morph layers 108, which receive the output of the generative neural network 106 and generate corresponding morph maps that can be used to geometrically deform and adapt the features output by the generative neural network 106.
In particular, the one or more morph layers 108 can be updated/trained to predict domain-specific morph maps which warp the output features of the generative neural network 106 according to different geometries of different image domains for which the generative machine-learning model 104 is updated/trained (e.g., image domains corresponding to the additional domain data 116). In some implementations, the generative machine-learning model 104 includes one or more additional rendering layers 110 (e.g., convolutional neural network layers), which can generate correctly stylized and geometrically aligned outputs based at least on the morphed features. The output images produced by the generative machine-learning model 104 can be semantically consistent across multiple domains.
One advantage of the generative machine-learning model 104 is an ability to generate aligned output images simultaneously based at least on input data (e.g., noise data). Many related domains may have similar semantics and characteristics, such as animal faces or face paintings. Aligned images may be images that share common attributes and conditions across domains, such as pose and lighting. By leveraging a single pre-trained generative neural network 106, the generative machine-learning model 104 provides computational advantages by sharing weights across domains. Additionally, the generative machine-learning model 104 may be trained, updated, or fine-tuned to a variety of applications, such as transferring segmentation labels from one domain to another, in which such information may not be available. Additional applications for the generative machine-learning model 104 include expressive image editing across different domains, image-to-image translation across domains, and zero-shot semantic segmentation transfer across domains.
In some implementations, the generative neural network 106 may be a pre-trained neural network, which is extended via the one or more morph layers 108 to generate simultaneous outputs across multiple, geometrically distinct domains. In some implementations, the generative neural network 106 includes a GAN, such as a model based at least on StyleGAN2, which is configured to produce multiple output features that are provided as input to the one or more morph layers 108, as described in further detail herein.
The system 100 can configure the generative machine-learning model 104 according to multiple image domains, which may be included in the training data elements 112. The training data elements 112 include the parent domain data 114 and additional domain data 116. The system 100 can configure the generative machine-learning model 104 using any suitable generative machine-learning techniques, including variational auto encoders or generative adversarial networks. Configuring the generative machine-learning model 104 may include updating the weights and biases according to a loss value calculated according to the techniques described herein. For example, updating the generative machine-learning model 104 may include updating weights, biases, or other trainable parameters of at least the one or more morph layers 108. In some implementations, the parameters of the generative neural network 106 may be pre-trained, and not updated based at least on the training process for updating the generative machine-learning model 104 to generate aligned samples from multiple image domains.
The system 100 can operate using the training data elements 112 (e.g., training data images), which may be retrieved from one or more databases or sources of training data. The training data elements 112 include parent domain data 114 and additional domain data 116. The parent domain data 114 may include a set of images from a parent domain upon which the generative neural network 104 has been pre-trained. In some implementations, the system 100 may update/train the generative neural network 106 using a suitable training technique, such as generative adversarial training using the parent domain data 114. The parent domain data 114 may be a large-scale dataset that includes multiple images that share geometric, semantic, and categorical similarities.
The additional domain data 116 may include multiple image datasets for each additional domain for which the generative machine-learning model 104 is to be updated/trained. The image datasets of the parent domain data 114 and each domain of the additional domain data 116 may be large-scale image datasets that include images that share common characteristics or features specific to each domain, such as objects, scenes, or styles. The images from each domain of the parent domain data 114 and the additional domain data 116 may be diverse within said domain to ensure the generative machine-learning model 104 is updated/trained according to a wide range of features and variations within each domain.
In the example training process described herein, the set of datasets for domains to be updated/trained may be referred to as D={πP, π1, . . . , πN}, where πP is refers to the parent image dataset of the parent domain included in the parent domain data 114. The one or more morph layers 108 of the generative machine-learning model 104 can include domain-specific morph layers, which may be represented herein as M1, . . . , N. As described herein, the generative machine-learning model 104 can include one or more rendering layers 110 (sometimes represented herein as R1, . . . , N), which may be domain-specific rendering layers. The rendering layers 110 are described in further detail in connection with
The generative machine-learning model 104 may be configured/implemented by the system 100 to receive input data (e.g., noise generated from a noise distribution), and produce one or more output images 120. The output images 120 may each correspond to a respective domain upon which the generative machine-learning model 104 has been updated/trained, including the domain represented by the images of the parent domain data 114 and each domain represented by the datasets included in the additional domain data 116.
To generate the output images 120, the system 100 can generate noise data (e.g., by sampling a vector z˜p(z), where p(z) is a normal distribution, etc.) from a standard normal distribution and provide the noise data as input to the generative neural network 106. As described herein, the generative neural network 106 can be a pre-trained generative model that is updated/trained using the image dataset of the parent domain included in the parent domain data 114. The system 100 can execute the generative neural network 106 by propagating the input noise data through each layer of the generative neural network 106, and computing the mathematical operations of each layer, and providing the result of the computations as input to the next layer until an output is generated. The generative neural network 106 can be pre-trained to produce an output image corresponding to the parent domain and one or more intermediate features (described in further detail in connection with
The intermediate features (e.g., the features 206 of
The system 100 can then perform a morph operation to morph each of the features produced by the generative neural network 106 operation that bilinearly interpolates the features to generate respective sets of morphed features, which are geometrically transformed to be suitable for each additional domain d represented in the additional domain data 116. The system 100 can then provide the geometrically transformed features as input to respective rendering layers 110 for each target domain d represented in the additional domain data 116, which produce the output images 120. Each of the output images 120 correspond to a respective target domain. The output images 120 may each include respective RGB images.
The system 100 can utilize any suitable training process to update one or more trainable parameters of the generative machine-learning model 104. For example, the system 100 can use a gradient descent operation, such as stochastic gradient descent or another optimization algorithm, to update the trainable parameters of the morph layers 108, the rendering layers 110, or other layers or trainable parameters of the generative machine-learning model 104. In the example where the generative machine-learning model 104 is updated/trained as generative adversarial network, the system 100 can update/train the generative machine-learning model 104 using separate discriminator models 111 with the same architecture for each domain represented in the parent domain data 114 and the additional domain data 116. The system 100 can calculate/determine a suitable loss value for updating/training the parameters of the generative machine-learning model 104, such as non-saturating logistic loss. In some implementations, R1 regularization and path-length regularization may be utilized to update/train the generative machine-learning model 104.
The system 100 can utilize equal loss weightings for the morph layers 108 corresponding to each domains. In implementations that utilize low-data regime training, the system 100 may weigh losses by |πd|/maxl|πl|, where |πd| is the number of training examples in the corresponding domain d. Such weighting may be utilized such that the features generated by the generative neural network 106 are mostly learned from data-rich domains while domains with significantly less data leverage the rich representation with domain-specific layers. As described herein, the generative neural network 106 can be initialized from pre-trained weights on the parent domain represented in the images of the parent domain data 114.
In some implementations, the system 100 can initialize each of the discriminator model(s) 111 from the same pre-trained model helped stabilize training. The system 100 can configure one discriminator model 111 for each target domain (e.g., the domains represented in the parent domain data 114 and the additional domain data 116). Each of the one or more discriminator models 111 can be domain-specific binary classifiers that are updated/trained by the system 100 concurrently with the generative machine-learning model 104 to distinguish between real samples (e.g., images) from the training data elements 112 and samples generated by the generative machine-learning model 104. The discriminator model(s) 111 may include convolutional neural networks that receive image data as input. The system 100 can execute the discriminator model(s) 111 by propagating the input data through each layer of the discriminator model(s) 111, to produce an output value (e.g., a binary output, a probability, or score) that indicates whether the input data is part of the training data elements 112 or generated by an output layer (e.g., a rendering layer 110) of the generative machine-learning model 104).
During training, the system 100 can update/train the generative machine-learning model 104 and the discriminator models 111 in an adversarial manner, where over time, the generative machine-learning model 104 generates samples that are more difficult for the discriminator models 111 to differentiate, while the discriminator is updated/trained to improve its predictions about whether the generated samples are included in the training data elements 112. This adversarial training process can continue until a training termination condition has been met.
The system 100 can train or otherwise update the generative machine-learning model 104 and the one or more discriminator model(s) 111 by modifying or updating one or more parameters, such as weights, biases, or other training parameters, of various nodes of each model. The system 100 can apply various machine-learning model optimization or modification operations to modify the generative machine-learning model 104 and the discriminator model(s) 111.
In some implementations, the system 100 may hold certain parameters of the generative machine-learning model 104 and/or the discriminator model(s) 111 constant. For example, the system 100 may freeze/maintain the parameters of the first three layers of discriminators and the generative neural network 106, such that the weights or other parameters of said layers are not updated during training. In some implementations, the system 100 may share one or more weights, trainable parameters, or layers, across multiple rendering layers 110. For example, the system 100 may share the weights of k rendering layers 110 across the domains represented in the parent domain data 114 and the additional domain data 116. Sharing such layers can promote rendering a similar style(s) (e.g., colors, etc.) across output images 120 for different domains.
The number of shared parameters of the rendering layers 110 may be a hyperparameter of the generative machine-learning model 104, which can be adjusted at model generation to increase or decrease how similar in style the output images 120 for each domain becomes. In some implementations, the hyper parameter k indicating the number of shared layers across the rendering layers 110 can be set to 1. In some implementations, the shared rendering layer 110 may be a 4×4 spatial resolution layer and sharing said layer may be utilized to produce similarly styled output images 120. Once generated, the system 100 can present the output images 120 using a display device, store the output images 120 in memory, such as a database, or provide the output images 120 to another computing system (e.g., via a network or a suitable communications bus or interface).
In some implementations, the system 100 uses at least some various subsets of the training data elements 112 to configure the generative machine-learning model 104 and the discriminator model(s) 111. For example, the system 100 can use different batches, or sets, of training data to update/train the models, and may allocate a portion of both the parent domain data 114 and the additional domain data 116 as a test set that is not exposed to the models during training/updating. During one iteration of training, the system 100 may execute the generative machine-learning model 104 using noise data as input as described herein to generate a set of output images 120 that each correspond to a respective domain (e.g., the parent domain and each additional domain represented in the additional domain data 116). Each output image 120 can then be provided as input to a respective discriminator model 111, which can output a prediction of whether the output image 120 was generated by the generative machine-learning model 104. The output of each discriminator model 111 can be utilized to calculate a loss for the generative machine-learning model 104. The system 100 can update/train each discriminator model 111 with corresponding training data elements 112 (e.g., images included in the parent domain data 114 and/or the additional domain data 116) for each domain, as described herein. An example dataflow diagram showing example operations of the generative machine-learning model 104 and the discriminator models 111 are shown in
Referring to
The GAN generator 204 may be, for example, a pre-trained StyleGAN2 model, or another suitable generative model. For the purposes of describing how the generative machine-learning model 200 produces output data, and as described in connection with
As shown in
To generate the output images 216 using the generative machine-learning model 200, noise data (e.g., a noise vector) z˜p(z) can be sampled from a standard normal distribution, which is provided as input to the GAN generator 204. In some implementations, additional input data, such as metadata defining certain features to generate, may be provided as input to cause the GAN generator 204 to generate certain types of outputs (e.g., outputs having certain characteristics, styles, etc.). The GAN generator 204 can then be executed by propagating the input data through each layer of the GAN generator 204, producing the output image IP (e.g., shown as the top output image 216) and also the intermediate features 206, which may be represented herein as the u1, . . . , uL for L features in the GAN generator 204.
The generator features 206 may be stored for each spatial resolution from 22×22 to 2L+1×2L+1 before the final features are transformed via a 1×1 convolution layer (e.g., tRGB) that produces the output RGB values. In this example, square images with H=W=2L+1 are utilized for simplicity. However, it should be understood that the parameters of the generative machine-learning model 200 may be modified to accommodate images of any dimension or resolution.
The features 206 generated by the GAN generator 204 can be provided as input to the MorphNet 208 to produce domain-specific warp fields 210 (sometimes referred to herein as morph maps) that can modify the geometry embedded in the features 206 to be suitable for each target domain. The MorphNet 208 can reduce the channel dimension for each feature 206 to be smaller through a 1×1 convolution layer. The MorphNet 208 can include further layers that then upsample all reduced features to match the largest spatial resolution H×W.
In this example implementation, the upsampled features are concatenated channel-wise within the MorphNet 208 and are propagated through two 3×3 convolution layers of the MorphNet 208. In some implementations, a fixed 2-dimensional sinusoidal positional encoding is added to the merged features to inject grid position information which can be useful for learning geometric biases in a dataset.
This tensor is then processed by domain-specific convolutional layers for each domain within the MorphNet 208. The convolutional layers produce a H×W×2 morph map (shown here as the warp fields 210). Each of the warp fields 210 may include an (x, y) vector transformation for each position (e.g., pixel) within the H×W feature maps 206. A respective warp field 210 may be generated for each additional domain d. The values of the warp fields 210 can be normalized between [−1/η, 1/η], for example, through a tanh activation function, where 11 is a hyperparameter that controls the maximum displacement the morphing operation are allowed to produce. The warp fields 210 may be represented mathematically herein as , which represents the relative horizontal and vertical direction that each pixel would get its value from (e.g., where a pixel is a (p, q) position in a 3-dim spatial tensor). In the various examples described in further detail herein, the morph hyperparameter is set to η=3, such that each pixel can move at most ⅙ of the image size in the x and y direction. The hyperparameter can be adjusted depending on the geometric gap between domains.
The warp fields 210 can then be utilized to differentiably morph each of the generated features 206 to generate the corresponding morphed features 212. To do so, a 2D sampling grid can be initialized from an identity transformation matrix, and normalized between [−1, 1]. The sampling grid can be generated to have the same shape as the warp fields 210, and each pixel (p, q) in the sampling grid can be initialized to include the absolute position (x, y) of the source pixel that will be morphed into (p, q). For example, if pixel (p, q) has value (−1, −1), the vector at the top left corner of the corresponding feature map 206 will be morphed into (p, q). The warp field 210 is added to the grid. The resulting grid is represented mathematically herein as Γ∈H×W×2. The warp fields 210 are pixel-wise morphing maps, which provide precise control for fine-detailed morphing.
To generate the morphed features 212, the following morph operation that bilinearly interpolates features is performed for each layer l of the features 206.
In the above equation, ũlpq is the corresponding morphed feature 212 vector at pixel (p, q) for layer l, ulnm is the source feature 206 vector prior to the morph operation at pixel (n, m), and (xpq, ypq) is the sample point in the grid F for the pixel (p, q), assuming unnormalized grid coordinates for ease of presentation. In this example, the grid F is bilinearly interpolated to match the spatial dimension of each layer (Hl, Wl). The morphed features 212 (represented mathematically herein as {ũ1, . . . , ũL}d for each domain d) are then geometrically transformed to be suitable for each domain d.
Each of the morphed features 212 is then provided as input to the renderers 214 (e.g., the rendering layers 110 of
Once the output images 216 have been generated by executing the renderers 214 of the generative machine-learning model 200, the output images 216 can be provided as input to the discriminators 218 (e.g., the discriminator models 111 of
Referring to
Referring to
Referring to
Based at least on these examples, it is shown that the *DC-StyleGAN2 model, which fine-tunes a pre-trained model, has difficulties learning class-conditioning information, as can be seen in its low classification accuracy. Additionally, the third row shows that even without morphing, the generative machine-learning models described herein produce aligned poses due to the shared use of some generator features, but has trouble sharing features across domains because of their geometric differences. In contrast, the generative machine-learning models described herein utilizing the morph operations leverage domain-specific layers but still benefit from sharing the entire stack of features due to the geometric morphing, and therefore achieves the best overall sample quality and accuracy on all datasets, as shown in the bottom row.
Some examples of how edit vectors can be transferred across all domains are shown in
Referring to
As the morph map MA captures the geometric differences between domains, the morph map can be used MA to transfer the segmentation masks of the parent domain across one or more target domains, as shown in
Referring to
The StarGANv2 model has a strong shape bias from the input image, and therefore has trouble translating to another car domain, as indicated by its low accuracy in Table 2 below. For an example dataset that includes images of faces, the StarGANv2 model performs if updated/trained only on animal faces because geometric differences between animal classes are small. However, when updated/trained on five geometrically distinct domains within the dataset, training collapses and fails to translate between human and animal faces, as represented in both Table 3 and
The generative machine-learning models described herein share features for multiple domains, which can be beneficial for domains with small amounts of data, as they can leverage the rich representations learned from other domains. The generative machine-learning models described herein can be evaluated on a first domain (e.g., the Faces domain) by varying the amount of data for sub-domains included therein (here, the MetFaces and Cat domains) while other domains use the full training data. Results from these training approaches are summarized in Table 5 above. Compared to other machine-learning techniques, the generative machine-learning models described herein achieve better FID performance when the amount of training data is small. In the example experiments summarized in Table 5, Style-GAN2 training was performed with 5% data mode-collapsed. The generative machine-learning models described herein can be combined with techniques that explicitly tackle low-data GAN training to achieve useful results.
Referring to
Pre-trained generative neural networks may be fine-tuned for new target domains, but cannot be effectively updated/trained to produce samples for multiple target domains simultaneously. While these methods can achieve high image quality, fine-tuning encourages the child (e.g., fine-tuned) models to be specialized to the new domains. As a further comparison, the same parent model used by the generative machine-learning models described herein was fine-tuned for each domain. For an example dataset including faces, the fine-tuning process preserves some attributes such as pose and colors (with the same latents for original and fine-tuned models).
An example dataset including images of cars data reflects more diversity in viewpoints and car placement. The fine-tuned models show different sizes, poses and backgrounds. On the other hand, the generative machine-learning models described herein produce consistently aligned cars. The viewpoint alignment is evaluated with a regression model by measuring the mean difference in azimuth and elevation between Sedan and other domains. Fine-tuning achieves 53.2 and 3.8 degrees in azimuth and elevation, respectively. Example experimental results show that the generative machine-learning models described herein achieves 21.0 and 2.2 degrees in azimuth and elevation, significantly out-performing the fine-tuning approach of other machine-learning models. As datasets become more diverse, it becomes challenging to enforce alignment without feature sharing for conventional machine-learning techniques. The generative machine-learning models described herein has an advantage of being a single model that directly produces highly aligned samples across domains while enabling a diverse set of applications.
Now referring to
The method 1000, at block B1002, includes generating input data (e.g., the input data 202) according to a noise function. The input data may be generated, for example, by sampling from a multi-dimensional normal distribution. The noise data may include one or more noise vectors. Although various approaches described herein have described input noise data being sampled from a normal distribution, it should be understood that any type of noise function may be utilized to create noise data suitable for the techniques described herein. The noise data may be generated in part based at least on a random number generation algorithm.
The method 1000, at block B1004, includes determining, a generative machine-learning model (e.g., the generative machine-learning model 104, the generative machine-learning model 200, etc.), based at least on the input data, one or more output images (e.g., the output images 120, the output images 216). The output images can each correspond to a respective image domain. The generative machine-learning model can have at least one layer (e.g., the morph layers 108, the MorphNet 208) that generate one or more morph maps (e.g., the warp fields 210). Each of the morph maps can correspond to one of the respective image domains.
To determine the output images, a generative neural network of the generative machine-learning model can be executed using the noise data as input to generate a set of output features. The output features can be processed according to the techniques described herein, and provided as input to the one or more morph layers of the generative machine-learning model to generate the morph maps for each image domain. The morph maps can then be applied to the set of input features, as described herein, to produce one or more sets of morphed features (e.g., the morphed features 212). The morphed features can then be provided as input to and propagated through one or more rendering layers (e.g., the rendering layers 110, the renderers 214, etc.), to produce the one or more output images.
The method 1000, at block B1006, includes presenting the output images using a display device. In some implementations, the output images can be stored in one or more regions of computer memory (e.g., a database, etc.). In some implementations, the output images can be provided to another computing system (e.g., via a network or a suitable communications bus or interface).
In some implementations, the output images can be utilized to update the generative machine-learning model. For example, once the output images are generated according to the present techniques, the output images, along with images from a training dataset, can be provided as input to one or more discriminator models (e.g., the discriminator models 111, the discriminators 218, etc.). The discriminator models can be updated/trained to determine whether a given input image was produced by the generative machine-learning model or was originally included in the training dataset. A respective discriminator model may be utilized for each target domain. A loss can be calculated for the generative machine-learning model based at least on the output of the discriminators to update/train the generative machine-learning model to produce data that better resembles the training dataset. The loss may be utilized to update/train the discriminator models concurrently to better distinguish between training data samples and samples produced by the generative machine-learning model. The trainable parameters of the generative machine-learning model may be updated according to the appropriate loss(es) using a suitable optimization algorithm, such as gradient descent.
Example Content Streaming System
Now referring to
In the system 1100, for an application session, the client device(s) 1104 may only receive input data in response to inputs to the input device(s) 1126, transmit the input data to the application server(s) 1102, receive encoded display data from the application server(s) 1102, and display the display data on the display 1124. As such, the more computationally intense computing and processing is offloaded to the application server(s) 1102 (e.g., rendering—in particular ray or path tracing—for graphical output of the application session is executed by the GPU(s) of the application server(s) 1102). In other words, the application session is streamed to the client device(s) 1104 from the application server(s) 1102, thereby reducing the requirements of the client device(s) 1104 for graphics processing and rendering.
For example, with respect to an instantiation of an application session, a client device 1104 may be displaying a frame of the application session on the display 1124 based at least on receiving the display data from the application server(s) 1102. The client device 1104 may receive an input to one of the input device(s) 1126 and generate input data in response. The client device 1104 may transmit the input data to the application server(s) 1102 via the communication interface 1120 and over the network(s) 1106 (e.g., the Internet), and the application server(s) 1102 may receive the input data via the communication interface 1118. The CPU(s) 1108 may receive the input data, process the input data, and transmit data to the GPU(s) 1110 that causes the GPU(s) 1110 to generate a rendering of the application session. For example, the input data may be representative of a movement of a character of the user in a game session of a game application, firing a weapon, reloading, passing a ball, turning on a vehicle, etc. The rendering component 1112 may render the application session (e.g., representative of the result of the input data) and the render capture component 1114 may capture the rendering of the application session as display data (e.g., as image data capturing the rendered frame of the application session). The rendering of the application session may include ray or path-traced lighting and/or shadow effects, computed using one or more parallel processing units—such as GPUs, which may further employ the use of one or more dedicated hardware accelerators or processing cores to perform ray or path-tracing techniques—of the application server(s) 1102. In some embodiments, one or more virtual machines (VMs)—e.g., including one or more virtual components, such as vGPUs, vCPUs, etc. —may be used by the application server(s) 1102 to support the application sessions. The encoder 1116 may then encode the display data to generate encoded display data and the encoded display data may be transmitted to the client device 1104 over the network(s) 1106 via the communication interface 1118. The client device 1104 may receive the encoded display data via the communication interface 1120 and the decoder 1122 may decode the encoded display data to generate the display data. The client device 1104 may then display the display data via the display 1124.
Example Computing Device
Although the various blocks of
The interconnect system 1202 may represent one or more links or busses, such as an address bus, a data bus, a control bus, or a combination thereof. The interconnect system 1202 may be arranged in various topologies, including but not limited to bus, star, ring, mesh, tree, or hybrid topologies. The interconnect system 1202 may include one or more bus or link types, such as an industry standard architecture (ISA) bus, an extended industry standard architecture (EISA) bus, a video electronics standards association (VESA) bus, a peripheral component interconnect (PCI) bus, a peripheral component interconnect express (PCIe) bus, and/or another type of bus or link. In some embodiments, there are direct connections between components. As an example, the CPU 1206 may be directly connected to the memory 1204. Further, the CPU 1206 may be directly connected to the GPU 1208. Where there is direct, or point-to-point connection between components, the interconnect system 1202 may include a PCIe link to carry out the connection. In these examples, a PCI bus need not be included in the computing device 1200.
The memory 1204 may include any of a variety of computer-readable media. The computer-readable media may be any available media that may be accessed by the computing device 1200. The computer-readable media may include both volatile and nonvolatile media, and removable and non-removable media. By way of example, and not limitation, the computer-readable media may comprise computer-storage media and communication media.
The computer-storage media may include both volatile and nonvolatile media and/or removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, and/or other data types. For example, the memory 1204 may store computer-readable instructions (e.g., that represent a program(s) and/or a program element(s), such as an operating system. Computer-storage media may include, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which may be used to store the desired information and which may be accessed by computing device 1200. As used herein, computer storage media does not comprise signals per se.
The computer storage media may embody computer-readable instructions, data structures, program modules, and/or other data types in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” may refer to a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, the computer storage media may include wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media. Combinations of any of the above should also be included within the scope of computer-readable media.
The CPU(s) 1206 may be configured to execute at least some of the computer-readable instructions to control one or more components of the computing device 1200 to perform one or more of the methods and/or processes described herein. The CPU(s) 1206 may each include one or more cores (e.g., one, two, four, eight, twenty-eight, seventy-two, etc.) that are capable of handling a multitude of software threads simultaneously. The CPU(s) 1206 may include any type of processor, and may include different types of processors depending on the type of computing device 1200 implemented (e.g., processors with fewer cores for mobile devices and processors with more cores for servers). For example, depending on the type of computing device 1200, the processor may be an Advanced RISC Machines (ARM) processor implemented using Reduced Instruction Set Computing (RISC) or an x86 processor implemented using Complex Instruction Set Computing (CISC). The computing device 1200 may include one or more CPUs 1206 in addition to one or more microprocessors or supplementary co-processors, such as math co-processors.
In addition to or alternatively from the CPU(s) 1206, the GPU(s) 1208 may be configured to execute at least some of the computer-readable instructions to control one or more components of the computing device 1200 to perform one or more of the methods and/or processes described herein. One or more of the GPU(s) 1208 may be an integrated GPU (e.g., with one or more of the CPU(s) 1206 and/or one or more of the GPU(s) 1208 may be a discrete GPU. In embodiments, one or more of the GPU(s) 1208 may be a coprocessor of one or more of the CPU(s) 1206. The GPU(s) 1208 may be used by the computing device 1200 to render graphics (e.g., 3D graphics) or perform general purpose computations. For example, the GPU(s) 1208 may be used for General-Purpose computing on GPUs (GPGPU). The GPU(s) 1208 may include hundreds or thousands of cores that are capable of handling hundreds or thousands of software threads simultaneously. The GPU(s) 1208 may generate pixel data for output images in response to rendering commands (e.g., rendering commands from the CPU(s) 1206 received via a host interface). The GPU(s) 1208 may include graphics memory, such as display memory, for storing pixel data or any other suitable data, such as GPGPU data. The display memory may be included as part of the memory 1204. The GPU(s) 1208 may include two or more GPUs operating in parallel (e.g., via a link). The link may directly connect the GPUs (e.g., using NVLINK) or may connect the GPUs through a switch (e.g., using NVSwitch). When combined together, each GPU 1208 may generate pixel data or GPGPU data for different portions of an output or for different outputs (e.g., a first GPU for a first image and a second GPU for a second image). Each GPU 1208 may include its own memory, or may share memory with other GPUs.
In addition to or alternatively from the CPU(s) 1206 and/or the GPU(s) 1208, the logic unit(s) 1220 may be configured to execute at least some of the computer-readable instructions to control one or more components of the computing device 1200 to perform one or more of the methods and/or processes described herein. In embodiments, the CPU(s) 1206, the GPU(s) 1208, and/or the logic unit(s) 1220 may discretely or jointly perform any combination of the methods, processes and/or portions thereof. One or more of the logic units 1220 may be part of and/or integrated in one or more of the CPU(s) 1206 and/or the GPU(s) 1208 and/or one or more of the logic units 1220 may be discrete components or otherwise external to the CPU(s) 1206 and/or the GPU(s) 1208. In embodiments, one or more of the logic units 1220 may be a coprocessor of one or more of the CPU(s) 1206 and/or one or more of the GPU(s) 1208.
Examples of the logic unit(s) 1220 include one or more processing cores and/or components thereof, such as Data Processing Units (DPUs), Tensor Cores (TCs), Tensor Processing Units (TPUs), Pixel Visual Cores (PVCs), Vision Processing Units (VPUs), Image Processing Units (IPUs), Graphics Processing Clusters (GPCs), Texture Processing Clusters (TPCs), Streaming Multiprocessors (SMs), Tree Traversal Units (TTUs), Artificial Intelligence Accelerators (AIAs), Deep Learning Accelerators (DLAs), Arithmetic-Logic Units (ALUs), Application-Specific Integrated Circuits (ASICs), Floating Point Units (FPUs), input/output (I/O) elements, peripheral component interconnect (PCI) or peripheral component interconnect express (PCIe) elements, and/or the like.
The communication interface 1210 may include one or more receivers, transmitters, and/or transceivers that allow the computing device 1200 to communicate with other computing devices via an electronic communication network, including wired and/or wireless communications. The communication interface 1210 may include components and functionality to allow communication over any of a number of different networks, such as wireless networks (e.g., Wi-Fi, Z-Wave, Bluetooth, Bluetooth LE, ZigBee, etc.), wired networks (e.g., communicating over Ethernet or InfiniBand), low-power wide-area networks (e.g., LoRaWAN, SigFox, etc.), and/or the Internet. In one or more embodiments, logic unit(s) 1220 and/or communication interface 1210 may include one or more data processing units (DPUs) to transmit data received over a network and/or through interconnect system 1202 directly to (e.g., a memory of) one or more GPU(s) 1208. In some embodiments, a plurality of computing devices 1200 or components thereof, which may be similar or different to one another in various respects, can be communicatively coupled to transmit and receive data for performing various operations described herein, such as to facilitate latency reduction.
The I/O ports 1212 may allow the computing device 1200 to be logically coupled to other devices including the I/O components 1214, the presentation component(s) 1218, and/or other components, some of which may be built in to (e.g., integrated in) the computing device 1200. Illustrative I/O components 1214 include a microphone, mouse, keyboard, joystick, game pad, game controller, satellite dish, scanner, printer, wireless device, etc. The I/O components 1214 may provide a natural user interface (NUI) that processes air gestures, voice, or other physiological inputs generated by a user. In some instances, inputs may be transmitted to an appropriate network element for further processing, such as to modify and register images. An NUI may implement any combination of speech recognition, stylus recognition, facial recognition, biometric recognition, gesture recognition both on screen and adjacent to the screen, air gestures, head and eye tracking, and touch recognition (as described in more detail below) associated with a display of the computing device 1200. The computing device 1200 may include depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, touchscreen technology, and combinations of these, for gesture detection and recognition. Additionally, the computing device 1200 may include accelerometers or gyroscopes (e.g., as part of an inertia measurement unit (IMU)) that allow detection of motion. In some examples, the output of the accelerometers or gyroscopes may be used by the computing device 1200 to render immersive augmented reality or virtual reality.
The power supply 1216 may include a hard-wired power supply, a battery power supply, or a combination thereof. The power supply 1216 may provide power to the computing device 1200 to allow the components of the computing device 1200 to operate.
The presentation component(s) 1218 may include a display (e.g., a monitor, a touch screen, a television screen, a heads-up-display (HUD), other display types, or a combination thereof), speakers, and/or other presentation components. The presentation component(s) 1218 may receive data from other components (e.g., the GPU(s) 1208, the CPU(s) 1206, DPUs, etc.), and output the data (e.g., as an image, video, sound, etc.).
Example Data Center
As shown in
In at least one embodiment, grouped computing resources 1314 may include separate groupings of node C.R.s 1316 housed within one or more racks (not shown), or many racks housed in data centers at various geographical locations (also not shown). Separate groupings of node C.R.s 1316 within grouped computing resources 1314 may include grouped compute, network, memory or storage resources that may be configured or allocated to support one or more workloads. In at least one embodiment, several node C.R.s 1316 including CPUs, GPUs, DPUs, and/or other processors may be grouped within one or more racks to provide compute resources to support one or more workloads. The one or more racks may also include any number of power modules, cooling modules, and/or network switches, in any combination.
The resource orchestrator 1312 may configure or otherwise control one or more node C.R.s 1316(1)-1316(N) and/or grouped computing resources 1314. In at least one embodiment, resource orchestrator 1312 may include a software design infrastructure (SDI) management entity for the data center 1300. The resource orchestrator 1312 may include hardware, software, or some combination thereof.
In at least one embodiment, as shown in
In at least one embodiment, software 1332 included in software layer 1330 may include software used by at least portions of node C.R.s 1316(1)-1316(N), grouped computing resources 1314, and/or distributed file system 1338 of framework layer 1320. One or more types of software may include, but are not limited to, Internet web page search software, e-mail virus scan software, database software, and streaming video content software.
In at least one embodiment, application(s) 1342 included in application layer 1340 may include one or more types of applications used by at least portions of node C.R.s 1316(1)-1316(N), grouped computing resources 1314, and/or distributed file system 1338 of framework layer 1320. One or more types of applications may include, but are not limited to, any number of a genomics application, a cognitive compute, and a machine-learning application, including training or inferencing software, machine-learning framework software (e.g., PyTorch, TensorFlow, Caffe, etc.), and/or other machine-learning applications used in conjunction with one or more embodiments, such as to train, configure, update, and/or execute the generative machine-learning models 104 and 200.
In at least one embodiment, any of configuration manager 1334, resource manager 1336, and resource orchestrator 1312 may implement any number and type of self-modifying actions based at least on any amount and type of data acquired in any technically feasible fashion. Self-modifying actions may relieve a data center operator of data center 1300 from making possibly bad configuration decisions and possibly avoiding underutilized and/or poor performing portions of a data center.
The data center 1300 may include tools, services, software or other resources to update/train one or more machine-learning models (e.g., train the generative machine-learning models 104 and 200, etc.) or predict or infer information using one or more machine-learning models according to one or more embodiments described herein. For example, a machine-learning model(s) may be updated/trained by calculating weight parameters according to a neural network architecture using software and/or computing resources described above with respect to the data center 1300. In at least one embodiment, trained or deployed machine-learning models corresponding to one or more neural networks may be used to infer or predict information using resources described above with respect to the data center 1300 by using weight parameters calculated through one or more training techniques, such as but not limited to those described herein.
In at least one embodiment, the data center 1300 may use CPUs, application-specific integrated circuits (ASICs), GPUs, FPGAs, and/or other hardware (or virtual compute resources corresponding thereto) to perform training and/or inferencing using above-described resources. Moreover, one or more software and/or hardware resources described above may be configured as a service to allow users to update/train or perform inferencing of information, such as image recognition, speech recognition, or other artificial intelligence services.
Example Network Environments
Network environments suitable for use in implementing embodiments of the disclosure may include one or more client devices, servers, network attached storage (NAS), other backend devices, and/or other device types. The client devices, servers, and/or other device types (e.g., each device) may be implemented on one or more instances of the computing device(s) 1200 of
Components of a network environment may communicate with each other via a network(s), which may be wired, wireless, or both. The network may include multiple networks, or a network of networks. By way of example, the network may include one or more Wide Area Networks (WANs), one or more Local Area Networks (LANs), one or more public networks such as the Internet and/or a public switched telephone network (PSTN), and/or one or more private networks. Where the network includes a wireless telecommunications network, components such as a base station, a communications tower, or even access points (as well as other components) may provide wireless connectivity.
Compatible network environments may include one or more peer-to-peer network environments—in which case a server may not be included in a network environment—and one or more client-server network environments—in which case one or more servers may be included in a network environment. In peer-to-peer network environments, functionality described herein with respect to a server(s) may be implemented on any number of client devices.
In at least one embodiment, a network environment may include one or more cloud-based network environments, a distributed computing environment, a combination thereof, etc. A cloud-based network environment may include a framework layer, a job scheduler, a resource manager, and a distributed file system implemented on one or more of servers, which may include one or more core network servers and/or edge servers. A framework layer may include a framework to support software of a software layer and/or one or more application(s) of an application layer. The software or application(s) may respectively include web-based service software or applications. In embodiments, one or more of the client devices may use the web-based service software or applications (e.g., by accessing the service software and/or applications via one or more application programming interfaces (APIs)). The framework layer may be, but is not limited to, a type of free and open-source software web application framework such as that may use a distributed file system for large-scale data processing (e.g., “big data”).
A cloud-based network environment may provide cloud computing and/or cloud storage that carries out any combination of computing and/or data storage functions described herein (or one or more portions thereof). Any of these various functions may be distributed over multiple locations from central or core servers (e.g., of one or more data centers that may be distributed across a state, a region, a country, the globe, etc.). If a connection to a user (e.g., a client device) is relatively close to an edge server(s), a core server(s) may designate at least a portion of the functionality to the edge server(s). A cloud-based network environment may be private (e.g., limited to a single organization), may be public (e.g., available to many organizations), and/or a combination thereof (e.g., a hybrid cloud environment).
The client device(s) may include at least some of the components, features, and functionality of the example computing device(s) 1200 described herein with respect to
The disclosure may be described in the general context of computer code or machine-useable instructions, including computer-executable instructions such as program modules, being executed by a computer or other machine, such as a personal data assistant or other handheld device. Generally, program modules including routines, programs, objects, components, data structures, etc., refer to code that perform particular tasks or implement particular abstract data types. The disclosure may be practiced in a variety of system configurations, including hand-held devices, consumer electronics, general-purpose computers, more specialty computing devices, etc. The disclosure may also be practiced in distributed computing environments where tasks are performed by remote-processing devices that are linked through a communications network.
As used herein, a recitation of “and/or” with respect to two or more elements should be interpreted to mean only one element, or a combination of elements. For example, “element A, element B, and/or element C” may include only element A, only element B, only element C, element A and element B, element A and element C, element B and element C, or elements A, B, and C. In addition, “at least one of element A or element B” may include at least one of element A, at least one of element B, or at least one of element A and at least one of element B. Further, “at least one of element A and element B” may include at least one of element A, at least one of element B, or at least one of element A and at least one of element B.
The subject matter of the present disclosure is described with specificity herein to meet statutory requirements. However, the description itself is not intended to limit the scope of this disclosure. Rather, the inventors have contemplated that the claimed subject matter might also be embodied in other ways to include different steps or combinations of steps similar to the ones described in this document, in conjunction with other present or future technologies. Moreover, although the terms “step” and/or “block” may be used herein to connote different elements of methods employed, the terms should not be interpreted as implying any particular order among or between various steps herein disclosed unless and except when the order of individual steps is explicitly described.
This application claims priority from Provisional Application U.S. Application 63/344,011, filed May 19, 2022, incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
63344011 | May 2022 | US |