The present application claims the benefit under 35 U.S.C. § 119 of European Patent Application No. EP 23 19 8631.6 filed on Sep. 20, 2023, which is expressly incorporated herein by reference in its entirety.
The present invention relates to a device and a computer-implemented methods for machine learning, for providing a machine learning system, or for operating a technical system.
A diffusion model may be used in machine learning for generating synthetic digital images, e.g., for training, testing, verifying and/or validating the machine learning system depending on the synthetic digital image.
The diffusion model may be trained unconditional or conditional.
A device and the computer-implemented methods according to example embodiments of the present invention provide a machine learning with a diffusion model and a discriminator for providing guidance, wherein learning parameters of the diffusion model and/or the discriminator are learned depending on the guidance provided by the discriminator in the unconditional or conditional case. The diffusion model may be a neural network. The parameters of the diffusion model may be weights of the neural network resembling the diffusion model. The discriminator may be a neural network. The parameters of the discriminator may be weights of the neural network resembling the discriminator.
According to an example embodiment of the present invention, a computer-implemented method for machine learning, in particular for generating training and/or test data for training a machine learning system or for training an image classifier, comprises providing a latent variable of a diffusion model representing the synthetic digital image, wherein providing the latent variable comprises sampling the latent variable from random noise, or providing a digital reference image, and adding noise with the diffusion model to the digital reference image to determine the latent variable, or mapping the digital reference image with an encoder to an input of the diffusion model, and adding noise with the diffusion model to the input to determine the latent variable, wherein the method further comprises mapping the latent variable with the diffusion model depending on parameters of the diffusion model to features of the diffusion model, mapping the features with the diffusion model depending on parameters of the diffusion model to an output of the diffusion model, wherein the synthetic digital image comprises the output, or wherein the method comprises mapping the output with a decoder to the synthetic digital image, mapping the features or the synthetic digital image with a discriminator to a class, in particular either a class of a predetermined set of classes for a real digital image or a class for a synthetic digital image, depending on parameters of the discriminator, and learning at least one parameter of the diffusion model and/or the discriminator depending on a loss for the discriminator that is defined depending on the class and a predetermined reference class. The discriminator may be semantic segmentation network based, i.e., segmenter based. The discriminator provides the guidance for learning.
The synthetic digital image comprises pixels. According to an example embodiment of the present invention, the method may comprise determining the class pixel-wise, and learning the at least one parameter depending on a loss that defined depending on the pixel-wise classes and respective predetermined pixel-wise reference classes. This means the discriminator provides the guidance pixel-wise in particular for the unconditional diffusion model to improve local details.
According to an example embodiment of the present invention, the method may comprise determining a sequence of outputs comprising the output, wherein determining the sequence of outputs comprises determining a first output of the sequence depending on a result of encoding the input and decoding the encoded input, and successively determining a next output in the sequence depending on the result of encoding the output preceding the next output in the sequence and decoding the encoded output preceding the next output in the sequence, wherein encoding or decoding comprises determining the features. This means, the outputs and the features are determined in stages. discriminator may use features of the encoder or decoder at different stages.
According to an example embodiment of the present invention, the method may comprise determining, with a respective discriminator, the class for a plurality of outputs of the sequence depending on features collected from the respective encoding or decoding, and learning the parameters depending on the losses that are defined for the discriminators depending on the class predicted by the respective discriminator and the predetermined reference class. A discriminator per stage in the plurality of stages may be used for providing the guidance. Instead of training the diffusion model to predict noise at a single, randomly sampled stage during training, a multistep sampling at the plurality of stages takes into account several stages. The stages may be a number of successive stages, i.e., stages within a horizon defined by the number of stages.
According to an example embodiment of the present invention, the method may comprise predicting a noise in a next output of the sequence depending on the output preceding the next output in the sequence of outputs, and learning the parameters depending a loss for noise and depending on the losses that are defined for the discriminators, wherein the loss for noise is defined depending on the predicted noise and a predetermined random noise, in particular random noise sampled from a Gaussian distribution. This reduces the memory consumption compared to using the loss for noise at several stages.
According to an example embodiment of the present invention, the method may comprise providing a condition, in particular a label map, modulating the features depending on the condition and mapping the modulated features with the diffusion model to the output, and/or with the discriminator to the class. The label map for example comprises a pixel-wise reference class for the pixel of the synthetic digital image. The implicit assumption is that the conditional information is beneficial for the noise prediction in the diffusion model, thus the denoising will learn to use the condition. With the discriminator there is explicit enforcement or supervision for the usage of condition.
According to an example embodiment of the present invention, the method may comprise learning the parameters depending on the loss that is defined for the discriminator, wherein the loss that is defined for the discriminator is defined depending on the condition, or learning the parameters depending on the losses that are defined for the discriminators, wherein the losses are defined depending on the condition. In the conditional case, additional guidance is provided that depends on the condition.
This discriminator provides in particular pixel-wise guidance and encourages the conditional alignment. The discriminator may be trained jointly with the diffusion model, and learns to predict the given label map condition on real samples and predict fake ones as the additional “fake” class. The diffusion model learns to generate realistic images aligned with the label map as synthetic digital images to fool the discriminator. In this way, the label map condition is explicitly leveraged during training.
The condition comprises the predetermined reference class or the pixel-wise reference classes, e.g., the label map, to provide the additional guidance.
A computer-implemented method for providing the machine learning system, in particular a neural network for semantic segmentation, object recognition, or classification of objects, comprises generating a synthetic digital image with the method for machine learning, and training, testing, verifying, or validating the machine learning system for semantic segmentation, or object recognition, or classification of objects, preferably for recognizing background or an object, in particular a traffic sign, a road surface, a pedestrian, a vehicle, an animal, a plant, infrastructure, or the sky, depending on the synthetic digital image.
Training the machine learning system may comprise providing the condition, in particular the label map that is provided for generating the synthetic digital image as ground truth for training the machine learning system.
A computer-implemented method for operating a technical system, in particular a computer-controlled machine, preferably a robotic system, a vehicle, a domestic appliance, a power tool, a manufacturing machine, a personal assistant or an access control system, comprises providing the machine learning system with the method for providing the machine learning system, capturing a digital image with a sensor, in particular a video, radar, LiDAR, ultrasonic, motion, or thermal image, and determining a control signal for operating the technical system with the machine learning system depending on the digital image.
A device may comprise at least one processor and at least one memory, wherein the at least one processor is configured to execute instructions that, when executed by the at least one processor, cause the device to execute the method for machine learning, for providing the machine learning system or operating the technical system, wherein the at least one memory is configured to store the instructions.
A computer program, characterized in that the computer program comprises computer-readable instructions that, when executed by a computer, cause the computer to execute the method according to the present invention disclosed herein.
Further advantageous embodiments of the present invention are derived from the following description and the figures.
The device 100 comprises at least one processor 102 and at least one memory 104. The at least one memory 104 comprises at least one non-transitory memory. The at least one memory 104 is configured to store instructions that are executable by the at least one processor 102 and that cause the device 100 to perform a method for generating a synthetic digital image, when executed by the at least one processor 102.
The synthetic digital image may be a video, radar, LiDAR, ultrasonic, motion, or thermal image.
The device 100 may be configured to read a digital reference image from storage. The digital reference image may be a video, radar, LiDAR, ultrasonic, motion, or thermal image.
The device 100 may comprise at least one interface 106. The at least one interface 106 may be configured for receiving the digital reference image, e.g., from a sensor 108. The at least one interface 106 may be configured to output a control signal for an actuator 110.
The sensor 108 may be configured to capture a video, radar, LiDAR, ultrasonic, motion, or thermal image.
The sensor 108 or the actuator 110 may be part of a technical system 112. The technical system 112 is a physical system. The technical system 112 may be a computer-controlled machine, preferably a robotic system, a vehicle, a domestic appliance, a power tool, a manufacturing machine, a personal assistant or an access control system.
The device 100 may comprise a machine learning system that is configured to operate the technical system 112 depending on input from the sensor 108 and with output of the control signal to the actuator 110.
For example, the machine learning system is configured to determine the control signal for operating the technical system 112 depending on a digital image.
The digital image may be a video, radar, LiDAR, ultrasonic, motion, or thermal image.
Determining the control signal depending on the digital image for example comprises mapping the digital image with a neural network for semantic segmentation, object recognition or classification of objects.
Determining the control signal depending on the digital image for example comprises determining the control signal depending on an output of the neural network, e.g., the semantic segmentation, object recognition, or classification of objects.
For example, a background is distinguished from objects, or recognized, or classified as background in the digital image or an object is distinguished from background, or recognized, or classified, e.g., as a traffic sign, a road surface, a pedestrian, a vehicle, an animal, a plant, infrastructure, or the sky, depending on the synthetic digital image.
For example, the control signal is determined to move the technical system 112 upon recognizing or classifying the object as a predetermined object indicating that moving the technical system 112 is acceptable or stopping the technical system 112 otherwise or upon recognizing or classifying the object as a predetermined object indicating that stopping the technical system 112 is required.
Classifying the object may comprise segmentation and/or object detection.
The diffusion model 202 is configured to receive a digital image x, e.g., a digital reference image 204, in an image space, and to determine an output {tilde over (x)}, e.g., a synthetic digital image 208, in the image space. The digital image x and/or the digital reference image 204 and/or the synthetic digital image 208 may be a video, radar, LiDAR, ultrasonic, motion, or thermal image. Optionally, the diffusion model 202 is configured to receive a condition y, e.g., a digital reference condition 206, and to determine the output k depending on the condition y and the digital image x.
The diffusion model 202 comprises an encoder 210 that is configured to encode the input x, e.g., the digital reference image 204, into a representation Z0, e.g., of the digital reference image 204, in a latent space of the diffusion model.
The diffusion model 202 comprises a decoder 214 that is configured to decode a representation {tilde over (Z)}0, e.g., of the synthetic digital image 208, in the latent space to the output {tilde over (x)}, e.g., the synthetic digital image 208, in the image space.
The image space in the example is a pixel space, e.g., for a RGB image comprising a width W, a height H and 3 channels. The image space may comprise more or less channels.
The diffusion model 202 is configured to successively add in a forward pass 216 random noise to the representation Z0 in the latent space. In the example, the diffusion model according to the first embodiment 202 is configured to determine a sequence Z1, . . . , ZT of representations, wherein after the first representation Z0 a next representation Zt+1 in the sequence is determined depending on the result of adding random noise to the representation Zt preceding the next representation Zt+1 in the sequence of representations.
The diffusion model 202 is configured to successively remove in a denoising process 218 random noise from the last representation ZT in the latent space. In the example, the exemplary diffusion model 202 is configured to determine a sequence {tilde over (Z)}T−1, . . . , {tilde over (Z)}0 of outputs depending on the last representation ZT as input to the denoising process. A first output {tilde over (Z)}t=T−1 in the sequence is determined in the example by denoising the input ZT. After the first output {tilde over (Z)}t=T−1 a next output {tilde over (Z)}t−1 in the sequence is determined depending on the result of denoising the output {tilde over (Z)}t preceding the next output {tilde over (Z)}t−1 in the sequence of outputs.
The diffusion model 202 is configured to determine the first output {tilde over (Z)}t=T−1 depending on a result of encoding the input ZT with an encoder 220 and decoding the encoded input with a decoder 222.
The diffusion model 202 is configured to determine the respective next output {tilde over (Z)}t−1 depending on a result of encoding the output {tilde over (Z)}t preceding the next output {tilde over (Z)}t−1 with an encoder and decoding the encoded input with a decoder.
A respective encoder is configured to determine features of the encoder at different levels of the encoder. A respective decoder is configured to determine features of the decoder at different levels of the decoder.
The diffusion model 202 is optionally configured to manipulate the features depending on the condition y. The diffusion model 202 is for example configured to manipulate the features of one encoder or multiple encoders or the features of one decoder or multiple decoders. The diffusion model 202 is for example configured to manipulate the features at at least one level or at different levels. The diffusion model 202 is for example configured to input the condition y to the encoder or decoder at the level or at the respective levels. According to the example, the features of a respective encoder are manipulated depending on the condition y only or depending on the condition y and the features of the same encoder. According to the example, the features of a respective decoder are manipulated depending on the condition y only or depending on the condition y and the features of the same decoder.
An example for the diffusion model is a stable diffusion model that operates in a latent space of an autoencoder.
Firstly, an encoder ε, i.e., the encoder 210, maps a given image x, e.g., the digital reference image 204, into a spatial latent code Z0=ε(x), then {tilde over (Z)}0 is mapped back to a prediction for the image {tilde over (x)}, e.g., the synthetic digital image 208, in the image space by a decoder D, i.e., the decoder 214. According to an example, the autoencoder is trained to reconstruct the given image, i.e., {tilde over (x)}=D(ε(x)).
Secondly, the forward pass 216 and the denoising process 218 of the diffusion model are trained in the latent space.
The forward pass is a Markov chain to gradually add Gaussian noise to the spatial latent code Z0:
where {βt}t=0T are a fixed variance schedule. The noisy latent code ZT is computed successively:
where ε˜N(0, I), where Z0=ε(x), and αt:=Πst(1−βs).
The reverse denoising process 218 is parametrized in the example by another Gaussian distribution
Essentially μθ(zt,t) is expressed as a linear combination of zt and predicted noise ϵθ(z
In an example, the respective encoder and the respective decoder are implemented as a UNet. UNet is described e.g., in Olaf Ronneberger, Philipp Fischer, Thomas Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” 18 May 2015, https://arxiv.org/abs/1505.04597.
The parameters of the encoder and decoder are learned in the example by minimizing the L2 norm of the noise prediction at a sampled timestep t:
In case the optional condition y is used, in the training, or after an initial training of the diffusion model, the modulation of the features depending on the condition y may be trained with a loss
that depends on the condition y.
The diffusion model 202 may be configured to control the features of at least one of the decoders. The diffusion model 202 may comprise a trainable copy of the encoder that precedes the decoder that is controlled. The diffusion model 202 may be configured to provide the condition y to a convolution, in particular a zero convolution, and to provide a result of adding the convolved condition to the preceding output zt as input to the trainable copy of the encoder. The diffusion model 202 may be configured to provide the output of the trainable copy of the encoder to a convolution, in particular a zero convolution. The diffusion model 202 may be configured to manipulate the features of the decoder at one level depending on the output of the convolution. The diffusion model 202 may be configured to propagate the output of the trainable copy of the encoder through a series of convolutions, in particular zero convolutions and manipulate the features of the decoder at different levels depending on the output of a respective convolution of the series. Propagate means for example, that the output of a convolution is used as input for a next convolution in the series.
Zero convolution in this context refers to a convolution that comprises parameters, weight and bias, that are initialized as zeros before training the diffusion model and that are learned in training. In the example, the zero convolution is a 1×1 convolution. Trainable copy of the encoder refers to an encoder that comprises parameters that are initialized before training of the trainable copy with the values of the parameters of the encoder. In training of the trainable copy, the parameters of the encoder are frozen, i.e., remain unchanged, and the parameters of the trainable copy are learned.
The output of the convolutions and the trainable copy of the encoder is for example integrated in the diffusion model 202 as ControlNet, wherein the features of the trainable copy of the encoder are manipulated.
The diffusion model 202 may be configured to modulate the features of the trainable copy in the same way as described for the encoder of the diffusion model 202.
According to an example, the layers for modulating the features, e.g., the SPADE blocks, are inserted at predetermined locations of the trainable copy, e.g., UNet, an modulate the features of the trainable copy as described for the modulation of the features in the diffusion model 202.
The features f may be manipulated depending on the condition y in different ways.
The according to a first exemplary modulation, a SPADE block takes the conditional input y and extracted features f from a level of the frozen diffusion network, and outputs spatially modulated features fadp, e.g., as
where μf is the mean and γf is the standard deviation of the extracted features f, and where μ(y) and γ(y) are learned pixel-wise modulation parameters conditioned on the condition y. The SPADE block is for example configured as described in, Taesung Park, Ming-Yu Liu, Ting-Chun Wang, Jun-Yan Zhu, “Semantic Image Synthesis with Spatially-Adaptive Normalization”, 5 Nov. 2019, https://arxiv.org/abs/1903.07291.
According to an example, the modulated features fadp and the extracted features f are added to an output fout that replaces the extracted features f at the original level, i.e., f←fout=fadp+f.
The modulated features fadp may be inserted to replace the extracted features f at the original level, i.e., f←fout=fadp, or inserted at another level, e.g., a level that is after the original level in direction of the denoising process 218. The diffusion model 202 of the diffusion model may be configured to replace features f at a level with features fout=f+tanh(β)fadp, where β is a learnable factor and initialized e.g., as zero. This means, the modulated features fadp are added to f in a learnable way.
The diffusion model 202 may comprise a sequence of a self-attention layer, a cross attention layer and a feed forward layer. According to a second exemplary modulation the features f are extracted from an output of the cross-attention layer of the diffusion model that follows after the self-attention layer of the diffusion model in direction of the denoising process 218, and provided as input to the SPADE block and as input to the feed forward layer of the diffusion model. The features are modulated depending on an output of the SPADE block as described for the first exemplary modulation.
The diffusion model 202 may comprise one or several of the parts for modulating features.
For example, SPADE blocks are inserted in parts of the diffusion model according to the first exemplary modulation or the second exemplary modulation at one predetermined location or at multiple predetermined locations of UNet or any other architecture of the diffusion model.
At inference time, the input zT for the denoising process 218 is randomly sampled from the Gaussian distribution and the condition y is provided. Then the trained denoising encoder and decoder sets, e.g., the UNet, are sequentially applied, to obtain the denoised latent zt−1 given zt from to t=T to t=1, while the features are manipulated depending on the condition y. The final synthesized image can be obtained by feeding the clean latent space code {tilde over (Z)}0 through the decoder D:
The diffusion model 202 comprises trainable parameters.
A discriminator or discriminators may be used for training the diffusion model 202. The training may use explicit supervision of the condition y.
The first denoising stage 302 is configured to output features f of the encoder 304 and/or the decoder 306 of the first denoising stage 302 to the feature-based discriminator 300.
The first denoising stage 302 is configured to manipulate the features f in the encoder 304 and/or the decoder 306 of the first denoising stage 302 as described above and to output the manipulated features f to the feature-based discriminator 300.
The feature-based discriminator 300 is configured to map the features f that are provided to the feature-based discriminator 300 by the first denoising stage 302 to a class c, in particular either a class of a predetermined set of N classes for a real digital image or a class for a synthetic digital image. In the example, the feature-based discriminator 300 is configured to determine the class pixel-wise.
The first denoising stage 302 may be configured to output an attention map depending on the output {tilde over (Z)}t. The feature-based discriminator 300 may be configured to map the attention map received from denoising stage 302 and the features f to the class c or the pixel-wise class c.
Using the condition y is optional. In an unconditional case, i.e., without input of the condition y to the first denoising stage 302, the feature-based discriminator 300 may be trained to minimize the loss
wherein ypred is the prediction, Dis(ft) represents the mapping function of the feature-based discriminator 300, wherein i,j indicates a spatial location, the prediction ypred=0 represents a fake class and the prediction ypred=1 represents a real class, and Lseg represents the per-pixel binary cross-entropy loss with the ground truth label as 1 everywhere.
A loss for the conditional case, i.e., wherein the diffusion model 202 uses the condition y, e.g., an input map, is for example:
where CE represents the cross-entropy loss, wherein the prediction ypred includes N real sematic classes (1, . . . , N) and 1 fake class (N+1). Since the class frequency is usually imbalanced, the example comprises a class balancing weight αc, which for example is an inverse pixel-wise frequency of a class c per batch that depends on the height H, the width W and the expectancy value for the unit matrix I for the predictions yi,j,c=1 i.e., being real classes. During training, e.g., due to memory constraint, a limited number of images, e.g., 8 images, are sampled as batch per optimization step. For the limited number of 8 images, one training batch contains 8 images.
The second denoising stage 402 is configured to manipulate the features f in the encoder 404 and/or the decoder 406 of the second denoising stage 402 as described above.
The image-based discriminator 400 is configured to map the next output {tilde over (Z)}t−1 that is provided to the image-based discriminator 400 by the second denoising stage 402 to a class c, in particular either a class of a predetermined set of N classes for a real digital image or a class for a synthetic digital image. In the example, the image-based discriminator 400 is configured to determine the class pixel-wise.
Using the condition y is optional. In an unconditional case, i.e., without input of the condition to the second denoising stage 402, the image-based discriminator 400 may be trained to minimize the loss
wherein Dis({tilde over (x)}t) represents the mapping function of the image-based discriminator 400, wherein i,j indicates a spatial location, the prediction ypred=0 represents a fake class and the prediction ypred=1 represents a real class, and Lseg represents the per-pixel binary cross-entropy loss with the ground truth label as 1 everywhere.
The diffusion model 202 may comprise multiple first denoising stages 302 or multiple second denoising stages 402. The diffusion model 202 may comprise a first denoising stage 302 or multiple first denoising stages 302 and a second denoising stage 402 or multiple second denoising stages 402.
For example, the diffusion model 202 is trained depending on a respective loss Lseg for the output {tilde over (Z)}t and for the sequence of outputs {tilde over (Z)}t−1, {tilde over (Z)}t−2, . . . , {tilde over (Z)}t
wherein λ is an optional weighting factor. To avoid memory explosion, only gradients of the first step t, i.e., Lnoise, may be used, while the discriminator loss Lseg may be determined at the following steps t−1, . . . , t−1−th.
The layer 602 may be configured to map the condition y, e.g., the label map pixel-wise to determine a modified noise prediction:
where μϵ is the mean and γϵ is the standard deviation of the original noise prediction ϵθt and γ(y) and μ(y) are the learned modulation parameters conditioned on the input condition y.
A corresponding training strategy for the diffusion model 202 improves the synthesis quality and is particularly useful for the conditional diffusion model 202, i.e., the diffusion model 202 using the condition y to enhance the conditional alignment.
The discriminator provides per-pixel guidance without using the condition or per-pixel conditional guidance when using the condition y. The multistep sampling of losses in the horizon during training further improves the training. Especially for the conditional case, i.e., when using the condition y, the inter-step conditioning strengthens the conditional effect.
The method comprises a step 702.
The step 702 comprises providing a latent variable ZT of the diffusion model 202 representing the synthetic digital image {tilde over (x)}.
The latent variable ZT is sampled from random noise.
The inference may use the condition y or not.
In the conditional case, i.e., when the condition y is used, the step 702 comprises providing the condition y.
The condition y is for example the label map.
The method comprises a step 704.
The step 704 comprises mapping the latent variable ZT with the diffusion model 202 depending on parameters of the diffusion model 202 to features f of the diffusion model 202.
The step 704 may comprise determining the sequence of outputs ZT−1, . . . Z0 comprising the output Z0.
Determining the sequence of outputs {tilde over (Z)}T−1, . . . {tilde over (Z)}0 may comprise determining a first output {tilde over (Z)}T−1 of the sequence depending on a result of encoding the input ZT and decoding the encoded input, and successively determining a next output {tilde over (Z)}T−2, . . . {tilde over (Z)}0 in the sequence depending on the result of encoding the output {tilde over (Z)}T−1, . . . {tilde over (Z)}1 preceding the next output {tilde over (Z)}T−2, . . . {tilde over (Z)}0 in the sequence and decoding the encoded output preceding the next output in the sequence.
Encoding or decoding may comprise determining the features f.
The step 704 comprises predicting a noise ϵi in a next output {tilde over (Z)}i of the sequence {tilde over (Z)}T−2, . . . {tilde over (Z)}0 depending on the output {tilde over (Z)}T−1, . . . {tilde over (Z)}1 preceding the next output {tilde over (Z)}i−1 in the sequence of outputs {tilde over (Z)}T−1, . . . {tilde over (Z)}0.
The method comprises a step 706.
The step 706 comprises mapping the features f with the diffusion model 202 depending on parameters of the diffusion model 202 to an output Z0 of the diffusion model 202.
The step 706 may comprise modulating the features f depending on the condition y as described above and mapping the modulated features f with the diffusion model 202 to the output Z0.
The output Z0 may be in image space, i.e., pixel space, or in latent space.
The method may comprise a step 708 for mapping the output Z0 in latent space to the synthetic digital image {tilde over (x)}.
The synthetic digital image {tilde over (x)} may comprise the output Z0 in image space. This means, no decoder 214 is used.
For training, validation, or verification, the method may comprise a step 710.
The step 710 may comprise mapping the features f with the feature-based discriminator 300 to a class c.
The step 710 may comprise mapping the synthetic digital image {tilde over (x)} with the image-based discriminator 400 to the class c.
The class c is either a class of the predetermined set of N classes for the real digital image or the class for the fake digital image, i.e., recognition of a synthetic digital image {tilde over (x)}.
The method may comprise determining the class pixel-wise.
The step 710 may comprise determining the class c for a plurality of outputs of the sequence ZT−1, . . . Z0 with a respective discriminator depending on features fT−1, . . . f0 collected from the respective encoding or decoding.
The method comprises a step 712.
The step 712 comprises learning at least one parameter of the diffusion model 202 and/or of the feature-based discriminator 300 and/or of the image-based discriminator 400 depending on the loss LSeg.
For the unconditional case, in the loss LSeg the prediction ypred=0 represents a fake class and the prediction ypred=1 represents a real class. For the conditional case, in the loss LSeg one prediction represents a fake class and N prediction represent true classes. For example, the prediction ypred for N real sematic classes (1, . . . , N) and 1 fake class (N+1) is used as described above.
This means, the loss LSeg for the discriminator is defined depending on the class c and a predetermined reference class or the pixel-wise classes and pixel-wise reference classes.
In the conditional case, the condition y may comprise the predetermined reference class or the pixel-wise reference classes.
The step 712 may comprise learning the parameters depending on the losses LSeg,i that are defined for the discriminators depending on the class c predicted by the respective discriminator and the predetermined reference class.
The step 712 may comprise learning the parameters depending the loss for noise Lnoise and depending on the losses LSeg that are defined for the discriminators.
The loss for noise Lnoise is defined depending on the predicted noise ϵi and a predetermined random noise, in particular random noise sampled from a Gaussian distribution.
The loss LSeg that is defined for the discriminator or the losses LSeg that are defined for the discriminators may be the loss LSeg that is defined depending on the condition y.
The reference class may be provided pixel-wise.
The method may comprise learning the at least one parameter depending on the loss that is defined depending on the pixel-wise classes and respective predetermined pixel-wise reference classes.
The steps 702 to 710 may be repeated to determine a batch of training data for learning the parameters. The step 712 may comprise learning the parameters e.g., with a gradient descent based on the batch of training data.
The method according to the second embodiment comprises a step 802.
The step 802 comprises providing a digital reference image 204.
The method according to the second embodiment comprises a step 804.
The step 804 comprises adding noise with the diffusion model 202 to the input x, i.e., the digital reference image 204 to determine the latent variable ZT.
The method according to the second embodiment comprises a step 806.
The step 806 comprises mapping the latent variable ZT as described in step 704.
The step 806 may comprise determining the sequence of outputs ZT−1, . . . Z0 as described in step 704, wherein encoding or decoding may comprise determining the features f.
The step 806 may comprise predicting the noise ϵg as describe in step 704.
According to the second embodiment, the method comprises a step 808.
The step 808 comprises mapping the features f with the diffusion model 202 as described in step 706.
The step 808 may comprise modulating the features f depending on the condition y as described in step 706.
The output Z0 may be in image space, i.e., pixel space, or in latent space.
The method according to the second embodiment may comprise a step 810 for mapping the output Z0 in latent space to the synthetic digital image {tilde over (x)}.
The synthetic digital image {tilde over (x)} may comprise the output Z0 in image space. This means, no decoder 214 is used.
For training, validation, or verification, the method according to the second embodiment may comprise a step 812.
The step 812 may comprise mapping the features f with the feature-based discriminator 300 to the class c.
The step 812 may comprise mapping the synthetic digital image {tilde over (x)} with the image-based discriminator 400 to the class c.
The method according to the second embodiment may comprise determining the class pixel-wise.
The step 812 may comprise determining the class c for a plurality of outputs of the sequence ZT−1, . . . Z0 with a respective discriminator depending on features fT−1, . . . f0 collected from the respective encoding or decoding.
The method according to the second embodiment comprises a step 814.
The step 814 comprises learning at least one parameter of the diffusion model 202 and/or of the feature-based discriminator 300 and/or of the image-based discriminator 400 as described in step 712.
The steps 802 to 812 may be repeated to determine a batch of training data for learning the parameters. The step 814 may comprise learning the parameters e.g., with a gradient descent based on the batch of training data
According to the third embodiment, the diffusion model comprises the encoder 210 for mapping a digital image to an input x.
According to the third embodiment, the method comprises a step 902.
The step 902 comprises providing the digital reference image 214.
According to the third embodiment, the method comprises a step 904.
The step 904 comprises mapping the digital reference image 204 with the encoder 210 to the input x of the diffusion model 202.
According to the third embodiment, the method comprises a step 906.
The step 906 comprises adding noise with the diffusion model 202 to the input x to determine the latent variable ZT.
According to the third embodiment, the method comprises a step 908.
The step 908 comprises mapping the latent variable ZT as described in step 704.
The step 908 may comprise determining the sequence of outputs ZT−1, . . . Z0 as described in step 704, wherein encoding or decoding may comprise determining the features f.
The step 908 may comprise predicting the noise ϵi as describe in step 704.
According to the third embodiment, the method comprises a step 910.
The step 910 comprises mapping the features f with the diffusion model 202 as described in step 706.
The step 910 may comprise modulating the features f depending on the condition y as described in step 706.
The output Z0 may be in image space, i.e., pixel space, or in latent space.
The method according to the third embodiment may comprise a step 912 for mapping the output Z0 in latent space to the synthetic digital image {tilde over (x)}.
The synthetic digital image {tilde over (x)} may comprise the output Z0 in image space. This means, no decoder 214 is used.
For training, validation, or verification, the method according to the third embodiment may comprise a step 914.
The step 914 may comprise mapping the features f with the feature-based discriminator 300 to the class c.
The step 914 may comprise mapping the synthetic digital image {tilde over (x)} with the image-based discriminator 400 to the class c.
The method according to the third embodiment may comprise determining the class pixel-wise.
The step 914 may comprise determining the class c for a plurality of outputs of the sequence ZT−1, . . . Z0 with a respective discriminator depending on features fT−1, . . . f0 collected from the respective encoding or decoding.
The method according to the third embodiment comprises a step 916.
The step 916 comprises learning at least one parameter of the diffusion model 202 and/or of the feature-based discriminator 300 and/or of the image-based discriminator 400 as described in step 712.
The steps 902 to 914 may be repeated to determine a batch of training data for learning the parameters. The step 916 may comprise learning the parameters e.g., with a gradient descent based on the batch of training data
The method for generating synthetic images as described above may be used for providing a machine learning system. The machine learning system may comprise a neural network for semantic segmentation, object recognition, or classification of objects.
An exemplary computer-implemented method for providing the machine learning system comprises generating a synthetic digital image with the method for generating the synthetic digital image, and training, testing, verifying, or validating the machine learning system depending on the synthetic digital image.
Preferably a plurality of synthetic digital images are determined and the machine learning system is trained, tested, verified or validated depending on the plurality of synthetic digital images.
Depending on the purpose or type of machine learning system that shall be provided, the machine learning system is trained, tested, verified, or validated for semantic segmentation, or object recognition, or classification of objects.
Preferably, the machine learning system is trained, tested, verified, or validated for recognizing background or an object, in particular a traffic sign, a road surface, a pedestrian, a vehicle, an animal, a plant, infrastructure, or the sky, depending on the synthetic digital image.
Training, testing, verifying, or validating the machine learning system may comprise providing the condition y, in particular the label map, that is provided for generating the synthetic digital image as ground truth for training, testing, verifying, or validating the machine learning system.
The diffusion model may be a generative model or part of a generative model to generate training and/or test data for a machine learning system or training and/or test data for an image classifier.
This means, a computer-implemented method for generating the synthetic digital image {tilde over (x)} with the diffusion model for generating training and/or test data for training a machine learning system or an image classifier comprises sampling an input ZT from a probability distribution, in particular a Gaussian distribution, providing the condition y for the semantic layout of the synthetic digital image {tilde over (x)}, in text, a class label, a semantic label map, a two-dimensional or a three dimensional pose, or a depth map, and mapping the input ZT with the diffusion model to the output {tilde over (x)} or {tilde over (Z)}0 of the diffusion model, e.g., as described in step 608.
The synthetic digital image {tilde over (x)} may comprise the output {tilde over (x)} of the diffusion model, or the synthetic digital image {tilde over (x)} is generated depending on the output {tilde over (Z)}0 of the diffusion model.
The diffusion model is the generative model or part of the generative model to generate training and/or test data for a machine learning system or training and/or test data for an image classifier. The diffusion model comprises the successive levels.
Mapping the input ZT with the diffusion model to the output {tilde over (x)} or {tilde over (Z)}0 may comprise successively determining features f at the levels depending on the input ZT, inputting the condition y to the diffusion model at at least one level of the levels, modulating the features f of at least one level of the levels depending on the condition y, and determining the output {tilde over (Z)}0 of the diffusion model depending on the modulated features.
An exemplary computer-implemented method for operating the technical system 112 comprises providing the machine learning system with the method for providing the machine learning system, capturing a digital image with the sensor 108, and determining the control signal for operating the technical system 112 with the machine learning system depending on the digital image.
The methods comprises a generic training paradigm to improve the synthesis quality of the diffusion models, which is particularly useful for conditional diffusion models and improve the conditional alignment. The methods can improve the training of both unconditional and conditional diffusion models, and is independent of the architecture design of the diffusion models. The diffusion model may comprise a denoising model for the denoising process 218, which can be a UNet or a transformer-based model in the pixel space or in the latent space.
For instance, the methods can be applied to improve ControlNet, FreestyleNet, despite they have different designs of the denoising model with conditions. By employing the methods, the synthesized digital images are better aligned with the conditions, e.g., the label map. This is beneficial for downstream applications, since the condition to the generative models will be used as the ground-truth label during training of the downstream task. For instance, when using the synthetic data for augmenting the training of a semantic segmentation network, the label map of the generative model may be reused as the ground-truth label to train the semantic segmentation network. If the synthesized images are not aligned with the conditional inputs, the mismatching will mislead the network training because the ground-truth label is no longer correct anymore.
Number | Date | Country | Kind |
---|---|---|---|
23 19 8631.6 | Sep 2023 | EP | regional |