CROSS REFERENCE
The present application claims the benefit under 35 U.S.C. § 119 of European Patent Application No. EP 22 20 1999.4 filed on Oct. 17, 2022, which is expressly incorporated herein by reference in its entirety.
The present invention relates to a method of processing digital image data.
The present invention further relates to an apparatus for processing digital image data.
Generative adversarial networks, GAN, are described in the related art and characterize an approach of generative modeling which can, e.g., be used to generate image data.
Exemplary embodiments of the present invention relate to a method, for example a computer-implemented method, of processing digital image data, comprising: determining, by an encoder configured to map a first digital image to an extended latent space associated with a generator of a generative adversarial network, GAN, system, a noise prediction associated with the first digital image, determining, by the generator of the GAN system, at least one further digital image based on the noise prediction associated with the first digital image and a plurality of latent variables associated with the extended latent space. In some exemplary embodiments, this may enable to determine, e.g., generate, further digital images comprising a similar or identical content as the first digital image, but, optionally, with a modified style, as, e.g., characterized by at least some of the plurality of latent variables.
In some exemplary embodiments of the present invention, the digital image data and/or the (first) digital image may comprise at least one of, but is not limited to: a) at least one digital image, b) an image or frame of a video stream, c) data associated with a RADAR system, e.g. imaging RADAR system, e.g. RADAR image, c) data associated with a LIDAR system, e.g. LIDAR image, d) an ultrasonic image, e) a motion image, f) a thermal image, e.g. as obtained from a thermal imaging system.
In some exemplary embodiments of the present invention, at least some of the plurality of latent variables associated with the extended latent space characterize at least one of the following aspects of the first digital image: a) a style, e.g. a non-semantic appearance, b) a texture, c) a color. In some exemplary embodiments, a style of a digital image may be characterized by a combination of a texture of at least some parts of the digital image and a color of at least some parts of the digital image.
In some exemplary embodiments of the present invention, the method comprises determining the plurality of latent variables based on at least one of: a) a second digital image, which is different from the first digital image, e.g. using the encoder, b) a plurality of probability distributions, as may e.g. be obtained based on a data set in some exemplary embodiments.
In some exemplary embodiments of the present invention, the method comprises at least one of: a) determining a plurality of, for example hierarchical, feature maps based on the first digital image, b) determining a plurality of latent variables associated with the extended latent space for the first digital image based on the plurality of, for example hierarchical, feature maps, c) determining a, for example additive, noise map based on at least one of the plurality of, for example hierarchical, feature maps.
In some exemplary embodiments of the present invention, the method comprises: randomly and/or pseudo-randomly masking at least a portion of the noise prediction associated with the first digital image. Note that according to further exemplary embodiments, the masking is not required for modifying a style, e.g. for style augmentation, according to the principle according to the embodiments.
In some exemplary embodiments of the present invention, the method comprises: masking of the noise map, e.g. in a random and/or pseudo-random fashion.
In some exemplary embodiments of the present invention, the method comprises: dividing, e.g. spatially dividing, the noise map into a plurality of, e.g. P×P many, e.g. non-overlapping, patches, selecting, in a random and/or pseudo-random fashion, a subset of the patches, replacing the subset of the patches by patches of, e.g. unit Gaussian, random variables, e.g. of the same size.
In some exemplary embodiments of the present invention, the method comprises: combining the noise prediction associated with the first digital image with a style prediction of a or the second digital image, generating a further digital image using the generator based on the combined noise prediction associated with the first digital image and the style prediction of the second digital image. In some exemplary embodiments, this enables to provide the further digital image with the style, or, for example, with at least some aspects of the style, of the second digital image, and, e.g., with the content of the first digital image.
In some exemplary embodiments of the present invention, the method comprises: providing the noise prediction associated with the first digital image, providing different sets of latent variables characterizing different styles to be applied to a, for example semantic, content of the first digital image, generating a plurality of digital images with the different styles using the generator based on the noise prediction associated with the first digital image and the different sets of latent variables characterizing the different styles.
In some exemplary embodiments of the present invention, the method comprises: providing image data, e.g. comprising one or more digital images, associated with a first domain, providing image data, e.g. comprising one or more digital images, associated with a second domain, applying a style of the second domain to the image data associated with the first domain.
In some exemplary embodiments of the present invention, the image data associated with the first domain comprises labels, wherein, for example, the applying of the style of the second domain to the image data associated with the first domain comprises preserving the labels. This way, a style of the digital images of the first domain may be modified while at the same time preserving the labels, thus providing further labeled image data with different style(s).
In some exemplary embodiments of the present invention, the method comprises: providing first image data having first content information, providing second image data, wherein for example the second image data comprises second content information different from the first content information, extracting style information of the second image data, applying at least a part of the style information of the second image data to the first image data.
In some exemplary embodiments of the present invention, the method comprises: generating training data, e.g. for training at least one neural network and/or machine learning system, wherein the generating is e.g. based on image data of a source domain and based on modified image data of the source domain, wherein, for example, the modified image data is and/or has been modified with respect to an image style, e.g. according to the principle of the embodiments, for example based on a style of further image data, and, optionally, training the at least one neural network system based on the training data.
Further exemplary embodiments of the present invention relate to an apparatus for performing the method according to the embodiments of the present invention.
Further exemplary embodiments of the present invention relate to a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method according to the embodiments of the present invention.
Further exemplary embodiments of the present invention relate to a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method according to the embodiments of the present invention.
Further exemplary embodiments of the present invention relate to a data carrier signal carrying and/or characterizing the computer program according to the embodiments of the present invention.
Further exemplary embodiments of the present invention relate to a use of the method according to the embodiments and/or of the apparatus according to the embodiments and/or of the computer program according to the embodiments and/or of the computer-readable storage medium according to the embodiments and/or of the data carrier signal according to the embodiments for at least one of: a) determining at least one further digital image based on the noise prediction associated with the first digital image and a plurality of latent variables associated with the extended latent space, at least some of the plurality of latent variables being associated with another image and/or other data than the first digital image, b) transferring a style from a second digital image to the first digital image, e.g. while preserving a content of the first digital image, c) disentangling style and content of at least one digital image, d) creating different stylized digital images with unchanged content, e.g. based on the first digital image and a style of at least one further, e.g. second, digital image, e) using, e.g. re-using, labelled annotations for stylized images, f) avoiding annotation work when changing a style of at least one digital image, g) generating, e.g. perceptually realistic, digital images, e.g. with different styles, h) providing proxy validation set, e.g. for testing out-of-distribution generalization, e.g. of a neural network system, i) training a machine learning system, j) testing a machine learning system, k) verifying a machine learning system, l) validating a machine learning system, m) generating training data, e.g. for a machine learning system, n) data augmentation, e.g. of existing image data, o) improving a generalization performance of a machine learning system, p) manipulating, e.g. flexibly manipulate, image styles, e.g. without a training associated with multiple data sets, q) utilizing an encoder GAN pipeline to manipulate image styles, r) embedding, by the encoder, information associated with an image style into, for example intermediate, latent variables, s) mixing styles of digital images, e.g. for generating at least one further digital image comprising a style based on the mixing.
Some exemplary embodiments of the present invention will now be described with reference to the figures.
Exemplary embodiments, see, for example,
In some exemplary embodiments, the digital image data and/or the (first) digital image x1 may comprise at least one of, but is not limited to: a) at least one digital image, b) an image or frame of a video stream, c) data associated with a RADAR system, e.g. imaging RADAR system, e.g. RADAR image, c) data associated with a LIDAR system, e.g. LIDAR image, d) an ultrasonic image, e) a motion image, f) a thermal image, e.g. as obtained from a thermal imaging system.
In some exemplary embodiments, at least some of the plurality LAT-VAR of latent variables associated with the extended latent space SP-W+ characterize at least one of the following aspects of the first digital image: a) a style, e.g. a non-semantic appearance, b) a texture, c) a color. In some exemplary embodiments, a style of a digital image may be characterized by a combination of a texture of at least some parts of the digital image and a color of at least some parts of the digital image.
In some exemplary embodiments,
In some exemplary embodiments, the GAN system 10 may comprise an optional discriminator 16, which, in some further exemplary embodiments, may e.g. be used for training at least one component of the of the GAN system, as is conventional in the art.
Some exemplary embodiments may make use of aspects of GAN inversion, which is related to finding, e.g. determining, latent variables in a latent space of a, for example pretrained, GAN, e.g. the GAN system 10 of
In some exemplary embodiments, the generator 14 of the GAN system 10 is configured and/or trained to generate digital images, e.g. photorealistic digital images, from latent variables, such as e.g. random (or pseudorandom) latent variables.
In some exemplary embodiments, the GAN system 10 of
In some exemplary embodiments, e.g. in addition to “styles”, spatial stochastic noise that is e.g. randomly sampled, e.g. from a Gaussian distribution, may be added, e.g. after at least one, e.g. some, e.g. each, feature modulation(s).
In some exemplary embodiments, the encoder 12 (
In some exemplary embodiments, in the + space, “styles” at different layers may e.g. be different. In some exemplary embodiments, a, for example properly, trained encoder 12, e.g. trained by randomly masking out noise according to some exemplary embodiments, can disentangle texture and structure information, e.g. in an unsupervised way. More specifically, in some exemplary embodiments, the encoder 12 will encode texture information into “style” latents (latent variables) and content information into noise(s). Note, however, that according to further exemplary embodiments, the masking is not (necessarily) required for modifying a style, e.g. for style augmentation, according to the principle according to the embodiments. In other words, in some exemplary embodiments, style mixing, e.g. style augmentation, can e.g. be performed without masking.
In some exemplary embodiments, e.g. given a pretrained generator G (such as e.g. generator 12 of
where d(⋅) is a distance metric, e.g. to measure a similarity between the original image×and reconstructed image G(z).
In some exemplary embodiments, L2 and LPIPS (as e.g. defined by arXiv:1801.03924v2 [cs.CV] 10 Apr. 2018) can be jointly used as the distance metric d(⋅).
In some exemplary embodiments, the extended (intermediate) latent space + encourages a comparatively good reconstruction quality. In some exemplary embodiments, e.g. in addition to intermediate latents prediction, spatial noises may be predicted as well, which, in some exemplary embodiments, can e.g. better preserve detail information in a given image. In some exemplary embodiments, formally, the Encoder E and Generator G can be described as follows:
{ω, ε}=E(χ),
χ*=G(ω, ε),
wherein x and x* are the given original image and the reconstructed image, respectively, wherein ω characterizes predicted intermediate latent variables, and wherein ε characterizes predicted noises. In some exemplary embodiments, the encoder may e.g. be trained to, e.g. faithfully, reconstruct the given image x.
In some exemplary embodiments,
Element E1 symbolizes a feature pyramid according to some exemplary embodiments, which is e.g. configured to perform the step of determining 120 a plurality of, for example hierarchical, feature maps FM based on the first digital image x1, see block 120 of
In some exemplary embodiments, the feature pyramid E1 may e.g. comprise a plurality of convolution layers for providing the plurality of, for example hierarchical, feature maps FM.
In some exemplary embodiments, the feature pyramid E1 may e.g. be based on, e.g. be similar or identical to, the structure depicted by
In some exemplary embodiments, other topologies for the feature pyramid E1 are also possible.
Elements E2-1, . . . , E2-n, . . . of
Element E3 of
In some exemplary embodiments, the noise mapper E3 may e.g. comprise a stack of, e.g. 1×1, convolution layers, which is configured to take a h×w×c feature map as an input and to output a h×w×c′ feature map.
In some exemplary embodiments,
In some exemplary embodiments,
In some exemplary embodiments,
In some exemplary embodiments, the generator 14a of
In some exemplary embodiments, the generator 14a may e.g. be of the StyleGAN-type or StyleGAN2-type, as e.g. described in at least one of the following papers:
As an example, in some exemplary embodiments, the generator 14a may comprise the architecture as exemplarily denoted by
In some exemplary embodiments,
In some exemplary embodiments, e.g. using the encoder 12, 12a of the GAN system 10, a style of a digital image x1 can be modified, e.g. by changing the intermediate latents w, which characterize aspects of the style of the digital image x1.
In this regard,
Elements E21 of
Elements E22 symbolize blocks of the encoder which are configured to determine information characterizing a style of the respective input image x1, x2, e.g. characterized by latent variables w as explained above, see, for example, elements w1, . . . , wk of
Elements E23 of
Element 14b of
In other words, some exemplary embodiments enable to keep, e.g. preserve, the content of the first digital image x1 and to transfer the style information of the second digital image x2 to the first digital image x1, e.g. by combining the noise prediction ε1 from the first digital image x1, and the (e.g. intermediate) latent variables w2 of the second digital image x2. In some exemplary embodiments, the, for example fixed, generator 14b, e.g. of the StyleGAN-type or of the StyleGAN2-type, takes the components ε1, w2 as inputs and produces the mixed image xmix.
Returning to
In some exemplary embodiments, and as at least partly already mentioned above, the noise map is spatially divided into nonoverlapping P×P patches PATCH (see also block 132a of
In some exemplary embodiments, e.g. based on a pre-defined ratio ρ, a subset PATCH-SUB of the patches is e.g. randomly selected and replaced by patches of unit Gaussian random variables ∈˜N(0, 1) of the same size, wherein, for example, N(0, 1) is the prior distribution of the noise map, e.g. at a training of the generator 14, 14a (which can e.g. be of the StyleGAN2-type).
In some exemplary embodiments, the encoder 12, 12a may be denoted as a “masked noise encoder”, as, in some exemplary embodiments, it is trained with random masking, e.g. to predict the noise map.
In some exemplary embodiments, the proposed random masking may reduce an encoding capacity of the noise map, hence encouraging the encoder 12, 12a to jointly exploit the latent codes {wk} for reconstruction. In some exemplary embodiments, thus, the encoder 12, 12a takes the noise map and latent codes from the content and style images, respectively. In some exemplary embodiments, then, they may be fed into the generator 14, 14a (e.g., of the StyleGAN2-type), e.g. to synthesize a new image.
In some exemplary embodiments, if the encoder 12, 12a is not trained with random masking, the new image may not have, e.g. any, perceptible difference(s) with the content image. In some exemplary embodiments, this means that the latent codes {wk} encode negligible information of the image. In contrast, in some exemplary embodiments, when being trained with masking, the encoder creates a novel image that takes the content and style from two different images. In some exemplary embodiments, this observation confirms an important role of masking for content and style disentanglement according to some exemplary embodiments, and thus the, e.g. improved, style mixing capability.
In some exemplary embodiments, the noise map does not, e.g. no longer, encode(s) all perceptible information of the image, including style and content. In some exemplary embodiments, in effect, the latent codes {wk} play a more active role in controlling the style.
In the following, aspects and information related to an encoder training loss according to some exemplary embodiments are provided.
In some exemplary embodiments, the principle according to the embodiments related to GAN inversion, e.g. StyleGAN2 inversion according to some exemplary embodiments, with a masked noised encoder EM can be formulated as {w1, . . . , wK, ε}=EM(x);
x*=G○E
M(x)=G(w1, . . . , wK, ε).
In some exemplary embodiments, the masked noise encoder EM maps the given image x onto the latent codes {wk} and the noise map ε.
In some exemplary embodiments, the generator G (see also element 14, 14a of
In some exemplary embodiments, the encoder 12, 12, e.g. the masked noise encoder EM, is trained, e.g. to reconstruct the original image x.
In some exemplary embodiments, when training the encoder 12, 12, e.g. the masked noise encoder EM to reconstruct the original image x, the original noise map ε is masked, e.g. before being fed into the, e.g. pre-trained, generator G, wherein the masking can e.g. be characterized by:
εM=(1−Mnoise)⊙ε+Mnoise⊙∈,
{circumflex over (x)}=G(w1, . . . , wK, εM)
wherein Mnoise e.g. is a random binary mask, wherein ⊙ indicates the Hadamard product, and wherein {circumflex over (x)} denotes the reconstructed image with the masked noise εM.
In some exemplary embodiments, the training loss for the encoder can be characterized by
=msc+λ1lpips+λ2adv+λ3reg,
where {λi} are weighting factors. The first three terms are the pixel-wise MSE loss, learned perceptual image patch similarity (LPIPS) loss (e.g., according to Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang; “The unreasonable effectiveness of deep features as a perceptual metric;” In CVPR, 2018.) and adversarial loss (e.g., according to Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, DavidWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio; “Generative adversarial nets;” in NeurIPS, 2014.):
msc=∥(1−Mimg)⊙(x−{tilde over (x)})∥2,
lpips=∥(1−Mfeat)⊙(VGG(x)−VGG({tilde over (x)})∥2,
adv=−log D(G(EM(x))).
Note that, in some exemplary embodiments, masking removes the information of the given image×at certain spatial positions, the reconstruction requirement on these positions should then be relaxed. In some exemplary embodiments, Mimg and Mfeat may e.g. be obtained by up- and downsampling the noise mask Mnoise to the image size and the feature size of the, e.g. VGG-based, feature extractor.
In some exemplary embodiments, the adversarial loss is obtained by formulating the encoder training as an adversarial game with a discriminator D (also see optional element 16 of
In some exemplary embodiments, the last regularization term is defined as
reg=∥ε∥1+∥EwG(G(wgt, ∈))−wgt∥2
In some exemplary embodiments, the L1 norm helps to induce sparse noise prediction. In some exemplary embodiments, it is complementary to random masking, reducing the capacity of the noise map. In some exemplary embodiments, the second term is obtained by using the ground truth latent codes wgt of synthesized images G(wgt, ∈) to train the latent code prediction EwM (e.g. according to Xu Yao, Alasdair Newson, Yann Gousseau, and Pierre Hellier; “Feature-Style Encoder for Style-Based GAN Inversion,” arXiv preprint, 2022.). In some exemplary embodiments, it guides the encoder to stay close to the original latent space of the generator, speeding up the convergence.
In some exemplary embodiments,
In some exemplary embodiments,
In some exemplary embodiments, there are multiple ways to obtain style information from one or more digital images and/or data sets. For example, as exemplarily shown in
In some exemplary embodiments, one, e.g. single, e.g. unlabeled, image from a target domain TD (
In some exemplary embodiments, styles extracted e.g. from one or more digital images, e.g. based on the principle according to the embodiments, can also be interpolated. As exemplarily shown in
In some exemplary embodiments, as illustrated by
In some exemplary embodiments,
In some exemplary embodiments,
In some exemplary embodiments, the style-mixed images x as e.g. obtained according to at least one of
For instance,
In some exemplary embodiments,
Since according to exemplary embodiments, the content information I-CONT-1 remains unchanged, e.g. during the processing using the generator 14b, the labels LAB of the first domain or source domain DOM-1 may be used and they are preserved throughout the generation of style-mixed images x-CONT-1-STYLE-2. In some exemplary embodiments, style information of digital images can e.g. be translated from any target domain(s), e.g. without labels. Such data augmentation according to exemplary embodiments can e.g. be helpful for improving generalization performance.
For example, in some exemplary embodiments, a (machine learning) model trained solely on day scenes (i.e., one single, specific domain or style) may perform badly on other scenes, such as e.g. night scenes. With the proposed style-mixing data augmentation according to exemplary embodiments, a performance gap between day scenes and night scenes may be largely reduced.
Interestingly, in some exemplary embodiments, it can be observed that style mixing within a source domain can improve, e.g. boost, an out-of-domain (“ood”) generalization, e.g. without access to more datasets. In some exemplary embodiments, it is hypothesized that an intra-mix stylization according to some exemplary embodiments can be helpful, e.g. for finding a near solution, e.g. of a flat optimum, which may e.g. lead to better generalization ability.
Furthermore, in some exemplary embodiments, the style-mixed images as can be obtained applying the principle according to the embodiments can also be used for validation, where the test performance can e.g. serve as a proxy indicator of generalization to select models. In some conventional approaches, there may not be a good or preferable way to pick a-priori a model with a best generalization ability given, e.g. only, a source dataset. Therefore, in some exemplary embodiments, style-mixing by applying the principle according to the embodiments may be helpful for selecting the best model, e.g. without requiring target datasets.
In some exemplary embodiments, one single, e.g. unlabeled, image can be used, e.g. is enough, e.g. for style extraction, e.g., by using the encoder 12, 12a, where the style can e.g. be transferred to a source dataset. Since in some exemplary embodiments, the source dataset may be labelled, the model can thus be tested on a style-mixed dataset. Based on a so determined test accuracy, in some exemplary embodiments, the model's generalization performance on the target dataset can be approximated.
In some exemplary embodiments,
To summarize some exemplary aspects, in some exemplary embodiments, see, for example,
Further exemplary embodiments,
In some exemplary embodiments, the apparatus 200 comprises at least one calculating unit, e.g. processor, 202 and at least one memory unit 204 associated with (i.e., usably by) the at least one calculating unit 202, e.g. for at least temporarily storing a computer program PRG and/or data DAT, wherein the computer program PRG is e.g. configured to at least temporarily control an operation of the apparatus 200, e.g. for implementing at least some aspects of the GAN system 10 (
In some exemplary embodiments, the at least one calculating unit 202 comprises at least one core (not shown) for executing the computer program PRG or at least parts thereof, e.g. for executing the method according to the embodiments or at least one or more steps and/or other aspects thereof.
According to further exemplary embodiments, the at least one calculating unit 202 may comprise at least one of the following elements: a microprocessor, a microcontroller, a digital signal processor (DSP), a programmable logic element (e.g., FPGA, field programmable gate array), an ASIC (application specific integrated circuit), hardware circuitry, a tensor processor, a graphics processing unit (GPU). According to further preferred embodiments, any combination of two or more of these elements is also possible.
According to further exemplary embodiments, the memory unit 204 comprises at least one of the following elements: a volatile memory 204a, e.g. a random-access memory (RAM), a non-volatile memory 204b, e.g. a Flash-EEPROM.
In some exemplary embodiments, the computer program PRG is at least temporarily stored in the non-volatile memory 204b. Data DAT, e.g. associated with at least one of: a) digital image(s), b) parameters and/or hyperparameters of the GAN system 10, c) latent variables, d) random data, e.g. for masking a noise map, e) distribution(s) DISTR, f) content information I-CONT-1, g) style information I-STYLE-2 and the like, which may e.g. be used for executing the method according to some exemplary embodiments, may at least temporarily be stored in the RAM 204a.
In some exemplary embodiments, an optional computer-readable storage medium SM comprising instructions, e.g. in the form of a further computer program PRG′, may be provided, wherein the further computer program PRG′, when executed by a computer, i.e., by the calculating unit 202, may cause the computer 202 to carry out the method according to the embodiments. As an example, the storage medium SM may comprise or represent a digital storage medium such as a semiconductor memory device (e.g., solid state drive, SSD) and/or a magnetic storage medium such as a disk or harddisk drive (HDD) and/or an optical storage medium such as a compact disc (CD) or DVD (digital versatile disc) or the like.
In some exemplary embodiments, the apparatus 200 may comprise an optional data interface 206, e.g. for bidirectional data exchange with an external device (not shown). As an example, by means of the data interface 206, a data carrier signal DCS may be received, e.g. from the external device, for example via a wired or a wireless data transmission medium, e.g. over a (virtual) private computer network and/or a public computer network such as e.g. the Internet.
In some exemplary embodiments, the data carrier signal DCS may represent or carry the computer program PRG, PRG′ according to the embodiments, or at least a part thereof.
Further exemplary embodiments relate to a computer program PRG, PRG′ comprising instructions which, when the program is executed by a computer 202, cause the computer 202 to carry out the method according to the embodiments.
Further exemplary embodiments,
In the following, further aspects and advantages according to further exemplary embodiments are provided, which, in some exemplary embodiments, may be combined with each other and/or with at least one of the exemplary aspects explained above.
In some conventional approaches, the i.i.d (independent and identically distributed) assumption has been made for deep learning, i.e., training and test data such as e.g. digital images should be drawn from the same distribution. However, in real life, the i.i.d assumption can be easily violated. For example, different weather conditions, different cities can cause distribution shifts. In at least some conventional approaches, such data shifts can lead to severe performance degradation. In at least some conventional approaches, unsupervised domain adaptation or domain generalization aims to mitigate this issue.
In some conventional approaches, data augmentation techniques such as e.g. color transformation and CutMix (https://arxiv.org/pdf/1912.04958.pdf) are proposed, which can modify randomly an appearance of a dataset, but which cannot transfer appearances/styles of another dataset to a source dataset. In some conventional approaches, Image to Image Translation for Domain Adaptation can do such targeted translation, but requires the image-to-image translation model to be trained on both source and target domain.
In some exemplary embodiments, the principle according to the embodiments can e.g. be seen and/or used as an enhancement to Encoder-GAN architectures, such as “Encoding in Style: a StyleGAN Encoder for Image-to-Image Translation (pSp)” (https://arxiv.org/pdf/2008.00951.pdf). Particularly, and in contrast to the conventional approaches, the principle according to the embodiments can flexibly manipulate image styles, e.g. without multi-dataset training. In some exemplary embodiments, the images, e.g. synthesized images, as obtained by applying the principle according to the embodiments can be used for data augmentation during network training, e.g. to improve model generalization performance.
In some exemplary embodiments, stylized images as e.g. obtained by applying the principle according to the embodiments can be used for validation, e.g. to indicate a model's out-of-distribution (ood) generalization capability.
In some exemplary embodiments, an Encoder-GAN pipeline is used to manipulate image styles. In some exemplary embodiments, it can be observed that a, for example properly, trained encoder can disentangle style and content information in an unsupervised manner. More specifically, in some exemplary embodiments, the encoder can embed style information into intermediate latent variables and content information into noises. Moreover, in some exemplary embodiments, this pipeline generalizes well to unseen datasets.
In some exemplary embodiments, taking advantage of these appealing properties of the encoder GAN pipeline related to the principle according to the embodiments, multiple applications, e.g. to manipulate image styles and/or further usages, e.g. during training and/or validation are proposed.
In some exemplary embodiments, the principle according to the embodiments enables to transfer styles of other datasets to a source dataset and to generate stylized images with well-preserved content information of the original images.
In some exemplary embodiments, the principle according to the embodiments enables to interpolate styles and/or to learn a style distribution and to sample from the style distribution. In some exemplary embodiments, the stylized images as obtained by applying the principle according to the embodiments can e.g. be used for data augmentation, e.g. during training.
In some exemplary embodiments, the stylized images as obtained by applying the principle according to the embodiments can e.g. be used as proxy validation sets, e.g. for out-of-distribution (ood) data, where the test accuracy on stylized synthetic images can predict the ood generalization performance to a certain extent. In some exemplary embodiments, this can be useful for model selection. For example, for a model trained on sunny day scenes (source domain), night, foggy, snowy and any scenes under other different weather conditions are considered as ood samples. In some exemplary embodiments, styles of the ood samples can be transferred to the source domain while preserving the content of the source images. In some exemplary embodiments, since source domain images may be labelled, models can be tested on stylized source images, and the test accuracy can indicate the model's ood generalization ability.
In some exemplary embodiments, e.g. thanks to the style-content disentanglement of the encoder 12, 12a, exemplary embodiments enable to generate different stylized images while the content of the images remains unchanged, e.g. the same content as in the original image. Thus, in some exemplary embodiments, labelled annotations of the original images can also be used for the stylized images.
In some conventional approaches, when drawing samples from a distribution that is not covered by existing data, the gathered samples are required to be labeled, which is not the case with exemplary embodiments due to preserving the labels. Thus, in some exemplary embodiments, time and/or costs, e.g. for an additional annotation work, can be saved.
Further, in some exemplary embodiments, the style-mixed images as may be obtained by applying the principle of the embodiments are perceptually realistic and e.g. close to the target datasets. Therefore, in some exemplary embodiments, they can be used as a proxy validation set for testing out-of-distribution generalization.
In some exemplary embodiments, an Encoder-GAN pipeline according to the principle of the embodiments does not require training on the target datasets, e.g. like some conventional image-to-image translation methods. In some exemplary embodiments, a single dataset trained model generalizes well to unseen datasets, which enables more flexible style mixing and manipulation.
In some exemplary embodiments, the principle of the embodiments can e.g. be used for at least one of: training a machine learning (ML) system, generating training data for this training, generating test data, e.g. to check whether the trained ML system can then be safely operated.
In some exemplary embodiments, aspects of the embodiments relate to and/or characterize a generative model, e.g. to generate training or test data, and to a method to train the generative model.
In some exemplary embodiments, the principle according to the embodiments may e.g. be used for at least one of, but not limited to: a) data analysis, e.g. analysis of digital image and/or video data, b) classifying digital image data, c) detecting the presence of objects in the data, d) performing a semantic segmentation on the data, e.g. regarding at least one of, but not limited to: d1) traffic signs, d2) road surfaces, d3) pedestrians, d4) vehicles, d5) object classes that may e.g. show in a semantic segmentation task, e.g., trees, sky, . . .
Number | Date | Country | Kind |
---|---|---|---|
22 20 1999.4 | Oct 2022 | EP | regional |