A conditional generative adversarial networks (cGAN) includes a generator and a discriminator. The generator includes a first neural network that generates a synthetic image based on conditional input information. The discriminator includes a second neural network for determining whether an input image fed to it is synthetic (meaning that it was produced by the generator) or real (meaning that it was likely not produced by the generator). A training system iteratively adjusts the parameter values of the generator with the aim of producing synthetic images that are mistaken by discriminator as being real. The training system updates the parameter values of the discriminator such that it can successfully discriminate between synthetic and real images. This training objective is referred to as adversarial because it pits the generator against the discriminator. Once the generator and the discriminator are fully trained, a developer will use the generator to transform images from an input form into an output form. At this stage, the discriminator is not needed and is discarded.
GAN technology has proven to be a powerful tool to train image translators. For instance, the technical literature has discussed the use of GAN-trained image translators that allow users to modify fashion-related images. Yet there is room for improvement in existing GANs. For instance, a developer may have difficulty obtaining sufficient training data to train robust models using a GAN.
A computer-implemented technique is described herein that uses a generative adversarial network (GAN) to jointly train a generator neural network (“generator”) and a discriminator neural network (“discriminator”). Unlike traditional GAN designs, the discriminator performs the dual role of: (a) determining plural attribute values associated with an object depicted in an input image fed to the discriminator; and (b) determining whether the input image fed to the discriminator is real or “fake” (meaning that it is synthesized by the generator).
Also unlike traditional GAN designs, an inference-stage image classifier can make use of a model that is learned for the GAN's discriminator. In other words, the GAN produces two productive models. The first trained model finds use in an image translator, while the second trained model finds use in an image classifier. The GAN is referred to as a multi-task GAN herein because it incorporates a discriminator that performs plural tasks, including determining plural attribute values and determining whether the input image fed to the discriminator is real or fake. Another reason that the GAN can be referred to as multi-task is because it uses a same training procedure to produce two models.
According to one technical advantage, the GAN described herein can increase the accuracy of its trained models by including a discriminator that performs plural tasks. More specifically, the inclusion of a dual-use discriminator enables the generator to produce more realistic synthesized images, and enables the discriminator to perform its dual classification role with greater accuracy. This is true even for the case in which the GAN is trained using a relatively modest-sized corpus of training examples.
According to one illustrative aspect, the generator receives generator input information that includes a conditional input image and one or more conditional input values that express desired characteristics of a generator output image. In one non-limiting case, the conditional input image is produced by applying a predetermined image transformation on a given input image (e.g., a real image). The discriminator receives the conditional input image in conjunction with a discriminator input image, which corresponds to either the generator output image or a real image.
The above-summarized technique can be manifested in various types of systems, devices, components, methods, computer-readable storage media, data structures, graphical user interface presentations, articles of manufacture, and so on.
This Summary is provided to introduce a selection of concepts in a simplified form; these concepts are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
The same numbers are used throughout the disclosure and figures to reference like components and features. Series 100 numbers refer to features originally found in
This disclosure is organized as follows. Section A describes a GAN-based training framework for jointly training an image translator and an image classifier. Section B sets forth illustrative methods which explain the operation of the training framework of Section A. And Section C describes illustrative computing functionality that can be used to implement any aspect of the features described in Sections A and B.
As a preliminary matter, the term “hardware logic circuitry” corresponds to technology that includes one or more hardware processors (e.g., CPUs, GPUs, etc.) that execute machine-readable instructions stored in a memory, and/or one or more other hardware logic units (e.g., FPGAs) that perform operations using a task-specific collection of fixed and/or programmable logic gates. Section C provides additional information regarding one implementation of the hardware logic circuitry. In some contexts, each of the terms “component,” “engine,” “system,” and “tool” refers to a part of the hardware logic circuitry that performs a particular function.
In one case, the illustrated separation of various parts in the figures into distinct units may reflect the use of corresponding distinct physical and tangible parts in an actual implementation. Alternatively, or in addition, any single part illustrated in the figures may be implemented by plural actual physical parts. Alternatively, or in addition, the depiction of any two or more separate parts in the figures may reflect different functions performed by a single actual physical part.
Other figures describe the concepts in flowchart form. In this form, certain operations are described as constituting distinct blocks performed in a certain order. Such implementations are illustrative and non-limiting. Certain blocks described herein can be grouped together and performed in a single operation, certain blocks can be broken apart into plural component blocks, and certain blocks can be performed in an order that differs from that which is illustrated herein (including a parallel manner of performing the blocks). In one implementation, the blocks shown in the flowcharts that pertain to processing-related functions can be implemented by the hardware logic circuitry described in Section C, which, in turn, can be implemented by one or more hardware processors and/or other logic units that include a task-specific collection of logic gates.
As to terminology, the phrase “configured to” encompasses various physical and tangible mechanisms for performing an identified operation. The mechanisms can be configured to perform an operation using the hardware logic circuitry of Section C. The term “logic” likewise encompasses various physical and tangible mechanisms for performing a task. For instance, each processing-related operation illustrated in the flowcharts corresponds to a logic component for performing that operation. A logic component can perform its operation using the hardware logic circuitry of Section C. When implemented by computing equipment, a logic component represents an electrical element that is a physical part of the computing system, in whatever manner implemented.
Any of the storage resources described herein, or any combination of the storage resources, may be regarded as a computer-readable medium. In many cases, a computer-readable medium represents some form of physical and tangible entity. The term computer-readable medium also encompasses propagated signals, e.g., transmitted or received via a physical conduit and/or air or other wireless medium, etc. However, the specific term “computer-readable storage medium” expressly excludes propagated signals per se, while including all other forms of computer-readable media.
The following explanation may identify one or more features as “optional.” This type of statement is not to be interpreted as an exhaustive indication of features that may be considered optional; that is, other features can be considered as optional, although not explicitly identified in the text. Further, any description of a single entity is not intended to preclude the use of plural such entities; similarly, a description of plural entities is not intended to preclude the use of a single entity. Further, while the description may explain certain features as alternative ways of carrying out identified functions or implementing identified mechanisms, the features can also be combined together in any combination. Further, the term “plurality” refers to two or more items, and does not necessarily imply “all” items of a particular kind, unless otherwise explicitly specified. Unless otherwise noted, the descriptors “first,” “second,” “third,” etc. are used to distinguish among different items, and do not imply an ordering among items. Finally, the terms “exemplary” or “illustrative” refer to one implementation among potentially many implementations.
A. Illustrative Computing Systems
A.1. Overview
The discriminator 110 receives discriminator input information 120. The discriminator input information 120 includes the same conditional input image 116 that was fed to the generator 108. It also includes a discriminator input image. The discriminator input image can either correspond to the generator output image 114 or a “real” image, which, in one interpretation, means any kind of input image produced by an mechanism (e.g., a camera), provided that the input image that was not synthetically generated by the generator 108. The real image is the real or source counterpart of the generator output image 114.
The discriminator 110 performs two tasks based on the discriminator input information 120. As a first task, the discriminator 110 generates one or more attribute values 122 that identify one or more characteristics of the discriminator input image included in the discriminator input information 120. That is, each attribute value specifies a class associated with a particular attribute of the discriminator input image. For example, the attribute value “red” specifies a class associated with the attribute of “color.” As a second task, the discriminator 110 generates an output result 124 that specifies whether the discriminator input image is fake (meaning that it was likely generated by the generator 108) or real (meaning that it was not likely generated by the generator 108). In contrast, note that a traditional discriminator of a GAN performs the sole dedicated task of determining whether input information fed into the discriminator is real or fake. The discriminator 110 may be referred to as a multi-task discriminator because it performs at least the above two tasks. It may also be considered a multi-task discriminator because it classifies plural characteristics of the discriminator input image.
The behavior of the generator is governed by a set of parameter values θg, while the behavior of the discriminator 110 is governed by a set of parameter values θd. The parameter-updating system 106 iteratively adjusts these parameter values based on the following competing objectives. As a first objective, the parameter-updating system 106 attempts to iteratively adjust the parameter values of the generator 108 such that it produces increasingly realistic synthetic images. A generator output image is deemed realistic when it successfully “fools” the discriminator 110 into identifying it as real, when, in fact, it is synthetic. A successful generator output image also exhibits desired attributes values, and further enables the discriminator 110 to correctly assess its properties. Second, the parameter-updating system 106 attempts to iteratively adjust the parameter values of the discriminator 110 to progressively increase the accuracy with which it: (a) classifies the attributes of the discriminator input image; and (b) assesses whether the discriminator input image is real or fake. In performing this training task, the parameter-updating system 106 draws from training images in a data store 126.
Once fully trained, the parameter values θg of the generator 108 define a first trained model, while the parameter values θd of the discriminator 110 define a second trained model. In the inference stage, an image translator (not shown in
The use of a multi-task discriminator 110 imposes additional constraints in the training performed by the training framework 102, compared to a traditional GAN. These added constraints result in the production of a more accurate generator model and a more accurate discriminator model (compared to the example in which the discriminator 110 does not perform multiple tasks). The conditional input image fed to the discriminator 110 also passes on useful information to the discriminator 110, which contributes to the end result of producing an accurate discriminator model.
From the standpoint of the discriminator 110, the generator 108 serves the role of expanding the number of training examples fed to the discriminator 110, starting from an original corpus of labeled source images. This factor is one reason the training framework 102 is able to produce robust models starting with a relatively modest corpus of training examples. Additional details regarding the training performed by the training framework 102 is set forth in Subsection A.4.
The generator input information 202 also includes one or more attribute arrays 206. Each attribute array specifies an attribute value associated with an attribute. For example, a first attribute array specifies a particular color, a second attribute array specifies a particular pattern, a third attribute array specifies a particular material, and so on. More specifically, each attribute array has a size of f×g×1. Each element of an array specifies the same attribute value. Consider, for example, the attribute array associated color. Assume that there are M color classes, and that red is the mth color in a list of the M color classes. An attribute array for the color red can include the same value m/M across its 256×256×1 elements.
In general, the generator 108 seeks to produce a generator output image 114 that embodies the characteristics of both the conditional input image 202 and the conditional input value(s). For instance, the generator 108 seeks to produce a red-green-blue (RGB) generator output image that resembles the men's shirt in the conditional input image 204, and which embodies the attribute values specified in the attribute arrays. In one case, the generator output image 114 has a size of f×g×3, e.g., 256×256×3.
As described above, the input-preparing component 304 can produce the conditional input image 308 by applying any type of transformation to an original input image. In one non-limiting example, the input-preparing component 304 can apply any type of edge-preserving filter on a real image 316 to produce the conditional input image 308, such as a bilateral filter, anisotropic diffusion filter, an edge-preserving domain transform filter, and so on. A non-limiting example of a domain transform filter is described in Gastal, et al., “Domain Transform for Edge-Aware Image and Video Processing,” in ACM Transactions on Graphics, Article No. 69, July 2011, 11 pages. In other cases, the input-preparing component 304 can apply a neural network that performs a style transformation of any type on the input image 316. A non-limiting example of a style transformation engine is described in Gatys, et al., “Image Style Transfer Using Convolutional Neural Networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, 10 pages.
The training framework 102, image translator system 302, and image classifier system 402 can be distributed between user computing devices and the servers 504 in any manner. In one case, for instance, one or more servers implement the entirety of the training framework 102. Likewise, one or more servers can implement the image translator system 302 and the image classifier system 402. In other cases, any (or all) of these elements can be implemented by each user computing device in local fashion. In still other case, some of the functionality of these elements can be implemented by the servers 504 and some of the functionality of these elements can be implemented by individual user computing devices 502.
A.2. Illustrative Applications
The operation of the application shown in
In the merely illustrative case of
In operation 7.1, the image-editing engine 604 receives input information that specifies one or more selections that the user has made using the graphical control 906. In operation 7.2, the image-editing engine 604 uses the input-preparing component 304 of the image translator system 302 to generate translator input information. More specifically, the input-preparing component 304 generates a conditional input image by applying a predefined image transformation to the input image 904 that appears in the electronic document 902 (assuming that the conditional input image has not already been generated and pre-stored). In the merely illustrative example of the figures, the input-preparing component 304 applies a transformation to produces a grayscale version of the input image 904, but many other kinds of transformations can be used. The input-preparing component 304 then generates one or more attribute values that encode the selections that the user has made via the graphical control 906. In operation 7.3, the image-editing engine 604 relies on the image translator 312 to map the translator input information to a translator output image 908. In operation 7.4, the image-editing engine 604 sends the translator output image 908 (or link thereto) to the user computing device 702. The document viewer 704 displays the translator output image 908 in the same electronic document 902 in which the input image 904 appears, although this need not be the case in all applications. The translator output image 908 shows a version of the T-Shirt depicted in the input image 904 with a red-star pattern.
An optional graphical control 910 corresponds to a slide bar that allows the user to adjust the size of the stars in the star pattern in the translator output image 908. That is, the user can increase the size of the stars by moving a sliding member of the slide bar to the right. In response, the image translator system 302 dynamically adjusts the translator output image 908. Other implementations can provide dynamic control in different ways, such by including additional slide bars associated with additional attributes.
In operation 8.1, the item-serving engine 606 receives a trigger event that indicates that the user has invoked its services. In this case, the trigger event corresponds to a determination that the user has loaded the electronic document 902 which includes the input image 904. In operation 8.2, the item-serving engine 606 relies on the input-preparing component 404 of the image classifier system 402 to generate classifier input information. The classifier input information includes the input image 904 and a transformed version of the input image 904. Again, the input-preparing component 404 produces a grayscale version of the input image 904 (if not already generated and pre-stored), but other implementations can use other types of transformations. In operation 8.3, based on the translator input information, the image classifier 408 generates plural attribute values associated with the input image 904. These attribute values describe respective properties of the T-Shirt depicted in the input image 904. For example, one attribute value may describe a particular color of the T-Shirt, another attribute value may specify the T-Shirt's manufacturer, another attribute value may specify the T-Shirt's material, another attribute value may specify the T-Shirt's pattern, and so on. The image classifier 408 also generates an output result that indicates whether the input image 904 is real or fake. This information is an artifact of the fact that the image classifier 408 has been trained using the GAN 104 of
In operation 8.4, the item-serving engine 606 next identifies at least one supplemental content item based on the identified attributes. For example, assume that the item-serving engine 606 includes or has access to a data store 802 of supplemental content items, such as digital advertisements. Further assume that each supplemental content item in the data store 802 is tagged with metadata that describes its attribute values. The item-serving engine 606 can perform a search to find one or more supplemental content items in the data store 604 having attribute values that best match the attribute values of the input image 904. For example, the item-serving engine 606 can locate a digital advertisement that pertains to men's T-Shirts for sale by a particular manufacturer, where those T-Shirts include various patterns.
In operation 8.5, the item-serving engine 606 sends one or more supplemental content items (or URLs associated therewith) to the document viewer 704.
The above-described applications are set forth in the spirit of illustration, not limitation. Many other applications can make use of the image translator system 302 and the image classifier system 402 of
A.3. The Generator and the Discriminator
This subsection sets forth illustrative details of the generator 108 and the discriminator 110. This section also indirectly describes the architecture of the image translator 312 and the image classifier 408, since the image translator 312 adopts the same architecture as the generator 108, and the image classifier 408 adopts the same architecture as the discriminator 110.
The generator 108 and the discriminator 110 can each include one or more convolutional neural networks (CNNs). Therefore, in advance of explaining illustrative architectures of the generator 108 and the discriminator 110, the principle features of an illustrative CNN 1002 will be described below with reference to
The CNN 1002 show in
In each convolutional operation, a convolutional component moves an n×m kernel across an input image (where “input image” in this general context refers to whatever image is fed to the convolutional component). In some implementations, at each position of the kernel, the convolutional component generates the dot product of the kernel values with the underlying pixel values of the image. The convolutional component stores that dot product as an output value in an output image at a position corresponding to the current location of the kernel. More specifically, the convolutional component can perform the above-described operation for a set of different kernels having different machine-learned kernel values. Each kernel corresponds to a different pattern. In early layers of processing, a convolutional component may apply a kernel that serves to identify relatively primitive patterns (such as edges, corners, etc.) in the image. In later layers, a convolutional component may apply a kernel that finds more complex shapes (such as shapes that particular kinds of objects in the input image 1006, etc.).
In each pooling operation, a pooling component moves a window of predetermined size across an input image (where the input image corresponds to whatever image is fed to the pooling component). The pooling component then performs some aggregating/summarizing operation with respect to the values of the input image enclosed by the window, such as by identifying and storing the maximum value in the window, generating and storing the average of the values in the window, etc. A pooling operation may also be referred to as a down-sampling operation. Although not shown, a counterpart up-sampling component can expand an input image into a larger-sized output image, e.g., by duplicating values in the input image within the output image.
A fully-connected component is often preceded by a flattening component (not shown in
The last fully-connected layer of the CNN 1002 provides a final representation of features associated with the input image 1006. Although not shown, one or more classification components may operate on the features to generate output conclusions. For example, the CNN 1002 may include a softmax output operation, a support vector machine (SVM) classifier, etc.
The architecture 1102 shown in
In addition, the first series of layers 1104 feeds feature information to like-dimensioned layers in the second series of layers 1108. This feeding of information across the like-dimensioned layers has the resultant effect of preserving fine-detailed information that would otherwise be lost by down-sampling the generator input information into the low-dimensioned representation 1106.
In the specific example of
In the first stage (S1), a generator 1202 maps generator input information 1204 into a generator output image G1 1206. The generator input information 1204 includes a conditional input image 1208 and one or more conditional input values 1210. The generator 1202 itself includes a down-sampling component 1212 that corresponds to the first series of layers 1104 shown in
In an optional second phase of training (S2), the output of the first phase of training can be applied as input to the second phase of training. The second phase of training has the end effect of further improving the quality of the trained models produced by the GAN 104. The second phase of training uses a second-stage generator 1224 in combination with a second-stage discriminator 1226. The second-stage generator 1224 is adapted to accept different generator input information compared to the first-stage generator 1202, but is otherwise like the first-stage generator 1202. The second-stage generator 1224 begins its training without reference to the trained parameter values learned in the first phase. The second-stage discriminator 1226 has the same architecture as the first-stage discriminator 1218; unlike the generator 1224, the second-stage discriminator 1226 begins its training using the model parameter values learned in the phrase phase.
More specifically, in the second phase, the second-stage generator 1224 maps generator input information 1228 that includes the generator output image G1 1206 generated by the first phase into a generator output image G2 1230. The generator input information 1228 also includes the same conditional input image 1208 and the conditional input values 1210 that were fed to the first-stage generator 1202 in the first phase. The second-stage discriminator 1226 again maps discriminator input information 1232 into classification results 1234. The discriminator input information 1232 includes a conditional input image and either the generator output image G2 1230 or the real counterpart thereof.
If only the first stage is used to train the models, the inference-stage image translator 312 (of
The CNN 1304 can include any implementation-specific combination of components. A first component 1314 can apply a convolutional operation, a batch normalization operation, and a ReLU operation in that order. The first component 1314 also has the effect of down-converting the four-channel discriminator input information 1306 into three-channel information for further processing by the remainder of the CNN 1304. (The input to the first component 1314 is four channels because it includes a one-channel grayscale conditional input image and a three-channel red-green-blue discriminator input image.) Another convolutional neural network (CNN) 1316 performs further processing on the information fed to it by the first component 1314. In one implementation, the CNN 1316 may use the architecture of the VGG neural network described in Simonyan, et al., “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556v6 [cs.CV], Apr. 10, 2015, 14 pages, although any other neural network can be used to implement the CNN 1316. A flattening component 1318 converts the output information provided by the CNN 1316 into a single output vector. A fully-connected neural network 1320 performs further processing on the flatted output vector using one or more fully-connected layers that adopt any activation function (e.g., ReLU). An optional drop-out component 1322 selectively ignores or “drops out” output values produced by the fully-connected neural network 1320. This well-known drop-out operation helps reduce overfitting during the training operation.
A.4. Training Environment
In general, the training environment 1402 is capable of producing robust models based on a relatively modest corpus of training examples. This characteristic ensues, in part, from the multi-task nature of the training performed by the training framework 102. That is, the multi-tasking training imposes additional constraints that have the result of extracting additional insight from the training examples.
Further, the generator 108 serves the role of expanding the number of training examples fed to the discriminator 110, starting from an original corpus of labeled source images. For example, the generator 108 can modify a source image of a pair of red shoes in many different ways by feeding a conditional input image associated with this source image, together with different conditional input values. The conditional input values, for instance, may specify shoes of different colors, textures, etc. This is another reason why the training environment 1402 can be said to produce robust models starting with a relatively modest corpus of training examples.
The parameter-updating system 106 of the training framework 102 can iteratively update the parameter values of the GAN 104 based on the following loss function:
(G,D)=x,y[log(mask·D(x,y))]+x,z[log(1−mask)·D(x,G(x,z))]+λx,y,z[∥y−G(x,z)∥1].
In this equation, the generator (G) 108 maps an observed image x and random noise vector z to an output image y. The generator 108 is trained to generate an output image that cannot be distinguished from real images by the discriminator (D) 110. The discriminator 110, in turn, is optimized to discriminate between the synthetized and real images. In other words, the generator 108 attempts to minimize the objective in the above equation, while the discriminator 110 attempts to maximize it. The third line of the above equation reduces blurring by requiring the generator 108 to produce output images that are close to respective ground-truth outputs. This portion of the equation uses the L1 distance (l1) to express this goal. λ is a weighting factor that governs the impact that the anti-blurring objective plays in the training performed by the training framework 102. Finally, mask is a tensor of the same shape as D(x,y) with binary values 0 and 1. A value is 1 indicates that a corresponding label exists. This mask tensor has the end result of counting only labels that are available in the course of the training operation.
B. Illustrative Processes
The processing by the discriminator 110 includes the following operations. In block 1602, the discriminator 110 receives discriminator input information that includes the conditional input image and a discriminator input image, the discriminator input image corresponding to either the generator output image or a real image that is not generated by the generator 108. In block 1604, the discriminator 110 produces plural attribute values based on the discriminator input information, each attribute value being associated with a characteristic of an object depicted by the discriminator input image. In block 1606, the discriminator 110 produces an output result based on the discriminator input information that indicates whether the discriminator input image is the generator output image or the real image.
In block 1608, the parameter-updating system 106 iteratively adjusts parameter values of the generator 108 and the discriminator 110. Upon completion of training, the parameter-updating system 106 provides a first trained model based on trained parameter values associated with the generator 108 for use by an image translator 312, and a second trained model based on trained parameter values associated with the discriminator 110 for use by an image classifier 408.
C. Representative Computing Functionality
The computing device 1902 can include one or more hardware processors 1904. The hardware processor(s) 1904 can include, without limitation, one or more Central Processing Units (CPUs), and/or one or more Graphics Processing Units (GPUs), and/or one or more Application Specific Integrated Circuits (ASICs), etc. More generally, any hardware processor can correspond to a general-purpose processing unit or an application-specific processor unit.
The computing device 1902 can also include computer-readable storage media 1906, corresponding to one or more computer-readable media hardware units. The computer-readable storage media 1906 retains any kind of information 1908, such as machine-readable instructions, settings, data, etc. Without limitation, for instance, the computer-readable storage media 1906 may include one or more solid-state devices, one or more magnetic hard disks, one or more optical disks, magnetic tape, and so on. Any instance of the computer-readable storage media 1906 can use any technology for storing and retrieving information. Further, any instance of the computer-readable storage media 1906 may represent a fixed or removable unit of the computing device 1902. Further, any instance of the computer-readable storage media 1906 may provide volatile or non-volatile retention of information.
The computing device 1902 can utilize any instance of the computer-readable storage media 1906 in different ways. For example, any instance of the computer-readable storage media 1906 may represent a hardware memory unit (such as Random Access Memory (RAM)) for storing transient information during execution of a program by the computing device 1902, and/or a hardware storage unit (such as a hard disk) for retaining/archiving information on a more permanent basis. In the latter case, the computing device 1902 also includes one or more drive mechanisms 1910 (such as a hard drive mechanism) for storing and retrieving information from an instance of the computer-readable storage media 1906.
The computing device 1902 may perform any of the functions described above when the hardware processor(s) 1904 carry out computer-readable instructions stored in any instance of the computer-readable storage media 1906. For instance, the computing device 1902 may carry out computer-readable instructions to perform each block of the processes described in Section B.
Alternatively, or in addition, the computing device 1902 may rely on one or more other hardware logic units 1912 to perform operations using a task-specific collection of logic gates. For instance, the hardware logic unit(s) 1912 may include a fixed configuration of hardware logic gates, e.g., that are created and set at the time of manufacture, and thereafter unalterable. Alternatively, or in addition, the other hardware logic unit(s) 1912 may include a collection of programmable hardware logic gates that can be set to perform different application-specific tasks. The latter category of devices includes, but is not limited to Programmable Array Logic Devices (PALs), Generic Array Logic Devices (GALs), Complex Programmable Logic Devices (CPLDs), Field-Programmable Gate Arrays (FPGAs), etc.
In some cases (e.g., in the case in which the computing device 1902 represents a user computing device), the computing device 1902 also includes an input/output interface 1916 for receiving various inputs (via input devices 1918), and for providing various outputs (via output devices 1920). Illustrative input devices include a keyboard device, a mouse input device, a touchscreen input device, a digitizing pad, one or more static image cameras, one or more video cameras, one or more depth camera systems, one or more microphones, a voice recognition mechanism, any movement detection mechanisms (e.g., accelerometers, gyroscopes, etc.), and so on. One particular output mechanism may include a display device 1922 and an associated graphical user interface presentation (GUI) 1924. The display device 1922 may correspond to a liquid crystal display device, a light-emitting diode display (LED) device, a cathode ray tube device, a projection mechanism, etc. Other output devices include a printer, one or more speakers, a haptic output mechanism, an archival mechanism (for storing output information), and so on. The computing device 1902 can also include one or more network interfaces 1926 for exchanging data with other devices via one or more communication conduits 1928. One or more communication buses 1930 communicatively couple the above-described units together.
The communication conduit(s) 1928 can be implemented in any manner, e.g., by a local area computer network, a wide area computer network (e.g., the Internet), point-to-point connections, etc., or any combination thereof. The communication conduit(s) 1928 can include any combination of hardwired links, wireless links, routers, gateway functionality, name servers, etc., governed by any protocol or combination of protocols.
The following summary provides a non-exhaustive set of illustrative examples of the technology set forth herein.
According to a first example, one or more computing devices are described for performing machine training. The computing device(s) include a conditional generative adversarial network (GAN) including a generator neural network and a discriminator neural network. The generator neural network is configured to: receive generator input information that includes a conditional input image and one or more conditional input values; and transform the generator input information into a generator output image, wherein the one or more conditional input values describe one or more desired characteristics of the generator output image. The discriminator neural network is configured to: receive discriminator input information that includes the conditional input image and a discriminator input image, the discriminator input image corresponding to either the generator output image or a real image that is not generated by the generator neural network; produce plural attribute values based on the discriminator input information, each attribute value being associated with a characteristic of an object depicted by the discriminator input image; and produce an output result based on the discriminator input information that indicates whether the discriminator input image is the generator output image or the real image. The device(s) also includes a parameter-updating system for iteratively adjusting parameter values of the generator neural network and the discriminator neural network. Upon completion of training, the parameter-updating system provides a first trained model based on trained parameter values associated with the generator neural network for use by an image translator, and a second trained model based on trained parameter values associated with the discriminator neural network for use by an image classifier. The GAN and the parameter-updating system are implemented by hardware logic circuitry provided by the one or more computing devices.
According to a second example, the conditional input image is a transformed version of the real image.
According to a third example, the discriminator neural network includes a convolutional neural network for mapping the discriminator input information into feature information, and plural individual classifier neural networks for respectively producing the plural attribute values and the output result that conveys whether the discriminator input information includes the generator output image or the real image.
According to a fourth example, a method is described for training a conditional generative adversarial network (GAN). The method includes using a generator neural network of the GAN to: receive generator input information that includes a conditional input image and one or more conditional input values; and transform the generator input information into a generator output image, wherein the one or more conditional input values describes one or more desired characteristics of the generator output image. The method also includes using a discriminator neural network of the GAN to: receive discriminator input information that includes the conditional input image and a discriminator input image, the discriminator input image corresponding to either the generator output image or a real image that is not generated by the generator neural network; produce plural attribute values based on the discriminator input information, each attribute value being associated with a characteristic of an object depicted by the discriminator input image; and produce an output result based on the discriminator input information that indicates whether the discriminator input image is the generator output image or the real image. The method further includes iteratively adjusting parameter values of the generator neural network and the discriminator neural network. Upon completion of training, the method provides a first trained model based on trained parameter values associated with the generator neural network for use by an image translator, and a second trained model based on trained parameter values associated with the discriminator neural network for use by an image classifier.
According to a fifth example, relating to the fourth example, the conditional input image is a transformed version of the real image.
According to a sixth example, relating to the fourth example, the discriminator neural network includes a convolutional neural network for mapping the discriminator input information into feature information, and plural individual classifier neural networks for respectively producing the plural attribute values and the output result that conveys whether the discriminator input information includes the generator output image or the real image.
According to a seventh example, relating to the fourth example, the method further includes using the image translator by: receiving translator input information that includes a translator conditional input image, the translator conditional input image being generated by transforming a translator input image, the translator input information also including one or more translator conditional input values, the one or more translator conditional input values describing one or more desired characteristics of a translator output image; and using the image translator to transform the translator input information into the translator output image.
According to an eighth example, relating to the seventh example, the method further includes: providing an electronic document to a user computing device operated by a user, the electronic document including the translator input image and a graphical control that enables the user to enter the one or more translator conditional input values. The one or more translator conditional input values are received in response to interaction by the user with the graphical control. The method further includes sending the translator output image to the user computing device for presentation to the user.
According to a ninth example, relating to the eighth example, the translator output image is presented in the electronic document in which the translator input image appears.
According to a tenth example, relating to the fourth example, the method further includes using the image classifier by: receiving classifier input information that includes a classifier input image and a classifier conditional input image, the classifier conditional input image being generated by transforming the classifier input image; producing plural classifier attribute values based on the classifier input information, each classifier attribute value being associated with a characteristic of an object depicted by the classifier input image; and producing a classifier output result based on the classifier input information that indicates whether the translator input image is synthetic or real.
According to an eleventh example, relating to the tenth example, the classifier input image appears on an electronic document presented to a user computing device operated by a user, and wherein the method further includes: identifying a supplemental content item based on the plural classifier attribute values; and sending the supplemental content item to the user computing device for presentation to the user.
According to a twelfth example, relating to the eleventh example the supplemental content item is presented in the electronic document in which the classifier input image appears.
According to a thirteenth example, an image translator is described that is produced by the method of the fourth example.
According to a fourteenth example, an image classifier is described that is produced by the method of the fourth example.
According to a fifteenth example, an image classification system is described that is implemented by one or more computing devices. The image classification system includes hardware logic circuitry configured to: receive a classifier input image to be classified; transform the classifier input image into a classifier conditional input image, the classifier input image and the classifier conditional input image corresponding to classifier input information; use an image classifier neural network provided by the hardware logic circuitry to produce plural classifier attribute values based on the classifier input information, each classifier attribute value being associated with a characteristic of an object depicted by the classifier input image; and use the image classifier neural network to produce a classifier output result based on the classifier input information that indicates whether the classifier input image is synthetic or real. The image classifier neural network has a model that is trained by iteratively adjusting parameter values of a discriminator neural network in a generative adversarial network (GAN).
According to a sixteenth example, relating to the fifteenth example, the GAN includes a generator neural network configured to: receive generator input information that includes a generator conditional input image and one or more generator conditional input values; and transform the generator input information into a generator output image, the one or more generator conditional input values describing one or more desired characteristics of the generator output image. The discriminator neural network is configured to: receive discriminator input information that includes the generator conditional input image and a discriminator input image, the discriminator input image corresponding to either the generator output image or a real image that is not generated by the generator neural network; produce plural discriminator attribute values based on the discriminator input information, each discriminator attribute value being associated with a characteristic of an object depicted by the discriminator input image; and produce a discriminator output result based on the discriminator input information that indicates whether the discriminator input image is the generator output image or a real image.
According to a seventeenth example, relating to the sixteenth example, each generator conditional input image fed to the generator neural network and the discriminator neural network is produced by a same image transformation that is used to produce the classifier conditional input image.
According to an eighteenth example, relating to the fifteenth example, the classifier neural network includes a convolutional neural network for mapping the classifier input information into feature information, and plural individual classifier neural networks for respectively producing the plural classifier attribute values and the classifier output result that conveys whether the classifier input information is synthetic or real.
According to a nineteenth example, relating to the fifteenth example, the plural classifier attribute values include any two or more classes selected from: a category class; a color class; a department class; a material class; and a pattern class.
According to a twentieth example, relating to the fifteenth example, the classifier input image originates from an electronic document with which a user is interacting via a user computing device.
A twenty-first aspect corresponds to any combination (e.g., any logically consistent permutation or subset) of the above-referenced first through twentieth aspects.
A twenty-second aspect corresponds to any method counterpart, device counterpart, system counterpart, means-plus-function counterpart, computer-readable storage medium counterpart, data structure counterpart, article of manufacture counterpart, graphical user interface presentation counterpart, etc. associated with the first through twenty-first aspects.
In closing, the functionality described herein can employ various mechanisms to ensure that any user data is handled in a manner that conforms to applicable laws, social norms, and the expectations and preferences of individual users. For example, the functionality can allow a user to expressly opt in to (and then expressly opt out of) the provisions of the functionality. The functionality can also provide suitable security mechanisms to ensure the privacy of the user data (such as data-sanitizing mechanisms, encryption mechanisms, password-protection mechanisms, etc.).
Further, the description may have set forth various concepts in the context of illustrative challenges or problems. This manner of explanation is not intended to suggest that others have appreciated and/or articulated the challenges or problems in the manner specified herein. Further, this manner of explanation is not intended to suggest that the subject matter recited in the claims is limited to solving the identified challenges or problems; that is, the subject matter in the claims may be applied in the context of challenges or problems other than those described herein.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Number | Name | Date | Kind |
---|---|---|---|
20190096056 | Giering | Mar 2019 | A1 |
20200219304 | Kavidayal | Jul 2020 | A1 |
Entry |
---|
Li, et al., “Mt-Gan: A Training Framework to Enhance Image Classification Task with Image Translation,” in 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Jul. 2021, 4 pages. |
Kingma, et al., “Auto-Encoding Variational Bayes,” arXiv:1312.6114v10 [stat.ML], May 1, 2014, 14 pages. |
Rezende, et al., “Stochastic Backpropagation and Approximate Inference in Deep Generative Models,” in Proceedings of the 31st International Conference on Machine Learning, PMLR, vol. 32, No. 2, 2014, 9 pages. |
Goodfellow, et al., “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems 27, 2014, 9 pages. |
Denton, et al., “Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks,” in Advances in Neural Information Processing Systems 28, 2015, 9 pages. |
Gauthier, Jon, “Conditional generative adversarial nets for convolutional face generation,” available at http://cs231n.stanford.edu/reports/2015/pdfs/jgauthie_final_report.pdf, Class project for Stanford CS231N: Convolutional Neural Networks for Visual Recognition, winter semester 2014, 9 pages. |
Mirza et al., “Conditional Generative Adversarial Nets,” arXiv:1411.1784v1 [cs.LG], Nov. 6, 2014, 7 pages. |
Isola, et al., “Image-to-lmage Translation with Conditional Adversarial Networks,” arXiv:1611.07004v3 [cs.CV], Nov. 26, 2018, 17 pages. |
Brock, et al., “Neural Photo Editing with Introspective Adversarial Networks,” arXiv:1609.07093v3 [cs.LG], Feb. 6, 2017, 15 pages. |
Taigman, et al., “Unsupervised Cross-Domain Image Generation,” arXiv:1611.02200v1 [cs.CV], Nov. 7, 2016, 14 pages. |
Ledig, et al., “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” arXiv:1609.04802v5 [cs.CV], May 25, 2017, 19 pages. |
Park, et al., “Semantic Image Synthesis with Spatially-Adaptive Normalization,” rXiv:1903.07291v2 [cs.CV], Nov. 5, 2019, 19 pages. |
Hsiao, et al., “Fashion++: Minimal Edits for Outfit Improvement,” arXiv:1904.09261v3 [cs.CV], Sep. 2, 2019, 17 pages. |
Odena, et al., “Conditional Image Synthesis with Auxiliary Classifier GANs,” in Proceedings of the 34th International Conference on Machine Learning, PMLR 70, 2017, 10 pages. |
Zhang, et al., “StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks,” arXiv:1612.03242v2 [cs.CV], Aug. 5, 2017, 14 pages. |
Zhang, et al., “StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks,” arXiv:1710.10916v3 [cs.CV], Jun. 28, 2018, 16 pages. |
Zhang, et al., “A Survey on Multi-Task Learning,” arXiv:1707.08114v2 [cs.LG], Jul. 27, 2018, 20 pages. |
Liu, et al., “Multi-Task Deep Neural Networks for Natural Language Understanding,” arXiv:1901.11504v2 [cs.CL], May 30, 2019, 10 pages. |
Russakovsky, et al., “ImageNet Large Scale Visual Recognition Challenge,” arXiv:1409.0575v3 [cs.CV], Jan. 30, 2015, 43 pages. |
Ronneberger, et al., “U-Net: Convolutional Networks for Biomedical Image Segmentation,” arXiv:1505.04597v1 [cs.CV], May 18, 2015, 8 pages. |
Ioffe, et al., “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of the 32nd International Conference on Machine Learning, vol. 37, 2015, 9 pages. |
Simonyan, et al., “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556v6 [cs.DV], Apr. 10, 2015, 14 pages. |
Pathak, “Context Encoders: Feature Learning by Inpainting,” arXiv:1604.07379v2 [cs.CV], Nov. 21, 2016, 12 pages. |
Liu, et al., “Deepfashion: Powering robust clothes recognition and retrieval with rich annotations,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, CVF Open Access reprint, 9 pages. |
Gong, et al., “Deep Convolutional Ranking for Multilabel Image Annotation,” arXiv:1312.4894v2 [cs.CV], Apr. 14, 2014, 9 pages. |
Yang, et al., “Articulated pose estimation with flexible mixtures-of-parts,” in CVPR 2011, 2011, pp. 1385-1392. |
Chen, et al., “Describing Clothing by Semantic Attributes,” in Lecture Notes in Computer Science, vol. 7574, Springer, Berlin, Heidelberg, ECCV'12: Proceedings of the 12th European Conference on Computer Vision, vol. Part III, Oct. 2012, pp. 609-623. |
Huang, et al., “Cross-domain Image Retrieval with a Dual Attribute-aware Ranking Network,” arXiv:1505.07922v1 [cs.CV], May 29, 2015, 12 pages. |
Heusel, et al. “GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium,” arXiv:1706.08500v6 [cs.LG], Jan. 12, 2018, 38 pages. |
Binkowsk, et al., “Demystifying MMD GANs,” arXiv:1801.01401v4 [stat.ML], Mar. 21, 2018, 36 pages. |
Brock, et al., “Large Scale GAN Training for High Fidelity Natural Image Synthesis,” arXiv:1809.11096v2 [cs.LG], Feb. 25, 2019, 35 pages. |
Chen, et al., “Self-Supervised GANs via Auxiliary Rotation Loss,” arXiv:1811.11212v2 [cs.LG], Apr. 9, 2019, 15 pages. |
Choi, et al., “StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation,” arXiv:1711.09020v3 [cs.CV], Sep. 21, 2018, 15 pages. |
Corbiere, et al., “Leveraging Weakly Annotated Data for Fashion Image Retrieval and Label Prediction,” arXiv:1709.09426v1 [cs.CV], Sep. 27, 2017, 7 pages. |
Karras, et al., “Progressive Growing of GANs for Improved Quality, Stability, and Variation,” arXiv:1710.10196v3 [cs.NE], Feb. 26, 2018, 26 pages. |
Li, et al., “Triple Generative Adversarial Nets,” in Advances in Neural Information Processing Systems 30, 2017, 11 pages. |
Liu, et al., “Unsupervised Image-to-Image Translation Networks,” in Advances in Neural Information Processing Systems 30, 2017, 9 pages. |
Lucic, et al., “High-Fidelity Image Generation With Fewer Labels,” arXiv:1903.02271v2 [cs.LG], May 14, 2019, 23 pages. |
Mescheder, et al., “Which Training Methods for GANs do actually Converge?,” arXiv:1801.04406v4 [cs.LG], Jul. 31, 2018, 39 pages. |
Wang, et al., “High-Resolution Image Synthesis and Semantic Manipulation With Conditional GANs,” arXiv:1711.11585v2 [cs.CV], Aug. 20, 2018, 14 pages. |
Wang, et al., “Attentive Fashion Grammar Network for Fashion Landmark Detection and Clothing Category Classification,” in Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, CVF Open Access reprint, 10 pages. |
Zhu, et al., “Unpaired Image-To-Image Translation Using Cycle-Consistent Adversarial Networks,” arXiv:1703.10593v6 [cs.CV], Nov. 15, 2018, 18 pages. |
Liu, et al., “Deep Fashion Analysis with Feature Map Upsampling and Landmark-driven Attention,” in Computer Vision—ECCV 2018 Workshops, ECCV 2018, Lecture Notes in Computer Science, vol. 11131, CVF Open Access, 7 pages. |
Gatys, et al., “Image Style Transfer Using Convolutional Neural Networks,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, CVF Open Access reprint, 10 pages. |
Li, et al., U.S. Appl. No. 16/422,992, “Pipeline for Identifying Supplemental Content Items That are Related to Objects in Images,” filed on May 25, 2019, 35 pages. |
PCT Search Report and Written Opinion for PCT/US2021/018274, dated May 21, 2021, 17 pages. |
Murali, et al., “Image Generation and Style Transfer Using Conditional Generative Adversarial Networks,” in Proceedings of the 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA), Dec. 16, 2019, pp. 1415-1419. |
Bai, et al., “SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network,” in 15th European Conference on Computer Vision, published in Lecture Notes in Computer Science, vol. 11217, Springer, Berlin, Heidleberg, Oct. 6, 2018, pp. 210-226. |
Isola, et al., “Image-to-Image Translation with Conditional Adversarial Networks,” in Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5967-5976. |
Odena, Augustus, “Semi-Supervised Learning with Generative Adversarial Networks,” arXiv:1606.01583v1 [stat.ML], Jun. 5, 2016, 3 pages. |
Hobley, et al., “Say Yes to the Dress: Shape and Style Transfer Using Conditional GANs,” in 14th Asian Conference on Computer Vision, published in Lecture Notes in Computer Science, vol. 11363, Springer International Publishing, May 29, 2019, pp. 135-149. |
Li, et al., “Boosting Fashion Image Attributes Classification Performance with MT-GAN Training Technique,” 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA), Oct. 2020, 10 pages. |
Abdulnab, et al., “Multi-task CNN Model for Attribute Prediction,” arXiv:1601.00400v1 [cs.CV], Jan. 4, 2016, 11 pages. |
Caruana, Rich, “Multitask Learning,” in Machine Learning 28, Kluwer Academic Publishers, 1997, pp. 41-75. |
Yu, et al., “SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017, 7 pages. |
Zhang, et al., “Adversarial Feature Matching for Text Generation,” in Proceedings of the 34th International Conference on Machine Learning, PMLR 70, 2017, 10 pages. |
Fedus, et al., “MaskGAN: Better Text Generation via Filling in the___ ,” arXiv:1801.07736v3 [stat.ML], Mar. 1, 2018, 17 pages. |
Radford, et al., “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” arXiv:1511.06434v2 [cs.LG], Jan. 7, 2016, 16 pages. |
Clark, et al., “ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators,” arXiv:2003.10555v1 [cs.CL], Mar. 23, 2020, 18 pages. |
Liu, et al., “Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Infomnation Retrieval,” in Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015, pp. 912-921. |
Collobert, et al., “Natural Language Processing (Almost) from Scratch,” in Journal of Machine Learning Research 12, 2011, pp. 2493-2537. |
Guo, et al., “Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (vol. 1: Long Papers), Jul. 2018, pp. 387-697. |
Choi, et al., “StarGAN v2: Diverse Image Synthesis for Multiple Domains,” arXiv:1912.01865v2 [cs.CV], Apr. 26, 2020, 14 pages. |
Number | Date | Country | |
---|---|---|---|
20210303927 A1 | Sep 2021 | US |