The present application relates to generating style pictures, and in particular but not limited to, generating style pictures based on multiple models using neural networks.
Some neural networks, for example, style-based generative neural networks (StyleGANs), provide an adversarial generative modeling framework, which shows a powerful generation ability in learning structures of human faces and drawing virtual human faces. Further, technologies including StyleGAN blending technology can generate paired virtual human photos and stylized images, which have stylized special effects, while maintaining the identity of the face. Moreover, the generated paired data facilitate further training of a downstream deep network model to convert user's input photos into maps with stylized special effects.
The existing StyleGAN blending technology first trains a StyleGAN model, i.e., a base model, using a face dataset, such as Flickr-Faces-HQ (FFHQ), to generate a series of realistic face images. Then, a batch of face special effects pictures with specific styles are selected to further train and optimize the base model and obtain a new model, i.e., a transferred model, that can generate special effects. However, the base model and the transferred model cannot guarantee good consistency of facial identity features. As a result, the two models cannot create a personalized style image or provide matching data for downstream models.
Furthermore, the StyleGAN blending technology interpolates weights of different layers in the two models, i.e., the base model and the transferred model, to obtain a new model, i.e., an interpolated model. The interpolated model can balance identity characteristics generated by the base model and maintain style characteristics of the transferred model. A balance between the identity characteristics and the style characteristics can be obtained by adjusting interpolation strategies in different layers.
The existing StyleGAN blending technology relies on a large amount of uniformly styled data to train the transferred model. However, it is hard to guarantee pictures or data with a uniform style. Works created by artists are often expensive, with few samples, and the styles are not completely consistent. The incompetence of these pictures or data causes the transferred model's failure in converging to a high-quality style effect. Furthermore, it is hard to guarantee the quality of the generated images from the final interpolated model. As a result, it is difficult to pass the aesthetic requirements of the business side.
Moreover, when there is a need to mix different styles of works to create a new style, the existing technology cannot controllably blend the style effects of different models, such as hair feature of one style and facial features of another style.
The present disclosure provides examples of techniques relating to generating one or more style pictures by mixing generation effects of different interpolated models in different regions without adding additional data for training downstream models.
According to a first aspect of the present disclosure, there is provided a method for generating a style picture. The method may include obtaining one or more models by training a neural network.
Further, the method may include obtaining a plurality of interpolated models based on the one or more models, generating a plurality of pictures by the plurality of interpolated models, and generating the style picture by combining two or more pictures in the plurality of pictures using one or more model-specific alpha masks.
According to a second aspect of the present disclosure, there is provided an apparatus for generating a style picture. The apparatus may include one or more processors and a memory configured to store instructions executable by the one or more processors. The one or more processors, upon execution of the instructions, are configured to obtain one or more models by training a neural network.
Further, the one or more processors may be configured to obtain a plurality of interpolated models based on the one or more models, generate a plurality of pictures by the plurality of interpolated models, and generate the style picture by combining two or more pictures in the plurality of pictures using one or more model-specific alpha masks.
According to a third aspect of present disclosure, there is provided a non-transitory computer readable storage medium including instructions stored therein. Upon execution of the instructions by one or more processors, the instructions may cause the one or more processors to perform acts including obtaining one or more models by training a neural network.
Further, the instructions may cause the one or more processors to perform acts including obtaining a plurality of interpolated models based on the one or more models, generating a plurality of pictures by the plurality of interpolated models, and generating the style picture by combining two or more pictures in the plurality of pictures using one or more model-specific alpha masks.
A more particular description of the examples of the present disclosure will be rendered by reference to specific examples illustrated in the appended drawings. Given that these drawings depict only some examples and are not therefore considered to be limiting in scope, the examples will be described and explained with additional specificity and details through the use of the accompanying drawings.
Reference will now be made in detail to specific implementations, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.
Reference throughout this specification to “one embodiment,” “an embodiment,” “an example,” “some embodiments,” “some examples,” or similar language means that a particular feature, structure, or characteristic described is included in at least one embodiment or example. Features, structures, elements, or characteristics described in connection with one or some embodiments are also applicable to other embodiments, unless expressly specified otherwise.
Throughout the disclosure, the terms “first,” “second,” “third,” and etc. are all used as nomenclature only for references to relevant elements, e.g., devices, components, compositions, steps, and etc., without implying any spatial or chronological orders, unless expressly specified otherwise. For example, a “first device” and a “second device” may refer to two separately formed devices, or two parts, components or operational states of a same device, and may be named arbitrarily.
The terms “module,” “sub-module,” “circuit,” “sub-circuit,” “circuitry,” “sub-circuitry,” “unit,” or “sub-unit” may include memory (shared, dedicated, or group) that stores code or instructions that can be executed by one or more processors. A module may include one or more circuits with or without stored code or instructions. The module or circuit may include one or more components that are directly or indirectly connected. These components may or may not be physically attached to, or located adjacent to, one another.
As used herein, the term “if” or “when” may be understood to mean “upon” or “in response to” depending on the context. These terms, if appear in a claim, may not indicate that the relevant limitations or features are conditional or optional. For example, a method may comprise steps of: i) when or if condition X is present, function or action X′ is performed, and ii) when or if condition Y is present, function or action Y′ is performed. The method may be implemented with both the capability of performing function or action X′, and the capability of performing function or action Y′. Thus, the functions X′ and Y′ may both be performed, at different times, on multiple executions of the method.
A unit or module may be implemented purely by software, purely by hardware, or by a combination of hardware and software. In a pure software implementation, for example, the unit or module may include functionally related code blocks or software components, that are directly or indirectly linked together, so as to perform a particular function.
In some examples, the base model 101 may include a plurality of resolution layers, BL_1, BL_2, . . . , BL_N-1, and BL_N, that are responsible for different features in the generated pictures, where N is a positive integer that is greater than 1. The plurality of resolution layers, BL_1, BL_2, . . . , BL_N-1, and BL_N, are respectively corresponding to different resolutions. The resolution of the resolution layer BL_i may be respectively higher than the resolution of the resolution layer BL_i-1, where i is a positive integer between 2 and N. The resolution layer BL_1 may be corresponding to the lowest resolution while the resolution layer BL_N may be corresponding to the highest resolution.
In some examples, in the base model 101, the resolution layer BL_1 may be corresponding to a resolution of 4×4, and the resolution layer BL_2 may be corresponding to a resolution of 8×8. Moreover, the resolution layer BL_N-1 may be corresponding to a resolution of 512×512, and the resolution layer BL_N may be corresponding to a resolution of 1024×1024.
As illustrated in
In some other examples, different transferred models may be generated in different training periods when the dataset of one style is used for training the base model 101.
The transferred model 102 may include a plurality of resolution layers, TL_1, TL_2, . . . , TL_N-1, and TL_N, as illustrated in
In some examples, in the transferred model 102, the resolution layer TL_1 may be corresponding to a resolution of 4×4, and the resolution layer TL_2 may be corresponding to a resolution of 8×8. Moreover, the resolution layer TL_N-1 may be corresponding to a resolution of 512×512, and the resolution layer TL_N may be corresponding to a resolution of 1024×1024.
Further, a plurality of interpolated models may be generated based on one or more models obtained by training the neural network. The one or more models may include the base model 101, the transferred model 102, or any model obtained by training the neural network. The one or more models may have same architecture.
In some examples, the plurality of interpolated models may be generated through interpolating at the different resolution layers in the base model 101.
In some examples, the plurality of interpolated models may be respectively generated by interpolating at different resolution layers of the transferred model 102. In some examples, through interpolating at the different resolution layers in the transferred model 102 and the base model 101, multiple different interpolated models may be generated.
In some examples, in the interpolated model 103, the resolution layer IL_1 may be corresponding to a resolution of 4×4, and the resolution layer IL_2 may be corresponding to a resolution of 8×8. Moreover, the resolution layer IL_N-1 may be corresponding to a resolution of 512×512, and the resolution layer IL_N may be corresponding to a resolution of 1024×1024.
In some examples, a plurality of different interpolated models may be further generated based on the interpolated model 103. For example, the plurality of different interpolated models may be generated by respectively interpolating different resolution layers of the interpolated model 103.
In some examples, the plurality of different interpolated models may include a first interpolated model, a second interpolated model, a third interpolated model. Due to data limitations, a single model may have some specific flaws, artifacts, or areas that do not meet business needs or requirements. Moreover, these problems are coupled with feature parts of the face. If the generation effect of eyes is flawed, the model has problems with eyes in most of the generated images. As a result, there is a possibility of replacing results of other models as a whole.
The picture 201 shown in
In some examples, most of pictures or all pictures generated by one interpolated model may have the same artifact or flaw. For example, most of pictures or all pictures generated by the first interpolated model may have the same artifact with the hair as the first picture 202, and most of pictures or all pictures generated by the second interpolated model may have the same artifact with the skin tone as the second picture 203.
In some examples, after obtaining the first picture 202 and the second picture 203, a face analysis model is used to identify a target area or region in the two pictures. For example, the face analysis model identifies hair regions of the two pictures, and replace the hair region of the first picture 202 with the hair region of the second picture 203 to obtain a style picture with hair having no flaws or artifacts.
In some examples, image masking is used to implement the replacement of the target region in pictures. For example, a mask is used to replace the hair region of the first picture 202 with the hair region of the second picture 203.
In some examples, due to blunt transition area caused by the image masking, targeted adjustments, e.g., feathering, in different facial regions are used and the two pictures, i.e., the first picture 202 and the second picture 203, are combined by using a model-specific alpha mask.
In some examples, the different facial regions may be identified by using an intermediate matrix Mface. Taking the hair region as an example, the hair region of the first picture 202 or the second picture 203 is identified by determining an intermediate matrix Mface corresponding to the hair region. The intermediate matrix Mface may include a plurality of matrix elements mface, where the matrix elements mface equal to 0 denote that these elements indicate the background of the picture, and the matrix elements of mface equal to 1 denotes that these matrix elements indicate the facial target region in the picture, e.g., the hair region.
After a facial target region, e.g., one of the different facial regions, is identified, an alpha mask matrix Malpha is obtained by performing convolution operations on the intermediate matrix Mface using a kernel function. The kernel function may be a two-dimensional Gaussian function for feature shapes. In some examples, the two-dimensional alpha mask matrix Malpha is obtained by using following equation (1):
M
alpha(x,y)=∫∫a,bK(a,b)Mface(x−a, y−b)dadb (1)
where K (a, b) denotes the two-dimensional Gaussian function, Mface(x−a, y−b) denotes the intermediate matrix. The model-specific alpha mask may be implemented by the equation (1).
I
final=(1−Malpha)Ifirst+MalphaI second (2)
where Ifinal denotes the style picture, Ifirst denotes the first picture 202 generated by the first interpolated model, Isecond denotes the second picture 203 generated by the second interpolated model, and Malpha denotes the two-dimensional alpha mask matrix.
As shown in
Moreover, a style picture may be obtained by combining multiple pictures that are respectively generated by multiple different interpolated models. For example, in addition to the first picture 202 and the second picture 203, a third picture is generated by a third interpolated model that is different from the first and second interpolated models. After a first combined picture, i.e., the style picture generated in
In some examples, the first combined picture that is obtained by combining the first picture 202 and the second picture 203 may still have artifacts or flaws in the face or in other facial regions, such as the nose region, the ear region, etc. For example, the first combined picture has flaws in the nose region, and the third picture generated by the third interpolated model has no flaw in the nose region. To further improve the effect of generating style pictures, the first combined picture and the third picture are combined by using the second model-specific alpha mask.
The examples in the present disclosure integrate different stylized features to create new style special effects by combining styles, thereby reducing cost of manual drawing and retouching. When there are flaws or artifacts with the multiple different models, flaws or artifacts of a single model may be eliminated by combining advantages of the multiple different models. Thus, manual editing in later stages are avoided. When style data is difficult to cover various situations, for examples, wearing glasses, regions with good effects generated by the existing models are used to replace areas in pictures generated by models with missing data or with bad effect, thus aesthetic style pictures may be generated.
As shown in
The processing component 402 usually controls overall operations of the system 400, such as operations relating to display, a telephone call, data communication, a camera operation and a recording operation. The processing component 402 may include one or more processors 420 for executing instructions to complete all or a part of steps of the above method. The processors 420 may include CPU, GPU, DSP, or other processors. Further, the processing component 402 may include one or more modules to facilitate interaction between the processing component 402 and other components. For example, the processing component 402 may include a multimedia module to facilitate the interaction between the multimedia component 408 and the processing component 402.
The memory 404 is configured to store different types of data to support operations of the system 400. Examples of such data include instructions, contact data, phonebook data, messages, pictures, videos, and so on for any application or method that operates on the system 400. The memory 404 may be implemented by any type of volatile or non-volatile storage devices or a combination thereof, and the memory 404 may be a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic memory, a flash memory, a magnetic disk or a compact disk.
The power supply component 406 supplies power for different components of the system 400. The power supply component 406 may include a power supply management system, one or more power supplies, and other components associated with generating, managing and distributing power for the system 400.
The multimedia component 408 includes a screen providing an output interface between the system 400 and a user. In some examples, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen receiving an input signal from a user. The touch panel may include one or more touch sensors for sensing a touch, a slide and a gesture on the touch panel. The touch sensor may not only sense a boundary of a touching or sliding actions, but also detect duration and pressure related to the touching or sliding operation. In some examples, the multimedia component 408 may include a front camera and/or a rear camera. When the system 400 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data.
The I/O interface 412 provides an interface between the processing component 402 and a peripheral interface module. The above peripheral interface module may be a keyboard, a click wheel, a button, or the like. These buttons may include but not limited to, a home button, a volume button, a start button and a lock button.
The sensor component 414 includes one or more sensors for providing a state assessment in different aspects for the system 400. For example, the sensor component 414 may detect an on/off state of the system 400 and relative locations of components. For example, the components are a display and a keypad of the system 400. The sensor component 414 may also detect a position change of the system 400 or a component of the system 400, presence or absence of a contact of a user on the system 400, an orientation or acceleration/deceleration of the system 400, and a temperature change of system 400. The sensor component 414 may include a proximity sensor configured to detect presence of a nearby object without any physical touch. The sensor component 414 may further include an optical sensor, such as a CMOS or CCD image sensor used in an imaging application. In some examples, the sensor component 414 may further include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 416 is configured to facilitate wired or wireless communication between the system 400 and other devices. The system 400 may access a wireless network based on a communication standard, such as WiFi, 4G, or a combination thereof. In an example, the communication component 416 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an example, the communication component 416 may further include a Near Field Communication (NFC) module for promoting short-range communication. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra-Wide Band (UWB) technology, Bluetooth (BT) technology and other technology.
In an example, the system 400 may be implemented by one or more of Application Specific Integrated Circuits (ASIC), Digital Signal Processors (DSP), Digital Signal Processing Devices (DSPD), Programmable Logic Devices (PLD), Field Programmable Gate Arrays (FPGA), controllers, microcontrollers, microprocessors or other electronic elements to perform the above method.
A non-transitory computer readable storage medium may be, for example, a Hard Disk Drive (HDD), a Solid-State Drive (SSD), Flash memory, a Hybrid Drive or Solid-State Hybrid Drive (SSHD), a Read-Only Memory (ROM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, and etc.
In step 502, the processor 420 obtains one or more models by training a neural network.
In some examples, the one or more models may have same architecture.
In some examples, the one or more models may include a base model obtained by training a neural network using a face dataset.
In some examples, the one or more models may include one or more transferred models obtained by training the base model using one or more new datasets.
In some examples, the one or more models may include multiple interpolated models obtained by respectively interpolating different layers of the base model or the one or more transferred models.
In some examples, the neural network is a StyleGAN. The face dataset may include fake human face photos generated by a neural network such as StyleGAN.
In step 504, the processor 420 obtains a plurality of interpolated models based on the one or more models obtained in the step 502.
In step 506, the processor 420 generates a plurality of pictures by the plurality of interpolated models.
In step 508, the processor 420 generates the style picture by combining two or more pictures in the plurality of pictures using one or more model-specific alpha masks.
In some examples, the plurality of interpolated models may include a first interpolated model and a second interpolated model.
In some examples, the processor 420 may further generate a first picture by the first interpolated model, generate a second picture by the second interpolated model, and generate the style picture by combining the first picture and the second picture using a first model-specific alpha mask in the one or more model-specific alpha masks.
In some examples, the processor 420 may further identify a facial target area in the first picture and determining an intermediate matrix for the facial target area, obtain an alpha mask matrix by performing convolution operations on the intermediate matrix, and generate the style picture based on the alpha mask matrix, the first picture, and the second picture.
In some examples, each picture of a plurality of pictures generated by the first interpolated model may include the facial target area.
In some examples, the processor 420 may perform the convolution operations on the intermediate matrix by performing the convolution operations on the intermediate matrix by using a kernel function.
In some examples, the plurality of interpolated models may include a first interpolated model, a second interpolated model, and a third interpolated model. Further, the processor 420 may respectively generate a first picture, a second picture, and a third picture by the first interpolated model, the second interpolated model, and the third interpolated model, generate a first combined picture by combining the first picture and the second picture using a first model-specific alpha mask in the one or more model-specific alpha masks, and generate the style picture by combining the first combined picture and the third picture using a second model-specific alpha mask in the one or more model-specific alpha masks.
In some examples, the first model-specific alpha mask may be the same as or different from the second model-specific alpha mask.
In some examples, the processor 420 may further generate a plurality of transferred models in different training periods by training the base model on a dataset of a style and obtain the plurality of interpolated model based on the plurality of transferred models in the different training periods.
In some examples, the processor 420 may generate a plurality of different transferred models by training the base model on a plurality of datasets of different styles.
In some examples, the processor 420 may obtain a plurality of interpolated models by interpolating at different layers of a transferred model.
In some examples, there is provided an apparatus for generating a style picture. The apparatus includes one or more processors 420 and a memory 404 configured to store instructions executable by the one or more processors; where the processor, upon execution of the instructions, is configured to perform a method as illustrated in
In some other examples, there is provided a non-transitory computer readable storage medium 404, having instructions stored therein. When the instructions are executed by one or more processors 420, the instructions cause the processor to perform a method as illustrated in
The description of the present disclosure has been presented for purposes of illustration, and is not intended to be exhaustive or limited to the present disclosure. Many modifications, variations, and alternative implementations will be apparent to those of ordinary skill in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings.
The examples were chosen and described in order to explain the principles of the disclosure, and to enable others skilled in the art to understand the disclosure for various implementations and to best utilize the underlying principles and various implementations with various modifications as are suited to the particular use contemplated. Therefore, it is to be understood that the scope of the disclosure is not to be limited to the specific examples of the implementations disclosed and that modifications and other implementations are intended to be included within the scope of the present disclosure.