METHOD FOR TRAINING CONVOLUTIONAL NEURAL NETWORK, AND METHOD AND DEVICE FOR STYLIZING VIDEO

Description

TECHNICAL FIELD

The present disclosure relates to technical field of imaging processing, and particularly, to a method for training a convolutional neural network (CNN) for stylizing a video, and a method and device for stylizing video.

BACKGROUND

Style transfer aims to transfer the style of a reference image/video to an input image/video. It is different from color transfer in the sense that it transfers not only colors but also strokes and textures of the reference. Some existing techniques are time consuming and ineffective, while some techniques impose heavy computation burden on computing devices.

SUMMARY

The embodiments of the present disclosure relate to a method for training a convolutional neural network (CNN) for stylizing a video, and a method and device for stylizing video.

According to a first aspect, there is provided a method for training a convolutional neural network (CNN) for stylizing a video, comprising: transforming each of a plurality of original frames of the video into a stylized frame by using a first CNN for stylizing; determining at least one first loss according to a first original frame and second original frame of the plurality of original frames and results of the transforming, the second original frame being next to the first original frame; and training the first CNN according to the at least one first loss.

According to a second aspect, there is provided a method for stylizing a video, comprising: stylizing a video by using a first convolutional neural network (CNN); where the first CNN has been trained according to at least one first loss which is determined according to a first original frame and second original frame of a plurality of original frames of the video and results of transforming, the second original frame being next to the first original frame, the transforming comprising: transforming each of a plurality of original frames of the video into a stylized frame by using a first CNN for stylizing.

According to a third aspect, there is provided a device for stylizing a video, comprising: a memory for storing instructions; and at least one processor configured to execute the instructions to perform operations of: stylizing a video by using a first convolutional neural network (CNN); where the first CNN has been trained according to at least one first loss which is determined according to a first original frame and second original frame of a plurality of original frames of the video and results of transforming, the second original frame being next to the first original frame, the transforming comprising: transforming each of a plurality of original frames of the video into a stylized frame by using a first CNN for stylizing

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the present disclosure.

FIG. 1 illustrates images obtained when the current filters adopted in smartphone perform standard color transformation to the images/videos.

FIG. 2 illustrates stylized frame sequence when video style transfer is performed on original sequence of frames.

FIG. 3 illustrates temporal inconsistency in relevant video style transfer.

FIG. 4 illustrates a flow chart of a method for training a CNN for stylizing a video according to at least some embodiments of the present disclosure.

FIG. 5 illustrates a block diagram of a device for training a CNN for stylizing a video according to at least some embodiments of the present disclosure.

FIG. 6 illustrates a flow chart of a method for stylizing a video according to at least some embodiments of the present disclosure.

FIG. 7 illustrates a block diagram of a device for stylizing a video according to at least some embodiments of the present disclosure.

FIG. 8 illustrates the architecture of the proposed Twin Network according to at least some embodiments of the present disclosure.

FIG. 9 illustrates some example details about the StyleNet according to at least some embodiments of the present disclosure.

FIG. 10 illustrates VGG network which is used as a loss network.

FIG. 11 illustrates style transfer result from the proposed Twin Network according to at least some embodiments of the present disclosure.

FIG. 12 illustrates a block diagram of electronic device according to another exemplary embodiment.

Specific embodiments of the present disclosure have been illustrated through the above accompanying drawings and more detailed descriptions will be made below. These accompanying drawings and textual descriptions are intended not to limit the scope of the concept of the present disclosure in any manner but to explain the concept of the present disclosure to those skilled in the art with reference to specific embodiments.

DETAILED DESCRIPTION

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise represented. The implementations set forth in the following description of exemplary embodiments do not represent all implementations consistent with the present disclosure. Instead, they are merely examples of apparatuses and methods consistent with aspects related to the present disclosure as recited in the appended claims.

Gatys et al. (A Neural Algorithm of Artistic Style (Gatys, Ecker, and Bethge; 2015)) presented a technique for learning a style and applying it to other images. Briefly, they use gradient descent from white noise to synthesize an image which matches the content and style of the target and source image respectively. Though impressive stylized results are achieved, Gatys et al.'s method takes quite a long time to infer the stylized image. Afterwards, Johnson et al. (Perceptual Losses for Real-Time Style Transfer and Super-Resolution) use a feed-forward network to reduce the computation time and effectively conduct the image style transfer.

Simply treating each video frame as an independent image, the aforementioned image style transfer methods can be directly extended to videos. However, without considering temporal consistency, those methods will inevitably bring flicker artifacts to generated stylized videos.

Video-based solution tries to achieve video style transfer directly on the video domain. For example, Ruder (and other similar works, for example, Manuel Ruder, Alexey Dosovitskiy, and Thomas Brox titled Artistic style transfer for videos (2016)) presents a method of obtaining stable video by penalizing departures from the optical flow of the input video. Style features remain present from frame to frame, following the movement of elements in the original video. However, the on-the-fly computation of optical flows makes this approach computationally far too heavy for real-time style-transfer, taking minutes per frame.

One of the issues in video style transfer is the temporal inconsistency problem, which can be observed visually as flickering between consecutive frames and inconsistent stylization of moving objects (as illustrated in FIG. 3). In this disclosure, a multi-level temporal loss is introduced according to at least some embodiments of the present disclosure, to stabilize the video style transfer. Comparing to previous method, the proposed method is more advantageous.

First, unlike relevant methods that enforce the temporal consistency at the final output level (e.g., the last network layer), we design a multi-level temporal loss that enforce the high-level semantic information to be synced in earlier network layers, which gives higher flexibility for the network to adjust its weights to achieve the temporal consistency and thus result in better network convergence property. A more stable video style transfer result is also delivered (e.g., without flickering effect).

Second, our method generates no extra computation burden during run time, which may avoid the on-the-fly optical flow calculation and greatly reduce the computation burden.

As can be seen in FIG. 1, the current filters adopted in smartphone just perform standard color transformation to the images/videos. These default filters are somewhat boring and can hardly attract users' attention (especially for those young ones).

Style transfer provides a more impressive effect to images and videos, and the number of style filters we can create is unlimited, which can largely enrich the filters in smartphone and is more attractive for (young) users.

As can be seen in FIG. 2, video style transfer transforms the original sequence of frames into another stylized frame sequence. This can provide a more impressive effect to users comparing to relevant filters, which just change the color tone or color distribution. In addition, the number of style filters we can create is unlimited, which can largely enrich the products (such as video album) in smartphone.

In FIG. 2, (a) illustrates an original video and (b) illustrates a stylized video.

Most of the current product adopt an image-based video style transfer method to generate stylized video, where they apply image-based style transfer techniques to a video frame by frame. However, this scheme inevitably brings temporal inconsistencies and thus causes severe flicker artifacts. FIG. 3 illustrates an example of temporal inconsistency in relevant video style transfer. As the highlighted part in the figure, the result of stylized frame t and t+1 is with no temporal consistency and thus create a flickering effect.

FIG. 3 illustrates temporal inconsistency in relevant video style transfer. Left and right images denote the stylized frame at t and t+1 respectively. As can be seen, even under such a short period of time (e.g., 1/30 second), stylized frame t and t+1 is different in several parts (e.g., the parts in the circles) and thus create a flickering effect.

In this disclosure, a temporal stability mechanism, which is generated by Twin Network, is proposed to stabilize the changes in pixel values from frame-to-frame. Furthermore, unlike previous video style transfer methods that introduces heavy computation burden during run time, the stabilization is done at training time, allowing for an unruffled style transfer of videos in real-time.

According to a first aspect, there is provided a method for training a CNN for stylizing a video. FIG. 4 illustrates a flow chart of a method for training a CNN for stylizing a video according to at least some embodiments of the present disclosure.

At block S402, each of a plurality of original frames of the video is transformed into a stylized frame by using a first convolutional neural network (CNN) for stylizing.

At block S404, at least one first loss is determined according to a first original frame and second original frame of the plurality of original frames and the results of the transforming. Here, the second original frame is next to the first original frame.

At block S406, the first CNN is trained according to the at least one first loss.

At least one temporal loss is introduced to stabilize the video style transfer, so as to enforce the temporal consistency at the final output level, which will have more flexibility.

In at least some embodiments, the at least one first loss may include a semantic-level temporal loss, and the determining the at least one first loss may include: extracting a first output of a hidden layer in the first CNN when the first CNN is applied to the first original frame, and extracting a second output of the hidden layer in the first CNN when the first CNN is applied to the second original frame; and determining a semantic-level temporal loss according to a first difference between the first output and the second output.

Here, the high-level semantic information is forced to be synced in earlier network layers, it will be easier and effective for adapting the network to a specific type (e.g., in our case, to generate a stable output frames).

Traditional method usually try to enforce the temporal consistency in only the output level (e.g., the last layer of network). However, tuning the result based totally on the output level result is somewhat challenge and has less flexibility to adjust the CNN. Here, the encoder loss is used to alleviate the problem. The encoder loss penalizes temporal inconsistency on the last level feature map to enforce a high-level semantic similarity between two consecutive frames.

In at least some embodiments, the at least one first loss may include a contrastive loss, and the determining the at least one first loss may include: determining a contrastive loss according to a second difference between: (a) a difference between the first original frame and a stylized first frame corresponding to the first original frame, and (b) a difference between the second original frame and a stylized second frame corresponding to the second original frame.

The idea behind contrastive loss is that one should consider the motion changes in the original frames and use them as a guide to update the neural network. For example, if there is a large motion change in the original frames, then we should also expect a relatively large changes between the corresponding stylized frames at time t and t+1. In this case, we should ask the network to output a pair of stylized frames that could be potentially different (instead of blindly enforcing frames t and t+1 to be exactly the same). Otherwise, if only minor or no motion is observed, then the network can generate similar stylized frames.

The contrastive loss can achieves this by trying to minimize the difference between the changes of original and stylized frame at time t and t+1. The information can thus correctly guide the CNN to generate images depending on the source motion changes. In addition, comparing to the direct temporal loss that is difficult to train, the contrastive loss guarantees a more stable neural network training process and a better converge property. One advantage of the contrastive loss is that it introduces no extra computation burden to run time.

In at least some embodiments, the above method may include transforming each of the plurality of original frames of the video by using a second CNN, the second CNN having been trained on an ImageNet dataset; transforming each of a plurality of the stylized frames by using the second CNN; determining at least one second loss according to an output feature vector of each of the plurality of the original frames at a first layer of the second CNN, and an output feature vector of each of the plurality of the stylized frames at a first layer of the second CNN. Here, training the first CNN according to the at least one first loss includes: training the first CNN according to the at least one first loss and the at least one second loss.

In at least some embodiments, the at least one second loss may include a content loss, and the method further includes: extracting a first feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to each of the plurality of original frames; extracting a second feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to a stylized frame corresponding to the original frame; and determining the content loss according to Euclidean distance between the first feature map and second feature map.

By using the content loss to train the CNN, it is advantageous that the difference between the original frame and the stylized frame can be minimized.

In at least some embodiments, the at least one second loss may include a style loss, and the method further includes: determining a first Gram matrix according to a first feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to each of the plurality of original frames; determining a second Gram matrix according to a second feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to a stylized frame corresponding to the original frame; and determining the style loss according a difference between the first Gram matrix and second Gram matrix.

By using the style loss to train the CNN, it is advantageous that the difference between styles of two frame can be minimized.

In at least some embodiments, determining the style loss according the difference between the first Gram matrix and second Gram matrix includes: determining the style loss according a squared Frobenius norm of the difference between the first Gram matrix and second Gram matrix.

In at least some embodiments, training the first CNN according to the at least one first loss and the at least one second loss includes: training the first CNN such that a weighted sum of the at least one first loss and the at least one second loss is minimized.

In at least some embodiments, training the first CNN such that a weighted sum of the at least one first loss and the at least one second loss is minimized includes: training the first CNN based on a method which uses gradient to update network parameters of the first CNN, such that a weighted sum of the at least one first loss and the at least one second loss is minimized.

In at least some embodiments, the second CNN is selected from a group including a VGG network, InceptionNet, and ResNet.

According to a second aspect, there is provided a device for training a CNN for stylizing a video. FIG. 5 illustrates a block diagram of a device for training a CNN for stylizing a video according to at least some embodiments of the present disclosure.

The device may include a determination unit 502, transforming unit 504 and training unit 506.

The transforming unit 504 is configured to transform each of a plurality of original frames of the video into a stylized frame by using a first convolutional neural network (CNN) for stylizing.

The determination unit 502 is configured to determine at least one first loss according to a first original frame and second original frame of the plurality of original frames and results of the transforming. The second original frame may be next to the first original frame.

The training unit 506 is configured to train the first CNN according to at least one first loss.

In at least some embodiments, the at least one first loss may include a semantic-level temporal loss. The determination unit 502 is configured to extract a first output of a hidden layer in the first CNN when the first CNN is applied to the first original frame, and extracting a second output of the hidden layer in the first CNN when the first CNN is applied to the second original frame; and determining a semantic-level temporal loss according to a first difference between the first output and the second output.

In at least some embodiments, the at least one first loss may include a contrastive loss. The determination unit 502 is configured to determine a contrastive loss according to a second difference between: (a) a difference between the first original frame and a stylized first frame corresponding to the first original frame, and (b) a difference between the second original frame and a stylized second frame corresponding to the second original frame.

In at least some embodiments, the transforming unit 504 is configured to transform each of the plurality of original frames of the video by using a second CNN. The second CNN having been trained on an ImageNet dataset. The transforming unit 504 is configured to transform each of a plurality of the stylized frames by using the second CNN; determining at least one second loss according to an output feature vector of each of the plurality of the original frames at a first layer of the second CNN, and an output feature vector of each of the plurality of the stylized frames at a first layer of the second CNN.

The training unit 506 is configured to train the first CNN according to the at least one first loss and the at least one second loss.

In at least some embodiments, the at least one second loss may include a content loss. The determination unit 502 is further configured to extract a first feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to each of the plurality of original frames; extract a second feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to a stylized frame corresponding to the original frame; and determine the content loss according to Euclidean distance between the first feature map and second feature map.

In at least some embodiments, the at least one second loss may include a style loss. The determination unit 502 may be further configured to determine a first Gram matrix according to a first feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to each of the plurality of original frames; determine a second Gram matrix according to a second feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to a stylized frame corresponding to the original frame; and determine the style loss according a difference between the first Gram matrix and second Gram matrix.

In at least some embodiments, the determination unit 502 may be configured to determine the style loss according a squared Frobenius norm of the difference between the first Gram matrix and second Gram matrix.

In at least some embodiments, the training unit 506 may be configured to train the first CNN such that a weighted sum of the at least one first loss and the at least one second loss is minimized.

In at least some embodiments, the training unit 506 is configured to train the first CNN based on a method which uses gradient to update network parameters of the first CNN, such that a weighted sum of the at least one first loss and the at least one second loss is minimized.

There is provided a non-transitory storage medium having stored thereon computer-executable instructions that, when being executed by a processor, cause the processor to perform the method as described above.

FIG. 6 illustrates a method for stylizing a video according to at least some embodiments of the present disclosure.

At block S602, a video is stylized by using a first convolutional neural network (CNN). Here, the first CNN has been trained according to at least one first loss which is determined according to a first original frame and second original frame of a plurality of original frames of the video and results of transforming, the second original frame being next to the first original frame, the transforming comprising: transforming each of a plurality of original frames of the video into a stylized frame by using a first CNN for stylizing.

According to some embodiments, the at least one first loss may include a semantic-level temporal loss, and the semantic-level temporal loss is determined according to a first difference between the first output and the second output, the first output is an output of a hidden layer in the first CNN when the first CNN is applied to the first original frame, and the second output is an output of the hidden layer in the first CNN when the first CNN is applied to the second original frame.

According to some embodiments, the at least one first loss may include a contrastive loss, and the contrastive loss is determined according to a second difference between: (a) a difference between the first original frame and a stylized first frame corresponding to the first original frame, and (b) a difference between the second original frame and a stylized second frame corresponding to the second original frame.

According to some embodiments, the training the first CNN according to the at least one first loss may include: training the first CNN according to the at least one first loss and the at least one second loss.

Here, the at least one second loss may be obtained by: transforming each of the plurality of original frames of the video by using a second CNN, the second CNN having been trained on an ImageNet dataset; transforming each of a plurality of the stylized frames by using the second CNN; and determining the at least one second loss according to an output feature vector of each of the plurality of the original frames at a first layer of the second CNN, and an output feature vector of each of the plurality of the stylized frames at a first layer of the second CNN.

According to at least some embodiments, the at least one second loss may include a content loss, and the content loss may be obtained by: extracting a first feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to each of the plurality of original frames; extracting a second feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to a stylized frame corresponding to the original frame; and determining the content loss according to Euclidean distance between the first feature map and second feature map.

According to at least some embodiments, the at least one second loss may include a style loss, and the style loss is obtained by: determining a first Gram matrix according to a first feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to each of the plurality of original frames; determining a second Gram matrix according to a second feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to a stylized frame corresponding to the original frame; and determining the style loss according a difference between the first Gram matrix and second Gram matrix.

According to at least some embodiments, determining the style loss according the difference between the first Gram matrix and second Gram matrix may include: determining the style loss according a squared Frobenius norm of the difference between the first Gram matrix and second Gram matrix.

According to at least some embodiments, training the first CNN according to the at least one first loss and the at least one second loss may include: training the first CNN such that a weighted sum of the at least one first loss and the at least one second loss is minimized.

According to at least some embodiments, training the first CNN such that a weighted sum of the at least one first loss and the at least one second loss is minimized may include: training the first CNN based on a method which uses gradient to update network parameters of the first CNN, such that a weighted sum of the at least one first loss and the at least one second loss is minimized.

According to at least some embodiments, the second CNN may be selected from a group comprising a VGG network, InceptionNet, and ResNet.

FIG. 7 illustrates a device for stylizing a video according to at least some embodiments of the present disclosure.

The device includes a styling module 702, configured for stylizing a video by using a first convolutional neural network (CNN). Here, the first CNN has been trained according to at least one first loss which is determined according to a first original frame and second original frame of a plurality of original frames of the video and results of transforming, the second original frame being next to the first original frame, the transforming comprising: transforming each of a plurality of original frames of the video into a stylized frame by using a first CNN for stylizing.

According to at least some embodiments, the at least one first loss comprises a semantic-level temporal loss, and the semantic-level temporal loss is determined according to a first difference between the first output and the second output, the first output is an output of a hidden layer in the first CNN when the first CNN is applied to the first original frame, and the second output is an output of the hidden layer in the first CNN when the first CNN is applied to the second original frame.

According to at least some embodiments, the at least one first loss comprises a contrastive loss, and the contrastive loss is determined according to a second difference between: (a) a difference between the first original frame and a stylized first frame corresponding to the first original frame, and (b) a difference between the second original frame and a stylized second frame corresponding to the second original frame.

According to at least some embodiments, the training the first CNN according to the at least one first loss comprises: training the first CNN according to the at least one first loss and the at least one second loss, wherein the at least one second loss is obtained by: transforming each of the plurality of original frames of the video by using a second CNN, the second CNN having been trained on an ImageNet dataset; transforming each of a plurality of the stylized frames by using the second CNN; and determining the at least one second loss according to an output feature vector of each of the plurality of the original frames at a first layer of the second CNN, and an output feature vector of each of the plurality of the stylized frames at a first layer of the second CNN.

According to at least some embodiments, the at least one second loss comprises a content loss, and the content loss is obtained by: extracting a first feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to each of the plurality of original frames; extracting a second feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to a stylized frame corresponding to the original frame; and determining the content loss according to Euclidean distance between the first feature map and second feature map.

According to at least some embodiments, the at least one second loss comprises a style loss, and the style loss is obtained by: determining a first Gram matrix according to a first feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to each of the plurality of original frames; determining a second Gram matrix according to a second feature map of an activation of a convolutional layer of the second CNN when the second CNN is applied to a stylized frame corresponding to the original frame; and determining the style loss according a difference between the first Gram matrix and second Gram matrix.

According to at least some embodiments, determining the style loss according the difference between the first Gram matrix and second Gram matrix comprises: determining the style loss according a squared Frobenius norm of the difference between the first Gram matrix and second Gram matrix.

According to at least some embodiments, training the first CNN according to the at least one first loss and the at least one second loss comprises: training the first CNN such that a weighted sum of the at least one first loss and the at least one second loss is minimized.

According to at least some embodiments, training the first CNN such that a weighted sum of the at least one first loss and the at least one second loss is minimized comprises: training the first CNN based on a method which uses gradient to update network parameters of the first CNN, such that a weighted sum of the at least one first loss and the at least one second loss is minimized.

According to at least some embodiments, the second CNN is selected from a group comprising a VGG network, InceptionNet, and ResNet.

Some embodiments of the present disclosure will be further described below.

Network Architecture

FIG. 8 illustrates the architecture of the proposed Twin Network according to at least some embodiments of the present disclosure. A model of the Twin Network may consist of two parts: StyleNet and LossNet. The video frames are fed into the twin network by pair (e.g., frame t and frame t+1), and the twin network will generate the following losses: content loss t and content loss t+1, style loss t and style loss t+1, encoder loss, and contrastive loss. These losses will be used to update the SyleNet for better video style transfer.

FIG. 9 illustrates more details about the StyleNet. It may be a deep convolutional neural network (CNN) parameterized by weights W. The StyleNet may transform input images x into output images y via the mapping y=fw (x).

Now, the description is made by taking convolutional neural network as an example for fw (x). A convolutional neural network, fw (.), consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of a series of convolutional layers that convolve with a multiplication or other dot product. The activation function is commonly a RELU layer, and is subsequently followed by additional convolutions such as pooling layers, fully connected layers and normalization layers, referred to as hidden layers because their inputs and outputs are masked by the activation function and final convolution. Finally, the network output a transformed image y based on the aforementioned operators.

Part
Input Shape
Operation
Output Shape

encoder
(h, w, n text missing or illegible when filed

)
CONV-(C64, K7 × 7, S1 × 1, P_same), ReLU, Instance Normal
(h, w, 64)

(h, w, 64)
CONV-(C128, K4 × 4, S2 × 2, P_same), ReLU, Instance Normal

(\frac{h}{2}, \frac{w}{2}, 128)

(\frac{h}{2}, \frac{w}{2}, 128)

CONV-(C256, K4 × 4, S2 × 2, P_same), ReLU, Instance Normal

(\frac{h}{4}, \frac{w}{4}, 256)

bottleneck

(\frac{h}{8}, \frac{w}{8}, 256)