This is a U.S. National Stage under 35 U.S.C 371 of the International Application PCT/CN2016/112812, filed Dec. 29, 2016.
Field of Invention
The present invention relates to a stereoscopic video generation method, and more particularly to a monocular-to-binocular stereoscopic video generation method based on 3D convolution neural network.
Description of Related Arts
Due to strong sense of reality and immersion, the 3D film is very popular with the audience. In recent years, the 3D film has accounted for a large share in the film market, and accounted for 14% to 21% of North American box office total revenue between 2010 and 2014. In addition, with the emergence of virtual reality (VR) market, the head mounted display also has a further demand for 3D contents.
Equipment and production costs are higher for directly producing 3D film format, so it has become a more ideal choice to convert 2D films into 3D films. A typical professional conversion process usually comprises firstly manually creating a depth map for each frame, and then combining the original 2D video frame with the depth map to produce a stereoscopic image pair based on depth map rendering algorithm. However, this process is still expensive and requires costly manpower. Therefore, high production costs become a major stumbling block to the large-scale development of 3D film industry.
In recent years, many researchers have sought to produce 3D video from a single video sequence through existing 3D model libraries and depth estimation techniques. The current depth information is able to be obtained through both hardware and software. The hardware, which has access to the depth information, comprises laser range finder and 3D depth somatosensory camera KINECT launched by MICROSOFT. Common software methods comprise multi-view stereo, photometric stereo, shape from shading, depth from defocus, and a method based on machine learning. The method based on machine learning is mainly adapted for 3D films converted from 2D films, and especially in recent years with the wide application of depth learning framework, the framework is also applied to the depth estimation. For example, Eigen et al. firstly achieves an end-to-end monocular image depth estimation through multi-scale convolution neural network (CNN). However, the size of the outputted result is limited, so it is predicted that the depth map is much smaller than the inputted original image, and the height and the width of the obtained depth map are respectively only 1/16 of the original image. Therefore, Eigen and Fergus improve the network structure later which comprises firstly up-sampling the original realized CNN output, and then connecting with the convolution result of the original input image, and then processing through multiple convolutional layers to deepen the neural network for obtaining the final outputted depth map with higher resolution.
However, the depth map obtained by the above method still has problems that the contour is not clear enough and the resolution is low. In addition, the complement problem of occlusion and other invisible parts caused by the change of view is still difficult to be solved.
An object of the present invention is to provide a monocular-to-binocular stereoscopic video generation method to overcome deficiencies of the prior art, which is able to automatically convert existing 2D video sources into stereoscopic videos that are able to be played on 3D devices through training 3D deep full convolution neural network.
The object of the present invention is achieved by a technical solution as follows. A stereoscopic video generation method based on 3D convolution neural network comprises steps of:
preparing training data, training the 3D convolution neural network, training the obtained neural network model through taking 2D videos as a left eye video input, generating right eye videos, synthesizing the left and right videos to 3D videos and outputting.
The training data are downloaded through web; a sufficient number (at least 20) of non-animated 3D movies are adopted; all videos are firstly divided into left eye views and right eye views; blank frames which may occur at title, tail and shot transition are canceled, so as to obtain training samples with 5000000 frames. The rich training samples enable the trained CNN to have a strong generalization capability.
t1=(t0+2×pad-kernel_size)/stride+1 (1);
w1=(w0+2×pad-kernel_size)/stride+1 (2);
h1=(h0+2×pad-kernel_size)/stride+1 (3).
The final output of the 3D convolution neural network is color images with RGB channels. As shown in
Generally speaking, five continuous frames of left eye view result in one frame of right eye view. However, both the height and width of the right eye view shrink by four units than the left eye views. Therefore, during network training, the central area of the middle frame of the input five frames and the output right eye view are aligned to obtain the loss, which is back-propagated to adjust network parameters.
The correlation between adjacent frames in the time domain exists only within the same shot. Therefore, before network training, the videos are firstly split to shots through shot segmentation. The shot segmentation algorithm adopted by the present invention is based on the fuzzy C-means clustering algorithm, which specifically comprises steps of:
firstly, converting every frame of a video from RGB (Red-Green-Blue) space to YUV (Luminance and Chrominance) space through a conversion formula of
Y=0.299R+0.587G+0.114B
U=0.492(B−Y)
V=0.877(R−Y) (4).
and then calculating a color histogram of YUV channels of every frame and calculating an inter-frame difference between adjacent frames through a formula of
x(fi, fi+1)=Σk=1n|HY(fi, k)−HY(fi+1, k)|+Σk=1m(|HU(fi, k)−HU(fi+1, k)|+|HV(fi, k)−HV(fi+1, k)|) (5),
here, m is the histogram bin number of a UV channel, n is the histogram bin number of a Y channel, m<n, H (f,k) represents the amount of pixels within the kth bin of frame f.
All inter-frame differences of the video are clustered into three categories through the fuzzy C-means clustering algorithm: shot change class (SC), suspected shot changes class (SSC) and non-shot changed class (NSC). The suspected shot changes class refers to the frames which are difficult to determine whether the shot is changed or not.
The fuzzy C-means clustering algorithm (whose input is video adjacent inter-frame difference sequences and output is adjacent frame types) comprises steps of:
(1) initializing a sort number c=3 and an index weight w=1.5, and assigning all membership value μik (i=1, . . . , c, k=1, . . . , n, here, n is a total number of the inter-frame difference sequences) to 1/c;
(2) calculating c clustering centers ci through a formula (6), here, i=1, . . . ,c;
(3) calculating a value function J through a formula (7), wherein if the J is smaller than a determined threshold, or a variation of the J relative to a former value function is smaller than a threshold, then the fuzzy C-means clustering algorithm is stopped; and
(4) calculating a new membership value μik through a formula (8) and return to the step (2), wherein:
ci=(Σj=1nμijwxj)/Σj=1nμijw (6),
J=Σi=1cΣj=1nμijw∥ci−xj∥2 (7),
μij=1/Σk=1c(∥ci−xj∥/∥ck−xj∥)2/(w−1) (8).
The suspected shot change class SSC is processed as follows. When there are multiple SSC frames SSC(k) (k=j, . . . , j+n−1) between two continuous shot change frames SC(i) and SC(i+1), if a condition is met as follows:
H_SSC(k)≥0.025*[H_SC(i)+H_SC(i+1)] (9),
then the frame SSC(k) is taken as a shot change class, wherein H_SSC(k) represents a histogram bin difference of the SSC(k), H_SC(i)+H_SC(i+1) represents a histogram bin difference of the SC(i) and the SC(i+1). However, the shots should not be continuously changed between adjacent frames. Therefore, some frames which meet the formula (9) are deleted.
The mean image of all input training samples is calculated. While training, the mean image is subtracted from every frame of the left eye views inputted into the network. Similarly, when using the trained network model to generate the right eye views, the mean image also needs to be subtracted from the inputted left eye views.
In
wherein {tilde over (Y)} is an output of the last layer of the 3D convolution neural network, Y is a real right eye view corresponding to the middle frame of five continuous frames participating in the 3D convolution, n is an amount of outputted pixels. The network is trained through minimizing the loss function. The training is completed when the loss function is converged during the training process.
In
Beneficially effects of the present invention are as follows. Based on the deep convolution neural network model obtained through large data set training, the left eye videos automatically generate the right eye videos, so that human participation is maximally reduced during the stereoscopic video production process, thereby improving the efficiency of the stereoscopic video production and reducing production costs.
The present invention is further explained in detail with accompanying drawings as follows.
Stereoscopic video generation is a technique that occurs when the existing 3D signal is relatively small. It is able to automatically produce the 3D display effect by calculating daily watched 2D films or TV (television) shows.
As shown in
The stereoscopic video generation method based on 3D convolution neural network comprises steps as follows.
(1) Training the 3D convolution neural network.
In order to avoid the over-fitting phenomenon while training the deep convolution neural network, it is necessary to prepare sufficient training data. In the present invention, more than twenty non-animated 3D films downloaded from the web are taken as the training data; and then 3D videos are divided into left eye JPEG image sequences and right eye JPEG image sequences through FFmpeg commands; and then blank frames which may appear in titles, tails and shots shading are deleted from the sequences.
All training samples are tailored or scaled to the same height and width, and the mean image of all training images is calculated.
The left eye JPEG image sequences separated from every movie are processed through shot segmentation by the fuzzy C-means clustering method mentioned in the summary of the present invention; and then the mean image is subtracted from the left eye images to obtain input data for training; first two frames and last two frames are removed from the right eye images of the corresponding shots, and two pixels are cropped off from each of four sides to get a training target, and then the training pairs are saved in an HDF5 format file.
Convolutional kernel parameters of every layer of the 3D convolution neural network are initialized through Gaussian distribution in the range of [0, 0.01] with a standard deviation of 0.01, the initial learning rate of every layer is set to 0.01, the learning rate is reduced to 1/10 of the former one for each 100000 training steps, the Momentum is set to 0.9.
Through the above training data and parameters, the 3D convolution neural network shown in
(2) Generating right eye videos via 2D left eye videos through the obtained 3D convolution neural network after being trained.
The 2D videos to be converted are taken as the left eye videos; and then converted into image sequences through the method as same as the training, processed through shot segmentation by the fuzzy C-means clustering method, and converted into images with the same size of the training images through scaling or cropping; the mean image of the training images is subtracted from every frame, and then inputted into a 3D convolution neural network through taking the shots as a unit; the output of the last convolutional layer is a floating-point type. It is necessary to obtain three-channel RGB images whose gray scale is an integer in the range of [0,255]. Therefore, the final output of the network is rounded to obtain expected right eye views, i.e., while meeting the condition of vϵ[0,255], the final output is rounded to the nearest integer, when v<0, the final output is 0; when v>255, the output final is 255. One middle frame of right eye view is generated from every five-frame left eye views, the generation process slides forward with stride of one in the time domain, so that the corresponding right eye views of every shot are obtained except the former two initial frames and the last two end frames. It is acceptable that four frames of each shot are lost during the video editing process.
(3) Synthesizing left and right eye videos into 3D videos
When the left eye videos are converted into the right eye videos through the network shown in
The foregoing is intended to be only a preferred embodiment of the present invention, but the protective scope of the present invention is not limited thereto, and any changes or substitutions that may be readily apparent to those skilled in the art within the technical scope of the present invention are intended to be encompassed within the protective scope of the present invention. Accordingly, the protective scope of the present invention should be based on the protective scope defined by the claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2016/112812 | 12/29/2016 | WO | 00 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2018/119808 | 7/5/2018 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20120051625 | Appia et al. | Mar 2012 | A1 |
Number | Date | Country |
---|---|---|
102223553 | Oct 2011 | CN |
102932662 | Feb 2013 | CN |
103716615 | Apr 2014 | CN |
105979244 | Sep 2016 | CN |
105979244 | Sep 2016 | CN |
106504190 | Mar 2017 | CN |
Number | Date | Country | |
---|---|---|---|
20190379883 A1 | Dec 2019 | US |