Video frame interpolation is a video processing technique having many applications. For example, video frame interpolation may be utilized when performing frame rate conversion or in the generation of slow motion video effects. Traditional approaches to performing video frame interpolation have included identifying correspondences between consecutive frames, and using those correspondences to synthesize the interpolated intermediate frames through warping. Unfortunately, however, those traditional approaches typically suffer from the inherent ambiguities in estimating the correspondences between consecutive frames, and are particularly sensitive to occlusions/dis-occlusion, changes in colors, and changes in lighting.
In an attempt to overcome the limitations of traditional methods for performing video frame interpretation, alternative approaches have been explored. One such alternative approach relies on phased-based decomposition of the input images. However, the conventional methods based on this alternative approach are limited in the range of motion they can handle. Consequently, there remains a need in the art for a video processing solution capable of interpolating video frames for challenging scenes containing changes in color, changes in light, and/or motion blur.
There are provided systems and methods for performing video frame interpolation using a convolutional neural network, substantially as shown in and/or described in connection with at least one of the figures, and as set forth more completely in the claims.
The following description contains specific information pertaining to implementations in the present disclosure. One skilled in the art will recognize that the present disclosure may be implemented in a manner different from that specifically discussed herein. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
The present application discloses a video processing solution suitable for use in performing video frame interpolation that overcomes the drawbacks and deficiencies in the conventional art. In one implementation, the present solution does so at least in part by utilizing a convolutional neural network (CNN) configured to receive phase-based decompositions of images contained in consecutive video frames and to determine a phase-based intermediate decomposition of an image of in-between video frame based on those received image decompositions. In one implementation, the CNN architecture disclosed in the present application mirrors in its structure the phase-based decomposition applied to the video frame images, and may be configured to determine phase and amplitude values of the intermediate image decomposition resolution level-by-resolution level.
The CNN architecture disclosed herein is advantageously designed, in one implementation, to require relatively few parameters. In a further implementation, the present solution introduces a “phase loss” during training of the CNN that is based on the phase difference between an image included in an interpolated in-between video frame and the corresponding true image (hereinafter “ground truth” or “ground truth image”). In addition, the phase loss encodes motion relevant information.
To achieve efficient and stable training, in one implementation, the present solution uses a hierarchical approach to training that starts from estimating phase values at lower resolution levels and incrementally proceeds to the next higher resolution level. As such, the present video processing solution may advantageously outperform existing state-of-the-art methods for video frame interpolation when applied to challenging video imagery.
It is noted that, as defined in the present application, a CNN is a deep artificial neural network including layers that apply one or more convolution operations to an input to the CNN. Such a CNN is a machine learning engine designed to progressively improve its performance of a specific task. In various implementations, CNNs may be utilized to perform video processing or natural-language processing.
As further shown in
It is noted that, although the present application refers to frame interpolation software code 110 as being stored in system memory 106 for conceptual clarity, more generally, frame interpolation software code 110 may be stored on any computer-readable non-transitory storage medium. The expression “computer-readable non-transitory storage medium,” as used in the present application, refers to any medium, excluding a carrier wave or other transitory signal, capable of providing instructions to a hardware processor, such as hardware processor 104 of computing platform 102 or hardware processor 124 of user system 120, for example. Thus, a computer-readable non-transitory medium may correspond to various types of media, such as volatile media and non-volatile media, for example. Volatile media may include dynamic memory, such as dynamic random access memory (dynamic RAM), while non-volatile memory may include optical, magnetic, or electrostatic storage devices. Common forms of computer-readable non-transitory media include, for example, optical discs, RAM, programmable read-only memory (PROM), erasable PROM (EPROM), and FLASH memory.
It is further noted that although
As a result, hardware processor 104 and system memory 106 may correspond to distributed processor and memory resources within video processing system 100. Thus, it is to be understood that various features of frame interpolation software code 110, such as one or more of the features described below by reference to
According to the implementation shown by
Although user system 120 is shown as a desktop computer in
It is noted that, in some implementations, frame interpolation software code 110 may be utilized directly by user system 120. For example, frame interpolation software code 110 may be transferred to user system memory 126, via download over communication system 118, for example, or via transfer using a computer-readable non-transitory medium, such as an optical disc or FLASH drive. In those implementations, frame interpolation software code 110 may be persistently stored on user system memory 126, and may be executed locally on user system 120 by user system hardware processor 124.
As further shown in
Video sequence 230a including first and second consecutive video frames 232 and 234, and interpolated video frame 233 correspond respectively in general to video sequence 130a including first and second consecutive video frames 132 and 134, and interpolated video frame 133, in
In addition,
As shown by
Also shown in
The functionality of frame interpolation software code 110/210 and CNN 240 will be further described by reference to
Referring now to
CNN 240 is trained to enable software code 120/220 to determine an intermediate image included in an interpolated video frame, given as inputs the images included in video frames neighboring the interpolated video frame. However, rather than directly predicting the color pixel values of the intermediate image, CNN 240 predicts the values of the complex-valued steerable pyramid decomposition of the intermediate image. Thus, the goal of training is to predict the phase values of the intermediate image included in the interpolated video frame based on the complex-valued steerable pyramid decomposition of the input video frame images.
An exemplary implementation of the architecture of CNN 240 is shown in
It is noted that although
In the interests of stability, a hierarchical training procedure may be adopted, where the lowest level convolutional processing blocks are trained first. That is to say, the convolutional processing blocks corresponding to lower resolution levels of the complex-valued steerable pyramid used for image decomposition may be trained independently of convolutional processing blocks corresponding to higher resolution levels of the complex-valued steerable pyramid.
The exemplary training procedure described in the present application can be seen as a form of curriculum training that aims at improving training by gradually increasing the difficulty of the learning task. According to the present exemplary implementation, use of a complex-valued steerable pyramid decomposition on input images automatically provides a coarse to fine representation of those images that is well suited for such a hierarchical training approach.
Given input images I1 and I2, and ground truth intermediate image I, CNN 240 is trained to determine a predicted intermediate image Î that is as close as possible to ground truth image I. The input images I1 and I2 are decomposed using the complex-valued steerable pyramid. By applying the complex-valued steerable pyramid filters Ψω,θ, which include quadrature pairs, the input images I1 and I2 can be decomposed into a set of scale and orientation dependent complex-valued sub-bands Rω,θ(x,y):
Rω,θ(x,y)=(I*Ψω,θ)(x,y) (Equation 1)
=CW B(x,y)+iSω,θ(x,y) (Equation 2)
=Aω,θ(x,y)eiϕ
where Cω,θ(x,y) is the cosine part and Sω,θ(x,y) is the sine part. Because they represent the even-symmetric and odd-symmetric filter response, respectively, it is possible to compute for each sub-band the amplitude:
Aω,θ(x,y)=|Rω,θ(x,y)| (Equation 4)
and the phase values:
ϕω,θ(x,y)=Im(log(Rω,θ(x,y))), (Equation 5)
where Im represents the imaginary part of the term. The frequencies that cannot be captured in the levels of the complex-valued steerable pyramid can be summarized in real valued high-pass and low-pass residuals rh and rl, respectively.
The input images I1 and I2 are decomposed using Equation 1 to yield respective image decompositions R1 and R2 as:
Ri=Ψ(Ii)={{(ϕω,θi,Aω,θi)|ω,θ},rli,rhi} (Equation 6)
These image decompositions R1 and R2 are the inputs to CNN 240. Using to these values, the training objective is to determine {circumflex over (R)}, the decomposition of the predicted intermediate image Î corresponding to the interpolated video frame to be inserted between the video frames containing images I1 and I2. We introduce the prediction function, , learned with CNN 240 using the parameters A. Denoting Ψ−1 as the reconstruction function, the predicted intermediate image Î corresponding to the interpolated video frame is:
Î=Ψ−1({circumflex over (R)})=Ψ−1((R1,R2;Λ)) (Equation 7)
CNN 240 is trained to minimize the objective function, or loss function, , (hereinafter “loss function”) over a dataset, , including triplets of images I1, I2, and ground truth intermediate image I:
Λ*=argΛminI
The training objective is to predict, through a determination performed using CNN 240, the intermediate image decomposition values {circumflex over (R)} that lead to a predicted intermediate image Î similar to the ground truth image Î. The training also penalizes deviation from the ground truth image decomposition R. Thus, a loss function is utilized that includes an image loss term summed with a phase loss term.
For the image loss term, the l1-norm of pixel images is used to express the image loss as:
1=∥I−Î∥1 (Equation 9)
Regarding the phase loss term, it is noted that the predicted intermediate image decomposition {circumflex over (R)} of the intermediate image Î corresponding to the interpolated video frame includes amplitudes and phase values for each level and orientation present in the complex-valued steerable pyramid decomposition. To improve the quality of the intermediate image, a loss term that captures the deviations Δϕ of the predicted phase {circumflex over (ϕ)} from the ground truth phase ϕ is summed with the image loss term. The phase loss term is defined as the l1 loss of the phase difference values over all levels (ω) and orientations (θ):
phase=Σ107,θ∥Δϕω,θ∥1 (Equation 10)
where Δϕ is defined as:
Δϕ=a tan 2(sin(ϕ−{circumflex over (ϕ)}),cos(ϕ−{circumflex over (ϕ)})) (Equation 11)
Finally, we define the loss function using the image loss term and the phase loss term as:
=1+vphase (Equation 12)
where v is a weighting factor applied to the phase loss term. That is to say, in some implementations, the phase loss term phase of loss function is weighted relative to the image loss term 1. Moreover in some implementations, the weighting factor v may be less than one (1.0). In one exemplary implementation, for instance, the weighting factor v may be approximately 0.1. It is noted, however, that in some implementations, it may be advantageous or desirable for the weighting factor v to be greater than one.
Referring once again to
User 136 may utilize user system 120 to interact with video processing system 100 in order to synthesize interpolated video frame 133/233 for insertion between first and second video frames 132/232 and 134/234 in video sequence 130b. As shown by
Flowchart 460 continues with decomposing first and second images I1 and I2 included in respective first and second consecutive video frames 132/232134/234 to produce first and second image decompositions R1 212 and R2 214 (action 463). As discussed above, in some implementations, first and second images I1 and I2 may be decomposed using a complex-valued steerable pyramid to filter first and second images I1 and I2, according to Equation 1 above. First and second images I1 and I2 may be decomposed by frame interpolation software code 110/210, executed by hardware processor 104, and using decomposition module 216.
Flowchart 460 continues with using CNN 240 to determine intermediate image decomposition {circumflex over (R)} 213 based on first and second image decompositions R1 212 and R2 214, where intermediate image decomposition {circumflex over (R)} 213 corresponds to interpolated video frame 133/233 for insertion between first and second video frames 132/232 and 134/234 (action 464). Determination of intermediate image decomposition {circumflex over (R)} 213 based on first and second image decompositions R1 212 and R2 214 may be performed by frame interpolation software code 110/210, executed by hardware processor 104, and using CNN 240.
Referring to
Table 540 corresponds in general to the architecture of CNN 240. Thus, CNN 240 may include the convolutional processing blocks 542-0 and 542-1 through 542-10 described by table 540. In addition, convolutional processing blocks 542-0 and 542-1 correspond respectively in general to convolutional processing blocks 342-0 and 342-1, in
Thus, each of convolutional processing blocks 542-1 through 542-10 includes successive convolutional processing layers corresponding to successive convolutional processing layers 344a-1 and 344b-1, as well as a final processing layer corresponding to final processing layer 346-1. Furthermore, each of convolutional processing blocks 542-1 through 542-10 also includes elements corresponding respectively in general to intermediate features map 348 and a next lower resolution level intermediate image decomposition determined by the next lower level convolutional processing block.
It is noted that the next lower resolution level intermediate image decomposition determined by the next lower level convolutional processing block is resized before being provided as an input to each of convolutional processing blocks 542-1 through 542-10. In other words, the intermediate image decomposition output of each of convolutional processing blocks 542-0 and 542-1 through 542-9, but not convolutional processing block 542-10, is resized and provided as an input to the next convolutional processing blocks in sequence, from convolutional processing block 542-1 to convolutional processing block 542-10.
Convolutional processing block 542-0 is configured to determine lowest resolution level 313-0 of intermediate image decomposition {circumflex over (R)} 213 based on lowest resolution levels 212-0/312-0 and 214-0/314-0 of respective first and second image decompositions R1 212 and R2 214. Convolutional processing block 542-1 is analogously configured to determine next higher resolution level 313-1 of intermediate image decomposition {circumflex over (R)} 213 based on next higher resolution levels 212-1/312-1 and 214-1/314-1 of respective first and second image decompositions R1 212 and R2 214, and so forth, through convolutional processing block 542-10.
According to the present exemplary implementation, each of convolutional processing blocks 542-0 and 542-1 through 542-10 corresponds respectively to a resolution level of the complex-valued steerable pyramid applied to first and second images I1 and I2 by decomposition module 216, in action 462. For example, convolutional processing block 542-0 may correspond to the lowest resolution level of the complex-valued steerable pyramid, while convolutional processing block 542-10 may correspond to the highest resolution level of the complex-valued steerable pyramid. Moreover, convolutional processing blocks 542-1 through 542-9 may correspond respectively to progressively higher resolution levels of the complex-valued steerable pyramid between the lowest and the highest resolution levels.
In some implementations, CNN 240 may determine intermediate image decomposition {circumflex over (R)} 213 using convolutional processing blocks 542-0 and 542-1 through 542-10 in sequence, beginning with convolutional processing block 542-0 corresponding to the lowest resolution level of the complex-valued steerable pyramid and ending with convolutional processing block 542-10 corresponding to the highest resolution level of the complex-valued steerable pyramid. Thus, in those implementations, intermediate image decomposition {circumflex over (R)} 213 may be determined by CNN 240 level-by-level with respect to the resolution levels of the complex-valued steerable pyramid, from a lowest resolution level to a highest resolution level, using convolutional processing blocks 542-0 and 542-1 through 542-10 in sequence.
Flowchart 460 can conclude with synthesizing interpolated video frame 133/233 based on intermediate image decomposition {circumflex over (R)} 213 (action 465). By way of example, the reconstruction function Ψ−1 can be applied to intermediate image decomposition {circumflex over (R)} 213 to produce intermediate image Î in a manner analogous to Equation 7, above. Interpolated video frame 133/233 may then be synthesized to include intermediate image Î. Synthesis of interpolated video image 133/233 may be performed by frame interpolation software code 110/210, executed by hardware processor 104, and using frame synthesis module 218.
It is noted that, although not included in flowchart 460, in some implementations, the present method can include rendering video sequence 130b including interpolated video frame 133/233 inserted between first and second video frames 132/232 and 134/234, on a display, such as display 108 or display 128 of user system 120. As noted above, in some implementations, frame interpolation software code 110/210 including CNN 240 may be stored on a computer-readable non-transitory medium, may be transferred to user system memory 126, and may be executed by user system hardware processor 124. Consequently, the rendering of video sequence 130b including interpolated video frame 133/233 inserted between first and second video frames 132/232 and 134/234 on display 108 or display 128 may be performed by frame interpolation software code 110/210, executed respectively by hardware processor 104 of computing platform 102 or by user system hardware processor 124.
Thus, the present application discloses a video processing solution suitable for use in performing video frame interpolation. The present solution utilizes a CNN configured to receive phase-based decompositions of images contained in consecutive video frames and to determine the phase-based intermediate decomposition of an image contained in an in-between video frame based on those received image decompositions. The disclosed CNN architecture is simple and is advantageously designed to require relatively few parameters. Moreover, the present solution introduces the concept of phase loss during training of the CNN that is based on the phase difference between an image included in an interpolated in-between video frame and its corresponding ground truth. Consequently, the present video processing solution advantageously outperforms existing state-of-the-art methods for video frame interpolation when applied to challenging video imagery including changes in color, changes in lighting, and/or motion blur.
From the above description it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described herein, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.
The present application claims the benefit of and priority to a Provisional Patent Application Ser. No. 62/643,580, filed Mar. 15, 2018, and titled “Video Frame Interpolation Using a Convolutional Neural Network,” which is hereby incorporated fully by reference into the present application.
Number | Name | Date | Kind |
---|---|---|---|
7362374 | Holt | Apr 2008 | B2 |
9571786 | Zimmer | Feb 2017 | B1 |
9911215 | Hornung | Mar 2018 | B1 |
20060200253 | Hoffberg | Sep 2006 | A1 |
20140072228 | Rubinstein | Mar 2014 | A1 |
20140072229 | Wadhwa | Mar 2014 | A1 |
20160191159 | Aoyama | Jun 2016 | A1 |
20180103213 | Holzer | Apr 2018 | A1 |
20180293711 | Vogels | Oct 2018 | A1 |
20180365554 | Van den Oord | Dec 2018 | A1 |
20190012526 | Guo | Jan 2019 | A1 |
Entry |
---|
S. Baker, D. Scharstein, J. P. Lewis, S. Roth, M. J. Black, and R. Szeliski. “A Database and Evaluation Methodology for Optical Flow.” International Journal of Computer Vision, 92(1):1-31, 2011. |
P. Didyk, P. Sitthi-amorn, W. T. Freeman, F. Durand, and W. Matusik. “Joint View Expansion and Filtering for Automultiscopic 3D Displays.” ACM Trans. Graph., 32(6):221, 2013. |
S. Meyer, O. Wang, H. Zimmer, M. Grosse, and A. Sorkine-Hornung. “Phase-based Frame Interpolation for Video.” IEEE Conference on Computer Vision and Pattern Recognition, pp. 1410-1418, 2015. |
S. Niklaus, L. Mai, and F. Liu. “Video Frame Interpolation Via Adaptive Separable Convolution.” IEEE International Conference on Computer Vision, 2017. |
D. Sun, S. Roth, and M. J. Black. “A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them.” International Journal of Computer Vision, 106(2):115-137, 2014. |
Number | Date | Country | |
---|---|---|---|
20190289257 A1 | Sep 2019 | US |
Number | Date | Country | |
---|---|---|---|
62643580 | Mar 2018 | US |