The present invention relates to the field of computer vision, mainly to a generation and interaction method of virtual viewpoints in a Free Viewpoint Video (FVV).
In the traditional TV relay, the program director shoots a video program from a limited number of viewpoints and relays the program to viewers, and the video information is output and acquired unidirectionally, so the viewers can only watch the video from specific viewpoints given by the program director. Moreover, due to the limited number of cameras in most program shot scenes, frame skipping of video images will occur when the program director actively switches the viewpoints, thus bring viewers less than ideal viewing experience. To solve the passive video viewing experience, Free Viewpoint Video (FVV) technology has been developed rapidly in recent years with the improvement of video capture devices and the rapid improvement of computing power, and interactive video viewing is becoming the development trend of a new generation of new media.
In the TV relay of typical stage scenes such as sports events, a broadcaster will set up as many cameras as possible to capture from as many viewpoints as possible. As the number of cameras increases, viewpoints will be switched more smoothly, but the pressure of data transmission will also increase linearly, so virtual viewpoint generation technology was born to achieve the smoothest possible viewpoint switching effect under the conditions of the controllable number of cameras. This technology can be used to generate virtual viewpoints between the physical viewpoints captured by the camera, thus shifting the pressure of data transmission from the physical capture terminals to a local server or cloud server with high computing power. Therefore, generating better quality virtual viewpoints with the lowest possible computational effort has become the core of FVV-related technologies.
Now some existing virtual viewpoint generation technologies generate viewpoints by rendering traditional depth and parallax images, such as a method in the patent CN102447932A, which includes following steps of calculating a depth image of a shot scene using pre-rectified internal and external parameters of a camera, mapping pixels in a corresponding reference image to the corresponding 3D space using depth information in the depth image, correspondingly converting the pixels in the reference image in the 3D space to a virtual camera position using translation parameters and the internal parameters of the camera, and finally displaying images on the plane of the virtual camera, i.e., virtual viewpoint images. This method is computationally intensive because of a need to traverse all pixels of the images, and the rendering efficiency increases exponentially with the increase of image resolution and the number of cameras. Moreover, this virtual viewpoint generation method needs to rectify the cameras in advance, and the difficulty and precision of camera rectification will be greatly affected in the TV relay of large scenes such as sports events, resulting in the degraded quality of synthesized virtual viewpoints.
Currently, deep learning in the field of virtual viewpoint generation is mostly carried out in the field of Video Frame Interpolation, where some of these networks predict and generate virtual viewpoints by a deep network with a specific structure based on the information related to optical flow between adjacent frames of a video as well as physical viewpoint images in a dataset. If these networks for Video Frame Interpolation are used directly in the field of multi-viewpoint shooting of large scenes, a large area of artifacts will be produced due to the wide baseline and large displacement of adjacent physical viewpoints.
An objective of the present invention is to provide a generation and interaction method of virtual viewpoints in a Free Viewpoint Video (FVV) based on a deep CNN network, to improve the quality of virtual viewpoints and reduce the computational effort.
A technical solution employed in the present invention is follows.
A Free Viewpoint Video (FVV) generation and interaction method based on a deep convolutional neural network is provided, including following steps of:
a step (1) of rectifying the pose and color of cameras in an acquisition system
the acquisition system includes N cameras in uniform and circular arc-shaped arrangement at the same height; the pose and position of the cameras are rectified based on a reference object at the center of the circular arc, and the position of each camera remains unchanged after rectification; and color parameters of the N cameras are rectified by a white balance algorithm based on Gray World;
a step (2) of shooting a target scene object in synchronous video sequences by the camera array of the acquisition system, and selecting video frames at a certain moment to rectify baselines of N−1 groups of adjacent viewpoints in sequence, to obtain N−1 image affine transformation matrices Mi, i=1, 2, . . . , n;
a step (3) of rectifying baselines of all frame data of adjacent sub-viewpoints in sequence by the obtained affine transformation matrices Mi;
a step (4) of pre-processing binocular datasets through baseline rectification, color rectification based on the Gray World algorithm, displacement threshold screening based on optical flow calculation, and then training the virtual viewpoint generation ability of the deep CNN;
a step (5) of inputting the baseline data rectified in the step (3) into the deep CNN pre-trained in the step (4), and outputting generated virtual viewpoint 2D images based on the number of reconstructed virtual viewpoints;
a step (6) of stitching the matrices of the physical viewpoints and the virtual viewpoints generated in sequence of physical space positions, and labeling Block_Index of all viewpoints in the image matrix in sequence;
and a step (7) of synthesizing the stitched frames at every moment obtained in the step (6) into an FVV at a shooting frame rate of multiple cameras.
Compared with the prior art, the beneficial effects of the present invention are as follows.
(1) Unlike traditional geometric methods based on depth and parallax, the method in the present invention rectifies baselines of the sequences of the shot FVV at pixel level, and predicts and generates virtual viewpoints by the deep CNN without rectifying multiple cameras in advance, which solves the problem of low precision and difficulty in multi-camera rectification in large scenes, reduces the computational effort and improves the efficiency of virtual viewpoint generation.
(2) According to the present invention, the binocular vision datasets are pre-processed through baseline rectification, color rectification, displacement threshold screening based on optical flow calculation in the process of training the deep CNN, and the virtual viewpoints are synthesized in better effect in case of wide baseline and large displacement of adjacent viewpoints and can eliminate the large area of artifacts to some extent.
The present invention will be described in detail below with reference to the accompanying drawings by embodiments.
In this embodiment, a multi-camera array as shown in a topology of
A processing flow of this embodiment is shown in
(1) Circular Arc-Shaped Arrangement of a Camera Array, and Pose and Color Rectification of Multiple Cameras
The topology of a hardware acquisition system is shown in
(2) Synchronous Rectification of Multiple Cameras
All cameras are synchronized by an external trigger signal generator through video data trigger lines, and the frequency of trigger signals is adjusted to trigger all cameras to synchronously acquire information of a shot scene.
(3) Simultaneous Acquisition of Video Sequences and Baseline Rectification to Obtain Affine Transformation Matrices
The camera array arranged in the step (1) is configured to shoot a target scene object in synchronous video sequences, video frames at a certain moment are selected to rectify baselines of N−1 groups of adjacent viewpoints in sequence, and a translation factor (x,y), a rotation factor θ and a scaling factor k in affine transformation are manually set based on feature points of the object in the scene, so that the feature points at the center of the scene are located at a position where the reference object is, such as a central rectification point of the scene at a point O in a schematic diagram of a baseline rectify system used in this embodiment (shown in
where α=k·cos(θ), β=k·sin(θ).
(4) Batch Baseline Rectification
The baselines of all frame data of adjacent sub-viewpoints are rectified in sequence by the obtained affine transformation matrices Mi through the warpAffine( ) function in OpenCV, and the baselines of N−1 groups of cameras are rectified in pairs by the affine matrices Mi (i=1, 2, . . . , n) obtained in the step (3) sequentially according to the spatial positions of N cameras in circular arc-shaped arrangement, so that the rectified image baselines of the N cameras are all kept at the same level.
(5) Virtual Viewpoint Generation Network Training
This step starts with pre-processing datasets through baseline rectification, color rectification, and displacement threshold screening based on optical flow calculation. Each dataset consists of image triplets of multiple ‘left center right’ viewpoints in many scenes, and baselines of the image triplets of each image are first rectified in batches by the method the same as that in the step (3), so that several groups of feature points in every three images are kept at the same level. Color rectification is performed by the white balance algorithm based on Gray World so that three images of the same scene have the same white balance parameters. Finally, optical flow diagrams of the triplets are calculated in pairs to obtain average pixel displacement of the same object in the same scene, and a threshold is set to screen the triplets exceeding the threshold to form a new training dataset.
The structure of the deep CNN used in this embodiment is based on an open-source network SepConv, as shown in
L1=∥R−RGT∥22, L2=∥S(R)−S(RGT)∥22
Ltotal=L1+α·L2, where L1 is a 2-norm error between the network predicted image and the Ground Truth based on the pixel RGB difference, L2 is the difference in a feature structure extracted by the network, and the function S( ) is a feature extraction loss function for training a network model to perceive a deep special structure in the scene. The total loss function Ltotal for training is a linear weighted sum of L1 and L2. An optimal parameter model of a Virtual View Generation Network (VVGN) is obtained by iterative training for a certain period.
(6) Virtual Viewpoint Generation
The baseline data rectified in the step (4) are input into a pre-trained deep Virtual View Generation Network (VVGN), and generated virtual viewpoint 2D images are output based on the number of reconstructed virtual viewpoints. Unlike traditional virtual viewpoint generation methods, the method in the present invention predicts and generates a virtual viewpoint between two physical viewpoints based on the deep CNN, inputs data for baseline rectification at pixel level in advance, and inputs feature structures of the two viewpoints directly through CNN learning to output the results without rectifying multiple cameras in advance. Through this step, the effect of generated virtual viewpoints is determined. The binocular datasets are pre-processed through baseline rectification, color rectification, displacement threshold screening based on optical flow calculation in the step (5), and input into the CNN as shown in
L1=∥R−RGT∥22, L2=∥S(R)−S(RGT)∥22,
Ltotal=L1+α·L2, where L1 is a 2-norm error between the network predicted image and the Ground Truth based on the pixel RGB difference, L2 is the difference in a feature structure extracted by the network, and the function S( ) is a feature extraction loss function for training a network model to perceive a deep special structure in the scene. The total loss function Ltotal for training is a linear weighted sum of L1 and L2. In case of binocular wide baselines, better virtual viewpoint quality may be obtained compared with existing deep learning-based video interpolation networks; and the computational effort is much lower compared with traditional methods.
(7) Stitching Matrices of all Viewpoint Image Frames
The matrices of the physical viewpoints and the virtual viewpoints generated in the step (6) are stitched in sequence of physical space positions (the number of rows and columns of each matrix depends on the number of virtual viewpoints generated), and Block_Index of all viewpoints in the image matrix are labeled in sequence with a default of row.
(8) FVV Synthesis
The stitched frames at every moment obtained in the previous step are synthesized into an FVV using a cv2.VideoWriter( ) function in FFmpeg or OpenCV at a shooting frame rate of multiple cameras, and the FVV is compressed and stored in a local server at a certain compression ratio.
(9) Interactive Viewing of FVV by Users
The interface of an FVV Player is shown in
Number | Date | Country | Kind |
---|---|---|---|
201911106557.2 | Nov 2019 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2020/124206 | 10/28/2020 | WO |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2021/093584 | 5/20/2021 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
20100026712 | Aliprandi | Feb 2010 | A1 |
20180048810 | Matsushita | Feb 2018 | A1 |
20180182114 | Hanamoto | Jun 2018 | A1 |
20180220125 | Tamir | Aug 2018 | A1 |
20190014301 | Ota | Jan 2019 | A1 |
20190068955 | Nakazato | Feb 2019 | A1 |
20190174109 | Yoshikawa | Jun 2019 | A1 |
20190174122 | Besley | Jun 2019 | A1 |
20190244379 | Venkataraman | Aug 2019 | A1 |
20190356906 | Handa | Nov 2019 | A1 |
20200120324 | Kwong | Apr 2020 | A1 |
20200234495 | Nakao | Jul 2020 | A1 |
20200258196 | Kokura | Aug 2020 | A1 |
20200329189 | Tanaka | Oct 2020 | A1 |
20200336719 | Morisawa | Oct 2020 | A1 |
20200372691 | Ito | Nov 2020 | A1 |
20200380720 | Dixit | Dec 2020 | A1 |
20200387288 | Ito | Dec 2020 | A1 |
20210120218 | Himukashi | Apr 2021 | A1 |
20210134058 | Ito | May 2021 | A1 |
20210334935 | Grigoriev | Oct 2021 | A1 |
20220130056 | Zhang | Apr 2022 | A1 |
Number | Date | Country |
---|---|---|
102447932 | May 2012 | CN |
105659592 | Jun 2016 | CN |
107396133 | Nov 2017 | CN |
107545586 | Jan 2018 | CN |
107493465 | Jun 2019 | CN |
110113593 | Aug 2019 | CN |
110223382 | Sep 2019 | CN |
110443874 | Nov 2019 | CN |
110798673 | Feb 2020 | CN |
2018147329 | Aug 2018 | WO |
Entry |
---|
Wang, Yanru et al.; “Interactive free-viewpoint video generation”, Virtual Reality & Intelligent Hardware; vol. 2, No. 3; Jun. 30, 2020; pp. 247-260. |
Deng, Bao-Song et al.; “Wide Baseline Matching based on Affirre Iterative Method”; Signal Processing; vol. 23, No. 6; Dec. 31, 2007; pp. 823-828. |
Bosc, Emilie et al.; “Towards a New Quality Metric for 3-D Synthesized View Assessment”, IEEE Journal of Selected Topics in Signal Processing; vol. 5, No. 7; Sep. 26, 2011; All. |
Number | Date | Country | |
---|---|---|---|
20220394226 A1 | Dec 2022 | US |