The present invention relates to 360-degree video coding and processing. In particular, the present invention relates to decoding a view region from a 360° VR video sequence within the user's field of view. In particular, the present invention discloses an adaptive region-based video decoding according to the user's viewpoint behavior to enhance user's viewing experience.
The 360-degree video, also known as immersive video is an emerging technology, which can provide “feeling as sensation of present”. The sense of immersion is achieved by surrounding a user with wrap-around scene covering a panoramic view, in particular, 360-degree (360°) field of view. The “feeling as sensation of present” can be further improved by stereographic rendering. Accordingly, the panoramic video is being widely used in Virtual Reality (VR) applications.
Immersive video involves capturing a scene using one or multiple cameras to cover a panoramic view, such as 360-degree field of view. The immersive camera usually uses a set of cameras, arranged to capture 360° field of view. Typically, two or more cameras are used for the immersive camera. All videos must be taken simultaneously and separate fragments (also called separate perspectives) of the scene are recorded. Furthermore, the set of cameras are often arranged to capture views horizontally, while other arrangements of the cameras are possible.
While the 360° video provides all around scenes, a user often views only a limited field of view. Therefore, the decoder only needs to decode a portion (e.g. a view region) of each 360° frame and display the corresponding portion of the 360° frame to the user. However, a user may not always look at the same region. In practical usage, a user may look around so that the field of view may change from time to time. Accordingly, different regions need to be decoded and displayed.
The region that needs to be decoded and displayed can be determined according to a 3D projection model and the field of view.
As shown above, the region-based decoding for the 360° frames needs to decode a field of view in response to user's current viewpoint. The user's viewpoint or viewpoint motion may be automatically detected if the user wears a head-mount display (HMD) device equipped with 3D motion sensors. The user's viewpoint may also be indicated by the user using a pointing device. In order to accommodate different fields of view for a 360° video sequence, various 3D coding systems have been developed in the field. For example, Facebook™ developed a pyramid coding system that streams 30 bitstreams corresponding to 30 different fields of view. However, only the visible field of view is treated as a main bitstream. The main bitstream is coded to allow full-resolution rendering while other bitstreams are coded at reduced resolutions. In
Qualcomm™ also developed a coding system to facilitate multiple fields of view. In particular, Qualcomm™ uses truncated square pyramid projection by projecting a select field of view to a front (i.e., main) cube face. Drawing 410 in
According to the conventional region-based multiple Fields of View (FOV) coding system, bitstreams for a large number of fields of view have to be generated. The large amount of data to be streamed will cause long network latency. When the user changes his/her viewpoint, the associated bitstream for updated viewpoint may not be available. Therefore, the user has to rely on the non-main bitstream to display the view region in reduced resolution. In some cases, part of data in the updated view region may not be available from any of the 30 bitstreams. Therefore, erroneous data in the updated view region may occur. Therefore, it is desirable to develop techniques to adaptively stream bitstreams according to different fields of view. Furthermore, it is desirable to develop an adaptive coding system that can facilitate different fields of view effectively without the needs for high bandwidth or long switching latency.
Methods and apparatus of video decoding for a 360-degree video sequence are disclosed. According to one method, a first view region in the previous 360-degree frame associated with a first field of view is determined for a user at a previous frame time. The previous 360-degree frame and the current 360-degree frame in a 360-degree video sequence can be decoded from a bitstream. An extended region from the first view region in the current 360-degree frame is determined based on user's viewpoint information. The extended region in the current 360-degree frame is then decoded. Furthermore, a second view region in the current 360-degree frame associated with an actual field of view can be rendered for the user at a current frame time.
In one embodiment, the extended region is enlarged in a turn direction when user's viewpoint turns. The extended region is reduced when the user's viewpoint is back to still. In another embodiment, the extended region is enlarged in a direction corresponding to previous viewpoint motion. The extended region can be enlarged according to predicted viewpoint motion derived using linear prediction of previous viewpoint motion. The extended region may also be enlarged according to predicted viewpoint motion derived using non-linear prediction of previous viewpoint motion. The extended region can be determined according to learning mechanism using user's view tendency. For example, the user's view tendency may comprise frequency of user's viewpoint change, speed of user's viewpoint motion, or both. In another embodiment, a predefined region is derived based on user's view information and the extended region corresponds to a smallest rectangular region covering both the first view region and the predefined region.
The present step of rendering the second view region may further comprise blurring any non-decoded region in the second view region.
The following description is of the best-contemplated mode of carrying out the invention. This description is made for the purpose of illustrating the general principles of the invention and should not be taken in a limiting sense. The scope of the invention is best determined by reference to the appended claims.
As mentioned before, according to the conventional region-based multiple FOV coding system, bitstreams for a large number of fields of view have to be generated. When the user changes his/her field of view or viewpoint, the associated bitstream has to be switched, which may cause substantial latency depending on the network condition.
In order to overcome the issues associated with changing field of view, an adaptive coding system for 360° video sequences is disclosed. The adaptive decoding system extends the decoded region to anticipate the possible change in field of view. Therefore, when a user moves his/her viewpoint, the adaptive decoding system will likely provide new view region with less artifacts as shown in
According to present invention, the view region is adaptively decoded based on prediction on user's behavior of turning. In particular, the decoded region is enlarged to prevent user from observing non-decoded area, which will provide better user experience due to better quality and less non-decoded area being rendered. The decoded region can be determined adaptively using prediction on viewpoint.
According to another embodiment, the adaptive region decoding can be based on user's viewpoint moving history. For example, the prediction can be applied to arbitrary direction. Also, the prediction can be adapted to various velocities. Accordingly, the faster a user's viewpoint moves, the bigger the decode region is.
While the motion vector prediction (MVP) mentioned above may be used to extend the decoded region to reduce the possibility of non-decoded area, it won't guarantee that the new view region is always fully covered by the decoded region. In case that any non-decoded area occurs, an embodiment according to the present invention will blur the non-decoded area to reduce the visibility of the non-decoded data. In
The view region prediction can be improved using learning mechanism. For example, the learning process can be based on user's view tendency, such as the frequency and the speed that the user changes his viewpoint. In another example, the learning mechanism can be based on video preference. For example, the user's view information can be collected and used to build a predefined prediction.
The system according to the present invention is compared with the systems developed by Facebook™ and Qualcomm™ in Table 1.
While cube 3D model is used to generate the view region in the above examples, the present invention is not limited to use the cube 3D model. In Table 1, the present invention is configured to support 135 degrees of FOV. Nevertheless, any other FOV coverage may be used.
The flowchart shown above is intended for serving as examples to illustrate embodiments of the present invention. A person skilled in the art may practice the present invention by modifying individual steps, splitting or combining steps with departing from the spirit of the present invention.
The above description is presented to enable a person of ordinary skill in the art to practice the present invention as provided in the context of a particular application and its requirement. Various modifications to the described embodiments will be apparent to those with skill in the art, and the general principles defined herein may be applied to other embodiments. Therefore, the present invention is not intended to be limited to the particular embodiments shown and described, but is to be accorded the widest scope consistent with the principles and novel features herein disclosed. In the above detailed description, various specific details are illustrated in order to provide a thorough understanding of the present invention. Nevertheless, it will be understood by those skilled in the art that the present invention may be practiced.
Embodiment of the present invention as described above may be implemented in various hardware, software codes, or a combination of both. For example, an embodiment of the present invention can be one or more electronic circuits integrated into a video compression chip or program code integrated into video compression software to perform the processing described herein. An embodiment of the present invention may also be program code to be executed on a Digital Signal Processor (DSP) to perform the processing described herein. The invention may also involve a number of functions to be performed by a computer processor, a digital signal processor, a microprocessor, or field programmable gate array (FPGA). These processors can be configured to perform particular tasks according to the invention, by executing machine-readable software code or firmware code that defines the particular methods embodied by the invention. The software code or firmware code may be developed in different programming languages and different formats or styles. The software code may also be compiled for different target platforms. However, different code formats, styles and languages of software codes and other means of configuring code to perform the tasks in accordance with the invention will not depart from the spirit and scope of the invention.
The invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described examples are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is therefore, indicated by the appended claims rather than by the foregoing description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
The present invention claims priority to U.S. Provisional Patent Application, Ser. No. 62/428,571, filed on Dec. 1, 2016. The U.S. Provisional Patent Application is hereby incorporated by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
62428571 | Dec 2016 | US |