METHOD AND APPARATUS FOR CODING/DECODING A LARGE FIELD OF VIEW VIDEO

1. TECHNICAL FIELD

The present disclosure relates to encoding and decoding immersive videos, for example when such immersive videos are processed in a system for virtual reality, augmented reality or augmented virtuality and for instance when displayed in a head mounted display device.

2. BACKGROUND

Recently there has been a growth of available large field-of-view content (up to 360°). Such content is potentially not fully visible by a user watching the content on immersive display devices such as Head Mounted Displays, smart glasses, PC screens, tablets, smartphones and the like. That means that at a given moment, a user may only be viewing a part of the content. Although a user can navigate within the content by various means such as head movement, mouse movement, touch screen, voice and the like. A large field-of-view content may be, among others, a three-dimension computer graphic imagery scene (3D CGI scene), a point cloud or an immersive video.

Many terms might be used to design such immersive videos: Virtual Reality (VR), 360, panoramic, 4π steradians, immersive, omnidirectional, large field of view, etc.

For coding an omnidirectional video into a bitstream, for instance for transmission over a data network, traditional video codec, such as HEVC, H.264/AVC, could be used. Each picture of the omnidirectional video is thus first projected on one or more 2D pictures, for example one or more rectangular pictures, using a suitable projection function. In practice, a picture from the omnidirectional video is respresented as a 3D surface. For ease of projection, usually a convex and simple surface such as a sphere, or a cube, or a pyramid are used for the projection. The projected 2D pictures representative of the omnidirectional video are then coded using a tradictional video codec.

FIG. 13A shows an example of projection from a surface S represented as a sphere onto one single rectangular picture I using an equi-rectangular projection.

FIG. 14A shows another example of projection from the surface S, here represented as a cube, onto six pictures or faces. The faces can possibly be re-arranged into one single picture as shown in FIG. 14E or 14F.

For coding an omnidirectional video, the projected picture of the 3D surface can then be coded using conventional video coding standards such as HEVC, H.264/AVC, etc. . . . . According to such standards, a 2D picture is encoded by first dividing it into non-overlapping blocks of fixed size and then by encoding those blocks individually.

However, it appears that such conventional 2D video coder/decoder remains suboptimal for coding a picture of a 3D surface projected on a 2D picture, such projected picture presenting characteristics different from the ones associated to classical 2D pictures.

There is thus a need for improving the performances of conventional 2D video coders/decoders when coding/decoding projected pictures of 3D surfaces.

There is also a need for this improvement leading to a minimum modification of the existing video coders/decoders so as to allow a maximum reuse of existing standardized solutions and thus minimize the cost of the proposed solution.

3. SUMMARY

A particular aspect of the present disclosure relates to a method for coding a large field of view video into a bitstream, at least one picture of the omnidirectional video being represented as a surface, the surface being projected onto at least one 2D picture using a projection function. The method comprises, for at least one current block of the at least one 2D picture:

- adapting a size of the at least one current block as a function of a pixel density function determined according to the projection function; and
- encoding the at least one current block into the bitstream using the adapted size.

Thus, this particular embodiment relies on a wholly novel and inventive solution for encoding a large field of view video into a bitstream.

For this, it is proposed to take into account the deterministic distortion resulting from the projection function of the surface on the 2D picture, such distortion resulting in a given pixel density function across the 2D picture. Consequently, a size associated to a block to be encoded can be adapted due to this a priori known information content of the block so that the encoding processing is initiated in a situation that is closer to the optimum, thus resulting in a more efficient encoding processing.

For instance, such a size associated to a block may be a column and/or row size of the block, and/or a size of a support used for encoding this block.

According to one embodiment, the act of adapting a size of the at least one current block comprises splitting the at least one current block according to a criterion function of the pixel density function, the act of splitting delivering at least one subblock associated to the at least one current block; and the act of encoding comprises encoding the at least one subblock associated to the at least one current block.

In this embodiment, the amount of operations performed for applying the splitting process of the considered standard (e.g. the rate distortion optimization process in HEVC) is reduced as splitting decisions have already been taken according to a priori known information derived from the pixel density function.

According to one embodiment, the act of adapting a size of the at least one current block comprises adapting a size of a transform to be applied to the current block or subblock; and the act of encoding comprises applying the transform of adapted size to the current block or subblock.

In this embodiment, the complexity of the transform processing module is reduced, thus resulting in an encoder optimization.

In one embodiment of the present disclosure, it is proposed a method for decoding a bitstream representative of a large field of view video, at least one picture of the omnidirectional video being represented as a surface, the surface being projected onto at least one 2D picture using a projection function. The method comprises, for at least one current block of the at least one 2D picture:

- adapting a size of the at least one current block as a function of a pixel density function determined according to the projection function; and
- decoding from the bitstream the at least one current block using the adapted size.

Thus, the characteristics and advantages of the method for decoding according to the present disclosure are the same as the method for coding described above. Therefore, they are not described in more detail.

and the act of decoding comprises decoding the at least one subblock associated to the at least one current block.

In this embodiment, the bitrate of the bitstream is thus minimized, as no particular splitting syntax is needed.

According to one embodiment, the act of adapting a size of the at least one current block comprises adapting a size of an inverse transform to be applied to the current block or subblock; and the act of encoding comprises applying the inverse transform of adapted size to the current block or subblock.

In this embodiment, the complexity of the inverse transform processing module is reduced, thus resulting in a decoder optimization.

According to one embodiment, the criterion used for deciding the splitting of the at least one current block belongs to a group comprising at least:

- a ratio of an average value of a horizontal, respectively vertical, component of the pixel density function over an average value of the vertical, respectively horizontal, component of the pixel density function being compared to a threshold;
- a ratio of a maximum value of the horizontal, respectively vertical, component of the pixel density function over a maximum value of the vertical, respectively horizontal, component of the pixel density function being compared to a threshold; or
- a ratio of a maximum value of the horizontal, respectively vertical, component of the pixel density function over a minimum value of the vertical, respectively horizontal, component of the pixel density function being compared to a threshold.

According to one embodiment, the act of adapting a size of the at least one current block comprises delivering at least one current block of adapted size, and wherein the adapted size is derived from a nominal size divided by an average value, or a median value, of the pixel density function computed for at least one pixel of the at least one current block.

In this embodiment, the size of square blocks (e.g. CTU, CU, PU, TU) can be adapted so as to optimize the encoding/decoding processing taking into account for the distortion resulting from the projection of the 3D surface on the 2D picture. This embodiment is in particular suited for equirectangular projections in which the resulting pixel density function depends only in the vertical coordinate (i.e. Y) in the resulting 2D picture.

According to one embodiment, the act of adapting a size comprises:

- determining a width, respectively a height, of the at least one current block as a function of an average value, or a median value, of a horizontal, respectively vertical, component of the pixel density function computed for at least one pixel of the at least one current block.

In this embodiment, this is the size of rectangular blocks (e.g. CTU, CU, PU, TU) that can be adapted so as to optimize the encoding/decoding processing taking into account for the distortion resulting from the projection of the 3D surface on the 2D picture. This embodiment is in particular suited for cube mapping projections in which the resulting pixel density function depends on both coordinate (i.e. X and Y) in the resulting 2D picture.

According to one embodiment, the splitting into at least one subblock of the at least one current block is signaled in the bitstream using an existing splitting syntax of the standard.

In this embodiment, the decoder of the existing standard is reused as such, thus allowing minimizing the cost for deploying the disclosed method in applications.

Another aspect of the present disclosure relates to an apparatus for coding a large field of view video into a bitstream, at least one picture of the omnidirectional video being represented as a surface, the surface being projected onto at least one 2D picture using a projection function. The apparatus for coding comprises, for at least one current block of the at least one 2D picture:

- means for adapting a size of the at least one current block as a function of a pixel density function determined according to the projection function; and
- means for encoding the at least one current block into the bitstream using the adapted size.

Such an apparatus is particularly adapted for implementing the method for coding a large field of view video into a bitstream according to the present disclosure (according to any of the various aforementioned embodiments).

Thus, the characteristics and advantages of this apparatus are the same as the method for coding described above. Therefore, they are not described in more detail.

Another aspect of the present disclosure relates to an apparatus for decoding a bitstream representative of a large field of view video, at least one picture of the omnidirectional video being represented as a surface, the surface being projected onto at least one 2D picture using a projection function. The apparatus for decoding comprises, for at least one current block of the at least one 2D picture:

- means for adapting a size of the at least one current block as a function of a pixel density function determined according to the projection function; and
- means for decoding from the bitstream the at least one current block using the adapted size.

Such an apparatus is particularly adapted for implementing the method for decoding a bitstream representative of a large field of view video according to the present disclosure (according to any of the various aforementioned embodiments).

Thus, the characteristics and advantages of this apparatus are the same as the method for coding described above. Therefore, they are not described in more detail.

Another aspect of the present disclosure relates to a computer program product comprising program code instructions for implementing the above-mentioned methods (in any of their different embodiments), when the program is executed on a computer or a processor.

Another aspect of the present disclosure relates to a non-transitory computer-readable carrier medium storing a computer program product which, when executed by a computer or a processor causes the computer or the processor to carry out the above-mentioned methods (in any of their different embodiments).

Another aspect of the present disclosure relates to a bitstream representative of a coded omnidirectional video, at least one picture of the omnidirectional video being represented as a surface, the surface being projected onto at least one 2D picture using a projection function. The bistream comprises:

- coded data representative of at least one current block of the 2D picture; and
- an information indicating that the at least one current block has been encoded using a size adapted as a function of a pixel density function determined according to the projection function.

Such bitstream is delivered by a device implementing the method for coding a large field of view video into a bitstream according to the present disclosure, and intended to be used by a device implementing the method for decoding a bitstream representative of a large field of view video according to the present disclosure (according to any of their various aforementioned embodiments).

Thus, the characteristics and advantages of this bitstream are the same as the methods described above. Therefore, they are not described in more detail.

Another aspect of the present disclosure relates to an immersive rendering device comprising an apparatus for decoding a bitstream representative of a large field of view video according to the present disclosure.

Yet another aspect of the present disclosure relates to a system for immersive rendering of a large field of view video encoded into a bistream, comprising at least:

- a network interface for receiving the bistream from a data network,
- an apparatus for decoding the bitstream according to the present disclosure,
- an immersive rendering device.

4. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 represents a functional overview of an encoding and decoding system according to a preferred environment of the embodiments of the disclosure,

FIG. 2 represents a first embodiment of a system according to the present disclosure,

FIG. 3 represents a first embodiment of a system according to the present disclosure,

FIG. 4 represents a first embodiment of a system according to the present disclosure,

FIG. 5 represents a first embodiment of a system according to the present disclosure,

FIG. 6 represents a first embodiment of a system according to the present disclosure,

FIG. 7 represents a first embodiment of a system according to the present disclosure,

FIG. 8 represents a first embodiment of a system according to the present disclosure,

FIG. 9 represents a first embodiment of a system according to the present disclosure,

FIG. 10 represents a first embodiment of an immersive video rendering device according to the present disclosure,

FIG. 11 represents a first embodiment of an immersive video rendering device according to the present disclosure,

FIG. 12 represents a first embodiment of an immersive video rendering device according to the present disclosure,

FIG. 13A illustrates an example of projection from a spherical surface S onto a rectangular picture I,

FIG. 13B illustrates an XY-plane reference system of a picture I,

FIG. 13C illustrates an angular reference system on the sphere S,

FIG. 14A illustrates an example of projection from a cubic surface S onto 6 pictures,

FIG. 14B illustrates a cube reference system,

FIG. 14C illustrates an XY-plane reference system of a 2D picture I,

FIG. 14D illustrates a layout of the 6 faces of a cube projected on a 2D picture,

FIG. 14E, 14F illustrates corresponding re-arranged rectangular pictures according to different layouts,

FIGS. 15A and 15B illustrate a horizontal pixel density function resulting of the projection from a spherical 3D surface S onto a rectangular 2D picture I according to the embodiment of FIG. 13A,

FIGS. 16A and 16B illustrate the vertical and horizontal pixel densities resulting from a projection from a cubic 3D surface S onto 6 2D pictures according to the embodiment of FIG. 14D,

FIG. 17 illustrates an adaptation of CTU size in case of equirectangular mapping according to an embodiment of the present disclosure,

FIG. 18 illustrates an adaptation of CTU size in case of cubic mapping according to an embodiment of the present disclosure,

FIG. 19 illustrates an adaptation of CTU size in case of equirectangular mapping according to another embodiment of the present disclosure,

FIG. 20 illustrates an adaptation of CTU size in case of cubic mapping according to another embodiment of the present disclosure,

FIG. 21 illustrates a split of a subblock according to an embodiment of the present disclosure,

FIG. 22 illustrates a transformed block according to an embodiment of the present disclosure,

FIG. 23 illustrates a transformed block according to another embodiment of the present disclosure,

FIG. 24A illustrates block diagrams for an exemplary method for encoding an omnidirectional video into a bistream according to an embodiment of the present disclosure,

FIG. 24B illustrates block diagrams for an exemplary method for encoding an omnidirectional video into a bistream according to another embodiment of the present disclosure,

FIG. 25A illustrates block diagrams for an exemplary method for decoding a bistream representative of an omnidirectional video according to an embodiment of the present disclosure,

FIG. 25B illustrates block diagrams for an exemplary method for decoding a bistream representative of an omnidirectional video according to another embodiment of the present disclosure,

FIG. 26 illustrates block diagrams for an exemplary video encoder implementing a method for encoding an omnidirectional video into a bistream according to any of the embodiments disclosed in relation with FIGS. 24A and 24B,

FIG. 27 illustrates block diagrams for an exemplary decoder implementing a method for decoding a bistream representative of an omnidirectional video according to any of the embodiments disclosed in relation with FIGS. 25A and 25B,

FIG. 28 illustrates an exemplary apparatus for encoding an omnidirectional video into a bistream according to one embodiment,

FIG. 29 illustrates an exemplary apparatus for decoding a bistream representative of an omnidirectional video according to one embodiment.

5. DESCRIPTION OF EMBODIMENTS

In all of the figures of the present document, the same numerical reference signs designate similar elements and steps.

The present principle is disclosed here in the case of omnidirectional video, it may also be applied in case of conventional plane images acquired with very large field of view, i.e. acquired with very small focal length like fish eye lens.

FIG. 1 illustrates a general overview of an encoding and decoding system according to a preferred embodiment of the invention. The system of FIG. 1 is a functional system. A pre-processing module 300 may prepare the content for encoding by the encoding device 400. The pre-processing module 300 may perform multi-image acquisition, merging of the acquired multiple images in a common space (typically a 3D sphere if we encode the directions), and mapping of the 3D sphere into a 2D frame using, for example, but not limited to, an equi-rectangular mapping or a cube mapping. The pre-processing module 3 may also accept an omnidirectional video in a particular format (for example, equi-rectangular) as input, and pre-processes the video to change the mapping into a format more suitable for encoding. Depending on the acquired video data representation, the pre-processing module may perform a mapping space change. The encoding device 400 and the encoding method will be described with respect to other figures of the specification. After being encoded, the data, which may encode immersive video data or 3D CGI encoded data for instance, are sent to a network interface 500, which can be typically implemented in any network interface, for instance present in a gateway. The data are then transmitted through a communication network, such as internet but any other network can be foreseen. Then the data are received via network interface 600. Network interface 600 can be implemented in a gateway, in a television, in a set-top box, in a head mounted display device, in an immersive (projective) wall or in any immersive video rendering device. After reception, the data are sent to a decoding device 700. Decoding function is one of the processing functions described in the following FIGS. 2 to 12. Decoded data are then processed by a player 800. Player 800 prepares the data for the rendering device 900 and may receive external data from sensors or users input data. More precisely, the player 800 prepares the part of the video content that is going to be displayed by the rendering device 900. The decoding device 700 and the player 800 may be integrated in a single device (e.g., a smartphone, a game console, a STB, a tablet, a computer, etc.). In a variant, the player 800 is integrated in the rendering device 900.

Several types of systems may be envisioned to perform the decoding, playing and rendering functions of an immersive display device, for example when rendering an immersive video.

A first system, for processing augmented reality, virtual reality, or augmented virtuality content is illustrated in FIGS. 2 to 6. Such a system comprises processing functions, an immersive video rendering device which may be a head-mounted display (HMD), a tablet or a smartphone for example and may comprise sensors. The immersive video rendering device may also comprise additional interface modules between the display device and the processing functions. The processing functions can be performed by one or several devices. They can be integrated into the immersive video rendering device or they can be integrated into one or several processing devices. The processing device comprises one or several processors and a communication interface with the immersive video rendering device, such as a wireless or wired communication interface.

The processing device can also comprise a second communication interface with a wide access network such as internet and access content located on a cloud, directly or through a network device such as a home or a local gateway. The processing device can also access a local storage through a third interface such as a local access network interface of Ethernet type. In an embodiment, the processing device may be a computer system having one or several processing units. In another embodiment, it may be a smartphone which can be connected through wired or wireless links to the immersive video rendering device or which can be inserted in a housing in the immersive video rendering device and communicating with it through a connector or wirelessly as well. Communication interfaces of the processing device are wireline interfaces (for example a bus interface, a wide area network interface, a local area network interface) or wireless interfaces (such as an IEEE 802.11 interface or a Bluetooth® interface).

When the processing functions are performed by the immersive video rendering device, the immersive video rendering device can be provided with an interface to a network directly or through a gateway to receive and/or transmit content.

In another embodiment, the system comprise an auxiliary device which communicates with the immersive video rendering device and with the processing device. In such an embodiment, this auxiliary device can contain at least one of the processing functions.

The immersive video rendering device may comprise one or several displays. The device may employ optics such as lenses in front of each of its display. The display can also be a part of the immersive display device like in the case of smartphones or tablets. In another embodiment, displays and optics may embedded in a helmet, in glasses, or in a visor that a user can wear. The immersive video rendering device may also integrate several sensors, as described later on. The immersive video rendering device can also comprise several interfaces or connectors. It might comprise one or several wireless modules in order to communicate with sensors, processing functions, handheld or other body parts related devices or sensors.

The immersive video rendering device can also comprise processing functions executed by one or several processors and configured to decode content or to process content. By processing content here, it is understood all functions to prepare a content that can be displayed. This may comprise, for instance, decoding a content, merging content before displaying it and modifying the content to fit with the display device.

One function of an immersive content rendering device is to control a virtual camera which captures at least a part of the content structured as a virtual volume. The system may comprise pose tracking sensors which totally or partially track the user's pose, for example, the pose of the user's head, in order to process the pose of the virtual camera. Some positioning sensors may track the displacement of the user. The system may also comprise other sensors related to environment for example to measure lighting, temperature or sound conditions. Such sensors may also be related to the users' bodies, for instance, to measure sweating or heart rate. Information acquired through these sensors may be used to process the content. The system may also comprise user input devices (e.g. a mouse, a keyboard, a remote control, a joystick). Information from user input devices may be used to process the content, manage user interfaces or to control the pose of the virtual camera. Sensors and user input devices communicate with the processing device and/or with the immersive rendering device through wired or wireless communication interfaces.

Through FIGS. 2 to 6, several embodiments of this first type of system for displaying augmented reality, virtual reality, augmented virtuality or any content from augmented reality to virtual reality.

FIG. 2 illustrates a particular embodiment of a system configured to decode, process and render immersive videos. The system comprises an immersive video rendering device 10, sensors 20, user inputs devices 30, a computer 40 and a gateway 50 (optional).

The immersive video rendering device 10, illustrated on FIG. 10, comprises a display 101. The display is, for example of OLED or LCD type. The immersive video rendering device 10 is, for instance a HMD, a tablet or a smartphone. The device 10 may comprise a touch surface 102 (e.g. a touchpad or a tactile screen), a camera 103, a memory 105 in connection with at least one processor 104 and at least one communication interface 106. The at least one processor 104 processes the signals received from the sensors 20. Some of the measurements from sensors are used to compute the pose of the device and to control the virtual camera. Sensors used for pose estimation are, for instance, gyroscopes, accelerometers or compasses. More complex systems, for example using a rig of cameras may also be used. In this case, the at least one processor performs image processing to estimate the pose of the device 10. Some other measurements are used to process the content according to environment conditions or user's reactions. Sensors used for observing environment and users are, for instance, microphones, light sensor or contact sensors. More complex systems may also be used like, for example, a video camera tracking user's eyes. In this case the at least one processor performs image processing to operate the expected measurement. Sensors 20 and user input devices 30 data can also be transmitted to the computer 40 which will process the data according to the input of these sensors.

Memory 105 comprises parameters and code program instructions for the processor 104. Memory 105 can also comprise parameters received from the sensors 20 and user input devices 30.

Communication interface 106 enables the immersive video rendering device to communicate with the computer 40. The Communication interface 106 of the processing device is wireline interfaces (for example a bus interface, a wide area network interface, a local area network interface) or wireless interfaces (such as an IEEE 802.11 interface or a Bluetooth® interface). Computer 40 sends data and optionally control commands to the immersive video rendering device 10. The computer 40 is in charge of processing the data, i.e. prepare them for display by the immersive video rendering device 10. Processing can be done exclusively by the computer 40 or part of the processing can be done by the computer and part by the immersive video rendering device 10. The computer 40 is connected to internet, either directly or through a gateway or network interface 50. The computer 40 receives data representative of an immersive video from the internet, processes these data (e.g. decodes them and possibly prepares the part of the video content that is going to be displayed by the immersive video rendering device 10) and sends the processed data to the immersive video rendering device 10 for display. In a variant, the system may also comprise local storage (not represented) where the data representative of an immersive video are stored, said local storage can be on the computer 4000 or on a local server accessible through a local area network for instance (not represented).

FIG. 3 represents a second embodiment. In this embodiment, a STB 90 is connected to a network such as internet directly (i.e. the STB 90 comprises a network interface) or via a gateway 50. The STB 90 is connected through a wireless interface or through a wired interface to rendering devices such as a television set 100 or an immersive video rendering device 200. In addition to classic functions of a STB, STB 90 comprises processing functions to process video content for rendering on the television 100 or on any immersive video rendering device 200. These processing functions are the same as the ones that are described for computer 40 and are not described again here. Sensors 20 and user input devices 30 are also of the same type as the ones described earlier with regards to FIG. 2. The STB 90 is connected to internet, either directly or through a gateway or network interface 50. The STB 90 obtains the data representative of the immersive video from the internet. In a variant, the STB 90 obtains the data representative of the immersive video from a local storage (not represented) where the data representative of the immersive video are stored, said local storage can be on the game console 60 or on a local server accessible through a local area network for instance (not represented).

FIG. 4 represents a third embodiment related to the one represented in FIG. 2. The game console 60 processes the content data. Game console 60 sends data and optionally control commands to the immersive video rendering device 10. The game console 60 is configured to process data representative of an immersive video and to send the processed data to the immersive video rendering device 10 for display. Processing can be done exclusively by the game console 60 or part of the processing can be done by the immersive video rendering device.

The game console 60 is connected to internet, either directly or through a gateway or network interface 50. The game console 60 obtains the data representative of the immersive video from the internet. In a variant, the game console 60 obtains the data representative of the immersive video from a local storage (not represented) where the data representative of the immersive video are stored, said local storage can be on the game console 60 or on a local server accessible through a local area network for instance (not represented).

The game console 60 receives data representative of an immersive video from the internet, processes these data (e.g. decodes them and possibly prepares the part of the video that is going to be displayed) and sends the processed data to the immersive video rendering device 10 for display. The game console 60 may receive data from sensors 20 and user input devices 30 and may use them to process the data representative of an immersive video obtained from the internet or from the from the local storage.

FIG. 5 represents a fourth embodiment of said first type of system where the immersive video rendering device 70 is formed by a smartphone 701 inserted in a housing 705. The smartphone 701 may be connected to internet and thus may obtain data representative of an immersive video from the internet. In a variant, the smartphone 701 obtains data representative of an immersive video from a local storage (not represented) where the data representative of an immersive video are stored, said local storage can be on the smartphone 701 or on a local server accessible through a local area network for instance (not represented).

Immersive video rendering device 70 is described with reference to FIG. 11 which gives a preferred embodiment of immersive video rendering device 70. It optionally comprises at least one network interface 702 and the housing 705 for the smartphone 701. The smartphone 701 comprises all functions of a smartphone and a display. The display of the smartphone is used as the immersive video rendering device 70 display. Therefore there is no need of a display other than the one of the smartphone 701. However, there is a need of optics 704, such as lenses to be able to see the data on the smartphone display. The smartphone 701 is configured to process (e.g. decode and prepare for display) data representative of an immersive video possibly according to data received from the sensors 20 and from user input devices 30. Some of the measurements from sensors are used to compute the pose of the device and to control the virtual camera. Sensors used for pose estimation are, for instance, gyroscopes, accelerometers or compasses. More complex systems, for example using a rig of cameras may also be used. In this case, the at least one processor performs image processing to estimate the pose of the device 10. Some other measurements are used to process the content according to environment conditions or user's reactions. Sensors used for observing environment and users are, for instance, microphones, light sensor or contact sensors. More complex systems may also be used like, for example, a video camera tracking user's eyes. In this case the at least one processor performs image processing to operate the expected measurement.

FIG. 6 represents a fifth embodiment of said first type of system where the immersive video rendering device 80 comprises all functionalities for processing and displaying the data content. The system comprises an immersive video rendering device 80, sensors 20 and user input devices 30. The immersive video rendering device 80 is configured to process (e.g. decode and prepare for display) data representative of an immersive video possibly according to data received from the sensors 20 and from the user input devices 30. The immersive video rendering device 80 may be connected to internet and thus may obtain data representative of an immersive video from the internet. In a variant, the immersive video rendering device 80 obtains data representative of an immersive video from a local storage (not represented) where the data representative of an immersive video are stored, said local storage can be on the smartphone 701 or on a local server accessible through a local area network for instance (not represented).

The immersive video rendering device 80 is illustrated on FIG. 12. The immersive video rendering device comprises a display 801. The display can be for example of OLED or LCD type, a touchpad (optional) 802, a camera (optional) 803, a memory 805 in connection with at least one processor 804 and at least one communication interface 806. Memory 805 comprises parameters and code program instructions for the processor 804. Memory 805 can also comprise parameters received from the sensors 20 and user input devices 30. Communication interface 806 enables the immersive video rendering device to communicate with internet network. The processor 804 processes data representative of the video in order to display them of display 801. The camera 103 captures images of the environment for an image processing step. Data are extracted from this step in order to control the immersive video rendering device.

A second system, for processing augmented reality, virtual reality, or augmented virtuality content is illustrated in FIGS. 7 to 9. Such a system comprises an immersive wall.

FIG. 7 represents a system of the second type. It comprises a display 1000 which is an immersive (projective) wall which receives data from a computer 4000. The computer 4000 may receive immersive video data from the internet. The computer 4000 is usually connected to internet, either directly or through a gateway or network interface 5000. In a variant, the immersive video data are obtained by the computer 4000 from a local storage (not represented) where the data representative of an immersive video are stored, said local storage can be in the computer 4000 or in a local server accessible through a local area network for instance (not represented).

This system may also comprise sensors 2000 and user input devices 3000. The immersive wall 1000 can be of OLED or LCD type. It can be equipped with one or several cameras. The immersive wall 1000 may process data received from the sensor 2000 (or the plurality of sensors 2000). The data received from the sensors 2000 may be related to lighting conditions, temperature, environment of the user, e.g. position of objects.

The immersive wall 1000 may also process data received from the user inputs devices 3000. The user input devices 3000 send data such as haptic signals in order to give feedback on the user emotions. Examples of user input devices 3000 are handheld devices such as smartphones, remote controls, and devices with gyroscope functions.

Sensors 2000 and user input devices 3000 data may also be transmitted to the computer 4000.

The computer 4000 may process the video data (e.g. decoding them and preparing them for display) according to the data received from these sensors/user input devices. The sensors signals can be received through a communication interface of the immersive wall. This communication interface can be of Bluetooth type, of WIFI type or any other type of connection, preferentially wireless but can also be a wired connection.

Computer 4000 sends the processed data and optionally control commands to the immersive wall 1000. The computer 4000 is configured to process the data, i.e. preparing them for display, to be displayed by the immersive wall 1000. Processing can be done exclusively by the computer 4000 or part of the processing can be done by the computer 4000 and part by the immersive wall 1000.

FIG. 8 represents another system of the second type. It comprises an immersive (projective) wall 6000 which is configured to process (e.g. decode and prepare data for display) and display the video content. It further comprises sensors 2000, user input devices 3000.

The immersive wall 6000 receives immersive video data from the internet through a gateway 5000 or directly from internet. In a variant, the immersive video data are obtained by the immersive wall 6000 from a local storage (not represented) where the data representative of an immersive video are stored, said local storage can be in the immersive wall 6000 or in a local server accessible through a local area network for instance (not represented).

This system may also comprise sensors 2000 and user input devices 3000. The immersive wall 6000 can be of OLED or LCD type. It can be equipped with one or several cameras. The immersive wall 6000 may process data received from the sensor 2000 (or the plurality of sensors 2000). The data received from the sensors 2000 may be related to lighting conditions, temperature, environment of the user, e.g. position of objects.

The immersive wall 6000 may also process data received from the user inputs devices 3000. The user input devices 3000 send data such as haptic signals in order to give feedback on the user emotions. Examples of user input devices 3000 are handheld devices such as smartphones, remote controls, and devices with gyroscope functions.

The immersive wall 6000 may process the video data (e.g. decoding them and preparing them for display) according to the data received from these sensors/user input devices. The sensors signals can be received through a communication interface of the immersive wall. This communication interface can be of Bluetooth type, of WIFI type or any other type of connection, preferentially wireless but can also be a wired connection. The immersive wall 6000 may comprise at least one communication interface to communicate with the sensors and with internet.

FIG. 9 illustrates a third embodiment where the immersive wall is used for gaming. One or several gaming consoles 7000 are connected, preferably through a wireless interface to the immersive wall 6000. The immersive wall 6000 receives immersive video data from the internet through a gateway 5000 or directly from internet. In a variant, the immersive video data are obtained by the immersive wall 6000 from a local storage (not represented) where the data representative of an immersive video are stored, said local storage can be in the immersive wall 6000 or in a local server accessible through a local area network for instance (not represented).

Gaming console 7000 sends instructions and user input parameters to the immersive wall 6000. Immersive wall 6000 processes the immersive video content possibly according to input data received from sensors 2000 and user input devices 3000 and gaming consoles 7000 in order to prepare the content for display. The immersive wall 6000 may also comprise internal memory to store the content to be displayed. The immersive wall 6000 can be of OLED or LCD type. It can be equipped with one or several cameras.

The data received from the sensors 2000 may be related to lighting conditions, temperature, environment of the user, e.g. position of objects. The immersive wall 6000 may also process data received from the user inputs devices 3000. The user input devices 3000 send data such as haptic signals in order to give feedback on the user emotions. Examples of user input devices 3000 are handheld devices such as smartphones, remote controls, and devices with gyroscope functions.

The immersive wall 6000 may process the immersive video data (e.g. decoding them and preparing them for display) according to the data received from these sensors/user input devices. The sensors signals can be received through a communication interface of the immersive wall. This communication interface can be of Bluetooth type, of WIFI type or any other type of connection, preferentially wireless but can also be a wired connection. The immersive wall 6000 may comprise at least one communication interface to communicate with the sensors and with internet.

The general principle of the disclosed method consists in coding an omnidirectional video into a bitstream taking into account for the distortion of the information resulting of the projection operation of a 3D surface on a 2D picture. More particularly, it is proposed to adapt the size of the blocks constituting the 2D picture as a function of the pixel density function determined according to the projection function of the 3D surface on the 2D picture. The blocks of adapted size are then encoded into the bitstream. Indeed, the inventors have noted that the projection operation of the 3D surface on the 2D picture lead to deterministic distortions on the resulting picture.

For instance, reconsidering the embodiment shown on FIG. 13A in which the projection is from a surface S represented as a sphere onto one single rectangular picture I using an equirectangular projection, the following relationship between the Cartesian co-ordinates on the XY-plane as illustrated on FIG. 13B, and the angular co-ordinates on the sphere as illustrated on FIG. 13C holds:

x=Wθ/2π+W/2,

y=2H(p/2π+H/2,

With W and H being the width and the height of the 2D picture (i.e. the frame to be encoded at the end), and where (x,y) corresponds to the location of a point M on the XY-plane of the 2D picture and (θ, φ) being the coordinate of a corresponding point M′ on the sphere S.

However, it can be seen that the width of each pixel projected in the rendering frame varies according to the angle φ. The variation of the width of the pixel can be very high: in equirectangular mapping, the whole top line and bottom line of the 2D picture correspond to one single pixel (that lays at the poles of the sphere S).

Consequently, a density of pixel in the 2D picture can be defined in order to quantize this spreading effect of the original pixels on the sphere S when projected onto the 2D picture.

According to one embodiment of the present disclosure, the density of pixels at a given location (x,y) on the 2D picture is defined as inversely proportional to the number of pixels of the 2D picture (assuming a uniform pixel grid) that are filed by a given pixel of the sphere S that is projected onto that particular location (x,y).

However, due to the structure of the projection operation, it appears that the density of pixels in the 2D picture is not isotropic. Therefore, According to one embodiment of the present disclosure, the pixel density function is defined as a 2D density function Density_2D(x,y) represented as a vector composed of both horizontal and vertical pixel density functions (Density_2D_h(x,y) and Density_2D_v(x,y) respectively). In that case, for a given pixel located in (x,y) (i.e. a given location (x,y)) in the 2D picture pixel grid, the horizontal pixel density function is defined as the width (i.e. the size along the horizontal axis X) of one pixel in the 2D picture pixel grid divided by the width of all the pixels in the same row of the 2D picture pixel grid that are effectively filed with the original pixel on the sphere S projected onto that particular pixel located in (x,y) in the 2D picture pixel grid. The same holds for the vertical pixel density function when considering the height (i.e. the size along the vertical axis y) of one pixel in the 2D picture pixel grid divided by the height of all the pixels in the same column of the 2D picture pixel grid that are effectively filed with the original pixel on the sphere S projected onto that particular pixel located in (x,y) in the 2D picture pixel grid.

In case of an equirectangular projection of the sphere S onto a 2D picture, the horizontal and vertical densities of pixels corresponding to that definition can be expressed as:

$(\frac{\cos (ϕ) ρ π}{W}, \frac{ρ π}{2 H})$

with ρ the diameter of the sphere S. However, for the sake of simplicity, the parameters can be chosen so that the maximum horizontal and vertical density functions are set to one. It results in normalized horizontal and vertical densities of pixels that can be expressed as:

(cos(φ),1)

Equivalently, using the coordinate change given above between the Cartesian co-ordinates on the XY-plane, and the angular co-ordinates on the sphere, it results in:

$\begin{matrix} {Density}_{2 D} (x, y) = ({Density}_{2 D_{h}} (x, y), {Density}_{2 D_{v}} (x, y)) = (\cos ((\frac{y}{H} - \frac{1}{2}) \cdot π), 1) & (Eq - 1) \end{matrix}$

We thus see in that particular embodiment that the vertical pixel density on the 2D picture is constant (i.e. no spreading of the original pixels occurs in the vertical direction, i.e. along the Y axis when performing the projection) whereas the horizontal pixel density varies according to the cosine of the y coordinate as shown in FIG. 15A (the grey intensity is inversely proportional to the pixel horizontal density), and in FIG. 15B (for a cut in the 2D surface for a constant X).

As another example, we can reconsider the embodiment shown on FIG. 14A in which the projection is performed from a surface S represented as a cube onto six pictures (or faces) re-arranged into one single picture (e.g. according to the embodiments shown in FIG. 14D, 14E or 14F). More particularly, in case the considered layout corresponds to the one shown in FIG. 14D, the following relationships holds between the Cartesian coordinates of a point in the XY-plane of the 2D picture and on the cube:

$f {\begin{matrix} Left : x < w, y > h : u = \frac{2 x}{w} - 1, v = \frac{2 (y - h)}{h} - 1, k = 0 \\ front : w < x < 2 w, y > h : u = \frac{2 (x - w)}{w} - 1, v = \frac{2 (y - h)}{h} - 1, k = 1 \\ right : 2 w < x, y > h : u = \frac{2 (x - 2 w)}{w} - 1, v = \frac{2 (y - h)}{h} - 1, k = 2 \\ bottom : x < w, y < h : u = \frac{2 y}{h} - 1, v = \frac{2 (w - x)}{w} - 1, k = 3 \\ back : w < x < 2 w, y < h : u = \frac{2 y}{h} - 1, v = \frac{2 (2 w - x)}{w} - 1, k = 4 \\ top : 2 w < x, y < h : u = \frac{2 y}{h} - 1, v = \frac{2 (3 w - x)}{w} - 1, k = 5 \end{matrix}$

Here, k denotes the face number and (u,v), where u,v∈[−1,1], denote the coordinates on that face. Each face of the cube is of width w and of height h.

In that case, FIGS. 16A and 16B show an exemplary vertical, respectively horizontal, pixel density function for one face of the cube, in case of a uniform pixel grid distribution on it (i.e. a uniform pixel width and height along each of the faces of the cube), that can be associated to the 2D picture according to one embodiment of the present disclosure.

Consequently, it can be seen from those two exemplary embodiments that on the contrary of a classical 2D picture in which the information depends only on the object of the picture itself, a projected picture of a 3D surface may present deterministic distortions resulting from the projection operation. Such difference is obviously not taken into account in conventional 2D video coder/decoder. The inventors therefore propose to adapt the size of a block BLK of an existing video codec (e.g. HEVC, H.264/AVC, JVET, VP10, QTBT, etc.) according to a pixel density function determined according to the projection function used for projecting the 3D surface onto the 2D picture. Depending on the video coding standard considered, such block BLK may be identified as the macroblocks (MB) (such as in H.264/AVC) or the Coding Tree Units (CTU) (such as in HEVC), i.e. the units of pixels delivered by the subdividing module. It can further be identified to other subblocks encountered in those standards. For instance, according to an HEVC coder, a coding tree unit comprises a coding tree block (CTB) of luminance samples and two coding tree blocks of chrominance samples and corresponding syntax elements regarding further subdividing of coding tree blocks. A coding tree block of luminance samples may have a size of 16×16 pixels, 32×32 pixels or 64×64 pixels. Each coding tree block can be further subdivided into smaller blocks (known as coding blocks CB) using a tree structure and quadtree-like signaling. The root of the quadtree is associated with the coding tree unit. The size of the luminance coding tree block is the largest supported size for a luminance coding block. One luminance coding block and ordinarily two chrominance coding blocks form a coding unit (CU). A coding tree unit may contain one coding unit or may be split to form multiple coding units, and each coding unit having an associated partitioning into prediction units (PU) and a tree of transform unit (TU). The decision whether to code a picture area using interpicture or intra picture prediction is made at the coding unit level. A prediction unit partitioning structure has its root at the coding unit level. Depending on the basic prediction-type decision, the luminance and chrominance coding blocks can then be further split in size and predicted from luminance and chrominance prediction blocks (PB). The HEVC standard supports variable prediction block sizes from 64×64 down to 4×4 samples. The prediction residual is coded using block transforms. A transform unit (TU) tree structure has its root at the coding unit level. The luminance coding block residual may be identical to the luminance transform block or may be further split into smaller luminance transform blocks. The same applies to chrominance transform blocks. A transform block may have size of 4×4, 8×8, 16×16 or 32×32 samples. In such embodiment, it is therefore proposed to adapt the size of a CU, PU or TU according to the density function as for a CTU or a macroblock.

Referring now to FIG. 17, we detail an adaptation of the size of the CTU blocks in case of equirectangular mapping according to an embodiment of the present disclosure. However, as discussed above, the embodiment disclosed herein could apply equivalently to adapt the size of a CU, PU or TU. Because the number of CTU in the 2D image and their size depend on the size of the image, it is proposed as a first step to map a CTU grid to the image, before processing the CTU in raster-scan order. This can be done only once at the first image and doesn't need any additional signal to be transmitted because it is based on the layout and size of the video and CTU nominal size.

More particularly, in one embodiment, the first CTU 1 is placed where the pixel density is maximum (for achieving a minimal distortion). For example, the first CTU 1 is placed at the center left of the image for the equirectangular layout, or the center of the front face for the cube mapping. It is to be noted that the top left pixel of the first CTU can be chosen to be aligned horizontally and/or vertically on the horizontal and/or vertical center of the image/face; or on a position multiple of p, p being the size of the smallest possible transform, or a power of 2 for hardware design simplification.

Once placed, the size of the first CTU 1 is adapted according to the pixel density. According to different variants, the adapted size is obtained as:

- a nominal size divided by an average value of a pixel density function computed over the block;
- a nominal size divided by the median value of a pixel density function computed over the block;
- a nominal size divided by the value of a pixel density function taken at the pixel located at the center of the block.

In case the pixel density function is a 2D pixel density function composed of both a horizontal and a vertical density function as discussed above in relation with FIGS. 15A to 16B, the average value may be taken as the average value of the averages of the horizontal and vertical density functions computed over the block. In the same way, the median value may be taken as the median value of the median values of the horizontal and vertical density functions computed over the block. In variants, the skilled person would easily be able to use other statistics based on the horizontal and vertical density functions depending on its use case.

The process is then repeated, for instance in a raster-scan order in order to place and adapt the size of the other CTUs.

Reconsidering the embodiment where the projection function corresponds to an equirectangular mapping (as discussed above in relation with the FIG. 13A), the 2D density function is given by (Eq-1). It appears that this function, composed of both the horizontal and the vertical density functions, is independent of the x coordinate, thus explaining that the adapted size of the CTU along a given 2D image line remains the same. For example, as illustrated on FIG. 17, five different adapted sizes are used (CTU 1 to 5).

In one variant, the CTU adapted size is rounded to the closest power of two in order to have good split properties for the subblocks whose maximum size is derived from the CTU size (or to the closest multiple of the minimum transform size depending on the split strategy and the codec abilities). Either the minimum block size, or the maximum split depth is set to remain constant. In the latter, it means that for bigger CTU, the minimum transform size is bigger, which is coherent with reduced pixel density. Thus, signaling can be reduced, as small transforms have little advantage for low pixel density. The value of the adapted size of the CTUs may be signaled in the bistream to the decoder using a syntax element, e.g. an existing syntax element of the standard. For example, in case of an HEVC standard, the adapted size of the CTUs may be signaled in a Sequence Parameter Set (SPS) of the bistream or in a Picture Parameter SET (PPS), or in a particular syntax element for the CU, PU or TU. According to another variant of this embodiment, no signaling is used and it is assumed that the decoder will perform the same derivations for determining the adapted size of the CTUs according to the density function.

Referring now to FIG. 18, we detail an adaptation of the size of the CTU blocks in case of cubic mapping according to an embodiment of the present disclosure. However, in that case again the embodiment disclosed herein could apply equivalently to adapt the size of a CU, PU or TU. The same concepts as disclosed above in relation with FIG. 17 apply in the present case. However, having the six faces of the cube re-arranged into one single 2D picture (according to the embodiment disclosed in relation with FIG. 14F in the present embodiment), the CTU may be trimmed outside the image, and the CTU boundaries may be set to coincide with surface faces.

In a variant, the size can be set constant on a CTU row (or column), for allowing the mapping of the CTU on the 2D image.

Referring now to FIGS. 19 and 20, we detail an adaptation of the size of the CTU blocks in case of equirectangular mapping and cubic mapping respectively, according to another embodiment of the present disclosure.

In the present embodiment, the CTU is rectangular. According to that embodiment, the same concepts as disclosed above in relation with FIGS. 17 and 18 apply in the present case, but with the dimensions (i.e. the width and height along the x and y coordinate respectively) of the CTU of adapted size (including the first CTU) are adapted according to the horizontal/vertical pixel densities.

According to different variants, the width (respectively height) of the CTU of adapted size is obtained as:

- a nominal size divided by an average value of a horizontal (respectively vertical) pixel density function computed over the block;
- a nominal size divided by a median value of a horizontal (respectively vertical) pixel density function computed over the block;
- a nominal size divided by the value of a horizontal (respectively vertical) pixel density function taken at the pixel located at the center of the block.

In the embodiment illustrated in FIG. 19 (equirectangular mapping), the CTU mapping starts with the center row and follows in raster scan order for the bottom part of the image; from the middle to bottom and and from left to right, and for the top part of the image, from the middle to top and from left to right. In the embodiment illustrated in FIG. 20 (cubic mapping), the order is from a center of a face, then following a spiral around the center. At each step, the CTU size is rounded so that the width (or height) of the spiral row is an integer multiple (or power of two) of previous row CTU size.

Here again, the embodiment disclosed herein could apply equivalently to adapt the size of a CU, PU or TU.

Referring now to FIG. 21, we detail a split of a subblock according to an embodiment of the present disclosure.

Referring to our discussion at the beginning of this disclosure, such subblock can be for instance a CU, PU or TU in case the considered video coding standard is HEVC. Indeed, in such standard, the size decided for the CTU drives the maximum allowable size for the subblocks. More particularly, the size for those subblocks is derived from this maximum allowable size following a splitting process defined in the standard.

In one embodiment, the maximum allowable size is derived based on the method disclosed above in relation with FIGS. 17 to 20. In another embodiment, the size of the current block is not adapted as disclosed with FIGS. 17 to 20, but by a splitting process as disclosed below. In this embodiment, the maximum allowable size is then the original size of the current block.

As for the selection of the size of the block BLK, the inventors propose to take advantage of the knowledge of the 2D pixel density function for improving the performances of known coders/decoders. More particularly, in the present embodiment, it is proposed to split such subblock from its maximum allowable size (e.g. the CTU size, for a CU, the CU size for a PU, etc.) according to a criterion depending on the pixel density function. Indeed, low pixel density areas won't benefit from small block intra prediction, and one coding mode is more appropriate when the block is consistent, so as to reduce distortions in inter predictions.

Consequently, when the shape of the subblocks can be rectangular, splits can be inferred in one dimension so that subblock shape reflect pixel density i.e. the topology depends on the pixel density. According to a first variant, a vertical split (e.g. corresponding to a vertical fold line 2100) is decided if an average value of the horizontal pixel density function over an average value of the vertical pixel density function is greater than a threshold, e.g. if:

mean{Density_2D_h(x,y),(x,y)∈B}>2×mean{Density_2D_v(x,y),(x,y)∈B}