The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Omnidirectional video/360 video can be rendered to provide special user experience. For example, in a virtual reality application, computer technologies create realistic images, sounds and other sensations that replicate a real environment or create an imaginary setting, thus a user can have a simulated omnidirectional video/360 video experience of a physical presence in a environment.
Aspects of the disclosure provide an apparatus that includes an interface circuit, a processing circuit, and a display device. The interface circuit is configured to receive media data with video content being structured into one or more tracks corresponding to one or more spatial partitions. The media data includes a correspondence of the one or more tracks to the one or more spatial partitions. The processing circuit is configured to extract the correspondence of the one or more tracks to the one or more spatial partitions, select, from the one or more tracks, one or more covering tracks with spatial partitions covering a region of interest based on the correspondence, and generate images of the region of interest based on the one or more covering tracks. The display device is configured to display the images of the region of interest.
According to an aspect of the disclosure, the processing circuit is configured to determine a correspondence of a track to a spatial partition based on spatial partition information associated with the track.
According to an aspect of the disclosure, the processing circuit is configured to determine a projection type based on a projection indicator, and determine the correspondence based on the projection type. In an embodiment, the processing circuit is configured to extract values in a spherical coordinate system that define the spatial partition when the projection indicator is indicative of equirectangular projection (ERP). For example, the processing circuit is configured to determine a center point and a field of view that define the spatial partition based on the values in the spherical coordinate system. In another example, the processing circuit is configured to determine boundaries that define the spatial partition based on the values in the spherical coordinate system.
In another embodiment, the processing circuit is configured to extract a face index that identifies the spatial partition when the projection indicator is indicative of platonic solid projection.
Aspects of the disclosure provide a method for image rendering. The method includes receiving media data with video content being structured into one or more tracks corresponding to one or more spatial partitions. The media data includes a correspondence of the one or more tracks to the one or more spatial partitions. Further, the method includes extracting the correspondence of the one or more tracks to the one or more spatial partitions, selecting, from the one or more tracks, one or more covering tracks with spatial partitions covering a region of interest based on the correspondence, generating images of the region of interest based on the one or more covering tracks, and displaying the images of the region of interest.
Aspects of the disclosure provide an apparatus that includes a memory and a processing circuit. The memory is configured to buffer captured media data. The processing circuit is configured to structure video content of the captured media data into one or more tracks corresponding to one or more spatial partitions, encode the media data and encapsulate the encoded media data with a correspondence of the one or more tracks to the one or more spatial partitions into one or more files.
Aspects of the disclosure provide a method. The method includes receiving captured media data, structuring video content of the captured media data into one or more tracks corresponding to one or more spatial partitions, encoding the media data and encapsulating the encoded media data with a correspondence of the one or more tracks to the one or more spatial partitions into one or more files.
Various embodiments of this disclosure that are proposed as examples will be described in detail with reference to the following figures, wherein like numerals reference like elements, and wherein:
According to an aspect of the disclosure, the source system 110 structures media data logically in one or more tracks, and each track includes a sequence of samples in time order. In an embodiment, the source system 110 structures image/video data into one or more tracks according to spatial partitions. The one or more tracks are encapsulated in one or more files. Further, the source system 110 includes a correspondence between a track and a spatial partition to assist rendering. Thus, in an example, based on the correspondence, the rendering system 160 can fetch appropriate tracks to generate images of a region of interests.
The source system 110 can be implemented using any suitable technology. In an example, components of the source system 110 are assembled in a device package. In another example, the source system 110 is a distributed system, components of the source system 110 can be arranged at different locations, and are suitable coupled together for example by wire connections and/or wireless connections.
In the
The acquisition device 112 is configured to acquire various media data, such as images, sound, and the like of omnidirectional video/360 video. The acquisition device 112 can have any suitable settings. In an example, the acquisition device 112 includes a camera rig (not shown) with multiple cameras, such as an imaging system with two fisheye cameras, a tetrahedral imaging system with four cameras, a cubic imaging system with six cameras, an octahedral imaging system with eight cameras, an icosahedral imaging system with twenty cameras, and the like, configured to take images of various directions in a surrounding space.
In an embodiment, the images taken by the cameras are overlapping, and can be stitched to provide a larger coverage of the surrounding space than a single camera. In an example, the images taken by the cameras can provide 360° sphere coverage of the whole surrounding space. It is noted that the images taken by the cameras can provide less than 360° sphere coverage of the surrounding space.
The media data acquired by the acquisition device 112 can be suitably stored or buffered, for example in the memory 115. The processing circuit 120 can access the memory 115, process the media data, and encapsulate the media data in suitable format. The encapsulated media data is then suitably stored or buffered, for example in the memory 115.
In an embodiment, the processing circuit 120 includes an audio processing path configured to process audio data, and includes an image/video processing path configured to process image/video data. The processing circuit 120 then encapsulates the audio, image and video data with metadata according to a suitable format.
In an example, on the image/video processing path, the processing circuit 120 can stitch images taken from different cameras together to form a stitched image, such as an omnidirectional image, and the like. Then, the processing circuit 120 can project the omnidirectional image according to suitable two-dimension (2D) plane to convert the omnidirectional image to 2D images that can be encoded using 2D encoding techniques. Then the processing circuit 120 can suitably encode the image and/or a stream of images.
It is noted that the processing circuit 120 can project the omnidirectional image according to any suitable projection technique. In an example, the processing circuit 120 can project the omnidirectional image using equirectangular projection (ERP). The ERP projection projects a sphere surface, such as omnidirectional image, to a rectangular plane, such as a 2D image, in a similar manner as projecting earth surface to a map. In an example, the sphere surface (e.g., earth surface) uses spherical coordinate system of yaw (e.g., longitude) and pitch (e.g., latitude), and the rectangular plane uses XY coordinate system. During the projection, the yaw circles are transformed to the vertical lines and the pitch circles are transformed to the horizontal lines, the yaw circles and the pitch circles are orthogonal in the spherical coordinate system, and the vertical lines and the horizontal lines are orthogonal in the XY coordinate system.
In another example, the processing circuit 120 can project the omnidirectional image to faces of platonic solid, such as tetrahedron, cube, octahedron, icosahedron, and the like. The projected faces can be respectively rearranged, such as rotated, relocated to form a 2D image. The 2D images are then encoded.
It is noted that, in an embodiment, the processing circuit 120 can encode images taken from the different cameras, and does not perform the stitch operation and/or the projection operation on the images.
It is also noted that the processing circuit 120 can encapsulate the media data using any suitable format. In an embodiment, the media data is encapsulated in a single track. For example, the ERP projection projects a sphere surface to a rectangular plane, and the single track can include a flow of the entire rectangular images of the rectangular plane.
In another embodiment, the media data is encapsulated in multiple tracks. In an example, the ERP projection projects a sphere surface to a rectangular plane, and the rectangular plane is divided into multiple partitions (also known as “sub-pictures”). A timed sequence of images of a partition forms a track. Thus, video content of the sphere surface are structured into multiple tracks corresponding to the multiple partitions.
In another example, the platonic solid projection projects a sphere surface into faces of a platonic solid. In the example, the sphere surface is partitioned according to the faces of the platonic solid. A timed sequence of images on a face forms a track. Thus, video content of the sphere surface are structured into multiple tracks corresponding to the faces of the platonic solid.
In another example, multiple cameras are configured to take images in different directions of a scene. In the example, the scene is partitioned according to the field of views of the cameras. A timed sequence of images from a camera forms a track. Thus, video content of the scene is structured into multiple tracks corresponding to the multiple cameras.
According to an aspect of the disclosure, the processing circuit 120 is configured to generate a correspondence between tracks and spatial partitions, and include the correspondence with the media data. In an example, the processing circuit 120 includes a file/segment encapsulation module 130 configured to encapsulate the correspondence of tracks to spatial partitions in files and/or segments. The correspondence can be used to assist a rendering system, such as the rendering system 160, to fetch appropriate tracks and render images of the region of interests.
In an embodiment, the processing circuit 120 is configured to use an extensible format standard, such as ISO base media file format and the like for time-based media, such as video and/or audio. In an example, the ISO base media file format defines a general structure for time-based multimedia files, and is flexible and extensible that facilitates interchange, management, editing and presentation of media. The ISO base media file format is independent of particular network protocol, and can support various network protocols in general. Thus, in an example, presentations based on files in the ISO base media file format can be rendered locally, via network or via other stream delivery mechanism.
Generally, a media presentation can be contained in one or more files. One specific file of the one or more files includes metadata for the media presentation, and is formatted according to a file format, such as the ISO base media file format. The specific file can also include media data. When the media presentation is contained in multiple files, the other files can include media data. In an embodiment, the metadata is used to describe the media data by reference. Thus, in an example, the media data is stored in a state not favoring any protocol. The same media data can be used for local presentation, multiple protocols, and the like. The media data can be stored with or without order.
Specifically, the ISO base media file format includes a specific collection of boxes. The boxes are the logical containers. Boxes include descriptors that hold parameters derived from the media content and media content structures. The media is encapsulated in a hierarchy of boxes. A box is an object-oriented building block defined by a unique type identifier and length.
In an example, the presentation of media content is referred to as a movie and is logically divided into tracks, such as parallel tracks. Each track represents a timed sequence of logical samples of media content. Media content are stored and accessed by access units, such as frames, and the like. The access unit is defined as the smallest individually accessible portion of data within an elementary stream, and unique timing information can be attributed to each access unit. In an embodiment, access units can be stored physically in any sequence and/or any grouping, intact or subdivided into packets. The ISO base media file format uses the boxes to map the access units to a stream of logical samples using references to byte positions where the access units are stored. In an example, the logical sample information allows access units to be decoded and presented synchronously on a timeline, regardless of storage.
According to an aspect of the disclosure, the processing circuit 120 is configured to include correspondence of tracks to spatial partitions into the metadata for tracks. In an embodiment, the processing circuit 120 is configured to use a track box to include metadata for the track. The processing circuit 120 can include description of the spatial partition in the metadata for the track. For example, the processing circuit 120 can includes the description of the spatial partition in a sub-box of the track box. The description of the spatial partition can be suitably provided based on the partition characteristics.
In an embodiment, video contents of a sphere surface are projected to a rectangular plane according to ERP projection, and the rectangular plane is divided into multiple partitions (sub-pictures). In the embodiment, the description of the spatial partitions (sub-pictures) is provided in a spherical coordinate system. In an example, the spatial partition is defined by a center point and a field of view. The center point is provided as a center in yaw dimension (center_yaw) and a center in pitch dimension (center_pitch) and the field of view is provided as a field of view in yaw dimension (fov_yaw) and a field of view in pitch dimension (fov_pitch). In another example, the spatial partition is defined by boundaries, such as a minimum yaw value (yaw_left), a maximum yaw value (yaw_right), a minimum pitch value (pitch_bot), and a maximum pitch value (pitch_top).
In another embodiment, the platonic solid projection projects a sphere surface into faces of a platonic solid, thus the sphere surface is partitioned according to the faces of the platonic solid. In the embodiment, the description of the spatial partitions is provided using face indexes. In the example, a spatial partition can be identified based on the number of faces (num_faces) of the platonic solid and a face index (face_id) for a face corresponding to the spatial partition.
In an embodiment, multiple cameras are configured to take images in different directions of a scene. In the embodiment, the scene is partitioned according to the field of views of the cameras (sub-picture equals to the camera captured picture). In an example, a spatial partition can be identified based on characteristics of corresponding camera, such as field of view of the camera, and the like.
In an embodiment, the processing circuit 120 is implemented using one or more processors, and the one or more processors are configured to execute software instructions to perform media data processing. In another embodiment, the processing circuit 120 is implemented using integrated circuits.
In the
The rendering system 160 can be implemented using any suitable technology. In an example, components of the rendering system 160 are assembled in a device package. In another example, the rendering system 160 is a distributed system, components of the source system 110 can be located at different locations, and are suitable coupled together by wire connections and/or wireless connections.
In the
The processing circuit 170 is configured to process the media data and generate images for the display device 165 to present to one or more users. The display device 165 can be any suitable display, such as a television, a smart phone, a wearable display, a head-mounted device, and the like.
According to an aspect of the disclosure, the processing circuit 170 is configured to determine a correspondence of tracks to spatial partitions from metadata of a media presentation. Then, the processing circuit 170 is configured to determine one or more cover tracks with spatial partitions that cover a region of interest based on the correspondence. Then the one or more cover tracks can be fetched, and the processing circuit 170 can generate one or more images for the region of interest based on the one or more cover tracks.
In an embodiment, the processing circuit 170 is configured to request suitable media data, such as a specific track, from the delivery system 150 via the interface circuit 161. In another embodiment, the processing circuit 170 is configured to fetch a specific track from a locally stored file.
In an example, the processing circuit 170 includes a parser module 180 and an image generation module 190. The parser module 180 is configured to parse the metadata to extract the correspondence of tracks to spatial partitions from metadata. The image generation module 190 is configured to generate images of the region of interests. The parser module 180 and the image generation module 190 can be implemented as processors executing software instructions and can be implemented as integrated circuits.
In an embodiment, description of the spatial partitions is provided in a spherical coordinate system. In an example, the parser module 180 extracts, from metadata of a track, values in the spherical coordinate system for a center point and a field of view that define a spatial partition. In another example, the parser module 180 extracts, from metadata of a track, values in the spherical coordinate system that define boundaries of a spatial partition.
In another embodiment, description of the spatial partitions is provided as face indexes for a platonic solid. In an example, the parser module 180 extracts, from metadata of a track, the number of faces of the platonic solid and a face index for a face that identifies a spatial partition.
In an embodiment, description of the spatial partitions is provided as characteristics of cameras. In an example, the parser module 180 extracts, from the metadata of a track, the characteristics of a camera, and determines the spatial partition based on the characteristics.
In an embodiment, the processing circuit 170 is implemented using one or more processors, and the one or more processors are configured to execute software instructions to perform media data processing. In another embodiment, the processing circuit 170 is implemented using integrated circuits.
At S210, media data is acquired. In the
At S220, the media data is processed. In the
At S230, correspondence of tracks to spatial partitions (sub-pictures) is encapsulated with media data in files/segments. In the
At S240, encapsulated files/segments are stored and delivered. In the
At S310, media data with correspondence of tracks to spatial partitions is received. In the
At S320, one or more tracks are selected that the spatial partitions of the tracks cover a region of interest. In the
At S330, images to render views for the region of interests are generated. In the
At S340, images are displayed. In the
In the
In the
In the
In the
In the
In the
In the
In the
In the
In the
In the
In the
In the
In the
In the
When implemented in hardware, the hardware may comprise one or more of discrete components, an integrated circuit, an application-specific integrated circuit (ASIC), etc.
While aspects of the present disclosure have been described in conjunction with the specific embodiments thereof that are proposed as examples, alternatives, modifications, and variations to the examples may be made. Accordingly, embodiments as set forth herein are intended to be illustrative and not limiting. There are changes that may be made without departing from the scope of the claims set forth below.
This present disclosure claims the benefit of U.S. Provisional Application No. 62/372,824, “Methods and Apparatus of Indications of VR and 360 video Content in File Formats” filed on Aug. 10, 2016, and U.S. Provisional Application No. 62/382,805, “Methods and Apparatus of Indications of VR in File Formats” filed on Sep. 2, 2016, which are incorporated herein by reference in their entirety.
Number | Date | Country | |
---|---|---|---|
62372824 | Aug 2016 | US | |
62382805 | Sep 2016 | US |