360° or spherical videos are video recordings captured by an omnidirectional) (360° camera or a group of cameras configured for 360° coverage. Images from the many camera(s) are then stitched to form a single video in a projection space, such as equirectangular and spherical based spaces. This video data is then encoded for storage or transmission. However, encoding in equirectangular and spherical based spaces presents issues related to distortion. Moreover, encoders are typically configured to handle any type of video data. To do this, the encoder reads in all of the video data and stores it in a cache. The encoder searches though all of the video data to perform motion estimation and prediction coding. Consequently, the motion estimation searching and prediction processing are non-optimal because the image has been distorted to map the image into an equirectangular or other format from a true spherical shape.
A more detailed understanding can be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
Described herein is a method and apparatus for using cube mapping and mapping metadata with encoders. Video data, such as 360° video data, is sent by a capturing device to an application, such as video editing software, which generates cube mapped video data and mapping metadata from the 360° video data. An encoder then applies the mapping metadata to the cube mapped video data to minimize or eliminate search regions when performing motion estimation, minimize or eliminate neighbor regions when performing intra coding prediction and assign zero weights to edges having no relational meaning. Consequently, the encoder encodes the cube mapped video data faster and more efficiently.
Although encoder(s) 130 are shown as a separate device(s), it may be implemented as an external device or integrated in any device that may be used in capturing, generating or transmitting video data. In an implementation, encoder(s) 130 may include a multiplexor. In an implementation, encoder(s) 130 can process non-cube mapped video data as well as cube mapped data depending on the presence of the mapping metadata in, for example, the header information. Application 130 may be implemented or co-located with any of video capturing device 120, mobile phone 122, camera 124 or encoder(s) 135, for example. In an embodiment, application 130 may be implemented on a standalone server. Although decoder(s) 140 are shown as a separate device(s), it may be implemented as an external device or integrated in any device that may be used in replaying or displaying the video data. In an implementation, decoder(s) 140 may include a demultiplexor. In an implementation, decoder(s) 140 can process non-cube mapped video data as well as cube mapped data depending on the presence of the mapping metadata in, for example, the header information. Application 142 may be implemented or co-located with any of destination device 144, VII headset and audio headphones 146 or decoder(s) 140, for example.
An encoder 135 uses the mapping metadata to encode the cube mapped video data (230). Encoder 135 can minimize the amount of video data that has to be read and stored, reduce the search region area, simplify transition smoothing between face edges and reduce the number of bits needed to encode specific faces. The impact of the mapping metadata is described further with respect to
As noted, the cube mapping is unfolded in accordance with a predetermined or selected pixel arrangement.
Encoder 500 includes an input port 505 that is in communication with or connected to (collectively “connected to”) at least a general coder control 510, a transform, scaling and quantization 515 via a summer 512, an intra-picture estimation 520, a filter control analysis 525, and a motion estimation 530. General coder control 510 is further connected to a header, metadata and entropy 570, transform, scaling and quantization 515 and motion estimation 530. Transform, scaling and quantization 515 is further connected to header, metadata and entropy 570, a scaling and inverse transform 535, an intra/inter selection 540. Intra-picture estimation 520 is further connected to header, metadata and entropy 570 and intra-prediction 545, which is in turn connected to a pole 541 of intra/inter selection 540.
Motion estimation 530 is further connected to header, metadata and entropy 570 and motion compensation 550, which is in turn connected to pole 542 of intra/inter selection 540. An output pole 543 of intra/inter selection 540 is connected to transform, scaling and quantization 515 via summer 512 and filter control analysis 525 via summer 523. Scaling and inverse transform 535 is further connected to filter control analysis 525 and intra-picture estimation 520, both via summer 523. Filter control analysis 525 is further connected to header, metadata and entropy 570 and in-loop filtering 555, which is in turn connected to decoded picture buffer 560. Decoded picture buffer 560 is further connected to motion estimation 530, motion compensation 550, and an output port 565 for outputting output video signal.
Operation of encoder 500 is described with respect to illustrative components that use mapping metadata to optimize encoder processing. In particular, these illustrative encoder components are motion estimation 530, intra-picture estimation 520, and in-loop filtering 555. Each of these encoder components implement logic that uses the mapping metadata to minimize or eliminate search regions when performing motion estimation, minimize or eliminate pixels when performing intra-picture estimation and assign zero weights to edges having no relational meaning when smoothing transition at face edges, i.e. deblocking. Other encoder components can also benefit directly or indirectly from the use of cube mapped video data and mapping metadata.
The cube mapped video data is input at input port 505. As stated above, encoder 500 splits the cube mapped video data into multiple blocks. The blocks are then processed by motion estimation 530, intra-picture estimation 520, and in-loop filtering 555 at the appropriate times using the mapping metadata.
In general, motion estimation determines motion vectors that describe the transformation from one 2D image to another image from adjacent frames in the video data sequence. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. In particular, motion estimation involves comparing each of the blocks with a corresponding block and its adjacent neighbors in a nearby frame of the video data, where the latter is denoted as a search area. A motion vector is created that models the movement of a block from one location to another. This movement, calculated for all the blocks comprising a frame, constitutes the motion on a per block basis estimated within a frame. For example, motion estimation is typically denoted by a vector X and vector Y that is the amount of motion in pixels in the x direction and in the y direction. Vectors X and Y can be fractions, such as ½ and ¼ depending on the codec in use. For a conventional encoder, a search area may have a height of 7 blocks, e.g., 3 blocks above and 3 blocks below a center block and a width that may be about twice the height. The search area parameters are illustrative and can depend on the encoder/decoder. A full search of all potential blocks however is a computationally expensive task. This search area is then moved in a predetermined manner from a target pixel, (i.e. right, left, up and down), in a search region as shown in
Encoder 500 and motion estimation 530 minimizes the amount of cube mapped video data that has to be read and stored in cache and reduces the search region boundaries by using the mapping metadata. For example, in an implementation using a
In an implementation, encoder 500 and motion estimation 530 would not read or preload a cache or buffer for these invalid regions. In another implementation, motion estimation 530 could remove these invalid regions from the search region as shown for example in
In an implementation, removal or clamping of the search region can be done by generating a mask based on the mapping metadata, overlaying it on the search region and then search only in the remaining search region. In an implementation, a map can contain each pixel location along with an invalid bit or flag based on the mapping metadata. The map can then be used to not load data or skip regions as designated.
In intra-picture estimation 520, pixels in neighboring blocks are checked and potentially used to predict the pixels in the target block. Consequently, the efficiency of intra-picture estimation 520 can be increased by using the mapping metadata to eliminate searching in neighboring blocks that are invalid regions. Similar to motion estimation 530, intra-picture estimation 520 can proceed with the search if the faces are relationally meaningful as shown for example in
In-loop filtering 555 improves visual quality and prediction performance by smoothing the sharp edges which can form between blocks due to the block coding process. This is typically done by assigning a weight or strength to each horizontal and vertical edge between adjacent blocks. Based on the weight, a filter is applied across the edge to smooth the transition from one block to another block. A weight of zero means to do nothing on that edge. The efficiency of in-loop filtering 555 can be increased by using the mapping metadata to mark each edge with a zero that is not relationally meaningful. That is, the edge between two faces has no relational meaning. This can be implemented using the map described above, for example.
In addition to the above encoder components, encoder 500 can minimize the amount of bytes needed to encode the cubed mapped video data. In an implementation using a
In various alternatives, the processor 802 includes a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core can be a CPU or a GPU. In various alternatives, the memory 804 is be located on the same die as the processor 802, or is located separately from the processor 802. The memory 804 includes a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 806 includes a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 808 include, without limitation, a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 810 include, without limitation, a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 812 communicates with the processor 802 and the input devices 808, and permits the processor 802 to receive input from the input devices 808. The output driver 814 communicates with the processor 802 and the output devices 810, and permits the processor 802 to send output to the output devices 810. It is noted that the input driver 812 and the output driver 814 are optional components, and that the device 800 will operate in the same manner if the input driver 812 and the output driver 814 are not present.
In general, a method for processing video data includes generating cube mapped video data, determining at least one pixel arrangement for the cube mapped video data, creating mapping metadata associated with the at least one pixel arrangement and encoding the cube mapped video data using the mapping metadata, where the mapping metadata provides pixel arrangement and orientation information. In an implementation, the mapping metadata is sent in a header associated with the cube mapped video data. In an implementation, the method includes converting non-cube mapped video data into the cube mapped video data. In an implementation, the mapping metadata identifies faces having blank data or faces that have no relational meaning with neighboring faces for use with motion estimation. In an implementation, the method further includes generating a mask based on the mapping metadata, overlaying the mask on the search region area to identify the faces having blank data or the faces that have no relational meaning with neighboring faces, and searching in remaining search region areas. In an implementation, the mapping metadata identifies faces having blank data or faces that have no relational meaning with neighboring faces for use with intra-picture estimation. In an implementation, the mapping metadata identifies face edges that have no relational meaning as between neighboring faces for transition smoothing between edge faces. In an implementation, the method further includes assigning a zero weight to a face edge when the face edge has no relational meaning as between neighboring faces. In an implementation, the mapping metadata identifies blank faces for purposes of storing the cube mapped.
In general, an apparatus for processing video data includes a video generator that generates cube mapped video data, determines at least one pixel arrangement for the cube mapped video data, creates mapping metadata associated with the at least one pixel arrangement and an encoder connected to the video generator, where the encoder encodes the cube mapped video data using the mapping metadata to minimize encoder processing by providing pixel arrangement and orientation information. In an implementation, the mapping metadata is sent in a header associated with the cube mapped video data. In an implementation, the video generator converts non-cube mapped video data into the cube mapped video data. In an implementation, the mapping metadata identifies faces having blank data or faces that have no relational meaning with neighboring faces for use in motion estimation. In an implementation, the encoder generates a mask based on the mapping metadata, overlays the mask on the search region area to identify the faces having blank data or the faces that have no relational meaning with neighboring faces and searches in remaining search region areas. In an implementation, the mapping metadata identifies faces having blank data or faces that have no relational meaning with neighboring faces for use in intra-picture estimation. In an implementation, the mapping metadata identifies face edges that have no relational meaning as between neighboring faces for use in transition smoothing between face edges. In an implementation, the encoder assigns a zero weight to a face edge when the face edge has no relational meaning as between neighboring faces. In an implementation, the mapping metadata identifies blank faces for the purpose of storing the cube mapped data.
A method for processing video data, the method including receiving cube mapped video data, receiving mapping metadata associated with at least one pixel arrangement for the cube mapped video data, and encoding the cube mapped video data using the mapping metadata, where the mapping metadata provides pixel arrangement and orientation information. In an implementation, the mapping metadata is received in a header associated with the cube mapped video data.
It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element can be used alone without the other features and elements or in various combinations with or without other features and elements.
The methods provided can be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors can be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing can be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements features of the disclosure.
The methods or flow charts provided herein can be implemented in a computer program, software, or firmware incorporated in a non-transitory computer-readable storage medium for execution by a general purpose computer or a processor. Examples of non-transitory computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).