The disclosed embodiments are generally directed to encoding, and in particular, to 3D stereo video encoding.
The transmission and reception of stereo video data over various medium is ever increasing. Typically, video encoders are used to compress the stereo video data and reduce the amount of stereo video data transmitted over the medium. Efficient encoding of the stereo video data is a key feature for encoders and is important for real time applications.
The motion picture experts group 2 (MPEG-2) introduced the MPEG-2 multiview profile as an amendment to the MPEG-2 standard to enable multiview video coding. The amendment defines the base layer which is associated with the left view and the enhancement layer which is associated with the right view. The base layer is encoded in a manner compatible with common MPEG-2 decoders. In particular, the base layer uses a conventional motion estimation process which requires a full exhaustive search in a reference frame for each macroblock (MB) of a current frame to find the best motion vector which results in the lowest rate distortion cost. For the enhancement layer, the conventional motion estimation process is performed with respect to the base layer and the enhancement layer. This is very time consuming. Alternatively, the motion vector from the base layer frame may be used as the motion vector for the co-located MB in the enhancement layer frames and may save more cycles for the motion estimation process. However, taking the motion vector directly from one view and using it for the other view is not optimal and introduces visual quality degradation.
An efficient motion estimation method and apparatus for 3D stereo video encoding is described herein. In an embodiment of the method, an enhancement layer motion vector for a frame is determined by obtaining a motion vector of a co-located macroblock (MB) from the same frame of a base layer. The motion vectors of a predetermined number of surrounding MBs from the same frame of the base layer are also obtained. A predicted motion vector for the MB of the frame in the enhancement layer is determined using, for example, a median value from the motion vectors associated with the co-located MB and the predetermined number of surrounding MBs. A small or less than full range motion refinement is performed to obtain a final motion vector, where full range refers to the maximum search range supported by an encoder performing the method.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
The motion picture experts group 2 (MPEG-2) introduced the MPEG-2 multiview profile as an amendment to the MPEG-2 standard to enable multiview video coding. As shown in
Described herein is a method in accordance with some embodiments that provides efficient motion estimation for stereo video data, 3D stereo video data and the like. In particular, if the motion vector of the co-located MBn from the base layer frame P1 is used, then exhaustive motion search is not needed for MBn in the enhancement layer. This approach can be further applied to all MBs in the enhancement layer frames. The efficiency is based on the fact that the left and right view images are captured at the same time for the same scene from different angles and are correlated. This is particularly true in the case of stereo video data. Since the left and right view images are close to the eye distance, the motion information should be highly correlated between the left and right view images.
Based on the above underpinnings, a fast and efficient motion estimation method utilizes motion vector information from the base layer to predict the motion vector for the co-located area in the enhancement layer without sacrificing quality.
Referring to
In some embodiments, the median method is used over an average method to determine the predicted motion vector because the motion trend could become inaccurate due to averaging over all of the motion vectors. For example, the value may be zero. The number of neighbors from the base layer to use in determining the predicted motion vector MVn for the enhancement layer may be dependent upon performance, speed and other like factors and may change from frame to frame, over time or based on other inputs (e.g., user input, performance inputs, power consumption inputs, etc.).
Centered on the generated predicted motion vector MVn, motion refinement can then be performed in a small search range, as compared to the base layer, to get a final motion vector with higher precision for MBn in the enhancement layer P-frame P2310. With respect to the term “small search range”, assume that the original full search range that a video encoder can support is M×M. The motion refinement in accordance with some embodiments can then be done in a range with size N×N with 1≦M/N≦M.
The motion refinement may be more applicable and useful in certain type of applications. For example, if an input stream has very small motion between frames, then in most cases there will not be much of a difference between the compressed stream size using an integer pixel based motion vector or a half pixel based motion vector. The small motion refinement is more useful for cases where there is big motion between frames in the input video stream, such as for video gaming applications. In these cases, the small motion refinement helps enhance the precision of the motion estimation and reduces the rate distortion costs that may be needed for improving the compression ratio.
The methods described herein can be applied to all MBs in the enhancement layer frames and the exhaustive search can be avoided for the enhancement layer frames. The efficient motion estimation algorithm for MPEG-2 based 3D stereo video effectively makes use of the motion vector information of the left view to predict the motion vector of the co-located area in the right view to simplify the time consuming motion estimation process. The benefits of such a speedup would benefit, for example, systems with limited processing power, or could help in handling multiple encoding jobs. The described method can increase throughput gains in some systems as it is known that the exhaustive motion search process occupies a large percentage of the entire encoding time.
The apparatus and methods described herein are applicable to MPEG-2 based 3D stereo video coding in a variety of frame compatible formats including the top and bottom format, the side by side format, the horizontal or vertical line interleaved format, and the checkerboard format. For each case, the two views are downsized horizontally or/and vertically and packed in a single frame. For each MB of the right view, the MV can be predicted from the motion vectors of the co-located MBs in the left view using the method described herein. In this manner, the exhaustive search can be avoided for a half area of each frame, which in turn speeds up the overall encoding process.
The encoded video data is processed by decoder(s) 540, which in turn sends the decoded video data to destination devices, which may include, but is not limited to, destination device 542, online gaming device 544, and a display monitor 546. Although the encoder(s) 530 and decoder(s) 540 are shown as a separate device(s), it may be implemented as an external device or integrated in any device that may be used in storing, capturing, generating, transmitting or receiving video data.
The processor 602 may include a central processing unit (CPU), a graphics processing unit (GPU), a CPU and GPU located on the same die, or one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 604 may be located on the same die as the processor 602, or may be located separately from the processor 602. The memory 604 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache. In some embodiments, the high throughput video encoders are implemented in the processor 602.
The storage 606 may include a fixed or removable storage, for example, a hard disk drive, a solid state drive, an optical disk, or a flash drive. The input devices 608 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 610 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 612 communicates with the processor 602 and the input devices 608, and permits the processor 602 to receive input from the input devices 608. The output driver 614 communicates with the processor 602 and the output devices 610, and permits the processor 602 to send output to the output devices 610. It is noted that the input driver 612 and the output driver 614 are optional components, and that the device 600 will operate in the same manner if the input driver 612 and the output driver 614 are not present.
The video encoders described herein may use a variety of encoding schemes including, but not limited to, Moving Picture Experts Group (MPEG) MPEG-1, MPEG-2, MPEG-4, MPEG-4 Part 10, Windows®*.avi format, Quicktime® *.mov format, H.264 encoding schemes, High Efficiency Video Coding (HEVC) encoding schemes and streaming video formats.
In general, in accordance with some embodiments, a method for encoding a frame in an enhancement layer includes obtaining a motion vector of a co-located macroblock from a same frame, as the frame, in a base layer. The motion vectors from a predetermined number of neighbor macroblocks from the same frame of the base layer are also obtained. A predicted motion vector is determined based on the motion vector and the motion vectors for a macroblock of the frame in the enhancement layer. In some embodiments, a less than full range motion refinement is performed on the predicted motion vector to obtain a final motion vector. The less than full range motion refinement is centered on the predicted motion vector. In some embodiments, a median value of the motion vector and the motion vectors is used to determine the predicted motion vector. In some embodiments, a weighting function is applied to the motion vectors based on a predetermined criteria. In some embodiments, the predetermined number of neighbor macroblocks is based on a desired level of accuracy or on a desired level of resolution.
In accordance with some embodiments, a method for encoding includes obtaining a motion vector of a co-located macroblock from a left view frame and motion vectors from a predetermined number of neighbor macroblocks from the left view frame. A predicted motion vector is then determined based on the motion vector and the motion vectors for a macroblock of a right view frame associated with the left view frame.
In accordance with some embodiments, a device includes a base layer encoder and an enhancement layer encoder connected to the base layer encoder, which encodes a frame in an enhancement layer. The enhancement layer encoder obtains a motion vector of a co-located macroblock from a same frame, as the frame, in a base layer and motion vectors from a predetermined number of neighbor macroblocks from the same frame of the base layer. The enhancement layer encoder determines a predicted motion vector based on the motion vector and the motion vectors for a macroblock of the frame in the enhancement layer. In some embodiments, the enhancement layer encoder performs a less than maximum range motion refinement on the predicted motion vector to obtain a final motion vector, where the device supports up to a maximum range search. In some embodiments, the frame and same frame are stereo video data frames.
In accordance with some embodiments, a method for encoding a frame in an enhancement layer includes determining a predicted motion vector for a macroblock in the frame in the enhancement layer based on a motion vector of a co-located macroblock from the same frame in the corresponding base layer and motion vectors for neighboring macroblocks in the base layer. In some embodiments, a less than full range motion refinement is performed on the predicted motion vector to obtain a final motion vector. The less than full range motion refinement is centered on the predicted motion vector. In some embodiments, a median value of the motion vector and the motion vectors is used to determine the predicted motion vector. In some embodiments, a weighting function is applied to the motion vectors based on a predetermined criteria. In some embodiments, the number of neighbor macroblocks used is based on a desired level of accuracy or on a desired level of resolution.
In accordance with some embodiments, a method for encoding includes determining a predicted motion vector for a macroblock in a right view frame based on a motion vector of a co-located macroblock from the same frame in a corresponding left view frame and motion vectors for neighboring macroblocks in the left view frame.
In accordance with some embodiments, a device includes a base layer encoder and an enhancement layer encoder connected to the base layer encoder. The enhancement layer encodes a frame in an enhancement layer. In particular, the enhancement layer encoder determines a predicted motion vector for a macroblock in the frame in the enhancement layer based on a motion vector of a co-located macroblock from the same frame in the base layer and motion vectors for neighboring macroblocks in the base layer. In some embodiments, the enhancement layer encoder performs a less than maximum range motion refinement on the predicted motion vector to obtain a final motion vector. In some embodiments, the enhancement layer encoder performs a less than maximum range motion refinement on the predicted motion vector to obtain a final motion vector, where the device supports up to a maximum range search. In some embodiments, the frame and same frame are stereo video data frames.
It should be understood that many variations are possible based on the disclosure herein. Although features and elements are described above in particular combinations, each feature or element may be used alone without the other features and elements or in various combinations with or without other features and elements.
The methods provided, to the extent applicable, may be implemented in a general purpose computer, a processor, or a processor core. Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) circuits, any other type of integrated circuit (IC), and/or a state machine. Such processors may be manufactured by configuring a manufacturing process using the results of processed hardware description language (HDL) instructions and other intermediary data including netlists (such instructions capable of being stored on a computer readable media). The results of such processing may be maskworks that are then used in a semiconductor manufacturing process to manufacture a processor which implements aspects of the embodiments.
The methods or flow charts provided herein, to the extent applicable, may be implemented in a computer program, software, or firmware incorporated in a computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).