1. Field of the Invention
The present invention generally relates to video encoding. More specifically, the present invention uses multiple reference frames to generate forward, backward or bi-directional P frames to facilitate construction of hierarchical frame structures to better accommodate low complexity decoder profiles.
2. Background Art
Many video encoders can generate encoded video using hierarchical B frames. The use of hierarchical B frames to encode video is well known. Exploitation of hierarchical B frames enables encoders to improve coding efficiency. Hierarchical B frames can also provide temporal scalability and better drift control (e.g., reducing error propagation).
Many video decoding devices are low power and/or low complexity devices. These resource-limited decoders generally have restricted capabilities in terms of processing speed and/or power constraints and are unable to support B frames, whether or not hierarchically arranged. For example, devices that conform to the “baseline profile” specified by the H.264 standard cannot decode B frames. Consequently, many playback devices cannot exploit the benefits of hierarchical B frames.
Accordingly, there is a need to develop encoding techniques to enable low complexity decoders to exploit the benefits of hierarchically arranged frame structures that do not require B frames.
Embodiments of the present invention provide systems, methods and apparatuses for generating forward, backward or bi-directional predictive frames (i.e., P frames). According to an aspect of the present invention, prior to encoding a sequence of video frames, P frames within the video sequence can be reordered to include causal and/or non-causal references to one or more reference frames. This allows any block partition of a bi-directional P frame to include a single reference to a reference frame that is temporally displayed either before or after the bi-directional P frame. As a result, compression and visual quality can be improved.
Hierarchical frame structures can be constructed using bi-directional P frames during the encoding process. Such hierarchical frame structures can better accommodate low complexity decoding profiles (e.g., devices conforming to the baseline profile specified in the International Telecommunication Union (ITU) H.264 standard). Multilayered encoded video bitstreams can be generated based on the hierarchical frame structures. Specifically, in a multilayered encoded video bitstream, a first layer can include anchor frames while one or more second layers can include bi-directional P frames that reference the anchor frames and/or one or more frames in a lower level layer.
The encoding techniques of the present disclosure provide temporal scalability and flexibly accommodate a wide range of decoders. The encoding techniques of the present disclosure also improve the coding efficiency and visual quality of video sequences decoded by low complexity decoders. Further, the techniques of the present disclosure can improve error resiliency during decoding since frame dependencies can be broken up by layers. For example, if a network connection introduces a large number of errors into a high level layer of the encoded hierarchical structure, then a decoder can simply ignore the corrupted layers during decoding. In this way, errors experienced by a corrupted layer of the multilayered encoded bitstream need not necessarily affect the decoding performance and visual quality of the remaining encoded layers of the hierarchical structure. Drift control can also be improved, in a manner similarly provided by hierarchical B frames, since frame dependencies can be contained to be within a Group of Pictures (GOP).
Frame 102 is an I frame and represents the beginning of the GOP 100. Frame 110 is a P frame referencing to frame 102 and represents the end of the GOP 100. Frames 102 and 110 can be considered anchor reference frames. These anchor reference frames can form the first layer of a hierarchical frame structure. That is, frames 102 and 110 can form the first layer (e.g., a base layer) of a multilayered encoded video bitstream. Frames 102 and 110 can be frames that should be decoded before decoding and exploiting frames forming portions of a second or higher layer (e.g., an enhancement layer) of a multilayered encoded video bitstream.
Frames 104, 106 and 108 are each P frames. Frame 106 can form a second layer of the hierarchical frame structure. Specifically, frame 106 need not be decoded and displayed by a decoder but can be decoded to improve temporal scalability and/or visual quality if so desired. Frame 106 can reference from both frame 102 and frame 110 (as indicated by the arrows illustrated in phantom). As such, frame 106 is a bi-directional P frame. Frame 102 can be considered a causal reference for frame 106 as frame 102 occurs prior to frame 106 temporally. Frame 110 can be considered a non-causal reference for frame 106 as frame 110 occurs subsequent to frame 106 temporally. Both frames 102 and 110 can be reordered prior to encoding so that they can be encoded prior to frame 106.
By exploiting both causal and non-causal references, an aspect of the present invention can enable the construction of hierarchical frame structures using bi-directional P frames. The exploitation of non-causal references allows frame 106 to use prediction information for pixel regions that would otherwise be occluded when limited to only causal references.
The references to frames 102 and 110 used by frame 106 can be generated on a block partition basis. That is, a P frame can be broken into several similarly sized partitions (e.g., an 8×8 pixel region, 16×8 pixel region, etc.). Each block partition of a bi-directional P frame of the present invention can include a reference to either a forward-looking or backward-looking reference frame. As illustrated in
As further illustrated in
As shown in
Each layer of the hierarchical frame structure 200 can be included as a different layered portion of an encoded video bitstream provided to a downstream video decoder. That is, frames 102 and 110 can form a base layer, frame 106 can form a separate first enhancement layer and frames 104 and 108 can form a still separate second enhancement layer.
Based on the capabilities of the decoder (e.g., processing power/speed and other decoding resources), the decoder can chose how many enhancement layers to decode beyond the baseline layer (i.e., layer 202). By encoding frames 102 through 110 hierarchically, an encoder of the present invention can introduce temporal scalability into the resulting encoded bitstream. Further, coding efficiency can be improved by relying on hierarchical dependencies as less video content information may be encoded at higher layers.
An encoder of the present invention can generate the hierarchical structure and dependencies as illustrated in
Furthermore, an encoder operating according to the present invention can determine which type of reference (either a forward or backward reference) will be associated with a particular block partition of a bi-directional P frame. The use of forward/non-causal references can improve visual quality and coding efficiency by enabling prediction of occluded pixel partitions that previously could not be predicted when limited to backward-looking references. Errors across GOPs can also be limited by restricting the constructed hierarchical structures, and the frame reference dependencies therein, to within a single GOP.
Each layer of a resulting encoded hierarchical frame structure contained within a GOP can be labeled and associated with target decoder device types during the encoding process. That is, during encoding, an encoder of the present invention can specify which layers are associated with particular device profiles. This labeling information can be contained in the bitstream 300 using labels. For example, labels may be Supplemental Enhancement Information (SEI) messages in accordance with the Advanced Video Coding (AVC)/H.264 standard. In one or more exemplary embodiments, the SEI messages may also contain out of band information.
Informational labels (e.g., SEI messages) may be at the start and/or end of GOPs. As an example, information label 308, which is at the end of GOP A, can specify which layer or layers of the first GOP are directed to a specific device type. Consequently, a decoder that receives the bitstream 300 can, from a review of the information label 308, determine which layers 302 through 306 should be used for decoding a GOP and which layers can or should be ignored. As an example, a first layer (e.g., layer 302) can be specified for use by all devices/baseline devices; a second layer (e.g., layer 304) can be specified for use by more advanced decoders and/or decoders with less disruptive network restrictions; and a third layer (e.g., layer 308) can be specified for the most advanced devices having no network restrictions. Device-based layer labels can vary for each GOP in the bitstream 300. Information label 318, which is at the beginning of GOP A, may contain same information as information label 308. Because the information label 318 may contain out-of-band information at the beginning of the GOP, a video distribution server (e.g., a sync server) may use the information label 318 to filter the bitstream. Such that certain layers that a recipient decoder will not be able to play will not be transmitted to the recipient decoder unnecessarily. Information labels 316 and 322 may contain similar information as information labels 308 and 318 respectively. In one exemplary embodiment, each GOP may only include an information label at the beginning. In another exemplary embodiment, each GOP may only include an information label at the end. In yet another exemplary embodiment, each GOP may include information labels at both beginning and end. In one embodiment, information labels 308, 316, 318 and 322 may be implemented in SEI messages. In another embodiment, those information labels may be implemented in other formats that contain the label information and/or out of band information. Further, the informational label may contain other information of the bitstream.
At step 402, a video sequence is received from a video source. The video sequence can contain a number of video frames.
At step 404, an order for encoding the video frames is determined. The order for encoding can be determined based on one or more target decoder profiles. The order for encoding can also be determined by the ability to encode bi-directional P frames. That is, frames determined to be P frames can be rearranged to include both causal and non-causal references to one more reference frames.
At step 406, the rearranged video frames are encoded to form a hierarchical frame structure comprising multiple layers of encoded video. The hierarchical frame structure can be confined to a GOP. Each layer of the resulting hierarchical frame structure can be labeled and associated with one or more target decoder device types during the encoding process. For example, information labels (e.g., SEI messages, in accordance with H.264 ), can be generated for each GOP to specify which particular layers of the resulting hierarchical encoded structure are to be decoded by corresponding decoders. For example, a first layer can be labeled as available for all devices including baseline devices. A second layer can be labeled as directed to more advanced decoders and/or decoders with less disruptive network restrictions. Similarly, a third layer can be generated and directed to the most advanced decoder devices having no network restrictions.
At step 407, a server may prepare bitstream(s) for targeted device(s). A video distribution server may be used to transmit encoded videos to decoder devices. In one embodiment, not the whole encoded video will be transmitted. For example, a video distribution center (e.g., a sync center) may decide to throw away data contained in layers higher than a certain layer when it knows a recipient (e.g., a playback device) that will play the content cannot decode layers higher than the certain layer.
At step 408, the encoded video is transmitted across a network as a multi-layered bitstream.
At step 410, the encoded video is received by a target decoder device. In one or more exemplary embodiments, not the whole encoded video but selected parts may be received. For example, when a video distribution center (e.g., a sync center) decide to throw away data contained in layers higher than a certain layer, a recipient will not receive layers of encoded data it won't be able to play anyways. Thus, smaller file size may be achieved for transmission and the recipient needs not receive the whole encoded video.
At step 412, the target decoder device decodes the encoded video based on the capabilities of the decoder. Specifically, the decoder can review the information labels (e.g., SEI messages) used to label the layers of the encoded video and can determine which layers to use for decoding. The target decoder can determine the one or more layers to decode for an entire encoded sequence or can dynamically adjust which layers to decode based on varying network conditions and varying capabilities of the decoder.
According to a further aspect of the present invention, scalable bitstreams can be generated based on a hierarchical coding structure provided by features of the present invention. That is, one or more side channels in an encoded video bitstream can be used to carry B frames. The side channels can be used by a decoder device that can decode B frames. For example, a baseline layer of an encoded bitstream can include I and P frames and no B frames. A first set of enhancement layers can include bi-directional P frames while a second set of enhancement layers can include B frames, whether or not bi-directional. The second set of enhancement layers can be used as an alternative set of enhancement layers that can be used and exploited by a decoder capable of decoding B frames. The alternative layer can contain fewer bits than the layer containing only P frames yet can reproduce a video frame of substantially similar visual quality or can contain similar bits yet can reproduce a video frame of better visual quality.
In this way, encoded bitstreams can be developed that can comprise lower layers of encoded video that is shared by all downstream decoders while higher layers of encoded video can be tailored to different decoders. To improve coding efficiency, some decoders having the ability to decode B frames can replace the higher layer P frame only layers with alternative layers that include B frames. The side channel information carrying the alternative layers having B frames can be included in the bitstream depicted in
According to a further aspect of the present invention, a repository for encoded video can generate demuxable bitstreams according to an aspect of the present invention. A repository of encoded video can be, for example, a server/service (e.g., iTunes) that synchs multiple remote decoder devices to encoded video.
When a new sequence of video is to be made available for download, the repository can download or prepare multiple bitstreams for download. That is, the repository can download or generate encoded video for download by a wide range of decoder devices. The downloaded encoded video can include labels specifying which layers are intended for specific decoder devices or profiles. Accordingly, based on the capabilities of the particular decoder attempting to download an encoded video bitstream from the repository, the repository can use the labels to determine exactly what portions of the bitstream the decoder needs for decoding. These decisions—which bitstream and which layers of a particular bitstream to provide to the downstream device—can be made dynamically during download as network conditions vary. This technique generates an efficient bitstream for download by a target device and limits the amount of unnecessary transmitted to the decoder. In essence, a bitstream is tailored for download by the server repository prior to transmission according to device-based layer labels.
An encoder of the present invention can include an encoding unit and a control unit. The encoding unit can perform the functions of encoding video data based on control information or coding directions received from the control unit. Specifically, the control unit can determine the arrangement of video frames for encoding, frame types, and a hierarchical frame structure for encoding based on exploitation of bi-directional P frames. The control unit can also generate or specify the information labels (e.g., SEI messages) to be included in the resulting encoded bitstream.
A decoder of the present invention can include a decoding unit and a control unit. The control unit can receive and decode the information labels (e.g., SEI messages) in a received bitstream. The control until can subsequently direct the decoder unit to decode the encoded video in particular manner based on the information labels (e.g., SEI messages) and the capabilities of the decoder.
An encoder and decoder of the present invention can be implemented in hardware, software or some combination thereof. For example, an encoder and/or decoder of the present invention can be implemented using a computer system.
As shown in
While various embodiments of the present invention have been described above, it should be understood that they have been presented by way of example and not limitation. It will be apparent to one skilled in the pertinent art that various changes in form and detail can be made therein without departing from the spirit and scope of the invention. Therefore, the present invention should only be defined in accordance with the following claims and their equivalents.
Number | Date | Country | |
---|---|---|---|
61079631 | Jul 2008 | US |