This document addresses techniques for video coding with temporal scalability.
Video coding techniques (such as H.264/AVC and H.265/HEVC) provide techniques for temporal scalability, also known as temporal layering. Temporal scalability segments a compressed video bitstream into layers that allow for decoding and playback of the bitstream at a variety of frame rates. In such layering systems, the portion of an encoded bitstream comprising a lower layer can be decoded with a lower output frame rate without the portion of the bitstream comprising upper layers, while decoding an upper layer (for a higher output frame rate) requires decoding all lower layers. The lowest temporal layer is the base layer with the lowest frame rate, while higher temporal layers are enhancement layers with higher frame rates.
Temporal scalability is useful in a variety of settings, such as where there is insufficient bandwidth to transmit an entire encoded bitstream, where only lower layers are transmitted to produce a useful, lower frame rate output at a decoder without needing to transmit upper layers. Temporal scalability also provides a mechanism for reducing decoder complexity by decoding only lower temporal layers, for example when a decoder does not have sufficient resources to decoder all layers or when a display is incapable of presenting the highest frame rate from the highest layer. Temporal scalability also provides trick-mode playback, such as fast-forward playback.
Video coding techniques with motion prediction impose constraints on the references when predicting inter-frame motion. For example, I-frames (or intra-coded frames) do not predict motion from any other frame, P-frames are predicted from a single reference frame, and B-frames are predicted from two reference frames. Video coding techniques for temporal scalablity may impose further constraints. For example, in an HEVC encoded video sequence, temporal sublayer access (TSA) and stepwise TSA (STSA) pictures can be identified. In HEVC, a decoder may switch the number of layers being decoded mid-stream. A TSA picture indicates when a decoder can safely increase the number of layers being decoded to include any higher layers. A STSA picture identifies when a decoder can safely increase the number of layers decoded to an immediately higher layer. Identification of TSA and STSA pictures imposes constraints on which frames may be used as motion prediction references.
Inventors perceive a need for improved techniques for video compression with temporal-scalability, better balancing video encoding goals such as coding efficiency, complexity, and latency in real-time encoding, while also meeting constraints in prediction structure, such as those imposed by H.264 and H.265 video coding standards.
Techniques for video coding with temporal scalability are presented. Embodiments of the techniques include structures of inter-frame motion prediction references that meet prediction constraints of temporal scalability, such as the constraints of temporal scalability modes of H.264 and H.265 video coding standards, while also balancing such video coding goals as coding efficiency, complexity, and latency in real-time encoding. In embodiments, the structure of inter-frame motion prediction references may include a virtual temporal layering structure with more virtual temporal layers than there are identified temporal layers actually encoded into a temporally scalable bitstream. For example, a video may be encoded with a dyadic prediction structure of N virtual layers, where the resultant encoded bitstream only identifies N−1 actual temporal layers. Two or more virtual temporal layers may be combined into a single signaled temporal layer in the encoded bitstream, for example by combining the lowest virtual temporal layers (the layers with the lowest time resolution or lowest frame). Such virtual temporal layers may be useful to improve coding efficiency and balance practical encoding constraints, such as real-time video encoding where the framerate input to an encoder is variable, or where some frames expected at the input to an encoder are missing.
A video coding system 100 may be used in a variety of applications. In a first application, the terminals 110, 150 may support real time bidirectional exchange of coded video to establish a video conferencing session between them. In another application, a terminal 110 may code pre-produced video (for example, television or movie programming) and store the coded video for delivery to one or, often, many downloading clients (e.g., terminal 150). Thus, the video being coded may be live or pre-produced, and the terminal 110 may act as a media server, delivering the coded video according to a one-to-one or a one-to-many distribution model. For the purposes of the present discussion, the type of video and the video distribution schemes are immaterial unless otherwise noted.
In
The network represents any number of networks that convey coded video data between the terminals 110, 150, including, for example, wireline and/or wireless communication networks. The communication network may exchange data in circuit-switched or packet-switched channels. Representative networks include telecommunications networks, local area networks, wide area networks, and/or the Internet. For the purposes of the present discussion, the architecture and topology of the network are immaterial to the operation of the present disclosure unless otherwise noted.
The coding system 140 may perform coding operations on the video to reduce its bandwidth. Typically, the coding system 140 exploits temporal and/or spatial redundancies within the source video. For example, the coding system 140 may perform motion compensated predictive coding in which video frame or field pictures are parsed into sub-units (called “pixel blocks,” for convenience), and individual pixel blocks are coded differentially with respect to predicted pixel blocks, which are derived from previously-coded video data. A given pixel block may be coded according to any one of a variety of predictive coding modes, such as:
Pixel blocks also may be coded according to other coding modes. Any of these coding modes may induce visual artifacts in decoded images, and artifacts at block boundaries may be particularly noticeable to the human visual system.
The coding system 140 may include a coder 142, a decoder 143, an in-loop filter 144, a picture buffer 145, and a predictor 146. The coder 142 may apply the differential coding techniques to the input pixel block using predicted pixel block data supplied by the predictor 146. The decoder 143 may invert the differential coding techniques applied by the coder 142 to a subset of coded frames designated as reference frames. The in-loop filter 144 may apply filtering techniques, including deblocking filtering, to the reconstructed reference frames generated by the decoder 143. The picture buffer 145 may store the reconstructed reference frames for use in prediction operations. The predictor 146 may predict data for input pixel blocks from within the reference frames stored in the picture buffer.
The transmitter 150 may transmit coded video data to a decoding terminal via a channel CH.
The receiver 160 may receive a data stream from the network and may route components of the data stream to appropriate units within the terminal 200. Although
The video decoder 170 may perform decoding operations that invert coding operations performed by the coding system 140. The video decoder may include a decoder 172, an in-loop filter 173, a picture buffer 174, and a predictor 175. The decoder 172 may invert the differential coding techniques applied by the coder 142 to the coded frames. The in-loop filter 144 may apply filtering techniques, including deblocking filtering, to reconstructed frame data generated by the decoder 172. For example, the in-loop filter 144 may perform various filtering operations (e.g., de-blocking, de-ringing filtering, sample adaptive offset processing, and the like). The filtered frame data may be output from the decoding system. The picture buffer 174 may store reconstructed reference frames for use in prediction operations. The predictor 175 may predict data for input pixel blocks from within the reference frames stored by the picture buffer according to prediction reference data provided in the coded video data.
The post-processor 180 may perform operations to condition the reconstructed video data for display. For example, the post-processor 180 may perform various filtering operations (e.g., de-blocking, de-ringing filtering, and the like), which may obscure visual artifacts in output video that are generated by the coding/decoding process. The post-processor 180 also may alter resolution, frame rate, color space, etc. of the reconstructed video to conform it to requirements of the video sink 190.
The video sink 190 represents various hardware and/or software components in a decoding terminal that may consume the reconstructed video. The video sink 190 typically may include one or more display devices on which reconstructed video may be rendered. Alternatively, the video sink 190 may be represented by a memory system that stores the reconstructed video for later use. The video sink 190 also may include one or more application programs that process the reconstructed video data according to controls provided in the application program. In some embodiments, the video sink may represent a transmission system that transmits the reconstructed video to a display on another device, separate from the decoding terminal. For example, reconstructed video generated by a notebook computer may be transmitted to a large flat panel display for viewing.
The foregoing discussion of the encoding terminal and the decoding terminal (
Video coding techniques H.264 and H.265 introduced flexible coding structures (such as hierarchical, dyadic structures).
This section details a subset of signaling mechanism defined in HEVC standard to signal temporal layers.
A subset of the HEVC standard specifies a mechanism for signaling temporal layers. HEVC temporal layer signaling includes TemporalID, vps_max_sub_layers_minus1, sps_max_sub_layers_minus1. TemporalID is signaled in the network abstraction layer (NAL) unit header to specify the temporal identifier of that temporal layer and a sub-bitstream extraction process could use temporalID to extract the sub-bitstream corresponding to target frame rate. vps_max_sub_layers_minus1 or sps_max_sub_layers_minus1 specifies the maximum number of temporal sub-layers that may be present in each coded video sequence (CVS) referring to the video parameter set (VPS) syntax element and sequence parameter set (SPS) syntax element respectively.
A reference picture set specifies the prediction referencing of pictures. A reference picture set is a set of reference pictures associated with a current picture to be encoded or decoded, where the reference picture set may consist of all reference pictures that are prior to the current picture in coding order (the order frame are encoded or decoded, and is different from presentation order) that may be used for inter-prediction of the picture to be decoded or any picture following the current picture in decoding order.
Temporal layering may impose further constraints on prediction referencing. HEVC includes such constraints and signaling schemes to achieve smooth playback, efficient trick play, and fast forward/rewind functionality with temporal layering. In the HEVC temporal layering, pictures with lower temporal layer cannot predict from pictures with higher temporal layer. The temporal layer is signaled in the bitstream and interpreted to be TemporalID. Other restrictions include the signaling of STSA and TSA pictures that disallow within sub-layer prediction referencing at various points in the bitstream to indicate the capability of up-switching to different frame rates.
A hierarchical dyadic structure is a constraint on layered prediction scheme whereby every B-frame may only be predicted by immediately neighboring frames (in presentation order) from the current temporal layer or a lower temporal layer. In a hierarchical dyadic structure, the GOP size n is an integer power of 2, and if m is the number of B-pictures between consecutive non-B frames, the GOP contains one leading I-picture and (n/m+1)−1 P-frames and every P-frame is predicted from immediately previous P-frame's or I-frame's. A hierarchical dyadic structure allows exactly half of the frame rate reduction for every temporal layer extracted. In embodiments, all I-pictures and P-pictures may be encoded only as members of the bottom two virtual temporal layers, that is virtual temporal layers 1 and 2 of
Coding efficiency may be reduced when the number of possible reference pictures is reduced. Hence the visual quality of video encoded with temporal layering may be reduced due to the additional prediction constraints imposed a temporal layering system.
In a real-time video encoding system, the frame rate of images arriving at an encoder may vary from an expected target frame rate. Varying source frame rates may be caused by factors such as camera fluctuations under various lighting conditions, transcoding variable frame rate sequences, or the encoder capability. For example, encoding, even non-real-time encoding a source video signal that includes a splice from a first camera that captures at a first frame rate to a second camera that captures with a second frame rate, different from the first frame rate.
These fluctuations may result in the encoder receiving frames at irregular intervals in time, potentially causing missing frames at the expected point in time, given a target frame rate. A encoding system with a fixed or constant number of virtual temporal layers in a varying frame rate environment may provide a prediction structure that balances trade-offs among video quality, complexity (storage), latency and ease of encoder implementation across a wide variation in instantaneous frame rates.
Various design challenges may occur when designing a prediction structure. For example, a first design challenge is selection of an optimal number of temporal layers. Traditionally, the number of temporal layers are chosen based on the desired frame rates. For example, in the scenario where the target frame rate is same as base layer frame rate a prediction structure as in
The following embodiments may be applied separately or jointly in combination to address various challenges in designing prediction structure for video encoding with temporal layering. These embodiments include a generalized structure of motion prediction that provide a good trade-off when operating at arbitrary set of target frame rate and arbitrary base layer frame rate.
The number of signaled temporal layers and the TemporalID for a particular picture are signaled in the bitstream based on a target frame rate (the highest frame rate a decoder can decode, by decoding all layers) and a required base layer frame rate (a minimum frame rate a decoder is expected to decode, by decoding only the base layer):
num_temporal_layers=Max(log 2(target frame rate/base layer frame rate)+1), N)
where num_temporal_layers is the number of temporal layers signaled in a bitstream, and N is a chosen number for the total number of virtual temporal layers. In one example implementation, N is set to 4 and would result in the dyadic prediction structures illustrated in
The total number of virtual temporal layers, N, may be chosen, for example, by balancing compression quality (compression ratio or image quality at a bitrate), latency, and complexity (of an encoder or decoder). A higher N will generally lead to high compression quality, but will also lead to longer latency and more complexity. A lower N will generally produce lower compression quality, but will gain reduced latency and reduced complexity.
If the target frame rate for a set of pictures is higher than the base layer frame rate, those set of pictures may be signaled in the encoded bitstream as enhancement temporal layer pictures (TemporalID>1). Note that TemporalID in this convention starts from 1 and base layer pictures have TemporalID=1. The rest of the pictures that are not signaled as non-enhancement temporal layer pictures (treated as base layer pictures) may be further split into “virtual temporal layers” based on their temporal referencing. These virtual temporal layers are together signaled in an encoded bitstream as a single base layer (TemporalID=1).
The term “virtual temporal layers” specifies the further non-signaled temporal layering structure within a single signaled temporal layer, such as a single HEVC temporal layer. In some embodiments, only the base temporal layer (TemporalID=1) may contain a plurality of virtual temporal layers.
In one embodiment, the total number of virtual layers is chosen independent of target frame rate and required base layer frame rate. In this embodiment, the number of virtual temporal layers is fixed to N for different target frame rates and base layer frame rates. In one example, N is set to 4.
In other embodiments, the number of virtual temporal layers within a signaled temporal layer (for example an HEVC temporal sub-layer) is chosen based on target frame rate and base layer frame rate. In one example, when target frame rate is equal to base layer frame rate, the number of virtual temporal layers for temporalID=1 layer is chosen to be 4 and when target frame rate is equal to 2*base layer frame rate, the number of virtual temporal layers for temporalID=1 is chosen to be 3.
In another other example, the number of virtual temporal layers for the TemporalID=1 signaled layer is:
N−Max(log 2(target frame rate/base layer frame rate)+1), N)
In one example implementation, N is set to 4 and would result in prediction structures illustrated in
Varying the number of virtual temporal layers trades off complexity vs video quality. More virtual temporal layers lead to more complexity and higher video quality at an encoded bitrate. Here the complexity may include amount of storage for decoded picture buffers, latency at the playback etc. The temporal layers trade-off frame rate modulation flexibility vs video quality.
The examples of
Benefits of using virtual temporal layers, as in
Other benefits of prediction structure of
First, when a picture with virtual temporal layer>2 is missing, references for other B-frames that are present that have virtual temporal layer>2 are modified to predict from pictures that are in a virtual temporal layer lower than the temporal layer of the missing frame. For example, when a picture from virtual layer=3 is missing, then any pictures that would have used the missing picture as a reference picture will instead use the nearest neighboring frame in a virtual temporal layer less than 3 (i.e. from either virtual temporal layer 1 or 2). For example, in any of the
Second, when a picture with virtual temporal layer<=2 is missing, the next available picture immediately after the missing picture is promoted to virtual temporal layer=1 or 2 based on the number of missing pictures. For example, in
The temporalID for the pictures are assigned according to the picture timing of the incoming pictures.
A benefit of handling missing frames according to these methods is that they are implementation-friendly. These method reduce encoder complexity for addressing missing frames. When the number of virtual temporal layers are same, the handling of missing pictures works in the same way independent of the target frame rate and the base layer frame rate.
In some embodiment, an encoder may adapt a prediction pattern when expected reference pictures are missing at the input to an encoder. In these embodiments, optional boxes 906, 908, 910, 912, and 914 may adapt the prediction pattern. In box 906, determines if an expected reference frame is missing, for example in a real-time encoder. If no reference frame is missing encoding continues as normal in box 916. When a reference frame is missing, in box 908, if the virtual temporal layer that would have been assigned is to the missing frame is less than or equal to 2, control flow moves to box 912, otherwise control flow moves to box 910. In box 910, where the missing frames virtual temporal layer was >2, frames that would have been predicted using the missing frame as a prediction reference instead predict from the nearest neighboring (not missing) frame that is in a virtual temporal layer lower than the virtual temporal layer of the missing frame. In box 912, where the missing frame's virtual temporal layer is <=2, the next available picture immediately following the missing frame is promoted to virtual layer 1 or 2 (that is, the next available picture is encoded in virtual layer 1 or 2). After promotion, in box 914, any picture that would have been predicted from the missing frame will instead use the promoted picture as a reference frame.
As discussed above,
Some embodiments may be implemented, for example, using a non-transitory computer-readable storage medium or article which may store an instruction or a set of instructions that, if executed by a processor, may cause the processor to perform a method in accordance with the disclosed embodiments. The exemplary methods and computer program instructions may be embodied on a non-transitory machine readable storage medium. In addition, a server or database server may include machine readable media configured to store machine executable program instructions. The features of the embodiments of the present invention may be implemented in hardware, software, firmware, or a combination thereof and utilized in systems, subsystems, components or subcomponents thereof. The “machine readable storage media” may include any medium that can store information. Examples of a machine readable storage medium include electronic circuits, semiconductor memory device, ROM, flash memory, erasable ROM (EROM), floppy diskette, CD-ROM, optical disk, hard disk, fiber optic medium, or any electromagnetic or optical storage device.
While the invention has been described in detail above with reference to some embodiments, variations within the scope and spirit of the invention will be apparent to those of ordinary skill in the art. Thus, the invention should be considered as limited only by the scope of the appended claims.