This application benefits from priority of application Ser. No. 62/001,998, filed on May 22, 2014, the disclosure of which is incorporated herein in its entirety.
Many video compression standards, e.g. H.264/AVC and H.265/HEVC, have been widely used in video capture, video storage, real time video communication and video transcoding. Examples of popular applications include Apple's AirPlay Mirroring, FaceTime and iPhone/iPad video capture.
Most video compression standards achieve much of their compression efficiency by searching for a reference picture by motion compensation, using it as a prediction for the current picture, and coding only the difference between the current picture and the prediction. The highest rates of compression can be achieved when the prediction is highly correlated to the current picture. One of the major challenges that such systems face is how to achieve good compressed video visual quality during illumination changes, such as fading transitions. The current picture is more strongly correlated to the reference picture scaled by a weighting factor with an offset than to the reference picture itself. In order to solve this problem, the weighted prediction (WP) tool has been adopted in the H.264/AVC and H.265/HEVC video coding standards to improve coding efficiency by applying a multiplicative weighting factor and an additive offset to the motion compensated prediction to form a weighted prediction. Even though weighted prediction was originally designed to handle fading and cross-fading, better compression efficiency could also be obtained, as weighted prediction cannot only manage local illumination variations, but also improve sub-pixel precision for motion compensation using reference picture lists with duplicate references.
Optimal solutions are obtained when illumination compensation weights, motion estimation and rate distortion optimization are optimized jointly. However, they are generally based on iterative methods requiring large computation times, which are not acceptable for many applications (e.g., real time coding). Moreover, convergence may not be guaranteed.
Many algorithms rely on a relatively long window of pictures to observe enough statistics for an accurate detection. However, such methods require the availability of the statistics of the entire fade duration, which introduces long delays and is impractical in real-time encoding systems, particularly those that select coding parameters in a pipelined fashion (e.g., on a pixel-block-by-pixel-block basis) where such statistics are unavailable.
Most of the weighted prediction parameters estimation algorithms can be described as a three step process. In the first step, a picture signal analysis is performed to extract image characteristics. It could be applied to the current (original) picture and the reference (original or reconstructed) picture. Various statistics could be extracted, such as the mean of the whole picture pixel values, the standard deviation of the whole picture pixel values, the mean square of the whole picture pixel values, the mean of the product of the co-located pixel values, the mean of the pixel gradients, the pixel histogram, etc. The next stage is the weighted prediction parameter value estimation. Finally, it is decided whether weighted prediction is applied or not to compress the current picture.
In many practical encoder designs, especially for real time applications, the encoders are not able to analyze the current picture to get the statistics needed for estimating the optimal weighted prediction parameter(s) before the encoding process starts. This constraint prevents the weighted prediction to be applied for this kind of encoder.
Embodiments of the present disclosure provide a pipelined video coding system that includes a motion estimation stage and an encoding stage. The motion estimation stage may operate on an input frame of video data in a first stage of operation and may generate estimates of motion and other statistical analyses. The encoding stage may operate on the input frame of video data in a second stage of operation later than the first stage. The encoding stage may perform predictive coding using coding parameters that are selected, at least in part, from the estimated motion and statistical analysis generated by the motion estimator. Because the motion estimation is performed at a processing stage that precedes the encoding, a greater amount of processing time may be devoted to such processes than in systems that performed both operations in a single processing stage.
For bidirectional transmission of data, however, each terminal 110, 120 may code video data captured at a local location for transmission to the other terminal via the network 130. Each terminal 110, 120 also may receive the coded video data transmitted by the other terminal, may decode the coded data and may display the recovered video data at a local display device. Bidirectional data transmission is common in communication applications such as video calling or video conferencing.
In
A first terminal 210 may include a video source 215, a pre-processor 220, a video coder 225, a transmitter 230, and a controller 235. The video source 215 may provide video to be coded by the terminal 210. The pre-processor 220 may perform various analytical and signal conditioning operations on the video data, often to condition it for coding. The video coder 225 may apply coding operations to the video sequence to reduce the video sequence's bit rate. The transmitter 230 may buffer coded video data, format it for transmission to a second terminal 250 and transmit the data to a channel 245. The controller 235 may manage operations of the first terminal 210.
Embodiments of the present disclosure find application with a variety of video sources 215. In a videoconferencing system, the video source 215 may be a camera that captures local image information as a video sequence. In a gaming or graphics-authoring application, the video source 215 may be a locally-executing application that generates video for transmission. In a media serving system, the video source 215 may be a storage device storing previously prepared video.
Embodiments of the present disclosure also find application with a variety of pre-processors 220. For example, the pre-processor 220 may search for video content in the source video sequence that is likely to generate artifacts when the video sequence is coded, decoded and displayed. The pre-processor 220 also may apply various filtering operations to the frame data to improve efficiency of coding operations applied by a video coder 225.
As noted, the video coder 225 may perform coding operations on the video sequence to reduce the sequence's bit rate. The video coder 225 may code the input video data by exploiting temporal and spatial redundancies in the video data. For example, the video coder 225 may apply coding operations that are mandated by a governing coding protocol, such as the ITU-T H.264/AVC and H.265/HEVC coding standards.
The transmitter 230 may transmit coded data to the channel 245. In this regard, the transmitter 230 may merge coded video data with other data streams, such as audio data and/or application metadata, into a unitary data stream (called “channel data” herein). The transmitter 230 may format the channel data according to requirements of the channel 245 and transmit it to the channel 245.
The first terminal 210 may operate according to a coding policy, which is implemented by the controller 235 and video coder 225 that select coding parameters to be applied by the video coder 225 in response to various operational constraints. Such constraints may be established by, among other things: a data rate that is available within the channel to carry coded video between terminals, a size and frame rate of the source video, a size and display resolution of a display at a terminal 250 that will decode the video, and error resiliency requirements required by a protocol by which the terminals operate. Based upon such constraints, the controller 235 and/or video coder 225 may select a target bit rate for coded video (for example, as N bits/sec) and an acceptable coding error for the video sequence. Thereafter, they may make various coding decisions to individual frames of the video sequence. For example, the controller 235 and/or video coder 225 may select a frame type for each frame, a coding mode to be applied to pixel blocks within each frame, and quantization parameters to be applied to frames and or pixel blocks.
During coding, the controller 235 and/or video coder 225 may assign to each frame a certain frame type, which can affect the coding techniques that are applied to the respective frame. For example, frames often are assigned as one of the following frame types:
A video coder 225 commonly parses input frames into a plurality of pixel blocks (for example, blocks of 4×4, 8×8 or 16×16 pixels each) and coded on a pixel-block-by-pixel-block basis. Pixel blocks may be coded predictively with reference to other coded pixel blocks as determined by the coding assignment applied to the pixel blocks' respective frame. For example, pixel blocks of I frames can be coded non-predictively or they may be coded predictively with reference to pixel blocks of the same frame (spatial prediction). Pixel blocks of P frames may be coded non-predictively, via spatial prediction or via temporal prediction with reference to one previously coded reference frame. Pixel blocks of B frames may be coded non-predictively, via spatial prediction or via temporal prediction with reference to one or two previously coded reference frames.
The receiver 255 may receive channel data from the channel 245 and parse it according to its constituent elements. For example, the receiver 255 may distinguish coded video data from coded audio data and route each coded data to decoders to handle them. In the case of coded video data, the receiver 255 may route it to the video decoder 260.
The video decoder 260 may perform decoding operations that invert processes applied by the video coder 225 of the first terminal 210. Thus, the video decoder 260 may perform prediction operations according to the coding mode that was identified and perform entropy decoding, inverse quantization and inverse transforms to generate recovered video data representing each coded frame.
The post-processor 265 may perform additional processing operations on recovered video data to improve quality of the video prior to rendering. Filtering operations may include, for example, filtering at pixel block edges, anti-banding filtering and the like.
The video sink 270 may consume the reconstructed video. The video sink 270 may be a display device that displays the reconstructed video to an operator. Alternatively, the video sink may be an application executing on the second terminal 250 that consumes the video (as in a gaming application).
The HME 310 may estimate motion of image content from the content of a frame. Typically, the HME 310 may analyze frame content at two or more levels of data to estimate motion. The HME 310, therefore, may output a motion vector representing identified motion characteristics that are observed in motion content. The motion vector may be output to the BPC 320 to aid in prediction operations.
The HME 310 also may perform statistical analyses of the frame N and output data representing those statistics. The statistics also may be output to the BPC 320 to assist in mode selection operations, discussed below.
The HME 310 further may determine weighting factors and offset values to be used in weighted prediction. The weighting factors and offset values also may be output to the BPC 320.
The BPC 320 may include a subtractor 321, a transform unit 322, a quantizer 323, an entropy coder 324, an inverse quantizer 325, an inverse transform unit 326, a prediction/mode selection unit 327, a multiplier 328, and an adder 329.
The BPC 320 may operate on an input frame N+1 on a pixel-block-by-pixel-block basis. Typically, a frame N+1 of content may be parsed into a plurality of pixel blocks, each of which may correspond to a respective spatial area of the frame. The BPC 320 may process each pixel block individually.
The subtractor 321 may perform a pixel-by-pixel subtraction between pixel values in the source frame N+1 and any pixel values that are provided to the subtractor 321 by the prediction/mode selection unit 327. The subtractor 321 may output residual values representing results of the subtraction on a pixel-by-pixel basis. In some cases, the prediction/mode selection unit 327 may provide no data to the subtractor 321 in which case the subtractor 321 may output the source pixel values without alteration.
The transform unit 322 may apply a transform to a pixel block of input data, which converts the pixel block to an array of transform coefficients. Exemplary transforms may include discrete cosine transforms and wavelet transforms. The transform unit 322 may output transform coefficients for each pixel block to the quantizer 323.
The quantizer 323 may apply a quantization parameter Qp to the transform coefficients output by the transform unit 322. The quantization parameter Qp may be a single value applied uniformly to each transform value in a pixel block or, alternatively, it may represent an array of values, each value being applied to a respective transform coefficient in the pixel block. The quantizer 323 may output quantized transform coefficients to the entropy coder 324.
The entropy coder 324, as its name applies, may perform entropy coding of the quantized transform coefficients presented to it. The entropy coder 324 may output a serial data stream, typically run-length coded data, representing the quantized transform coefficients. Typical entropy coding schemes include variable length coding and arithmetic coding. The entropy coded data may be output from the BPC 320 as coded data of the pixel block. Thereafter, it may be merged with other data such as coded data from other pixel blocks and coded audio data and be output to a channel (not shown).
The BPC 320 may include a local decoder formed of the inverse quantizer unit 325, inverse transform unit 326, and an adder (not shown) that reconstruct select coded frames, called “reference frames.” Reference frames are frames that are selected as a candidate for prediction of other frames in the video sequence. When frames are selected to serve as reference frames, a decoder (not shown) must decoded the coded reference frame and store it in a local cache for later use. The encoder also includes decoder components so it may decode the coded reference frame data and store it in its own cache. Thus, absent transmission errors, the encoder's reference picture cache 330 and the decoder's reference picture cache (not shown) should store the same data.
The inverse quantizer unit 325 may perform processing operations that invert coding operations performed by the quantizer 323. Thus, the transform coefficients that were divided down by a respective quantization parameter may be scaled by the same quantization parameter. Quantization often is a lossy process, however, and therefore the scaled coefficient values that are output by the inverse quantizer unit 325 oftentimes will not be identical to the coefficient values that were input to the quantizer 323.
The inverse transform unit 326 may invert transformation processes that were applied by the transform unit 322. Again, the inverse transform unit 326 may apply discrete cosine transforms or wavelet transforms to match those applied by the transform unit 322. The inverse transform unit may generate pixel values, which approximate prediction residuals input to the transform unit 322.
Although not shown in
The prediction unit 327 may perform mode selection and prediction operations for the input pixel block. In doing so, the prediction unit 327 may select a type of coding to be applied to the pixel block, for example intra-prediction, unidirectional inter-prediction or bidirectional inter-prediction. For either type of inter prediction, the prediction unit 327 may perform a prediction search to identify, from a reference picture stored in the reference picture cache 330, stored data to serve as a prediction reference for the input pixel block. The prediction unit 327 may generate identifiers of the prediction reference by providing motion vectors or other metadata (not shown) for the prediction. The motion vector may be output from the BPC 320 along with other data representing the coded block.
The multiplier 328 and adder 329 may apply a weighting factor and offset to the predicted data generated by the prediction unit 327. Specifically, the multiplier 328 may scale the predicted data according to the weighting factor provided by the HME 310. The adder 329 may add an offset value to the output of the multiplier, again, using a value that is provided by the HME. Data output from the adder 329 may be input to the subtractor 321 as prediction data.
The principles of the present disclosure conserve resources expended in a video coder by staggering operation of the HME 310 and the BPC 320. In many coding implementations, especially for real time applications, a video coder cannot review all pixel values for a frame being coded (frame N+1) to develop statistics needed for estimating an optimal set of weighted prediction parameter(s) before the encoding process starts. Embodiments of the present disclosure overcome such limitations by performing such analyses in an HME 310 which operates a frame ahead of coding operations. A given frame will be processed by the HME 400 (
In practice, it may be convenient to provide the HME 310 and BPC 320 as separate circuit systems of a common integrated circuit.
The downsampler 410 may perform a downsampling of the input frames, both the source frame and the reference frame. Typical downsampling operations include a 2×2 or 4×4 downsampling of the input frames. Thus, the downsampler 410 may output a representation of video data that has a lower spatial resolution than the input frames. For convenience, the downsampled frame data is labeled “level 1” and the original resolution frame data is labeled “level 0.”
The motion estimators 420, 430 each may perform motion analysis on the source frame using the reference frame data as a reference point. The level 1 motion estimator 430 is expected to perform its analysis more quickly than the level 0 motion estimator 420 because the level 1 motion estimator 430 is operating on a lower resolution of the frame data than the level 0 estimator 420 does.
The level 1 statistical analyzer 440 may perform statistical analyses on the level 1 source frame. The level statistical 1 analyzer may collect data on any or all of the following metrics:
The weight estimator 450 may derive weighting factors and offsets for use in weighted prediction. In an embodiment, the weight estimator 450 may derive its weights using results of the level 1 motion estimator 430.
The level 0 statistical analyzer 460 may perform statistical analyses on the level 0 source frame. The level 0 statistical analyzer 460 may collect data on any or all of the metrics listed above for level 1.
Embodiments of the present disclosure also may include a region classifier 470 that works in conjunction with an HME 400. In such an embodiment, an HME 400 may analyze a source frame with regard to several different reference frames. The HME 400 may perform its processes for each of the reference frames and generate sets of weighting parameters, a weighting factor and offset, for each such reference frame. The HME 400 may output all sets of weighting parameters to a BPC (
In an embodiment, the region classifier 470 may detect regions within frames that share similar content and may cause the HME 400 to develop sets of weighted prediction parameters independently for each region according to their image content. The region classifier 470 may assign image content to different regions according to:
In another embodiment, such regions may be identified based not only on similarities observed between spatially adjacent elements of image content but also based on similarities observed between image content and co-located image content in temporally-adjacent frames. Typically, contiguous areas of frames that exhibit similarities in one or more of the foregoing statistics may be assigned to a common region.
Once regions are identified from within a frame, the HME 400 may operate on the regions separately and develop weighted prediction parameters independently for the regions according to their respective statistics.
The foregoing discussion has described operation of the embodiments of the present disclosure in the context of terminals that embody encoders and/or decoders. Commonly, these components are provided as electronic devices. They can be embodied in integrated circuits, such as application specific integrated circuits, field programmable gate arrays and/or digital signal processors. Alternatively, they can be embodied in computer programs that execute on personal computers, notebook computers, tablet computers, smartphones or computer servers. Such computer programs typically are stored in physical storage media such as electronic-, magnetic- and/or optically-based storage devices, where they are read to a processor under control of an operating system and executed. Similarly, decoders can be embodied in integrated circuits, such as application specific integrated circuits, field-programmable gate arrays and/or digital signal processors, or they can be embodied in computer programs that are stored by and executed on personal computers, notebook computers, tablet computers, smartphones or computer servers. Decoders commonly are packaged in consumer electronics devices, such as gaming systems, DVD players, portable media players and the like; and they also can be packaged in consumer software applications such as video games, browser-based media players and the like. And, of course, these components may be provided as hybrid systems that distribute functionality across dedicated hardware components and programmed general-purpose processors, as desired.
Several embodiments of the disclosure are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the disclosure are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the disclosure.
Number | Date | Country | |
---|---|---|---|
62001998 | May 2014 | US |