The present invention relates to video coding and, in particular, to techniques for computing scale factors and offsets for use in weighted prediction.
Weighted prediction allows an encoder to specify the use of a scaling factor (w) and offset value (o) when performing motion compensation. Weighted prediction provides a significant coding benefit in performance in special cases—such as fade-to-black, fade-in, and cross-fade transitions. This includes implicit weighted prediction for B-frames and explicit weighted prediction for P-frames. Weighted prediction techniques have been codified in video coding standards such as ITU H.264.
Weighted prediction has been considered most effective for prepared video sequences where the fade effects are introduced into the sequences artificially, for example, by an editing operation. The inventors identified an opportunity to use weighted prediction in video conferencing applications where a continuous video sequence is presented for coding but the sequences exhibits brightness variations that arise due to variations in the captured video data. For example, operators often move about in proximity to lighting sources that affect the overall brightness of captured video data. Auto exposure controls at a camera may mitigate some brightness variations but not all. The inventors recognized a need for application of weighted prediction that can provide for efficient coding in light of such variations.
Embodiments of the present invention provide video coding techniques for deriving scaling factors W and/or offset values O for use in weighted prediction. Given a frame of input video to be coded, a prediction match may be established with one or more reference frames. The input frame may be parsed into a plurality of regions. Thereafter the scaling factor W and/or offset value O may be derived by developing a system of equations relating a predicted pixel to the pixel in the frame by the scaling factor W and/or offset value O. Equations within the system may be prioritized according to priority among regions, and scaling factor W and/or offset value O may be solved for. The scaling factor W and/or offset value O may be used during weighted prediction of the input frame.
The communication network 130 may provide communication channels between the respective terminals via wired or wireless communication networks, for example, the communication services provided by packet-based Internet or mobile communication networks (e.g., 3G and 4G wireless networks). Although only two terminals are described in detail, the videoconferencing system 100 may include additional terminals provided in mutual communication for multipoint videoconferences.
In an embodiment, the pre-processor may parse individual frames of video data into pixel blocks (often, 16×16 or 8×8 blocks of pixel data within frames). The pre-processor further may apply filtering to the video data to condition it for coding by applying, for example, de-noising filters, sharpening filters, smoothing filters, bilateral filters and the like that may be applied dynamically to the source video based on characteristics observed within the video. The pre-processor 210 further may apply brightness control that normalizes brightness variations that may occur in input video.
The coding engine 220, in an embodiment, may be a functional unit that codes data by motion-predictive coding techniques to exploit spatial and/or temporal redundancies therein. The coding engine 220 may output a coded video data stream that consumes lower bandwidth than the source video data stream. The coded video data may comply with a predetermined coding protocol to enable a decoder (not shown) to decode the coded video data. Exemplary protocols may include the H.263, H.264 and/or MPEG families of coding standards.
In an embodiment, the transmitter 240 may format the coded data for transmission over the channel. The transmitter 240 may include buffer memory (not shown) to store the coded video data prior to transmission. The transmitter 240 further may receive and buffer data from other sources, such as audio coders (not shown), as well as administrative data to be conveyed to the decoder.
To code an input pixel block predictively, the motion predictor 360 may generate a predicted pixel block and output the predicted pixel block data to the subtractor 310. The subtractor 310 may generate data representing a difference between the source pixel block and predicted pixel block. The subtractor 310 may operate on a pixel-by-pixel basis, developing residuals at each pixel position over the pixel block. If a given pixel block is to be coded non-predictively, then the motion predictor 360 will not generate a predicted pixel block and the subtractor 310 may output pixel residuals that are the same as the source pixel data.
The transform unit 320 may convert the pixel block data output by the subtractor 310 into an array of transform coefficients, such as by a discrete cosine transform (DCT) process or a wavelet transform. Typically, the number of transform coefficients generated therefrom will be the same as the number of pixels provided to the transform unit 320. Thus, an 8×8, 8×16 or 16×16 block of pixel data may be transformed to 8×8, 8×16 or 16×16 blocks of coefficient data. The quantizer unit 330 may quantize (divide) each transform coefficient of block by a quantization parameter Qp. In many circumstances, low amplitude coefficients may be truncated to zero. The entropy coder 340 may code the quantized coefficient data by run-value coding, run-length coding or the like. Data from the entropy coder may be output to the channel as coded video data of the pixel block.
The reference frame decoder 350 may decode pixel blocks of reference frames and assemble decoded data for such frames. Although not shown in
The motion predictor 360 may search among the reference picture cache for stored decoded frame data that exhibits strong correlation with the source pixel block. When the motion predictor 360 finds an appropriate prediction reference for the source pixel block, it may generate motion vector data (MVs) that may be output to the decoder as part of the coded video data stream. The motion predictor 360 may retrieve a reference pixel block from the reference cache and output to subtractor 310. In so doing, the scalar 370 may scale the pixel data of the reference picture block by a scale factor w, which may have unity gain (w=1) in appropriate circumstances. Similarly, the adder 380 may add an offset O to the scaled data. The offset may be zero or have a negative value in appropriate circumstances.
The scalar 370 and adder 380, therefore, may cooperate to support weighted prediction in which a predicted pixel block is generated as:
P
PRED(i,j)=W*PREF(i,j)+O, where Eq. 1
PPRED(i,j) represents pixel values of a predicted pixel block input to the subtractor 310, PREF(i,j) represents pixel values of a pixel block extracted from a reference picture cache according to the motion predictor, W represents the scale factor applied at scalar 370, and O represents the offset applied at adder 380.
In an embodiment, the scaling factor W and offset O values may be used for all reference pixel blocks extracted from a common frame in the reference picture cache. The scaling factor W and offset values O may be provided to the coding engine 300 once per frame or once per slice.
The coding engine 300 further may include a controller 390 to manage coding of the source video, including estimation of distortion and selection of a final coding mode for use in coding video.
P
TAR(i,j)=W*PREF(i,j)+O, where Eq. 2
PTAR(i,j) represents a value of a pixel at location (i,j) in a target frame, PREF(i,j) represents a value of a pixel at location (i,j) in a reference frame, W represents a scaling factor to be applied in weighted prediction, and O represents an offset value to be applied in weighted prediction (box 430). The values W and O eventually will be calculated in this system of equations. The method may supplement the system of equations by identifying regions having relatively high priority and preferentially weighting equations corresponding to those regions (box 440). High priority regions, by way of example, may be identified as regions having relatively low texture, regions toward the center of the image or regions that exhibit high temporal stability. The method 400 then may solve for W and O using the system of equations (box 450).
During operation of boxes 420-440, the method may create an over-determined system of equations, meaning it has more equations than unknowns. The method may solve for W and O using statistical estimation techniques. For example, the values of W and O may be derived as values that minimize mean squared error between the target pixel values that would be computed via Eq. 2 and the actual target pixel values that occur in the source pixel block. In another example, the values of W and O may be derived as values that minimize transform energy of prediction residuals generated between the target pixel values that would be computed via Eq. 2 and source blocks—when those residuals are coded via discrete cosine transforms, wavelet transforms, Hadamard transforms and the like. Once derived, the values of W and O may be used in the coding engine for use in weighted prediction during coding of pixel blocks.
In practice, if the motion estimation search (box 410) identifies non-zero motion, Eq. 2 will proceed as PTAR(i,j)=W*PREF(i−mvx,j−mvy)+O where mvx and mvy are components of a motion vector identified in the motion estimation search.
In an embodiment, the method may operate as a pre-processing stage of operation before substantive coding and block-based motion estimation occurs. The W and O may be used for all coded blocks obtained by the coding engine.
The method of
In another embodiment, prior to computation of W and O, the method may reduce contribution, within the system of equations, of any equation for which the differences between the target pixel and its associated reference pixel exceeds a predetermined threshold (box 470). The threshold may vary based on an overall correlation factor identifying a degree of correlation between the target frame and the reference frame. In an embodiment, an equation may be removed entirely from the system of equations when the threshold is exceeded.
In a further embodiment, computation of W and O may be performed through an iterative operation (box 480). For example, after computation of W and O, the method may compute an estimated value of each target pixel through application of the scaling factor W and offset O to the corresponding reference pixel (e.g., PEST(i,j)=W*PREF(i,j)+O) (box 482). Thereafter, the method 400 may compute an error by comparison of the estimated target pixel value to the actual value of the target pixel (box 484). The method may compare the error value to a threshold. The threshold may vary based on a value representing a difference between the target pixels and the estimated pixels taken across the entire frame (box 486). If the error value exceeds a threshold, then the method may reduce contribution of the equation corresponding to the target pixel from the system of equations (box 488) or it may be removed entirely. The method may advance to box 450 to solve for W and O again. The operation of box 480 may be repeated over several iterations as desired.
The coding of each region has an associated cost due to the cost of encoding separate W and O parameters as well as other overhead. In an embodiment, the method may operate over a subset of regions based on an expected cost of coding the frame. The method also may determine how many regions should be made subject to the method 400 based on the expected coding cost, estimated benefit of using extra partitions and/or the current bit-rate of the coded video stream.
Again, during operation of boxes 520-540, the method 500 may create an over-determined system of equations. The method 500 may solve for W and O using statistical estimation techniques. For example, the values of W and O may be derived as values that minimize mean squared error between the target pixel values that would be computed via Eq. 2 and the actual target pixel values that occur in the source pixel block. In another example, the values of W and O may be derived as values that minimize transform energy of prediction residuals generated between the target pixel values that would be computed via Eq. 2 and source blocks—when those residuals are coded via discrete cosine transforms, wavelet transforms, Hadamard transforms and the like. Once derived, the values of W and O may be used in the coding engine for use in weighted prediction during coding of pixel blocks.
In practice, if the motion estimation search (box 510) identifies non-zero motion, Eq. 2 will proceed as PTAR(i,j)=W*PREF(i−mvx,j−mvy)+O where mvx and mvy are components of a motion vector identified in the motion estimation search.
In an embodiment, the method 500 may operate as a pre-processing stage of operation before substantive coding and block-based motion estimation occurs. The W and O may be used for all coded blocks obtained by the coding engine.
The method of
As in the embodiment of
Again, during operation of boxes 720-740, the method 700 may create an over-determined system of equations. The method 700 may solve for W and O using statistical estimation techniques. For example, the values of W and O may be derived as values that minimize mean squared error between the target pixel values that would be computed via Eq. 1 and the actual target pixel values that occur in the source pixel block. In another example, the values of W and O may be derived as values that minimize transform energy of prediction residuals generated between the target pixel values that would be computed via Eq. 2 and source blocks—when those residuals are coded via discrete cosine transforms, wavelet transforms, Hadamard transforms and the like. Once derived, the values of W and O may be used in the coding engine for use in weighted prediction during coding of pixel blocks.
The method of
If W and O are not on the same sides as their thresholds, then in a first embodiment, the value of W may be taken as valid but O may be recomputed using sources outside the domain of the system of equations. For example, O may be computed as a difference of mean pixel values of the target frame and reference frame respectively (O=AVGTAR—AVGREF) (box 790). In another embodiment, the value of O may be taken as valid but W may be recomputed using sources outside the domain of the system of equations. For example, W may be computed as a ratio of variances between the target frame and the reference frame (W=VARTAR/VARREF) (box 800). Thereafter, the values of W and O may be taken as valid and output to the coding engine (box 780).
In another embodiment (not shown), if W and O are on different sides of the thresholds, the method may revise both values. It may calculate W first as W=VARTAR/VARREF and thereafter calculate O based on the newly calculated W value as O=AVGTAR−(W*AVGREF).
The foregoing discussion identifies functional blocks that may be used in video coding systems constructed according to various embodiments of the present invention. In practice, these systems may be applied in a variety of devices, such as mobile devices provided with integrated video cameras (e.g., camera-enabled phones, entertainment systems and computers) and/or wired communication systems such as videoconferencing equipment and camera-enabled desktop computers. In some applications, the functional blocks described hereinabove may be provided as elements of an integrated software system, in which the blocks may be provided as separate elements of a computer program. In other applications, the functional blocks may be provided as discrete circuit components of a processing system, such as functional units within a digital signal processor or application-specific integrated circuit. Still other applications of the present invention may be embodied as a hybrid system of dedicated hardware and software components. Moreover, the functional blocks described herein need not be provided as separate units. For example, although
Further, the figures illustrated herein have provided only so much detail as necessary to present the subject matter of the present invention. In practice, video coders typically will include functional units in addition to those described herein, including audio processing systems, buffers to store data throughout the coding pipelines as illustrated and communication transceivers to manage communication with the communication network and a counterpart decoder device. Such elements have been omitted from the foregoing discussion for clarity.
Several embodiments of the invention are specifically illustrated and/or described herein. However, it will be appreciated that modifications and variations of the invention are covered by the above teachings and within the purview of the appended claims without departing from the spirit and intended scope of the invention.
This application claims benefit under 35 U.S.C. §119(e) of U.S. Provisional Application No. 61/441,961, filed Feb. 11, 2011, which is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61441961 | Feb 2011 | US |