This invention relates generally to video coding, and more particularly to modifying the signaling of transform coefficients based upon perceptual characteristics of the video content.
When videos, images, multimedia or other similar data are encoded or decoded, compression is typically achieved by quantizing the data. A set of previously reconstructed blocks of data is used to predict the block currently being encoded or decoded. The set can include one or more previously reconstructed blocks. A difference between a prediction block and the block currently being encoded is a prediction residual block. In the decoder, the prediction residual block is added to a prediction block to form a decoded or reconstructed block.
Alternatively in intra-mode, the prediction block can be determined by an intra prediction process 210, which also produces intra mode information 211. The input video block and the prediction block are input to a difference calculation 214, which outputs a prediction residual block 215. This prediction residual block is transformed 216, the produce transform coefficients 219, and quantized 217, using rate control 213, which produces quantized transform coefficients 218. These coefficients are input to an entropy coder 220 for signaling in a bitstream 221. Additional mode and motion information are also signaled in the bitstream.
The quantized transform coefficients also undergo an inverse quantization 230 and inverse transform process 240, which in turn is added 250 to the prediction block to produce a reconstructed block 241. The reconstructed block is stored in memory for use in subsequent prediction and motion estimation processes.
Compression of data is primarily achieved through the quantization process. Typically, the rate control module 213 determines quantization parameters that control how coarsely or finely a transform coefficient is quantized. To achieve lower bitrates or small file sizes, transform coefficients are quantized more coarsely, resulting in fewer bits output to the bitstream. This quantization introduces both visual and numerical distortion into the decoded video, as compared to the video input to the encoder. The bitrate and measured distortion are typically combined in a cost function. The rate control chooses parameters, which minimize the cost function, i.e., minimizes the bitrate needed to achieve a desired distortion or minimizing distortion associated with a desired bitrate. The most common distortion metrics are determined using a mean squared error (MSE) or mean absolute error, which are typically determined by taking pixel-wise differences between blocks and reconstructed versions of the blocks.
Metrics such as MSE, however, do not always accurately reflect how the human visual system (HVS) perceives distortion in images or video. Two decoded images having the same MSE as compared to the input image may be perceived by the HVS as having significantly different levels of distortion, depending upon where the distortion is located in the image. For example, the HVS is more sensitive to noise in smooth regions of an image as compared to having noise in highly textured areas. Moreover, the visual acuity, which is the highest spatial frequency that can be perceived by the HVC, is dependent upon the motion of the object or scene across the retina of the viewer. For a normal visual acuity the highest spatial frequency that can be resolved is 30 cycles per degree of visual angle. This value is calculated for a visual stimulus that is stationary on the retina. The HVS is equipped with a mechanism of eye movements that enables tracking of a moving stimulus, keeping it stationary on the retina. However, as the velocity of the moving stimulus increases, the tracking performance of the HVS declines. This results in a decrease of a maximum perceptible spatial frequency. The maximum perceptible spatial frequency can be expressed as the following function:
where Kmax is the highest perceptible frequency for a static stimulus (30 cycles per degree), vRx/y is velocity component of stimulus in horizontal or vertical direction, and vc is Kelly's corner velocity (2 degrees per second). This function is shown in
Prior art methods related to using perceptual metrics to code images and video typically replace or extend the distortion metric in the rate-control cost function with perceptually motivated distortion metrics, which are designed based upon the behavior of the HVS. One method use a visual attention model, just-noticeable-difference (JND), contrast sensitivity function (CSF), and skin detection to modify how quantization parameters are selected in an H.264/MPEG-4 Part 10 codec. Transform coefficients are quantized more coarsely or finely based in part on these perceptual metrics. Another method uses perceptual metrics to normalize transform coefficients. Because these existing methods for perceptual coding are essentially forms of rate control and coefficient scaling, the decoder and encoder must still be capable of decoding all transform coefficients at any time, including transform coefficients that represent spatial frequencies that are not visible to the HVS due to the motion of a block. The coefficients that fall into this category unnecessarily consume bits in the bitstream and require processing that adds little or no quality to the decoded video.
There is a need, therefore, for a method to eliminate the signaling of coefficients that do not add to the perceptual quality of the video and eliminates the additional software or hardware complexity associated with receiving and processing those coefficients.
Embodiments of the invention are based on a realization that various encoding/decoding (codec) techniques must be capable of processing and signaling coefficients that represent spatial frequencies that are not perceptible to a viewer.
This invention uses a motion-based visual acuity model to determine what frequencies are not visible, and then instead of only quantizing the corresponding coefficients more coarsely as done in traditional rate control methods, the invention eliminates the need to signal or decode those coefficients. The elimination of those coefficients further reduces the amount of data that need to be signaled in the bitstream, and reduces the amount of processing or hardware needed to decode the data.
The motion information 161 is also input to a visual perceptual model 310. The visual perceptual model first estimates the velocity of a block or object represented by the block. The “velocity” is characterized by changes in pixel intensities, which can be represented by a motion vector. A formula, which incorporates a visual acuity model and the velocity, identifies a range of spatial frequency components that are not likely to be detected by the human visual system. The visual perceptual model can also incorporate the content of neighboring previously-reconstructed blocks when determining the range of spatial frequencies. The visual perceptual model then maps the spatial frequency range to a subset of transform coefficient indices. Transform coefficients that are outside this subset represent spatial frequencies that are imperceptible, based on the visual perceptual model. Horizontal and vertical indices representing the boundaries of the subset are signaled as coefficient cutoff information 312 to a spatiotemporal coefficient selector 320.
A subset of quantized transform coefficients 311 is decoded from the bitstream and is input to the spatiotemporal coefficient selector. Given the coefficient cutoff information, the spatiotemporal coefficient selector arranges the subset of quantized transform coefficients according to the positions determined by the visual perceptual model. These arranged selected coefficients 321 are input to a coefficient reinsertion process 330, which substitutes predetermined values, e.g., zero, into the positions corresponding to coefficients which were cut off, i.e., not part of the subset identified by the visual perceptual model.
After coefficient reinsertion, the resulting modified quantized transform coefficients 322 are inverse quantized 120 to produce reconstructed transform coefficients 121, which in turn are inverse transformed 130 to produce a reconstructed prediction residual block 131. The pixels in the prediction block 132 are added 140 to those in the reconstructed prediction residual block 131 to obtain a reconstructed block 141 for the output video 102, and the set of previously reconstructed block 150 are stored in a memory buffer.
Perceptual Model and Coefficient Processing
For example, the decoder normally processes an N×N block of transform coefficients. This block has N columns and N rows. If the column cutoff index is cx, then the visual perceptual model has determined that horizontal frequencies represented by coefficients in columns 1 through cx, are perceptible, and the horizontal frequencies represented by coefficients in columns cx through N are imperceptible. Similarly, the vertical velocity ƒ(mvy) is mapped 420 to a row cutoff index cy 421 The column cutoff and row cutoff indices comprise the coefficient cutoff information 312, which is signaled to the spatiotemporal coefficient selector 320.
The subset of quantized transform coefficients 311 decoded from the bitstream form an incomplete set of transformed coefficients, because coefficients that were beyond the row or column cutoff indices were not signaled in the bitstream. The coefficient cutoff information is used to arrange the subset of quantized transform coefficients. These selected coefficients 321 are then input a coefficient reinsertion process, which fills in values for the missing coefficients. Typically, a value of zero is used for this substitution. In the example above, and in the common cases where the transform being used by the codec is related to the Discrete Cosine Transform (DCT), the selected coefficients are a cx×cy block of coefficients, which can be placed in the upper-left corner of an N×N block. Positions not occupied by the selected coefficients are filled with zero values. The output of the coefficient reinsertion process is a block of modified quantized transform coefficients 122, which is processed by the rest of the decoder.
As described above, motion information, such as motion vectors, are used to identify the velocity 510 of the block or object represented by the block. The velocity can be represented by separate horizontal and vertical velocities, or the velocity can be represented by a two-dimensional vector or function as shown. The velocities are mapped 520 to coefficient cutoff indices. For example, for separate horizontal and vertical motion models, there can be a column cutoff index Tx and a row cutoff index Ty.
Another method 532 for cutting out coefficients can use a 2-D function g(Tx, Ty). This function can trace any path over a block, outside which coefficients are not signaled. Additional embodiments can relate the function g to the type of transform being used, as the spatial frequency components represented by a given coefficient position is dependent upon the type of transform being used by the codec.
The motion-based perceptual, or visual acuity model, can consider the horizontal and vertical velocities separately or jointly. As described above, cutoff indices can be determined separately based on horizontal and vertical motion, or the cutoff indices can be determined jointly as a function of the horizontal and vertical or other measured motion directions combined. For systems that apply separable transforms horizontally and vertically, the horizontal and vertical motion models and cutoff indices can also be applied in a separable fashion, both horizontally and vertically. Thus, the complexity reductions resulting from hardware and software implementations of separable transforms can also be extended to the separable application of this invention.
Encoder
The subset of quantized transform coefficients also undergo a coefficient reinsertion process 330, in which coefficients outside the subset are assigned predetermined values, resulting in a complete set of modified quantized transform coefficients. This modified set undergoes an inverse quantization and inverse transform process, whose output is added to the prediction block to produce a reconstructed block. The reconstructed block is stored in memory for use in subsequent prediction and motion estimation processes.
The preferred embodiment describes how the coefficient selector and reinsertion processes are applied prior to inverse quantization in the decoder. In an additional embodiment, the coefficient selector and reinsertion processes can be applied between the inverse quantization and inverse transform. In this case, the coefficient cutoff information is also input to the inverse quantizer so that the quantizer knows which coefficients are signaled in the bitstream. Similarly, the encoder can have the coefficient selector between the transform and quantize processes (and the inverse quantization and inverse transform processes) and the coefficient selector can also be input to the quantizer (and inverse quantizer) so the quantizer knows which subset of coefficients to quantize.
The functions ƒ(mvx) and ƒ(mvy), which map motion information to velocities, can include a scaling, another mapping, or thresholding. For example, the functions can be configured so that no coefficients are cutoff when the motion represented by mvx and mvy is below a given threshold. The motion information input to these functions can also be scaled nonlinearly, or the motion information can be mapped based upon an experimentally predetermined relation between motion and visible frequencies. When a predetermined relation is used, the decoder and encoder use the same model, so no additional side information needs to be signaled. A further refinement of this embodiment allows the model to vary, in which additional side information is needed.
The functions ƒ(mvx) and (mvy) and corresponding mappings and visual perceptual model can also incorporate the motion associated with neighboring previously-decoded blocks. For example, suppose a large cluster of blocks in a video has similar motion. This cluster can be associated with a large moving object. The visual perceptual model can determine that such an object can likely to be tracked by the human eye, causing the velocity of the block relative to the viewer's retina to be decreased, as compared to a small moving object that the viewer is not following. In this case, the functions ƒ(mvx) and ƒ(mvy) and corresponding mappings can be scaled so that fewer coefficients are cut out of the block of coefficients. Conversely, if the current block has a significantly amount of motion or direction of motion as compared to neighboring blocks, then the visual perceptual model can increase the number of cut-out coefficients under the assumption that distortion is less likely to be perceived in a block that is difficult to track due to surrounding motion.
The encoder can perform additional motion analysis on the input video to determine motion and perceptible motion. If this analysis results in a change in the cut off coefficients as compared to a codec, which uses existing information such as motion vectors, then the results of the additional motion analysis can be signaled in the bitstream. The decoder's visual perceptual model and mappings can incorporate this additional analysis along with the existing motion information, such as motion vectors.
In addition to reducing the number of coefficients that are signaled, another embodiment can reduce other kinds of information. If a codec supports a set of modes, such as prediction modes or block size or block shape modes, then the size of this set of modes can be reduced based upon the visual perceptual model. For example, a codec may support several block-partitioning modes, where a 2N×2N block is partitioned into multiple 2N×N, N×2N, N×N, etc. sub-blocks. Typically, smaller block sizes are used to allow different motion vectors or prediction modes to be applied to each sub-block, resulting in a higher fidelity reconstruction of the sub-block. If the motion model, however, determines that all motion associated with a 2N×2N block is fast enough so that some spatial frequencies are unlikely to be perceptible, then the codec can disable the use of smaller sub-blocks for this block. By limiting the number of partitioning modes in this way, the complexity of the codec, and the number of bits needed to be signaled for these modes in the bitstream, can be reduced.
The perceptual model can also incorporate spatial information from neighboring previously-decoded blocks. If the current block is part of a moving or non-moving object which encompasses the current block and neighboring previously-reconstructed blocks, then the visual perceptual model and mappings for the current block can be made more similar to those used for the previously-reconstructed blocks. Thus, a consistent model is used over a moving object comprising multiple blocks.
The perceptual model and mappings can be modified based upon the global motion in the video. For example, if a video was acquired by a camera panning across a stationary scene, then the mappings can be modified to cut out no coefficients, unless this global motion is above a given threshold. Above this threshold, the panning is considered to be so fast that a viewer would be unlikely to be able to track any object in the scene. This may happen during a fast transition between scenes.
This invention can also be extended to operate on intra-coded blocks. Motion can be associated with intra-coded blocks based upon the motion of neighboring or previously-decoded and spatially-correlated inter-coded blocks. In a typical video coding system, intra-coded pictures or intra-coded blocks may occur only periodically, so that most blocks are inter-coded. If no scene-change is detected, then the parts of a moving object coded using an intra-coded block can be assumed to have motion consistent with the previously-decoded intra-coded blocks from that object. The coefficient cut-off process can be applied to the intra-coded blocks using the motion information from the neighboring or motion-consistent blocks in previously-decoded pictures. Additional reductions in signaled information can be achieved by reducing, for example, the number of prediction modes or block partitioning modes available for use by the intra-coded block.
The type of transform can be modified or selected based upon the visual perceptual model. For example, slow-moving objects can use a transform that reproduces sharp fine detail, whereas fast objects can use a transform, such as a directional transform, that reproduces detail in a given direction. If the motion of a block is, for example, mostly horizontal, then a directional transform that is oriented horizontally can be selected. The loss of vertically-oriented detail is imperceptible according to the visual model. Such directional transforms can be less complex and better performing in this case as compared to conventional two-dimensional separable transforms like the 2-D DCT.
The invention can be extended to work with stereo (3-D) video in that objects in the mappings can be scaled so that more coefficients are cut of in background objects, and fewer coefficients are cut off in foreground objects. Given that a viewer's attention is likely to be focused on the foreground objects, additional distortion can be tolerated in background objects as the motion of the background, object increases. Furthermore, two visual perceptual models can be used: one for blocks including foreground objects, and another for blocks including background objects.
If all coefficients are cut out, then no coefficients are signaled in the bitstream for a given block. In this case, the data in the bitstream can be further reduced by not signaling any header or additional information associated with representing a block of coefficients. Alternatively, if the bitstream contains a coded-block-pattern flag which is set to true if all coefficients in the block are zero, then this flag can be set when no coefficients are to be signaled.
Instead of using the visual perceptual model to limit the subset of coefficients that are signaled, the model can also be used to determine a down-sampling factor for an input video block. Blocks can be down-sampled prior to encoding and then up-sampled after decoding. Faster moving blocks can be assigned a higher down-sampling factor, based upon the motion model.
Although the invention has been described by way of examples of preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the invention. Therefore, it is the object of the appended s to cover all such variations and modifications as come within the true spirit and scope of the invention.