1. Field
The present application relates to video encoders and cost functions employed therein.
2. Background
Video compression involves compression of digital video data. Video compression is used for efficient coding of video data in video file formats and streaming video formats. Compression is a reversible conversion of data to a format with fewer bits, usually performed so that the data can be stored or transmitted more efficiently. If the inverse of the process, decompression, produces an exact replica of the original data then the compression is lossless. Lossy compression, usually applied to image data, does not allow reproduction of an exact replica of the original image, but it is more efficient. While lossless video compression is possible, in practice it is virtually never used. Standard video data rate reduction involves discarding data.
Video is basically a three-dimensional array of color pixels. Two dimensions serve as spatial (horizontal and vertical) directions of the moving pictures, and one dimension represents the time domain.
A frame is a set of all pixels that (approximately) correspond to a single point in time. Basically, a frame is the same as a still picture. However, in interlaced video, the set of horizontal lines with even numbers and the set with odd numbers are grouped together in fields. The term “picture” can refer to a frame or a field.
Video data contains spatial and temporal redundancy. Similarities can thus be encoded by merely registering differences within a frame (spatial) and/or between frames (temporal). Spatial encoding is performed by taking advantage of the fact that the human eye is unable to distinguish small differences in color as easily as it can changes in brightness, and so very similar areas of color can be “averaged out.” With temporal compression, only the changes from one frame to the next are encoded because a large number of the pixels will often be the same on a series of frames.
Video compression typically reduces this redundancy using lossy compression. Usually this is achieved by (a) image compression techniques to reduce spatial redundancy from frames (this is known as intraframe compression or spatial compression) and (b) motion compensation and other techniques to reduce temporal redundancy (known as interframe compression or temporal compression).
H.264/AVC is a video compression standard resulting from joint efforts of ISO (International Standards Organization) and ITU (International Telecommunication Union.)
Quantization, in principle, involves reducing the dynamic range of the signal. This impacts the number of bits (rate) generated by entropy coding. This also introduces loss in the residual, which causes the original and reconstructed macroblock to differ. This loss is normally referred to as quantization error (distortion). The strength of quantization is determined by a quantization factor parameter. The higher the quantization parameter, the higher the distortion and lower the rate.
As discussed above, the predictor can be of two types—intra 128 and inter 130. Spatial estimation 124 looks at the neighboring macroblocks in a frame to generate the intra predictor 128 from among multiple choices. Motion estimation 126 looks at the previous/future frames to generate the inter predictor 130 from among multiple choices. Inter predictor aims to reduce temporal redundancy. Typically, reducing temporal redundancy has the biggest impact on reducing rate.
Motion estimation may be one of the most computationally expensive blocks in the encoder because of the huge number of potential predictors it has to choose from. Practically, motion estimation involves searching for the inter predictor in a search area comprising a subset of the previous frames. Potential predictors or candidates from the search area are examined on the basis of a cost function or metric. Once the metric is calculated for all the candidates in the search area, the candidate that minimizes the metric is chosen as the inter predictor. Hence, the main factors affecting motion estimation are: search area size, search methodology, and cost function.
Focusing particularly on cost function, a cost function essentially quantifies the redundancy between the original block of the current frame and a candidate block of the search area. The redundancy should ideally be quantified in terms of accurate rate and distortion.
The cost function employed in current motion estimators is Sum-of-Absolute-Difference (SAD).
In the example in
The following steps are calculated to get a motion vector (X,Y):
Ideally, the predictor macroblock partition should be the macroblock partition that most closely resembles the macroblock. One of the drawbacks of SAD is that it does not specifically and accurately account for Rate and Distortion. Hence the redundancy is not quantified accurately, and therefore it is possible that the predictive macroblock partition chosen is not the most efficient choice. Thus, in some cases utilizing a SAD approach may actually result in less than optimal performance.
One embodiment relates to a method for selecting a predictive macroblock partition in motion estimation and compensation in a video encoder including determining a bit rate signal, generating a distortion signal, calculating a cost based on the bit rate signal and the distortion signal, and determining a motion vector from the cost. The motion vector designates the predictive macroblock partition. The method may be implemented in a mobile device such as a mobile phone, digital organizer or lap top computer.
Reference will now be made in detail to some embodiments, examples of which are illustrated in the accompanying drawings. It will be understood that the embodiments are not intended to limit the description. On the contrary, the description is intended to cover alternatives, modifications and equivalents, which may be included within the spirit and scope of the description as defined by the claims. Furthermore, in the detailed description, numerous specific details are set forth in order to provide a thorough understanding. However, it may be obvious to one of ordinary skill in the art that the present description may be practiced without these specific details. In other instances, well known methods, procedures, components, and circuits have not been described in detail as not to unnecessarily obscure aspects of the present description.
Some portions of the detailed descriptions that follow are presented in terms of procedures, logic blocks, processing, and other symbolic representations of operations on data bits within a computer or digital system memory. These descriptions and representations are means used by those skilled in the data processing arts to effectively convey the substance of their work to others skilled in the art. A procedure, logic block, process, etc., is herein, and generally, conceived to be a sequence of steps or instructions leading to a desired result.
Unless specifically stated otherwise as apparent from the discussion herein, it is understood that throughout discussions of the embodiments, discussions utilizing terms such as “determining” or “outputting” or “transmitting” or “recording” or “locating” or “storing” or “displaying” or “receiving” or “recognizing” or “utilizing” or “generating” or “providing” or “accessing” or “checking” or “notifying” or “delivering” or the like, refer to the action and processes of a computer system, or similar electronic computing device, that manipulates and transforms data. The data is represented as physical (electronic) quantities within the computer system's registers and memories and is transformed into other data similarly represented as physical quantities within the computer system memories or registers or other such information storage, transmission, or display devices.
In general, embodiments of the description below subject candidate macroblock partitions to a series of processes that approximate the processes the macroblock partition would undergo were it actually selected as the predictive macroblock partition (see generally
An optimal solution to predictive macroblock partition selection needs to be established to understand where SAD stands and the scope of the gains possible. The optimal solution will guarantee minimum distortion (D) under a rate (R) constraint. Such a solution is found using Lagrangian-based optimization, which combines Rate and Distortion as D+λR. λ is the Lagrangian multiplier that represents the tradeoff between rate and distortion.
The current frame 308 is motion compensated 310 for by the candidate macroblock partitions 304 to get a residual error signal e(x,y) as shown in (1).
e(x,y) is divided into an integral number of 4×4 blocks e(x,y,z) 312 where
The size of e(x,y) is A×B. The values that A and B can take are shown in Table 1. Let e(x,y,z) be denoted by E.
E 312 is transformed 314 into the frequency domain from the spatial domain. Let the transformed block be denoted as t(x,y,z) or T 316. Since the transform is separable, it is applied in two stages, horizontal (4) and vertical (5) on E 312. E′ represents the intermediate output. D represents the transform matrix shown in (6).
T 316 is quantized 318 with a quantization parameter Q, which is predetermined. Let the quantization block be denoted by l(x,y,z) or L 320.
The values for the elements of M are derived from a table known in the art. A sample of the table is shown in Table 2.
Next, L 320 is entropy coded 328 using a context-adaptive variable length coding (CAVLC) scheme. This generates the number of bits taken to represent l(x,y,z), which is denoted as Rate(x,y,z,Q) or Rate(Q) 332.
Rate(x,y,z,Q)=CAVLC(l(x,y,z,Q)) (11)
It should be appreciated by one skilled in the art that CAVLC is known in the art and that another entropy coding algorithm may be used in its place.
L 320 is inverse quantized 322 with quantization parameter Q. Let the inverse quantized block be denoted by {circumflex over (l)}(x,y,z) or {circumflex over (L)} 324.
The values for the elements of {circumflex over (M)} are derived from a table known in the art. A sample of the table is shown in Table 3.
{circumflex over (L)} is transformed from the frequency domain to the spatial domain 326. Let the transformed block be denoted by ê(x,y,y,Q) or Ê 329. Since the transform is separable, it is applied in two stages, horizontal (14) and vertical (15), on {circumflex over (L)}. L′ represented the intermediate output. {circumflex over (D)} represents the transform matrix shown in (16).
The squared-error between Ê and E represents the Distortion, Distortion(x,y,z,Q) or Distortion(Q).
The Lagrangian cost Cost4×4(x,y,z,Q,λ) is calculated for a predefined λ.
Cost4×4(x,y,z,Q,λ)=Distortion(x,y,z,Q)+λ×Rate(x,y,z,Q) (18)
The total cost for p(x,y) is given by:
The motion vector (X,Y) is then calculated as follows.
(X,Y)=(x,y)|min Cost(x,y,Q,λ) (20)
The optimal solution just described maybe too complex to be practical even though it provides the best solution possible. Embodiments of the present description introduce a new cost function that represents a computational approximation of the optimal solution. This computational approximation may have an insignificant impact on the results of the optimal solution while significantly reducing the complexity of the same.
According to the optimal solution, T would now be quantized. However, the quantization process is computationally complex because it involves multiplication and other complex binary functions. Thus in one embodiment, the multiplication of T and M from (7) is approximated through a series of shifts and adds as follows:
M(i,j)×T(i,j)=(T(i,j)<<a+Sign(T(i,j)<<b,
(7) can be rewritten as the quantization approximation 418:
L(i,j)=((T(i,j)<<a+Sign(T(i,j)<<b,
S and R can be determined from (9) and (10). The multiplication factor M is approximated with {tilde over (M)}. The values of a, b, c, d,
According to the optimal solution, the quantization approximation block 420 would then be entropy coded to produce the rate signal 428. However, entropy coding algorithms such as CAVLC are highly computationally demanding operations. Entropy coding of a 4×4 quantized block involves encoding a Token (indicates the number of non-zero coefficients and the number of trailing 1's), signs or the trailing 1's, Levels of the non-zero coefficients, and Runs of zeros between non-zero coefficients. In one embodiment, the entropy coding is eliminated by using the Fast Bits Estimation Method (FBEM) to estimate the rate. According to FBEM, the number of bits taken by the different elements can be derived from the number of non-zero coefficients (NC), the number of zeros (NZ), and the sum of absolute levels (SAL).
where Scan( ) represents the zig-zag scan
where Scan( ) represents the zig-zag scan
Thus, a Rate 428 can be determined for each candidate macroblock partition 406 through an entropy coding approximation 424.
According to the optimal solution, L would also need to be inverse-quantized 322 and inverse-transformed 326. Similar to quantization, inverse quantization is also computationally complex. In one embodiment, these processes are simplified through an inverse quantization approximation. The inverse quantization approximation is achieved by performing the same steps as the quantization approximation, but with a second quantization parameter.
L′(i,j)=((T(i,j)<<a+Sign(T(i,j)<<b,
In one embodiment, the second quantization parameter is chosen such that S=15, which approximates the equivalent to calculating the zero-distortion value.
By doing the above steps, inverse quantization 322 has been significantly simplified and inverse transformation 326 is no longer necessary. It is appreciated that because embodiments achieve the inverse quantization approximation through quantization approximation with a second quantization parameter, both L and L′ can be generated from the same circuitry, module, etc.
In one embodiment, once the inverse quantization approximation block L′ 422 has been generated, the Distortion 430, Distortion(x,y,z,Q) or Distortion(Q), can be represented by the squared-error between L′ and L. (L′-L) represents the quantization error and has a small dynamic range. Hence embodiments can store the squared values in a lookup-table to avoid the squaring operation.
In one embodiment, the Lagrangian cost for each of the integral number of four by four blocks Cost4×4(x,y,z,Q,λ) is calculated for a predefined λ.
Cost4×4(x,y,z,Q,λ)=Distortion(x,y,z,Q)+λ×Rate(x,y,z,Q) (34)
In one embodiment, the total cost for p(x,y) is given by:
Finally, the motion vector (X,Y) is then selected as follows:
(X,Y)=(x,y)|min Cost(x,y,Q,λ) (36)
Thus, the above embodiments are able to accurately approximate the rate and distortion for each candidate macroblock partition. The embodiments may select the best possible predictive macroblock partition with more certainty than the SAD cost function because the selection process specifically account for Rate and Distortion. Therefore, the embodiments are able to achieve a higher signal to noise ratio than SAD for a given bitrate, as illustrated in
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present description. Various modifications to these embodiments may be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the description. Thus, the present description is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.