The invention relates to methods for encoding and decoding a video signal and corresponding apparatuses.
Coding of video signals is well known in the art and usually related to the MPEG 4 or H.264/AVC standard. The responsible committees for these two standards are the ISO and ITU. In order to reduce the bit rate of video signals, the ISO and ITU coding standards apply hybrid video coding with motion-compensated prediction combined with transform coding of the prediction error. In the first step, the motion-compensated prediction is performed. The temporal redundancy, i.e. the correlation between consecutive images is exploited for the prediction of the current image from already transmitted images. In a second step, the residual error is transform coded, thus the spatial redundancy is reduced.
In order to perform the motion-compensated prediction, the current image of a sequence is split into blocks. For each block a displacement vector di is estimated and transmitted that refers to the corresponding position in one of reference images. The displacement vectors may have fractional-pel resolution. Today's standard H.264/AVC allows for ¼-pel displacement resolution. Displacement vectors with fractional-pel resolution may refer to positions in the reference image, which are located between the sampled positions. In order to estimate and compensate the fractional-pel (sub-pel) displacements, the reference image has to be interpolated on the sub-pel positions. H.264/AVC uses a 6-tap Wiener interpolation filter with fixed filter coefficients. The interpolation process used in H.264/AVC is depicted in
It is an object of the invention to provide a method for encoding and decoding video data in a more effective manner.
The object is solved by the methods according to claim 1, 13, and 21.
Accordingly, a method for encoding a video signal representing a moving picture is provided that comprises the steps of receiving successive frames of a video signal, coding a frame of the video signal, using a reference frame of the video signal, and calculating analytically a value of a sub-pel position of the reference frame by use of a filter having an individual set of two-dimensional filter coefficients. According to this aspect of the invention, instead of calculating the values of sub-pel positions in two steps based on two one-dimensional filters, the pre-sent invention discloses a method of calculating the value of a sub-pel position in a single step by use of a set of two-dimensional filter coefficients.
The filter set can be established by setting up an individual set of equations for the sub-pel position. Accordingly, the calculation is independent for each sub-pel position.
According to an aspect of the invention, some of the two-dimensional filter coefficients are set equal under the constraint that the distance of the corresponding full-pel position to the current sub-pel position for which the two-dimensional filter coefficients are calculated is equal. This contributes to reduce data overhead. Instead of transmitting all filter coefficients, only a reduced number of filter coefficients has to be transmitted.
According to another aspect of the invention, the filter coefficients are coded. The coding may be based on a temporal prediction, wherein the differences of a first filter set with respect to a second filter set have to be transmitted. It is also possible to base the prediction on spatial prediction, wherein the symmetry of the statistical properties of the video signal is exploited. The step of predicting the two-dimensional filter coefficients of a second sub-pel is carried out by the use of an interpolation step with respect to the impulse response of a filter set up of two-dimensional filter coefficients for a first sub-pel, such that the result is used for a second sub-pel. Coding the filter coefficients provides further reduction of the amount of data to be transmitted from an encoder to a decoder.
According to another aspect of the invention, the standard representation form of a filter having one-dimensional filter coefficients is replaced by the corresponding two-dimensional form of the filter. Accordingly, the means provided to encode or decode a video signal can be configured to fulfil only the requirements for a two-dimensional representation form even though two-dimensional and one-dimensional filter sets are used.
The method according to the present invention supports all kinds of filtering, such as for example a Wiener-filter having fixed coefficients. The two-dimensional filter can also be a polyphase filter.
According to an aspect of the invention, different filters are provided for different regions of a picture, such that several sets of filter coefficients can be transmitted and the method comprises the step of indicating which filter set is to be used for a specific region. Accordingly, it is not necessary to transmit all individual sets of filter coefficients, if these sets are identical for different regions. Instead of conveying the data related to the filter coefficients repeatedly from the encoder to the decoder, a single flag or the like is used to select the filter set for a specific region. The region can be a macroblock or a slice. In particular, for a macroblock, it is possible to signal the partition id.
According to another aspect of the invention, a different method for encoding a video signal representing a moving picture by use of a motion compensated prediction is provided. The method includes the steps of receiving successive frames of a video signal, coding a frame of the video signal using a reference frame of the video signal and calculating a value of the sub-pel position independently by minimisation of an optimisation criteria in an adaptive manner. According to this aspect of the invention, the calculation step of a value of sub-pel position is not only carried out independently, but also by minimisation of an optimisation criteria in an adaptive manner. “In an adaptive manner” implies the use of an adaptive algorithm or iteration. Providing an adaptive solution enables the encoder to find an optimum solution with respect to a certain optimisation criteria. The optimisation criteria may vary in time or for different locations of the sub-pel, entailing a continuously adapted optimum solution. This aspect of the invention can be combined with the step of calculating the value of the sub-pel position analytically by use of a filter having an individual set of two-dimensional filter coefficients, such that the filter coefficients are calculated adaptively. The optimisation criteria can be based on the rate distortion measure or on the prediction error energy. The calculation can be carried out by setting up an individual set of equations for the filter coefficients of each sub-pel position. In particular, with respect to the prediction error energy as an optimisation criteria, it is possible to compute first the derivative of the prediction error energy in order to find an optimum solution. The set of two-dimensional filter coefficients can also profit from setting two-dimensional filter coefficients equal for which the distance of the corresponding full-pel position to the current sub-pel position is equal. The step of equating can be based on statistical properties of the video signal, a still picture, or any other criteria. The two-dimensional filter coefficients can be coded by means of temporal prediction, wherein the differences of a first filter set to a second filter set (e.g. used for the previous image or picture or frame) have to be determined. The filter coefficients can also be coded by a spatial prediction, wherein the symmetry of the statistical properties of the video signal is exploited as set out before. The two-dimensional filter can be a polyphase filter.
Different filters can be provided for different regions of a picture, such that several sets of filter coefficients can be transmitted and the method may comprise a step of indicating which filter set is to be used for a specific region. This can be done by a specific flag provided in the coding semantics. The region can be a macroblock or a slice, wherein the partition id can be signalled for each macroblock.
According to another aspect of the invention, a method is provided for encoding and decoding a video signal. The method provides an adaptive filter flag in the syntax of a coding scheme. The adaptive filter flag is suitable to indicate whether a specific filter is used or not. This is particularly useful, since an adaptive filtering step may not be beneficial for all kinds of video signals. Accordingly, a flag (adaptive filter flag) is provided in order to switch on or off the adaptive filter function.
According to another aspect of the invention, a sub-pel is selected for which, among a plurality of sub-pels, a filter coefficient is to be transmitted. This information is included for example in a coding scheme or a coding syntax. Similarly, it can be indicated whether a set of filter coefficients is to be transmitted for the selected sub-pel. This measure takes account of the fact that filter coefficients are not always calculated for all sub-pels. In order to reduce the data overhead, it is possible to transmit only the differences of a present set of filter coefficients with respect to a previous set of filter coefficients. Further, it is possible to code the differences according to entropy coding for any selected sub-pel. The adaptive filter flag can be introduced in the picture parameter set raw byte sequence payload syntax of the coding scheme. This is only one example for a position of an adaptive filter flag in the coding syntax. Other flags may be provided to indicate whether an adaptive filter is used for a current macroblock, another region of a picture, or for B- or P-slices.
The present invention provides also an apparatus for encoding a video signal representing a moving picture by use of motion compensated prediction. An apparatus according to the present invention comprises means for receiving successive frames of a video signal, means for coding the frame of the video signal using a reference frame of the video signal, and means for calculating analytically a value of a sub-pel position of the reference frame by use of a filter having an individual set of two-dimensional filter coefficients.
According to another preferred embodiment, the apparatus according to the present invention may include means for receiving successive frames of a video signal, means for coding a frame of the video signal using a reference frame of the video signal, and means for calculating a value of a sub-pel position independently by minimisation of an optimisation criteria in an adaptive manner.
The present invention provides also a respective method for decoding a coded video signal being encoded according to the method for encoding the video signal as set out above and an apparatus for decoding a coded video signal comprising means to carry out the method for decoding.
The methods and apparatuses for encoding and decoding as well as the coding semantics explained above are applicable to scalable video. It is an aspect of the present invention to provide the methods and apparatuses explained above for scalable video, wherein an independent filter set is used for a layer or a set of layers of the scalable video coding. The filter set for a second layer is predicted from a filter set of a first layer. The layers are typically produced by spatial or temporal decomposition.
These and other aspect of the invention are apparent from and will be elucidated by reference to the embodiments described hereinafter and with respect to the following figures.
The present invention relates to an adaptive interpolation filter, which is independently estimated for every image. This approach enables to take into account the alteration of image signal properties, especially aliasing, on the basis of minimization of the prediction error energy. According to another aspect of the invention, an approach is disclosed for efficient coding of filter coefficients, required especially at low bit rates and videos with low spatial resolution. In the following section, the new scheme of interpolation filter is described. According to a further aspect of the invention, an optimized low-overhead syntax that allows definite filter coefficients decoding is disclosed.
In order to achieve the practical bound for the gain, obtained by means of an adaptive filter, another kind of adaptive filter has been developed. For every sub-pel position SP (a . . . o), see
In the following, we describe the calculation of the filter coefficients more precisely. Let us assume, that h00SP, h01SP, . . . , h54SP, h55SP are the 36 filter coefficients of a 6×6-tap 2D filter used for a particular sub-pel position SP. Then the value pSP (a . . . o) to be interpolated is computed by a convolution:
where Pi,j is an integer sample value (A1 . . . F6).
The calculation of coefficients and the motion compensation are performed in the following steps:
{tilde over (x)}=x+└mvx┘−FO, {tilde over (y)}=+└mvy┘−FO
The filter coefficients have to be quantized and transmitted as side information e.g. using an intra/inter-prediction and entropy coding (s. Heading “Prediction and Coding of the Filter Coefficients”).
Since transmitting 360 filter coefficients may result in a high additional bit rate, the coding gain can be drastically reduced, especially for video sequences with small spatial resolution. In order to reduce the side information, we assume that statistical properties of an image signal are symmetric.
Thus, the filter coefficients are assumed to be equal, in case the distance of the corresponding full-pel positions to the current sub-pel position are equal (the distance equality between the pixels in x- and y-direction is also assumed, i.e. if the image signal is interlaced, a scaling factor should be considered etc.).
Let us denote hC18 as a filter coefficient used for computing the interpolated pixel at sub-pel position a at the integer position C1, depicted in
hC1a=hA3d=hC6c=hF3l
hC2a=hB3d=hC5c=hE3l
hC3a=hC3d=hC4c=hD3l
hC4a=hD3d=hC3c=hC3l
hC5a=hE3d=hC2c=hB3l
hC6a=hF3d=hC1c=hA3l
The same assumptions, applied at sub-pel positions b and h result in 3 coefficients for these sub-pel positions:
hC1b=hC6b=hA3h=hF3h
hC2b=hC5b=hB3h=hE3h
hC3b=hC4b=hC3h=hD3h
In the same way, we get 21 filter coefficients for sub-pel positions e, g, m, o 18 filter coefficients for sub-pel positions f, i, k, n and 6 filter coefficients for the sub-pel position j.
hA1e=hA6g=hF1m=hF6o
hA2e=hB1e=hA5g=hB6g=hE1m=hF2m=hE60=hF5o
hA3e=hC1e=hA4g=hC6g=hD1m=hF3m=hD6o=hF4o
hA4e=hD1e=hA3g=hD6g=hC1m=hF4m=hC6o=hF3o
hA5e=hE1e=hA2g=hE6g=hB1m=hF5m=hB6o=hF2o
hA6e=hF1e=hA1g=hF6g=hA1m=hF6m=hA6o=hF1o
hB2e=hB5g=hE2m=hE5o
hB3e=hC2e=hB4g=hC5g=hD2m=hE3m=hD5o=hE4o
hB4e=hD2e=hB3g=hD5g=hC2m=hE4m=hC5o=hE3o
hB5e=hE2e=hB2g=hE5g=hB2m=hE5m=hB5o=hE2o
hB6e=hF2e=hB1g=hF5g=hA2m=hE6m=hA5o=hE1o
hC3e=hC4g=hD3m=hD4o
hC4e=hD3e==hC3g=hD4g=hC3m=hD4m=hC4o=hD3o
hC5e=hE3e=hC2g=hE4g=hB3m=hD5m=CB4o=hD2o hC6e=hF3e=hC2g=hF4g=hA3m=hD6m=hA4o=hD1o
hD4e=hD3g=hC4m=hC3o
hD5e=hE4e=ED2g=hE3g=hB4m=hC5m=hB3o=hC2o
hD6e=hF4e=hD1g=hF3g=hA4m=hC6m=hA3o=hC1o
hE5e=hE2g=hB5m=hB2o
hE6e=hF5e=hE1g=hF2g=hA5m=hB6m=hA2o=hB1o hF6e=hF1g=hA6m=hA1o
hA1f=hA6f=hA1i=hF1i=hA6k=hF6k=hF1n=hF6n
hA2f=hA5f=hB1i=hE1i=hB6k=hE6k=hF2n=hF5n
hA3f=hA4f=hC1i=hD1i=hC6k=hD6k=hF3n=hF4n
hB1f=hB6f=hA2i=hF2i=hA5k=hF5k=hE1n=hE6n
hB2f=hB5f=hB2i=hE2i=hB5k=hE5k=hE2n=hE5n
hC1f=hC6f=hA3i=hF3i=hA4k=hF4k=hD1n=hD6n
hC2f=hC5f=hB3i=hE3i=hB4k=hE4k=hD2n=hD5n
hC3f=hC4f=hC3i=hD3i=hC4k=hD4k=hD3n=hD4n
hD1f=hD6f=hA4i=hF4i=hA3k=hF3k=hC1n=hC6n
hD2f=hD5f=hB4i=hE4i=hB3k=hE3k=hC2n=hC5n
hD3f=hD4f=hC4i=hD4i=hC3k=hD3k=hC3n=hC4n
hE1f=hE6f=hA5i=hF5i=hA2k=hF2k=hB1n=hB6n
hE2f=hE5f=hB5i=hE5i=hB2k=hE2k=hB2n=hB5n
hE3f=hE5f=hB5i=hE5i=hB2k=hE2k=hB2n=hB5n
hF1f=hF6f=hA6i=hF6i=hA2k=hF1k=hA1n=hA6n
hF2f=hF5f=hB6i=hE6i=hA2k=hF2k=hA2n=hA5n
hF3f=hF4f=hC6i=hD6i=hA3k=hF3k=hA3n=hA4n
hA1j=hA6j=hF1j=hF6j
hA2j=hA5j=hB1j=hB6j=hE1j=hF2j=hE6j=hF5j
hA3j=hA4j=hC1j=hD1j=hD6j=hF3j=hF3j=hF4j
hB2j=hB5j=hE2j=hE5j
hB3j=hB4j=hC2j=hC5j=hD2j=hD5j=hE3j=hE4j
hC3j=hC4j=hD3j=hD4j
In total, this reduces the number of needed filter coefficients from 360 to 54, exploiting the assumption, that statistical properties of an image signal are symmetric. In following chapter we describe, how the filter coefficients can be predicted and coded. In some cases (e.g. interlaced video), we cannot assume any more, that horizontal and vertical filter sets are equal. Then, vertical and horizontal symmetries independently from each other have to be assumed.
After a quantization of the filter coefficients, a combination of two prediction schemes is proposed. The first type is a temporal (inter) prediction, so the differences of the current filter set to the filter set used for the previous image have to be transmitted. This type of coding is applied for filter coefficients at sub-pel positions a and b. The second type is a spatial (intra) prediction. Exploiting the symmetry of statistical properties of an image signal and knowing that no bilinear interpolation is used, coefficients of 2D filters for the different sub-pel positions can be regarded as samples of a common 2D filter, also called as polyphase filter. So, knowing the impulse response of the common filter at particular positions, we can predict its impulse response at other positions by interpolation.
This process is depicted in
Thus, only entropy coded differences have to be transmitted.
So, with ha and hb, and accordingly hc, hd, hh and hi, we can predict 2D filter coefficients by multiplication:
h
e
=h
d
·h
a
h
f
=h
d
·h
b
h
j
=h
h
·h
b
Alternatively, knowing the impulse response of the polyphase filter at particular sub-pel positions, we can predict the impulse response at remaining sub-pel positions applying spline or other interpolation functions.
In order to reduce complexity, required for realization of two different approaches, the standard separable filter and an adaptive non-separable 2D filter, we propose to bring the standard coefficients into the 2D form. In this case, 15 (if the displacement vector resolution is restricted to quarter-pel) different matrixes containing interpolation filter coefficients have to be stored. For the sub-pel positions a, b, c, d, h, l, located on a row or on a column, only 6 coefficients are used:
a,dT:[1 −5 52 20 −5 1]·2−6
b,hT:[1 −5 20 20 −5 1]·2−5
c,lT:[1 −5 20 52 −5 1]·2−6
For the remaining sub-pel positions, the 2D matrixes with up to 36 coefficients have to be used, which can be derived on the same manner. As an example, a matrix for a position f is given:
The matrix coefficients for the sub-pel positions i, n, k can be obtained, when rotating the matrix used for the sub-pel position f by 90°, 180° and 270° in mathematical sense, respectively.
The same can be applied at sub-pel positions e, g, m and o. The coefficient matrix for the sub-pel position e is given as example.
Replacing the 1D standard filter through the corresponding 2D form would give the following advantages:
As already shown, coefficients of 2D filter sets can be regarded as samples of one common 2D filter, sampled at different positions. Since the standard filter as used in H.264 uses a bilinear interpolation for quarter-pel positions, its impulse and frequency response diverges from that of the Wiener filter. In order to show, that the standard interpolation filter applied at quarter-pel positions is far away from the Wiener filter, which is the optimal one, if fixed coefficients are preconditioned, the frequency responses of both, Wiener filter, applied at half-pel positions, and a bilinear filter, applied at quarter-pel positions, are depicted in
Thus, we propose to use a two-dimensional Wiener filter with fixed coefficients, as described in section “Prediction and Coding of the Filter Coefficients”. By selecting the number of bits used for quantization of filter coefficients, the desired approximation accuracy for the optimal 2D Wiener filter can be achieved. Applying this approach does not require non-separable 2D filter set. Thus, also separable filters can be deployed.
It is possible, that different parts of an image contain different aliasing components. One reason may be that an image contains different objects, which move differently. Another reason may be that an image contains different textures. Each texture can have different aliasing components. Thus, using different filters which are adapted to different regions can improve the prediction. In this case, we would transmit several sets of filter coefficients. In addition, we would transmit a partition of each image indicating which filter set is valid for that region. A preferred embodiment signals for each macroblock the partition id. Alternatively, this partition could be defined as a slice as used in H.264 or MPEG-4.
As we already mentioned, the introduced approach is not restricted to describe settings like quarter-pel motion resolution and 6×6 tap filter size. Depending on requirements, the filter can be either extended to an 8×8-tap filter, what would result in a better prediction quality, but also increase the computational effort, or reduced to a 4×4-tap filter. Using the same techniques described above, we can extend the approach to e.g. ⅛-pel motion resolution. As we showed, it is not necessary to develop extra filter coefficients. Instead of that we can exploit the polyphase structure of the 2D filter and predict the best filter coefficients with a high accuracy.
It is also conceivable to use several filter sets, one for each reference frame. Thus, approach proposed in the section “Non-separable two-dimensional Adaptive Wiener Interpolation Filter” can be applied to each reference frame independently. Though, this would increase side information.
Another extension is defining a set of n predetermined filter sets or n predetermined filters. For each frame, just the index of one or more of the predetermined filter sets is transmitted. Thus, the analytically calculated optimal filter is mapped to the best predetermined filter set or filter of the set. So, only the index of the predetermined filter set or filter (if necessary, entropy coded) needs to be transmitted.
This section describes exemplary syntax and semantics which allows the invented scheme to be incorporated into the H.264/AVC standard.
With the introduction of adaptive interpolation filter scheme, the adaptive filter scheme can be switched on or off by the encoder. For this purpose, we introduce in the picture parameter set raw byte sequence payload syntax an adaptive_filter_flag.
This code indicates to the decoder, whether the adaptive interpolation scheme is applied for current sequence (adaptive_filter_flag =1) or not (adaptive_filter_flag =0).
adaptive_filter_flagB equal to 1 indicates, that adaptive interpolation scheme is in use for B-slices. adaptive_filter_flagB equal to 0 indicates, that adaptive interpolation scheme is not in use for B-slices.
For all of these slice headers, where the adaptive interplation scheme is in use, the entropy coded filter coefficients are transmitted by the encoder.
This code indicates to the decoder that if adaptive_filter_flag is set to 1 and current slice is a P-Slice than the entropy coded filter coefficients are transmitted. First, use_all_subpel_positions is transmitted. use_all_subpel_positions equal to 1 specifies that all independent filter subsets are in use. use_all_subpel_positions equal to 0 indicates that not every sub-pel position sub_pel (a . . . o) has been used by the motion estimation tool and positions_pattern is transmitted. positions_pattern[sub_pel] equal to 1 specifies that FilterCoef[sub_pel][i] is in use, whereat FilterCoef represents the actually transmitted optimal filter coefficients.
Since use_all_subpel_positions signals, if every sub-pel position is in use, positions_pattern cannot be equal to 1111. If use_all_subpel_positions is equal to 0 and the first four entries of positions_pattern are equal to 1, the last entry (j_pos) must be equal to 0 and is not transmitted.
Then, for every sub-pel position where the filter coefficients have been calculated for, the entropy coded (here, using CAVLC) quantized differences (see section “Prediction and Coding of the Filter Coefficients”) DiffFilterCoef are transmitted. Thus, the reconstructed filter coefficients are obtained by adding differences and predicted filter coefficients.
A similar scheme can be applied to a scalable video coder, where for each layer (or for several layers) either independent filter sets or common filter set is used. In case that each layer uses independent filter set, it can be predicted from lower to upper layer.
Since applying one adaptive filter set for the entire image results only in averaged improvements, it does not necessarily mean, that every macroblock is coded more efficiently. To ensure the best coding efficiency for every macroblock, an additional step at the encoder can be performed, whereby for each macroblock two filter sets, the standard and the adaptive one are compared. For these macroblocks where the adaptive filter is better (e.g. in terms of rate-distortion criterion), a new filter is calculated and only this one is transmitted. For the remaining macroblocks, the standard interpolation filter is applied. In order to signal, if the adaptive or the standard filter is applied to the current macroblock, an additional flag has to be transmitted for each macroblock.
adaptive filter_in_current_mb equal to 1 specifies, that adaptive filter is in use for current macroblock. adaptive_filter_in_current_mb equal to 0 specifies, that standard (fixed) filter is in use for current macroblock.
Alternatively, another adaptive filter can be calculated for all these macroblocks, where standard (fixed) filter has been chosen. The filter coefficients of this filter are transmitted in the same manner, described in previous section. In that case, adaptive_filter_in_current_mb flag would switch between two filter sets. adaptive_filter_in_current_mb flag can be predicted from neighboring already decoded macroblock so that only the prediction error for adaptive filter_in_current_mb flag is transmitted. If entropy coding is used (e.g. arithmetic coding, CABAC), this flag can be coded with less than 1 bit/flag.
In some cases, e.g. if an image consists of different textures, it is conceivable to use several independent filters. These can be either for every image independently calculated filter coefficient sets or choosing one of a set of pre-defined filter sets, or combination of both. For this purpose, for each macroblock (or set of e.g. neighbor macroblocks), a filter number has to be transmitted. Furthermore, this filter set can be predicted starting from neighboring already decoded macroblocks. Thus, only entropy coded differences (CAVLC, CABAC) have to be transmitted.
The present invention is beneficial for a broad variety of applications such as digital cinema, video coding, digital TV, DVD, blue ray, HDTV, scalable video. All these applications will profit from one or more aspects of the present invention. The present invention is in particular dedicated to improving the MPEG 4 Part 10 H.264/AVC standard. In order to enhance coding schemes and coding syntax of these standards, particular semantics are disclosed which may comply with the standard requirements. However, the basic principle of the present invention should not be constrained to any particular syntax given on the previous pages, but will be acknowledged by the person skilled in the art in a much broader sense.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2006/003410 | 4/13/2006 | WO | 00 | 8/14/2008 |
Number | Date | Country | |
---|---|---|---|
60594494 | Apr 2005 | US | |
60595941 | Aug 2005 | US |