The present invention generally relates to the field of data compression and, more specifically, to an encoding method applied to a video sequence divided into successive groups of frames (GOFs) themselves subdivided into successive couples of frames (COFs) including a reference frame and a current frame, said method comprising the following steps:
Although network bandwidth and storage capacity of digital devices are increasing rapidly, video compression still plays an essential role due to the exponential growth in size of multimedia content. Moreover, many applications require not only a high compression efficiency, but also an enhanced flexibility. For instance, SNR scalability is highly needed to transmit a video over heterogeneous networks, and spatial/temporal scalability is required to make a same compressed video bitstream that may be decoded by different types of digital terminals according to their computational, display and memory capabilities.
Current standards like MPEG-4 have implemented a limited scalability in a predictive DCT-based framework through additional high-cost layers. More efficient solutions, based on a 3D wavelet decomposition followed by a hierarchical encoding of the spatio-temporal trees, have been recently proposed as an extension of still image coding techniques to video coding ones. A 3D, or (2D+t), wavelet decomposition of the sequence of frames considered as a 3D volume provides a natural spatial resolution and frame rate scalability, while the in-depth scanning of the generated coefficients in the hierarchical trees (the coefficients generated by the wavelet transform constitute a hierarchical pyramid in which the spatio-temporal relationship is defined thanks to 3D orientation trees evidencing the parent-offspring dependencies between coefficients) and the progressive bitplane encoding technique lead to the desired quality scalability. A higher flexibility is thus obtained at a reasonable cost in terms of coding efficiency.
Some prior implementations are based on that approach. In such implementations, the input video sequence is generally divided into Groups of Frames (GOFs), and each GOF, itself subdivided into successive couples of frames (which are as many inputs for a so-called Motion-Compensated Temporal Filtering, or MCTF module), is first motion-compensated (MC) and then temporally filtered (TF) as shown in
When a Haar multiresolution analysis is used for the temporal decomposition, since one motion vector field is generated between every two frames in the considered group of frames at each temporal decomposition level, the number of motion vector fields is equal to half the number of frames in the temporal subband, i.e. four at the first level of motion vector fields, two at the second one, and one at the third one. Motion estimation (ME) and motion compensation (MC) are only performed every two frames of the input sequence, and the total number of ME/MC operations required for the whole temporal tree resulting from this MCTF operation is roughly the same as in a predictive scheme. Using these very simple filters, the low frequency temporal subband represents a temporal average of the input couples of frames, whereas the high frequency one contains the residual error after the MCTF step.
In such a 3D video coding scheme, the ME/MC operations are generally performed in the forward way, i.e. when performing the motion compensation into a couple of frames (i, i+1), i is displaced in the direction of motion towards i+1. If, as shown in the example of
This behaviour can be explained by the following temporal filtering equations (1) and (2), giving the MCTF equations for low and high frequency subbands and in which the motion vectors are subtracted from the coordinates of both reference and low frequency subbands (A=reference frame; B=current frame):
Assuming that the prediction error is null, one has L=A.{square root}{square root over (2)}. Therefore, the low frequency subband is very similar to the reference frame. It will then be shown that, in addition, with a not perfect reconstruction, these MCTF equations always better reconstruct the reference than the current frame.
The process of MCTF combined with block matching ME is described in
This processing does not however completely solve the problem of unconnected pixels, since it can be shown that, when the video bitstream is only partly decoded, they may still induce some perturbations in the spatio-temporal tree reconstruction.
Considering then a couple of low and high frequency subbands, it is supposed that there was no transmitted wavelet coefficient for the high frequency one (H=0). The reconstruction equations for A (reference) and B (current) frames, which are:
which correspond respectively to reconstructed reference and current frames with no coefficient in the decoded high frequency subband. The corresponding reconstruction is then given by the equations (9) and (10):
where ε is the prediction error. This proves that the error is equally distributed between A and B frames.
For unconnected pixels, however, the conclusions are not the same. The reconstruction equations (11) and (12):
become, when H=0:
A′(i,j)=A(i,j) (13)
which gives, for the reconstruction error, for unconnected pixels of reference and current frames with no coefficient in the decoded high frequency subband, the following equations (15) and (16):
|A′−A|(i,j)=0 (15)
In this case, the error is now entirely put on the current frame. Due to cascaded forward ME/MC, said error is propagating in depth inside the temporal tree, leading to a quality drop within each half of the GOF and inducing some annoying visual effects.
This kind of drift is really an issue in the (2D+t) video coding scheme, since balanced temporal decomposition is a prerequisite for efficient coding of wavelet coefficients (coefficients of the root subbands have offspring in the highest levels, and an assumption made for data compression is that the coefficients of the same line have a similar behaviour).
Moreover, in the 3D subband coding approach, the temporal distance between these reference and current frames ((ref,cur) couple) increases with deeper temporal levels. If the temporal distance between two successive frames is considered as equal to 1, it is equal to 2 if there is one frame between them, and so on. Since, as explained just above, low frequency temporal subbands are very close to the input reference frames, it will be considered that they are located at the same instant as their reference, and, consequently, the notion of temporal distance can be simply extended to them. Based on this statement, it is possible to evaluate the temporal distance between frames (or subbands) at each temporal resolution level. As shown in
It is therefore the object of the invention to propose a video encoding method with which the shift leading to these artefacts is at least reduced.
To this end, the invention relates to a video encoding method such as defined in the introductory part of the description and which is moreover characterized in that the direction of the motion estimation step is modified according to the considered couple of frames in the concerned GOF.
In an advantageous implementation of said encoding method, the direction of the motion estimation step is alternately a backward one and a forward one for the successive couples of frames of any concerned GOF.
This method provides closer couples of reference and current frames for ME/MC at deeper temporal decomposition levels and it also leads to more balanced temporal approximations of the GOF at each temporal resolution level. A better repartition of the bit budget between temporal subbands is therefore obtained, and the global efficiency on the whole GOF is improved. Especially at low bitrates, the overall quality of the reconstructed video sequence is improved.
In another implementation of the encoding method, the direction of the motion estimation step for the successive couples of frames of any concerned GOF is chosen according an arbitrarily modified scheme in which the motion estimation and compensation operations are concentrated on a limited number of said couples of frames, selected according to an energy criterion.
By deciding to favor some frames to the detriment of the other ones inside a GOF, this method allows to get an improved coding efficiency in a particular temporal area.
The invention will now be described in a more detailed manner, with reference to the accompanying drawings in which:
While in the 3D video coding scheme described above (in relation with
in which n is the temporal decomposition level, dintra represents the intra frame temporal distance within a GOF, or (ref,cur) couple distance, and dinter represents the inter frame temporal distance between two successive couples in number of frame units.
With this solution, the lowest frequency temporal subbands are shifted towards the middle of the GOF, leading to a more balanced temporal decomposition. The quality degradation due to unconnected pixels is still present but no more cumulative with the successive temporal levels. The use of such a modified ME/MC in a 3D subband video compression scheme allows a clear and noticeable improvement of the coding efficiency at low bitrates, as illustrated in
However, when considering an extract from a sequence of frames in which the first part (for instance a first GOF) contains a high amount of motion (due to a camera panning for instance) while there is almost no more motion in the second part (for instance a second GOF) of said extract (which shows for example a house), the following remarks can be made. At low bitrates, the first part of the extract (the first GOF) cannot be encoded correctly due to the high degree of motion: visually, the reconstructed video contains a lot of very annoying block artefacts induced by the block matching ME and the poor error encoding (one could get rid of these artefacts only at very high bitrates). It may be then proposed to change the motion estimation direction according to the motion content. However, if the considered sequence is coded with a classical forward scheme or with the alternate scheme, the end of the first GOF (this first GOF contains a high amount of motion, but said motions stops at the end of the GO and said end is therefore rather still) is of poor quality compared to the similar frames in the second GOF (completely still). The problem of these “still” frames of the end of the first GOF is that they suffer from being clustered in a same GOF with some previous frames which contain a high amount of motion.
It may then proposed, on the basis of an energy criterion, to concentrate the ME and MC operations on the successive frames which, at said end of the first GOF, are quite similar (since they are still), and to “sacrify” the middle ones because they cannot be coded with a good quality anyway (the maximum bitrate allowed being not sufficient). An implementation of this solution is given in
Number | Date | Country | Kind |
---|---|---|---|
01403384.9 | Dec 2001 | EP | regional |
02291984.9 | Aug 2002 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB02/05669 | 12/20/2002 | WO |