The invention relates to an encoding method for the compression of a video sequence divided into groups of frames (GOFs) themselves subdivided into couples of frames, each of said GOFs being decomposed by means of a three-dimensional (3D) wavelet transform comprising successively, at each decomposition level, a motion compensation step between the two frames of each couple of frames, a temporal filtering step, and a spatial decomposition step of each temporal subband thus obtained, said motion compensation being based for each temporal decomposition level on a motion estimation performed at the highest spatial resolution level, the motion vectors thus obtained being divided by powers of two in order to obtain the motion vectors also for the lower spatial resolutions, the estimated motion vectors allowing to reconstruct any spatial resolution level being encoded and put in the coded bitstream together with, and just before, the coded texture information formed by the wavelet coefficients at this given spatial level, said encoding operation being carried out on said estimated motion vectors at the lowest spatial resolution, only refinement bits of said motion vectors at each spatial resolution being then put in the coded bitstream refinement bitplane by refinement bitplane, from one resolution level to the other, and specific markers being introduced in said coded bitstream for indicating the end of the bitplanes, the temporal decomposition levels and the spatial decomposition levels respectively.
The invention also relates to a corresponding encoding device, to a transmittable video signal consisting of a coded bitstream generated by such an encoding device, to corresponding decoding devices, and to computer executable process steps for use in such decoding devices.
Video streaming over heterogeneous networks requires a high scalability capability, i.e. that parts of a bitstream can be decoded without a complete decoding of the coded sequence and can be combined to reconstruct the initial video information at lower spatial or temporal resolutions (spatial scalability, temporal scalability) or with lower quality (SNR or bitrate scalability). A convenient way to achieve these three types of scalability (spatial, temporal, SNR) is a three-dimensional subband decomposition of the input video sequence, after a motion compensation of said sequence (for the design of an efficient scalable video coding scheme, motion estimation and motion compensation are indeed key components, but with some contradictory requirements, which are mainly to provide a good temporal prediction while keeping the motion information overhead low in order not to reduce drastically the bit budget available for texture encoding/decoding).
A fully scalable video coding method has been already described in the document WO 02/01881 (PHFR000070). The main characteristics of this method are first recalled, with reference to
At the decoder side, in the case of temporal scalability, in order to allow a progressive decoding, the bitstream has then been organized as described for example in
In the case of spatial scalability, in order to be able to reconstruct a reduced spatial resolution video, it then appeared as not desirable to transmit at the beginning of the bitstream the motion vector fields of full resolution, and the solution proposed to this end in the cited document was to adapt the motion described by the motion vectors to the size of the current spatial level: a low resolution motion vector field corresponding to the lowest spatial resolution was first transmitted, and the resolution of the motion vectors was progressively increased according to the increase in the spatial resolution, only the difference between a motion vector field resolution and another one being encoded and transmitted (in the technical solution thus described, the motion vectors are assumed to be obtained by means of a block-based motion estimation method like full search block matching or any other derived solution and the size of the blocks in the motion estimation must then chosen carefully: indeed, if the original size of the block is 8×8 in the full resolution, it becomes 4×4 in the half resolution, then 2×2 in the quarter, and so on, and consequently, a problem may appear if the original size of the blocks is too small, which leads to always check that the original size is compatible with the number of decomposition/reconstruction levels).
With for instance s spatial decomposition levels, if one wants the motion vectors corresponding to all possible resolutions, either the initial motion vectors are divided by 2s or a shift of s positions is performed, the result representing the motion vectors corresponding to the blocks from lowest resolution the size of which is divided by 2s. A division by 2s−1 of the original motion vector would provide the next spatial resolution, but this value is already available from the previous operation: it corresponds to a shift of s−1 positions. The difference, with respect to the first operation, is the bit in the binary representation of the motion vector with a weight of 2S−1. It is then sufficient to add this bit (called refinement bit) to the previously transmitted vector to reconstruct the motion vector at a higher resolution, which is illustrated in
Thanks to this scalable motion vector encoding method (such as described in the cited document and hereinabove recalled), the hierarchy of the temporal and spatial levels has been transposed to the motion vector coding, allowing to decode the motion information progressively: for a given spatial resolution, the decoder has no longer to decode parts of the bitstream that are not useful at that level. However, although said scalable vector encoding method ensures a fully progressive bitstream, the overhead of the motion information may become too high in case of decoding at very low bitrate, leading to the following drawback: the incapacity to decode texture bits for lack of available budget, and therefore a very poor reconstruction quality.
It is therefore an object of the invention to propose an encoding method avoiding this drawback, and therefore more adapted to the situation where high bitrate scalability must be obtained, i.e. when decoding bitrate is much lower than encoding bitrate.
To this end, the invention relates to an encoding method such as defined in the introductory part of the description and which is moreover characterized in that, for each temporal decomposition level, additional specific markers are introduced into said coded bitstream, for indicating in each spatial decomposition level the end of the motion vector information related to said spatial decomposition level.
Another object of the invention is to propose an encoding device for carrying out said encoding method.
To this end, the invention relates to a device for encoding a video sequence divided into groups of frames (GOFs) themselves subdivided into couples of frames, each of said GOFs being decomposed by means of a three-dimensional (3D) wavelet transform comprising successively, at each decomposition level, a motion compensation step between the two frames of each couple of frames, a temporal filtering step, and a spatial decomposition step of each temporal subband thus obtained, said motion compensation being based for each temporal decomposition level on a motion estimation performed at the highest spatial resolution level, the motion vectors thus obtained being divided by powers of two in order to obtain the motion vectors also for the lower spatial resolutions, the estimated motion vectors allowing to reconstruct any spatial resolution level being encoded and put in the coded bitstream together with, and just before, the coded texture information formed by the wavelet coefficients at this given spatial level, said encoding operation being carried out on said estimated motion vectors at the lowest spatial resolution, only refinement bits of said motion vectors at each spatial resolution being then put in the coded bitstream refinement bitplane by refinement bitplane, from one resolution level to the other, and specific markers being introduced in said coded bitstream for indicating the end of the bitplanes, the temporal decomposition levels and the spatial decomposition levels respectively, said encoding device comprising motion estimation means, for determining from said video sequence the motion vectors associated to all couples of frames, 3D wavelet transform means, for carrying out within each GOF, on the basis of said video sequence and said motion vectors, successively a motion compensation step, a temporal filtering step, and a spatial decomposition step, and encoding means, for coding both coefficients issued from said transform means and motion vectors delivered by said motion estimating means and yielding said coded bitstream, said encoding device being further characterized in that it also comprises means for introducing into said coded bitstream additional specific markers for indicating in each spatial decomposition level the end of the motion vector information related to said spatial decomposition level.
The invention also relates to a transmittable video signal consisting of a coded bistream generated by such an encoding device, said coded bitstream being characterized in that it comprises additional specific markers for indicating in each spatial decomposition level the end of the motion vector information related to said spatial decomposition level.
Another object of the invention is to propose a device for decoding a bitstream generated by carrying out the encoding method such as proposed.
To this end, the invention relates to a device for decoding a coded bitstream generated by carrying out the above-described encoding method, said decoding device comprising decoding means, for decoding in said coded bitstream both coefficients and motion vectors, inverse 3D wavelet transform means, for reconstructing an output video sequence on the basis of the decoded coefficients and motion vectors, and resource controlling means, for defining before each motion vector decoding process the amount of bit budget already spent and for deciding, on the basis of said amount, to stop, or not, the decoding operation concerning the motion information, by means of a skipping operation of the residual part of said motion information, or to a device for decoding a coded bitstream generated by carrying out said encoding method, said decoding device comprising decoding means, for decoding in said coded bitstream both coefficients and motion vectors, inverse 3D wavelet transform means, for reconstructing an output video sequence on the basis of the decoded coefficients and motion vectors, and resource controlling means, for defining before each motion vector decoding process the amount of bit budget already spent and for deciding, on the basis of said amount, to stop, or not, the decoding operation concerning the motion information and the residual part of the concerned spatial decomposition level, by means of a skipping operation of the residual part of said motion information and the following residual part of the concerned spatial decomposition level.
The invention also relates to computer executable process steps for use in such decoding devices.
The present invention will now be described, by way of example, with reference to the accompanying drawings in which:
The solution illustrated in
Under these particular circumstances, it is proposed, according to the invention, to focus on texture bit decoding to the detriment of motion vector decoding and to introduce, during the implementation of the decoding process, a decision that allows or not to continue decoding the motion vectors. Given a certain decoding bitrate, the amount of bit budget already spent is checked before each motion vector decoding process (approximation MV1 or further MVi). If this amount exceeds a certain percentage (M %) of the total bit budget, the motion overhead is assumed to be too high to allow decoding of further detail bitplanes, and it is decided not to decode the remaining parts of motion information so as to save bits for the following texture coefficients. In order to be able to implement this technical solution, the decoder must be able to skip the parts of the bitstream corresponding to the motion vectors so as to jump directly to the next texture part. For instance in
The encoding method thus described may be implemented in an encoding device such as illustrated in
At the decoding side (or in a server), the corresponding decoding method may be implemented in a decoding device such as illustrated in
The method as proposed may however introduce a drift between the coding and decoding operations when the motion vector decoding operation is stopped at a certain spatio-temporal level: if further spatio-temporal levels are still decoded, no motion compensation will indeed be performed for these remaining resolutions, including the one under reconstruction. In order to limit this drawback, and taking into account the fact that since a great part of the bit budget available for decoding has been already reached for the first bitplane, it is proposed, according to the invention, to dynamically reduce the set of decoding parameters, for instance, by reducing the frame rate or the spatial resolution according to given requirements of the application, so as to obtain a visually acceptable reconstruction quality. The spatio-temporal resolution for which the motion vector decoding operation is stopped has to be reconstructed at the maximum quality allowed by the available bit budget, and higher resolutions may be given up. Thus, accent is here on the in-depth exploration of the bitplanes for the current spatio-temporal resolution instead of trying to reconstruct all of them, which will be anyway of poor quality according to the above-mentioned decoding conditions. This is illustrated in
The foregoing description of the preferred embodiments of the invention has been presented for purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously many modifications and variations, apparent to a person skilled in the art and intended to be included within the scope of this invention, are possible in light of the above teachings.
It may for example be understood that the devices described herein can be implemented in hardware, software, or a combination of hardware and software, without excluding that a single item of hardware or software can carry out several functions or that an assembly of items of hardware and software or both carry out a single function. These devices may be implemented by any type of computer system—or other apparatus adapted for carrying out the methods described herein. A typical combination of hardware and software could be a general-purpose computer system with a computer program that, when loaded and executed, controls the computer system such that it carries out the methods described herein. Alternatively, a specific use computer, containing specialized hardware for carrying out one or more of the functional tasks of the invention could be utilized. The present invention can also be embedded in a computer program product, which comprises all the features enabling the implementation of the methods and functions described herein, and which—when loaded in a computer system—is able to carry out these methods and functions. Computer program, software program, program, program product, or software, in the present context mean any expression, in any language, code or notation, of a set of instructions intended to cause a system having an information processing capability to perform a particular function either directly or after either or both of the following: (a) conversion to another language, code or notation; and/or (b) reproduction in a different material form.
Number | Date | Country | Kind |
---|---|---|---|
01403319.5 | Dec 2001 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/IB02/05306 | 12/9/2002 | WO |