The present invention relates to a method for jointly encoding a plurality of input video streams.
In several applications, several video streams need to be simultaneously compressed before transmission or storage. One obvious solution is to encode each stream independently. This is generally very processing power consuming, as most existing encoders follow more or less the same reference architecture where the bulk of the processing comprises computing encoding related syntax elements. Most traditional encoders further construct a sequence of predicted pixel blocks, from the received input video data and from these calculated encoding related syntax elements. These predicted pixel blocks are then processed, generally involving a step of subtracting them from corresponding blocks of the input video stream or vice versa, to thereby obtain a sequence of residual pixel blocks. The processing further involves transforming this sequence of residual pixel blocks, with subsequent quantizing and entropy encoding in combination with the encoding related syntax elements to obtain a traditional encoded video stream.
Although such encoding methods are now widespread, they still require a lot of processing power since an encoder needs to compute the encoding related syntax elements for each input stream. This requires a lot of processing effort. Especially when several input streams are to be jointly encoded, this processing effort is then to be multiplied by the number of input streams to be encoded.
On the other hand, alternative coding mechanisms have been developed such as to jointly encode all the input streams together with the objective of maximizing the compression efficiency of the whole set of input streams. As an example, a “Multiview Video Coding”, hereafter abbreviated by MVC, extension was recently standardized as the Annex H of the H.264/AVC video coding standard. The aim of MVC is to offer good compression performance to jointly encode a set of input video streams by exploiting the similarities between those video streams. As its name suggest, one potential application is to encode several views of a given scene obtained by several cameras. The shorter the distance between these cameras, the better compression will be obtained using MVC for jointly compressing the multiple views. A drawback of the MVC approach however is that it creates strong coding interdependencies between the coded streams. This especially presents drawbacks at the decoder side as in order to decode one video stream of the plurality of the encoded streams, all the data from all other views required by an inter-view prediction step need to be decoded as well. Similarly, if one wants to display a given video stream, the decoder has to decode all the encoded streams on which the displayed stream depends, according to this MVC encoding method.
It is therefore an object of the present invention to describe an alternative encoding and decoding method for encoding a plurality of video streams, which requires less processing power both at the encoder and decoder side.
According to the invention this object is achieved by providing a method for encoding a plurality of video streams, said method comprising the steps of receiving said plurality of video streams, constructing a plurality of sequences of predicted pixel blocks, processing and entropy encoding said predicted pixel blocks of said plurality of sequences of predicted pixel blocks with corresponding blocks of said plurality of video streams for generating a plurality of sequences of encoded residual pixel data, wherein said plurality of sequences of predicted pixel blocks are constructed from encoding structure data generated from said plurality of video streams, and wherein said plurality of sequences of encoded residual pixel data is provided together with reference data comprising said encoding structure data as encoded data of said plurality of video streams.
In this way, a plurality of sequences of encoded residual pixel data streams will be generated together with reference data comprising encoding structure data. This makes the joint encoding process itself much easier as the encoding structure data only have to be determined once in stead of for each individual stream of the plurality.
In an embodiment said processing and entropy encoding comprises generating a plurality of sequences of residual pixel blocks from the difference between predicted pixel blocks of said plurality of sequences of predicted pixel blocks and corresponding blocks of said plurality of video streams, to transform, quantize and entropy encode said residual pixel blocks of said respective sequences to thereby obtain said plurality of sequences of encoded residual pixel data.
In another embodiment said encoding structure data is further entropy encoded to provide encoded encoding structure data as said reference data.
The encoding structure data can be generated from an intermediate stream derived from at least one video stream from said plurality.
This intermediate stream may be obtained e.g. by averaging at least two video streams of said plurality, but it can also be a selection of one of the streams of the plurality.
The encoding structure data can also be generated from at least two video streams of said plurality by analyzing encoding decisions for said at least two video streams and selecting a single prediction choice for being comprised in said encoding structure data.
In an embodiment said analysis is based upon comparing said encoding decisions with respect to a predetermined optimization criterion.
The present invention relates as well to a method for decoding at least one encoded video stream comprising at least one sequence of encoded residual pixel data and reference data comprising input encoding structure data, said method including a step of receiving a plurality of sequences of encoded residual pixel data and of said reference data comprising said input encoding structure data, a step of selecting at least one sequence of encoded residual pixel data pertaining to said at least one encoded video stream and said reference data comprising said encoding structure data, to entropy decode and process said at least one sequence of encoded residual pixel data pertaining to said at least one encoded video stream with said encoding structure data to provide at least one sequence of decoded pixel blocks as at least one decoded video stream.
In this way a decoder receiving such a plurality of encoded residual pixel blocks together with a reference stream comprising encoding structure data, only has to select the reference stream and the appropriate sequence of encoded residual pixel data pertaining to the video to be decoded. The decoding or reconstruction can be done rather easily by performing the steps of entropy decoding and processing involving e.g. prediction construction to finally come to the decoded pixel blocks. In case several video streams need to be decoded, embodiments of the method even become more interesting as the encoding structure is the same for all the streams to be decoded and the processing involving e.g. the prediction construction may imply the application the same operations to the decoded residual pixel blocks parts of each stream. As these processing steps are the same for all the streams to be decoded, they can be efficiently executed in parallel implementations, using for instance the Single Instruction, Multiple Data, abbreviated by SIMD, approach. A much simpler decoder for jointly decoding several encoded streams is thereby obtained since the same encoding structure is shared by all streams and the prediction construction can be efficiently implemented in a joint parallel process.
In an embodiment said at least one sequence of encoded residual pixel data pertaining to said at least one encoded video stream is submitted to an inverse quantization and an inverse transformation to thereby obtain at least one sequence of decoded residual pixel blocks, wherein at least one prediction of pixel blocks is constructed from said encoding structure data and from buffered pixel blocks for combination with said at least one decoded residual pixel blocks to thereby obtain said at least one sequence of decoded pixel blocks
In another variant said encoding structure data is derived from said reference data by entropy decoding encoded encoding structure data extracted from said reference input data.
The present invention relates as well to an encoder and a decoder for performing the subject methods.
Further embodiments are set out in the appended claims.
It is to be noticed that the term ‘coupled’, used in the claims, should not be interpreted as being limitative to direct connections only. Thus, the scope of the expression ‘a device A coupled to a device B’ should not be limited to devices or systems wherein an output of device A is directly connected to an input of device B. It means that there exists a path between an output of A and an input of B which may be a path including other devices or means. It is to be noticed that the term ‘comprising’, used in the claims, should not be interpreted as being limitative to the means listed thereafter. Thus, the scope of the expression ‘a device comprising means A and B’ should not be limited to devices consisting only of components A and B. It means that with respect to the present invention, the only relevant components of the device are A and B.
The above and other objects and features of the invention will become more apparent and the invention itself will be best understood by referring to the following description of an embodiment taken in conjunction with the accompanying drawings wherein
a shows a basic scheme of an embodiment of a prior art encoder,
b shows a basic embodiment of a prior art MVC encoder,
a shows an end-to-end encoding and transmission scheme comprising a joint encoder, an intermediate node and individual or joint decoders,
b shows an overview of the coding interdependencies obtained using classical AVC and MVC prior art approaches, and the approach followed in embodiments according to the invention
a shows a first embodiment JE1 of a joint encoder according to the invention,
b shows an embodiment of a single video encoder module E1 which is included in the first embodiment of the joint encoder JE1 of
c shows a second embodiment JE2 of a joint encoder according to the invention,
d shows an embodiment of another single video encoder module E2 which is included in the second embodiment of the joint encoder JE2 of
a shows a fourth embodiment JE4 of a joint encoder according to the invention,
b shows details of a first embodiment JED1 of a “joint make encoding decisions” module JED of
c shows details of a second embodiment JED2 of a “joint make encoding decisions” module JED of
a shows a first embodiment of a decoder JD1 according to the invention, and
b shows a second embodiment of a decoder JD2 according to the invention.
It is to be remarked that the following merely illustrates the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the invention. All examples and conditional language recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor(s) to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
It is also to be understood that throughout this document the notation “input video stream” and output video stream” refer to input and output data which can have the form of real streaming video but can also related to (stored) data files, or any combination of these. The embodiments set out in this description therefore refer to both online and offline encoding of these video data and to any combination thereof.
In many applications, several video streams representing the same content, but not completely identical pixel-wise, need to be simultaneously compressed or encoded before transmission or storage. A typical example is a set of video streams obtained when capturing a scene with several cameras located close to each other, often also denoted as Multiview Video. For these applications, similarities usually arise for portions of the scene that correspond to objects that are lying at the largest distance from the camera. For those objects, the disparities between the different cameras are usually the smallest. This situation can also arise if one wants to simultaneously encode several variants of the same video content which slightly differ from each other, for instance because of different post-processing with respect to color values, illumination values, etc. . . . applied on these variants or because each version has been uniquely watermarked, etc. . . .
Prior art solutions either encode each of these input video streams separately and thereby use a standard encoder such as the one shown in
To enable a better understanding of some embodiments described in this patent application a brief explanation of this H.264 coding standard with data partitioning feature will be given here below:
According to this H.264 standard each video frame is divided and encoded at the macroblock level, where each macroblock is a 16×16 block of pixels.
Macroblocks can be grouped together in slices to allow parallelization or error resilience. For each macroblock, the coded bitstream contains, firstly, data which signal to the decoder how to compute a prediction of that macroblock based on already decoded macroblocks and, secondly, residual data which are decoded and added to the prediction to re-construct the macroblock pixel values. Each macroblock is either encoded in “intra-prediction” mode in which the prediction of the macroblock is formed based on reconstructed macroblocks in the current slice, or “inter-prediction” mode in which the prediction of the macroblock is formed based on blocks of pixels in already decoded frames, called reference frames. The intra-prediction coding mode applies spatial prediction within the current slice in which the encoded macroblock is predicted from neighbouring samples in the current slice that have been previously encoded, decoded and reconstructed. A macroblock coded in intra-prediction mode is called an I-type macroblock. The inter-prediction coding mode is based on temporal prediction in which the encoded macroblock is predicted from samples in previous and/or future reference frames. A macroblock coded in inter-prediction mode can either be a P-type macroblock if each sub-block is predicted from a single reference frame, or a B-type macroblock if each sub-block is predicted from one or two reference frames.
The default H.264 behaviour is to group macroblocks in raster-scan order (i.e. scanning lines from left to right) into slices. The H.264 standard however further introduced another feature, referred to as flexible macroblock ordering, hereafter abbreviated with FMO. FMO partitions a video frame into multiple slice groups, where each slice group contains a set of macroblocks which could potentially be in nonconsecutive positions and could be anywhere in a frame.
For transport each slice can be transported within one network abstraction layer, hereafter abbreviated by NAL, unit, using default mode. However the H.264/AVC standard further describes an additional feature of data partitioning of each slice over several NAL units, to improve the error resilience during the transport of the slice.
According to this feature of data partitioning of one slice over several Partitions, the encoded contents of one slice will be distributed over 3 NAL units: a NAL unit partition A, a NAL unit partition B, and a NAL unit partition C. According to the standard, the NAL unit partition A will contain Category 2 syntax elements of that slice, representing all slice-related syntax elements that are not residual data. These category 2 syntax elements comprise slice header and header data for each macro block within a slice, including intra-prediction mode, resp. motion vectors, for intra-coded, resp. inter-coded, macroblocks, etc. The NAL unit partition B will contain the Category 3 syntax elements, that is the intracoded residual data of the macroblocks of the slice under consideration, if intra prediction coding was used, and the NAL unit partition C will contain the Category 4 syntax elements, that is the intercoded residual data, if this type of coding was used.
Returning to
In most traditional encoders the computation of the block transform and quantization is performed in the forward but, usually, also a feedback step in the reverse direction is present. These feedback steps are usually added to make sure that the encoder uses the same sets of decoded frames as the decoder to make the predictions. Such encoders are called “closed-loop” encoders, as opposed to “open-loop” encoders, where these feedback steps are not present. On the other hand the main differentiator between encoders lies in the way they define the encoding related syntax elements implying making a choice of type of frame, slicing, intra. vs inter-prediction, choice of intra-prediction mode and computation of motion vectors. These steps are thus generally performed within the block “make encoding decisions” and usually add significant additional complexity in the encoder with respect to a decoder.
As explained above, encoding of a plurality of video streams can be achieved by separately encoding these individual video streams using such a state-of-the-art encoder for each video sequence to be encoded. This however requires a lot of processing effort.
As an alternative MVC encoding was introduced to improve the compression, and is a proposed extension of the present H.264/AVC standard. As shown schematically in
Thus while MVC can improve the compression efficiency, it still has drawbacks of being computationally very intensive.
b (II) shows the interdependencies for the separately encoding of the views using regular H.264/AVC encoders. In this case the 3 encoded views do not show coding interdependencies, but, as mentioned before, the drawback is that each view has to be compressed separately, resulting in high computational effort.
These drawbacks of the prior art methods are overcome by embodiments of joint encoders and decoders according to this invention. A high level scheme of such a joint encoder JE coupled via an optional intermediate node IM to several decoders JD and JD′ is shown in
The intermediate node IM is thus further adapted to identify and extract the appropriate data, for further forwarding to their destination. This can be performed by filtering the plurality of encoded residual data streams and encapsulation of the needed video data to a transport stream for further transmission to their final destination, being the two decoders JD and JD's in
The decoder JD is adapted to receive the encoded data of the first and second video stream, and the intermediate node will accordingly provide the encoded residual pixel data ERPD1, resp. ERPD2 of IV1, resp. IV2, together with the reference encoding structure data IREF. In this decoder JD all this information will be used for decoding such as to obtain the correctly decoded video data with the aim of reconstructing as good as possible the original video streams IV1 and IV2. The decoding streams are denoted DV1 and DV2 and are provided on respective output terminals DOUT1 and DOUT2. It is to be remarked that in case the intermediate node is not present, all encoded residual pixel data can be directly transmitted and provided to a decoder. Such a decoder is then adapted to extract from the input data, the reference encoding structure data as well as the encoded residual pixel data pertaining to the video that has to be decoded. The decoding or reconstruction can be done rather easily by performing the steps of entropy decoding and processing usually involving a prediction construction to finally come to the decoded pixel blocks. As the encoding structure is the same for all the streams to be decoded, the prediction construction consists in applying the same operations to the already decoded parts of each stream. As these processing steps are the same for all the streams to be decoded, they can be efficiently executed in parallel implementations, using for instance the Single Instruction, Multiple Data, abbreviated by SIMD, approach. A much simpler decoder for jointly decoding several encoded streams is thereby obtained since the same encoding structure is shared by all streams and the prediction construction can be efficiently implemented in a joint parallel process.
As all views will be encoded from a common encoding structure IREF, the resulting interdependencies between the encoded views remain simple, as shown in
Several embodiments of such joint encoders and decoders adapted to decode encoded video stream which were encoded by these joint encoders will now be described.
A first embodiment of a joint encoder JE1, depicted in
Alternatively, the input video for which the obtained encoding structure offers the best average rate-distortion performance for the overall compression of the input streams, can also be selected to become the reference stream.
A person skilled in the art is adapted to generate detailed embodiments for realizing the aforementioned selection procedures. Therefore detailed embodiments of such selection modules S such as the one depicted in
In the joint encoder embodiment JE1 depicted in
An embodiment E1 of this encoder module is shown in
The residual pixel blocks RPB2 further undergo a transformation, quantization and entropy encoding, in the embodiment of
The encoded residual pixel data ERPD2 is provided as output data on a first output terminal OUTE1 of the module E1 of
Referring back to
Another alternative embodiment JE2 is depicted in
EESD1 will serve as input data to another single video encoder module E2, which is further adapted to determine the residual pixel data of EV2, using EESD1 as a reference input. An embodiment of this module E2 is shown into more details in
Such joint encoders are thus particularly useful e.g. for Compression of stereo- or multiview-video. For applications using e.g. stereoscopy-based 3D video or free viewpoint video, one typically has to capture several views of the same object or scene. For instance, in stereoscopy, the two videos are typically very close to each other. When capturing multiple video streams spanning a wide range of viewpoints, the various streams can typically be grouped in clusters of streams with viewpoints close to each other. To store or transmit 2 or more video streams with close viewpoints, the prior art method will independently compress and store/transmit the various views. In this prior art case, the complexity and the storage/transmission cost will scale linearly with the number of views to encode. This joint encoders JE1 and JE2 offer the alternative to first encode only one of the video streams and re-use the encoding structure which in the case of encoding using e.g. H.264 standard encoding methods with NAL unit partitioning as in JE2, relates to the Partition A of the obtained stream to efficiently encode the other similar video streams. This drastically reduces the encoding complexity for the latter streams and allows all streams to share the same Partition A on the storage/transmission medium, if this coding standard is used in the traditional encoder.
In a third embodiment JE3 of a joint encoder, shown in
In this embodiment JE3 this intermediate stream IS, is created by averaging the input streams at the pixel level, upon which step again traditional encoding such as e.g. the standard H.264 encoding is applied on this intermediate stream. The traditional encoder is again denoted ETH. The resulting encoded intermediate stream is denoted EISTh. This encoded stream can then be filtered again for filtering the partition A, as only this one is further needed during the encoding of the input video streams IV1 and IV2. The partition A, comprising the encoding structure data EESD is then provided as reference data IREF on a reference output terminal OUTREF of the joint encoder JE3, and is also further used in two single encoder modules E2 comprised within JE3. The operation of these single encoder modules E2 was described in an earlier paragraph with reference to
The leftmost encoder module E2 in
In yet another embodiment of a joint encoder JE4, being depicted in
This joint encoding structure is then further used within the joint encoder JE4 to construct the sequence of predicted pixel blocks PPB1 and PPB2 respectively for IV1 and IV2. Respective sequences of residual pixel blocks RPB1 and RPB2 will be generated from the difference between these respective sequences of predicted pixels blocks PPB1 and PPB2 and corresponding blocks from the respective input video streams IV2 and IV1. These respective sequences of residual pixel blocks are then further processed such as e.g. being transformed and quantized to obtain respective sequences of quantized residual pixel data QRPD1 and QRPD2, which are subsequently entropy encoded such as to obtain respective sequences of encoded residual pixel data denoted ERPD1 and ERPD2 which are provided at respective output terminals OUT1 and OUT2 of this joint encoder JE4. The encoding structure data JESD will in this embodiment also undergo an entropy encoding step before being delivered as encoded encoding structure data EJESD, which is provided as reference data IREF on the reference output terminal OUTREF of this joint encoder JE4.
Two embodiments of such a “Make Joint Encoding Decisions” block JED will now be described with reference to
A first embodiment JED1 shown in
A joint encoding data structure JESD can be calculated at the slice level by applying, for each independently computed encoding structure data ESD1 and ESD2, the corresponding prediction and residual quantization steps inherent in ESD1 and ESD2 on that particular slice for both input videos IV1 and IV2. Then the joint performance of this particular encoding structure data ESD1, resp. ESD2 is evaluated by using some metric. This metric can e.g. be based on determining the rate-distortion quality. The encoding structure data which yields the best quality metric value with respect to this rate-distortion performance will then be selected to be the joint encoding structure data JESD. Such a quality metric value can for instance comprise measuring the sum over all input streams of the PSNR between the original slice and the slice coded with this particular ESD1 or ESD2. In this case the maximum metric value is sought by this comparison step. Alternatively the sum of the encoding size required for this particular ESD1, resp ESD2, can be determined together with the residual of all input videos obtained by using this particular encoding structure. In this case the ESD yielding the minimum value for this metric will be selected.
In a second embodiment JED2 shown in
As with respect to the other embodiment JED1, again many possibilities exist in order to define the metric to be applied for comparison of the prediction as being perfomed by the compare prediction module. It is for instance possible to aim at minimizing the total energy of all residuals. This total energy may be calculated as the sum, over all input videos, of the sum, over all pixel of the blocks, of squared difference between original and predicted pixel values.
The selection of the quantization parameter QP itself can, for instance, be based on a fixed choice or can be chosen so that the total size of the coded macroblocks fits a given size (in bits) budget.
It is to be remarked that, in this embodiment JED2, the selection of the prediction mode and the choice of the quantization parameter QP was decoupled. As in state-of-the-art encoders, the person skilled in the art will be able to generalize the approach as to select both the prediction mode and the quantization parameter QP based on a feedback of the encoding loop (which can include entropy coding), where the parameters are chosen e.g. so as to maximize the total quality (over all input video streams) of the coded macroblock under a total bit budget, or to maximize the compression by e.g. minimize the total size of the encoded MB for all input video streams under a required minimal quality.
In addition, the principles described in the previous paragraph can also be improved for optimizing the quantization parameter QP choices. This may involve using state-of-the-art techniques such Lagrangian Optimization which allow to optimize the QP macroblock value so as to optimize a global rate-distortion performance (at a slice, frame or sequence level).
Most of the shown embodiments JE1 to JE4 thus take advantage of the fact that the encoding structure of a encoded stream (potentially packaged in Partition A if data partitioning is used) contains all the important coding decisions made by the encoder: intra and inter-prediction mode, picture buffer management, motion vectors, . . . . Once this encoding structure is fixed, the encoding process of the residual data such as Partition B and C if data partitioning is used, merely consists of applying the chosen prediction mode, computing the residual data, applying the integer block transform and quantization and finally entropy coding of the obtained results. In the simplest form, the encoding of N input streams thus results in an output consisting of a common partition A and of N (unshared) partitions, each corresponding to one of the input streams. One given coded stream can be extracted from the N+1 created partitions by assembling its dedicated partition with the common partition. A decoder can then process those two partitions to display the required view.
Because the preferred use cases of this invention apply to input streams with strong similarities, a unique encoding structure may contain efficient coding decision for all the input streams. Therefore, for joint encoding of several input raw video streams, providing a shared partition made of a unique encoding structure to be used for all streams and a dedicated partition per individual stream made of their coded residual data provides a simple, yet very effective method.
For embodiments based on the selection or the creation of an intermediate stream, yet other embodiments of encoders according to the invention may combine one of the previously described embodiments with state-of-the-art encoding mechanisms as described with reference to
The slicing of the video picture itself can also be chosen during the encoding process so as to group macroblocks in FMO slices in function of their similarities across the multiple input video stream. Slices of macroblocks being very similar across the views are then encoded using a common encoding structure, while slices of macroblocks having more difference across the different input videos are encoded independently using a state-of-the-art encoding process that outputs a dedicated encoding structure related to that FMO slice for each video input.
It may further be noticed that this switching decision between a state-of-the-art encoder and an encoder as shown in the previous embodiments, can also be made at a coarser granularity e.g. at the frame level, or at the sequence level.
Two embodiments of a decoder for co-operating with the aforementioned joint encoders will now be described with reference to
Decoder JD1 of
In the embodiment of
The embodiment JD2 depicted in
While the principles of the invention have been described above in connection with specific apparatus, it is to be clearly understood that this description is made only by way of example and not as a limitation on the scope of the invention, as defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
09290986.0 | Dec 2009 | EP | regional |
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/EP2010/070200 | 12/20/2010 | WO | 00 | 6/20/2012 |