The present invention relates to devices and a method for video encoding and reconstruction.
In present video transmission systems content providers don't always have a clear view of the network conditions the transmitted multimedia content will undergo between sender and receiver. However most content providers do know which part of the content is to be provided in high quality and which part is not that important. This may for instance be the case for some movie scenes where the main personages can be much more important compared to the landscape in background. However for other scenes of this same movie, it may be exactly the opposite. For a high quality transmission of often varying important parts, whereby also network conditions may change too, no good solutions are presently available.
Existing scalable video coding methods, hereafter abbreviated with SVC, encode the complete video into one base quality layer and one or multiple quality enhancement layers whereby each next encoded layer depends and improves upon the quality of the previous layer. This solution may take into account varying network conditions and/or varying user preferences/devices, e.g. by having the number of enhancement layers transmitted dependent upon the quality of the link between sender and receiver. This may guarantee at least a minimum base quality layer reaching the end-client under worst case network conditions. However even in such situations the content provider has no real control to guarantee that the most important content part, thus possibly varying objects from the video, will reach the client with a certain level of guarantee. This is because in SVC the base layer makes abstraction of the video contents and thus provides the complete video at a minimum guaranteed quality, which, for some scenes, may not be sufficient.
Another existing encoding method used for transmitting panoramic interactive video, tiles the video frames into independent sub-frames, denoted video tiles. These tiles are independently encoded and sent at different qualities or resolutions in order to enable interactive zooming in the video content while maintaining the overall resolution received by the end-device. However this approach suffers from the disadvantage to consume much more bandwidth compared to traditional encoding schemes in view of the bad compression efficiency.
It is thus an object of the invention to provide improved encoding schemes which are at the same time compression efficient and versatile, thus allowing to adapt to different network conditions.
This object is achieved by means of an embodiment of a transmitter for generating a set of encoded prioritized video streams from an incoming video stream, said transmitter comprising:
In this way prioritization can be given on two different levels, a first level at the compression stage, a second level at the transport stage.
In an embodiment said video decomposition module comprises a filter bank of N analysis filters, for performing respective filtering operations on said incoming video stream, thereby obtaining a set of N independent video components.
In another embodiment said video decomposition module is adapted to perform a fast Fourier transform operation on said incoming video stream, followed by at least two low pass filtering operations in the respective frame directions.
In yet another embodiment said video decomposition module is adapted to perform a wavelet transform operation on said incoming video stream.
These three different embodiments for performing the decomposition of the incoming video signal all result in a set of independent decomposed video components. The filter bank and fast-Fourier transform embodiments result in a set of independent spectral components, the wavelet embodiment results in a set of independent time-spatial/spectral components.
According to another aspect said transmission rule engine is adapted to determine said compression parameters and said respective priority parameters, based on knowledge of said decomposition process and on the contents of said incoming video stream.
This allows to better prioritize the more important parts based on the contents themselves.
In yet another embodiment said transmission rule engine is further adapted to receive said at least two independent video components from said video decomposition module, to analyze said at least two independent video components and to further determine said compression parameters and said respective priority parameters based on said analysis.
This further enhances the accuracy for attributing priorities based on the video component contents themselves.
In an embodiment said compression parameter is a quantization parameter.
To enable good recovery at the receiver coupled to embodiments of the transmitter, the transmitter is adapted to send a signaling message to a receiver coupled to said transmitter, said signaling message containing information about said decomposition used by said transmitter and the respective qualities in which said respective set of packet streams are generated.
The present invention relates as well to embodiments of a receiver adapted to request at least two compressed independent video components, each of them in a respective requested quality, from a transmitter, said receiver further being adapted to
In an embodiment said re-composition operation comprises performing a set of synthesis filter operations.
In another embodiment said re-composition operation comprises performing an inverse Fourier transform operation.
In yet another embodiment said re-composition operation comprises performing an inverse wavelet transform operation.
These three respective embodiments of a re-composition module at the receiver allow to reconstruct encoded signals received from respective transmitter embodiments performing an associated respective de-composition.
To further enable the correct reconstruction of the video signal embodiments of the receiver are adapted to receive from said transmitter a signaling message informing said receiver on a video decomposition operation performed in said transmitter, such that said receiver is adapted to perform an inverse re-composition operation.
The present invention relates as well to embodiments of methods performed in the transmitter/receiver and in between both of them, e.g. via the signaling signals between both of them, as well as to embodiments of computer programs for performing these methods.
It is to be noticed that the term ‘coupled’, used in the claims, should not be interpreted as being limitative to direct connections only. Thus, the scope of the expression ‘a device A coupled to a device B’ should not be limited to devices or systems wherein an output of device A is directly connected to an input of device B. It means that there exists a path between an output of A and an input of B which may be a path including other devices or means.
It is to be noticed that the term ‘comprising’, used in the claims, should not be interpreted as being limitative to the means listed thereafter. Thus, the scope of the expression ‘a device comprising means A and B’ should not be limited to devices consisting only of components A and B. It means that with respect to the present invention, the only relevant components of the device are A and B.
The above and other objects and features of the invention will become more apparent and the invention itself will be best understood by referring to the following description of an embodiment taken in conjunction with the accompanying drawings wherein:
It is to be remarked that the following merely illustrates the principles of the invention. It will thus be appreciated that those skilled in the art will be able to devise various arrangements that, although not explicitly described or shown herein, embody the principles of the invention. Furthermore, all examples and conditional language recited herein are principally intended expressly to be only for pedagogical purposes to aid the reader in understanding the principles of the invention and the concepts contributed by the inventor(s) to furthering the art, and are to be construed as being without limitation to such specifically recited examples and conditions. Moreover, all statements herein reciting principles, aspects, and embodiments of the invention, as well as specific examples thereof, are intended to encompass both structural and functional equivalents thereof. Additionally, it is intended that such equivalents include both currently known equivalents as well as equivalents developed in the future, i.e., any elements developed that perform the same function, regardless of structure.
It should be appreciated by those skilled in the art that any block diagrams herein represent conceptual views of illustrative circuitry embodying the principles of the invention. Similarly, it will be appreciated that any flow charts, flow diagrams, state transition diagrams, pseudo code, and the like represent various processes which may be substantially represented in computer readable medium and so executed by a computer or processor, whether or not such computer or processor is explicitly shown.
Transmitter T contains an input which is adapted to receive a video file Vin. This video can be temporarily stored within T before being forwarded to a video encoding apparatus E of T. In the embodiments depicted in
This video decomposition module is adapted to perform a first decomposition of the video signal into components in the spectral or time/spatial-spectrum domain. For embodiments where this decomposition takes place in the spectral domain, such as the one shown in
For a (block-based)(discrete)(fast) Fourier transform embodiment, the input video frames are first decomposed into their 2D spectra or their 3D spectra using both temporal and 2D spatial Fourier transform, in two directions, the horizontal and the vertical direction. This can be performed by the use of fast Fourier transform, abbreviated by FFT, or 3D-FFT on each color channel (for example RGB, YUV, LAB, HSV or other color spaces), respectively. Furthermore, the FFT can be applied on the whole frame or on blocks of pixels for each frame, but also on different tiles of the frame at different resolutions in order to introduce spatial locality to the transform. The latter tiling is not shown in
The values of such spectra may be complex and typically in floating point precision. The advantage of spectral decomposition for natural video signals (that is excepting parasites and white noise signals) is that most of the signal energy is compacted into the spectral low frequencies, while signal energy quickly decreases when frequency increases, as is shown in
Another aspect of the Fourier transform of real video signals is that their spectrum has well-known symmetry properties, which make it a robust representation.
The output of this video decomposition module is denoted Vd1 . . . Vdn, in
Global 2D (or 3D) multi-scale block 2D (or 3D) spectra can then be optionally partitioned into different components, which is typically performed to create components that contain specific percentages of the spectral energy of the local video signal. This partitioning is shown in
Spectral components can be encoded with a traditional coding algorithm, such as for example H264, with a preliminary normalization and quantization of the spectral coefficients amplitude and phase.
The separate components Vd1 to Vdn produced by the decomposition module may be compressed separately at different qualities or adapted domain resolutions, this difference in compression being controlled by a signal prc. This control signal prc may comprise a set of control parameters, e.g. a quantization parameters per decomposed video, which can individually influence the encoding, if needed. In some embodiments only two out of the n are differently encoded, in other embodiments all n are differently encoded, and all possible embodiments in between are as well possible.
This control signal prc is generated and provided to the encoder E by a transmitter rule module TRM for influencing the compression. This control signal is denoted prc on
In the embodiment depicted in
Priorities are determined by TRM for influencing prc. For a spectral decomposition these priorities are determined for the different frequency components, thus calculated or attributed for the different spectral components based on their frequency range. DC component and low frequencies receive the highest priority, as they generally also have the highest energy. In general frequency bands are prioritized based on the energy they contain. This energy distribution is however content dependent: in presence of edges in only one given direction, the spectrum amplitude in that spatial frequency direction is known to be higher than in the other directions. It follows, that priorities follow the relative importance of the spectral component with respect to the other components in the quality of the reconstruction.
This is further illustrated in
An operator or another person with knowledge of the video contents, can therefore already attribute a default higher weight to these lower frequencies, based on a presumed spectrum. Some examples of weights are shown at the bottom in
The sender, e.g. an operator, may thus already prioritize some image components based on their directional normalized frequencies fx and fy, which may have a distribution, dependent on the content, as shown in
In another embodiment such as the one depicted in
In another embodiment, not shown on the figures, VDA may as well receive input values for the weights from e.g. an operator during an initialization process, and take this information together with the analysis of the spectral components Vd1 to Vdn to generate the intermediate weights c for provision to a rules engine module TRE. The latter will further process this information into a set of encoding-related parameters, for provision to the encoding module E, thereby enabling to compress the different components separately. In an embodiment the rules engine module TRE is adapted to derive quantization parameters, abbreviated by QP, from these oca, and/or intermediate weight parameters c. In an embodiment high QPs can be attributed to low importance components, thus for which the weights as given by oca are low, and low QPs (high quality) for important components, thus for which the weights of oca are low. Numerical examples in the case of a Fourier Transform, may attribute a low QP typically below 15 for H.264/AVC, as small modifications already impair the reconstruction, to components with high peaks, as shown in
For embodiments where the decomposition process is based on wavelet decomposition, which will be described in conjunction with
This set of QP values for the different decomposed videos, is provided to the compression module of the encoder as control signal prc.
Other parameters such as target bitrates, segment length, and a specific packet priority will also be calculated by TRE, and next be provided to a Packetization and prioritization module MPP of the transmitter, under form of priority parameters Pr for the respective components.
Target bitrates can be e.g. be computed by using a linear sharing of a given target bit rate, determined by the operator, for all components given their importance, which importance is again reflected by the weights.
All these parameters calculated by TRM and provided to MPP are denoted Pr in
These parameters Pr, prc and oca may also be further adapted during operation, starting from some default parameters, depending on the chosen decomposition.
In
The generation of the different versions and the packetizing of all of them thus takes place in module MPP. This module receives control parameters from TRM, which are denoted Pr, and which may thus comprise a set of l values of frame rate, segments length or a combination of both, per decomposed video file. The sub-set pr1 for Vde1 can be different from the sub-set prn for Vden.
In this way l×n different packetized or transport video streams are generated and stored in a memory NFM in the transmitter. Upon receiving a request from the receiver, denoted ur, a set of possibly different quality versions of two or more of the components are be provided to the receiver. This means that the receiver is knowledgeable about the number of different quality versions. This information was communicated to the receiver by signaling from the transmitter (not shown on the figures). The user may thus request a particular quality version of each component, depending on the bandwidth (which the user has to determine) between transmitter and receiver, user preferences (high quality or low quality, e.g. depending on display), content type (e.g. like for sports, users prefer HD resolution at 60 frames per second rather than 4K resolution at 30 frames; similarly components qualities can be requested in order to better reconstruct e.g. HD at 60 fps), or complexity of the reconstruction to e.g. save battery or improve fluidity of the display on mobile devices (selecting lower qualities enables to perform less reconstruction operations in all embodiments) etc.
The parameters used for generating the different versions are denoted priority parameters pr and may thus comprise a set of n or more parameter values, one for each component, and are generated in the rules engine module TRE, taking into account the respective weights oca and/or c. In an example the relative importance of the component in the total amplitude of the spectrum, as already attributed or determined by the weight, can also result in a relative high priority value, e.g. high priority is 1 and low priority is 3 on a scale of one to 3. Similarly, although QP or target bitrates are already set in the encoder, it could be that, if variable bit rate transport schemes are used, that, like for AVC streams for videos, some components require providing enough resources to possibly anticipate a peak of bit rate as they come with high details that require high quality. The occurrence of these peaks can be flagged together with the priority to inform the decoder that this component might require more buffering or higher delay. Methods for indicating this can be by setting a special bit or flag e.g. in the header of the packets of the transport stream, similar to the indication of the priority itself. The receiver and other network elements in the network of course have to be knowledgeable of this packet header structure.
In case of a flag indicating extra buffering, this can e.g. be understood by the receiver by instructing a receiver or de-packetizing buffer, to either reserve some extra space for buffering some consecutive encoded frames when possible (possible if it is software, not always possible if it is hardware), before providing them to the decompression unit of the decoder, or instructing to switch to a lower quality level that better suits the receiver or de-packetizing buffer size.
In this way a content provider, is capable of providing a plurality of different qualities which may e.g. vary for receiver type (e.g. mobile phone or HD TV and any variations in between), for connection type, network condition and combinations of them
The memory module NFM will thus store all these encoded versions, and select for a user R, a set of n encoded versions. So it is possible select Vde1q2, Vde2ql, Vde3q2, . . . vdenq1, based on the ur parameter the user provides. The user is to check the network conditions and is knowledgeable about his type of receiver, so can adjust ur such as to always receive an optimal quality taking into account his settings and the current network conditions. The user preferences are provided by each user, as depicted by the signal or message upr, from a preferences and rules module RRM in the receiver R, to the NFM.
For instance, If the user wants to reconstruct a video at 4K resolution at full quality, the client will ask and download all components in their version with priority 1. In case the user rather wants a reconstruction in HD resolution because of a lower bandwidth or a TV set that is not 4K compatible, the client will select low frequency priority 1 components and high frequency priority 2 components. In case the user uses a mobile and only wishes a minimal resolution with lowest priority, the client might only ask and download the priority 3 low frequency components.
Via this mechanism a highly adaptive and fine-tuned selection of video encoding based on the video content and taking into account observed network conditions is possible.
Upon having received the request from the receiver comprising the requested components in a requested quality, NFM thus selects the appropriate encoded versions and transmits them further to R via the network. Within the receiver a de-packetization unit DPU unpacks the stream into its set of received streams Vde1r to Vdenr. These are provided to a decoder D which first decompresses them into Vd1r to Vdnr, after which steps they undergo a re-composition.
In
These files are further provided over the network to the receiver R, which first may temporarily store them in a receiver or de-packetization buffer (not shown on the figures) if needed, which de-packetization buffer may be part of the de-packetization unit DPU.
After de-packetization by DPU, received packets of encoded received stream Vde1r and Vdenr, are provided to the decoder D of the receiver. Therein a decompression first take place; resulting exemplary components being denoted Vd1r and Vdnr. These are next provided to a re-composition unit VR, which is adapted to generate again a received video file Vr which may be provided to a display.
Such re-composition in VR can be done by simply reconstructing the decoded components on their specific location in the video. It can also include specific post-filters that enable to smooth transitions between regions of the video that were reconstructed from different components at different qualities. We detail possible decompositions in the embodiments.
Partitioned components belonging in the same group or partition are treated together similarly (e.g. grouping all spatial parts of the frame belonging to a person in one group, and grouping the background in another group). It may be realized by a simple tiling of the input image that can be done after or before the decomposition, esp in case of wavelet decomposition, and that could be done before when selecting Fourier or filter banks. However, you can also do it after in all cases.
In this case the operator knows about the partitioning procedure, and can provide parameters that take this already into account.
The receiver of
In the embodiment of
Additional information can further be used during the relative importance estimation, e.g. by analyzing the video frames input, identifying objects of interests and their textures or contours and estimating which components contribute the most to their high quality reconstruction. To this end one possible approach is to reconstruct the image or video frames with only one component and compare the reconstructed image for one component and the original frame on the locations of the detected objects and contours.
Based on this analysis again a set of parameters, represented by signal c in
An example of such filters are e.g. quadratic mirror filters. Such filters are designed to generate uniformly distributed sub-bands capturing a frequential sub-band from low frequencies to high frequencies. The advantage of such approach is that the human eye is much more sensitive to the low frequencies but also requires sharp details in specific areas of the video frames. A filter bank enables to both decompose frequencies, and these can later be encoded separately using standard video encoding techniques such as H.264 or H.265.
The filter bank at the sender side is usually referred to as analysis filters and those at the receive side, are then denoted synthesis filters. On
The analysis filters and also the synthesis can thus be implemented for example as Quadratic Mirror Filters as detailed in Girod, B., Hartung, F., & Horn, U. (1996). Subband image coding. In Subband and Wavelet Transforms (pp. 213-250). Springer U S., but any other choice of Quadratic Mirror Filters can be selected as long as it prevents aliasing distortions. It is also possible to use machine learning to define the filters of the filter banks in order to specialize such filters for detection and separation of specific textures in the images. These techniques are e.g. described by Sainath, T. N., Kingsbury, B., Mohamed, A. R., & Ramabhadran, B. (2013, December). “Learning filter banks within a deep neural network framework”, Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on (pp. 297-302). IEEE., and Rigamonti, R., Sironi, A., Lepetit, V., & Fua, P. (2013, June). “Learning separable filters”, Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (pp. 2754-2761), IEEE.
The filtered frames generated by the filterbank are obtained as follows:
With n,m being pixels
The reconstruction of the frames are obtained similarly:
The N output sub-sampled frames fi that are thus generated by the analysis filter bank can be partitioned, such as in
At the receiver side, an up-sampling step and a mirror filter bank must be applied to reconstruct the video frames.
Note that in other embodiments the down- and upsampling steps are not present.
In another embodiment, as depicted on
The spectral resolution is equal to the resolution of the block of pixels the FFT is applied to. The values of such spectra are complex and typically in floating point precision. The advantage of spectral decomposition for natural video signals (that is excepting parasites and white noise signals) is that most of the signal energy is compacted into the spectral low frequencies, while signal energy quickly decreases when frequency increases. Another aspect of the Fourier transform of real video signals is that their spectrum has well-known symmetry properties, which make it a robust representation.
For the associated receiver module VR, depicted in
The decomposition here based on a multi-resolution wavelet transform can for example be based on the Discrete Wavelet Transform (DWT) as used in standards like JPEG2000, but other choices of wavelets (actually the choice of the wavelet basis function) are possible in order to provide more sensitivity to directivity of video content features. Similarly to the embodiment based on the spectral decomposition, components show peaks of energy or information but now localized on the pixel areas they relate to, while Fourier transform peaks are rather localized in low frequencies. Another slight difference is that the wavelet decomposition can be applied recursively into different levels, each yielding frequency components relating to another resolution. These various components computed by the wavelet decomposition can be called wavelet bands, with typically a horizontal, a vertical and a diagonal wavelet band (also called sub-band when different resolutions or levels are computed). Examples of how these wavelet bands map to the real image are represented on
As for spectral decompositions, each wavelet band can be subdivided and partitioned into different components. Video energy distribution is now local in both spatial and spectral dimensions, and the amplitude of the coefficients represent the importance of a given wavelet basis in the video signal representation. Partitioning can be done regularly in the wavelet domain by subdividing regularly the wavelet bands. The importance of such wavelet coefficients in the quality of the reconstruction is proportional to the square of the coefficient amplitude. The compression of wavelet coefficients can be done in a scalable manner by using quantization, bitplane representation and entropy coding like e.g. in JPEG2000, however AVC could also be used to encode the wavelet bands as regular video frames. Similarly to the Spectral decomposition embodiment, wavelets can also be applied on the temporal dimension as well, leading to 3D wavelet decompositions.
Components generated by the decomposition and partitioning can now receive a priority that depends on first the resolution of the wavelet decomposition and secondly on the energy of the wavelet component.
An embodiment for the reconstruction for wavelet-based decomposition is shown in
In the claims hereof any element expressed as a means for performing a specified function is intended to encompass any way of performing that function. This may include, for example, a) a combination of electrical or mechanical elements which performs that function or b) software in any form, including, therefore, firmware, microcode or the like, combined with appropriate circuitry for executing that software to perform the function, as well as mechanical elements coupled to software controlled circuitry, if any. The invention as defined by such claims resides in the fact that the functionalities provided by the various recited means are combined and brought together in the manner which the claims call for, and unless otherwise specifically so defined, any physical structure is of little or no importance to the novelty of the claimed invention. Applicant thus regards any means which can provide those functionalities as equivalent as those shown herein.
While the principles of the invention have been described above in connection with specific apparatus, it is to be clearly understood that this description is made only by way of example and not as a limitation on the scope of the invention, as defined in the appended claims.
Number | Date | Country | Kind |
---|---|---|---|
15307068.5 | Dec 2015 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2016/081184 | 12/15/2016 | WO | 00 |