The present invention relates to methods for processing signals, such as by way of non-limiting examples video, image, hyperspectral image, audio, point clouds, 3DoF/6DoF and volumetric signals. Processing data may include, but is not limited to, obtaining, deriving, encoding, outputting, receiving and reconstructing a signal in the context of a hierarchical (tier-based) coding format, where the signal is decoded in tiers at subsequently higher level of quality, leveraging and combining subsequent tiers (“echelons”) of reconstruction data. Different tiers of the signal may be coded with different coding formats, (e.g., by way of non-limiting examples, traditional single-layer DCT-based codecs, ISO/IEC MPEG-5 Part 2 Low Complexity Enhancement Video Coding SMPTE VC-6 2117, etc.), by means of different elementary streams that may or may not multiplexed in a single bitstream.
In tier-based coding formats, such as ISO/IEC MPEG-5 Part 2 LCEVC (hereafter “LCEVC”), or SMPTE VC-6 2117 (hereafter “VC-6”), a signal is decomposed in multiple “echelons” (also known as “hierarchical tiers”) of data, each corresponding to a “Level of Quality” (also referred to herein as “LoQ”) of the signal, from the highest echelon at the sampling rate of the original signal to a lowest echelon, which typically has a lower sampling rate than the original signal. In the non-limiting example when the signal is a picture in a video stream, the lowest echelon may be a thumbnail of the original picture, e.g. a low-resolution frame in video stream, or even just a single picture element. Other echelons contain information on correction to apply to a reconstructed rendition in order to produce the final output. Echelons may be based on residual information, e.g. a difference between a version of the original signal at a particular level of quality and a reconstructed version of the signal at the same level of quality. A lowest echelon may not comprise residual information but may comprise the lowest sampling of the original signal. The decoded signal at a given Level of Quality is reconstructed by first decoding the lowest echelon (thus reconstructing the signal at the first—lowest—Level of Quality), then predicting a rendition of the signal at the second—next higher—Level of Quality, then decoding the corresponding second echelon of reconstruction data (also known as “residual data” at the second Level of Quality), then combining the prediction with the reconstruction data so as to reconstruct the rendition of the signal at the second—higher—Level of Quality, and so on, up to reconstructing the given Level of Quality.
Reconstructing the signal may comprise decoding residual data and using this to correct a version at a particular Level of Quality that is derived from a version of the signal from a lower Level of Quality. Different echelons of data may be coded using different coding formats, and different Levels of Quality may have different sampling rates (e.g., resolutions, for the case of image or video signals). Subsequent echelons may refer to a same signal resolution (i.e., sampling rate) of the signal, or to a progressively higher signal resolution. Examples of these approaches are described in more detail in the available specifications for LCEVC and VC-6.
The process of encoding and decoding a signal tends to be resource intensive. For example, video encoding and decoding requires a frame of data to be processed in fractions of a second (33 ms for frames at 30 Hz or 16 ms for frames at 60 Hz). Applications such as videoconferencing, which require both audio and video encoding and transmission over a network, often command a large proportion of available resources on a computing device. Additional challenges are also faced with mobile devices, that operate with more limited processing resources and typically use battery power. It is desired to provide improved encoding and decoding methods for variable real-world use conditions.
Aspects of the present invention are set out in the appended independent claims. Variations of the present invention are set out in the appended dependent claims. Additional variations and aspects are set out in the examples described herein.
In tier-based hierarchical coding technologies, such as those embodied in LCEVC and VC-6, a signal may require a varying amount of correction based on the fidelity of the predicted rendition of a given Level of Quality (LoQ). This correction is provided by “residual data” (or simply “residuals”) in order to generate a reconstruction of the signal at the given LoQ that best resembles (or even losslessly reconstructs) the original signal. In tier-based hierarchical coding, a signal may consist of multiple components or channels. For an audio signal, these may comprise components relating to different loudspeakers and/or microphones. For a video signal, these may comprise components relating to different colour channels. For example, LCEVC and VC-6 are configured to process different chroma planes (e.g., by way of non-limiting example, Y or luma, U chroma and V chroma). Chroma planes may be defined according to a specified colour encoding method and may be reconstructed to their target resolution by means of independent residual planes. Chroma planes may be processed in series or in parallel and may be combined in an output reconstruction for rendering on a display device. Further details of standardised processes for decoding chroma planes are described in the specifications for LCEVC and VC-6.
Encoding and/or decoding a signal requires efficient use of available resources. For example, hardware and/or software encoders and decoders need to efficiently control processor, memory and power utilization (amongst others). For mobile encoders and decoders, such as smartphones and tablets, power is typically provided by a battery. When battery consumption is relevant, e.g. when it needs to be conserved, encoding processing power is a relevant metric to minimize. In several devices (such as, by way of non-limiting example, mobile devices), power consumption is significantly influenced by the amount of memory accesses and memory copies.
Certain novel embodiments illustrated herein allow encoding and/or decoding devices to flexibly save significant processing power by means of limiting the encoding of upper layer signals to a subset of available signal components. Surprisingly, only encoding one component of a signal at a higher level can still provide perceivable improvements in an output reconstruction, yet significantly reduce resource utilization. This makes it a suitable adaptation to known tier-based hierarchical coding approaches for efficient encoding and decoding when resources, e.g. on a computing device, is limited. In one example, restricting the encoding of components of the signal limits the generation of echelons of residuals for chroma planes for higher levels of quality.
Non-limiting embodiments illustrated herein refer to a signal as a sequence of samples. These sample may comprise, for example, two-dimensional images, video frames, video fields, sound frames, etc. In the description the terms “image”, “picture” or “plane” (intended with the broadest meaning of “hyperplane”, i.e., array of elements with any number of dimensions and a given sampling grid) will be often used to identify the digital rendition of a sample of the signal along the sequence of samples, wherein each plane has a given resolution for each of its dimensions (e.g., X and Y), and comprises a set of plane elements (or “element”, or “pel”, or display element for two-dimensional images often called “pixel”, for volumetric images often called “voxel”, etc.) characterized by one or more “values” or “settings” (e.g., by ways of non-limiting examples, colour settings in a suitable colour space, settings indicating density levels, settings indicating temperature levels, settings indicating audio pitch, settings indicating amplitude, settings indicating depth, settings indicating alpha channel transparency level, etc.). Each plane element is identified by a suitable set of coordinates, indicating the integer positions of said element in the sampling grid of the image. Signal dimensions can include only spatial dimensions (e.g., in the case of an image) or also a time dimension (e.g., in the case of a signal evolving over time, such as a video signal).
As non-limiting examples, a signal can be an image, an audio signal, a multi-channel audio signal, a telemetry signal, a video signal, a 3DoF/6DoF video signal, a volumetric signal (e.g., medical imaging, scientific imaging, holographic imaging, etc.), a volumetric video signal, or even signals with more than four dimensions.
For simplicity, non-limiting embodiments illustrated herein often refer to signals that are displayed as 2D planes of settings (e.g., 2D images in a suitable colour space), such as for instance a video signal. The terms “picture, “frame” or “field” will be used interchangeably with the term “image”, so as to indicate a sample in time of the video signal: any concepts and methods illustrated for video signals made of frames (progressive video signals) can be easily applicable also to video signals made of fields (interlaced video signals), and vice versa. Despite the focus of embodiments illustrated herein on image and video signals, people skilled in the art can easily understand that the same concepts and methods are also applicable to any other types of multidimensional signal (e.g., audio signals, volumetric signals, stereoscopic video signals, 3DoF/6DoF video signals, plenoptic signals, point clouds, etc.).
Components of a signal represent different “values” or “settings”. For example, these may comprise, as set out above, different colour channels, different sensor channels, different audio channels, metadata channels etc. For example, a different plane of samples as set above may be provided for each of the different components, and an encoding and/or decoding process may be applied to each component plane in series or in parallel to generate encoded and decoded versions of the components. For ease of explanation, reference will be made herein to a YUV colour encoding of a video signal, where there are three components—Y, U and V. Y represents a luma or brightness channel and U and V represent different opponent colour channels. It should be noted that the described examples are not limited to YUV encodings and may be applied to different colour encodings (including RGB, Lab, YDbDr, XYZ etc.) and to non-colour examples. For example, for surround sound audio, there may be 6 audio channels including front left and right, surround left and right, centre and sub-woofer channels.
In a first aspect described herein, there is a method of encoding a signal using a hierarchical or multi-layer coding approach. The signal is encoded at a first layer using a first encoding module and at a second layer using a second encoding module. For example, the first encoding module may represent a base encoding layer and the second layer may represent an enhancement encoding layer. Alternatively, the first and second encoding modules may represent different sub-layers of an enhancement encoding layer. The signal is composed of two or more components.
In examples of the first aspect, the components encoded by the second encoding module comprise a subset of the components encoded by the first encoding module. This may be implemented by the method comprising sending a signal from the second encoding module to the first encoding module to instruct the first encoding module to provide to the second encoding module only a first component of the signal at the first layer. The signal may be sent when the second module determines that only the first component of the signal is to be encoded at the second layer. As the second encoding module only receives a subset of the components from the first encoding module, it may only encode what it receives. This not only reduces the memory use by the first and second encoding modules, it also reduces the computations performed by the second encoding module.
The encoding device 110 encodes the input signal 110 using at least a first layer (LAYER 1) using a first encoding module 120 and a second layer (LAYER 2) using a second encoding module 130. The input signal 110 is composed of two or more components, in
In
In
In certain examples, such as those similar to LCEVC, the first and second encoding modules 120, 130 may respectively implement different coding methods. For example, the first coding method may correspond to a single-layer coding method (such as AVC, HEVC, AV1, VP9, EVC, VVC, VC-6) whilst the second method may correspond to a different multi-layer coding method (such as LCEVC). In other examples, the first and second encoding modules 120, 130 may respectively implement the same coding method (such as VC-6 or AVC/HEVC).
The example of
In a first operating mode, e.g. according to a standard specification, the encoded second stream 150 also comprises encoded versions of the set of components (i.e., [E20, E21, E22]). The encoded second stream 150 is received by a second decoding module 230, which in the first operating mode may decode the encoded second stream 150 according to a standard specified decoding process (e.g., as specified for an enhancement stream in LCEVC or for an echelon in VC-6).
A second operating mode is shown in
In one case, the decoding device 200 is a passive device and simply decodes and reconstructs based on a set of received encoded streams. For example, if encoded component data is absent from the encoded second stream 150 (e.g., as is shown for components 1 and 2), then this data is not used in the reconstruction. In these cases, the received decoded first level data—DE11 and DE12— may be upscaled to a second level of quality without adding any additional residual data; whereas the decoded first level data for the first component—DE10—may be upscaled and then the decoded second level data—DE20—may be added to that upscaled first component data.
In another case, even if the second decoding module 230 receives encoded data for all three components in the encoded second stream 150, it may discard data for one or more components based on local processing conditions. For example, if resources are constrained at the decoding device 200, only one component may be decoded and used to output the reconstructed signal 240.
In the examples described herein, one or more of the decoding device and the encoding device may be a mobile device, such as a mobile phone, a tablet, a laptop, a low-power portable device (e.g., a smartwatch), etc. In one case, a device may comprise both encoding and decoding devices, e.g. a mobile phone holding a video-conference may simultaneous encode and decode video streams or a voice assistant may simultaneous encode and decode audio streams.
In certain examples, the control signal (CTRL) described above is sent when the second module determines that only the first component of the signal is to be encoded at the second layer. For example, it may be an optional signal, whereas in the absence of the signal encoding is performed according to a standardised process (such as LCEVC or VC-6). Hence, the examples described herein may comprise an optional “out-of-standard” enhancement that does not affect standardised encoding or decoding; it may be added as an optional feature in certain devices (e.g. mobile or resource-limited devices).
At block 320, the determined resource condition at block 310 is evaluated to determine if resource use is to be reduced. This may be performed by comparing a measured resource condition with a defined threshold. For example, this may comprise a requirement to reduce power consumption, e.g. based on a battery capacity falling below a threshold value, or to reduce CPU/GPU loading, e.g. based on a threshold utilisation being exceeded. The condition may comprise a requirement to reduce the number of processing operations to be performed in the encoding of the signal. The processing operations may comprise reading and/or writing to memory. For example, these may be mem-copy operations.
Based on the evaluation at block 320, one of blocks 330 or 340 is selected. If resource use does not need to be reduced, e.g. because one or more resource metrics are within acceptable ranges, then at block 330, a full set of components are encoded at a second encoding module, such as 130 in
Reducing the encoded components may reduce resource use in multiple ways. Processing resources used to encode and/or decode components at one or more of the first and second encoding modules may be saved. Memory use may be reduced by only copying, by the first encoding module, one component from many into memory for access by the second encoding module. The modules described herein may be configured to flexibly encode and/or decode based on a received signal, such that a minimal level of control signalling is required to flexibly change the encoding and decoding approaches (e.g. just the signal from the second encoding module to the first encoding module may be required).
In certain case, the first encoding module may implement a first encoding method, and the second encoding module may implement a second encoding method. The first encoding method may be different from the second encoding method. The first encoding method may alternatively be the same as the second encoding method. The first layer is at a lower level in the hierarchy than the second layer. For example, the first layer may be at a lower resolution than the second layer.
The method of
In this method, providing only the first component of the signal may comprise processing, by the first module, the two or more components of the signal and passing, by the first module to the second module, only the first component of the signal. As discussed above, providing only the first component of the signal may comprise writing to a memory, by the first module, only the first component of the signal. It may also or alternatively comprise encoding, by the first module, only the first component of the signal. The first layer may be at a lower level in the hierarchy than the second layer. For example, the first layer may be at a lower resolution than the second layer.
A corresponding method of decoding a signal using a hierarchical coding approach may also be provided. This may be based on the arrangement of
A method of encoding a signal may also be performed by the first encoding module in a set of encoding modules. In this case, the first encoding module may receive a signal from a second encoding module, e.g. as shown in
In another example, there is provided a method of encoding a signal using a hierarchical coding approach, wherein the signal is encoded at a first layer using a first encoding module and at a second layer using a second encoding module, and wherein the signal is composed of two or more components, the method comprising sending a signal from the first encoding module to the second encoding module to instruct the second encoding module to only encode a first component of the signal at the second layer. In this case, the signal may be sent from the first encoding module to the second encoding module. For example, the signal may be sent when the first encoding module determines that only the first component of the signal is to be encoded at the second layer. The determination may comprise determining a condition requiring that only the first component of the signal should be provided. The condition may comprise encoding a signal for a low-power service. The low-power service may comprise a videoconferencing service. The condition may comprise a requirement to reduce power consumption. The condition may comprise a requirement to reduce the number of processing operations to be performed in the encoding of the signal. The processing operations comprise reading and/or writing to memory. For example, these may be mem-copy operations. The first encoding module may implement a first encoding method, and the second encoding module may implement a second encoding method. The first encoding method may be different from the second encoding method. The first encoding method may be the same as the second encoding method.
According to one specific implementation, a signal processor (e.g., computer processor hardware) is configured to receive a signal composed of multiple planes and encode it (“encoder”). For example, the planes may correspond to colour planes in a video or image signal, for instance a luma plane (Y) and two chroma planes (U and V). The encoder produces for each plane (for instance, the colour planes) of the signal a rendition of the signal at a first level of quality (e.g., a lower level) and encodes it with a first coding method. It then produces a predicted rendition of the signal at a second level of quality (e.g., a higher level), and correspondingly produces and encodes a layer (e.g., an echelon) of residual data at the second level of quality to apply to the predicted rendition of the signal at the second level of quality in order to produce a corrected rendition of the signal at the second level of quality. The predicted rendition of the signal may be generated by a scaling process, for example an upscaling, applied to said rendition of the signal at the first level of quality. Upon detecting that chroma processing should be limited to the lower level of quality, the encoder may generate and encode an echelon of residual data at the second level of quality for the luma component of the signal only, without generating also layers (e.g. echelons) of residual data at the second level of quality for the for chroma components of the signal. The residual data may be encoded with a second coding method. In one embodiment, the first and second encoding methods are the same encoding methods. In a different embodiment, the first and second encoding methods are different. A similar approach may be applied for multi-channel audio data, where residual data may only be provided for certain audio channels at a higher level of quality (e.g. a higher sampling or bit rate or a wider frequency range). In this case, audio output devices that normally output human speech, such as central and front speakers, may have corresponding audio channels (i.e. components) that are encoded by the second encoding module and audio output devices such as surround and sub-woofer speakers may receive components that are only encoded by the first encoding module (e.g., that are reconstructed by the second processing module without encoded elements from enhancement streams). This may save resources yet have a minimal effect on sound perception.
In a corresponding specific decoder implementation, a signal processor configured as a decoder receives an encoded signal, obtains a rendition of the signal at a first (lower) level of quality and produces a predicted rendition of the signal at a second (higher) level of quality, the second level of quality having a higher resolution (i.e., signal sampling rate) than the first level of quality. The predicted rendition of the signal may be generated by a scaling process, for example an upscaling, applied to said rendition of the signal at the first level of quality. The decoder may then receive and decode one or more echelons of residual data to apply to the predicted rendition of the signal to produce a corrected rendition of the signal at the second level of quality. When detecting that no echelons of residual data were encoded for one or more chroma planes of the signal, the decoder outputs for said chroma planes the predicted rendition of the planes at the second level of quality. In some examples, a bit in the decoded bitstream signals to the decoder the presence or absence of residual data at a given level of quality for a chroma plane.
In certain examples, the encoder is configured not to process and encode layers (e.g., echelons) of residual data for chroma planes at the second level of quality in case of specific applications, such as by means of non-limiting example videoconferencing. In other non-limiting embodiments, the encoder is configured not to process and encode echelons of residual data for chroma planes at the second level of quality in case of remaining battery falling below a threshold.
According to certain examples described herein, a signal processor is configured to receive a signal and encode it with a hybrid tier-based encoding method, such as by way of non-limiting example MPEG-5 Part 2 LCEVC (Low Complexity Enhancement Video Coding) or SMPTE VC-6 ST2117. The encoder receives the signal, downsamples it to a lower level of quality, produces for each colour plane of the signal a rendition of the signal at a first (lower) level of quality and encodes it with a codec implementing a first coding method. In some examples, the codec implementing the first coding method is a hardware codec. The encoder then receives from the hardware codec the decoded reconstruction of said first coding process, produces a predicted rendition of the signal at a second (higher) level of quality, and correspondingly produces and encodes an echelon of residual data at the second level of quality to apply to the predicted rendition of the signal at the second level of quality in order to produce a corrected rendition of the signal at the second level of quality. When detecting that chroma processing should be limited to a lower level of quality, the encoder signals to the codec implementing the first coding method that chroma residual data at a higher level of quality will not be produced. As a consequence, the codec implementing the first coding method will not provide to the encoder the decoded reconstructions of chroma planes at the first level of quality.
In certain examples, when receiving a signal from the encoder indicating that chroma residual data will not be produced, the codec implementing the first coding method will not perform mem-copy (memory copy) operations to provide the encoder with decoded reconstructions of chroma planes at the first level of quality, with consequent savings in processing power and battery power consumption. Correspondingly, the encoder will not perform memory operations and computing operations on chroma planes, producing further savings in processing power consumption. In a further embodiment, there is provided an instantiation of the encoding pipeline in order to allow on the fly disablement of the encoding of the chroma planes as described in the present description.
In certain examples, responsive to detecting that a specific use case requires a higher quality reconstruction, the encoder is configured to process residual data for all chroma planes, and signal to the codec implementing the first coding method that all chroma reconstructions at the first level of quality will be necessary.
In the example of
In preferred examples, a particular subset of components may be selected for encoding when resources are constrained. For example, with colour components it has been found that encoding residual data for only lightness or contrast planes, and not encoding chroma planes, produces improved perception of video quality over no encoded residual data at that level of quality but uses considerable fewer resources (e.g., 33% of the encoding resources). While quality is best when all components are encoded, this may not be possible when resources are limited, e.g. when applications take processing resources during a video call or when a mobile phone runs low on battery; in these cases reducing the encoded components can help slow resource drain yet provide adequate quality to continue the call. Also the systems and methods discussed herein may be flexibly and dynamically applied during encoding without needing to stop or start the video stream, meaning that falling back to a reduced number of components is graceful and can provide a position that provides improved visual experience to falling back to a lower level of quality immediately.
The techniques described herein may be implemented in software or hardware, or may be implemented using a combination of software and hardware. They may include configuring an apparatus to carry out and/or support any or all of techniques described herein.
The above embodiments are to be understood as illustrative examples. Further embodiments are envisaged. It to be understood that any feature described in relation to any one embodiment may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the embodiments, or any combination of any other of the embodiments.
Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2020/052616 | 10/16/2020 | WO |
Number | Date | Country | |
---|---|---|---|
62923380 | Oct 2019 | US |