The present invention relates to the processing of a multi-layer video stream. In particular, the present invention relates to one or more of the encoding and the decoding of a multi-layer video stream, for example using different approaches to communicate the multi-layer stream to a decoding device and enable efficient decoding.
Multi-layer video coding schemes have existed for a number of years but have experienced problems with widespread adoption. Much of the video content on the Internet is still encoded using H.264 or MPEG-4 Part 10, Advanced Video Coding (MPEG-4 AVC), with this format being used for between 80-90% of online video content. This content is typically supplied to decoding devices as a single video stream that has a one-to-one relationship with available hardware and/or software video decoders, e.g. a single stream is received, parsed and decoded by a single video decoder to output a reconstructed video signal. Many video decoder implementations are thus developed according to this framework. To support different encodings, decoders are generally configured with a simple switching mechanism that is driven based on metadata identifying a stream format.
Existing multi-layer coding schemes include the Scalable Video Coding (SVC) extension to H.264, Scalable extensions to H.265 or MPEG-H Part 2 High Efficiency Video Coding (SHVC), and newer standards such as MPEG-5 Part 2 Low Complexity Enhancement Video Coding (LCEVC). While H.265 is a development of the coding framework used by H.264, LCEVC takes a different approach to scalable video. SVC and SHVC operate by creating different encoding layers and feeding each of these with a different spatial resolution. Each layer encodes the input according to a normal AVC or HEVC encoder with the possibility of leveraging information generated by lower encoding layers. LCEVC, on the other hand, generates one or more layers of enhancement residuals as compared to a base encoding, where the base encoding may be of a lower spatial resolution.
One reason for the slow adoption of multi-layer coding schemes has been the difficulty adapting existing and new decoders to process multi-layer encoded streams. As discussed above, video streams are typically single streams of data that have a one-to-one pairing with a suitable decoder, whether implemented in hardware or software or a combination of the two. Client devices and media players, including Internet browsers, are thus built to receive a stream of data, determine what video encoding the stream uses, and then pass the stream to an appropriate video decoder. Within this framework, multi-layer schemes such as SVC and SHVC have typically been packaged as larger single video streams containing multiple layers, where these streams may be detected as “SVC” or “SHVC” and the multiple layers extracted from the single stream and passed to an SVC or SHVC decoder for reconstruction. This approach though often mitigates some of the benefits of multi-layer encodings. Hence, many developers and engineers have concluded that multi-layer coding schemes are too cumbersome and return instead to a multicast of single H.264 video streams.
It is thus desired to obtain an improved method and system for decoding multi-layer video data that overcomes some of the disadvantages discussed above and that allows more of the benefits of multi-layer coding schemes to be realised.
The paper “The Scalable Video Coding Extension of the H.264/AVC Standard” by Heiko Schwarz and Mathias Wien, as published in IEEE Signal Processing Magazine 135, March 2008, provides an overview of the SVC extension.
The paper “Overview of SHVC: Scalable Extensions of the High Efficiency Video Coding Standard” by Jill Boyce, Yan Ye, Jianle Chen, and Adarsh K. Ramasubramonian, as published in IEEE Transactions on Circuits and Systems for Video Technology, VOL. 26, NO. 1, January 2016, provides an overview of the SHVC extensions.
The decoding technology for LCEVC is set out in the Draft Text of ISO/IEC FDIS 23094-2 as published at Meeting 129 of MPEG in Brussels in January 2020, as well as the Final Approved Text and WO 2020/188273 A1.
US 2010/0272190 A1 describes a scalable transmitting/receiving apparatus and a method for improving availability of a broadcasting service, which can allow a reception party to select an optimum video according to an attenuation degree of a broadcasting signal by scalably encoding video data and transmitting it by a different transmission scheme for each layer. US 2010/0272190 A1 encodes HD and SD video streams using an H.264 scalable video encoder (i.e., using SVC) and generates different layers of the SVC encoding using different packet streams. At a decoding device, a DVB-S2 receiver/demodulator receives/demodulates a satellite broadcasting signal from a transmitting satellite and restores a first layer packet stream and a second layer packet stream. At the decoding device, a scalable combiner combines the restored first- and second-layer packet streams in input order generating a single transport stream. A subsequent demultiplexer demultiplexes and depacketizes the combined transport stream and splits it into first- and second-layer video streams, which are then passed to an H.264 scalable video decoder for decoding and generation of a reconstruction of the original HD video stream.
WO 2017/141038 A1 describes a physical adapter that is configured to receive a data stream comprising data useable to derive a rendition of a signal at a first level of quality and reconstruction data produced by processing a rendition of the signal at a second, higher level of quality and indicating how to reconstruct the rendition at the second level of quality using the rendition at the first level of quality. WO 2017/141038 A1 describes how a presentation timestamp (PTS) may be used to synchronise different elementary streams, a first elementary stream with a first packet identifier (PID) and a second elementary stream with a second packet identifier (PID).
With multi-layer streams there is also a general problem of stream management. Different layers of a multi-layer stream may be generated together or separately, and may be supplied together or separately. It is desired to have improved methods and systems for transmission and re-transmission of multi-layer streams over a network. For example, it is desired to allow content distributors to easily and flexibility modify video quality by adding additional layers in a multi-layer scheme. It is also desired to be able to flexibly re-multiplex multi-layer video streams without breaking downstream multi-layer decoding.
There is also a problem of supplying multi-layer streams as static file formats. For example, video streams may be read from fixed or portable media, such as solid-state devices or portable disks, or downloaded and stored as a file for later viewing. It is difficult to support the carriage of multi-layer video with existing file formats, as these file formats typically assume a one-to-one mapping with media content and decoding configurations, whereas multi-layer streams may use different decoding configurations for different layers. Changes in file formats often do not work practically, as they require updates to decoding hardware and software and may affect the decoding of legacy formats.
All of the above publications set out above are to be incorporated by reference herein.
Aspects of the present invention are set out in the appended independent claims. Variations of these aspects are set out in the appended dependent claims.
Certain examples described herein allow decoding devices to be easily adapted to handle multi-layer video coding schemes. Certain examples are described with reference to an LCEVC multi-layer video stream, but the general concepts may be applied to other multi-layer video schemes including SVC and SHVC, as well as multi-layer watermarking and content delivery schemes.
Different examples are presented. In one set of examples a single or joint packet stream is generated for the multi-layer video stream. This may be a joint elementary packet stream. The single or joint packet stream may be processed in a one-to-one manner by existing decoders despite containing data for multiple levels or layers of the multi-layer video stream. For example, backward compatibility may be maintained by passing a single or joint data stream comprising encoded data for multiple layers to a first layer decoder, where the configuration of the single or joint data stream is such that data relating to layers other than the first layer is ignored. Other layer decoders, including those that operate according to different decoding methods (e.g., based on different video coding standards) may receive either the single or joint data stream or other layer data from said stream and provide enhancements to the first layer decoding.
In the Figures,
In certain examples, such as those shown in
A number of examples will now be described with reference to the accompanying Figures.
In certain cases, there may be special PID values that are reserved for indexing tables. In one case, one PID value may be reserved for a program association table (PAT) that contains a directory listing of a set of program map tables, a program map table (PMT) comprising a mapping between one or more PID values and a particular “program”. Originally a “program” related to a particular broadcast program but with Internet streaming, the term is used broadly to relate to the content of a particular video stream. PMTs may provide additional metadata regarding content that is transmitted as part of a PID stream 104 within the Transport Stream 102. The PMT may comprise program descriptors. These are sets of bytes (multiples of 8-bits), where a length of the descriptors may be defined (e.g., a length may indicate that N descriptors follow, each of 8-bits). Descriptors may be provided for an entire MPEG-2 program or for individual elementary streams. They may be optionally provided, such that some elementary streams do not carry descriptors. In certain cases, descriptors may be provided generally as part of program-specific information (PSI), which comprises metadata for content that is supplied as part of a transport stream. In certain examples described later in this description, the descriptors may be used to pair different layers within a multi-layer stream without breaking backward decoding capabilities for lower layers.
Both BMFF and CMAF define “containers”, which are portions of the file structure that store encoded media data. Metadata for a container may define an encoding standard that has been used to generate the encoded media data. In one case, the file format may be defined in a file type field (“ftyp”) at the beginning of the file that specifies the encoding format, e.g. “AVC1” for AVC or “HEVC” for HEVC. The file format may be parsed by a decoding device and used to activate a suitable decoder for the file format. This form of file format identification, however, requires a one-to-one mapping between the identified file format and the decoder implementation. While this works for monolithic scalable technologies such as SVC or SHVC, where a single decoder receives and decodes all the layers within a multi-layer video encoding, it does not work when different decoders are used for different layers (such as in LCEVC). It also causes problems with backwards compatibility. For example, a stream tagged as “SVC1” would be passed to an SVC decoder and would raise an error if an SVC1 decoder was not present, despite each layer within SVC being based on the AVC format.
In certain examples described herein a flexible method of decoding multi-layer video is provided. In these examples, a “nested” identifier is used to identifier encoded multi-layer video content within a file. The “nested” identifier operates as a valid identifier for at least one layer of the multi-layer video content (e.g., for a base layer). As such, even if decoders for one or more other layers of the multi-layer video content are not available, the at least one layer more be passed for decoding using an available first decoder (e.g., a legacy decoder) by parsing the nested identifier. In this case, data for the one or more other layers in the multi-layer video content may be ignored and only data for the at least one layer decoded using available decoders. However, if decoders for one or more other layers were available, these may be activated based on additional information derived from the nested identifier and passed at least the data for the one or more other layers. In certain cases, the data for all the layers are passed to each decoder and decoders are configured to ignore data that does not relate to their given layer (e.g., data tagged with values that are not recognised by a decoder may be ignored using default functionality of the decoder). In one case, the nested identifier may be implemented using the descriptor fields of a transport and/or elementary stream.
In
Following the parsing of the stream data 205 by the stream receiver 210, the stream receiver 210 is configured to instantiate one or more decoders to decode the individual data streams contained within the stream data 205.
The video decoders 220 shown in
Once a particular PID stream or track is identified, the ES parser 222 is configured to process the data of that stream or track to provide encoded media data to the AU producer 224. This may comprise extracting data from the payloads of multiple TS packets 110 to form a PES 106 as shown in
The renderer 226 is configured to decode the media sample data received from the AU producer 224 to produce a decoded medium 228 (e.g., a frame, a subtitle or an audio sample) ready to be output, e.g. on a display. The renderer 226 may comprise a specific codec or decoder, such as a H.264 or H.265 decoder (or an MP3/4 or Advanced Audio Coding-AAC decoder for audio data).
In the examples of
Within the context of the comparative examples of
With newer multi-layer video coding schemes, such as LCEVC, it is becoming possible to have a multi-layer coding scheme where different layers within the coding scheme are encoded according to different coding standards. For example, an LCEVC data stream may comprise a lower resolution H.264, H.265, or Versatile Video Coding (VVC) “base” layer stream and an LCEVC-specific “enhancement layer” stream, where the LCEVC-specific “enhancement layer” stream may in turn comprise different sub-layers. The “base” layer stream may thus take a variety of forms and may comprise pre-existing and/or independent streams, e.g. a video distributer may provide an LCEVC-specific “enhancement layer” stream “on top of” an existing and/or independent “base” layer stream. In LCEVC, the sub-layers comprise encoded residual data for application to a decoded output of the “base” layer stream.
Turning to the example of
The stream receiver 210 in the example 300 performs functions similar to the stream receiver 210 described with reference to
In the example of
In the present example 300, the stream generator 310 takes the two packet sub-streams and generates a joint elementary packet stream comprising a sequence of packets comprising data for both the first and the second layers. The joint elementary packet stream may comprise a joined PES 508 as shown in
Returning to
The first layer decoder 322 is configured to receive the joint elementary packet stream generated by the stream generator and to output a decoding of the data for the first layer within the joint elementary packet stream. The first layer decoder 322 may comprise a normal single layer decoder (e.g., a H.264 or H.265 decoder). In this case, NAL units for the second layer in the joint elementary packet stream, such as NAL units 524 in
The second layer decoder 324 also receives the joint elementary packet stream generated by the stream generator 310. The second layer decoder 324 decodes the data for the second layer, e.g. the data contained in NAL units 524, to output a decoding of the data for the second layer within the joint elementary packet stream. In the examples described herein, the second layer decoder 324 is different to the first layer decoder 322. For example, the first layer decoder 322 may be a legacy hardware and/or software video decoder that complies with a first video coding standard (e.g., H.264, H.265, VVC etc.) and the second layer decoder 324 may be an enhancement hardware and/or software video decoder that complies with a second, different video coding standard (e.g., LCEVC). The second layer decoder 324 may decode a residual signal whereas the first layer decoder 322 may decode a video (non-residual) signal. The residual signal may comprise a plurality of sub-layers representing different levels of quality (e.g., different spatial resolutions).
In the present example, the multi-layer controller 326 is communicatively coupled to the first layer decoder 322 and the second layer decoder 324. The multi-layer controller 326 is configured to combine an output of the first layer decoder 322 and an output of the second layer decoder 324 to provide a multi-layer reconstruction of the video signal. Although shown as a separate component, in certain implementations the multi-layer controller 326 may form part of the second layer decoder 324. For example, an enhancement decoder may comprise a second layer decoder in the form of a residual decoder and a controller to apply decoded residual data to a decoded frame of video from the first layer decoder 322. The multi-layer controller 326 may receive the output of the first and second layer decoders 322, 324 directly or indirectly. In the latter case, the multi-layer controller 326 may have access to a shared memory where decoded output of one or more of the first and second layer decoders 322, 324 is available. The shared memory may comprise a frame buffer than contains one or more frames as they are decoded. In an LCEVC case, the first layer decoder 322 may comprise a base decoder that outputs a lower quality picture or frame, e.g. a lower resolution frame, and the second layer decoder 324 may comprise an LCEVC decoder that receives and decodes residual data for a higher quality picture or frame, e.g., at a higher resolution. In this case, the multi-layer controller 326 may be configured to upsample the output of the first layer decoder 322 and apply one or more sub-layers of residual data. In one configuration, the multi-layer controller 326 may apply a first sub-layer of decoded residual data to the output of the first layer decoder 322, upsample the result, and then apply a second sub-layer of decoded residual data before outputting the multi-layer reconstruction of the video signal at the upsampled resolution. In LCEVC, the sub-layers and upsampling operations may be flexibly configured and so different multi-layer reconstruction configurations are possible. Generally, multiple sub-layers of the second layer may be decoded, possibly in parallel, by a common (e.g., single) second layer decoder.
In the example 300 of
In certain examples, the stream receiver 210 may comprise a demultiplexer to receive and demultiplex a multiplexed transport stream comprising a first packet sub-stream and a second packet sub-stream, e.g. Transport Stream 102. The data payloads of the first packet sub-stream may form a first packetized elementary stream (i.e., a first PES) and the data payloads of the second packet sub-stream may form a second packetized elementary stream (i.e., a second PES), wherein a joint elementary packet stream is generated that comprises a third packetized elementary stream (i.e., a joined PES) with a header comprising a presentation time stamp, the data payload following the header comprising data payloads from the first and second packetized elementary streams that are associated with the presentation time stamp. In certain cases, the first and second packet sub-streams may be transmitted such that the data payload of the second packet sub-stream arrives no later than the data payload of a corresponding portion of the first packet sub-stream. This may assist in synchronising the two PID streams.
In
As described above,
At block 602, the method 600 comprises receiving a first packet sub-stream for the first layer. This first packet sub-stream may be a PID stream or PES such as PID stream 104 or PES 106 in
At block 604, the method 600 comprises receiving a second packet sub-stream for the second layer. This second packet sub-stream may be a PID stream or PES such as PID stream 504 or PES 506 in
At block 606, the packets from the first packet sub-stream and the second packet sub-stream are joined to generate a joint elementary packet stream. The joint elementary packet stream comprises a sequence of packets comprising data for both the first layer and the second layer. These packets may be NAL units, such as NAL units 522 and 524 in
At block 608, the joint elementary packet stream is provided to a first layer decoder for decoding of the data for the first layer within the joint elementary packet stream. This may comprise decoding data for the first layer using a H.264, H.265, or VVC decoder. The first layer decoder may ignore packets comprising data for the second layer. The first layer decoder may comprise the first layer decoder 322 of
At block 610, the joint elementary packet stream is also provided to a second layer decoder for decoding of at least the data for the second layer within the joint elementary packet stream. The second layer decoder may comprise the second layer decoder 324 of
At block 612, the method 600 comprises combining an output of the first layer decoder and an output of the second layer decoder to provide a multi-layer reconstruction of the video signal. For example, this may be performed by the multi-layer controller 326 of
In one case, the data payloads of the first packet sub-stream form a first packetized elementary stream (i.e., a first PES) and the data payloads of the second packet sub-stream form a second packetized elementary stream (i.e., a second PES). The joint elementary packet stream thus then comprises a third packetized elementary stream (i.e., a third PES). The third packetized elementary stream has a header comprising a presentation time stamp (PTS), and a data payload following the header (i.e., a payload of the third PES) comprises data payloads from the first and second packetized elementary streams that are associated with the presentation time stamp. In this case, the presentation time stamp is used to sync data for a particular picture or frame. In one case, the first and second packet sub-streams may be transmitted such that the last packet in the second packet sub-stream arrives no later than the last packet in the first packet sub-stream. In this manner, data for all the layers of the multi-layer encoding for a given picture or frame is available at a decoder to be synchronised, decoded, and combined as described.
A payload of the joint elementary packet stream may comprise a sequence of network abstraction layer (NAL) units for the first layer and a sequence of NAL units for the second layer. The first layer decoder may be configured to ignore the network abstraction layer units for the second layer based on unit type data values within a header of the network abstraction layer units for the second layer.
An output of the first layer decoder may be renderable independently of the multi-layer reconstruction of the video signal. For example, it may be possible to view both the output of the first layer decoder and the multi-layer reconstruction of the video signal. In certain cases, a displayed video rendering may switch between the two outputs based on the availability of the second packet sub-stream. The first layer may comprise a “base” video stream and the second layer may comprise a corresponding “enhancement” video stream. The base video stream may have first encoding parameters and the enhancement video stream may have second encoding parameters. In one case, the multi-layer reconstruction of the video signal comprises a higher quality rendition of a base video signal decoded from the base video stream. For example, the first packet sub-stream may represent a video encoding at a first resolution, such as a High Definition (HD) H.264 encoding, and the second packet sub-stream may represent an enhancement encoding at a second higher resolution, such as an Ultra-HD (UHD) LCEVC encoding. As well as different resolutions, the two packet sub-streams may also represent encodings at one or more of: different bit rates, different colour depths, different quantisation configurations, and different bit depths.
In one case, the method 600 further comprises receiving a multiplexed transport stream comprising the first packet sub-stream and the second packet sub-stream and demultiplexing the multiplexed transport stream to extract the first packet sub-stream and the second packet sub-stream. For example, this may be performed by one of the stream receiver 210, the stream generator 310, and the ES Parser Joiner 410 of the previous examples.
In certain examples, data for the second layer comprises frames of residual data that are combined with frames of the base video signal as decoded from the base video stream. For example, the second layer may comprise a Low Complexity Enhancement Video Coding (LCEVC) video stream. In certain cases, the second layer may comprise a watermarking stream, e.g. a stream with data to be added to an original video stream to visibly or invisibly mark, identify or secure the original video stream. In this case, the second layer may comprise data that is combined with the original video stream but where the data does not comprise residual data. The second layer may also comprise a metadata stream to accompany an original, first layer video stream. For example, the second layer may comprise localised metadata associated with objects or people within the original first layer video stream, such as unique identifiers or hyperlinks to data sources. There may also be multiple base layer and/or multiple enhancement layers as part of a multi-layer video stream and these may be processed similarly to the single base and single enhancement examples described herein.
In one case, the method 600 described herein may be applied as part of an adapted media player implementation. Blocks 602 and 604 may be implemented by a PMT reader or parser that extracts Elementary Streams (ES) from a digital multimedia container such as a transport stream. For example, a transport stream may be detected and a PMT parsed to extract a directory or mapping of programs to a set of PIDs. Blocks 602 and 604 may also involve a transport stream extractor that creates a PES reader for one or more of the identified PID streams. The transport stream extractor may perform functions similar to the stream receiver 210 or the stream generator 310. The PES reader may be generated based on a stream type and stream information (e.g., PES information). The PID for each ES/PES may also be extracted and passed to the corresponding PES reader. In an enhancement coding case, if a corresponding base PID is signalled, then a PES reader for the base stream (e.g., having the base PID) may be shared between a base decoder and an enhancement decoder. When ES metadata are read, if a reference base PID is signalled, the ES for the base is sent to a “base” PES reader. In the present case, all PES readers may have as a consumer a joiner interface that is capable of providing the joint elementary packet stream. Each PES reader may be unaware of the joining and see a sequence of NAL units. The joiner interface may provide data to specific data consumers, such as readers (i.e., decoders) for particular video formats. The joiner interface may be provided by an ES Parser that is fed by two inputs, based on the two PID streams, but has a single output as per comparative ES parsers such as 222 in
In certain cases, the NAL units of the second layer may be interleaved with the NAL units of the first layer. In this case, a second layer decoder or second layer pre-processor may parse the sequence of NAL units to extract the NAL units of the second layer from the joined stream.
In certain cases, the joint stream may only require data for the first layer before release. In this case, if the second layer data is present it may be added, but if it is absent (e.g., in whole or part), it may be omitted and only the first layer decoded and viewed. The second layer decoder may thus either skip the enhancements and/or provide a pass through of the first layer output. Hence, the enhancement layer may be flexibly added to the base layer. The present examples allow easy retrofitting of existing stream processing pipelines to manage multi-layer streams. For example, each video decoder expects a single PES stream as input and thus a common interface may be provided, regardless of whether a single layer or multi-layer stream is being decoded. The advanced logic for the upper layers of the multi-layer stream may thus be incorporated into second layer decoders and/or multi-layer controllers that can be easily added to existing options for stream parsing and decoding.
In certain cases, one or more of the example systems 300 and 400, or method 600, may be implemented via instructions retrieved from a computer-readable medium that are executed by a processor of a decoding system, such as a client device.
Certain general information relating to example enhancement coding schemes will now be described. This information provides examples of specific multi-layer coding schemes.
It should be noted that examples are presented herein with reference to a signal as a sequence of samples (i.e., two-dimensional images, video frames, video fields, sound frames, etc.). For simplicity, non-limiting examples illustrated herein often refer to signals that are displayed as 2D planes of settings (e.g., 2D images in a suitable colour space), such as for instance a video signal. In a preferred case, the signal comprises a video signal. An example video signal is described in more detail with reference to
The terms “picture”, “frame” or “field” are used interchangeably with the term “image”, so as to indicate a sample in time of the video signal: any concepts and methods illustrated for video signals made of frames (progressive video signals) can be easily applicable also to video signals made of fields (interlaced video signals), and vice versa. Despite the focus of examples illustrated herein on image and video signals, people skilled in the art can easily understand that the same concepts and methods are also applicable to any other types of multidimensional signal (e.g., audio signals, volumetric signals, stereoscopic video signals, 3DoF/6DoF video signals, plenoptic signals, point clouds, etc.). Although image or video coding examples are provided, the same approaches may be applied to signals with dimensions fewer than two (e.g., audio or sensor streams) or greater than two (e.g., volumetric signals).
In the description the terms “image”, “picture” or “plane” (intended with the broadest meaning of “hyperplane”, i.e., array of elements with any number of dimensions and a given sampling grid) will be often used to identify the digital rendition of a sample of the signal along the sequence of samples, wherein each plane has a given resolution for each of its dimensions (e.g., X and Y), and comprises a set of plane elements (or “element”, or “pel”, or display element for two-dimensional images often called “pixel”, for volumetric images often called “voxel”, etc.) characterized by one or more “values” or “settings” (e.g., by ways of non-limiting examples, colour settings in a suitable colour space, settings indicating density levels, settings indicating temperature levels, settings indicating audio pitch, settings indicating amplitude, settings indicating depth, settings indicating alpha channel transparency level, etc.). Each plane element is identified by a suitable set of coordinates, indicating the integer positions of said element in the sampling grid of the image. Signal dimensions can include only spatial dimensions (e.g., in the case of an image) or also a time dimension (e.g., in the case of a signal evolving over time, such as a video signal). In one case, a frame of a video signal may be seen to comprise a two-dimensional array with three colour component channels or a three-dimensional array with two spatial dimensions (e.g., of an indicated resolution—with lengths equal to the respective height and width of the frame) and one colour component dimension (e.g., having a length of 3). In certain cases, the processing described herein is performed individually to each plane of colour component values that make up the frame. For example, planes of pixel values representing each of Y, U, and V colour components may be processed in parallel using the methods described herein.
Certain examples described herein use a scalability framework that uses a base encoding and an enhancement encoding. The video coding systems described herein operate upon a received decoding of a base encoding (e.g., frame-by-frame or complete base encoding) and add one or more of spatial, temporal, or other quality enhancements via an enhancement layer. The base encoding may be generated by a base layer, which may use a coding scheme that differs from the enhancement layer, and in certain cases may comprise a legacy or comparative (e.g., older) coding standard.
In the spatially scalable coding scheme, the methods and apparatuses may be based on an overall algorithm which is built over an existing encoding and/or decoding algorithm (e.g., MPEG standards such as AVC/H.264, HEVC/H.265, etc. as well as non-standard algorithms such as VP9, AV1, and others) which works as a baseline for an enhancement layer. The enhancement layer works accordingly to a different encoding and/or decoding algorithm. The idea behind the overall algorithm is to encode/decode hierarchically the video frame as opposed to using block-based approaches as done in the MPEG family of algorithms. Hierarchically encoding a frame includes generating residuals for the full frame, and then a reduced or decimated frame and so on.
Above the dashed line is a series of enhancement level processes to generate an enhancement layer of a multi-layer coding scheme. In the present example, the enhancement layer comprises two sub-layers. In other example, one or more sub-layers may be provided. In
In
To generate the encoded enhancement layer, sub-layer 2 stream, a further level of enhancement information is created by producing and encoding a further set of residuals via residual generator 700-S. The further set of residuals are the difference between an up-sampled version (via up-sampler 705U) of a corrected version of the decoded base stream (the reference signal or frame), and the input signal 701 (the desired signal or frame).
To achieve a reconstruction of the corrected version of the decoded base stream as would be generated at a decoder (e.g., as shown in
The up-sampled signal (i.e., reference signal or frame) is then compared to the input signal 701 (i.e., desired signal or frame) to create the further set of residuals (i.e., a difference operation is applied by the residual generator 700-S to the up-sampled re-created frame to generate a further set of residuals). The further set of residuals are then processed via an encoding pipeline that mirrors that used for the first set of residuals to become an encoded enhancement layer, sub-layer 2 stream (i.e., an encoding operation is then applied to the further set of residuals to generate the encoded further enhancement stream). In particular, the further set of residuals are transformed (i.e., a transform operation 710-0 is performed on the further set of residuals to generate a further transformed set of residuals). The transformed residuals are then quantised, and entropy encoded in the manner described above in relation to the first set of residuals (i.e., a quantisation operation 720-0 is applied to the transformed set of residuals to generate a further set of quantised residuals; and, an entropy encoding operation 730-0 is applied to the quantised further set of residuals to generate the encoded enhancement layer, sub-layer 2 stream containing the further level of enhancement information). In certain cases, the operations may be controlled, e.g. such that, only the quantisation step 720-1 may be performed, or only the transform and quantisation step. Entropy encoding may optionally be used in addition. Preferably, the entropy encoding operation may be a Huffmann encoding operation or a run-length encoding (RLE) operation, or both (e.g., RLE then Huffmann encoding). The transformation applied at both blocks 710-1 and 710-0 may be a Hadamard transformation that is applied to 2×2 or 4×4 blocks of residuals.
The encoding operation in
As illustrated in
In
Additionally, and optionally in parallel, the encoded enhancement layer, sub-layer 2 stream is processed to produce a decoded further set of residuals. Similar to sub-layer 1 processing, enhancement layer, sub-layer 2 processing comprises an entropy decoding process 830-0, an inverse quantisation process 820-0 and an inverse transform process 810-0. Of course, these operations will correspond to those performed at block 700-0 in encoding system 700, and one or more of these steps may be omitted as necessary. Block 800-0 produces a decoded enhancement layer, sub-layer 2 stream comprising the further set of residuals, and these are summed at operation 800-C with the output from the up-sampler 805U in order to create an enhancement layer, sub-layer 2 reconstruction of the input signal 701, which may be provided as the output of the decoding system 800. Thus, as illustrated in
With reference to the example 300 of
In general, examples described herein operate within encoding and decoding pipelines that comprises at least a transform operation. The transform operation may comprise the DCT or a variation of the DCT, a Fast Fourier Transform (FFT), or, in preferred examples, a Hadamard transform as implemented by LCEVC. The transform operation may be applied on a block-by-block basis. For example, an input signal may be segmented into a number of different consecutive signal portions or blocks and the transform operation may comprise a matrix multiplication (i.e., linear transformation) that is applied to data from each of these blocks (e.g., as represented by a 1D vector). In this description and in the art, a transform operation may be said to result in a set of values for a predefined number of data elements, e.g. representing positions in a resultant vector following the transformation. These data elements are known as transformed coefficients (or sometimes simply “coefficients”).
As described herein, where the enhancement data comprises residual data, a reconstructed set of coefficient bits may comprise transformed residual data, and a decoding method may further comprise instructing a combination of residual data obtained from the further decoding of the reconstructed set of coefficient bits with a reconstruction of the input signal generated from a representation of the input signal at a lower level of quality to generate a reconstruction of the input signal at a first level of quality. The representation of the input signal at a lower level of quality may be a decoded base signal and the decoded base signal may be optionally upscaled before being combined with residual data obtained from the further decoding of the reconstructed set of coefficient bits, the residual data being at a first level of quality (e.g., a first resolution). Decoding may further comprise receiving and decoding residual data associated with a second sub-layer, e.g. obtaining an output of the inverse transformation and inverse quantisation component, and combining it with data derived from the aforementioned reconstruction of the input signal at the first level of quality. This data may comprise data derived from an upscaled version of the reconstruction of the input signal at the first level of quality, i.e. an upscaling to the second level of quality.
Further details and examples of a two sub-layer enhancement encoding and decoding system may be obtained from published LCEVC documentation. Although examples have been described with reference to a tier-based hierarchical coding scheme in the form of LCEVC, the methods described herein may also be applied to other tier-based hierarchical coding scheme, such as VC-6: SMPTE VC-6 ST-2117 as described in PCT/GB2018/053552 and/or the associated published standard document, which are both incorporated by reference herein.
In LCEVC and certain other coding technologies, a video signal fed into a base layer is a downscaled version of the input video signal, e.g. 701. In this case, the signal that is fed into both sub-layers of the enhancement layer comprises a residual signal comprising residual data. A plane of residual data may also be organised in sets of n-by-n blocks of signal data 910. The residual data may be generated by comparing data derived from the input signal being encoded, e.g. the video signal 701, and data derived from a reconstruction of the input signal, the reconstruction of the input signal being generated from a representation of the input signal at a lower level of quality. The comparison may comprise subtracting the reconstruction from the downsampled version. The comparison may be performed on a frame-by-frame (and/or block-by-block) basis. The comparison may be performed at the first level of quality; if the base level of quality is below the first level of quality, a reconstruction from the base level of quality may be upscaled prior to the comparison. In a similar manner, the input signal to the second sub-layer, e.g. the input for the second sub-layer transformation and quantisation component, may comprise residual data that results from a comparison of the input video signal 701 at the second level of quality (which may comprise a full-quality original version of the video signal) with a reconstruction of the video signal at the second level of quality. As before, the comparison may be performed on a frame-by-frame (and/or block-by-block) basis and may comprise subtraction. The reconstruction of the video signal may comprise a reconstruction generated from the decoded decoding of the encoded base bitstream and a decoded version of the first sub-layer residual data stream. The reconstruction may be generated at the first level of quality and may be upsampled to the second level of quality.
Hence, a plane of data 908 for the first sub-layer may comprise residual data that is arranged in n-by-n signal blocks 910. One such 2 by 2 signal block is shown in more detail in
As shown in
In certain cases, for LCEVC video streams, e.g. as described above with reference to
A set of additional examples will now be described. These operate within a similar context to the examples set out above but differ in certain aspects. These examples may or may not be used with a multi-layer scheme such as that described with reference to
In one example, a method of processing a multi-layer video stream is provided. The multi-layer video stream encodes a video signal and comprises at least a first layer and a second layer. In this example, the method comprises: receiving a first packet sub-stream for the first layer; receiving a second packet sub-stream for the second layer; and, joining packets from the first packet sub-stream and the second packet sub-stream to generate a joint elementary packet stream, the joint elementary packet stream comprising a sequence of packets comprising data for both the first layer and the second layer. In this case, each packet of the first packet sub-stream may comprise a header and a data payload, where the data payload comprises the encoded data for the first layer, and each packet of the second packet sub-stream may comprise a header and a data payload, the data payload comprising the encoded data for the second layer. For example, this method may comprise a method similar to that performed by the stream generator 310. However, this method may be performed at an encoder such that the joint elementary stream is transmitted as a single PID stream to a decoder, e.g. as part of a transport stream.
Variations of the example above are shown in
Although examples are presented herein in the form of transmitted streams, static media files are also based on the same framework and so the methods described herein may also be applied to media “containers”, such as those that wrap encoded media content.
In certain variations of the examples described above (e.g., those described with reference to
In certain variations, a single packet identifier is assigned to the joint elementary packet stream. For example, “C” in
In certain variations, methods set out above may comprise transmitting the joint elementary packet stream as part of a packetised transport stream to one or more video decoders, data for the joint elementary packet stream being indicated by the single packet identifier in packet headers of the packetised transport stream. For example, this is shown in
In certain aspects, a method of processing a multi-layer video stream is provided. Again, the multi-layer video stream encodes a video signal and comprises at least a first layer and a second layer. In this aspect, the method comprises receiving encoded data for the first layer; receiving encoded data for the second layer; and combining the encoded data for the first layer and the encoded data for the second layer as a single elementary packet stream with a single packet identifier, the single packet identifier being linked with the first layer within metadata for the single elementary packet stream. This method may be performed at an encoder (e.g., as per the recent examples) or at a decoder (e.g., as per the examples of
In this aspect, the method may further comprise, e.g. at a service provider server, transmitting the single elementary packet stream as part of a transport stream to one or more decoding devices. This may occur if a joint or single stream is generated at an encoder.
In this aspect, the encoded data for the first layer and the encoded data for the second layer may be interleaved. For examples, PES payloads for frames or planes of a video signal may result in data being grouped in a BLBLBLBLBLB . . . format, where B indicates data (e.g., NAL units) for a first or base layer and L indicates data (e.g., NAL units) for a second or LCEVC/enhancement layer. Interleaving may enable the simple synchronisation of different layers and provide robustness to reduce stream latencies.
In the above cases, a method of decoding a transport stream as generated by encoder generating the single encoded data stream may comprise: extracting an elementary packet stream from the transport stream based on the single packet identifier; communicating data from the elementary packet stream to a first layer decoder based on a mapping between the single packet identifier and the first layer; communicating data from the elementary packet stream to a second layer decoder to determine if the elementary packet stream comprises encoded data for the second layer; and, responsive to a determination that the elementary packet stream comprises encoded data for the second layer, combining an output of the first layer decoder and the second layer decoder to provide a multi-layer reconstruction of the video signal. For example, metadata for a program represented by the elementary packet stream, such as one or more of PSI, PAT, PMT, and descriptor data, may indicate that the elementary packet stream also comprises data for one or more additional layers and/or the second layer decoder may inspect data packets for the elementary packet stream to determine if they contain metadata (such as header data) that indicates the presence of other layer encoded data. In certain cases, the first layer decoder may be a legacy hardware and/or software decoder that is not able to be updated with new functionality, where other layer decoders (such as the second layer decoder) may be updatable with new functionality and so may encompass additional logic to parse the combined data stream. In one case, a second layer decoder may be passed at least a portion of data from all compatible first layer data streams and may only be activated if second layer data is detected within those streams.
In one variation, the second layer decoder is configured to inspect header data from one or more network abstraction layer units derived from the elementary packet stream to determine if the elementary packet stream comprises encoded data for the second layer. In this case, the first layer decoder may be configured to ignore NAL units contained encoded data for the second layer based on values within the headers of said NAL units.
At block 1204, a descriptor field of the first encoded data stream is parsed to extract an identifier for the first encoded data stream. The descriptor field may be a descriptor field as defined as part of metadata for the first encoded data stream, such as PSI data. The descriptor field may be set as a value that is ignored by a first layer decoder, such as a reserved value for one or more first layer coding standards.
At block 1206, a second encoded data stream is received for the second layer of the multi-layer video stream. The second encoded data stream may be received in a similar manner to the first layer of the multi-layer video stream, e.g. as PID stream 504 in
At block 1210, conditional logic is applied based on the parsing performed at blocks 1204 and 1208. For example, responsive to the presence of the identifier for the first encoded data stream as determined in block 1208, the first and second encoded data streams are paired as set out in block 1212 and a decoding of the multi-layer video stream is instructed based on the paired data. For example, this decoding may comprise the decoding shown in
The method of
In certain variations, e.g. of
In certain cases, the first encoded data stream for the first layer (e.g., as read as a media track from a file or received as a PID stream) appears as a first layer encoded bitstream. However, it actually carries interleaved encoded data for multiple layers.
In certain aspects, a method of decoding a multi-layer video stream comprises accessing a media track of a data file structure, the media track being identified by an identifier, the media track carrying the multi-layer video stream, the multi-layer video stream encoding a video signal and comprising data representing a first layer and data representing a second layer; parsing the identifier to instruct decoding of the data representing the first layer using a first layer decoder, wherein the identifier is defined according to an encoding format of the first layer, wherein data within the media track is accessed by the first layer decoder; and parsing the identifier to instruct decoding of the data representing the second layer using the second layer decoder, wherein outputs of the first and second layer decoders are combinable to reconstruct an output for the multi-layer video stream. In this case, as an adaptation of the method of
In certain examples as described herein, the encoded data for the first layer and encoded data for the second layer are generated using different video encoders. For example, selectable base codecs generate the first layer and an LCEVC codec generates further layers.
In certain examples described herein a first layer video stream may form a base layer for multiple enhancement streams (e.g., multiple second layer streams). In this case, each enhancement stream may have a different function and/or may carry differentiated content. For example, enhancement streams may be provided at different levels of quality (such as different bit rates, colour depths, and/or resolutions) and/or may include different content to be overlaid over base stream content. For example, an enhancement stream may provide different text for different languages or different advertising content for different users or areas. In one case, each LCEVC stream may encode different logo content for surfaces visible in the base video stream, such as sport hoardings or billboards within videos. This approach, and the use of descriptors more generally, may be applied to both video streams over a network and file-based content (e.g., streams as recorded as bit sequences within files).
In a case where there is one first layer or base stream and multiple second layer or enhancement streams, a descriptor may be provided that has a loop function and defines the plurality of additional streams that are associated with the first layer or base stream. This descriptor with a loop function may be provided as part of the first layer or base stream, thus allowing any decoder that receives the first layer or base stream to have access to a set of identifiers for available second layer or enhancement streams. Each second layer or enhancement stream may have a descriptor with the identifier for that stream. Hence, the decoders may pair base and multiple different enhancement streams but legacy base decoders may ignore the additional descriptors that accompany the base stream and simply decode the base stream as per single layer cases.
In an LCEVC case, an LCEVC video extension descriptor may be defined. Each LCEVC video stream (e.g., a “second layer” stream as discussed in examples herein) may have an LCEVC video descriptor that is present in a descriptor loop of a program map section for the LCEVC video stream. A base video stream (e.g., a first layer stream as discussed in examples herein) may constitute a base video stream for more than one LCEVC video stream. The base video stream may also comprise an LCEVC video extension descriptor. As set out above, the base video stream may comprise multiple LCEVC video extension descriptors (in a so-called descriptor “loop”), where each LCEVC video extension descriptor comprises an identifier that identifies an association with a different LCEVC encoded video stream. The LCEVC video extension descriptor may be identified using an extension descriptor tag, e.g. a tag with a value of “0×17” that was previously defined as “reserved”. As such legacy base decoders may simply ignore LCEVC video extension descriptor as the value is deemed not used in their configuration.
An each LCEVC video extension descriptor may have a form similar to that set out in the table below:
In this case, the Icevc_stream_tag field is an 8 bit field specifying the identifier of an association between a base and an enhancement encoded video stream. In other alternative examples, the Icevc_stream_tag may be replaced with a PID of a base stream to form the link between streams. The other fields may then provide additional optional information regarding the LCEVC video stream, such as properties of the LCEVC video stream.
An LCEVC registration descriptor may also be provided that defines a set of available Icevc_stream_tags that may be used. For example, this may have the form:
This descriptor may be used for a base stream, where num_Icevc_stream_tags defines the number of associated enhancement streams and the inbuilt loop repeats the Icevc_stream_tags for each enhancement stream in turn. As before, the Icevc_stream_tag tag value allows indicating the video elementary stream as the base of an LCEVC video stream that carries the same tag value in its LCEVC video descriptor. In cases where there is a single base and enhancement stream the base descriptor may just contain one Icevc_stream_tag.
In certain cases, one or more of the example systems 300 and 400, method 600, method 1100, method 1200 or any other of the examples described herein may be implemented via instructions retrieved from a computer-readable medium. These may be executed by a processor of a decoding system, such as a client device. In one case, examples related to method 1100 may be implemented by way of instructions retrieved from a computer-readable medium and executed by a processor of an encoding system, such as an encoding server.
The techniques described herein may be implemented in software or hardware, or may be implemented using a combination of software and hardware. They may include configuring an apparatus to carry out and/or support any or all of techniques described herein. The above examples are to be understood as illustrative. Further examples are envisaged. It is to be understood that any feature described in relation to any one example may be used alone, or in combination with other features described, and may also be used in combination with one or more features of any other of the examples, or any combination of any other of the examples. Furthermore, equivalents and modifications not described above may also be employed without departing from the scope of the invention, which is defined in the accompanying claims.
Number | Date | Country | Kind |
---|---|---|---|
2116781.2 | Nov 2021 | GB | national |
2200609.2 | Jan 2022 | GB | national |
2200674.6 | Jan 2022 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/GB2022/052949 | 11/21/2022 | WO |