ATTENTION BASED CONTEXT MODELLING FOR IMAGE AND VIDEO COMPRESSION

FIELD

Embodiments of the present disclosure relate to the field of artificial intelligence (AI)-based video or picture compression technologies, and in particular, to context modelling using an attention layer within a neural network to process elements of a latent tensor.

BACKGROUND

Video coding (video encoding and decoding) is used in a wide range of digital video applications, for example broadcast digital TV, video transmission over internet and mobile networks, real-time conversational applications such as video chat, video conferencing, DVD and Blu-ray discs, video content acquisition and editing systems, and camcorders of security applications.

The amount of video data needed to depict even a relatively short video can be substantial, which may result in difficulties when the data is to be streamed or otherwise communicated across a communications network with limited bandwidth capacity. Thus, video data is generally compressed before being communicated across modern day telecommunications networks. The size of a video could also be an issue when the video is stored on a storage device because memory resources may be limited. Video compression devices often use software and/or hardware at the source to code the video data prior to transmission or storage, thereby decreasing the quantity of data needed to represent digital video pictures. The compressed data is then received at the destination by a video decompression device that decodes the video data. With limited network resources and ever increasing demands of higher video quality, improved compression and decompression techniques that improve compression ratio with little to no sacrifice in picture quality are desirable.

In recent years, deep learning is gaining popularity in the fields of picture and video encoding and decoding.

SUMMARY

The embodiments of the present disclosure provide apparatuses and methods for entropy encoding and decoding of a latent tensor, which includes separating the latent tensor into segments and obtaining a probability model for the entropy encoding of a current element of the latent tensor by processing a set of elements by one or more layers of a neural network including an attention layer.

According to an embodiment a method is provided for entropy encoding of a latent tensor, comprising: separating the latent tensor into a plurality of segments in the spatial dimensions and in the channel dimension, each segment including at least one latent tensor element; processing an arrangement of the plurality of segments by one or more layers of a neural network, including at least one attention layer; and obtaining a probability model for the entropy encoding of a current element of the latent tensor based on the processed plurality of segments.

The method considers spatial correlations in the latent tensor and spatial adaptivity for the implicit entropy estimation. The attention mechanism adaptively weights the importance of the previously coded latent segments. The contribution of the segments to the entropy modeling of the current element corresponds to their respective importance. Thus, the performance of the entropy estimation is improved.

In an exemplary implementation, the separating the latent tensor comprises separating the latent tensor into two or more segments in the channel dimension.

Separating the latent tensor into segments in the channel dimension may enable the use of cross-channel correlations for the context modelling, and thus improving the performance of the entropy estimation.

For example, the processing of the arrangement of the plurality of segments further comprises: obtaining two or more groups of segments; and processing the segments within a group out of the two or more groups independently by the one or more layers of the neural network.

An independent processing of segments within a group facilitates a parallel processing, which may enable a more efficient use of hardware and may reduce the processing time.

In an exemplary implementation, for each group out of the two or more groups, the segments in said group have a same respective channel segment index, said channel segment index indicating the segments the channel dimension.

Segments in a group having a same respective channel segment index facilitate an independent processing of latent tensor elements having different coordinates in the spatial dimensions and a sequential processing of segments of different channels. Thus, spatial and cross-channel correlations are included in the obtained probability model.

For example, the segments having a same channel segment index are grouped into either two groups or four groups.

Obtaining two (four) groups per channel segment index enables a processing that requires at least two (four) sequential steps per channel segment index.

In an exemplary implementation, the segments having a same channel segment index are grouped according to a checkerboard pattern into either two groups or four groups in the spatial dimensions.

A checkerboard pattern causes a uniformly spatial distribution of segments of a group within a plurality of segments having a same channel segment index. This may enhance the spatial correlations in the processing of the segments.

For example, the processing of the arrangement of the plurality of segments further comprises: processing the segments in a first group in parallel, followed by processing the segments in a second group in parallel, wherein the segments in the second group have a same index in the channel dimension as the segments in the first group.

Such an arrangement may improve the performance of the entropy estimation by focusing on spatial correlations due to the related processing order.

In an exemplary implementation, wherein the processing of the arrangement of the plurality of segments further comprises processing the segments in a first group in parallel, followed by processing the segments in a second group in parallel, wherein the segments in the second group have a same spatial coordinate as the corresponding segments in the first group.

Such an arrangement may improve the performance of the entropy estimation by focusing on cross-channel correlations due to the related processing order.

For example, the method further comprises generating a bitstream; and including an indication into said bitstream indicating whether or not two or more groups of segments are obtained.

Such an indication may enable an efficient extraction from the bitstream during decoding.

In an exemplary implementation, the processing of the arrangement further of the plurality of segments comprises separating the segments into a plurality of patches, each patch including two or more segments; and processing each patch independently by the one or more layers of the neural network.

A separation into patches that are processed independently may improve the processing as this facilitates an adaption to hardware properties.

For example, each patch includes a K×M grid of spatially neighboring segments and L neighboring segments in the channel dimension, with L, K and M being positive integers, at least one of K and M being larger than one.

Patches including neighboring segments may enable a processing in a sliding window fashion, thus taking into account spatial correlation in the neighborhood of a currently processed segment.

In an exemplary implementation, the patches out of the plurality of patches are overlapping in the spatial dimensions and/or in the channel dimension.

Overlapping patches may improve the obtained probability model by taking into account additional spatial and/or cross-channel correlations.

For example, the method further comprises generating a bitstream; and including an indication into said bitstream indicating one or more of whether or not the patches out of the plurality of patches are overlapping, and an amount of overlap.

Such an indication may enable an efficient extraction from the bitstream during decoding.

In an exemplary implementation, the processing of a patch out of the plurality of patches by the one or more layers of the neural network further comprises obtaining two or more groups of segments within said patch; and processing the segments within a group out of the two or more groups independently by the one or more layers of the neural network.

A combination of an independent processing of patches and an independent processing of segments in a group within a patch may reduce the number of sequential steps in the processing of a patch and thus the overall processing time.

For example, the processing of the arrangement comprises arranging the segments in a predefined order, wherein segments with a same spatial coordinate are grouped together.

Such an arrangement may improve the performance of the entropy estimation by focusing on cross-channel correlations due to the related processing order.

In an exemplary implementation, the processing of the arrangement comprises arranging the segments, wherein segments, which have different spatial coordinates, are arranged consecutively in a predefined order.

Such an arrangement may improve the performance of the entropy estimation by focusing on spatial correlations due to the related processing order.

For example, the processing by the neural network comprises applying a first neural subnetwork to extract features of the plurality of segments, and providing an output of the first neural subnetwork as an input to a subsequent layer within the neural network.

Processing the input of the neural network to extract the features of the plurality of segments may enable a focus of the attention layer on independent deep features of the input.

In an exemplary implementation, the processing by the neural network further comprises: providing positional information of the plurality of segments as an input to the at least one attention layer.

The positional encodings may enable the attention layer to utilize the sequential order of the input sequence.

In an exemplary implementation, the processing an arrangement of the plurality of segments includes selecting a subset of segments from said plurality of segments; and said subset is provided as an input to a subsequent layer within the neural network.

Selecting a subset of segments may enable support for latent tensors of larger sizes by requiring a reduced size of memory and/or a reduced amount of processing.

For example, the processing by the at least one attention layer in the neural network further comprises: applying a mask, which masks elements in an attention tensor following the current element within a processing order of the latent tensor.

Applying a mask ensures, that only previously encoded elements may be processed and thus the coding order is preserved. The mask mirrors the availability of information at the decoding side to the encoding side.

In an exemplary implementation, the neural network includes a second neural subnetwork, the second neural subnetwork processing an output of the attention layer.

The neural subnetwork may process the features outputted by the attention layer to provide probabilities for the symbols used in the encoding and thus enabling an efficient encoding and/or decoding.

For example, at least one of the first neural subnetwork and the second neural subnetwork is a multilayer perceptron.

A multilayer perceptron may provide an efficient implementation of a neural network.

In an exemplary implementation, the at least one attention layer in the neural network is a multi-head attention layer.

A multi-head attention layer may improve the estimation of probabilities by processing different representations of the input in parallel and thus providing more projections and attention computations, which corresponds to various perspectives of the same input.

For example, the at least one attention layer in the neural network is included in a transformer subnetwork.

A transformer subnetwork may provide an efficient implementation of an attention mechanism.

In an exemplary implementation, the method is further comprising: padding the beginning of the arrangement of the plurality of segments with a zero segment before processing by the neural network.

A padding with zeros at the beginning of the arrangement mirrors the availability of information at the decoding side and thus causality in the coding order is preserved.

For example, the method is further comprising: entropy encoding the current element into a first bitstream using the obtained probability model.

Using the probability model obtained by processing the plurality of segments by a neural network including an attention layer may reduce the size of the bitstream.

In an exemplary implementation, the method is further comprising: quantizing the latent tensor before separating into segments.

A quantized latent tensor yields a simplified probability model, thus enabling a more efficient encoding process. Also, such latent tensor is compressed and can be processed with reduced complexity and represented more efficiently within the bitstream.

For example, the quantizing of the latent tensor comprises a Rate-Distortion Optimized Quantization (RDOQ) including the obtained probability model.

Optimizing the quantization including the obtained probability model may yield a more efficient representation of the latent tensor within the bitstream.

For example, the method is further comprising selecting the probability model for the entropy encoding according to: computational complexity and/or properties of the first bitstream.

Enabling the selection of the context modelling strategy may allow for better performance during the encoding process and may provide flexibility in adapting the encoded bitstream to the desired application.

In an exemplary implementation, the method is further comprising: hyper-encoding the latent tensor obtaining a hyper-latent tensor; entropy encoding the hyper-latent tensor into a second bitstream; entropy decoding the second bitstream; and obtaining a hyper-decoder output by hyper-decoding the hyper-latent tensor.

Introducing a hyper-prior model may further improve the probability model and thus the coding rate by determining further redundancy in the latent tensor.

For example, the method is further comprising: separating the hyper-decoder output into a plurality of hyper-decoder output segments, each hyper-decoder output segments including one or more hyper-decoder output elements; for each segment out of the plurality of segments, concatenating said segment and a set of hyper-decoder output segments out of the plurality of hyper-decoder output segments before obtaining the probability model.

The probability model may be further improved by concatenating the hyper-decoder output with a respective segment out of the plurality of segments.

In an exemplary implementation, the set of hyper-decoder output segments to be concatenated with a respective segment includes one or more of a hyper-decoder output segment corresponding to said respective segment, or a plurality of hyper-decoder output segments corresponding to a same channel as said respective segment, or a plurality of hyper-decoder output segments spatially neighboring said respective segment, or a plurality of hyper-decoder output segments including neighboring segments spatially neighboring said respective segment and segments corresponding to a same channel as said neighboring segment.

The probability model may be further improved by including a respective set of hyper-decoder output segments. The behavior for performance and complexity may depend on the set of hyper-decoder output segments and the content to be encoded.

For example, the method is further comprising adaptively selecting the set of hyper-decoder output segments according to: computational complexity and/or properties of the first bitstream.

Enabling the selection of additional hyper-prior modelling strategy may allow for better performance during the encoding process and may provide flexibility in adapting the encoded bitstream to the desired application.

In an exemplary implementation, one or more of the following steps are performed in parallel for each segment out of the plurality of segments: processing by the neural network, and entropy encoding the current element.

A parallel processing of the segments may result in a faster encoding into the bitstream.

According to an embodiment, a method is provided for encoding image data comprising: obtaining a latent tensor by processing the image data with an autoencoding convolutional neural network; and entropy encoding the latent tensor into a bitstream using a generated probability model according to any of the methods described above.

The entropy coding may be readily and advantageously applied to image encoding, to effectively reduce the data rate, e.g. when transmission or storage of pictures or videos is desired, as the latent tensors for image reconstruction may still have considerable size.

According to an embodiment, a method is provided for entropy decoding of a latent tensor, comprising: initializing the latent tensor with zeroes; separating the latent tensor into a plurality of segments in the spatial dimensions and in the channel dimension, each segment including at least one latent tensor element; processing an arrangement of the plurality of segments by one or more layers of a neural network, including at least one attention layer; and obtaining a probability model for the entropy decoding of a current element of the latent tensor based on the processed plurality of segments.

In an exemplary implementation, the separating the latent tensor comprises separating the latent tensor into two or more segments in the channel dimension.

An independent processing of segments within a group facilitates a parallel processing, which may enable a more efficient use of hardware and may reduce the processing time.

For example, the segments having a same channel segment index are grouped into either two groups or four groups.

Obtaining two (four) groups per channel segment index enables a processing that requires at least two (four) sequential steps per channel segment index.

In an exemplary implementation, the segments having a same channel segment index are grouped according to a checkerboard pattern into either two groups or four groups in the spatial dimensions.

Such an arrangement may improve the performance of the entropy estimation by focusing on spatial correlations due to the related processing order.

Such an arrangement may improve the performance of the entropy estimation by focusing on cross-channel correlations due to the related processing order.

For example, the method further comprises receiving a bitstream; and obtaining an indication from said bitstream indicating whether or not two or more groups of segments are obtained.

Such an indication may enable an efficient extraction from the bitstream during decoding.

A separation into patches that are processed independently may improve the processing as this facilitates an adaption to hardware properties.

Patches including neighboring segments may enable a processing in a sliding window fashion, thus taking into account spatial correlation in the neighborhood of a currently processed segment.

In an exemplary implementation, the patches out of the plurality of patches are overlapping in the spatial dimensions and/or in the channel dimension.

Overlapping patches may improve the obtained probability model by taking into account additional spatial and/or cross-channel correlations.

For example, the method further comprises receiving a bitstream; and obtaining an indication from said bitstream indicating one or more of whether or not the patches out of the plurality of patches are overlapping, and an amount of overlap.