This invention relates to a method and system for lossy image or video encoding, transmission and decoding, a method, apparatus, computer program and computer readable storage medium for lossy image or video encoding and transmission, and a method, apparatus, computer program and computer readable storage medium for lossy image or video receipt and decoding.
There is increasing demand from users of communications networks for images and video content. Demand is increasing not just for the number of images viewed, and for the playing time of video; demand is also increasing for higher resolution content. This places increasing demand on communications networks and increases their energy use because of the larger amount of data being transmitted.
To reduce the impact of these issues, image and video content is compressed for transmission across the network. The compression of image and video content can be lossless or lossy compression. In lossless compression, the image or video is compressed such that all of the original information in the content can be recovered on decompression. However, when using lossless compression there is a limit to the reduction in data quantity that can be achieved. In lossy compression, some information is lost from the image or video during the compression process. Known compression techniques attempt to minimise the apparent loss of information by the removal of information that results in changes to the decompressed image or video that is not particularly noticeable to the human visual system.
Artificial intelligence (AI) based compression techniques achieve compression and decompression of images and videos through the use of trained neural networks in the compression and decompression process. Typically, during training of the neutral networks, the difference between the original image and video and the compressed and decompressed image and video is analyzed and the parameters of the neural networks are modified to reduce this difference while minimizing the data required to transmit the content. However, AI based compression methods may achieve poor compression results in terms of the appearance of the compressed image or video or the amount of information required to be transmitted.
According to the present invention there is provided a method for lossy image and video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent, wherein the sizes of the bins used in the quantization process are based on the input image; transmitting the quantized latent to a second computer system; decoding the quantized latent using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.
The sizes of the bins may be different between at least two of the pixels of the latent representation.
The sizes of the bins may be different between at least two channels of the latent representation.
A bin size may be assigned to each pixel of the latent representation.
The quantisation process may comprise performing an operation on the value of each pixel of the latent representation corresponding to the bin size assigned to that pixel.
The quantisation process may comprise subtracting a mean value of the latent representation from each pixel of the latent representation.
The quantisation process may comprise a rounding function.
The sizes of the bins used to decode the quantized latent may be based on previously decoded pixels of the quantized latent.
The quantisation process may comprise a third trained neural network.
The third trained neural network may receive at least one previously decoded pixel of the quantized latent as an input.
The method may further comprising the steps of: encoding the latent representation using a fourth trained neural network to produce a hyper-latent representation; performing a quantization process on the hyper-latent representation to produce a quantized hyper-latent; transmitting the quantized hyper-latent to the second computer system; and decoding the quantized hyper-latent using a fifth trained neural network to obtain the sizes of the bins; wherein the decoding of the quantized latent uses the obtained sizes of the bins.
The output of the fifth trained neural network may be processed by a further function to obtain the sizes of the bins.
The further function may be a sixth trained neural network.
The sizes of the bins used in the quantization process of the hyper-latent representation may be based on the input image.
The method may further comprise the step of identifying at least one region of interest of the input image; and reducing the size of the bins used in the quantisation process for at least one corresponding pixel of the latent representation in the identified region of interest.
The method may further comprise the step of identifying at least one region of interest of the input image; wherein a different quantisation process is used for at least one corresponding pixel of the latent representation in the identified region of interest.
The at least on region of interest may be identified by a seventh trained neural network.
The location of the one or more regions of interest may be stored in a binary mask; and the binary mask may be used to obtain the sizes of the bins.
According to the present invention there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent, wherein the sizes of the bins used in the quantization process are based on the input image; decoding the quantized latent using a second neural network to produce an output image, wherein the output image is an approximation of the input image; determining a quantity based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the determined quantity; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.
The method may further comprise the steps of: encoding the latent representation using a third neural network to produce a hyper-latent representation; performing a quantization process on the hyper-latent representation to produce a quantized hyper-latent; transmitting the quantized hyper-latent to the second computer system; and decoding the quantized hyper-latent using a fourth neural network to obtain the sizes of the bins; wherein the decoding of the quantized latent uses the obtained bin sizes; and the parameters of the third neural network and the fourth neural network are additionally updated based on the determined quantity to obtain a third trained neural network and a fourth trained neural network.
The quantisation process may comprise a first quantisation approximation.
The determined quantity may be additionally based on a rate associated with the quantized latent; a second quantisation approximation may be used to determine the rate associated with the quantized latent; and the second quantisation approximation may be different to the first quantisation approximation.
The determined quantity may comprise a loss function and the step of updating of the parameters of the neural networks may comprise the steps of: evaluating a gradient of the loss function; and back-propagating the gradient of the loss function through the neural networks; wherein a third quantisation approximation is used during back-propagation of the gradient of the loss function; and the third quantisation approximation is the same approximation as the first quantisation approximation.
The parameters of the neural networks may be additionally updated based on a distribution of the sizes of the bins.
At least one parameter of the distribution may be learned.
The distribution may be an inverse gamma distribution.
The distribution may be determined by a fifth neural network.
According to the present invention there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent, wherein the sizes of the bins used in the quantization process are based on the input image; and transmitting the quantized latent.
According to the present invention there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving the quantized latent transmitted according to the method above at a second computer system; decoding the quantized latent using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.
According to the present invention there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; decoding the quantized latent using a second neural network to produce an output image, wherein the output image is an approximation of the input image; determining a quantity based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the determined quantity; and repeating the above steps using a plurality of sets of input images to produce a first trained neural network and a second trained neural network; wherein at least one of the plurality of sets of input images comprises a first proportion of images including a particular feature; and at least one other of the plurality of sets of input images comprises a second proportion of images including the particular feature, wherein the second proportion is different to the first proportion.
The first proportion may be all of the images of the set of input images.
The particular feature may be of one of the following: a human face, an animal face, text, eyes, lips, a logo, a car, flowers and a pattern.
Each of the plurality of sets of input images may be used an equal number of times during the repetition of the method steps.
The difference between the output image and the input image may be at least partially determined by a neural network acting as a discriminator.
A separate neural network acting as a discriminator may be used for each set of the plurality of sets of input images.
The parameters of one or more of the neural networks acting as discriminators may be updated for a first number of training steps; and one or more other of the neural networks acting as discriminators may be updated for a second number of training steps, wherein the second number is lower than the first number.
The determined quantity may be additionally based on a rate associated with the quantized latent; the updating of the parameters for at least one of the plurality of sets of input images may use a first weighting for the rate associated with the quantized latent; and the updating of the parameters for at least one other of the plurality of sets of input images may use a second weighting for the rate associated with the quantized latent, wherein the second weighting is different to the first weighting.
The difference between the output image and the input image may be at least partially determined using a plurality of perceptual metrics; the updating of the parameters for at least one of the plurality of sets of input images may use a first set of weightings for the plurality of perceptual metrics; and the updating of the parameters for at least one other of the plurality of sets of input images may use a second set of weightings for the plurality of perceptual metrics, wherein the second set of weightings is different to the first set of weightings.
The input image may be a modified image in which one or more regions of interest have been identified by a third trained neural network and other regions of the image have been masked.
The regions of interest may be regions comprising one or more of the following features: human faces, animal faces, text, eyes, lips, logos, cars, flowers and patterns.
The location of the areas of the one or more regions of interest may be stored in a binary mask.
The binary mask may be an additional input to the first neural network.
According to the present invention there is provided a method for lossy image and video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; transmitting the quantized latent to a second computer system; and decoding the quantized latent using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein the first trained neural network and the second trained neural network have been trained according to the method above.
According to the present invention there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; and transmitting the quantized latent; wherein the first trained neural network has been trained according to the method above.
According to the present invention there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving the quantized latent according to the method of claim 48 at a second computer system; and decoding the quantized latent using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein the second trained neural network has been trained according to the method above.
According to the present invention there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; decoding the quantized latent using a second neural network to produce an output image, wherein the output image is an approximation of the input image; determining a quantity based on a rate associated with the quantized latent, wherein the evaluation of the rate comprises the step of interpolation of a discrete probability mass function; updating the parameters of the first neural network and the second neural network based on the determined quantity; and repeating the above steps using a plurality of sets of input images to produce a first trained neural network and a second trained neural network.
At least one parameter of the discrete probability mass function may be additionally updated based on the evaluated rate.
The method may further comprise the steps of: encoding the latent representation using a third neural network to produce a hyper-latent representation; performing a quantization process on the hyper-latent representation to produce a quantized hyper-latent; and decoding the quantized hyper-latent using a fourth neural network to obtain at least one parameter of the discrete probability mass function; wherein the parameters of the third neural network and the fourth neural network are additionally updated based on the determined quantity to obtain a third trained neural network and a fourth trained neural network.
The interpolation may comprise at least one of the following: piecewise constant interpolation, nearest neighbour interpolation, linear interpolation, polynomial interpolation, spline interpolation, piecewise cubic interpolation, gaussian processes and kriging.
The discrete probability mass function may be a categorical distribution.
The categorical distribution may be parameterized by at least one vector.
The categorical distribution may be obtained by a soft-max projection of the vector.
The discrete probability mass function may be parameterized by at least a mean parameter and a scale parameter.
The discrete probability mass function may be multivariate.
The discrete probability mass function may comprise a plurality of points; a first adjacent pair of points of the plurality of points may have a first spacing; and a second adjacent pair of points of the plurality of points may have a second spacing, wherein the second spacing is different to the first spacing.
The discrete probability mass function may comprises a plurality of points; a first adjacent pair of points of the plurality of points may have a first spacing; and a second adjacent pair of points of the plurality of points may have a second spacing, wherein the second spacing is equal to the first spacing.
At least one of the first spacing and the second spacing may be obtained using the fourth neural network.
At least one of the first spacing and the second spacing may be obtained based on the value of at least one pixel of the latent representation.
According to the present invention there is provided a method for lossy image and video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; transmitting the quantized latent to a second computer system; and decoding the quantized latent using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein the first trained neural network and the second trained neural network have been trained according to the method above.
According to the present invention there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; and transmitting the quantized latent; wherein the first trained neural network has been trained according to the method above.
According to the present invention there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving the quantized latent according to the method above at a second computer system; and decoding the quantized latent using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein the second trained neural network has been trained according to the method above.
According to the present invention there is provided a method for lossy image and video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a first operation on the latent representation to obtain a residual latent; transmitting the residual latent to a second computer system; performing a second operation on the latent residual to obtain a retrieved latent representation, wherein the second operation comprises performing an operation on previously obtained pixels of the retrieved latent; and decoding the retrieved latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.
The operation on previously obtained pixels of the retrieved latent may be performed for each pixel of the retrieved latent for which previously obtained pixels have been obtained.
At least one of the first operation and the second operation may comprise the solving of an implicit equation system.
The first operation may comprise a quantisation operation.
The operation performed on previously obtained pixels of the retrieved latent may comprise a matrix operation.
The matrix defining the matrix operation may be sparse.
The matrix defining the matrix operation may have zero values corresponding to pixels of the retrieved latent that have not been obtained when the matrix operation is performed.
The matrix defining the matrix operation may be lower triangular.
The second operation may comprise a standard forward substitution.
The operation performed on previously obtained pixels of the retrieved latent may comprise a third trained neural network.
The method may further comprise the steps of: encoding the latent representation using a fourth trained neural network to produce a hyper-latent representation; transmitting the quantized hyper-latent to the second computer system; and decoding the quantized hyper-latent using a fifth trained neural network, wherein the operation performed on previously obtained pixels of the retrieved latent is based on the output of the fifth trained neural network.
The decoding of the quantized hyper-latent using the fifth trained neural network may additionally produces a mean parameter; and the implicit equation system may additionally comprise the mean parameter.
According to the present invention there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; performing a first operation on the latent representation to obtain a residual latent; performing a second operation on the latent residual to obtain a retrieved latent representation, wherein the second operation comprises performing an operation on previously obtained pixels of the retrieved latent; decoding the quantized latent using a second neural network to produce an output image, wherein the output image is an approximation of the input image; determining a quantity based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the determined quantity; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network.
The operation performed on previously obtained pixels of the retrieved latent may comprise a matrix operation.
The parameters of a matrix defining the matrix operation may be additionally updated based on the determined quantity.
The operation performed on previously obtained pixels of the retrieved latent may comprise a third neural network; and the parameters of the third neural network may be additionally updated based on the determined quantity to produce a third trained neural network.
The method may further comprise the steps of: encoding the latent representation using a fourth neural network to produce a hyper-latent representation; performing a quantization process on the hyper-latent representation to produce a quantized hyper-latent; transmitting the quantized hyper-latent to the second computer system; and decoding the quantized hyper-latent using a fifth neural network, wherein the operation performed on previously obtained pixels of the retrieved latent is based on the output of the fifth trained neural network; wherein the parameters of the fourth neural network and the fifth neural network are additionally updated based on the determined quantity to produce a fourth trained neural network and a fifth trained neural network.
According to the present invention there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a first operation on the latent representation to obtain a residual latent; transmitting the residual latent.
According to the present invention there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving the residual latent transmitted according to the method above at a second computer system; performing a second operation on the latent residual to obtain a retrieved latent representation, wherein the second operation comprises performing an operation on previously obtained pixels of the retrieved latent; and decoding the retrieved latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.
According to the present invention there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; entropy encoding the latent representation; transmitting the entropy encoded latent representation to a second computer system; entropy decoding the entropy encoded latent representation; and decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; determining a quantity based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the determined quantity; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network; wherein the entropy decoding of the entropy encoded latent representation is performed pixel by pixel; and the order of the pixel by pixel decoding is additionally updated based on the determined quantity.
The order of the pixel by pixel decoding may be based on the latent representation.
The entropy decoding of the entropy encoded latent may comprise an operation based on previously decoded pixels.
The determining of the order of the pixel by pixel decoding may comprise ordering a plurality of the pixels of the latent representation in a directed acyclic graph.
The determining of the order of the pixel by pixel decoding may comprise operating on the latent representation with a plurality of adjacency matrices.
The determining of the order of the pixel by pixel decoding may comprise dividing the latent representation into a plurality of sub-images.
The plurality of sub-images may be obtained by convolving the latent representation with a plurality of binary mask kernels.
The determining of the order of the pixel by pixel decoding may comprise ranking a plurality of pixels of the latent representation based on the magnitude of a quantity associated with each pixel.
The quantity associated with each pixel may be the location or scale parameter associated with that pixel.
The quantity associated with each pixel may be additionally updated based on the evaluated difference.
The determining of the order of the pixel by pixel decoding may comprise a wavelet decomposition of a plurality of pixels of the latent representation.
The order of the pixel by pixel decoding may be based on the frequency components of the wavelet decomposition associated with the plurality of pixels.
The method may further comprise the steps of: encoding the latent representation using a fourth trained neural network to produce a hyper-latent representation; transmitting the hyper-latent to the second computer system; and decoding the hyper-latent using a fifth trained neural network, wherein the order of the pixel by pixel decoding is based on the output of the fifth trained neural network.
According to the present invention there is provided a method for lossy image and video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; entropy encoding the latent representation; transmitting the entropy encoded latent representation to a second computer system; entropy decoding the entropy encoded latent representation; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein the first trained neural network and the second trained neural network have been trained according to the method above.
According to the present invention there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; entropy encoding the latent representation; and transmitting the entropy encoded latent representation; wherein the first trained neural network has been trained according to the method above.
According to the present invention there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving the entropy encoded latent representation transmitted according to the method of above at a second computer system; and decoding the latent representation using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein the second trained neural network has been trained according to the method above.
According to the present invention there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; decoding the latent representation using a second neural network to produce an output image, wherein the output image is an approximation of the input image; determining a quantity based on a difference between the output image and the input image and a rate associated with the latent representation, wherein a first weighting is applied to the difference between the output image and the input image and a second weighting is applied to the rate associated with the latent representation when determining the quantity; updating the parameters of the first neural network and the second neural network based on the determined quantity; and repeating the above steps using a first set of input images to produce a first trained neural network and a second trained neural network; wherein, after at least one of the repeats of the above steps, at least one of the first weighting and the second weighting is additionally updated based on a further quantity, the further quantity based on at least one of the difference between the output image and the input image and the rate associated with the latent representation.
At least one of the difference between the output image and the input image and the rate associated with the latent representation may be recorded for each repeat of the steps; and the further quantity may be based on at least one of a plurality of the previously recorded differences between the output image and the input image and a plurality of the previously recorded rates associated with the latent representation.
The further quantity may be based on an average of the plurality of the previously recorded differences or rates.
The average may be at least one of the following: the arithmetic mean, the median, the geometric mean, the harmonic mean, the exponential moving average, the smoothed moving average and the linear weighted moving average.
Outlier values may be removed from the plurality of the previously recorded differences or rates before determining the further quantity.
The outlier values may only be removed for an initial predetermined number of repeats of the steps.
The rate associated with the latent representation may be calculated using a first method when determining the quantity and a second method when determining the further quantity, wherein the first method is different to the second method.
At least one repeat of the steps may be performed using an input image from a second set of input images; and the parameters of the first neural network and the second neural network may not be updated when an input image from a second set of input images is used.
The determined quantity may be additionally based on the output of a neural network acting as a discriminator.
According to the present invention there is provided a method of training one or more neural networks, the one or more neural networks being for use in lossy video encoding, transmission and decoding, the method comprising the steps of: receiving an input video at a first computer system; encoding a plurality of frames of the input video using a first neural network to produce a plurality of latent representations; decoding the plurality of latent representations using a second neural network to produce a plurality of frames of an output video, wherein the output video is an approximation of the input video; determining a quantity based on a difference between the output video and the input video and a rate associated with the plurality of latent representations, wherein a first weighting is applied to the difference between the output video and the input video and a second weighting is applied to the rate associated with the plurality of latent representations; updating the parameters of the first neural network and the second neural network based on the determined quantity; and repeating the above steps using a plurality of input videos to produce a first trained neural network and a second trained neural network; wherein, after at least one of the repeats of the above steps, at least one of the first weighting and the second weighting is additionally updated based on a further quantity, the further quantity based on at least one of the difference between the output video and the input video and the rate associated with the plurality of latent representations.
The input video may comprise at least one I-frame and a plurality of P-frames.
The quantity may be based on a plurality of first weightings or second weightings, each of the weightings corresponding to one of the plurality of frames of the input video.
After at least one of the repeats of the steps, at least one of the plurality of weightings may be additionally updated based on an additional quantity associated with each weighting.
Each additional quantity may be based on a predetermined target value of the difference between the output frame and the input frame or the rate associated with the latent representation.
The additional quantity associated with the I-frame may have a first target value and at least one additional quantity associated with a P-frame may have a second target value, wherein the second target value is different to the first target value.
Each additional quantity associated with a P-frame may have the same target value.
The plurality of first weightings or second weightings may be initially set to zero.
According to the present invention there is provided a method for lossy image and video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; transmitting the quantized latent to a second computer system; and decoding the quantized latent using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image; wherein the first trained neural network and the second trained neural network have been trained according to the method above.
According to the present invention there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image or video at a first computer system; encoding the input image or video using a first trained neural network to produce a latent representation; and transmitting the latent representation; wherein the first trained neural network has been trained according to the method above.
According to the present invention there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving the latent representation according to the method of above at a second computer system; and decoding the latent representation using a second trained neural network to produce an output image or video, wherein the output image or video is an approximation of the input image or video; wherein the second trained neural network has been trained according to the method above.
According to the present invention there is provided a method for lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; entropy encoding the quantized latent using a probability distribution, wherein the probability distribution is defined using a tensor network; transmitting the entropy encoded quantized latent to a second computer system; entropy decoding the entropy encoded quantized latent using the probability distribution to retrieve the quantized latent; and decoding the quantized latent using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.
The probability distribution may be defined by a Hermitian operator operating on the quantized latent, wherein the Hermitian operator is defined by the tensor network.
The tensor network may comprise a non-orthonormal core tensor and one or more orthonormal tensors.
The method may further comprise the steps of: encoding the latent representation using a third trained neural network to produce a hyper-latent representation; performing a quantization process on the hyper-latent representation to produce a quantized hyper-latent; transmitting the quantized hyper-latent to the second computer system; and decoding the quantized hyper-latent using a fourth trained neural network; wherein the output of the fourth trained neural network is one or more parameters of the tensor network.
The tensor network may comprise a non-orthonormal core tensor and one or more orthonormal tensors; and the output of the fourth trained neural network may be one or more parameters of the non-orthonormal core tensor.
One or more parameters of the tensor network may be calculated using one or more pixels of the latent representation.
The probability distribution may be associated with a sub-set of the pixels of the latent representation.
The probability distribution may be associated with a channel of the latent representation.
The tensor network may be at least one of the following factorisations: Tensor Tree, Locally Purified State, Born Machine, Matrix Product State and Projected Entangled Pair State.
According to the present invention there is provided a method of training one or more networks, the one or more networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving a first input image; encoding the first input image using a first neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; entropy encoding the quantized latent using a probability distribution, wherein the probability distribution is defined using a tensor network; entropy decoding the entropy encoded quantized latent using the probability distribution to retrieve the quantized latent; decoding the quantized latent using a second neural network to produce an output image, wherein the output image is an approximation of the input image; determining a quantity based on a difference between the output image and the input image; updating the parameters of the first neural network and the second neural network based on the determined quantity; and repeating the above steps using a plurality of input images to produce a first trained neural network and a second trained neural network.
One or more of the parameters of the tensor network may be additionally updated based on the determined quantity.
The tensor network may comprise a non-orthonormal core tensor and one or more orthonormal tensors; and the parameters of all of the tensors of the tensor network except for the non-orthonormal core tensor may be updated based on the determined quantity.
The tensor network may be calculated using the latent representation.
The tensor network may be calculated based on a linear interpolation of the latent representation.
The determined quantity may be additionally based on the entropy of the tensor network.
According to the present invention there is provided a method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; performing a quantization process on the latent representation to produce a quantized latent; entropy encoding the quantized latent using a probability distribution, wherein the probability distribution is defined using a tensor network; and transmitting the entropy encoded quantized latent.
According to the present invention there is provided a method for lossy image or video receipt and decoding, the method comprising the steps of: receiving an entropy encoded quantized latent transmitted according to the method above at a second computer system; entropy decoding the entropy encoded quantized latent using the probability distribution to retrieve the quantized latent; and decoding the quantized latent using a second trained neural network to produce an output image, wherein the output image is an approximation of the input image.
According to the present invention there is provided method for lossy image and video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; encoding the latent representation using a second trained neural network to produce a hyperlatent representation; encoding the hyperlatent representation using a third trained neural network to produce a hyperhyperlatent representation; transmitting the latent, hyperlatent and hyperhyperlatent representation to a second computer system; decoding the hyperhyperlatent representation using a fourth trained neural network; decoding the hyperlatent representation using the output of the fourth trained neural network and a fifth trained neural network; and decoding the latent representation using the output of the fifth trained neural network and a sixth trained neural network to produce an output image, wherein the output image is an approximation of the input image.
The method may further comprising the steps of determining the rate of the input image; wherein, if the determined rate satisfies a predetermined condition, the step of encoding the hyperlatent representation and decoding the hyperhyperlatent representation is not performed.
The predetermined condition may be that the rate is less than a predetermined value.
According to the present invention there is provided method of training one or more networks, the one or more networks being for use in lossy image or video encoding, transmission and decoding, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first neural network to produce a latent representation; encoding the latent representation using a second neural network to produce a hyperlatent representation; encoding the hyperlatent representation using a third neural network to produce a hyperhyperlatent representation; decoding the hyperhyperlatent representation using a fourth neural network; decoding the hyperlatent representation using the output of the fourth neural network and a fifth neural network; and decoding the latent representation using the output of the fifth neural network and a sixth neural network to produce an output image, wherein the output image is an approximation of the input image; determining a quantity based on a difference between the output image and the input image; updating the parameters of the third and fourth neural networks based on the determined quantity; and repeating the above steps using a plurality of input images to produce a third and fourth trained neural network.
The parameters of the first, second, fifth and sixth neural network may not be updated in at least one of the repeats of the steps.
The method may further comprise the step of determining the rate of the input image; wherein, if the determined rate satisfies a predetermined condition, the parameters of the first, second, fifth and sixth neural network are not updated in that repeat of the steps.
The predetermined condition may be that the rate is less than a predetermined value.
The parameters of the first, second, fifth and sixth neural network may not be updated after a predetermined number of repeats of the steps.
The parameters of the first, second, fifth and sixth neural network may additionally be updated based on the determined quantity to produce a first, second, fifth and sixth trained neural network.
At least one of the following operations may be performed on at least one of the plurality of input images before performing the other steps: an upsampling, a smoothing filter and a random crop.
According to the present invention there is provided method for lossy image or video encoding and transmission, the method comprising the steps of: receiving an input image at a first computer system; encoding the input image using a first trained neural network to produce a latent representation; encoding the latent representation using a second trained neural network to produce a hyperlatent representation; encoding the hyperlatent representation using a third trained neural network to produce a hyperhyperlatent representation; transmitting the latent, hyperlatent and hyperhyperlatent representation.
According to the present invention there is provided method for lossy image or video receipt and decoding, the method comprising the steps of: receiving the latent, hyperlatent and hyperhyperlatent representation transmitted according to the method above at a second computer system; decoding the hyperhyperlatent representation using a fourth trained neural network; decoding the hyperlatent representation using the output of the fourth trained neural network and a fifth trained neural network; and decoding the latent representation using the output of the fifth trained neural network and a sixth trained neural network to produce an output image, wherein the output image is an approximation of the input image.
According to the present invention there is provided a data processing system configured to perform any of the methods above.
According to the present invention there is provided a data processing apparatus configured to perform any of the methods above.
According to the present invention there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the methods above.
According to the present invention there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the methods above.
Aspects of the invention will now be described by way of examples, with reference to the following figures in which:
Compression processes may be applied to any form of information to reduce the amount of data, or file size, required to store that information. Image and video information is an example of information that may be compressed. The file size required to store the information, particularly during a compression process when referring to the compressed file, may be referred to as the rate. In general, compression can be lossless or lossy. In both forms of compression, the file size is reduced. However, in lossless compression, no information is lost when the information is compressed and subsequently decompressed. This means that the original file storing the information is fully reconstructed during the decompression process. In contrast to this, in lossy compression information may be lost in the compression and decompression process and the reconstructed file may differ from the original file. Image and video files containing image and video data are common targets for compression. JPEG, JPEG2000, AVC, HEVC and AVI are examples of compression processes for image and/or video files.
In a compression process involving an image, the input image may be represented as x. The data representing the image may be stored in a tensor of dimensions H×W×C, where H represents the height of the image, W represents the width of the image and C represents the number of channels of the image. Each H×W data point of the image represents a pixel value of the image at the corresponding location. Each channel C of the image represents a different component of the image for each pixel which are combined when the image file is displayed by a device. For example, an image file may have 3 channels with the channels representing the red, green and blue component of the image respectively. In this case, the image information is stored in the RGB colour space, which may also be referred to as a model or a format. Other examples of colour spaces or formats include the CMKY and the YCbCr colour models. However, the channels of an image file are not limited to storing colour information and other information may be represented in the channels. As a video may be considered a series of images in sequence, any compression process that may be applied to an image may also be applied to a video. Each image making up a video may be referred to as a frame of the video.
The frames of a video may be labelled depending on the nature of the frame. For example, frames of a video may be labeled as I-frames and P-frames. I-frames may be the first frame of a new section of a video. For example, the first frame after a scene transition may be labeled an I-frame. P-frames may be subsequent frames after an I-frame. For example, the background or objects present of a P-frame may not change from the I-frame proceeding the P-frame. The changes in a P-frame compared to the I-frame proceeding the P-frame may be described by motion of the objects present in the frame or by motion of the perspective of the frame.
The output image may differ from the input image and may be represented by x. The difference between the input image and the output image may be referred to as distortion or a difference in image quality. The distortion can be measured using any distortion function which receives the input image and the output image and provides an output which represents the difference between input image and the output image in a numerical way. An example of such a method is using the mean square error (MSE) between the pixels of the input image and the output image, but there are many other ways of measuring distortion, as will be known to the person skilled in the art. The distortion function may comprise a trained neural network.
Typically, the rate and distortion of a lossy compression process are related. An increase in the rate may result in a decrease in the distortion, and a decrease in the rate may result in an increase in the distortion. Changes to the distortion may affect the rate in a corresponding manner. A relation between these quantities for a given compression technique may be defined by a rate-distortion equation.
AI based compression processes may involve the use of neural networks. A neural network is an operation that can be performed on an input to produce an output. A neural network may be made up of a plurality of layers. The first layer of the network receives the input. One or more operations may be performed on the input by the layer to produce an output of the first layer. The output of the first layer is then passed to the next layer of the network which may perform one or more operations in a similar way. The output of the final layer is the output of the neural network.
Each layer of the neural network may be divided into nodes. Each node may receive at least part of the input from the previous layer and provide an output to one or more nodes in a subsequent layer. Each node of a layer may perform the one or more operations of the layer on at least part of the input to the layer. For example, a node may receive an input from one or more nodes of the previous layer. The one or more operations may include a convolution, a weight, a bias and an activation function. Convolution operations are used in convolutional neural networks. When a convolution operation is present, the convolution may be performed across the entire input to a layer. Alternatively, the convolution may be performed on at least part of the input to the layer.
Each of the one or more operations is defined by one or more parameters that are associated with each operation. For example, the weight operation may be defined by a weight matrix defining the weight to be applied to each input from each node in the previous layer to each node in the present layer. In this example, each of the values in the weight matrix is a parameter of the neural network. The convolution may be defined by a convolution matrix, also known as a kernel. In this example, one or more of the values in the convolution matrix may be a parameter of the neural network. The activation function may also be defined by values which may be parameters of the neural network. The parameters of the network may be varied during training of the network.
Other features of the neural network may be predetermined and therefore not varied during training of the network. For example, the number of layers of the network, the number of nodes of the network, the one or more operations performed in each layer and the connections between the layers may be predetermined and therefore fixed before the training process takes place. These features that are predetermined may be referred to as the hyperparameters of the network. These features are sometimes referred to as the architecture of the network.
To train the neural network, a training set of inputs may be used for which the expected output, sometimes referred to as the ground truth, is known. The initial parameters of the neural network are randomized and the first training input is provided to the network. The output of the network is compared to the expected output, and based on a difference between the output and the expected output the parameters of the network are varied such that the difference between the output of the network and the expected output is reduced. This process is then repeated for a plurality of training inputs to train the network. The difference between the output of the network and the expected output may be defined by a loss function. The result of the loss function may be calculated using the difference between the output of the network and the expected output to determine the gradient of the loss function. Back-propagation of the gradient descent of the loss function may be used to update the parameters of the neural network using the gradients dL/dy of the loss function. A plurality of neural networks in a system may be trained simultaneously through back-propagation of the gradient of the loss function to each network.
In the case of AI based image or video compression, the loss function may be defined by the rate distortion equation. The rate distortion equation may be represented by Loss=D+λ*R, where D is the distortion function, λ is a weighting factor, and R is the rate loss. A may be referred to as a lagrange multiplier. The langrange multiplier provides as weight for a particular term of the loss function in relation to each other term and can be used to control which terms of the loss function are favoured when training the network.
In the case of AI based image or video compression, a training set of input images may be used. An example training set of input images is the KODAK image set (for example at www.cs.albany.edu/xypan/research/snr/Kodak.html). An example training set of input images is the IMAX image set. An example training set of input images is the Imagenet dataset (for example at www.image-net.org/download). An example training set of input images is the CLIC Training Dataset P (“professional”) and M (“mobile”) (for example at http://challenge.compression.cc/tasks/).
An example of an AI based compression process 100 is shown in
In a third step, the quantized latent is entropy encoded in an entropy encoding process 150 to produce a bitstream 130. The entropy encoding process may be for example, range or arithmetic encoding. In a fourth step, the bitstream 130 may be transmitted across a communication network.
In a fifth step, the bitstream is entropy decoded in an entropy decoding process 160. The quantized latent is provided to another trained neural network 120 characterized by a function go acting as a decoder, which decodes the quantized latent. The trained neural network 120 produces an output based on the quantized latent. The output may be the output image of the AI based compression process 100. The encoder-decoder system may be referred to as an autoencoder.
The system described above may be distributed across multiple locations and/or devices. For example, the encoder 110 may be located on a device such as a laptop computer, desktop computer, smart phone or server. The decoder 120 may be located on a separate device which may be referred to as a recipient device. The system used to encode, transmit and decode the input image 5 to obtain the output image 6 may be referred to as a compression pipeline.
The AI based compression process may further comprise a hyper-network 105 for the transmission of meta-information that improves the compression process. The hyper-network 105 comprises a trained neural network 115 acting as a hyper-encoder fθh and a trained neural network 125 acting as a hyper-decoder gθh. An example of such a system is shown in
Components of the system not further discussed may be assumed to be the same as discussed above. The neural network 115 acting as a hyper-decoder receives the latent that is the output of the encoder 110. The hyper-encoder 115 produces an output based on the latent representation that may be referred to as a hyper-latent representation. The hyper-latent is then quantized in a quantization process 145 characterised by Qh to produce a quantized hyper-latent. The quantization process 145 characterised by Qh may be the same as the quantisation process 140 characterised by Q discussed above.
In a similar manner as discussed above for the quantized latent, the quantized hyper-latent is then entropy encoded in an entropy encoding process 155 to produce a bitstream 135. The bitstream 135 may be entropy decoded in an entropy decoding process 165 to retrieve the quantized hyper-latent. The quantized hyper-latent is then used as an input to trained neural network 125 acting as a hyper-decoder. However, in contrast to the compression pipeline 100, the output of the hyper-decoder may not be an approximation of the input to the hyper-decoder 115. Instead, the output of the hyper-decoder is used to provide parameters for use in the entropy encoding process 150 and entropy decoding process 160 in the main compression process 100. For example, the output of the hyper-decoder 125 can include one or more of the mean, standard deviation, variance or any other parameter used to describe a probability model for the entropy encoding process 150 and entropy decoding process 160 of the latent representation. In the example shown in
Further transformations may be applied to at least one of the latent and the hyper-latent at any stage in the AI based compression process 100. For example, at least one of the latent and the hyper latent may be converted to a residual value before the entropy encoding process 150, 155 is performed. The residual value may be determined by subtracting the mean value of the distribution of latents or hyper-latents from each latent or hyper latent. The residual values may also be normalised.
To perform training of the AI based compression process described above, a training set of input images may be used as described above. During the training process, the parameters of both the encoder 110 and the decoder 120 may be simultaneously updated in each training step. If a hyper-network 105 is also present, the parameters of both the hyper-encoder 115 and the hyper-decoder 125 may additionally be simultaneously updated in each training step
The training process may further include a generative adversarial network (GAN). When applied to an AI based compression process, in addition to the compression pipeline described above, an additional neutral network acting as a discriminator is included in the system. The discriminator receives an input and outputs a score based on the input providing an indication of whether the discriminator considers the input to be ground truth or fake. For example, the indicator may be a score, with a high score associated with a ground truth input and a low score associated with a fake input. For training of a discriminator, a loss function is used that maximizes the difference in the output indication between an input ground truth and input fake.
When a GAN is incorporated into the training of the compression process, the output image 6 may be provided to the discriminator. The output of the discriminator may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Alternatively, the discriminator may receive both the input image 5 and the output image 6 and the difference in output indication may then be used in the loss function of the compression process as a measure of the distortion of the compression process. Training of the neural network acting as a discriminator and the other neutral networks in the compression process may be performed simultaneously. During use of the trained compression pipeline for the compression and transmission of images or video, the discriminator neural network is removed from the system and the output of the compression pipeline is the output image 6.
Incorporation of a GAN into the training process may cause the decoder 120 to perform hallucination. Hallucination is the process of adding information in the output image 6 that was not present in the input image 5. In an example, hallucination may add fine detail to the output image 6 that was not present in the input image 5 or received by the decoder 120. The hallucination performed may be based on information in the quantized latent received by decoder 120.
As discussed above, a video is made up of a series of images arranged in sequential order. AI based compression process 100 described above may be applied multiple times to perform compression, transmission and decompression of a video. For example, each frame of the video may be compressed, transmitted and decompressed individually. The received frames may then be grouped to obtain the original video.
A number of concepts related to the AI compression processes discussed above will now be described. Although each concept is described separately, one or more of the concepts desired below may be applied in an AI based compression process as described above.
Quantisation is a critical step in any AI-based compression pipeline. Typically, quantisation is achieved by rounding data to the nearest integer. This can be suboptimal, because some regions of images and video can tolerate higher information loss, while other regions require fine-grained detail. Below it is discussed how the size of quantisation bins can be learned, instead of fixed to nearest-integer rounding. We detail several architectures to achieve this, such as predicting bin sizes from hypernetworks, context modules, and additional neural networks. We also document the necessary changes to the loss function and quantisation procedure required for training AI-based compression pipelines with learned quantisation bin sizes, and show how to introduce Bayesian priors to control the distribution of bin sizes that are learned during training. We show how learned quantisation bins can be used both with or without split quantisation. This innovation also allows distortion gradients to flow through the decoder and to the hypernetwork. Finally, we give a detailed account of Generalised Quantisation Functions, which give performance and runtime improvements. In particular, this innovation allows us to include a context model in a compression pipeline's decoder, but without incurring runtime penalties from repeatedly running an arithmetic (or other lossless) decoding algorithm. Our methods for learning quantisation bins are compatible with all ways of transmitting metainformation, such as hyperpriors, autoregressive models, and implicit models.
The following discussion will outline the functionality, scope and future outlook of learned quantisation bins and generalised quantisation functions for usage in, but not limited to, AI-based image and video compression.
A compression algorithm may be broken into two phases: the encoding phase and the decoding phase. In the encoding phase, input data is transformed into a latent variable with a smaller representation (in bits) than the original input variable. In the decode phase, a reverse transform is applied to the latent variable, in which the original data (or an approximation of the original data) is recovered.
An AI-based compression system must be also be trained. This is the procedure of choosing parameters of the AI-based compression system that achieve good compression results (small file size and minimal distortion). During training, parts of the encoding and decoding algorithm are run, in order to decide how to adjust the parameters of the AI-based compression system.
To be precise, in AI-based compression, encoding typically takes the following form:
Here x is the data to be compressed (image or video), fenc is the encoder, which is usually a neural network with parameters θ that are trained. The encoder transforms the input data x into a latent representation y, which is lower-dimensional and in an improved form for further compression.
To compress y further and transmit as a stream of bits, an established lossless encoding algorithm such as arithmetic encoding may be used. These lossless encoding algorithms may require y to be discrete, not continuous, and also may require knowledge of the probability distribution of the latent representation. To achieve this, a quantisation function Q (usually nearest integer rounding) is used to convert the continuous data into discrete values ŷ.
The necessary probability distribution p(ŷ) is found by fitting a probability distribution onto the latent space. The probability distribution can be directly learned, or as often is the case, is a parameteric distribution with parameters determined by a hyper-network consisting of a hyper-encoder and hyper-decoder. If using a hyper-network, an additional bitstream 2 (also known as “side information”) may be encoded, transmitted, and decoded:
where μy, σŷ are the mean and scale parameters that determine the quantised latent distribution p(ŷ).
The encoding process (with a hyper-network) is depicted in
Decoding proceeds as follows:
Summarising: the distribution of latents p(ŷ) is used in the arithmetic decoder (or other lossless decoding algorithm) to turn the bitstream into the quantised latents ŷ. Then a function fdec transforms the quantised latents into a lossy reconstruction of the input data, denoted {circumflex over (x)}. In AI-based compression, fdec is usually a neural network, depending on learned parameters θ.
If using a hyper-network, the side information bitstream is first decoded and then used to obtain the parameters needed to construct p(ŷ), which is needed for decoding the main bitstream. An example of the decoding process (with a hyper-network) is depicted in
AI-based compression depends on learning parameters for the encoding and decoding neural networks, using typical optimisation techniques with a “loss function.” The loss function is chosen to balance the goals of compressing the image or video to small file sizes, while maximising reconstruction quality. Thus the loss function consists of two terms:
Here R determines the cost of encoding the quantised latents according to the distribution p(ŷ), D measures the reconstructed image quality, and λ is a parameter that determines the tradeoff between low file size and reconstruction quality. A typical choice of R is the cross entropy
The choice of pŷ(ŷ) is due to quantisation: the latents are rounded to the nearest integer, so the probability distribution of p(ŷ) is given by the integral of the (unquantised) latent distribution p(y) from ŷ−½ to ŷ+½, which is given in terms of the cumulative distribution function P(ŷ). The function D may be chosen to be the mean squared error, but can also be a combination of other metrics of perceptual quality, such as MS-SSIM, LPIPS, and/or adversarial loss (if using an adversarial neural network to enforce image quality).
If using a hyper-network, an additional term may be added to R to represent the cost of transmitting the additional side information:
Altogether, note that the loss function depends explicitly on the choice of quantisation scheme through the R term, and implicitly, because ŷ depends on the choice of quantisation scheme.
It will now be discussed how learned quantisation bins may be used in AI-based image and video compression. The steps discussed are:
A significant step in the typical AI-based image and video compression pipeline is “quantisation,” where the pixels of the latent representation are usually rounded to the nearest integer. This is required for the algorithms that losslessly encode the bitstream. However, the quantisation step introduces its own information loss, which impacts reconstruction quality.
It is possible to improve the quantisation function by training a neural network to predict the size of the quantisation bin that should be used for each latent pixel. Normally, the latents y are rounded to the nearest integer, which corresponds to a “bin size” of 1. That is, every possible value of ŷ in an interval of length 1 gets mapped to the same ŷ:
However, this may not be the optimal choice of information loss: for some latent pixels, more information can be disregarded (equivalently: using bins larger than 1) without impacting reconstruction quality much. And for other latent pixels, the optimal bin size is smaller than 1.
This issue can be resolved by predicting the quantisation bin size, per image, per pixel. We do this with a tensor Δ∈C×H×W, which then modifies the quantisation function as follows:
We refer to ξŷ the “quantised latent residuals.” Thus Equation 7 becomes:
indicating that values in an interval of length Δ get mapped to the same quantised latent value.
Note that because the learned quantisation bin sizes are incorporated into a modification of the quantisation function Q, any data that we wish to encode and transmit can make use of learned quantisation bin sizes. For example, if instead of encoding the latents ŷ, we wish to encode the mean-subtracted latents y−μy, this can be achieved:
Similarly, hyperlatents, hyperhyperlatents, and other objects we might wish to quantise can all use the modified quantisation function QΔ, for an appropriately learned Δ.
Several architectures for predicting quantisation bin sizes will now be discussed. A possible architecture is predicting the quantisation bin sizes Δ using a hypernetwork. The bitstream is encoded as follows:
where the division is element-wise. ξy is now the object that is losslessly encoded and sent as a bitstream (we refer to ξy as the quantised latent residuals).
An example of the modified encoding process using a hypernetwork is depicted in
When decoding, the bitstream is losslessly decoded as usual. We then use Δ to rescale ξy by multiplying the two element-wise. The result of this transformation is what is now denoted ŷ and passed to the decoder network as usual:
An example if the modified decoding process using a hypernetwork is depicted in
Applying the above techniques may lead to a 1.5 percent improvement in the rate of an AI based compression pipeline and and a 1.9 percent improvement in the distortion when measured by MSE. The performance of the AI based compression process is therefore improved.
We detail several variants of the above architectures which are of use:
We also emphasise that our methods for learning quantisation bins are compatible with all ways of transmitting metainformation, such as hyperpriors, hyperhyperpriors, autoregressive models, and implicit models.
To train neural networks with learned quantisation bins for AI-based compression, we may modify the loss function. In particular, the cost of encoding data described Equation 5a may be modified as follows:
The idea is that we need to integrate the probability distribution of latents from ξy−Δy/2 to ξy+Δy/2 instead of integrating over an interval of length 1.
Similarly, if using a hypernetwork, the term RZ=−x˜p(x)[log p{circumflex over (Z)}({circumflex over (Z)})] corresponding to the cost of encoding the hyperlatents is modified in exactly the same way as the cost of encoding the latents is modified to incorporate learned quantisation bin sizes.
Neural networks are usually trained by variants of gradient descent, utilising backpropagation to update the learned parameters of the network. This requires computing gradients of all layers in the network, which in turn requires the layers of the network to be composed of differentiable functions. However, the quantisation function Q and its learned bins modification QΔ are not differentiable, because of the presence of the rounding function. In AI-based compression, one of two differentiable approximations to quantisation are used during training of the network to replace Q(y) (no approximation is used once the network is trained and used for inference):
When using learned quantisation bins, the approximations to quantisation during training are
Instead of choosing one of these differentiable approximations during training, AI-based compression pipelines can also be trained with “split quantisation,” where we use ({tilde over (ξ)}y)noise when calculating the rate loss R, but we send ({tilde over (ξ)}y)STE to the decoder in training. AI-based compression networks can be trained with both split quantisation and learned quantisation bins.
First, note that split quantisation comes in two flavours:
Notice that with integer rounding quantisation, hard split and soft split quantisation are equivalent, because in the backward pass
However, hard split and soft split quantisation are not equivalent when using learned quantisation bins, because in the backward pass
Now we examine loss function gradients in each quantisation scheme.
Rate gradients w.r.t. Δ are negative in every quantisation scheme:
In every quantisation scheme, the rate term receives {tilde over (ξ)}noise, so
Then we have
because ½−∈>0 always. So the rate gradients always Δ to increase Δ. Distortion gradients w.r.t. Δ differ by quantisation scheme. The gradients are
In soft split quantisation,
and E[{tilde over (ξ)}y∈]=E[{tilde over (ξ)}y]E[∈], because the random noise ∈ in the backward pass is independent of the STE rounding that makes {tilde over (ξ)}ŷ in the forward pass. This means that
Therefore, in soft split quantisation, rate gradients drive Δ larger, while distortion gradients are on average 0, so overall Δ→∞ and training the network is not possible.
Conversely, in hard split quantisation
because {tilde over (ξ)}ŷ is not independent of
Altogether, if using split quantisation with learned quantisation bins, we use hard split quantisation and not soft split quantisation.
The nontrivial distortion gradients that can be achieved with or without split quantisation mean that distortion gradients flow through the decoder and to the hypernetwork. This is normally not possible in models with a hypernetwork, but is a feature introduced by our methods for learning quantisation bin sizes.
In some compression pipelines it is important to control the distribution of values that are learned for the quantisation bin sizes (this is not always necessary). When needed, we achieve this by introducing an additional term into the loss function:
FΔ is characterised by the choice of distribution pΔ(Δ), which we refer to as a “prior” on Δ, following terminology used in Bayesian statistics. Several choices can be made for the prior:
In the previous section, we give detailed descriptions of a simplistic quantisation function Q that makes use of a tensor of bin sizes Δ:
We can extend all of these methods to more generalised quantisation functions. In the general case Q is some invertible function of ŷ and Δ. Then encoding is given by:
and decoding is achieved by
This quantisation function is more flexible than Equation 24, resulting in improved performance. The generalised quantisation function can also be made context-aware e.g. by incorporating quantisation parameters that use an auto-regressive context model.
All methods of the previous section are still compatible with the generalised quantisation function framework:
The more flexible generalised quantisation functions improve performance. In addition to this, the generalised quantisation function can depend on parameters that are determined auto-regressively, meaning that quantisation depends on pixels that have already been encoded/decoded:
In general, using auto-regressive context models improves AI-based compression performance.
Auto-regressive Generalised Quantisation Functions are additionally beneficial from a runtime perspective. Other standard auto-regressive models such as PixelCNN require executing the arithmetic decoder (or other lossless decoder) each time a pixel is being decoded using the context model. This is a severe performance hit in real-world applications of image and video compression. However, the Generalised Quantisation Function framework allows is to incorporate an auto-regressive context model into AI-based compression, without the runtime problems of e.g. PixelCNN. This is because Q−1 acts auto-regressively on {tilde over (ξ)}, which is fully decoded from the bitstream. Thus the arithmetic decoder does not need to be run auto-regressively and the runtime problem is solved.
The generalised quantisation function can be any invertible function. For example:
Furthermore, Q in general need not have a closed form, or be invertible. For example, we can define Qenc (·, Δ) and Qdec (·, Δ) where these functions are not necessarily inverses of each other and train the whole pipeline end-to-end. In this case, Qenc and Qdec could be neural networks, or modelled as specific processes such as Gaussian Processes, Probabilistic Graphical Models (simple example: Hidden Markov Models).
To train AI-based Compression pipelines that use Generalised Quantisation Functions, we use many of the same tools as described above:
Depending on the choice of Generalised Quantisation Function, other tools become necessary for training the AI-based compression pipeline:
There are several possibilities for context modelling that are compatible with the Generalised Quantisation Function framework:
Here L is a matrix that can be given a particular structure depending on desired context. For example L could be banded, upper/lower triangular, sparse, only non-zero for n elements preceding the current position being decoded (in raster-scan order).
If we obtain important metainformation e.g. from an attention mechanism/focus mask we can incorporate this into Δ predictions. In this case, the size of quantisation bins adapts even more precisely to sensitive areas of images and videos, where knowledge of the sensitive regions is stored in this metainformation. In this way, less information is lost from perceptually important regions, while performs gains result from disregarding information from unimportant regions, in an enhanced way compared to AI-based compression pipelines that do not have adaptable bin sizes.
We further outline a connection between learned quantisation bins and variable rate models: one form of variable rate models trains an AI-based compression pipeline with a free hyperparameter δ controlling bin sizes. At inference time, δ is transmitted as metainformation to control the rate (cost in bits) of transmission−lower delta means small bins and larger transmission cost, but better reconstruction quality.
In the variable rate framework, δ is a global parameter in the sense that it controls all bin sizes simultaneously. In our innovation, we obtain the tensor of bin sizes Δ locally, that is predicted per pixel, which is an improvement. In addition, variable rate models that use δ to control the rate of transmission are compatible with our framework, because we can scale the local prediction element-wise by the global prediction as needed to control the rate during inference:
In this section we detail a collection of training procedures applied to the generative adversarial networks framework, which allows us to control the bit allocation for different areas in the images depending on what is depicted in them. This approach allows us to bias a generative compression model to any type of image data and control the quality of the resulting images based on the subject in the image.
General adversarial networks (GANs) have shown excellent results when applied to a variety of different generative tasks in the image, video and audio domains. The approach is inspired by the game theory in which two model, a generator and a critic, are pitted against each other, making both of them stronger as a result. The first model in GAN is a Generator G that takes a noise variable input z and outputs a synthetic data sample {circumflex over (x)}; the second model is a discriminator D that is trained to tell the difference between samples from the real data distribution and the data, generated by the generator. An example overview of the architecture of a GAN is shown in
Let us denote px as data distribution over real sample x; μZ−data distribution over the noise sample Z and pg—the generators distribution over the data x.
Training a GAN is then presented as a minimax game where the following function is optimised:
Adapting the generative adversarial approach for the image compression task, we begin by considering an image x∈C×H×W, where C is the number of channels, H and W are height and width in pixels.
The Compression encoder pipeline based on an autoencoder consists of and Encoder function fθ(x)=ŷ that encodes the image {circumflex over (x)} into latent representation
Q is a quantisation function required for sending ŷ as a bitstream, and Decoder function gθ(ŷ)={circumflex over (x)} that decodes quantised latents ŷinto reconstructed image {circumflex over (x)}∈C×H×W:
In this case, a combination of encoder fθ, quantisation function Q and decoder gθ can be thought of together as a generative network. For simplicity of notation we denote this generative network as G(x). The generative network is complemented with a discriminator network D that is training in conjunction with the generative network in a two-stage manner.
An example of a standard generative adversarial compression pipeline is shown in
where px is a distribution of natural images, r(ŷ) is a rate measured using an entropy model, λrate is a Lagrangian multiplier controlling the balance between rate and distortion, and d(x, {circumflex over (x)}) is a distortion measure.
Complementing this learned compression network with a discriminator model may improve the perceptual quality of the output images. In this case, the compression encoder-decoder network can be viewed as a generative network and the two models can be then trained using a bi-level approach at each iteration. For the discriminator architecture, we chose to use a conditional discriminator shown to produce better quality reconstructed imaged. The discriminator d(x, ŷ), in this case, is conditioned on the quantised latent ŷ. We begin by training the discriminator with discriminator loss:
To train the generative network in (32) we augment the rate-distortion loss in (36) by adding a adversarial “non-saturating” loss used for training generators in GAN:
Adding adversarial loss into the rate-distortion loss encourages the network to produce natural-looking patterns and textures. Using the architecture that combines GANs with autoencoders allowed for excellent results in image compression with substantial improvements in the perceptual quality of the reconstructed images. However, despite the great overall results of such architectures, there is a number of notable failure modes. It has been observed that these models struggle compressing regions of high visual importance which include, but are not limited to, human faces or text. An example of such a failure mode is exhibited in
Under this framework we train the network on multiple datasets with a separate discriminator for each of them. We start by selecting N additional datasets X1, . . . XN on which the model is biased. A good example of one such dataset that helps with modelling faces would be a dataset that consists of portraits of people. For each dataset Xi, we introduce a discriminator model Di. Each of the discriminator models Di is trained only on data from the dataset Xi, while the Encoder-Decoder model is trained on images from all the datasets.
where xi is an image from dataset Xi, yi—latent representation of the image xi, G(xi) is a reconstructed image and px
An illustration of a compression pipeline with multi-discriminator cGAN training with dataset biasing is shown in
For illustration purposes, we focus our attention on the failure mode of faces, as previously demonstrated in
As an example, consider a dataset biasing using just one extra dataset. In this case, X1—general training dataset and X2—dataset with portrait images only. Comparison between the reconstructions of the same generative model trained to the same bitrate with and without multi-discriminant dataset biasing scheme are presented in
The approach described above can also be used in an architecture where a single discriminator is used for all the datasets. Additionally, the discriminator Di for a specific dataset can be trained more often than the generator, increasing the effects of biasing on that datasets.
Given a generative compression network as described above, we now define architectural modifications that permit higher bit allocation, conditioned on image or frame context. To increase the effect of dataset biasing and change the perceptual quality of different areas of the image depending on the image's subject we propose a training procedure in which the Lagrangian coefficient that controls bitrate differs for each dataset Xi. The generator changing the loss function in (37) to
This approach trains the model in assigning a higher percentage of the bitstream for the face regions of the compressed images. The results of the bitrate adjusted dataset biasing can be observed in
Extending the method proposed above, we propose using different distortion functions d(x, {circumflex over (x)}) for different datasets used for biasing. This method allows us to adjust the focus of the model to each particular type of data. For example, we can use a linear combination of MSE, LPIPS and MS-SSIM metrics as our distortion function.
Changing the coefficients λiMSE, λiLPIPS, λiMS-SSIM . . . of different components of the distortion function may change the perceptual quality of the resulting images allowing the generative compression model to reconstruct different areas of the image in different ways. Then, equation 37 can be modified by indexing the distortion function d(x, {circumflex over (x)}) for each of the datasets Xi:
We now discuss utilising salience (attention) masks to bias the generative compression model to the areas that are particularly important in the image. We propose generating these masks using a separate pre-trained network, the output of which can be used to further improve the performance of the compression model.
Begin by considering a network H that takes image {circumflex over (x)} as an input and outputs a binary mask m∈H×W:
Salient pixels in xi are indicated by ones in m and zeros indicate areas that the network does not need to focus on. This binary mask can be used to further bias the compression network to the area. Examples of such important areas include, but not limited to, human facial features such as eyes and lips. Given m we can modify the input image {circumflex over (x)} so it prioritises these areas. The modified image xH is then used as an input into the adversarial compression network. An example of such a compression pipeline is shown in
Extending the approach proposed above we propose an architecture that makes use of the pre-trained network that produces salience mask to bias a compression pipeline. This approach allows for changing the bit-rate allocation to the various parts of the reconstructed images by changing the mask, without retraining the compression network. In this variation, the mask m from equation 41 is used as additional input to train the network to allocate more bits to the areas marked as salient (one) in m. At the inference stage, after the network is trained, bit-allocation can be adjusted by modifying the mask m. An example of such a compression pipeline is shown in
We further propose a training scheme that ensures that the model is exposed to examples from a wide range of natural images. The training dataset is constructed out of images from N different classes, with each image being label accordingly. During training, the images are sampled from the datasets according to the class. By sampling equally from each class, the model may be exposed to underrepresented classes and is able to learn the whole distribution of natural images.
Modern methods of learned image compression, such as VAE and GAN based architectures, allow for excellent compression with small bitrates and substantial improvements in the perceptual quality of the reconstructed images. However, despite the great overall results of such architectures, there is a number of notable failure modes as discussed above. It has been observed that these models struggle compressing regions of high visual importance, which include, but are not limited to, human faces or text. An example of such a failure mode is exhibited in
We propose an approach that allows increasing the perceptual quality of a region of interest (ROI) by allocating more bits for in the bitstream by changing the quantisation bin in the ROI.
To encode the latent ŷ into a bitstream, we first may quantise it to ensure that it is discrete. We propose to use a quantisation parameter Δ to control the bpp allocated to the area. Δ is quantisation bin size or quantisation interval, which represents the coarseness of quantisation in our latent and hyperlatent space. The coarser the quantisation, the fewer bits are allocated to data.
Quantisation of the latent y is then achieved as follows:
We propose to utilise spatially varied delta to control the coarseness of the quantisation in the image. This allows us to control the number of allocated bits and the visual quality of different areas of the image. This proceeds as follows.
Begin by considering a function H, usually represented by a neural network, that detects the regions of interest. Function H(s) takes an image {circumflex over (x)} as an input and outputs a binary mask m∈H×W. A one in m indicates that the corresponding pixel of the image {circumflex over (x)} lies within a region of interest, and a zero correspond to the pixel lying outside of it.
In one instance, the network H(x) is trained prior to training the compression pipeline, and in another, it is trained in conjunction with the Encoder-Decoder. Map m is used for creating a quantisation map Δ where each pixel is assigned a quantisation parameter. If the value in m for a certain pixel is one, the corresponding value in Δ is small. Function Q, defined in eqn. 42 then uses the spatial map Δ to quantise y into ŷ before encoding it into bitstream. The result of such a quantisation scheme is a higher bitrate for the regions of interest compared to the rest of the image.
The proposed pipeline is illustrated in
In another instance, we may utilise a different quantisation function Qm for areas identified with ROI detection network H(x) from eqn. 41. An example of such an arrangement is shown in
Encoding and decoding a stream of discrete symbols (such as the latent pixels in an AI-based compression pipeline) into a binary bitstream may require access to a discrete probability mass function (PMF). However, it is widely believed that training such a discrete PMF in an AI-based compression pipeline is impossible, because training requires access to a continuous probability distribution function (PDF). As such, in training an AI-based compression pipeline, the de facto standard is to train a continuous PDF, and only after training is complete, approximate the continuous PDF with a discrete PMF, evaluated at a discrete number of quantization points.
Described below is a inversion of this procedure, in which a AI-based compression pipeline may be trained on a discrete PMF directly, by interpolating the discrete PMF to a continuous, real-valued space. The discrete PMF can be learned or predicted, and can also be parameterized.
The following description will outline the functionality, scope and future outlook of discrete probability mass functions and interpolation for usage in, but not limited to, AI-based image and video compression. The following provides a high-level description of discrete probability mass functions, a description of their use in inference and training AI-based compression algorithms, and methods of interpolating functions (such as discrete probability mass functions)
In the AI-based compression literature, the standard approach to creating entropy models is to start with a continuous probability density function (PDF) py (y) (such as the Laplace or Gaussian distributions). Now because Shannon entropy may be only defined on discrete variables ŷ (usually ŷ∈), this PDF should be turned into a discrete probability mass function (PMF) pŷ(ŷ), for use, for example, by a lossless arithmetic encoder/decoder. This may be done by gathering up all the (continuous) mass inside a (unit) bin centred at ŷ:
This is approach was first proposed in Johannes Ballè, Valero Laparra, and Eero P. Simoncelli. End-to-end optimized image compression. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, Apr. 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017, which is hereby incorporated by reference. This function is not just defined on the integers ŷ∈ and will accept any real-valued argument. This may be quite convenient during training, where a PDF over the continuous, real-valued latents outputted by the encoder is still needed. Therefore, a “new” function p{tilde over (y)}({tilde over (y)}):=pŷ({tilde over (y)}), will be defined, which, by definition, perfectly agrees the PMF pŷ defined on the integers. This function p{tilde over (y)} is the PMF that an end-to-end AI-based compression algorithm is actually trained with.
To summarize: a continuous real-valued PDF py (y∈) is turned into a discrete PMF pŷ (ŷ∈), which is then evaluated as a continuous PDF p{tilde over (y)} during training (y∈).
This thinking model may be reversed. Rather than starting from a PDF, instead it is possible to begin with a discrete PMF, and recover a continuous PDF (which only be used in in training), by interpolating the PMF.
Suppose we are given a PMF pŷ. We can represent this PMF using two vectors of length N, namely ŷi, and {circumflex over (p)}i, where i=1 . . . N indexes the discrete points. In the old thinking model (when the PMF is defined through a function), we'd define {circumflex over (p)}i=pŷ(ŷi). However in general the {circumflex over (p)}i's could be any non-negative vector that sums to one. The vector yi should be sorted, in ascending order, and does not necessarily need to have integer-values.
Now, suppose we are given a query point {tilde over (y)}∈[ŷ0, ŷN]. Note that the query point must be bounded by the extremes of the discrete points. To define (an approximate) training PDF f({tilde over (y)}), we use an interpolation routine
There are many different interpolation routines available. A non-exhaustive list of possible interpolating routines is:
In general, the function so defined via interpolation may not exactly be a PDF. Depending on the interpolation routine used, the interpolated value may be negative or may not have unit mass. However, these problems can be mitigated by choosing a suitable routine. For example, piecewise linear interpolation may preserve mass and preserve positivity, which ensures the interpolated function is actually a PDF. Piecewise cubic Hermite interpolation can be constrained to be positive, if the interpolating points themselves are positive as discussed in Randall L Dougherty, Alan S Edelman, and James M Hyman. Nonnegativity-, monotonicity-, or convexity-preserving cubic and quintic hermite interpolation. Mathematics of Computation, 52(186):471-494, 1989, which is hereby incorporated by reference.
However, piecewise linear interpolation can suffer from other problems. Its derivatives are piecewise constant, and the interpolation error can be quite bad, for example as shown in the left image of
A discrete PMF can be trained directly in an AI-based image and compression algorithm. To achieve this in training, the probability values of real-valued latent, outputted by an encoder, are interpolated using the discrete values of the PMF model. In training the PMF is learned by passing gradients from the rate (bitstream size) loss backwards through to the parameters of the PMF model.
The PMF model could be learned, or could be predicted. By learned, we mean that the PMF model and it's hyper-parameters could be independent of the input image. By predicted, we mean that the PMF model could conditionally depend on ‘side-information’ such as information stored hyper-latents. In this scenario, the parameters of the PMF could be predicted by a hyper-decoder. In addition, the PMF could conditionally depend on neighbouring latent pixels (in this case, we'd say the PMF is a discrete PMF context model). Irregardless of how the PMF is represented, during training the values of the PMF may be interpolated to provide estimates of the probability values at real-valued (non-quantized) points, which may be fed into the rate loss of the training objective function.
The PMF model could be parameterized in any one of the following ways (though this list is non-exhaustive):
This framework can be extended in any one of several ways. For instance, if the discrete PMF is multivariate (multi-dimensional), then a multivariate (multi-dimensional) interpolation scheme could be used to interpolate the values of the PMF to real vector-valued points. For instance, multi-linear interpolation could be used (bilinear in 2d; trilinear in 3d; etc). Alternately, multi-cubic interpolation could be used (bicubic in 2d; tricubic in 3d; etc).
This interpolation method is not constrained to only modeling discrete valued PMFs. Any discrete valued function can be interpolated, anywhere in the AI-based compression pipeline, and the techniques described here-in are not strictly limited to modeling probability mass/density functions.
In AI-based compression, autoregressive context models have powerful entropy modeling capabilities, yet suffer from very poor run-time, due to the fact that they must be run in serial.
This document describes a method for overcoming this difficulty, by predicting autoregressive modeling components from a hyper-decoder (and conditioning these components on “side” information). This technique yields an autoregressive system with impressive modeling capabilities, but which is able to run in real-time. This real-time capability is achieved by detaching the autoregressive system from the model needed by the lossless decoder. Instead, the autoregressive system reduces to solving a linear equation at decode time, which can be done extremely quickly using numerical linear algebra techniques. Encoding can be done quickly as well by solving a simple implicit equation.
This document outlines the functionality and scope of current and future utilization of autoregressive probability models with linear decoding systems for use in, but not limited to, image and video data compression based on AI and deep learning.
In AI-based image and video compression, an input image {circumflex over (x)} is mapped to a latent variable ŷ. It is this latent variable which is encoded to a bitstream, and set to a receiver who will decoded the bitstream back into the latent variable. The receiver then transforms the recovered latent back into a representation (reconstruction) {circumflex over (x)} of the original image.
To perform the step of transforming the latent into a bitstream, the latent variable may be quantized into an integer-valued representation ŷ. This quantized latent ŷ is transformed into the bitstream via a lossless encoding/decoding scheme, such as an arithmetic encoder/decoder or range encoder/decoder.
Lossless encoding/decoding schemes may require a model one-dimensional discrete probability mass function (PMF) for each element of the latent quantized variable. The optimal bitstream length (file-size) is achieved when this model PMF matches the true one-dimensional data-distribution of the latents.
Thus, file-size is intimately tied to the power of the model PMF to match the true data distribution. More powerful model PMFs yield smaller file-sizes, and better compression. As the case may be, this in turn yields better reconstruction errors (as for a given file-size, more information can be sent for reconstructing the original image). Hence, much effort has gone into developing powerful model PMFs (often called entropy models).
The typical approach for modeling one-dimensional PMFs in AI-based compression is to use a parametric one-dimensional distribution, P(Y=ŷi|θ), where θ are the parameters of the one-dimensional PMF. For example a quantized Laplacian or quantized Gaussian could be used. In these two examples, θ comprises the location μ and scale σ parameters of the distribution. For example if a quantized Gaussian (Laplacian) was used, the PMF would be written
Here p(y|μ, σ) is the continuous Gaussian (Laplacian), and δ is the quantization bin size (typically δ=1).
More powerful models may be created by “conditioning” the parameters θ, such as location μ or scale σ, on other information stored in the bitstream. In other words, rather than statically fixing parameters of the PMF to be constant across all inputs of the AI-based compression system, the parameters can respond dynamically to the input.
This is commonly done in two ways. In the first, extra side-information {circumflex over (z)} is sent in the bitstream in addition to ŷ. The variable {circumflex over (z)} is often called a hyper-latent. It is decoded in its entirety prior to decoding ŷ, and so is available for use in encoding/decoding ŷ. Then, μ and σ can be made functions of {circumflex over (z)}, for example returning μ and σ through a neural network. The one-dimensional PMF is then said to be conditioned on {circumflex over (z)}, and is given by P(Y=ŷ|μ({circumflex over (z)}), σ({circumflex over (z)})).
Another approach is to use autoregressive probabilistic models. For example, PixelCNN as described in Aäron van den Oord, Nal Kalchbrenner, Lasse Espeholt, Koray Kavukcuoglu, Oriol Vinyals, and Alex Graves. Conditional image generation with pixelcnn decoders. In Daniel D. Lee, Masashi Sugiyama, Ulrike von Luxburg, Isabelle Guyon, and Roman Garnett, editors, Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, Dec. 5-10, 2016, Barcelona, Spain, pages 4790-4798, 2016, which is hereby incorporated by reference, has been widely used in academic AI-based compression papers, as for instance done in David Minnen, Johannes Ballè, and George Toderici. Joint autoregressive and hierarchical priors for learned image compression. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicoló Cesa-Bianchi, and Roman Garnett, editors, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montrèal, Canada, pages 10794-10803, 2018, which is hereby incorporated by reference. In this framework, context pixels are used to condition the location μ and scale σ parameters of the PMF at the current pixel. These context pixels are previously decoded pixels neighbouring the current pixel. For example, suppose the previous k pixels have been decoded. Due to inherent spatial correlations in images, these pixels often contain relevant information about the current active pixel. Hence, these context pixels may be used to improve the location and scale predictions of the current pixel. The PMF for the current pixel would then be given by P(Y=ŷi|μ(ŷi-1, . . . , ŷi-k), σ(ŷi-1, . . . , ŷi-k)), where now μ and σ are functions (usually convolutional neural networks) of the previous k variables.
These two approaches, conditioning via either hyper-latents or with autoregressive context models, both come with benefits and drawbacks.
One of the main benefits of conditioning via hyper-latents is that quantization can be location-shifted. In otherwords, quantization bins can be centred about the location parameter μ. Using integer valued bins, the quantization latent is given as
where └·┐ is the rounding function. This may yield superior results to straight rounding, ŷ=└y┐. In addition, conditioning via hyper-latents can implemented relatively quickly, with (depending on the neural network architecture), real-time decoding speeds.
The main benefit of autoregressive context models is the use of contextual information—the neighbouring decoded pixels. Because images (and videos) are spatially highly correlated, these neighbouring pixels can provide very accurate and precise predictions about what the current pixel should be. Most state-of-the-art academic AI-based compression pipelines use autoregressive context models due to their impressive performance results, when measured in terms of bitstream length and reconstruction errors. However, despite their impressive relative performance, they suffer from two problems.
First, they must be run serially: the PMF of the current pixel depends on all previously decoded pixels. In addition, the location and scale functions μ(·) and σ(·) are usually large neural networks. These two facts mean that autoregressive context models cannot be run in real-time, taking many orders of magnitude longer than computing requirements necessitated by real-time performance on edge devices. Thus, in their current state, autoregressive context models are not commercially viable, despite the fact that they yield impressive compression performs
Second, due to the effects of cascading errors, autoregressive context models must use straight rounding (ŷ=└y┐). Location-shifted rounding (ŷ=└y−┐+μ) is not possible, for tiny floating point errors introduced early in the decoding pass can be amplified and magnified during the serial decoding pass, leading to vastly different predictions between the encoder and decoder. The lack of location-shifted rounding is problematic, and it is believed that all other components being equal, an autoregressive model with location-shifted rounding (if it were possible to construct) would outperform a straight rounding autoregressive model.
Thus, there is a need to develop a PMF modeling framework that combines the benefits of conditioning on hyper-latents (fast runtime; location-shifted rounding), with the impressive performance of autoregressive modeling (creating powerful predictions from prior decoded context pixels).
Described below is a technique to modify the hyper-decoder to additionally predict the parameters of an autoregressive model using a hyper-decoder. In other words, we will condition the parameters of the autoregressive model on the hyper-latent {circumflex over (Z)}. This is in contrast to the standard set-up in autoregressive modeling, where the autoregressive functions are static and unchanging, and do not change depending on the compression pipeline input.
We will primarily be concerned with the following quasi-linear setup (in the sense that the decode pass is linear; while encode is not linear). In addition to the μ and σ predictions, the hyper-decoder may also output a sparse matrix L, called the context matrix. This sparse matrix will be used for the autoregressive context modeling component of the PMF as follows. Given an ordering on the latent pixels (such as raster-scan order), suppose the previous k latent pixels have been encoded/decoded, and so are available for a autoregressive context modeling. An approach is to use the following modified location-shifted quantization: we quantize via
The probability model is then given by P Y=ŷi|μi+Σj=i-ki-iLijŷj, σi). In matrix-vector notation, we have that
where here L is the sparse matrix outputted by the hyper-decoder. (Note that L need not be predicted, it could also be learned or static). Note that in the ordering of the latent pixels, L may be a strictly lower-triangular matrix. This hybrid autoregressive-hyper-latent context modeling approach may be called L-context.
Note that this is a form of autoregressive context modeling. This is because the one-dimensional PMF relies on previously decoded latent pixels. However we remark that only the location parameters may rely on previously decoded latent pixels, not the scale parameters.
Notice that the integer values which may be actually encoded by the arithmetic encoder/decoder are the quantization residuals
Therefore, in decode, the arithmetic encoder returns from the bitstream not ŷ but {tilde over (ξ)}. Then, ŷ may be recovered by solving the following linear system for ŷ
or put another way, by setting ŷ=(I−L)−1({tilde over (ξ)}+μ).
Solving the system (50) is detached from the arithmetic decoding process. That is, whereas the arithmetic decoding process must be done serially as the bitstream is received, solving (50) is independent of this process and can be done using any numerical linear algebra algorithm. The decoding pass of the L-context modeling step may not be a serial procedure, and can be run in parallel.
Another way of viewing this result is to see that equivalently, the arithmetic encoder/decoder operates on {tilde over (ξ)}, which has location zero. That is, the arithmetic encoder operates not on the ŷ latents, but on the residuals {tilde over (ξ)}. In this view, the PMF is P(Ξ={tilde over (ξ)}|0, σ). Only after {tilde over (ξ)} is recovered from the bitstream do we then recover ŷ. However, since recovering {tilde over (ξ)} from the bitstream may not be autoregressive (the only dependence being on σ, which has not been made context/autoregressive dependent), this procedure may be extremely fast. Then, ŷ can be recovered using highly optimized linear algebra routines to solve (50).
In both encoding, and the training of L-context system, we may solve (48) for the unknown variable ŷ-ŷ is not given explicitly, and must be determined. In fact, (48) is an implicit system. Here we outline several possible approaches to finding ŷ satisfying (48).
Many of the techniques described in the previous section can be applied at decode time as well. In particular, the linear equation
may be solved in any of the following ways.
Below, we describe in detail an example of the L-context module inside an AI-based compression pipeline.
In the previous sections, we have assumed L is lower-triangular, with respect to the decode ordering of the pixels. A generalization is to relax this assumption, to a general matrix A, not necessarily lower-triangular. In this case, the encoding equations would be to solve
and send
to the bitstream, via the PMF P(Ξ=ξi|0, σi). At decode, after retrieving {tilde over (ξ)} for the bitstream, the rounded latent is recovered by solving the following linear system for ŷ:
In general, the context functions could be non-linear. For example the encoding problem would be to solve
where f is a non-linear function, which could be learned or predicted, such as a Neural Network with learned or predicted parameters. The rounded latent ŷ is a fixed point of (55). This equation can be solved with any non-linear equation solver. Then, during encode the residual latents
are sent to the bitstream, via the PMF P(Ξ=ξi|0, σi). At decode, the following non-linear equation is solved for ŷ
One interpretation of this latter extension is as an implicit PixelCNN. For example, if f(·) has a triangular Jacobian (matrix of first derivatives), then (55) models an autoregressive system. However, (55) is more general than this interpretation, indeed it is capable of modelling not just autoregressive systems but any probabilistic system with both forward and backward conditional dependencies in the pixel ordering.
In AI-based image and video compression, autoregressive modelling is a powerful technique for entropy modelling of the latent space. Context models that condition on previously decoded pixels used in a state-of-the-art AI-based compression pipeline. However, the autoregressive ordering in context models is often a predefined and rudimentary such as raster scan ordering, which may impose unwanted biases in the learning. To this end, we propose alternative autoregressive orderings in context models that are either fixed but non-raster scan, conditioned, learned or directly optimised for.
In mathematical terms, the goal of lossy AI-based compression is to infer a prior probability distribution, the entropy model, which matches as closely as possible to a latent distribution that generates the observed data. This can be achieved by training a neural network through an optimisation framework such as gradient descent. Entropy modelling underpins the entire AI-based compression pipeline, where better distribution matching corresponds to better compression performances, characterised by lower reconstruction losses and bitrates.
For image and video data, which exhibits large spatial and temporal redundancy, an autoregressive process termed context modelling can be very helpful to exploit this redundancy in the entropy modelling. In high level, the general idea is to condition the explanation of subsequent information with existing, available information. The process of conditioning on previous variables to realise the next variable implies an autoregressive information retrieval structure of a certain ordering. This concept has proven to be incredibly powerful in AI-based image and video compression and is commonly part of cutting-edge neural compression architectures.
However, the ordering of the autoregressive structure, the autoregressive ordering (or AO in short-hand) in AI-based image and video compression may be predetermined. These context models often adopt a so-called raster scan order, which follows naturally the data sequence in image data types (3-dimensional; height×width×channels, such as RGB), for example.
Below, we describe a number of AOs that can be fixed or learned, along with a number of distinct frameworks through which these can be formulated. The AO of context modelling can be generalised through these frameworks, which can be optimised for finding the optimal AO of the latent variables. The following concepts are discussed:
An AI-based image and video compression pipeline usually follows an autoencoder structure, which is composed by convolutional neural networks (CNNs) that make up an encoding module and decoding module whose parameters can be optimised by training on a dataset of natural-looking images and video. The (observed) data is commonly denoted by {circumflex over (x)} and is assumed to be distributed according to a data distribution p(x). The feature representation after the encoder module is called the latents and is denoted by ŷ. This is what eventually gets entropy coded into a bitstream in encoding, and vice versa in decoding.
The true distribution of the latent space p(y|x) is practically unattainable. This is because the marginalisation of the joint distribution over y and x to compute the data distribution, p(x)=∫p(x|y)p(y)dy, is intractable. Hence, we can only find an approximate representation of this distribution, which is precisely what entropy modelling does.
The true latent distribution of y∈M can be expressed, without loss of generality, as a joint probability distribution with conditionally dependent variables
which models the probability density over all sets of realisations of y. Equally, a joint distribution can be factorised into a set of conditional distributions of each individual variable, with an assumed, fixed ordering from i∈{1, . . . , M}
where y<i denotes the vector of all latent variables preceding yi, implying an AO that is executed serially from 1 to M (a M-step AO). However, M is often very large and therefore inferring p(yi|{y<i}) at each step is computationally cumbersome. To achieve speedups in the autoregressive process, we can
Applying any of the two concepts imposes constraints that invalidates the equivalence of the joint probability and the factorisation into conditional components as described in Equation (59), but is often done to trade off against modelling complexity. The first concept is almost always done in practice for high-dimensional data, for example in a PixelCNN-based context modelling where only local receptive field is considered. An example of this process is shown in
The second concept includes the case of assuming a factorised entropy model (no conditioning on random variables, only on deterministic parameters and a hyperprior entropy model (latent variables are all conditionally independent due to conditioning on a set of hyperlatents, Z). Both of these cases have a 1-step AO, meaning the inference of the joint distribution is executed in a single step.
Below will be described three different frameworks which specify the AO for a serial execution of any autoregressive process for the application in AI-based image and video compression. This may include, but is not limited to, entropy modelling with a context model. Each framework offers a different way of (1) defining the AO and (2) formulating potential optimisation techniques for it.
The data may be assumed to be arranged in 2-D format (a single-channel image or a single-frame, single-channel video) with dimensionality M=H×W where H is the height dimension and W is the width dimension. The concepts presented herein are equally applicable for data with multiple channels and multiple frames.
Graphical models, or more specifically directed acyclic graphs (DAGs), are very useful for describing probability distributions and their conditional dependency structure. A graph is made up by nodes, corresponding to the variables of the distribution, and directed links (arrows), indicating the conditional dependency (the variable at the head of the arrow is conditioned on the variable at the tail). For a visual example, the joint distribution that describes the example in
The main constraint for a directed graph to properly describe a joint probability is that it cannot contain any directed cycles. This means there should be no path that starts at any given node on that path and ends on the same node, hence directed acyclic graphs. The raster scan ordering follows exactly the same structure shown in
A different AO that is less than M-step is the checkerboard ordering. The ordering is visualised for a simple example in
Binary mask kernels for autoregressive modelling are a useful framework to specify AOs that is N-step where N<<M. Given the data y, the binary mask kernel technique entails dividing it up into N lower-resolution subimages {y1, . . . , yN}. Note that whilst previously we defined each pixel as a variable yi, here we define yi∈K as a group of K pixels or variables that are conditionally independent (and are conditioned on jointly for future steps).
A subimage yi is extracted by convolving the data with a binary mask kernel Miϵ{0, 1}k
Here, 1k
It is also possible to represent traditional interlacing schemes with binary mask kernels, such as Adam7 used in PNG.
Raster scan order could also be defined within the binary mask kernel framework, where the kernel size would be H×W; This would mean that it is a H×W=N-step AO, with N binary mask kernels of H×W that are organised such that the ones are ordered in raster scan.
In summary, binary mask kernels may lend themselves better to gradient descent-based learning techniques, and are related to further concepts regarding autoregressive ordering in frequency space as discussed below.
The ranking table is a third framework for characterising AOs, and under a fixed ranking it is especially effective in describing M-step AOs without the representational complexity of binary mask kernels. The concept of a ranking table is simple: given a quantity q∈M (flattened and corresponding to the total number of variables), each the AO is determined on the basis of a ranking system of the elements of q, qi, such that the index with largest qi gets assigned as y1, the index with the second largest qi gets assigned as y2 and so on. The indexing can be performed using the argsort operator and the ranking can be either in descending or ascending order, depending on the interpretation of q.
q can be a pre-existing quantity that relays certain information about the source data y, such as the entropy parameters of y (either learned or predicted by a hyperprior), for example the scale parameter σ. In this particular case, we can define an AO by variables that have scale parameters of descending order. This comes with the interpretation that high-uncertainty regions, associated with variables with large scale parameters σij, should be unconditional since they carry information not easily retrievable by context. An example visualisation of this process, where the AO is defined as y1, y2, . . . y16, can be seen in
q can also be derived from pre-existing quantities, such as the first-order or second-order derivative of the location parameter μ. Both of these can be obtained by applying finite-difference methods to obtain gradient vectors (for first-order derivatives) or Hessian matrices (for second-order derivatives), and would be obtained before computing q and argsort(q). The ranking can then be established by the norm of the gradient vector, or norm of the eigenvalues of the Hessian matrix, or any measure of the curvature of the latent image. Alternatively, for second-order derivatives, the ranking can be based on the magnitude of the Laplacian, which is equivalent to the trace of the Hessian matrix.
Lastly, q can also be a separate entity altogether. A fixed q can be arbitrarily pre-defined before training and remain either static or dynamic throughout training, much like a hyperparameter. Alternatively, it can be learned and optimised through gradient descent, or parametrised by a hypernetwork.
The way we access elements of y depends on whether or not we want gradients to flow through the ranking operator:
y
sort
=Py.
In the case that the ranking table is optimised through gradient descent-based methods, indexing operators such as argsort or argmax may not be differentiable. Hence, a continuous relaxation of a permutation matrix {tilde over (P)} must be used, which can be implemented with the SoftSort operator:
where d is an arbitrary distance metric such as the L1-norm, d(x, y)=|x−y|, 1M is a vector of ones of length M, τ>0 is a temperature parameter controlling the degree of continuity (where limτ→0 {tilde over (P)}=P, i.e. approaches the true argsort operator when τ approaches zero. The softmax operator is applied per row, such that each row sums up to one. An example of this is shown in
The ranking table concept can be extended to work with binary mask kernels as well. The matrix q will be of the same dimensionality as the mask kernels themselves, and the AO will be specified based on the ranking of the elements in q.
Another possible autoregressive model is one which is defined on a hierarchical transformation of the latent space. In this perspective, the latent is transformed into a hierarchy of variables, and now where lower hierarchical levels are conditioned on higher hierarchical levels.
This concept can best be illustrated using Wavelet decompositions. In a wavelet decomposition, a signal is decomposed into high frequency and low frequency components. This is done via a Wavelet operator W. Let us denote a latent image as y0, of size H×W pixels. We use the superscript 0 to mark that the latent image is at the lowest (or root) level of the hierarchy. Using one application of a Wavelet transform, the latent image can be transformed into a set of 4 smaller images yll1, ylh1 yhl1, and yhh1, each of size H//2×W//2. The letters H and L denote high frequency and low frequency components respectively. The first letter in the tuple corresponds to the first spatial dimension (say height) of the image, and the second letter corresponds to the second dimension (say width). So for example yhl1 is the Wavelet component of the latent image y0 corresponding to high frequencies in the height dimension, and low frequencies in the width dimension.
In matrix notation we have
So one can see that W is a block matrix, comprising of 4 rectangular blocks, a block for each of the corresponding frequency decompositions.
Now this procedure can be applied again, recursively on the low-frequency blocks, which constructs a hierarchical tree of decompositions.
Crucially, if the transform matrix W is invertible (and indeed in the case of the wavelet transform W−1=WT), then the entire procedure can be inverted. Given the last level of the hierarchy, the preceding level's low frequency component can easily be recovered just by applying the inverse transform on the last level. Then, having recovered the next level's low-frequency components, the inverse transform is applied to the second-last level, and so on, until the original image is recovered.
Now, how can this hierarchical structure be used to construct an autoregressive ordering? In each hierarchical level, an autoregressive ordering is defined between the elements of that level. For example, refer to the bottom image of
Another DAG is defined between the elements of the next lowest level, and the autoregressive process is applied recursively, until the original latent variable is recovered.
Thus, an autoregressive ordering is defined on the variables given by the levels of the Wavelet transform of an image, using a DAG between elements of the levels of the tree, and the inverse Wavelet transform.
We remark that this procedure can be generalized in several ways:
Example techniques for constrained optimization and rate-distortion annealing are set out in international patent application PCT/GB2021/052770, which is hereby incorporated by reference.
An AI-based compression pipeline tries to minimize the rate (R) and distortion (D). The objective function is:
where minimization is taken over a set of compression algorithms, and AR and AD are the scalar coefficients controlling the relative importance of respectively the rate and the distortion to the overall objective.
In international patent application PCT/GB2021/052770, this problem is reformulated as a constrained optimization problem. A method for solving this constrained optimization problem is the Augmented Lagrangian technique as described in PCT/GB2021/052770. The constrained optimization problem is to solve:
where c is a target compression rate. Note that D and R are averaged over the entire data distribution. Note also that an inequality constraint could also be used. Furthermore, the roles of R and D could be reversed: instead we could minimize rate subject to a distortion constraint (which may be a system of constraint equations).
Typically the constrained optimization problem will be solved using stochastic first-order optimization methods. That is, the objective function will be calculated on a small batch of training samples (not the entire dataset), after which a gradient will be computed. An update step will then be performed, modifying the parameters of the compression algorithm, and possibly other parameters related to the constrained optimization, such as Lagrange Multipliers. This process may be iterated many thousands of times, until a suitable convergence criteria has been reached. For example, the following steps may be performed:
However, there are several issues encountered while training a constrained optimization problem in a stochastic small-batch first-order optimization setting. First and foremost, the constraint cannot be computed on the entire dataset at each iteration, and will typically only be computed on the small-batch of training samples used at each iteration. Using such small number of training samples in each batch can make updates to the constrained optimization parameters (such as the Lagrangian Multipliers in the Augmented Lagrangian) extremely dependent of the current batch, leading to high variance of training updates, suboptimal solutions, or even an unstable optimization routine.
An aggregated average constraint value (such as the average rate) can be computed over N of the previous iteration steps. This has the meritorious effect of expanding constraint information of the last many optimization iteration steps, so as to be applied to the current optimization step, especially in regards to updating parameters related to the constraint optimization algorithm (such as the update to the Lagrange Multipliers in the Augmented Lagrangian). A non-exhaustive list of ways to calculate this average over the last N iterations, and apply it in the optimization algorithm, is:
where avg is a generic averaging operator. Examples of an averaging operator are:
Regardless, the aggregated constraint value is computed over many of the N previous iterations, and is used to update parameters of the training optimization algorithm, such as the Lagrange Multipliers.
A second problem with stochastic first-order optimization algorithms is that the dataset will contain images with extremely large or extremely small constraint values (such as having either very small or very large rate R). When we have outliers present on a dataset they will force the function we are learning to take them into account, and may create a poor fit for the more common samples. For instance, the update to the parameters of the optimization algorithm (such as the Lagrange Multipliers) may have high variance and cause non-optimal training when there are many outliers.
Some of these outliers may be removed from the computation of the constraint average detailed above. Some possible methods for filtering (removing) these outliers would be
Using a constrained optimization algorithm such as the Augmented Lagrangian we are able to target a specific average constraint (such as rate) target c on the training data set. However, converging to that target constraint on the training set does not guarantee that we will have the same constraint value on the validation set. This may be caused for example by changes between the quantization function used in training, and the one used in inference (test/validation). For example it is common to quantize using uniform noise during training, but use rounding in inference (called ‘STE’). Ideally the constraint would be satisfied in inference, however this can be difficult to achieve.
The following methods may be performed:
The techniques described above and set out in international patent application PC-T/GB2021/052770 may also be applied in the AI based compression of video. In this case, a Lagrange multiplier may be applied to the rate and distortion associated with each of the frames of the video used for each training step. One or more of these Lagrange multipliers may be optimized using the techniques discussed above. Alternatively, the multipliers may be averaged over a plurality of frames during the training process.
The target value for the Lagrange multipliers may be set to an equal value for each of the frames of the input video used in the training step. Alternatively, different values may be used. For example, a different target may be used for I-frames of the video to P-frames of the video. A higher target rate may be used for the I-frames compared to the P-frames. The same techniques may also be applied to B-frames.
In a similar manner to image compression, the target rate may initially be set to zero for one or more frames of the video used in training. When a target value is set for distortion, the target value may be set so that the initial weighting is at maximum for the distortion (for example, the target rate may be set at 1).
AI-based compression relies on modeling discrete probability mass functions (PMFs). These PMFs can appear deceptively simple. Our usual mental model begins with one discrete variable X, which can take on D possible values X1, . . . , XD. Then, constructing a PMF P(X) is done simply by making a table where the entries are defined Pi=P(Xi). Of course the Pi's have to be non-negative and sum to 1, but this can be done by for example using the softmax
function. For modeling purposes, it doesn't seem that hard to learn each of the Pi's in this table that would fit a particular data distribution.
What about a PMF over two variables, X and Y, each of which can take on N possible values? This again still seems manageable, in that a 2d table would be needed, with entries Pij=P(Xi, Yj) This is slightly more involved; now the table has D2 entries, but still
manageable, provided D is not too big. Continuing on, with three variables, a 3d table would be needed, where entries Pijk indexed by a 3-tuple.
However this naive “build a table” approach may quickly becomes unmanageable, as soon as we attempt to model any more than a handful of discrete variables. For example, think of modeling a PMF over the space of RGB 1024×1024 images: each can take on 2563 possible values (each color channel has 256 possible values, and we have 3 color channels). Then the lookup table we'd need has 2563·1024
In an alternative approach, PMFs may be modelled as tensors. A tensor is simply another word for a giant table (but with some extra algebraic properties, not discussed herein). A discrete PMF can always be described as a tensor. For example, a 2-tensor (alternatively referred to as a matrix) is an array with two indices, ie a 2d table. So the above PMF Pij=P(Xi, Yj) over two discrete variables X and Y is a 2-tensor. An N-tensor Ti
The main appeal of this viewpoint is that massive tensors may be modelled using the framework of tensor networks. Tensor networks may be used to approximate a very high dimensional tensor with contractions of several low dimensional (ie. tractable) tensors. That is, tensor networks may be used to perform a low-rank approximations of otherwise intractable tensors.
For example, if we view matrices as 2-tensors, standard low-rank approximations (such as singular value decomposition (SVD) and principle component analysis (PCA)) are tensor network factorizations. Tensor networks are generalizations of the low-rank approximations used in linear algebra to multilinear maps. An example of the use of tensor networks in probabilistic modeling for machine learning is shown in in “Ivan Glasser, Ryan Sweke, Nicola Pancotti, Jens Eisert, and J Ignacio Cirac. Expressive power of tensor-network factorizations for probabilistic modeling, with applications from hidden markov models to quantum machine learning. arXiv preprint, arXiv:1907.03741, 2019”, which is hereby incorporated by reference.
Tensor networks may be considered an alternative to a graphical model. There is a correspondence between tensor networks and graphical models: any probabilistic graphical model can be recast as a tensor network, however the reverse is not true. There exist tensor networks for joint density modelling that cannot be recast as probabilistic graphical models, yet have strong performance guarantees, and are computationally tractable. In many circumstances tensor networks are more expressive than traditional probabilistic graphical models like HMMs:
All other modeling assumptions being equal, tensor networks may be preferred over HMMs.
An intuitive explanation for this result is that probabilistic graphical factor the joint via their conditional probabilities, which are usually constrained to be positive by only considering exponential maps p (X=xi|Y)∝exp(−f(xi)). This amounts to modeling the joint as a Boltzmann/Gibbs distribution. This may in fact be a restrictive modeling assumption. A completely alternative approach offered by tensor networks is to model the joint as an inner product: p(X)∝<X, HX> for some Hermitian positive (semi-)definite operator H. (This modeling approach is inspired by the Born rule of quantum systems.) The operator H can be written as a giant tensor (or tensor network). Crucially, the entries of H can be complex. It is not at all obvious how (or even if) this could be translated into a graphical model. It does however present a completely different modeling perspective, otherwise unavailable.
Let us illustrate what a tensor network decomposition is through a simple example. Suppose we have a large D×D matrix T (a 2-tensor), with entries Tij, and we want to make a low-rank approximation of T—say a rank-r approximation, with r<D. One way to do this is to find an approximation {circumflex over (T)}, with entries
In other words, we're saying {circumflex over (T)}=AB, where A is an D×r matrix and B is an r×D matrix. We have introduced a hidden dimension, shared between A and B, which is to be summed over. This can be quite useful in modeling: rather than dealing with a giant D×D matrix, if we set r very small, we can save on a large amount of computing time or power by going from D2 parameters to 2Dr parameters. Moreover, in many modeling situations, r can be very small while still yielding a “good enough” approximation of T.
Let's now model a 3-tensor, following the same approach. Suppose we're given a D×D×D tensor T, with entries Tijk. One way to approximate T is with the following decomposition
Here A and C are low-rank matrices, and B is a low-rank 3-tensor. There are now two hidden dimensions to be summed over: one between A and B, and one between B and C. In tensor network parlance, these hidden dimensions may be called the bond dimension. Summing over a dimension may be called a contraction.
This example can be continued, approximating a 4-tensor as a product of lower dimensional tensors, but the indexing notation quickly becomes cumbersome to write down. Instead, we will use tensor network diagrams, a concise way of diagrammatically conveying the same calculations.
In a tensor network diagram, tensors are represented by blocks, and each indexing dimension is represented as an arm, as shown in
We can represent the tensor decomposition of the 3-tensor {circumflex over (T)} given by equation (69) diagrammatically, as seen in the top row of
Armed with this notation, we can now delve into some possible tensor-network factorizations used for probabilistic modeling. The key idea is that the true joint distribution for a high-dimensional PMF is intractable. We must approximate it, and will do so using tensor-network factorizations. These tensor network factorizations can then be learned to fit training data. Not all tensor network factorizations will be appropriate. It may be necessary to constrain entries of the tensor network to be non-negative and to sum to 1.
An example if an approach is the use of a Matrix Product State (MPS) (sometimes also called a Tensor Train). Suppose we want to model a PMF P(X1, . . . , XN) as a tensor {circumflex over (T)}i
Graphically as a tensor network diagram, this can be seen in the bottom row of
To ensure the entries sum to 1, a normalization constant is computed by summing over all possible states. Though computing this normalization constant for a general N-tensor may be impractical, conveniently for an MPS, due to its linear nature, the normalization constant can be computed in O(N) time. Here by “linear nature” we mean, the tensor products can be performed sequentially one-by-one, operating down the line of the tensor train. (Both tensors and their tensor network approximations are multilinear functions.)
An MPS appears quite a lot like Hidden Markov Model (HMM). In fact, there is indeed a correspondence: An MPS with positive entries corresponds exactly to an HMM.
Further examples of tensor network models are Born Machines and Locally Purified States (LPS). Both are inspired by models arising in quantum systems. Quantum systems assume the Born rule, which says that the probability of an event X occurring is proportional to it's squared norm under an inner product <·, H·>, with some positive (semi-)definite Hermitian operator H. In other words, the joint probability is a quadratic function. This is a powerful probabilistic modeling framework that has no obvious connection to graphical models.
Locally Purified State (LPS) takes the form depicted in
The elements of {circumflex over (T)} are guaranteed to be positive, by virtue of the fact that contraction along the purification dimension yields positive values (for a complex number z, z
As in the MPS, computing the normalization constant of an LPS is fast and can be done in N time. A Born Machine is a special case of LPS, when the size of the purification dimensions is one.
Tensor trees are another example type of tensor network. At the leaves of the tree, dangling arms are to be contracted with data. However, the hidden dimensions are arranged in a tree, where nodes of the tree store tensors. Edges of the tree are dimensions of the tensors to be contracted. A simple Tensor Tree is depicted in
Note that a tensor tree can be combined with the framework of the Locally Purified State: a purification dimension could be added to each tensor node, to be contracted with the complex conjugate of that node. This would then define an inner product according to some Hermitian operator given by the tensor tree and it's complex conjugate.
Another example tensor network is the Projected Entangled Pair States (PEPS). In this tensor network, tensor nodes are arranged in a regular grid, and are contracted with their immediate neighbours. Each tensor has an additional dangling arm (free index) which is to be contracted with data (such as latent pixel values). In a certain sense, PEPS draws a similarity to Markov Random Fields and the Ising Model. A simple example of PEPS on a 2×2 image patch is given in
Tensor network calculations (such as computing the joint probability of a PMF, conditional probabilities, marginal probabilities, or calculating the entropy of a PMF) can be massively simplified, and greatly sped up, by putting a tensor into canonical form, as discussed in greater detail below. All of the tensors networks discussed above can be placed into a canonical form.
Because the basis in which hidden dimensions are represented is not fixed (so called gauge-freedom), we can simply change the basis in which these tensors are represented. For example, when a tensor network is placed in canonical form, almost all the tensors can be transformed into orthonormal (unitary) matrices.
This can be done by performing a sequential set of decompositions on the tensors in the tensor network. These decompositions include the QR decomposition (and it's variants, RQ, QL, and LQ), the SVD decomposition, and the spectral decomposition (if it is available), the Schur decomposition, the QZ decomposition, Takagi's decomposition, among others. The procedure of writing a tensor network in canonical form works by decomposing each of the tensors into an orthonormal (unitary) component, and an other factor. The other factor is contracted with a neighbouring tensor, modifying the neighbouring tensor. Then, the same procedure is applied to the neighbouring tensor and it's neighbours, and so on, until all but one of the tensors is orthonormal (unitary).
The remaining tensor which is not orthonormal (unitary) may be called the core tensor. The core tensor is analagous to the diagonal matrix of singular values in an SVD decomposition, and contains spectral information about the tensor network. The core tensor can be uses to calculate for instance normalizing constants of the tensor network, or the entropy of the tensor network.
The use of tensor networks for probabilistic modeling in AI-based image and video compression will now be discussed in more detail. As discussed above, in an AI-based compression pipeline, an input image (or video) x is mapped to a latent variable y, via an encoding function (typically a neural network). The latent variable y is quantized to integer values ŷ, using a quantization function Q. These quantized latents are converted to a bitstream using a lossless encoding method such as entropy encoding as discussed above. Arithmetic encoding or decoding is an example of such an encoding process and will be used as an example in further discussion.
This lossless encoding process is where the probabilistic model is required: the arithmetic encoder/decoder requires a probability mass function q(ŷ) to convert integer values into the bitstream. On decode, similarly the PMF is used to turn the bitstream back into quantized latents, which are then fed through a decoder function (also typically a neural network), which returns the reconstructed image {circumflex over (x)}.
The size of the bitstream (the compression rate) is determined largely by the quality of the probability (entropy) model. A better, more powerful, probability model results in smaller bitstreams for the same quality of reconstructed image.
The arithmetic encoder typically operates on one-dimensional PMFs. To incorporate this modeling constraint, typically the joint PMF q(ŷ) is assumed to be independent, so that each of the pixels ŷi is modeled by a one-dimensional probability distribution q (ŷi|θi). Then the joint density is modeled as
where M is the number of pixels. The parameters θi control the one-dimensional distribution at pixel i. As discussed above, often the parameters θ may be predicted by a hyper-network (containing a hyper-encoder and hyper-decoder). Alternately or additionally, the parameters may be predicted by a context-model, which uses previously decoded pixels as an input.
Either way, fundamentally this modeling approach assumes a one-dimensional distribution on each of the ŷi pixels. This may be restrictive. A superior approach can be to model the joint distribution entirely. Then, when encoding or decoding the bitstream, the necessary one-dimensional distributions needed for the arithmetic encoder/decoder can be computed as conditional probabilities.
Tensor networks may be used for modeling the joint distribution. This can be done as follows. Suppose we are given a quantized latent ŷ={ŷ1, ŷ2, . . . , ŷM}. Each latent pixel will be embedded (or lifted) into a high dimensional space. In this high dimensional space, integers are represented by vectors lying on the vertex of a probability simplex. For example, suppose we quantize yi to D possible integer values {−D//2, −D//2+1, . . . , 1, 0, 1, . . . , D//2−1, D//2}. The embedding maps ŷi to a D-dimensional one-hot vector, with a one in the slot corresponding to the integer value, and zeros everywhere else.
For example, suppose each ŷ1 can take on values {−3, −2, −1, 0, 1, 2, 3}, and ŷi=−1. Then the embedding is e(ŷi)=(0, 0, 1, 0, 0, 0, 0).
Thus, the embedding maps ŷ={ŷ1, ŷ2, . . . , ŷM} to e(ŷ)={e(ŷ1), e(ŷ2), . . . , e(ŷM)}. In effect this takes ŷ living in a M-dimensional space, and maps it to a DM dimensional space.
Now, each of these entries in the embedding can be viewed as dimensions indexing a high-dimensional tensor. Thus, the approach we will take is model the joint probability density via a tensor network {circumflex over (T)}. For example, we could model the joint density as
where H is a Hermitian operator modeled via a tensor network (as described above. Really any tensor network with tractable inference can be used here, such as Tensor Trees, Locally Purified States, Born Machines, Matrix Product States, or Projected Entangled Pair States, or any other tensor network.
At encode/decode, the joint probability cannot be used by the arithmetic encoder/decoder. Instead, one-dimensional distributions must be used. To calculate the one-dimensional distribution, conditional probabilities may be used.
Conveniently, conditional probabilities are easily computed by marginalizing out hidden variables, fixing prior conditional variables, and normalizing. All of these can be done tractably using tensor networks.
For example, suppose we encode/decode in raster-scan order. Then, pixel-by-pixel, we will need the following conditional probabilities: q(ŷ1) q(ŷ2|y1), . . . , q(ŷM|ŷM−1, . . . , ŷ1) Each of these conditional probabilities can be computed tractably by contracting the tensor network over the hidden (unseen) variables, fixing the index of the conditioning variable, and normalizing by an appropriate normalization constant.
If the tensor network is in canonical form, this is an especially fast procedure, for in this case contraction along the hidden dimension is equivalent to multiplication with the identity.
The tensor network can be applied to joint probabilistic modeling of the PMF across all latent pixels, or patches of latent pixels, or modeling joint probabilities across channels of the latent representation, or any combination thereof.
Joint probabilistic modeling with a tensor network can be readily incorporated into an AI-based compression pipeline, as follows. The tensor network could be learned during end-to-end training, and then fixed post-training. Alternately, the tensor network, or components thereof, could be predicted by a hyper-network. A tensor network may additionally or alternatively be used for entropy encoding and decoding the hyper-latent in the hyper network. In this case, the parameter of the tensor network used for entropy encoding and decoding the hyper-latent could be learned during end-to-end training, and then fixed post-training.
For instance, a hyper-network could predict the core tensor of a tensor network, on a patch-by-patch basis. In this scenario, the core tensor varies across pixel-patches, but the remaining tensors are learned and fixed across pixel patches. For example, see
Rather than (or possibly in conjunction with) using a hyper-network to predict tensor network components, parts of the tensor network may be predicted using a context module which uses previously decoded latent pixels.
During training of the AI-based compression pipeline with a tensor network probability model, the tensor network can be trained on non-integer valued latents (y rather than ŷ=Q(y), where Q is a quantization function). To do so, the embedding functions e can be defined on non-integer values. For example, the embedding function could comprise of tent functions, which take on the value of 1 at the appropriate integer value, zero at all other integers, and linearly interpolating between. This then performs multi-linear interpolation. Any other real-valued extension to the embedding scheme could be used, so long as the extension agrees with original embedding on integer valued points.
The performance of the tensor network entropy model may be enhanced by some forms of regularization during training. For example, entropy regularization could be used. In this case, the entropy H(q) of the tensor network could be calculated, and a multiple of this could be added or subtracted to the training loss function. Note that the entropy of a tensor network in canonical form can be easily calculated by computing the entropy of the core tensor.
The functionality and scope of current and utilization of training techniques for an auxiliary hyperhyper prior, for use in, but not limited to, image and video data compression based on AI and deep learning will be discussed below.
A commonly adopted network configuration for AI-based image and video compression is the autoencoder. It consists of an encoder module that transforms the input data into “latents” (y), an alternative representation of the input data and often modelled as a set of pixels, and a decoder module that takes the set of latents and intends to transform it back to the input data (or as closely resembling as possible). Because of the high-dimensional nature of the “latent space”, where each latent pixel represents a dimension, we “fit” a parametric distribution onto the latent space with a so-called “entropy model” p(ŷ). The entropy model is used to convert ŷ into a bitstream using a lossless arithmetic encoder. The parameters for the entropy model (“entropy parameters”) are learned internally within the network. The entropy model can either be learned directly or predicted via a hyperprior structure. An illustration of this structure can be found in
The entropy parameters are most commonly comprised by a location parameter and a scale parameter (which is often expressed as a positive real value), such as (but not limited to) the mean μ and standard deviation σ for a Gaussian distribution, and the mean μ and scale b for a Laplacian distribution. Naturally, there exist many more distribution types, both parametric and non-parametric, with a large variety of parameter types.
A hyperprior structure (
where all symbols with a ∧ represent the quantized versions of the original, Q is a quantization function, hD are the set of transformations, θ represents the parameters of said transformation, {circumflex over (Z)}˜p(Z|μZ, σZ). This structure is used to help the entropy model capture dependencies it can't by itself. Since a hyperprior predicts the parameters of an entropy model, thus we now can learn the model of the hyperprior, which can be comprised of the same types of parametric distributions as the entropy model. Just as we add a hyperprior to the entropy model, we can add a hyperprior, with latents w, to the hyperprior to predict its parameters, referred to as a hyperhyperprior. So instead of {circumflex over (Z)}˜p(Z|μZ, σZ), now we have {circumflex over (Z)}˜p(Z|μZ(Ŵ), (σZ(Ŵ)). Further hyperpriors can be added to the model.
A hyperhyperprior can be added to an already trained model with a hyperprior, which may be referred to as an auxiliary hyperhyperprior. This technique is applied in particular, but not exclusively, to improve the model capacity to capture low frequency features, thus improve performance on “low-rate” images. Low frequency features are present in an image if there aren't abrupt color changes through their axis′. Thus, an image with a high amount of low frequency features would have only one color throughout the image. We can find out the amount of low frequency features an image has by extracting the power spectrum of an image.
Hyperhyperpriors may be trained jointly with the hyperprior and the entropy parameters. However using a hyperhyperprior on all images may be computationally expensive. In order to maintain performance on non-low-rate images, but still give the network the capacity to model low frequency features, we may adopt an auxiliary hyperhyperprior which is used when an image fits a predetermined feature, such as being low-rate. An example of a low-rate image is if the bits-per-pixel (bpp) is roughly below 0.1. An example of this is shown in Algorithm 3.
The auxiliary hyperhyperprior framework allows the model to be adjusted only when required. Once trained, we can encode into a bitstream a flag signaling that this specific image needs a hyperhyperprior. This approach can be generalized to infinite components of the entropy model, such as a hyperhyperhyperprior.
The most direct way of training our hyperhyperprior is to “freeze” the existing pre-trained hyperprior network, including the encoder and decoder, and only optimize the weights of the hyperhyper modules. In this document when we refer to “freeze”, it means that the weights of the modules being freezing do are not trained and do not accumulate gradients to train the non-frozen modules. By freezing the existing entropy model, the hyperhyperprior may modify the hyperprior's parameters, like μ and σ in the case of a normal distribution, in such a way that it is more biased towards low-rate images.
Using this training scheme provides several benefits:
A possible implementation is to initially let the hyperprior network train for N iterations. Once N iterations are reached, we may freeze the entropy model and switch to the hyperhyperprior if an image has a low-rate. This allows the hyperprior model to specialize on the images it already does great at, while the hyperhyperprior works as intended. Algorithm 4 illustrates the training scheme. This training can be used as well to train only on low-frequency regions of an image, if it has them, by splitting the image into K blocks of size N×N, then applying this scheme on those blocks.
Another possibility is not to wait N iterations to start training the hyperhyperprior, as shown in Algorithm 5.
There are different criteria to choose from to classify an image as low-rate, including: using the rate calculated with the distribution we chose as a prior, using the mean or median value of the power spectrum of an image, median or mean value of frequency obtained by a fast Fourier transform.
Data augmentation can be used to create more samples with low frequency features that are related to low-rate images to create sufficient data. There are different ways images can be modified:
In addition to upsampling or blurring the images, a random crop may also be performed.
Number | Date | Country | Kind |
---|---|---|---|
2111188.5 | Aug 2021 | GB | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/EP2022/071858 | 8/3/2022 | WO |