Video content represents the majority of total Internet traffic and is expected to increase even more as spatial resolution frame rate, and color depth of videos increase and more users adopt earning services. Although existing codecs have achieved impressive performance, they have been engineered to the point where adding further small improvements is unlikely to meet future demands. Consequently, exploring fundamentally different ways to perform video coding may advantageously lead to a new class of video codecs with improved performance and flexibility.
For example, one advantage of using a trained machine learning (ML) model, such as a neural network (NN), in the form of a generative adversarial network (GAN) for example, to perform video compression is that it enables the ML model to infer visual details that it would otherwise be costly in terms of data transmission, obtain. However, training ML models such as GANs is typically challenging because the training alternates between minimization and maximization steps to converge to a saddle point of the loss function. The task becomes more challenging when considering the temporal domain and the increased complexity it introduces.
The following description contains specific information pertaining to implementations in the present disclosure. One skilled in the art will recognize that the present disclosure may be implemented in a manner different from that specifically discussed herein. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
As noted above, video content represents the majority of total Internet traffic and is expected to increase even more as spatial resolution frame rate, and color depth of videos increase and more users adopt streaming services. Although existing codecs have achieved impressive performance, they have been engineered to the point where adding further small improvements is unlikely to meet future demands. Consequently, exploring fundamentally different ways to perform video coding may advantageously lead to a new class of video codecs with improved performance and flexibility.
For example, and as further noted above, one advantage of using a trained machine learning (ML) model, such as a neural network (NN), in the form of a generative adversarial network (GAN) for example, to perform video compression is that it enables the ML model to infer visual details that it would otherwise be costly in terms of data transmission, to obtain. However, training ML models such as GANs is typically challenging because the training alternates between minimization and maximization steps to converge to a saddle point of the loss function. The task becomes more challenging when considering the temporal domain and s increased complexity it introduces if only because of the increased data.
The present application discloses a framework based on knowledge distillation and latent space residual to use any adversarially trained image compression ML model as a basis to build a video compression codec that has similar hallucination capacity to a trained GAN which is particularly important when targeting low bit-rate video compression. The images resulting from the present ML model-based video compression solution are visually pleasing without requiring a high bit-rate. Some image details synthesized when using an ML model-based video codec may look realistic while deviating slightly from the ground truth. Nevertheless, the present ML model-based video compression solution is capable of providing image quality that would be impossible using the same amount of transmitted data in conventional approaches. Moreover, in some implementations, the present ML model-based based video compression solution can be implemented as substantially automated systems and methods.
It is noted that, as used in the present application, the terms “automation,” “automated,” and “automating” refer to systems and processes that do not require the participation of a human user, such as a human editor or system administrator. Although, in some implementations, a human system administrator may review the performance of the automated systems operating according to the automated processes described herein, that human involvement is optional. Thus, the processes described in the present application may be performed under the control of hardware processing components of the disclosed systems.
It is further noted that, as defined in the present application, the expression “machine learning mode” (hereinafter “ML model”) refers to a mathematical model for making future predictions based on patterns learned from samples of data obtained from a set of trusted known matches and known mismatches, known as training data. Various learning algorithms can be used to map correlations between input data and output data. These correlations form the mathematical model that can be used to make future predictions on new input data. Such a predictive model may include one or more logistic regression models, Bayesian models, or NNs, for example. In addition, machine learning models may be designed to progressively e their performance of a specific task.
A “deep neural network” (deep NN), in the context of deep learning, may refer to an NN that utilizes multiple hidden layers between input and output layers, which may allow for learning based on features not explicitly defined in raw data. As used in the present application, a feature labeled as an NN refers to a deep neural network. In various implementations, NNs may be utilized to perform image processing or natural-language processing. Although the present novel and inventive principles are described below by reference to an exemplary NN class known as GANs, that characterization is provided merely in the interests of conceptual clarity. More generally, the present ML model-based video compression solution may be implemented using other types of ML models, and may be particularly advantageous when used with ML models that are onerous, expensive, or time consuming to train.
As further shown in
Although the present application refers to ML model-based codec software resources 130 as being stored in system memory 106 for conceptual clarity, more generally system memory 106 may take the form of any computer-readable non-transitory storage medium. The expression “computer-readable non-transitory storage medium,” as used in the present application, refers to any medium, excluding a carrier wave or other transitory signal that provides instructions to processing hardware 104 of computing platform 102. Thus, a computer-readable non-transitory storage medium may correspond to various types of media, such as volatile media and non-volatile media, for example. Volatile media may include dynamic memory, such as dynamic random access memory (dynamic RAM), while non-volatile memory may include optical, magnetic, or electrostatic storage devices. Common forms of computer-readable non-transitory storage media include, for example, optical discs, RAM, programmable read-only memory (PROM), erasable PROM (EPROM), and FLASH memory.
Moreover, although
Processing hardware 104 may include multiple hardware processing units, such as one or more central processing units, one or more graphics processing units, and one or more tensor processing units, one or more field-programmable gate arrays (FPGAs), custom hardware for machine-learning training or inferencing, and an application programming interface (API) server, for example. By way of definition, as used in the present application, the terms “central processing unit” (CPU), “graphics processing unit” (GPU), and “tensor processing unit” (TPU) have their customary meaning in the art. That is to say, a CPU includes an Arithmetic Logic Unit (ALU) for carrying out the arithmetic and logical operations of computing platform 102, as well as a Control Unit (CU) for retrieving programs, such as ML model-based codec software resources 130, from system memory 106, while a GPU may be implemented to reduce the processing overhead of the CPU by performing computationally intensive graphics or other processing tasks. A TPU is an application-specific integrated circuit (ASIC) configured specifically for artificial intelligence (AI) processes such as machine learning.
In some implementations, computing platform 102 may correspond to one or more web servers, accessible over communication network 110 in the form of a packet-switched network such as the Internet, for example. Moreover, in some implementations, communication network 110 may be a high-speed network suitable for high performance computing (HPC), for example a 10 GigE network or an Infiniband network. In some implementations, computing platform 102 may correspond to one or more computer servers supporting a private wide area network (WAN), local area network (LAN), or included in another type of limited distribution or private network. As yet another alternative, in some implementations, system 100 may be implemented virtually, such as in a data center. For example, in some implementations, system 100 may be implemented in software, or as virtual machines.
According to the implementation shown by
With respect to display 122 of user system 120, display 122 may be physically integrated with user system 120 or may be communicatively coupled to but physically separate from user system 120. For example, where user system 120 is implemented as a smartphone, laptop computer, or tablet computer, display 122 will typically be integrated with user system 120. By contrast, where user system 120 is implemented as a desktop computer, display 122 may take the form of a monitor separate from user system 120 in the fora of a computer tower. Moreover, display 122 may take the form of a liquid crystal display (LCD), a light-emitting diode (LED) display, an organic light-emitting diode (OLED) display, a quantum dot (QD) display, or a display using any other suitable display technology that performs a physical transformation of signals to light.
ML model-based codec software resources 130 are further described below by reference to
According to the exemplary implementation shown in
Image compression can formally be expressed as minimizing the expected length of the bitstream as well as the expected distortion of the reconstructed image compared to the original, formulated as optimizing the following rate-distortion objective function:
L
g,p
=x˜p
where: −log2 pŷ) (ŷ) is the rate term and d(x, {circumflex over (x)}) is the distortion term.
It is noted that in the notation used in Equation 1, the parameters of g include g−1. Here d indicates a distortion measure and can include a combination of 2, structural. similarity index measure (SSIM), learned perceptual image patch similarity (LPIPS), and the like. The rate corresponds to the length of the bitstream needed to encode the quantized representation ŷ, based on a learned entropy model pŷ over the unknown distribution of natural images px. By reducing the weight λ, better compression can be achieved at the cost of larger distortion on the reconstructed image.
According to one implementation of the present novel and inventive concepts, the neural image compression formulation described above can be augmented with an ML model in the form of a conditional GAN. In such a case of adversarial training, D is denoted as the discriminator neural network that learns to distinguish between the ground truth x and the decoded images {circumflex over (x)} conditioned on the latent representation ŷ:
L
D=x˜p
The training of the discriminator is alternated with the training of image compression ML model 232, in which case the rate-distortion objective augmented with the adversarial loss is optimized:
L
g,p
=x˜p
where: −log2pŷ(ŷ) and d(x, {circumflex over (x)}) remain the rate and distortion terms, respectively, while D({circumflex over (x)}, ŷ) is the adversarial loss.
In order to take advantage of temporal redundancy in video encoding, video compression relies on formation transfer through motion compensation, More precisely, a subsequent frame xt+1 (identified in
However, it is noted that the video compression techniques described below by reference to
Two strategies for leveraging a trained image compression ML model to compress the residual information needed to fix the current estimate {tilde over (x)}t+1 are described below. The first strategy is referred to as “knowledge distillation with latent space residuals” and the second strategy is referred to as “knowledge distillation with image space residuals.”
Knowledge Distillation with Latent Space Residuals:
As the residual mapping function h and its reverse h−1 operate on a single instant, time can be omitted from the notation that follows. The following definitions are established: r=h(y, ŷ) and ŷ=h−1({circumflex over (r)}, {tilde over (y)}). It is noted that because {tilde over (y)}t+1 is obtained via motion compensation it is available both at encoding and decoding time. This solution is designed to leverage the image compression ML model g trained with the adversarial loss. The parameters of image compression ML model g remain unchanged. To achieve this residual compression, the parameters of the residual mapping h (including its reverse h−1) and the parameters of the probability model p{circumflex over (r)} need to be trained. This can be done by optimizing the following rate distortion loss:
L
h,p
=x˜p
where: −log2 p{circumflex over (r)}({circumflex over (r)}) is the rate tern and d(g)is the distortion term.
It is noted that the target frames are no longer the ground truth but are now the output of the image compression ML model g. This enables the performance of knowledge distillation and retains the detail hallucination capabilities of the adversarially trained image compression model. The residual mapping itself can be implemented as a combination of several techniques, as described in greater detail in the attached paper, titled “Knowledge Distillation for GAN Based Video Codec,” which is hereby incorporated fully by reference: into the present application.
Knowledge Distillation with Image Space Residuals:
A different approach to leverage the available trained image compression ML model g is to work with image space residuals as represented by exemplary ML model-based video codec architecture 228, in
The difference in the approach depicted by ML model-based video codec architecture 228, is that in the implementation shown in
z
t+1
=x
t+1
−{circumflex over (x)}
t+1 (Equation 5)
The neural encoder and neural decoder functions are denoted respectively as h and h−1. They may be implemented as neural network layers as described above by reference to
The training loss may be expressed as:
L
h,p
=x˜p
where −log2 p{circumflex over (r)}({circumflex over (r)}) is the rate term, d(xt+1, {circumflex over (x)}t+1) is the distortion term, and D(x*t+1, {circumflex over (x)}t+1) is the adversarial loss. It is noted that the training loss expressed by Equation 6 includes an adversarial loss against x*t+1. This corresponds to the image compressed as a single frame with the trained image compression ML model g.
Temporal instability, such as flickering mismatch between hallucinated details, and the like. can occur in both of the knowledge distillation processes described above. In order to maintain temporally stable results, a temporal smoothing component can be added. More formally, given the previously decoded frame {circumflex over (x)}t and motion vectors {circumflex over (m)}t the objective is to process the frame {circumflex over (x)}t+1 to remove any temporal artifact:
{circumflex over (x)}*
t+1
=F({circumflex over (x)}t+1, W({circumflex over (x)}*t, {circumflex over (m)}t)) (Equation 7)
with the * superscript indicating a temporally processed frame. W is the image warping function that uses the motion field {circumflex over (m)}t to warp the previous frame {circumflex over (x)}*t to match the current frame {circumflex over (x)}t+1. It is noted that rather than a single previously decoded frame, in some implementations {circumflex over (x)}t may represent multiple frames. In implementations in which multiple previously decoded frames are utilized, each motion vector {circumflex over (m)}t may be treated as a pair of data points including the displacement and the reference frame index.
The correcting function F may he implemented as an NN and may be trained using the following loss:
L
F=x˜p
where M⊙d({circumflex over (x)}*t+1, W({circumflex over (x)}*t, {circumflex over (m)}t)) is the temporal term and log D({circumflex over (x)}*t, {circumflex over (x)}t+1) is the optional adversarial loss term, with d a distortion error that penalizes deviation between the two consecutive frame appearances to enforce temporal stability (it can he the 1 loss for example). M is a merging function that may be implemented as a binary mask that indicates where motion vectors a valid to limit the penalty to regions where motion is correctly estimated. Finally, as an option, an adversarial loss may he added to avoid over-smoothing the final output.
The knowledge distillation with image space residuals approach corresponding to exemplary ML model-based video codec architecture 228, in
Referring now to
Flowchart 350 further includes comparing the uncompressed video content with the motion compensated video content to identify image space residual 237 corresponding to the uncompressed video content (action 352). Continuing to refer to
Flowchart 350 further includes transforming image space residual 237 to latent space representation 239 of image space residual 237 (action 353). Image space residual 237 may be transformed to latent space representation 239 of image space residual 237 in action 353 by ML model-based video compression encoder 235, executed by processing hardware of 104 of system 100, and using neural encoder function h.
Flowchart 350 further includes receiving, using trained image compression ML model 232, the motion compensated video content (e.g., motion compensated frame 219) (action 354). As noted above, trained image compression ML model 232 may include a trained NN, such as a trained GAN, for example. Moreover, and as noted above, in some implementations, trained image compression ML model 232 may include an NN trained using an objective function including an adversarial loss. Action 354 may be performed by ML model-based video compression encoder 235, executed by processing hardware of 104 of system 100.
Flowchart 350 further includes transforming, using trained image compression ML model 232, the motion compensated video content represented by motion compensated frame 219 to latent space representation 234 of the motion compensated video content (action 355). As shown by
It is noted that although flowchart 350 depicts actions 354 and 355 as following action 351, 352, and 353, that representation is provided merely by way of example. In some other implementations, action 354 and 355 may be performed in sequence, but in parallel, i.e., substantially concurrently, with actions 351, 352, and 353. In still other implementations, action 354, or actions 354 and 355 may precede one or more of actions 351, 352, and 353.
Flowchart 350 further includes encoding latent space representation 239 of image space residual 237 to produce an encoded latent residual (action 356). Latent space representation 239 of image space residual 237 may be encoded in action 356 to produce the encoded latent residual by ML model-based videos compression encoder 235, executed by processing hardware 104 of system 100.
Flowchart 350 further includes encoding, using trained image compression ML model 232, latent space representation 234 of motion compensated frame 219 to produce encoded latent video content (action 357). Latent space representation 234 of motion compensated frame 219 may be encoded in action 357 to produce the encoded latent video content by ML model-based video compression encoder 235, executed by processing hardware 104 of system 100, and using trained image compression ML model 232.
It is noted that although flowchart 350 depicts action 357 as following action 356, that representation is provided merely by way of example. The only constraint placed on the timing of action 357 is that it follows action 355, while the only constraint placed on the timing of action 356 is that it follows action 353. Thus, in various implementations, action 357 may follow action 356, may precede action 356, or may be performed in parallel i.e., substantially concurrently with, action 356. That is to say, in some implementations, the encoded latent residual produced in action 356 and the encoded latent video content produced in action 357 may be produced in parallel.
Referring to
The knowledge distillation with latent space residuals approach corresponding to exemplary ML model-based video codec architecture 226, in
Referring now to
Flowchart 460 further includes transforming, using trained image compression ML model 232, the uncompressed video content represented by uncompressed frame 217 to first latent space representation 234a of the uncompressed video content (action 462). As shown by
Flowchart 460 further includes transforming, using trained image compression ML model 232, the uncompressed video content represented by motion compensated frame 219 to second latent space representation 234b of the uncompressed video content (action 463). As shown by
It is noted that although flowchart 460 depicts action 463 as following action 462, that representation is provided merely by way of example. In various implementations, action 463 may follow action 462, may precede action 462, or may be performed in parallel with, i.e., substantially concurrently with, action 462. That is to say, in some implementations, the transformation of the uncompressed video content to first latent space representation 234a, and the transformation of the motion compensated video content to second latent space representation 234b, may be performed in parallel.
Flowchart 460 further includes generating a bitstream for transmitting compressed video content 117 corresponding to uncompressed video content 116 based on first latent space representation 234a and second latent space representation 234b (action 464). In some implementations, action 464 may include determining, using first latent space representation 234a and second latent space representation 234b, a latent space residual. For example, such a latent space residual may be based on the difference between first latent space representation 234a and second latent space representation 234b. In implementations in which a latent space residual is determined as part of action 464, the bitstream for transmitting compressed video content 117 corresponding to uncompressed video content 116 may be generated using the latent space residual. Generation of the bitstream for transmitting compressed video content 117, in action 464, may be performed by ML model-based video compression encoder 233, executed by processing hardware 104 of system 100.
With respect to the actions represented in
Thus, the present application discloses a framework including an ML model-based video compression solution based on knowledge distillation and latent space residual to enable use of a video compression codec that has similar hallucination capacity to a trained GAN, which is particularly important when targeting low bit-rate video compression. The present ML model-based video compression solution advances the state-of-the-art by providing images that are visually pleasing without requiring a high bit-rate. Some image details synthesized when using an ML model-based video codec may look realistic while deviating slightly from the ground truth. Nevertheless, the present ML model-based video compression solution is advantageously capable of providing image quality that would be impossible using the same amount of transmitted data in conventional approaches.
From the above description it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person of ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described herein, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.
The present application claims the benefit of and priority to Provisional Patent Application Ser. No. 63/172.315, filed Apr. 8, 2021, and titled “Neural Network Based Video Codecs,” and Provisional Patent Application Ser. No. 63/255,280, filed Oct. 13, 2021, and titled “Microdosing For Low Bitrate Video Compression,” which are hereby incorporated fully by reference into the present application.
Number | Date | Country | |
---|---|---|---|
63172315 | Apr 2021 | US | |
63255280 | Oct 2021 | US |