The present embodiments generally relate to a method and an apparatus for video encoding or decoding, by using deep neural networks.
In conventional image or video coding, recent codecs already show the benefit of block-based coding. However, in recent deep learning-based image or video compression, the full image is usually used, for example, the whole picture is fed into an auto-encoder to compress the picture.
According to an embodiment, a method of video decoding is provided, comprising: accessing a bitstream including a picture, said picture having a plurality of blocks; entropy decoding said bitstream to generate a set of values for a block of said plurality of blocks; applying a neural network to said set of values to generate a block of picture samples for said block, said neural network having a plurality of network layers, wherein each network layer of said plurality of network layers performs linear and non-linear operations.
According to an embodiment, a method of video encoding is provided, comprising: accessing a picture, said picture partitioned into a plurality of blocks; forming an input based on at least a block of said picture; applying a neural network to said input to form output coefficients, said neural network having a plurality of network layers, wherein each network layer of said plurality of network layers performs linear and non-linear operations; and entropy encoding said output coefficients.
According to another embodiment, an apparatus for video decoding is provided, comprising one or more processors, wherein said one or more processors are configured to: access a bitstream including a picture, said picture having a plurality of blocks; entropy decode said bitstream to generate a set of values for a block of said plurality of blocks; apply a neural network to said set of values to generate a block of picture samples for said block, said neural network having a plurality of network layers, wherein each network layer of said plurality of network layers performs linear and non-linear operations.
According to another embodiment, an apparatus for video encoding is provided, comprising one or more processors, wherein said one or more processors are configured to: access a picture, said picture partitioned into a plurality of blocks; form an input based on at least a block of said picture; applying a neural network to said input to form output coefficients, said neural network having a plurality of network layers, wherein each network layer of said plurality of network layers performs linear and non-linear operations; and entropy encode said output coefficients.
According to another embodiment, an apparatus of video decoding is provided, comprising: means for accessing a bitstream including a picture, said picture having a plurality of blocks; means for entropy decoding said bitstream to generate a set of values for a block of said plurality of blocks; means for applying a neural network to said set of values to generate a block of picture samples for said block, said neural network having a plurality of network layers, wherein each network layer of said plurality of network layers performs linear and non-linear operations.
According to another embodiment, an apparatus of video encoding is provided, comprising: means for accessing a picture, said picture partitioned into a plurality of blocks; means for forming an input based on at least a block of said picture; means for applying a neural network to said input to form output coefficients, said neural network having a plurality of network layers, wherein each network layer of said plurality of network layers performs linear and non-linear operations; and means for entropy encoding said output coefficients.
The system 100 includes at least one processor 110 configured to execute instructions loaded therein for implementing, for example, the various aspects described in this application. Processor 110 may include embedded memory, input output interface, and various other circuitries as known in the art. The system 100 includes at least one memory 120 (e.g., a volatile memory device, and/or a non-volatile memory device). System 100 includes a storage device 140, which may include non-volatile memory and/or volatile memory, including, but not limited to, EEPROM, ROM, PROM, RAM, DRAM, SRAM, flash, magnetic disk drive, and/or optical disk drive. The storage device 140 may include an internal storage device, an attached storage device, and/or a network accessible storage device, as non-limiting examples.
System 100 includes an encoder/decoder module 130 configured, for example, to process data to provide an encoded video or decoded video, and the encoder/decoder module 130 may include its own processor and memory. The encoder/decoder module 130 represents module(s) that may be included in a device to perform the encoding and/or decoding functions. As is known, a device may include one or both of the encoding and decoding modules. Additionally, encoder/decoder module 130 may be implemented as a separate element of system 100 or may be incorporated within processor 110 as a combination of hardware and software as known to those skilled in the art.
Program code to be loaded onto processor 110 or encoder/decoder 130 to perform the various aspects described in this application may be stored in storage device 140 and subsequently loaded onto memory 120 for execution by processor 110. In accordance with various embodiments, one or more of processor 110, memory 120, storage device 140, and encoder/decoder module 130 may store one or more of various items during the performance of the processes described in this application. Such stored items may include, but are not limited to, the input video, the decoded video or portions of the decoded video, the bitstream, matrices, variables, and intermediate or final results from the processing of equations, formulas, operations, and operational logic.
In several embodiments, memory inside of the processor 110 and/or the encoder/decoder module 130 is used to store instructions and to provide working memory for processing that is needed during encoding or decoding. In other embodiments, however, a memory external to the processing device (for example, the processing device may be either the processor 110 or the encoder/decoder module 130) is used for one or more of these functions. The external memory may be the memory 120 and/or the storage device 140, for example, a dynamic volatile memory and/or a non-volatile flash memory. In several embodiments, an external non-volatile flash memory is used to store the operating system of a television. In at least one embodiment, a fast external dynamic volatile memory such as a RAM is used as working memory for video coding and decoding operations, such as for MPEG-2, HEVC, or VVC.
The input to the elements of system 100 may be provided through various input devices as indicated in block 105. Such input devices include, but are not limited to, (i) an RF portion that receives an RF signal transmitted, for example, over the air by a broadcaster, (ii) a Composite input terminal, (iii) a USB input terminal, and/or (iv) an HDMI input terminal.
In various embodiments, the input devices of block 105 have associated respective input processing elements as known in the art. For example, the RF portion may be associated with elements suitable for (i) selecting a desired frequency (also referred to as selecting a signal, or band-limiting a signal to a band of frequencies), (ii) down converting the selected signal, (iii) band-limiting again to a narrower band of frequencies to select (for example) a signal frequency band which may be referred to as a channel in certain embodiments, (iv) demodulating the down converted and band-limited signal, (v) performing error correction, and (vi) demultiplexing to select the desired stream of data packets. The RF portion of various embodiments includes one or more elements to perform these functions, for example, frequency selectors, signal selectors, band-limiters, channel selectors, filters, downconverters, demodulators, error correctors, and demultiplexers. The RF portion may include a tuner that performs various of these functions, including, for example, down converting the received signal to a lower frequency (for example, an intermediate frequency or a near-baseband frequency) or to baseband. In one set-top box embodiment, the RF portion and its associated input processing element receives an RF signal transmitted over a wired (for example, cable) medium, and performs frequency selection by filtering, down converting, and filtering again to a desired frequency band. Various embodiments rearrange the order of the above-described (and other) elements, remove some of these elements, and/or add other elements performing similar or different functions. Adding elements may include inserting elements in between existing elements, for example, inserting amplifiers and an analog-to-digital converter. In various embodiments, the RF portion includes an antenna.
Additionally, the USB and/or HDMI terminals may include respective interface processors for connecting system 100 to other electronic devices across USB and/or HDMI connections. It is to be understood that various aspects of input processing, for example, Reed-Solomon error correction, may be implemented, for example, within a separate input processing IC or within processor 110 as necessary. Similarly, aspects of USB or HDMI interface processing may be implemented within separate interface ICs or within processor 110 as necessary. The demodulated, error corrected, and demultiplexed stream is provided to various processing elements, including, for example, processor 110, and encoder/decoder 130 operating in combination with the memory and storage elements to process the datastream as necessary for presentation on an output device.
Various elements of system 100 may be provided within an integrated housing, Within the integrated housing, the various elements may be interconnected and transmit data therebetween using suitable connection arrangement 115, for example, an internal bus as known in the art, including the I2C bus, wiring, and printed circuit boards.
The system 100 includes communication interface 150 that enables communication with other devices via communication channel 190. The communication interface 150 may include, but is not limited to, a transceiver configured to transmit and to receive data over communication channel 190. The communication interface 150 may include, but is not limited to, a modem or network card and the communication channel 190 may be implemented, for example, within a wired and/or a wireless medium.
Data is streamed to the system 100, in various embodiments, using a Wi-Fi network such as IEEE 802.11. The Wi-Fi signal of these embodiments is received over the communications channel 190 and the communications interface 150 which are adapted for Wi-Fi communications. The communications channel 190 of these embodiments is typically connected to an access point or router that provides access to outside networks including the Internet for allowing streaming applications and other over-the-top communications. Other embodiments provide streamed data to the system 100 using a set-top box that delivers the data over the HDMI connection of the input block 105. Still other embodiments provide streamed data to the system 100 using the RF connection of the input block 105.
The system 100 may provide an output signal to various output devices, including a display 165, speakers 175, and other peripheral devices 185. The other peripheral devices 185 include, in various examples of embodiments, one or more of a stand-alone DVR, a disk player, a stereo system, a lighting system, and other devices that provide a function based on the output of the system 100. In various embodiments, control signals are communicated between the system 100 and the display 165, speakers 175, or other peripheral devices 185 using signaling such as AV.Link, CEC, or other communications protocols that enable device-to-device control with or without user intervention. The output devices may be communicatively coupled to system 100 via dedicated connections through respective interfaces 160, 170, and 180. Alternatively, the output devices may be connected to system 100 using the communications channel 190 via the communications interface 150. The display 165 and speakers 175 may be integrated in a single unit with the other components of system 100 in an electronic device, for example, a television. In various embodiments, the display interface 160 includes a display driver, for example, a timing controller (T Con) chip.
The display 165 and speaker 175 may alternatively be separate from one or more of the other components, for example, if the RF portion of input 105 is part of a separate set-top box. In various embodiments in which the display 165 and speakers 175 are external components, the output signal may be provided via dedicated output connections, including, for example, HDMI ports, USB ports, or COMP outputs.
Note that this simple example omits many details, especially on the strategy for the entropy coding of the coefficients. In this example, the whole image is fed into the auto-encoder and each coefficient transmitted is used at most in the reconstruction of an area of 36×36 pixels in the reconstructed image. However, there is no particular region boundaries for each decoded coefficient, and each final pixel depends, potentially, on the value of many coefficients spatially located around this pixel.
The present application proposes compressive auto-encoders working on image parts (as opposed to the whole image). The image partitioning can be handled in the DNN design in order to reduce data redundancy. Classical image/video partitioning scheme can be used, for example, regular block splitting as in JPEG and H.264/AVC, quad-tree partitioning as in H.265/HEVC, or more advanced splitting as in H.266/VVC.
Some advantages of using block-based (or region-based) encoding are described as follows:
In
The output from the neural network can then be quantized (330). The quantized values are entropy coded (340) to output a bitstream. It should be noted that quantization is not mandatory if the network itself is already in integers because in that case the quantization is “included” in the network during the training.
If encoding the current block is based on other reconstructed blocks, the encoder can also decode the encoded block to provide causal information. The quantized values are de-quantized (360). The dequantized values are used to reconstruct the block by using another neural network (350), which performs linear and non-linear operations. Generally, this neural network (350) used for decoding performs the inverse operations of the neural network (320) used for encoding.
In the following, we first assume that the image has been partitioned into uniform non-overlapping blocks and that each block is coded sequentially, following the raster scan order as illustrated in
Each block is composed of a set of pixels, having at least one component. Typically, a pixel has three components (for example {R, G and B}, or {Y, U and V}). Note that the proposed methods also apply to other “image-based” information such as a depth map, a motion-field, etc.
We assume that each block is compressed using a compressive auto-encoder, for example, as shown in
Case 1—Corner Case
The first case, as illustrated in
Case 2—Top Row Case
The second case, as illustrated in
Case 3—Left Column Case
This case, as illustrated in
Case 4—General Case
The last case, as illustrated in
The auto-encoder is similar in principle to the previous ones, but three concatenated channels are used instead of one. The concatenation refers to the usual tensor concatenation where each layer of each block forms a tensor of dimension w×h×d where w and his the block size (width and height) and d is the depth of the tensor, i.e., d=3 in this case if each block has one component only.
Case 4—Variant 1
According to Another Embodiment, in the General Case where Top and Left Blocks are available, the top-left block P is also added to the auto-encoder inputs. The auto-encoder inputs are similar to the ones presented in the previous general case with an additional channel. The reconstructed top left block P has been mirrored horizontally and vertically, to increase the correlation with each pixel of S.
Example of Auto-Encoders with Input Context
Symmetrically, the decoder as illustrated in
Note that other layers might be used for the auto-encoder such as the generalized divisive normalization layer, normalization layer etc.
Input Extension
As the image is encoded sequentially per block, in order to decrease the blocking artifacts, in a variant, an extended version of the block X to encode is input in the auto-encoder, as illustrated in
In another variant, the border B is also reconstructed by the decoder, but during the training stage the reconstruction error associated with the border is weighted by a factor α less or equal to 1:=∥X−{circumflex over (X)}∥+α∥B−{circumflex over (B)}∥. For the final reconstruction, the overlapping borders are used in a weighted average with the current block to obtain the final block, as illustrated in
Training Process
The auto-encoders as described above can be trained sequentially, as illustrated in
Unification of Different Cases
As shown in
Similar to Case 4, the extended reconstructed top block Qext is mirrored vertically (1220) so that the top pixels of S are better spatially correlated with the top pixels of the mirrored version of Qext. The extended reconstructed left block Rext is mirrored horizontally (1230) so that the left pixels of S are better spatially correlated with the left pixels of the mirrored version of Rext. The mirrored version of Qext (1220), that of Rext (1230), and S (1240) are each fed into a convolutional layer (1281, 1282, 1283), the down-sampling factor of each convolutional layer being chosen such that the output feature maps have the same spatial dimensions. All the resulting feature maps are concatenated (1250) and fed into the auto-encoder (1260) to obtain reconstructed block Ŝ.
Latent Input
In an example as illustrated in
In another example as illustrated in
Spatial Localization Input
In this embodiment, in order to “specialize” the network on the pixel location in the block, we propose to modify the input of the network. Indeed, the pixel location in the block helps the network to better use the neighboring block information. In all embodiments, the additional input can be used additionally to the input of neighboring blocks (either by reconstruction samples or latent variables).
In one example as illustrated in
In order to give the decoder the same information, the same two channels H and V are input in a secondary network (1510, 1520) having a set of layers similar to the encoder part (successive convolution, down sampling and nonlinear layer) until the resolution matches the input of the layer in the decoder. In
In another example as shown in
In another example, the network is rendered completely spatially aware by replacing the all (or part) the convolution layers by fully connected layers. This method is especially relevant in the case of auto-encoder for small blocks (for example, up to 16×16).
Adaptive Block Size
In an embodiment, several auto-encoders are trained for different block sizes. The image is partitioned using different block sizes, as illustrated in
In this embodiment, there exists several auto-encoders:
In this embodiment, the reconstructed pixel values from the neighbors, at the same size as the current block, are considered as input, since neighbor blocks may have different sizes as the current block which makes the latent information unavailable. In
In case of latent input, an approximation of the latent variables is given by re-encoding the virtual block (from reconstructed pixels) in an auto-encoder. The latent variables are then taken from the input of the last layer.
RDO
Given several auto-encoders specialized by the block size, a classical Rate-Distortion Optimization (RDO) can be performed outside the auto-encoders as illustrated in
Φ(A)+Δ(R(A)+S0)Φ(B)+Φ(C)+Φ(D)+Φ(E)+λ(R(B)+R(C)+R(D)+R(E)+S1)
where Φ( ) is the distortion function (between original and reconstructed block), R( ) is the rate (in bits) of coding the given block, S0 the coding cost of signaling the no split of the block, S1 the coding cost of signaling the split of the block, and λ the trade-off between the distortion and rate. The same method can be applied recursively on each block.
Post-Filtering
In order to remove blocking artifacts between blocks, a post-filter network is trained on the block boundaries. In order to improve the performance, the auto-encoders (2010, 2020, 2030, 2040) and post-filter network (2050, 2060, 2070) can be trained or fined-tuned jointly, for example, using the process shown in
Various methods are described herein, and each of the methods comprises one or more steps or actions for achieving the described method. Unless a specific order of steps or actions is required for proper operation of the method, the order and/or use of specific steps and/or actions may be modified or combined. Additionally, terms such as “first”, “second”, etc. may be used in various embodiments to modify an element, component, step, operation, etc., for example, a “first decoding” and a “second decoding”. Use of such terms does not imply an ordering to the modified operations unless specifically required. So, in this example, the first decoding need not be performed before the second decoding, and may occur, for example, before, during, or in an overlapping time period with the second decoding.
Various methods and other aspects described in this application can be used to modify modules, for example, the neural networks (320, 350, 440) of a video encoder and decoder as shown in
An embodiment provides a computer program comprising instructions which when executed by one or more processors cause the one or more processors to perform the encoding method or decoding method according to any of the embodiments described above. One or more of the present embodiments also provide a computer readable storage medium having stored thereon instructions for encoding or decoding video data according to the methods described above. One or more embodiments also provide a computer readable storage medium having stored thereon a bitstream generated according to the methods described above. One or more embodiments also provide a method and apparatus for transmitting or receiving the bitstream generated according to the methods described above.
Various implementations involve decoding. “Decoding,” as used in this application, may encompass all or part of the processes performed, for example, on a received encoded sequence in order to produce a final output suitable for display. In various embodiments, such processes include one or more of the processes typically performed by a decoder, for example, entropy decoding, inverse quantization, and deconvolution. Whether the phrase “decoding process” is intended to refer specifically to a subset of operations or generally to the broader decoding process will be clear based on the context of the specific descriptions and is believed to be well understood by those skilled in the art.
Various implementations involve encoding. In an analogous way to the above discussion about “decoding”, “encoding” as used in this application may encompass all or part of the processes performed, for example, on an input video sequence in order to produce an encoded bitstream.
The implementations and aspects described herein may be implemented in, for example, a method or a process, an apparatus, a software program, a data stream, or a signal. Even if only discussed in the context of a single form of implementation (for example, discussed only as a method), the implementation of features discussed may also be implemented in other forms (for example, an apparatus or program). An apparatus may be implemented in, for example, appropriate hardware, software, and firmware. The methods may be implemented in, for example, an apparatus, for example, a processor, which refers to processing devices in general, including, for example, a computer, a microprocessor, an integrated circuit, or a programmable logic device. Processors also include communication devices, for example, computers, cell phones, portable/personal digital assistants (“PDAs”), and other devices that facilitate communication of information between end-users.
Reference to “one embodiment” or “an embodiment” or “one implementation” or “an implementation”, as well as other variations thereof, means that a particular feature, structure, characteristic, and so forth described in connection with the embodiment is included in at least one embodiment. Thus, the appearances of the phrase “in one embodiment” or “in an embodiment” or “in one implementation” or “in an implementation”, as well any other variations, appearing in various places throughout this application are not necessarily all referring to the same embodiment.
Additionally, this application may refer to “determining” various pieces of information. Determining the information may include one or more of, for example, estimating the information, calculating the information, predicting the information, or retrieving the information from memory.
Further, this application may refer to “accessing” various pieces of information. Accessing the information may include one or more of, for example, receiving the information, retrieving the information (for example, from memory), storing the information, moving the information, copying the information, calculating the information, determining the information, predicting the information, or estimating the information.
Additionally, this application may refer to “receiving” various pieces of information. Receiving is, as with “accessing”, intended to be a broad term. Receiving the information may include one or more of, for example, accessing the information, or retrieving the information (for example, from memory). Further, “receiving” is typically involved, in one way or another, during operations, for example, storing the information, processing the information, transmitting the information, moving the information, copying the information, erasing the information, calculating the information, determining the information, predicting the information, or estimating the information.
It is to be appreciated that the use of any of the following “/”, “and/or”, and “at least one of”, for example, in the cases of “A/B”, “A and/or B” and “at least one of A and B”, is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of both options (A and B). As a further example, in the cases of “A, B, and/or C” and “at least one of A, B, and C”, such phrasing is intended to encompass the selection of the first listed option (A) only, or the selection of the second listed option (B) only, or the selection of the third listed option (C) only, or the selection of the first and the second listed options (A and B) only, or the selection of the first and third listed options (A and C) only, or the selection of the second and third listed options (B and C) only, or the selection of all three options (A and B and C). This may be extended, as is clear to one of ordinary skill in this and related arts, for as many items as are listed.
As will be evident to one of ordinary skill in the art, implementations may produce a variety of signals formatted to carry information that may be, for example, stored or transmitted. The information may include, for example, instructions for performing a method, or data produced by one of the described implementations. For example, a signal may be formatted to carry the bitstream of a described embodiment. Such a signal may be formatted, for example, as an electromagnetic wave (for example, using a radio frequency portion of spectrum) or as a baseband signal. The formatting may include, for example, encoding a data stream and modulating a carrier with the encoded data stream. The information that the signal carries may be, for example, analog or digital information. The signal may be transmitted over a variety of different wired or wireless links, as is known. The signal may be stored on a processor-readable medium.
Number | Date | Country | Kind |
---|---|---|---|
19306703.0 | Dec 2019 | EP | regional |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/US2020/064862 | 12/14/2020 | WO |