A growing trend in video transmission is that a significant portion of images and videos that are recorded in the field are consumed by machines only, without ever reaching human eyes. Those machines process images and videos with the goal of completing specific tasks such as object detection, object tracking, segmentation, event detection, and the like. Recognizing that this trend is prevalent and will only accelerate in the future, international standardization bodies have been engaging in efforts to standardize image and video coding that is primarily optimized for machine consumption. For example, standards such as JPEG AI and Video Coding for Machines have been proposed in addition to already established standards such as Compact Descriptors for Visual Search, and Compact Descriptors for Video Analytics. Video and image data for machine consumption does not necessarily have the same requirements as human consumption. With the growing volume of data for machine consumption, solutions that improve efficiency for encoded data for machine consumption compared to classical image and video coding techniques are needed.
In classical image and video coding, hybrid systems are predominantly built using a workflow that includes: input pre-processing, partitioning, prediction, frequency transform, quantization, and entropy coding. Most of the steps result in so-called lossless compression, such that the output of the decoder can be identical to the input to the encoder. The only step that allows lossy compression is quantization. Here, the designer of the coding system can apply domain knowledge to remove redundant information from the input signal. In classical image and video coding systems, this is usually done by utilizing knowledge about the human visual system. For example, since the human visual system is more sensitive to low frequencies than high frequencies, quantization for human consumption is typically designed such that more information is preserved for low spatial frequencies. Such quantization strategies, however, may lead to sub-optimal results when humans are replaced by machines as the end users. Depending on the task, machines may be sensitive to any portion of the spectrum. An adaptive strategy when designing quantization may provide enhanced performance and efficiency.
Systems and methods for image and video coding are presented which include an encoder and a decoder. The present systems and methods are preferably applied to use cases where machines are processing the output of the decoder. The present methods preferably include a method for adaptive quantization that is tailored for improved performance in the tasks that a machine conducts on the output of the decoder. The present systems and methods improve efficiency of the image and video coding compared to widely used systems that apply classical coding techniques developed for human end users.
In one embodiment, a system is comprised of the encoder, which encodes image and/or video using the proposed quantization method and a decoder which decodes encoded image and/or video using the information provided by the encoder in the bitstream or applies additional method for post-processing quantization independently of the encoded bitstream information.
An adaptive quantization module (AQM) for encoding and decoding image or video data for machines using adaptive quantization will preferably obtain a machine model for the machine-based system receiving the image and/or video data. From this machine-model, the AQM will generate a frequency importance map. The frequency importance map is used to determine an adjustment matrix, which in turn is used to adjust the default quantization matrix. The AQM quantizes the image or video data using the adjusted quantization matrix.
In some embodiments, the frequency importance map is implicitly determined from the machine model by iteratively testing the sensitivity of the model output to the changes in each frequency band of interest. This may include testing of the frequency sensitivity of the machine model using a gradient method which is based on the chain rule for differentiation.
Alternatively, the frequency importance map may be implicitly determined using statistics of the sample dataset on which the machine model is trained.
The AQM preferably adjusts the default quantization matrix by calculating a Hadamard product of the default quantization matrix and the adjustment matrix.
Preferably, a video coding system for machines includes an encoder with an AQM and a compliant decoder with a compatible AQM. In some embodiments, the AQM at the encoder site generates the adjusted quantization matrix and encodes parameters of the matrix in the bitstream. In this case, the AQM at the decoder can extract the parameters from the bitstream to perform inverse quantization.
While the option of adaptive quantization is available for the machine—targeted coding, the system allows flexibility of using classical quantization, when needed—as an example, for hybrid use case where both machine and human are consumers of the decoded video, as extracted from the bitstream or sub-streams within a bitstream.
Muxer 135 is a multiplexer that combines two or more sub-streams into a unified bitstream that is sent to the decoder. Thus, muxer 135 receives the outputs from video encoder 120 and video encoder 125 to generate the resultant bitstream from the encoder 100.
The encoder further includes an adaptive quantization module aQM module 115, which calculates adaptive quantization matrix based on the machine model 160 and passes the calculated values to the video encoder(s) 120, 125. The adaptive quantization methods employed by aQM module 15 are discussed in further detail below and depicted in the high-level flow chart of
The encoded video bitstream is provided over a communication channel to decoder 102. Demuxer 140 receives the bitstream and de-multiplexes the sub-streams from the bitstream and sends them to the appropriate video decoder. Video decoder(s) 150, 155, which decode the image/video using the information in the bitstream. Optionally, the video decoders can implement the proposed adaptive quantization method for encoder-independent post-processing.
An aQM module 145 is provided in the decoder 102 to calculate an adaptive quantization matrix based on the machine model 160 and passes the calculated values to the video decoder(s) 150, 155 to implement adaptive quantization. Alternatively, the aQM module 145 receives parameters for the adaptive quantization matrix calculated by the encoder and which are signaled in the received bitstream.
In addition to the encoder 100 and the decoder 102, a machine model 160 can be stored or transmitted to the encoder 100 and the decoder 102. This machine model 160 preferably contains relevant information about the machine algorithm. This information can be used to calculate proper quantization adjustments.
In classical image/video coding, pixels that represent either the input signal or the residual from the prediction process are transformed into frequency coefficients using some variant of Discrete Cosine Transform, Discrete Sine Transform, Wavelet Transform, or similar transformations. The reason is found in the energy compaction and decorrelation properties of those transforms. In the image/picture pixel block, the energy is distributed nearly uniformly across the block. After transformation, the data has been decorrelated horizontally and vertically and one dominant coefficient now contains a significant proportion of the energy.
For example, an 8×8 picture block of pixels and the DCT-II transformed coefficients is given in
However, machines often process visual information based on principles that differ from the human visual system in significant ways. Distortions in the high frequency portion of the spectrum (which correspond to highly textured portions of the image/picture) can result in a significant degradation of accuracy in completing the given machine task. The sensitivity of a particular machine to different portions of the input spectrum is not readily apparent and generally has to be calculated based on the machine model parameters.
The method of adaptive quantization processing in accordance with the present disclosure is generally illustrated in
The transformed coefficients matrix T is quantized such that each value is divided by the corresponding value in the quantization matrix M:
Where ° represents a Hadamard product of matrices TN×N and MN×N, resulting in the quantized coefficient matrix QN×N. With optimal values, the matrix M can achieve optimal rate-distortion performance. However, the definition of the distortion measure changes when the end user is not human, but rather is a machine performing a specific task that it's trained for. With the present adaptive quantization method, the values in matrix M are calculated such that the resulting matrix Q achieves highest performance in task completion with the lowest amount of data. Thus, it is preferable to compute M such that it reflects the importance mapping of the machine model, as illustrated in
Referring to
Explicit information can be provided from the designer who built and configured the machine model 160. This information has direct and explicit mapping to the quantization, which is trivial. Here, we are considering implicit deduction which can be performed from the training sample or from the model parameters.
The spectral domain information is represented by the transform coefficients. The magnitude of the coefficients corresponds to the importance of the given frequency. The information most pertinent to the machine model is the one preserved in the coefficients after the quantization. The dynamic range of all the coefficients at a given frequency, given by the energy, or a variance of all the samples, represents the importance of the given frequency. Because of this, the machine-optimized quantization should preserve frequencies that have higher importance to the machine model and reduce or remove the less important frequencies.
In step 205, a frequency importance map can be derived for the machine model. To determine the frequency importance map for a given machine model, two implicit techniques can be used. First, the mapping can be obtained from the model itself using a technique that is equivalent to backpropagation. The sensitivity of the model output to the changes in each frequency band can be tested using a gradient method, which is based on the chain rule for differentiation. If the output loss function of the machine model is given as L, and the input to the model is given as X, the differentiation chain gives the following relationship:
Where each Xn is the output of the n-th layer of the neural network. Since the input X is obtained by dequantizing the quantized frequency coefficients in Q, further expansion of the chain relates to the derivative (or gradient in the case of matrices):
This derivative will determine the sensitivity of the model, represented by the loss function L, to changes in each quantized coefficient.
Using this information, the default quantization matrix M can be adjusted, such that the elements Mi,j corresponding to the elements Qi,j with the highest gradient are proportionally decreased and vice versa. If the backpropagation method is not available or not easily computable, an alternative implicit technique is to use the statistics of the sample dataset on which the model is trained. The rationale for using training set statistics is justified by the fact that model's sensitivity is correlated with the variance of the training samples. In other words, and in general terms, the model will recognize new inputs that are within the dynamic range of the training samples, while the inputs that are outside of the range of learning samples will be prone to misclassification.
In step 215, an adjustment factor matrix, F, is determined. The calculation of the correlation between training and new samples is preferably performed in the frequency domain. For each frequency coefficient position i,j the variance quotient can be calculated and stored as an adjustment factor. For example,
T′i,j is the coefficient in the transformed new sample, and Ti,j is the coefficient in the transformed training sample at the same frequency. The T's are calculated as averages of all the coefficients at the given frequency in the dataset. An additional coefficient k is introduced as a multiplying factor that can be adjusted based on the specific use case to control the rate, and in the default case it has a value 1.
The final value of the adaptive quantization matrix MA is obtained by calculating a Hadamard product of the default matrix M and the adjustment factors matrix F.
An example of the default quantization matrix M, the calculated adjustment factors F, and the resulting adaptive quantization matrix MA are given in
While the given examples describe machine models based on neural networks, similar methodology can be applied to other models based on different statistical learning techniques. Values for the adaptive quantization matrix can be calculated each time machine model is updated. For example, this can be done once when the system is initialized based on the starting parameters of the machine model and then repeated each time machine model is updated. The updates can be passed to the aQM modules, which recalculate matrix values. Updates to the machine model 160 might be initiated based on the re-training of the existing model or replacement of the model with a new model, either using the same machine framework reappropriated/retrained for the new task, or replacement of the whole framework. Adaptive quantization is a universal process that does not depend on a particular configuration or dimensionality of the machine model.
Regarding the operation of the aQM module in the decoder, it is important to note its role of post-processing. The output of the decoder process is typically a pixel representation of the coded signal. Since formulas for calculating sensitivity of the model to the input signal can be applied either in the pixel domain (X), or extended to the quantized domain (Q), the post-processing can be done either directly on the decoder output in the pixel domain, or the decoder output can be transformed into frequency domain, quantized, and then de-quantized into the pixel domain. In both cases, the resulting pixel representation is expected to produce better performance, in the sense of accuracy, when passed into the machine model.
The bitstream is generally comprised of the standard elements such as stream header, sub-stream headers, and payload information. The signaling of the adaptive quantization matrix can be signaled in the bitstream such as by using the picture header. The parameters can be specified on the lowest, block-level.
Each Image/picture header can contain adaptive quantization parameters in the following format:
Block ID represents the identification number of the prediction block (i.e. block in the image coding, macroblock or a coding unit in the video coding) in the current image/picture. The AQ deltas represent the elements of the adjustment matrix M. The elements are encoded as the values for the first block, and as the difference values for each subsequent block. For example, to obtain element values for the second block, values of the first block, from the first row are added to the values of the difference (deltas) from the second row.
In operation, and still referring to
In an embodiment, and still referring to
Still referring to
In operation, and with continued reference to
Further referring to
With continued reference to
In some implementations, and still referring to
In some implementations, and still referring to
Some embodiments may include non-transitory computer program products (i.e., physically embodied computer program products) that store instructions, which when executed by one or more data processors of one or more computing systems, cause at least one data processor to perform operations herein.
Still referring to
For instance, encoder 600 may be configured to perform a single step or sequence repeatedly until a desired or commanded outcome is achieved; repetition of a step or a sequence of steps may be performed iteratively and/or recursively using outputs of previous repetitions as inputs to subsequent repetitions, aggregating inputs and/or outputs of repetitions to produce an aggregate result, reduction or decrement of one or more variables such as global variables, and/or division of a larger processing task into a set of iteratively addressed smaller processing tasks. Encoder 600 may perform any step or sequence of steps as described in this disclosure in parallel, such as simultaneously and/or substantially simultaneously performing a step two or more times using two or more parallel threads, processor cores, or the like; division of tasks between parallel threads and/or processes may be performed according to any protocol suitable for division of tasks between iterations. Persons skilled in the art, upon reviewing the entirety of this disclosure, will be aware of various ways in which steps, sequences of steps, processing tasks, and/or data may be subdivided, shared, or otherwise dealt with using iteration, recursion, and/or parallel processing.
With continued reference to
It is to be noted that any one or more of the aspects and embodiments described herein may be conveniently implemented using one or more machines (e.g., one or more computing devices that are utilized as a user computing device for an electronic document, one or more server devices, such as a document server, etc.) programmed according to the teachings of the present specification, as will be apparent to those of ordinary skill in the computer art. Appropriate software coding can readily be prepared by skilled programmers based on the teachings of the present disclosure, as will be apparent to those of ordinary skill in the software art. Aspects and implementations discussed above employing software and/or software modules may also include appropriate hardware for assisting in the implementation of the machine executable instructions of the software and/or software module.
Such software may be a computer program product that employs a machine-readable storage medium. A machine-readable storage medium may be any medium that is capable of storing and/or encoding a sequence of instructions for execution by a machine (e.g., a computing device) and that causes the machine to perform any one of the methodologies and/or embodiments described herein. Examples of a machine-readable storage medium include, but are not limited to, a magnetic disk, an optical disc (e.g., CD, CD-R, DVD, DVD-R, etc.), a magneto-optical disk, a read-only memory “ROM” device, a random-access memory “RAM” device, a magnetic card, an optical card, a solid-state memory device, an EPROM, an EEPROM, and any combinations thereof. A machine-readable medium, as used herein, is intended to include a single medium as well as a collection of physically separate media, such as, for example, a collection of compact discs or one or more hard disk drives in combination with a computer memory. As used herein, a machine-readable storage medium does not include transitory forms of signal transmission.
Such software may also include information (e.g., data) carried as a data signal on a data carrier, such as a carrier wave. For example, machine-executable information may be included as a data-carrying signal embodied in a data carrier in which the signal encodes a sequence of instruction, or portion thereof, for execution by a machine (e.g., a computing device) and any related information (e.g., data structures and data) that causes the machine to perform any one of the methodologies and/or embodiments described herein.
Examples of a computing device include, but are not limited to, an electronic book reading device, a computer workstation, a terminal computer, a server computer, a handheld device (e.g., a tablet computer, a smartphone, etc.), a web appliance, a network router, a network switch, a network bridge, any machine capable of executing a sequence of instructions that specify an action to be taken by that machine, and any combinations thereof. In one example, a computing device may include and/or be included in a kiosk.
Processor 704 may include any suitable processor, such as without limitation a processor incorporating logical circuitry for performing arithmetic and logical operations, such as an arithmetic and logic unit (ALU), which may be regulated with a state machine and directed by operational inputs from memory and/or sensors; processor 704 may be organized according to Von Neumann and/or Harvard architecture as a non-limiting example. Processor 704 may include, incorporate, and/or be incorporated in, without limitation, a microcontroller, microprocessor, digital signal processor (DSP), Field Programmable Gate Array (FPGA), Complex Programmable Logic Device (CPLD), Graphical Processing Unit (GPU), general purpose GPU, Tensor Processing Unit (TPU), analog or mixed signal processor, Trusted Platform Module (TPM), a floating-point unit (FPU), and/or system on a chip (SoC)
Memory 708 may include various components (e.g., machine-readable media) including, but not limited to, a random-access memory component, a read only component, and any combinations thereof. In one example, a basic input/output system 716 (BIOS), including basic routines that help to transfer information between elements within computer system 700, such as during start-up, may be stored in memory 708. Memory 708 may also include (e.g., stored on one or more machine-readable media) instructions (e.g., software) 720 embodying any one or more of the aspects and/or methodologies of the present disclosure. In another example, memory 708 may further include any number of program modules including, but not limited to, an operating system, one or more application programs, other program modules, program data, and any combinations thereof.
Computer system 700 may also include a storage device 724. Examples of a storage device (e.g., storage device 724) include, but are not limited to, a hard disk drive, a magnetic disk drive, an optical disc drive in combination with an optical medium, a solid-state memory device, and any combinations thereof. Storage device 724 may be connected to bus 712 by an appropriate interface (not shown). Example interfaces include, but are not limited to, SCSI, advanced technology attachment (ATA), serial ATA, universal serial bus (USB), IEEE 1394 (FIREWIRE), and any combinations thereof. In one example, storage device 724 (or one or more components thereof) may be removably interfaced with computer system 700 (e.g., via an external port connector (not shown)). Particularly, storage device 724 and an associated machine-readable medium 728 may provide nonvolatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for computer system 700. In one example, software 720 may reside, completely or partially, within machine-readable medium 728. In another example, software 720 may reside, completely or partially, within processor 704.
Computer system 700 may also include an input device 732. In one example, a user of computer system 700 may enter commands and/or other information into computer system 700 via input device 732. Examples of an input device 732 include, but are not limited to, an alpha-numeric input device (e.g., a keyboard), a pointing device, a joystick, a gamepad, an audio input device (e.g., a microphone, a voice response system, etc.), a cursor control device (e.g., a mouse), a touchpad, an optical scanner, a video capture device (e.g., a still camera, a video camera), a touchscreen, and any combinations thereof. Input device 732 may be interfaced to bus 712 via any of a variety of interfaces (not shown) including, but not limited to, a serial interface, a parallel interface, a game port, a USB interface, a FIREWIRE interface, a direct interface to bus 712, and any combinations thereof. Input device 732 may include a touch screen interface that may be a part of or separate from display 736, discussed further below. Input device 732 may be utilized as a user selection device for selecting one or more graphical representations in a graphical interface as described above.
A user may also input commands and/or other information to computer system 700 via storage device 724 (e.g., a removable disk drive, a flash drive, etc.) and/or network interface device 740. A network interface device, such as network interface device 740, may be utilized for connecting computer system 700 to one or more of a variety of networks, such as network 744, and one or more remote devices 748 connected thereto. Examples of a network interface device include, but are not limited to, a network interface card (e.g., a mobile network interface card, a LAN card), a modem, and any combination thereof. Examples of a network include, but are not limited to, a wide area network (e.g., the Internet, an enterprise network), a local area network (e.g., a network associated with an office, a building, a campus or other relatively small geographic space), a telephone network, a data network associated with a telephone/voice provider (e.g., a mobile communications provider data and/or voice network), a direct connection between two computing devices, and any combinations thereof. A network, such as network 744, may employ a wired and/or a wireless mode of communication. In general, any network topology may be used. Information (e.g., data, software 720, etc.) may be communicated to and/or from computer system 700 via network interface device 740.
Computer system 700 may further include a video display adapter 752 for communicating a displayable image to a display device, such as display device 736. Examples of a display device include, but are not limited to, a liquid crystal display (LCD), a cathode ray tube (CRT), a plasma display, a light emitting diode (LED) display, and any combinations thereof. Display adapter 752 and display device 736 may be utilized in combination with processor 704 to provide graphical representations of aspects of the present disclosure. In addition to a display device, computer system 700 may include one or more other peripheral output devices including, but not limited to, an audio speaker, a printer, and any combinations thereof. Such peripheral output devices may be connected to bus 712 via a peripheral interface 756. Examples of a peripheral interface include, but are not limited to, a serial port, a USB connection, a FIREWIRE connection, a parallel connection, and any combinations thereof.
The foregoing has been a detailed description of illustrative embodiments of the invention. Various modifications and additions can be made without departing from the spirit and scope of this invention. Features of each of the various embodiments described above may be combined with features of other described embodiments as appropriate in order to provide a multiplicity of feature combinations in associated new embodiments. Furthermore, while the foregoing describes a number of separate embodiments, what has been described herein is merely illustrative of the application of the principles of the present invention. Additionally, although particular methods herein may be illustrated and/or described as being performed in a specific order, the ordering is highly variable within ordinary skill to achieve methods, systems, and software according to the present disclosure. Accordingly, this description is meant to be taken only by way of example, and not to otherwise limit the scope of this invention.
Exemplary embodiments have been disclosed above and illustrated in the accompanying drawings. It will be understood by those skilled in the art that various changes, omissions and additions may be made to that which is specifically disclosed herein without departing from the spirit and scope of the present invention.
This application is a continuation of international application PCT/US23/32030 filed on Sep. 6, 2023, and titled “Image and Video Coding with Adaptive Quantization for Machine-based Applications,” which in turn claims priority to U.S. Provisional Application Ser. No. 63/404,272 filed on Sep. 7, 2022, titled “Image and Video Coding System with a Machine-based Adaptive Quantization.”
| Number | Date | Country | |
|---|---|---|---|
| 63404272 | Sep 2022 | US |
| Number | Date | Country | |
|---|---|---|---|
| Parent | PCT/US2023/032030 | Sep 2023 | WO |
| Child | 19070043 | US |