The present invention is generally directed to the compression of frames in a group of pictures (GOP) using a video encoder.
Video compression algorithms have historically been implemented in an encoder using software running on a processor or on dedicated hardware with a combination of firmware and hardware components.
An encoder may be a device, circuit, transducer, software program or algorithm that converts information (i.e., data) from one format or code to another. Encoding may be performed for the purposes of standardization, speed, secrecy, security, or saving memory space by reducing file size. A bitrate control mechanism of the encoder monitors the incremental file size being generated, compares it to the requested target bitrate, and makes adjustments on a small and large scale as necessary. This may generally be implemented by setting bit budgets and using various metrics within a frame to set the quantization level being used on a per macroblock (MB), slice or frame basis.
On processors and in firmware of hardware encoders, these changes may generally be made with a very short feedback time, since the processing is performed in a mostly serial fashion, one MB after another. For example, when beginning the quantization on one MB, the results of all of the previous MBs in a frame are generally known.
The goal of a bitrate control algorithm is to generate specific quantization levels (Qp) for each MB that provides a near uniform and optimum distortion level, (frame-to-frame and within a frame), that is within the dictated bit budget and keeps the video stream in compliance with buffer limits for overflow and underflow.
A video encoding method and a video encoder are described for processing frames in a group of pictures (GOP). A difference between a bit budget of a selected frame in the GOP and an estimated number of bits consumed by the selected frame is determined. Quantization parameter (Qp) values assigned to coefficients of macroblocks (MBs) in the selected frame are adjusted if the difference does not fall within a tolerance. A bit budget to the GOP may be assigned or adjusted based on a target bitrate. A bit budget may be assigned to each unprocessed frame in the GOP. Spatial activity may be calculated for each MB in the selected frame, and a bit budget and quantization may be assigned for each MB in the selected frame based on the spatial activity. The number of bits consumed per MB in the selected frame may be approximated based on zero and non-zero coefficients of the MB. Quantization may be performed on each MB in the selected frame using the Qp values. The Qp values may be filtered.
A video encoder may comprise a memory configured to store at least one group of pictures (GOP), and a processor configured to determine a difference between a bit budget of a selected frame in the GOP and an estimated number of bits consumed by the selected frame, and adjust Qp values assigned to coefficients of MBs in the selected frame if the difference does not fall within a tolerance.
A computer-readable storage medium may be configured to store a set of instructions used for manufacturing a semiconductor device. The semiconductor device may comprise a memory configured to store at least one GOP, and a processor configured to determine a difference between a bit budget of a selected frame in the GOP and an estimated number of bits consumed by the selected frame, and adjust Qp values assigned to coefficients of MBs in the selected frame if the difference does not fall within a tolerance. The instructions may be Verilog data instructions or hardware description language (HDL) instructions.
A more detailed understanding may be had from the following description, given by way of example in conjunction with the accompanying drawings wherein:
A processor, (e.g., a graphics processing unit (GPU), computer processing unit (CPU), and the like), may be configured to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display. Modern processors may be efficient at manipulating computer graphics, and their highly parallel structure makes them effective for implementing algorithms where processing of large blocks of data may be performed in parallel.
Compute shaders may be used to perform video compression while enhancing the speed of algorithms performed by the processor. A processor, (or an array of processors), may contain many compute engines such as compute shaders, (e.g., hundreds or thousands), and is thus termed as a massively parallel (MP) architecture.
A data parallel programming model used in an MP machine may assign each processing unit to work on different data but with the same instructions. In video compression, for example, each 16×16 pixel MB may be processed at once so that adjacent MBs do not know the results of their neighboring MBs before they begin their processing.
In practice, a processing unit may not actually have that many processors, depending on the size of the video image and the capacity of the processing unit being used. However, the programming model allows one to assume one or more groups of MBs execute until all groups have completed. Every MB may begin processing at once and all may complete processing at the same time. Therefore, it may not be possible to adjust the processing of one MB based on the results of other MBs, as is done in conventional serial processing algorithms.
The primary problem is one of feedback on the quantity of bits being consumed from the first MB to the last (Mth) MB in the frame 105, as shown in
One solution to this problem is to divide the frame 105 into a plurality of slices, gather aggregate data in-between slices, and apply this feedback to the quantization levels used in future slices. This may work up to a point, but limits the total amount of parallelism. Another (not mutually exclusive) solution is to iteratively arrive at a final solution.
In accordance with one embodiment, the entire frame 105 may be processed using the video and image compression procedure 300 of
Entropy encoding is the final procedure of the overall encoding process, whereby all of the encoding decisions are converted and the actual bits that go into the stream are encoded. This procedure provides the actual numbers of bits. Prior to doing this final procedure, various estimates may be used to reduce the number of bits required. The tentative solution may use one Qp value and estimate how many bits may be consumed. If it is close to the allocated number of bits, then final entropy processing may be implemented. If the estimate falls outside of the expected range, then Qp may be adjusted and the estimate may be performed again, perhaps iterating several times.
As shown in
As shown in
As shown in
After the array of per MB Qp levels has been generated (in the iterative stages above), a filter may be optionally run on the Qp values to minimize the changes in Qp, thus attempting to maintain a constant or lesser distortion as the same or fewer bits are consumed. This may be accomplished by taking into account the bit cost in the stream for making a change to the Qp versus the benefit of making the change.
When the frame has completed the entropy encoding stage, the final true sum of bits consumed is tallied and compared to the budget for that frame. This feedback may be used to make two adjustments on future frames. First, the bit budget model may be adjusted, if required. Second, the overall video sequence buffer model is adjusted to insure that the bits of the frames are maintained within the required limits.
Since the processing of frames may have a deep pipeline, the feedback, (per frame and per GOP) may not be immediate, and thus the iterative per frame bit budget allocation and closeness to this budget are important. However, video scenes may be suddenly different and a method is needed for allocating some reserved number of additional bits. If there is a sudden change, such as a cut to a new scene, there may be a need for more bits because the motion estimation from prior frames may produce poor results. Thus, detection of a new scene may be used as a trigger to allow using some reserve bits.
An enhancement may provide a budget for each frame in two parts, (e.g., 90% may be free to be completely used, and the remainder (10%) may only be used after the first iteration of the quantization, if needed). Any of these reserved bits that are not used may then be carried over to the next frame in the pipeline.
The accumulation over time of these reserved or standby bits needs to be great enough to handle (approximately) the expected worst case condition that may occur every one in 50 or 100 frames, that is suddenly more difficult than expected, whereas the normally budgeted allocation may always be used within a GOP.
Each iteration of the proxy quantization may be implemented based on a reduced resolution image, or based on a statistical sampling of the MBs, thus reducing the computation time. After the quantization is determined to be final, the full resolution image may be quantized. The size of the reduced (i.e., compressed) resolution image may be optimized.
The processor 502 may include a CPU, a GPU, a CPU and GPU located on the same die, one or more processor cores, wherein each processor core may be a CPU or a GPU. The memory 504 may be located on the same die as the processor 502, or may be located separately from the processor 504. The memory 504 may include a volatile or non-volatile memory, for example, random access memory (RAM), dynamic RAM, or a cache.
The storage 506 may include a fixed or removable storage, for example, hard disk drive, solid state drive, optical disk, or flash drive. The input devices 508 may include a keyboard, a keypad, a touch screen, a touch pad, a detector, a microphone, an accelerometer, a gyroscope, a biometric scanner, or a network connection, (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals). The output devices 510 may include a display, a speaker, a printer, a haptic feedback device, one or more lights, an antenna, or a network connection, (e.g., a wireless local area network card for transmission and/or reception of wireless IEEE 802 signals).
The input driver 552 communicates with the processor 502 and the input devices 508, and permits the processor 502 to receive input from the input devices 508. The output driver 554 communicates with the processor 502 and the output devices 510, and permits the processor 502 to send output to the output devices 510.
Although features and elements are described above in particular combinations, each feature or element can be used alone without the other features and elements or in various combinations with or without other features and elements. The apparatus described herein may be manufactured by using a computer program, software, or firmware incorporated in a computer-readable storage medium for execution by a general purpose computer or a processor. Examples of computer-readable storage mediums include a read only memory (ROM), a random access memory (RAM), a register, cache memory, semiconductor memory devices, magnetic media such as internal hard disks and removable disks, magneto-optical media, and optical media such as CD-ROM disks, and digital versatile disks (DVDs).
Embodiments of the present invention may be represented as instructions and data stored in a computer-readable storage medium. For example, aspects of the present invention may be implemented using Verilog, which is a hardware description language (HDL). When processed, Verilog data instructions may generate other intermediary data, (e.g., netlists, GDS data, or the like), that may be used to perform a manufacturing process implemented in a semiconductor fabrication facility. The manufacturing process may be adapted to manufacture semiconductor devices (e.g., processors) that embody various aspects of the present invention.
Suitable processors include, by way of example, a general purpose processor, a special purpose processor, a conventional processor, a digital signal processor (DSP), a plurality of microprocessors, a graphics processing unit (GPU), an accelerated processing unit (APU), a DSP core, a controller, a microcontroller, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), any other type of integrated circuit (IC), and/or a state machine, or combinations thereof.