With respect to encoding and compression of video data, it is known that encoders generally rely only on information they can cull from an input stream of images (or, in the case of a transcoder, a compressed bitstream) to inform the various processes (e.g., frame-type determination) and devices (e.g., a rate controller) that may constitute operation of a video encoder. This information can be computationally expensive to derive, and may fail to provide the video encoder with cues it may need to generate an optimal encode in an efficient manner.
Embodiments of the present invention can use measurements and/or statistics metadata provided by an image-capture system to supplement selection or revision of coding parameters by an encoder. An encoder can receive a video sequence together with associated metadata and may code the video sequence into a compressed bitstream. The coding process may include initial parameter selections made according to a coding policy, and revision of a parameter selection according to the metadata. In some embodiments, various coding decisions and information associated with the compressed bitstream may be passed to a transcoder, which may use the coding decisions and other information, in addition to the metadata originally provided by the image-capture system to supplement decisions associated with transcoding operations. The scheme may reduce the complexity of the generated bitstream(s) and increase the efficiency of the coding process(es) while maintaining perceived quality of the video sequence when recovered at a decoder. Thus, the bitstream(s) may be transmitted with less bandwidth, and the computational burden on both the encoder and decoder may be lessened.
The preprocessor 110 (as shown in phantom) optionally receives the metadata M1 from metadata sensor(s) 110 and images (i.e., the video sequence) from the camera 105. The preprocessor 110 may preprocess the set of images using the metadata M1 prior to coding. The preprocessed images may form a preprocessed video sequence that may be received by the encoder 120. The preprocessor 110 also may generate a second set of metadata M2, which may be provided to the encoder 120 to supplement selection or revision of a coding parameter associated with a coding operation.
The encoder 120 may receive as its input the video sequence from the camera 105 or the preprocessed video sequence if the preprocessor 110 is used. The encoder 120 may code the input video sequence as coded data according to a coding process. Typically, such coding exploits spatial and/or temporal redundancy in the input video sequence and generates coded video data that is bandwidth-compressed as compared to the input video sequence. Such coding further involves selection of coding parameters, such as quantization parameters and the like, which are transmitted in a channel as part of the coded video data and are used during decoding to recover a recovered video sequence. The encoder 120 may receive the metadata M1, M2 and may select coding parameters based, at least in part, on the metadata. It will be appreciated that typically an encoder works together with a rate controller to make various coding decisions, as is shown in
The coded video data buffer 130 may store the coded bitstream before transferring it to a channel, a transmission medium to carry the coded bitstream to a decoder. Channels typically include storage devices such as optical, magnetic or electrical memories and communications channels provided, for example, by communications networks or computer networks.
In an embodiment, the encoding system 100 may include a pair of pipelined encoders 120, 140 (as shown in
The encoding operations carried out by the encoding system 100 may be reversed by the decoding system 150, which may include a receive buffer 180, a decoder 170 and a postprocessor 160. Each unit may perform the inverse of its counterpart in the encoding system 100, ultimately approximating the video sequence received from the camera 105. The postprocessor 160 may receive the metadata M1 and/or the metadata M2, and use this information to select or revise a postprocessing parameter associated with a postprocessing operation (as detailed below). The decoder 170 and the postprocessor 160 may include other blocks (not shown) that perform various processes to match or approximate coding processes applied at the encoding system 100.
The rate controller 240 may be used to manage the bit budget of the bitstream, for example, by keeping the number of bits available per frame under a prescribed, though possibly varying threshold. To this end, the rate controller 240 may make coding parameter assignments by, for example, assigning prediction modes for frames and/or assigning quantization parameters for pixel blocks within frames. The rate controller 240 may include a bitrate estimation unit 250, a frame-type assignment unit 260 and a metadata processing unit 270. The bitrate estimation unit 250 may estimate the number of bits needed to encode a particular frame at a particular quality, and the frame-type assignment unit 260 may determine what prediction type (e.g., I, P, B, etc.) should be assigned to each frame.
The metadata processor 270 may receive the metadata M1 associated with each frame, analyze it, and then may send the information to the bitrate estimation unit 250 or frame-type assignment unit 260, where it may alter quantization parameter or frame-type assignments. The rate controller 240, and more specifically, the metadata processor 270 may analyze metadata one frame at a time or, alternatively, may analyze metadata for a plurality of contiguous frames in an effort to detect a pattern, etc. Similarly, the rate controller 240 may contain a cache (not shown) for holding in memory various metadata values so that they can be compared relative to each other. As is known, various compression processes base their selection of coding parameters on other inputs and, therefore, the rate controller 240 may receive inputs and generate outputs other than those shown in
The metadata M1 may be generated by the image-capture device or an apparatus external to the image-capture device, such as, for example, a boom arm on which the image-capture device is mounted. When the metadata M1 is generated by the image-capture device, it may be calculated or derived by the device or come from the device's image sensor processor (ISP). For each image in the video sequence, the metadata M1 may include, for example, exposure time (i.e., a measure of the amount of light allowed to hit the image sensor), digital/analog gain (generally an indication of noise level, which may comprise an exposure value plus an amplification value), aperture value (which generally determines the amount and angle of light allowed to hit the image sensor), luminance (which is a measure of the intensity of the light hitting the image sensor and which may correspond to the perceived brightness of the image/scene), ISO (which is a measure of the image sensor's sensitivity to light), white balance (which generally is an adjustment used to ensure neutral colors remain neutral), focus information (which describes whether the light from the object being filmed is well-converged; more generally, it is the portion of the image that appears sharp to the eye), brightness, physical motion of the image-capture device (via, for example, an accelerometer), etc.
Additionally, certain metadata may be considered singly or in combination with other metadata. For example, exposure time, digital/analog gain, aperture value, luminance, and ISO may be considered as a single value or score in determining the parameters to be used by certain preprocessing or encoding operations.
At block 410, one or more of the images optionally may be preprocessed (as shown in phantom), wherein the video sequence may be converted into a preprocessed video sequence. “Preprocessing” refers generally to operations that condition pixels for video coding, such as, for example, denoising, scaling, color balancing, effects, packaging each frame into pixelblocks or macroblocks, etc. As at block 420—where the video sequence is encoded—the preprocessing stage may take into account received metadata M1. More specifically, a preprocessing parameter associated with a preprocessing operation may be selected or revised according to the metadata associated with the video sequence.
As an example of preprocessing according to the metadata M1, consider denoising. Generally, denoising filters attempt to remove noise artifacts from source video sequences prior to the video sequences being coded. Noise artifacts typically appear in source video as small aberrations in the video signal within a short time duration (perhaps a single pixel in a single frame). Denoising filters can be controlled during operation by varying the strength of the filter as it is applied to video data. When the filter is applied at a relatively low level of strength (i.e., the filter is considered “weak”), the filter tends to allow a greater percentage of noise artifacts to propagate through the filter uncorrected than when the filter is applied at a relatively high level of strength (i.e., when the filter is “strong”). A relatively strong denoising filter, however, can induce image artifacts for portions of a video sequence that do not include noise.
According to an embodiment of the invention, the value of a preprocessing parameter associated with the strength of a denoising filter can be determined by the metadata M1. For example, the luminance and/or ISO values of an image may be used to control the strength of the denoising filter; in low-light conditions, the strength of the denoising filter may be increased relative to the strength of the denoising filter in bright conditions.
The denoiser may be a temporal denoiser, which may generate an estimate of global motion within a frame (i.e., the sum of absolute differences) that may be used to affect future coding operations; also, the combination of exposure and gain metadata M1 may be used to determine a noise estimate for the image, which noise estimate may affect operation of the temporal denoiser. At least one benefit of using such metadata to control the strength of the denoising filter is that it may provide more effective noise elimination, which can improve coding efficiency by eliminating high-frequency image components while at the same time maintaining appropriate image quality.
As another example of preprocessing according to the metadata M1, consider scaling of the video sequence. As is well known, scaling is the process of converting a first image/video representation at a first resolution into a second image/video representation at a second resolution. For example, a user may want to convert high-definition (HD) video captured by his camera into a VGA (640×480) version of the video.
When scaling there inherently are choices as to which scaling filters (and associated parameters) to use. Scaling generally implies that there is a relatively high level of high-frequency information in the image, which can affect these filters and parameters. Various metadata M1 (e.g., focus information) can be used to select a preprocessing parameter associated with a filter operation. Similarly, if in-device scaling occurs (via, e.g., binning, line-skipping, etc.), such information can be used by the pre/postprocessor. In-device scaling may insert artifacts into the image, which artifacts may be searched for by the preprocessor (via, e.g., edge detection), and the size, frequency, etc. of the artifacts may be used to determine which scaling filters and coefficients to use, as may the knowledge of the type of scaling performed (e.g., if it is known that the image was not binned, only line-skipped, then a relatively heavy filter may be used to compensate for any aliasing artifacts).
Preprocessing may be used to decrease coding complexity at the encoding stage. For example, if the dynamic range of the video sequence (or, rather, the images comprising the video sequence) is known, then it can be reduced during the preprocessing stage such that the encoding process is easier. Additionally, the preprocessing stage itself may generate metadata M2 which may be used by the encoder (or a decoder, transcoder, etc., as discussed below), in which case the metadata M2 generated by the preprocessing stage may be multiplexed with the metadata M1 received with the original video sequence or it can be stored/received separately.
Generally, increasing brightness is a difficult situation to code for, and an image-capture device may artificially attempt to normalize brightness (i.e., keep it within a predetermined range) by, for example, modifying the aperture of the optics system and the integration time of the image sensor. However, during dynamic changes, the aperture/integration control may lag behind the image sensor. In such a situation, if, for example, the metadata M1 indicates that the image-capture device is relatively still over the respective frames, and the only thing that really is changing is the aperture/integration controls as the camera attempts to adjust to the new steady-state operational parameters, then a preprocessor may attempt to further normalize brightness across the respective frames.
At block 420, an encoder may code the input video sequence into a coded bitstream according to a video coding policy. At least one of the coding parameters that make up the video coding policy may be selected or revised according to the metadata, which may include the metadata M2 generated at the preprocessing stage (as shown in phantom), and the metadata M1 associated with the original video sequence. Examples of the parameters whose values may be selected or revised by the metadata include bitrates, frame types, quantization parameters, etc.
As an example of how the coding at block 420 may use the metadata M1 to select certain of its parameters, consider metadata M1 describing motion of the image-capture device, which can be used, for example, to select quantization parameters and/or bitrates for various portions of the video sequence.
In both cases, a moving camera likely is to acquire video sequences with a relatively high proportion of blurred image content due to the motion. Use of relatively high quantization parameters and/or low target bitrates likely will cause the respective portion to be coded at a lower quality than for other portions where a quantization parameter is lower or a target bitrate is higher. This coding policy may induce a higher number of coding errors into the “moving” portion, but the errors may not affect perceptual quality due to blurred image content in the source image(s).
As another example of how coding parameters may be adjusted according to the metadata, consider metadata M1 that describes focus information, which may indicate that the camera actually is in the act of focusing over a plurality of frames. In this case, and generally without sacrificing perceptual quality, the encoder may encode with less quality/bandwidth the frames occurring during the “unfocused” phase than those occurring where focus has been set or “locked,” and may adjust quantization parameters, etc., accordingly.
A rate controller may select coding parameters based on a focus score delivered by the camera. The focus score may be provided directly by the camera as a pre-calculated value or, alternatively, may be derived by the rate controller from a plurality of values provided by the camera, such as, for example, aperture settings, the focal length of the image-capture device's lens, etc. A low focus score may indicate that image content is unfocused, but a higher focus score may indicate that image content is in focus. When the focus score is low, the rate controller may increase quantization parameters over default values provided by a default coding scheme. As discussed, higher quantization parameters provide generally greater compression, but they can lower perceived quality of a recovered video sequence. However, for video sequences with low focus scores, reduced quality may not be as perceptible because the image content is unfocused.
As another example, changes in exposure can be used to, for example, select or revise parameters associated with the allocation of intra/inter-coding modes or the quantization step size. By analyzing certain of the metadata M1 (e.g., exposure, aperture, brightness, etc.) during the coding stage, particular effects may be detected, such as an exposure transition, or fade (e.g., when a portion of the video sequence moves from the ground to the sky). Given this information, a rate controller may, for example, determine where in a fade-like sequence a new I-frame will be used (e.g., at the first frame whose exposure value is halfway between the exposure values of the first and last frames in the fade-like sequence).
As discussed, exposure metadata may include indicators of the brightness, or luma, of each image. Generally, a camera's ISP will attempt to maintain the brightness at a constant level within upper and lower thresholds (labeled “acceptable” levels herein) so that the perceived quality of the images is reasonable, but this does not always work (e.g., when the camera is moving too quickly from shooting a very dark scene to shooting a very bright scene). By analyzing brightness metadata associated with some number of contiguous frames, a rate controller may determine a pattern (see, e.g.,
Together with the direction (i.e., light-to-dark, dark-to-light, etc.) of the brightness gradient over contiguous frames, a rate controller also may take into account various other metadata M1, such as, for example, movement of the camera. For example, if, over a number of successive frames, the brightness and camera motion are above or increasing beyond predetermined thresholds, then quantization parameters may be increased over the frames. The alteration of quantization parameters in this exemplary instance may be acceptable because it is likely that the image is 1) washed-out and 2) blurry; thus, the perceived quality of the encoded image likely will not suffer from a fewer number of bits being allocated to it.
A rate controller also may use brightness to supplement frame-type decisions. Generally, frame types may be assigned according to a default group of frames (GOP) (e.g., I, B, B, B, P, I); in an embodiment, the GOP may be modified by information from the metadata M1 regarding brightness. For example, if, between two successive frames, the change in brightness is above a predetermined threshold, and the number of macroblocks in the first frame to be intra-coded is above a predetermined threshold (e.g., 70%), then the rate controller may “force” the first frame to be an I-frame even though some of its macroblocks may otherwise have been inter-coded.
Similarly, metadata M1 for a few buffered frames may be used to determine, for example, the amount by which a camera's auto-exposure adjustment is lagging behind; this measurement can be used to either preprocess the frames to correct the exposure, or indicate to the encoder certain characteristics of the incoming frames (i.e., that the frames are under/over-exposed) so that, for example, a rate controller can adjust various parameters accordingly (e.g., lower the bitrate, lower the frame rate, etc.).
As still another example, white balance adjustments/information from the camera may be used by the encoder to detect, for example, scene changes, which can help the encoder to allocate bits appropriately, determine when a new I-frame should be used, etc. For example, if the white balance adjustment for each of frames 10-30 remains relatively constant, but at frame 31 the adjustment changes dramatically, then that may be an indication that, for example, there has been a scene change, and so the rate controller may make frame 31 an I-frame.
Like preprocessing and encoding, “postprocessing” also may take advantage of metadata associated with the original video sequence and/or the preprocessed video sequence. Once the coded bitstream has been decoded by a decoder into a video sequence, the video sequence optionally may be postprocessed by a postprocessor using the metadata. Postprocessing refers generally to operations that condition pixels for viewing. According to an embodiment, a postprocessing stage may perform such operations using metadata to improve them.
Many of the operations done in the preprocessing stage may be augmented or reversed in the postprocessing stage using the metadata M1 generated during image-capture and/or the metadata M2 generated during preprocessing. For example, if denoising is done at the preprocessing stage (as discussed above), information pertaining to the type and amount of denoising done can be passed to the postprocessing stage (as additional metadata M2) so that the noise can be added back to the image. Similarly, if the dynamic range of the images was reduced during preprocessing (as discussed above), then on the decode side the inverse can be done to bring the dynamic range back to where it was originally.
As another example, consider the case where the postprocessor has information from the preprocessor regarding how the image was downscaled, what filter coefficients were used, etc. In such a case, that information can be used by the postprocessor to compensate for image degradation possibly introduced by the scaling. Generally, preprocessing generates artifacts in the video, but by using metadata associated with the original video sequence and/or preprocessing operations, decoding operations can be told where/what these artifacts are and can attempt to correct them.
Postprocessing operations may be performed using metadata associated with the original video sequence (i.e., the metadata M1). For example, a postprocessor may use white balance values from the image-capture device to select postprocessing parameters associated with the color saturation and/or color balance of a decoded video sequence. Thus, many of the metadata-using processing operations described herein can be performed either in the preprocessing stage or the postprocessing stage, or both.
It will be appreciated that during encoding of the first bitstream, certain frames may be dropped, averaged, etc., potentially causing metadata to become out of sync with the frame(s) it purports to describe. Further, certain metadata may not be specific to a single frame, but may indicate a difference of a certain metric (e.g., brightness) between two or more frames. In light of these issues, the encoder 820 may include a metadata correlator 840 to map the metadata to the first bitstream (using, for example, time stamps, key frames, etc.) such that if the first bitstream is decoded by a transcoder, any metadata will be associated with the portion of the recovered video to which it belongs. The syncing information may be multiplexed together with the metadata or kept separate from it.
The coding system 800 further may include a transcoder 850 to recode the coded video data according to a second coding protocol (block 930 of
The rate controller 880 further may select coding parameters based on the metadata M3 obtained by the first encoder 820. The metadata M3 may include information defining or indicating (Qp,bits) pairs, motion vectors, frame or sequence complexity (including temporal and spatial complexity), bit allocations per frame, etc. The metadata M3 also may include various candidate frames that the first encoding process held onto before making final decisions regarding which of the candidate frames would ultimately be used as reference frames, and information regarding intra/inter-coding mode decisions.
Additionally, the metadata M3 also may include a quality metric that may indicate to the transcoder the objective and/or perceived quality of the first bitstream. A quality metric may be based on various known objective video evaluation techniques that generally compare the source video sequence to the compressed bitstream, such as, for example, peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), video quality metric (VQM), etc. A transcoder may use or not use certain metadata based on a received quality metric. For example, if the quality metric indicates that a portion of the first bitstream is of excellent quality (either relative to other portions of the first bitstream, or absolutely with respect to, for example, the compression format of the first bitstream), then the transcoder may re-use certain metadata associated with coding parameters for that portion of the sequence (e.g., quantization parameters, bit allocations, frame types, etc.) instead of expending processing time and effort calculating those values again.
In an embodiment, the transcoder 850 may include a confidence estimator 890 that may adjust the rate controller's reliance on the metadata M1, M2, M3 obtained by the first coding operation.
In an embodiment, the confidence estimator 890 may examine a first set of metadata to determine whether the rate controller may consider other metadata to set coding parameters (block 1000 of
In another embodiment, the confidence estimator 890 may review camera metadata to determine whether the rate controller 880 may rely on or re-use quantization parameters from the first coding in the second coding. For example, if the confidence estimator 890 encounters coded video data with a relatively high quantization parameter (block 1020 of FIG. 10), and camera metadata M1 indicates a relatively low level of camera motion (block 1025 of
In a further embodiment, the confidence estimator 890 may review encoder metadata M3 to determine whether the rate controller 880 may rely on or re-use quantization parameters from the first encoding in the second coding. For example, if the confidence estimator 890 encounters coded video data with a relatively high quantization parameter (block 1040 of
Coding system 800 may include a preprocessor (not shown) to condition pixels for encoding by encoder 870, and certain preprocessing operations may be affected by metadata. For example, if a quality metric indicates that the coding quality of a portion of the bitstream is relatively poor, then the preprocessor can blur the sequence in an effort to mask the sub-par quality. As another example, the preprocessor may be used to detect artifacts in the recovered video (as described above); if artifacts are detected and the metadata M1 indicates that the exposure of the frame(s) is in flux or varies beyond a predetermined threshold, then the preprocessor may introduce noise into the frame(s).
Coding system 800 may include a postprocessor (not shown), and certain postprocessing operations may be affected by metadata, including metadata M3 generated by the first encoder 820.
It will be appreciated that many of the types of metadata that may comprise the metadata M3 discussed above generally are discarded after the first encoding process has been completed, and therefore usually are not available to supplement decisions made by a transcoder. It also will be appreciated that having these types of metadata may be especially beneficial when the video processing environment is constrained in some manner, such as within a mobile device (e.g., a mobile phone, netbook, etc.). With regard to a mobile device, there may be limited storage space on the device such that the source video may be compressed into a first bitstream in real-time, as it is being captured and the source video is discarded immediately after processing. In this case, the transcoder may not have access to the source video but may access the metadata to transcode the coded video data with higher quality than may be possible if transcoding the coded video data alone. A mobile device also may be limited in processing and/or battery power such that multiple start-from-scratch encodes of a video sequence (which may occur because the user wants to, for example, upload/send the video to various people, services, etc.) would tax the processor to such an extent that the battery would drain too quickly, etc. It also may be the case that the device is constrained by channel limitations. For example, the user of the mobile phone may be in a situation where he needs to upload a video to a particular service, but effectively is prohibited because he's in an area with low-bandwidth Internet connectivity (e.g., an area covered only by EDGE, etc.); in this scenario the user may be able to more quickly re-encode the video (because of the metadata associated with the video) to put it in a form that is more amenable to being uploaded via the “slow” network.
As another example, assume that a mobile phone has generated a first bitstream from a real-time capture, and that the first bitstream has been encoded at VGA resolution using the H.264 video codec, and then stored to memory within the phone, together with various metadata M1 realized during the real-time capture, and any metadata M3 generated by the H.264 coding process. At some later point in time, the user may want to upload or send the first bitstream to a friend or video-sharing service, which may require the first bitstream to be transcoded into a format accepted by the user/service; e.g., the user may wish to send the video to a friend as an MMS (Multimedia Messaging Service) message, which requires that the video be in a specific format and resolution, namely H.263/QCIF.
Assuming the source video was deleted during or after generation of the first bitstream (as a matter of practice or because, for example, the phone does not have enough storage capacity to keep both the source video and the first bitstream), the phone will need to decode the first bitstream in order to generate a recovered video sequence (i.e., some approximation of the original capture) that can be re-encoded in the new format. After the first bitstream (or a first portion of the first bitstream) has been decoded, the transcoder's encoder may begin to encode the recovered video into a second bitstream. The metadata M3 provided to the encoder's rate controller may include, for example, information indicating the relative complexity of the current or future frames, which may be used by the rate controller to, for example, assign a low quantization parameter to a frame that is particularly complex.
The various systems described herein may each include a storage component for storing machine-readable instructions for performing the various processes as described and illustrated. The storage component may be any type of machine-readable medium (i.e., one capable of being read by a machine) such as hard drive memory, flash memory, floppy disk memory, optically-encoded memory (e.g., a compact disk, DVD-ROM, DVD±R, CD-ROM, CD±R, holographic disk), a thermomechanical memory (e.g., scanning-probe-based data-storage), or any type of machine readable (computer-readable) storing medium. Each computer system may also include addressable memory (e.g., random access memory, cache memory) to store data and/or sets of instructions that may be included within, or be generated by, the machine-readable instructions when they are executed by a processor on the respective platform. The methods and systems described herein may also be implemented as machine-readable instructions stored on or embodied in any of the above-described storage mechanisms.
Although the preceding text sets forth a detailed description of various embodiments, it should be understood that the legal scope of the invention is defined by the words of the claims set forth below. The detailed description is to be construed as exemplary only and does not describe every possible embodiment of the invention since describing every possible embodiment would be impractical, if not impossible. Numerous alternative embodiments could be implemented, using either current technology or technology developed after the filing date of this patent, which would still fall within the scope of the claims defining the invention. For example, in an embodiment, metadata M3 (as described with respect to
It should be understood that there exist implementations of other variations and modifications of the invention and its various aspects, as may be readily apparent to those of ordinary skill in the art, and that the invention is not limited by specific embodiments described herein. It is therefore contemplated to cover any and all modifications, variations or equivalents that fall within the scope of the basic underlying principals disclosed and claimed herein.
The present application claims the benefit of U.S. Provisional application Ser. No. 61/184,780 filed Jun. 5, 2009, entitled “IMAGE ACQUISITION AND ENCODING SYSTEM.” The aforementioned application is incorporated herein by reference in its entirety.
Number | Date | Country | |
---|---|---|---|
61184780 | Jun 2009 | US |