MPEG is a standard for compression, decompression, processing, and coded representation of moving pictures and audio. MPEG 1, 2 and 4 standards are currently being used to encode video into bit streams.
The MPEG standard promotes interoperability. An MPEG-compliant bit stream can be decoded and displayed by different platforms including, but not limited to, DVD/VCD, satellite TV, and personal computers running multimedia applications.
The MPEG standard leaves little latitude to optimize the decoding process. However, the MPEG standard leaves much greater latitude to optimize the encoding process. Consequently, different encoder designs can be used to generate compliant bit streams.
However, not all encoder designs produce the same quality bit stream. For example, bit allocation (or bit rate control) can play an important role in video quality. Encoders using different bit allocation schemes can produce bit streams of different quality. Poor bit allocation can result in bit streams of poor quality.
One challenge of designing a video encoder is producing high quality bit streams from different types of inputs, such as video, still images, and a mixture of the two. This challenge becomes more complicated if different video clips are captured from different devices and have different characteristics. The (output) bit stream likely has constant frame rate as mandated by the compression standard, but the input video sequences might not have the same frame rate.
Encoding of still images poses an additional problem. When a still image is displayed on a television, the image quality tends to “oscillate.” For example, the image as initially displayed appears fuzzy, but then becomes sharper, goes back to fuzzy, and so forth.
It is desirable to produce high-quality, compliant bit streams from different types of multimedia having different characteristics.
According to one aspect of the present invention, a video bit stream having a constant frame rate is generated from an input having a frame rate that is different than the constant frame rate. Zero-motion difference frames are added to the bit stream to achieve the constant frame rate.
According to another aspect of the present invention, bit rate control includes using a state transition model to determine a noise masking factor for a frame; and assigning a number of bits as a function of the noise masking factor.
Other aspects and advantages of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the present invention.
As shown in the drawings for purposes of illustration, the present invention is embodied in the encoding of multimedia The present invention is especially useful for generating bit streams from multimedia including a combination of still images and video clips. The bit streams are high quality and they can be made compliant. Encoded still images do not “oscillate” during display.
Audio can be handled separately. According to the MPEG standard, for instance, audio is coded separately and interleaved with the video.
Reference is made to
Different video clips can have different formats. Exemplary formats for the video clips include, without limitation, MPEG, DVI, and WMV. Different still images can have different formats. Exemplary formats for the still images include, without limitation, GIF, JPEG, TIFF, RAW, and bitmap.
The input may have a constant frame rate or a variable frame rate. For example, one video clip might have 30 frames per second, while another video clip has 10 frames per second. Other images might be still images.
The multimedia system 110 includes a converter 112 and an encoder 114. The converter 112 converts the input to a format expected by the encoder 114. For example, the converter 112 would ensure that still images and video are in the format expected by an MPEG-compliant encoder 114. This might include transcoding video and still images. The converter 112 would also ensure that the input is in a color space expected by the encoder 114. For example, the converter 112 might change color space of an image from RGB space to YCbCr or YUV color space. The converter 112 might also change the picture size.
The converter 112 supplies the converted input to the encoder 114. The converter 112 could also supply information about the input. The information might include input type (e.g., still image, video clip). If the input is a video clip, the information could also include frame rate of the video clip. If the input is a still image, the information could also include the duration for which the still image should be displayed. In the alternative, this information could be supplied to the encoder 114 via user input.
Additional reference is made to
If the frame rates match (block 212), which means that the input is a video clip, the encoder 114 performs motion analysis (block 213) and uses the motion analysis to reduce temporal redundancy in the frames (block 214). The motion analysis may be performed according to convention. In addition to performing motion analysis, the encoder 114 may also analyze the content of each frame. The reason for analyzing scene content will be described later.
The temporal redundancy can be reduced by the use of independent frames and difference frames. An MPEG-compliant encoder, for example, would create groups of pictures. Each group of pictures (GOP) would start with an I-Frame (i.e., an independent frame), and would be followed by P-frames and B-frames. The P-frame is a difference frame that can show motion and pixel differences in a frame with respect to previous frames in its GOP. The B-frame is a difference frame that can show motion and pixel differences in a frame with respect to previous and future frames in its GOP.
If the frame rates do not match (block 212), the encoder determines the number of zero motion difference frames that are needed to obtain the frame rate of a compliant bit stream (block 216). A zero-motion difference frame is a frame having all forward or backward motion vectors with values of zero. If the input is a video clip having a frame rate of 10 frames-per-second (fps) and the bit stream frame rate is 30 fps, the encoder would determine that 20 zero-motion difference frames should be added for each second of video.
If the input is a video clip, the encoder 114 then reduces the temporal redundancy of the input (block 214). If necessary during this step, the encoder 114 can insert the zero-motion difference frames to achieve the constant frame rate. The encoder 114 can add the zero-motion difference frames before or after the temporal redundancy has been reduced. Consider an example in which an MPEG-compliant encoder received frames of a 10 fps video clip. For each frame received by the encoder 114 the encoder 114 could insert, on average, two P-frames indicating no motion and no pixel differences.
If the input is a still image, the encoder 114 does not need to perform motion analysis. Instead, the encoder 114 determines the duration over which the still image should be displayed (block 216) and adds the zero-motion difference frames to bit stream (block 218). If the still image should be displayed for three seconds and the frame rate of the bit stream is 30 fps, then the encoder 114 determines that 89 zero-motion difference frames should be added to obtain the frame rate of the bit stream.
The zero-motion difference frames would indicate motion-compensated pixel differences having zero values (these frames are hereinafter referred to as zero-motion difference frames indicating zero pixel differences), unless it is desired to improve the visual quality of the independent frame. Zero-motion difference frames indicating zero pixel differences can be compressed better than zero-motion difference frames indicating motion-compensated pixel values having non-zero pixel differences.
However, zero-motion difference frames indicating non-zero pixel differences can be used to improve the visual quality of the preceding I-frame. For example, the I-frame is assigned a sub-optimal number of bits prior to being placed in the bit stream. To improve the visual quality, the first several zero-motion difference frames following the I-frame would indicate non-zero pixel differences. The remaining zero-motion difference frames would indicate zero pixel differences.
If encoding is performed according to the MPEG standard, P-fames are the preferred difference frames. However, B-frames could be used instead of, or in addition to, the P-frames.
Consider an example in which the input consists of a still image that should be displayed for five seconds. An MPEG encoder may encode the still image as six identical GOPs, with each GOP containing twenty five frames (an I-frame followed by twenty four zero-motion P-frames). If the zero-motion P-frames indicate zero pixel difference, each I-frame will be displayed without any oscillation or other distracting motion.
The GOPs may be made identical so as to conform to a pre-decided GOP size. However, the bit stream could be non-compliant, in which case the GOPs need not be identical. Also, a GOP is not limited to twenty five frames. A GOP is allowed to contain arbitrary number of frames.
After the temporal redundancy has been exploited and a proper frame rate has been achieved, the encoder 114 transforms the frames from their spatial domain representation to a frequency domain representation (block 220). The frequency domain representation contains transform coefficients. An MPEG encoder, for example, converts macroblocks (e.g., 8×8 pixel blocks) of each frame to 8×8 blocks of DCT coefficients.
The encoder 114 performs lossy compression by quantizing the transform coefficients in the transform coefficient blocks (block 222). The encoder 114 then performs lossless compression (e.g., entropy coding) on the quantized blocks (block 224). The compressed data is placed in the bit steam (226).
Reference is now made to
At block 310, a quantizer step size is determined. The quantizer step size may be determined in a conventional manner. For example, a quantizer table could be used to determine the quantizer step size.
The quantizer step size may also be determined according to decoding buffer constraints. One of the constraints is overflow/underflow of a decoding buffer. During encoding, the encoder keeps track of the exact number of bits that will be in the decoding buffer (assuming that the encoding standard specifies the decoding buffer behavior, as is the case with MPEG). If the decoding buffer capacity is approached, the quantizer step size is reduced so a greater number of bits are pulled from the buffer to avoid buffer overflow. If an underflow condition is approached, the quantizer step size is increased so fewer bits are pulled from the decoding buffer. The encoder adjusts the step size to avoid these overflow and underflow conditions. The encoder can also perform bit stuffing to avoid buffer overflow.
A noise masking factor is selected for each frame (block 312). The noise masking factor is determined according to scene content. The noise perceived by the human visual system can vary according to the content of the scene. In scenes with high texture and high motion, the human eye is less sensitive to noise. Therefore, fewer bits can be allocated to frame containing such content. Thus, the noise masking factor is assigned to achieve the highest visual quality at the target bit rate.
For example, a still image is assigned the highest noise masking factor (e.g., 1) so it can be displayed with the highest visual quality. Low motion video is assigned a lower noise masking factor (e.g., 0.7) than still images; high motion video is assigned a lower factor (e.g., 0.4) than low motion video, and scene changes are assigned the lowest factor (e.g., 0.3). Thus, more bits will be assigned to a still image than a scene change, given the same buffer constraints.
The noise masking factor is used to adjust the quantizer step size (block 314). The noise masking factor can be used to scale the quantization step, for example, by multiplying the quantization step by the noise masking factor.
The quantizer step sizes are used to generate the quantized coefficients (block 316). For example, a deadzone quantizer would use the step size as follows
where sgn is the sign of the transform coefficient c, Δ is the quantization step size., and q is the quantized transform coefficient.
Increasing the quantization step size can reduce image quality. If the quantizer step is increased for a still image (for example, to avoid buffer underflow), the number of bits assigned to the still image will be sub-optimal. Consequently, image quality of the still image will be reduced. To improve the quality of the still image, the encoder can add a few of the zero-motion difference frames indicating non-zero pixel differences.
A transition state model can be used to determine the noise masking factors. Exemplary state transition models are illustrated in
Reference is now made to
The state transition model 510 of
The state transition model 510 of
A state transition model according to the present invention is not limited to any particular number of states or transitions. However, increasing the number of states and transitions can increase the complexity of the state transition model.
The transitions can be determined in a variety of ways. As a first example, a transition could be determined from information identifying the input type (video or still image). This information may be ascertained by the encoder (e.g., by examining headers) or supplied to the encoder (e.g., via manual input).
As a second example, a transition could be determined by identifying the amount of noise in the frames. For video clips, the encoder could determine the amount of motion from the motion vectors generated during motion analysis. The encoder could examine scene content such as the amount of texture). Changes in highly textured surfaces, for example, would not be readily perceptible to the human visual system. Therefore, a transition could be made to a state (e.g., high motion) corresponding to a lower noise masking factor.
Other models could have states corresponding to different texture amounts and different levels of noise. In general, the states can be defined by any relevant information that is related to the characteristics of the images and video.
Reference is now made to
The encoder 610 further includes a state machine 620, which implements a state transition model. The processor 612 supplies the different states to the state machine 620, and the state machine 620 supplies noise masking factors to a bit rate controller 622. The bit rate controller 622 uses the noise masking factors to adjust the quantizer step sizes, and a quantizer 624 uses the adjusted quantizer step sizes to quantize the transform coefficient blocks. Lossless compression is then performed by a variable length coder 626. A bit stream having a constant frame rate is provided on an output of the variable length coder (VLC) 626.
The encoder may be implemented as an ASIC. The bit rate controller 622, the quantizer 624 and the variable length coder 626 may be implemented as individual circuits.
The ASIC may be part of a machine that does encoding. For example, the ASIC may be on-board a camcorder or a DVD writer. The ASIC would allow real-time encoding. The ASIC may be part of a DVD player or any device that needs encoding of video and images.
Reference is now made to
The program 716 may be a standalone program or part of a larger program. For example. the program 716 may be part of a video editing program. The program 716 may be distributed via electronic transmission, via removable media (e.g., a CD) 718, etc.
The computer 710 can transmit the bit stream (B) to another machine (e.g., via a network 720), or store the bit stream (B) on a storage medium 730 (e.g., hard driver, optical disk). If the bit stream (B) is compliant, it can be decoded by a compliant decoder 740 of a playback device 742.
Although several specific embodiments of the present invention have been described and illustrated, the present invention is not limited to the specific forms or arrangements of parts so described and illustrated. Instead, the present invention is construed according to the following claims.