The present invention relates generally to video coding and, more particularly, to video editing.
Video editing capability is an increasingly requested feature in video playing and/or capturing devices. Transitional effects between different video-sequences, logo insertion and over-layering sequences are among the most widely used operations in editing. Video editing tools enable users to apply a set of effects on their video clips aiming to produce a functionally and aesthetically better representation of their video.
To apply video editing effects on video sequences, several commercial products exist. These software products are targeted mainly for the PC platform. Because processing power, storage and memory constraints are not an issue in the PC platform today, the techniques utilized in such video-editing products operate on the video sequences mostly in their raw formats in the spatial domain. With such techniques, the compressed video is first decoded and then the editing effects are introduced in the spatial domain. Finally, the video is again encoded. This is known as spatial domain video editing operation.
For devices with low resources in processing power, storage space, available memory and battery power, decoding a video sequence and re-encoding it are costly operations that take a long time and consume a lot of battery power. Many of the latest communication devices, such as mobile phones, communicators and PDAs, are equipped with video cameras, offering users the capability to shoot video clips and send them over wireless networks. It is advantageous and desirable to allow users of those communication devices to generate quality video at their terminals. The spatial domain video editing operation is not suitable in wireless cellular environments.
As mentioned above, most video effects are performed in the spatial domain in prior art. In the case of video blending (transitional effects for fading, etc.) between two or more sequences, for instance, video clips are first decompressed and then the effects are performed according to the following equation:
{tilde over (V)}(x,y,t)=α1{tilde over (V)}1(x,y,t)+α2{tilde over (V)}2(x,y,t) (1)
where {tilde over (V)}(x,y,t) is the edited sequence from the original sequences V1(x,y,t) and V2(x,y,t). α1 and α2 are two weighting parameters chosen according to the desired effect. Equation (1) is applied in the spatial domain for the various color components of the video sequence depending on the desired effect.
Finally, the resulting edited image sequence is re-encoded. The major disadvantage of this approach is that it is significantly computationally intensive, especially in the encoding part. Typical complexity ratio between generic encoders and decoders is approximately four. Using this conventional spatial-domain editing approach, all of the video frames coming right after the transition effect in the second sequence must be re-encoded.
Furthermore, it is not unusual that editing operations are usually repeated several times by users before the desired result is achieved. The repetition adds to the complexity of the editing operations, and requires more processing power. It is therefore important to develop efficient techniques minimizing the decoding and encoding operations, functioning in the compressed domain, to perform such editing effects.
In order to perform efficiently, video compression techniques exploit spatial redundancy in the frames forming the video. First, the frame data is transformed to another domain, such as the Discrete Cosine Transform (DCT) domain, to decorrelate it. The transformed data is then quantized and entropy coded.
In addition, the compression techniques exploit the temporal correlation between the frames: when coding a frame, utilizing the previous, and sometimes the future, frames(s) offers a significant reduction in the amount of data to compress.
The information representing the changes in areas of a frame can be sufficient to represent a consecutive frame. This is called prediction and the frames coded in this way are called predicted (P) frames or Inter frames. As the prediction cannot be 100% accurate (unless the changes undergone are described in every pixel), a residual frame representing the errors is also used to compensate the prediction procedure.
The prediction information is usually represented as vectors describing the displacement of objects in the frames. These vectors are called motion vectors. The procedure to estimate these vectors is called motion estimation. The usage of these vectors to retrieve frames is known as motion compensation.
Prediction is often applied on blocks within a frame. The block sizes vary for different algorithms (e.g. 8×8 or 16×16 pixels, or 2n×2m pixels with n and m being positive integers). Some blocks change significantly between frames, to the point that it is better to send all the block data independently from any prior information, i.e. without prediction. These blocks are called Intra blocks.
In video sequences there are frames, which are fully coded in Intra mode. For example, the first frame of the sequence is usually fully coded in Intra mode, because it cannot be predicted from an earlier frame. Frames that are significantly different from previous ones, such as when there is a scene change, are usually also coded in Intra mode. The choice of the coding mode is made by the video encoder.
The decoder 420 operates on a multiplexed video bit-stream (includes video and audio), which is demultiplexed to obtain the compressed video frames. The compressed data comprises entropy-coded-quantized prediction error transform coefficients, coded motion vectors and macro block type information. The decoded quantized transform coefficients c(x,y,t), where x,y are the coordinates of the coefficient and t stands for time, are inversely quantized to obtain transform coefficients d(x,y,t) according to the following relation:
d(x,y,t)=Q−1(c(x,y,t)) (3)
where Q−1 is the inverse quantization operation. In the case of scalar quantization, equation (3) becomes
d(x,y,t)=QPc(x,y,t) (4)
where QP is the quantization parameter. In the inverse transform block, the transform coefficients are subject to an inverse transform to obtain the prediction error Ec(x,y,t):
Ec(x,y,t)=T−1(d(x,y,t)) (5)
where T−1 is the inverse transform operation, which is the inverse DCT in many compression techniques.
If the block of data is an intra-type macro block, the pixels of the block are equal to Ec(x,y,t). In fact, as explained previously, there is no prediction, i.e.:
R(x,y,t)=Ec(x,y,t). (6)
If the block of data is an inter-type macro block, the pixels of the block are reconstructed by finding the predicted pixel positions using the received motion vectors (Δx,Δy) on the reference frame R(x,y,t−1) retrieved from the frame memory. The obtained predicted frame is:
P(x,y,t)=R(x+Δx,y+Δy,t−1) (7)
The reconstructed frame is
R(x,y,t)=P(x,y,t)+Ec(x,y,t) (8)
In general, blending, transitional effects, logo insertion and frame superposition are editing operations which can be achieved by the following operation:
where {tilde over (V)}(x,y,t) is the edited sequence from the N Vi(x,y,t) original sequences and t is the time index for which the effect would take place. The parameter αi(x,y,t) represents the modifications for introducing on Vi(x,y,t) for all pixels (x,y) at the desired time t.
For the sake of simplicity, we consider the case when N=2, i.e., the editing is performed using two input sequences. Nevertheless, it is important to stress that all of the following editing discussion can be generalized to n arbitrary input frames to produce one edited output frame.
For N=2, Equation (9) can be written as Equation (1):
{tilde over (V)}(x,y,t)=α1(x,y,t)V1(x,y,t)+α2(x,y,t)V2(x,y,t)
The present invention provides a method for compressed domain operation to achieve the desired editing effects, with reduced complexity reduction, starting substantially at any frame (at any time t). The method, according to the present invention, offers the possibility of changing the effect including regaining the original clip. In the editing device, according to the present invention, transform coefficients of a part of the video sequence are obtained from an encoder so that they can be combined with transform coefficients of other part of the video sequence, the transform coefficients of other video sequence or the transform coefficients indicative of a logo in order to achieve video effects, such as blending, sliding transitional and logo insertion.
Thus, the first aspect of the present invention provides a method for editing a bitstream carrying video data indicative of a video sequence. The method comprises:
According to present invention, the acquiring step includes:
According to the present invention, the modified data contain a plurality of quantized modified transform coefficients, and the modifying step includes changing the transform coefficients for providing a plurality of modified transform coefficients. The method further comprises:
According to the present invention, the method further comprises:
According to the present invention, one or both of the first and second weighting parameters are adjusted to achieve a blending effect, or a sliding transitional effect. The further data can be obtained from a memory device via a transform operation, or from the same or a different bitstream.
According to the present invention, the method further comprise:
According the present invention, the method further comprises:
The second aspect of the present invention provides a video editing device for editing a bitstream carrying video data indicative of a video sequence. The device comprises:
According to the present invention, the acquiring module comprises:
According to the present invention, the transform coefficients are changed in the transform domain to become modified transform coefficients by the modification module, and the editing device further comprises:
According to the present invention, the editing device further comprises:
According to the present invention, the editing device further comprises:
The third aspect of the present invention provides a video coding system, which comprises:
The fourth aspect of the present invention provides an electronic device, which comprises:
The fifth aspect of the present invention provides a software product for use in a video editing device for editing a bitstream carrying video data indicative of a video sequence. The software product comprises:
The software product further comprises:
According to the present invention, the code for extracting comprises:
According to the present invention, the code for modifying comprises:
According to the present invention, the code for mixing comprises:
According to the present invention, the software product comprises:
According to the present invention, the software product comprises:
The present invention will become apparent upon reading the description taken in conjunction with
a is a block diagram showing an electronic device having a compressed-domain video editing device, according to the present invention.
b is a block diagram showing another electronic device having a compressed-domain video editing device, according to the present invention.
c is a block diagram showing yet another electronic device having a compressed-domain video editing device, according to the present invention.
d is a block diagram showing still another electronic device having a compressed-domain video editing device, according to the present invention.
The present invention is mainly concerned with transitional effects between different video sequences, logo insertion and overlaying of video sequences while the sequences are in compressed format. As such, the editing effects are applied to the video sequences without requiring full decoding and re-encoding. Thus, the present invention is concerned with blending and logo insertion operations in video editing. Blending is the operation of combining or joining sequences, overlaying for the entire frames or part of the frames in the sequences. Logo insertion is the operation of inserting a logo, which can be an image or graphic at a particular area of the frames in the video sequences.
Transition effect editing between two frames can be broken down to performing such operations between the corresponding macroblocks of these two frames. As explained above macro blocks in compressed video are of two types: Intra and Inter. Hence, we find four different combinations for applying editing effects between the macroblocks. We will present how to achieve the above effects with combinations of these macroblocks.
In general, editing operations can happen on a video clip in a channel at one of its terminals. The edited video clip is outputted at the other terminal, as shown in
Blending of an Intra Block with an Intra Block
This operation in spatial domain is performed as follows:
Ĩ(x,y,t)=α1(t)I1(x,y,t)+α2(t)I2(x,y,t)
For Intra frames, using the steps of the earlier section, we have,
{tilde over (V)}(x,y,t)=α1(t)E1(x,y,t)+α2(t)E2(x,y,t) (10)
For Intra frames, using the steps of the earlier section, and after taking the transform of the frame after special effects, the same operations can be formulated as follows in the compressed domain:
{tilde over (e)}(x,y)=α1(t)d1(x,y)+α2(t)d2(x,y) (11)
The transform domain approach significantly simplifies the blending operations, as can be seen from
It should be understood that it is possible to combine the inverse quantization, scaling and quantization blocks or to combine the scaling and quantization blocks into a single coding block.
This process is repeated for both luminance and chrominance components of the video bitstream.
Blending of an Inter Block with an Inter Block.
Inter-frames are reconstructed by summing residual error with the motion-compensated prediction,
V1(x,y,t)=R1(x+Δx1,y+Δy1,t−1)+E1(x,y)
and similarly,
V2(x,y,t)=R2(x+Δx2,y+Δy2,t−1)+E2(x,y)
The spatial domain representation of dissolve effect is formulated as follows:
{tilde over (V)}(x,y,t)=α1(t)(R1(x+Δx1,y+Δy1,t−1)+E1(x,y))+α2(t)(R2(x+Δx2,y+Δy2,t−1)+E2(x,y))
{tilde over (V)}(x,y,t)=α1(t)E1(x,y)+α2(t)E2(x,y)+α1(t)R1(x+Δx1,y+Δy1,t−1)+α2(t)R2(x+Δx2,y+Δy2,t−1)
Note that {tilde over (V)}(x+Δx1,y+Δy1,t−1) is the previously reconstructed frame after the fading effects, and it can be re-written in terms of R(x+Δx1,y+Δy1,t−1), which represents the frame that would have been reconstructed if transitional effects were not applied:
{tilde over (V)}(x+Δx1,y+Δy1,t−1)=α1(t−1)(R1(x+Δx1,y+Δy1,t−1)+α2(t−1)R2(x+Δx1,y+Δy1,t−1)
Then the prediction residual can be calculated by:
F(x,y,t)={tilde over (V)}(x,y,t)−{tilde over (V)}(x+Δx1,y+Δy1,t−1)
F(x,y,t)=α1(t)E1(x,y)+α2(t)E2(x,y)+α1(t)R1(x+Δx1,y+Δy1,t−1)+α2(t)R2(x+Δx2,y+Δy2,t−1)−α1(t−1)R1(x+Δx1,y+Δy1,t−1)−α2(t−1)R2(x+Δx1,y+Δy1,t−1)
F(x,y,t)=α1(t)E1(x,y)+α2(t)E2(x,y)−(α1(t−1)−α1(t))R1(x+Δx1,y+Δy1,t−1)−α2(t−1)R2(x+Δx1,y+Δy1,t−1)+α2(t)R2(x+Δx2,y+Δy2,t−1) (12)
Taking the transform of new residual data, we have the blending effect of two inter blocks in the transform domain:
{tilde over (e)}(x,y)=α1(t)d1(x,y)+α2(t)d2(x,y)−(α1(t−1)−α1(t))T(R1(x+Δx1,y+Δy1,t−1))−α2(t−1)T(R2(x+Δx1,y+Δy1,t−1))+α2(t)T(R2(x+Δx2,y+Δy2,t−1)) (13)
Blending of an Intra Block with an Inter Block
The spatial domain representation of dissolve effect can be formulated as follows:
{tilde over (V)}(x,y,t)=α1(t)E1(x,y)+α2(t)(R2(x+Δx2,y+Δy2,t−1)+E2(x,y)),
or
{tilde over (V)}(x,y,t)=α1(t)E1(x,y)+α2(t)E2(x,y)+α2(t)R2(x+Δx2,y+Δy2,t−1) (14)
Since the output is an intra block, i.e., no prediction, the transform of the block is given by,
{tilde over (e)}(x,y,t)=α1(t)d1(x,y)+α2(t)d2(x,y)+α2(t)T(R2(x+Δx2,y+Δy2,t−1)) (15)
Equation (15) gives the result of blending an intra block with an inter block in the transform domain.
Blending of an Inter Block with an Intra Block
The spatial domain representation of dissolve effect is then formulated as follows:
{tilde over (V)}(x,y,t)=α1(t)(R1(x+Δx1,y+Δy1,t−1)+E1(x,y))+α2(t)E2(x,y),
or
{tilde over (V)}(x,y,t)=α1E1(x,y)+α2(t)E2(x,y)+α1(t)R1(x+Δx1,y+Δy1,t−1)
Again {tilde over (V)}(x+Δx1,y+Δy1,t−1) is the previously reconstructed frame after fading effects and can be re-written in terms of R(x+Δx1,y+Δy1,t−1), which represents the frame that would have been reconstructed if transition effects are not applied:
{tilde over (V)}(x+Δx1,y+Δy1,t−1)=α1(t−1)(R1(x+Δx1,y+Δy1,t−1)+α2(t−1)R2(x+Δx1,y+Δy1,t−1)
The prediction residual can be calculated by:
F(x,y,t)={tilde over (V)}(x,y,t)−{tilde over (V)}(x+Δx1,y+Δy1,t−1)
F(x,y,t)=α1(t)E1(x,y)+α2(t)E2(x,y)+α1(t)R1(x+Δx1,y+Δy1,t−1)−α1(t−1)R1(x+Δx1,y+Δy1,t1)−α2(t−1)R2(x+Δx1,y+Δy1,t−1)
F(x,y,t)=α1(t)E1(x,y)+α2(t)E2(x,y)−(α1(t−1)−α1(t))R1(x+Δx1,y+Δy1,t−1)−α2(t−1)R2(x+Δx1,y+Δy1,t−1) (16)
Taking the transform of new residual data, we have the effect of blending an inter block with an intra block:
e(x,y)=α1(t)d1(x,y)+α2(t)d2(x,y)−(α1(t−1)−α1(t))T(R1(x+Δx1,y+Δy1,t−1))−α2(t−1)T(R2(x+Δx1,y+Δy1,t−1)) (17)
Blending of an Inter Block with an Intra Block for the First Intra Frame
This is a special case of blending an intra block on inter blocks, applied to the first intra frame. Note that this case can be expressed by α2 (t−1)=0. The rest of the process follows the analysis. By applying α2(t−1)=0 to Equation (17), we obtain the final residual coefficients in the transform domain as follows:
{tilde over (e)}(x,y)=α1(t)d1(x,y)+α2(t)d2(x,y)−(α1(t−1)−α1(t))T(R1(x+Δx1,y+Δy1,t−1)) (18)
These transform coefficients e(x,y) are then quantized and sent to the entropy coder.
Similar to the process as shown in
It should be understood that it is possible to combine the inverse quantization, scaling and quantization blocks or to combine the scaling and quantization blocks into a single coding block.
This process is repeated for both luminance and chrominance components of the video bitstream.
In typical applications, the above-described process can be further improved. For example, it is possible to allow only the selected transition frames to go through the method of producing edited bitstream 170, according to the present invention. For frames that are not transition frames, the operations can be skipped. This improvement process can be carried out by setting one of the weighting parameters in the above-described case to 0: α1(t)=0 or α2(t)=0. When α2(t)=0, there is no need to compute the transform coefficients 138′ of R2(x+ΔX2,y+Δy2,t−1). Likewise, when α2(t−1)=0, there is no need to compute 137′, or R2(x+Δx1,y+Δy1,t−1). When α1(t−1)=α1(t), there is no need to compute the transform coefficients 138 of R1(x+Δx1,y+Δy1,t−1).
When α2(t−1)=α2(t), the transform coefficients of R2(x+Δx2,y+Δy2,t−1) and R2(x+Δx1,y+Δy1,t−1) need not be computed separately in different coding blocks, but they can be computed as follows. After computing both R2(x+Δx2,y+Δy2,t−1) and R2(x+Δx1,y+Δy1,t−1), the block R2(x+Δx2,y+Δy2,t−1) is subtracted from R2(x+Δx1,y+Δy1,t−1). The difference is subjected to transform coding in one of the transform blocks, such as the block 39′. The results are scaled by α2(t−1) or α2(t), and the scaled result is fed to the summing block 25. The remaining steps are identical to the process as described in conjunction with
Sliding Transitional Effect
Sliding transitional effect, also known as “wipe” effect, makes one video clip slide into the other during transition. This can be accomplished by assigning appropriate weights a(x,y,t) that are dependent on the spatial location (x,y) in the frame. Furthermore, for the frames V1(x,y,t), we set weights α1(x,y,t)=0 and α1(x,y,t)=1 in order to dictate which parts of frame 1 to be included in the sliding transition. Likewise, the setting α2 (x,y,t)=0 and α2 (x,y,t)=1 dictates which parts of the frame are to be included in frame 2.
Logo Insertion
Logo insertion can be accomplished in different ways. One way is logo insertion with blending, as shown in
In logo insertion with blending, the transform coefficients 120 from one of decoder (see
Logo insertion without blending is shown in
Superposition of Multiple Sequences or Frames
In the above-described editing processes, the number of input sequences, or N, is set to 2 (Equation 1). Similarly, the number of frames, or n, for use in motion prediction is also set to 2. However, the method of transform domain editing, according to the present invention, can be generalized such that the number of frames can be extended from n=2 to n=N, with N being a positive integer larger than 2.
The compressed-domain editing modules as shown in FIGS. 4 to 7 can be incorporated into conventional encoders and decoders as shown in
Each of the editing modules 5, 5′ and 7 can also be incorporated in an expanded decoder 620 as shown in
The editing module 8 of
The expanded encoder 610 can be integrated into an electronic device 710, 720 or 730 to provide compressed domain video editing capability to the electronic device, as shown separately in
It should be understood that video effect provided in blocks 22, 22′, as shown in
In sum, the present invention provides a method and device for editing a bitstream carrying video data in a video sequence. The editing procedure includes:
The transform coefficients can be modified by combining the transform coefficients with other transform coefficients by way of weighted summation, for example. The other transform coefficients can be obtained from the same video sequence or from a different video sequence. They can also be obtained from a memory via a transform module.
Many or all of these method steps can be carried out by software codes in a software program.
Thus, although the invention has been described with respect to a preferred embodiment thereof, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.
The present patent application is related to U.S. patent application Ser. No. 10/737,184, filed Dec. 16, 2003, assigned to the assignee of the present patent application. The present invention is also related to U.S. Patent Application Docket No. 944-001-128, assigned to the assignee of the present application, filed even date herewith.