The present invention relates to a method and an apparatus for video stabilization, which involves the detection, estimation and removal of unwanted camera motions from the incoming video images. The method and apparatus described herein can be incorporated into both the pre-processing and post-processing stage of typical video processing system, including but not limited to video recording and playback systems.
Video taken from a shaky camera can be quite displeasing to an audience, often causing uneasiness and nausea. Shaky motions are, however, sometimes unavoidable. For example, an outdoor highly-mounted camera will shake under windy conditions. Also, personal video cameras are abundantly available for personal use, but it is not easy for a non-professional photographer to keep a camera stable while shooting. Failure to maintain the stability of a video camera while shooting introduces unwanted motions in the recorded video, which can result in poor quality or distracting videos. Consequently, it is desirable to stabilize the incoming video images before recording takes place, or to filter out the unwanted camera motions in a recorded video. Such tasks are collectively referred as video stabilization.
Existing video stabilization techniques can be broadly classified into two main categories: optical stabilization techniques and signal processing techniques. Optical stabilization techniques stabilize the optics portion (e.g., lens/prism) of a camera by moving the optical parts in opposition to the shaking of a camera. Such techniques generally result in little or no change in the efficiency of the camera operation or the quality of the recorded image, but add significant cost to the camera and require components that are prone to mechanical wear. Moreover, such techniques work only in the video recording phase and thus are not suitable for correcting a previously-recorded video tainted by unwanted camera motions.
Signal processing techniques, on the other hand, generally involve analyzing a video signal to detect and estimate camera movements, and then transform the video signals such that the effect of unwanted camera movements are compensated. With recent advances in digital signal processing equipment and techniques, signal processing techniques, in particular those involving digital signal processing, appear to be a more economical and reliable method for video stabilization. In addition, the digital signal processing offers a feasible way of stabilizing a pro-recorded video.
Many prior works approach the problem of video stabilization from the digital signal processing perspective. However, they can suffer from a number of deficiencies. For example, they often rely heavily on the correct extraction of features such as edges, corners, etc., to identify reference area/points for camera movement estimation, meaning that some kind of sophisticated and time consuming feature extractors have to be incorporated into the video stabilization methods. Furthermore, even if reference area/points can be identified from good feature extractors, these reference area/points have to then go through another selection process to filter out some of the reference area/points that correspond to foreground objects, which usually contribute errors of the camera movement estimation. This selection process is essentially a segmentation process which segments foreground objects from the background. This, however, is not an easy task as segmentation is still a fundamental research problem that has not been fully resolved. Finally, the extracted features have to be tracked across a number of frames before the camera movement can be estimated, which further introduces inaccuracies when the tracking techniques employed are not robust enough. Related prior works can be found in U.S. Patent Application Publication No. 2003/0090593 to Xiong and in U.S. Pat. No. 5,053,876 to Blissett et al.
Some other prior works, such as those in U.S. Pat. No. 5,973,733 to Gove and U.S. Pat. No. 6,459,822 to Hathaway et al., rely on motion vectors derived from block matching techniques for the camera motion analysis. These prior works generally partition a video frame into non-overlapping blocks, and all of these blocks must be involved in the motion estimation process. Motion vectors obtained under this approach are reasonably good in representing the true motions in a scene under the assumption that the motion discontinuities only occur at the regularly spaced block boundaries. However, this assumption is not likely to hold for typical operational environment. Moreover, since all the blocks must be involved in the motion estimation process, it does not offer a flexible way to scale computation requirement up or down to meet different computation constraints.
Additional references that include camera motion estimation methods, but are not particularly focused on video stabilization, include U.S. Pat. No. 6,710,844 to Han et al., U.S. Pat. No. 5,742,710 to Hsu et al., U.S. Pat. No. 6,349,114 to Mory, U.S. Pat. No. 6,738,099 to Osberger, U.S. Pat. No. 5,751,838 to Cox et al., U.S. Pat. No. 6,707,854 to Bonnet et al., and U.S. Pat. No. 5,259,040 to Hanna. Methods described in these references, however, suffer the same aforementioned problems.
What is desired is a robust and efficient video stabilization method that neither depends on feature extraction and segmentation, nor relies on an assumption that cannot reliably hold under normal operating conditions.
Disclosed herein is a method and apparatus for providing computationally efficient and robust video stabilization that favors both software and hardware implementation.
According to the present disclosure, a set of sample blocks can be derived from a set of sample points, for example where each sample block is centered about the associated sample point. Motion vectors for these blocks can then be generated by a block matching technique using a current frame and a reference frame. These sample block motion vectors, representing the motion of the associated sample points, can then be used for camera motion estimation.
The camera motion can be estimated based on a valid subset of motion vectors according to the affine camera motion model. A subset can be considered valid if the associated sample blocks are not collinear. By considering different combinations of subsets of non-collinear blocks out of the set of sample blocks, a set of possible camera motion parameters can thus be obtained.
The final camera motion can be obtained by searching over a space of possible camera motion parameters. By evaluating the likelihood of the existence of unwanted motion for each possible motion parameter according to a similarity measurement, a final estimated camera motion can be selected as a motion parameter that results in a best similarity measurement. The best similarity measurement is then compared against a threshold to determine whether the final motion parameter actually represents an unwanted camera motion.
The final motion parameter can be used to remap the current frame to generate an output frame in order to eliminate the detected unwanted camera motion, thus resulting in stabilized video.
Embodiments are illustrated by way of example in the accompanying figures, in which like reference numbers indicate similar parts, and in which:
The current frame memory unit 20 is for receiving and storing the input video frames 12 as they are successively input to the video stabilization apparatus 10. At any instant, a video frame 12 stored in the memory unit 20 can be considered a current video frame, whereas a video frame stored in the memory unit 29, for example a previously stabilized video frame, can be considered a reference video frame. The current and reference video frames are provided to a motion estimation unit 22, where they can be used for deriving motion vectors based on set of sample points 31 (shown in
Referring now also to
After the motion vectors of the set of sample blocks 32 are obtained, these motion vectors are then provided to a camera motion estimation unit 23. The camera motion estimation unit 23 estimates the camera motion for combinations of valid subsets of the set of N motion vectors, a subset being valid under a condition that the associated sample points are not collinear. For example, consider a scenario where there are only four sample points 31 as depicted in
The camera motion estimation unit 23 can estimate the camera motion, for example according to the six parameter affine motion model. Denoting the coordinate of the pixel at the mth column and the nth row as [m, n, 1]T, and assuming all pixel movements are due to camera motion, then, for any pixel at coordinate [mi, ni, 1]T in the current frame, the pixel coordinate [m′i, n′i, 1]T of the corresponding pixel in the reference frame is as follows:
Suppose there are three sample blocks 32 Bi, Bj and Bk centered about the sampling points with coordinates [mi, ni, 1]T, [mj, nj, 1]T and [mk, nk, 1]T respectively, whose corresponding motion vectors are Vi=[pi, qi, 0]T, Vj=[pj, qj, 0]T and Vk=[pk, qk, 0]T. The corresponding locations of these blocks in the reference frame are then [mi+p1, ni+qi, 0]T, [mj+pj, nj+qi, 0]T, [mk+pk, nk+qk, 0]T. If the assumption that the motion vectors represent the true motion of the sample points 31 holds, then:
After some matrix manipulations, it yields:
Now, provided that the three sample blocks, Bi, Bj, and Bk are not collinear with each other, then, the affine motion transformation matrix A can be derived as follows:
From Equation (4), it can be seen that the camera motion, parameterized in the form of a transformation matrix, can be estimated from the motion vectors, provided that the three sample blocks 32 in consideration are not collinear. By using Equation (4), the camera motion estimation unit 23 can estimate a number of possible camera motion parameters by considering different subsets of the N non-collinear sample blocks 32.
Referring again to
On the other hand, rather than constructing a complete remapped image, it is also possible to remap only a subset of pixel coordinates of the pixels in the current image to form a partially remapped image, such that only these remapped pixels contribute to the similarity measurement for selecting the best camera motion parameter. The camera motion selection unit 24 can also reject all of the received camera motion parameters if the best similarity measurement indicates no camera motion appears to be unwanted or undesirable motion. Under this situation, the camera motion selection unit 24 should return an affine transformation that has no effect in the subsequent frame remapping process, for example:
The frame remapping unit 25 will then receive the final camera motion parameter from camera motion selection unit 24. In essence, the frame remapping unit 25 takes the final camera motion parameter and employs Equation (1) to remap all the pixel coordinates in the current frame to another set of pixel coordinates. Each pixel will then be relocated to its remapped coordinate accordingly to form the remapped output frame. It should be noted that the pixel coordinates remapping process can also be done on a sub-pixel basis to improve stabilization performance, provided that appropriate interpolation methods are used for the relocation of the pixel at sub-pixel accuracy. The remapped output frame will then be stored in the output frame memory unit 27 to serve as the stabilized video frame 18.
The reference frame rendering unit 28 reads the stabilized video frame from the output frame memory unit 27 and prepares a reference frame for a next iteration of the stabilization process. The reference frame rendering unit 28 can update the reference frame memory unit 29 by cloning the content in output frame memory unit 27 at each iteration, or it can do so periodically at a pre-defined sampling frequency. Another method of preparing the reference frame is to consider the similarity measurement taken from camera motion selection unit 24 and determine whether the estimated camera motion attains a certain predetermined level of confidence. If the confidence is higher than the pre-defined threshold, then the reference frame rendering unit 28 updates the reference frame. Otherwise, a previously-prepared reference frame can be retained in the reference frame memory unit 29 for the next iteration.
The process described above can be repeated for the each incoming video frame 12. Referring again to
While various embodiments in accordance with the principles disclosed herein have been described above, it should be understood that they have been presented by way of example only, and are not limiting. Thus, the breadth and scope of the invention(s) should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the claims and their equivalents issuing from this disclosure. Furthermore, the above advantages and features are provided in described embodiments, but shall not limit the application of such issued claims to processes and structures accomplishing any or all of the above advantages.
Additionally, the section headings herein are provided for consistency with the suggestions under 37 CFR 1.77 or otherwise to provide organizational cues. These headings shall not limit or characterize the invention(s) set out in any claims that may issue from this disclosure. Specifically and by way of example, although the headings refer to a “Technical Field,” such claims should not be limited by the language chosen under this heading to describe the so-called technical field. Further, a description of a technology in the “Background” is not to be construed as an admission that technology is prior art to any invention(s) in this disclosure. Neither is the “Summary” to be considered as a characterization of the invention(s) set forth in issued claims. Furthermore, any reference in this disclosure to “invention” in the singular should not be used to argue that there is only a single point of novelty in this disclosure. Multiple inventions may be set forth according to the limitations of the multiple claims issuing from this disclosure, and such claims accordingly define the invention(s), and their equivalents, that are protected thereby. In all instances, the scope of such claims shall be considered on their own merits in light of this disclosure, but should not be constrained by the headings set forth herein.