A broad range of video equipment from cameras in smart phone to video equipment for large production studios is available to individuals and businesses. The video footage recorded by video equipment often appear wobbly due to unwanted motion of objects in the recorded video due to e.g., unintended shaking of the camera, rolling shutter effect, etc.
Different techniques are used to stabilize a video sequence and remove unwanted camera movements. The objective of motion stabilization is to remove the jitter produced by hand-held devices. Camera jitter introduces extraneous motion that is not related to the actual motion of objects in the picture. Therefore, the motion appears as random picture movements that produce disturbing visual effects.
Image stabilization methods have been developed in the past that model the camera motion and distinguish between the intended and unintended motions. Other methods have also been developed that generate a set of curves to track different camera movements such as translation, rotation, and zoom. The curves are smoothed and the differences between the unsmoothed curves and the smoothed curves are used to define a set of transformations to apply to each video image to remove the unwanted camera motion.
Some embodiments provide a method for homography-based video stabilization and smoothing. During the analysis phase, the method analyzes a video sequence and determines homographies between each pair of consecutive frames that captures the dominant motion of the video sequence. In order to facilitate these homography calculations, the method in some embodiments first identifies the points of interest, referred to as robust image feature points, within each frame. Each identified feature point is then described in terms of one or more parameters of a group of neighboring points. The method then matches the feature points between each frame and the previous frame in the sequence. Other embodiments use different methods such as optical flow to match points between frames.
Once the matches are identified, the method uses a novel enhancement of the Random Sample Consensus (RANSAC) algorithm, referred to herein as Geometrically Biased Historically Weighted RANSAC (or weighted RANSAC for brevity), to identify homographies between each pair of consecutive frames describing the spatial transformation of feature points associated with the dominant motion between the frames.
Prior to the application of weighted RANSAC algorithm, some embodiments apply a non-maximum suppression algorithm to the set of feature matches to reduce the density of feature matches in areas of high concentration. The result is a more uniform distribution of matched feature points across the entire image, rather than having dense clusters of feature points in areas of high detail. This allows the subsequent application of the weighted RANSAC algorithm to produce a more spatially uniform consensus of motion.
The method maintains historical metrics for each feature point that indicate in how many previous frames the feature point has been tracked, in how many of the previous frames the feature point was an inlier that contributed to the dominant motion of the video sequence, and how much the feature point has moved from the dominant field of motion.
The method utilizes the historical metrics to perform the weighted RANSAC with a cost function associated with each point, where inclusion of prior inliers (particularly those with long history of being inlier) is weighted heavily and the feature points that have long been major outliers are weighted lightly, or in some embodiments negatively. The algorithm also incorporates a geometric component in the weighted RANSAC cost function that biases solutions towards solutions that have minimal spatial distortion. The weighted RANSAC is utilized in to provide a homography that describes the motion from frame N−1 to frame N.
During the smoothing and stabilization phase, the method in some embodiments utilizes the homographies and finds smoothing homographies that are applied to the video frames to stabilize the sequence of video frames. The homographies are utilized to determine the reprojected position of each corner of each video frame and subsequently calculate the differences or “deltas” between the original corner frame positions and their reprojected positions based on the homography calculated from the dominant motion between the consecutive frames.
The method then applies a smoothing function to the sequence deltas for each of the identified four corners. Some embodiments apply a Gaussian kernel to perform the smoothing operation. The method then determines the difference between the smoothed corner deltas and the original deltas. The differences are utilized to generate homographies that are applied to the original video frames to produce the smooth video sequence. The method calculates the difference between the smooth and noisy corners. The method has the smooth trajectory through the time as well as the noisy trajectory through the time. The method calculates the difference between the smooth and the current trajectory for each corner. The differences are then used to find the corrective homography that is applied in order to get each frame to the smooth trajectory.
Some embodiments provide a tripod mode to completely eliminate the dominant motion of the video sequence as if the camera was on a tripod. The method selects a key frame (e.g., the original frame or a frame that has most of the relevant subject matter, etc.) in the video sequence, calculates the difference between all corners for any other frame and the corresponding corners of the key frame. The differences are then used to map all frames to the key frame homography to delete all the motion related to the dominant motion of the video sequence.
The preceding Summary is intended to serve as a brief introduction to some embodiments of the invention. It is not meant to be an introduction or overview of all inventive subject matter disclosed in this document. The Detailed Description that follows and the Drawings that are referred to in the Detailed Description will further describe the embodiments described in the Summary as well as other embodiments. Accordingly, to understand all the embodiments described by this document, a full review of the Summary, Detailed Description and the Drawings is needed. Moreover, the claimed subject matters are not to be limited by the illustrative details in the Summary, Detailed Description and the Drawing, but rather are to be defined by the appended claims, because the claimed subject matters can be embodied in other specific forms without departing from the spirit of the subject matters.
The novel features of the invention are set forth in the appended claims. However, for purpose of explanation, several embodiments of the invention are set forth in the following figures.
In the following detailed description of the invention, numerous details, examples, and embodiments of the invention are set forth and described. However, it will be clear and apparent to one skilled in the art that the invention is not limited to the embodiments set forth and that the invention may be practiced without some of the specific details and examples discussed.
Some embodiments provide a system and method for removing unwanted motion and stabilizing a video sequence.
During the analysis phase, the process determines the relevant structure between frames and utilizes it to determine the inter-frame homography that describes the dominant motion. The dominant motion is the motion of the dominant plane of the video sequence through time. The process excludes and ignores transient objects that move through the frame when calculating the inter-frame transformation of the dominant plane.
Once the dominant motion in the video sequence is identified, the process smooths (at 110) the effects of the unwanted motion to stabilize the video sequence. The space-time motion trajectory of the dominant plane identified in analysis phase includes both the wanted major motions such as pans, zooms, etc., as well as the unwanted high frequency motion such as camera shake, vibrations, etc. During the stabilization phase, the process removes this unwanted component through low-pass smoothing of the noisy trajectory. The amount of smoothing applied in some embodiments is a user specified parameter.
Several more detailed embodiments of the invention are described in sections below. Section I discusses analysis of a video sequence. Next, Section II describes stabilization of the video sequence. Finally, section III provides a description of a computer system with which some embodiments of the invention are implemented.
I. Video Sequence Analysis
The process then describes (at 215) each identified feature in the current video frame. In some embodiments, each feature is described in terms of one or more parameters of a group of neighboring points. The description of the features in each video frame is described by reference to
Otherwise when the current frame is not the first frame, the process matches (at 230) the description of each point of interest in the current frame with the description of points of interest in the previous frame to identify a match between feature points in the current frame and the previous frame. Matching of the features in successive video frame is described by reference to
The process then calculates (at 235) the movement of each feature point from the previous frame to the current frame. Next, the process determines (at 240) whether the current frame is the second frame in the sequence. If yes, the process identifies (at 245) a homography between the first and second frame to describe movement of feature points between the pair of frames. The process then proceeds to 260 to store historical metrics for the feature points as described below.
As described further below, the homography between a pair of frames is determined by using a geometrically biased historically weighted RANSAC that is based on historical metrics and a geometric component that biases the solutions towards minimally distorted solutions. For the second frame in the video sequence the homography is calculated for the first time and historical metric are not available yet. However, the feature points that are matched between the first and second frames and have less motion between the two frames are more likely to be part of the background and contribute to the dominant motion of video sequence.
The process defines (at 245) a cost function that gives more weight to feature points with less motion between the first and second frames and includes a geometric component that biases towards the solutions that have minimal spatial distortion. The process then determines (at 247) a homography that describes the dominant motion from the first frame to the second frame using a weighted RANSAC method that uses the cost function and gives more weight to matched feature points with less motion between the first and second frames.
Utilizing homographies for stabilizing video sequences significantly outperforms existing video stabilization techniques that are based on simpler frame-to-frame affine transformations. An affine transformed plane is a plane that is either translated (i.e., moved), rotated, scaled (i.e., resized), or sheared (i.e., fixed in one dimension while the lines in other dimension are moved) but does not include, for instance, a plane subject to keystoning effect where a perspective image is projected onto a surface at an angle. On the other hand, homography captures any linear transformation (or distortion) of a two-dimensional plane in a three-dimensional space.
The RANSAC algorithm is used to come up with a consensus among feature point mappings that generate the homographies between two frames. Assuming that both frames are viewing the same plane from different positions and/or angles, homographies are used to determine how this hypothesized plane gets distorted from one frame to the other.
Homography is an invertible transformation that describes the changes in a perspective projection when the point of view of the observer changes. A homography is a 3 by 3 matrix:
Given a point X1 with coordinates (a1, b1, 1) in one image and a point X2 with coordinates (a2, b2, 1) in another image, the homography relates the point coordinates in the two images if X2=M X1. When the homography is applied to every pixel in an image, the image is a warped version of the original image.
Using a large number of points that match, the relative motion from frame N−1 to frame N is described by this matrix, which causes the inliers to move between frame N, and N−1. The shortcoming of using a simple RANSAC algorithm is that the method does not provide with much continuity through the video sequence. For instance, in a video sequence where the background dominates the scene (for instance the scene described by reference to
The process defines (at 250) a cost function that weights each feature point based on the historical metrics and includes a geometric component that biases solutions towards the solutions that have minimal spatial distortion. The process then determines homography between the current frame and the previous frame by performing (at 255) a geometrically biased historically weighted RANSAC to produce an inter frame homography that describes the motion from the previous frame to the current frame. As described in the following sections, some embodiments provide a novel technique to collect historical metrics for feature points and utilize the metrics to further refine the identification of the inliers and calculation of the dominant motion between the frames. Performing the geometrically biased historically weighted RANSAC method and using the historical metrics of feature points to determine the dominant motion is described by reference to
The process then calculates (at 260) historical metrics for each feature point. The process, for each feature in the current frame stores an historical metric that indicates (i) whether the feature has been an inlier or an outlier in a set of previous frames, (ii) the age of the feature to show in how many previous frames a feature was tracked, and (iii) the projection error of the feature calculated in the previous frame. Calculation of historical metrics for each feature point is described by reference to
The process then determines (at 265) whether all frames in the video sequence are examined. If not, the process proceeds to 205, which was described above. Otherwise, the process optionally optimizes (at 270) the calculated homographies. The process then ends. Optimizing the homographies is described by reference to
A. Feature Identification
Some embodiments identify features in each frame by identifying a set of feature points that includes corners and line intersections. For instance, some embodiments identify feature points where there are two dominant edge directions in a local neighborhood of the point. Other embodiments also identify isolated points with a maximum or minimum local intensity as feature points.
Different embodiments use different techniques for feature detection. For instance, some embodiments utilize the high-speed “Features from Accelerated Segment Test” (FAST) algorithm to identify points of interest. FAST algorithm is described in “Machine Learning for High-Speed Corner Detection,” Edward Rosten and Tom Drummond, Proceedings of the 9th European Conference on Computer Vision, Volume Part I, pages 430-443, 2006. This document is herein incorporated by reference. Other embodiments utilize other techniques such as Speeded Up Robust Features (SURF) feature detection method to identify points of interest. SURF algorithm is described in “Speeded Up Robust Features (SURF),” Herbert Bay, Andreas Ess, Tinne Tuytelaars, Luc Van Gool, Computer Vision and Image Understanding (CVIU), Vol. 110, No. 3, pp. 346-359, Sep. 10, 2008. This document is herein incorporated by reference. Yet other embodiments use other techniques such as optical flow to match points between frames.
For each point (e.g., each pixel) in the frame 300 one or more parameters of a set of neighboring points are examined. For instance, in the example of
In addition, some embodiments perform a quick test to exclude a large number of candidate points. In these embodiments, only the four pixels labeled 1, 5, 9, and 13 are examined and the point is discarded as a candidate feature point if the point is not brighter than at least three of these points by a threshold or darker than at least three of the points by a threshold. If a point is not discarded as a candidate feature point, then the intensity of the point is compared with the intensity of the 16 neighboring points as described above.
B. Feature Description
After the feature points are identified on a frame, one or more parameters of each feature point is described in order to compare and match the points in different frames. Different embodiments use different techniques to describe the feature points. For instance, some embodiments utilize the “Binary Robust Independent Elementary Features” (BRIEF) algorithm to describe the features in each frame. BRIEF is described in “BRIEF: Binary Robust Independent Elementary Features,” Michael Calonder, Vincent. Lepetit, Christoph Strecha, and Pascal Fua, European Conference on Computer Vision, 2010. This document is herein incorporated by reference. Other embodiments utilize the “Oriented FAST and Rotated BRIEF” (ORB) algorithm to define the feature points in each frame. The ORB algorithm is described in “ORB: An Efficient Alternative to SIFT or SURF,” Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski, Computer Vision (ICCV), 2011 IEEE International Conference on Computer Vision (ICCV). This document is herein incorporated by reference. Other embodiments utilize the above-mentioned SURF algorithm to describe the feature points.
The method creates a bit vector corresponding to each pair of points. The bit value corresponding to each pair is 1 if the intensity of a particular point in the pair is higher or equal to the intensity of the other point in the pair. Otherwise, the bit value is 0.
C. Feature Matching
Once the feature points are identified and described, the descriptions are utilized to match the points between the frames. For each feature point selected in a current frame, a search is made to find a point in the previous frame whose descriptor best matches a predetermined number of bits in the descriptor of the selected feature point.
In order to expedite the search, some embodiments divide each frame into a grid of overlapping blocks and search for a point in the corresponding block of the previous frame. The rational is that features will not migrate too much from frame to frame and will typically remain within a single block. The overlap allows for the features to be tracked continuously as they travel over block edges.
When matching features from frame N to N−1, some embodiments perform a two-step match. In the first step, a match is done from feature points in frame N to N−1. In this step, the two closest matches (in hamming distance) are found in frame N−1 for each feature in frame N. If the closest match is below a threshold, and if the difference between the distance of the closest and second closest is above a certain threshold (for example, “the closest is at least twice as close as the second-closest”), then the closest match is considered as a match candidate. The reasoning is that, if both the closest match and second closest match are very similar in distance, then the match is ambiguous, and is thrown out.
Once all frame-N-to-N−1 matches are found, a reverse match is performed in the second step. In this step, for all the matched-to points in N−1 identified in the first step, the closest matches in frame N is found by using the same method as the first step, but going from frame N−1 to frame N. If the matching is bidirectional for a pair of matches identified in first step (that is, for a point p1 in frame N and point p2 in frame N−1, p1's best match is p2, and p2's best match is p1) then the two points are identified as matched points.
The process then determines (at 820) whether the spatial distance (i.e., the two dimensional distance) between the current feature point in the current frame and the selected feature of the previous frame is more than a predetermined threshold. If yes, the process does not consider the two points to be candidates for matching and proceeds to 830, which is described below.
Otherwise, the process computes and saves (at 820) the Hamming distance between the descriptors of the selected feature point in the current frame with the descriptor of the selected feature point in the previous frame. In some embodiments, the process makes a bitwise comparison of the descriptor of the feature point in the current frame with the descriptor of the selected feature point in the previous frame. In some embodiments, the descriptors are compared using the Hamming distance between the two descriptors. The feature points match if the Hamming distances are within a predetermined threshold. The Hamming distance between the two descriptors is the number of positions at which the corresponding bits are different. The Hamming distance measures the minimum number of substitution bits that are required to change one of the descriptors into the other.
The process then determines (at 840) whether all points in the current frame are examined. If not, the process selects (at 845) the next feature point in the current frame. The process then proceeds to 815, which was described above. Otherwise, the process selects (at 850) the first feature point in the current frame. The process then identifies (at 855) the two feature points in the previous frame that best match the current feature point based on the computed Hamming distances.
The process then determines (at 860) whether the Hamming distance between the current point and the best match is below a threshold. If not, the process proceeds to 875, which is described below. Otherwise, the process determines (at 865) whether the difference between the Hamming distance of the current point and the best match and the Hamming distance of the current point and the second best match is more than a threshold. If not, the process determines (at 875) that the feature point in the current frame does not match any feature point in the previous frame. The process then proceeds to 880, which is described below.
Otherwise, the process adds (at 870) the selected feature point in the current frame and the best matching feature point in the previous frame to the list of candidate matching pairs. The process then determines (at 880) whether all feature points in the current frame are examined. If not, the process selects (at 885) the next feature point in the current frame. The process then proceeds to 855, which was described above.
Otherwise, the process selects (at 890) the first feature point of the previous frame from the list of candidate matching pairs. The process then identifies (at 891) the two feature points in the current frame that best match the selected feature point in the previous frame based on the computed Hamming distances. The process then determines (at 892) whether the Hamming distance between the selected feature point and the best match is below a threshold.
If not, the process proceeds to 896, which is described below. Otherwise, the process determines (at 893) whether the difference between the Hamming distance of the current point and the best match and the Hamming distance of the current point and the second best match is more than a threshold. If not, the process determines (at 896) that the current feature point in the previous frame does not match any feature points in the current frame. The process then proceeds to 897, which is described below.
Otherwise, the process identifies (at 895) the selected feature point of the previous frame and the corresponding best match of the current frame as matching points.
Referring back to
Some embodiments provide an option for a user to indicate a subject in a video sequence that should be the focus of stabilization. For instance, a vehicle during a race containing a lot of other vehicles, the individual 330 on a bicycle in the example of
As shown, the user interface provides the option for the user to select an area in the video frame to be the focus of stabilization. For instance, the user interface allows drawing a shape (such as rectangle, a circle, a lasso, etc.) or identifying a set of points on the video frame to define a polygon around a desired subject. In this example, the user has identified a polygon 1125 by identifying a set of points 1135 around a desired subject such the automobile 1130 shown in the video frame. The identification of this area of interest is done on the first frame of the video sequence in some embodiments.
By selecting an area around the desired subject (in this example the automobile 1130) the user indicates that the algorithm should stabilize the motion of the automobile, rather than the motion of the background. Only features found within the selected region are used for the initial weighted RANSAC calculation that determines homography between a pair of frames. Once the weighted RANSAC homography estimation has been performed with the selected subset of points, other points outside of the selected region can also be found to be inliers in the plane of dominant motion. The historical metrics of those points (along with those within the selected area of interest) are then initialized as being inliers, which would bias towards their selection as the dominant motion during analysis of subsequent frames. In some embodiments, histories of features within the selected area of interest are biased more heavily for selection as future inliers in the weighted RANSAC calculations (e.g., by initializing their histories to indicate a long history of being an inlier).
D. Determine Homographies Between Frames and the Dominant Motion of the Video Sequence Using Historical Metrics for Feature Points
Finding the same point in two frames allows determining how the feature points move relative to each other. Once the matches have been established, some embodiments determine homographies that describe the travel of the feature points between the frames. Some embodiments utilize a novel geometrically biased historically weighted RANSAC method to determine the inter frame homographies.
A RANSAC algorithm iteratively examines a set of data points to estimate parameters of a model from a set of observed data that includes inliers and outliers. RANSAC identifies inliers and outliers. RANSAC algorithm is non-deterministic and produces the results with a certain probability that increases as more iterations are performed. The inliers are points (or pair of points) that are considered part of the dominant motion plane. Inlier feature points contribute to a solution for the dominant motion that is consistent; outliers do not.
Some feature points are attached to objects that are moving through the plane of the image (e.g., the person 950 on the bike in
The process then determines (at 1315) whether the current frame is the second frame in the video sequence. If not, the process proceeds to 1325, which is described below. When the current frame is the second frame, historical metrics for the feature points are not determined yet. However, the feature points that are matched between the first and second frames and have less motion between the two frames are more likely to be part of the background and contribute to the dominant motion of video sequence.
Therefore, when the current frame is the second frame, the process defines (at 1320) a cost function that gives more weight to feature points with less motion between the first and second frames and includes a geometric component that biases towards the solutions that have minimal spatial distortion. The process then determines (at 1323) a homography that describes the dominant motion from the first frame to the second frame using a weighted RANSAC method that uses the cost function and gives more weight to matched feature points with less motion between the first and second frames. The process then proceeds to 1335, which is described below.
The homography between the first and second frame in some embodiments is determined based on a traditional RANSAC algorithm. When the current frame is subsequent to the second frame, historical metrics for the feature points are already determined and stored (as described by reference to operation 1345, below) and all other subsequent homographies are determined using the geometrically biased historically weighted RANSAC algorithm. In some embodiments, the historical metrics and the geometric components are determined for the first frame based on a consensus vote (i.e., the traditional RANSAC). In these embodiments, operations 1315 and 1320 are bypassed and the homography between the first and second frames is also determined using the geometrically biased historically weighted RANSAC algorithm.
The process defines (at 1325) a cost function that weights each feature point based on the historical metrics and includes a geometric component that biases solutions towards the solutions that have minimal spatial distortion. The following pseudo code defines the cost function used for scoring the geometrically biased historically weighted RANSAC method of some embodiments.
As shown by the above pseudo code, the cost function returns a total score for each result generated by the geometrically biased historically weighted RANSAC algorithm. For each inlier, the cost function identifies the number of times the point has been and inlier and outlier and scales the score based on the inlier and outlier counts.
The cost function also biases against points that had large reprojection error in the last frame and the points that had more motion. The cost function also includes a geometric component. The geometric acceptance criteria are based on the two following measurements: the angle distortion at the corners of the reprojected frame and the maximum corner travel of the reprojected frame.
When applying the detected motion homography to the original video frame, the cost function calculates the maximum angle change from 90 degrees for each of the four corners. The cost function then calculates the cosine of this angle over the maximum distance traveled by any of the detected inliers for that frame. When the ratio of cos(angle_delta)/max_inlier_travel exceeds a predetermined threshold, (e.g., 1.0) the algorithm result is considered a failure for this frame.
The cost function also applies the detected motion homography to the frame bounds, and calculates the maximum difference in position between the original four corners and their reprojected positions. The cost function then calculates the ratio of this maximum travel over the median of the travel of all inlier features. If this ratio of max_corner_travel/median (feature-travel) exceeds a predetermined threshold (e.g., 2.5) then the algorithm result is considered a failure for this frame.
The process then determines (at 1330) homography between the current and the previous frame by performing a geometrically biased historically weighted RANSAC method that uses the cost function and weights each feature point based on the feature point historical metrics to produce a homography that describes the dominant motion from the previous frame to the current frame. The geometrically biased historically weighted RANSAC method weights each feature point based on the feature point historical metrics to produce a homography that describes the dominant motion from the previous frame to the current frame.
The process then calculates (at 1335) a score for the determined homography based on the histories of the inlier/outlier points and geometric distortion produced by the homography (e.g., as described by reference to the pseudo code, above). The process then determines (at 1340) whether the score is better than the prior best score. If not, the process proceeds to 1350, which is described below. Otherwise, the process saves the determined homography as the best candidate homography.
The information saved in historical metrics for each feature point is used to calculate a better estimate of for the dominant field of motion in the video sequence. For instance, the process can determine that 750 feature points are identified in a frame N, 500 of which match to feature points in frame N−1. And, of those 500 matches, 410 have a history, which means they at least existed in frame N−2, and have been used to calculate the motion homography in the past. And, of those 410 feature points, 275 were inliers in the calculation of the motion and the remaining 135 were outliers. In addition, for each of those points with a history, the projection error is used to determine how closely the points' travel matched the dominant motion plane in the prior frame.
Some embodiments utilize the historical metrics and the geometric bias to perform a geometrically biased historically weighted RANSAC (or RANSAC with a cost function associated with each point), where inclusion of prior inliers (particularly those with long history of being inlier) is weighted heavily and the feature points that have long been major outliers are weighted lightly, or in some embodiments negatively. Features that are new (i.e., without much history) are considered positively in order to allow a homography that describes as much of a frame as possible.
The weighted RANSAC is utilized in some embodiments to provide a homography that describes the motion from frame N−1 to frame N. These embodiments, define a cost function to optimize. The cost function associates a weight with each point, rather than having each point has the same weight. Each point, depending on the history of being an inlier, age, and projection error has a sway or controlling influence in the cost function. The geometric component biases the solution towards solutions with minimal spatial distortion.
Referring back to
The threshold for each point is an absolute travel distance in any direction. In determining inliers/outliers, the process takes the homography determined in step 1330 and applies that homography mapping to all points within frame N−1, which will produce their projected positions into frame N. If the projected position of a point differs from the actual position of the point's matched feature by more than an acceptable threshold, the process considers the point to be an outlier, otherwise the point is an inlier.
In some embodiments such as real-time applications or where the computing resources are limited, when the first satisfactory solution is found, the process ends (not shown). In other embodiments, the process performs a certain number of iterations to find better solution. In these embodiments, the process increments (at 1360) the number of iterations performed by one.
The process then determines (at 1365) whether the maximum allowable iterations are performed. If not, the process updates (at 1370) the threshold for identifying the inliers. For instance, when the solution was acceptable (i.e., the number of inliers was determined at 1350 to be larger than or equal to the acceptable minimum), the process decreases the threshold to fit the inliers in order to find a better estimate for the dominant motion. On the other hand, when the solution was unacceptable (i.e., the number of inliers was determined at 1350 to less than the acceptable minimum), the process increases the threshold to fit the inliers in order to find an acceptable number of inliers. The process then proceeds to 1315, which was described above.
The process calculates (at 1375) historical metrics for each feature point. The process, for each feature in the current frame stores an historical metric that indicates (i) whether the feature has been an inlier or an outlier in a group of previous frames, (ii) the age of the feature to show in how many previous frames a feature was tracked, and (iii) the projection error of the feature calculated in the previous frame. Calculating and updating the historical metrics for feature points is described in more detail below.
The process then performs a refinement step on the resulting homography, which minimizes the projection error between the sets of matched points between the two frames. The process uses (at 1380) a linear least squares minimizer to refine the homography to reduce the sum of squared errors between the reprojected point and their detected feature matches. The process then uses (at 1385) a nonlinear optimization method to minimize the error. Some embodiments perform Levenberg-Marquardt nonlinear optimization. Other embodiments utilize other non-linear optimization methods such as Broyden-Fletcher-Goldfarb-Shanno (BFGS), scaled conjugate gradient, etc. to minimize the error. Some embodiments perform the nonlinear optimization for several iterations (e.g., until the error is below a predetermined threshold or a certain number of iterations are performed). The process then ends.
E. History and Metrics
Some embodiments maintain historical metrics for feature points identified in each frame to utilize in determination of the dominant motion among the frames. Once an initial determination of the inliers and outlier is done (e.g., by performing process 1300 described by reference to
In addition, some embodiments maintain an age for each feature point to indicate in how many previous frames the feature point has been tracked.
Some embodiments also calculate the projection error for each feature point (i.e., how much the feature point moved away from the dominant field of motion).
As shown, the particular feature point had a projection error of 9.21 pixels (as shown by 1805) in the x dimension and 2.78 pixels (as shown by 1810) in the y dimension. This is the error between the projection that was given for the dominant motion and where this feature point is actually mapped. As described by reference to
This information is used to determine whether a feature was moving in the same direction or in a different direction than the dominant motion. Some embodiments maintain the projection error of each feature point for the previous frame only. Other embodiments maintain the projection error of each feature point for more than one previous frame. In these embodiments, data structure 1800 for each feature point is a two dimensional array.
Next, process 1900 updates (at 1910) the age of the feature point. When a feature point is identified for the first time in a frame, the age is set to 1 and is incremented by 1 each time the feature point is matched to a feature point in a future frame. A data structure similar to data structure 1700 is maintained for each feature point identified in a frame.
The process then calculates (at 1915) the projection error of each feature point to show the deviation from the inter frame dominant motion. The process then ends. A data structure similar to data structure 1800 is maintained for each feature point identified in a frame. Some embodiments maintain the projection error of a feature point over several frames. In these embodiments, data structure 1800 for each feature point is an array of data to store projection error over of the feature point multiple frames.
F. Optimization
Some embodiments perform optimization after all homographies between each pair of consecutive video frames are identified.
The process then selects (at 2010) the first pairwise homography as the current homography. The process then determines (at 2015) whether the confidence level for the corresponding pair of frames is below a predetermined threshold. If not, the process proceeds to 2030, which is described below. Otherwise, when the confidence level is below the threshold, the process determines (at 2020) whether prior and subsequent valid homographies exist. If not, the process proceeds to 2030, which is described below.
Otherwise, the process replaces (at 2025) the current homography with a linear interpolation of the first prior valid homography and the first successive valid homography in time. The process then determines (at 2030) whether all pairwise homographies are examined. If yes, the process ends. Otherwise, the process selects (at 2035) the next homography as the current homography. The process then proceeds to 2015, which was described above.
The analysis phase ends after the homographies are optimized. At the end of analysis phase, if there are M frames in the video sequence, there will be M−1 homographies, one for each pair of consecutive frames.
II. Stabilization
A. Removing Unwanted Motion
The analysis phase provides a complete chain of homographies between all frames. Some embodiments calculate a smoothed chain of correction homographies. In some embodiments, the amount of smoothing is a scalar user specified parameter, which sets how aggressively the noisy space-time motion trajectory is smoothed/filtered.
The process then smooths the offsets of each corner of a frame by applying (at 2115) a smoothing function to the corner offset of the frame and the corresponding corner offsets of a group of previous and subsequent frames. For instance, some embodiments utilize a kernel length of 60 frames with 30 frames before and 30 frames after the current frame. When there are not enough frames either before or after a frame, some embodiments utilize a kernel with fewer numbers of frames. Some embodiments utilize a Gaussian smoothing function as the function to smooth the array of offset points.
In some embodiments, the amount of smoothing performed is based on a user selectable parameter.
The user interface includes a control 2215 for adjusting the amount of smoothing for the video sequence. The user interface also includes a control 2220 for enabling and disabling of video stabilization. The slider control 2215 is used to indicate the amount of smoothing. In this example, the slider indicates a value from 0-100%, which in this particular embodiment corresponds to a range of 0-6 seconds of range of the Gaussian smoothing of the corner positions of the frame. In other embodiments, the user interface includes a text for specifying the range in seconds for frames to be used in the Gaussian smoothing function.
Referring back to
In addition to removing the effects of shaking of the camera, the disclosed geometrically biased historically weighted RANSAC approach reduces the rolling shutter effects. Rolling shutter is a method of image capture where each frame is recorded not from a snapshot taken at a single point in time, but rather by scanning across the frame either vertically or horizontally.
Because rolling shutter distortion is caused by high frequency motion (typically caused by camera shake) and the disclosed motion model can accommodate for the distortions, the smoothing model not only reduces the high-frequency motion to smooth the video, it also reduces the high-frequency changes in distortion that are caused by rolling shutter.
The calculated position of the corners to which the corners of the video frame are moved when applying the homography from the previous frame are labeled as A2, B2, C2, and D2. As shown, the effects of the camera shake are removed from the smoothed frame 2315 after operation 2130 of process 2100 is performed on the image.
In some embodiments, the video sequence is cropped to an inside rectangle. In other embodiments, instead of cropping, the blank portion of each image is filled in by using the information from other frames (e.g., the neighboring frames) or by extrapolating parameter information from points from the adjacent areas of the image. The technique is sometimes referred to as in-painting.
B. Tripod Mode
The smoothing embodiments described above, smooths the perceived motion of the camera through the frame. Some embodiments provide a different technique that removes all camera-related motion from the sequence as if the camera is on a tripod. In order to chain back the product of the homographies, some embodiments select a key frame in the video sequence, calculate difference between all corners for any other frame and the corresponding corner of the key frame. The differences are used to map all frames to the key frame homography to delete all motion related to the dominant motion of the video sequence. In other words, all point positions of inliers are reprojected to the key frame's coordinate system (by producing a product of consecutive homography matrices).
This operation is conceptually similar to stacking up all the frames. In some embodiments, the video sequence is cropped to the inside rectangle. In other embodiments, instead of cropping, the blank portion of each image is filled in by using the information from other frames or by extrapolating the points of the adjacent areas of the image.
The user can view different frames of the video sequence 2515 to identify a frame that the user wants to selects as the key frame. The user then selects the frame as the key frame for the tripod mode stabilization by selecting the control 2515.
Referring back to
The process then sets (at 2415) the last frame in the sequence as the current frame. The process then determines (at 2420) whether the current frame is the key frame. If yes, the process proceeds to 2450, which is described below. Otherwise, the process determines (at 2425) whether the current frame is located after the key frame in the sequence of video frames. If not, the process proceeds to 2440, which is described below.
Otherwise, the process computes (at 2430) the product of the inverse homography matrices starting from the inverse homography between the current frame and the immediately preceding frame up to and including the homography between the frame after the key frame and the key frame. The process then applies (at 2435) the product of the inverse homographies to the current frame to remove all motion related to the dominant motion of the video sequence. The process then proceeds to 2450, which is described below.
When the current frame is located before the key frame in the sequence of video frames, the process computes (at 2440) the product of the pairwise homography matrices starting from the homography between the next frame and the current frame and up to and including the homography between the key frame and the frame before the key frame. The process then applies (at 2445) the product of the homographies to the current frame to remove all motion related to the dominant motion of the video sequence.
The process then determines (at 2450) whether the current frame is the first frame in the video sequence. If yes, the process ends. Otherwise, the process sets (at 2455) the frame immediately preceding the current frame as the current frame. The process then proceeds to 2420, which is described above. Although process 2400 is described to start from the last frame in the sequence in operation 2415 and ends to the first frame in the sequence when operation 2450 is true, a person of ordinary skill in the art will realize that the process can be implemented by starting from the first frame in the sequence and ending to the last frame in the sequence.
III. Electronic System
Many of the above-described features and applications are implemented as software processes that are specified as a set of instructions recorded on a computer readable storage medium (also referred to as computer readable medium, machine readable medium, machine readable storage). When these instructions are executed by one or more computational or processing unit(s) (e.g., one or more processors, cores of processors, or other processing units), they cause the processing unit(s) to perform the actions indicated in the instructions. Examples of computer readable media include, but are not limited to, CD-ROMs, flash drives, random access memory (RAM) chips, hard drives, erasable programmable read only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), etc. The computer readable media does not include carrier waves and electronic signals passing wirelessly or over wired connections.
In this specification, the term “software” is meant to include firmware residing in read-only memory or applications stored in magnetic storage, which can be read into memory for processing by a processor. Also, in some embodiments, multiple software inventions can be implemented as sub-parts of a larger program while remaining distinct software inventions. In some embodiments, multiple software inventions can also be implemented as separate programs. Finally, any combination of separate programs that together implement a software invention described here is within the scope of the invention. In some embodiments, the software programs, when installed to operate on one or more electronic systems, define one or more specific machine implementations that execute and perform the operations of the software programs.
The bus 2605 collectively represents all system, peripheral, and chipset buses that communicatively connect the numerous internal devices of the electronic system 2600. For instance, the bus 2605 communicatively connects the processing unit(s) 2610 with the read-only memory 2630, the GPU 2615, the system memory 2620, and the permanent storage device 2635.
From these various memory units, the processing unit(s) 2610 retrieves instructions to execute and data to process in order to execute the processes of the invention. The processing unit(s) may be a single processor or a multi-core processor in different embodiments. Some instructions are passed to and executed by the GPU 2615. The GPU 2615 can offload various computations or complement the image processing provided by the processing unit(s) 2610.
The read-only-memory (ROM) 2630 stores static data and instructions that are needed by the processing unit(s) 2610 and other modules of the electronic system. The permanent storage device 2635, on the other hand, is a read-and-write memory device. This device is a non-volatile memory unit that stores instructions and data even when the electronic system 2600 is off. Some embodiments of the invention use a mass-storage device (such as a magnetic or optical disk and its corresponding disk drive, integrated flash memory) as the permanent storage device 2635.
Other embodiments use a removable storage device (such as a floppy disk, flash memory device, etc., and its corresponding drive) as the permanent storage device. Like the permanent storage device 2635, the system memory 2620 is a read-and-write memory device. However, unlike storage device 2635, the system memory 2620 is a volatile read-and-write memory, such a random access memory. The system memory 2620 stores some of the instructions and data that the processor needs at runtime. In some embodiments, the invention's processes are stored in the system memory 2620, the permanent storage device 2635, and/or the read-only memory 2630. For example, the various memory units include instructions for processing multimedia clips in accordance with some embodiments. From these various memory units, the processing unit(s) 2610 retrieves instructions to execute and data to process in order to execute the processes of some embodiments.
The bus 2605 also connects to the input and output devices 2640 and 2645. The input devices 2640 enable the user to communicate information and select commands to the electronic system. The input devices 2640 include alphanumeric keyboards and pointing devices (also called “cursor control devices”), cameras (e.g., webcams), microphones or similar devices for receiving voice commands, etc. The output devices 2645 display images generated by the electronic system or otherwise output data. The output devices 2645 include printers and display devices, such as cathode ray tubes (CRT) or liquid crystal displays (LCD), as well as speakers or similar audio output devices. Some embodiments include devices such as a touchscreen that function as both input and output devices.
Finally, as shown in
Some embodiments include electronic components, such as microprocessors, storage and memory that store computer program instructions in a machine-readable or computer-readable medium (alternatively referred to as computer-readable storage media, machine-readable media, or machine-readable storage media). Some examples of such computer-readable media include RAM, ROM, read-only compact discs (CD-ROM), recordable compact discs (CD-R), rewritable compact discs (CD-RW), read-only digital versatile discs (e.g., DVD-ROM, dual-layer DVD-ROM), a variety of recordable/rewritable DVDs (e.g., DVD-RAM, DVD-RW, DVD+RW, etc.), flash memory (e.g., SD cards, mini-SD cards, micro-SD cards, etc.), magnetic and/or solid state hard drives, read-only and recordable Blu-Ray® discs, ultra density optical discs, any other optical or magnetic media, and floppy disks. The computer-readable media may store a computer program that is executable by at least one processing unit and includes sets of instructions for performing various operations. Examples of computer programs or computer code include machine code, such as is produced by a compiler, and files including higher-level code that are executed by a computer, an electronic component, or a microprocessor using an interpreter.
While the above discussion primarily refers to microprocessor or multi-core processors that execute software, some embodiments are performed by one or more integrated circuits, such as application specific integrated circuits (ASICs) or field programmable gate arrays (FPGAs). In some embodiments, such integrated circuits execute instructions that are stored on the circuit itself. In addition, some embodiments execute software stored in programmable logic devices (PLDs), ROM, or RAM devices.
As used in this specification and any claims of this application, the terms “computer”, “server”, “processor”, and “memory” all refer to electronic or other technological devices. These terms exclude people or groups of people. For the purposes of the specification, the terms display or displaying means displaying on an electronic device. As used in this specification and any claims of this application, the terms “computer readable medium,” “computer readable media,” and “machine readable medium” are entirely restricted to tangible, physical objects that store information in a form that is readable by a computer. These terms exclude any wireless signals, wired download signals, and any other ephemeral signals.
As used in this specification and any claims of this application, the terms “computer”, “server”, “processor”, and “memory” all refer to electronic or other technological devices. These terms exclude people or groups of people. For the purposes of the specification, the terms display or displaying means displaying on an electronic device. As used in this specification and any claims of this application, the terms “computer readable medium,” “computer readable media,” and “machine readable medium” are entirely restricted to tangible, physical objects that store information in a form that is readable by a computer. These terms exclude any wireless signals, wired download signals, and any other ephemeral signals.
While the invention has been described with reference to numerous specific details, one of ordinary skill in the art will recognize that the invention can be embodied in other specific forms without departing from the spirit of the invention. In addition, a number of the figures (including
The present Application claims the benefit of U.S. Provisional Patent Application 61/832,750, entitled, “Robust Image Feature Based Video Stabilization and Smoothing,” filed Jun. 7, 2013. The content of U.S. Provisional application 61/832,750 is hereby incorporated by reference.
Number | Name | Date | Kind |
---|---|---|---|
20120154579 | Hampapur | Jun 2012 | A1 |
20120281922 | Yamada | Nov 2012 | A1 |
Number | Date | Country | |
---|---|---|---|
20140362240 A1 | Dec 2014 | US |
Number | Date | Country | |
---|---|---|---|
61832750 | Jun 2013 | US |