The present invention relates to the field of image processing. More specifically, the present invention relates to motion estimation.
Motion estimation is the process of determining motion vectors that describe the transformation from one image to another, usually from adjacent frames in a video sequence. The motion vectors may relate to the whole image (global motion estimation) or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that are able to approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.
Applying the motion vectors to an image to synthesize the transformation to the next image is called motion compensation. The combination of motion estimation and motion compensation is a key part of video compression as used by MPEG 1, 2 and 4 as well as many other video codecs.
A technique for estimating background motion in monocular video sequences is described herein. The technique is based on occlusion information contained in video sequences. Two algorithms are described for estimating background motion: one fits well for general cases, and the other fits well for a case when available memory is very limited. The significance of the technique includes: a motion segmentation algorithm with adaptive and temporally stable estimate of the number of objects is developed, two algorithms are developed to infer occlusion relations among segmented objects using the detected occlusions and background motion estimation from the inferred occlusion relations.
In one aspect, a method of motion estimation programmed in a memory of a device comprises performing motion segmentation to segment an image into different objects using motion vectors to obtain a segmentation result, generating an occlusion matrix using the segmentation result, occluded pixel information and image data and estimating background motion using the occlusion matrix. The occlusion matrix is of size K×K, wherein K is a number of objects in the image. Each entry in the occlusion matrix represents the number of pixels one segment occludes another segment. Estimating the motion of the background object includes finding the background object. The device is selected from the group consisting of a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, a smart phone, a portable music player, a tablet computer, a mobile device, a video player, a video disc writer/player, a television, and a home entertainment system.
In another aspect, a method of motion segmentation programmed in a memory of a device comprises generating a histogram using input motion vectors, performing K-means clustering with a different number of clusters and generating a cost, determining a number of clusters using the cost, computing a centroid of each cluster and clustering a motion vector at each pixel with a nearest centroid, wherein the clustered motion vector and nearest centroid segments a frame into object. A number of the segments is not fixed. A temporally stable estimation of the number of clusters is developed. A Bayesian approach for estimation is used. The device is selected from the group consisting of a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, a smart phone, a portable music player, a tablet computer, a mobile device, a video player, a video disc writer/player, a television, and a home entertainment system.
In another aspect, a method of occlusion relation inference programmed in a memory of a device comprises finding a first corresponding motion segment of an occluding object, finding a pixel location in the next frame, finding a second corresponding motion segment of the occluded object, incrementing an entry in an occlusion matrix and repeating the steps until all occlusion pixels have been traversed. The entry represents the number of pixels a first segment occludes a second segment. The device is selected from the group consisting of a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, a smart phone, a portable music player, a tablet computer, a mobile device, a video player, a video disc writer/player, a television, and a home entertainment system.
In another aspect, a method of occlusion relation inference programmed in a memory of a device comprises using a sliding window to locate occlusion regions and neighboring regions, moving the window if there are no occluded pixels in the window, computing a first luminance histogram at the occluded pixels, computing a second luminance histogram for each motion segment inside the window, comparing the first luminance histogram and the second luminance histogram, identifying a first motion segment with a closest luminance histogram to an occlusion region as a background object in the window, identifying a second motion segment with the most pixels among all but background motion segments as an occluding, foreground object, incrementing an entry in an occlusion matrix by the number of pixels in the occlusion region in the window and repeating the steps until an entire frame has been traversed. The device is selected from the group consisting of a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, a smart phone, a portable music player, a tablet computer, a mobile device, a video player, a video disc writer/player, a television, and a home entertainment system.
In another aspect, a method of background motion estimation programmed in a memory of a device comprises designing a metric to measure an amount of contradiction when selecting a motion segment as a background object, assigning a background motion to be the motion segment with a minimum amount of contradiction and subtracting the background motion of the background object from motion vectors to obtain a depth map. The method further comprises determining if the number of occluded pixels is below a first threshold or a minimum contradiction is above a second threshold, or determining if a total number of occlusion pixels is below a third threshold, then assigning the background object to be a largest segment, and a corresponding motion is assigned to be the background motion. The device is selected from the group consisting of a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, a smart phone, a portable music player, a tablet computer, a mobile device, a video player, a video disc writer/player, a television, and a home entertainment system.
In another aspect, an apparatus comprises a video acquisition component for acquiring a video, a memory for storing an application, the application for: performing motion segmentation to segment an image of the video into different objects using motion vectors to obtain a segmentation result, generating an occlusion matrix using the segmentation result, occluded pixel information and image data and estimating background motion using the occlusion matrix and a processing component coupled to the memory, the processing component configured for processing the application. The occlusion matrix is of size K×K, wherein K is a number of objects in the image. Each entry in the occlusion matrix represents the number of pixels one segment occludes another segment. Estimating the background motion includes finding the background object.
A technique for estimating background motion in monocular video sequences is described herein. The technique is based on occlusion information contained in video sequences. Two algorithms are described for estimating background motion: one fits well for general cases, and the other fits well for a case when available memory is very limited. The second algorithm is tailored toward platforms where memory usage is heavily constrained, so low cost implementation of background motion estimation is made possible.
Background motion estimation is very important in many applications, such as depth map generation, moving object detection, background subtraction, video surveillance, and other applications. For example, a popular method to generate depth maps for monocular video is to compute motion vectors and subtract background motion from the motion vectors. The remaining magnitude of motion vectors will be the depth. Often times, people use global motion instead of background motion to accomplish tasks. Global motion accounts for the motion of the majority of pixels in the image. In cases where background pixels are less than foreground pixels, global motion is not equal to background motion.
Occlusion is one of the most straightforward cues to infer relative depth between objects. If object A is occluded by object B, then object A is behind object B. Then, background motion is able to be estimated from the relative occlusion relations among objects. So the primary problem becomes how does one know which object occludes which object. In video sequences, it is possible to detect occlusion regions. Occlusion regions refer to either covered regions, which appear in the current frame but will disappear in the next frame due to occlusion of relatively closer objects, or uncovered regions, which appeared in the previous frame but disappear in the current frame due to the movement of occluding objects. Occlusion regions, both covered and uncovered, should belong to occluded objects. If occlusion regions are able to be associated with certain objects, then the occluded objects are able to be found. So the frame is segmented into different objects. Then, given the covered and uncovered pixel locations, algorithms are developed to infer occlusion relations among objects. Finally, from the estimated occlusion relations, the background motion is estimated.
There are various methods to segment the image into different objects or segments based on motion vectors. In order to achieve fast computation and reduce memory usage, K-means clustering for motion segmentation is used. The K-means clustering algorithm is a technique for cluster analysis which partitions n observations into a fixed number of clusters K, so that each observation vj belongs to the cluster with the nearest centroid ci. K-means clustering works by minimizing the following cost function:
The K-means clustering algorithm is used to do the motion segmentation. However, some modifications have been made. First, the number of clusters/segments K is not fixed. An algorithm is used to estimate the number of segments in order to make it adaptive. In addition, in order to avoid large variation in segmentation results between consecutive frames, a temporal stabilization mechanism is used. Once the number of segments/clusters is determined, K-means clustering is used to find out the centroid of these clusters or segments. Then, the motion vector at each pixel is clustered to the nearest centroid in Euclidian distance to complete the motion segmentation.
In order to make the estimate of number of clusters temporally stabilized, a Bayesian approach for estimation is used, with the prior probability obtained from the prediction based on the posterior probability in previous frames. The Bayesian approach computes the maximum a posteriori estimate of the number of clusters. The posterior probability of the number of clusters kn in the current frame given the observations (motion vectors) in the current frame and all previous frames z1,2 . . . , n are able to be computed as:
The estimate of the number of clusters is the value kn, which maximizes P(kn|z1:n). The denominator P(Zn|Z1:n−1) is constant for all values of kn. So maximizing P(kn|z1:n) is equivalent to maximizing the numerator. The conditional probability P(zn|kn) is able to be modeled as a decreasing function of a cost function Ψ(zn, kn):
where Φk is the K-means clustering cost function and is a function of the number of clusters kn and the observations (motion vectors) zn of the current frame n. The cost function Ψ(zn,kn) tries to balance the number of clusters and the cost due to clustering. More clusters will result in smaller cost because of finer partition of the observations. But too many clusters may not help. So the combination of cost and number of clusters weighted by λ determines the final cost function. Smaller cost means higher probability. The conditional probability is constructed so that it is a decreasing function of the cost function. The second term P(kn|z1:n−1) is able to be computed as:
where P(kn|kn−1) is the state transition probability, and P(kn−1|z1:n−1) is the posterior probability computed from the previous frame. The state transition probability is able to be predefined. A simple form is used to speed up computation:
P(kn|kn−1)=2−|k
With the posterior probability computed as in Equation (2), the number of clusters is estimated as the number kn which has the maximum posterior probability, e.g.:
After the number of clusters or segments has been estimated, a K-means clustering technique is used to cluster the motion vectors at each pixel. The centroid of each cluster will be computed, and the motion vector at each pixel is able to be clustered with the closest centroid. Then, motion segmentation is achieved. The entire frame is segmented into K objects.
From available occlusion detection results, it is able to be determined which pixels in the current frame will be covered in the next frame and which pixels in the current frame are uncovered in the previous frame. The known fact is that the occlusion pixels belong to occluded objects.
To simplify notation, Vx12 and Vy12 are used to denote the horizontal and vertical motion vector from frame n−1 to frame n, and Vx21 and Vy21 are used to denote the horizontal and vertical motion vector from frame n to frame n−1. Vx23 and Vy23 are used to denote horizontal and vertical motion vector from frame n to frame n+1, and use Vx32 and Vy32 to denote the horizontal and vertical motion vector from frame n+1 to frame n. If a pixel (x,y) on frame n is identified as a covered pixel, then Vx21(x,y) and Vy21(x,y) is used to cluster (x,y) into one of the motion segments i, and this segment i is identified as the occluded object. In addition, the pixel (x′,y′)=(x,y)−(Vx21(x,y), Vy21(x,y)) on frame n+1 is analyzed. Motion vector Vx32(x′,y′) and Vy32(x′,y′) will be used to cluster into one of the motion clusters j, and this segment j is identified as the occluding object. Entry (i,j) in the occlusion matrix O is then incremented by 1. All of occlusion pixels are traversed in order to obtain the final occlusion matrix O. The algorithm description is shown in
In the step 500, a corresponding motion segment i using Vx21 and Vy21 is found. In the step 502, a pixel location in the next frame (x′,y′)=(x,y)−(Vx21(x,y), Vy21(x,y)) is found. In the step 504, a corresponding motion segment j of (x′, y′) using Vx32 and Vy32 is found. In the step 506, entry (i,j) in the occlusion matrix O is incremented by 1. In the step 508, it is determined if all occlusion pixels (x, y) have been traversed. If all occlusion pixels (x, y) have been traversed, then the occlusion matrix O is completed. If all occlusion pixels (x, y) have not been traversed, then the process returns to the step 500. In some embodiments, the order of the steps is modified. In some embodiments, more or fewer steps are implemented.
The algorithm described in the section above uses motion vectors to associate occlusion pixels to motion segments. Both forward and background motion vectors between three consecutive frames are able to be stored. That is a total of eight frames of motion vectors. In cases where memory is limited and very expensive to use, the previous algorithm may not be appropriate. In this section, an algorithm that uses a small amount of memory is described. The primary reason for the need to store many frames of motion vectors is that the motion in occluded pixels cannot be trusted. So motion from adjacent frames needs to be used as a substitute. However, instead of using motion to associate occluded pixels with motion segments, appearance is able to be used to associate occluded pixels with motion segments. It is assumed that the occluded region belongs to the segment with the most similar appearance. Appearance usually refers to luminance, color, and texture properties. But in order to make the algorithm cost effective, only the luminance property is used herein, although color and texture properties are able to also be used to provide better performance. A luminance histogram is used to find similarity between regions. Sliding windows are used to locate occlusion regions and their neighboring regions. A multi-scale sliding window is used to traverse the image. In order to save memory and computation, the multiple scales are only on the width of the window. In other words, the height of the window is fixed, and only the width is varied to account for different scales. So only a fixed number of lines need to be stored instead of the whole frame. When the sliding the window goes across the image, if there are no occluded pixels inside the window, then the window is moved to the next position. Otherwise, the luminance histogram at the occluded pixels is computed. For other pixels inside the window, pixels belonging to the same motion segment are put together, and a luminance histogram for each motion segment inside the window is constructed. The luminance histogram of the occlusion region and the luminance histograms of the motion segments are compared. The motion segment i with the closest luminance histogram to the occlusion region is identified as the background object in that window. The motion segment j with the most pixels among all but background motion segments is identified as the occluding/foreground object. Then entry (i,j) in occlusion matrix O is incremented by the number of pixels in the occlusion region inside the sliding window. Some criteria are able to be used to remove outliers, for example, the number of occluding pixels and occluded pixels in a sliding window has to be over a certain threshold, and the level of similarity between histograms has to be over a certain value. After multi-scale sliding windows traverse across the entire frame, the final occlusion matrix O is obtained to infer the occlusion relations among motion segments or objects.
Once the occlusion matrix O is obtained, the background motion can be estimated. In the depth estimation application, background motion is subtracted from motion vectors to obtain the depth map. A miscalculated background motion will produce wrong relative depth between objects, and will contradict with the occluding relations described in the occlusion matrix O. The contradiction is quantified based on occlusion matrix O. One of the motion segments is chosen as the background object. The motion in that background object will be background motion. If object k is chosen as the background object, then the depth at each object i is computed as di=∥vi−vk∥. The contradiction from (i, j) is then
C
k,(i,j)=max(Oi,j−Oj,i,0)I(dj−di)+max(Oj,i−Oi,j,0)I(di−dj), (6)
where
and large d means close, small d means far. The contradictions when assuming vk as background motion are able to be computed as follows:
The background motion is assigned to be the motion that leads to the minimum amount of contradiction Ck. However, if the number of occluded pixels is small or the minimum contradiction is still too big, or the total number of occlusion pixels is too small to draw any statistical significance, then the largest segment is assigned to be the background object, and the corresponding motion is assigned to be the background motion.
In depth estimation in monocular video sequences, motion vectors are first estimated, and then background motion is subtracted from these motion vectors to obtain the depth map.
In some embodiments, the occlusion-based background motion estimation application(s) 830 include several applications and/or modules. In some embodiments, modules include one or more sub-modules as well. In some embodiments, fewer or additional modules are able to be included.
Examples of suitable computing devices include a personal computer, a laptop computer, a computer workstation, a server, a mainframe computer, a handheld computer, a personal digital assistant, a cellular/mobile telephone, a smart appliance, a gaming console, a digital camera, a digital camcorder, a camera phone, a smart phone, a portable music player, a tablet computer, a mobile device, a video player, a video disc writer/player (e.g., DVD writer/player, Blu-ray® writer/player), a television, a home entertainment system or any other suitable computing device.
To utilize the occlusion-based background motion estimation method, a user acquires a video/image such as on a digital camcorder, and before, during or after the content is acquired, the occlusion-based background motion estimation method automatically performs motion estimation on the data. The occlusion-based background motion estimation occurs automatically without user involvement.
In operation, the occlusion-based background motion estimation method is very useful in many applications, for example depth map generation, background subtraction, video surveillance and other applications. The significance of the background motion estimation method includes: 1) a motion segmentation algorithm with adaptive and temporally stable estimate of the number of objects is developed, 2) two algorithms are developed to infer occlusion relations among segmented objects using the detected occlusions and 3) background motion estimation from the inferred occlusion relations.
The present invention has been described in terms of specific embodiments incorporating details to facilitate the understanding of principles of construction and operation of the invention. Such reference herein to specific embodiments and details thereof is not intended to limit the scope of the claims appended hereto. It will be readily apparent to one skilled in the art that other various modifications may be made in the embodiment chosen for illustration without departing from the spirit and scope of the invention as defined by the claims.