The invention relates to video analysis and more specifically to the recognition of moving features in a video based upon their motion.
The recognition of moving features such as people in videos has a number of applications. For example the recognition of pedestrians can be used to control automated breaking systems in cars.
In order to recognise a moving feature in a video, the motion of the feature through the frames of the video must be determined and then this motion is compared with the expected motion of that feature.
The motion or trajectory of a feature can be considered as the position of the feature in each of a number of frames of a video. More accurate results for the recognition can be obtained when longer trajectories are available. It is possible that the feature may be obscured in some of the frames of the video by occlusion. This will limit the length of any trajectory returned for such a feature. There may also be errors in matching feature between frames that can also limit the lengths of trajectories for features.
The selection of the characteristics of the expected motion of a feature also presents a challenge. Characteristics of the motion that will allow accurate classification are those where there is a sharp discrimination on the existence of the feature.
It is an object of the present invention to address the issues discussed above.
According to an aspect of the present invention, there is provided a method of recognising a moving feature in a video sequence. The moving feature has a characteristic pattern of motion over the video sequence. The video sequence comprises a first frame and a plurality of following frames.
The method comprises locating points of interest in each frame of the video sequence. The points of interest potentially correspond to the moving feature. In order to construct a trajectory for a feature in the first frame, correspondences between points of interest in a following frame and the frame preceding it are determined. It is possible that for a given point of interest in one frame, more than one point of interest in a following frame corresponding to it may be used. Using the correspondences between points of interest, trajectories for the points of interest in the first frame are constructed. Because of the possibility of multiple correspondences between points of interest between frames, more than one possible trajectory for a point of interest in the first frame may be found.
Once a plurality of trajectories for the points of interest in the first frame has been constructed, they are compared with the characteristic pattern of motion. Based on this comparison, a feature in the first frame can be recognised.
The method allows flexibility in the matching of points of interest between frames. Some of the plurality of trajectories may not track the same physical feature across all of the frames. However, because such a trajectory is not likely to exhibit the characteristic pattern, it will be excluded when the comparison with the characteristic pattern takes place. The flexibility is advantageous as when an object is obscured by occlusion, trajectories will still be generated even for the frames where the object is obscured. Further, when there is more than one potential match for an object across a pair of frames, these can all be included in the plurality of trajectories.
The first frame can be either the start of the video sequence or its end.
According to an embodiment of the present invention, the points of interest are matched between frames by considering the video sequence as a sequence of overlapping pairs of frames. In each pair of frames, the probability that each of a set of point of interest in the second frame of the pair corresponds the same feature as a point of interest in the first frame of the pair is calculated.
According to an embodiment of the present invention, the probability two point of interest in different frames correspond to the same feature is based on the spatial distance between the points of interest or the appearance similarity between the points of interest.
According to an embodiment of the present invention, the plurality of trajectories are determined by constructing an acyclic graph with nodes representing the points of interest in the frames and edges linking points of interest that correspond to the same feature in subsequent frames, and then traversing the graph.
According to an embodiment of the present invention, traversing the acyclic graph comprises making a probabilistic selection of edges at each node, based on an augmented probability. The augmented probability can include a factor depending on the traceable depth of the trajectory. This makes it possible to favour long trajectories that are likely to allow a more certain determination of whether the indicative relationship exists. The augmented probability can include a factor based on the conservation of a physical property such as speed. This makes it possible to favour trajectories that exhibit properties expected of physical objects and thus makes it possible to exclude unphysical trajectories.
According to an embodiment of the present invention, the trajectories are compared to the characteristic pattern using a random decision forest.
According to an embodiment of the present invention, the characteristic pattern is a correlation between two trajectories from the plurality of trajectories.
According to embodiments of the present invention, trajectories are analysed and classified based on correlations that may exist between pairs of trajectories.
The characteristic motion may be walking motion. There are a number of correlations that exist between the motions of the two feet of a walking pedestrian.
According to an embodiment of the present invention, first and second trajectories making up the pair of trajectories have a static phase and a dynamic phase. The indicative relationship between the pair of trajectories that is used to classify them is that the static phase of the first trajectory coincides with the dynamic phase of the second trajectory.
This analysis allows detection of pedestrians since the trajectories of the feet of a walking pedestrian exhibit a static phase and a dynamic phase. The standing foot of a pedestrian is stationary relative to the ground and is thus static. The second foot of the pedestrian moves relative to the ground and can thus be considered to be in a dynamic phase. As the pedestrian walks, each foot alternates between static and dynamic phases and while one foot is in a static phase, the other foot is in a dynamic phase.
According to an embodiment of the present invention, the indicative relationship between the first and second trajectories is a correlation between the directions of motion of the pair of trajectories.
The motions of the feet of a pedestrian are expected to be approximately parallel to a walking direction of the pedestrian. This can be used to identify pairs of trajectories that potentially relate to pairs of feet for a pedestrian.
According to an embodiment of the present invention, the correlation in the directions of motion of the first and second trajectories includes the first trajectory being substantially parallel to the second trajectory.
According to an aspect of the present invention there is provided a computer readable medium that carries instructions to cause a computer to carry out a method in accordance with the present invention.
According to an aspect of the present invention, a video analysis system for recognising a moving feature in a video is provided.
In the following, embodiments of the invention will be described with reference to the drawings in which:
To recognise a moving feature in a video, a trajectory for the feature over the frames of the video is required. In order to track the trajectory of an object in a video, it is necessary to identify the location of that object in the frames of the video. There are therefore two problems that must be addressed; the location of objects must be identified in each frame, and the same object must be identified in the frames over which it is to be tracked.
The step S100 of the method involves locating points of interest in each frame of the video sequence. Spatial patterns in the pixels of each frame indicative of certain features that may occur in more than one frame of the sequence are found. For example, the first step may involve detecting corners in each of the frames.
In order to find the trajectories of features in the video, in step S101, points of interest that correspond to features in the immediately preceding frame are determined.
In step S102, for a given point of interest in the first frame of the sequence, a set of possible trajectories is generated. The most general set of possible trajectories for that point of interest include the position in the next frame of any of the points of interest identified in the next frame. It is also possible that the trajectory leads to a position not shown in the frame, either outside the frame, or behind another object shown in the frame.
The set of possible trajectories is determined by considering pairs of frame and identifying potential matching points of interest between the pairs of frames. The pairs of matching points of interest are identified by calculating a matching score. The matching score takes factors such as the spatial distance between the locations of the points of interest and the similarity in appearance of the points of interest into account. This may be achieved by identifying points of interest that have a similar appearance, and/or a similar position within the two frames. From the potential matching pairs of frames, possible trajectories are constructed by tracing the matching pairs through the sequence of frames.
The set of possible trajectories for an object in the first frame of a video sequence may include a large number of points of interest in the second and subsequent frames. The set of possible trajectories is narrowed down by enforcing a constraint based on a pattern that the trajectory of the object is expected to follow. This can be, for example a physical rule such as that the speed of an object will be conserved.
Once the number of possible trajectories has been narrowed down, when a trajectory is required for further processing, a probabilistic selection of the probable trajectories is made. The resultant trajectory is used in the further processing, for example in a gesture recognition algorithm. It is noted that if the probabilistic selection is repeated the selected trajectory may be different.
The trajectories are then compared with a characteristic pattern of motion in step S103.
The points of interest in the frame at time t are denoted as pi(t), where i=1, . . . , n is an index for the points of interest. The two-dimensional location of the point of interest pi(t) in the frame t is xi(t).
In step S2, matches between the points of interest in the frames are determined. For each point of interest in a frame, points of interest in the previous frame that are potential ancestors are identified. This identification is carried out analysing points of interest in the previous frame that are within a certain range of the corner being considered in the present frame.
A temporal matching score is calculated for each pair of points of interest. The temporal matching score matches a point of interest in the present frame with a point of interest in the preceding frame.
The temporal matching score is based on the appearance similarity of the two points of interest and the spatial distance between them.
The temporal matching score between the point of interest pi(t) in the frame at time t and the point of interest pj(t−1) in the preceding frame at time t−1 is denoted as Pij(pi(t), pj(t−1)), and can be calculated by the following formula:
Pij(pi(t),pj(t−1))∝exp(−αSij)exp(−βDij)
Where Sij is the appearance similarity between the point of interest pi(t) and the point of interest pj(t−1). The appearance similarity is calculated from the local image regions around the points of interest in their respective frames. It is calculated as the mean normalised sum of absolute differences between the local image regions. Dij is the spatial distance between the points of interest and is calculated from the following formula.
Dij=∥xi(t)−xj(t−1)∥
Potential matches are identified based on the temporal matching score Pij. A binary value Eij(t) is associated with each potential match between a point of interest pi(t) in the frame at time t and the point of interest pj(t−1) in the preceding frame at time t−1. Potential matches may be selected as the highest value of the temporal matching score for a given corner in the present frame, or all matches having temporal matching score within a threshold of the maximum value may be selected as potential matches. Thus Eij(t)=1 when either:
Pij=maxjPij or
Pij+e>maxjPij
Where e is a threshold.
Potential matches are also determined by considering the process in reverse; for each point of interest in the preceding frame, a temporal matching score is calculated for points of interest in the present frame.
In step S3, the total number of matches between two frames is limited. This is achieved by only keeping a fixed number of matches. The threshold e may be dynamically adjusted so that the number of matches between pairs of frames is constant. A total number of matches between frames of 4n has been found to be a suitable fixed number for the fixed number of matches. These are selected as those with the highest temporal matching score. This may result in some of the corners having no matches where the maximum matching score for a corner is a low value.
Following step S3, a number of points of interest in each frame in the sequence and a set of active matches of points of interest between frames are available to construct a graph from which probabilistic trajectories are extracted.
The graph is constructed in step S4. For each frame t, there is a set of points of interest pi(t) and between temporally adjacent frames, there is a set of matches Eij(t) between the points of interest of a frame at time t and a frame at time t−1. These are used to construct a graph Gi(N,E). The nodes of the graph N represent matched points of interest in the preceding frames and edges E represent the matches between these points of interest. The graph is constructed by defining a root node for each of the points of interest in the final frame of the video sequence. Edges are added for each of the active matches from these points of interest to points of interest in the preceding frame. Nodes are defined for the matched points of interest in the preceding frame. The process is repeated for active matches from the frame preceding the final frame of the sequence, and an acyclic graph such as that shown in
The traceable depth of an edge, or the number of frames in which potential ancestors of a feature point pi(t) in the graph is written as d[Eij(t)].
Data representing the graph Gi(N,E) is stored in the memory 104. The data is stored as values for xi(t) each of the N nodes and values of Pij(t) and d[Eij(t)] for each of the edges, Eij(t).
In step S5, a trajectory is generated by traversing the acyclic graph. The traversal is started at one of the root nodes and a probabilistic selection of edges is made. The process is repeated at each node. The probabilistic selection is made based upon an augmented probability. The augmented probability is calculated from the matching score and includes a factor to take into account the traceable depth and a speed conservation factor.
The augmented probability, or sampling probability P′ij(pi(t), pj(t−1)) is given by the following formula:
P′ij(pi(t),pj(t−1))∝Pijexp(−γ/(d[Eij]+1))exp(−δVij))
Where γ and δ are positive weighting factors and Vij is a speed conservation factor given by the following formula:
Vij(T)=∥(xh(T+1)−xi(T))−(xi(T)−xj(T−1))∥
The above formula for the speed conservation factor takes into account the position of the point of interest in the frame in question xi(T), the position of the point of interest in the trajectory in the preceding frame xj(T−1) and the position of the point of interest being considered as the next point in the trajectory xh(T+1).
The use of an augmented probability that includes a factor dependant upon the traceable depth means that long trajectories are favoured. Long trajectories are likely to be of greater use than short trajectories. Further, the speed conservation factor means that links between feature points that preserve the speed of the object in the trajectory up to the frame under consideration are favoured. This ensures that smooth trajectories that are more likely to correspond to the motion of physical features are more likely to be followed when traversing the acyclic graph.
The acyclic graph can be traversed multiple times to generate a number of probabilistic trajectories for an object corresponding to a feature point in the final frame. The trajectories generated may be different, and thus at least some will not be correct, however, by allowing possible trajectories where there is uncertainty about the correspondence between consecutive frames, trajectories over relatively long sequences of frames can be obtained even where the objects are obscured for part of the sequence by occlusion.
Multiple trajectories for an object may be of use, particularly when used in conjunction with, for example, a pattern recognition algorithm that finds features within a video sequence that exhibit a particular characteristic motion. In such a case it may be beneficial to examine a number of possible trajectories for each point of interest in the final frame of the sequence and examine all the possible trajectories for those that exhibit the particular characteristic motion. The use of an acyclic graph to generate multiple possible trajectories is particularly suitable for such an application.
Following the traversal of the acyclic graph the probabilistic trajectories are stored in the memory 104 as Xi(t)=[x(t), x(t−1), . . . x(t−T+1)] which represents a set of locations in frames for the points belonging to the trajectory.
In order to extract a trajectory from the graph for a point of interest, at each node one of the edges connecting that node to a node representing a point of interest in an earlier frame is selected.
To identify the walking motion of a pedestrian, the step of comparing trajectories with a characteristic pattern involves identifying features that potentially relate to a foot of a pedestrian and then finding pairs of such features with correlated motion which correspond to a pair of feet of a pedestrian. This is described in reference to
In step S601, candidate trajectories are identified. In order to determine whether a trajectory potentially relates to a foot, the motion of the trajectory over the time of approximately one walk cycle is analysed. If the trajectory relates to a point of interest corresponding to a location on a foot, then it is expected to exhibit a moving phase and a stationary phase within the walk cycle. Therefore, the candidate trajectories are identified in step S601 as those that exhibit a dynamic phase and a stationary phase within one walk cycle.
Once the candidate trajectories have been identified, correlated pairs of candidate trajectories are found in step S602. For a walking pedestrian, it is expected that the motion of two feet will be correlated. This correlation is both temporal; and spatial. The temporal aspect of the correlation is that when one foot of a walking pedestrian is in the stationary phase, the other foot is expected to be in its dynamic phase. The spatial locations of the two feet are also expected to be correlated. In addition to these correlations, the directions of the motion of a pedestrian's feet are expected to be correlated. The trajectories of the two feet of a walking person are expected to be approximately parallel, in a direction which is the direction in which the pedestrian is walking. By identifying pairs of candidate trajectories that exhibit the correlations described above, pedestrians are identified in a video sequence from the motion of points of interest relating to the pedestrians' feet.
A random forest classifier can be used to classify the trajectories as relating to feet or not to feet to identify the candidate trajectories in step S601.
A method of using a random forest classifier to classify trajectories as relating to feet or not is shown in
The trajectory is written as:
{tilde over (X)}i(t)=[{tilde over (x)}(t),{tilde over (x)}(t−1), . . . ,{tilde over (x)}(t−T+1)]T
in its canonical form.
In step S702, feature vectors v0 and v1 are generated from the canonical form of the trajectory. These feature vectors are generated by cutting the trajectory into five sections by randomly selecting four times t0, t1, t2, and t3 at which the trajectory is cut.
The feature vectors are given by the following formulae:
v0=
v1=
In step S703, features fs and fd are calculated from the feature vectors and randomly selected variables. The features are calculated as the distance and the dot product of the two vectors vo and v1 using the following:
fs=∥a0v1−a1v0∥
fd=<b0v0,b1v1>
Where a0, a1, b0 and b1 are random coefficients that take values between 0 and 1. By using different values for the cutting points, t0, t1, t2, and t3, and the coefficients, a0, a1, b0 and b1 Numerous variations in fs and fd can be made.
In step S704, the trajectories are classified as either relating to feet or not relating to feet using a random decision forest.
The random forest is stored as a set of values for the parameters t0, t1, t2, t3, a0, a1, b0 and b1, functions of f3 and fd, and a threshold θ for each branching point. These values are stored in the memory 104. They are calculated prior to the running of the program 106 by manually annotating features in a video sequence and calculating the values of the functions of fs and fd, and selecting the values for the threshold θ that gives the largest information gain.
Once candidate trajectories for feet have been identified they are analysed to find correlated pairs of feet relating to the same pedestrian. This analysis is also carried out using a random decision forest in a manner similar to that described above. The functions used to make the decisions when traversing the trees in the random decision forest are calculated based on the correlation of the directions of motion of the two trajectories under consideration and the correlation of the moving and stationary phases of the motion in the trajectories.
In step S1002, a walking direction vector xiu(t) is calculated for the two trajectories based on the locations xi(t) and xu(t) of the points of interest in the frame at time t relating to the two trajectories.
xiu(t)=xi(t)−xu(t)
In step S1003, a consistency c based on dot products of the directions of motion with each other and with the walking direction is calculated.
The consistency exploits the fact that it is expected that the directions of movement of the feet of a pedestrian will be close to parallel and that these directions will be approximately parallel with the walking direction of the pedestrian.
In step S1004 a function f0 based on the consistency and a random vector φ is calculated.
fo=<φ,c>
In step S1101, velocity vectors Yi(t) are calculated for the trajectories.
Yi(t)=[y(t),y(t−1), . . . ,y(t−T+2)]TεR2(T−1)
Where y(T)=x(T)−x(T−1) for T=t, . . . t−T+2.
In step S1102, rectified velocity vectors are calculated. Using the equation:
constant camera motion is cancelled. The rectified velocity vectors are then generated using the following equation:
{hacek over (Y)}i(t)=[{hacek over (y)}(t),{hacek over (y)}(t−1), . . . ,{hacek over (y)}(t−T+2)]T
In step S1103, the rectified velocity vectors are cut into I pieces at fixed cutting points, and in step S1104, a vector based on the dot products of the cut rectified velocity vectors is calculated.
q=[<{hacek over (Y)}′i(t),{hacek over (Y)}′u(t)>, . . . ,<{hacek over (Y)}′i(t1-2),{hacek over (Y)}′u(t1-2)>]TεRl
Finally in step S1105, a function based on the vector and a random vector is calculated:
fp=<ψ,q>
A second random decision forest is used to classify pairs of trajectories as relating to pairs of feet or not. The result of the decision forest may be a large number of pairs of trajectories relating to the same pair of feet. A mean shift algorithm is used to find the main clusters and the average of the most probable cluster is taken as the final result.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/GB2009/001948 | 8/6/2009 | WO | 00 | 5/22/2012 |
Publishing Document | Publishing Date | Country | Kind |
---|---|---|---|
WO2011/015801 | 2/10/2011 | WO | A |
Number | Name | Date | Kind |
---|---|---|---|
6925410 | Narayanan | Aug 2005 | B2 |
7251364 | Tomita, Jr. et al. | Jul 2007 | B2 |
7492944 | Narayanan et al. | Feb 2009 | B2 |
8311272 | Matsugu et al. | Nov 2012 | B2 |
8811666 | Maki et al. | Aug 2014 | B2 |
20020082026 | Narayanan | Jun 2002 | A1 |
20030048849 | Tomita, Jr. et al. | Mar 2003 | A1 |
20050192683 | Narayanan | Sep 2005 | A1 |
20060280335 | Tomita, Jr. et al. | Dec 2006 | A1 |
20080002862 | Matsugu et al. | Jan 2008 | A1 |
20110106833 | Albers et al. | May 2011 | A1 |
Number | Date | Country |
---|---|---|
2002-099918 | Apr 2002 | JP |
2008-009914 | Jan 2008 | JP |
Entry |
---|
Office Action issued Mar. 18, 2014 in Japanese Patent Application No. 2012-523372 (with English language translation). |
Perbet, F., et al., “Correlated Probabilistic Trajectories for Pedestrian Motion Detection,” 2009 IEEE 12th International Conference on Computer vision, pp. 1647-1645, (Sep. 29-Oct. 2, 2009). |
Perbet, F., et al., “Index of /work/cv; Publication date: Correlated Probabilistic Trajectories for Pedestrian Motion Detection,” Internet Citation (Jul. 7, 2009) XP007912949, Retrieved from the Internet: URL: http://frank.perbet.org/work/cv/> [retrieved on May 5, 2010]. |
International Search Report Issued May 17, 2010 in PCT/GB09/01948 Filed Aug. 6, 2009. |
Office Action mailed Oct. 29, 2013 in Japanese Application No. 2012-523372 (w/English translation). |
Japanese Office Action issued Sep. 16, 2014 in Patent Application No. 2012-523372 (with English Translation). |
Number | Date | Country | |
---|---|---|---|
20120224744 A1 | Sep 2012 | US |