1. Field of the Invention
The present invention generally relates to human-computer interaction, and specifically, a method and an apparatus for tracking an object in human-computer interaction.
2. Description of the Related Art
Object tracking is a very important and essential part in the human-computer interaction field. Currently, as an example of the object tracking, hand tracking is being studied, and methods for tracking a hand, such as a tracking method based on color features of the hand, a tracking method based on depth features of the hand or the like are provided.
However, the hand is a non-rigid object, and phenomena of shape deformation and shape inconsistency may occur during a motion process. Furthermore, the motion of a hand has some peculiar features, and for example, the motion velocity of a hand may constantly change and hand information in an image may blur due to a rapid motion of the hand. Thus, it is difficult to find a single feature of a hand that has an optimal tracking effect for all of the scenes during a whole motion process of a hand.
U.S. Pat. No. 8,213,679B2 discloses a method for moving targets tracking and number counting. In this method, matching degree between a target region of a current frame and a target region of a previous frame is calculated based on all features in a pre-established feature pool, and an overall matching degree is further calculated based on a feature with maximum matching degree. By this method, different features may be used to perform tracking for different video frames during the motion process of an object. However, in this method, complicated matching calculation is performed for both of two video frames, the calculation amount is large and the processing speed is slow.
According to an aspect of an embodiment of the present invention, a method for tracking an object includes tracking, based on a previously selected first tracking feature, the object in a sequence of video frames having the object; when a scene of the video frame is changed, selecting a second tracking feature with optimal tracking performance for the changed scene; and continuing tracking the object based on the selected second tracking feature.
According to another aspect of an embodiment of the present invention, an apparatus for tracking an object includes a feature selection unit configured to select a tracking feature with optimal tracking performance for a changed scene and notify a tracking unit of the tracking feature, when the scene of a video frame is changed; and the tracking unit configured to track, based on the selected tracking feature, the object in a sequence of the video frames having the object.
According to another aspect of an embodiment of the present invention, a method for selecting a tracking feature used for tracking an object includes selecting the tracking feature with optimal tracking performance for a changed scene, in response to a change of the scene of a video frame having the object.
According to the object tracking technology and the tracking feature selection technology of the embodiments of the present invention, a feature with optimal tracking performance for a corresponding scene can be dynamically selected in response to the changed scene, thus it is possible to perform accurate tracking.
In the following, embodiments of the present invention are described in detail with reference to the accompanying drawings, so as to facilitate the understanding of the present invention.
For convenience of explanation, as an example of the object tracking, the hand tracking technology according to an embodiment of the present invention will be described below.
First, the basic concept of the hand tracking technology according to the present invention will be described briefly. As described above, the hand is a non-rigid object, and has characteristics of speedily moving and easily deforming. Thus, it is difficult to find a single feature of a hand that can obtain optimal tracking effect for all of the scenes during a whole motion process of a hand. For this situation, the present invention provides tracking technology of dynamically selecting a feature fit for a current scene in response to the changed specific scene during the tracking process of a hand. For example, when the hand moves rapidly, information of rough edges of the hand is unclear or lost; and for this scene, a color feature has good distinction effect. Accordingly, when this scene occurs during a tracking process, it may be considered that a color feature is dynamically selected to perform the tracking. As another example, when the hand moves in the vicinity of a face, the distinction degree of the color feature is reduced since the color of the hand and the face are similar; meanwhile, a depth feature shows its good distinction effect. Accordingly, when this scene occurs during a tracking process, it may be considered that a depth feature is dynamically selected instead of a color feature to perform the tracking. Furthermore, for some scenes, not only a single feature but also a combination of features may be selected for hand tracking. In this way, a feature fitting for a current scene can be dynamically selected in response to the changed scene during the tracking process of a hand, thus it is possible to perform accurate tracking.
As illustrated in
Tracking features are features that represent characteristics of a hand and can provide good tracking performance during the tracking of a hand. The tracking features may be a color feature or depth feature described above, and may also be an edge feature, a grayscale feature or the like.
In this step, the first tracking feature for the tracking may be a previously selected tracking feature fitting for a current scene, and may also be a tracking feature selected by any other appropriate methods. In the following, the processing of step S210 will be described with reference to
As illustrated in
The specific tracking processing based on the first tracking feature may be performed by any known methods, such as a Kalman filtering method, a particle filtering method or the like, and the detailed description is omitted here.
The tracking of the hand according to the embodiment of the present invention is a real-time and online process. In this step, for each of the obtained video frames with the hand, the tracking of the hand is performed in real time by using the first tracking feature, and the reliability of the tracking result that is obtained by the tracking is calculated, until a start video frame T whose tracking performance starts to fall appears; namely, the reliability of the tracking result in the video frame T based on the first tracking feature is less than a predetermined reliability threshold, and the reliability of the tracking result in a video frame T-1 is greater than or equal to the reliability threshold. The reliability reflects reliable degree of the tracking result. Specifically, the reduction of the reliability indicates that the tracking performance of the currently selected tracking feature reduces; that is to say, the currently selected tracking feature does not fit for the scene of the current video frame, namely, a change of scene occurs. For example, in the first 100 frames, a color feature is still used as a tracking feature to perform the tracking, and the tracking performance of all frames is relatively high; whereas in the 101th frame, the hand moves to the vicinity of the face, and the distinction degree of the color feature reduces since the colors of the hand and the face are similar. Accordingly, the reliability of the tracking result when the tracking is performed by using the color feature in the 101th frame reduces, and the tracking performance reduces; namely, the 101th frame is the start video frame T whose tracking performance starts to fall, as described above.
The reliability may be calculated by any appropriate methods. Considering that color distance and position distance of the hand between two adjacent frames in the same scene do not vary so much, an example of a method for calculating the reliability is as follows.
Confidencei=1/(D(Colori, Colori−1)+D(Posi, Posi−1)) (1)
where Confidencei represents the reliability of the tracking result of the i-th frame, D(colori, colori−1) represents the color distance between the i-th frame and the (i−1)-th frame, and D(Posi, Posi−1) represents the position distance between the i-th frame and the (i−1)-th frame. The color distance and the position distance may be calculated by using any appropriate methods. For example, as a method for calculating the color distance, a distance of a color histogram of a tracking region of the tracked hand between two adjacent frames, such as a Bhattacharyya distance is calculated; and as a method for calculating the position distance, an Euclidean distance of a position of the tracked hand between two adjacent frames is calculated. If the Confidencei is less than a predetermined reliability threshold, it is determined that the tracking performance of the currently selected tracking feature in the i-th frame has fallen. The reliability threshold may be set by experience according to a specific application environment.
Returning to
As described in above step S310, since the tracking scene changes, the start video frame T whose tracking performance reduces appears. However, in actuality, interference such as noise in the obtained video frames may be the reason that the tracking performance in the video frame T reduces. Accordingly, in step S320, after the start video frame T whose tracking performance reduces appears, it is not necessary to change the tracking feature immediately, and an “allowable period” is set. In the “allowable period”, the first tracking feature is still used for the tracking of the hand, and it is observed whether the tracking performance gets better. The “allowable period” may be set by experience according to a specific tracking environment, for example, the “allowable period” may be k video frames after the start video frame T whose tracking performance reduces, where k>0. In step S330, it is determined that the scene of the video frame is changed, if the tracked hand is missed since a video frame of the k video frames or the reliability of the tracking result of the video frame T+k is still less than the reliability threshold; otherwise the tracking is continued based on the first tracking feature.
In this step, a process is performed based on the tracking results in the k video frames using the first tracking feature. Specifically, if the tracked hand is missed (i.e., tracking failure) since a video frame of the k video frames, or the reliability of the tracking result of the video frame T+k is still less than the reliability threshold, namely, the tracking performance still does not get better after the “allowable period” is over, it is determined that the scene has been changed, and good tracking performance cannot be obtained by using the first tracking feature in the current scene. On the contrary, if the tracking performance gets better, for example, the reliability becomes greater than or equal to the reliability threshold since a video frame in the “allowable period” and the reliability of subsequent frames are still greater than or equal to the reliability threshold, it is determined that good tracking performance can be obtained by using the first tracking feature in the current scene; thus the first tracking feature can still be used for the tracking.
Returning to
When the scene of the video frame is changed and good tracking performance cannot be obtained based on the first tracking feature in the changed scene, in step S220, the tracking feature with the optimal tracking performance for the changed scene may be selected by any appropriate methods. As an example, the second tracking feature with the optimal tracking performance for the changed scene may be selected based on previously calculated tracking performance of each of the tracking features in each of the scenes of a training data set. The training data set consists of training video frames in the scenes, and the training video frames include the hand. In this example, the tracking performance of each of the tracking features in each of the scenes is previously calculated; thus, after the changed scene is determined, it is easy to select the tracking feature with the optimal tracking performance for the changed scene. The tracking performance of each of the tracking features in each of the scenes may be previously calculated by any known methods in the art; for the complete description, an example will be described briefly.
First, a feature pool is constructed. The feature pool includes features that can have good tracking performance in the tracking of a hand, for example, a single feature such as color feature, depth feature, edge feature, grayscale feature or the like, and a combination feature of a plurality of the single features. Furthermore, the training data is collected and a training data set is established. It should be noted that, the training data set may cover as many different scenes relating to the motion of the hand as possible, and specifically, different scenes relating to the motion of the hand in the human-computer interaction field. Next, the training data (including video frames of the hand) is classified according to the scenes relating to the motion of the hand. The scenes relating to the motion of the hand include, for example, a scene in which the hand moves rapidly, a scene in which the hand moves to the vicinity of the face or the like. It should be noted that, these two scenes are just examples, and the number and the type of the specific scenes may be set according to the actual application.
After the training data is classified according to the different scenes, for each of the video frames in each of the scenes, a position of the hand in the video frame is artificially marked, as ground truth, by drawing a hand region using a rectangular frame or drawing a center position using points. Furthermore, for each of the scenes, feature distribution of each of the features in the feature pool is calculated. The feature distribution reflects a specific value of the tracking feature in each of the frames in the scene. For example, when a depth value is used as the tracking feature, the specific value in each of the frames is a depth value of the detected hand in each of the frames. For example,
After the training data is classified according to different scenes, an offline tracking of the hand is performed for all of the scenes, by using the features in the feature pool. For example, if there are r features (single features or combination features) in the feature pool, the tracking of the hand is performed for each of the r features, and the tracking of the hand is performed for all of the scenes. And then, for each of the tracking features, average tracking performance in each of the scenes is calculated. The tracking performance is represented by parameters or the combination thereof, such as tracking accuracy, tracking error, number of times of tracking failure (missing the tracking object) or the like. For example, as illustrated by the following expression (2), the average tracking performance is represented by the combination of the tracking error and the number of times of tracking failure.
where, Avg.PRm represents average tracking performance of a feature in a scene m, errori represents tracking error of the feature in a i-th frame of the scene m, the tracking error may be represented by a distance between an artificially marked ground truth of a position of the hand in the video frame and an offline-tracked position of the hand in the video frame, n is the number of the video frames of the scene m in the training data set, and losstimesm represents the number of tracking failures of the feature in the scene m. The smaller Avg.PRm calculated by the expression (2) is, the better the tracking performance of the feature is.
Thus, the tracking performance of the tracking features in the scenes can be previously calculated according to the above expression (2). It should be noted that, the expression (2) may also be expanded to the whole training data set, namely, average tracking performance of the features may also be calculated for the whole training data set.
Returning to step S220, as described above, in this step, it is only necessary to determine what the scene is changed to; and the feature with the optimal tracking performance for the changed scene can be selected by the previously calculated tracking performance of the tracking features in the scenes. The detailed steps will be described with reference to
As illustrated in
In step S620, distances between the feature distribution and the previously calculated feature distribution of the first tracking feature in each of the scenes of the training data set are calculated.
As described above, in the k+1 video frames from the video frame T to the video frame T+k, good tracking performance cannot obtained by using the first tracking feature, thus it is determined that the scene has changed since the video frame T. Here, for convenience of explanation, a current scene that has been changed is represented by Situationcurrent. Furthermore, as described above, for each possible scene, distribution of each possible feature in the feature pool is previously calculated.
Accordingly, in step S620, the corresponding distances between the feature distribution of the first tracking feature in the k+1 video frames and the feature distribution of the first tracking feature in each of the scenes of the training data set can be calculated.
In step S630, the scene in the training data set, which corresponds to a minimum distance among the distances, is determined.
In this step, the minimum distance among the corresponding distances calculated in step S620 is determined, and the scene SituationminD in the training data set, which corresponds to the minimum distance, is determined. The scene may be represented by the following expression.
SituationminD=iε(1,M)Min(D(feature1 Situation
Where, M is the number of the scenes in the training data set, D(feature1 situation
In step S640, the tracking feature with the optimal tracking performance for the scene in the training data set which corresponds to the minimum distance is determined, based on the previously calculated tracking performance of each of the tracking features in each of the scenes of the training data set, serving as the second tracking feature.
As described above, the average tracking performance Avg.PR of the tracking feature in the scenes is previously calculated according to the expression (2), thus the tracking feature with the optimal tracking performance for the scene SituationminD can be easily determined, serving as the second tracking feature with the optimal tracking performance for the changed current scene Situationcurrent.
It should be noted that, in step S610, it is the feature distribution of the first tracking feature in the k+1 video frames from the video frame T whose reliability becomes the reliability threshold; however, it is just an example. Specifically, the feature distribution in a plurality of video frames from several frames before or after the video frame T to the video frame T+k may be calculated, and alternatively, the feature distribution in a video frame sequence of more than or less than k+1 video frames may also be calculated.
Additionally, in the above description relating to
Returning to
As described above, the tracking of the hand according to the embodiment of the present invention is a real-time online tracking process. Thus, after the second tracking feature is selected, in step S230, for each of the video frames having the hand in which the scene has changed, the tracking of the hand is continued in real time based on the second tracking feature. The specific tracking processing based on the second tracking feature may be performed by using any known methods, and the detailed description is omitted here.
The method for tracking a hand according to the embodiment of the present invention is described above. According to the method, a feature with optimal tracking performance for a corresponding scene can be dynamically selected in response to the changed scene in the tracking process of the hand; thus it is possible to perform accurate tracking.
It should be noted that, in the whole tracking process applying the tracking method according to the embodiment of the present invention, when the scene is changed, a feature most fitting the changed scene is dynamically selected to perform the tracking; however, for a first video frame at the start of the tracking, the scene cannot be predicted, thus the most fitting feature cannot be previously selected. Accordingly, for a first video frame at the start of the tracking, the tracking may be performed based on the tracking feature with optimal average tracking performance in the whole training data set. The tracking feature with optimal average tracking performance in the whole training data set may be calculated by using the expanded expression (2) as described above.
Additionally, as an example of the tracking object, a hand is tracked; however, the object tracking method according to the present invention is not limited to the hand tracking, and may be applied to the tracking for other objects.
Furthermore, an embodiment of the present invention may also provide a tracking feature selecting method in real-time object tracking. In this method, the tracking feature with optimal tracking performance for a changed scene is selected in response to a change of the scene of a video frame having the object. The specific processing of the selecting step may refer to the descriptions of
In the following, an object tracking apparatus according to an embodiment of the present invention will be described with reference to
As illustrated in
The detailed functions and operations of the above feature selection unit 810 and tracking unit 820 may refer to the descriptions in
The basic principle of the present invention is described above with reference to the embodiments. Any one or all of the steps or units of the method or apparatus according to the present invention may be implemented by hardware, software or their combination in any one of computing devices (including a processor, a storage medium, etc.) or a network of computing devices, and it can be implemented by persons skilled in the art who have read the specification of the present application.
Therefore, the present invention may also be realized by a program or a set of programs running on any one of computing devices. The computing devices may be well known general-purpose devices. Therefore, the present invention may also be implemented by providing a program product including program codes for implementing the method or apparatus. That is to say, the program product also belongs to the present invention, and a storage medium storing the program product also belongs to the present invention. Obviously, the storage medium may be any one of well-known storage media or storage media which are to be developed.
In addition, in the apparatus or method of the present invention, units or steps may be divided and/or recombined. The division and/or recombination should be regarded as an equivalent embodiment of the present invention. Steps of the above method may be performed in time order, however the performing sequence is not limited to the time order. Any steps may be performed in parallel or independently.
The present invention is not limited to the specifically disclosed embodiments, and various modifications, combinations and replacements may be made without departing from the scope of the present invention.
The present application is based on and claims the benefit of priority of Chinese Priority Application No. 201310479162.3 filed on Oct. 14, 2013, the entire contents of which are hereby incorporated by reference.
Number | Date | Country | Kind |
---|---|---|---|
201310479162.3 | Oct 2013 | CN | national |