This application is entitled to claim the benefit of priority based on Japanese Patent Application No. 2008-202291, filed on Aug. 5, 2008; the entire contents of which are incorporated herein by reference.
The present invention relates to an apparatus and a method for tracking an image, and more particularly, relates to an apparatus and a method which may speed up tracking of an object and improve robustness.
JP-A 2006-209755 (KOKAI) (see page 11, FIG. 1) and L. Lu and G. D. Hager, “A Nonparametric Treatment for Location/Segmentation Based Visual Tracking,” Computer Vision and Pattern Recognition, 2007 disclose that conventional image processing apparatuses have tracked objects using classification units which separates the objects from their backgrounds in input images, adapting to the appearance changes of the objects and their background at different time. The apparatuses have generated new feature extraction units when the classification units have been updated. The features extracted by feature extraction units have not been always effective to separate the objects from their backgrounds when the objects changes temporarily (e.g., a person raises his/her hand for a quick moment) and therefore tracking may be unsuccessful.
As stated above, the conventional technologies may fail to track because the features extracted by newly generated feature extraction units have not been always effective to separate the objects from their backgrounds.
The present invention allows high-speed and robust tracking of an object and improvement of an image processing apparatus, an image processing method and an image processing program.
An aspect of the embodiments of the invention is an image processing apparatus which comprises a classification unit configured to extract N features from an input image using pre-generated N feature extraction units and calculate confidence value which represents object-likelihood based on the extracted N features, an object detection unit configured to detect an object included in the input image based on the confidence value, a feature selection unit configured to select M feature extraction units from the N feature extraction units such that separability between the confidence value of the object and that of background thereof become greater than a case where the N feature extraction units are used, the M being a positive integer smaller than N, and an object tracking unit configured to extract M features from the input image and tracks the object using the M features selected by the feature selection unit.
As shown in
The feature selection unit 130 may generate a plurality of groups of features, where each of the groups contains the extracted N features, based on a detection result of the object detection unit 120 or a tracking result of the object tracking unit 140. The feature selection unit 130 may select M feature extraction units from the N feature extraction units such that separability between the confidence value of the object and that of background thereof become greater, based on the generated plurality of groups of features.
Sequence of the images acquired by the acquisition unit 110 is input to the object detection unit 120 or the object tracking unit 140. The image processing apparatus 100 outputs a detection result of the object detection unit 120 and a tracking result of the object tracking unit 140 from the feature selection unit 130 or the object tracking unit 140. The object detection unit 120, the object tracking unit 140 and the feature selection unit 130 are connected to the storage unit 150 respectively. The object detection unit 120 outputs the detection result of the object to the object tracking unit 140 and the feature selection unit 130. The object tracking unit 140 outputs the tracking results of the object to the object detection unit 120 and the feature selection unit 130. The feature selection unit 130 outputs the selection result of the features to the object tracking unit 140.
Operation of the image processing apparatus according to a first embodiment of the present invention is explained with reference to
In step S310, the control unit 160 stores image sequence acquired by the acquisition unit 110 in the storage unit 150.
In step S320, the control unit 160 determines whether the present mode is a tracking mode. For example, the control unit 160 determines that the present mode is the tracking mode in a case where detection and tracking of the object in the previous image are successful and feature selection is performed in step S350. When the control unit 160 determines that the present mode is the tracking mode (“Yes” in step S320), the control unit 160 proceeds to step S340. When the control unit 160 determines that the present mode is not the tracking mode. (“No” in step S320), the control unit 160 proceeds to step S330.
In step S330, the object detection unit 120 detects object using N features extracted by the N feature extraction units 151 (g1, g2, . . . , gN) stored in the storage unit 150. More specifically, a confidence value which expresses object-likelihood with each position of an input image is calculated and the position having the peak of the confidence value is set to a position of the object. The confidence value cD may be calculated based on the extracted N features x1, x2, . . . , xN using the equation 1, where xi denotes the features extracted by the feature extraction unit gi.
C
D
=f
D(x1, x2, . . . , xN) (Equation 1)
Function fD is, for example, a classifier which separates pre-learned object for generating N feature extraction units from background thereof. Therefore, the function fD may be nonlinear, but a linear function is simply used as shown in equation 2. Here “background” means areas after the removal of the object in an image. In fact, an area including positions of the input image is set to each position of the input image and classification is performed by extracting features from the set area to classify whether the position is an object. Therefore, the set areas include object and background thereof at the positions near the boundary of the object and background thereof. In such areas, the positions are classified as object when the proportion of the object is greater than a predefined value.
A classifier which satisfies the equation 2 may be realized by using, for example, well known AdaBoost algorithm where gi denotes i-th weak classifier, xi denotes output of the i-th weak classifier and ai denotes weight of the i-th weak classifier, respectively.
In step S331, the control unit 160 determines whether detection of the object was successful. For example, the control unit 160 determines that detection is unsuccessful when the peak value of the confidence value is smaller than a threshold value. In step 331, the control unit 160 proceeds to step S320 when the control unit 160 determines that detection of the object is unsuccessful (“No” in step S331). The control unit 160 proceeds to step S350 when the control unit 160 determine that detection of the object is successful (“Yes” in step S331).
In step S340, the object tracking unit 140 tracks object using M features extracted by M feature extraction units selected by the feature selection unit 130. More specifically, confidence value which expresses object-likelihood at each position of the input image is calculated and the position having the peak of the confidence value is set to a position of the object. The object tracking unit 140 determines that detection is unsuccessful when the peak value of the confidence value is smaller than a threshold value. The confidence value cT may be calculated based on the extracted M first features xσ1, xσ2, . . . , xσM using the equation 3, where xσi denotes the features extracted by the feature extraction unit gσi given the conditions σ1, σ2, . . . , σMε{1, 2, . . . , N} and σi≠σj if i≠j.
C
T
=f
T(xσ
For example, function fT limits input of the function fD used for the detection of object to M features. If fD is a linear function as shown in the equation 2, fT can be expressed by the equation 4.
Simply, bi=aσi(i=1, 2, . . . , M). Confidence value cT is calculated by using similarity between M first features xσ1, xσ2, . . . , xσM and M second features yσ1, yσ2, . . . , yσM extracted from the object in the input image for which detection or tracking process is completed. For example, the similarity may be calculated by an inner product of a first vector having M first features and a second vector having M second features as shown in equation 5 where yσi denotes the features extracted by the feature extraction unit gσi.
The equation 6, which uses positive values of the product part of the equation 5, may also be used.
The equation 7, which focuses on the sign of the product part of the equation 5, may also be used.
The function h(x) is the same as that used in the equation 6. The equation 7 represents a matching rate between signs of M first features and ones of M second features.
In step S341, the control unit 160 determines whether the tracking of the objects is successful. The control unit 160 proceeds to step S350 when the control unit 160 determine that tracking of the objects is successful (“Yes” in step S341). The control unit 160 proceeds to step S330 when the control unit 160 determine that tracking of the objects is unsuccessful (“No” in step S341).
In step S350, the feature selection unit 130 selects M feature extraction units from N feature extraction units such that degree in separation of the confidence value cD which represents object-likelihood between the object and background thereof becomes larger, in order to adapt to change of appearances of the object and background thereof. The output of the unselected N-M feature extraction units is treated as 0 in the calculation of cD. Suppose that the calculating method of cD is performed by the equation 2 in a feature selection method, features y1, y2, . . . , yN (yi denotes features extracted by gi) are extracted as a group from the positions of the objects by N feature extraction units and M feature extraction units are selected in descending order of ai*yi. Instead of using the N features as they are, N features extracted as other group from the position of the objects in each image, which has plurality of processed objects, may be considered. This enable us to calculate the average value Myi of features extracted by each feature extraction unit gi, and select M feature extraction units in descending order of ai*yi or incorporate higher-order statistics. For example, letting syi be a standard deviation of features extracted by feature extraction unit gi, M feature extraction units are selected in descending order of ai*(yi−syi) or ai*(Myi−syi). N features z1, z2, . . . , zN (zi denotes feature extracted by feature extraction unit gi) extracted from neighboring areas of the objects by N feature extraction units may be used to select M feature extraction units in descending order of ai*(yi−zi). As for the feature zi extracted from the background, M feature extraction units are selected in descending order of ai*(yi−Mzi) or ai*(Myi−Mzi) where Mz1, Mz2, . . . , MzN are average values of features extracted from the neighboring areas of the objects and background positions without objects in a plurality of pre-processed images instead of using the values of the feature zi as it is. Higher-order statistics such as a standard deviation sz1, sz2, . . . , szN as well as the average values may be incorporated. For example, M feature extraction units may be selected in descending order of ai*(Myi−syi−Mzi−szi). The neighboring areas for extracting zi may be selected from, for example, four areas (e.g., right, left, top and bottom) of the objects, or areas which have a large cD or cT. The area having a large cD is likely to be falsely detected as the objects and the area having a large cT is likely to be falsely tracked as the objects. The selection of this area widens the gap between cT at this area and cT at the position of the objects, and therefore the peak of cT may be sharpened. Feature extraction units, which corresponds to ai*yi greater than a threshold value, may be selected instead of selecting M feature extraction units in descending order of ai*yi. If the number of ai* yi greater than a predefined threshold value is smaller than M, which is set to the number of minimally selected feature extraction units, M feature extraction units may be selected in descending order of ai*yi.
Images of multiple resolutions may be input by creating low-resolution images after down sampling input images. At this time, the object detection unit 120 and the object tracking unit 140 perform detection or tracking for the images of multiple resolutions. Detection of the objects is performed by setting the position, which has the maximum value of the peak value of cD in each image resolution, as the position of the objects. Although the generation method of the samples in the feature selection unit 130 is as mentioned above fundamentally; however, the neighboring areas of the objects differ in that they also exist on images having different resolutions as well as the images having the same resolution as the resolution where peak value of cD or cT shows the maximum value. Therefore, samples used for feature selection are created from images of multiple resolutions.
According to the first embodiment of the image processing apparatus, M feature extraction units are selected from pre-generated N feature extraction units such that separability between the confidence value of the objects and that of background thereof become greater. As a result, a high speed tracking as well as adaptation to appearance changes of the objects and background thereof can be realized.
In this embodiment, a verification process for candidate positions of the objects is introduced in a case where the confidence value cT, which represents object-likelihood, has a plurality of peaks (i.e., there are plurality of candidate positions of the objects).
The block diagram of an image processing apparatus according to a second embodiment of the invention is the same as that of the first embodiment of the invention as shown in
In step S401, the object tracking unit 140 calculates confidence value cT, which represents object-likelihood as shown in equation 3, at positions of each image using, for example, one of equations 4-7, when the object tracking unit 140 determines that the present mode is a tracking mode in step S320 where the object tracking unit 140 determines whether the present mode is the tracking mode.
In step S402, the object tracking unit 140 acquires the peak of the confidence value cT calculated in step S401.
In step S403, the object tracking unit 140 excludes the peak if the peak value acquired in step S402 is smaller than a threshold value.
In step S404, the control unit 160 determines whether the number of the remaining peaks is 0. The control unit 160 proceeds to step S330 where detection of the objects is performed again, when the control unit 160 determines that the number of the remaining peaks is 0 (“Yes” in step S404) and tracking is unsuccessful. The control unit 160 proceeds to step S405 when the control unit 160 determines that the number of the remaining peaks is not 0 (i.e., the number of the remaining peaks is greater than or equal to) (“Yes” in step S404) and tracking is unsuccessful.
In step S405, the control unit 160 verifies a hypothesis that each of the remaining peak positions corresponds to the position of the objects. The verification of the hypothesis is performed to calculate confidence value cV which represents object-likelihood. If the confidence value is equal to or smaller than a threshold value, the corresponding hypothesis is rejected. If the confidence value is greater than a threshold value, the corresponding hypothesis is accepted. The control unit 160 proceeds to step S330 where detection of the objects is performed again, when the control unit 160 determines that all of the hypotheses are rejected and tracking is unsuccessful. The control unit 160 sets the peak position, which has the maximum value of cV, as the final position of the objects and proceeds to the feature selection step S350, when there are a plurality of adapted hypotheses.
The confidence value cV showing object-likelihood used for hypothetical verification is calculated by means other than means for calculating cT. cD may be used as cV in the simplest way. The hypothesis of the position, which is not like objects, can be rejected. Outputs of the classifiers using higher level feature extraction units, which are different from the feature extraction units stored in the storage unit 150, may be used as cv. In general, the high level feature extraction units have a large calculation cost, but the number of calculations of cV for an input image is smaller than that of cD and cT. Therefore, the calculation cost does not affect the entire processing time of the apparatus so much. As the high level feature extraction, for example, features based on edges may be used as described in N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” Computer Vision and Pattern Recognition, 2005. Similarity between the position of the object in the previous image and the hypothetical position in the present image may be used. This similarity may be normalized correlation between pixel values in two regions, where each of the regions includes the position of the object and the hypothetical position or may be similarity of the distribution of pixel values. The similarity of the distribution of pixel values may be based on, for example, Bhattacharrya coefficient or sum of intersection of two histograms of pixel values.
According to the second embodiment of the image processing apparatus, a more robust tracking may be realized by introducing a verification process in the tracking process of the objects.
In this embodiment, a plurality of objects are included in an image as explained below. The block diagram and operation of the image processing apparatus according to a third embodiment of the present invention is similar to those according to the first embodiment of the present invention as shown in the block diagram of
In step S310, the control unit 160 stores sequence of images input from an image input unit to an storage unit.
In step S320, the control unit 160 determines whether the present mode is a tracking mode. For example, the control unit 160 determines that the present mode is the tracking mode in a case where detection and tracking of the object in the previous image are successful and feature selection is performed for at least one object in step S350. When a certain number of images are processed after the last time the detection step S330 is performed, the control unit 160 determines that the present mode is not the tracking mode.
In step S330, the object detection unit 120 detects objects using N features extracted by the N feature extraction units g1, g2, . . . , gN stored in the storage unit 150. More specifically, a confidence value cD which expresses object-likelihood with each position of an input image is calculated and all of the positions having the peak of the confidence value are acquired and each of the positions is set to a position of the object.
In step S331, the control unit 160 determines whether detection of the objects was successful. For example, the control unit 160 determines that detection is unsuccessful when all of the peak values of the confidence values are smaller than a threshold value. In this case, the confidence value cD is calculated by, for example, the equation 2. In step 331, the control unit 160 proceeds to step S320 and processes the next image when the control unit 160 determines that detection of the object is unsuccessful (“No” in step S331). The control unit 160 proceeds to step S350 when the control unit 160 determine that detection of the object is successful (“Yes” in step S331).
In step S340, the object tracking unit 140 tracks each of the objects using M features extracted by M feature extraction units selected for each object by the feature selection unit 130. More specifically, confidence value cT which expresses object-likelihood at each position of the input image is calculated for each object and the position having the peak of the confidence value is set to a position of the object.
In step S341, the control unit 160 determines whether the tracking of the objects is successful. The control unit 160 determines that tracking is unsuccessful when the peak values of the confidence values for all of the objects are smaller than a threshold value (“No” in step S341). The control unit 160 may determine that tracking is unsuccessful when the peak values of the confidence values for at least one objects are smaller than a threshold value (“No” in step S341). In this case, the confidence value cT is calculated by, for example, the equation 4. The control unit 160 proceeds to step S350 when the control unit 160 determines that tracking of the object is successful (“Yes” in step S341). The control unit 160 proceeds to step S330 when the control unit 160 determine that tracking of the object is unsuccessful (“No” in step S341).
In step S350, the feature selection unit 130 selects M feature extraction units from. N feature extraction units for each object such that degree in separation of the confidence value cD which represents object-likelihood between each of the objects and background thereof, in order to adapt to change of appearances of each of the objects and background thereof. Since the calculating method of cD is explained in the first embodiment of the present invention, explanation for the calculating method is omitted.
According to the third embodiment of the image processing apparatus, tracking may be more robust and faster than ever before when a plurality of objects are included in an image.
Before calculating the equation 5, the equation 6 and the equation 7, which are calculating means for the confidence value cT representing object-likelihood, a certain value θσi may be subtracted from the output of each feature extraction unit gσi. This means that xσi and yσi of the equation 5, the equation 6 and the equation 7 are replaced with xσi·θσi and yσi−θσi, respectively. θσi may be, for example, the average value Myσi of yσi used in the above-mentioned feature selection, the average value of both yσi and zσi, or the intermediate value instead of the average value. Learning result of classifiers, which separates yσi and zσi (a plurality of yσi and zσi exist if there are a plurality of samples generated at the time of feature selection), may be used for each output of each feature extraction units gi. For example, linear classifiers, which is expressed in the form of l=ux·v (l denotes a category label, x denotes values of the learning sample (i.e., yσi or zσi), and u and v denote the constants determined by learning). The category label of yσi is set to 1 and the category label of zσi is set to −1 at the time of learning. If the value of u, which is acquired by the learning result, is not 0, v/u is used as θi=0. If the value of u, which is acquired by the learning result, is 0, then θi=0. Learning of classifiers is performed using linear discriminant analysis, support vector machines and any other methods which are capable of learning linear classifiers.
The invention is not limited to the above embodiments, but elements can be modified and embodied without departing from the scope of the invention. Further, the suitable combination of the plurality of elements disclosed in the above embodiments may create various inventions. For example, some of the elements can be omitted from all the elements described in the embodiments. Further, the elements according to different embodiments may be suitably combined with each other. The processing step of each element of the image processing apparatus may be performed by a computer using a computer-readable image processing program stored or transmitted in the computer.
Number | Date | Country | Kind |
---|---|---|---|
2008-202291 | Aug 2008 | JP | national |