The present invention relates to a technique for tracking a specific subject in an image.
Examples of techniques for tracking a specific subject in an image include a technique for using the luminance and color information and a technique of template correlation. A recent technique utilizing a deep neural network (hereinafter referred to as a DNN) has been attracting increasing attention as a high-accuracy tracking technique. For example, Non-Patent Literature 1 discusses one method for tracking a specific subject in an image. An image including a tracking target and an image to be a search range are input to convolutional neural networks (hereinafter abbreviated as CNNs) having the same weight. Then, the cross correlation between feature quantities obtained from the CNNs is calculated to identify the position where the tracking target exists in the image that is the search range.
PTL 1: Japanese Patent Application Laid-Open No. 2013-219531
NPL 1: Bertinetto, “Fully-Convolutional Siamese Networks for Object Tracking”, arXiv 2016
However, in the technique discussed in Non-Patent Literature 1, when the image contains an object similar to the tracking target, a cross-correlation value with the similar object becomes high, and an error of erroneous tracking the similar object as a tracking target may occur. In Patent Literature 1, when an object similar to the tracking target exists in the vicinity of the tracking target, the positions of the tracking target and the similar object are predicted. However, the method discussed in Patent Literature 1 uses only the position of the tracking target for prediction. Thus, when the tracking target exists at a position away from the predicted position or when the tracking target and the similar object are close, the tracking target may be lost.
The present invention has been devised in view of the above-described problem and is directed to tracking of a specific object.
According to an aspect of the present invention, an information processing apparatus configured to track a specific object in images captured at a plurality of times includes a retaining unit configured to retain a feature quantity of a tracking target based on a learned model configured to detect a position of a predetermined object in an input image, an acquisition unit configured to acquire feature quantities of objects in a plurality of images based on the learned model, a detection unit configured to detect a candidate object similar to the tracking target based on the feature quantity of the tracking target and the feature quantities of the objects acquired from the plurality of images, and an identification unit configured to identify a correlation between the candidate object detected in a first image and the candidate object in a second image captured at a different time from the first image among the plurality of images.
Further features of the present invention will become apparent from the following description of exemplary embodiments with reference to the attached drawings.
An information processing apparatus according to an exemplary embodiment of the present invention will be described below with reference to the accompanying drawings. Components assigned the same reference numerals in the drawings perform the same operations, and thus duplicated descriptions thereof will be omitted. The components described in the exemplary embodiment are mere examples and are not intended to limit the scope of the present invention.
In a first exemplary embodiment, an example will be described where a tracking target and an object similar to the tracking target are tracked at the same time so that stable tracking is continued even in a situation where there are many objects similar to the tracking target or where the tracking target is blocked by another object. More specifically, the present exemplary embodiment is directed to stably tracking each object even in a case where an object similar to the tracking target is present.
A storage unit H104 stores data to be processed and tracking target data according to the present exemplary embodiment. As media of the storage unit H104, a hard disk drive (HDD), a flash memory, and various optical media can be used. An input unit H105 includes a keyboard, a touch panel, a dial, or the like for accepting inputs from a user, and is used to set a tracking target. A display unit H106 includes a liquid crystal display or the like and displays a subject and a tracking result to the user. The information processing apparatus 1 can communicate with another apparatus such as an imaging apparatus via a communication unit H107.
In S301, the image acquisition unit 201 acquires an image (initial image) in which a predetermined object is captured. The image acquisition unit 201 may acquire an image captured by an imaging apparatus connected to the information processing apparatus or acquire an image stored in the storage unit H104. The processing in S301 to S303 is directed to setting the object of interest to be the tracking target using the initial image.
In S302, the tracking target determination unit 202 determines the object to be the tracking target (object of interest) in the image acquired in S301. There may be one or a plurality of tracking targets. In the present exemplary embodiment, an example of selecting one tracking target will be described. In this step, the tracking target determination unit 202 acquires the position of an image feature indicating a predetermined object from the image by using a learned model that detects the position of the predetermined object, and determines a partial image containing the object of interest. As the learned model, for example, a model that has learned the image feature of a predetermined object such as a person or a vehicle in advance is used. A learning method will be described below. When one object is detected in the image, the object becomes a tracking target. When the predetermined object is not detected in the image, for example, the image in the next frame may be input. When a plurality of objects is acquired, the tracking target determination unit 202 outputs a tracking target candidate and then determines the tracking target by using a method specified in advance. In this case, the tracking target determination unit 202 determines the tracking target (object of interest) in the acquired image according to an instruction specified by the input unit H105.
Examples of specific methods for determining the tracking target include a method for determining the tracking target by a touch on the subject displayed on the display unit H106. In addition to specification by the input unit H105, the tracking target determination unit 202 may determine the tracking target by automatically detecting a main subject in the image. Examples of the method for automatically detecting the main subject in the image include the method discussed in Japanese Patent No. 6556033. The tracking target determination unit 202 may also determine the main subject based on both the specification by the input unit H105 and a result of detecting an object in the image. Examples of techniques for detecting an object in an image include “Liu, SSD: Single Shot Multibox Detector, In: ECCV2016”.
In S303, the retaining unit 203 retains the feature quantity of the tracking target from the image containing the determined tracking target, based on the learned model.
In S401, the retaining unit 203 acquires information about the position of the tracking target in the image determined by the tracking target determination unit 202. The acquired information about the position of the tracking target is hereinafter referred to as a bounding box (BB). As the information about the position of the tracking target, information about the center position of the tracking target input by the user when the tracking target is determined in S302, or a result of detecting a predetermined position (e.g., the center of gravity) of the tracking target by a learning model is used.
In S402, the retaining unit 203 acquires a template image that is an image indicating the tracking target extracted into a predetermined size based on the position of the tracking target in the image. More specifically, the retaining unit 203 clips, as the template image, the periphery of the region acquired in S401 from the initial image, and then resizes the image into a predetermined size. The predetermined size may be adjusted to the size of the input image of the learned model.
In S403, the retaining unit 203 inputs the template image indicating the tracking target to the learned model for detecting the position of the predetermined object in the input image, thus acquiring the feature quantity of the tracking target. In this case, the retaining unit 203 inputs the image resized in S402 to a convolutional neural network (CNN) (the learned model). The CNN has been trained in advance to acquire a feature quantity that makes it easier to distinguish between a tracking target and a non-tracking target. A learning method will be described below. The CNN includes convolution and nonlinear transform such as Rectified Linear Unit (hereinafter referred to as ReLU) and Max Pooling. ReLU and Max Pooling described herein are to be considered as mere examples. Leaky ReLU or a sigmoid function may be used instead of ReLU. Average Pooling may be used instead of Max Pooling. The present exemplary embodiment is not limited to these methods. Then, in S404, the retaining unit 203 retains the feature quantity of the tracking target acquired in S403 as a template feature quantity indicating the tracking target. The above-described processing is processing in the tracking target setting phase.
In S304, the image acquisition unit 201 acquires images captured at a plurality of times to perform tracking processing. In the subsequent processing, there is described processing for detecting the tracking target set in a first image from a second image captured at a different time from the first image. The first and second images are captured so that as large portion of the tracking target as possible is included in the images.
In S305, the object detection unit 204 detects a candidate object similar to the tracking target based on the feature quantity of the tracking target and feature quantities of the object acquired from a plurality of images.
In S501, the object detection unit 204 acquires a search range image (partial image) indicating a region to be subjected to tracking target search from the current image (second image). In this case, the object detection unit 204 acquires the search range image based on the last detection position of the tracking target or the candidate object. More specifically, the object detection unit 204 extracts, from the second image, the partial image with a predetermined size from a region corresponding to the vicinity of the candidate object detected in the first image (past image). The size of the search region may be changed depending on the speed of the object and the angle of view of the image. The search region may be the entire search image or the periphery of the last position of the tracking target. Setting a partial region, not the entire region, of the input image as a search range provides effects of improving the processing speed and reducing tracking correlation errors.
In S502, the object detection unit 204 extracts an input image to be input to the learned model, from the search range image. The object detection unit 204 clips a search range region from the search range image and then resizes the search range region. The size of the search range is determined to be a constant multiple or the like of the BB size of the tracking target. The feature quantity with little noise can be acquired when the feature quantities are acquired from images with the same size. The object detection unit 204 clips a region based on the determined search region and resizes the region so that a resizing ratio is equivalent to that in S402.
In S503, the object detection unit 204 inputs the extracted search range image to the learned model (CNN) for detecting the position of the predetermined object in the input image, to acquire the feature quantity of each search range image. More specifically, the object detection unit 204 inputs the image of the clipped region to the CNN. The feature quantity of each search range image indicates a feature quantity of an object existing in each search range image. The weight of the CNN in S503 is assumed to be partly or entirely identical to the weight of the CNN in S403. For example, when a certain search range image contains a blocking object that blocks a person, the CNN enables acquisition of the feature quantity indicating the blocking object. When another partial image contains an animal but not a person, the feature quantity indicating the animal is acquired.
In S504, the object detection unit 204 acquires a cross correlation between the feature quantity of the tracking target and the feature quantity of the object existing in the current search range image acquired in S503. The cross correlation is an index representing a similarity between detected objects. In this case, an object similar to the tracking target (an object of the same type as the tracking target) is referred to as the candidate object. More specifically, an object having the cross correlation larger than a predetermined value is the candidate object. The candidate object includes one or both of the tracking target and the non-tracking target. In a specific example, when the tracking target is a person, a search range image having the feature quantity indicating a person has a high cross correlation.
In S505, the object detection unit 204 detects the position of the candidate object in the current image. Since the weight of the CNN in S503 is partly or entirely identical to the weight of the CNN in S403, a value of the cross correlation increases at a position in a search range where a candidate object is highly likely to exist. This makes it possible to detect the position of the candidate object in a search range image having the value of the cross correlation larger than or equal to the threshold value. More specifically, based on the cross correlation acquired in S504, the object detection unit 204 detects a position where the cross correlation is larger than the predetermined value, as the position of the candidate object. The tracking target is less likely to exist at a position where the cross correlation is smaller than the predetermined value. In this case, the object detection unit 204 further acquires the BB that surrounds the candidate object based on the position of the candidate object. First, the object detection unit 204 determines the position of the BB based on a search range image that has indicated a response of a high cross correlation.
In S306, the tracking unit 205 identifies a correlation between a candidate object detected in the first image among the plurality of images and a candidate object in the second image captured at a different time from the first image. Identifying the correlation between the objects detected at the plurality of times enables tracking of correlated objects. More stable tracking is enabled since the feature quantity and the position of the tracking target are updated based on the image in which the tracking target is detected.
In S701, the tracking unit 205 acquires a combination of a candidate object detected in an image captured at a time in the past prestored in the storage unit 206 and a candidate object detected in an image captured at the current time (correlation candidates). In this case, past candidate objects are correlated with current candidate objects in pairs to generate all possible combinations of the past and current candidate objects. Each of the candidate objects detected in the past images is assigned a tracking target/non-tracking target label. When there is one tracking target, each object identified as the tracking target among the past candidate objects may be correlated with each of the current candidate objects.
In S702, the tracking unit 205 identifies a combination (correlation) having an acquired similarity higher than or equal to a threshold value. A high similarity between the past and current candidate objects indicates that the past and current candidate objects are highly likely to be identical objects. There is a plurality of methods for correlation. For example, there are a method for preferentially correlating candidate objects having a higher similarity, and a method that uses the Hungarian algorithm. The correlation method is not limited herein. In this case, the tracking unit 205 identifies the identical objects based on the similarity between a candidate object other than the tracking target in the first image and a candidate object in the second image. Tracking other objects similar to an object that is the tracking target in this way can prevent the tracking target from being correlated with another object. Thus, stable tracking can be performed. Suitably performing correlation in this way enables recognition of the past and current tracking targets as identical objects.
For example, a similarity L between a past candidate c1 and a current candidate c2 is calculated as follows. Here, BB denotes a vector that includes four different variables (center coordinate value x, center coordinate value y, width, and height) of each candidate BB, and f denotes the feature of each candidate. The feature refers to a feature on which each candidate is positioned, extracted from a feature map acquired from the CNN. W1 and W2 are empirically acquired coefficients and W1 > 0 and W2 > 0. More specifically, the similarity becomes higher with closer feature quantities, and the similarity becomes higher with closer detection positions and closer sizes of detection regions. [Equation 1]
Then, in S703, the tracking unit 205 identifies the tracking target based on a result of correlation. As the result of the correlation acquired in S702, the tracking unit 205 can identify the current candidate object correlated with the past tracking target, as the tracking target. A candidate object other than the tracking target is supplied with information indicating that the object is not a tracking target. When there is no current candidate object having a similarity to the feature quantity of the past tracking target higher than the predetermined threshold value, it is likely that the tracking target is outside the angle of view or blocked by another object. In that case, the tracking unit 205 may notify that no tracking target is identified.
Finally, in S704, the storage unit 206 retains the feature quantity of the tracking target in the second image and the feature quantity of the candidate object in the second image. When the tracking target is identified in the current image, the storage unit 206 updates the feature quantity of the tracking target. When a candidate object having a similarity to the feature quantity of the tracking target in the first image larger than the predetermined threshold value is detected in the second image, the storage unit 206 retains the feature quantity acquired from the second image as the feature quantity of the tracking target. When no candidate object having the similarity to the feature quantity of the tracking target larger than the predetermined threshold value is detected in the second image, the storage unit 206 retains the feature quantity acquired from the first image as the feature quantity of the tracking target. When no tracking target is detected in the current image, the tracking unit 205 retains the feature quantity and position of the tracking target in the past image. Further, the storage unit 206 stores the feature quantity of the current candidate object supplied with the tracking target/non-tracking target label. The storage unit 206 updates the BB (position and size) and the feature of the tracking target and the candidate object thereof. Retaining the feature quantity and the determination result of candidate objects similar to the tracking target enables more stable tracking.
In S307, the image acquisition unit 201 determines whether to end the tracking processing. When the tracking processing is to be continued, the processing returns to S304. When the tracking process is to be ended, the processing proceeds to end. For example, the image acquisition unit 201 determines to end the processing if an end instruction from the user is acquired or if the image of the next frame cannot be acquired. When the image of the next frame can be acquired, the processing returns to S304. The processing in the tracking processing execution step has been described above. The learning processing will be described below.
Now, a method for training the learned model (specifically the CNN) for estimating an object position in an image is described. The learned model used herein is assumed to have learned an object classification task (e.g., a task for detecting a person and not detecting an animal) to some extent, so that the model learns to be able to recognize an individual based on an external feature of a predetermined object. This enables tracking of a specific object.
Assume an example case where a person A wears red clothing, and a person B wears yellow clothing. Since a color of the clothing is not a necessary feature for a learned model that merely detects a person, the learned model may have learned to ignore the color of the clothing in a person detection task. However, when detecting (tracking) only the person A, the model needs to learn a feature that distinguishes between the persons A and B. In this case, the color of the clothing is an important feature required to identify an individual. In the present exemplary embodiment, among the objects in the same category, the model learns the feature quantity of a tracking target object while distinguishing the tracking target object from other objects in the same category.
The storage unit 206 stores images captured at a plurality of times and GT information indicating the position and size of a tracking target in each image. In this case, the storage unit 206 stores information about the center position (or the BB indicating the region) of the tracking target object input by the user for each image as the GT information. The GT information may be generated by a method other than GT assignment by the user. For example, a result of detecting the position of a tracking target object by the use of another learned model may also be used. The GT acquisition unit 1400, the template image acquisition unit 1401, and the search range image acquisition unit 1402 each acquire an image stored in the storage unit 206.
The ground truth (hereinafter referred to as GT) acquisition unit 1400 acquires the GT information to acquire a correct-answer position of an object that is the tracking target in the template image and a correct-answer position of the tracking target in a search range image. The GT acquisition unit 1400 also acquires the BB of the tracking target in the template image acquired by the template image acquisition unit 1401, and the BB of the tracking target in the search range image acquired by the search range image acquisition unit 1402. More specifically, referring to an image 1704 illustrated in
The template image acquisition unit 1401 acquires an image containing the tracking target as the template image. The template image may contain a plurality of objects in the same category. The search range image acquisition unit 1402 acquires the image to be subjected to tracking target search. More specifically, the feature quantity of the specific object to be the tracking target can be acquired from this image. For example, the template image acquisition unit 1401 selects any frame from a series of sequence images, and the search range image acquisition unit 1402 selects another frame not selected by the template image acquisition unit 1401 among the sequence images.
The tracking target estimation unit 1403 estimates the position of the tracking target in the search range image. The tracking target estimation unit 1403 estimates the position of the tracking target in the search range image based on the template image acquired by the template image acquisition unit 1401 and the search range image acquired by the search range image acquisition unit 1402.
The loss calculation unit 1404 calculates a loss based on the tracking result acquired by the tracking target estimation unit 1403 and the position of the tracking target in the search range image acquired by the GT acquisition unit 1400. The loss is smaller at a position closer to an estimation result from teacher data. The loss calculation unit 1404 acquires the correct answer of the position of the tracking target in the search range image based on the GT information acquired by the GT acquisition unit.
The parameter update unit 1405 updates CNN parameters based on the loss acquired by the loss calculation unit 1404. Herein, the parameter update unit 1405 updates the parameters so that loss values converge. When a sum total of the loss values converges or when a loss value becomes smaller than a predetermined value, the parameter update unit 1405 updates a parameter set and ends the learning.
The parameter storage unit 1406 stores the CNN parameters updated by the parameter update unit 1405, in the storage unit 206 as learned parameters.
A flowchart of the learning processing will be described with reference to
In S1502, the template image acquisition unit 1401 clips the region to be used as the template from the template image and then resizes the region to a predetermined size. The size of the region to be clipped is determined, for example, as a constant multiple of the BB size based on the BB of the tracking target.
In S1503, the tracking target estimation unit 1403 inputs the template image generated in S1502 to the learning model (CNN) and then acquires CNN feature quantity of the template.
In S1504, the search range image acquisition unit 1402 acquires the search range image. The partial image to be a search range is acquired as a partial image containing the tracking target based on the position and size of the tracking target object.
In S1505, the search range image acquisition unit 1402 clips the search range region from the search range image and then resizes the search range region. The size of the search range is determined to be a constant multiple of the BB size for the tracking target. The search range image acquisition unit 1402 resizes the search range region at a magnification used to resize the template in S1502 (so that the size for the tracking target after resizing of the template approximately coincides with the size of the tracking target after resizing of the search range).
In S1506, the tracking target estimation unit 1403 inputs the search range image generated in S1506 to the learning model (CNN) and then acquires the CNN feature quantity of the search range.
In S1507, the tracking target estimation unit 1403 estimates the position of the tracking target in the search range image. The tracking target estimation unit 1403 calculates the cross correlation indicating the similarity between the CNN feature of the tracking target acquired in S1506 and the CNN feature of the search range acquired in S1506, and then outputs the cross correlation as a map. The tracking target estimation unit 1403 estimates the tracking target by indicating a position having a cross-correlation value larger than or equal to the threshold value.
In S1508, the loss calculation unit 1404 calculates a loss related to the inferred position of the tracking target and a loss related to the inferred size of the tracking target. With regard to the loss related to the position, the loss calculation unit 1404 calculates a loss to advance the learning so that the cross correlation value at the tracking target position indicates a large value. The ground truth (hereinafter referred to as GT) acquisition unit 1400 acquires the BB of the tracking target in the template image acquired by the template image acquisition unit 1401 and the BB of the tracking target in the search range image acquired by the search range image acquisition unit 1402.
The loss function can be represented by Formula (1-2), where Cin denotes the map 1701 acquired in the processing in S1507, and Cgt denotes the GT map 1704. Formula (1-2) represents the mean square of a difference between the maps Cin and Cgt on a pixel basis. The loss decreases when the tracking target is appropriately estimated, and increases when a non-tracking target is estimated to be a tracking target or when a tracking target is estimated to be a non-tracking target. [Equation 2]
Likewise, the loss related to the size is calculated by Formula (1-3). [Equations 3]
LossW and LossH denote the loss related to width and the loss related to height of the estimated tracking target, respectively. Wgt and Hgt denote the values of the width and the height of the tracking target embedded at the position of the tracking target, respectively. The losses are calculated by using Formulas (1-3) and (1-4), and thus the learning is advanced so that, for Win and Hin, the width and the height of the tracking target are inferred at the position of the tracking target, respectively. When all of the losses are summed up, Formula (1-5) results.
Although, in this case, the loss is described as the mean squared error (hereinafter referred to as MSE), the loss is not limited to MSE. The loss may be Smooth-L1 or the like. The calculation formula for the loss is not limited. Further, a loss function for the position and a loss function for the size may be different.
In S1509, the parameter update unit 1405 (learning unit) updates the CNN parameters based on the loss calculated in S1508. The parameters are updated based on backpropagation by using stochastic gradient descent (SGD) with Momentum or the like. Outputs of the loss functions for one image have been described above. In the actual learning, however, loss values are calculated by Formula (1-2) with respect to scores estimated for a plurality of various images. The parameter update unit 1405 updates interlayer connection weighting factors of the learning model so that the loss values for the plurality of images each becomes smaller than a predetermined threshold value.
In S1510, the parameter storage unit 1406 stores the CNN parameters updated in S1509 in the storage unit 206. In an inference step, an inference is made by using the parameters stored in S 1510 to enable correct tracking of the tracking target.
In S1511, the parameter update unit 1405 determines whether to end the learning. When the loss value acquired by Formula (1-2) becomes smaller than the predetermined threshold value, the parameter update unit 1405 determines that the learning is to be ended.
The present exemplary embodiment is characterized in tracking the tracking target and an object similar to the tracking target at the same time. The fact that simultaneously tracking an object similar to the tracking target reduces the possibility of erroneously tracking a similar target will be described with reference to
Firstly, a case where only the tracking target 804 is being tracked will be described below. In this case, the object 804 being correctly tracked at the time t = 0 is blocked by an object 808 at the time t = 1. When blocking occurs, with regard to the feature quantity of the object 804, the feature quantity lacking in the likelihood of an object due to the blocking is highly likely to be detected. For the object 808, the feature quantity with a high likelihood of an object is detected. Thus, at the time t = 1, the tracking target is highly likely to be recognized as the object 808, and then the object 808 is started to be erroneously tracked as the tracking target.
A case where not only the tracking target 804 but also the similar object 805 are tracked at the same time will be described below. At the time t = 1, there are two different past tracking target candidates, the objects 804 and 805. At the time t = 1, only an object 808 is acquired as a new tracking target candidate since an object 809 is being blocked. At this time, when the similarity between the past candidate 804 and the object 808 is compared with the similarity between the past candidate 805 and the object 808, the similarity between the objects 805 and 808 is higher than the similarity between the objects 804 and 808. This is because the CNN feature correlated with each candidate has been learned to distinguish between the objects, and the position and the size of the BB moderately change over time. Thus, the current candidate 808 is correlated not with the past candidate 804 but with the past candidate 805. The latest feature quantity of the object 805 is updated to the feature quantity of the object 808. For the object 804 that is not detected at the time t = 1, the feature quantity acquired at the time t = 0 is retained. Then, at the time t = 2, the similarity between candidate objects are calculated. At the time t = 2, the objects 804 and 808 are past candidates, while objects 811 and 812 are acquired as two new candidates. Since these two candidate objects are not blocked, desirable feature quantities can be acquired therefor. After the similarity calculation, high similarities are obtained between the objects 808 and 811 and between the objects 804 and 812, and low similarities are obtained between the objects 808 and 812 and between the objects 804 and 811. Thus, the object 804 that is the tracking target is correlated with the object 812 to enable correct tracking of the tracking target.
In a modification 1-1, the weight W2 for the feature quantity is sequentially updated using the feature quantities of a tracking target and a similar object acquired in the time series, in Formula (1-1) according to the first exemplary embodiment.
An example is described below. [Equation 4]
ftarget denotes the feature quantity of a tracking target acquired at each time, and fdistractor denotes the feature quantity of a similar object acquired at each time.
The weight is updated using the features of the tracking target and the similar object based on Formula (1-2), and thus the similarity can be calculated by applying a larger weight to a feature dimension that makes it easier to distinguish between the tracking target and the similar object among the feature dimensions. This makes it easier to distinguish between the tracking target and the similar object even if the features of the tracking target and similar object are close in a feature space.
In a modification 1-2, transformation to obtain the similarity between feature quantities is calculated in advance by metric learning, in Formula (1-1) according to the first exemplary embodiment. When F denotes a function of transforming the feature quantity, Formula (1-1) is represented as Formula (1-7). [Equation 5]
The transform F represents a structure with connections of neural networks in one or more layers, and can learn in advance using the triplet loss or the like. Learning by the transform F using the triplet loss makes it possible for the transform F to learn such a transformation in which the distance is short if the past and the current objects are identical objects or long if the past and the current objects are different objects. The learning method using the triplet loss is described in detail in “Wang, Learning Fine-grained Image Similarity with Deep Ranking, In: CVPR2014”.
A second exemplary embodiment further performs blocking determination processing in the tracking target identification processing in S306 in
The configuration is basically similar to that according to the first exemplary embodiment in
Now, processing performed by the information processing apparatus 1′ according to the second exemplary embodiment will be described. Flowcharts according to the present exemplary embodiment correspond to
In S1002, the blocking determination unit 207 determines the presence or absence of a blocking region where the candidate object is blocked based on the position of the candidate object in the current processing target image (second image). More specifically, the blocking determination unit 207 performs the blocking determination for each candidate object in the current image. The blocking determination processing in S1002 will be described in more detail with reference to
ps denotes the position of the candidate, and po denotes the position of the occluder. α is an empirically-set value.
In S703, the tracking unit 205 identifies the correlation between the candidate object in the first image and the candidate object in the second image based on the result of the blocking determination. More specifically, the tracking unit 205 identifies the position of the tracking target object in the second image. When the candidate object identified as the last tracking target object is identified in the current image in S702, the tracking unit 205 identifies the position of the tracking target in the current image. When the last tracking target is not identified from candidate objects in the current image in S702, then in S1002, blocking determination is performed. When the tracking target is determined to be blocked in the current image, the occluder thereof is identified, and the position of the tracking target is updated based on Formula (2-1). Meanwhile, the feature quantity of the tracking target is not updated. In S704, the storage unit 206 stores the position and the feature quantity of the tracking target identified by the tracking unit 205. If blocking occurs, the above-described processing enables restarting of tracking after the blocking is resolved because, in some cases, the position of the tracking target is updated while the feature quantity of the tracking target is retained.
A modification 2-1 performs the blocking determination by a neural network. An example of the blocking determination by the neural network is “Zhou, Bi-boxRegression for Pedestrian Detection and Occlusion, In: ECCV2018”. In this example, in S1002, the tracking unit 205 estimates the BB of an object and simultaneously estimates a non-blocked region (viewable region) of an object region. Then, when the ratio of the region where blocking has occurred to the object region exceeds a predetermined threshold value, the blocking determination unit 207 can determine the blocking.
In a third exemplary embodiment, with respect to a tracking method based on online learning, a plurality of candidate objects is simultaneously tracked to stably track each of a plurality of similar objects. The hardware configuration is similar to that according to the first exemplary embodiment.
Processing performed by the information processing apparatus 3 according to the present exemplary embodiment will be described with reference to
Such a tracking method based on online learning can reduce erroneous tracking by simultaneously tracking a plurality of candidates in a similar way to the first exemplary embodiment.
In a fourth exemplary embodiment, a case where not one tracking target but a plurality of tracking targets is set is described. Even when a plurality of similar objects is tracked, simultaneously tracking candidate objects detected in the past enables stable tracking even if a tracking target is once lost. The hardware configuration is similar to that according to the first exemplary embodiment. An information processing apparatus according to the present exemplary embodiment has a functional configuration similar to that of the information processing apparatus 1 according to the first exemplary embodiment, except for a difference in processing between the tracking target determination unit 202 and the tracking unit 205. The tracking target determination unit 202 determines a plurality of objects as tracking targets. The tracking target determination unit 202 determines the tracking targets by a method similar to that according to the first exemplary embodiment. All objects included in a certain image may be acquired as the tracking targets. The tracking unit 205 tracks detected objects with regard to a plurality of tracking targets. More specifically, the tracking unit 205 retains the CNN features of the plurality of candidate objects and performs correlation by using the similarity between the candidate objects at the times t and t + 1.
Processing performed by the information processing apparatus 1 according to the present exemplary embodiment will be described. A flowchart according to the present exemplary embodiment corresponds to
In a case where two objects are detected at the time t and one object is detected at the time t + 1, the object identical to the object at the time t + 1 is assumed to be the object having a high similarity of the two objects acquired at the time t. Correlating objects having the high similarity enables reduction of the possibility of erroneous tracking. However, there may arise a case where an object detected at the time t is detected at the time t + 1 due to occurrence of blocking. In this case, if at least one candidate object exists in addition to the tracking target object at the time t + 1, erroneous tracking on a candidate object at a close position may be started. Thus, in S306, the CNN features of a plurality of objects to be candidate objects may be retained, and the similarities with the feature quantities of the candidate objects retained at the time of the similarity calculation may be calculated. When the tracking target object is blocked, the correlation cannot be identified, but tracking can be restarted when the blocking is resolved.
The present invention is also implemented by performing the following processing. More specifically, software (program) for implementing the functions of the above-described exemplary embodiments is supplied to a system or an apparatus via a data communication network or various types of storage media, and a computer (CPU or micro processing unit (MPU)) of the system or the apparatus reads and executes the program. The program may be provided by being recorded in a computer-readable recording medium.
The present invention is not limited to the above-described exemplary embodiments and various changes and modifications can be made without departing from the spirit and scope of the present invention. Therefore, the following claims are appended to apprise the public of the scope of the present invention.
According to the present invention, a specific object can be tracked.
Embodiment(s) of the present invention can also be realized by a computer of a system or apparatus that reads out and executes computer executable instructions (e.g., one or more programs) recorded on a storage medium (which may also be referred to more fully as a ‘non-transitory computer-readable storage medium’) to perform the functions of one or more of the above-described embodiment(s) and/or that includes one or more circuits (e.g., application specific integrated circuit (ASIC)) for performing the functions of one or more of the above-described embodiment(s), and by a method performed by the computer of the system or apparatus by, for example, reading out and executing the computer executable instructions from the storage medium to perform the functions of one or more of the above-described embodiment(s) and/or controlling the one or more circuits to perform the functions of one or more of the above-described embodiment(s). The computer may comprise one or more processors (e.g., central processing unit (CPU), micro processing unit (MPU)) and may include a network of separate computers or separate processors to read out and execute the computer executable instructions. The computer executable instructions may be provided to the computer, for example, from a network or the storage medium. The storage medium may include, for example, one or more of a hard disk, a random-access memory (RAM), a read only memory (ROM), a storage of distributed computing systems, an optical disk (such as a compact disc (CD), digital versatile disc (DVD), or Blu-ray Disc (BD)™), a flash memory device, a memory card, and the like.
While the present invention has been described with reference to exemplary embodiments, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.
Number | Date | Country | Kind |
---|---|---|---|
2020-123796 | Jul 2020 | JP | national |
This application is a Continuation of International Patent Application No. PCT/JP2021/024898, filed Jul. 1, 2021, which claims the benefit of Japanese Patent Application No. 2020-123796, filed Jul. 20, 2020, both of which are hereby incorporated by reference herein in their entirety.
Number | Date | Country | |
---|---|---|---|
Parent | PCT/JP2021/024898 | Jul 2021 | WO |
Child | 18155349 | US |