Re-identifying individuals in images can be a difficult task, because many images are not taken with sufficiently high resolution to use facial recognition software. Conventional methods of re-identification depend on a comparison of a first total image to a second total image. Comparing the two total images, however, requires compressing image data for each image by one or more orders of magnitude, resulting in a significant loss of data and resolution. As a result, conventional methods are error prone and may return false negatives due to, among other things, differing conditions between the images being compared, such as different lighting and a change in pose of the individual.
The present disclosure is directed to systems and methods for re-identifying objects in images, substantially as shown in and/or described in connection with at least one of the figures, as set forth more completely in the claims.
The following description contains specific information pertaining to implementations in the present disclosure. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions.
Camera 105 may be a camera for capturing images, such as a security camera. In some implementations, camera 105 may be one camera, or may include two or more cameras positioned to capture images at different locations. For example, a first camera may be positioned to capture an image of individuals entering a building, such as an office building or a retail store, and a second camera may be positioned at a second location, such as near an elevator in the office building or near valuable merchandise in the retail store. In some implementations, camera 105 may capture image 101 and image 102 and transmit image 101 and image 102 to computing device 110.
Executable code 140 includes one or more software modules stored in memory 130 for execution by processor 120 of commuting device 110. As shown in
Feature vector module 143 is a software module for execution by processor 120 to extract one or more feature vectors from each patch of image. In some implementations, each feature vector may include one or more metrics of the patch from which the feature vector is extracted. For example, feature vector module 143 may extract color information from each patch of image 101 and each patch of image 102, texture information from each patch of image 101 and each patch of image 102, etc.
Patch repositioning module 145 is a software module for execution by processor 120 to displace one or more patches of image 101. In some implementations, when image 101 and image 102 depict the same object, the position and/or location of the object may be different in image 102 than it is in image 101. For example, image 101 may depict an individual from a substantially frontal view point, taken as the individual faced camera 105. Image 102 may depict the same individual from an angle, such as 15 degrees, 30 degrees, etc., to the side of the individual, measured horizontally from the direction the individual is facing at the time image 102 is taken.
Image comparison module 147 is a software module for execution by processor 120 to determine whether image 101 and image 102 depict the same object. In some implementations, image comparison module 147 may compare the total image distance measure for image 101 with the total image distance measure for image 102. Based on the comparison, image comparison module 147 may determine that image 102 depicts the same object as image 101 if the aggregate image measures for image 101 and image 102 are similar, such as when there is a 20% variance, a 15% variance, a 10% variance, etc., between the two aggregate image measures.
At 903, executable code 140 extracts a first plurality of feature vectors from each of the first plurality of patches and a second plurality of feature vectors from each of the second plurality of patches. From each patch location k, feature vector module 143 may extract color descriptors and texture descriptors, such as color, gradient histogram, etc. In some implementations, feature vector module 143 may concatenate the extracted color descriptors and texture descriptors into the patch feature vector pik. Executable code 140 may represent the first image i as an ordered set of patch features Xi={pi1, pi2, . . . , piK} and the second image j as an ordered set of patch features Xj={pj1, pj2, . . . , pjK}, where K is the number of patches. In some implementations, executable code 140 may learn a dissimilarity function for feature vectors extracted from patches. Executable code 140 may define the dissimilarity measure as:
Φ(pik−pjk)=(pik−pjk)TM(k)(pik−pjk), (2)
where pik and pjk are the feature vectors extracted from patches at location k in the first image i and the corresponding location k in the second image j. In some implementations, a single metric (M) could be learned for all patch locations. Regions with statistically different amounts of background noise should have different metrics. For example, when camera 105 is used to capture images of individuals, patches close to the head of an individual may contain more background noise than patches close to the torso of the individual. In some implementations, recognition performance may be a function of available training data, which may limit the number of patch metrics that executable code 140 can learn efficiently.
To learn M(k) on the first image i and the second image j, executable code 140 may introduce the space of pair-wise differences, pijk=pik−pjk, and partition the training data into pijk+ when i and j are images containing the same object, and pijk− otherwise. Note that for learning, executable code 140 may use differences on patches from the same location k. Executable code 140 may assume a zero mean Gaussian structure on difference space and employ a log likelihood ratio test, resulting in:
M(k)=Σk+−1−Σk−−1 (3)
where Σk+ and Σk− are the covariance matrices of pijk+ and pijk−, respectively:
Σk+=Σ(pijk+)(pijk+)T, (4)
Σk−=Σ(pijk−)(pijk−)T. (5)
To compute the dissimilarity between the first image i and the second image j, executable code 140 may combine patch dissimilarity measures by summing over all patches Σk=1KΦ(pik,pik), which may be represented as a block diagonal matrix:
where all M(k) are learned independently or through spatial clusters. This approach may be referred to as patch-based metric learning (PML).
At 904, executable code 140 applies a dimensionality reduction to the first plurality of feature vectors and the second plurality of feature vectors. For example, executable code 140 may apply a principle component analysis, or other appropriate compression methods. Method 900 continues at 905, where executable code 140 repositions each patch of the first plurality of patches based on a deformation cost for each patch of the first plurality of patches. In some implementations, executable code 140 may learn a deformation cost for each of the first plurality of patches. Pose changes and different camera viewpoints make re-identification more difficult. To overcome this issue, executable code 140 may deform the first image by repositioning patches in the first image when matching to the second image. In some implementations, executable code 140 may approximate continuous non-affine warps by translating 2D templates. In some implementations, patch repositioning module 145 may use a spring model to limit the displacement of patches in the first image. The deformable dissimilarity measure for matching the patch at location k in the first image with the second image may be defined as:
where patch feature pjl is extracted from the second image j at location 1. The appearance term Φ(pik,pjl) may compute the feature dissimilarity between patches and may be learned in the same manner as learning M(k) described above. The deformation cost αkΔ(k,l) may refer to a spring model that controls the relative placement of patches k and l. Δ(k,l) is the squared distance between the patch locations. αk encodes the rigidity of the spring: αk=∞ corresponds to a rigid model, while αk=0 allows a patch to change its location freely.
Executable code 140 may combine the deformable dissimilarity measures Ψ(pik,j) into a unified dissimilarity measure:
where w is a vector of weights and ψij corresponds to a vector of patch dissimilarity measures.
To learn αk and w, patch repositioning module 145 may define the optimization problem as a relative distance comparison of triplets {i,j,z} such that Ψ(i,z)>Ψ(i,j) for all i,j,z, where i and j correspond to images containing the same person, and i and z are images from different images containing different people. In some implementations, patch repositioning module 145 may use a limited number of unique spring constants αk and apply two-step optimization. First, patch repositioning module 145 may optimize k with w=1, by performing exhaustive grid search while maximizing Rank-1 recognition rate. Second, patch repositioning module 145 may fix αk and determine the best w using structural support vector machines (SVMs). This approach may be referred to as deformable model metric learning (DPML). In some implementations, patch repositioning module 145 may simplify equation (8) by restricting the number of unique spring constants. Two parameters, α1 and α2, may be assigned to patch locations obtained by hierarchical clustering with the number of clusters m=2, as shown in
At 906, executable code 140 determines a plurality of patch dissimilarity measures based on a plurality of patch metrics, each patch dissimilarity measure being a dissimilarity between corresponding patches of the first plurality of patches and the second plurality of patches. In some implementations, executable code 140 may learn a metric for each patch locations in the grid of patches. In some implementations, the metric learning may be based on a plurality of training images, such as the VIPeR dataset, the i-LIDS dataset, the CUHK01 dataset, etc. The VIPeR dataset is one of the most popular person re-identification datasets. It contains 632 image pairs of pedestrians captured by two outdoor cameras. VIPeR images contain large variations in lighting conditions, background, viewpoint, and image quality. The i-LIDS has 119 individuals with 476 images. This dataset is very challenging because it includes many occlusions. Often only the top part of the individual is visible, and usually there is a significant scale or viewpoint change. The CUHK01 dataset contains 971 persons captured with two cameras. For each person, 2 images for each camera are provided. The images in this dataset are better quality and higher resolution than the images in the VIPeR dataset and the i-LIDS dataset.
At 907, executable code 140 computes an image dissimilarity between the first image and the second image based on an aggregate of the plurality of patch dissimilarity measures. In some implementations, the image dissimilarity may be calculated between the first image and a plurality of candidate images. The image dissimilarity may be calculated by adding together the plurality of patch dissimilarity measures between two images. Method 900 continues at 908, where executable code 140 evaluates the image dissimilarity to determine a probability of whether the first object and the second object are the same. In some implementations, the image dissimilarity may be used for ranking image candidates to determine if some two images contain the same object. Image comparison module 147 may determine whether the first image and the second image depict the same object based on the image dissimilarity.
From the above description, it is manifest that various techniques can be used for implementing the concepts described in the present application without departing from the scope of those concepts. Moreover, while the concepts have been described with specific reference to certain implementations, a person having ordinary skill in the art would recognize that changes can be made in form and detail without departing from the scope of those concepts. As such, the described implementations are to be considered in all respects as illustrative and not restrictive. It should also be understood that the present application is not limited to the particular implementations described above, but many rearrangements, modifications, and substitutions are possible without departing from the scope of the present disclosure.
Number | Name | Date | Kind |
---|---|---|---|
20120230610 | Lee | Sep 2012 | A1 |
Entry |
---|
Zhao, Rui, Wanli Ouyang, and Xiaogang Wang. “Unsupervised salience learning for person re-identification.” Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on. IEEE, 2013. |
Bedagkar-Gala, Apurva, and Shishir K. Shah. “A survey of approaches and trends in person re-identification.” Image and Vision Computing 32.4 (2014): 270-286. |
Yi, Yang, and Deva, Ramanan. Articulated Human Detection with Flexible Mixtures of Parts, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Dec. 2013. p. 1-15. |
Felzenszwalb, Pedro F., et al. Object Detection with Discriminatively Trained Part Based Models, IEEE Transactions on Pattern Analysis and Machine Intelligence, Sep. 2010. pp. 1-20. |
Number | Date | Country | |
---|---|---|---|
20170256057 A1 | Sep 2017 | US |