Embodiments described herein generally relate to the field of image analysis, and more particularly, to subject-matching in image analysis.
Subject-matching in images is the process of determining whether individuals in separate images are the same person. This matching is based on the visual similarity of the individual in each image, along with any knowledge of the time or location at which the images are taken. Generally, an individual depicted in a target image, sometimes taken from a video sequence, is compared with a source image, or training set, of possible matching images, to find the most similar match.
Visual information, also known as visual features, may be used for the comparison. For example, color features collected over horizontal stripes in the target sequence images may be compared with the same horizontal stripes in the source sequence images. Such features are relatively robust to changes in orientation of the subject, because the color of a subject is typically fairly constant along a horizontal axis.
However, there is a need to improve the accuracy of subject-matching for a wider range of possible situations.
In accordance with one aspect, a method for subject-matching in images comprises detecting a first orientation of a first subject in a target image; comparing the first orientation with a second orientation of a second subject in a source image to obtain at least one orientation parameter; extracting at least one features from the target image; computing a visual similarity score between the target image and the source image using the features, the visual similarity score being computed as a function of the at least one feature and the at least one orientation parameter; and determining a match between the first subject and the second subject in accordance with the visual similarity score.
In accordance with another aspect, a system comprises a memory having stored therein a program comprising at least one sequence of instructions for subject-matching in images, and at least one processor coupled to the memory. The processor is configured for executing the at least one sequence of instructions for detecting a first orientation of a first subject in a target image; comparing the first orientation with a second orientation of a second subject in a source image to obtain at least one orientation parameter; extracting at least one features from the target image; computing a visual similarity score between the target image and the source image using the features, the visual similarity score being computed as a function of the at least one feature and the at least one orientation parameter; and determining a match between the first subject and the second subject in accordance with the visual similarity score.
In accordance with yet another aspect, a computer readable medium or media has stored thereon computer readable instructions executable by at least one processor. The instructions cause the processor to detect a first orientation of a first subject in a target image; comparing the first orientation with a second orientation of a second subject in a source image to obtain at least one orientation parameter; extract at least one features from the target image; compute a visual similarity score between the target image and the source image using the features, the visual similarity score being computed as a function of the at least one feature and the at least one orientation parameter; and determine a match between the first subject and the second subject in accordance with the visual similarity score.
In some example embodiments, computing a visual similarity score comprises applying weights to the features in accordance with the at least one orientation parameter to obtain the visual similarity score.
In some example embodiments, applying weights comprises applying a non-zero weight to feature types that are associated with the at least one orientation parameter.
In some example embodiments, extracting features comprises extracting features corresponding to feature types that are associated with the at least one orientation parameter.
In some example embodiments, the at least one orientation parameter corresponds to an orientation pair representative of the first orientation and the second orientation. An expected orientation pair may be associated with a higher weight than an unexpected orientation pair.
In some example embodiments, detecting orientation comprises characterizing orientation using at least two angles, the at least two angles comprising a first angle α, along a horizontal plane, between a facing direction of the first subject and a viewing axis of an image acquisition device having acquired the target image, and a second angle β, along a vertical plane, between the viewing axis of the image acquisition device and a horizontal axis.
In some example embodiments, applying weights comprises applying a first set of weights when the first orientation and the second orientation are the same, and applying a second set of weights when the first orientation and the second orientation are different. In some example embodiments, applying a second set of weights comprises selecting the second set of weights from a group of weights as a function of a degree of difference between the first orientation and the second orientation.
In some example embodiments, the weights are applied by feature type.
In some example embodiments, the feature types are grouped together as a function of a sensitivity to orientation changes from a target image to a source image.
In some example embodiments, at least one feature type comprises sub-features having varying weights assigned thereto.
In some example embodiments, applying weights to the features comprises applying at least one fixed weight independent of the at least one orientation parameter.
In some example embodiments, the weights are determined using a training model configured for at least one of: learning an orientation-dependent distance metric; associating feature types with orientation parameters; learning transformations of color and SIFT features from target image to source image for a given image acquisition device pair; and learning predictable changes in orientation from target image to source image for the given image acquisition device pair.
Many further features and combinations thereof concerning the present improvements will appear to those skilled in the art following a reading of the instant disclosure.
In the figures:
It will be noted that throughout the appended drawings, like features are identified by like reference numerals.
There is described herein a method, system, and computer readable medium for performing subject-matching in images. Subject-matching refers to finding a match for a subject, such as a human, in an image or a sequence of images, when compared with another image or sequence of images. The image under investigation will be referred to herein as a target image. The image to which the target image is compared will be referred to herein as a source image. The target image may be obtained using any one of various types of image acquisition devices such as a still camera, a video camera, etc. A series of target images may represent a video sequence taken from a same image acquisition device, or separate images taken from multiple image acquisition devices. The source image may be obtained using the same or a different image acquisition device as the target image. The source image may be a scanned image of a known subject used for the purpose of subject-matching. In such a case, there may be no information available regarding the image acquisition device having acquired the source image. The source image may form part of a database of video surveillance data that has been divided into sequences depicting single individuals for ease of comparison.
An embodiment for performing subject-matching is illustrated in the flowchart of
Referring back to
At 206, features are extracted from the target image. Generally, the features may be local features or global features. Examples of local features are those obtained using a scale-invariant feature transform (SIFT) algorithm. Local features are spatial, location-specific features which typically provide accurate results when there is no change in orientation between a target image and a source image. Examples of global features are color- or texture-based features obtained using, for example, a Red, Green, Blue (RGB), Hue, Saturation, Value (H,S,V), or luminance and chrominance (Y, Cb, Cr) color model. Global features are typically robust to changes in orientation. For example, a bounding box may be divided into a plurality of horizontal stripes. Each stripe is described by a histogram with multiple bins for each color component H, S, and V. In general, image features may capture different aspects of the image, such as color, texture, patterns, small image patches, etc. Other embodiments for feature extraction will be readily understood. Note that feature extraction, as per 206, may be performed earlier in the method, e.g. before orientation detection and generating of an orientation parameter 202, 204 or concurrently therewith. In some embodiments, a set of predetermined features are extracted from the target image regardless of the detected orientation. In these embodiments, the extraction of features may be performed at any time before computing a visual similarity score. In other embodiments, the features to be extracted are determined as a function of the orientation parameter. For example, color features may be extracted when the orientation parameter is “different”, and spatial features may be extracted when the orientation parameter is “same”. In these embodiments, features will be extracted after the orientation parameter has been generated at 204. In some embodiments, features from the source image are pre-extracted and stored with the source image. Features stored with the source image may be retrieved for the purpose of computing the visual similarity score. Alternatively, feature extraction from the source image may be performed concurrently with or subsequently to feature extraction from the target image.
At 208, a visual similarity score is computed using at least some of the features as extracted from both the target and source image, as a function of the orientation parameter. The orientation parameter is used to determine a best use of image features, i.e. which features should be considered given the difference in orientation between the target and the source image, so as to maximize the accuracy of the result. For example, color features may be used when the orientation parameter is “different” and spatial features may be used when the orientation parameter is “same”. In some embodiments, the visual similarity score will only use a given type of feature and discard other types of features, as a function of the orientation parameter. In other embodiments, the visual similarity score will be computed using different types of features, and each feature type will be assigned a weight as a function of the orientation parameter. For example, color features may be assigned a greater weight than spatial features when the orientation parameter is “different”, and spatial features may be assigned a greater weight when the orientation parameter is “same”. Feature types may be grouped as a function of a sensitivity to changes in orientation, or each feature type may be assigned its own weight. In some embodiments, each feature within a given feature type may be assigned its own weight.
At 210, a match between subjects in the target image and the source image is determined in accordance with the visual similarity score as computed. In some embodiments, a threshold may be set for the similarity score, above which the subjects are considered to be a match. A similarity score found to be below the threshold would be indicative of no match. In other embodiments, a greatest visual similarity score (or smallest distance) obtained from comparing a target image to a plurality of source images is indicative of a match and all other scores are indicative of no match. Other information may also be used, either when computing the visual similarity score or when determining whether the subjects in the images match. For example, the target and source image acquisition devices may be positioned such that a subject passing from one to the other may result in an expected orientation parameter, and expected orientation parameters may be reflected in higher weights on certain features, compared to weights arising from unexpected orientation parameters. Time and/or location of image acquisition may also weigh into subject-matching, either as complementary information used to validate a matching result, or as additional parameters in the visual similarity score.
Reference is now made to
The orientation unit 504 is configured for orientation detection. It may be triggered by a control signal 503. The control signal 503 may be commanded by a user via a graphical user interface on the system 400 or via an external device (not shown) operatively connected to the system 400. The connection between the system 400 and the external device may be wired or wireless. Control signal 503 may also be generated by another application 406n. The other application 406n may form part of the system 400 or be provided separately therefrom.
In some embodiments, contribution of a subvector is weighted differently as a function of the orientation parameter. For example, when having two subvector distances d1 and d2, a match may correspond to the smallest value for λ1t,s d1+λ2t,s d2, where λ1t,s is a parameter that depends on (a) the type of feature in the subvector, i (e.g., i=1 or i=2), and (b) the orientations of the target and source subjects, t,s, (e.g. (αt, βt, αs, βs)). Using an example with four target/source orientation pairs, namely F/F, F/B, B/F, B/B, then similarity (or distance) may be computed as λ1F/F d1+λ2F/F d2 if target and source are F/F, and as λ1F/B d1+λ2F/B d2, if target and source are F/B. The weighting of the two feature types is thus set dynamically depending on the orientation of the target and source images.
In some embodiments, the visual similarity score unit 506 is configured for machine learning, and uses training data to find the appropriate weights for each subvector. For example using the four orientation pairs F/F, F/B, B/F, B/B, an eight-dimension feature vector x may be created with two non-zero features for any pair as per Table 1:
x = <
x = <
x = <
x = <
In this example, the non-zero features come in pairs. The first feature is d1 and corresponds to a color feature and the next feature is d2 and corresponds to a SIFT feature. The score for each pair is λ·x, where λ=(λ1FF, λ2FF, λ1FB, λ2FB, λ1BF, λ2BF, λ1BB, λ2BB). In some embodiments, the parameters λ=(λ1FF, λ2FF, λ1FB, λ2FB, λ1BF, λ2BF, λ1BB, λ2BB) may be set manually.
In some embodiments, a fixed weight may be set on one or more of the feature types. For example, using the four orientation pairs F/F, F/B, B/F, B/B, a five-dimensional feature vector x may be created with two non-zero features for any pair as per Table 2:
x = <
x = <
x = <
x = <
In this example, the first feature is d1 (a color feature) and then one of the next four features is d2 (a SIFT feature). The score for each pair is λ·x, where λ=(λ1, λFF, λFB, λBF, λBB). The weight on d1 is λ1 and does not depend on the orientation parameter.
There may be more than two feature types. For example, there may be multiple SIFT sets with different pooling (where pooling is an operation that affects the extracted features), as per Table 3:
x = <
x = <
x = <
x = <
In this example, the score for each pair is λ·x, where λ=(λ0, λSIFT1FF, λSIFT2FF, λSIFT1FB, λSIFT2FB, λSIFT1BF, λSIFT2BF, λSIFT1BB, λSIFT2BB). This lets the SIFT features have different weights depending on the orientation parameters, but fixes the weights on color features. Similarly, color weights may also be configured to differ as a function of the orientation parameter, and the weights on SIFT features may be fixed. In some embodiments, both SIFT weights and color weights may vary as a function of the orientation parameter. In addition, a greater number of feature types and/or pooling combinations may be used.
In some embodiments, the accuracy obtained with certain features, for example color features, may also depend on the image acquisition device pair. If trained and used for a particular pair of devices, then the visual similarity score unit 506 may learn to down-weight color features if they are not likely to be effective. For example, the unit 506 may learn that blue colors become dark blue for a given device pair. Alternatively, or in combination therewith, the unit 506 may model expected orientations for a given pair of devices by learning how much to weight SIFT depending on orientation. Table 4 is an example using both types of learning:
x = <
f
color
t,
f
color
s,
x = <
f
color
t,
f
color
s,
x = <
f
color
t,
f
color
s,
x = <
f
color
t,
f
color
s,
In this example, there are only four parameters for the SIFT and orientation weighting, with the rest of the parameters being dedicated to model color transfer. This may be extended to multi-device situations, by adding new subspaces for color features.
In some embodiments, weights may vary as a function of the individual features, within a given feature type. For example, a learned distance metric, represented by a distance vector D, may be used to determine that leg colors are more useful than torso colors when orientation changes from the target image to the source image. The distance vector D may correspond to a point-wise difference between a target feature vector xt and a source feature vector xs. For example, DL1=<abs(xt1−xs1), abs(xt2−xs2), . . . >. An example of an eight dimension feature vector x using the learned distance metric is as per Table 5:
x = <
D
sift,
D
color,
x = <
D
sift,
D
color,
x = <
D
sift,
D
color,
x = <
D
sift,
D
color
In some embodiments, the distance vectors may be shared among all of the feature vectors, such that this information is re-used by default, while also indexing each vector by orientation pair, as per Table 6:
x = <
x = <
x = <
x = <
In some embodiments, concatenation vectors are used instead of distance vectors. The concatenation vectors may be used, for example, to learn that certain colors or key points on one side of the target image in a region of interest match the colors or key points on the other side of the region of interest in the source image when a subject is front facing in the target image and back facing in the source image. Other learning options may also be applied.
In some embodiments, certain features are independent of orientation and other features are learned as being orientation-specific, as per Table 7:
x = <
f
color
t,
f
color
s
f
sift
t,
f
sift
s,
x = <
f
color
t,
f
color
s,
f
sift
t,
f
sift
s,
x = <
f
color
t,
f
color
s,
f
sift
t,
f
sift
s,
x = <
f
color
t,
f
color
s,
f
sift
t,
f
sift
s
In this example, color transformation features are considered in the first two subspaces for each vector, and then SIFT transformation and relative weighting is learned while taking into account orientation. Transfer features may be grouped as a function of orientation sensitivity. Across devices, there may be a feature subspace for each device, which holds color features for the target and the source, and a feature subspace for each orientation change, which holds SIFT features for the target and the source.
Other variations and/or combinations may be used for the different feature subspaces. Generally, the feature subspaces may be used to train a model to learn which feature types are important, which weightings on features are orientation-change dependent, which transformations of features are image acquisition device dependent, which transformations of features are orientation-change dependent, and/or to learn and exploit predictable changes in orientation from target to source. Although illustrated with an orientation parameter that corresponds to four possible orientation pairs of F/F, F/B, B/F, B/B, each embodiment may be adapted to other orientation parameters, such as same/different or an angle value difference. In addition, features of the embodiments described may be combined in different ways.
One skilled in the relevant arts will recognize that changes may be made to the embodiments described above without departing from the scope of the invention disclosed. For example, the blocks and/or operations in the flowcharts and drawings described herein are for purposes of example only. There may be many variations to these blocks and/or operations without departing from the teachings of the present disclosure. For instance, the blocks may be performed in a differing order, or blocks may be added, deleted, or modified.
Although illustrated in the block diagrams as groups of discrete components communicating with each other via distinct data signal connections, it will be understood by those skilled in the art that the present embodiments are provided by a combination of hardware and software components, with some components being implemented by a given function or operation of a hardware or software system, and many of the data paths illustrated being implemented by data communication within a computer application or operating system. Based on such understandings, the technical solution of the present invention may be embodied in the form of a software product. The software product may be stored in a non-volatile or non-transitory storage medium, which may be, for example, a compact disk read-only memory (CD-ROM), USB flash disk, or a removable hard disk. The software product may include a number of instructions that enable a computer device (personal computer, server, or network device) to execute the methods provided in the embodiments of the present invention. The structure illustrated is thus provided to increase efficiency of teaching the present embodiment. The present disclosure may be embodied in other specific forms without departing from the subject matter of the claims.
Also, one skilled in the relevant arts will appreciate that although the systems, methods and computer readable mediums disclosed and shown herein may have a specific number of elements/components, the systems, methods and computer readable media may be modified to include additional or fewer of such elements/components. In addition, alternatives to the examples provided above are possible in view of specific applications. For instance, emerging technologies (e.g. fifth generation (5G) and future technologies) are expected to require higher performance processors to address ever growing data bandwidth and low-latency connectivity requirements. Therefore, new devices will be required to be smaller, faster and more efficient. Some embodiments can specifically be designed to satisfy the various demands of such emerging technologies. The embodiments described herein may be applied in parallel programming, cloud computing, and other environments for big data, as well as in embedded devices, or on custom hardware such as GPUs or FPGAs.
The present disclosure is also intended to cover and embrace all suitable changes in technology. Modifications which fall within the scope of the present invention will be apparent to those skilled in the art, and, in light of a review of this disclosure, such modifications are intended to fall within the appended claims.
Number | Name | Date | Kind |
---|---|---|---|
6642967 | Saunders | Nov 2003 | B1 |
8363928 | Sharp | Jan 2013 | B1 |
8429173 | Rosenberg | Apr 2013 | B1 |
9875284 | Amacker | Jan 2018 | B1 |
20040190766 | Watanabe | Sep 2004 | A1 |
20100100457 | Rathod | Apr 2010 | A1 |
20120002880 | Lipson | Jan 2012 | A1 |
20150169708 | Song | Jun 2015 | A1 |
20160117544 | Hoyos | Apr 2016 | A1 |
Number | Date | Country |
---|---|---|
101510257 | Aug 2009 | CN |
2011-215843 | Mar 2010 | JP |
2011215843 | Oct 2011 | JP |
Entry |
---|
CN 10151025 Machine Translation Zuo et al Aug. 19, 2009. |
Gilbert et al., “Tracking objects across cameras by incrementally learning inter-camera colour calibration and patterns of activity”, Computer Vision—ECCV 2006, 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006, Proceedings, Part II, vol. 3952 of the series Lecture Notes in Computer Science, pp. 125-136. |
Javed et al., “Modeling inter-camera space-time and appearance relationships for tracking across non-overlapping views”, Computer Vision and Image Understanding, vol. 109, No. 2 , 2008, pp. 146-162. |
Avraham et al., “Learning implicit transfer for person re-identification”, Computer Vision—ECCV 2012, Workshops and Demonstrations, Florence, Italy, Oct. 7-13, 2012, Proceedings, Part I, vol. 7583 of the series Lecture Notes in Computer Science, pp. 381-390. |
Liu et al. “Person re-identification: What features are important?”, Computer Vision—ECCV 2012, Workshops and Demonstrations, Florence, Italy, Oct. 7-13, 2012, Proceedings, Part I, vol. 7583 of the series Lecture Notes in Computer Science, pp. 391-401. |
Das et al., “Consistent re-identification in a camera network”, Computer Vision—ECCV 2014, 13th European Conference, Zurich, Switzerland, Sep. 6-12, 2014, Proceedings, Part II, vol. 8690 of the series Lecture Notes in Computer Science, pp. 330-345. |
Number | Date | Country | |
---|---|---|---|
20170213108 A1 | Jul 2017 | US |