The present invention relates to the field of person re-identification in computer vision, and specifically, to a deep discriminative network for person re-identification in an image or a video.
In recent years, as people pay more attention to the public safety, video surveillance systems have become popular. Public places such as airports, train stations, campuses and office buildings are in urgent need of surveillance to protect the security. Faced with massive monitoring video data, a large amount of manpower needs to be invested in the monitoring and retrieval of video information. This method has low efficiency and causes additional waste of resources. With computer vision analysis technology to automatically monitor and analyze video information, the construction of safe city will be accelerated.
Person re-identification is a key task in the study of computer vision. In general, for an image or a video about a pedestrian, person re-identification is the process of recognizing the same person in other images or videos that do not coincide with the shooting scene. Although relevant research has received more attention, and the accuracy of person re-identification has increased a lot, there are still many difficulties to be overcome. Since the pedestrian image to be recognized is taken by a camera different from the original image, the difference in camera may bring errors to imaging conditions; the environment in different scenes may be different, and the collected data may also have different deviations; changes in lighting can make the same color different; and more importantly, the change of posture and occlusion of pedestrians under the camera makes the recognition of the same person difficult.
In recent years, convolutional neural networks are widely used in the field of person re-identification following the trend of deep learning. Extracting image features through deep networks, and using deep learning or traditional methods for distance measurement on corresponding feature space greatly improve the accuracy of person re-identification. The progress of these work benefits from the ability of the deep convolutional network model to extract features, but the exploration on the discrimination ability is limited to the given feature space, thus limiting the improvement of depth model discrimination ability.
To overcome the above deficiencies of the prior art, the present invention provides a deep discriminative network model method for person re-identification in an image or a video. Based on the similarity judgment process of pedestrians between different images, a deep discriminative network model is designed. Concatenation are carried out on two input images on a color channel, the similarity between the images is discriminated in the original image difference space, and the learning ability of network is improved by embedding Inception module, so as to effectively distinguish whether the input images belong to the same person. In the present invention, the features of an individual image are not extracted, and there is no traditional step of extracting the feature from the input image, so the potential of the deep convolutional neural network model in discriminating image difference can be fully utilized.
In the present invention, concatenation are first carried out on two input images on a color channel, and an obtained splicing result is defined as an original difference space of two images, and then an obtained splicing result is sent into the designed convolutional neural network, and the network can finally calculate the similarity between two input images by learning difference information in original space. The deep discriminative network in the present invention comprises a generated original difference space and a convolutional network, and the convolutional network comprises three connected convolutional modules and one Inception module, followed by an asymmetric convolutional layer and a fully connected layer. The similarity between images can be obtained by using the SoftMax algorithm.
The technical scheme proposed in the present invention:
Disclosed is a deep discriminative network model method for person re-identification in an image or a video. Concatenation are carried out on two input images on a color channel by constructing a deep discriminative network, an obtained splicing result is sent into a convolutional network, and the said deep discriminative network outputs the similarity between two input images by learning difference information in original difference space, thereby realizing person re-identification; specifically, comprising the steps of:
1) designing the structure of a deep discriminative network model;
11) constructing the original difference space of the image;
Concatenation are carried out on two input images on a color channel (R, G, B), to make an “image” containing 6 channels (R, G, B, R, G, B), and the “image” is defined as an original difference space of the two images, as the object of direct learning of convolutional neural network;
12) designing three connected convolutional modules for learning the difference information of input object;
Each module contains 2 convolutional operations, 1 ReLU mapping, and 1 maximum pooling operation, where the size of the convolutional kernel is 3*3, the step is 1, the sampling size is 2*2, and the step is 2;
13) designing an Inception module following the convolutional module to increase the depth and width of the network; and
14) designing an asymmetric convolutional operation to further reduce the difference dimension and using the full connection and SoftMax methods to calculate the similarity between the input images;
2) setting the pedestrian image in the data set X with the same size, and dividing into the training set T and the test set D;
In the specific embodiment of the present invention, the pedestrian images in data set X are uniformly set with the size of 160*60, and randomly divided into the training set T and the test set D;
3) training the deep discriminative network constructed in training Step 1) with training set T, updating the learning parameters until convergence, and obtaining the trained deep discriminative network model; comprising the steps of:
31) performing data augmentation on the images in the training set T by:
A. horizontally flipping the image in the training set T to obtain a mirror image of each image;
B. taking the center of each image in the training set T (including the mirror image generated in Step A) as a reference, sampling multiple images (e.g. 5 images. The purpose of sampling is to increase the number of training samples) randomly offset in the horizontal and vertical directions for a certain size as samples. In the specific embodiment of the present invention, the offset is [−0.05H, 0.05H]*[−0.05 W, 0.05 W], and H and W are the height and width of the original image, respectively;
32) pre-processing the sample: Calculate the mean and variance of all samples in the training set, and then normalize all the images (including the training set and the test set) to obtain the normal-distributed sample data as follow-up training sample data;
33) generating training samples: All samples of each person form a similarity pair with each other. For each similarity pair, randomly select two images from the samples of all others to form a dissimilarity pair with one of the samples, so as to control the ratio of similarity pairs to dissimilarity pairs at 1:2 as the final training sample; and
34) using the batch training method, randomly sampling 128 pairs of pedestrian images from the training samples, and updating the network parameters with the random gradient descent method until convergence, to obtain the trained deep discriminative network model;
In the specific embodiment of the present invention, sample 128 pairs for batch training on pedestrian images; whereinwhen using the random gradient descent method, the learning rate is set as 0.05, the momentum is 0.9, the learning rate is attenuated to 0.0001, and the weight attenuation is 0.0005;
The trained deep discriminative network model can be evaluated by using the pedestrian image in test set D; and
4) using the trained deep discriminative network model to identify the test data set D, verifying whether the pedestrians in the two input images in the test data set D belong to the same pedestrian, and obtaining the accuracy rate.
The SoftMax algorithm is used in the present invention to obtain similarities between images.
Compared with the prior art, the beneficial effects of the present invention are:
The present invention provides a deep discriminative network model method for person re-identification in an image or a video, and further explores the potential of the deep convolutional neural network model in discriminating image disparity. Compared with the prior art, the present invention no longer learns the features of the individual images, but concatenation is carried out on input images on a color channel at the beginning, so that the designed network learns their difference information in the original space of the image. By introducing the Inception module and embedding it into the model, the learning ability of the network can be improved, and a better discriminating effect can be achieved.
The present invention will become apparent from the following detailed description of embodiments and from the accompanying drawings, but not limited to the scope of the invention in any way.
The present invention proposes a deep discriminative network model algorithm (hereinafter referred to as DDN-IM) for person re-identification, and its structure is shown in
1. designing the deep discriminative network architecture, comprising the steps of:
2. training of deep discriminative network (parameter learning):
In order to verify the effect of Inception module in the deep discriminative network model, the present invention performs corresponding comparison experiments according to whether or not the Inception module is used and Inception are respectively placed behind different convolutional modules, and the result is shown in
In Table 2, eSDC (existing Salience Detection Combination) is a significance detection method combined with a conventional method documented in literature (R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience learning for person re-recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3586-3593.); KISSME (Keep It Simple and Straightforward Metric Learning) is documented in the literature (M. Hirzer, “Large scale metric learning from equivalence constraints,” in Computer Vision and Pattern Recognition, 2012, pp. 2288-2295.). FPNN (Filter pairing neural network) is documented in the literature (W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairing neural network for person re-recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 152-159.); IDLA (Improved Deep Learning Architecture) is documented in the literature (E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learning architecture for person re-recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3908-3916.); SIRCIR (Single-Image Representation and Cross-Image Representation) is documented in the literature (F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang, “Joint learning of single-image and cross-image representations for person re-identification,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1288-1296.); PersonNet (Person Network) is documented in the literature (L. Wu, S. C., and A. van den Hengel, “Personnet: Person re-identification with deep cony.); and Norm X-Corr (Normalize Cross Correlation) is documented in the literature (A. Subramaniam, M. Chatterjee, and A. Mittal, “Deep neural networks with inexact matching for person re-recognition,” in Advances in Neural Information Processing Systems 29, 2016, pp. 2667-2675.).
In Table 3, LOMO+XQDA (Local Maximum Occurrence and Cross-view Quadratic Discriminant Analysis) is documented in the literature (S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-recognition by local maximal occurrence representation and metric learning,” in Computer Vision and Pattern Recognition (CVPR), 2015, pp. 2197-2206.); KEPLER (KErnelized saliency-based Person re-recognition through multiple metric LEaRning) is documented in the literature (N. Martinel, C. Micheloni, and G. L. Foresti, “Kernelized saliency based person re-recognition through multiple metric learning,” IEEE Transactions on Image Processing, vol. 24, no. 12, pp. 5645-5658, 2015.); NLML (Nonlinear local metric learning) is documented in the literature (S. Huang, J. Lu, J. Zhou, and A. K. Jain, “Nonlinear local metric learning for person re-recognition,” Computer Science, 2015.); SSDAL+XQDA (semi-supervised deep attribute learning and Cross-view Quadratic Discriminant Analysis) is documented in the literature (C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Deep attributes driven multi-camera person re-recognition,” arXiv preprint arXiv:1605.03259, 2016.); DR-KISS (dual-regularizedkiss) is documented in the literature (D. Tao, Y. Guo, M. Song, Y. Li, Z. Yu, and Y. Y. Tang, “Person re-identification by dual-regularized kiss metric learning,” IEEE Transactions on Image Processing A Publication of the IEEE Signal Processing Society, vol. 25, no. 6, pp. 2726-2738, 2016.); SCSP (Spatially Constrained Similarity function on Polynomial feature map) is documented in the literature (D. Chen, Z. Yuan, B. Chen, and N. Zheng, “Similarity learning with spatial constraints for person re-recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1268-1277.); and SSM (Supervised smoothed manifold) is documented in the literature (S. Bai, X. Bai, and Q. Tian, “Scalable person re-recognition on supervised smoothed manifold,” arXiv preprint arXiv:1703.08359, 2017.).
In Table 4, ITML (Information Theoretic Metric Learning) is documented in the literature (J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information theoretic metric learning,” in Machine Learning, Proceedings of the Twenty-Fourth International Conference, 2007, pp. 209-216.); kLFDA (kernel Local Fisher Discriminant Classifier) is documented in the literature (F. Xiong, M. Gou, O. Camps, and M. Sznaier, “Person re-recognition using kernel-based metric learning methods,” in European conference on computer vision. Springer, 2014, pp. 1-16.); DML (Deep Metric Learning) is documented in the literature (Y. Dong, L. Zhen, S. Liao, and S. Z. Li, “Deep metric learning for person re-recognition,” in International Conference on Pattern Recognition, 2014, pp. 34-39); NullReid (Null space for person Reid) is documented in the literature (L. Zhang, T. Xiang, and S. Gong, “Learning a discriminative null space for person re-recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1239-1248.); Ensembles (Metric Ensembles) is documented in the literature (S. Paisitkriangkrai, C. Shen, and V. D. H. Anton, “Learning to rank in person re-recognition with metric ensembles,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1846-1855.); ImpTrpLoss (Improved Triplet Loss) is documented in the literature (D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identification by multi-channel parts-based cnn with improved triplet loss function,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1335-1344.); and MTDnet (Multi-Task Deep Network) is documented in the literature (W. Chen, X. Chen, J. Zhang, and K. Huang, “A multi-task deep network for person re-recognition,” in Thirty-First AAAI Conference on Artificial Intelligence, 2017.).
As described above, the present invention has been tested on three different data sets and compared with other methods. Table 2, Table 3, and Table 4 list the CMC results obtained with different methods on the CUHK01 data set, the QMUL GRID data set, and the PRID2011 data set, respectively. As can be seen that, the deep discriminative network model proposed in the present invention has better performance, indicating the effectiveness of the algorithm.
It is to be noted that the above contents are further detailed description of the present invention in connection with the disclosed embodiments. The present invention is not limited to the embodiments referred to, but may be varied and modified by those skilled in the field without departing from the conception and scope of the present invention. The claimed scope of the present invention should be defined by the scope of the claims.
Number | Date | Country | Kind |
---|---|---|---|
201710570245.1 | Jul 2017 | CN | national |
Filing Document | Filing Date | Country | Kind |
---|---|---|---|
PCT/CN2018/073708 | 1/23/2018 | WO | 00 |