This application is based on and claims priority under 35 USC 119 from Japanese Patent Application No. 2016-169678 filed Aug. 31, 2016.
The present invention relates to an image processing apparatus, a non-transitory computer readable medium, and an image processing method.
According to an aspect of the invention, there is provided an image processing apparatus including a reception section, an image extraction section, a forming section, and a comparison section. The reception section receives a video. The image extraction section extracts target object images from multiple frames that constitute the video received by the reception section. The forming section forms multiple target object images among the target object images extracted by the image extraction section into one unit, the multiple target object images being temporally apart from each other. The comparison section makes a comparison on the basis of the unit formed by the forming section.
Exemplary embodiments of the present invention will be described in detail based on the following figures, wherein:
Now, exemplary embodiments of the present invention will be described in detail with reference to the drawings.
A person region extraction unit 28 automatically extracts person regions typically as rectangular regions in a case where persons are included in frames (images) that constitute the video received by the data reception unit 26. Various methods have been proposed for person region detection, and any standard method may be used. One of the representative methods is Fast R-CNN described in R. Girshick, Fast R-CNN, arXiv:1504.08083, 2015, for example.
A timeline segment forming unit 30 forms the person regions extracted by the person region extraction unit 28 into a timeline segment as one unit. That is, as illustrated in
Here, S1, S2, and S3 are the areas of portions defined in
Note that, as illustrated in
One of the problems about forming timeline segments is that, if persons overlap to an extremely large degree, timeline segments that are to be formed into different timeline segments of different persons are formed into the same timeline segment. That is, as illustrated in
The multiple-person overlap determination unit 32 separates multiple persons into different timeline segments respectively before and after the multiple persons are in an overlapping state. Accordingly, it is possible to suppress erroneous detection of multiple persons belonging to a single timeline segment.
The multiple-person overlap determination unit 32 is configured as a binary classifier that is formed by, for example, preparing learning data in which any person region in which multiple persons are in the overlapping state is assumed to be a positive instance and any person region in which multiple persons are not in the overlapping state is assumed to be a negative instance, extracting features, and performing model learning. When extracting features, any image features, such as HOG (histogram of oriented gradients) feature values or SIFT+BoF feature values (scale-invariant feature transform and bag of features), may be extracted. In the model learning, a classifier, such as an SVM (support vector machine) classifier, may be used. Alternatively, it is possible to form a classifier directly from RGB inputs by using a convolutional neural network, such as AlexNet, which is a representative one described in A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012.
A timeline segment comparison unit 34 compares timeline segments formed by the timeline segment forming unit 30 with each other. An output unit 36 causes the display device 22 to display the result of comparison made by the timeline segment comparison unit 34 via the display controller 18 described above, for example.
A comparison of timeline segments is made according to a first exemplary embodiment in which person identification is performed or according to a second exemplary embodiment in which the distance between persons is calculated.
First, the first exemplary embodiment is described.
In the first exemplary embodiment, the timeline segment comparison unit 34 illustrated in
The segment person identification unit 42 causes a person identification unit 44 to perform individual identification for each frame in a segment. When determination is performed on a segment, scores corresponding to each person ID are integrated to implement individual identification. As a method for integration, processing, such as adding up scores corresponding to each person ID, may be performed.
Further, the above-described individual identification may be combined with a face recognition technique that is widely used. In the case of combination, scores may be weighted and added up, for example.
Specifically, the segment person identification unit 42 includes the person identification unit 44, which is combined with a face detection unit 46 and a face recognition unit 48.
The person identification unit 44 is caused to learn in advance multiple persons present in a video and infers the IDs of the persons when a frame (image) in a segment is input. In the learning, all persons to be identified are respectively assigned IDs, person region images in which each person is present are collected as positive instances of the corresponding ID, and learning data is collected for the number of persons. The learning data is thus prepared, features are extracted, and model learning is performed to thereby form the person identification unit 44. When extracting features, any image features, such as HOG feature values or SIFT+BoF feature values, may be extracted. In the model learning, a classifier, such as an SVM classifier, may be used. Alternatively, it is possible to form a classifier directly from RGB inputs by using a convolutional neural network, such as AlexNet, which is a representative one described in A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012.
The face detection unit 46 detects face regions when a frame in a segment is input.
The face recognition unit 48 calculates a score for each person ID that is assigned to a corresponding one of the persons registered in advance in a case where face detection by the face detection unit 46 is successful.
First, in step S10, a video is received. Next, in step S12, the video received in step S10 is divided into frames (images). In step S14, timeline segments are formed from the frames obtained as a result of division in step S12. In step S16, a segment person identification process is performed. In step S18, it is determined whether processing is completed for all of the segments. If it is determined that processing is completed for all of the segments (Yes in step S18), the flow ends. If it is determined that processing is not completed for all of the segments (No in step S18), the flow returns to step S16, and processing is repeated until the processing is completed for all of the segments.
First, in step S161, a segment is input. Next, in step S162, individual identification is performed on the frames (images) obtained as a result of division in step S12 described above. In step S163, it is determined whether processing is completed for all of the frames. If processing is completed for all of the frames (Yes in step S163), the flow proceeds to step S164, a score calculated for each frame and for each person is integrated, and the flow ends. On the other hand, if it is determined that processing is not completed for all of the frames (No in step S163), the flow returns to step S162, and processing is repeated until the processing is completed for all of the frames.
Next, the second exemplary embodiment is described.
In the second exemplary embodiment, the timeline segment comparison unit 34 illustrated in
The inter-segment distance determination unit 42a calculates the distance between two segments that are input. As the calculation method, the distance between each pair of frames respectively included in the two segments may be calculated and the average distance may be defined as the distance between the two segments. Alternatively, another method in which the distance between two segments is defined as the distance between sets, such as the Hausdorff distance, may be used, for example.
Further, the above-described distance calculation may be combined with a face recognition technique that is widely used. In the case of combination, scores may be weighted and added up, for example.
Specifically, the inter-segment distance determination unit 42a includes an inter-person distance determination unit 44a, which is combined with a face recognition unit 46a and an inter-face distance calculation unit 48a.
The inter-person distance determination unit 44a determines whether two persons respectively present in the two input segments are the same person.
Here, the configuration is employed in which a binary result, that is, whether or not the two persons are the same person, is returned. The inter-person distance may be defined by returning a predetermined small value in a case where the two persons are determined to be the same person and by returning a predetermined large value in a case where the two persons are determined to not be the same person.
Alternatively, a method of performing end-to-end processing from feature extraction to identification may be applicable by using deep learning as described in H. Liu, J. Feng, M. Qi, J. Jiang and S. Yan, End-to-End Comparative Attention Networks for Person Re-identification, IEEE Transactions on Image Processing, vol. 14, No. 8, June 2016 or in L. Wu, C. Shen, A. van den Hengel, PersonNet: Person Re-identification with Deep Convolutional Neural Networks, http://arxiv.org/abs/1601.07255.
The face recognition unit 46a detects and recognizes face regions when a frame in a segment is input. The inter-face distance calculation unit 48a calculates the distance between faces respectively present in two input frames in a case where face detection is successful. As a standard method for this, a method, such as OpenFace, described in F. Schroff, D. Kalenichenko, J. Philbin, FaceNet: A Unified Embedding for Face Recognition and Clustering, CVPR 2012, pp. 815-823 is available.
Further, an inter-segment distance correction unit 54 may be provided. The inter-segment distance correction unit 54 corrects the distance on the basis of a condition that segments that are present at the same time and in the same space always correspond to different persons.
The distance between the segments is thus determined, and clustering is performed. Clustering is performed on the basis of the distance between segments calculated by the inter-segment distance determination unit 42a. As the method for clustering, the k-means method or various hierarchical clustering methods, for example, may be used.
First, in step S20, a video is received. Next, in step S22, the video received in step S20 is divided into frames (images). In step S24, timeline segments are formed from the frames obtained as a result of division in step S22. In step S26, the distance between segments is calculated. In step S28, it is determined whether processing is completed for all pairs of segments. If it is determined that processing is completed for all pairs of segments (Yes in step S28), the flow proceeds to step S30, clustering is performed, and the flow ends. On the other hand, if it is determined that processing is not completed for all pairs of segments (No in step S28), the flow returns to step S26, and processing is repeated until the processing is completed for all pairs of segments.
First, in step S261, segments are input. Next, in step S262, for the frames (images) obtained as a result of division in step S22 described above, the distance between frames is calculated. In step S263, it is determined whether processing is completed for all pairs of frames. If processing is completed for all pairs of frames (Yes in step S263), the flow proceeds to step S264, the distance between the segments is calculated, and the flow ends. On the other hand, if it is determined that processing is not completed for all pairs of frames (No in step S263), the flow returns to step S262, and processing is repeated until the processing is completed for all pairs of frames.
Note that persons are assumed to be the target objects in the above-described exemplary embodiments; however, the target objects are not limited to persons, and any objects, such as animals or cars, for example, may be targets.
The foregoing description of the exemplary embodiments of the present invention has been provided for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously, many modifications and variations will be apparent to practitioners skilled in the art. The embodiments were chosen and described in order to best explain the principles of the invention and its practical applications, thereby enabling others skilled in the art to understand the invention for various embodiments and with the various modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the following claims and their equivalents.
Number | Date | Country | Kind |
---|---|---|---|
2016-169678 | Aug 2016 | JP | national |