The present disclosure relates to a computer-implemented method for determining similarity in appearance of objects in image frames of at least one video sequence. The disclosure further relates to an image processing system configured to perform the method.
Re-identification in video sequences refers to the process of identifying and tracking an object across different image frames in a video sequence. This may be useful in a number of different applications and for a number of reasons. For example, in surveillance systems, identifying and tracking people or vehicles may be critical for security purposes. In other applications it may be useful to monitor and analyze, for example, the behavior and trajectory of an object over time. Moreover, many automated systems, such as those used in autonomous vehicles or robotics, rely on re-identification and object tracking for decision making.
When tracking objects in a video depicting a scene, object detections in a current frame are associated with object tracks from previous frames. Conventionally the association has been based on comparing the state, such as a position, of an object detected in the current frame to a predicted state. More recently methods have been developed that take into account the appearance of the tracked objects.
For this purpose feature vectors, which capture essential characteristics of an object's appearance, can be used in the re-identification. The feature vectors of different frames can be compared or matched to determine a similarity between objects detected in the different frames. Matching algorithms can be used to assess the similarity between the feature vectors to determine if they belong to the same object. One way of performing the re-identification is to use machine learning to extract feature vectors. A machine learning model may be trained on datasets of object images to generate similar feature vectors for images depicting the same physical object, and deviating feature vectors for images depicting different physical objects. A deep learning model may thereby be used as part of a process of tracking objects in a video.
Despite the recent progress in this technology, the re-identification still has challenges for certain setups and scenarios. As an example, imaging configurations that are optimized for human users typically involves digital processing of the captured images, which may cause failure in re-identifying an object in different image frames.
It is an object of the present disclosure to provide a system and a method for improving the re-identification of objects in image frames of at least one video sequence.
The present disclosure relates to, according to a first embodiment, a computer-implemented method for determining similarity in appearance of objects in image frames of at least one video sequence, comprising the steps of:
In modern camera systems there is often an image processing pipeline in which the raw image data is processed. For example, the image processing pipeline may reduce noise, such as artifacts and image sensor noise, adjust contrast, brightness, color and other image properties. The image processing pipeline may operate using a set of image processing settings, which may, for example, be influenced or depend on environmental conditions, such as lighting or weather conditions, noise etc. It has been found that the variation in image processing settings between frames may cause differences in the feature vectors that are extracted from the processed images, which in turn cause failure in re-identifying the object.
By re-processing the first raw image data applying the second image processing settings if both the one or more second feature vectors differ by more than a first threshold and the first image processing settings differ from the second image processing settings by more than a second threshold, it is possible to obtain one or more updated first feature vectors that can be recompared against the one or more second feature vectors. The updated first feature vectors, i.e., feature vectors that have been re-processed using the image processing settings of another image frame, can be used in the second comparison to determine whether the difference between the first feature vectors and the second feature vectors was caused by actual feature vector differences, or whether it was caused by the fact that different image processing settings were used for the two image frames.
The step of comparing the one or more updated first feature vectors to the one or more second feature vectors can be used to update the determined similarity of the first object and the second object. One embodiment of the presently disclosed computer-implemented method for determining similarity in appearance of objects in image frames of at least one video sequence comprises the step of updating the determined similarity of the first object and the second object by comparing the one or more updated first feature vectors to the one or more second feature vectors.
In order to be able to re-process the first raw image data corresponding to the first image area applying the second image processing settings it may be necessary to temporarily store the first raw image data, preferably in a storage medium such as a random-access memory, at least until the one or more first feature vectors and the one or more second feature vectors have been compared. In the event that the first image processing settings differ from the second image processing settings by more than a second threshold, the first raw image data has to be stored until the first raw image data has been re-processed. If the criteria for re-processing are not fulfilled the first raw image data can be discarded, otherwise the first raw image data can be discarded after the re-processing. It is possible to temporarily store the first raw image data only for the first image area to avoid unnecessary use of memory. In the same way the re-processing may only need to be carried out for the first image area.
The presently disclosed computer-implemented method for determining similarity in appearance of objects in image frames of at least one video sequence can be performed repeatedly, for example, at fixed time intervals or at every n:th image frame in a video sequence to track an object. As would be realized by a person skilled in the art, the first/second raw image data, image frame, image settings, processed image, feature vectors etc. can be extended to third, fourth, fifth raw image data, image frame, image settings, processed image, feature vectors etc. with corresponding comparisons.
The present disclosure further relates to an image processing system comprising:
The system may further comprise one or more image sensors, preferably in one or more cameras, for capturing the image frames processed by the processing circuitry. The system may further comprise one or more displays for displaying the processed image frames, for example, in the form of one or more video sequences. The objects that are detected can be labelled on the display. Alternatively, or in combination, the processing circuitry may be configured to generate or extract metadata in which the objects are labelled according to, for example, their properties, object identifiers, or which object class they belong to. If the method for determining similarity in appearance of objects determines that the first object in the processed first image frame is the same as the second object in the processed second image frame, or at least that the likelihood is greater than a predefined threshold, a user may be able to follow the object with an appropriate label. Alternatively, or in combination, the extracted information may be used in other applications, such as for analytical purposes.
A person skilled in the art will recognize that the presently disclosed method for improving the re-identification of objects in image frames of at least one video sequence may be performed using any embodiment of the presently disclosed image processing system, and vice versa.
Various embodiments are described hereinafter with reference to the drawings. The drawings are examples of embodiments and are intended to illustrate some of the features of the presently disclosed method and system for determining similarity in appearance of objects in image frames of at least one video sequence.
The present disclosure relates to a computer-implemented method for determining similarity in appearance of objects in image frames of at least one video sequence. The disclosure further relates to a method of tracking one or more objects in at least one video sequence using the presently disclosed method for determining similarity in appearance of objects in image frames of at least one video sequence.
The method comprises the steps of:
The method further comprises the steps of:
The method may further comprise the step of displaying the first object and/or the second object, preferably adding a label to indicate an identity of the object being tracked.
Moreover, if the presently disclosed method for determining similarity in appearance of objects determines that the similarity between the objects in the image frames is above a similarity threshold, the first object may be associated with the second object. This may include, for example, providing an indication that the first and second objects are considered to be the same object. This may include the step of outputting metadata, for example in the form of a common label, indicative of an association between the first and second object.
The step of detecting a first object in the processed first image frame and the step of detecting a second object in the processed second image frame may be done in several ways. The step may, for example, include the use of a convolutional neural network, and may detect objects in substantially real-time. One example of a method for object detection is the method described in J. Redmon, S. Divvala, R. Girshick and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016, pp. 779-788, doi: 10.1109/CVPR.2016.91, hereinafter referred to as YOLO. The YOLO method divides an input image into a grid of cells. For each cell bounding boxes are predicted using a single convolutional network. For each bounding box a likelihood that the detected object belongs to different predefined classes are calculated. Low-confidence bounding boxes can then be filtered out. As a person skilled in the art would realize, the presently disclosed computer-implemented method for determining similarity in appearance of objects may use the YOLO method or other similar methods capable of detecting objects in images.
As would be recognized by a person skilled in the art, the step of extracting one or more first feature vectors describing the appearance of the first object and/or extracting one or more second feature vectors describing the appearance of the second object, may also be carried out in several ways. For example, the step of extracting one or more first feature vectors and/or extracting one or more second feature vectors comprises applying a machine learning model, such as a neural network. For this purpose the method may use a convolutional neural network or equivalent. The convolutional neural network or equivalent will preferably have been pre-trained to extract feature vectors. During training of such a network it may, for example, be possible to use triplets of images. The triplets may comprise an anchor, a positive and a negative. The triplets may comprise roughly aligned matching/non-matching objects. The loss function, which is a function that quantifies the difference between the predicted output and the ground truth, may involve known technologies such as triplet loss and softmax loss. One example of a convolutional neural network for extracting feature vectors, and the training of the same, is described in Schroff, F., Kalenichenko, D. and Philbin, J. (2015) “Facenet: A Unified Embedding for Face Recognition and Clustering”. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, 7-12 Jun. 2015, 815-823, hereinafter referred to as FaceNet. Accordingly, the machine learning model may have been trained to calculate feature vectors that are closer to each other for objects which are the same and further apart from each other for objects which are different. That is, the distance, as measured by a suitable distance metric such as the L2-norm, between feature vectors of objects which are the same is shorter than the distance between feature features of objects which are different. The distance between the feature vectors may thus be used to measure a similarity of the feature vectors and thereby the objects, wherein the similarity increases with decreasing distance between the feature vectors. The first and second image frames may be image frames from one image sensor of a camera at different points in time, wherein the first image frame and second image frame are image frames from one video sequence. The first image frame may, for example, be captured before the second image frame or the second image frame may be captured before the first image frame. It is also possible that the first image frame and second image frame are image frames from multiple image sensors. The multiple image sensors may be multiple image sensors depicting different views of a scene. For such embodiments, the first and second image frames may still be image frames at different points in time. Alternatively, the first and second image frames are image frames captured at the same time. Some cameras have multiple image sensors for capturing a wider, or even a full 360 degree, view. The first and second image frames may be image frames from two of these image sensors, and, as would be understood by a person skilled in the art, further extended to any number of image sensors and image frames. The multiple image sensors may also be sensors of different cameras. In the embodiment in which the first image frame and the second image frame are image frames from multiple image sensors, the first image frame and second image frame may be image frames from different video sequences.
The method may further comprise the step of determining a similarity of the first object and the second object by comparing the one or more first feature vectors to the one or more second feature vectors. In this regard feature vectors may be seen as numerical representations of an object's visual characteristics extracted from the processed first and second image frames, respectively. The term ‘feature vector’ is commonly used in machine learning and would as such be known by a person skilled in the art. The one or more first feature vectors and the one or more second feature vectors may also be referred to as re-identification vectors. Comparing the one or more first feature vectors to the one or more second feature vectors may involve assessing the similarity or dissimilarity between the one or more first feature vectors to the one or more second feature vectors. A number of possible similarity metrics can be used for this purpose. The choice of similarity metric may depend on factors such as the content of the scene, properties of the objects and parameters and requirements related to the re-identification. Possible similarity metrics include, but are not limited to, Euclidian distance, Manhattan distance, Jaccard similarity, Hamming distance etc. It can be noted that the step of comparing the one or more updated first feature vectors to the one or more second feature vectors may perform the same comparison, i.e., the second comparison may apply, for example, the above described similarity metrics. Accordingly, the step of comparing the one or more first feature vectors to the one or more second feature vectors and/or the step of comparing the one or more updated first feature vectors to the one or more second feature vectors may comprise using a similarity metric to calculate a similarity between the one or more first feature vectors and the one or more second feature vectors, and thereby also a similarity between the first object and the second object.
A further option when comparing the one or more first feature vectors to the one or more second feature vectors is to include a first step of determining which object class the first and second object belong to. This is a task that a machine learning algorithm can typically perform with a high level of accuracy. The step of determining which object class the first and second object belong to may in practice be performed as part of the step of detecting the first and second object in the processed first image frame and processed second image frame, respectively. Commonly known object detectors, such as the above-mentioned YOLO and further developments of it, are typically capable of determining a position and an object class for the detected objects. The presently disclosed computer-implemented method for determining similarity in appearance of objects is not limited to any specific object detection algorithm or architecture. In certain cases, object detectors only include object localization and detection. For such applications it is possible to use an additional classifier to determine which object class the first and second object belong to. Examples of architectures capable of performing object classification include AlexNet, VGG16, GoogleNets, EfficientNet and Regnet. According to one embodiment the method may comprise the step of determining whether the first object and the second object are of the same object class. If the first object and the second object are not of the same class, it can be assumed that they are not the same object. There is then no need to further compare the feature vectors.
While the objects may be any objects, the presently disclosed method for determining similarity in appearance of objects may be particularly useful for humans. In one embodiment the first and second object are humans. In a number of surveillance scenarios it may be useful to be able to track an individual in one or more video sequences. Modern machine learning-based re-identification methods have the ability to re-identify and track an individual even if the individual, for example, turns around, changes body position etc. between image frames. The presently disclosed method for determining similarity in appearance of objects can further improve such re-identification by ensuring that the comparisons of feature vectors are done using the same image processing settings if there is a doubt of whether the objects are the same. The presently disclosed method for determining similarity in appearance of objects may be also be used in applications in which the objects are vehicles. In one embodiment the first and second object are, accordingly, vehicles. Further applications are possible. For example, animals, items in warehouses and shops, and transportation goods can be re-identified and tracked.
If the one or more first feature vectors and the one or more second feature vectors differ by more than a first threshold, and, additionally, the first image processing settings differ from the second image processing settings by more than a second threshold, the first raw data is re-processed. The first image processing settings differing from the second image processing settings by more than a second threshold may have the meaning that one or more parameters that is part of the processing settings, for example, a gain and/or exposure compensation and/or white balance and/or color adjustment, such as local tone mapping, can be adjusted in a way that it is possible to refer to a quantified difference. As an example, a “gain” in the context of processing a raw image may refer to an adjustment of the intensity or brightness of the image. The gain may be expressed in decibels (dB), which can be compared from one image processing setting to another. Similar levels exist for other similar settings, including the above listed settings. The re-processing may be carried out on the first raw image data corresponding to the first image area. In order to reduce computations, as well as the temporarily stored data, only a portion covering the first image area may need to be re-processed. In this case only the raw image data corresponding to the portion covering the first image area needs to be temporarily stored for re-processing. The re-processing is performed on the first raw image data using the second image processing settings. The re-processing generates a new processed image, which can be referred to as an updated processed first image. Optionally, it is now possible also to re-detect an object, such as the first object, in the updated processed first image and/or extract from the first image area, one or more updated first feature vectors describing the appearance of the first object. The one or more updated first feature vectors can then be re-compared to the one or more second feature vectors.
As would generally be understood by a person skilled in the art, the use of first/second raw image data, image frame, image settings, processed image, feature vectors etc. does not necessarily have the meaning that the first image frame is captured before the second image frame. If, for example, the first image frame is captured before the second image frame, it is possible to re-process the first raw image data corresponding to the first image area applying the second image processing settings and extract one or more updated first feature vectors, but it would also be possible to re-process the second raw image data corresponding to the second image area applying the first image processing settings and extract one or more updated second feature vectors, which could then be compared to the one or more first feature vectors. Both of these variants may be useful. For this reason the first/second raw image data, image frame, image settings, processed image, feature vectors etc. may refer to images from different points in time, wherein the “first” is not necessarily captured before the “second”. The first image area comprising the first object and second image area comprising the second object may refer to regions or sub-areas of the first image frame and the second image frame, respectively. Extracting a region or sub-area of an image frame is sometimes also referred to as cropping. In the present disclosure the first image area and second image area may refer to an area encapsulating the first/second object, wherein the area is smaller than the area of the entire first/second image frame. The first image area and second image area are typically, but not necessarily, rectangular or square-shaped.
In one embodiment of the presently disclosed method for determining similarity in appearance of objects, the step of comparing the one or more first feature vectors to the one or more second feature vectors and/or the step of comparing the one or more updated first feature vectors to the one or more second feature vectors comprises applying a machine learning model, such as a neural network, that has been trained to compare the appearance of two objects based on feature vectors describing the appearance of the two objects. Training a machine learning model, such as a neural network, may typically involve collecting a training dataset including annotations or other information indicating which objects are the same or different in the different frames. As previously explained, a machine learning model may be trained, for example, by using triplet training, on images of objects which are known to be the same or different, to calculate feature vectors that are closer to each other for objects which are the same and further apart from each other for objects which are different. When the one or more first and second feature vectors have been extracted, the comparison of the feature vectors may be done using a similarity/dissimilarity measure. As described above, one example of a similarity metric is Euclidian distance. In the example of using the FaceNet system to measure a similarity, squared L2 distances are used to determine object similarity. The feature vectors may be provided in any suitable format, such as arrays, database entries, text files, binary formats etc. The training may involve other well-known steps, such as data augmentation, transfer learning, validation etc.
As described above, in one embodiment of the presently disclosed method for determining similarity in appearance of objects, the first raw image data corresponding to at least the first image area is temporarily stored until the one or more first feature vectors and the one or more second feature vectors have been compared, and, in the event that the first image processing settings differ from the second image processing settings by more than a second threshold, until the first raw image data has been re-processed. The first raw image data and/or the second raw image data may be temporarily stored until the one or more first feature vectors have been compared to the one or more second feature vectors and/or until the first raw image data has been re-processed. After the comparison the raw image data may be discarded. In one embodiment the first raw image data and/or the second raw image data is discarded after the one or more first feature vectors have been compared to the one or more second feature vectors and/or the first raw image data has been re-processed.
The present disclosure further relates to a computer program having instructions which, when executed by a computing device or computing system, cause the computing device or computing system to carry out any embodiment of the presently disclosed method for determining similarity in appearance of objects. The computer program may be stored on any suitable type of storage media, such as non-transitory storage media.
As stated above the present disclosure further relates to an image processing system comprising processing circuitry configured to carry out any embodiment of the presently disclosed method for determining similarity in appearance of objects. The processing circuitry may comprise a single image processing pipeline, wherein the first image processing pipeline is the same image processing pipeline as the second image processing pipeline, or separate image processing pipelines for processing the first raw image data of the first image frame and the second raw image data of the second image frame. The latter may be useful if there are, for example, multiple cameras for capturing different views of a scene.
The image processing system may comprise a machine learning model, such as a neural network, that has been trained to compare the appearance of two objects based on feature vectors describing the appearance of the two objects.
The system may, but does not necessarily have to, include a display for displaying the processed image frames of the at least one video sequence. When tracking an object it may be useful to display the object with, for example, a label or bounding box. However, in other applications the re-identified objects are not necessarily displayed but used in additional applications, including, for example, further analysis.
The system may further comprise peripheral components, such as one or more memory units, which may be used for storing instructions that can be executed by the processing circuitry. The system may further comprise internal and external network interfaces, input and/or output ports etc.
As would be understood by a person skilled in the art, the processing circuitry may comprise a single processor or a processing unit in a multi-core/multiprocessor system. The processing circuitry may be connected to a data communication infrastructure.
The system may include one or more memory units, such as a random access memory (RAM) and/or a read-only memory (ROM), or any suitable type of memory. The system may further comprise a communication interface that allows software and/or data to be transferred between the system and external devices. Software and/or data transferred via the communications interface may be in any suitable form of electric, optical or RF signals. The communications interface may comprise, for example, a cable or a wireless interface. The camera(s) may be connected to the processing circuitry by means of one or more Ethernet cables, possibly employing Power of Ethernet.
| Number | Date | Country | Kind |
|---|---|---|---|
| 23218017.4 | Dec 2023 | EP | regional |