The present invention relates to the field of unsupervised tracking technology, in particular to an contrastive loss based training strategy for unsupervised multi-object tracking.
The mainstream multi-object tracking algorithms are implemented by object detection and representation vector extraction. In order to improve the tracking effect, researchers first proposed to use an additional appearance feature extractor to increase the available information when the frames before and after the tracking task are associated, but the use of multiple models makes it difficult for the model to meet real-time performance. In order to meet the real-time requirements, researchers have proposed a multi-object tracking model based on the Joint Detection and Embedding (JDE) paradigm. However, no matter what kind of approach, it requires extremely labor-intensive trajectory annotation as long as the tracking strategy uses the correlation information of the previous frame and the subsequent frame objects;
The existing methods treat embedding training as classification, which will bring some new problems. They classify each trajectory in the dataset as a category and constrain the embedded branch by classifying the features obtained by the embedded branch. This training strategy can achieve good effects when the number of trajectories is not large, but if the number of trajectories is too large, the model will be difficult to fit (the number of outputs of the fully connected layer is proportional to the number of trajectories), and the length of the trajectories in the dataset is inconsistent that results in an imbalance in the number of samples in each category, which will limit the performance of the JDE paradigm tracker. Meanwhile, the JDE paradigm uses a common backbone network to extract unified features for multiple tasks, but there is a certain conflict between sub-tasks, which leads to the lack of effect of the JDE paradigm model.
Therefore, we design an contrastive loss based training strategy for unsupervised multi-object tracking to provide another technical solution for the above technical problems.
Based on this, it is necessary to provide an contrastive loss based training strategy for unsupervised multi-object tracking to solve the technical problems proposed in the above background technology.
In order to solve the above technical problems, the present invention adopts the following technical scheme:
As a preferred embodiment of the contrastive loss based training strategy for unsupervised multi-object tracking provided by the present invention, an SSCI module is calculated according to the following:
As a preferred embodiment of the contrastive loss based training strategy for unsupervised multi-object tracking provided by the present invention, the positive sample pair is constructed by adjacent frame targets, and the steps are as follows:
As a preferred embodiment of the contrastive loss based training strategy for unsupervised multi-object tracking provided by the present invention, after inputting these sub-videos into a network, the corresponding feature vectors ={x1, x2 . . . xkt} and Êt+1={x1, x2 . . . xki+1} can be obtained according to the detection annotations of the frame t and frame t+1;
As a preferred embodiment of the contrastive loss based training strategy for unsupervised multi-object tracking provided by the present invention, the cross-frame expression ability of features is enhanced by forward matching and reverse matching, and the steps are as follows:
As a preferred embodiment of the contrastive loss based training strategy for unsupervised multi-object tracking provided by the present invention, the MOT Challenge comprises MOT17 and MOT20;
the MOT17 dataset comprises a training set and a testing set, the training set contains 5316 frames of images from 7 videos, and the testing set also contains 7 videos and a total of 5919 frames;
The MOT20 dataset comprises a training set and a testing set, the training set accounts for 4 videos and 8931 frames of images, and the testing set accounts for 4 videos and 4479 frames of images.
As a preferred embodiment of the contrastive loss based training strategy for unsupervised multi-object tracking provided by the present invention, a ratio of the training set and the testing set in the MOT17 is 5:5.
There is no doubt that through the above technical solutions of this application, the technical problems to be solved in this application can be solved.
Meanwhile, the present invention has at least the following beneficial effects through the above technical scheme:
To explain the technical scheme of the embodiment of the present invention more clearly, a brief introduction will be made to the accompanying drawings used in the embodiments or the description. It is obvious that the drawings in the description below are only some embodiments of the present disclosure, and those ordinarily skilled in the art can obtain other drawings according to these drawings without creative work.
In order to make the objective, technical solution, and advantages of the present invention clearer and more specific, the present invention will be further described in detail below with reference to accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present invention and are not intended to limit the present invention.
In order to make the personnel in the technical field better understand the scheme of the present invention, the following will describe the technical scheme in the embodiment of the present invention clearly and completely in combination with the accompanying drawings.
It should be noted that in the case of no conflict, the embodiments in the present invention and the characteristics and technical schemes in the embodiment can be combined with each other.
It should be noted that similar annotations and letters denote similar items in the following accompanying drawings, therefore, once an item is defined in a figure, it does not need to be further defined and explained in the subsequent figure.
With reference to
The positive and negative sample pairs required for the contrastive loss can be obtained from two priors, that is, the matching pairs obtained by prior 2) are viewed as the positive sample pairs in the contrastive learning, and the embedded features of other objects are taken as negative samples, so as to realize the self-supervised training of the embedding branch.
The JDE tracker will have a dataset denoted as {I, B, y}t=1N during supervised training, where It∈Rc*h*w denotes a frame image, Bt∈Rkt*4 denotes the position of kt objects in the current frame image, and yt∈Zkt denotes the trajectory number of the kt objects in the current frame. These JDE trackers will predict the object position {circumflex over (B)}t∈R{circumflex over (k)}
the three most common characterization losses are cross-entropy loss, triplet loss and contrastive loss. The relative constraint purpose is shown in
According to the equation and
The triplet loss does not need to determine the specific category of each feature, it only needs to know whether the several features of the loss calculation are the same category, the triplet loss is more flexible than the cross-entropy loss, but also because there is no clear feature category center as the cross-entropy loss, the effect will decrease, and the sampling strategy will have an extremely huge impact on the effect of the triplet loss, the farthest positive sample and the nearest negative sample are used to replace the random sampling for optimization. According to
According to the equation and
A constrained SSCI module is formed by using a relation between the interior of a video frame and a relation between adjacent video frame objects; the SSCI module is only a loss calculation module, the motivation and basis of its design are derived from two key prior information, that is, the objects within the same frame must not be the same, and the objects in adjacent frames can obtain matching pairs with high accuracy according to the embedded features. The display of these two priors is shown in
According to the two prior information shown in
After inputting these sub-videos into a network, the corresponding feature vectors ={x1, x2 . . . xkt} and Êt+1={x1, x2 . . . xkt+1} can be obtained according to the detection annotations of the frame t and frame t+1; where x denotes a feature vector of a corresponding object, and kt and kt+1 denote a number of objects in the frame image respectively. Since the trajectory annotation cannot be used, the cross-entropy loss cannot be used here to construct a constraint on the embedded features , so the present invention uses three variant losses based on the self-supervised contrastive loss to constrain, the original form of the self-supervised contrastive loss is shown in Equation 5:
where sim(xi, xi+) denotes the cosine similarity between the i-th sample and its positive sample, sim(xi, xj) denotes the similarity between the i-th object and the sample other than itself, t is the temperature that controls the constraint degree of the difficult sample. From this equation, it can also be understood that the construction of positive and negative samples is the most important part of contrastive loss.
As shown in
The value of mi,j denotes the cosine similarity between the embedding vectors corresponding to the two objects. As shown in
For the information condition, the loss function Lsame for the negative samples in the same frame is first designed, as shown in Equation 7:
The first loss Lsame only acts on objects in the same frame and does not establish constraints on cross-frame objects, which is the most important ability required for tracking tasks. Therefore, SSCI uses the Hungarian algorithm in Mt, t+1 as the forward matching of the frame t object to the frame t+1 object to obtain the matching pair of the same object in the adjacent frames, that is, the Hungarian operation of the Lcross in
Lcycle acts on the elements in Mt+1, t, which uses the forward matching pairs as the reverse matching pair, and does not use the additional matching operation, that is, the reverse operation of Lcycle in
the present invention will use the MOT Challenge dataset, comprising MOT17 and MOT20. The MOT17 dataset comprises a training set and a testing set, the training set contains 5316 frames of images from 7 videos, and the testing set also contains 7 videos and a total of 5919 frames. MOT20 is a more dense dataset than the MOT17 object, wherein the training set accounts for 4 videos and 8931 frames of images, and the testing set accounts for 4 videos and 4479 frames of images. In this section, in addition to the test experiment, the remaining experiments use the first half of the MOT17 data as the training set and the second half of the data as the verification set for the experiment. In the experiment of the testing set, it will be consistent with JDE, FairMOT and Cstrack, using additional CrowdHuman, ETH, CityPersons, CalTech, CUHK-SYSU and PRW datasets.
In terms of evaluation metrics, the present invention will use standard MOT Challenge evaluation metrics, and focus on MOTA, IDF1, MT (Mostly Tracked objects), ML (Mostly Lost objects), IDS (Number of Identity Switches) metrics.
In order to ensure the adequacy of the experiment, the present invention applies unsupervised training on FairMOT, Cstrack and OMC for corresponding effect comparison. Meanwhile, in order to ensure the fair comparison, the present invention will maintain the hyperparameters of these network standards. Both Cstrack and OMC will use the SGD optimizer to train 30 rounds. The learning rate is initialized to 5*10−4 and attenuated to 5*10−5 in 20 rounds. The weight of detection loss and embedding loss also uses the 1:0.02 in the original paper. FairMOT uses the Adam optimizer to train 30 rounds, and the learning rate is set to 1*10-4, the detection loss and embedding loss use learnable weights. All training on the present invention will be carried out in a Tesla V100 GPU. Consecutive frames in unsupervised training will be randomly selected from 10 frames before and after the first frame according to the video frame rate.
The present invention will carry out all the validation experiments mentioned above. That is to verify the above: 1) the features extracted by using randomly initialized embedded branches can still distinguish objects in short-interval frames; 2) Lsame uses simple addition as a loss, and the effect of using triple loss instead of contrastive loss on the experiment; 3) the validation competition problem still exists in the Cstrack using the CCN module.
The key prior that the randomly initialized embedded branch can still obtain a certain effect of embedded features when the interval between two frames is small will be the premise that Lcross can operate. In order to verify this prior, the present invention uses randomly initialized features of the embedded branch output to simulate the tracking, and uses these features to match to see the correct rate.
Specifically, the 28th frame image in the MOT17-09 sequence along with its subsequent 1 frame, 5 frame, 10 frame, and 20 frame images into a network loaded with only coco pretrained weights (because the pretrained is only for the detection branch, the embedded branch is randomly initialized at this time), the similarity matrix M of the embedded features is calculated, and the Hungarian algorithm is used for matching according to the similarity, and the results are shown in
It is also necessary to verify the effect of replacing Equation 7 with Equation 3 and Equation 11 on the experiment.
From
In order to verify whether the competition problem continues to exist, the present invention makes a simple experiment. As shown in Table 1, the first two rows are the results of Cstrack's untrained embedded branch and trained embedded branch, respectively, and the last two rows are the results of the FairMOT pair. Because the IDF1 metric is more responsive to the tracking effect, and the MOTA is more responsive to the detection effect, the present invention lets the IDF1 denote the tracking effect and the MOTA denote the detection effect. From Table 1, it can be seen that training the embedded branch can indeed greatly improve the tracking effect.
The present invention will conduct ablation research from three kinds of losses, negative sample number, difficult sample temperature and training matching threshold respectively, and display the visualization results. All experiments involved in the present invention will be implemented based on FairMOT.
Firstly, the ablation of SSCI is studied.
SSCI consists of three sub-losses: Lsame is responsible for pulling away the features of the same intra-frame object; Lcross is responsible for drawing closer the difference between the positive sample pairs with successful matching of adjacent frames; Lcycle is responsible for ensuring that the forward and reverse matching results are consistent.
Table 2 shows the effect of using each loss in the validation set, where the result of the fourth row is the effect of supervised training. It can be seen from Table 2 that only using Lsame can achieve a similar effect as supervision. After adding Lcross and Lcycle, IDF1 is significantly improved and IDS is reduced, that is, the effect of the embedded branch is improved, but it also causes a decline in recall (FN decline) and a decline in MOTA, the present invention believes that this result is caused by the competition between the embedded branch and the detection branch.
Since both Lcross and Lcycle are based on contrastive loss, the number of negative samples will have a greater impact on the effect of contrastive loss, so the present invention studies the number of negative samples. Lcross and Lcycle are both constraints on the positive sample pairs that are successfully matched, so the remaining objects in the current two frames can be naturally regarded as negative samples, meanwhile, because the MOT17 dataset is composed of multiple video segments, the objects of different videos can be considered to be different, so the present invention fills the objects of different videos in the same batch as negative samples. Here, the negative samples filled from different video segments are regarded as additional negative samples, and the number of these additional negative samples is analyzed. Table 3 shows the effect of FairMOT when using different numbers of negative samples, where Nt is the number of objects in the first frame. It can be found from Table 3 that more negative samples can generally bring higher IDF1, but at the same time reduce MOTA, therefore, in order to balance the most critical MOTA and IDF1 metrics, SSCI finally chose Nneg/Nt=2.
The self-supervised contrastive loss uses a temperature to control the weight of difficult samples (see Equation 5, Equation 7, Equation 8 and Equation 9), sets the temperature to 0.5, and mentions that this value will have different optimal values according to different tasks, therefore, the present invention compares the effects of different fixed T values in Table 4 and adds the effect comparison of adaptive T values. It can be seen from the results in the table that T=2 can still achieve the best results at a fixed value, but the T obtained dynamically can achieve the best results according to the number of objects, so the T of SSCI will be set to T=½ (log (Nt+Nt+1+1)).
Since Lcross and Lcycle need to use the linear matching of the Hungarian algorithm to construct positive sample pairs during training, the threshold in the Hungarian algorithm will inevitably affect the correctness and number of pairs, thus affecting the final effect. The present invention compares the effects of using different thresholds in Table 5, where Nmatch and Nright denote the proportion of the number of matching successes in the last epoch of training to the total number of objects and the proportion of the number of matching corrects to the number of matching successes respectively. It can be found from the table that higher thresh will lead to a significant reduction in the number of successful matches, but will not increase the accuracy rate too high, while lower thresh will increase the number of matches and reduce the accuracy rate. According to the experimental results, SSCI finally chose thresh=0.7.
Finally, a series of visual presentations of the features generated by the embedded branches trained using SSCI are made to show the effect comparable to supervised learning.
Firstly, the present invention uses the feature heat map response diagram to demonstrate the discriminative ability of the features obtained by unsupervised embedding training.
Table 6 lists the results of the multi-object tracking algorithm trained by the present invention compared with the current advanced supervised and unsupervised tracking algorithms on the MOT17 dataset. It can be seen that the present invention can obtain similar performance with its corresponding supervision method on the main tracking metrics. It is an available training mode to obtain a similar effect as the supervised method without using trajectory annotation. Compared with other unsupervised algorithms, only OUTrack uses additional supervised signals to achieve better results than the present invention, this result proves that the present invention is close to the best in unsupervised tracking methods. Table 7 lists the results of the multi-object tracking algorithm trained by the present invention compared with the current advanced supervised and unsupervised tracking algorithms on the MOT20 dataset.
The preferred embodiments of the present invention disclosed above are intended only to help illustrate the present invention. The preferred embodiment does not set forth all the details in detail, nor does it limit the present invention to the specific embodiment described. Obviously, many modifications and variations are possible in light of the above specification. The embodiments were chosen and described in specification in order to better explain the principles of the present invention and its practical application, so that the technical personnel in the technical field can well understand and use the present invention. The present invention is only limited by the claim and its full scope and equivalent.
Number | Date | Country | Kind |
---|---|---|---|
2023106318958 | May 2023 | CN | national |