This application claims the priority benefit of Chinese Patent Application No. 202210793867.1, filed on Jul. 7, 2022 in the China National Intellectual Property Administration, the disclosure of which is incorporated herein in its entirety by reference.
The present disclosure relates generally to information processing and computer vision, and more particularly, to a method, an apparatus and a storage medium for multi-target multi-camera tracking.
With the development of computer science and artificial intelligence, it is becoming increasingly universal and effective to use computers to run artificial intelligence models to implement information processing. Computer vision is an important application field of artificial intelligence models.
A hot spot of computer vision technology is multi-target tracking. Multi-target tracking is commonly referred to as MTT (Multiple Target Tracking; sometimes also abbreviated as MOT: Multiple Object Tracking) briefly, which is used to detect and endow identifiers (IDs) to targets of types of interest such as pedestrians, automobiles or/or animals in a video, so as to perform trajectory tracking, without knowing the number of the targets in advance. A desired tracking result is that: different targets own different IDs, so as to achieve work such as precise tracking, precise searching and the like. MTT is a key technology in the field of computer vision, and has been widely applied in fields such as autonomous driving, intelligent monitoring, behavior recognition and the like.
In multi-target tracking, for an input video, a tracking result of targets is output. In a tracking result image, each target is indicated by, for example, a rectangular bounding box with a corresponding ID identifier number. In an image sequence of multiple frames of a video, a moving trajectory of a bounding box of the same ID can be regarded as a trajectory of a target of the ID. In these multiple frames, an image block sequence of multiple image blocks indicated by the bounding box of the ID is referred to as a tracklet (tracklet). In a tracklet, each image block therein can be regarded as a frame of image of the tracklet, and each frame of image can be assigned with information representing time information and spatial position information of a target trajectory.
Considering a limited monitoring space of a camera providing an input video, in practical video monitoring (tracking) applications, it is possible to use multiple cameras to perform monitoring and target tracking for a larger space. This involves multi-camera multi-target tracking (Multi-Target Multi-Camera Tracking, MTMCT). MTMCT, for example, processes inputted image sequences from multiple cameras, and outputs identified image sequences, wherein, if a target with the same ID appears in image sequences from different cameras, it is desired to use bounding boxes of the same target ID to identify image blocks of the target, regardless of whether it crosses cameras. These image blocks corresponding to the bounding boxes of the same target ID constitute a cross-camera tracklet of the target with the target ID. That is, two frames of images in a single tracklet can come from different cameras.
At present, a multi-camera multi-target tracking technology mainly includes two stages: single-camera target tracking, and inter multi-camera matching.
The adverse factors affecting the accuracy of a result of multi-camera multi-target tracking include: occlusion, illumination, attitude changes, etc. It is challenging to improve the accuracy of a result of multi-camera multi-target tracking.
A brief summary of the present disclosure will be given below to provide a basic understanding of some aspects of the present disclosure. It should be understood that the summary is not an exhaustive summary of the present disclosure. It does not intend to define a key or important part of the present disclosure, nor does it intend to limit the scope of the present disclosure. The object of the summary is only to briefly present some concepts, which serves as a preamble of the detailed description that follows.
The technical problems to be solved by embodiments of the present disclosure include but are not limited to at least one of: reducing incorrect cross-camera target trajectories, reducing identification-switch, and reducing incorrect target identifier assignment.
According to an aspect of the present disclosure, there is provided a method for multi-target multi-camera tracking. The method comprises: determining an overall local target trajectory set including a local target trajectory set of each camera by performing single-camera multi-target tracking on a corresponding image sequence provided by each camera of a plurality of cameras; and determining a global target trajectory set for the plurality of cameras by performing multi-camera multi-target matching on the overall local target trajectory set; wherein determining the global target trajectory set comprises: determining a cluster matched global trajectory set by clustering local target trajectories in the overall local target trajectory set; determining a cost-minimum path set by implementing a cost-minimum path algorithm on a directed graph constructed with each trajectory in the cluster matched global trajectory set as a vertex; and merging corresponding trajectories in the cluster matched global trajectory set based on the cost-minimum path set.
According to another aspect of the present disclosure, there is provided an apparatus for multi-target multi-camera tracking. The apparatus comprises: a memory having instructions stored thereon; and at least one processor configured to execute the instructions to: determine an overall local target trajectory set including a local target trajectory set of each camera by performing single-camera multi-target tracking on a corresponding image sequence provided by each camera of a plurality of cameras; and determine a global target trajectory set for the plurality of cameras by performing multi-camera multi-target matching on the overall local target trajectory set; wherein determining the global target trajectory set comprises: determining a cluster matched global trajectory set by clustering local target trajectories in the overall local target trajectory set; determining a cost-minimum path set by implementing a cost-minimum path algorithm on a directed graph constructed with each trajectory in the cluster matched global trajectory set as a vertex; and merging corresponding trajectories in the cluster matched global trajectory set based on the cost-minimum path set.
According to still another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing a program. When the program is executed by a computer, the computer implements operations of: determining an overall local target trajectory set including a local target trajectory set of each camera by performing single-camera multi-target tracking on a corresponding image sequence provided by each camera of a plurality of cameras; and determining a global target trajectory set for the plurality of cameras by performing multi-camera multi-target matching on the overall local target trajectory set; wherein determining the global target trajectory set comprises: determining a cluster matched global trajectory set by clustering local target trajectories in the overall local target trajectory set; determining a cost-minimum path set by implementing a cost-minimum path algorithm on a directed graph constructed with each trajectory in the cluster matched global trajectory set as a vertex; and merging corresponding trajectories in the cluster matched global trajectory set based on the cost-minimum path set.
The beneficial effects of the methods, apparatuses and storage media of the present disclosure include at least of: improving the accuracy of a result of multi-camera multi-target tracking, and reducing identification-switch.
Embodiments of the present disclosure will be described below with reference to the accompanying drawings, which will help to more easily understand the above and other objects, features and advantages of the present disclosure. The accompanying drawings are merely intended to illustrate the principles of the present disclosure. The sizes and relative positions of units are not necessarily drawn to scale in the accompanying drawings. The same reference numbers may denote the same features. In the accompanying drawings:
Hereinafter, exemplary embodiments of the present disclosure will be described combined with the accompanying drawings. For the sake of clarity and conciseness, the specification does not describe all features of actual embodiments. However, it should be understood that many decisions specific to the embodiments may be made in developing any such actual embodiment, so as to achieve specific objects of a developer, and these decisions may vary as embodiments are different.
It should also be noted herein that, to avoid the present disclosure from being obscured due to unnecessary details, only those device structures closely related to the solution according to the present disclosure are shown in the accompanying drawings, while other details not closely related to the present disclosure are omitted.
It should be understood that, the present disclosure will not be limited only to the described embodiments due to the following description with reference to the accompanying drawings. Herein, where feasible, embodiments may be combined with each other, features may be substituted or borrowed between different embodiments, and one or more features may be omitted in one embodiment.
Computer program code for performing operations of various aspects of embodiments of the present disclosure can be written in any combination of one or more programming languages, the programming languages including object-oriented programming languages, such as Java, Smalltalk, C++ and the like, and further including conventional procedural programming languages, such as “C” programming language or similar programming languages.
Methods of the present disclosure can be implemented by circuitry having corresponding functional configurations. The circuitry includes circuitry for a processor.
An aspect of the present disclosure relates to a method for Multi-Target Multi-Camera Tracking (MTT).
In operation S101, an overall local target trajectory set including a local target trajectory set of each camera is determined by performing single-camera multi-target tracking on a corresponding image sequence provided by each camera of a plurality of cameras. The operation of this step is marked as a “single-camera multi-target tracking operation Op_mtt”.
Referring to
The plurality of cameras include, for example, a first camera Cam[1] that monitors a first local space (e.g., a first room) and a second camera Cam[2] that monitors a second local space (e.g., a second room) adjacent to the first local space. It could be understood that, the plurality of cameras can include cameras whose monitoring areas have an overlap.
Referring to
As illustrated in
The local target trajectory set TJs[c] includes, for example, a plurality of local target trajectories TJ[c][jCstart] to TJ[jCend], one local trajectory (only one target of a type of interest appears in the corresponding image sequence SqIm[c]), zero local trajectory (an empty set; no target of a type of interest appears in the corresponding image sequence SqIm[c]). In
Referring to
One local target trajectory TJ[c][jc] corresponds to one tracklet Trk[c][jc]. As illustrated in
An overall local target trajectory set LTJs is the union of single-camera target trajectory sets TJs[cStart] to TJs[cEnd]. Since no multi-camera multi-target matching across cameras (i.e. cross-camera trajectory matching) is performed, for a circumstance where there has been appearance of a specific target Tg[x] in videos from both first and second cameras, in the overall local target trajectory set LTJs, the two local target trajectories of the target Tg[x] which correspond to the first and second cameras are not identified with the same ID LTJs. It is desired that, these two local target trajectories are identified with the same ID in response to being found to match (i.e., correspond to the same target) in subsequent operations, and can be merged into one target trajectory identified with the same ID. Such one target trajectory obtained by merging target trajectories of a plurality of cameras can be referred to as a “cross-camera target trajectory”. As an example, if single-camera multi-target tracking processing is performed for 2 image sequences provided by 2 cameras to obtain 12 local target trajectories, due to existence of a circumstance where there has been appearance of 2 targets in both camera monitoring spaces of the 2 cameras, in an ideal case a global target trajectory set ultimately obtained through subsequent processing only comprises 10 global target trajectories. In other words, when videos from a plurality of cameras have only been subjected to single-camera multi-target tracking but have not yet been subjected to multi-camera multi-target matching, if videos with addition of target identifiers (e.g., addition of a bounding box with a corresponding identifier in each target image block) are played, different target identifiers in videos from different cameras may correspond to the same target (for example, the 1 # bounding box in a video 1 and the 3 # in a video 2 may correspond to the same person). That is, the target identifier at this time is localized, has a function of distinguishing targets in the same video, and cannot be used for distinguishing targets in different camera videos.
After the image sequence SqIm[c] from the camera Cam[c] is processed through single-camera multi-target tracking, targets in each image in the image sequence are positioned and identified. In playing back the image sequence (video), it is possible to superimpose and display on an image a bounding box of each target having been positioned; bounding boxes of different targets, for example, can be distinguished with different colors, so that a moving trajectory of a target identified with a certain color within a corresponding time period can be clearly seen; of course, during playback, it is also possible to superimpose and display on an image a unique identifier of a target. It is possible to display a tracking result in quasi real-time: after one frame of image is captured, single-camera target matching is performed, to match a target in a new image with a forward trajectory and assign an identifier, thereby displaying a new image on which a bounding box with a determined identifier has been superimposed. After performing multi-camera multi-target matching, it is possible to utilize a shared target identifier set shared by a plurality of cameras to perform displaying in a manner similar to the foregoing display manner; a shared target identifier is global, and in an ideal case, regardless of which camera it is, the same target is assigned with a unique target identifier, and bounding boxes with the same target identifier in different camera videos indicate the same target.
Referring to
In clustering local target trajectories in the overall local target trajectory set LTJs, when two local target trajectories are clustered into one class, it is regarded that the two local target trajectories correspond to the same target (that is, the two local target trajectories match each other), and thus, these two local target trajectories can be merged into one trajectory, reducing the number of trajectories in the trajectory set. Since there may be trajectory merging in the clustering operation Op_cluster, the number of trajectories in the obtained cluster matched global trajectory set GTJcms may be less than the number of trajectories in the overall local target trajectory set LTJs.
In the merging operation Op_merge, trajectory merging (i.e. trajectory matching) may also occur, and thus the number of trajectories in the global target trajectory set GTJs may be less than the number of trajectories in the cluster matched global trajectory set GTJcms. For example, a target Tg[x] has a trajectory TJ1,2 at times t1 to t2, a target Tg[x′] has a trajectory TJ3,4 at times t3 to t4, and when it is determined that the trajectory TJ1,2 matches the trajectory TJ3,4 (that is, it is regarded that the target Tg[x] and the target Tg[x′] are the same target Tg[X]), the trajectory TJ1,2 is merged (connected) with the trajectory TJ3,4 into a trajectory indicating a trajectory of the target Tg[X] at the times t1 to t4.
Exemplary description of further details of the method 100 will be made below.
In an embodiment, the single-camera multi-target tracking operation Op_mtt comprises a target detection operation Op_detB, a re-identification feature extraction operation Op_extF, a single-camera target matching operation Op_matS, and a local target trajectory post-processing operation Op_postP. The single-camera multi-target tracking operation Op_mtt will detect targets in each image of a plurality of images that are input, each target is represented by an image block indicated by a rectangular bounding box, and by matching image blocks at different times, it is possible to assign an intra single-camera target identifier for each image block. A time sequence of image blocks of the same intra single-camera target identifier indicates one local target trajectory in the local target trajectory set.
For the target detection operation Op_detB, it is possible to detect, with a target detection network NwTag (e.g., YoloX), targets in an image in a corresponding image sequence provided by a camera, and to output information (e.g., position, size, etc.) on rectangular bounding boxes (referred to as detected bounding boxes briefly) and bounding box confidence of the detected targets. Such a target detection network is a conventional technique, and will no longer be repeatedly described herein.
For the re-identification feature extraction operation Op_extF, a target-related re-identification feature Freid of an image block indicated by each detected bounding box in the image can be provided by a re-identification network NwReID. Such a re-identification network is a conventional technique, and will no longer be repeatedly described herein.
In an embodiment, determining an overall local target trajectory set LTJs comprises, for each camera of the plurality of cameras, performing a single-camera target matching operation Op_matS. The single-camera target matching operation Op_matS comprises determining a current local target trajectory set TJst based on a previous local target trajectory set TJst′ (i.e., a set of trajectories having been detected) and a current image Imt in the corresponding image sequence. The corresponding image sequence includes adjacent images: the previous image Imt′ and the current image Imt, where t′ is a previous time, t is a current time, and it is also possible to include an image captured prior to the time t′ (when Imt′ is not a start frame of the image sequence). Determining a current local trajectory set TJst comprises: determining, with a target detection network NwTag, detected bounding boxes Box[bStart] to Box[bEnd] and bounding box confidence cfd[bStart] to cfd[bEnd] of a predetermined class of targets in the current image Imt; and updating the previous local target trajectory set TJst′ as the current local target trajectory set TJst by performing single-camera target matching based on each detected bounding box in the current image Imt, each bounding box confidence and a previous image Imt′. The previous image Imt′ is the last image (i.e., latest image) in a corresponding image sequence of the previous local target trajectory set TJst′.
In an embodiment, updating a previous local target trajectory set by performing single-camera target matching based on each detected bounding box in the current image, each bounding box confidence and a previous image comprises: determining target identifiers of credible bounding boxes whose bounding box confidence is greater than a bounding box confidence threshold among the detected bounding boxes by performing first tracking matching on the credible bounding boxes and each target trajectory having been detected in the previous local target trajectory set; determining target identifiers of remaining detected bounding boxes by performing, for unmatched trajectories among target trajectories having been detected in the previous local target trajectory set, second tracking matching in the remaining detected bounding boxes; and generating, for bounding boxes whose bounding box confidence is greater than the bounding box confidence threshold and which fail to match the target trajectories having been detected in the previous local target trajectory set among the detected bounding boxes, new target identifiers.
Exemplarily, the single-camera target matching operation Op_matS can comprise steps of: (1) at the current time t, predicting, with a Kalman filter, a position of a bounding box at the current time t according to a bounding box at the time t′; (2) performing id matching (first tracking matching) on detected bounding boxes whose bounding box confidence is greater than a bounding box confidence threshold (e.g., 0.5) and each trajectory having been detected in the previous local target trajectory set TJst′; (3) for unmatched trajectories, performing matching (second tracking matching) in remaining detected bounding boxes; (4) generating, for bounding boxes whose bounding box confidence is greater than the bounding box confidence threshold and which fail to match the target trajectories having been detected in the previous local target trajectory set TJst′, new target identifiers; (5) updating parameters of the Kalman filter according to a latest bounding box set corresponding to a current trajectory point. In an example, in the single-camera multi-target tracking operation Op_mtt for one camera, for a first image in an image sequence, it is possible to only perform the target detection operation Op_detB, without performing the single-camera target matching operation Op_matS, wherein, the operation Op_detB detects targets in the first image, determines bounding boxes thereof, and obtains n trajectory points (i.e., n trajectories) corresponding to the number of the targets, each trajectory point corresponds to a corresponding bounding box, and the n trajectories constitute a local target trajectory set. For subsequent images, it is possible to perform the single-camera target matching operation Op_matS, to determine a current local target trajectory set (i.e., to update the local target trajectory set) by matching targets detected in a current image with trajectories in the previous local target trajectory set. The local target trajectory set is gradually updated iteratively based on each image, in an order of images in the image sequence, thereby ultimately obtaining a local target trajectory set for the entire image sequence.
In an embodiment, at least one of the first tracking matching and the second tracking matching is performed by: predicting predicted bounding boxes for the current image based on the detected bounding boxes in the previous image (i.e., the image at the previous time t′); and determining target identifiers of detected bounding boxes in the current image (i.e., the image at the current time t) based on an area overlap cost function and a vertex overlap cost function associated with the detected bounding boxes and the predicted bounding boxes of the current image.
In an embodiment, for the image sequence provided by the camera Cam[c], a target identifier of a bounding box of a target detected in an image at the current time t can be determined based on a bounding box of a target detected in a previous frame of image (corresponding to the time t′) through single-camera target matching, thereby updating the previous local target trajectory set as the current local target trajectory set by, for example, adding new trajectories or trajectory points to the previous local target trajectory set. The single-camera target matching operation Op_matS exemplarily can include operations as follows.
Iou
cost=2*soverlap/(spredicted+sdetected) (1)
Where, soverlap is an overlap area of the predicted bounding box and the detected bounding box, spredicted is an area of the predicted bounding box, sdetected is an area of the detected bounding box.
Y
cost=2*|Y0predicted−Y0detected|/(hpredicted+hdetected)*scale (2)
Where, Y0predicted is a vertical coordinate of an upper right corner of the predicted bounding box, Y0detected is a vertical coordinate of an upper right corner of the detected bounding box, hpredicted is a height of the predicted bounding box, hdetected is a height of the detected bounding box, and scale is scaling, which exemplarily can be taken as 10 herein.
m
cost
=λ*Iou
cost+(1−λ)*Ycost (3)
Where, λ is a predetermined weighting constant, which exemplarily can be taken as 0.8 herein.
In order to improve the accuracy of determining a trajectory in the local trajectory set, suppress occurrence of different targets in the same trajectory, and suppress identification-switch (incorrectly match an image block of a target x′ to a trajectory of an existing target x), determining a current local trajectory set further comprises: updating the current local target trajectory set by performing post-processing on the current local trajectory set.
In an embodiment, performing post-processing on the current local trajectory set comprises (marked as a “first post-processing operation Op_posP1” briefly): determining whether to generate a new trajectory based on a motion characteristic of a trajectory in the current local target trajectory set.
In an example, the motion direction dir is one of a positive X direction, a negative X direction, a positive Y direction, and a negative Y direction; and the positive X direction is perpendicular to the positive Y direction. As illustrated in
The re-identification feature Freidt of the corresponding image block of the current trajectory point and the re-identification feature Freidt′ of the corresponding image block of the previous trajectory point can be extracted from corresponding image blocks by a re-identification network NwReID.
The moving distance dis of the current trajectory point PTJt in the motion direction dirt relative to the previous trajectory point PTJt′, of the current trajectory point PTJt, on the trajectory TJ[c][x] can be determined according to Equation (5). That is, the moving distance can be a component of a real moving distance in a moving direction in the image.
Referring to
In an image of an image sequence, an overlap between image blocks of a plurality of targets may occur. At this time, identification-switch is prone to occur. Therefore, the inventor conceived the following embodiment related to a “second post-processing operation Op_posP2” to suppress occurrence of this case.
Sim(Freid_x′t,Freid_xt′)−Sim(Freid_xt,Freid_xt′)>sTh2 (6)
Target identifiers of the corresponding bounding box Boxx′t and the overlapping bounding box Boxxt are exchanged in a case where it is determined that the first similarity condition C1 is satisfied. Sim(Freid_x′t, Freid_xt′) is a similarity between a re-identification feature Freid_x′t of an image block of the target identifier x′ corresponding to the corresponding bounding box Boxx′t in the current image Imt and a re-identification feature Freid_xt′ of an image block of the target identifier x corresponding to the overlapping bounding box in the previous image Imt′; Sim(Freid_xt, Freid_xt′) is a similarity between re-identification features Freid_xt and Freid_xt′ of image blocks of the target identifier x corresponding to the overlapping bounding box Boxxt in the current image Imt and the previous image Imt′; and sTh2 is a second similarity threshold. Exemplarily, sTh2 is taken as 0.1. The similarity can be a cosine similarity between feature vectors. As an example, referring to
For a case where two targets overlap and then are separated, since occlusion has occurred, in this case the separated trajectory points may have been assigned with incorrect target identifiers. Therefore, the inventor conceived the following embodiment related to a “third post-processing operation Op_posP3” to suppress occurrence of this case.
In an embodiment, the third post-processing operation Op_posP3 comprises: determining whether a first trajectory and a second trajectory that satisfy an overlapping condition as follows exist in the current local trajectory set TJst: a first image block at a previous time t′ prior to a current time t in a corresponding image block sequence of the first trajectory overlaps with a second image block at the previous time t′ in a corresponding image block sequence of the second trajectory; a third image block at the current time t in the corresponding image block sequence of the first trajectory has no overlap with a fourth image block at the current time tin the corresponding image block sequence of the second trajectory; and a fifth image block at a more previous time t″ prior to the previous time t′ in the corresponding image block sequence of the first trajectory has no overlap with a sixth image block at the more previous time t″ in the corresponding image block sequence of the second trajectory.
The third post-processing operation Op_posP3 further comprises: determining, for the first trajectory TJ[c][x] and the second trajectory TJ[c][x] that satisfy the above overlapping condition, whether a second similarity condition C2 as follows is satisfied.
Sim(Freid_xt″,Freid_x′t)+Sim(Freid_x′t″,Freid_xt)−Sim(Freid_xt,Freid_xt″)+Sim(Freid_x′t″,Freid_x′t)>sTh3 (7)
Sim(Freid_xt″,Freid_x′t) is a similarity between a re-identification feature Freid_xt″ of the fifth image block and a re-identification feature Freid_x′t of the fourth image block;
Sim(Freid_x′t″, Freid_xt) is a similarity between a re-identification feature Freid_x′t″ of the sixth image block and a re-identification feature Freid_xt of the third image block;
Sim(Freid_xt, Freid_xt″) is a similarity between the re-identification feature Freid_xt″ of the fifth image block and the re-identification feature Freid_x of the third image block;
Sim(Freid_x′t″, Freid_x′t) is a similarity between the re-identification feature Freid_xr of the sixth image block and the re-identification feature Freid_x′t of the fourth image block.
sTh3 is a third similarity threshold.
The third post-processing operation Op_posP3 further comprises: exchanging target identifiers of the fourth image block and the third image block in a case where it is determined that the second similarity condition C2 is satisfied.
Further exemplary description of the clustering operation Op_cluster in the method 100 will be made below.
In operation S701, based on re-identification feature pairs of a plurality of corresponding cross-camera image block pairs of each inter-camera trajectory pair in the local target trajectory set GTJs, a trajectory similarity of the inter-camera trajectory pair is determined. For example, a local target trajectory obtained from the image sequence provided by the first camera Cam[c1] includes a first trajectory TJ[c1][j1], TJ[c1][j1] corresponds to K1 image blocks, wherein any image block is marked as a first image block Patch[c1][j1][k1], a local target trajectory obtained from the image sequence provided by the second camera Cam[c2] includes a second trajectory TJ[c2][j2], TJ[c2][j2] corresponds to K2 image blocks, wherein any image block is marked as a second image block Patch[c2][j2][k2], then it is possible to determine a trajectory similarity SimTbC (TJ[c1][j1],TJ[c2][j2]) (also referred to as an “inter-camera trajectory similarity”) of an inter-camera trajectory pair (the first trajectory TJ[c1][j1], the second trajectory TJ[c2][j2]) composed of the first and second trajectories based on re-identification feature pairs (Freid[c1][j1][k1], Freid[c2][j2][k2]) of K1*K2 cross-camera image block pairs composed of the K1 first image blocks and the K2 second image images.
In an example, a trajectory similarity is determined by calculating the mean of image block similarities, which is marked as an “averaging operation Op mean”. Determining the trajectory similarity SimTbC (TJ[c1][j1],TJ[c2][j2]) comprises: Step S7011, determining an image block similarity SimPbT(Patch[c1][j1][k1], Patch[c2][j2][k2]) of an inter-tracklet image block pair based on a re-identification feature pair of an inter-tracklet image block pair of two corresponding tracklets of the inter-camera trajectory pair that serves as a cross-camera image block pair (Patch[c1][j1][k1], Patch[c2][j2][k2]); operation S7013, determining the mean of the top-n largest image block similarities among image block similarities of a plurality of inter-tracklet image block pairs of two corresponding tracklets of the inter-camera trajectory pair, as the trajectory similarity SimTbC (TJ[c1][j1],TJ[c2][j2]). n is an integer, for example, n=5. For example, the image block similarity can be a cosine similarity of re-identification features of two image blocks. For example, for the first trajectory TJ[c1[j1] with K1 trajectory points and the second trajectory TJ[c2][j2] with K2 trajectory points, K1*K2 image block similarities SimPbT can be obtained, and the mean of the top-n largest image block similarities among these image block similarities SimPbT is taken as the similarity SimTbC of the trajectory pair composed of the first and second trajectories. Exemplarily, if there are 2 cameras, an image from a first camera obtains J1 trajectories, and an image from a second camera obtains J2 trajectories, then the number of trajectory similarities obtained is J1*J2.
In operation S703, a cluster matched global trajectory set GTJcms is determined by clustering a plurality of target trajectories in the local target trajectory set based on trajectory similarities of a plurality of inter-camera trajectory pairs in the local target trajectory set. In an example, the Agglomerative clustering algorithm is used to cluster a plurality of target trajectories in the local target trajectory set in case of a lower threshold (e.g., 0.5). Compared to the number of trajectories in the overall local target trajectory set LTJs, the number of trajectories in the cluster matched global trajectory set GTJcms may decrease.
Further exemplary description of the merging operation Op_merge in the method 100 will be made below.
A directed graph constructed with each trajectory in the cluster matched global trajectory set GTJcms as a vertex is “G (V,E)”, a vertex in the directed graph is represented by vi, and ei,j is a directed edge between the vertex vi and a vertex vj. During construction, first, a cost of a directed edge between each pair of vertices is initialized as infinity. Then, a cost of a corresponding directed edge is adjusted based on times corresponding to end points of a trajectory pair, which will be described by way of a trajectory pair composed of trajectories TJa, TJb. If, for a first trajectory TJa and a second trajectory TJb in the cluster matched global trajectory set GTJcms, a difference between a start time of the second trajectory TJb and an end time of the first trajectory TJa is greater than zero and less than a predetermined time threshold tTh (e.g., 1 second), then a cost of a directed edge between two vertices associated with the first trajectory TJa and the second trajectory TJb is adjusted based on at least one of a trajectory similarity cost function cost_reid, a temporal cost function cost_time and a spatial distance cost function cost_spatial associated with the first TJa and the second trajectory TJb.
As shown in Equation (8), the trajectory similarity cost function cost_reid is associated with a trajectory similarity Sim(TJa, TJb) between the first trajectory TJa and the second trajectory TJb.
cost_reid=log(1−Sim(TJa,TJb)) (8)
A determination manner of the trajectory similarity Sim(TJa, TJb) can be the same as the determination manner of the trajectory similarity as used in the clustering operation. It is also possible to use, as Sim(TJa, TJb), a re-identification feature similarity between re-identification features Freid_a, Freid_b of image block sequences from the trajectories TJa and TJb respectively, wherein the features Freid_a, Freid_b are re-identification features of key image blocks with identification degrees higher than a predetermined identification degree in the corresponding image block sequences of the first trajectory TJa and the second trajectory TJb. The identification degree can be determined based on at least one of bounding box confidence, an overlapping ratio, and a relative height of an image block (a ratio of an image block height to an image height), and thereby the key image blocks can be selected from the corresponding image block sequences based on identification degrees of the image blocks to calculate Sim(TJa, TJb). Image blocks with the highest identification degree are preferable.
As shown in Equation (9), the temporal cost function cost_time is associated with the difference diffTime between the start time of the second trajectory and the end time of the first trajectory.
cost_time=log(|diffTime|) (9)
As shown in Equation (10), the spatial distance cost function cost_spatial is associated with a spatial distance disSpatial, in the world coordinate system, between an end location of the first trajectory and a start location of the second trajectory.
cost_spatial=log(disSpatial) (10)
A cost cost of a directed edge between two vertices can be a total cost of the weighted sum of at least two of the above three costs, for example, a total cost as shown in Equation (11).
cost=αcost_reid+βcost_time+γcost_spatial (11)
α, β, γ are weighting weights, α+β+γ=1.
In order to determine a cost-minimum path in the directed graph G, a virtual start point can be set as a vertex of the directed graph when a cost-minimum path algorithm is implemented, and a cost from the vertex to any other vertex is set to 0.
The Bellman ford algorithm can solve a single-source shortest path problem in a cost graph. In the algorithm, weight values of edges can be negative values, which is an improvement to the Dijkstra shortest path algorithm in which weight values cannot be negative values. In an example of the disclosure, the Bellman Ford algorithm is used to calculate a cost-minimum path from a virtual vertex to other vertices, and all trajectory vertices under the cost-minimum path are multi-cameras multi-target tracking results. Ultimately, it is possible to merge corresponding trajectories in the cluster matched global trajectory set based on a cost-minimum path set, thereby obtaining the global target trajectory set GTJs.
As illustrated in
An overall local target trajectory set LTJs can be obtained by performing a single-camera multi-target tracking operation Op_mtt on the image sequences SqIm[1], SqIm[2], respectively, wherein the operation Op_mtt can comprise: a target detection operation Op_detB, a re-identification feature extraction operation Op_extF, a single-camera target matching operation Op_matS, and a local target trajectory post-processing operation Op_postP. Exemplarily, the overall local target trajectory set LTJs illustrated in
Based on re-identification features, a trajectory similarity SimTbC can be determined by performing an averaging operation Op mean. In this example, 4*5 inter-camera trajectory similarities SimTbC will be determined.
Based on a plurality of inter-camera trajectory similarities determined, a cluster matched global trajectory set GTJcms can be determined by performing a clustering operation Op_cluster. As illustrated in
A cost-minimum path set Pmins is determined by performing a cost-minimum operation Op_cminp for the cluster matched global trajectory set GTJcms.
A merging operation Op_merge is performed for the cluster matched global trajectory set GTJcms: the trajectory set GTJcms is updated as the global target trajectory set GTJs by clustering corresponding trajectories in the cluster matched global trajectory set GTJcms based on the cost-minimum path set Pmins. As illustrated in
In an embodiment of the present disclosure, there is provided an apparatus for multi-target multi-camera tracking. Exemplary description will be made with reference to
In an embodiment of the present disclosure, there is provided another apparatus for multi-target multi-camera tracking. Exemplary description will be made with reference to
An aspect of the present disclosure provides a non-transitory computer-readable storage medium having a program stored thereon. When the program is executed by a computer, it is possible to implement operations of: determining an overall local target trajectory set including a local target trajectory set of each camera by performing single-camera multi-target tracking on a corresponding image sequence provided by each camera of a plurality of cameras; and determining a global target trajectory set for the plurality of cameras by performing multi-camera multi-target matching on the overall local target trajectory set; wherein determining the global target trajectory set comprises: determining a cluster matched global trajectory set by clustering local target trajectories in the overall local target trajectory set; determining a cost-minimum path set by implementing a cost-minimum path algorithm on a directed graph constructed with each trajectory in the cluster matched global trajectory set as a vertex; and merging corresponding trajectories in the cluster matched global trajectory set based on the cost-minimum path set. The program has a corresponding relationship with the method 100. For the further configuration situation of the program, reference may be made to the description of the method 100 of the present disclosure.
According to an aspect of the present disclosure, there is further provided an information processing apparatus.
The CPU 1101, the ROM 1102 and the RAM 1103 are connected to each other via a bus 1104. An input/output interface 1105 is also connected to the bus 1104.
The following components are connected to the input/output interface 1105: an input part 1106, including a soft keyboard and the like; an output part 1107, including a display such as a Liquid Crystal Display (LCD) and the like, as well as a speaker and the like; the storage part 1108 such as a hard disc and the like; and a communication part 1109 including a network interface card such as an LAN card, a modem and the like. The communication part 1109 executes communication processing via a network such as the Internet, a local area network, a mobile network or a combination thereof.
A driver 1110 is also connected to the input/output interface 1105 as needed. A removable medium 1111 such as a semiconductor memory and the like is installed on the driver 1110 as needed, such that programs read therefrom are installed in the storage device 1108 as needed.
The CPU 1101 can run a program corresponding to a method for multi-target multi-camera tracking.
In the embodiments of the present disclosure, rectangular bounding box information, motion information, and re-identification features are fused in single-camera multi-target tracking, which can optimize multi-target tracking performance under a single camera and effectively reduce id-switch. In multi-camera matching, target matching is optimized according to trajectory similarities (re-identification similarities), time and space as a whole by utilizing a cost-minimum path algorithm based on a directed graph, thereby improving performance. The use of the cost-minimum path algorithm based on the directed graph can avoid errors brought by occlusion, illumination, and attitude changes to multi-camera multi-target tracking.
The beneficial effects of the methods, apparatuses, and storage media of the present disclosure include at least one of: improving the accuracy of a result of multi-target multi-camera tracking, and reducing identification-switch.
As described above, according to the present disclosure, the principle of multi-target multi-camera tracking has been disclosed. It should be noted that, the effects of the solution of the present disclosure are not necessarily limited to the above-mentioned effects, and in addition to or instead of the effects described in the preceding paragraphs, any of the effects as shown in the specification or other effects that can be understood from the specification can be obtained.
Although the present invention has been disclosed above through the description with regard to specific embodiments of the present invention, it should be understood that those skilled in the art can design various modifications (including, where feasible, combinations or substitutions of features between various embodiments), improvements, or equivalents to the present invention within the spirit and scope of the appended claims. These modifications, improvements or equivalents should also be considered to be included within the protection scope of the present invention.
It should be emphasized that, the term “comprise/include” as used herein refers to the presence of features, elements, operations or assemblies, but does not exclude the presence or addition of one or more other features, elements, operations or assemblies.
In addition, the methods of the various embodiments of the present invention are not limited to be executed in the time order as described in the specification or as shown in the accompanying drawings, and may also be executed in other time orders, in parallel or independently. Therefore, the execution order of the methods as described in the specification fails to constitute a limitation to the technical scope of the present invention.
The present disclosure includes but is not limited to the following solutions.
Sim(Freid_x′t,Freid_xt′)−Sim(Freid_xt,Freid_xt′)>sTh2;
Sim(Freid_xt″,Freid_x′t)+Sim(Freid_x′t″,Freid_xt)−Sim(Freid_xt,Freid_xt″)+Sim(Freid_x′t″,Freid_x′t)>sTh3; and
Number | Date | Country | Kind |
---|---|---|---|
202210793867.1 | Jul 2022 | CN | national |