Certain embodiments of the present disclosure relate to object tracking. More particularly, some embodiments of the present disclosure relate to multiple-sensor object tracking.
There are two major computer vision based object tracking models: multiple object tracking (MOT) and single object tracking (SOT). Often, MOT models require training a detector with predefined classes and the tracking detected objects of predefined classes across frames of a video. Comparatively, often SOT models do not require a separately trained object detector, but instead can be designed to track any generic object by specifying the target of interest.
Conventional systems and methods for object tracking may use one or more sensors, such as narrow band cameras, wide band cameras, infrared (IR) cameras, electro-optic (EO) cameras, and/or the like. In some examples, conventional systems and methods typically use object detection techniques that are computationally costly, inefficient (e.g., in terms of time and/or resources), and that make it difficult to track an object across different sensor feeds.
Hence, it is desirable to improve techniques for object tracking.
Certain embodiments of the present disclosure relate to object tracking. More particularly, some embodiments of the present disclosure relate to multiple sensor object tracking.
At least some aspects of the present disclosure are directed to a method for multiple sensor object tracking. In some embodiments, the method includes: receiving a first sensor feed and a second sensor feed from a plurality of sensors respectively. In some embodiments, the first sensor feed includes a set of first images, and the second sensor feed includes a set of second images. In some embodiments, the method includes generating an image transformation based on at least one first image in the set of first images and at least one second image in the set of second images, applying the image transformation to the set of second images, aggregating the set of first images and the set of transformed second images to generate a set of aggregated images, and applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects. In some embodiments, the method is performed using one or more processors.
At least some aspects of the present disclosure are directed to a system for multiple sensor object tracking. In some embodiments, the system includes at least one processor, and at least one memory storing instructions that, when executed by the at least one processor, causes the system to perform a set of operations. In some embodiments, the set of operations include: receiving a first sensor feed and a second sensor feed from a plurality of sensors respectively. In some embodiments, the first sensor feed includes a set of first images, and the second sensor feed includes a set of second images. In some embodiments, the set of operations include generating an image transformation based on at least one first image in the set of first images and at least one second image in the set of second images, applying the image transformation to the set of second images, aggregating the set of first images and the set of transformed second images to generate a set of aggregated images, and applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects.
At least some aspects of the present disclosure are directed to a method for multiple sensor object tracking. In some embodiments, the method includes: receiving a first sensor feed and a second sensor feed from a plurality of sensors respectively. In some embodiments, the first sensor feed includes a set of first images, and the second sensor feed includes a set of second images. In some embodiments, the method further includes generating an image transformation based on at least one first image in the set of first images and at least one second image in the set of second images, wherein the generating an image transformation based at least in part on the first image and the second image includes: applying an image matching model to a first image in the set of first images and a second image in the set of second images to generate an image matching result; and generating the image transformation based on the image matching result. In some embodiments, the method further includes: applying the image transformation to the set of second images; and aggregating the set of first images and the set of transformed second images to generate a set of aggregated images, wherein the aggregating the set of transformed first images and the set of transformed second images includes: arranging the first image in the set of first images captured at a first time and a transformed second image in the set of transformed second images captured at approximately the first time adjacent to each other. In some embodiments, the method further includes: applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects. In some embodiments, the method is performed using one or more processors.
Depending upon embodiment, one or more benefits may be achieved. These benefits and various additional objects, features and advantages of the present disclosure can be fully appreciated with reference to the detailed description and accompanying drawings that follow.
Unless otherwise indicated, all numbers expressing feature sizes, amounts, and physical properties used in the specification and claims are to be understood as being modified in all instances by the term “about.” Accordingly, unless indicated to the contrary, the numerical parameters set forth in the foregoing specification and attached claims are approximations that can vary depending upon the desired properties sought to be obtained by those skilled in the art utilizing the teachings disclosed herein. The use of numerical ranges by endpoints includes all numbers within that range (e.g., 1 to 5 includes 1, 1.5, 2, 2.75, 3, 3.80, 4, and 5) and any number within that range.
Although illustrative methods may be represented by one or more drawings (e.g., flow diagrams, communication flows, etc.), the drawings should not be interpreted as implying any requirement of, or particular order among or between, various steps disclosed herein. However, some embodiments may require certain steps and/or certain orders between certain steps, as may be explicitly described herein and/or as may be understood from the nature of the steps themselves (e.g., the performance of some steps may depend on the outcome of a previous step). Additionally, a “set.” “subset,” or “group” of items (e.g., inputs, algorithms, data values, etc.) may include one or more items and, similarly, a subset or subgroup of items may include one or more items. A “plurality” means more than one.
As used herein, the term “based on” is not meant to be restrictive, but rather indicates that a determination, identification, prediction, calculation, and/or the like, is performed by using, at least, the term following “based on” as an input. For example, predicting an outcome based on a particular piece of information may additionally, or alternatively, base the same determination on another piece of information. As used herein, the term “receive” or “receiving” means obtaining from a data repository (e.g., database), from another system or service, from another software, or from another software component in a same software. In certain embodiments, the term “access” or “accessing” means retrieving data or information, and/or generating data or information.
Conventional systems and methods are often not capable of efficiently identifying objects across different sensor feeds (e.g., video feeds). Conventional systems and methods typically use object detection techniques, which leads to increased cost and decreased efficiency, and is more difficult to track an object across different sensor feeds. For example, such conventional systems are often not capable of efficiently identifying objects across different sensor feeds (e.g., across different video feeds).
Various embodiments of the present disclosure can achieve benefits and/or improvements by a computing system receiving data from multiple sensors and using computing models to process the sensor data and identify and track one or more objects. In some embodiments, benefits include significant improvements, including, for example, increased efficiency, reduced complexity, and improved accuracy, in tracking multiple objects across multiple sensor feeds. In certain embodiments, benefits include increased object tracking accuracy by fusing different sensor feeds, for example, sensor feeds that complement each other. In some embodiments, systems and methods are configured to convert different video feeds to a same pixel space and use temporal context and spatial context.
At least some embodiments of the present disclosure are directed to multiple sensor object tracking. According to certain embodiments (e.g., certain overhead imagery/video application scenarios), there are multiple types of sensors looking at the same location at the same time. For example, the types of sensors can include narrow band cameras, wide band cameras, infrared (IR) cameras, electro-optic (EO) cameras, and/or the like. In some embodiments, one or more sensors (e.g., each sensor) feed to an object detector and tracker to generate detection tracks. In certain embodiments, given detector and tracker accuracy may not be accurate, and different sensors may complement each other, it can be beneficial to be able to fuse the output of object trackers running independently on these different sensor feeds (e.g., video feeds) to generate a single set of detection tracks. In some embodiments, systems and methods may include fusing detection tracks between narrow and wide camera feeds as well as between EO and IR feeds.
According to some embodiments, systems and methods may first geo-register different video feeds independently. In certain embodiments, systems and methods may map detections from each video feed to the same geo-coordinate space. In some embodiments, with sufficiently accurate georegistration, the detections across two or more sensor feeds (e.g., camera feeds) that overlap can be instantly matched resulting in a single set of tracks. In certain examples, this approach would not work well because georegistration can often fail, especially in the case of narrow video feeds. In such examples, detections from the individual video feeds will not properly align resulting in poor performance.
According to certain embodiments, fusion at the object and/or track level, where an identified object is associated with a track identifier (ID), also faces one or more challenges, such as, 1) different sensor modality: objects in EO and IR sensors have very different appearances; 2) different sensor bands: wide vs narrow causes different object size/scale; 3) different rotation angle: objects have different orientations; 4) different view angle: different degree of geometric distortions; 5) temporal out-of-synchronization: different object motion and trajectory; and/or 6) object detection/tracking results may not be accurate (false positives, false negatives, ID switches). In some embodiments, these challenges may cause some object/track level fusion approaches (e.g., using its appearance or motion/trajectory, etc. to match) to not work well.
According to some embodiments, systems and methods can convert a first set of image frames in a first video and a second set of image frames in a second video to a same pixel space. In certain embodiments, systems and methods can convert a first set of image frames in a first video and a second set of image frames in a second video to a same pixel space via image matching. In some embodiments, systems and methods can generate a sequence of image frames by combining the first set of image frames and the second set of image frames to generate an aggregated set of image frames. In certain embodiments, systems and methods can generate a sequence of image frames by combining the first set of image frames and the second set of image frames to generate an aggregated set of image frames, where two respective image frames, in the two sets of image frames respectively, that, taken at the same time or close in time, are arranged as adjacent image frames in the aggregated set of image frames. In some embodiments, systems and methods can feed the aggregated set of image frames to an object tracker to track one or more objects across the two videos.
According to certain embodiments, an object tracking system performs image matching across different sensor feeds (e.g., visual data, videos) captured by the different sensors to map detections to the same pixel space. In some embodiments, the system treats the detections of the image-matched sensor feeds as though they are consecutive frames captured by a single (virtual) sensor. In certain embodiments, the system can use an object tracker to match the detections (e.g., tracking objects) across sensor feeds of two or more sensors and output a single set of tracks (e.g., an identified object with a same track ID across different sensor feeds). In some embodiments, an image frame, also referred to as a frame, is an image in a sequence of images or an image in a video. In certain embodiments, a sensor feed, also referred to as a video feed, is a video stream or a sequence of images captured over time by a sensor.
According to some embodiments, instead of performing georegistration on the individual video feeds, the object tracking systems and methods apply image matching techniques such as, for example, AWOGs (Angle-Weighted Oriented Gradients), CFOG (Channel Features of Orientated Gradients), and/or the like, across different video feeds. In certain embodiments, image matching allows the object tracking system to map the different video feeds to the same pixel space. In some embodiments, the system maps different video feeds to the same pixel space as opposed to the same geolocation when geo-registration is used. In certain embodiments, image matching is a simpler problem (in the computational sense) compared to georegistration and allows for more reliable/persistent results.
According to certain embodiments, the systems and methods of object tracking use a computing model (e.g., a machine-learning model). In certain embodiments, a model, also referred to as a computing model, includes a model to process data. A model includes, for example, an artificial intelligence (AI) model, a machine-learning (ML) model, a deep-learning (DL) model, an artificial neural network (ANN), a deep neural network (DNN), an image processing model, an algorithm, a rule, other computing models, and/or a combination thereof.
Although the above has been shown using a selected group of components in the object tracking system or workflow 100, there can be many alternatives, modifications, and variations. For example, some of the components may be expanded and/or combined. Other components may be inserted into those noted above. Depending upon the embodiment, the arrangement of components may be interchanged with others replaced. Further details of these components are found throughout the present disclosure.
According to certain embodiments, the video feeds 110 are from two or more image sensors. In some embodiments, the video feeds 110 are from two or more types of image sensors. In certain embodiments, each video feed 110 includes a set of images (e.g., image frames). In some embodiments, the video feeds 110 include a first video (e.g., a video stream) and a second video, which are corresponding to a set of first images and a set of second images. In certain embodiments, the first video feed and the second video feed are captured by different image sensors. In some embodiments, the first video feed and the second video feed are captured by different types of image sensors (e.g., narrow band image sensor, wide band image sensor, EO image sensor, IR image sensor, satellite image sensor, etc.).
According to some embodiments, the detection and transformation processor 120 identifies a set of detected objects in the video feeds 110. In certain embodiments, the detection and transformation processor 120 identifies a set of first detected objects in the first video and/or one or more first images in the set of first images. In some embodiments, the detection and transformation processor 120 identifies a set of second detected objects in the second video and/or one or more second images in the set of second images.
According to certain embodiments, the detection and transformation processor 120 applies one or more image-matching techniques (e.g., AWOGs, CFOG, etc.) to the first video and the second video. In some embodiments, the detection and transformation processor 120 applies one or more image-matching techniques to one or more first images in the set of first images and one or more second images in the set of second images. In certain embodiments, the detection and transformation processor 120 generates an image transformation using the one or more image-matching techniques. In some embodiments, the detection and transformation processor 120 generates an image transformation between the first video and the second video using the one or more image-matching techniques, for example, in a same pixel space.
According to some embodiments, the detection and transformation processor 120 applies the image transformation (e.g., image matching techniques) to the set of first images to generate a set of transformed (e.g., mapped) first images. In certain embodiments, the detection and transformation processor 120 applies the image transformation to the set of second images to generate a set of transformed second images. In some embodiments, the image transformation is applied to the first set of images or the set of second images. In certain embodiments, the image transformation is applied to only one of the first set of images and the set of second images.
According to certain embodiments, the detection and transformation processor 120 aggregates the set of first images (e.g., the set of transformed first images) with the set of second images (e.g., the set of transformed second images) to generate a set of aggregated images 125. In some embodiments, the detection and transformation processor 120 aggregates the set of first images (e.g., the set of transformed first images) with the set of second images (e.g., the set of transformed second images) to generate the set of aggregated images 125 based on temporal context. In certain embodiments, the detection and transformation processor 120 arranges a first image in the set of first images captured at a first time and a second image in the set of second images captured at approximately the first time adjacent to each other in the set of aggregated images.
According to some embodiments, referring back to
According to certain embodiments, the MOT tracker 130 can be leveraged to match non-overlapping detections over time in a video feed. In some embodiments, the MOT tracker 130 includes a motion model 136. In certain embodiments, the motion model 136 includes a linear motion model. In some embodiments, the motion model 136 includes a filter that estimates the state of a linear dynamic system from a series of noisy measurements. For example, in some embodiments, the motion model 136 includes a Kalman filter.
According to some embodiments, the MOT tracker 130 includes a re-identification (ReID) model 132. In certain embodiments, the re-identification model 132 includes an appearance model (e.g., a similarity machine-learning model, a regression similarity learning model, a classification similarity learning model, a ranking similarity learning model, etc.). In some embodiments, the re-identification model 132 includes a deep neural network-based model to capture appearance similarity. In some examples, the re-identification model 132 includes a Siamese neural network and/or the like. In certain embodiments, the re-identification model 132 determines one or more similarity metrics between the set of first detected objects and the set of second detected objects. In some embodiments, the re-identification model 132 determines one or more similarity distances between the set of first detected objects and the set of second detected objects. In the example illustrated in
According to certain embodiments, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can match and/or associate objects based on motion and/or appearance. In some embodiments, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can assign weights to one or more motion parameters and/or one or more appearance parameters. In some embodiments, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can rank one or more motion parameters and/or one or more appearance parameters. In certain embodiments, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can assign tracking IDs based on the weighted and/or ranked result of the motion model and/or one or more appearance model and one or more criteria, for example, one or more predetermined thresholds. In some embodiments, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can determine a first object in the set of first detected objects and a second object in the set of second detected objects to have a same tracking ID if the weighted and/or ranked result of the motion model and/or one or more appearance model meets one or more criteria, for example, higher than one or more predetermined thresholds.
In certain embodiments, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can determine a similarity metric between a first object in the set of first detected objects and a second object in the set of second detected objects, determine the similarity metric to be higher than a predetermined threshold, and determine the first object and the second object to have a same tracking ID and/or re-assign a tracking ID to the first object and/or the second object. In some embodiments, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can determine a similarity distance between a first object in the set of first detected objects and a second object in the set of second detected objects, determine the similarity metric to be higher than a predetermined threshold, and determine the first object and the second object to have a same tracking ID and/or re-assign a tracking ID to the first object and/or the second object.
In certain embodiments, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can determine a similarity metric between a first object in the set of first detected objects and a second object in the set of second detected objects. In some examples, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can determine a spatial relationship of the first object and the second object. In some examples, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can determine whether the similarity metric meets a similarity criterion and/or whether the spatial relationship meets a spatial criterion. In some embodiments, if both the similarity criterion and the spatial criterion are met, the re-identification model 132, the object association model 134, and/or the MOT tracker 130 can determine the first object and the second object to have a same tracking ID and/or re-assign a tracking ID to the first object and/or the second object.
In certain examples, the system 100 combines the re-identification model 132 with the motion model 136 to assign the tracking IDs. In some examples, the system can perform fusion of sensor feeds between narrow and wide band sensors where appearances still look similar between sensors after image matching since both have similar scale, rotation, and/or the like. In some examples, performing fusion of sensor feeds refers to the process of combining data or from multiple sensors to improve the overall accuracy, reliability, and/or completeness of data being gathered about a system or environment.
According to certain embodiments, the MOT tracker 130 determines a spatial context 138 of one or more objects. In some embodiments, the spatial context includes a spatial relationship among different objects (e.g., two or more detected objects). In certain embodiments, the spatial relationship includes a spatial graph. In some embodiments, the MOT tracker 130 includes a spatial graph matching module along with appearance and motion to more robustly fuse sensor feeds 110 among different sensors. In certain embodiments, the MOT tracker 130 can use temporal context (e.g., the track trajectory and appearance changes, over time, of each sensor feed) to track objects, for example, establishing the initial tracks or assigning initial tracking IDs.
According to some embodiments, the object association model 134 (e.g., the object association module) uses or includes an optimization algorithm (e.g., Hungarian algorithm, etc.) that takes into consideration motion and/or distance, appearance similarity, spatial context, and/or the like, to determine tracking ID assignment. In some embodiments, the object association model 134 includes one or more association functions, for example, association functions for associating objects in different frames as one object. In the example illustrated in
According to certain embodiments, the object association model 134 may use spatial context 138, the motion model 136 (e.g., more complex motion model), and/or a different association algorithm for fusion. In some embodiments, for low confidence image matching and/or detection track matching scenarios, the object tracking system 100 can use image matching at neighboring frames (e.g., adjacent frames) to ensemble results since for high frame rate videos, scene/objects may look very similar, and objects may not move much between neighboring frames. In certain embodiments, the MOT tracker 130 may perform track initialization process 142, track update process 144, and/or track deletion process 146. In some embodiments, the MOT tracker 130 may perform track initialization process 142, track update process 144, and/or track deletion process 146, based on inputs from the object association model 134.
According to some embodiments, to further improve runtime efficiency, if objects are sparse in an image, the object tracking system 100 can use object region aware image matching to only match images for certain parts of the images to save computation time. In certain embodiments, the object tracking system 100 can sample templates (e.g., structured templates) only for the regions where there are objects of interest and potentially make image matching more accurate for these regions.
In some embodiments, the object tracking system 100 may include a storage repository (not shown) (e.g., a data repository) can include one or more sensor feeds, one or more aggregated sets of sensor feeds, one or more tracked objects (e.g., with track IDs), one or more temporal contexts, one or more spatial contexts, and/or the like. The repository may be implemented using any one of the configurations described below. A data repository may include random access memories, flat files, XML files, and/or one or more database management systems (DBMS) executing on one or more database servers or a data center. A database management system may be a relational (RDBMS), hierarchical (HDBMS), multidimensional (MDBMS), object oriented (ODBMS or OODBMS) or object relational (ORDBMS) database management system, and the like. The data repository may be, for example, a single relational database. In some cases, the data repository may include a plurality of databases that can exchange and aggregate data by data integration process or software application. In an exemplary embodiment, at least part of the data repository may be hosted in a cloud data center. In some cases, a data repository may be hosted on a single computer, a server, a storage device, a cloud server, or the like. In some other cases, a data repository may be hosted on a series of networked computers, servers, or devices. In some cases, a data repository may be hosted on tiers of data storage devices including local, regional, and central.
In certain embodiments, various components in the object tracking system 100 can interact with one another via a software interface. In some embodiments, a software interface includes an application programming interface (API), a web service interface, retrieving information from a file, retrieving information from a data repository, and/or the like. In some cases, various components in the object tracking system 100 can execute software or firmware stored in non-transitory computer-readable medium to implement various processing steps. Various components and processors of the object tracking system 100 can be implemented by one or more computing devices including, but not limited to, circuits, a computer, a cloud-based processing unit, a processor, a processing unit, a microprocessor, a mobile computing device, and/or a tablet computer. In some cases, various components of the object tracking system 100 (e.g., the detection and transformation processor 120, the multiple object tracker 130, etc.) can be implemented on a shared computing device. Alternatively, a component of the object tracking system 100 can be implemented on multiple computing devices. In some implementations, various modules and components of the object tracking system 100 can be implemented as software, hardware, firmware, or a combination thereof. In some cases, various components of the object tracking system 100 can be implemented in software or firmware executed by a computing device.
Various components of the object tracking system 100 can communicate via or be coupled to via a communication interface, for example, a wired or wireless interface. The communication interface includes, but is not limited to, any wired or wireless short-range and long-range communication interfaces. The short-range communication interfaces may be, for example, local area network (LAN), interfaces conforming known communications standard, such as Bluetooth® standard, IEEE 802 standards (e.g., IEEE 802.11), a ZigBee® or similar specification, such as those based on the IEEE 802.15.4 standard, or other public or proprietary wireless protocol. The long-range communication interfaces may be, for example, wide area network (WAN), cellular network interfaces, satellite communication interfaces, etc. The communication interface may be either within a private computer network, such as intranet, or on a public computer network, such as the internet.
In some embodiments, some or all processes (e.g., steps) of the method 200 are performed by a system (e.g., the computing system 500). In certain examples, some or all processes (e.g., steps) of the method 200 are performed by a computer and/or a processor directed by a code. For example, a computer includes a server computer and/or a client computer (e.g., a personal computer). In some examples, some or all processes (e.g., steps) of the method 200 are performed according to instructions included by a non-transitory computer-readable medium (e.g., in a computer program product, such as a computer-readable flash drive). For example, a non-transitory computer-readable medium is readable by a computer including a server computer and/or a client computer (e.g., a personal computer, and/or a server rack). As an example, instructions included by a non-transitory computer-readable medium are executed by a processor including a processor of a server computer and/or a processor of a client computer (e.g., a personal computer, and/or server rack).
According to some embodiments, at process 210, the system receives multiple videos from multiple sensors of one or more types of sensors (e.g., narrow-band camera, wide-band camera, infrared camera, electro-optic (EO) camera, satellite camera, etc.) and/or the same type of sensor with one or more zoom levels (e.g., EO narrow and EO wide). For example, the EO narrow Video A includes original images A1, A2, etc., and EO wide video B includes original images B1, B2, etc., where original images A1 and B1 were captured at a same time, original images A2 and B2 were captured at another same time, and original images A3 and B3 were captured at yet another same time.
In certain embodiments, at process 215, from the received multiple videos, the system selects multiple images select multiple images that were captured at the same time or approximately at the same time (e.g., within 0.1 second, within 1 second, within 5 seconds, etc.), wherein one image is selected from each video of the received multiple videos. As an example, the system selects original image A1 from EO narrow Video A, and selects original image B1 from EO wide video B. In some embodiments, at process 220, using image matching techniques (e.g., AWOGs, CFOG, etc.), the system performs image matching on the selected multiple images to map (e.g., transform) the selected multiple images to the same pixel space. For example, using image matching techniques, the system performs image matching on original images A1 and B1 in order to map original images A1 and B1 to the same pixel space.
According to some embodiments, the system may repeat processes 215 and 220 at different times to form multiple groups of transformed (e.g., mapped) multiple images. In certain embodiments, the system forms a first group of transformed A1 and B1, a second group of transformed A2 and B2, and a third group of transformed A3 and B3. In certain embodiments, at process 225, the system generates an aggregated set of transformed images. In some embodiments, the aggregated set of transformed images includes the multiple groups of transformed images, where a group of transformed images are arranged adjacent to each other. For example, the aggregated set of transformed images includes the transformed images A1, B1, A2, B2, A3, B3, and the like.
According to certain embodiments, at process 230, the system feeds the aggregated set of transformed images sequentially into a MOT tracker (e.g., a MOT 130, a detector and a MOT tracker). In some embodiments, at process 235, using the MOT tracker, the system assigns the same object on these image-matched multiple images (selected from different videos) to the same tracking ID. As an example, using the detector and the MOT tracker, the system assigns the same object on these transformed images A1, B1, A2, B2, A3, B3, . . . to the same tracking ID.
According to certain embodiments, one or more aspects of the method 200 use a computing model (e.g., a machine-learning model). In certain embodiments, a model, also referred to as a computing model, includes a model to process data. A model includes, for example, an artificial intelligence (AI) model, a machine-learning (ML) model, a deep-learning (DL) model, an artificial neural network (ANN), a deep neural network (DNN), an image processing model, an algorithm, a rule, other computing models, and/or a combination thereof.
According to some embodiments, at process 240, the system is configured to play the multiple videos (from the multiple types of sensors) separately, where the same object in different videos are labeled with the same tracking ID. For example, the system plays EO narrow Video A (including original images A1, A2, A3, etc.) and plays EO wide video B (including original images B1, B2, B3, etc.) separately, where the same object is labeled with the same tracking ID in these two different videos A and B.
In some embodiments, some or all processes (e.g., steps) of the method 300 are performed by a system (e.g., the computing system 500). In certain examples, some or all processes (e.g., steps) of the method 300 are performed by a computer and/or a processor directed by a code. For example, a computer includes a server computer and/or a client computer (e.g., a personal computer). In some examples, some or all processes (e.g., steps) of the method 300 are performed according to instructions included by a non-transitory computer-readable medium (e.g., in a computer program product, such as a computer-readable flash drive). For example, a non-transitory computer-readable medium is readable by a computer including a server computer and/or a client computer (e.g., a personal computer, and/or a server rack). As an example, instructions included by a non-transitory computer-readable medium are executed by a processor including a processor of a server computer and/or a processor of a client computer (e.g., a personal computer, and/or server rack).
According to some embodiments, at process 310, the system receives a first sensor feed and a second sensor feed from a plurality of sensors respectively, where the first sensor feed includes a set of first images and the second sensor feed includes a set of second images. In certain embodiments, the first sensor feed (e.g., video feed) is received from a first sensor, the second sensor feed is received from a second sensor different from the first sensor. In some embodiments, the first sensor is a different type of sensor from the second sensor. In certain embodiments, the first sensor has a different band from the second sensor. In some embodiments, the first sensor has a different zoom level from the second sensor. In certain embodiments, the first sensor has a different view angle from the second sensor.
According to certain embodiments, the sensor feeds (e.g., video feeds) are from two or more image sensors. In some embodiments, the video feeds are from two or more types of image sensors. In certain embodiments, each video feed includes a set of images (e.g., image frames). In some embodiments, the video feeds include a first video (e.g., a video stream) and a second video, which are corresponding to a set of first images and a set of second images. In certain embodiments, the first video feed and the second video feed are captured by different image sensors. In some embodiments, the first video feed and the second video fee are captured by different types of image sensors (e.g., narrow band image sensor, wide band image sensor, EO image sensor, IR image sensor, satellite image sensor, etc.).
According to some embodiments, the system identifies a set of detected objects in the video feeds. In certain embodiments, the system identifies a set of first detected objects in the first video and/or one or more first images in the set of first images. In some embodiments, the system identifies a set of second detected objects in the second video and/or one or more second images in the set of second images.
According to certain embodiments, at process 315, the system generates an image transformation based on at least one first image in the set of first images and at least one second image in the set of second images. In some embodiments, at process 320, which may be a part of the process 315, the system applies an image matching model (e.g., AWOGs, CFOG, etc.) to the at least one first image and the at least one second image to generate an image matching result. In certain embodiments, at process 325, which may be a part of part of the process 325, the system generates the image transformation based on the image matching result. In some embodiments, the system applies one or more image-matching techniques (e.g., AWOGs, CFOG, etc.) to the first sensor feed and the second sensor feed. In certain embodiments, the system applies one or more image-matching techniques to one or more first images in the set of first images and one or more second images in the set of second images. In certain embodiments, the system generates the image transformation using the one or more image-matching techniques. In some embodiments, the system generates an image transformation between the first video and the second video using the one or more image-matching techniques, for example, in a same pixel space.
According to certain embodiments, one or more aspects of the method 300 use a computing model (e.g., a machine-learning model). In certain embodiments, a model, also referred to as a computing model, includes a model to process data. A model includes, for example, an artificial intelligence (AI) model, a machine-learning (ML) model, a deep-learning (DL) model, an artificial neural network (ANN), a deep neural network (DNN), an image processing model, an algorithm, a rule, other computing models, and/or a combination thereof.
According to some embodiments, at process 330, the system applies the image transformation (e.g., image matching techniques) to the set of first images and/or the set of second images. In certain embodiments, the system applies the image transformation to generate a set of transformed (e.g., mapped) first images. In certain embodiments, the system applies the image transformation to the set of second images to generate a set of transformed second images. In some embodiments, the image transformation is applied to the first set of images or the set of second images. In certain embodiments, the image transformation is applied to only one of the first set of images and the set of second images.
According to certain embodiments, at process 335, the system aggregates the set of first images (e.g., the set of transformed first images) and the set of second images (e.g., the set of transformed second images) to generate a set of aggregated images. In some embodiments, the system aggregates the set of first images (e.g., the set of transformed first images) with the set of second images (e.g., the set of transformed second images) to generate the set of aggregated images based on temporal context. In certain embodiments, the system arranges a first image in the set of first images captured at a first time and a second image in the set of second images captured approximately (e.g., within 0.1 second, within 1 second, within 5 seconds, etc.) at the first time adjacent to each other in the set of aggregated images. In some embodiments, the system aggregates the set of transformed first images and the set of transformed second images to generate a set of aggregated images, including arranging a transformed first image in the set of transformed first images captured at a first time and a transformed second image in the set of transformed second images captured approximately at the first time adjacent to each other.
According to some embodiments, the set of first detected objects are associated with or assigned to a set of first tracking IDs and the set of second detected objects are associated with or assigned to a set of second tracking IDs different from the set of first tracking IDs. In certain embodiments, the system can match non-overlapping detections over time in a video feed (e.g., can match moving cars or people). In some embodiments, the system (e.g., MOT tracker) can match the set of first detected objects and the set of second detected objects by processing the set of aggregated images. In some embodiments, the system can match the set of first detected objects and the set of second detected objects by processing the set of aggregated images, using the object association model. In some embodiments, the non-overlapping detections across video feeds, that might be the result of the error in the image matching step, are very similar to the non-overlapping detections of moving objects, such as vehicles, in a single video feed over time. In certain embodiments, assuming the image matching error characteristics do not drastically change over time, the system may leverage a motion model, for example, to reliably generate a single set of tracks across all video feeds.
According to certain embodiments, at process 340, the system applies a multiple object tracking model to the set of aggregated images to identify a plurality of objects. In some embodiments, the system applies a re-identification model (345), applies a motion model (350), determines spatial context (355), and/or applies an object association model (360) to the set of aggregated images, for example, to identify the plurality of objects across at least two images in the set of aggregated images. In certain embodiments, the system applies an appearance model to the set of aggregated images to identify the plurality of objects across at least two images in the set of aggregated images. In some embodiments, the system applies an optimization algorithm to the first result and the second result to identify the plurality of objects across at least two images in the set of aggregated images.
According to some embodiments, the system determines a spatial relationship (e.g., spatial context) between two objects in the set of first detected objects. In certain embodiments, the spatial relationship includes a spatial graph. In some embodiments, a spatial graph includes one or more nodes of spatial locations, usually given by coordinates in one, two, or three dimensions. In some embodiments, the spatial graph includes a distance between two or more nodes. In certain embodiments, the spatial relationship includes relative positions of two or more objects. In certain embodiments, the system can be leveraged to match non-overlapping detections over time in a video feed. In some embodiments, the system includes a motion model. In certain embodiments, the motion model includes a linear motion model. In some embodiments, the motion model includes a filter that estimates the state of a linear dynamic system from a series of noisy measurements. For example, in some embodiments, the motion model includes a Kalman filter.
According to some embodiments, the system includes a re-identification (ReID) model. In certain embodiments, the re-identification model includes an appearance model (e.g., a similarity machine-learning model, a regression similarity learning model, a classification similarity learning model, a ranking similarity learning model, etc.). In some embodiments, the re-identification model includes a deep neural network-based model to capture appearance similarity. In some examples, the re-identification model includes a Siamese neural network and/or the like. In certain embodiments, the re-identification model determines one or more similarity metrics between the set of first detected objects and the set of second detected objects. In some embodiments, the re-identification model determines one or more similarity distances between the set of first detected objects and the set of second detected objects. In the example illustrated in
According to certain embodiments, the re-identification model, the object association model, and/or the system can match and/or associate objects based on motion and/or appearance. In some embodiments, the re-identification model, the object association model, and/or the system can assign weights to one or more motion parameters and/or one or more appearance parameters. In certain embodiments, the re-identification model, the object association model, and/or the system can assign tracking IDs based on the weighted result of the motion model and/or one or more appearance model and one or more criteria, for example, one or more predetermined thresholds. In some embodiments, the re-identification model, the object association model, and/or the system can determine a first object in the set of first detected objects and a second object in the set of second detected objects to have a same tracking ID if the weighted result of the motion model and/or one or more appearance model meets one or more criteria, for example, higher than one or more predetermined thresholds.
In certain embodiments, the re-identification model, the object association model, and/or the system can determine a similarity metric between a first object in the set of first detected objects and a second object in the set of second detected objects, determine the similarity metric to be higher than a predetermined threshold, and determine the first object and the second object to have a same tracking ID and/or re-assign a tracking ID to the first object and/or the second object. In some embodiments, the re-identification model, the object association model, and/or the system can determine a similarity distance between a first object in the set of first detected objects and a second object in the set of second detected objects, determine the similarity metric to be higher than a predetermined threshold, and determine the first object and the second object to have a same tracking ID and/or re-assign a tracking ID to the first object and/or the second object.
In certain embodiments, the re-identification model, the object association model, and/or the system can determine a similarity metric between a first object in the set of first detected objects and a second object in the set of second detected objects and a spatial relationship of the first object and the second object, determine whether the similarity metric meets a similarity criterion and whether the spatial relationship meets a spatial criterion. In some embodiments, if both the similarity criterion and the spatial criterion are met, the re-identification model, the object association model, and/or the system can determine the first object and the second object to have a same tracking ID and/or re-assign a tracking ID to the first object and/or the second object.
In certain examples, the system combines the re-identification model with the motion model to assign the tracking IDs. In some examples, the system can perform fusion of sensor feeds between narrow and wide band sensors where appearances still look similar between sensors after image matching since both have similar scale, rotation, and/or the like.
According to certain embodiments, the system determines a spatial context of one or more objects. In some embodiments, the spatial context includes a spatial relationship among different objects (e.g., two or more detected objects). In certain embodiments, the spatial relationship includes a spatial graph. In some embodiments, the system includes a spatial graph matching module along with appearance and motion to more robustly fuse sensor feeds 110 among different sensors. In certain embodiments, the system can use temporal context (e.g., the track trajectory and appearance changes over time of each sensor feed) to track objects, for example, establishing the initial tracks or assigning initial tracking IDs.
According to some embodiments, the object association model (e.g., the object association module) uses or includes an optimization algorithm (e.g., Hungarian algorithm, etc.) that takes into consideration of motion and/or distance, appearance similarity, spatial context, and/or the like, to determine tracking ID assignment. In some embodiments, the object association model includes one or more association functions, for example, association functions for associating objects in different frames as one object. In the example illustrated in
According to certain embodiments, the object association model may use spatial context, the motion model (e.g., more complex motion model), and/or different association algorithm for fusion. In some embodiments, for low confidence image matching and/or detection track matching scenarios, the system can use image matching at neighboring frames to ensemble results since for high frame rate videos, scene/objects will look very similar, and objects will not move much between neighboring frames.
According to some embodiments, to further improve runtime efficiency, if objects are sparse in an image, the system can use object region aware image matching to only match images for certain part of the images to save computation time. In certain embodiments, the system can sample templates (e.g., structured templates) only for the regions where there are objects of interest and potentially make image matching more accurate for these regions.
The computing system 500 includes a bus 502 or other communication mechanism for communicating information, a processor 504, a display 506, a cursor control component 508, an input device 510, a main memory 512, a read only memory (ROM) 514, a storage unit 516, and a network interface 518. In some embodiments, some or all processes (e.g., steps) of the methods 200 and/or 300 are performed by the computing system 500. In some examples, the bus 502 is coupled to the processor 504, the display 506, the cursor control component 508, the input device 510, the main memory 512, the read only memory (ROM) 514, the storage unit 516, and/or the network interface 518. In certain examples, the network interface is coupled to a network 520. For example, the processor 504 includes one or more general purpose microprocessors. In some examples, the main memory 512 (e.g., random access memory (RAM), cache and/or other dynamic storage devices) is configured to store information and instructions to be executed by the processor 504. In certain examples, the main memory 512 is configured to store temporary variables or other intermediate information during execution of instructions to be executed by processor 504. For example, the instructions, when stored in the storage unit 516 accessible to processor 504, render the computing system 500 into a special-purpose machine that is customized to perform the operations specified in the instructions. In some examples, the ROM 514 is configured to store static information and instructions for the processor 504. In certain examples, the storage unit 516 (e.g., a magnetic disk, optical disk, or flash drive) is configured to store information and instructions.
In some embodiments, the display 506 (e.g., a cathode ray tube (CRT), an LCD display, or a touch screen) is configured to display information to a user of the computing system 500. In some examples, the input device 510 (e.g., alphanumeric and other keys) is configured to communicate information and commands to the processor 504. For example, the cursor control component 508 (e.g., a mouse, a trackball, or cursor direction keys) is configured to communicate additional information and commands (e.g., to control cursor movements on the display 506) to the processor 504.
According to certain embodiments, a method for multiple sensor object tracking is provided. The method includes: receiving a first sensor feed and a second sensor feed from a plurality of sensors respectively, the first sensor feed including a set of first images, and the second sensor feed including a set of second images: generating an image transformation based on at least one first image in the set of first images and at least one second image in the set of second images: applying the image transformation to the set of second images: aggregating the set of first images and the set of transformed second images to generate a set of aggregated images; and applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects, wherein the method is performed using one or more processors. For example, the method is implemented according to at least
In some embodiments, the aggregating the set of transformed first images and the set of transformed second images includes: arranging a first image in the set of first images captured at a first time and a transformed second image in the set of transformed second images captured at approximately the first time adjacent to each other. In certain embodiments, the generating an image transformation based at least in part on the first image and the second image includes: applying an image matching model to a first image in the set of first images and a second image in the set of second images to generate an image matching result; and generating the image transformation based on the image matching result. In some embodiments, the image matching model includes at least one selected from a group consisting of an angle-weighted oriented gradients (AWOGs) algorithm and a channel features of orientated gradients (CFOG) algorithm.
In certain embodiments, the method further includes: identifying a set of first detected objects from a first image in the set of first images: identifying a set of second detected objects from a second image in the set of second images: wherein the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: associating the set of first detected objects and the set of second detected objects such that a specific object in the set of first detected objects is assigned to a specific tracking identifier and the specific object in the set of second detected objects is assigned to the specific tracking identifier. In some embodiments, the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: applying a motion model to the set of aggregated images to identify the plurality of objects across at least two images in the set of aggregated images. In certain embodiments, the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: applying an appearance model to the set of aggregated images to identify the plurality of objects across at least two images in the set of aggregated images.
In some embodiments, the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: applying a motion model to the set of aggregated images to generate a first result: applying an appearance model to the set of aggregated images to generate a second result; and applying an optimization algorithm to the first result and the second result to identify the plurality of objects across at least two images in the set of aggregated images. In certain embodiments, the method further includes: determining a spatial relationship between two objects in the set of first detected objects, wherein the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: associating the set of first detected objects and the set of second detected objects based at least in part on the spatial relationship. In some embodiments, the spatial relationship includes a spatial graph.
According to certain embodiments, a system for multiple sensor object tracking is provided. The system includes: at least one processor; and at least one memory storing instructions that, when executed by the at least one processor, causes the system to perform a set of operations. The set of operations includes: receiving a first sensor feed and a second sensor feed from a plurality of sensors respectively, the first sensor feed including a set of first images, and the second sensor feed including a set of second images: generating an image transformation based on at least one first image in the set of first images and at least one second image in the set of second images: applying the image transformation to the set of second images: aggregating the set of first images and the set of transformed second images to generate a set of aggregated images; and applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects. For example, the system is implemented according to at least
In some embodiments, the aggregating the set of transformed first images and the set of transformed second images includes: arranging a first image in the set of first images captured at a first time and a transformed second image in the set of transformed second images captured at approximately the first time adjacent to each other. In certain embodiments, the generating an image transformation based at least in part on the first image and the second image includes: applying an image matching model to a first image in the set of first images and a second image in the set of second images to generate an image matching result; and generating the image transformation based on the image matching result. In some embodiments, the image matching model includes at least one selected from a group consisting of an angle-weighted oriented gradients (AWOGs) algorithm and a channel features of orientated gradients (CFOG) algorithm.
In certain embodiments, the set of operations further includes: identifying a set of first detected objects from a first image in the set of first images; and identifying a set of second detected objects from a second image in the set of second images, wherein the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: associating the set of first detected objects and the set of second detected objects such that a specific object in the set of first detected objects is assigned to a specific tracking identifier and the specific object in the set of second detected objects is assigned to the specific tracking identifier. In some embodiments, the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: applying a motion model to the set of aggregated images to identify the plurality of objects across at least two images in the set of aggregated images. In certain embodiments, the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: applying an appearance model to the set of aggregated images to identify the plurality of objects across at least two images in the set of aggregated images.
In some embodiments, the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: applying a motion model to the set of aggregated images to generate a first result: applying an appearance model to the set of aggregated images to generate a second result; and applying an optimization algorithm to the first result and the second result to identify the plurality of objects across at least two images in the set of aggregated images. In certain embodiments, the set of operations further includes: determining a spatial relationship between two objects in the set of first detected objects, wherein the applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects includes: associating the set of first detected objects and the set of second detected objects based at least in part on the spatial relationship. In some embodiments, the spatial relationship includes a spatial graph.
According to certain embodiments, a method for multiple sensor object tracking is provided. The method includes: receiving a first sensor feed and a second sensor feed from a plurality of sensors respectively, the first sensor feed including a set of first images, and the second sensor feed including a set of second images: generating an image transformation based on at least one first image in the set of first images and at least one second image in the set of second images wherein the generating an image transformation based at least in part on the first image and the second image includes: applying an image matching model to a first image in the set of first images and a second image in the set of second images to generate an image matching result; and generating the image transformation based on the image matching result. The method further includes: applying the image transformation to the set of second images; and aggregating the set of first images and the set of transformed second images to generate a set of aggregated images, wherein the aggregating the set of transformed first images and the set of transformed second images includes: arranging the first image in the set of first images captured at a first time and a transformed second image in the set of transformed second images captured at approximately the first time adjacent to each other. The method further includes: applying a multiple object tracking model to the set of aggregated images to identify a plurality of objects. In some embodiments, the method is performed using one or more processors. For example, the method is implemented according to at least
For example, some or all components of various embodiments of the present disclosure each are, individually and/or in combination with at least another component, implemented using one or more software components, one or more hardware components, and/or one or more combinations of software and hardware components. In another example, some or all components of various embodiments of the present disclosure each are, individually and/or in combination with at least another component, implemented in one or more circuits, such as one or more analog circuits and/or one or more digital circuits. In yet another example, while the embodiments described above refer to particular features, the scope of the present disclosure also includes embodiments having different combinations of features and embodiments that do not include all of the described features. In yet another example, various embodiments and/or examples of the present disclosure can be combined.
Additionally, the methods and systems described herein may be implemented on many different types of processing devices by program code comprising program instructions that are executable by the device processing subsystem. The software program instructions may include source code, object code, machine code, or any other stored data that is operable to cause a processing system (e.g., one or more components of the processing system) to perform the methods and operations described herein. Other implementations may also be used, however, such as firmware or even appropriately designed hardware configured to perform the methods and systems described herein.
The systems' and methods' data (e.g., associations, mappings, data input, data output, intermediate data results, final data results, etc.) may be stored and implemented in one or more different types of computer-implemented data stores, such as different types of storage devices and programming constructs (e.g., RAM, ROM, EEPROM, Flash memory, flat files, databases, programming data structures, programming variables, IF-THEN (or similar type) statement constructs, application programming interface, etc.). It is noted that data structures describe formats for use in organizing and storing data in databases, programs, memory, or other computer-readable media for use by a computer program.
The systems and methods may be provided on many different types of computer-readable media including computer storage mechanisms (e.g., CD-ROM, diskette, RAM, flash memory, computer's hard drive, DVD, etc.) that contain instructions (e.g., software) for use in execution by a processor to perform the methods' operations and implement the systems described herein. The computer components, software modules, functions, data stores and data structures described herein may be connected directly or indirectly to each other in order to allow the flow of data needed for their operations. It is also noted that a module or processor includes a unit of code that performs a software operation and can be implemented, for example, as a subroutine unit of code, or as a software function unit of code, or as an object (as in an object-oriented paradigm), or as an applet, or in a computer script language, or as another type of computer code. The software components and/or functionality may be located on a single computer or distributed across multiple computers depending upon the situation at hand.
The computing system can include client devices and servers. A client device and server are generally remote from each other and typically interact through a communication network. The relationship of client device and server arises by virtue of computer programs running on the respective computers and having a client device-server relationship to each other.
This specification contains many specifics for particular embodiments. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations, one or more features from a combination can in some cases be removed from the combination, and a combination may, for example, be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Although specific embodiments of the present disclosure have been described, it will be understood by those of skill in the art that there are other embodiments that are equivalent to the described embodiments. Accordingly, it is to be understood that the invention is not to be limited by the specific illustrated embodiments. Various modifications and alterations of the disclosed embodiments will be apparent to those skilled in the art. The embodiments described herein are illustrative examples. The features of one disclosed example can also be applied to all other disclosed examples unless otherwise indicated. It should also be understood that all U.S. patents, patent application publications, and other patent and non-patent documents referred to herein are incorporated by reference, to the extent they do not contradict the foregoing disclosure.
This application claims priority to U.S. Provisional Application No. 63/462,790, entitled “SYSTEMS AND METHODS FOR MULTIPLE SENSOR OBJECT TRACKING,” and filed on Apr. 28, 2023, which is incorporated by reference herein for all purposes in its entirety.
Number | Date | Country | |
---|---|---|---|
63462790 | Apr 2023 | US |