This disclosure relates generally to computer vision systems, and specifically to visual analytics for tracking and identifying persons and objects within a location.
Cognitive environments which allow personalized services to be offered to customers in a frictionless manner are highly appealing to businesses, as frictionless environments are capable of operating and delivering services without requiring the customers to actively and consciously perform special actions to make use of those services. Cognitive environments utilize contextual information along with information regarding customer emotions in order to identify customer needs. Furthermore, frictionless systems can be configured to operate in a privacy-protecting manner without intruding on the privacy of the customers through aggressive locational tracking and facial recognition, which require the use of customers' real identities.
Conventional surveillance and tracking technologies pose a significant barrier to effective implementation of frictionless, privacy-protecting cognitive environments. Current vision-based systems identify persons using high resolution close-up images of faces which commonly available surveillance cameras cannot produce. In addition to identifying persons using facial recognition, existing vision-based tracking systems require prior knowledge of the placement of each camera within a map of the environment in order to monitor the movements of each person. Tracking systems that do not rely on vision rely instead on beacons which monitor customer's portable devices, such as smartphones. Such systems are imprecise, and intrude on privacy by linking the customer's activity to the customer's real identity.
Various embodiments are directed to one or more unique systems apparatuses, devices, hardware, methods, and combinations thereof for tracking and identifying people and objects within a monitored location.
According to an embodiment, a system for tracking and identifying people and objects within a monitored location may include a first camera positioned at a monitored location, the first camera is configured to capture a first video stream of a first area of interest of the monitored location. The system may also include a second camera positioned at the monitored location, the second camera is configured to capture a second video stream of a second area of interest of the monitored location, wherein the second area of interest overlaps at least a portion of the first area of interest. The system further includes an edge server positioned at the monitored location and communicatively coupled to each of the first camera and the second camera over an internal communications network at the monitored location, the edge server having a processor executing a plurality of instructions stored in memory, wherein the plurality of instructions cause the processor of the edge server to receive the first video stream from the first camera and the second video stream from the second camera, extract a first plurality of images from the first video stream, analyze each image of the first plurality of images with one or more artificial intelligence models to detect a presence of a person within the first area of interest, and generate, in response to detection of the presence of the person within the first area of interest, first metadata indicative of the presence of the person within the first area of interest. The plurality of instructions may also cause the processor of the edge server to extract a second plurality of images from the second video stream, analyze each image of the second plurality of images with the one or more artificial intelligence models to detect the presence of the person within the second area of interest, and generate, in response to detection of the presence of the person within the second area of interest, second metadata indicative of the presence of the person within the second area of interest. Additionally, the plurality of instructions may cause the processor of the edge server to infer that the person moved from the first area of interest to the second area of interest as a function of the first metadata and the second metadata, and transmit the first metadata and the second metadata to a remote processing server using an external communications network.
In some embodiments, to infer that the person moved from the first area of interest to the second area of interest as a function of the first and second metadata may include to analyze the first metadata and the second metadata with the one or more artificial intelligence models.
In some embodiments, the one or more artificial intelligence models include a machine learning model.
In some embodiments, the machine learning model is a neural network.
In some embodiment, the plurality of instructions further cause the processor of the edge server to analyze each image of the first plurality of images with the one or more artificial intelligence models to detect the presence of an object within the first area of interest, generate, in response to detection of the presence of the object within the first area of interest, third metadata indicative of the presence of the object within the first area of interest, and transmit the third metadata to the remote processing server using the external communications network.
In some embodiment, the plurality of instructions further cause the processor of the edge server to analyze the first metadata and the third metadata with the one of more artificial intelligence models to determine whether the person detected in the first area of interest is interacting with the object detected in the first area of interest.
In some embodiments, the remote processing server may include a processor executing a plurality of instructions stored in memory, wherein the plurality of instructions cause the processor of the remote processing server to receive the first metadata and the second metadata from the edge server, analyze the first metadata and the second metadata with the one or more artificial intelligence models to predict a future event at the monitored location, and generate an alert indicative of the predicted future the event at the monitored location.
In some embodiments, the plurality of instructions further cause the processor of the edge server to adaptively increase or decrease a number of the first plurality of images extracted from the first video stream as a function of the analysis of the each image of the first plurality of images with the one or more artificial intelligence models.
In some embodiments, the first plurality of images extracted from the first video stream is less than a total number of images making up the first video stream.
In some embodiments, the plurality of instructions further cause the processor of the edge server to store the first video stream in a data storage device of the edge server, determine, based on the analysis of each image of the first plurality of images, that additional images are required to detect the presence of the person within the first area of interest, retrieve a portion of the first video stream from the data storage device, the portion of the first video stream is associated with the first plurality of images, extract a third plurality of images from the retrieved portion of the first video stream, wherein the third plurality of images extracted from the portion of the first video stream includes more extracted images than the first plurality of images, and analyze each image of the third plurality of images with the one or more artificial intelligence models to detect the presence of the person within the first area of interest.
In some embodiments, the visual analytics system further includes a global configuration server communicatively coupled to the edge server with the external network and remotely positioned from the monitored location, wherein the global configuration server includes a processor executing a plurality of instructions stored in memory, wherein the plurality of instructions cause the processor of the global configuration server to transmit the one or more artificial intelligence models to the edge server.
According to another embodiment a method for tracking and identifying people and objects within a monitored location may include capturing, by a first camera positioned at a monitored location, a first video stream of a first area of interest of the monitored location, and capturing, by a second camera positioned at the monitored location, a second video stream of a second area of interest of the monitored location, wherein the second area of interest overlaps at least a portion of the first area of interest. In such embodiment, the method may further include receiving, by an edge server positioned at the monitored location and communicatively coupled to each of the first camera and the second camera over an internal communications network, the first video stream from the first camera and the second video stream from the second camera, extracting, by the edge server, a first plurality of images from the first video stream, analyzing, by the edge server, each image of the first plurality of images with one or more artificial intelligence models to detect a presence of a person within the first area of interest, and generating, by the edge server and in response to detecting the presence of the person within the first area of interest, first metadata indicative of the presence of the person within the first area of interest. Such method may also include extracting, by the edge server, a second plurality of images from the second video stream, analyzing, by the edge server, each image of the second plurality of images with the one or more artificial intelligence models to detect the presence of the person within the second area of interest, generating, by the edge server and in response to detecting the presence of the person within the second area of interest, second metadata indicative of the presence of the person within the second area of interest, inferring, by the edge server, that the person moved from the first area of interest to the second area of interest as a function of the first metadata and the second metadata, and transmitting, by the edge server, the first metadata and the second metadata to a remote processing server using an external communications network.
In some embodiments, inferring that the person moved from the first area of interest to the second area of interest as a function of the first and second metadata may include analyzing the first metadata and the second metadata with the one or more artificial intelligence models.
In some embodiments, the one or more artificial intelligence models include a machine learning model.
In some embodiments, the machine learning model is a neural network.
In some embodiments, the method may further include analyzing, by the edge server, each image of the first plurality of images with the one or more artificial intelligence models to detect the presence of an object within the first area of interest, generating, by the edge server and in response to detecting the presence of the object within the first area of interest, third metadata indicative of the presence of the object within the first area of interest, and transmitting, by the edge server, the third metadata to the remote processing server using the external communications network.
In some embodiments, the method may further include analyzing, by the edge server, the first metadata and the third metadata with the one of more artificial intelligence models to determine whether the person detected in the first area of interest is interacting with the object detected in the first area of interest.
In some embodiments, the method may further include receiving, by the remote processing server, the first metadata and the second metadata from the edge server, analyzing, by the remote processing server, the first metadata and the second metadata with the one or more artificial intelligence models to predict a future event at the monitored location, and generating, by the remote processing server, an alert indicative of the predicted future the event at the monitored location.
In some embodiments, the method may further include adaptively increasing or decreasing, by the edge server, a number of the first plurality of images extracted from the first video stream as a function of analyzing each image of the first plurality of images with the one or more artificial intelligence models.
In some embodiments, the first plurality of images extracted from the first video stream is less than a total number of images making up the first video stream.
In some embodiments, the method may further include storing, by the edge server, the first video stream in a data storage device of the edge server, determining, by the edge server and based on the analysis of each image of the first plurality of images, that additional images are required to detect the presence of the person within the first area of interest, and retrieving, by the edge server, a portion of the first video stream from the data storage device, the portion of the first video stream is associated with the first plurality of images. In such embodiments, the method may further include extracting, by the edge server, a third plurality of images from the retrieved portion of the first video stream, wherein the third plurality of images extracted from the portion of the first video stream includes more extracted images than the first plurality of images, and analyzing, by the edge server, each image of the third plurality of images with the one or more artificial intelligence models to detect the presence of the person within the first area of interest.
In some embodiments, the method may further include transmitting, by a global configuration server, the one or more artificial intelligence models to the edge server, wherein the global configuration server is communicatively coupled to the edge server with the external network and remotely positioned from the monitored location.
Various embodiments will become better understood with regard to the following description, appended claims and accompanying drawings wherein:
Various non-limiting embodiments of the present disclosure will now be described to provide an overall understanding of the principles of the structure, function, and use of the apparatuses, systems, methods, and processes disclosed herein. One or more examples of these non-limiting embodiments are illustrated in the accompanying drawings, wherein like numbers indicate the same or corresponding elements throughout the views. Those of ordinary skill in the art will understand that systems and methods specifically described herein and illustrated in the accompanying drawings are non-limiting embodiments. The features illustrated or described in connection with one non-limiting embodiment may be combined with the features of other non-limiting embodiments. Such modifications and variations are intended to be included within the scope of the present disclosure.
Reference throughout the specification to “various embodiments,” “some embodiments,” “one embodiment,” “some example embodiments,” “one example embodiment,” or “an embodiment” means that a particular feature, structure, or characteristic described in connection with any embodiment is included in at least one embodiment. Thus, appearances of the phrases “in various embodiments,” “in some embodiments,” “in one embodiment,” “some example embodiments,” “one example embodiment,” or “in an embodiment” in places throughout the specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined in any suitable manner in one or more embodiments.
The examples discussed herein are examples only and are provided to assist in the explanation of the apparatuses, devices, systems, and methods described herein. None of the features or components shown in the drawings or discussed below should be taken as mandatory for any specific implementation of any of these the apparatuses, devices, systems, or methods unless specifically designated as mandatory. For ease of reading and clarity, certain components, modules, or methods may be described solely in connection with a specific figure. Any failure to specifically describe a combination or sub-combination of components should not be understood as an indication that any combination or sub-combination is not possible. Also, for any methods described, it should be understood that unless otherwise specified or required by context, any explicit or implicit ordering of steps performed in the execution of a method does not imply that those steps must be performed in the order presented but instead may be performed in a different order or in parallel.
Additionally, it should be appreciated that items included in a list in the form of “at least one of A, B, and C” can mean (A); (B); (C); (A and B); (B and C); (A and C); or (A, B, and C). Similarly, items listed in the form of “at least one of A, B, or C” can mean (A); (B); (C); (A and B); (B and C); (A and C); or (A, B, and C). Further, with respect to the claims, the use of words and phrases such as “a,” “an,” “at least one,” and/or “at least one portion” should not be interpreted so as to be limiting to only one such element unless specifically stated to the contrary, and the use of phrases such as “at least a portion” and/or “a portion” should be interpreted as encompassing both embodiments including only a portion of such element and embodiments including the entirety of such element unless specifically stated to the contrary.
The disclosed embodiments may, in some cases, be implemented in hardware, firmware, software, or a combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. A machine-readable storage medium may be embodied as any storage device, mechanism, or other physical structure for storing or transmitting information in a form readable by a machine (e.g., a volatile or non-volatile memory, a media disc, or other media device).
Referring now to
It should further be understood that, unless otherwise specifically limited, any of the computing elements of the present invention may be implemented in cloud-based or cloud computing environments. As used herein and further described below in reference to the cloud-based system 108, “cloud computing”—or, simply, the “cloud”—is defined as a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned via virtualization and released with minimal management effort or service provider interaction, and then scaled accordingly. Cloud computing can be composed of various characteristics (e.g., on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, etc.), service models (e.g., Software as a Service (“SaaS”), Platform as a Service (“PaaS”), Infrastructure as a Service (“IaaS”), and deployment models (e.g., private cloud, community cloud, public cloud, hybrid cloud, etc.). Often referred to as a “serverless architecture,” a cloud execution model generally includes a service provider dynamically managing an allocation and provisioning of remote servers for achieving a desired functionality.
It should be understood that any of the computer-implemented components, modules, or servers described in relation to
As disclosed herein, the illustrative visual analytics system 100 is capable of identifying and tracking persons or objects within or around a monitored location 101 using existing or additional infrastructure. The monitored location 101 can be, for example, a retail store or shopping mall, a healthcare facility (e.g., hospital), an amusement park, a warehouse, a distribution center, a factory or plant, an office complex, a military installation, a gas station, a parking facility or parking lot, an apartment building or other residential space, or any other place where there is interest in monitoring and analyzing movement and activity of people, items, or vehicles within, throughout, and/or around a space. It should be appreciated that although only one monitored location 101 is illustratively depicted in
As illustratively shown in
In some embodiments, one or more of the cameras 102 are high-resolution cameras, e.g., having high definition (HD) resolution or better. The cameras 102 can be indoor or outdoor cameras. The cameras 102 can also be black and white, color, passive infrared, active infrared, or sensitive to other spectral wavelengths. In some embodiments, the cameras 102 are three-channel color and provide RGB (red-green-blue) video data. The cameras 102 can be configured, for example, to deliver video as digital data streams using, e.g., H.264 or H.265 video encoding. The video streams can be delivered to other computing devices using, for example, Internet Protocol (IP). Additionally, in some embodiments, the Real-Time Streaming Protocol (RTSP) can be used to deliver the video streams generated by the cameras 102 to one or more other computing devices of the monitored location 101. In some examples, the cameras 102 are configured for wireless video data transmission, and use, for example, WiFi to deliver their video streams.
The monitored location 101 can include an internal communications network 103, which can be a local area network (LAN) or any other type of network interconnecting computing devices of the monitored location 101. The internal communications network 103 can include one or more switches, routers, access points, or any other suitable equipment or infrastructure configured to enable wired or wireless communications between devices of the monitored location 101.
In some embodiments, the monitored location 101 can include additional sensors 162 or devices 172 configured to generate information for tracking and identifying people and objections within or around the monitored location 101. For example, the monitored location 101 can include one or more Radio Frequency Identification (RFID) devices, Near Field Communication (NFC) devices, temperature sensors, humidity sensors, light sensors, weight sensors, or sound sensors. In such embodiments, the information generated from the additional sensors 162 or devices 172 can be used by the edge server 104 or devices of the cloud-based system 108 to identify and track people and objects within or around the monitored location 101.
The monitored location 101 also includes an edge server 104, which is configured to identify and track persons and objects within or around the monitored location 101. The edge server 104 is a computer system having one or more AI acceleration units, such as graphics processing units (GPUs) or tensor processing units (TPUs), to provide AI processing capabilities. As illustratively shown in
As discussed herein, each camera 102 of the monitored location 101 is configured to capture or generate a video stream (i.e., a sequence of individual image frames) of an area within or around the monitored location 101. The video stream generated by each of the cameras 102 is streamed or transmitted to the edge server 104 via the internal network 103, which in some embodiments may occur in real-time or near real-time. The edge server 104 is configured to analyze the video streams generated by the cameras 102. To do so, the edge server 104 extracts individual image frames from each of the video streams generated by the cameras 102, and then analyzes the extracted image frames using one or more artificial intelligence (AI) methodologies or models to, for example, identify and track any number of in-frame people or objects (e.g., items, vehicles, etc.) present within or around the monitored location 101. To do so, in some embodiments, the edge server 104 is configured to analyze the extracted image frames using one or more machine learning models (e.g., neural networks, deep learning neural networks, convolutional neural networks, etc.). Additionally, the edge server 104 can also use AI models that are configured or otherwise trained to not only identify the presence of a person or an object, but also to determine or infer poses, gazes, gestures, behaviors, movement, and/or actions of the detected people or objects. As illustratively shown in
In some embodiments, the edge server 104 is configured to generate metadata based at least in part on, or otherwise as a function of, the output of the AI models. The generated metadata can be used by the edge server 104 to make inferences as to the presence of a person or object at the monitored location 101 as well as the behavior or actions of the detected person or objects at the monitored location 101. In some embodiments, the generated metadata can also be used as an input to other AI models for generating subsequent inferences or for further training of the utilized AI models. For example, the generated metadata can be used by the edge server 104 to determine or infer the location of a person or an object within or around the monitored location 101, determine or infer an action of a detected person in connection with a detected object (e.g., a detected person picked up a disposable cup, walked to a beverage station, and filled the cup with ice), determine or infer physical characteristics of a detected person or object (e.g., the person's height, etc.), or determine or infer any other relevant information.
As disclosed herein, the edge server 104 is configured to extract individual image frames from each of the video streams generated by the cameras 102, and analyze those image frames using one or more AI models. In some embodiments, the edge server 104 is configured to extract and analyze only a subset of the image frames making up a video stream. For example, the cameras 102 may be configured to generate video streams each containing 24 or 30 (or more) frames per second. In such cases, the edge server 104 can be configured to extract and analyze only three frames per second, thereby conserving computing resources (e.g., CPU, GPU, memory, bandwidth, etc.) of the edge server 104 or the internal network 103. Additionally or alternatively, the edge server 104 can be configured to down sample or otherwise reduce the resolution of the extracted image frames. In some embodiments, the edge server 104 may determine that it needs to extract and analyze more than three frames per second. For example, the edge server 104 may detect a quick movement made by a person or object. In such cases, the edge server 104 can retrieve the corresponding portion of the video stream containing the detected quick movement from a video archive (e.g., a storage device or memory), extract a larger number of individual image frames from the video stream, and re-analyze the larger number of image frames with the one or more AI models. It should be appreciated that the video streams generated by the cameras 102 can include any number of frames per second at any resolution. Further, the edge server 104 can be configured to extract and analyze any number of frames per second, and then adaptively adjust the number of frames extracted and analyzed based on any suitable determination.
The cloud-based system 108 (or cloud computing service 108) can be a collection of one or more physical or virtual computing devices positioned remotely from the monitored location 101. In some embodiments, such as the one shown in
The remote processing server 180 can be configured to make inferences based at least in part on, or otherwise as a function of, the analyzed metadata. For example, the remote processing server 108 can be configured to analyze the metadata generated and provided by the edge server 104 and predict the future occurrence of an event at the monitored location 101. In such examples, the remote processing server 108 can be configured to generate an alert or message, which can be displayed to a user of the visual analytics system 100 via the remote computing device 156. Additionally or alternatively, the remote processing server 108 can use the metadata received from the edge server 104 for further training or adjustment of the one or more AI models.
As illustratively shown in
Additionally, the cloud-based system 108 can include a web server 390, which can be accessed by the remote computing device 156 or any other computing device of the visual analytics system 100. In some embodiments, the web server 390 can provide information or alerts concerning the metadata provided by the edge server 104. The web server 390 can also be configured to facilitate the configuration, control, and maintenance of devices of the monitored location 101. It should be appreciated that any of the systems and services making up the cloud-based system 108 (e.g., the remote processing server 180, the global configuration control server 182, etc.) can include the artificial intelligence system 380, machine learning system 382, computer vision system 386, learning model system 388, and/or web server 390.
Referring back to
Referring now to
The illustrative method 400 begins with block 400 in which the edge server 104 receives a video stream captured by a camera 102 at a monitored location 101. In block 404, the edge server 104 extracts individual images from the received video stream. In some embodiments, the edge server 104 may extract a subset of the total number of individual image frames that make up the video stream captured by the camera 102. In block 406, the edge server 104 analyzes each image extracted from the received video stream using one or more artificial intelligence models. In block 408, the edge server 104 detects whether a person or an object is present within the images extracted from the video stream as a function of the one or more artificial intelligence models. If, in block 410, the edge server 103 determines that a person or an object is not present within the images, the method 400 starts over. If, however, the edge server 104 determines in block 410 that a person or an object is present within the one or more images, the method advances to block 412. In block 412, the edge server generated metadata indicative of the presence of the person or object within the extracted images. In some embodiments of the method 400, the edge server 104 receives a second video stream captured by a second camera 102 at the monitored location 101. In such embodiments, the edge server 104 extracts individual image frames from the second video steam, and analyzes each of the extracted images from the second video stream using the one or more artificial intelligence models. The edge server 104 then detects whether a person or an object is present within the images extracted from the second video stream as a function of the one or more artificial intelligence models. If the edge server 103 determines that a person or an object is detected in the second video stream, the edge server 104 utilizes the one or more artificial intelligence models to infer whether the person or object detected in the second video stream is the same as the person or object detected in the first video stream.
As discussed herein, visual analytics can be derived, using computer vision and machine learning (ML) techniques or other artificial (AI) techniques, from video camera data streams sourced from video cameras 102 placed throughout a location (e.g., a monitored location 101). The location 101 can be, as examples, a retail store or shopping mall, a healthcare facility (e.g., hospital), an amusement park, a warehouse, a distribution center, a factory or plant, an office complex, a military installation, a gas station, a parking facility or parking lot, an apartment building or other residential space, or any other place where there is interest in monitoring and analyzing movement and activity of people, items, or vehicles within, throughout, and/or around a space. The goals of visual analytics generally relate to quantifying flow patterns, activities, and behaviors, on both the individual level and in the aggregate, to improve economies and efficiencies, to generate and test hypotheses, and to timely and meaningfully address threats, annoyances, and aggravations that can singly or cumulatively impact-depending on the context-business goals, quality of service, quality of care, or quality of life. Advantageously, the visual analytics disclosed herein can be derived from cameras 102 already in place for security or loss prevention purposes, meaning that the visual analytics system 100 can leverage existing video surveillance infrastructure without substantial additional investment cost in camera and data network equipment.
As discussed herein and referring back to
The edge server 104 can process the video streams using deep neural networks as well as other artificial intelligence (AI) methodologies. In some examples, still image frames are extracted from streams from multiple cameras 102, and machine learning (ML) models detect and/or identify in-frame people, items, vehicles, and so forth. Some of the ML models can be trained, for example, to detect people, the objects that the people interact with, and/or vehicles, with the overall goal of trying to understand what people and/or vehicles shown in the video streams are doing, and to quantify their actions and behaviors both on the individual level and in the aggregate. Others of the ML models can be trained to perform functions other than detection, such as determining poses, gazes, gestures, and/or actions of the detected people. Different time steps and camera views can be merged to form a three-dimensional view of a scene and to obtain tracks representing the spatiotemporal motions of people, without using facial recognition or personally identifiable information (PII) about the people in the scene. Privacy regulations in many legal jurisdictions require that individuals not be identified with PII. Therefore, in some examples, appearance may be used as but one factor in tracking people in scenes across multiple different camera views. Foveal techniques can be used to ascertain details that resolve ambiguities. Derived metadata indicative of paths, actions, gestures, and behaviors can be analyzed for anomalies and signatures. Alerts, reports, recommendations, and orders can be generated based on the analysis of the metadata. Metadata can be accumulated, stored, and subsequently used in simulated testing to address hypotheses related to the uses of the spaces. A large host of questions and problems can thereby be addressed with quantifiable certainty not possible, or not practicable within budget and time constraints, in the absence of the visual analytics.
A location 101 can be initially setup for the visual analytics system 100 with (1) the installation of cameras 102 and other location devices (e.g., sensors 162 of various types, RFID readers, electronic locks or other access controls, lighting, and environmental controls, or any other device 172 configured to generate data indicative of object or persons within or around the monitored location 101), (2) installation of an edge server 104 on the location premises, (3) connection of the cameras 102 and other location devices to the edge server 104 (e.g., over a local area network (LAN) 103, either wired or wirelessly), (4) software provisioning and configuration of the edge server 104, cameras 102, and other location devices, which can be done remotely by the cloud-bases system 108 using a web-based interface of a configuration control platform (e.g., the global configuration control center 182), (5) calibration and registration of the cameras 102, (6) designation of areas of interest, and/or (7) definitions of actions. In some embodiments, the cameras 102 are optimally positioned to ensure adequate visual coverage within or around the monitored location 101. Additionally or alternatively, the cameras 102 can be positioned at the monitored location 101 according to a template or floor plan.
As discussed herein, in some examples, the cameras 102 are high-resolution cameras 102, e.g., having high definition (HD) resolution or better. The cameras 102 can be indoor or outdoor cameras. The cameras 102 can be black and white, color, passive infrared, active infrared, or sensitive to other spectral wavelengths. In some examples, the cameras 102 are three-channel color and provide RGB (red-green-blue) video data. The cameras 102 are configured, for example, to deliver video as digital data streams using, e.g., H.264 or H.265 video encoding. The video streams can be delivered using, for example, internet protocol (IP). In some examples, the cameras 102 are configured for wireless video data transmission, and use, for example, WiFi to deliver their video streams. In some examples, cameras 102 and other sensors 162, some of which have IP communication capability, will have already been installed at the location 101, and can be incorporated into the visual analytics system 100. In some such examples, it will be desirable to supplement existing surveillance infrastructure with additional cameras 102 and/or other sensors 162 or devices 172. Installed cameras 102 can have their positions, orientations, fields of view, and focuses adjusted and tuned to provide information needed for the visual analytics system 100. A floor plan of a location 101 (e.g., a retail store) or, in some examples, a 2.5D or 3D model of the location 101, can serve as a guide for planning camera installation locations, including camera 102 angles and fields of view.
In examples, the edge server 104 is a computer system having one or more AI acceleration units, such as graphics processing units (GPUs) or tensor processing units (TPUs), to provide AI processing capabilities. The edge server 104 may be behind one or more security firewalls and/or network address translation (NAT) firewalls. The cameras 102 and other sensors 162 and devices 172, if any, can be configured to connect to the edge server 104 to provide data to the edge server 104, including, in the case of the cameras 102, video streams depicting various views of the location 101 provided by the cameras 102. In some examples, a location 101 is equipped with a single edge server 104. In other examples, a location 101 is equipped with multiple edge servers 104.
In embodiments, each camera 102 of the visual analytics system 100 is calibrated. Camera calibration involves mapping the correspondences between the cameras 102 of different views and can be performed, in some examples, using structural features found in the location 101, or in other examples, from people or objects in a scene. All cameras 102 in a location 101 can be calibrated automatically, manually, or a combination of automatically and manually. To compute the calibrations for the cameras 102 automatically, the geometrical properties of the scene, e.g., floor tile patterns, floor tile sizes, known heights of objects, or intersections of the walls, can be relied upon by the edge server 104 (or other computing devices of the visual analytics system 100). For example, grout lines between floor tiles in view of each camera 102 can be detected and intersections of tile lines can provide reference points that can be used to determine the intrinsics and the extrinsics of each camera 102. In one method of camera calibration, the structure that is found in between the tiles on a floor—e.g., a grid of lines—can be used. A hypothetical grid of lines can be drawn in a model of the space. The hypothetical grid can then be fitted to the detected edges in between the tiles on the floor. As a result of this fitting, the 3D locations of the cameras 102 become known. Intersections at corners of walls can also be used, instead of floor tile points.
Because some locations 101 do not have floor tiles, calibration methods that rely on floor tile detection are not reliable in all cases. Accordingly, in another method of camera calibration, people or objects can be used as calibration targets. Such a calibration method can look for key points on the human body. There are, for example, 17 important joint locations, any one or more or which can be detected during the calibration process. Once people are detected in the scene, the people can be re-identified in different views. Pose estimation can be performed in each view to find the joint locations on the people, and then the joint locations can be used as calibration targets. For different camera views, the joint locations can be made the same at 3D points in physical space, giving accurate camera calibration.
As disclosed herein, the visual analytics system 100 may be used across multiple locations 101 (e.g., stores) with identical or highly similar layouts. In some examples, a full calibration can be performed for a first location 101, and then the relationships can be transferred to cameras 102 of a second location 101 of similar or identical layout using an image matching technique.
Calibration of a single camera 102 (based, for example, on grout lines) provides information about how that camera 102 sits in a local coordinate system. To relate all the local coordinate systems to each other, corresponding points are identified in different camera views. These points can also be related to a floor plan (e.g., CAD model, etc.) of the location 101, which provides an additional correspondence. For example, also identifying a same wall corner in the floor plan, the 3D camera coordinate system can be related to a 2D floor plan coordinate system.
Camera layouts may be similar, but not exactly the same, from location 101 to location 101 (e.g., from store to store). Using just the camera image from the first location 101, an image matching technique can be used to identify the image-to-image relationships of camera images from the first location 101 and camera images from the second location (not shown). The image-to-image relationships permit matching the tile grids to each other. Operating under the assumption that the first and second locations 101 are built to the same architectural plan can obviate the need to do cross-camera relating of correspondence points in the second location 101.
In some embodiments, the edge server 104 and/or a device of the cloud-based system 108 (e.g., the global configuration control server 182 or the remote processing server 180) can be configured to enable a system installer to view, e.g., via a personal computing device, a live view of the camera 102 being installed with a semi-transparent template image superimposed thereon. The semi-transparent template image may include an image captured by a camera 102 previously installed at another location 101 having the same or a substantially similar floor plan. The system installer can use the live view of the camera 102 being installed and the superimposed image to adjust the position of the camera 102 so that it matches the previously-installed camera 102. Additionally or alternatively, the edge server 104 and/or a device of the cloud-based system 108 can provide an installer a live view of the camera 102 being installed and superimpose directional arrows instructing the installer where to move the camera 102 being installed. A template image depicting the correct positioning of a camera 102 can also be used for maintenance and security operations. For example, the edge server 104 and/or a device of the cloud-based system 108 can run a difference comparison model or any other suitable AI model that compares the current view of the camera 102 to the template view of the camera 102. If the edge server 104 and/or device of the cloud-based system 108 detect a difference between the current view of the camera 102 and the template view, an alert to maintenance or security personnel can be generated. Such functionality can be used by the analytics system 100 to determine, for example, whether the camera 102 was knocked out of its optimal position, whether the images being captured by the camera 102 are blurry, or whether the view of the camera 102 is obstructed (e.g., due to spray painting, covering, or other forms of vandalism).
As illustratively shown in
To avoid the need to repeat steps 1506 and 1508 at every location 101, these steps can be performed once for a template location 1520, and then the results can be reused at other locations 1518 of the same architectural design, which can be referred to as query locations 1518. At a query location 1518, the cameras 102 are installed according to the same plan as at the template location 1520. There may be small differences in the placement and orientation of the cameras 102 between locations 101, 1518, 1520. These differences can be compensated for using the following process. For each pair of corresponding cameras 102: (a) points are matched across the images using a dense image feature matching algorithm (e.g., LoFTR), which is a robust process due to the differences between the two images being minor; and (b) an arbitrary floor point is mapped from one image to the other by interpolating among the nearby feature matches from step (a). By relating a small number of floor points across the images, the relationship between the two camera-local coordinate systems 1512, 1516 can be determined. Thus, the camera image at the query location 1518 is transitively connected to all of the coordinate systems of the template location 1520.
In some embodiments of camera calibration, the edge server 104 (or another component of the visual analytics system 100) analyzes high resolution images captured from one or more of the cameras 102 positioned at the monitored location 101 and being calibrated. For example, the edge server 104 can extract, from the high resolution images, one or more edges or geometric characteristics of an object (e.g., tile edges, a grid of tiles, a shelf, a door, curvature, dimensions, or any other feature or object). The edge server 104 can then compare the extracted geometric features or edges to expected dimensions, edges, or curvature. In some embodiments, the edge server 104 can utilize one or more AI models. Additionally, in some embodiments, parameters of the camera 102 being calibrated can be analyzed as part of the calibration process. For example, the Cartesian coordinates (e.g., X, Y, Z position), spherical coordinates (e.g., phi angle, theta angle, etc.), and other positional or directional information relating to the camera 102 being calibrated can be analyzed.
Camera registration is the process of matching the local floor coordinate systems of individual calibrated cameras 102 in a location 101. The process yields a global 3D model of the location that contains a global floor plane and the registered cameras. Any two cameras 102 with overlapping views can be registered by picking a single tile corner (or point on the floor) mutually visible to them. After cameras 102 are activated, camera registration may be established by identifying and corresponding location features in a scene where two or more cameras 102 have views of the scene (the same physical space) from different angles. From these correspondences, positions and angles in three-dimensional space may be identified. Proper camera registration is useful in later determining that a detected first person in a first camera view is identical to a detected second person in a second camera view—i.e., that the first person and the second person are the same person. By establishing that the first person in the first camera view is occupying roughly the same three-dimensional space seen by both the first and second camera views, a degree of confidence can be assigned that the first person and the second person are the same, and tracks for the first and second person can be merged. Other computer vision techniques can also be used to understand where detected persons are in three-dimensional space and to equate people detected in different camera views as being one and the same person.
As described in greater detail below with regard to the image processing pipelines, a deep neural network can derive a depth map for pixels in a two-dimensional image derived from a single camera. Depth information about a scene can also, in some examples, be derived from two or more camera views.
Information other than from video streams can be provided to the visual analytics system 100. As one example, a CAD layout of the location 101 (e.g., store) can be provided. The CAD layout can include a 2D floor plan, or can be a 2.5D (2D plus elevations layers) or 3D model of the location 101. In connection with the floor plan or other location model, a planogram can be provided. The planogram provides data on how items are placed in the store. A 2D planogram can be mapped to a 3D location in the store. A list of items sellable in the store can be provided, e.g., as a database. The list can include definition, stock keeping unit (SKU) code, Uniform Product Code (UPC) barcode, and/or other information. The item list can also include items without barcodes, like certain fountain drinks or prepared foods. Certain items may be customer-assembled, like drinks in cups that may or may not have barcodes, or hot dogs with condiments. These may be priced as one item or a combination of items. There may be some logic for how combinations are priced. Each of the floor plan, planogram, and item list may be dynamically updatable. In some examples, signals sensors 162 other than cameras 102 can be provided to an edge server 104. For example, RFID sensors can provide information about access to secure areas having RFID-based door locks. Temperature, humidity, light, weight, and sound sensors can also be provided to the visual analytics system 100 to provide additional information that can be processed to determine information about scenes and/or journeys throughout a location 101.
Once the cameras 102 are calibrated and registered, the visual analytics system 100 can go into operation. Areas of interest (AOIs) can be marked manually or can be inferred using ML models (or other AI mythologies) based on the movements of people and/or vehicles. Manual marking of AOIs can involve, for example, using a graphical user interface to draw an area or region (e.g., a polygonal area) on the floor of a scene, either in a camera view or in a 2D diagram or 3D model than can be derived from a floor plan or a 3D model of the location or scene. People entering the defined AOI can subsequently be flagged as having entered the AOI, and entry time, exit time, total time spent in the AOI, number of times exiting and re-entering the AOI, etc., can be noted as metrics relevant to the journey of the person in the location.
AOIs can also be automatically established by ML inference. For example, an AOI ML model can be trained using defined AOIs and tracks, and can thereby learn the meaning of an AOI with relation to the tracks of people as they move through the monitored location 101. Subsequently, the AOI ML model can observe the tracks of people derived from measured data and can automatically propose or designate certain areas in the location 101 to be AOIs. As examples, in the context of a store location 101, an AOI ML model may establish that certain areas in a store are a checkout station, a checkout queue, a coffee bar, a soda fountain, a restroom, a walk-in beverage cooler, a restricted “employees only” space or entrance thereto, and so forth, without these areas in the store needing to be manually defined by a human user in accordance with the manual AOI definition method described above.
Certain areas of a monitored location 101 may be restricted by role. For example, an area that is an employee door may be restricted to employee-only use. The visual analytics system 100 can determine whether certain people are in certain places inappropriately. The appropriateness of such use may be based on timing (e.g., an area appropriate to access during the day may be off-limits at night), dress or appearance (e.g., a uniformed employee or vendor may be allowed access to the back of a refrigerated case whereas a customer would not be) or presented identification (e.g., whether an RFID card reader has sensed a valid ID tag), among other factors. Information from non-camera sensors 162, like an RFID card reader, can be fused with metadata derived from cameras 102 to interpret the appropriateness of access events. In other examples, customers may be allowed access to certain areas, but may be prohibited from certain activities in those areas, and the visual analytics system 100 may be able to detect and notify about such prohibited behavior. For example, a customer may be allowed to serve himself coffee from a coffee maker, but may not be allowed to open a meat container and scoop out some meat, whereas an employee would be allowed to do so.
Accurate retail analytics may depend on an ability to localize a person in some predefined AOI. Such an ability can help provide information such as a time when the person reaches a cashier counter, how long a person has been in a checkout queue, or how long it takes a person to checkout. To accurately estimate this information, the 3D perception of the environment can be leveraged. As described in greater detail below, the edge server 104 can utilize a depth estimation ML model to provide dense depth estimates for locations in an image. Respective singular coordinates can be determined as representing spatial positions of each person in a scene or set of scenes. Since each person is made up of many points in a point cloud of the depth estimate, spatial sample clustering and averaging techniques can be used by the edge server 104 to predict the centroid of a person in the point cloud space. A 3D vector that is normal to the floor can be estimated using the point cloud. Using this 3D normal vector, the person's 3D centroid coordinate can be projected onto the same plane as the floor. This enables a “birds-eye view” representation of all the people in the monitored location 101. With this 2D projected floor coordinate information, any given person's position can be localized with respect to defined geofence polygons or AOIs.
The visual analytics system 100 can also be configured to distinguish people by their function, via dress, behaviors, or some combination of these or other factors. Functions can include employee, customer, security guard, or vendor, as examples. Roles can mutate over time, as when an employee or a security guard makes a purchase and temporarily assumes the role of customer. The visual analytics system 100 may be able to trigger an employee discount, in such a case. Roles and functions classifications can lead to actions and interpretations downstream in the processing of visual analytics information, as described in greater detail below.
In some embodiments, the visual analytics system 100 can use ML models or other AI techniques to detect items as they move through a store or monitored location 101. An item picked up by a customer from a shelf or rack can be tracked through the store as it journeys with the customer in a hand, shopping basket, or shopping cart. In some examples, a planogram and/or item lists can be utilized to facilitate identifying or monitoring items that are picked up and carried. For example, if a customer is detected as “reaching” at a particular position and/or direction within a store 101 (see discussion of gestures, below), and the planogram for that store 101 indicates that a shelf at or near the particular position contains one or more of a particular item, then the visual analytics system 100 can identify the customer as having picked up the particular item, to within some degree of certainty. The certainty can be augmented or reduced, for example, by subsequent observations of the item in, or not in, the customer's hand, shopping cart, basket, etc. The visual analytics system 100 can also track objects that are brought into the store 101-a customer's own cup, wallet, keys, children, wig, etc. Customers may put personal items that do not belong to the store 101 in different places throughout the store, such that these items may become part of the environment and would be useful to track, for example, to ensure that customers are not charged for their own items that they brought into the store or to assist in locating lost or stolen personal items.
Detected or derived visual characteristics: poses, gestures, facial expressions, demographics, tracks, and journeys
In some embodiments, an ML model can be used by the edge server 104 and/or the remote processing server 180 to draw a “skeleton” (a small number of connected lines) on a detected person (or object) in a single given frame, the skeleton indicative of the relative placement in space of the person's limbs, trunk, and head. In some examples, the skeleton is in 2D and can be extrapolated into 3D. The skeleton can define a particular pose of the person, which is a representation of the person's body state (e.g., the relation of limbs, trunk, and head with respect to each other) in three-dimensional space for a given instant in time. Particularly when used in conjunction with other information, such as a floor plan or a planogram (a 2D or 3D map defining locations of items), determined pose can be useful in determining where a person is, where the person is looking at, where the person's gaze is directed, etc. Pose information therefore allows a determination to be made, for example, that a person looked at a certain item on a shelf.
A gesture is a sequence of poses over a period of time, rather than at an instance of time. Gestures can include things like “reaching,” “pointing,” “sitting down,” “standing up,” “falling,” “stretching,” “swiping a credit card,” “extending a hand to tender cash,” and so forth. In some examples, the edge server 104 utilizes a trained ML model to determine gestures from image data from multiple successive captured frames. In other examples, the edge server 104 can utilize a trained ML model to determine gestures from poses determined from multiple successive captures fames.
Given a clear view of the face of a person, the edge server 104 can utilize a trained ML model that determines whether the person is, for example, smiling, frowning, has raised eyebrows, and so forth. Facial expressions can, in turn, be used, in some examples in conjunction with other indicators, to determine an emotional state of a person, such as “content,” “annoyed,” “delighted,” “angry,” and so forth. Facial expression determination can be done without performing facial recognition, that is, without determining information indicative of the person's personal identity.
Demographics of interest such as age and gender of detected people can also be determined by the edge server 104 using visual analytics (e.g., via various ML and/or AI models). In some embodiments, the edge server 104 is also configured to determine various demographics of interest relating to detected vehicles. Demographics of interest for vehicles can include make, model, year, and color. Tee ML models utilized by the edge server 104 can be trained to detect these demographics, and various other metrics can be derived from these demographics.
A track is a spatiotemporal representation of the position of a person over time. Tracks of the same person from multiple camera views can be merged to create tracks that span across different scenes in a location. The collection of tracks spanning between a person's entrance into and an exit out of a surveilled location (including, in some examples, outdoor areas around the location) can be termed a journey. In some examples, each person can be assigned a track ID, a unique identifier for a particular journey conducted by the person. A journey can be annotated with various actions taken during the journey, such as “got coffee,” “stood in line,” “paid for coffee,” and “left the store,” or with various AOIs entered and exited during the journey, along with timestamps for these various actions and AOI entrances and exits. In a retail store context, a journey serves as a record of a person throughout the person's travels into, around, and out of the store. A journey can thus be consulted to tell, for example, when a person enters a checkout queue and exits the queue, in order to determine how long the person spent in the queue, or when they enter a cashier area and exit a cashier area, in order to determine how long the person spent at checkout. Journeys can be consulted to access information on the individual person level, or in aggregate to determine, for example, how many people are in a queue at any given time, how many people are checking out at any given time, what the average check-out time is, what the average time is that a person or people were standing in a queue between 8:00 a.m. and 9:00 a.m., what the average length of a queue is, and so forth.
As described above with regard to camera registration, the visual analytics system 100 can recognize people or vehicles in different image frames or camera views as identical across the different image frames or camera views. In some examples, the edge server 104 can utilize an object detection model to work on a per-frame basis, meaning that no temporal information (i.e. previous detection coordinates) is utilized by the object detection model. For person and vehicle tracking, detections from previous frames can be matched in a re-identification process to detections in the current frame.
To perform person or vehicle re-identification, some visual features can be extracted from the detected person or vehicle from each view, some kind of similarity between those features can be computed by the edge server 104, and based on the computed similarity exceeding a threshold (e.g., an adaptive threshold), then the two people or the two vehicles in the different camera views can be declared to be the same person or vehicle. Otherwise, they can be declared by the edge server 104 to be different people or vehicles.
By embedding person images into features, distance comparisons between non-facial features from consecutive frames can be used by the edge server 104 in the re-identification process. Using the detection box coordinates from the person detection ML model, a cropped image containing a single person in each image can be generated by the edge server 104 for all person detections in a video frame. These cropped RGB images can then be used by the edge server 104 as inputs to a custom embedding convolutional neural network (CNN) inspired from OSNet architecture. The output of the CNN can be, for example, a 512-D feature vector representation of the input image. The edge server 104 can then generate a feature vector for all cropped-person images using the CNN, and can represent compressed visual features extracted from an image of a person. The edge sever 104 can aggregate these visual features with other local features extracted from a person to re-identify persons with similar features in consecutive frames. A re-identification (ReID) ML model can be trained using circle loss, gem pooling, and random grayscale transformation data augmentation functions. Such ReID can be trained and utilized by the edge server 104, devices of the cloud-based system 108, or any other device or computing system of the visual analytics system 100.
As another example, the edge server 104 can compute a physical attribute vector for each detected person or vehicle, which vector can include, for a person, information such as estimated height or other body dimensions, estimated gender, estimated age, or other metrics as determined by the visual analytics system, or for a vehicle, information such as estimated make, model, year, and color of the vehicle. The identity declaration by the edge server 104 can then be based on comparison of attribute vectors computed for detected people or vehicles in different frames or camera views.
As another example, for each detected person in a video camera view, the floor projections (the pixels where their feet touch the ground) can be determined by the edge server 104. These pixels are consequently mapped to the two-dimensional store map. This is done across all cameras 102 in the location 101 by the edge server 104. All people found to occupy the same physical space on the location floor can then be declared identical persons. Different methods for person re-identification, such as those described above, can be combined and weighted in various ways to improve re-identification confidence and accuracy.
Actions can be derived from (defined based on) poses, gestures, facial expressions, tracks, AOIs, time markers, or other information. As with AOIs, actions can be defined manually or automatically. As an example of a manual definition, a “waiting in checkout queue” action can be defined based on a determination that a person has lingered in a checkout queue AOI for more than a defined amount of time (e.g., five seconds). As another example, a “getting coffee” action can be defined based on a determination that a person has entered a coffee station AOI and has made a gesture that corresponds to reaching for a coffee pot or coffee dispenser control. Action definitions can be entered using a GUI or programmatically. Subsequently, ML models can be used by the edge server 104 (or a computing device of the cloud-based system 108) to determine signatures of behavior, based on any of the underlying data types discussed above. For example, in combination with planogram information, gesture information can be used by the edge server 104 to determine, for example, the action that a person picked up an item from a shelf after looking at it, the action that a person did not pick up an item from a shelf after looking at it, or that a person picked up an item and then put it back.
An action ML model, having been trained on defined actions and other data, can also be used by the edge server 104 to define actions automatically, without the actions needing to have been defined manually. For example, the edge server 104 can utilize a ML model to recognize, based on video data, pose data, or other data, a signature of a person waiting in line, a signature of a person completing a purchase transaction, and so forth.
Certain actions that employees or vendors may be supposed to do in certain ways can be documented, e.g., in an employee playbook. The visual analytics system 100 can be configured to validate whether employee or vendor actions comport with reference policies or expectations. As an example, employees may have a task list (e.g., clean floors, put items on the shelves, help a customer do something or find something, go to the back room, as distinct from a shopping area, clean the clogged pipes of a soda dispenser, change the carbon dioxide tank, etc.). The visual analytics system 100 may determine whether employee tasks were performed, how well they were performed, and how much time it took to perform them. Similarly, in some embodiments, the visual analytics system 100 can be used to determine whether vendors acted in accordance with permissible activities or established routines. As an example, a vendor may be permitted or required to use a separate door to access a location, or to stock items in certain places. The visual analytics system 100 can be used to answer questions about whether activities or behaviors of a vendor were expected or unexpected, whether a vendor arrived on time, or whether a vendor did the vendor's job. Vendors may be expected to place items but not remove them, or vice versa. In some embodiments, the visual analytics system 100 can alert to vendor theft via the edge server 104 and/or a computing device of the cloud-based system 108. The visual analytics system 100 can also be configured to follow the flow of items associated with people of certain roles, and whether the items are carried in accordance with expectations or routines. In some examples, the visual analytics system 100 can compare observed item transits with an item list for validation.
The edge server 104 and/or computing devices of the cloud-based system 108 can interpret actions. For example, the edge server 104 and/or other computing devices of the visual analytics system 100 can make interpretations such as, “this person is in this spot” and “this item is in this spot.” The edge server 104 and/or computing devices of the cloud-based system 108 can also be employed to determine whether items are the right place in the store, per the planogram, whether the items were stacked correctly, whether there are any misplaced items, whether there are any items out of stock. Another example action interpretation that could be made by the edge server 104 can be, “this person took an item from here and put it back in the wrong place.” Additionally or alternatively, any action interpretations made by the edge server 104 and/or computing devices of the cloud-based system 108 can be location-based.
When processing video streams, it is not always necessary, and generally not desirable, for the edge server 104 to capture details at the highest resolution or frame rates. Such processing can be expensive both in terms of LAN network bandwidth and edge server 104 processing bandwidth. In embodiments in which the edge server 104 observes a location 101 in four-dimensional views including the three-dimensional space plus time, the edge server 104 can extract four-dimensional hypertubes of relevance from processed data, and those four-dimensional hypertubes can be zoomed in on in a foveal manner by the edge server 104. Thus, for example, initial processing to derive the hypertubes can be performed by the edge server 104 at a relatively sparse data rate of three frames of video data per second, and/or at less than full image resolution of the full HD resolution of the frames. At the reduced frame rate and/or image resolution, the determinations made by the edge server 104 about actions and their causality may not be initially made at high confidences. As an example, it may be determined from a first scene at a first time that a person had nothing in the person's hands, and it may be determined from a second scene at a second time subsequent to the first time that a person had a candy bar in the person's hands, but no action corresponding to picking up a candy bar may have been recorded for the person in the intervening time between the first time and the second time. Upon recognition of this causal gap, the edge server 104 can automatically re-consult and re-process stored video stream to examine previously unprocessed frames, and/or previously unprocessed pixels within frames, effectively “zooming in” in space and time, in an attempt to more clearly determine, with higher confidence, that the candy bar was picked up during the person's journey, and where it was picked up from, based on signatures determined from the previously unapprised video data. Such foveal processing can automatically and adaptively be performed by the edge server 104 using buffered video data that is re-consulted as necessary.
As another example, as a customer approaches a checkout queue, the hypertube of the customer can be reviewed by the edge server 104 to determine or estimate what items the customer picked up along their journey, and thus to probabilistically determine or estimate a checkout list (“basket”) for the customer. The action of entering the checkout queue can thus serve as a trigger to the edge server 104 to review a substantial amount of video data spanning back in time minutes or hours to note where the customer stopped, what the customer looked at, what the customer reached for, what the customer picked up, what the customer put back, and so forth. The edge server 104 can assign a level of confidence to the determination or estimation of the checkout list. The checkout list can, in some examples, also be informed by information from stored point-of-sale (POS) records showing the purchase history of the customer. The edge server 104 can compare the determined or estimated checkout list against an actual checkout list, in some embodiments. Anomalies between the determined or estimated checkout list and the actual checkout list can form the basis of alerts, notifications, or reports, which can be generated by the edge server 104 and/or computing devices of the cloud-based system 108.
Collected data of the types described above can be used by the edge server 104 and/or computing devices of the cloud-based system 108 to train predictive models (e.g., ML or AI models) capable of predicting when certain situations will arise in the respective context of a particular location 101. For example, a ML or AI model can be trained to predict, based on tracks and behaviors of customers in a store, that a run on the checkout will occur within a certain time (e.g., within five minutes). Such a prediction can form the basis of an alert that can be generated by the edge server 104 and/or a computing device of the cloud-based system 108, and subsequently provided to a cashier to go and staff a checkout station in advance of the predicted run on the checkout.
Collected data of the types described above can also be used by the remote processing server 180 (or other devices of the cloud-based system 108) to generate reports that can provide a manager (e.g., a store manager, or a head nurse) views into business performance, such as a view of customer traffic, or a view of how one staffing shift is performing, e.g., relative to other shifts. In some embodiments, the remote processing server 180 can classify collected information according to certain metrics that are key performance indicators, e.g., customer conversions or queue time. In an example, a retailer may have as a business goal that they never want a person to wait more than a certain amount of time (e.g., 30 seconds) in queue. The remote processing server 180 can be configured to analyze stored data (e.g., metadata generated and transmitted by the edge server 104) be identify instances in which queue time for a customer exceeded this threshold, and to determine the reasons why it happened, e.g., whether the customer was busy on the customer's cell phone or, by contrast, whether the cashier failed to timely and attentively serve the customer. Reports can be generated by the remote processing server 180 that can link to individual video clips evidencing the determined behaviors. Such reports can be used for performance reviews or training. Training or teamwork exercises can be automatically suggested by the remote processing serve 180 based on the reports. The reports can be used as feedback to a team to improve its performance.
Reports can be generated by the remote processing server (or other computing devices of the cloud-based system 108) that provide insights as to different kinds of customer conversion rates and location space, facility, or equipment utilization rates. As an example, a first customer conversion can occur from the street to the parking lot of a store, a second customer conversion can occur from the parking lot to the store, and a number of other conversions can occur inside the store, such as a conversion from entering the store to collecting items to purchase, and a conversion from collecting the items to actually purchasing the items. Visual analytics can be used by the remote processing server 180 to compute conversion rates for various types of conversions, such as the example conversion types given above and others that may be desired, based on journey data. As an example, the fact that a person has entered a store does not mean that the person is going to buy an item in the store: the person may be in the store to pay for gas, use the restroom, stay cool on a hot day, or rob the store.
In some embodiments, the remote processing server 180 (or other computing devices of the cloud-based system 108) can compute and compare utilization from journeys of people and vehicles. Examples of utilization metrics can include how often a parking spot, or a range of parking spots, is utilized. The remote processing server 180 can use visual analytics to consider parking as an activity and can compute metrics on how frequently an unallowed space is used for parking. The remote processing server 180 can also compute metrics on how often cars are detected loitering in places that are not parking spots. Utilization per spot or area can be measured by the remote processing server 180. A location 101, such as a store or filling station, may be equipped with gas pumps. With regard to gas pumps, the fact that a car is detected at a gas pump may not by itself be a good indicator that the car is pumping gas, without a further indicator such as a detected gesture that a person has picked up a pump nozzle and inserted it into the car. Accordingly, the remote processing server 180 can use visual analytics to obtain much more accurate metrics regarding pump parking utilization, answering when pump parking is being used for gas pumping as opposed to when it is just being used for store parking or other purposes. For loss prevention purposes, the remote processing server 180 can use visual analytics to detect if an unauthorized device is used at a gas pump and can create alerts for theft.
In some embodiments, the visual analytics system 100 can be used as a system to enhance the performance of a human activity. The activity can be any activity. The visual analytics can be applied to healthcare, e.g., in the context of nurses performing actions on patients. Every action a nurse performs on a patient may be coded and can generate a medical record, e.g., in EPIC. An action ML model can be trained on how various nursing care actions are performed. Once the visual analytics system 100 learns how an action is supposed to be, the system 100 can flag outlier actions that appear to show performances outside the norms of this action. Such outliers can be flagged for review. A head nurse, for example, can receive a report from the visual analytics system 100, including a link to a video of an action determined to be outside the norm, and the head nurse can evaluate whether the action represented care provided outside the norm. Where the head nurse provides feedback to the visual analytics system 100 (e.g., “the flagged action was normal”), the action ML model can be retrained, and the visual analytics system 100 can learn. Such analysis and processing can be performed by the edge server 104 and/or the remote processing server 180 (or any other computing devices of the cloud-based system 108).
Additionally or alternatively, the visual analytics system 100 (e.g., via the edge server 104 and/or the remote processing server 180) can automate certain things, such as allocation of resources. For example, for a nurse making the rounds, the visual analytics system 100 can make a determination that a patient is awake, and can issue a recommendation that the nurse should go to the patient's room first, because later the patient may not be awake, and visiting the patient then would be a waste of nursing time. In another example, the visual analytics system 100 can also optimize the announcements to nurses. Interruption from announcements and notification communications poses a major problem in nursing. Nurses can receive hundreds of messages a day, often at times when they are doing something critical. Such interruptions can be distressing to nurses. Thus, if the visual analytics system 100 can accurately determine what kind of task the nurse in engaged in, the system 100 can hold off on delivery of a message until the nurse finishes performing a high-priority task. For instance, if a patient is in distress, a high-priority notification should issue, instructing the nurse to leave the patient that the nurse is changing the diaper of, and run to the next room to attend to the distressed patient. But if a notification is only regarding a patient who wants water, an alert to that effect can be held off until the nurse to whom the notification is directed finishes with a higher-priority task. Accordingly, the visual analytics system 100 can manage prioritization of actions according to the context in which humans are, giving them situational awareness.
As discussed herein, the visual analytics system 100 can track and quantify the behavior and motions of objects and people at a monitored location 101. In addition, the visual analytics system can be configured to utilize that capability to increase operational efficiency of a highly organized operation with well-defined workflows, e.g., at a retail store. There are numerous capabilities of the visual analytics system 100 available to define such workflows, track workflows using manually entered data, and help optimize the workflows. For example, the working hours of each worker can be entered by a supervisor and tracked when they check in and out of work. The workers' responsibilities can be listed and optionally scheduled for certain time periods. Workers and supervisors can then manually mark those tasks complete and may assess the duration and quality of the work. These inputs can be used by the visual analytics system 100 to understand how long it takes to perform a task and potentially train team members who are performing below a target efficiency. The visual analytics system 100 may also analyze indirect metrics that relate to and entire team of workers, such as, for example, the wait time for a customer, the cleanliness of a restroom, etc. In some embodiments, measurement of the various metric can be performed by “secret shoppers” in order to normalize scores across various locations 101 and avoid intentional or inadvertent measurement errors. It should be appreciated that store teams can be rewarded on the basis of their performance relative to a target and/or relative to each other.
In some embodiments, the edge server 104 at each monitored location 101 or the remote processing server 180 (or other computing devices of the visual analytics system 100) can be configured to automatically score worker performance as a function of data captured by on-premises cameras 102 that is analyzed using one or more artificial intelligence models (e.g., machine learning models, etc.). For example, the visual analytics system 100 can be used for compliance checking and measuring operational efficiencies in connection with food operations at a convenience store. Convenience stores compete with Quick Serve Restaurants (QSR) and serve prepared food items. Some of these food items are prepared to order, in which case the operation is very similar to a QSR. Most of the food items, however, is pre-cooked or pre-warmed to reduce the wait time and increase convenience. The preparation and maintenance of the food item is a complex workflow. The parameters of this workflow and the execution of the workflow determine the success, i.e., the revenue and profit at the store and the satisfaction of customers.
For example, a simplified workflow for a convenience store roller grill includes the following steps: (1) Nh1 hotdogs are placed at planogram location Ah at time T1; (2) Nt1 taquitos are placed at planogram location At at time T1; (3) remaining hotdogs on the roller grill are moved forward and Nh2 hotdogs are placed in the back of the grill at time T2 (replenishment); (4) remaining taquitos on the roller grill are moved forward and Nt2 taquitos are placed in the back of the grill T2 (replenishment); (5) hotdogs of the first batch remaining on the roller grill with age th,1>tshelflife-h (of count Sh3) at T3 are thrown away by an employee; (6) taquitos of the first batch remaining on the roller grill with age tt,i>tshelflife-t (of count Sh3) at T3 are thrown away by an employee; and (7) repeat steps 1-6. In this workflow, Ni and Tj are critical parameters that are determined through an analysis of previous data. Sk are measurements that are currently employee estimates. The age of hot dogs and taquitos t1 are currently unknown. Even the current POS data on the number of hot dogs and taquitos sold is inaccurate today.
In some embodiments, the visual analytics system 100 (via the edge server 104 and/or a computing device of the cloud-based system 108) is configured to classify food items by their appearance and their placement in a planogram. In some embodiments, labels can be placed on the roller grill to override the default planogram, which allows for a “soft planogram.” The visual analytics system 100 (e.g., the edge server 104, etc.) is configured to read those labels and update the planogram currently in effect. The visual analytics system 100 is also configured to track items as they are placed, moved and removed, which allows the age (t) attribute to be assigned to each item. The visual analytics system 100 (via the edge server 104 and/or a computing device of the cloud-based system 108) can visualize the data in the GUI to help employees. An expiration date can be set for each type of item, and the GUI can be configured to display the approximate age of each item through a color code (e.g., green for fresh, yellow for about the expire, red for expired).
The visual analytics system 100 (via the edge server 104 and/or a computing device of the cloud-based system 108) may also be configured to classify people who interact with the roller grill (e.g., customers, employees, maintenance workers, etc.). It should be appreciated that by classifying the people interacting with the roller grill, the visual analytics system 100 can more accurately classify their behavior. For example, when an employee removes an item from the roller grill, that item is marked by the visual analytics system 100 as disposed. When a customer removes the item from the monitored location, it can be marked by the visual analytics system 100 as sold.
In some embodiments, the visual analytics system 100 is configured to measure missed opportunities by identifying customers (of quantity Mi) who spend a certain amount of time looking at a food section (e.g., a roller grill) that is not stocked at the time, and who end up buying no food items from the food section.
Every store has different objectives when it comes to the balance among revenue, food waste, labor cost, food quality, customer satisfaction, etc. The relative importance of these key performance indicators (KPIs) are entered as weights in an optimization algorithm utilized by the visual analytics system 100 (e.g., the edge server 104 and/or the remote processing server 180). Based at least in part on, or otherwise as a function of, the entered weights, the optimization algorithm yields parameters Ni and Tj to guide food operations. Tj can be determined ahead of time assuming a certain flow of customers and is simpler to implement. Those times are presented to employees as tasks at given times. Unfortunately, the predicted flow of customers and purchases can be inaccurate on certain days. This results in a suboptimal performance. The next step in improvement is a flexible schedule communicated to the employees through alerts. In some embodiments, the remote processing server 180 and/or the edge server 104 is configured to generate and provide such alerts to employees. Management sets deviations at which these alerts should go out. As disclosed herein, the visual analytics system 100 utilizes one or more AI models (e.g., machine learning models, etc.) to provide inferences and optimizations. Such models continually update predictions on when individual food items will run out. Additionally, the models can be used by the analytics system to predict or keep a current record of which food items have expired (or are about to expire). When any of these predictions are off by the deviation threshold, employees are requested to act out of schedule (real time and predictive) through alerts, which can be generated by the edge server 104 and/or the remote processing server 180. In some embodiments, the visual analytics system 100 is configured to determine and record whether those generated alerts lead to prompt action.
In some embodiments, the visual analytics system 100 also provides more effective dynamic pricing schemes. Dynamic pricing is used in food sales to reduce waste and improve profit margins by offering a discount on food that is about to expire (avoiding a complete write-off). There are several difficulties in a traditional implementation. For example, the manual revision of pricing is labor intensive and error prone. If implemented incorrectly, the erosion of price of food that would have nevertheless been sold at full price offsets any gains. As such, the visual analytics system 100 improves on the traditional flow in the following ways: (1) It has more refined and accurate data on the age of food items; (2) it has more accurate predictions of customer flow (taking into account all lot and store movement); (3) thanks to alerts, the time for a discount is not fixed, preventing from customers to bargain-hunt (and thus erode full-price sales); (4) the discount can be removed instantly if the current sales prediction points to regular price sales being sufficient; (5) thanks to SA, the customer is charged what they saw, preventing confusion, potential customer dissatisfaction (overcharging) and lost revenue (undercharging); and (6) if the store prefers to place older discounted items in a certain area in the planogram, the visual analytics system 100 is informed whether the customer's item is from that area or from the regular area. (Impossible to tell visually for a traditional POS operation). It should be appreciated that although the above examples relate to retail stores and food items, the visual analytics system 100 is configured to provide comparable workflow functionality to other areas.
Video streams from the camera 102 can be provided in real-time (or near real-time) to the edge server 104. An image frame capture software component 122 running on the edge server 104 can capture images from the video streams. ML or other AI models or software components 112, 114, 118, 120, 122, which can be arranged in stages as pipelines (e.g., one pipeline per camera), can be run on the captured images to generate interpretations of the video streams in the form of metadata. The metadata can then be sent over the Internet (or other external networks 105) to the cloud computing service 108, e.g., either directly or via a virtual private network (VPN) 126, which may run through one or more other network facilities not illustrated in
The cloud computing service 108 (e.g., the cloud-based system 108) can provide additional data processing and storage of the metadata and other data derived from the metadata. For example, the cloud computing service 108 can include a web frontend 138 that serves a web-based dashboard application 140 to control an application-programmer interface (API) engine 142 that in turn can control a relational database service (RDS) 144 that can store metadata from the edge server 104. The cloud computing service 108 can also include a Kubernetes engine 146, a key management service (KMS) 148, and video storage 150 that can store video data from the edge server 104.
A configuration control platform (e.g., the global configuration control server 182 and/or the remote processing server 180 of
Because the camera 102 is managed by the edge server 104 and communicates directly with the edge server 104 (as opposed to communicating with the cloud-based system 108 or the cloud 128, for instance), the configuration control platform can provide a single point of entry for configuration of the cameras 102 as well as what may be a large plurality of other cameras 102, sensors 162, and devices 172 at the location. This single point of entry for device configuration stands in contrast to conventional internet of things (IOT) device configuration, in which every device may communicate directly with the cloud 128. The architecture 500 of the visual analytics system 100 is therefore particularly advantageous in a low network bandwidth situation, as compared to architectures using IoT devices in which every device at the location (potentially hundreds of them) consume LAN and internet bandwidth on the way to the cloud 128. Likewise, use of a single configuration control platform for all such devices offers advantages as compared to IoT devices that are each managed through a different interface. For instance, the device infrastructure of a retail location can be impracticable to manage when the coffee machine has its own web-based interface, each one of the security cameras has its own separate web-based interface, etc. The configuration control platform (e.g., the global configuration control server 182 and/or the remote processing server 180 of
The web-based interface of the configuration control platform (e.g., the global configuration control server 182 and/or the remote processing server 180 of
Camera health status can be based on more than whether a camera 102 is reachable on a network. For example, camera health status can compare a last validated image from the camera 102 to a present view of the camera 102 and compute a correspondence score between the validated image and the present view image to determine whether the camera is operating as expected, including in terms of its viewpoint and lens clarity. That way, if the camera 102 has been vandalized, e.g., by having its lens spray-painted or cracked, or if the orientation of the camera has been intentionally or inadvertently redirected, the camera health status indicator lights can provide a quick way of determining that, for a large number of cameras 102 at once.
The example configuration control platform interface may allow an option to select and display a camera view assist page. Particularly, the interface may show live or recent views of the interior cameras of the location, allowing a quick check that all cameras are working and unobstructed. Clicking on a particular one of the camera displays in the view can lead to an interface depicting a larger view of the live image of the camera, and also a comparison of a live image with a reference image, which can be used to check that the camera is in its expected location, orientation, and field of view. In some embodiments, the example configuration control platform interface may also allow an option to display a plane calibration tool for a selected camera. The entire image may be analyzed to determine which areas of the image constitute the floor. In some embodiments, the floor determination is human-assisted by drawing polygons over the floor. Tile grout lines in the floor, or other floor features, can then be analyzed by the edge server 104 to automatically estimate the intrinsic parameters of the camera such that lens distortion can be removed.
Other screens in the configuration control platform, not shown, can display details about the location. The configuration control platform can use a templated layout to provide different presentations for different contexts, e.g., whether the platform is used for a retail store or healthcare setting.
The configuration control platform (e.g., provided by the global configuration control server 182 and/or the remote processing server 180 of
In some embodiments, the configuration control platform advantageously aids in deployment of updated models to edge servers 104. After physically installing the edge server 104 at the edge location 101, AI inferencing can be run on the edge server 104, and data derived from such inferencing can be uploaded from the edge server 104 to the cloud-based system 108 (or computing devices of the cloud-based system 108, such as the remote processing server 180), as discussed herein.
The configuration control platform (e.g., provided by the global configuration control server 182 and/or the remote processing server 180 of
Installation of the visual analytics system 100 at a location 101 can be a complex human process that involves multiple companies and people, with different legal rights and obligations as defined by contracts. The configuration control platform 182 and its role-based access can be used to enforce these rights and obligations by defining roles in a software environment where the roles and their performance are tracked and validated. For example, when a portion of an installation process is complete, the configuration control platform can show a completion or provide an alert. The configuration control platform 182 can also integrate into a workflow management tool to manage installation processes and check their progress.
In some embodiments, the visual analytics system 100 includes a monitoring system (not shown), portions of which may run on any or all of the edge server(s) 104, cloud-based system 108, data center 106, or cloud 128. The monitoring system can be configured to identify a functionality or performance issue with one of the pipelines in the edge server 104, and may attempt to automatically resolve the issue without any manual intervention on the part of a human technician. The monitoring system can use ML models (or other AI methodologies) to estimate or predict when such issues have arisen or may soon arise on the edge server 104. Such predictions can be made based on metrics that are established, through machine learning, to be indicative of an imminent pipeline issue. Such metrics can include a utilization spike at the central processing unit (CPU) of the edge server 104, a spike in usage of system memory of the edge server 104, or a spike in network bandwidth utilization at or around the edge server 104. Such predictive and self-healing features aid in the seamless operation of tens of thousands or maybe hundreds of thousands of edge servers 104 at scale, without requiring a proportional scale-up in human staffing, and in many instances in a challenging network environment, with very limited to no connectivity.
The RTSP streamer component 1116 can capture still image frames from video streams streamed from the camera 1104 (e.g., the camera 102 of
After lens distortion removal, the detector component 1120 can use ML inferencing to detect objects in the undistorted still image frame. For example, the detector component 1120 can use an ML model trained to detect people and vehicles. In other examples, such as in an assisted checkout, the detector component 1120 can be trained to detect individual items available for sale and presented for checkout. The detector component 1120 can, for example, draw anchor boxes around the objects in an image. Object detection can thus take place in the form of predicting rectangular coordinates for cars and people in an entire RGB image or portion of the image. In some examples, two binary object detection models can be used for their respective task of detecting people or cars. For detecting cars from outdoor video, a detector based on a YOLO architecture with a CSPDarknet53 feature extracting backbone can be used. Non-maximum suppression can be used in a post-processing stage to eliminate duplicate detections.
The depth component 1122 can add a third dimension to the 2D still image frame data by using an ML model trained to estimate a depth value for each pixel in the 2D still image frame. This depth data can be important for gauging how far away from a camera a detected person is, and thus for assigning the person a position in 3D space. While 2D rectangular detections provide useful insight into the location of objects in an image, they do not carry the same information that 3D data does. Working in three dimensions provides structure and important relative relationships that cannot be decoded from 2D information alone. Knowing the spatial locations of objects in a scene provides the basis for detecting and setting important landmarks in, for example, a retail environment for establishing metrics. To determine the 3D coordinate for every pixel in the image, an RGB image is used by the edge server 104 as input into a depth estimation ML model and the output is a two dimensional depth image with the same shape as the input image. Each spatial location in the input image corresponds to the same spatial location in the depth image. The depth image estimates the distance to the camera at every pixel location in the image. The 2D coordinates in the image domain combined with the depth estimates can be used by the edge server 104 to generate a point cloud with each point containing a geometric Cartesian coordinate (x, y, z) and a color coordinate (r, g, b). The 3D point cloud coordinates are calculated by the edge server 104 using the intrinsic parameters of the camera 102 and the estimated depth values. A transformer-based architecture can be used by the edge server 104 (or computing devices of the cloud-based system 108) to design a CNN that performs dense depth estimation. This ML model can use a hybrid vision transformer (ViT) combined with ResNet-50 as an encoder network. A convolutional decoder with resampling and fusion blocks can be utilized by the edge server 104 (or computing devices of the cloud-based system 108) to result in the final depth map. Multiple datasets can be used by the edge server 104 (or computing devices of the cloud-based system 108) to train this network using a scale-invariant loss and dataset sampling process.
The 2D pose component 1124 can estimate a 2D pose of a detected object (e.g., person, vehicle) based in part on the depth information provided by the depth component 1122 and based on the object detection performed by the detector component 1120. Detecting the location of human joints (e.g., shoulders, elbows, wrists) is referred to as human pose estimation. 2D pose estimation can be performed using a CNN architecture that can estimate 17 major joint locations in the 2D image space. Connected together, the estimated joint locations form a 2D skeletal structure representation of people in a scene. Pose can be used to estimate where a person is looking (gaze), whether the person is walking or standing still, whether the person sanitized hands, etc. When combined with the AOI localization, pose information improves analytics quality. 3D pose estimation can be performed by 3D pose models that transform a person in a two-dimensional image into a 3D object. This transformation aids in estimating three-dimensional skeletal structure and accurate relative positioning of joints for a person. This improves the quality of gesture/activity recognition. A three dimensional pose ML model can, for example, take a 2D pose of a person as input and use a neural network inspired by a canonical 3D pose network for non-rigid structure from motion to perform 3D pose estimation. Such a model can learn a factorization network that can project the input 2D pose into 3D pose. The 2D pose can then be reconstructed from the 3D pose. The factorization network improves at constructing the 3D pose by minimizing the difference between the original and the reconstructed 2D pose.
The featurizer component 1126 is used by the edge server 104 to crop the images of objects (e.g., people, vehicles) detected by the detector component 1120 (e.g., to their respective anchor boxes) and processes the cropped images to provide an output vector representative of the features of the detected object. Two objects (e.g., people, vehicles) in different camera views having similar feature vectors can be determined to be the same object on the basis of the similar feature vector, among other criteria. In some examples, the separate views can be merged using a global tracker (not shown in
The tracker component 1128 uses ML inferencing to tell whether an identified person is the same person from image frame to image frame, whether within the same camera view or across multiple camera views, based on the feature vector, the 2D pose information, and the depth information, without using facial recognition. The tracker component 1128 outputs a track for the person, containing data representative of the position of the person over time.
The gesture component 1130 uses ML inferencing to determine a gesture for a detected person, based on the 2D pose information and the track for the person. Gesture information can be used to interpret actions, that is, to help understand what a person is doing. For example, if a person is trying to use a credit card machine to pay, the gesture component 1130 may output data indicative of a credit card payment attempt gesture. As another example, if a cash customer is interacting with the cashier, e.g., handing cash to a cashier, the gesture component 1130 may output data indicative of a cash payment gesture. Gestures interpret a combination of multiple frames that, in a certain sequence or with certain patterns, permit assumptions, such as “the person is interacting with the cashier” or “the person is trying to use a credit card to do the payment.”
The attributer component 1132 can use ML inferencing to determine other attributes of a person based on feature vectors and/or tracks. As examples, the attributer component 1132 can estimate demographics such as age or gender, or roles, such as whether the person is an employee, a non-employee, a security guard, a customer who is a security risk, etc. Attributes can be used as logged metrics of interest, and/or can be used in the person or vehicle re-identification process. As an example, cropped object (e.g., person or vehicle) detection images can be used as input to a ReID network to produce 512-D feature vectors that represent embedded visual features. These feature vectors can be used by the edge server 104 as inputs to a multi-layer perceptron (MLP) neural network ML model to perform attribute classification. The output of the ML model can be the class each person belongs to (e.g., customer, security, employee, risk, etc.). Each of the above components can make use of the model inference engine 1114 to perform model inferencing, via, for example, a remote procedure call request.
Information from the various pipeline components can be fed to the eventify component 1106, which is an event-driven alerting framework. When an defined event is identified as occurring by the eventify component 1106, the eventify component 1106 can initiate an action in real time, by merging tracks across time and space to create a journey and integrating aspects of the journey. For example, a person with a certain track ID is at check stand at a given time, and went to restroom at a second given time, and then came back and got a soda at a third given time, the eventify component 1106 can stich all of this information together. The information is ultimately transmitted to the cloud for storage and further processing.
Architecture 1100 thus provides an overview of how inference is performed at the edge server 104, in the one example illustrated. Other examples can use more or fewer stages, or different stages, to provide different visual analytics functionality, which may be important in different contexts. For example, the types of tracks, features, gestures, and attributes of interest in a healthcare facility may differ from those of interest in a retail store setting. In all instances, however, the visual analytics pipeline 1102 assists in understanding and interpret a scene, rather than merely detecting or identifying objects. As discussed herein, visual analytics has a goal of trying to understand the interaction between people or the interaction between a person and a machine or a person and the environment, interpret the observed interaction, and produce analytics based on the interpretation. The understanding of the scene can be accomplished in real-time or near real-time.
The central tool 1236 in
The visual analytics systems 100 and methods as described herein can provide three classes of analytics. A first class of analytics is looking back and providing historic data and comparisons. A second class of analytics is real-time or near-real-time, giving alerts. A third class of analytics is predictive, based on analysis of historic patterns. As an example of the third, predictive class of analytics, eventify component 1238 may determine that a run on the checkout is imminent, prompting an alert to a cashier to staff a checkout station, because soon the queue is going to get longer. Some of these insights may be generated by processing on the cloud-based system 108, rather than at the edge server 104.
In some embodiments, the visual analytics system 100 as described herein can include a health monitoring and self-healing features configured to determine or predict existing or future system failures, to automatically initiate remedial actions, and thus to reduce system downtimes.
With reference again to
With reference once again to
In some embodiments, a local queue at the edge server 104 can store pipeline component health data locally for up to about 24 hours. When connectivity to the edge server 104 is available, the health data can be regularly pushed to the data center 106 and/or the cloud services 108 (e.g., the cloud-based server 108). When connectivity to the edge server 104 is interrupted, the locally stored health data can continue to be returned to the local queue. When the connectivity returns, a process can automatically push to the data center 106 and/or the cloud services 108 all the locally queued data that has not yet been transmitted to the data center 106 and/or the cloud services 108.
When publishing to a new location or starting a pipeline does not provide the expected results, a set of test suites can be run on the edge server 104 and can provide a report. The test suites can determine, for example, whether the Argo tunnel 130 is up, whether the Tailscale VPN is up, whether the edge configuration control platform 124 is responsive, whether the Cloudflare tunnel 132 is up and running, whether the pipeline components are running, etc. The resultant report allows pinpointing of exactly where an issue is for debugging. The report can also trigger self-healing capabilities. In many cases, a solution action to resolve a pinpointed problem is as simple as, for example, restarting a pipeline component to reestablishing the Cloudflare tunnel. Such fixes can be done automatically, without manual intervention. Accordingly, many fixes can be automated so that the edge server 104 can generally continue to run in an uninterrupted fashion.
The visual analytics system 100 has further detected that NR6_customer has approached the checkout station and is engaged in a checkout transaction, based on one or more of this customer's spatial position (possibly as being within a designated AOI of the checkout station) and/or one or more gestures of this person as being consistent with conducting a checkout transaction. The visual analytics system 100 has therefore color-coded the anchor box for NR6_customer as green, for checking out, and no longer as blue, for browsing the store. The visual analytics system 100 has also relabeled “NR6_customer” as “NR6_customer_transaction,” signifying, again, the change in action from browsing to checking out of NR6_customer. By contrast, the visual analytics system 100 detects and displays NR7_customer as being still browsing. This is the case even though NR7_customer may have entered an AOI defined as a checkout queue area, possibly because the visual analytics system has detected the pose and gaze of the NR7_customer as directed away from the checkout station, which may indicate that customer NR_7 is still browser rather than intentionally entering the checkout queue.
Based on the above-described detections, the visual analytics system 100 is able to compute, provide, and display (e.g., on a GUI) operational metrics 1606 in
Based on the above-described detections, the visual analytics system is able to compute, provide, and display (e.g., on a GUI) revised operational metrics 1606 in
The computed metrics 1606 can be used by the edge server 104 and/or the remote processing server 180 to generate alerts in real time or near real time. For example, if NR7_customer is determined to have waited in the queue for longer than a threshold amount of time (which may, in some examples, be an adaptive threshold), an alert can be delivered to another employee (not shown), e.g., an employee who is otherwise engaged with another task such as stocking shelves, to come staff a second checkout station. The computed metrics can be logged and later analyzed to highlight anomalies (e.g., relating to when customers were obligated to wait in the checkout queue for a super threshold amount of time), draw out patterns, and provide reports.
A multi-agent simulator can simulate how people behave in a location (e.g., store or hospital). Agents are trained by the data collected from the visual analytics system 100, so they become predictive to anticipate what would happen based on a change in a location. Optimizations can be tested for, e.g., to reduce congestion in a store and improve flow in the store. As an example, a hypothesis may be developed that changing the location of movable shelves called gondolas in the store will improve flow. The simulator could be tested for different modified shelf spacings, e.g., three-foot spacing versus two-foot spacing. The simulated agents will react to the modified spacings as if they are real people, and measurements can be made based on the actions of the trained simulated agents. Such simulated experiments can be performed much less expensively than experiments done in the real world.
Each edge server 104, 608, 718, 720, 1302, 1402 and other computing devices/servers of the analytics system 100 may be embodied as one or more computing devices similar to the computing device 1800 described below in reference to
The computing device 1800 includes a processing device 1802 that executes algorithms and/or processes data in accordance with operating logic 1808, an input/output device 1804 that enables communication between the computing device 1800 and one or more external devices 1810, and memory 1806 which stores, for example, data received from the external device 1810 via the input/output device 1804.
The input/output device 1804 allows the computing device 1800 to communicate with the external device 1810. For example, the input/output device 1804 may include a transceiver, a network adapter, a network card, an interface, one or more communication ports (e.g., a USB port, serial port, parallel port, an analog port, a digital port, VGA, DVI, HDMI, FireWire, CAT 5, or any other type of communication port or interface), and/or other communication circuitry. Communication circuitry of the computing device 1800 may be configured to use any one or more communication technologies (e.g., wireless or wired communications) and associated protocols (e.g., Ethernet, Bluetooth®, Wi-Fi®, WiMAX, etc.) to implement such communication depending on the particular computing device 1800. The input/output device 1804 may include hardware, software, and/or firmware suitable for performing the techniques described herein.
The external device 1810 may be any type of device that allows data to be inputted or outputted from the computing device 1800. For example, in various examples, the external device 1810 may be embodied as a camera 102 as shown in
The processing device 1802 may be any type of processor(s) capable of performing the functions described herein. In particular, the processing device 1802 may be one or more single or multi-core processors, microcontrollers, or other processor or processing/controlling circuits. For example, in some embodiments, the processing device 1802 may include or be embodied as an arithmetic logic unit (ALU), central processing unit (CPU), digital signal processor (DSP), AI acceleration unit, graphics processing unit (GPU), tensor processing unit (TPU), and/or another suitable processor(s). The processing device 1802 may be a programmable type, a dedicated hardwired state machine, or a combination thereof. Processing devices 1802 with multiple processing units may utilize distributed, pipelined, and/or parallel processing in various embodiments. Further, the processing device 1802 may be dedicated to performance of just the operations described herein or may be utilized in one or more additional applications. In the illustrative embodiment, the processing device 1802 is programmable and executes algorithms and/or processes data in accordance with operating logic 1808 as defined by programming instructions (such as software or firmware) stored in memory 1806. Additionally, or alternatively, the operating logic 1808 for processing device 1802 may be at least partially defined by hardwired logic or other hardware. Further, the processing device 1802 may include one or more components of any type suitable to process the signals received from input/output device 1804 or from other components or devices and to provide desired output signals. Such components may include digital circuitry, analog circuitry, or a combination thereof.
The memory 1806 may be of one or more types of non-transitory computer-readable media, such as a solid-state memory, electromagnetic memory, optical memory, or a combination thereof. Furthermore, the memory 1806 may be volatile and/or nonvolatile and, in some embodiments, some or all of the memory 1806 may be of a portable type, such as a disk, tape, memory stick, cartridge, and/or other suitable portable memory. In operation, the memory 1806 may store various data and software used during operation of the computing device 1800 such as operating systems, applications, programs, libraries, and drivers. The memory 1806 may store data that is manipulated by the operating logic 1808 of processing device 1802, such as, for example, data representative of signals received from and/or sent to the input/output device 1804 in addition to or in lieu of storing programming instructions defining operating logic 1808. As shown in
In some embodiments, various components of the computing device 1800 (e.g., the processing device 1802 and the memory 1806) may be communicatively coupled via an input/output subsystem, which may be embodied as circuitry and/or components to facilitate input/output operations with the processing device 1802, the memory 1806, and other components of the computing device 1800. For example, the input/output subsystem may be embodied as, or otherwise include, memory controller hubs, input/output control hubs, firmware devices, communication links (i.e., point-to-point links, bus links, wires, cables, light guides, printed circuit board traces, etc.) and/or other components and subsystems to facilitate the input/output operations.
The computing device 1800 may include other or additional components, such as those commonly found in a typical computing device (e.g., various input/output devices and/or other components), in other embodiments. One or more of the components of the computing device 1800 described herein may be distributed across multiple computing devices. In other words, the techniques described herein may be employed by a computing system that includes one or more computing devices. Additionally, although only a single processing device 1802, I/O device 1804, and memory 1806 are shown in
The foregoing description of embodiments and examples has been presented for purposes of illustration and description. It is not intended to be exhaustive or limiting to the forms described. Numerous modifications are possible in light of the above teachings. Some of those modifications have been discussed, and others will be understood by those skilled in the art. The embodiments were chosen and described in order to best illustrate principles of various embodiments as are suited to particular uses contemplated. The scope is, of course, not limited to the examples set forth herein, but can be employed in any number of applications and equivalent devices by those of ordinary skill in the art. Rather it is hereby intended the scope of the invention to be defined by the claims appended hereto.
This application claims the benefit of U.S. Provisional Application No. 63/439,113, filed Jan. 14, 2023, U.S. Provisional Application No. 63/439,149, filed Jan. 15, 2023, and U.S. Provisional Application No. 63/587,874, filed Oct. 4, 2023, the disclosures of which are hereby incorporated herein by reference in their entireties.
Number | Date | Country | |
---|---|---|---|
63439113 | Jan 2023 | US | |
63439149 | Jan 2023 | US | |
63587874 | Oct 2023 | US |