The present invention relates to the field of technology-enabled objects detection, identification, tracking, and counting. Specifically, the present method and/or system relates to technologically identifying, classifying, counting, and tracking individual people among a human crowd from one or more video streams of a human crowd.
Counting, tracking and analyzing people in events, points of interest play an important role in analyzing people and their movements. People counting and occupancy monitoring, via deep learning algorithms, in indoor and outdoor areas of interest garnishes interests from business and/or institutional organizations. It is a cost-effective method to detect and track individual people in real-time video as well as pre-recorded videos from CCTV or surveillance cameras in order to fulfill various business/institutional needs. They include but not limited to people inflow monitoring inside a particular building premises, sport venues monitoring, shopping malls analytics, crowd management, stampede prediction etc. The solution is implemented by using AI algorithms (AI-enabled computer vision algorithms) that can automatically detect people in the video footages; track them until they are inside or outside of a particular area or point of interest; and keep a tally of their number, present at a given time in a pre-recorded video or the stream coming from a LIVE RTSP (Real time streaming protocol) photographing device (such as camera). Such information systems can be very useful for businesses and organizations to attain deep knowledge and/or implications from people traffic, estimate the occupancy levels of people and ensure compliance with regulations like social distancing, occupancy limits and healthy living standards compliance etc.
Object detection, tracking and classification have been a challenging problem in AI computer vision applications. Person detection is one such area where person identification, tracking and counting have been studied using deep learning approaches. The number of methodologies adopted to extract person objects from images and videos broadly include two categories: one stage detection and two stage detection and tracking. While one stage approach offers speeds and efficiency, its accuracy and precision are affected. The two-stage approach offers accuracy and precision at the cost of speed. A balanced approach, therefore, is needed to not only accurately position the bounding boxes around detected persons, label the boxes with relevant information about the detected persons, but also accurately return detected persons' relative position from one frame to another in one or more video streams, meaning an effective tracking. In other words, effective tracking depends on, among other things, accurate and efficient detection.
One approach for achieving accurate and efficient detection is using an artificial intelligence (AI) model that are trained to detect people from video footages. However, the effectiveness of such an AI model depends on the model's architecture, data used for training the model, and training methods. There are rooms for improvement in the training of an AI model for detecting, especially real-timely detecting, peoples from one or more video footages of people traffic. The disclosure of this application proposes and implements a hybrid people detector and counter which incorporates the existing Yolo-based end-to-end deep learning models to detect, track and count people in different scenarios. Note: Yolo (You Only Look Once) is an open-sourced real-time object detection algorithm by spatially separating bounding boxes and associating probabilities to each of the detected images using a single convolutional neural network (CNN). It has high speed, high detection accuracy, and better generalization (in terms of application in new fields). Version 7 of Yolo, codenamed as Yolov7, is the integration of the focus layer, represented by a single layer, which is created by replacing the first three layers of Yolov3. This integration reduced the number of layers, and number of parameters and also increased both forward and backward speed without any major impact on the mAP (mean average precision). Note, any version of Yolo later than version 4 would be a good baseline model to start the training. It's just newer version of Yolo packs more features and benefits.
Based on Yolov7, the disclosure of the application indicates a particularly crafted training (based on video footages of local population), testing and evaluating of detection, tracking and re-identification models to achieve maximum people counting and gender-based counting accuracy.
With the benefits offered by the particularly crafted models (fast, accurate, efficient, and particularly customized for the application in human crowd tracking), there are many real and potential use cases under which a variety of business/intuitional needs are met.
Smart cities: People analytics comprising of people counting, occupancy monitoring can also be used in urban environments to optimize traffic flow, manage public spaces and improve upon public safety. Pedestrian traffic management can be a game changes in the modern smart cities where the emphasis will be on resource optimization and facilities maximization for a fairly larger population. Using an AI-based people analytics system, the municipalities and city planners can identify areas that may need more resources and improved infrastructure to improve the flow of people and traffic.
Transportation: People counting and analytics can also be used at the airports, train stations, bus and ferry terminals to monitor passengers' traffic and optimize capacity planning aimed at improving services and people experiences. By tracking the number of people entering existing in a given area, transportation organizations can better predict passengers demands and adjust the transportation schedule accordingly along with arrangements for the required number of staff to provide services.
Safety and Security: The people analytics technology can be used to monitor access to secure areas such as hospitals, airports, bus terminals, government buildings etc. Keeping track of the number of people entering/existing in a given area, security personnel can better manage and optimize available resources aimed at improving security and preventing overcrowding in those places
Educational institutions: People analytics and monitoring applications have greater prospects in schools and universities to track student attendances and ensure compliance with the required occupancy limits. By intelligently tracking the number of people in a given classroom, lecture theatres, students lab, educators can better allocate required resources and plan for capacity needs and capacity augmentation. In addition, people analytics technology can help educators ensure that classrooms are not overcrowded and hence not losing out in terms of learning and teacher-student engagements. It also ensures that seasonal or pandemic rules like social distancing rules are correctly followed by a large majority of students in a given educational facility.
Retail shopping malls: Similarly, people counting and occupancy monitoring with analytics system can be very useful to monitor customers traffic and optimize the availability of staff at various levels and time slots. By analyzing foot traffic entering/existing shopping malls and stored, businesses can determine the busiest times of the day, month, season etc. As a result, they can allocate staff resources accordingly, and optimize store layouts to improve the overall customers experience while they remain at the premise.
Sports & entertainment: Sports and entertainment feature high in people' lives and authorities ensure that every sports event takes place without any mayhem or unexpected incident. In order to keep track of the number of people, their subcategories like individuals, families, kids along with gender and age attributes, AI-based people counting and analytics will be of huge significance. An accurate system will prevent crowd density formation to avoid stampede and entry and exit can be easily arranged through the help of an intelligent people monitoring and analytics system.
Heritage: People analytics including people counting, occupancy monitoring, idle people detection, crowd density prediction at heritage sites makes a good use case for organizers to ensure artifacts get even attention. In order to keep track of people visiting certain artifacts in a day, week and months enable organization improve upon the experiences of people visiting the heritage sites like museums, art galleries, antiques exhibitions etc.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter.
The purpose of the present disclosure is to resolve the aforementioned problems and/or limitations of the conventional technique, that is, to provide a technology-enabled crowd analysis capability for accurately detecting, identifying, counting, classifying, and tracking people from one or more video footages of crowd scenes.
Provided is a computer-implemented method for crowd analysis, comprising: training an artificial intelligence (AI) model based on a pre-collected dataset of human images in which each human image is annotated for its body and face to enable the AI model to detect humans based on body or face appearance thereof; deploying the AI model into a production environment, wherein the production environment comprises a computing system, and at least one photographing devices; receiving a video sequence of a crowd from a photographing device; preprocessing the video sequence, wherein the preprocessing the video sequence includes extracting a plurality of frames from the video sequence, removing noise from the video sequence, and subtracting background information from the video sequence; detecting, for each frame of the plurality of frames, a plurality of persons, and, for each person of the plurality of persons, extracting a plurality of global and local special features of the person that can be used for tracking the person in the video sequence, identifying the person based on the plurality of global and local features of the person including body features, clothing features, or facial features; re-identifying, for each frame of the plurality of frames, the plurality of persons detected from each frame by monitoring whether there is re-appearance, for each person of the plurality of persons, across a plurality of a pre-determined number of frames of the plurality of frames, wherein the re-identifying maintains a same identification for a person of the plurality persons across the person's all appearances across the plurality of the pre-determined number of frames; counting persons from the video sequence of the crowd to produce a counting report; classifying persons from the video sequence of the crowd to produce a classifying report; outputting the counting report and the classifying report.
In one embodiment of the provided method, wherein counting persons is based on gender features of the plurality of persons detected for each frame of the plurality of frames, and classifying persons is based on gender features of the plurality of persons detected from each frame of the plurality of frames.
In another embodiment of the provided method, wherein the outputting the counting report and classifying report comprises displaying the counting report and classifying report.
In another embodiment of the provided method, wherein the outputting the counting report and classifying report comprises analyzing the counting report and classifying report for the purpose of meeting one or more institutional needs.
In another embodiment of the provided method, it further comprises receiving a second video sequence of the crowd from a second photographing device, wherein the second photographing device is non-overlapping disjoint with the photographing device; preprocessing the second video sequence, wherein the preprocessing the second video sequence includes extracting a plurality of frames from the second video sequence, removing noise from the second video sequence, and subtracting background information from the second video sequence; detecting, for each frame of the plurality of frames extracted from the second video sequence, a plurality of persons, and, for each person of the plurality of persons, extracting a plurality of global and local features of the person that can be used for tracking the person in the second video sequence, identifying the person based on the plurality of global and local features of the person; re-identifying, for each frame of the plurality of frames extracted from the video sequence and the second video sequence, the plurality of persons detected from each frame of the plurality of frames extracted from the video sequence and each frame of the plurality of frames extracted from the second video sequence by monitoring whether there is re-appearance, for each person of the plurality of persons, across a plurality of a second pre-determined number of frames of the plurality of frames extracted from the video sequence and the second video sequence, wherein the re-identifying maintains a same identification for a person of the plurality persons across the person's all appearances across the plurality of the pre-determined number of frames among the plurality of frames extracted from the video sequence and the second video sequence.
In another embodiment of the provided method, wherein the re-identifying step further comprises using a deep learning algorithm to detect whether two persons detected from two frames are the same person, wherein the deep learning algorithm is pretrained to attain a loss function that makes a distance between two images each of which is of a same person as small as possible and the distance between two images each of which is of a different person as large as possible.
In another embodiment of the provided method, wherein the training an artificial intelligence (AI) model step comprises: modifying a Yolov7 model pretrained on an image-set of crowd human to obtain a set of hyperparameters used for the Yolov7 model to achieve a better detection accuracy, wherein the image-set of crowd human has one or more annotation labels for each image therein; analyzing the labeling annotation strategy of the image-set of crowd human and then parsing through all images in the image-set of crowd human to determine any inconsistency of the one or more annotation labels for each image in the image-set of crowd human, correcting all the inconsistent labels in the image-set of crowd human; retraining the Yolov7 model with the image-set of crowd human to produce a new version of the Yolov7 model, investigating if the new version of the Yolov7 model would produce a same set result as the previous version of the Yolov7 model in running against a same set of test data, and updating the previous version of the Yolov7 model with the new version of the Yolov7 model in the case of that the new version of the Yolov7 model outperforms the previous version of the Yolov7 model; retraining the new version of the Yolov7 model with only an image-set of human body to make the new version of the Yolov7 model detect humans from images based on imagery of human bodies only as opposed to imagery of human bodies and human faces; integrating the new version of the Yolov7 model in a semi-production environment, and generating a batch of labeled images by running the new version of the Yolov7 model on a semi-production dataset; correcting any labeling mistakes in the batch of labeled images according to a set of pre-determined labeling strategies; converting annotations on all images in the batch of labelled images according to the format of the image-set of crowd human; evaluating the performance of all versions of the Yolov7 models running on the batch of labeled images in terms of accuracy of detection, and determining the performance trend across all previously versions of the Yolov7 models; retraining the latest version of the Yolov7 model with a new batch of data that is different from any batch of data previously used; repeating the steps from 5 to 9, until reaching a pre-determined threshold of detection accuracy; and outputting the latest version of the Yolov7 model as the AI model.
In another embodiment of the provided method, wherein the training an artificial intelligence (AI) model is conducted based on a locally collected image-set containing images of local Arab people's clothing custom in which females wear Abaya and males wear Shimaagh.
In another embodiment of the provided method, further comprise benchmarking the performance of detecting people, counting detected people and classifying detected people, and retraining the AI model in an interactive way, in the case of that the benchmarking produces an unsatisfactory result, based on a plurality of custom dataset until the benchmarking produces a satisfactory result.
In another embodiment of the provided method, wherein the classifying step uses a file and folder directory structure in the production environment to facilitate the classifying step.
Provided is a system, comprising: a computing device, one or more photographing devices, wherein the computing device comprises a GPU, a processor, one or more computer-readable memories and one or more computer-readable, tangible storage devices, one or more input devices, one or more output devices, and one or more communication devices, and wherein the one or more photographing devices are connected to the computing device for feeding one or more captured video streams of human crowd scenes to the computing device's video buffer, and the computing device, prior to receiving the one or more captured video streams of human crowd scenes from the one or more photographing devices, to perform operations of training an artificial intelligence (AI) model based on a pre-collected dataset of human images in which each human image is annotated for its body and face to enable the AI model to detect humans based on body or face appearance thereof; and deploying the AI model into a production environment, wherein the production environment comprises a computing system, and at least one photographing devices, and wherein the computing device, upon receiving a video sequence of a crowd from one of the one or more photographing devices, to perform operations comprising: preprocessing the video sequence, wherein the preprocessing the video sequence includes extracting a plurality of frames from the video sequence, removing noise from the video sequence, and subtracting background information from the video sequence; detecting, for each frame of the plurality of frames, a plurality of persons, and, for each person of the plurality of persons, extracting a plurality of global and local special features of the person that can be used for tracking the person in the video sequence, identifying the person based on the plurality of global and local features of the person including body features, clothing features, or facial features; re-identifying, for each frame of the plurality of frames, the plurality of persons detected from each frame by monitoring whether there is re-appearance, for each person of the plurality of persons, across a plurality of a pre-determined number of frames of the plurality of frames, wherein the re-identifying maintains a same identification for a person of the plurality persons across the person's all appearances across the plurality of the pre-determined number of frames; counting persons from the video sequence of the crowd to produce a counting report; classifying persons from the video sequence of the crowd to produce a classifying report; outputting the counting report and the classifying report.
In an embodiment of the provided system, wherein counting persons is based on gender features of the plurality of persons detected for each frame of the plurality of frames, and classifying persons is based on gender features of the plurality of persons detected from each frame of the plurality of frames.
In another embodiment of the provided system, wherein the outputting the counting report and classifying report comprises displaying the counting report and classifying report.
In another embodiment of the provided system, wherein the outputting the counting report and classifying report comprises analyzing the counting report and classifying report for the purpose of meeting one or more institutional needs.
In another embodiment of the provided system, the computing device to perform operations further comprising: receiving a second video sequence of the crowd from a second photographing device, wherein the second photographing device is non-overlapping disjoint with the photographing device; preprocessing the second video sequence, wherein the preprocessing the second video sequence includes extracting a plurality of frames from the second video sequence, removing noise from the second video sequence, and subtracting background information from the second video sequence; detecting, for each frame of the plurality of frames extracted from the second video sequence, a plurality of persons, and, for each person of the plurality of persons, extracting a plurality of global and local features of the person that can be used for tracking the person in the second video sequence, identifying the person based on the plurality of global and local features of the person; re-identifying, for each frame of the plurality of frames extracted from the video sequence and the second video sequence, the plurality of persons detected from each frame of the plurality of frames extracted from the video sequence and each frame of the plurality of frames extracted from the second video sequence by monitoring whether there is re-appearance, for each person of the plurality of persons, across a plurality of a second pre-determined number of frames of the plurality of frames extracted from the video sequence and the second video sequence, wherein the re-identifying maintains a same identification for a person of the plurality persons across the person's all appearances across the plurality of the pre-determined number of frames among the plurality of frames extracted from the video sequence and the second video sequence.
In another embodiment of the provided system, wherein the re-identifying step further comprises using a deep learning algorithm to detect whether two persons detected from two frames are the same person, wherein the deep learning algorithm is pretrained to attain a loss function that makes a distance between two images each of which is of a same person as small as possible and the distance between two images each of which is of a different person as large as possible.
In another embodiment of the provided system, wherein the training an artificial intelligence (AI) model step comprises: modifying a Yolov7 model pretrained on an image-set of crowd human to obtain a set of hyperparameters used for the Yolov7 model to achieve a better detection accuracy, wherein the image-set of crowd human has one or more annotation labels for each image therein; analyzing the labeling annotation strategy of the image-set of crowd human and then parsing through all images in the image-set of crowd human to determine any inconsistency of the one or more annotation labels for each image in the image-set of crowd human, correcting all the inconsistent labels in the image-set of crowd human; retraining the Yolov7 model with the image-set of crowd human to produce a new version of the Yolov7 model, investigating if the new version of the Yolov7 model would produce a same set result as the previous version of the Yolov7 model in running against a same set of test data, and updating the previous version of the Yolov7 model with the new version of the Yolov7 model in the case of that the new version of the Yolov7 model outperforms the previous version of the Yolov7 model; retraining the new version of the Yolov7 model with only an image-set of human body to make the new version of the Yolov7 model detect humans from images based on imagery of human bodies only as opposed to imagery of human bodies and human faces; integrating the new version of the Yolov7 model in a semi-production environment, and generating a batch of labeled images by running the new version of the Yolov7 model on a semi-production dataset; correcting any labeling mistakes in the batch of labeled images according to a set of pre-determined labeling strategies; converting annotations on all images in the batch of labelled images according to the format of the image-set of crowd human; evaluating the performance of all versions of the Yolov7 models running on the batch of labeled images in terms of accuracy of detection, and determining the performance trend across all previously versions of the Yolov7 models; retraining the latest version of the Yolov7 model with a new batch of data that is different from any batch of data previously used; repeating the steps from 5 to 9, until reaching a pre-determined threshold of detection accuracy; and outputting the latest version of the Yolov7 model as the AI model.
In another embodiment of the provided system, wherein the training an artificial intelligence (AI) model is conducted based on a locally collected image-set containing images of local Arab people's clothing custom in which females wear Abaya and males wear Shimaagh.
In another embodiment of the provided system, the computing device, prior to receiving the one or more captured video streams of human crowd scenes from the one or more photographing devices, to perform operations that further comprise benchmarking the performance of detecting people, counting detected people and classifying detected people, and retraining the AI model in an interactive way, in the case of that the benchmarking produces an unsatisfactory result, based on a plurality of custom dataset until the benchmarking produces a satisfactory result.
In another embodiment of the provided system, wherein the classifying step uses a file and folder directory structure in the production environment to facilitate the classifying step.
Other features and aspects will be apparent from the following detailed description, the drawings, and the claims.
The present invention is illustrated by way of example and not limited in the figures of the accompanying drawings in which like references indicate similar elements.
tracking.
Various embodiments and aspects of the inventions will be described with reference to details discussed below, and the accompanying drawings will illustrate some embodiments. The following description and drawings are illustrative of the invention and are not to be construed as limiting the scope of the invention. Numerous specific details are described to provide an overall understanding of the present invention to one of ordinary skill in the art.
Reference in the specification to “one embodiment” or “an embodiment” or “another embodiment” means that a particular feature, structure, or characteristic described in conjunction with the embodiment can be included in at least one embodiment of the invention but need not be in all embodiments. The appearances of the phrase “in one embodiment” in various places in the specification do not necessarily all refer to the same embodiment.
Embodiments use a computer system for receiving, storing and analyzing sample documents' images/videos data and for providing information to verify target documents that are in the same genre as the sample documents. The system, in particular, employs artificial intelligence techniques to train a predictive model to verify target documents.
Input/Output (I/O) devices 112, 114 (including but not limited to keyboards, displays, pointing devices, transmitting device, mobile phone, edge device, verbal device such as a microphone driven by voice recognition software or other known equivalent devices, etc.) may be coupled to the system either directly or through intervening I/O controllers 110. More pertinent to the embodiments of disclosure are photographing devices as one genre of input device. A photographing device can be a camera, a mobile phone that is equipped with a camera, an edge device that is equipped with a camera, or any other device that can capture one or more images/videos of an object (or a view) via various means (such as optical means or radio-wave based means), store the captured images/videos in some local storage (such as a memory, a flash disk, or the like), and to transmit the captured images/videos, as input data, to either a more permanent storage (such as a database 118, a storage 116) or the at least one processor 102, depending on the demand of to where the captured images/videos are to be transmitted.
Input Devices 112 receive input data (raw and/or processed), and instructions from a user or other source. Input data includes, inter alia, (i) captured images of documents, (ii) captured videos of documents, and/or (iii) angles between the documents and the surface of the photographing device's optical lens that faces the documents and that is used when capturing the images/videos.
Network adapters 108 may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters 108. Network adapters 108 may also be communicatively coupled to internet 122 and/or cloud 124 to access remote computer resources such as on-premise computing systems (not shown in
The computer architecture 100 may be coupled to storage 116 (e.g., any type of storage device; a non-volatile storage area, such as magnetic disk drives, optical disk drives, a tape drive, etc.). The storage 116 may comprise an internal storage device or an attached or network accessible storage. Computer programs 106 in storage 116 may be loaded into the memory elements 104 and executed by a processor 102 in a manner known in the art.
Computer programs 106 may include AI programs or ML programs, and the computer programs 106 may partially reside in the memory elements 104, and partially reside in storage 116 and partially reside in cloud 124 or in an on-remise computing system via an internet 122.
The computer architecture 100 may include fewer components than illustrated, additional components not illustrated herein, or some combination of the components illustrated and additional components. The computer architecture 100 may comprise any computing device known in the art, such as a mainframe, server, personal computer, workstation, laptop, handheld computer, telephony device, network appliance, virtualization device, storage controller, virtual machine, smartphone, tablet, etc.
Input device(s) 112 transmits input data to processor(s) 102 via memory elements 104 under the control of operating system 105 and computer program(s) 106. The processor(s) 102 may be central processing units (CPUs) and/or any other types of processing device known in the art. In certain embodiments, the processing devices 102 are capable of receiving and processing input data from multiple users or sources, thus the processing devices 102 have multiple cores. In addition, certain embodiments involve the use of videos (i.e., graphics intensive information) or digitized information (i.e., digitized graphics), these embodiments therefore employ graphic processing units (GPUs) as the processor(s) 102 in lieu of or in addition to CPUs.
Certain embodiments also comprise at least one database 118 for storing desired data. Some raw input data are converted into digitized data format before being stored in the database 118 or being used to create the desired output data. It's worth noting that storage(s) 116, in addition to being used to store computer program(s) 106, are also sometimes used to store input data, raw or processed, and to store intermediate data. The permanent storage of input data and intermediate data is primarily database(s) 118. It is also noted that the database(s) 118 may reside in close proximity to the computer architecture 100, or remotely in the cloud 124, and the database(s) 118 may be in various forms or database architectures. Because certain embodiments need a storage for storing large volumes of photo image/video data, more than one database likely is used.
Computer Architecture 100 generically represents both a standalone computing device or an ensemble of computing devices such as, a computer server, or an ensemble of computer servers (server farm), a mobile computing device (such as a mobile phone), or a communicatively coupled and distributed computing resources that collectively have the elements and structures of the Computer Architecture 100.
In all embodiments, the Computer Architecture 100 is a system behind all kinds of deployment environment: training environment (for model training), test environment (for testing trained model), semi-production environment (for beta testing trained model in a close-to-production environment, and production environment.
Because certain embodiments need a storage or buffer for storing large volumes of video footages, more than one database likely is used.
Referring to
Referring to
First of all, a sequence of frames from a video or a CCTV (RTSP-enabled) camera 302 is passed through preprocessing 304 to format and clean the video data (for an example, noise removal from the video sequences and subtraction of background in order to extract out the foreground objects of interest. Note, in the application of all embodiments, the objects of interest in a video sequence are a person or many persons in a single frame). Then, video data, after being divided to multiple frames, is fed to a CNN based detector 308 to extract person features. The image of a single frame is then divided into a grid of cell and each cell is assigned a set of anchor boxes (a process called localization 306) which are pre-defined boxes of different sizes and aspect ratios that are used to predict the bounding boxes of the objects or persons in an image or multiple images (a process called identification 310).
Detected persons are passed to a tracker 314 module, which makes associations among persons being detected (312), and then makes prediction about the detected persons in terms of whether they are the same persons previously detected in one or more previous frames (316). This process is also called re-identification. After re-identification, people counting is conducted 318, which is to count detected persons only once despite their re-appearance in multiple frames. Afterwards, classifying detected and identified persons is conducted at 320, which is to classify them according to their features such as gender features, their height, or their size, etc. The last process 322 is to either output the result of counting and classification, or conduct a deeper analysis on the result report to find patterns of the detected persons (such as their movement speed around certain type of merchant goods in a mall, what sub-group of persons tend to cluster together, etc.)
The CNN-based detector 308 plays a vital role in the crowd analysis flow showed in
Referring to
Detected persons 410 are used to crop the frame images into individual images of detected persons 416, which are used to apply feature extraction 418 thereupon to extract special features of each detected person. Then these detected persons are clustered at 420 to form a cluster table 424 in which each row lists a number of key features of a person (such as person 1, person 2, person 4, etc.) The table is then going through another QA process 428 to find out whether there is confusion among the tabled persons (such as whether there is redundantly identified persons). If there's no confusion, then the process is successfully concluded 430. Otherwise, the QA process further labels (re-identifies) and sorts the tabled persons with the help of files and folders that hold the person images in an orderly way 426.
Afterwards, the model is retrained at 432 to learn the re-identification of mis-identified person(s) that caused confusion at step 428. The retrained model is then applied to extract detected persons at step 418 again. The cycle of 418, 420, 424, 428, 426, and 432 repeats until the success 430 is produced (i.e., there is not mis-identification of detected persons).
In our embodiments, the method is to take videos and images and cropping off global and local features from each person in the frame for important feature extraction so that they can be tracked for some time and re-identified if re-appeared after a few frames in a video sequence. It is important to mention here that feature extraction of each person in every single frame becomes very laborious during tracking unless some persons are dropped if the person goes out of frame and never appeared again. There are various factors that have to be kept in mind with most important one being the occlusion due to crowd density. Occlusion prevents a person from re-appearing or his/her partial appearance confuses the system and prevents it in typical cases from re-identifying the same person. In order to address the challenge imposed by occlusion, the method resorts memories, to keep all the features in the short-term memory for every 500 frames in order to increase the chances of a person getting re-identified. With the help of memories, the overall error of mis-identification in the total number of people being counted is minimized.
The above-described model training and retraining, specifically, is based on 120,000 or more images that include western and middle eastern men and women of various age groups. Although tremendous advancements have taken place in the AI computer vision research particularly focusing on gender detection analytics, generalization of an AI model catered toward to a specific region of interest still poses a huge challenge. Gender detection is a powerful tool that can be used in a plethora of AI applications ranging from marketing to security, surveillance to employee attendance systems and controlling the flow of people to a POI (point of interest) at the time of an eventuality such as a COVID-19 Pandemic. The capacity of AI application to accurately identify and authenticate gender-based entries/exists, businesses and organizations can improve their services leading to making timely decisions based on accurate information. Some of the applications that can be built on the accurate gender-based counting systems include but not limited to customers segmentation, store layout optimization, staff allocation on a particular day of the week, product recommendation and sale optimization performance analysis etc.
As described in
Referring to
Referring to
In order to detect the gender of a detected person, the training of the Re-ID and Gender-based models with the mixed custom-data set in order to cater to the gaps in the CrowdHuman dataset. In other words, the training data has customarily made images (images labelled with gender information, and/or gender-specific clothing information), in addition to the CrowdHuman dataset.
Referring to
RTSP Cam(s) 702 capture a number of video footage streams, and the streams are then processed into video frames 704. Frames of footages are then fed into the Person Deteco for detection therefrom 708. Detected persons, being represented as their respective bboxed objects in the frames, are put into an image gallery 710.
In another route, video frames 704 can be used to extract target person images as well (706), via a person detector, and the target person images are input into a ReID system 712 for training the ReID system. 712 receives a large number of target person images 706, and goes through the process of feature extraction, metric learning (of the extracted features), and similarity measurement (finding similarities among features of different images). After the training, the ReID system 712 is able to uniquely identify a person even if the person appears in different frame images from different view perspective.
Once the ReID system 712 is ready, it can take tasks from either directly from the person detector 708 or directly from the Gallery 710. Since the output (detected and identified persons) from the Person detector 708 needs to be tracked, the detected and identified persons need to be re-identified before being properly tracked. As result, the information about the detected and identified persons are sent 714 to ReID system 712 for re-identification. And the outcome of the Re-Identification is used as input for the Trackor (not shown).
On the other hand, queries (such as the query 718 for searching a white-haired man captured in camera 1, the query 720 for searching a red T-shirted woman captured from camera 2) can be issued to the Gallery 710, which in turn processing the queries 716 by using the ReID system 712. The trained ReID system 712 then processes the queries, and returns cropped images of the queried persons in a relevance-based ranking 722. 722 shows that the white-haired man is found out from the Gallery in two different cropped images, each of which is apparently captured from a different angle of photographing, and not necessarily from the same camera. Similarly, the red T-shirted woman is found out from the Gallery in two different cropped images, each of which is apparently captured from a different angle of photographing, and not necessarily from the same camera.
Put in a different way, the
In some embodiments, various models are trained on a variety of custom data sets in an interactive way. Note, for the detector model training, the model can start with one of the available versions of YOLO model, such as version 5 or version 7.
Referring to
More specifically, the YOLOv7 object detector receives input images or video stream from one or more RTSP cameras and outputs bounding boxes for detected objects, including people. The Deep SORT object tracker takes the bounding boxes from YOLOv7 as input and generates tracklets with unique IDs for each detected object. The ReID feature helps in maintaining consistent identities across frames. This has been achieved through rigorous training of an AI algorithm (ResNet50 backbone and OSNet ReID predictor) which has been fine tuned rigorously until optimum accuracy was achieved. The datasets in Figure above shows the extent to which the data collection from multiple venues was made possible with human annotators and labelers in the loop; this was needed to ensure several hundred images of a single person having a unique ID were validated before the OSNet ReID model was trained on the data iteratively until optimum accuracy was achieved.
Moreover,
Finally, the gender classifier predicts whether each tracklet corresponds to a male or female, providing gender predictions associated with each detected object.
Each component of
Besides, the system shown in
The output from the various models can be integrated into an overall prection, a process called as Integration of Predictions:
After outputs are integrated, an Output and Visualization process is conducted:
The output can also evaluated and trigger further training for optimization. The process is called as Evaluation and Optimization:
After all models are trained and optimized, the next step is to deploy them and use them. The process is called as Deployment and Usage:
Referring to
In the system 900, each component is represented as a separate node, and arrows indicate the flow of data between them. The YOLO detector detects objects and outputs bounding box coordinates along with object class probabilities. These outputs are then used as inputs to the Deep SORT tracker, which tracks the objects and updates their bounding box coordinates over time. Simultaneously, the bounding box coordinates are passed to the gender classifier focusing on Arab persons classification and localization, ethnicity classifier (Arab & Non-Arab), and age classifier (Child, Adult, Elderly), which predict gender, ethnicity, and age based on the detected objects. Finally, the predictions from all classifiers are provided as the output of the system, in a form of an integrated output or a form of separate outputs.
In training the model to be able to re-identify a detected person, the cropped images of detected person (508 and 608) are organized under a folder structure, in which each individual unique person's images are put in a folder separated from the folder that holds images for a different person.
The goal of the annotation and labelling task include sorting the images into the following semi-structured folder structure wherein each folder holds one or more images for one person:
Referring to
With these semi-structured folders, the task of the labeler included:
The task of the labeler also includes deleting images that are not of interest: in the cropped images-based folders, there are some cropped images that may contain some scenarios which are not of interest to us, in that case, the labeler should delete these types of images. Following is the list of cases, which the labeler should use as a guideline to delete the images.
Referring to
Person re-identification (Re-ID) technology uses AI computer vision to judge whether there is a specific or sought after person in a particular video sequence or images. This can be in a video sequence from a single camera or multiple non-overlapping disjoint cameras. This task is extremely challenging due to change in person poses, different camera views, local and global spatial features of persons and occlusions between several persons. It is a widely researched problem, explored alongside image or frame retrieval in a video sequence. Given a range of images, it asks the algorithm to retrieve the images of the person in question in a video sequence. AI deep neural networks integrates spatial features from two images/frames in a video sequence and ascertain whether the two images belong to the same person. Re-ID aims to make up for the visual limitation of the current fixed cameras and can be combined with custom-based person detection and tracking technology. Such technology is widely used in intelligent video monitoring, building security monitoring, heritage site analytics application and retail customers segmentation etc.
In all embodiments, the model for Re-ID is fine tuned to be more accurately re-identify persons from video sequences or images, and/or from difference sources of video sequences or images, by using above mentioned person images labelling, and using the labelled images to retrain the model.
Referring
Retraining the ReID model involves datasets preparation: augmenting the image dataset of global people (world-wide people) with images of local people.
Specifically, we ran experiments on a variety of datasets to ensure the AI deep learning algorithms learn diverse set of features form global and local features of people like wearables, structure (height, appearance etc.) and number of appearances in a video sequence. We have trained the model on a mixed dataset which includes Market1501 (large-scale public benchmark dataset for person re-identification. It contains 1501 identities which are captured by six different cameras, and 32,668 pedestrian image bounding-boxes obtained using the Deformable Part Models pedestrian detector. Each person has 3.6 images on average at each viewpoint. The dataset is split into two parts: 750 identities are utilized for training and the remaining 751 identities are used for testing. In the official testing protocol 3,368 query images are selected as probe set to find the correct match across 19,732 reference gallery images.)
Positive Sample ReIDs Vs. Negative Sample ReIDs
We have trained a deep learning algorithms OSNet on a custom-dataset with a view to computing the loss function based on similarity between the two persons in two images in a video sequence. The loss function in our network makes the distance between the same person image (positive sample pair) as small as possible and the distance between different person images (negative sample pairs) as large as possible.
For this task, the goal is to classify each folder labeled for Person Re-id as either male or female. This task depends on the labeler if he/she (or an automatic software app) wants to create an xlsx/csv file with two columns one for folder id and the other for the gender or want two folders one with male folders and one with female folders from Person re-id task.
The following steps are to combine the person re-id and gender-based labeling task into one task to reduce the complexity of searching re-appearance of a person among a multiple folder:
The goal of this section is to specify labelling requirements for the gender-based classification model. We have developed a gender-based person cropper tool which scrap off male and female instances from video sequences as well as 2D images. The program provides a directory structure comprising of Male (M) and Female (F) folders which are then given out to human labelers/annotators to rectify the misclassification. The annotator/labelers carry out the task using the following steps:
The detector is a Yolov7 based model (Utralytics) that is trained and evaluated on CrowdHuman dataset. As the CrowdHuman is a dataset that contains the annotation for both the human body and face, that is why the model is trained to predict body and face at the same time.
To integrate the model in our crowd-counter pipeline (as shown in
Note, above steps can be conducted in a training environment alone, or can be conducted in a training environment alone and then tested in a semi-production environment or a test environment.
Referring to
Thorough experimentation and benchmarking were performed to validate the accuracy of the detection, tracking and gender classification algorithms. Most of the algorithms in the online public space have been trained on people having some part of the skin visible including the face which makes it easily identifiable for a trained algorithm. However, such algorithms struggle to perform well on a dataset containing people videos and images wearing Abaya (a loose over-garment) which is meant for covering up the entire body of a person especially a female. Similarly, men wear a headcloth called Shimaagh, which is a piece of cloth designed for a desert environment to protect the wearer from sand and heat. In our dataset, collected from local regions include more than 80% of men and women wearing traditional dress (i.e. Abaya and Shimaagh) which was collected, annotated and labelled for training a detector model, tracking model and gender classification model as part of the people counting and occupancy monitoring system invention. The benchmarking validation analysis was conducted which involved the testing of pre-trained detection, tracking and gender classification model on locally collected videos. Such analysis informed us that using such models trained on non-Arab people datasets, we won't be able to achieve sufficient accuracy in terms of detection, tracking and gender classification. All that required us to thoroughly train ML algorithms on locally collected data with individual algorithms trained having different data parameters and hyper parameters. The performance of these individual algorithms has been demonstrated in some of the visualizations, shown below under Results & Evaluation section. The segmentation part is done using the features extracted from the model backbone to predict pixelwise areas of interest, the predicted areas are then passed as binary image masks into a dimension estimation network parallelly for each detected and identified element, which estimates the size thereof given the mask image reference.
After training the YOLOV7 (in some embodiments, YOLOV7 is used) CrowdHuman detection model on a custom annotated dataset, we calculate average precision for each class based on the model prediction, as shown below.
The present invention may be a system, a method, and/or a computer program product. The computer program product and the system may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device or a computer cloud via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++, Java, Python or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages, and scripting programming languages, such as Perl, JavaScript, or the like. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
These computer readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated to be incorporated by reference and were set forth in its entirety herein.
The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention (especially in the context of the following claims) are to be construed to cover both the singular and the plural, unless otherwise indicated herein or clearly contradicted by context. Recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. All methods described herein can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed. No language in the specification should be construed as indicating any non-claimed element as essential to the practice of the invention.
Certain embodiments of this invention are described herein, including the best mode known to the inventors for carrying out the invention. It should be understood that the illustrated embodiments are exemplary only, and should not be taken as limiting the scope of the invention.
Benefits, other advantages, and solutions to problems have been described herein with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and element(s) that may cause benefit, advantage, or solution to occur or become more pronounced are not to be construed as critical, required, or essential features or elements of the claims. Reference to an element in the singular is not intended to mean “one and only one” unless explicitly so stated, but rather “one or more.” As used herein, the terms “comprise”, “comprising”, or a variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, no element described herein is required for practice unless expressly described as “essential” or “critical”. Moreover, those skilled in the art will recognize that changes and modifications may be made to the exemplary embodiments without departing from the scope of the present invention. Thus, different embodiments may include different combinations, arrangements and/or orders of elements or processing steps described herein, or as shown in the drawing figures. For example, the various components, elements or process steps may be configured in alternate ways depending upon the particular application or in consideration of cost. These and other changes or modifications are intended to be included within the scope of the present invention, as set forth in the following claims.
This application claims the benefit of priority under 35 U.S.C. § 119 to U.S. Provisional Application No. 63/494,571, filed on Apr. 6, 2023, the entire contents of which are incorporated herein by reference.
Number | Date | Country | |
---|---|---|---|
63494571 | Apr 2023 | US |