In certain contexts, intelligent processing and playback of recorded video is an important function to have in a video surveillance system. For example, a video surveillance system may include many cameras, each of which records video. The total amount of video recorded by those cameras, much of which is typically recorded concurrently, makes relying upon manual location and tracking of an object-of-interest that appears in the recorded video inefficient. Intelligent processing and playback of video, and in particular automated search functionality, may accordingly be used to increase the efficiency with which an object-of-interest can be identified using a video surveillance system.
In the accompanying figures similar or the same reference numerals may be repeated to indicate corresponding or analogous elements. These figures, together with the detailed description, below are incorporated in and form part of the specification and serve to further illustrate various embodiments of concepts that include the claimed invention, and to explain various principles and advantages of those embodiments.
Skilled artisans will appreciate that elements in the figures are illustrated for simplicity and clarity and have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be exaggerated relative to other elements to help improve understanding of embodiments of the present disclosure.
The apparatus and method components have been represented where appropriate by conventional symbols in the drawings, showing only those specific details that are pertinent to understanding the embodiments of the present disclosure so as not to obscure the disclosure with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
According to a first aspect, there is provided a method comprising: receiving search commencement input requesting that an appearance search for one or more objects-of-interest commence; in response to the search commencement input, searching one or more video recordings for the one or more objects-of-interest; and displaying, in conjunction with a map on a display, one or more appearance search results depicting the one or more objects-of-interest, wherein each of the appearance search results depicts the one or more objects-of-interest as captured by a camera at a time during the one or more video recordings, and is depicted in conjunction with the map at a location indicative of a geographical location of the camera.
At least one of the appearance search results may be a still image of one of the one or more objects-of-interest.
At least one of the appearance search results may be a video recording of one of the one or more objects-of-interest.
The appearance search results may appear in an order corresponding to a sequence in which the appearance search results appear in the one or more video recordings.
The appearance search results may appear proportional to when the appearance search results appear in the one or more video recordings.
The method may further comprise: receiving playback input indicating that the appearance search results are to appear, wherein the playback input comprises a playback speed at which the appearance search results are to appear; and only causing the appearance search results to appear once the playback input is received, wherein the times at which the appearance search results appear are adjusted in proportion to the playback speed.
A path connecting sequentially appearing ones of the appearance search results may be displayed.
The method may further comprise: determining whether at least one of the appearance search results is located within a building; and if the at least one of the appearance search results is located within the building, determining at least one of an entrance and exit of the building. The path may pass through the at least one of an entrance and exit.
Searching the one or more video recordings may comprise searching for a single object-of-interest regardless of facets of the single object-of-interest.
The appearance search results may comprise the single object-of-interest, and the method may further comprise: receiving additional search commencement input indicating that a search is to be done for one or more objects-of-interest that share one or more facets of the single object-of-interest; in response to the additional search commencement input, searching the one or more video recordings for the one or more objects-of-interest that share the one or more facets of the single object-of-interest; and updating, on the display, the one or more appearance search results to depict the one or more objects-of-interest that share the one or more facets of the single object-of-interest.
The additional search commencement input may specify which of the one or more facets of the single object-of-interest are to be searched.
Searching the one or more video recordings may comprise searching for objects-of-interest comprising one or more facets of identical type and value.
The search commencement input may specify a descriptor and a tag of the one or more facets to be searched.
The appearance search results may comprise multiple objects-of-interest sharing one or more facets of identical descriptor and tag, and the method may further comprise: receiving additional search commencement input indicating that a search is to be done for a single object-of-interest comprising part of the appearance search results; in response to the additional search commencement input, searching the one or more video recordings for the single object-of-interest comprising part of the appearance search results regardless of facets of the single object-of-interest; and updating, on the display, the one or more appearance search results to depict the single object-of-interest comprising part of the appearance search results.
Each of the one or more facets may comprise age, gender, a type of clothing, a color of clothing, a pattern displayed on clothing, a hair color, a footwear color, or a clothing accessory.
Each of the one or more appearance search results may be associated with a confidence level, and the method may further comprise: receiving confidence level input specifying a minimum confidence level; and in response to the confidence level input, updating, on the display, the one or more appearance search results to depict only the one or more search results having a confidence level at or above the minimum confidence level.
At least one of the appearance search results may be overlaid on the map.
According to an aspect, the one or more objects-of-interest may comprise a vehicle, and wherein searching the one or more video recordings for the one or more objects-of-interest comprises searching the one or more video recordings for a license plate of the vehicle.
According to another aspect, there is provided a system comprising: a display; an input device; a processor communicatively coupled to the display and the input device;
and a memory communicatively coupled to the processor and having stored thereon computer program code that is executable by the processor, wherein the computer program code, when executed by the processor, causes the processor to perform the method of any of the foregoing aspects or suitable combinations thereof.
According to another aspect, there is provided a non-transitory computer readable medium having stored thereon computer program code that is executable by a processor and that, when executed by the processor, causes the processor to perform the method of any of the foregoing aspects or suitable combinations thereof.
Each of the above-mentioned embodiments will be discussed in more detail below, starting with example system and device architectures of the system in which the embodiments may be practiced, followed by an illustration of processing blocks for achieving an improved technical method, device, and system for an appearance search using a map. Example embodiments are herein described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to example embodiments. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. The methods and processes set forth herein need not, in some embodiments, be performed in the exact sequence as shown and likewise various blocks may be performed in parallel rather than in sequence. Accordingly, the elements of methods and processes are referred to herein as “blocks” rather than “steps.”
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational blocks to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide blocks for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. It is contemplated that any part of any aspect or embodiment discussed in this specification can be implemented or combined with any part of any other aspect or embodiment discussed in this specification.
Further advantages and features consistent with this disclosure will be set forth in the following detailed description, with reference to the figures.
Reference is now made to
The computer terminal 104 communicates with the server system 108 through one or more networks. These networks can include the Internet, or one or more other public/private networks coupled together by network switches or other communication elements. The network(s) could be of the form of, for example, client-server networks, peer-to-peer networks, etc. Data connections between the computer terminal 104 and the server system 108 can be any number of known arrangements for accessing a data communications network, such as, for example, dial-up Serial Line Interface Protocol/Point-to-Point Protocol (SLIP/PPP), Integrated Services Digital Network (ISDN), dedicated lease line service, broadband (e.g. cable) access, Digital Subscriber Line (DSL), Asynchronous Transfer Mode (ATM), Frame Relay, or other known access techniques (for example, radio frequency (RF) links). In at least one example embodiment, the computer terminal 104 and the server system 108 are within the same Local Area Network (LAN).
The computer terminal 104 includes at least one processor 112 that controls the overall operation of the computer terminal. The processor 112 interacts with various subsystems such as, for example, input devices 114 (such as a selected one or more of a keyboard, mouse, touch pad, roller ball and voice control means, for example), random access memory (RAM) 116, non-volatile storage 120, display controller subsystem 124 and other subsystems (not shown). The display controller subsystem 124 interacts with display 126 and it renders graphics and/or text upon the display 126.
Still with reference to the computer terminal 104 of the surveillance system 100, operating system 140 and various software applications used by the processor 112 are stored in the non-volatile storage 120. The non-volatile storage 120 is, for example, one or more hard disks, solid state drives, or some other suitable form of computer readable medium that retains recorded information after the computer terminal 104 is turned off.
Regarding the operating system 140, this includes software that manages computer hardware and software resources of the computer terminal 104 and provides common services for computer programs. Also, those skilled in the art will appreciate that the operating system 140, client-side video review application 144, and other applications 152, or parts thereof, may be temporarily loaded into a volatile store such as the RAM 116. The processor 112, in addition to its operating system functions, can enable execution of the various software applications on the computer terminal 104.
More details of the video review application 144 are shown in the block diagram of
The video review application 144 also includes the search session manager module 204 mentioned above. The search session manager module 204 provides a communications interface between the search UI module 202 and a query manager module 164 (
Besides the query manager module 164, the server system 108 includes several software components for carrying out other functions of the server system 108. For example, the server system 108 includes a media server module 168. The media server module 168 handles client requests related to storage and retrieval of video taken by video cameras 169 in the surveillance system 100. The server system 108 also includes an analytics engine module 172. The analytics engine module 172 can, in some examples, be any suitable one of known commercially available software that carry out mathematical calculations (and other operations) to attempt computerized matching of same individuals or objects as between different portions of video recordings (or as between any reference image and video compared to the reference image). For example, the analytics engine module 172 can, in one specific example, be a software component of the Avigilon Control Center™ server software sold by Avigilon Corporation. In some examples the analytics engine module 172 can use the descriptive characteristics of the person's or object's appearance. Examples of these characteristics include the person's or object's shape, size, textures and color.
The server system 108 also includes a number of other software components 176. These other software components will vary depending on the requirements of the server system 108 within the overall system. As just one example, the other software components 176 might include special test and debugging software, or software to facilitate version updating of modules within the server system 108. The server system 108 also includes one or more data stores 190. In some examples, the data store 190 comprises one or more databases 191 which facilitate the organized storing of recorded video.
Regarding the video cameras 169, each of these includes a camera module 198.
In some examples, the camera module 198 includes one or more specialized integrated circuit chips to facilitate processing and encoding of video before it is even received by the server system 108. For instance, the specialized integrated circuit chip may be a System-on-Chip (SoC) solution including both an encoder and a Central Processing Unit (CPU) and/or Vision Processing Unit (VPU). These permit the camera module 198 to carry out the processing and encoding functions. Also, in some examples, part of the processing functions of the camera module 198 includes creating metadata for recorded video. For instance, metadata may be generated relating to one or more foreground areas that the camera module 198 has detected, and the metadata may define the location and reference coordinates of the foreground visual object within the image frame. For example, the location metadata may be further used to generate a bounding box, typically rectangular in shape, outlining the detected foreground visual object. The image within the bounding box may be extracted for inclusion in metadata. The extracted image may alternately be smaller then what was in the bounding box or may be larger then what was in the bounding box. The size of the image being extracted can also be close to, but outside of, the actual boundaries of a detected object.
In some examples, the camera module 198 includes a number of submodules for video analytics such as, for instance, an object detection submodule, an instantaneous object classification submodule, a temporal object classification submodule and an object tracking submodule. Regarding the object detection submodule, such a submodule can be provided for detecting objects appearing in the field of view of the camera 169. The object detection submodule may employ any of various object detection methods understood by those skilled in the art such as, for example, motion detection and/or blob detection.
Regarding the object tracking submodule that may form part of the camera module 198, this may be operatively coupled to both the object detection submodule and the temporal object classification submodule. The object tracking submodule may be included for the purpose of temporally associating instances of an object detected by the object detection submodule. The object tracking submodule may also generate metadata corresponding to visual objects it tracks.
Regarding the instantaneous object classification submodule that may form part of the camera module 198, this may be operatively coupled to the object detection submodule and employed to determine a visual objects type (such as, for example, human, vehicle or animal) based upon a single instance of the object. The input to the instantaneous object classification submodule may optionally be a sub-region of an image in which the visual object-of-interest is located rather than the entire image frame.
Regarding the temporal object classification submodule that may form part of the camera module 198, this may be operatively coupled to the instantaneous object classification submodule and employed to maintain class information of an object over a period of time. The temporal object classification submodule may average the instantaneous class information of an object provided by the instantaneous classification submodule over a period of time during the lifetime of the object. In other words, the temporal object classification submodule may determine a type of an object based on its appearance in multiple frames. For example, gait analysis of the way a person walks can be useful to classify a person, or analysis of the legs of a person can be useful to classify a cyclist. The temporal object classification submodule may combine information regarding the trajectory of an object (e.g. whether the trajectory is smooth or chaotic, whether the object is moving or motionless) and confidence of the classifications made by the instantaneous object classification submodule averaged over multiple frames. For example, determined classification confidence values may be adjusted based on the smoothness of trajectory of the object. The temporal object classification submodule may assign an object to an unknown class until the visual object is classified by the instantaneous object classification submodule subsequent to a sufficient number of times and a predetermined number of statistics having been gathered. In classifying an object, the temporal object classification submodule may also take into account how long the object has been in the field of view. The temporal object classification submodule may make a final determination about the class of an object based on the information described above. The temporal object classification submodule may also use a hysteresis approach for changing the class of an object. More specifically, a threshold may be set for transitioning the classification of an object from unknown to a definite class, and that threshold may be larger than a threshold for the opposite transition (for example, from a human to unknown). The temporal object classification submodule may aggregate the classifications made by the instantaneous object classification submodule.
In accordance with at least some examples, a feature vector is an n-dimensional vector of numerical features (numbers) that represent an image of an object processable by computers. By comparing the feature vector of a first image of one object with the feature vector of a second image, a computer implementable process may determine whether the first image and the second image are images of the same object.
Similarity calculation can be just an extension of the above. Specifically, by calculating the Euclidean distance between two feature vectors of two images captured by one or more of the cameras 169, a computer implementable process can determine a similarity score to indicate how similar the two images may be.
In some examples, the camera module 198 is able to detect humans and extract images of humans with respective bounding boxes outlining the human objects for inclusion in metadata which along with the associated video may be transmitted to the server system 108. At the server system 108, the media server module 168 can process extracted images and generate signatures (e.g. feature vectors) to represent objects. In this example implementation, the media server module 168 uses a learning machine to process the bounding boxes to generate the feature vectors or signatures of the images of the objects captured in the video. The learning machine is for example a neural network such as a convolutional neural network (CNN) running on a graphics processing unit (GPU). The CNN may be trained using training datasets containing millions of pairs of similar and dissimilar images. The CNN, for example, is a Siamese network architecture trained with a contrastive loss function to train the neural networks. An example of a Siamese network is described in Bromley, Jane, et al. “Signature verification using a “Siamese” time delay neural network.” International Journal of Pattern Recognition and Artificial Intelligence 7.04 (1993): 669-688, the contents of which is hereby incorporated by reference in its entirety.
The media server module 168 deploys a trained model in what is known as batch learning where all of the training is done before it is used in the appearance search system. The trained model, in this embodiment, is a CNN learning model with one possible set of parameters. There is, practically speaking, an infinite number of possible sets of parameters for a given learning model. Optimization methods (such as stochastic gradient descent), and numerical gradient computation methods (such as backpropagation) may be used to find the set of parameters that minimize the objective function (also known as a loss function). A contrastive loss function may be used as the objective function. A contrastive loss function is defined such that it takes high values when it the current trained model is less accurate (assigns high distance to similar pairs, or low distance to dissimilar pairs), and low values when the current trained model is more accurate (assigns low distance to similar pairs, and high distance to dissimilar pairs). The training process is thus reduced to a minimization problem. The process of finding the most accurate model is the training process, the resulting model with the set of parameters is the trained model, and the set of parameters is not changed once it is deployed onto the appearance search system.
In at least some alternative example embodiments, the media server module 168 may determine feature vectors by implementing a learning machine using what is known as online machine learning algorithms. The media server module 168 deploys the learning machine with an initial set of parameters; however, the appearance search system keeps updating the parameters of the model based on some source of truth (for example, user feedback in the selection of the images of the objects of interest). Such learning machines also include other types of neural networks as well as convolutional neural networks.
In accordance with at least some examples, storage of feature vectors within the surveillance system 100 is contemplated. For instance, feature vectors may are indexed and stored in the database 191 with respective video. The feature vectors may also be associated with reference coordinates to where extracted images of respective objects are located in respective video. Storing may include storing video with, for example, time stamps, camera identifications, metadata with the feature vectors and reference coordinates, etc.
Referring now to
Referring now to
The image frame 306 of the selected video recording occupies the entirety of the top-right quadrant of the page 300. The frame 306 depicts a scene in which multiple persons are present. The server system 108 automatically identifies persons appearing in the scene that may be the subject of a search, and thus who are potential persons-of-interest 308 to the user, and highlights each of those persons by enclosing all or part of each in a bounding box 310. In
In
Immediately to the left of the image frame 306 is a bookmark list 302 showing all of the user's bookmarks, with a selected bookmark 304 corresponding to the image frame 306. Immediately below the bookmark list 302 are bookmark options 318 permitting the user to perform actions such as to lock or unlock any one or more of the bookmarks to prevent them from being changed, to permit them to be changed, to export any one or more of the bookmarks, and to delete any one or more of the bookmarks.
Immediately below the bookmark options 318 and bordering a bottom-left edge of the page 300 are video control buttons 322 permitting the user to play, pause, fast forward, and rewind the selected video recording. Immediately to the right of the video control buttons 322 is a video time indicator 324, displaying the date and time corresponding to the image frame 306. Extending along a majority of the bottom edge of the page 300 is a timeline 320 permitting the user to scroll through the selected video recording and through the video collectively represented by the collection of video recordings. The user may, for example, select a cursor 326 located along the timeline 320 and move the cursor 326 along the timeline to scroll to the time in the video corresponding to the cursor's 326 location. As discussed in further detail below in respect of
Referring now to
While video is being recorded, at least one of the cameras 169 and server system 108 in real-time identify when people, each of whom is a potential person-of-interest 308, are being recorded and, for those people, attempt to identify each of their faces. The server system 108 generates signatures based on the faces (when identified) and bodies of the people who are identified, as described above. The server system 108 stores information on whether faces were identified and the signatures as metadata together with the video recordings.
In response to the search commencement user input the user provides using the context menu 312 of
In one example embodiment, the face search is done by searching the collection of video recordings for faces. Once a face is identified, the coordinates of a bounding box that bounds the face (e.g., in terms of an (x,y) coordinate identifying one corner of the box and width and height of the box) and an estimation of the head pose (e.g., in terms of yaw, pitch, and roll) are generated. A feature vector may be generated that characterizes those faces using any one or more metrics, as discussed above.
In at least one example embodiment, the cameras 169 generate the metadata and associated feature vectors in or nearly in real-time, and the server system 108 subsequently assesses face similarity using those feature vectors. However, in at least one alternative example embodiment the functionality performed by the cameras 169 and server system 108 may be different. For example, functionality may be divided between the server system 108 and cameras 169 in a manner different than as described above. Alternatively, one of the server system 108 and the cameras 169 may generate the feature vectors and assess face similarity.
In
In
In the example embodiment shown in
In an alternative embodiment, the image search results 408 may be displayed only in order of likelihood of correspondence to the person-of-interest.
In the depicted embodiment, all of the search results 408 satisfy a minimum likelihood that they correspond to the person-of-interest 308; for example, in certain embodiments the video review application 144 only displays search results 408 that have at least a 25% likelihood (“match likelihood threshold”) of corresponding to the person-of-interest 308. However, in certain other embodiments, the video review application 144 may display all search results 408 without taking into account a match likelihood threshold, or may use a non-zero match likelihood threshold that is other than 25%.
In
Located immediately below the image frame 306 of the selected video recording are playback controls 426 that allow the user to play and pause the selected video recording. Located immediately above the horizontal scroll bar 418 beneath the image search results 408 is a load more results button 424, which permits the user to prompt the video review application 144 for additional search results 408. For example, in one embodiment, the video review application 144 may initially deliver at most a certain number of search results 408 even if additional results 408 exceed the match likelihood threshold. In that example, the user may request another tranche of results 408 that exceed the match likelihood threshold by selecting the load more results button 424. In certain other embodiments, the video review application 144 may be configured to display additional results 408 in response to the user's selecting the button 424 even if those additional results 408 are below the match likelihood threshold.
Located below the body and face thumbnails 404,402 is a filter toggle 422 that permits the user to restrict the image search results 408 to those that the user has confirmed corresponds to the person-of-interest 308 by having provided match confirmation user input to the video review application 144, as discussed further below.
Spanning the width of the page 300 and located below the body and face thumbnails 404, 402, search results 408, and image frame 306 is an appearance likelihood plot for the person-of-interest 308 in the form of a bar graph 412. The bar graph 412 depicts the likelihood that the person-of-interest 308 appears in the collection of video recordings over a given time span. In
To determine the bar graph 412, the server system 108 determines, for each of the time intervals, a likelihood that the person-of-interest 308 appears in the collection of video recordings for the time interval, and then represents that likelihood as the height of the bar 414 for that time interval. In this example embodiment, the server system 108 determines that likelihood as a maximum likelihood that the person-of-interest 308 appears in any one of the collection of video recordings for that time interval. In different embodiments, that likelihood may be determined differently. For example, in one different embodiment the server system 108 determines that likelihood as an average likelihood that the person-of-interest 308 appears in the image search results 408 that satisfy the match likelihood threshold.
In
While in the depicted embodiment the appearance likelihood plot is shown as comprising the bar graph 412, in different embodiments (not depicted) the plot may take different forms. For example, the plot in different embodiments may include a line graph, with different points on the line graph corresponding to appearance likelihood at different time intervals, or use different colors to indicate different appearance likelihoods.
As in
The video review application 144 permits the user to provide match confirmation user input regarding whether at least one of the image search results 408 depicts the person-of-interest 308. The user may provide the match confirmation user input by, for example, selecting one of the image search results 408 to bring up a context menu (not shown) allowing the user to confirm whether that search result 408 depicts the person-of-interest 308. In response to the match confirmation user input, the server system 108 in the depicted embodiment determines whether any match likelihoods change and, accordingly, whether positioning of the image search results 408 is to be changed in response to the match confirmation user input. For example, in one embodiment when the user confirms one of the results 408 is a match, the server system 108 may use that confirmed image as a reference for comparisons when performing one or both of face and body searches. When the positioning of the image search results is to be changed, the video review application 144 updates the positioning of the image search results 408 in response to the match confirmation user input. For example, the video review application 144 may delete from the image search results 408 any result the user indicates does not contain the person-of-interest 308 and rearrange the remaining results 408 accordingly. In one example embodiment, one or both of the face and body thumbnails 402, 404 may change in response to the match confirmation user input. In another example embodiment, if the server system 108 is initially unable to identify any faces of the person-of-interest 308 and the video review application 144 accordingly does not display the face thumbnail 402, the server system 108 may be able to identify the person-of-interest's 308 face after receiving match confirmation user input and the video review application 144 may then show the face thumbnail 402.
When the match confirmation user input indicates that any one of the selected image search results 408 depicts the person-of-interest 308, the video review application 144 displays a third indicator 410c over each of the selected image results 408 that the user confirms corresponds to the person-of-interest 308. As shown in the user interface page 300 of
The page 300 of
In
Referring now to
Referring now to
The method 900 starts at block 902, following which the processor 112 proceeds to block 904 and concurrently displays, on the display 126, the face thumbnail 402, body thumbnail 404, and the image search results 408 of the person-of-interest 308.
The processor 112 proceeds to block 906 where it receives some form of user input; example forms of user input are the match confirmation user input and search commencement user input described above. Additionally or alternatively, the user input may comprise another type of user input, such as any one or more of interaction with the playback controls 426, the bar graph 412, and the timeline 320.
Following receiving the user input, the processor proceeds to block 908 where it determines whether the server system 108 is required to process the user input received at block 906. For example, if the user input is scrolling through the image search results 408 using the scroll bars 418, then the server system 108 is not required and the processor 112 proceeds directly to block 914 where it processes the user input itself. When processing input in the form of scrolling, the processor 112 determines how to update the array of image search results 408 in response to the scrolling and then proceeds to block 916 where it actually updates the display 126 accordingly.
In certain examples, the processor 112 determines that the server system 108 is required to properly process the user input. For example, the user input may include search commencement user input, which results in the server system 108 commencing a new search of the collection of video recordings for the person-of-interest 308. In that example, the processor 112 proceeds to block 910 where it sends a request to the server system 108 to process the search commencement user input in the form, for example, of a remote procedure call. At block 912 the processor 112 receives the result from the server system 108, which may include an updated array of image search results 408 and associated images.
The processor 112 subsequently proceeds to block 914 where it determines how to update the display 126 in view of the updated search results 408 and images received from the server system 108 at block 912, and subsequently proceeds to block 916 to actually update the display 126.
Regardless of whether the processor 112 relies on the server system 108 to perform any operations at blocks 910 and 912, a reference herein to the processor 112 or video review application 144 performing an operation includes an operation that the processor 112 or video review application 144 performs with assistance from the server system 108, and an operation that the processor 112 or video review application 144 performs without assistance from the server system 108.
After completing block 916, regardless of whether the processor 112 communicated with the server system 108 in response to the user input, the processor 112 proceeds to block 918 where the method 900 ends. The processor 112 may repeat the method 900 as desired, such as by starting the method 900 again at block 902 or at block 906.
Facet Search
In at least some example embodiments, the methods, systems, and techniques as described herein are adapted as described further below to search for an object-of-interest. An object-of-interest may comprise the person-of-interest 308 described above in respect of
The server system 108 in at least some example embodiments saves the facet in storage 190 as a data structure comprising a “descriptor” and a “tag”. The facet descriptor may comprise a text string describing the type of facet, while the facet tag may comprise a value indicating the nature of that facet. For example, when the facet is hair color, the facet descriptor may be “hair color” and the facet tag may be “brown” or another color drawn from a list of colors. Similarly, when the facet is a type of clothing, the facet descriptor may be “clothing type” and the facet tag may be “jacket” or another clothing type drawn from a list of clothing types.
In at least some example embodiments and as described in respect of
Referring now to
After selecting “Appearances” in
The facet selectors 1010, 1016, 1018 allow the user to adjust any one or more of the person-of-interest's 308 gender (selected in
In at least some different example embodiments (not depicted), the user interface may differ from that which is depicted. For example, instead of the text-based drop-down menus 1020a,b depicted in
In response to the facet search commencement user input that the user provides by selecting the search button 1006, the server system 108 searches one or more of the video recordings for the facets. The server system 108 may perform the searching using a suitably trained artificial neural network, such as a convolutional neural network as described above for the body/face search. The server system 108 displays, on the display, facet image search results depicting the facets, with the facet image search results being selected from the one or more video recordings that were searched. In at least the depicted example embodiment, the facet image search results depict the facet in conjunction with a common type of object-of-interest common to the image search results.
Each of the entries in the searched facet list 1025 displays an “X” that is user selectable, and that when selected by the user causes that entry in the searched facet list 1025 to disappear. Removing a facet from the searched facet list 1025 in this manner represents updated facet search commencement user input, and causes the server system 108 to update the facet image search results by searching for the updated list of facets. The results of this updated search are displayed in the n×m array of image search results 408. In at least some example embodiments, the act of removing a facet from the searched facet list 1025 in this manner is implemented by the server system 108 deleting the contents of a tag associated with the removed facet.
Below the searched facet list 1025 is a series of menus 1026 allowing the user to further revise the list of facets to be searched by adding or removing facets in a manner analogous to that described in respect of the facet search menu 1004 of
The user may commence a body/face search directly from the page 300 of
In response to that object-of-interest search commencement user input, the server system 108 searches the one or more video recordings for the object-of-interest. In at least some example embodiments, the search is not restricted to the one or more video recordings from which were selected the facet image search results; for example, the server system 108 may search the same video recordings that were searched when performing the facet search. In at least some other example embodiments, the one or more video recordings that are searched are the one or more video recordings from which the facet image search results were selected, and the object-of-interest search results are selected from those one or more video recordings. After the server system 108 performs the object-of-interest search, it displays, on the display, the object-of-interest search results. In at least some of those example embodiments in which the object-of-interest search is done on the video recordings that were also searched when performing the facet search, the object-of-interest search results depict the object-of-interest and the facet. The object-of-interest search results are depicted in the user interface page 300 of
The object-of-interest search described immediately above is done after one or more facet searches. In at least some example embodiments, the object-of-interest search may be done before a facet search is done. For example, a body/face search may be done, and those image search results displayed, in accordance with the embodiments of
Referring now to
In at least some example embodiments, the server system 108 performs a facet search immediately after receiving queries of the type depicted in
The facet search as described above may be performed with an artificial neural network trained as described below. In at least some example embodiments, including the embodiments described below, the artificial neural network comprises a convolutional neural network.
In at least some example embodiments, training images are used to train the convolutional neural network. The user generates a facet image training set that comprises the training images by, for example, selecting images that depict a common type of object-of-interest shown in conjunction with a common type of facet. For example, in at least some example embodiments the server system 108 displays a collection of images to the user, and the user selects which of those images depict a type of facet that the user wishes to train the server system 108 to recognize. The server system 108 may, for example, show the user a set of potential training images, of which a subset depict a person (the object) having brown hair (the facet); the user then selects only those images showing a person with brown hair as the training images comprising the training set. Different training images may show different people, although all of the training images show a common type of object in conjunction with a common type of facet. The training images may comprise image chips derived from images captured by one of the cameras 169, where a “chip” is a region corresponding to portion of a frame of a selected video recording, such as that portion within a bounding box 310.
Once the facet image training set is generated, it is used to train the artificial neural network to classify the type of facet depicted in the training images comprising the set when a sample image comprising that type of facet is input to the network. An example of a “sample image” is an image comprising part of one of the video recordings searched after the network has been trained, such as in the facet search described above. During training, optimization methods (such as stochastic gradient descent), and numerical gradient computation methods (such as backpropagation) are used to find the set of parameters that minimize the objective function (also known as a loss function). A cross entropy function is used as the objective function in the depicted example embodiments. This function is defined such that it takes high values when it the current trained model is less accurate (i.e., incorrectly classifies facets), and low values when the current trained model is more accurate (i.e., correctly classifies facets). The training process is thus reduced to a minimization problem. The process of finding the most accurate model is the training process, the resulting model with the set of parameters is the trained model, and the set of parameters is not changed once it is deployed. While in some example embodiments the user generates the training set, in other example embodiments a training set is provided to the artificial neural network for training. For example, a third party may provide a training set, and the user may then provide that training set to the artificial neural network.
During training, the server system 108 records state data corresponding to different states of the convolutional neural network during the training. In at least some example embodiments, the state data is indexed to index data such as at least one of the common type of facet depicted in the training images, identification credentials of a user who is performing the training, the training images, cameras used to capture the training images, timestamps of the training images, and a time when the training commenced. This allows the state of the convolutional neural network to be rolled back in response to a user request. For example, the server system 108 in at least some example embodiments receives index data corresponding to an earlier state of the network, and reverts to that earlier state by loading the state data indexed to the index data for that earlier state. This allows network training to be undone if the user deems it to have been unsuccessful. For example, if the user determines that a particular type of facet is now irrelevant, the network may be reverted to an earlier state prior to when it had been trained to classify that type of facet, thereby potentially saving computational resources. Similarly, a reversion to an earlier network state may be desirable based on time, in which case the index data may comprise the time prior to when undesirable training started, or on operator credentials in order to effectively eliminate poor training done by another user.
Certain adaptations and modifications of the described embodiments can be made. For example, with respect to either the client-side video review application 144 (
Map Integration
In the example embodiments of
Referring now to
The user interface page 300 of
In response to the search commencement user input, the server system 108 searches one or more video recordings for the one or more objects-of-interest. After the server system 108 has performed the appearance search, it causes to be displayed, in conjunction with the map 1400 on the display 126, one or more of the image search results 408 depicting the one or more objects-of-interest. Each of the image search results 408 depicts the one or more objects-of-interest as captured by a camera 169 at a time during the one or more video recordings, and is depicted in conjunction with the map 1400 at a location indicative of a geographical location of the camera 169.
In the particular example embodiment of
While the locations 1502 are indicated on the map using circular icons, in at least some different example embodiments different icons may be used. For example, each of the icons may depict a camera 169. In order to populate the locations 1502 on the map 1400, the user may drag and drop icons representing each of the cameras 169 on to the map 1400 at their respective locations 1502, and also orient those icons such that they are oriented in a manner that corresponds to the actual cameras 169 deployed in the field.
Referring now to
Referring now to
In
In
The confidence selector 1504 is an example type of confidence level input specifying that only results 408a-f that are at or above that minimum confidence level are to be displayed. While a single “high” confidence level is used in
In at least some example embodiments, the search UI module 202 may update the page 300 over time to graphically indicate to the user when the search results 408 were obtained relative to each other; that is, the search results 406 may appear in an order corresponding to a sequence in which the results appear in the one or more video recordings. This may permit the user to, for example, track the path the person-of-interest 308 is traveling over time.
Each of the pages 300 of
In
As described above, the path 1506 may comprise a series of linear line segments that connect locations 1502 corresponding to sequentially obtained search results 408. The path 1506 may be determined differently in at least some example embodiments; for example, multiple search results 408 may be averaged, and a line segment may terminate at a location on the map 1400 corresponding to that average as opposed to any single one of the camera locations 1502.
More particularly, the user interface page 300 of
The search results 408b-d are respectively returned with metadata that describes the time at which the search results 408b-d are obtained, the camera 169 used to obtain the search results 408b-d, and a confidence level associated with the search results 408b-d. In at least some example embodiments, a search result 408b-d may only be returned and used to determine the averaged location 1802 if it has a confidence level greater than or equal to a minimum confidence threshold (e.g. 80%). In the depicted example embodiment, the second through fourth results 408b-d are concurrently obtained by the cameras 169 at those respective locations 1502b-d, and consequently the search UI module 202 averages them to determine a single location on the map 1400 at which to place the person-of-interest 308 at that time. However, in at least some different example embodiments, the search UI module 202 may average two or more of the search results 408b-d even if they do not overlap in time. For example, the search UI module 202 may average any two of the search results 408b-d that are not concurrent but that occur within a certain time of each other.
When determining the averaged location 1802 for any particular time, the search UI module 202 determines an average position and confidence of the search results 408b-d being averaged, and a total number of search results 408b-d that are averaged. The average position may comprise an average horizontal position (longitude) and an average vertical position (latitude) on the map 1400. Metadata such as numerical longitude and latitude positions, the number of search results 408b-d averaged to determine the averaged location 1802, and the averaged weight of the averaged location 1802 may be accessed by the user via the user interface page 300, such as by invoking the context menu 312. In at least some different example embodiments, the averaged location 1802 may be determined as a weighted average of the locations 1502b-d of the search results 408b-d, with the weights used in determining the weighted average being the confidence levels of the search results 408b-d. In still other example embodiments, one or more of the search results 408b-d may not be associated with a confidence value at all, and the averaged location 1802 may lack any associated metadata describing a confidence level.
In at least some example embodiments, the cameras 169 that generate the search results 408 may differ in at least one of frame rate and resolution. Without compensating for differences in frame rate and resolution between different cameras 169, the averaged location 1802 generated using the search results 408 from those different cameras 169 may be temporally or spatially biased. For example, if the averaged location 1802 is determined by averaging the locations 1502b-d associated with three different cameras 169 generating different search results 408b-d and the camera 169 at one of the locations 1502b has a frame rate N times greater than the cameras 169 at the other two locations 1502c,d, then an average over a certain period of time may be determined using N times more images from the camera 169 with the higher frame rate than either of the other cameras 169. To compensate for this, the search UI module 202 may decimate the number of images generated from the camera 169 with the higher frame rate by a certain factor (e.g., N) before determining the averaged location 1802. Additionally or alternatively, the search UI module 202 may generate a weighted average (e.g., by weighing the contribution from the camera 169 with the higher frame rate by 1/N) to perform temporal compensation.
As another example, if the averaged location 1802 is determined by averaging the locations 1502b-d associated with three different cameras 169 generating different search results 408b-d and the camera 169 at one of the locations 1502b has a higher resolution than the other cameras 169, the confidence level of the search results 408b from that camera 169 may be higher than the confidence level of the search results 408c,d from the cameras 169 with lower resolutions. To compensate for this spatial bias, the search UI module 202 may access a lookup table stored in the non-volatile storage 120 that contains correction factors taking into account image resolution and distance of an object-of-interest from the camera 169, and determine the averaged location 1802 as a weighted average that applies the correction factor to the higher resolution camera 169.
The JavaScript code below describes an example implementation of how to determine the averaged location 1802 according to the embodiment of
In another example embodiment, the code below may be used in place of the analogous code above to determine the averaged location 1802 using confidence weighting:
Additionally, the following code may be applied to group the search results 408 by time into different “buckets”. In at least some example embodiments, the buckets are non-overlapping in time. A single time period may, for example, be divided into sequential buckets such that all times during that period fall into one of the buckets. Each non-empty bucket may then be further processed to eventually become one of the averaged locations 1802 on the path 1506 drawn on the map 1400.
Generating the averaged location 1802 may be done live as the search UI module 202 is obtaining the search results 408 in real-time from at least one live video stream and/or based on recorded data to reconstruct the person-of-interest's 308 path. Various parameters, such as how many search results 408 to average and whether a weighted average is used may be adjusted to generate a variety of different paths 1506 for review by the user. In at least some example embodiments, the averaged location 1802 may be generated using the most recent search results 408, and the path 1506 may accordingly terminate at the averaged location 1802. The averaged location's 1802 position on the map 1400 may also change as the search UI module 202 obtains new search results 408 and updates the latitude and longitude of the averaged location 1802.
Each of
The search UI module 202 may also determine the speed of the person-of-interest 308 from the search results 408. If two search results 408 are indexed at times t1 and t2 and are a distance D apart, the average speed between the locations 1502 corresponding to those results 408 is D/(t2−t1). The search UI module 202 may display this average speed, which permits the user to infer locations at which the person-of-interest 308 may have traveled or lingered when not directly observed by at least one of the cameras 169. In at least some example embodiments, the search UI module 202 may determine from the average speed and from the person-of-interest's 308 direction of travel as indicated by the direction indicator 1804 an inferred area in which the person-of-interest 308 may be located. Each of
More generally, the search UI module 202 may highlight to the user the fifth location 1502e, which in
The region 1806 in
While in at least some of the example embodiments described herein the search UI module 202 presumes the position of the person-of-interest 308 is that of the camera 169 that captures the search result 408, this may differ in at least some different example embodiments. For example, the camera 169 may capture depth data, and the search UI module 202 may accordingly determine the person-of-interest's 308 location on the map 1400 as being spaced away from the location 1502 of the camera 169 by a distance corresponding to that depth.
In
As
In
Additionally or alternatively, following an initial selection of facets based on the search results 406 depicted on the page 300, the user may revise or add to those facets by providing inputs removed from the map 1400, such as by using the menus 1004 and 1020a,b of
Via the page 300, the user may accordingly commence a search for a person-of-interest 308 (regardless of the person-of-interest's 308 facets), or a search for one or more facets of a person-of-interest 308 shown in one of the search results 406. The user may also chain these searches together. For example, the user may commence a search for a person-of-interest 308 regardless of that person-of-interest's 308 facets, and then commence a facet search based on one or more facets of one or more persons depicted in the consequent search results 406, regardless of whether the result 406 depicts the actual person-of-interest 308 for whom the user was searching or a false positive. The user may then analogously perform one or more appearance searches for a person-of-interest 308 (regardless of his or her facets) and/or one or more facet searches from the results, as desired. Similarly, the user may start the chain by performing a facet search, and based on the results 406 of the facet search commence an appearance search for a particular person-of-interest 308 (regardless of his or her facets).
At least some of the foregoing example embodiments display results of an appearance search on the map 1400. In at least some different example embodiments, different types of search results may additionally or alternatively be displayed on the map 1400. For example, the search UI module 202 may display results of a non-appearance search performed using video analytics, or of a motion search. The search UI module 202 may depict, for example, lists of different video analytics-detected events detected using the analytics engine module 172 on the map 1400, with one or more of the locations 1502 being associated with a list of events detected at that location 1502. Example video analytics events comprise one or more of foreground/background segmentation, object detection, object tracking, object classification, virtual tripwire, anomaly detection, facial detection, facial recognition, license plate recognition, identifying objects “left behind”, monitoring objects (i.e. to protect from stealing), business intelligence and deciding a position change action.
The map integration described in respect of
Although example embodiments have described a reference image for a search as being taken from an image within recorded video, in some example embodiments it may be possible to conduct a search based on a scanned photograph or still image taken by a digital camera. This may be particularly true where the photo or other image is, for example, taken recent enough such that the clothing and appearance is likely to be the same as what may be found in the video recordings.
As should be apparent from this detailed description, the operations and functions of the electronic computing device are sufficiently complex as to require their implementation on a computer system, and cannot be performed, as a practical matter, in the human mind. Electronic computing devices such as set forth herein are understood as requiring and providing speed and accuracy and complexity management that are not obtainable by human mental steps, in addition to the inherently digital nature of such operations (e.g., a human mind cannot interface directly with RAM or other digital storage, cannot transmit or receive electronic messages, electronically encoded video, electronically encoded audio, etc., and cannot display content, such as a map, on a display, among other features and functions set forth herein).
In the foregoing specification, specific embodiments have been described.
However, one of ordinary skill in the art appreciates that various modifications and changes can be made without departing from the scope of the invention as set forth in the claims below. Accordingly, the specification and figures are to be regarded in an illustrative rather than a restrictive sense, and all such modifications are intended to be included within the scope of present teachings. The benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as a critical, required, or essential features or elements of any or all the claims. The invention is defined solely by the appended claims including any amendments made during the pendency of this application and all equivalents of those claims as issued.
Moreover in this document, relational terms such as first and second, top and bottom, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms “comprises,” “comprising,” “has”, “having,” “includes”, “including,” “contains”, “containing” or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises, has, includes, contains a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. An element proceeded by “comprises . . . a”, “has . . . a”, “includes . . . a”, “contains . . . a” does not, without more constraints, preclude the existence of additional identical elements in the process, method, article, or apparatus that comprises, has, includes, contains the element. The terms “a” and “an” are defined as one or more unless explicitly stated otherwise herein. The terms “substantially”, “essentially”, “approximately”, “about” or any other version thereof, are defined as being close to as understood by one of ordinary skill in the art, and in one non-limiting embodiment the term is defined to be within 10%, in another embodiment within 5%, in another embodiment within 1% and in another embodiment within 0.5%. The term “one of”, without a more limiting modifier such as “only one of”, and when applied herein to two or more subsequently defined options such as “one of A and B” should be construed to mean an existence of any one of the options in the list alone (e.g., A alone or B alone) or any combination of two or more of the options in the list (e.g., A and B together).
A device or structure that is “configured” in a certain way is configured in at least that way, but may also be configured in ways that are not listed.
The terms “coupled”, “coupling” or “connected” as used herein can have several different meanings depending in the context in which these terms are used. For example, the terms coupled, coupling, or connected can have a mechanical or electrical connotation. For example, as used herein, the terms coupled, coupling, or connected can indicate that two elements or devices are directly connected to one another or connected to one another through an intermediate elements or devices via an electrical element, electrical signal or a mechanical element depending on the particular context.
It will be appreciated that some embodiments may be comprised of one or more generic or specialized processors (or “processing devices”) such as microprocessors, digital signal processors, customized processors and field programmable gate arrays (FPGAs) and unique stored program instructions (including both software and firmware) that control the one or more processors to implement, in conjunction with certain non-processor circuits, some, most, or all of the functions of the method and/or apparatus described herein. Alternatively, some or all functions could be implemented by a state machine that has no stored program instructions, or in one or more application specific integrated circuits (ASICs), in which each function or some combinations of certain of the functions are implemented as custom logic. Of course, a combination of the two approaches could be used.
Moreover, an embodiment can be implemented as a computer-readable storage medium having computer readable code stored thereon for programming a computer (e.g., comprising a processor) to perform a method as described and claimed herein. Any suitable computer-usable or computer readable medium may be utilized. Examples of such computer-readable storage mediums include, but are not limited to, a hard disk, a CD-ROM, an optical storage device, a magnetic storage device, a ROM (Read Only Memory), a PROM (Programmable Read Only Memory), an EPROM (Erasable Programmable Read Only Memory), an EEPROM (Electrically Erasable Programmable Read Only Memory) and a Flash memory. In the context of this document, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
Further, it is expected that one of ordinary skill, notwithstanding possibly significant effort and many design choices motivated by, for example, available time, current technology, and economic considerations, when guided by the concepts and principles disclosed herein will be readily capable of generating such software instructions and programs and ICs with minimal experimentation. For example, computer program code for carrying out operations of various example embodiments may be written in an object oriented programming language such as Java, Smalltalk, C++, Python, or the like.
However, the computer program code for carrying out operations of various example embodiments may also be written in conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or server or entirely on the remote computer or server. In the latter scenario, the remote computer or server may be connected to the computer through a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
The Abstract of the Disclosure is provided to allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in various embodiments for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter.